Files
AIclinicalresearch/extraction_service
HaHafeng fa72beea6c feat(platform): Complete Postgres-Only architecture refactoring (Phase 1-7)
Major Changes:
- Implement Platform-Only architecture pattern (unified task management)
- Add PostgresCacheAdapter for unified caching (platform_schema.app_cache)
- Add PgBossQueue for job queue management (platform_schema.job)
- Implement CheckpointService using job.data (generic for all modules)
- Add intelligent threshold-based dual-mode processing (THRESHOLD=50)
- Add task splitting mechanism (auto chunk size recommendation)
- Refactor ASL screening service with smart mode selection
- Refactor DC extraction service with smart mode selection
- Register workers for ASL and DC modules

Technical Highlights:
- All task management data stored in platform_schema.job.data (JSONB)
- Business tables remain clean (no task management fields)
- CheckpointService is generic (shared by all modules)
- Zero code duplication (DRY principle)
- Follows 3-layer architecture principle
- Zero additional cost (no Redis needed, save 8400 CNY/year)

Code Statistics:
- New code: ~1750 lines
- Modified code: ~500 lines
- Test code: ~1800 lines
- Documentation: ~3000 lines

Testing:
- Unit tests: 8/8 passed
- Integration tests: 2/2 passed
- Architecture validation: passed
- Linter errors: 0

Files:
- Platform layer: PostgresCacheAdapter, PgBossQueue, CheckpointService, utils
- ASL module: screeningService, screeningWorker
- DC module: ExtractionController, extractionWorker
- Tests: 11 test files
- Docs: Updated 4 key documents

Status: Phase 1-7 completed, Phase 8-9 pending
2025-12-13 16:10:04 +08:00
..

文档提取微服务

基于FastAPI的文档文本提取服务支持PDF、Docx、Txt格式。

功能特性

  • PDF提取使用PyMuPDF快速提取PDF文本
  • Docx提取使用Mammoth提取Word文档Day 3
  • Txt提取支持多种编码Day 3
  • 语言检测自动检测PDF语言Day 2
  • Nougat集成高质量学术PDF解析Day 2

快速开始

1. 安装依赖

cd extraction_service

# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 安装依赖
pip install -r requirements.txt

2. 配置环境变量

# 复制示例配置
cp .env.example .env

# 编辑配置(可选)
# SERVICE_PORT=8000
# DEBUG=True

3. 启动服务

# 开发模式(自动重载)
python main.py

# 或使用uvicorn
uvicorn main:app --reload --port 8000

服务将在 http://localhost:8000 启动

4. 测试服务

健康检查

curl http://localhost:8000/api/health

返回:

{
  "status": "healthy",
  "checks": {
    "pymupdf": {
      "available": true,
      "version": "1.23.8"
    },
    "temp_dir": {
      "path": "/tmp/extraction_service",
      "writable": true
    }
  }
}

PDF文本提取

curl -X POST http://localhost:8000/api/extract/pdf \
  -F "file=@test.pdf"

返回:

{
  "success": true,
  "method": "pymupdf",
  "text": "提取的文本内容...",
  "metadata": {
    "page_count": 20,
    "char_count": 50000,
    "file_size": 1024000,
    "filename": "test.pdf"
  }
}

API文档

启动服务后访问:

项目结构

extraction_service/
├── main.py              # 主应用入口
├── requirements.txt     # Python依赖
├── .env.example         # 环境变量示例
├── README.md           # 本文件
├── services/           # 服务模块
│   ├── __init__.py
│   ├── pdf_extractor.py      # PDF提取PyMuPDF
│   ├── nougat_extractor.py   # Nougat提取Day 2
│   ├── docx_extractor.py     # Docx提取Day 3
│   ├── txt_extractor.py      # Txt提取Day 3
│   ├── language_detector.py  # 语言检测Day 2
│   └── file_utils.py         # 文件工具
└── tests/              # 测试文件(待添加)

开发计划

Day 1已完成

  • FastAPI项目搭建
  • PyMuPDF集成
  • PDF文本提取功能
  • 健康检查API

Day 2进行中

  • 安装Nougat
  • 语言检测功能
  • Nougat提取逻辑
  • 顺序降级机制

Day 3

  • Docx提取Mammoth
  • Txt提取多编码
  • 文件格式验证

依赖说明

版本 用途
fastapi 0.104.1 Web框架
uvicorn 0.24.0 ASGI服务器
PyMuPDF 1.23.8 PDF文本提取
pdfplumber 0.10.3 PDF语言检测
mammoth 1.6.0 Docx提取
langdetect 1.0.9 语言检测
loguru 0.7.2 日志管理

性能指标

操作 目标时间
20页PDFPyMuPDF <30秒
10页Docx <10秒
1MB Txt <5秒

常见问题

Q: PyMuPDF安装失败

A: 确保Python版本>=3.8使用pip安装pip install PyMuPDF

Q: 服务无法启动?

A: 检查端口8000是否被占用可修改.env中的SERVICE_PORT

Q: 临时文件在哪里?

A: 默认在/tmp/extraction_service目录可通过TEMP_DIR环境变量配置

License

MIT