Files
AIclinicalresearch/extraction_service
HaHafeng b64896a307 feat(deploy): Complete PostgreSQL migration and Docker image build
Summary:
- PostgreSQL database migration to RDS completed (90MB SQL, 11 schemas)
- Frontend Nginx Docker image built and pushed to ACR (v1.0, ~50MB)
- Python microservice Docker image built and pushed to ACR (v1.0, 1.12GB)
- Created 3 deployment documentation files

Docker Configuration Files:
- frontend-v2/Dockerfile: Multi-stage build with nginx:alpine
- frontend-v2/.dockerignore: Optimize build context
- frontend-v2/nginx.conf: SPA routing and API proxy
- frontend-v2/docker-entrypoint.sh: Dynamic env injection
- extraction_service/Dockerfile: Multi-stage build with Aliyun Debian mirror
- extraction_service/.dockerignore: Optimize build context
- extraction_service/requirements-prod.txt: Production dependencies (removed Nougat)

Deployment Documentation:
- docs/05-部署文档/00-部署进度总览.md: One-stop deployment status overview
- docs/05-部署文档/07-前端Nginx-SAE部署操作手册.md: Frontend deployment guide
- docs/05-部署文档/08-PostgreSQL数据库部署操作手册.md: Database deployment guide
- docs/00-系统总体设计/00-系统当前状态与开发指南.md: Updated with deployment status

Database Migration:
- RDS instance: pgm-2zex1m2y3r23hdn5 (2C4G, PostgreSQL 15.0)
- Database: ai_clinical_research
- Schemas: 11 business schemas migrated successfully
- Data: 3 users, 2 projects, 1204 literatures verified
- Backup: rds_init_20251224_154529.sql (90MB)

Docker Images:
- Frontend: crpi-cd5ij4pjt65mweeo.cn-beijing.personal.cr.aliyuncs.com/ai-clinical/ai-clinical_frontend-nginx:v1.0
- Python: crpi-cd5ij4pjt65mweeo.cn-beijing.personal.cr.aliyuncs.com/ai-clinical/python-extraction:v1.0

Key Achievements:
- Resolved Docker Hub network issues (using generic tags)
- Fixed 30 TypeScript compilation errors
- Removed Nougat OCR to reduce image size by 1.5GB
- Used Aliyun Debian mirror to resolve apt-get network issues
- Implemented multi-stage builds for optimization

Next Steps:
- Deploy Python microservice to SAE
- Build Node.js backend Docker image
- Deploy Node.js backend to SAE
- Deploy frontend Nginx to SAE
- End-to-end verification testing

Status: Docker images ready, SAE deployment pending
2025-12-24 18:21:55 +08:00
..

文档提取微服务

基于FastAPI的文档文本提取服务支持PDF、Docx、Txt格式。

功能特性

  • PDF提取使用PyMuPDF快速提取PDF文本
  • Docx提取使用Mammoth提取Word文档Day 3
  • Txt提取支持多种编码Day 3
  • 语言检测自动检测PDF语言Day 2
  • Nougat集成高质量学术PDF解析Day 2

快速开始

1. 安装依赖

cd extraction_service

# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 安装依赖
pip install -r requirements.txt

2. 配置环境变量

# 复制示例配置
cp .env.example .env

# 编辑配置(可选)
# SERVICE_PORT=8000
# DEBUG=True

3. 启动服务

# 开发模式(自动重载)
python main.py

# 或使用uvicorn
uvicorn main:app --reload --port 8000

服务将在 http://localhost:8000 启动

4. 测试服务

健康检查

curl http://localhost:8000/api/health

返回:

{
  "status": "healthy",
  "checks": {
    "pymupdf": {
      "available": true,
      "version": "1.23.8"
    },
    "temp_dir": {
      "path": "/tmp/extraction_service",
      "writable": true
    }
  }
}

PDF文本提取

curl -X POST http://localhost:8000/api/extract/pdf \
  -F "file=@test.pdf"

返回:

{
  "success": true,
  "method": "pymupdf",
  "text": "提取的文本内容...",
  "metadata": {
    "page_count": 20,
    "char_count": 50000,
    "file_size": 1024000,
    "filename": "test.pdf"
  }
}

API文档

启动服务后访问:

项目结构

extraction_service/
├── main.py              # 主应用入口
├── requirements.txt     # Python依赖
├── .env.example         # 环境变量示例
├── README.md           # 本文件
├── services/           # 服务模块
│   ├── __init__.py
│   ├── pdf_extractor.py      # PDF提取PyMuPDF
│   ├── nougat_extractor.py   # Nougat提取Day 2
│   ├── docx_extractor.py     # Docx提取Day 3
│   ├── txt_extractor.py      # Txt提取Day 3
│   ├── language_detector.py  # 语言检测Day 2
│   └── file_utils.py         # 文件工具
└── tests/              # 测试文件(待添加)

开发计划

Day 1已完成

  • FastAPI项目搭建
  • PyMuPDF集成
  • PDF文本提取功能
  • 健康检查API

Day 2进行中

  • 安装Nougat
  • 语言检测功能
  • Nougat提取逻辑
  • 顺序降级机制

Day 3

  • Docx提取Mammoth
  • Txt提取多编码
  • 文件格式验证

依赖说明

版本 用途
fastapi 0.104.1 Web框架
uvicorn 0.24.0 ASGI服务器
PyMuPDF 1.23.8 PDF文本提取
pdfplumber 0.10.3 PDF语言检测
mammoth 1.6.0 Docx提取
langdetect 1.0.9 语言检测
loguru 0.7.2 日志管理

性能指标

操作 目标时间
20页PDFPyMuPDF <30秒
10页Docx <10秒
1MB Txt <5秒

常见问题

Q: PyMuPDF安装失败

A: 确保Python版本>=3.8使用pip安装pip install PyMuPDF

Q: 服务无法启动?

A: 检查端口8000是否被占用可修改.env中的SERVICE_PORT

Q: 临时文件在哪里?

A: 默认在/tmp/extraction_service目录可通过TEMP_DIR环境变量配置

License

MIT