Files
AIclinicalresearch/extraction_service
HaHafeng 98d862dbd4 feat(aia): Complete AIA V2.0 and sync all changes
AIA V2.0 Major Updates:
- Add StreamingService with OpenAI Compatible format (backend/common/streaming)
- Upgrade Chat component V2 with Ant Design X deep integration
- Implement 12 intelligent agents (5 phases: topic/design/review/data/writing)
- Create AgentHub with 100% prototype V11 restoration
- Create ChatWorkspace with fullscreen immersive experience
- Add ThinkingBlock for deep thinking display
- Add useAIStream Hook for stream handling
- Add ConversationList for conversation management

Backend (~1300 lines):
- common/streaming: OpenAI adapter and streaming service
- modules/aia: 12 agents config, conversation service, attachment service
- Unified API routes to /api/v1 (RVW, PKB, AIA modules)
- Update authentication and permission helpers

Frontend (~3500 lines):
- modules/aia: AgentHub + ChatWorkspace + AgentCard components
- shared/Chat: AIStreamChat, ThinkingBlock, useAIStream, useConversations
- Update all modules API endpoints to v1
- Modern design with theme colors (blue/yellow/teal/purple)

Documentation (~2500 lines):
- AIA module status and development guide
- Universal capabilities catalog (11 services)
- Quick reference card
- System overview updates
- All module documentation synchronization

Other Updates:
- DC Tool C: Python operations and frontend components
- IIT Manager: session memory and wechat service
- PKB/RVW/ASL: API route updates
- Docker configs and deployment scripts
- Database migrations and scripts
- Test files and documentation

Tested: AIA streaming verified, authentication working, core features functional
Status: AIA V2.0 completed (85%), all changes synchronized
2026-01-14 19:19:00 +08:00
..

文档提取微服务

基于FastAPI的文档文本提取服务支持PDF、Docx、Txt格式。

功能特性

  • PDF提取使用PyMuPDF快速提取PDF文本
  • Docx提取使用Mammoth提取Word文档Day 3
  • Txt提取支持多种编码Day 3
  • 语言检测自动检测PDF语言Day 2
  • Nougat集成高质量学术PDF解析Day 2

快速开始

1. 安装依赖

cd extraction_service

# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 安装依赖
pip install -r requirements.txt

2. 配置环境变量

# 复制示例配置
cp .env.example .env

# 编辑配置(可选)
# SERVICE_PORT=8000
# DEBUG=True

3. 启动服务

# 开发模式(自动重载)
python main.py

# 或使用uvicorn
uvicorn main:app --reload --port 8000

服务将在 http://localhost:8000 启动

4. 测试服务

健康检查

curl http://localhost:8000/api/health

返回:

{
  "status": "healthy",
  "checks": {
    "pymupdf": {
      "available": true,
      "version": "1.23.8"
    },
    "temp_dir": {
      "path": "/tmp/extraction_service",
      "writable": true
    }
  }
}

PDF文本提取

curl -X POST http://localhost:8000/api/extract/pdf \
  -F "file=@test.pdf"

返回:

{
  "success": true,
  "method": "pymupdf",
  "text": "提取的文本内容...",
  "metadata": {
    "page_count": 20,
    "char_count": 50000,
    "file_size": 1024000,
    "filename": "test.pdf"
  }
}

API文档

启动服务后访问:

项目结构

extraction_service/
├── main.py              # 主应用入口
├── requirements.txt     # Python依赖
├── .env.example         # 环境变量示例
├── README.md           # 本文件
├── services/           # 服务模块
│   ├── __init__.py
│   ├── pdf_extractor.py      # PDF提取PyMuPDF
│   ├── nougat_extractor.py   # Nougat提取Day 2
│   ├── docx_extractor.py     # Docx提取Day 3
│   ├── txt_extractor.py      # Txt提取Day 3
│   ├── language_detector.py  # 语言检测Day 2
│   └── file_utils.py         # 文件工具
└── tests/              # 测试文件(待添加)

开发计划

Day 1已完成

  • FastAPI项目搭建
  • PyMuPDF集成
  • PDF文本提取功能
  • 健康检查API

Day 2进行中

  • 安装Nougat
  • 语言检测功能
  • Nougat提取逻辑
  • 顺序降级机制

Day 3

  • Docx提取Mammoth
  • Txt提取多编码
  • 文件格式验证

依赖说明

版本 用途
fastapi 0.104.1 Web框架
uvicorn 0.24.0 ASGI服务器
PyMuPDF 1.23.8 PDF文本提取
pdfplumber 0.10.3 PDF语言检测
mammoth 1.6.0 Docx提取
langdetect 1.0.9 语言检测
loguru 0.7.2 日志管理

性能指标

操作 目标时间
20页PDFPyMuPDF <30秒
10页Docx <10秒
1MB Txt <5秒

常见问题

Q: PyMuPDF安装失败

A: 确保Python版本>=3.8使用pip安装pip install PyMuPDF

Q: 服务无法启动?

A: 检查端口8000是否被占用可修改.env中的SERVICE_PORT

Q: 临时文件在哪里?

A: 默认在/tmp/extraction_service目录可通过TEMP_DIR环境变量配置

License

MIT