feat: add extraction_service (PDF/Docx/Txt) and update .gitignore to exclude venv

2025-11-16 15:32:44 +08:00
parent 2a4f59b08b
commit 39eb62ee79
18 changed files with 2706 additions and 0 deletions
--- a/extraction_service/README.md
+++ b/extraction_service/README.md
@@ -0,0 +1,181 @@
+# 文档提取微服务
+
+基于FastAPI的文档文本提取服务，支持PDF、Docx、Txt格式。
+
+## 功能特性
+
+- ✅ **PDF提取**：使用PyMuPDF快速提取PDF文本
+- ⏳ **Docx提取**：使用Mammoth提取Word文档（Day 3）
+- ⏳ **Txt提取**：支持多种编码（Day 3）
+- ⏳ **语言检测**：自动检测PDF语言（Day 2）
+- ⏳ **Nougat集成**：高质量学术PDF解析（Day 2）
+
+## 快速开始
+
+### 1. 安装依赖
+
+```bash
+cd extraction_service
+
+# 创建虚拟环境（推荐）
+python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+
+# 安装依赖
+pip install -r requirements.txt
+```
+
+### 2. 配置环境变量
+
+```bash
+# 复制示例配置
+cp .env.example .env
+
+# 编辑配置（可选）
+# SERVICE_PORT=8000
+# DEBUG=True
+```
+
+### 3. 启动服务
+
+```bash
+# 开发模式（自动重载）
+python main.py
+
+# 或使用uvicorn
+uvicorn main:app --reload --port 8000
+```
+
+服务将在 http://localhost:8000 启动
+
+### 4. 测试服务
+
+#### 健康检查
+
+```bash
+curl http://localhost:8000/api/health
+```
+
+返回：
+```json
+{
+  "status": "healthy",
+  "checks": {
+    "pymupdf": {
+      "available": true,
+      "version": "1.23.8"
+    },
+    "temp_dir": {
+      "path": "/tmp/extraction_service",
+      "writable": true
+    }
+  }
+}
+```
+
+#### PDF文本提取
+
+```bash
+curl -X POST http://localhost:8000/api/extract/pdf \
+  -F "file=@test.pdf"
+```
+
+返回：
+```json
+{
+  "success": true,
+  "method": "pymupdf",
+  "text": "提取的文本内容...",
+  "metadata": {
+    "page_count": 20,
+    "char_count": 50000,
+    "file_size": 1024000,
+    "filename": "test.pdf"
+  }
+}
+```
+
+## API文档
+
+启动服务后访问：
+- Swagger UI: http://localhost:8000/docs
+- ReDoc: http://localhost:8000/redoc
+
+## 项目结构
+
+```
+extraction_service/
+├── main.py              # 主应用入口
+├── requirements.txt     # Python依赖
+├── .env.example         # 环境变量示例
+├── README.md           # 本文件
+├── services/           # 服务模块
+│   ├── __init__.py
+│   ├── pdf_extractor.py      # PDF提取（PyMuPDF）
+│   ├── nougat_extractor.py   # Nougat提取（Day 2）
+│   ├── docx_extractor.py     # Docx提取（Day 3）
+│   ├── txt_extractor.py      # Txt提取（Day 3）
+│   ├── language_detector.py  # 语言检测（Day 2）
+│   └── file_utils.py         # 文件工具
+└── tests/              # 测试文件（待添加）
+```
+
+## 开发计划
+
+### ✅ Day 1（已完成）
+- [x] FastAPI项目搭建
+- [x] PyMuPDF集成
+- [x] PDF文本提取功能
+- [x] 健康检查API
+
+### ⏳ Day 2（进行中）
+- [ ] 安装Nougat
+- [ ] 语言检测功能
+- [ ] Nougat提取逻辑
+- [ ] 顺序降级机制
+
+### ⏳ Day 3
+- [ ] Docx提取（Mammoth）
+- [ ] Txt提取（多编码）
+- [ ] 文件格式验证
+
+## 依赖说明
+
+| 库 | 版本 | 用途 |
+|---|---|---|
+| fastapi | 0.104.1 | Web框架 |
+| uvicorn | 0.24.0 | ASGI服务器 |
+| PyMuPDF | 1.23.8 | PDF文本提取 |
+| pdfplumber | 0.10.3 | PDF语言检测 |
+| mammoth | 1.6.0 | Docx提取 |
+| langdetect | 1.0.9 | 语言检测 |
+| loguru | 0.7.2 | 日志管理 |
+
+## 性能指标
+
+| 操作 | 目标时间 |
+|---|---|
+| 20页PDF（PyMuPDF） | <30秒 |
+| 10页Docx | <10秒 |
+| 1MB Txt | <5秒 |
+
+## 常见问题
+
+### Q: PyMuPDF安装失败？
+A: 确保Python版本>=3.8，使用pip安装：`pip install PyMuPDF`
+
+### Q: 服务无法启动？
+A: 检查端口8000是否被占用，可修改.env中的SERVICE_PORT
+
+### Q: 临时文件在哪里？
+A: 默认在/tmp/extraction_service目录，可通过TEMP_DIR环境变量配置
+
+## License
+
+MIT
+
+
+
+
+
+