Summary: - Implement async file upload processing (Platform-Only pattern) - Add parseExcelWorker with pg-boss queue - Implement React Query polling mechanism - Add clean data caching (avoid duplicate parsing) - Fix pivot single-value column tuple issue - Optimize performance by 99 percent Technical Details: 1. Async Architecture (Postgres-Only): - SessionService.createSession: Fast upload + push to queue (3s) - parseExcelWorker: Background parsing + save clean data (53s) - SessionController.getSessionStatus: Status query API for polling - React Query Hook: useSessionStatus (auto-serial polling) - Frontend progress bar with real-time feedback 2. Performance Optimization: - Clean data caching: Worker saves processed data to OSS - getPreviewData: Read from clean data cache (0.5s vs 43s, -99 percent) - getFullData: Read from clean data cache (0.5s vs 43s, -99 percent) - Intelligent cleaning: Boundary detection + ghost column/row removal - Safety valve: Max 3000 columns, 5M cells 3. Bug Fixes: - Fix pivot column name tuple issue for single value column - Fix queue name format (colon to underscore: asl:screening -> asl_screening) - Fix polling storm (15+ concurrent requests -> 1 serial request) - Fix QUEUE_TYPE environment variable (memory -> pgboss) - Fix logger import in PgBossQueue - Fix formatSession to return cleanDataKey - Fix saveProcessedData to update clean data synchronously 4. Database Changes: - ALTER TABLE dc_tool_c_sessions ADD COLUMN clean_data_key VARCHAR(1000) - ALTER TABLE dc_tool_c_sessions ALTER COLUMN total_rows DROP NOT NULL - ALTER TABLE dc_tool_c_sessions ALTER COLUMN total_cols DROP NOT NULL - ALTER TABLE dc_tool_c_sessions ALTER COLUMN columns DROP NOT NULL 5. Documentation: - Create Postgres-Only async task processing guide (588 lines) - Update Tool C status document (Day 10 summary) - Update DC module status document - Update system overview document - Update cloud-native development guide Performance Improvements: - Upload + preview: 96s -> 53.5s (-44 percent) - Filter operation: 44s -> 2.5s (-94 percent) - Pivot operation: 45s -> 2.5s (-94 percent) - Concurrent requests: 15+ -> 1 (-93 percent) - Complete workflow (upload + 7 ops): 404s -> 70.5s (-83 percent) Files Changed: - Backend: 15 files (Worker, Service, Controller, Schema, Config) - Frontend: 4 files (Hook, Component, API) - Docs: 4 files (Guide, Status, Overview, Spec) - Database: 4 column modifications - Total: ~1388 lines of new/modified code Status: Fully tested and verified, production ready
文档提取微服务
基于FastAPI的文档文本提取服务,支持PDF、Docx、Txt格式。
功能特性
- ✅ PDF提取:使用PyMuPDF快速提取PDF文本
- ⏳ Docx提取:使用Mammoth提取Word文档(Day 3)
- ⏳ Txt提取:支持多种编码(Day 3)
- ⏳ 语言检测:自动检测PDF语言(Day 2)
- ⏳ Nougat集成:高质量学术PDF解析(Day 2)
快速开始
1. 安装依赖
cd extraction_service
# 创建虚拟环境(推荐)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 安装依赖
pip install -r requirements.txt
2. 配置环境变量
# 复制示例配置
cp .env.example .env
# 编辑配置(可选)
# SERVICE_PORT=8000
# DEBUG=True
3. 启动服务
# 开发模式(自动重载)
python main.py
# 或使用uvicorn
uvicorn main:app --reload --port 8000
服务将在 http://localhost:8000 启动
4. 测试服务
健康检查
curl http://localhost:8000/api/health
返回:
{
"status": "healthy",
"checks": {
"pymupdf": {
"available": true,
"version": "1.23.8"
},
"temp_dir": {
"path": "/tmp/extraction_service",
"writable": true
}
}
}
PDF文本提取
curl -X POST http://localhost:8000/api/extract/pdf \
-F "file=@test.pdf"
返回:
{
"success": true,
"method": "pymupdf",
"text": "提取的文本内容...",
"metadata": {
"page_count": 20,
"char_count": 50000,
"file_size": 1024000,
"filename": "test.pdf"
}
}
API文档
启动服务后访问:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
项目结构
extraction_service/
├── main.py # 主应用入口
├── requirements.txt # Python依赖
├── .env.example # 环境变量示例
├── README.md # 本文件
├── services/ # 服务模块
│ ├── __init__.py
│ ├── pdf_extractor.py # PDF提取(PyMuPDF)
│ ├── nougat_extractor.py # Nougat提取(Day 2)
│ ├── docx_extractor.py # Docx提取(Day 3)
│ ├── txt_extractor.py # Txt提取(Day 3)
│ ├── language_detector.py # 语言检测(Day 2)
│ └── file_utils.py # 文件工具
└── tests/ # 测试文件(待添加)
开发计划
✅ Day 1(已完成)
- FastAPI项目搭建
- PyMuPDF集成
- PDF文本提取功能
- 健康检查API
⏳ Day 2(进行中)
- 安装Nougat
- 语言检测功能
- Nougat提取逻辑
- 顺序降级机制
⏳ Day 3
- Docx提取(Mammoth)
- Txt提取(多编码)
- 文件格式验证
依赖说明
| 库 | 版本 | 用途 |
|---|---|---|
| fastapi | 0.104.1 | Web框架 |
| uvicorn | 0.24.0 | ASGI服务器 |
| PyMuPDF | 1.23.8 | PDF文本提取 |
| pdfplumber | 0.10.3 | PDF语言检测 |
| mammoth | 1.6.0 | Docx提取 |
| langdetect | 1.0.9 | 语言检测 |
| loguru | 0.7.2 | 日志管理 |
性能指标
| 操作 | 目标时间 |
|---|---|
| 20页PDF(PyMuPDF) | <30秒 |
| 10页Docx | <10秒 |
| 1MB Txt | <5秒 |
常见问题
Q: PyMuPDF安装失败?
A: 确保Python版本>=3.8,使用pip安装:pip install PyMuPDF
Q: 服务无法启动?
A: 检查端口8000是否被占用,可修改.env中的SERVICE_PORT
Q: 临时文件在哪里?
A: 默认在/tmp/extraction_service目录,可通过TEMP_DIR环境变量配置
License
MIT