Features: - Backend statistics API (cloud-native Prisma aggregation) - Results page with hybrid solution (AI consensus + human final decision) - Excel export (frontend generation, zero disk write, cloud-native) - PRISMA-style exclusion reason analysis with bar chart - Batch selection and export (3 export methods) - Fixed logic contradiction (inclusion does not show exclusion reason) - Optimized table width (870px, no horizontal scroll) Components: - Backend: screeningController.ts - add getProjectStatistics API - Frontend: ScreeningResults.tsx - complete results page (hybrid solution) - Frontend: excelExport.ts - Excel export utility (40 columns full info) - Frontend: ScreeningWorkbench.tsx - add navigation button - Utils: get-test-projects.mjs - quick test tool Architecture: - Cloud-native: backend aggregation reduces network transfer - Cloud-native: frontend Excel generation (zero file persistence) - Reuse platform: global prisma instance, logger - Performance: statistics API < 500ms, Excel export < 3s (1000 records) Documentation: - Update module status guide (add Week 4 features) - Update task breakdown (mark Week 4 completed) - Update API design spec (add statistics API) - Update database design (add field usage notes) - Create Week 4 development plan - Create Week 4 completion report - Create technical debt list Test: - End-to-end flow test passed - All features verified - Performance test passed - Cloud-native compliance verified Ref: Week 4 Development Plan Scope: ASL Module MVP - Title Abstract Screening Results Cloud-Native: Backend aggregation + Frontend Excel generation
117 lines
2.3 KiB
Markdown
117 lines
2.3 KiB
Markdown
# 文档处理引擎
|
||
|
||
> **能力定位:** 通用能力层
|
||
> **复用率:** 86% (6个模块依赖)
|
||
> **优先级:** P0
|
||
> **状态:** ✅ 已实现(Python微服务)
|
||
|
||
---
|
||
|
||
## 📋 能力概述
|
||
|
||
文档处理引擎是平台的核心基础能力,负责:
|
||
- 多格式文档文本提取(PDF、Docx、Txt、Excel)
|
||
- OCR处理
|
||
- 表格提取
|
||
- 语言检测
|
||
- 质量评估
|
||
|
||
---
|
||
|
||
## 📊 依赖模块
|
||
|
||
**6个模块依赖(86%复用率):**
|
||
1. **ASL** - AI智能文献(文献PDF提取)
|
||
2. **PKB** - 个人知识库(知识库文档上传)
|
||
3. **DC** - 数据清洗(Excel/Docx数据导入)
|
||
4. **SSA** - 智能统计分析(数据导入)
|
||
5. **ST** - 统计分析工具(数据导入)
|
||
6. **RVW** - 稿件审查(稿件文档提取)
|
||
|
||
---
|
||
|
||
## 💡 核心功能
|
||
|
||
### 1. PDF提取
|
||
- **Nougat**:英文学术论文(高质量)
|
||
- **PyMuPDF**:中文PDF + 兜底方案(快速)
|
||
- **语言检测**:自动识别中英文
|
||
- **质量评估**:提取质量评分
|
||
|
||
### 2. Docx提取
|
||
- **Mammoth**:转Markdown
|
||
- **python-docx**:结构化读取
|
||
|
||
### 3. Txt提取
|
||
- **多编码支持**:UTF-8、GBK等
|
||
- **chardet**:自动检测编码
|
||
|
||
### 4. Excel处理
|
||
- **openpyxl**:读取Excel
|
||
- **pandas**:数据处理
|
||
|
||
---
|
||
|
||
## 🏗️ 技术架构
|
||
|
||
**Python微服务(FastAPI):**
|
||
```
|
||
extraction_service/
|
||
├── main.py (509行) - FastAPI主服务
|
||
├── services/
|
||
│ ├── pdf_extractor.py (242行) - PDF提取总协调
|
||
│ ├── pdf_processor.py (280行) - PyMuPDF实现
|
||
│ ├── language_detector.py (120行) - 语言检测
|
||
│ ├── nougat_extractor.py (242行) - Nougat实现
|
||
│ ├── docx_extractor.py (253行) - Docx提取
|
||
│ └── txt_extractor.py (316行) - Txt提取(多编码)
|
||
└── requirements.txt
|
||
```
|
||
|
||
---
|
||
|
||
## 📚 API端点
|
||
|
||
```
|
||
POST /api/extract/pdf - PDF文本提取
|
||
POST /api/extract/docx - Docx文本提取
|
||
POST /api/extract/txt - Txt文本提取
|
||
POST /api/extract/excel - Excel表格提取
|
||
GET /health - 健康检查
|
||
```
|
||
|
||
---
|
||
|
||
## 🔗 相关文档
|
||
|
||
- [通用能力层总览](../README.md)
|
||
- [Python微服务代码](../../../extraction_service/)
|
||
|
||
---
|
||
|
||
**最后更新:** 2025-11-06
|
||
**维护人:** 技术架构师
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|