Files
AIclinicalresearch/docs/02-通用能力层/02-文档处理引擎/README.md
HaHafeng e3e7e028e8 feat(platform): Complete platform infrastructure implementation and verification
Platform Infrastructure - 8 Core Modules Completed:
- Storage Service (LocalAdapter + OSSAdapter stub)
- Logging System (Winston + JSON format)
- Cache Service (MemoryCache + Redis stub)
- Async Job Queue (MemoryQueue + DatabaseQueue stub)
- Health Check Endpoints (liveness/readiness/detailed)
- Database Connection Pool (with Serverless optimization)
- Environment Configuration Management
- Monitoring Metrics (DB connections/memory/API)

Key Features:
- Adapter Pattern for zero-code environment switching
- Full backward compatibility with legacy modules
- 100% test coverage (all 8 modules verified)
- Complete documentation (11 docs updated)

Technical Improvements:
- Fixed duplicate /health route registration issue
- Fixed TypeScript interface export (export type)
- Installed winston dependency
- Added structured logging with context support
- Implemented graceful shutdown for Serverless
- Added connection pool optimization for SAE

Documentation Updates:
- Platform infrastructure planning (04-骞冲彴鍩虹璁炬柦瑙勫垝.md)
- Implementation report (2025-11-17-骞冲彴鍩虹璁炬柦瀹炴柦瀹屾垚鎶ュ憡.md)
- Verification report (2025-11-17-骞冲彴鍩虹璁炬柦楠岃瘉鎶ュ憡.md)
- Git commit guidelines (06-Git鎻愪氦瑙勮寖.md) - Added commit frequency rules
- Updated 3 core architecture documents

Code Statistics:
- New code: 2,532 lines
- New files: 22
- Updated files: 130+
- Test pass rate: 100% (8/8 modules)

Deployment Readiness:
- Local environment: 鉁?Ready
- Cloud environment: 馃攧 Needs OSS/Redis dependencies

Next Steps:
- Ready to start ASL module development
- Can directly use storage/logger/cache/jobQueue

Tested: Local verification 100% passed
Related: #Platform-Infrastructure
2025-11-18 08:00:41 +08:00

110 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 文档处理引擎
> **能力定位:** 通用能力层
> **复用率:** 86% (6个模块依赖)
> **优先级:** P0
> **状态:** ✅ 已实现Python微服务
---
## 📋 能力概述
文档处理引擎是平台的核心基础能力,负责:
- 多格式文档文本提取PDF、Docx、Txt、Excel
- OCR处理
- 表格提取
- 语言检测
- 质量评估
---
## 📊 依赖模块
**6个模块依赖86%复用率):**
1. **ASL** - AI智能文献文献PDF提取
2. **PKB** - 个人知识库(知识库文档上传)
3. **DC** - 数据清洗Excel/Docx数据导入
4. **SSA** - 智能统计分析(数据导入)
5. **ST** - 统计分析工具(数据导入)
6. **RVW** - 稿件审查(稿件文档提取)
---
## 💡 核心功能
### 1. PDF提取
- **Nougat**:英文学术论文(高质量)
- **PyMuPDF**中文PDF + 兜底方案(快速)
- **语言检测**:自动识别中英文
- **质量评估**:提取质量评分
### 2. Docx提取
- **Mammoth**转Markdown
- **python-docx**:结构化读取
### 3. Txt提取
- **多编码支持**UTF-8、GBK等
- **chardet**:自动检测编码
### 4. Excel处理
- **openpyxl**读取Excel
- **pandas**:数据处理
---
## 🏗️ 技术架构
**Python微服务FastAPI**
```
extraction_service/
├── main.py (509行) - FastAPI主服务
├── services/
│ ├── pdf_extractor.py (242行) - PDF提取总协调
│ ├── pdf_processor.py (280行) - PyMuPDF实现
│ ├── language_detector.py (120行) - 语言检测
│ ├── nougat_extractor.py (242行) - Nougat实现
│ ├── docx_extractor.py (253行) - Docx提取
│ └── txt_extractor.py (316行) - Txt提取多编码
└── requirements.txt
```
---
## 📚 API端点
```
POST /api/extract/pdf - PDF文本提取
POST /api/extract/docx - Docx文本提取
POST /api/extract/txt - Txt文本提取
POST /api/extract/excel - Excel表格提取
GET /health - 健康检查
```
---
## 🔗 相关文档
- [通用能力层总览](../README.md)
- [Python微服务代码](../../../extraction_service/)
---
**最后更新:** 2025-11-06
**维护人:** 技术架构师