docs: complete documentation system (250+ files)

- System architecture and design documentation - Business module docs (ASL/AIA/PKB/RVW/DC/SSA/ST) - ASL module complete design (quality assurance, tech selection) - Platform layer and common capabilities docs - Development standards and API specifications - Deployment and operations guides - Project management and milestone tracking - Architecture implementation reports - Documentation templates and guides
2025-11-16 15:43:55 +08:00
parent 0fe6821a89
commit e52020409c
173 changed files with 46227 additions and 11964 deletions
--- a/docs/02-通用能力层/02-文档处理引擎/README.md
+++ b/docs/02-通用能力层/02-文档处理引擎/README.md
@@ -0,0 +1,107 @@
+# 文档处理引擎
+
+> **能力定位：** 通用能力层  
+> **复用率：** 86% (6个模块依赖)  
+> **优先级：** P0  
+> **状态：** ✅ 已实现（Python微服务）
+
+---
+
+## 📋 能力概述
+
+文档处理引擎是平台的核心基础能力，负责：
+- 多格式文档文本提取（PDF、Docx、Txt、Excel）
+- OCR处理
+- 表格提取
+- 语言检测
+- 质量评估
+
+---
+
+## 📊 依赖模块
+
+**6个模块依赖（86%复用率）：**
+1. **ASL** - AI智能文献（文献PDF提取）
+2. **PKB** - 个人知识库（知识库文档上传）
+3. **DC** - 数据清洗（Excel/Docx数据导入）
+4. **SSA** - 智能统计分析（数据导入）
+5. **ST** - 统计分析工具（数据导入）
+6. **RVW** - 稿件审查（稿件文档提取）
+
+---
+
+## 💡 核心功能
+
+### 1. PDF提取
+- **Nougat**：英文学术论文（高质量）
+- **PyMuPDF**：中文PDF + 兜底方案（快速）
+- **语言检测**：自动识别中英文
+- **质量评估**：提取质量评分
+
+### 2. Docx提取
+- **Mammoth**：转Markdown
+- **python-docx**：结构化读取
+
+### 3. Txt提取
+- **多编码支持**：UTF-8、GBK等
+- **chardet**：自动检测编码
+
+### 4. Excel处理
+- **openpyxl**：读取Excel
+- **pandas**：数据处理
+
+---
+
+## 🏗️ 技术架构
+
+**Python微服务（FastAPI）：**
+```
+extraction_service/
+  ├── main.py (509行)              - FastAPI主服务
+  ├── services/
+  │   ├── pdf_extractor.py (242行)    - PDF提取总协调
+  │   ├── pdf_processor.py (280行)    - PyMuPDF实现
+  │   ├── language_detector.py (120行) - 语言检测
+  │   ├── nougat_extractor.py (242行) - Nougat实现
+  │   ├── docx_extractor.py (253行)   - Docx提取
+  │   └── txt_extractor.py (316行)    - Txt提取（多编码）
+  └── requirements.txt
+```
+
+---
+
+## 📚 API端点
+
+```
+POST /api/extract/pdf      - PDF文本提取
+POST /api/extract/docx     - Docx文本提取
+POST /api/extract/txt      - Txt文本提取
+POST /api/extract/excel    - Excel表格提取
+GET  /health               - 健康检查
+```
+
+---
+
+## 🔗 相关文档
+
+- [通用能力层总览](../README.md)
+- [Python微服务代码](../../../extraction_service/)
+
+---
+
+**最后更新：** 2025-11-06  
+**维护人：** 技术架构师
+
+
+
+
+
+
+
+
+
+
+
+
+
+