feat(rag): Complete RAG engine implementation with pgvector

Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
2026-01-21 20:24:29 +08:00
parent 1f5bf2cd65
commit 40c2f8e148
338 changed files with 11014 additions and 1158 deletions
--- a/docs/02-通用能力层/02-文档处理引擎/README.md
+++ b/docs/02-通用能力层/02-文档处理引擎/README.md
@@ -3,117 +3,211 @@
 > **能力定位：** 通用能力层  
 > **复用率：** 86% (6个模块依赖)  
 > **优先级：** P0  
-> **状态：** ✅ 已实现（Python微服务）
+> **状态：** 🔄 升级中（pymupdf4llm + 统一架构）  
+> **最后更新：** 2026-01-20

 ---

 ## 📋 能力概述

-文档处理引擎是平台的核心基础能力，负责：
- 多格式文档文本提取（PDF、Docx、Txt、Excel）
- OCR处理
- 表格提取
- 语言检测
- 质量评估
+文档处理引擎是平台的核心基础能力，将各类文档统一转换为 **LLM 友好的 Markdown 格式**，为知识库构建、文献分析、数据导入等场景提供基础支撑。
+
+### 设计目标
+
+1. **多格式支持** - 覆盖医学科研领域 20+ 种文档格式
+2. **LLM 友好输出** - 统一输出结构化 Markdown
+3. **表格保真** - 完整保留文献中的表格信息（临床试验核心数据）
+4. **可扩展架构** - 方便添加新格式支持
+
+---
+
+## 🔄 重大更新（2026-01-20）
+
+### PDF 处理方案升级
+
+| 变更 | 旧方案 | 新方案 |
+|------|--------|--------|
+| 工具 | PyMuPDF + Nougat | ✅ **pymupdf4llm** |
+| 表格处理 | 基础文本 | ✅ Markdown 表格 |
+| 多栏布局 | 手动处理 | ✅ 自动重排 |
+| 依赖复杂度 | 高（GPU） | ✅ 低 |
+
+**关键决策：** 
+- `pymupdf4llm` 是 PyMuPDF 的上层封装，**自动包含 pymupdf 依赖**
+- 移除 Nougat 依赖，简化部署
+- 扫描版 PDF 单独使用 OCR 方案处理
+
+---
+
+## 📊 支持格式
+
+### 格式覆盖矩阵
+
+| 分类 | 格式 | 推荐工具 | 优先级 | 状态 |
+|------|------|----------|--------|------|
+| **文档类** | PDF | `pymupdf4llm` | P0 | ✅ |
+| | Word (.docx) | `mammoth` | P0 | ✅ |
+| | PPT (.pptx) | `python-pptx` | P1 | ✅ |
+| | 纯文本 | 直接读取 | P0 | ✅ |
+| **表格类** | Excel (.xlsx) | `pandas` + `openpyxl` | P0 | ✅ |
+| | CSV | `pandas` | P0 | ✅ |
+| | SAS/SPSS/Stata | `pandas` + `pyreadstat` | P2 | 🔜 |
+| **网页类** | HTML | `beautifulsoup4` + `markdownify` | P1 | ✅ |
+| **引用类** | BibTeX/RIS | `bibtexparser` / `rispy` | P1 | ✅ |
+| **医学类** | DICOM | `pydicom` | P2 | 🔜 |

 ---

 ## 📊 依赖模块

 **6个模块依赖（86%复用率）：**
-1. **ASL** - AI智能文献（文献PDF提取）
-2. **PKB** - 个人知识库（知识库文档上传）
-3. **DC** - 数据清洗（Excel/Docx数据导入）
-4. **SSA** - 智能统计分析（数据导入）
-5. **ST** - 统计分析工具（数据导入）
-6. **RVW** - 稿件审查（稿件文档提取）

---
-
-## 💡 核心功能
-
-### 1. PDF提取
- **Nougat**：英文学术论文（高质量）
- **PyMuPDF**：中文PDF + 兜底方案（快速）
- **语言检测**：自动识别中英文
- **质量评估**：提取质量评分
-
-### 2. Docx提取
- **Mammoth**：转Markdown
- **python-docx**：结构化读取
-
-### 3. Txt提取
- **多编码支持**：UTF-8、GBK等
- **chardet**：自动检测编码
-
-### 4. Excel处理
- **openpyxl**：读取Excel
- **pandas**：数据处理
+| 模块 | 用途 | 核心格式 |
+|------|------|----------|
+| **ASL** - AI智能文献 | 文献 PDF 提取 | PDF |
+| **PKB** - 个人知识库 | 知识库文档上传 | PDF, Word, Excel |
+| **DC** - 数据清洗 | 数据导入 | Excel, CSV |
+| **SSA** - 智能统计分析 | 数据导入 | Excel, CSV, SAS/SPSS |
+| **ST** - 统计分析工具 | 数据导入 | Excel, CSV |
+| **RVW** - 稿件审查 | 稿件文档提取 | Word, PDF |

 ---

 ## 🏗️ 技术架构

-**Python微服务（FastAPI）：**
+### 统一处理器架构
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   DocumentProcessor                          │
+│  (统一入口：自动检测文件类型，调用对应处理器)                    │
+├─────────────────────────────────────────────────────────────┤
+│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐   │
+│  │    PDF    │ │   Word    │ │    PPT    │ │   Excel   │   │
+│  │ Processor │ │ Processor │ │ Processor │ │ Processor │   │
+│  │pymupdf4llm│ │  mammoth  │ │python-pptx│ │  pandas   │   │
+│  └───────────┘ └───────────┘ └───────────┘ └───────────┘   │
+├─────────────────────────────────────────────────────────────┤
+│                    输出: 统一 Markdown 格式                   │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### 目录结构
+
 ```
 extraction_service/
-  ├── main.py (509行)              - FastAPI主服务
-  ├── services/
-  │   ├── pdf_extractor.py (242行)    - PDF提取总协调
-  │   ├── pdf_processor.py (280行)    - PyMuPDF实现
-  │   ├── language_detector.py (120行) - 语言检测
-  │   ├── nougat_extractor.py (242行) - Nougat实现
-  │   ├── docx_extractor.py (253行)   - Docx提取
-  │   └── txt_extractor.py (316行)    - Txt提取（多编码）
-  └── requirements.txt
+├── main.py                    - FastAPI 主服务
+├── document_processor.py      - 统一入口
+├── processors/
+│   ├── pdf_processor.py       - PDF 处理 (pymupdf4llm)
+│   ├── docx_processor.py      - Word 处理 (mammoth)
+│   ├── pptx_processor.py      - PPT 处理 (python-pptx)
+│   ├── excel_processor.py     - Excel 处理 (pandas)
+│   ├── csv_processor.py       - CSV 处理 (pandas)
+│   ├── html_processor.py      - HTML 处理 (markdownify)
+│   └── reference_processor.py - 文献引用处理
+└── requirements.txt
 ```

 ---

-## 📚 API端点
+## 💡 快速使用
+
+### 基础用法
+
+```python
+from document_processor import DocumentProcessor
+
+# 创建处理器
+processor = DocumentProcessor()
+
+# 转换任意文档为 Markdown
+md = processor.to_markdown("research_paper.pdf")
+md = processor.to_markdown("report.docx")
+md = processor.to_markdown("data.xlsx")
+```
+
+### PDF 表格提取
+
+```python
+import pymupdf4llm
+
+# PDF 转 Markdown（自动保留表格结构）
+md_text = pymupdf4llm.to_markdown(
+    "paper.pdf",
+    page_chunks=True,    # 按页分块
+    write_images=True,   # 提取图片
+)
+```
+
+---
+
+## 📚 API 端点

 ```
-POST /api/extract/pdf      - PDF文本提取
-POST /api/extract/docx     - Docx文本提取
-POST /api/extract/txt      - Txt文本提取
-POST /api/extract/excel    - Excel表格提取
+POST /api/extract/pdf      - PDF 文本提取
+POST /api/extract/docx     - Word 文本提取
+POST /api/extract/txt      - TXT 文本提取
+POST /api/extract/excel    - Excel 表格提取
+POST /api/extract/pptx     - PPT 文本提取（新增）
+POST /api/extract/html     - HTML 文本提取（新增）
 GET  /health               - 健康检查
 ```

 ---

+## 📦 核心依赖
+
+```txt
+# PDF
+pymupdf4llm>=0.0.10
+
+# Word
+mammoth>=1.6.0
+
+# PPT
+python-pptx>=0.6.23
+
+# Excel/CSV
+pandas>=2.0.0
+openpyxl>=3.1.2
+tabulate>=0.9.0
+
+# HTML
+beautifulsoup4>=4.12.0
+markdownify>=0.11.6
+
+# 文献引用
+bibtexparser>=1.4.0
+rispy>=0.7.0
+```
+
+---
+
 ## 🔗 相关文档

+- [详细设计方案](./01-文档处理引擎设计方案.md) - 完整实现细节
 - [通用能力层总览](../README.md)
- [Python微服务代码](../../../extraction_service/)
+- [PKB 知识库](../../03-业务模块/PKB-个人知识库/00-模块当前状态与开发指南.md)
+- [Dify 替换计划](../../03-业务模块/PKB-个人知识库/04-开发计划/01-Dify替换为pgvector开发计划.md)
+
+---
+
+## 📅 更新日志
+
+### 2026-01-20 架构升级
+
+- 🆕 PDF 处理升级为 `pymupdf4llm`
+- 🆕 移除 Nougat 依赖
+- 🆕 新增统一处理器架构
+- 🆕 新增 PPT、HTML、文献引用格式支持
+- 📝 创建详细设计方案文档
+
+### 2025-11-06 初始版本
+
+- 基础 PDF/Word/Excel 处理
+- Python 微服务架构

 ---

-**最后更新：** 2025-11-06  
 **维护人：** 技术架构师
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-