feat(rag): Complete RAG engine implementation with pgvector
Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
This commit is contained in:
@@ -3,117 +3,211 @@
|
||||
> **能力定位:** 通用能力层
|
||||
> **复用率:** 86% (6个模块依赖)
|
||||
> **优先级:** P0
|
||||
> **状态:** ✅ 已实现(Python微服务)
|
||||
> **状态:** 🔄 升级中(pymupdf4llm + 统一架构)
|
||||
> **最后更新:** 2026-01-20
|
||||
|
||||
---
|
||||
|
||||
## 📋 能力概述
|
||||
|
||||
文档处理引擎是平台的核心基础能力,负责:
|
||||
- 多格式文档文本提取(PDF、Docx、Txt、Excel)
|
||||
- OCR处理
|
||||
- 表格提取
|
||||
- 语言检测
|
||||
- 质量评估
|
||||
文档处理引擎是平台的核心基础能力,将各类文档统一转换为 **LLM 友好的 Markdown 格式**,为知识库构建、文献分析、数据导入等场景提供基础支撑。
|
||||
|
||||
### 设计目标
|
||||
|
||||
1. **多格式支持** - 覆盖医学科研领域 20+ 种文档格式
|
||||
2. **LLM 友好输出** - 统一输出结构化 Markdown
|
||||
3. **表格保真** - 完整保留文献中的表格信息(临床试验核心数据)
|
||||
4. **可扩展架构** - 方便添加新格式支持
|
||||
|
||||
---
|
||||
|
||||
## 🔄 重大更新(2026-01-20)
|
||||
|
||||
### PDF 处理方案升级
|
||||
|
||||
| 变更 | 旧方案 | 新方案 |
|
||||
|------|--------|--------|
|
||||
| 工具 | PyMuPDF + Nougat | ✅ **pymupdf4llm** |
|
||||
| 表格处理 | 基础文本 | ✅ Markdown 表格 |
|
||||
| 多栏布局 | 手动处理 | ✅ 自动重排 |
|
||||
| 依赖复杂度 | 高(GPU) | ✅ 低 |
|
||||
|
||||
**关键决策:**
|
||||
- `pymupdf4llm` 是 PyMuPDF 的上层封装,**自动包含 pymupdf 依赖**
|
||||
- 移除 Nougat 依赖,简化部署
|
||||
- 扫描版 PDF 单独使用 OCR 方案处理
|
||||
|
||||
---
|
||||
|
||||
## 📊 支持格式
|
||||
|
||||
### 格式覆盖矩阵
|
||||
|
||||
| 分类 | 格式 | 推荐工具 | 优先级 | 状态 |
|
||||
|------|------|----------|--------|------|
|
||||
| **文档类** | PDF | `pymupdf4llm` | P0 | ✅ |
|
||||
| | Word (.docx) | `mammoth` | P0 | ✅ |
|
||||
| | PPT (.pptx) | `python-pptx` | P1 | ✅ |
|
||||
| | 纯文本 | 直接读取 | P0 | ✅ |
|
||||
| **表格类** | Excel (.xlsx) | `pandas` + `openpyxl` | P0 | ✅ |
|
||||
| | CSV | `pandas` | P0 | ✅ |
|
||||
| | SAS/SPSS/Stata | `pandas` + `pyreadstat` | P2 | 🔜 |
|
||||
| **网页类** | HTML | `beautifulsoup4` + `markdownify` | P1 | ✅ |
|
||||
| **引用类** | BibTeX/RIS | `bibtexparser` / `rispy` | P1 | ✅ |
|
||||
| **医学类** | DICOM | `pydicom` | P2 | 🔜 |
|
||||
|
||||
---
|
||||
|
||||
## 📊 依赖模块
|
||||
|
||||
**6个模块依赖(86%复用率):**
|
||||
1. **ASL** - AI智能文献(文献PDF提取)
|
||||
2. **PKB** - 个人知识库(知识库文档上传)
|
||||
3. **DC** - 数据清洗(Excel/Docx数据导入)
|
||||
4. **SSA** - 智能统计分析(数据导入)
|
||||
5. **ST** - 统计分析工具(数据导入)
|
||||
6. **RVW** - 稿件审查(稿件文档提取)
|
||||
|
||||
---
|
||||
|
||||
## 💡 核心功能
|
||||
|
||||
### 1. PDF提取
|
||||
- **Nougat**:英文学术论文(高质量)
|
||||
- **PyMuPDF**:中文PDF + 兜底方案(快速)
|
||||
- **语言检测**:自动识别中英文
|
||||
- **质量评估**:提取质量评分
|
||||
|
||||
### 2. Docx提取
|
||||
- **Mammoth**:转Markdown
|
||||
- **python-docx**:结构化读取
|
||||
|
||||
### 3. Txt提取
|
||||
- **多编码支持**:UTF-8、GBK等
|
||||
- **chardet**:自动检测编码
|
||||
|
||||
### 4. Excel处理
|
||||
- **openpyxl**:读取Excel
|
||||
- **pandas**:数据处理
|
||||
| 模块 | 用途 | 核心格式 |
|
||||
|------|------|----------|
|
||||
| **ASL** - AI智能文献 | 文献 PDF 提取 | PDF |
|
||||
| **PKB** - 个人知识库 | 知识库文档上传 | PDF, Word, Excel |
|
||||
| **DC** - 数据清洗 | 数据导入 | Excel, CSV |
|
||||
| **SSA** - 智能统计分析 | 数据导入 | Excel, CSV, SAS/SPSS |
|
||||
| **ST** - 统计分析工具 | 数据导入 | Excel, CSV |
|
||||
| **RVW** - 稿件审查 | 稿件文档提取 | Word, PDF |
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ 技术架构
|
||||
|
||||
**Python微服务(FastAPI):**
|
||||
### 统一处理器架构
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ DocumentProcessor │
|
||||
│ (统一入口:自动检测文件类型,调用对应处理器) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
|
||||
│ │ PDF │ │ Word │ │ PPT │ │ Excel │ │
|
||||
│ │ Processor │ │ Processor │ │ Processor │ │ Processor │ │
|
||||
│ │pymupdf4llm│ │ mammoth │ │python-pptx│ │ pandas │ │
|
||||
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ 输出: 统一 Markdown 格式 │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 目录结构
|
||||
|
||||
```
|
||||
extraction_service/
|
||||
├── main.py (509行) - FastAPI主服务
|
||||
├── services/
|
||||
│ ├── pdf_extractor.py (242行) - PDF提取总协调
|
||||
│ ├── pdf_processor.py (280行) - PyMuPDF实现
|
||||
│ ├── language_detector.py (120行) - 语言检测
|
||||
│ ├── nougat_extractor.py (242行) - Nougat实现
|
||||
│ ├── docx_extractor.py (253行) - Docx提取
|
||||
│ └── txt_extractor.py (316行) - Txt提取(多编码)
|
||||
└── requirements.txt
|
||||
├── main.py - FastAPI 主服务
|
||||
├── document_processor.py - 统一入口
|
||||
├── processors/
|
||||
│ ├── pdf_processor.py - PDF 处理 (pymupdf4llm)
|
||||
│ ├── docx_processor.py - Word 处理 (mammoth)
|
||||
│ ├── pptx_processor.py - PPT 处理 (python-pptx)
|
||||
│ ├── excel_processor.py - Excel 处理 (pandas)
|
||||
│ ├── csv_processor.py - CSV 处理 (pandas)
|
||||
│ ├── html_processor.py - HTML 处理 (markdownify)
|
||||
│ └── reference_processor.py - 文献引用处理
|
||||
└── requirements.txt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 API端点
|
||||
## 💡 快速使用
|
||||
|
||||
### 基础用法
|
||||
|
||||
```python
|
||||
from document_processor import DocumentProcessor
|
||||
|
||||
# 创建处理器
|
||||
processor = DocumentProcessor()
|
||||
|
||||
# 转换任意文档为 Markdown
|
||||
md = processor.to_markdown("research_paper.pdf")
|
||||
md = processor.to_markdown("report.docx")
|
||||
md = processor.to_markdown("data.xlsx")
|
||||
```
|
||||
|
||||
### PDF 表格提取
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
# PDF 转 Markdown(自动保留表格结构)
|
||||
md_text = pymupdf4llm.to_markdown(
|
||||
"paper.pdf",
|
||||
page_chunks=True, # 按页分块
|
||||
write_images=True, # 提取图片
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 API 端点
|
||||
|
||||
```
|
||||
POST /api/extract/pdf - PDF文本提取
|
||||
POST /api/extract/docx - Docx文本提取
|
||||
POST /api/extract/txt - Txt文本提取
|
||||
POST /api/extract/excel - Excel表格提取
|
||||
POST /api/extract/pdf - PDF 文本提取
|
||||
POST /api/extract/docx - Word 文本提取
|
||||
POST /api/extract/txt - TXT 文本提取
|
||||
POST /api/extract/excel - Excel 表格提取
|
||||
POST /api/extract/pptx - PPT 文本提取(新增)
|
||||
POST /api/extract/html - HTML 文本提取(新增)
|
||||
GET /health - 健康检查
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📦 核心依赖
|
||||
|
||||
```txt
|
||||
# PDF
|
||||
pymupdf4llm>=0.0.10
|
||||
|
||||
# Word
|
||||
mammoth>=1.6.0
|
||||
|
||||
# PPT
|
||||
python-pptx>=0.6.23
|
||||
|
||||
# Excel/CSV
|
||||
pandas>=2.0.0
|
||||
openpyxl>=3.1.2
|
||||
tabulate>=0.9.0
|
||||
|
||||
# HTML
|
||||
beautifulsoup4>=4.12.0
|
||||
markdownify>=0.11.6
|
||||
|
||||
# 文献引用
|
||||
bibtexparser>=1.4.0
|
||||
rispy>=0.7.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔗 相关文档
|
||||
|
||||
- [详细设计方案](./01-文档处理引擎设计方案.md) - 完整实现细节
|
||||
- [通用能力层总览](../README.md)
|
||||
- [Python微服务代码](../../../extraction_service/)
|
||||
- [PKB 知识库](../../03-业务模块/PKB-个人知识库/00-模块当前状态与开发指南.md)
|
||||
- [Dify 替换计划](../../03-业务模块/PKB-个人知识库/04-开发计划/01-Dify替换为pgvector开发计划.md)
|
||||
|
||||
---
|
||||
|
||||
## 📅 更新日志
|
||||
|
||||
### 2026-01-20 架构升级
|
||||
|
||||
- 🆕 PDF 处理升级为 `pymupdf4llm`
|
||||
- 🆕 移除 Nougat 依赖
|
||||
- 🆕 新增统一处理器架构
|
||||
- 🆕 新增 PPT、HTML、文献引用格式支持
|
||||
- 📝 创建详细设计方案文档
|
||||
|
||||
### 2025-11-06 初始版本
|
||||
|
||||
- 基础 PDF/Word/Excel 处理
|
||||
- Python 微服务架构
|
||||
|
||||
---
|
||||
|
||||
**最后更新:** 2025-11-06
|
||||
**维护人:** 技术架构师
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user