Files

HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector

Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready

2026-01-21 20:24:29 +08:00

01-文档处理引擎设计方案.md

feat(rag): Complete RAG engine implementation with pgvector

2026-01-21 20:24:29 +08:00

02-文档处理引擎使用指南.md

feat(rag): Complete RAG engine implementation with pgvector

2026-01-21 20:24:29 +08:00

README.md

feat(rag): Complete RAG engine implementation with pgvector

2026-01-21 20:24:29 +08:00

README.md

文档处理引擎

能力定位： 通用能力层
复用率： 86% (6个模块依赖)
优先级： P0
状态： 🔄 升级中（pymupdf4llm + 统一架构）
最后更新： 2026-01-20

📋 能力概述

文档处理引擎是平台的核心基础能力，将各类文档统一转换为 LLM 友好的 Markdown 格式，为知识库构建、文献分析、数据导入等场景提供基础支撑。

设计目标

多格式支持 - 覆盖医学科研领域 20+ 种文档格式
LLM 友好输出 - 统一输出结构化 Markdown
表格保真 - 完整保留文献中的表格信息（临床试验核心数据）
可扩展架构 - 方便添加新格式支持

🔄 重大更新（2026-01-20）

PDF 处理方案升级

变更	旧方案	新方案
工具	PyMuPDF + Nougat	✅ pymupdf4llm
表格处理	基础文本	✅ Markdown 表格
多栏布局	手动处理	✅ 自动重排
依赖复杂度	高（GPU）	✅ 低

关键决策：

pymupdf4llm 是 PyMuPDF 的上层封装，自动包含 pymupdf 依赖
移除 Nougat 依赖，简化部署
扫描版 PDF 单独使用 OCR 方案处理

📊 支持格式

格式覆盖矩阵

分类	格式	推荐工具	优先级	状态
文档类	PDF	`pymupdf4llm`	P0	✅
	Word (.docx)	`mammoth`	P0	✅
	PPT (.pptx)	`python-pptx`	P1	✅
	纯文本	直接读取	P0	✅
表格类	Excel (.xlsx)	`pandas` + `openpyxl`	P0	✅
	CSV	`pandas`	P0	✅
	SAS/SPSS/Stata	`pandas` + `pyreadstat`	P2	🔜
网页类	HTML	`beautifulsoup4` + `markdownify`	P1	✅
引用类	BibTeX/RIS	`bibtexparser` / `rispy`	P1	✅
医学类	DICOM	`pydicom`	P2	🔜

📊 依赖模块

6个模块依赖（86%复用率）：

模块	用途	核心格式
ASL - AI智能文献	文献 PDF 提取	PDF
PKB - 个人知识库	知识库文档上传	PDF, Word, Excel
DC - 数据清洗	数据导入	Excel, CSV
SSA - 智能统计分析	数据导入	Excel, CSV, SAS/SPSS
ST - 统计分析工具	数据导入	Excel, CSV
RVW - 稿件审查	稿件文档提取	Word, PDF

🏗️ 技术架构

统一处理器架构

┌─────────────────────────────────────────────────────────────┐
│                   DocumentProcessor                          │
│  (统一入口：自动检测文件类型，调用对应处理器)                    │
├─────────────────────────────────────────────────────────────┤
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐   │
│  │    PDF    │ │   Word    │ │    PPT    │ │   Excel   │   │
│  │ Processor │ │ Processor │ │ Processor │ │ Processor │   │
│  │pymupdf4llm│ │  mammoth  │ │python-pptx│ │  pandas   │   │
│  └───────────┘ └───────────┘ └───────────┘ └───────────┘   │
├─────────────────────────────────────────────────────────────┤
│                    输出: 统一 Markdown 格式                   │
└─────────────────────────────────────────────────────────────┘

目录结构

extraction_service/
├── main.py                    - FastAPI 主服务
├── document_processor.py      - 统一入口
├── processors/
│   ├── pdf_processor.py       - PDF 处理 (pymupdf4llm)
│   ├── docx_processor.py      - Word 处理 (mammoth)
│   ├── pptx_processor.py      - PPT 处理 (python-pptx)
│   ├── excel_processor.py     - Excel 处理 (pandas)
│   ├── csv_processor.py       - CSV 处理 (pandas)
│   ├── html_processor.py      - HTML 处理 (markdownify)
│   └── reference_processor.py - 文献引用处理
└── requirements.txt

💡 快速使用

基础用法

from document_processor import DocumentProcessor

# 创建处理器
processor = DocumentProcessor()

# 转换任意文档为 Markdown
md = processor.to_markdown("research_paper.pdf")
md = processor.to_markdown("report.docx")
md = processor.to_markdown("data.xlsx")

PDF 表格提取

import pymupdf4llm

# PDF 转 Markdown（自动保留表格结构）
md_text = pymupdf4llm.to_markdown(
    "paper.pdf",
    page_chunks=True,    # 按页分块
    write_images=True,   # 提取图片
)

📚 API 端点

POST /api/extract/pdf      - PDF 文本提取
POST /api/extract/docx     - Word 文本提取
POST /api/extract/txt      - TXT 文本提取
POST /api/extract/excel    - Excel 表格提取
POST /api/extract/pptx     - PPT 文本提取（新增）
POST /api/extract/html     - HTML 文本提取（新增）
GET  /health               - 健康检查

📦 核心依赖

# PDF
pymupdf4llm>=0.0.10

# Word
mammoth>=1.6.0

# PPT
python-pptx>=0.6.23

# Excel/CSV
pandas>=2.0.0
openpyxl>=3.1.2
tabulate>=0.9.0

# HTML
beautifulsoup4>=4.12.0
markdownify>=0.11.6

# 文献引用
bibtexparser>=1.4.0
rispy>=0.7.0

🔗 相关文档

📅 更新日志

2026-01-20 架构升级

🆕 PDF 处理升级为 pymupdf4llm
🆕 移除 Nougat 依赖
🆕 新增统一处理器架构
🆕 新增 PPT、HTML、文献引用格式支持
📝 创建详细设计方案文档

2025-11-06 初始版本

基础 PDF/Word/Excel 处理
Python 微服务架构

维护人： 技术架构师

README.md Unescape Escape

文档处理引擎

📋 能力概述

设计目标

🔄 重大更新（2026-01-20）

PDF 处理方案升级

📊 支持格式

格式覆盖矩阵

📊 依赖模块

🏗️ 技术架构

统一处理器架构

目录结构

💡 快速使用

基础用法

PDF 表格提取

📚 API 端点

📦 核心依赖

🔗 相关文档

📅 更新日志

2026-01-20 架构升级

2025-11-06 初始版本

README.md