Files
AIclinicalresearch/docs/08-项目管理/01-文档处理引擎设计方案_v1.2.md
HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector
Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready
2026-01-21 20:24:29 +08:00

185 lines
6.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# **文档处理引擎设计方案**
文档版本: v1.2 (极简版)
更新日期: 2026-01-20
核心变更: 移除 PaddleOCR追求极致轻量化
适用范围: PKB 知识库、ASL 智能文献、DC 数据清洗
## **📋 概述**
### **设计目标**
构建一个 "极轻量、零OCR、LLM 友好" 的文档解析微服务。
核心原则:只处理可编辑文档(电子版),放弃扫描件支持,换取极致的部署速度和低资源占用。
构建一个 "容错性强、LLM 友好" 的文档解析微服务。对于 2 人团队,核心原则是:抓大放小,确保 PDF/Word/Excel 的绝对准确,放弃冷门格式。
### **架构概览 (Pipeline)**
graph LR
Input\[文档输入\] \--\> Router{格式路由}
Router \--\>|PDF| pymupdf4llm\[pymupdf4llm\]
pymupdf4llm \--\>|成功| MD\_Out
pymupdf4llm \--\>|文本过少| Error\[报错:不支持扫描件\]
Router \--\>|Word| Mammoth\[Mammoth\]
Router \--\>|PPT| Pptx\[Python-pptx\]
Router \--\>|Excel/CSV| Pandas\[Pandas \+ Context\]
Mammoth \--\> MD\_Out
Pptx \--\> MD\_Out
Pandas \--\> MD\_Out\[Markdown 输出\]
## **🔧 核心实现方案**
### **1\. PDF 文档处理 (极简版)**
策略:只用 pymupdf4llm。
逻辑:尝试解析 \-\> 如果字数太少 \-\> 抛出异常(告诉前端提示用户上传电子版)。
#### **代码实现 (pdf\_processor.py)**
import pymupdf4llm
import logging
logger \= logging.getLogger(\_\_name\_\_)
class PdfProcessor:
def to\_markdown(self, pdf\_path: str) \-\> str:
"""
PDF 转 Markdown (仅支持电子版)
"""
try:
\# 1\. 尝试快速解析 (保留表格结构)
md\_text \= pymupdf4llm.to\_markdown(pdf\_path, show\_progress=False)
\# 2\. 质量检查:如果提取内容极少(\<50字符),视为扫描件
if len(md\_text.strip()) \< 50:
msg \= f"解析失败:提取文本过少({len(md\_text)}字符)。可能为扫描版PDF本系统暂不支持。"
logger.warning(msg)
\# 选择策略:是返回空字符串让流程继续,还是报错?
\# 建议:返回一段提示文本,让 LLM 知道这个文件没读出来
return "\> \*\*系统提示\*\*:此文档似乎是扫描件(图片),无法提取文本内容。"
return md\_text
except Exception as e:
logger.error(f"pymupdf4llm failed: {e}")
raise ValueError(f"PDF解析失败: {str(e)}")
### **2\. Word 文档处理**
**策略**mammoth。轻量、快速、HTML/Markdown 转换效果好。
#### **代码实现 (docx\_processor.py)**
import mammoth
class DocxProcessor:
def to\_markdown(self, docx\_path: str) \-\> str:
with open(docx\_path, "rb") as f:
result \= mammoth.convert\_to\_markdown(f)
if not result.value.strip():
return "\> \*\*系统提示\*\*Word文档内容为空或无法识别。"
return result.value
### **3\. Excel/CSV 处理**
**策略**pandas。加上文件名上下文。
#### **代码实现 (excel\_processor.py)**
import pandas as pd
import os
class ExcelProcessor:
def to\_markdown(self, file\_path: str, max\_rows: int \= 200\) \-\> str:
"""Excel/CSV 转 Markdown"""
ext \= os.path.splitext(file\_path)\[1\].lower()
filename \= os.path.basename(file\_path)
md\_output \= \[\]
try:
if ext \== '.csv':
dfs \= {'Sheet1': pd.read\_csv(file\_path)}
else:
dfs \= pd.read\_excel(file\_path, sheet\_name=None)
for sheet\_name, df in dfs.items():
md\_output.append(f"\#\# 数据来源: {filename} \- {sheet\_name}")
md\_output.append(f"- \*\*行列\*\*: {len(df)}行 x {len(df.columns)}列")
if len(df) \> max\_rows:
md\_output.append(f"\> (仅显示前 {max\_rows} 行)")
df \= df.head(max\_rows)
df \= df.fillna('')
md\_output.append(df.to\_markdown(index=False))
md\_output.append("\\n---\\n")
return "\\n".join(md\_output)
except Exception as e:
return f"Error processing Excel: {str(e)}"
## **🏗️ 统一入口 (document\_processor.py)**
import os
from .pdf\_processor import PdfProcessor
from .docx\_processor import DocxProcessor
from .excel\_processor import ExcelProcessor
from .pptx\_processor import PptxProcessor
class DocumentProcessor:
def \_\_init\_\_(self):
self.pdf \= PdfProcessor()
self.docx \= DocxProcessor()
self.excel \= ExcelProcessor()
self.pptx \= PptxProcessor()
def process(self, file\_path: str) \-\> str:
ext \= os.path.splitext(file\_path)\[1\].lower()
if ext \== '.pdf':
return self.pdf.to\_markdown(file\_path)
elif ext in \['.docx', '.doc'\]:
return self.docx.to\_markdown(file\_path)
elif ext in \['.xlsx', '.xls', '.csv'\]:
return self.excel.to\_markdown(file\_path)
elif ext \== '.pptx':
return self.pptx.to\_markdown(file\_path)
elif ext in \['.txt', '.md'\]:
with open(file\_path, 'r', encoding='utf-8', errors='ignore') as f:
return f.read()
else:
return f"Unsupported file format: {ext}"
## **📦 极简依赖清单 (requirements.txt)**
**体积预估**
* 整个 Docker 镜像压缩后可能只有 **200MB \- 300MB**
* 相比带 PaddleOCR 的版本1.5GB+),缩小了 5 倍以上。
\# 核心解析库
pymupdf4llm\>=0.0.17
mammoth\>=1.8.0
python-pptx\>=1.0.2
pandas\>=2.2.0
openpyxl\>=3.1.5
tabulate\>=0.9.0
\# 基础工具
chardet\>=5.2.0
fastapi\>=0.109.0
uvicorn\>=0.27.0
python-multipart\>=0.0.9
## **🚀 部署建议**
1. **Docker 基础镜像**:可以使用 python:3.11-slim非常小。
2. **资源限制**:这个服务甚至可以在 **0.5核 CPU / 512MB 内存** 的微型容器里跑起来。
3. **用户引导**:在前端上传界面加一行小字:“目前仅支持电子版 PDF暂不支持扫描件或图片”。这比在后端搞复杂的 OCR 性价比高得多。