feat(rag): Complete RAG engine implementation with pgvector
Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
This commit is contained in:
185
docs/08-项目管理/01-文档处理引擎设计方案_v1.2.md
Normal file
185
docs/08-项目管理/01-文档处理引擎设计方案_v1.2.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# **文档处理引擎设计方案**
|
||||
|
||||
文档版本: v1.2 (极简版)
|
||||
更新日期: 2026-01-20
|
||||
核心变更: 移除 PaddleOCR,追求极致轻量化
|
||||
适用范围: PKB 知识库、ASL 智能文献、DC 数据清洗
|
||||
|
||||
## **📋 概述**
|
||||
|
||||
### **设计目标**
|
||||
|
||||
构建一个 "极轻量、零OCR、LLM 友好" 的文档解析微服务。
|
||||
核心原则:只处理可编辑文档(电子版),放弃扫描件支持,换取极致的部署速度和低资源占用。
|
||||
|
||||
构建一个 "容错性强、LLM 友好" 的文档解析微服务。对于 2 人团队,核心原则是:抓大放小,确保 PDF/Word/Excel 的绝对准确,放弃冷门格式。
|
||||
|
||||
### **架构概览 (Pipeline)**
|
||||
|
||||
graph LR
|
||||
Input\[文档输入\] \--\> Router{格式路由}
|
||||
|
||||
Router \--\>|PDF| pymupdf4llm\[pymupdf4llm\]
|
||||
pymupdf4llm \--\>|成功| MD\_Out
|
||||
pymupdf4llm \--\>|文本过少| Error\[报错:不支持扫描件\]
|
||||
|
||||
Router \--\>|Word| Mammoth\[Mammoth\]
|
||||
Router \--\>|PPT| Pptx\[Python-pptx\]
|
||||
Router \--\>|Excel/CSV| Pandas\[Pandas \+ Context\]
|
||||
|
||||
Mammoth \--\> MD\_Out
|
||||
Pptx \--\> MD\_Out
|
||||
Pandas \--\> MD\_Out\[Markdown 输出\]
|
||||
|
||||
## **🔧 核心实现方案**
|
||||
|
||||
### **1\. PDF 文档处理 (极简版)**
|
||||
|
||||
策略:只用 pymupdf4llm。
|
||||
逻辑:尝试解析 \-\> 如果字数太少 \-\> 抛出异常(告诉前端提示用户上传电子版)。
|
||||
|
||||
#### **代码实现 (pdf\_processor.py)**
|
||||
|
||||
import pymupdf4llm
|
||||
import logging
|
||||
|
||||
logger \= logging.getLogger(\_\_name\_\_)
|
||||
|
||||
class PdfProcessor:
|
||||
def to\_markdown(self, pdf\_path: str) \-\> str:
|
||||
"""
|
||||
PDF 转 Markdown (仅支持电子版)
|
||||
"""
|
||||
try:
|
||||
\# 1\. 尝试快速解析 (保留表格结构)
|
||||
md\_text \= pymupdf4llm.to\_markdown(pdf\_path, show\_progress=False)
|
||||
|
||||
\# 2\. 质量检查:如果提取内容极少(\<50字符),视为扫描件
|
||||
if len(md\_text.strip()) \< 50:
|
||||
msg \= f"解析失败:提取文本过少({len(md\_text)}字符)。可能为扫描版PDF,本系统暂不支持。"
|
||||
logger.warning(msg)
|
||||
\# 选择策略:是返回空字符串让流程继续,还是报错?
|
||||
\# 建议:返回一段提示文本,让 LLM 知道这个文件没读出来
|
||||
return "\> \*\*系统提示\*\*:此文档似乎是扫描件(图片),无法提取文本内容。"
|
||||
|
||||
return md\_text
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"pymupdf4llm failed: {e}")
|
||||
raise ValueError(f"PDF解析失败: {str(e)}")
|
||||
|
||||
### **2\. Word 文档处理**
|
||||
|
||||
**策略**:mammoth。轻量、快速、HTML/Markdown 转换效果好。
|
||||
|
||||
#### **代码实现 (docx\_processor.py)**
|
||||
|
||||
import mammoth
|
||||
|
||||
class DocxProcessor:
|
||||
def to\_markdown(self, docx\_path: str) \-\> str:
|
||||
with open(docx\_path, "rb") as f:
|
||||
result \= mammoth.convert\_to\_markdown(f)
|
||||
|
||||
if not result.value.strip():
|
||||
return "\> \*\*系统提示\*\*:Word文档内容为空或无法识别。"
|
||||
|
||||
return result.value
|
||||
|
||||
### **3\. Excel/CSV 处理**
|
||||
|
||||
**策略**:pandas。加上文件名上下文。
|
||||
|
||||
#### **代码实现 (excel\_processor.py)**
|
||||
|
||||
import pandas as pd
|
||||
import os
|
||||
|
||||
class ExcelProcessor:
|
||||
def to\_markdown(self, file\_path: str, max\_rows: int \= 200\) \-\> str:
|
||||
"""Excel/CSV 转 Markdown"""
|
||||
ext \= os.path.splitext(file\_path)\[1\].lower()
|
||||
filename \= os.path.basename(file\_path)
|
||||
md\_output \= \[\]
|
||||
|
||||
try:
|
||||
if ext \== '.csv':
|
||||
dfs \= {'Sheet1': pd.read\_csv(file\_path)}
|
||||
else:
|
||||
dfs \= pd.read\_excel(file\_path, sheet\_name=None)
|
||||
|
||||
for sheet\_name, df in dfs.items():
|
||||
md\_output.append(f"\#\# 数据来源: {filename} \- {sheet\_name}")
|
||||
md\_output.append(f"- \*\*行列\*\*: {len(df)}行 x {len(df.columns)}列")
|
||||
|
||||
if len(df) \> max\_rows:
|
||||
md\_output.append(f"\> (仅显示前 {max\_rows} 行)")
|
||||
df \= df.head(max\_rows)
|
||||
|
||||
df \= df.fillna('')
|
||||
md\_output.append(df.to\_markdown(index=False))
|
||||
md\_output.append("\\n---\\n")
|
||||
|
||||
return "\\n".join(md\_output)
|
||||
|
||||
except Exception as e:
|
||||
return f"Error processing Excel: {str(e)}"
|
||||
|
||||
## **🏗️ 统一入口 (document\_processor.py)**
|
||||
|
||||
import os
|
||||
from .pdf\_processor import PdfProcessor
|
||||
from .docx\_processor import DocxProcessor
|
||||
from .excel\_processor import ExcelProcessor
|
||||
from .pptx\_processor import PptxProcessor
|
||||
|
||||
class DocumentProcessor:
|
||||
def \_\_init\_\_(self):
|
||||
self.pdf \= PdfProcessor()
|
||||
self.docx \= DocxProcessor()
|
||||
self.excel \= ExcelProcessor()
|
||||
self.pptx \= PptxProcessor()
|
||||
|
||||
def process(self, file\_path: str) \-\> str:
|
||||
ext \= os.path.splitext(file\_path)\[1\].lower()
|
||||
|
||||
if ext \== '.pdf':
|
||||
return self.pdf.to\_markdown(file\_path)
|
||||
elif ext in \['.docx', '.doc'\]:
|
||||
return self.docx.to\_markdown(file\_path)
|
||||
elif ext in \['.xlsx', '.xls', '.csv'\]:
|
||||
return self.excel.to\_markdown(file\_path)
|
||||
elif ext \== '.pptx':
|
||||
return self.pptx.to\_markdown(file\_path)
|
||||
elif ext in \['.txt', '.md'\]:
|
||||
with open(file\_path, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
return f.read()
|
||||
else:
|
||||
return f"Unsupported file format: {ext}"
|
||||
|
||||
## **📦 极简依赖清单 (requirements.txt)**
|
||||
|
||||
**体积预估**:
|
||||
|
||||
* 整个 Docker 镜像压缩后可能只有 **200MB \- 300MB**。
|
||||
* 相比带 PaddleOCR 的版本(1.5GB+),缩小了 5 倍以上。
|
||||
|
||||
\# 核心解析库
|
||||
pymupdf4llm\>=0.0.17
|
||||
mammoth\>=1.8.0
|
||||
python-pptx\>=1.0.2
|
||||
pandas\>=2.2.0
|
||||
openpyxl\>=3.1.5
|
||||
tabulate\>=0.9.0
|
||||
|
||||
\# 基础工具
|
||||
chardet\>=5.2.0
|
||||
fastapi\>=0.109.0
|
||||
uvicorn\>=0.27.0
|
||||
python-multipart\>=0.0.9
|
||||
|
||||
## **🚀 部署建议**
|
||||
|
||||
1. **Docker 基础镜像**:可以使用 python:3.11-slim,非常小。
|
||||
2. **资源限制**:这个服务甚至可以在 **0.5核 CPU / 512MB 内存** 的微型容器里跑起来。
|
||||
3. **用户引导**:在前端上传界面加一行小字:“目前仅支持电子版 PDF,暂不支持扫描件或图片”。这比在后端搞复杂的 OCR 性价比高得多。
|
||||
Reference in New Issue
Block a user