feat(rag): Complete RAG engine implementation with pgvector

Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
2026-01-21 20:24:29 +08:00
parent 1f5bf2cd65
commit 40c2f8e148
338 changed files with 11014 additions and 1158 deletions
--- a/docs/08-项目管理/01-文档处理引擎设计方案_v1.2.md
+++ b/docs/08-项目管理/01-文档处理引擎设计方案_v1.2.md
@@ -0,0 +1,185 @@
+# **文档处理引擎设计方案**
+
+文档版本： v1.2 (极简版)  
+更新日期： 2026-01-20  
+核心变更： 移除 PaddleOCR，追求极致轻量化  
+适用范围： PKB 知识库、ASL 智能文献、DC 数据清洗
+
+## **📋 概述**
+
+### **设计目标**
+
+构建一个 "极轻量、零OCR、LLM 友好" 的文档解析微服务。  
+核心原则：只处理可编辑文档（电子版），放弃扫描件支持，换取极致的部署速度和低资源占用。
+
+构建一个 "容错性强、LLM 友好" 的文档解析微服务。对于 2 人团队，核心原则是：抓大放小，确保 PDF/Word/Excel 的绝对准确，放弃冷门格式。
+
+### **架构概览 (Pipeline)**
+
+graph LR  
+    Input\[文档输入\] \--\> Router{格式路由}  
+      
+    Router \--\>|PDF| pymupdf4llm\[pymupdf4llm\]  
+    pymupdf4llm \--\>|成功| MD\_Out  
+    pymupdf4llm \--\>|文本过少| Error\[报错:不支持扫描件\]  
+      
+    Router \--\>|Word| Mammoth\[Mammoth\]  
+    Router \--\>|PPT| Pptx\[Python-pptx\]  
+    Router \--\>|Excel/CSV| Pandas\[Pandas \+ Context\]  
+      
+    Mammoth \--\> MD\_Out  
+    Pptx \--\> MD\_Out  
+    Pandas \--\> MD\_Out\[Markdown 输出\]
+
+## **🔧 核心实现方案**
+
+### **1\. PDF 文档处理 (极简版)**
+
+策略：只用 pymupdf4llm。  
+逻辑：尝试解析 \-\> 如果字数太少 \-\> 抛出异常（告诉前端提示用户上传电子版）。
+
+#### **代码实现 (pdf\_processor.py)**
+
+import pymupdf4llm  
+import logging
+
+logger \= logging.getLogger(\_\_name\_\_)
+
+class PdfProcessor:  
+    def to\_markdown(self, pdf\_path: str) \-\> str:  
+        """  
+        PDF 转 Markdown (仅支持电子版)  
+        """  
+        try:  
+            \# 1\. 尝试快速解析 (保留表格结构)  
+            md\_text \= pymupdf4llm.to\_markdown(pdf\_path, show\_progress=False)  
+              
+            \# 2\. 质量检查：如果提取内容极少(\<50字符)，视为扫描件  
+            if len(md\_text.strip()) \< 50:  
+                msg \= f"解析失败：提取文本过少({len(md\_text)}字符)。可能为扫描版PDF，本系统暂不支持。"  
+                logger.warning(msg)  
+                \# 选择策略：是返回空字符串让流程继续，还是报错？  
+                \# 建议：返回一段提示文本，让 LLM 知道这个文件没读出来  
+                return "\> \*\*系统提示\*\*：此文档似乎是扫描件（图片），无法提取文本内容。"  
+              
+            return md\_text  
+              
+        except Exception as e:  
+            logger.error(f"pymupdf4llm failed: {e}")  
+            raise ValueError(f"PDF解析失败: {str(e)}")
+
+### **2\. Word 文档处理**
+
+**策略**：mammoth。轻量、快速、HTML/Markdown 转换效果好。
+
+#### **代码实现 (docx\_processor.py)**
+
+import mammoth
+
+class DocxProcessor:  
+    def to\_markdown(self, docx\_path: str) \-\> str:  
+        with open(docx\_path, "rb") as f:  
+            result \= mammoth.convert\_to\_markdown(f)  
+              
+        if not result.value.strip():  
+            return "\> \*\*系统提示\*\*：Word文档内容为空或无法识别。"  
+              
+        return result.value
+
+### **3\. Excel/CSV 处理**
+
+**策略**：pandas。加上文件名上下文。
+
+#### **代码实现 (excel\_processor.py)**
+
+import pandas as pd  
+import os
+
+class ExcelProcessor:  
+    def to\_markdown(self, file\_path: str, max\_rows: int \= 200\) \-\> str:  
+        """Excel/CSV 转 Markdown"""  
+        ext \= os.path.splitext(file\_path)\[1\].lower()  
+        filename \= os.path.basename(file\_path)  
+        md\_output \= \[\]
+
+        try:  
+            if ext \== '.csv':  
+                dfs \= {'Sheet1': pd.read\_csv(file\_path)}  
+            else:  
+                dfs \= pd.read\_excel(file\_path, sheet\_name=None)
+
+            for sheet\_name, df in dfs.items():  
+                md\_output.append(f"\#\# 数据来源: {filename} \- {sheet\_name}")  
+                md\_output.append(f"- \*\*行列\*\*: {len(df)}行 x {len(df.columns)}列")  
+                  
+                if len(df) \> max\_rows:  
+                    md\_output.append(f"\> (仅显示前 {max\_rows} 行)")  
+                    df \= df.head(max\_rows)  
+                  
+                df \= df.fillna('')  
+                md\_output.append(df.to\_markdown(index=False))  
+                md\_output.append("\\n---\\n")
+
+            return "\\n".join(md\_output)
+
+        except Exception as e:  
+            return f"Error processing Excel: {str(e)}"
+
+## **🏗️ 统一入口 (document\_processor.py)**
+
+import os  
+from .pdf\_processor import PdfProcessor  
+from .docx\_processor import DocxProcessor  
+from .excel\_processor import ExcelProcessor  
+from .pptx\_processor import PptxProcessor
+
+class DocumentProcessor:  
+    def \_\_init\_\_(self):  
+        self.pdf \= PdfProcessor()  
+        self.docx \= DocxProcessor()  
+        self.excel \= ExcelProcessor()  
+        self.pptx \= PptxProcessor()
+
+    def process(self, file\_path: str) \-\> str:  
+        ext \= os.path.splitext(file\_path)\[1\].lower()  
+          
+        if ext \== '.pdf':  
+            return self.pdf.to\_markdown(file\_path)  
+        elif ext in \['.docx', '.doc'\]:  
+            return self.docx.to\_markdown(file\_path)  
+        elif ext in \['.xlsx', '.xls', '.csv'\]:  
+            return self.excel.to\_markdown(file\_path)  
+        elif ext \== '.pptx':  
+            return self.pptx.to\_markdown(file\_path)  
+        elif ext in \['.txt', '.md'\]:  
+            with open(file\_path, 'r', encoding='utf-8', errors='ignore') as f:  
+                return f.read()  
+        else:  
+            return f"Unsupported file format: {ext}"
+
+## **📦 极简依赖清单 (requirements.txt)**
+
+**体积预估**：
+
+* 整个 Docker 镜像压缩后可能只有 **200MB \- 300MB**。  
+* 相比带 PaddleOCR 的版本（1.5GB+），缩小了 5 倍以上。
+
+\# 核心解析库  
+pymupdf4llm\>=0.0.17  
+mammoth\>=1.8.0  
+python-pptx\>=1.0.2  
+pandas\>=2.2.0  
+openpyxl\>=3.1.5  
+tabulate\>=0.9.0
+
+\# 基础工具  
+chardet\>=5.2.0  
+fastapi\>=0.109.0  
+uvicorn\>=0.27.0  
+python-multipart\>=0.0.9
+
+## **🚀 部署建议**
+
+1. **Docker 基础镜像**：可以使用 python:3.11-slim，非常小。  
+2. **资源限制**：这个服务甚至可以在 **0.5核 CPU / 512MB 内存** 的微型容器里跑起来。  
+3. **用户引导**：在前端上传界面加一行小字：“目前仅支持电子版 PDF，暂不支持扫描件或图片”。这比在后端搞复杂的 OCR 性价比高得多。