feat(rag): Complete RAG engine implementation with pgvector

Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
2026-01-21 20:24:29 +08:00
parent 1f5bf2cd65
commit 40c2f8e148
338 changed files with 11014 additions and 1158 deletions
--- a/docs/02-通用能力层/02-文档处理引擎/02-文档处理引擎使用指南.md
+++ b/docs/02-通用能力层/02-文档处理引擎/02-文档处理引擎使用指南.md
@@ -0,0 +1,544 @@
+# 文档处理引擎使用指南
+
+> **文档版本**: v1.0  
+> **最后更新**: 2026-01-21  
+> **状态**: ✅ 生产就绪  
+> **目标读者**: 业务模块开发者（PKB、ASL、DC、RVW 等）
+
+---
+
+## 📋 快速开始
+
+### 5 秒上手
+
+```typescript
+// 调用 Python 微服务（使用环境变量）
+const EXTRACTION_SERVICE_URL = process.env.EXTRACTION_SERVICE_URL || 'http://localhost:8000';
+
+const response = await fetch(`${EXTRACTION_SERVICE_URL}/api/document/to-markdown`, {
+  method: 'POST',
+  body: formData,  // file: PDF/Word/Excel/PPT
+});
+
+const result = await response.json();
+// {
+//   success: true,
+//   text: "# 标题\n\n内容...",  // Markdown 格式
+//   format: "markdown",
+//   metadata: { page_count: 10, char_count: 5000 }
+// }
+```
+
+---
+
+## 🎯 核心原则
+
+### 极轻量 + 零 OCR + LLM 友好
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  设计理念（适合 2 人小团队）                                  │
+├─────────────────────────────────────────────────────────────┤
+│  • 抓大放小：PDF/Word/Excel 绝对准确，冷门格式按需扩展       │
+│  • 零 OCR：只处理电子版，扫描件返回友好提示                   │
+│  • 容错优雅：解析失败不中断流程，返回 LLM 可读的提示          │
+│  • LLM 友好：统一输出 Markdown，保留表格和结构               │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 📄 支持格式
+
+| 格式 | 工具 | 优先级 | 状态 |
+|------|------|--------|------|
+| **PDF** | pymupdf4llm | P0 | ✅ 已实现并测试 |
+| **Word (.docx)** | mammoth | P0 | ✅ 已实现 |
+| **Excel (.xlsx)** | pandas + openpyxl | P0 | ✅ 已实现 |
+| **CSV** | pandas | P0 | ✅ 已实现 |
+| **PPT (.pptx)** | python-pptx | P1 | ✅ 已实现 |
+| **纯文本 (.txt/.md)** | 直接读取 | P0 | ✅ 已实现 |
+
+**注意**：HTML/BibTeX/RIS 格式的处理器代码已存在，但未集成到统一入口，需要时可单独调用。
+
+---
+
+## 🚀 使用方式
+
+### 方式 1: 直接调用 Python 微服务（不推荐）
+
+**仅用于调试或特殊场景。业务开发应该使用方式 2。**
+
+```typescript
+// 从环境变量获取服务地址（生产/开发环境自动切换）
+const EXTRACTION_SERVICE_URL = process.env.EXTRACTION_SERVICE_URL || 'http://localhost:8000';
+
+async function convertToMarkdown(file: Buffer, filename: string): Promise<string> {
+  const formData = new FormData();
+  const blob = new Blob([file], { type: 'application/pdf' });  // 根据文件类型设置
+  formData.append('file', blob, filename);
+
+  const response = await fetch(
+    `${EXTRACTION_SERVICE_URL}/api/document/to-markdown`,
+    { 
+      method: 'POST', 
+      body: formData,
+      // 注意：不要设置 Content-Type，让 fetch 自动处理 multipart/form-data
+    }
+  );
+
+  if (!response.ok) {
+    const errorText = await response.text();
+    throw new Error(`文档转换失败: ${response.status} - ${errorText}`);
+  }
+
+  const result = await response.json();
+  
+  if (!result.success) {
+    throw new Error(result.error || '文档转换失败');
+  }
+
+  return result.text;  // Markdown 格式
+}
+```
+
+### 方式 2: 使用 RAG 引擎（推荐）
+
+**99% 的业务场景应该使用这种方式！文档处理已集成在 RAG 引擎内部。**
+
+```typescript
+// 业务模块代码（如 PKB、ASL、AIA）
+import { getDocumentIngestService } from '@/common/rag';
+
+// 1. 获取入库服务
+const ingestService = getDocumentIngestService(prisma);
+
+// 2. 一行代码完成：文档转换 → 分块 → 向量化 → 存储
+const result = await ingestService.ingestDocument(
+  {
+    filename: 'research.pdf',
+    fileBuffer: pdfBuffer,  // 文件内容
+  },
+  {
+    kbId: 'your-knowledge-base-id',
+    contentType: 'LITERATURE',  // 可选
+    tags: ['医学', 'RCT'],      // 可选
+  }
+);
+
+// 3. 返回结果
+console.log(`✅ 入库成功: ${result.documentId}`);
+console.log(`   分块数: ${result.chunkCount}`);
+console.log(`   Token数: ${result.tokenCount}`);
+console.log(`   耗时: ${result.duration}ms`);
+
+// DocumentIngestService 内部自动完成：
+// ✅ 调用 Python 微服务转换为 Markdown
+// ✅ 智能分块
+// ✅ 批量向量化
+// ✅ 存入 ekb_schema
+```
+
+**对比：**
+```
+方式 1 (直接调用): 只得到 Markdown，需要自己处理后续步骤
+方式 2 (RAG引擎): 一行代码完成所有流程，直接可检索 ✅
+```
+
+---
+
+## 📦 格式特性
+
+### PDF (.pdf)
+
+**工具**：`pymupdf4llm`
+
+**特点**：
+- ✅ 自动保留表格结构（Markdown 表格）
+- ✅ 多栏布局自动重排
+- ✅ 数学公式保留 LaTeX
+- ✅ 自动检测扫描件（返回友好提示）
+
+**输出示例**：
+```markdown
+# 文章标题
+
+## 摘要
+
+阿司匹林是一种...
+
+## 研究方法
+
+| 组别 | 样本量 | 剂量 |
+|------|--------|------|
+| 实验组 | 150 | 100mg/日 |
+| 对照组 | 150 | 安慰剂 |
+```
+
+### Word (.docx)
+
+**工具**：`mammoth`
+
+**特点**：
+- ✅ 保留标题层级
+- ✅ 保留列表结构
+- ✅ 保留表格（转为 Markdown）
+- ⚠️ 图片可能丢失
+
+### Excel (.xlsx) / CSV
+
+**工具**：`pandas + openpyxl`
+
+**特点**：
+- ✅ 多 Sheet 支持
+- ✅ 自动添加数据来源上下文
+- ✅ 大数据截断（默认 200 行）
+- ✅ 空值处理
+
+**输出示例**：
+```markdown
+## 数据来源: patient_data.xlsx - Sheet1
+- **行列**: 500 行 × 12 列
+
+> ⚠️ 数据量较大，仅显示前 200 行（共 500 行）
+
+| 患者ID | 年龄 | 性别 | 诊断 |
+|--------|------|------|------|
+| P001 | 65 | 男 | 肺癌 |
+```
+
+### PPT (.pptx)
+
+**工具**：`python-pptx`
+
+**特点**：
+- ✅ 按幻灯片分段
+- ✅ 提取标题和正文
+- ⚠️ 图表可能丢失
+
+---
+
+## ⚠️ 注意事项
+
+### 1. 扫描版 PDF
+
+**问题**：无法提取文字
+
+**处理**：
+```markdown
+> **系统提示**：文档 `scan.pdf` 似乎是扫描件（图片型 PDF）。
+> 
+> - 提取文本量：15 字符
+> - 本系统暂不支持扫描版 PDF 的文字识别
+> - 建议：请上传电子版 PDF
+```
+
+**不会中断流程**，LLM 可以理解这个提示。
+
+### 2. 大文件处理
+
+**Excel/CSV**：
+- 自动截断超过 200 行的数据
+- 返回带提示的 Markdown
+
+**PDF**：
+- pymupdf4llm 可处理大 PDF（几百页）
+- 建议：超大 PDF（>100MB）考虑拆分
+
+### 3. 编码问题
+
+**CSV/TXT**：
+- 自动检测编码（使用 chardet）
+- 支持 UTF-8, GBK, GB2312 等
+
+---
+
+## 🔧 配置
+
+### 环境变量（必须配置）
+
+```bash
+# backend/.env
+EXTRACTION_SERVICE_URL=http://localhost:8000  # Python 微服务地址
+
+# 开发环境（本地）
+EXTRACTION_SERVICE_URL=http://localhost:8000
+
+# 生产环境（Docker）
+EXTRACTION_SERVICE_URL=http://extraction-service:8000
+
+# 生产环境（阿里云 SAE）
+EXTRACTION_SERVICE_URL=http://172.17.173.66:8000
+```
+
+### 启动 Python 微服务
+
+```bash
+# 方式 1: 开发环境（Windows）
+cd extraction_service
+.\venv\Scripts\python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000
+
+# 方式 2: 生产环境（Docker）
+docker-compose up -d extraction-service
+
+# 验证服务运行
+curl http://localhost:8000/api/health
+```
+
+### Python 依赖
+
+```txt
+# extraction_service/requirements.txt
+pymupdf4llm>=0.0.17    # PDF → Markdown
+mammoth==1.6.0         # Word → Markdown
+pandas>=2.0.0          # Excel/CSV
+openpyxl>=3.1.2        # Excel 读取
+tabulate>=0.9.0        # Markdown 表格
+python-pptx>=0.6.23    # PPT 读取
+```
+
+---
+
+## 📊 性能指标
+
+| 操作 | 耗时 | 说明 |
+|------|------|------|
+| PDF → Markdown (10页) | 1-3秒 | 电子版 PDF |
+| PDF → Markdown (50页) | 5-10秒 | 大文档 |
+| Word → Markdown | 0.5-2秒 | |
+| Excel → Markdown | 0.5-3秒 | 取决于数据量 |
+
+---
+
+## 💡 最佳实践
+
+### 1. 错误处理
+
+```typescript
+try {
+  const markdown = await convertToMarkdown(file, filename);
+  
+  // 检查是否为扫描件提示
+  if (markdown.includes('系统提示：文档') && markdown.includes('扫描件')) {
+    // 友好提示用户
+    throw new Error('文档是扫描件，请上传电子版');
+  }
+  
+  return markdown;
+} catch (error) {
+  logger.error('文档处理失败', { error, filename });
+  throw error;
+}
+```
+
+### 2. 与 RAG 引擎集成
+
+```typescript
+// 完整的文档入库流程
+import { getDocumentIngestService } from '@/common/rag';
+
+async function ingestPDF(kbId: string, file: Buffer, filename: string) {
+  const ingestService = getDocumentIngestService(prisma);
+  
+  // DocumentIngestService 内部会：
+  // 1. 调用 Python 微服务转换为 Markdown
+  // 2. 分块
+  // 3. 向量化
+  // 4. 存入数据库
+  
+  const result = await ingestService.ingestDocument(
+    { filename, fileBuffer: file },
+    { kbId }
+  );
+  
+  return result;
+}
+```
+
+### 3. 批量处理
+
+```typescript
+// 批量上传文档
+for (const file of files) {
+  try {
+    await ingestPDF(kbId, file.buffer, file.name);
+  } catch (error) {
+    logger.error(`文档入库失败: ${file.name}`, { error });
+    // 继续处理下一个文件
+  }
+}
+```
+
+---
+
+## 🐛 常见问题
+
+### Q1: PDF 转换失败？
+
+**可能原因**：
+- 扫描版 PDF（无文字层）
+- PDF 已加密
+- PDF 损坏
+
+**解决**：
+- 检查是否为电子版 PDF
+- 尝试用 PDF 阅读器打开测试
+
+### Q2: Excel 只显示部分数据？
+
+**原因**：
+- 自动截断（默认 200 行）
+
+**解决**：
+- 这是设计行为，避免 LLM 输入过长
+- 如需完整数据，使用 DC 模块的 Excel 处理功能
+
+### Q3: Python 微服务连接失败？
+
+**检查**：
+```bash
+# 测试服务是否运行
+curl http://localhost:8000/api/health
+
+# 检查服务状态
+docker ps | grep extraction_service
+
+# 查看服务日志
+cd extraction_service
+.\venv\Scripts\python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000
+```
+
+---
+
+## 💡 完整示例（业务模块开发者必读）
+
+### 场景：PKB 模块上传 PDF 文档
+
+```typescript
+// modules/pkb/controllers/documentController.ts
+
+import { FastifyRequest, FastifyReply } from 'fastify';
+import { getDocumentIngestService } from '../../../common/rag/index.js';
+import { prisma } from '../../../config/database.js';
+import { storage } from '../../../common/storage/index.js';
+
+/**
+ * 上传文档到知识库
+ * POST /api/pkb/knowledge-bases/:kbId/documents
+ */
+export async function uploadDocument(
+  request: FastifyRequest<{ Params: { kbId: string } }>,
+  reply: FastifyReply
+) {
+  try {
+    const { kbId } = request.params;
+    const userId = (request as any).user?.userId;  // 从 JWT 获取
+    const file = await request.file();  // Fastify multipart
+
+    if (!file) {
+      return reply.status(400).send({ error: '请上传文件' });
+    }
+
+    // 1. 读取文件内容
+    const fileBuffer = await file.toBuffer();
+    const filename = file.filename;
+
+    // 2. 上传到 OSS（可选，用于备份）
+    const fileUrl = await storage.upload(fileBuffer, filename);
+
+    // 3. 调用 RAG 引擎入库（自动调用文档处理引擎）
+    const ingestService = getDocumentIngestService(prisma);
+    const result = await ingestService.ingestDocument(
+      {
+        filename,
+        fileBuffer,
+      },
+      {
+        kbId,
+        contentType: 'LITERATURE',
+        tags: ['上传'],
+      }
+    );
+
+    // 4. 返回结果
+    return reply.status(201).send({
+      success: true,
+      data: {
+        documentId: result.documentId,
+        filename,
+        fileUrl,
+        chunkCount: result.chunkCount,
+        tokenCount: result.tokenCount,
+        duration: result.duration,
+      },
+    });
+
+  } catch (error) {
+    logger.error('文档上传失败', { error });
+    return reply.status(500).send({ error: '文档上传失败' });
+  }
+}
+```
+
+**关键点：**
+1. ✅ 使用 `getDocumentIngestService(prisma)` 获取服务
+2. ✅ 调用 `ingestDocument()` - 自动完成文档处理全流程
+3. ✅ 不需要手动调用 Python 微服务
+4. ✅ 不需要手动分块、向量化
+5. ✅ 环境变量自动处理（`EXTRACTION_SERVICE_URL`）
+
+### 关键配置
+
+**必须配置的环境变量：**
+```bash
+# backend/.env
+EXTRACTION_SERVICE_URL=http://localhost:8000  # Python 微服务地址
+DASHSCOPE_API_KEY=sk-xxx                      # 阿里云 API Key（向量化）
+```
+
+**启动服务顺序：**
+```bash
+# 1. 启动数据库
+docker-compose up -d postgres
+
+# 2. 启动 Python 微服务
+cd extraction_service
+.\venv\Scripts\python -m uvicorn main:app --reload --port 8000
+
+# 3. 启动 Node.js 后端
+cd backend
+npm run dev
+```
+
+---
+
+## 📚 相关文档
+
+- [01-文档处理引擎设计方案.md](./01-文档处理引擎设计方案.md) - 详细技术方案
+- 📖 [RAG 引擎使用指南](../03-RAG引擎/05-RAG引擎使用指南.md) - **推荐阅读**（完整业务集成）
+
+---
+
+## 🚀 快速测试
+
+```bash
+# 1. 测试 Python 微服务
+curl http://localhost:8000/api/health
+
+# 2. 测试单文件转换（调试用）
+curl -X POST http://localhost:8000/api/document/to-markdown \
+  -F "file=@test.pdf"
+
+# 3. 测试完整入库流程（推荐）
+cd backend
+npx tsx src/tests/test-pdf-ingest.ts "path/to/your.pdf"
+```
+
+---
+
+## 📅 版本历史
+
+| 版本 | 日期 | 变更内容 |
+|------|------|----------|
+| v1.0 | 2026-01-21 | 初版：基于 pymupdf4llm 的文档处理引擎使用指南 |
+