Files

HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector

Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready

2026-01-21 20:24:29 +08:00

13 KiB

Raw Blame History

文档处理引擎使用指南

文档版本: v1.0
最后更新: 2026-01-21
状态: ✅ 生产就绪
目标读者: 业务模块开发者（PKB、ASL、DC、RVW 等）

📋 快速开始

5 秒上手

// 调用 Python 微服务（使用环境变量）
const EXTRACTION_SERVICE_URL = process.env.EXTRACTION_SERVICE_URL || 'http://localhost:8000';

const response = await fetch(`${EXTRACTION_SERVICE_URL}/api/document/to-markdown`, {
  method: 'POST',
  body: formData,  // file: PDF/Word/Excel/PPT
});

const result = await response.json();
// {
//   success: true,
//   text: "# 标题\n\n内容...",  // Markdown 格式
//   format: "markdown",
//   metadata: { page_count: 10, char_count: 5000 }
// }

🎯 核心原则

极轻量 + 零 OCR + LLM 友好

┌─────────────────────────────────────────────────────────────┐
│  设计理念（适合 2 人小团队）                                  │
├─────────────────────────────────────────────────────────────┤
│  • 抓大放小：PDF/Word/Excel 绝对准确，冷门格式按需扩展       │
│  • 零 OCR：只处理电子版，扫描件返回友好提示                   │
│  • 容错优雅：解析失败不中断流程，返回 LLM 可读的提示          │
│  • LLM 友好：统一输出 Markdown，保留表格和结构               │
└─────────────────────────────────────────────────────────────┘

📄 支持格式

格式	工具	优先级	状态
PDF	pymupdf4llm	P0	✅ 已实现并测试
Word (.docx)	mammoth	P0	✅ 已实现
Excel (.xlsx)	pandas + openpyxl	P0	✅ 已实现
CSV	pandas	P0	✅ 已实现
PPT (.pptx)	python-pptx	P1	✅ 已实现
纯文本 (.txt/.md)	直接读取	P0	✅ 已实现

注意：HTML/BibTeX/RIS 格式的处理器代码已存在，但未集成到统一入口，需要时可单独调用。

🚀 使用方式

方式 1: 直接调用 Python 微服务（不推荐）

仅用于调试或特殊场景。业务开发应该使用方式 2。

// 从环境变量获取服务地址（生产/开发环境自动切换）
const EXTRACTION_SERVICE_URL = process.env.EXTRACTION_SERVICE_URL || 'http://localhost:8000';

async function convertToMarkdown(file: Buffer, filename: string): Promise<string> {
  const formData = new FormData();
  const blob = new Blob([file], { type: 'application/pdf' });  // 根据文件类型设置
  formData.append('file', blob, filename);

  const response = await fetch(
    `${EXTRACTION_SERVICE_URL}/api/document/to-markdown`,
    { 
      method: 'POST', 
      body: formData,
      // 注意：不要设置 Content-Type，让 fetch 自动处理 multipart/form-data
    }
  );

  if (!response.ok) {
    const errorText = await response.text();
    throw new Error(`文档转换失败: ${response.status} - ${errorText}`);
  }

  const result = await response.json();
  
  if (!result.success) {
    throw new Error(result.error || '文档转换失败');
  }

  return result.text;  // Markdown 格式
}

方式 2: 使用 RAG 引擎（推荐）

99% 的业务场景应该使用这种方式！文档处理已集成在 RAG 引擎内部。

// 业务模块代码（如 PKB、ASL、AIA）
import { getDocumentIngestService } from '@/common/rag';

// 1. 获取入库服务
const ingestService = getDocumentIngestService(prisma);

// 2. 一行代码完成：文档转换 → 分块 → 向量化 → 存储
const result = await ingestService.ingestDocument(
  {
    filename: 'research.pdf',
    fileBuffer: pdfBuffer,  // 文件内容
  },
  {
    kbId: 'your-knowledge-base-id',
    contentType: 'LITERATURE',  // 可选
    tags: ['医学', 'RCT'],      // 可选
  }
);

// 3. 返回结果
console.log(`✅ 入库成功: ${result.documentId}`);
console.log(`   分块数: ${result.chunkCount}`);
console.log(`   Token数: ${result.tokenCount}`);
console.log(`   耗时: ${result.duration}ms`);

// DocumentIngestService 内部自动完成：
// ✅ 调用 Python 微服务转换为 Markdown
// ✅ 智能分块
// ✅ 批量向量化
// ✅ 存入 ekb_schema

对比：

方式 1 (直接调用): 只得到 Markdown，需要自己处理后续步骤
方式 2 (RAG引擎): 一行代码完成所有流程，直接可检索 ✅

📦 格式特性

PDF (.pdf)

工具：pymupdf4llm

特点：

✅ 自动保留表格结构（Markdown 表格）
✅ 多栏布局自动重排
✅ 数学公式保留 LaTeX
✅ 自动检测扫描件（返回友好提示）

输出示例：

# 文章标题

## 摘要

阿司匹林是一种...

## 研究方法

| 组别 | 样本量 | 剂量 |
|------|--------|------|
| 实验组 | 150 | 100mg/日 |
| 对照组 | 150 | 安慰剂 |

Word (.docx)

工具：mammoth

特点：

✅ 保留标题层级
✅ 保留列表结构
✅ 保留表格（转为 Markdown）
⚠️ 图片可能丢失

Excel (.xlsx) / CSV

工具：pandas + openpyxl

特点：

✅ 多 Sheet 支持
✅ 自动添加数据来源上下文
✅ 大数据截断（默认 200 行）
✅ 空值处理

输出示例：

## 数据来源: patient_data.xlsx - Sheet1
- **行列**: 500 行 × 12 列

> ⚠️ 数据量较大，仅显示前 200 行（共 500 行）

| 患者ID | 年龄 | 性别 | 诊断 |
|--------|------|------|------|
| P001 | 65 | 男 | 肺癌 |

PPT (.pptx)

工具：python-pptx

特点：

✅ 按幻灯片分段
✅ 提取标题和正文
⚠️ 图表可能丢失

⚠️ 注意事项

1. 扫描版 PDF

问题：无法提取文字

处理：

> **系统提示**：文档 `scan.pdf` 似乎是扫描件（图片型 PDF）。
> 
> - 提取文本量：15 字符
> - 本系统暂不支持扫描版 PDF 的文字识别
> - 建议：请上传电子版 PDF

不会中断流程，LLM 可以理解这个提示。

2. 大文件处理

Excel/CSV：

自动截断超过 200 行的数据
返回带提示的 Markdown

PDF：

pymupdf4llm 可处理大 PDF（几百页）
建议：超大 PDF（>100MB）考虑拆分

3. 编码问题

CSV/TXT：

自动检测编码（使用 chardet）
支持 UTF-8, GBK, GB2312 等

🔧 配置

环境变量（必须配置）

# backend/.env
EXTRACTION_SERVICE_URL=http://localhost:8000  # Python 微服务地址

# 开发环境（本地）
EXTRACTION_SERVICE_URL=http://localhost:8000

# 生产环境（Docker）
EXTRACTION_SERVICE_URL=http://extraction-service:8000

# 生产环境（阿里云 SAE）
EXTRACTION_SERVICE_URL=http://172.17.173.66:8000

启动 Python 微服务

# 方式 1: 开发环境（Windows）
cd extraction_service
.\venv\Scripts\python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000

# 方式 2: 生产环境（Docker）
docker-compose up -d extraction-service

# 验证服务运行
curl http://localhost:8000/api/health

Python 依赖

# extraction_service/requirements.txt
pymupdf4llm>=0.0.17    # PDF → Markdown
mammoth==1.6.0         # Word → Markdown
pandas>=2.0.0          # Excel/CSV
openpyxl>=3.1.2        # Excel 读取
tabulate>=0.9.0        # Markdown 表格
python-pptx>=0.6.23    # PPT 读取

📊 性能指标

操作	耗时	说明
PDF → Markdown (10页)	1-3秒	电子版 PDF
PDF → Markdown (50页)	5-10秒	大文档
Word → Markdown	0.5-2秒
Excel → Markdown	0.5-3秒	取决于数据量

💡 最佳实践

1. 错误处理

try {
  const markdown = await convertToMarkdown(file, filename);
  
  // 检查是否为扫描件提示
  if (markdown.includes('系统提示：文档') && markdown.includes('扫描件')) {
    // 友好提示用户
    throw new Error('文档是扫描件，请上传电子版');
  }
  
  return markdown;
} catch (error) {
  logger.error('文档处理失败', { error, filename });
  throw error;
}

2. 与 RAG 引擎集成

// 完整的文档入库流程
import { getDocumentIngestService } from '@/common/rag';

async function ingestPDF(kbId: string, file: Buffer, filename: string) {
  const ingestService = getDocumentIngestService(prisma);
  
  // DocumentIngestService 内部会：
  // 1. 调用 Python 微服务转换为 Markdown
  // 2. 分块
  // 3. 向量化
  // 4. 存入数据库
  
  const result = await ingestService.ingestDocument(
    { filename, fileBuffer: file },
    { kbId }
  );
  
  return result;
}

3. 批量处理

// 批量上传文档
for (const file of files) {
  try {
    await ingestPDF(kbId, file.buffer, file.name);
  } catch (error) {
    logger.error(`文档入库失败: ${file.name}`, { error });
    // 继续处理下一个文件
  }
}

🐛 常见问题

Q1: PDF 转换失败？

可能原因：

扫描版 PDF（无文字层）
PDF 已加密
PDF 损坏

解决：

检查是否为电子版 PDF
尝试用 PDF 阅读器打开测试

Q2: Excel 只显示部分数据？

原因：

自动截断（默认 200 行）

解决：

这是设计行为，避免 LLM 输入过长
如需完整数据，使用 DC 模块的 Excel 处理功能

Q3: Python 微服务连接失败？

检查：

# 测试服务是否运行
curl http://localhost:8000/api/health

# 检查服务状态
docker ps | grep extraction_service

# 查看服务日志
cd extraction_service
.\venv\Scripts\python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000

💡 完整示例（业务模块开发者必读）

场景：PKB 模块上传 PDF 文档

// modules/pkb/controllers/documentController.ts

import { FastifyRequest, FastifyReply } from 'fastify';
import { getDocumentIngestService } from '../../../common/rag/index.js';
import { prisma } from '../../../config/database.js';
import { storage } from '../../../common/storage/index.js';

/**
 * 上传文档到知识库
 * POST /api/pkb/knowledge-bases/:kbId/documents
 */
export async function uploadDocument(
  request: FastifyRequest<{ Params: { kbId: string } }>,
  reply: FastifyReply
) {
  try {
    const { kbId } = request.params;
    const userId = (request as any).user?.userId;  // 从 JWT 获取
    const file = await request.file();  // Fastify multipart

    if (!file) {
      return reply.status(400).send({ error: '请上传文件' });
    }

    // 1. 读取文件内容
    const fileBuffer = await file.toBuffer();
    const filename = file.filename;

    // 2. 上传到 OSS（可选，用于备份）
    const fileUrl = await storage.upload(fileBuffer, filename);

    // 3. 调用 RAG 引擎入库（自动调用文档处理引擎）
    const ingestService = getDocumentIngestService(prisma);
    const result = await ingestService.ingestDocument(
      {
        filename,
        fileBuffer,
      },
      {
        kbId,
        contentType: 'LITERATURE',
        tags: ['上传'],
      }
    );

    // 4. 返回结果
    return reply.status(201).send({
      success: true,
      data: {
        documentId: result.documentId,
        filename,
        fileUrl,
        chunkCount: result.chunkCount,
        tokenCount: result.tokenCount,
        duration: result.duration,
      },
    });

  } catch (error) {
    logger.error('文档上传失败', { error });
    return reply.status(500).send({ error: '文档上传失败' });
  }
}

关键点：

✅ 使用 getDocumentIngestService(prisma) 获取服务
✅ 调用 ingestDocument() - 自动完成文档处理全流程
✅ 不需要手动调用 Python 微服务
✅ 不需要手动分块、向量化
✅ 环境变量自动处理（EXTRACTION_SERVICE_URL）

关键配置

必须配置的环境变量：

# backend/.env
EXTRACTION_SERVICE_URL=http://localhost:8000  # Python 微服务地址
DASHSCOPE_API_KEY=sk-xxx                      # 阿里云 API Key（向量化）

启动服务顺序：

# 1. 启动数据库
docker-compose up -d postgres

# 2. 启动 Python 微服务
cd extraction_service
.\venv\Scripts\python -m uvicorn main:app --reload --port 8000

# 3. 启动 Node.js 后端
cd backend
npm run dev

📚 相关文档

01-文档处理引擎设计方案.md - 详细技术方案
📖 RAG 引擎使用指南 - 推荐阅读（完整业务集成）

🚀 快速测试

# 1. 测试 Python 微服务
curl http://localhost:8000/api/health

# 2. 测试单文件转换（调试用）
curl -X POST http://localhost:8000/api/document/to-markdown \
  -F "file=@test.pdf"

# 3. 测试完整入库流程（推荐）
cd backend
npx tsx src/tests/test-pdf-ingest.ts "path/to/your.pdf"

📅 版本历史

版本	日期	变更内容
v1.0	2026-01-21	初版：基于 pymupdf4llm 的文档处理引擎使用指南

13 KiB Raw Blame History Unescape Escape