Files

HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector

Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready

2026-01-21 20:24:29 +08:00

17 KiB

Raw Blame History

知识库引擎分阶段实施方案

文档版本： v1.0
创建日期： 2026-01-20
最后更新： 2026-01-20
核心原则： 先跑通 MVP，让业务走起来，再逐步完善

📋 概述

为什么分阶段实施？

完整的知识库引擎包含多个复杂功能，一次性全部实现风险高、周期长。采用分阶段实施：

✅ 降低风险：每阶段可交付、可验证
✅ 快速见效：MVP 3天即可让业务跑起来
✅ 灵活调整：根据业务反馈调整后续优先级

三阶段总览

┌─────────────────────────────────────────────────────────────┐
│  Phase 1: MVP（3天）                                         │
│  ─────────────────                                          │
│  目标：让业务跑起来                                           │
│  能力：入库 + 向量检索 + 全文获取                             │
│  场景：PKB 基础问答                                          │
├─────────────────────────────────────────────────────────────┤
│  Phase 2: 增强检索（2天）                                    │
│  ─────────────────                                          │
│  目标：检索质量提升                                           │
│  能力：+ 关键词检索 + 混合检索 + rerank                       │
│  场景：PKB 高质量检索                                        │
├─────────────────────────────────────────────────────────────┤
│  Phase 3: 完整功能（3天）                                    │
│  ─────────────────                                          │
│  目标：完整架构落地                                           │
│  能力：+ 异步入库 + 摘要生成 + 临床要素提取                    │
│  场景：ASL、AIA 完整功能                                     │
└─────────────────────────────────────────────────────────────┘

🚀 Phase 1: MVP（3天）

目标

最小可用版本：PKB 能上传文档、能检索、能问答

交付能力

能力	方法	说明
文档入库	`ingestDocument()`	同步处理（简化版）
向量检索	`vectorSearch()`	pgvector 语义检索
全文获取	`getDocumentFullText()`	获取单个文档
	`getAllDocumentsText()`	获取知识库所有文档
管理操作	`deleteDocument()`	删除文档

暂不实现

❌ 异步入库（pg-boss）
❌ 关键词检索（pg_trgm）
❌ 混合检索（RRF）
❌ rerank 重排序
❌ 摘要生成
❌ 临床要素提取（PICO）

数据模型

📌 数据模型详见：04-数据模型设计.md

⚠️ 重要：MVP 阶段就创建完整 Schema，但只使用部分字段。避免后续阶段改表迁移。

字段使用阶段说明

字段分层	Phase 1 MVP	Phase 2 增强	Phase 3 完整
Layer 0-1: 基础信息
filename, fileType, fileSizeBytes	✅ 填充	-	-
fileUrl, extractedText	✅ 填充	-	-
status, errorMessage	✅ 使用	-	-
Layer 2: 内容增强
summary, tokenCount, pageCount	❌ 预留	❌ 预留	✅ 填充
Layer 3: 分类标签
contentType, tags, category	⚪ 可选	⚪ 可选	✅ 启用
Layer 4: 结构化数据
metadata, structuredData	❌ 预留	❌ 预留	✅ 填充
EkbChunk
content, chunkIndex	✅ 填充	-	-
pageNumber, sectionType	⚪ 可选	-	-
embedding	✅ 填充	-	-

核心代码

// Phase 1: MVP 版本 KnowledgeBaseEngine
export class KnowledgeBaseEngine {
  constructor(private prisma: PrismaClient) {}

  /**
   * 同步入库（MVP 简化版，小文件场景）
   * Phase 3 将升级为异步
   */
  async ingestDocument(params: {
    kbId: string;
    userId: string;
    file: Buffer;
    filename: string;
  }): Promise<{ documentId: string }> {
    // 1. 解析文档 → Markdown
    const markdown = await documentProcessor.toMarkdown(params.file, params.filename);
    
    // 2. 切片
    const chunks = chunkService.split(markdown, { size: 512, overlap: 50 });
    
    // 3. 向量化
    const embeddings = await embeddingService.embedBatch(chunks.map(c => c.text));
    
    // 4. 上传原始文件到 OSS
    const fileUrl = await storage.upload(params.file, params.filename);
    
    // 5. 存储文档（使用完整 Schema，MVP 只填充部分字段）
    const document = await this.prisma.ekbDocument.create({
      data: {
        kbId: params.kbId,
        userId: params.userId,
        filename: params.filename,
        fileType: getFileType(params.filename),
        fileSizeBytes: BigInt(params.file.length),
        fileUrl: fileUrl,
        extractedText: markdown,
        status: 'completed',
        // Phase 3 才填充的字段保持 null：
        // summary, tokenCount, pico, studyDesign, regimen, safety, criteria, endpoints
      }
    });
    
    // 6. 存储切片 + 向量（使用完整 Schema）
    for (let i = 0; i < chunks.length; i++) {
      await this.prisma.$executeRaw`
        INSERT INTO "ekb_schema"."EkbChunk" 
        (id, document_id, content, chunk_index, page_number, section_type, embedding, created_at)
        VALUES (
          ${crypto.randomUUID()},
          ${document.id},
          ${chunks[i].text},
          ${i},
          ${chunks[i].pageNumber || null},
          ${chunks[i].sectionType || null},
          ${embeddings[i]}::vector,
          NOW()
        )
      `;
    }
    
    return { documentId: document.id };
  }

  /**
   * 向量检索
   */
  async vectorSearch(
    kbIds: string[],
    query: string,
    topK: number = 10
  ): Promise<SearchResult[]> {
    const queryVector = await embeddingService.embed(query);
    
    const results = await this.prisma.$queryRaw<SearchResult[]>`
      SELECT 
        c.id,
        c.content,
        c.document_id,
        d.filename,
        1 - (c.embedding <=> ${queryVector}::vector) as score
      FROM "ekb_schema"."EkbChunk" c
      JOIN "ekb_schema"."EkbDocument" d ON c.document_id = d.id
      WHERE d.kb_id = ANY(${kbIds}::text[])
      ORDER BY c.embedding <=> ${queryVector}::vector
      LIMIT ${topK}
    `;
    
    return results;
  }

  /**
   * 获取单个文档全文
   */
  async getDocumentFullText(documentId: string): Promise<DocumentText> {
    const doc = await this.prisma.ekbDocument.findUnique({
      where: { id: documentId },
      select: { id: true, filename: true, extractedText: true }
    });
    
    if (!doc) throw new Error('Document not found');
    
    return {
      id: doc.id,
      filename: doc.filename,
      text: doc.extractedText || '',
    };
  }

  /**
   * 获取知识库所有文档全文
   */
  async getAllDocumentsText(kbId: string): Promise<DocumentText[]> {
    const docs = await this.prisma.ekbDocument.findMany({
      where: { kbId, status: 'completed' },
      select: { id: true, filename: true, extractedText: true }
    });
    
    return docs.map(doc => ({
      id: doc.id,
      filename: doc.filename,
      text: doc.extractedText || '',
    }));
  }

  /**
   * 删除文档
   */
  async deleteDocument(documentId: string): Promise<void> {
    await this.prisma.ekbDocument.delete({
      where: { id: documentId }
    });
  }
}

索引设计（完整版，一次性创建）

-- ===== MVP 阶段必须创建 =====

-- 1. HNSW 向量索引（语义检索核心）
CREATE INDEX IF NOT EXISTS ekb_chunk_embedding_idx 
ON "ekb_schema"."EkbChunk" 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- ===== Phase 2 阶段使用（MVP 可预创建）=====

-- 2. pg_bigm 中文关键词索引
CREATE EXTENSION IF NOT EXISTS pg_bigm;
CREATE INDEX IF NOT EXISTS ekb_chunk_content_bigm_idx 
ON "ekb_schema"."EkbChunk" 
USING gin (content gin_bigm_ops);

-- ===== Phase 3 阶段使用（MVP 可预创建）=====

-- 3. JSONB GIN 索引（临床数据查询）
CREATE INDEX IF NOT EXISTS ekb_document_pico_idx 
ON "ekb_schema"."EkbDocument" USING gin (pico);

CREATE INDEX IF NOT EXISTS ekb_document_safety_idx 
ON "ekb_schema"."EkbDocument" USING gin (safety);

CREATE INDEX IF NOT EXISTS ekb_document_studydesign_idx 
ON "ekb_schema"."EkbDocument" USING gin ("studyDesign");

💡 建议：MVP 阶段一次性创建所有索引，避免后续 DDL 操作。空表创建索引几乎无成本。

任务清单

任务	预估	产出
Schema 迁移（完整版）	3h	`ekb_schema` + 完整表结构 + 全部索引
EmbeddingService	3h	阿里云 API 封装
ChunkService	2h	文本切片
KnowledgeBaseEngine MVP	4h	核心类（只实现 MVP 方法）
单元测试	3h	基础测试用例
PKB 集成	4h	替换 Dify 调用
合计	19h (3天)

💡 关键点：Schema 迁移一次到位，后续阶段只写代码，不改表。

验收标准

PKB 可上传 PDF 文档
PKB 可向量检索
PKB 可获取文档全文
PKB 可删除文档
基础问答功能正常

🔍 Phase 2: 增强检索（2天）

目标

检索质量提升：支持中文关键词、混合检索、结果重排序

新增能力

能力	方法	说明
关键词检索	`keywordSearch()`	pg_bigm 中文精确匹配
混合检索	`hybridSearch()`	向量 + 关键词 + RRF
重排序	`rerank()`	Qwen-Rerank API

新增索引

-- Phase 2: 关键词检索索引（pg_bigm，专为中文优化）
CREATE EXTENSION IF NOT EXISTS pg_bigm;

CREATE INDEX IF NOT EXISTS ekb_chunk_content_bigm_idx 
ON "ekb_schema"."EkbChunk" 
USING gin (content gin_bigm_ops);

新增代码

// Phase 2 新增方法

/**
 * 关键词检索（pg_bigm 中文精确匹配）
 */
async keywordSearch(
  kbIds: string[],
  query: string,
  topK: number = 10
): Promise<SearchResult[]> {
  const results = await this.prisma.$queryRaw<SearchResult[]>`
    SELECT 
      c.id,
      c.content,
      c.document_id,
      d.filename,
      bigm_similarity(c.content, ${query}) as score
    FROM "ekb_schema"."EkbChunk" c
    JOIN "ekb_schema"."EkbDocument" d ON c.document_id = d.id
    WHERE d.kb_id = ANY(${kbIds}::text[])
      AND c.content LIKE ${'%' + query + '%'}
    ORDER BY bigm_similarity(c.content, ${query}) DESC
    LIMIT ${topK}
  `;
  
  return results;
}

/**
 * 混合检索（向量 + 关键词 + RRF 融合）
 */
async hybridSearch(
  kbIds: string[],
  query: string,
  topK: number = 10
): Promise<SearchResult[]> {
  // 并发执行两路检索
  const [vectorResults, keywordResults] = await Promise.all([
    this.vectorSearch(kbIds, query, 20),
    this.keywordSearch(kbIds, query, 20),
  ]);
  
  // RRF 融合
  return rrfFusion(vectorResults, keywordResults, topK);
}

/**
 * 重排序（Qwen-Rerank API）
 */
async rerank(
  documents: SearchResult[],
  query: string,
  topK: number = 10
): Promise<SearchResult[]> {
  const response = await dashscope.rerank({
    model: 'gte-rerank',
    query,
    documents: documents.map(d => d.content),
    top_n: topK,
  });
  
  return response.results.map(r => ({
    ...documents[r.index],
    score: r.relevance_score,
  }));
}

RRF 融合算法

// utils/rrfFusion.ts
export function rrfFusion(
  vectorResults: SearchResult[],
  keywordResults: SearchResult[],
  topK: number,
  k: number = 60
): SearchResult[] {
  const scores = new Map<string, number>();
  
  // 向量检索得分
  vectorResults.forEach((r, rank) => {
    const score = scores.get(r.id) || 0;
    scores.set(r.id, score + 1 / (k + rank + 1));
  });
  
  // 关键词检索得分
  keywordResults.forEach((r, rank) => {
    const score = scores.get(r.id) || 0;
    scores.set(r.id, score + 1 / (k + rank + 1));
  });
  
  // 合并去重 + 排序
  const allResults = [...vectorResults, ...keywordResults];
  const uniqueResults = Array.from(
    new Map(allResults.map(r => [r.id, r])).values()
  );
  
  return uniqueResults
    .map(r => ({ ...r, score: scores.get(r.id) || 0 }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

任务清单

任务	预估	产出
pg_bigm 索引	1h	SQL 迁移
KeywordSearchService	2h	关键词检索（pg_bigm）
RRF 融合算法	2h	rrfFusion.ts
RerankService	2h	阿里云 API 封装
hybridSearch 集成	2h	混合检索
测试	3h	检索质量验证
合计	12h (2天)

验收标准

关键词检索支持中文（"帕博利珠" 可匹配 "帕博利珠单抗"）
混合检索可用
rerank 可用
检索召回率提升（对比 Phase 1）

🎯 Phase 3: 完整功能（3天）

目标

完整架构落地：支持大文件、高级提取功能

新增能力

能力	方法	说明
异步入库	`submitIngestTask()`	pg-boss 队列
	`getIngestStatus()`	任务状态查询
摘要获取	`getDocumentSummary()`	LLM 生成摘要
	`getAllDocumentsSummaries()`	批量获取摘要
临床数据	`getClinicalData()`	PICO 等结构化数据

Schema 说明

✅ 无需 Schema 升级：MVP 阶段已创建完整表结构，Phase 3 只需填充字段。

// Phase 3: 开始填充 summary、tokenCount、临床数据字段
await prisma.ekbDocument.update({
  where: { id: documentId },
  data: {
    summary: generatedSummary,           // 🆕 Phase 3 填充
    tokenCount: calculatedTokens,        // 🆕 Phase 3 填充
    pico: extractedPico,                 // 🆕 Phase 3 填充
    studyDesign: extractedStudyDesign,   // 🆕 Phase 3 填充
    regimen: extractedRegimen,           // 🆕 Phase 3 填充
    safety: extractedSafety,             // 🆕 Phase 3 填充
  }
});

异步入库实现

详见：Postgres-Only异步任务处理指南

// Phase 3: 异步入库

/**
 * 提交入库任务
 */
async submitIngestTask(params: {
  kbId: string;
  userId: string;
  file: Buffer;
  filename: string;
  options?: {
    enableSummary?: boolean;           // 💰 默认 false
    enableClinicalExtraction?: boolean; // 💰 默认 false
  };
}): Promise<{ taskId: string; documentId: string }> {
  // 1. 快速上传到 OSS
  const fileUrl = await storage.upload(params.file);
  
  // 2. 创建文档记录（status: processing）
  const document = await this.prisma.ekbDocument.create({
    data: {
      kbId: params.kbId,
      userId: params.userId,
      filename: params.filename,
      fileUrl,
      status: 'processing',
    }
  });
  
  // 3. 推送任务到 pg-boss
  const taskId = await jobQueue.send('ekb-ingest', {
    documentId: document.id,
    fileUrl,
    options: params.options,
  });
  
  return { taskId, documentId: document.id };
}

/**
 * 获取任务状态
 */
async getIngestStatus(taskId: string): Promise<{
  status: 'pending' | 'processing' | 'completed' | 'failed';
  progress: number;
  error?: string;
}> {
  const job = await jobQueue.getJobById(taskId);
  
  return {
    status: job.state,
    progress: job.data?.progress || 0,
    error: job.state === 'failed' ? job.output?.error : undefined,
  };
}

任务清单

任务	预估	产出
pg-boss Worker	4h	ingestWorker.ts（异步入库）
SummaryService	3h	LLM 摘要生成
ClinicalExtractionService	4h	PICO 提取 + JSON 容错
摘要/临床数据 API	2h	新增 getDocumentSummary 等方法
测试	4h	完整流程验证
合计	17h (3天)

✅ 优势：无需 Schema 迁移，只写业务代码。

验收标准

大文件上传不超时
任务状态可查询
摘要生成可用
PICO 提取可用（ASL 场景）
AIA 摘要筛选策略可用

📊 总体进度

阶段	工期	累计	状态
Phase 1 MVP	3天	3天	🔜 待开始
Phase 2 增强	2天	5天	📋 规划中
Phase 3 完整	3天	8天	📋 规划中

🔗 相关文档

知识库引擎架构设计 - 完整架构目标
pgvector 替换 Dify 技术方案 - 详细技术实现
Postgres-Only异步任务处理指南 - 异步架构参考

📅 更新日志

v1.0 (2026-01-20)

初始版本
确定三阶段实施方案：MVP → 增强 → 完整

维护人： 技术架构师
最后更新： 2026-01-20

17 KiB Raw Blame History Unescape Escape

知识库引擎分阶段实施方案

📋 概述

为什么分阶段实施？

三阶段总览

🚀 Phase 1: MVP（3天）

目标

交付能力

暂不实现

数据模型

字段使用阶段说明

核心代码

索引设计（完整版，一次性创建）

任务清单

验收标准

🔍 Phase 2: 增强检索（2天）

目标

新增能力

新增索引

新增代码

RRF 融合算法

任务清单

验收标准

🎯 Phase 3: 完整功能（3天）

目标

新增能力

Schema 说明

异步入库实现

任务清单

验收标准

📊 总体进度

🔗 相关文档

📅 更新日志

v1.0 (2026-01-20)

17 KiB

Raw Blame History