docs(rag-engine): update architecture design with building-blocks principle

- Add core design principle: provide building blocks, no strategy selection - Remove chat() method, strategy determined by business modules - Add new capabilities: getDocumentFullText(), getAllDocumentsText() - Add new capabilities: getDocumentSummary(), getAllDocumentsSummaries() - Add business module strategy examples (PKB/AIA/ASL/RVW) - Add strategy selection guide (by scale, by scenario) - Update data model with summary and tokenCount fields - Add SummaryService to code structure
2026-01-20 20:35:26 +08:00
parent 74a209317e
commit 1f5bf2cd65
3 changed files with 1583 additions and 82 deletions
--- a/docs/02-通用能力层/03-RAG引擎/README.md
+++ b/docs/02-通用能力层/03-RAG引擎/README.md
@@ -1,114 +1,325 @@
-# RAG引擎
+# 知识库引擎（RAG Engine）

 > **能力定位：** 通用能力层  
-> **复用率：** 43% (3个模块依赖)  
-> **优先级：** P1  
-> **状态：** ✅ 已实现（基于Dify）
+> **复用率：** 57% (4个模块依赖)  
+> **优先级：** P0  
+> **状态：** 🔄 升级中（Dify → PostgreSQL + pgvector）  
+> **核心原则：** 提供基础能力（乐高积木），不做策略选择  
+> **最后更新：** 2026-01-20

 ---

 ## 📋 能力概述

-RAG引擎负责：
- 向量化存储（Embedding）
- 语义检索（Semantic Search）
- 检索增强生成（RAG）
- Rerank重排序
+知识库引擎是平台的**核心通用能力**，提供知识库相关的**基础能力（乐高积木）**。
+
+### ⭐ 核心设计原则
+
+```
+✅ 提供基础能力（乐高积木）
+❌ 不做策略选择（组装方案由业务模块决定）
+```
+
+**原因**：不同业务场景需要不同策略
+- 小知识库（10个文件）→ 直接全文塞给 LLM
+- 大知识库（100+文件）→ RAG 向量检索
+- 特殊场景 → 摘要筛选 + Top-K 全文
+
+### 基础能力清单
+
+- 📄 **文档入库** - 文档解析 → 切片 → 向量化 → 存储
+- 📝 **全文获取** - 单文档/批量获取文档全文
+- 📋 **摘要获取** - 单文档/批量获取文档摘要
+- 🔍 **向量检索** - 基于 pgvector 的语义检索
+- 🔤 **关键词检索** - 基于 PostgreSQL FTS
+- 🔀 **混合检索** - 向量 + 关键词 + RRF 融合
+
+> ⚠️ **注意**：不提供 `chat()` 方法！问答策略由业务模块根据场景决定。
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   业务模块层（策略选择）                       │
+│   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐       │
+│   │   PKB   │  │   AIA   │  │   ASL   │  │   RVW   │       │
+│   │ 全文/RAG │  │摘要+全文 │  │向量+Rerank│ │ 全文比对 │       │
+│   └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘       │
+│        │  根据场景自由组合基础能力       │                    │
+│        └────────────┴────────────┴────────────┘             │
+│                          │                                  │
+│                          ▼                                  │
+├─────────────────────────────────────────────────────────────┤
+│              知识库引擎（提供基础能力 / 乐高积木）              │
+│   ┌─────────────────────────────────────────────────────┐   │
+│   │              KnowledgeBaseEngine                     │   │
+│   │   ingest() / getFullText() / getSummary()           │   │
+│   │   vectorSearch() / keywordSearch() / hybridSearch() │   │
+│   │   ❌ 不提供 chat() - 策略由业务模块决定               │   │
+│   └─────────────────────────────────────────────────────┘   │
+├─────────────────────────────────────────────────────────────┤
+│                PostgreSQL + pgvector                        │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 🔄 重大更新（2026-01-20）
+
+### 架构升级：Dify → PostgreSQL + pgvector
+
+| 维度 | 原方案（Dify） | 新方案（pgvector） |
+|------|----------------|-------------------|
+| **存储** | Qdrant（外部） | PostgreSQL（内部） |
+| **数据控制** | 外部 API | 完全自主 |
+| **扩展性** | 受限 | 高度灵活 |
+| **符合架构** | ❌ | ✅ Postgres-Only |
+
+### 定位变更：从 PKB 模块到通用能力层
+
+| 维度 | 原定位 | 新定位 |
+|------|--------|--------|
+| 代码位置 | `modules/pkb/` | `common/rag/` |
+| 使用范围 | 仅 PKB | 全平台 |
+| 设计目标 | 单模块功能 | 通用能力 |

 ---

 ## 📊 依赖模块

-**3个模块依赖（43%复用率）：**
-1. **AIA** - AI智能问答（@知识库问答）
-2. **ASL** - AI智能文献（文献内容检索）
-3. **PKB** - 个人知识库（RAG问答）
+**4个模块依赖（57%复用率）：**
+
+| 模块 | 使用场景 | 优先级 |
+|------|----------|--------|
+| **PKB** 个人知识库 | 知识库管理、RAG 问答、全文阅读 | P0（首个接入） |
+| **AIA** AI智能问答 | @知识库 问答、附件理解 | P0 |
+| **ASL** AI智能文献 | 文献库检索、智能综述生成 | P1 |
+| **RVW** 稿件审查 | 稿件与文献对比、查重检测 | P1 |

 ---

-## 💡 核心功能
+## 💡 基础能力使用

-### 1. 向量化存储
- 基于Dify平台
- Qdrant向量数据库（Dify内置）
+### 1. 文档入库

-### 2. 语义检索
- Top-K检索
- 相关度评分
- 多知识库联合检索
-
-### 3. RAG问答
- 检索 + 生成
- 智能引用系统（100%准确溯源）
-
---
-
-## 🏗️ 技术架构
-
-**基于Dify平台：**
 ```typescript
-// DifyClient封装
-interface RAGEngine {
-  // 创建知识库
-  createDataset(name: string): Promise<string>;
-  
-  // 上传文档
-  uploadDocument(datasetId: string, file: File): Promise<string>;
-  
-  // 语义检索
-  search(datasetId: string, query: string, topK?: number): Promise<SearchResult[]>;
-  
-  // RAG问答
-  chatWithRAG(datasetId: string, query: string): Promise<string>;
+import { KnowledgeBaseEngine } from '@/common/rag';
+
+const kbEngine = new KnowledgeBaseEngine(prisma);
+
+await kbEngine.ingestDocument({
+  kbId: 'kb-123',
+  userId: 'user-456',
+  file: pdfBuffer,
+  filename: 'research.pdf',
+  options: {
+    generateSummary: true,      // 生成摘要
+    extractClinicalData: true,  // 提取 PICO 等临床数据
+  }
+});
+```
+
+### 2. 全文/摘要获取
+
+```typescript
+// 获取单个文档全文
+const doc = await kbEngine.getDocumentFullText(documentId);
+
+// 获取知识库所有文档全文（小知识库场景）
+const allDocs = await kbEngine.getAllDocumentsText(kbId);
+
+// 获取知识库所有文档摘要（筛选场景）
+const summaries = await kbEngine.getAllDocumentsSummaries(kbId);
+```
+
+### 3. 检索能力
+
+```typescript
+// 向量检索
+const vectorResults = await kbEngine.vectorSearch(kbIds, query, 20);
+
+// 关键词检索
+const keywordResults = await kbEngine.keywordSearch(kbIds, query, 20);
+
+// 混合检索（向量 + 关键词 + RRF）
+const hybridResults = await kbEngine.hybridSearch(kbIds, query, 10);
+```
+
+---
+
+## 🎯 业务模块策略示例
+
+### PKB：小知识库全文模式
+
+```typescript
+// 10 个文档 → 直接全文塞给 LLM
+async function pkbSmallKbChat(kbId: string, query: string) {
+  const docs = await kbEngine.getAllDocumentsText(kbId);
+  const context = docs.map(d => `## ${d.filename}\n${d.text}`).join('\n\n');
+  return llmChat(context, query);
+}
+```
+
+### AIA：摘要筛选 + Top-K 全文
+
+```typescript
+// 摘要筛选 → LLM 选 Top 5 → 读取全文
+async function aiaSmartChat(kbIds: string[], query: string) {
+  const summaries = await kbEngine.getAllDocumentsSummaries(kbIds);
+  const topDocIds = await llmSelectTopK(summaries, query, 5);
+  const fullTexts = await Promise.all(
+    topDocIds.map(id => kbEngine.getDocumentFullText(id))
+  );
+  return llmChat(fullTexts.join('\n\n'), query);
+}
+```
+
+### ASL：向量检索 + Rerank
+
+```typescript
+// 大规模文献检索
+async function aslSearch(kbIds: string[], query: string) {
+  const candidates = await kbEngine.vectorSearch(kbIds, query, 50);
+  const reranked = await rerankService.rerank(candidates, query, 10);
+  return reranked;
 }
 ```

 ---

-## 📈 优化成果
+## 🏗️ 技术架构

-**检索参数优化：**
-| 指标 | 优化前 | 优化后 | 提升 |
-|------|--------|--------|------|
-| 检索数量 | 3 chunks | 15 chunks | 5倍 |
-| Chunk大小 | 500 tokens | 1500 tokens | 3倍 |
-| 总覆盖 | 1,500 tokens | 22,500 tokens | 15倍 |
-| 覆盖率 | ~5% | ~40-50% | 8-10倍 |
+### 代码结构
+
+```
+backend/src/common/rag/
+├── index.ts                      # 统一导出
+├── KnowledgeBaseEngine.ts        # 统一入口类（基础能力）
+├── services/
+│   ├── ChunkService.ts           # 文档切片
+│   ├── EmbeddingService.ts       # 向量化（阿里云）
+│   ├── SummaryService.ts         # 摘要生成 ⭐
+│   ├── VectorSearchService.ts    # 向量检索
+│   ├── KeywordSearchService.ts   # 关键词检索
+│   ├── HybridSearchService.ts    # 混合检索 + RRF
+│   └── ClinicalExtractionService.ts  # 临床要素提取
+├── types/
+│   └── index.ts                  # 类型定义
+└── utils/
+    └── rrfFusion.ts              # RRF 算法
+```
+
+### 数据模型
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    ekb_schema                                │
+│  ┌───────────────────────┐    ┌───────────────────────┐    │
+│  │     EkbDocument       │    │      EkbChunk         │    │
+│  │  ─────────────────    │    │  ─────────────────    │    │
+│  │  id                   │    │  id                   │    │
+│  │  kbId                 │───>│  documentId           │    │
+│  │  filename             │    │  content              │    │
+│  │  extractedText        │    │  embedding (vector)   │    │
+│  │  summary ⭐           │    │  pageNumber           │    │
+│  │  tokenCount           │    │  sectionType          │    │
+│  │  pico (JSONB)         │    └───────────────────────┘    │
+│  │  studyDesign (JSONB)  │                                  │
+│  │  regimen (JSONB)      │                                  │
+│  │  safety (JSONB)       │                                  │
+│  └───────────────────────┘                                  │
+└─────────────────────────────────────────────────────────────┘
+```
+
+> ⭐ `summary` 字段用于支持"摘要筛选 + Top-K 全文"策略

 ---

-## 🔗 相关文档
+## 📚 API 接口

- [通用能力层总览](../README.md)
- [Dify集成文档](../../00-系统总体设计/03-数据库架构说明.md)
+### KnowledgeBaseEngine
+
+```typescript
+class KnowledgeBaseEngine {
+  // ========== 文档入库 ==========
+  ingestDocument(params: IngestParams): Promise<IngestResult>;
+  ingestBatch(documents: IngestParams[]): Promise<IngestResult[]>;
+  
+  // ========== 内容获取 ==========
+  getDocumentFullText(documentId: string): Promise<DocumentText>;
+  getAllDocumentsText(kbId: string): Promise<DocumentText[]>;
+  getDocumentSummary(documentId: string): Promise<DocumentSummary>;
+  getAllDocumentsSummaries(kbId: string): Promise<DocumentSummary[]>;
+  
+  // ========== 检索能力 ==========
+  vectorSearch(kbIds: string[], query: string, topK?: number): Promise<SearchResult[]>;
+  keywordSearch(kbIds: string[], query: string, topK?: number): Promise<SearchResult[]>;
+  hybridSearch(kbIds: string[], query: string, topK?: number): Promise<SearchResult[]>;
+  
+  // ========== 管理操作 ==========
+  deleteDocument(documentId: string): Promise<void>;
+  clearKnowledgeBase(kbId: string): Promise<void>;
+  getKnowledgeBaseStats(kbId: string): Promise<KBStats>;
+  
+  // ❌ 不提供 chat() 方法 - 策略由业务模块根据场景决定
+}
+```
+
+---
+
+## 🔗 与其他通用能力的关系
+
+| 依赖能力 | 用途 |
+|----------|------|
+| **文档处理引擎** | PDF/Word/Excel → Markdown |
+| **LLM 网关** | 摘要生成、临床要素提取 |
+| **存储服务** | 原始文档存储到 OSS |
+
+> 注：LLM 问答由业务模块自行调用 LLM 网关实现
+
+---
+
+## 📅 开发计划
+
+详见：[02-pgvector替换Dify计划.md](./02-pgvector替换Dify计划.md)
+
+| 里程碑 | 内容 | 工期 | 状态 |
+|--------|------|------|------|
+| **M1** | 数据库设计 + 核心服务 | 5 天 | 🔜 待开始 |
+| **M2** | PKB 模块接入 + 测试 | 3 天 | 📋 规划中 |
+| **M3** | 数据迁移 + 上线 | 2 天 | 📋 规划中 |
+
+---
+
+## 📂 相关文档
+
+- [知识库引擎架构设计](./01-知识库引擎架构设计.md)
+- [pgvector 替换 Dify 开发计划](./02-pgvector替换Dify计划.md)
+- [文档处理引擎](../02-文档处理引擎/01-文档处理引擎设计方案.md)
+- [通用能力层清单](../00-通用能力层清单.md)
+
+---
+
+## 📅 更新日志
+
+### 2026-01-20 v1.1 设计原则重大更新
+
+- ⭐ **核心原则**：提供基础能力（乐高积木），不做策略选择
+- ❌ 移除 `chat()` 方法，策略由业务模块决定
+- 🆕 新增 `getDocumentFullText()` / `getAllDocumentsText()` 全文获取
+- 🆕 新增 `getDocumentSummary()` / `getAllDocumentsSummaries()` 摘要获取
+- 🆕 新增业务模块策略示例（PKB/AIA/ASL/RVW）
+
+### 2026-01-20 v1.0 架构升级
+
+- 🔄 定位变更：从 PKB 模块提升为通用能力层
+- 🆕 创建架构设计文档
+- 🆕 重构开发计划（通用能力层视角）
+- 📦 规划代码结构：`common/rag/`
+
+### 2025-11-06 初始版本
+
+- 基于 Dify 实现
+- 仅 PKB 模块使用

 ---

-**最后更新：** 2025-11-06  
 **维护人：** 技术架构师
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-