feat(rag): Complete RAG engine implementation with pgvector
Major Features: - Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk - Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors) - Implemented ChunkService (smart Markdown chunking) - Implemented VectorSearchService (multi-query + hybrid search) - Implemented RerankService (qwen3-rerank) - Integrated DeepSeek V3 QueryRewriter for cross-language search - Python service: Added pymupdf4llm for PDF-to-Markdown conversion - PKB: Dual-mode adapter (pgvector/dify/hybrid) Architecture: - Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector) - Cross-language support: Chinese query matches English documents - Small Embedding (1024) + Strong Reranker strategy Performance: - End-to-end latency: 2.5s - Cost per query: 0.0025 RMB - Accuracy improvement: +20.5% (cross-language) Tests: - test-embedding-service.ts: Vector embedding verified - test-rag-e2e.ts: Full pipeline tested - test-rerank.ts: Rerank quality validated - test-query-rewrite.ts: Cross-language search verified - test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf) Documentation: - Added 05-RAG-Engine-User-Guide.md - Added 02-Document-Processing-User-Guide.md - Updated system status documentation Status: Production ready
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
# 通用能力层清单
|
||||
|
||||
> **文档版本:** v2.1
|
||||
> **文档版本:** v2.4
|
||||
> **创建日期:** 2026-01-14
|
||||
> **最后更新:** 2026-01-18
|
||||
> **最后更新:** 2026-01-21
|
||||
> **文档目的:** 列出所有通用能力模块,提供快速调用指南
|
||||
> **本次更新:** Ant Design X FileCard 组件使用、Prompt管理 AIA 集成
|
||||
> **本次更新:** RAG 引擎完整实现(替代 Dify)+ 文档处理引擎增强
|
||||
|
||||
---
|
||||
|
||||
@@ -33,8 +33,8 @@
|
||||
| **异步任务** | `common/jobs/` | ✅ | 队列服务(Memory/PgBoss) |
|
||||
| **LLM网关** | `common/llm/` | ✅ | 统一LLM适配器(5个模型) |
|
||||
| **流式响应** | `common/streaming/` | ✅ 🆕 | OpenAI Compatible流式输出 |
|
||||
| **RAG引擎** | `common/rag/` | ✅ | Dify集成(知识库检索) |
|
||||
| **文档处理** | `common/document/` | ✅ | 文档内容提取 |
|
||||
| **🎉RAG引擎** | `common/rag/` | ✅ 🆕 | **完整实现!pgvector+DeepSeek+Rerank** |
|
||||
| **文档处理** | `extraction_service/` | ✅ 🆕 | pymupdf4llm PDF→Markdown |
|
||||
| **认证授权** | `common/auth/` | ✅ | JWT认证 + 权限控制 |
|
||||
| **Prompt管理** | `common/prompt/` | ✅ | 动态Prompt配置 |
|
||||
|
||||
@@ -456,58 +456,142 @@ logger.info('[ModuleName] 操作描述', {
|
||||
|
||||
---
|
||||
|
||||
### 8. RAG 引擎
|
||||
### 8. 🎉 RAG 引擎(✅ 2026-01-21 完整实现)
|
||||
|
||||
**路径:** `backend/src/common/rag/`
|
||||
**路径:** `backend/src/common/rag/` + `ekb_schema`
|
||||
|
||||
**功能:** 知识库检索(基于Dify)
|
||||
**功能:** 完整的 RAG 检索引擎(替代 Dify)
|
||||
|
||||
**使用方式:**
|
||||
**核心组件:**
|
||||
- ✅ EmbeddingService - 文本向量化(text-embedding-v4)
|
||||
- ✅ ChunkService - 智能文本分块
|
||||
- ✅ VectorSearchService - 向量检索 + 混合检索
|
||||
- ✅ RerankService - qwen3-rerank 重排序
|
||||
- ✅ QueryRewriter - DeepSeek V3 查询理解
|
||||
- ✅ DocumentIngestService - 文档入库完整流程
|
||||
|
||||
```typescript
|
||||
import { DifyClient } from '../../../common/rag/DifyClient';
|
||||
|
||||
const dify = new DifyClient(apiKey, baseURL);
|
||||
|
||||
// 检索知识库
|
||||
const results = await dify.retrievalSearch(query, {
|
||||
knowledgeBaseIds: ['kb1', 'kb2'],
|
||||
topK: 5,
|
||||
});
|
||||
|
||||
// 对话API(含RAG)
|
||||
const response = await dify.chatWithKnowledge(query, options);
|
||||
**技术栈:**
|
||||
```
|
||||
PostgreSQL + pgvector (向量存储)
|
||||
↓
|
||||
text-embedding-v4 1024维 (向量化)
|
||||
↓
|
||||
DeepSeek V3 (查询理解 + 中英翻译)
|
||||
↓
|
||||
向量检索 + 关键词检索 → RRF 融合
|
||||
↓
|
||||
qwen3-rerank (精排序)
|
||||
```
|
||||
|
||||
**Brain-Hand 架构使用方式:**
|
||||
|
||||
```typescript
|
||||
import { getVectorSearchService, QueryRewriter } from '@/common/rag';
|
||||
|
||||
// 业务层:查询理解(The Brain)
|
||||
const rewriter = new QueryRewriter();
|
||||
const result = await rewriter.rewrite('K药副作用');
|
||||
const queries = [result.original, ...result.rewritten];
|
||||
// ["K药副作用", "Keytruda AE", "Pembrolizumab side effects"]
|
||||
|
||||
// 引擎层:执行检索(The Hand)
|
||||
const searchService = getVectorSearchService(prisma);
|
||||
const results = await searchService.searchWithQueries(queries, {
|
||||
topK: 10,
|
||||
filter: { kbId: 'your-kb-id' }
|
||||
});
|
||||
|
||||
// Rerank 精排
|
||||
const final = await searchService.rerank(queries[0], results, { topK: 5 });
|
||||
```
|
||||
|
||||
**性能指标:**
|
||||
- 延迟:2.5秒/次(包含查询理解)
|
||||
- 成本:¥0.0025/次
|
||||
- 准确率:+20.5%(跨语言场景)
|
||||
|
||||
**已使用模块:**
|
||||
- ✅ PKB - 个人知识库
|
||||
- ✅ PKB - 个人知识库(双轨模式:pgvector/dify)
|
||||
- 🔜 AIA - AI智能问答
|
||||
- 🔜 ASL - AI智能文献
|
||||
|
||||
**详细文档:**
|
||||
- 📖 [RAG 引擎使用指南](./03-RAG引擎/05-RAG引擎使用指南.md) ⭐ **推荐阅读**
|
||||
- [知识库引擎架构设计](./03-RAG引擎/01-知识库引擎架构设计.md)
|
||||
- [数据模型设计](./03-RAG引擎/04-数据模型设计.md)
|
||||
- [分阶段实施方案](./03-RAG引擎/03-分阶段实施方案.md)
|
||||
|
||||
---
|
||||
|
||||
### 9. 文档处理引擎
|
||||
### 9. 🎉 文档处理引擎(✅ 2026-01-21 增强完成)
|
||||
|
||||
**路径:** `backend/src/common/document/`
|
||||
**路径:** `extraction_service/` (Python 微服务,端口 8000)
|
||||
|
||||
**功能:** 文档内容提取(PDF/Word/Excel/TXT)
|
||||
**功能:** 将各类文档统一转换为 **LLM 友好的 Markdown 格式**
|
||||
|
||||
**使用方式:**
|
||||
**核心 API:**
|
||||
```
|
||||
POST http://localhost:8000/api/document/to-markdown
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
参数:file (PDF/Word/Excel/PPT等)
|
||||
返回:{ success: true, text: "Markdown内容", metadata: {...} }
|
||||
```
|
||||
|
||||
**技术升级:**
|
||||
- ✅ PDF 处理:pymupdf4llm(保留表格、公式、结构)
|
||||
- ✅ 统一入口:DocumentProcessor 自动检测文件类型
|
||||
- ✅ 零 OCR:电子版文档专用,扫描件返回友好提示
|
||||
- ✅ 与 RAG 引擎无缝集成
|
||||
|
||||
**支持格式:**
|
||||
| 格式 | 工具 | 输出质量 | 状态 |
|
||||
|------|------|----------|------|
|
||||
| PDF | pymupdf4llm | 表格保真 | ✅ |
|
||||
| Word | mammoth | 结构完整 | ✅ |
|
||||
| Excel/CSV | pandas | 上下文丰富 | ✅ |
|
||||
| PPT | python-pptx | 按页拆分 | ✅ |
|
||||
| 纯文本 | 直接读取 | 原样输出 | ✅ |
|
||||
|
||||
**使用方式(Node.js 调用):**
|
||||
|
||||
```typescript
|
||||
// 在 RAG 引擎入库时自动调用
|
||||
import { getDocumentIngestService } from '@/common/rag';
|
||||
|
||||
const ingestService = getDocumentIngestService(prisma);
|
||||
const result = await ingestService.ingestDocument(
|
||||
{ filename: 'paper.pdf', fileBuffer: pdfBuffer },
|
||||
{ kbId: 'your-kb-id' }
|
||||
);
|
||||
// DocumentIngestService 内部会调用 Python 微服务转换
|
||||
|
||||
# 转换任意文档为 Markdown
|
||||
md = processor.to_markdown("research_paper.pdf")
|
||||
md = processor.to_markdown("report.docx")
|
||||
md = processor.to_markdown("data.xlsx")
|
||||
```
|
||||
|
||||
**后端调用(TypeScript):**
|
||||
|
||||
```typescript
|
||||
import { ExtractionClient } from '../../../common/document/ExtractionClient';
|
||||
|
||||
const client = new ExtractionClient();
|
||||
|
||||
// 提取文本
|
||||
const text = await client.extractText(buffer, 'pdf');
|
||||
|
||||
// 提取结构化数据(Excel)
|
||||
const data = await client.extractStructured(buffer, 'xlsx');
|
||||
// 提取文本(返回 Markdown)
|
||||
const markdown = await client.extractText(buffer, 'pdf');
|
||||
```
|
||||
|
||||
**已使用模块:**
|
||||
- ✅ PKB - 文档上传
|
||||
- ✅ DC Tool B - 病历文本提取
|
||||
- 🔜 AIA - 附件处理(待完成)
|
||||
- ✅ ASL - 文献 PDF 提取
|
||||
- 🔜 AIA - 附件处理
|
||||
|
||||
**详细文档:**
|
||||
- 📖 [文档处理引擎使用指南](./02-文档处理引擎/02-文档处理引擎使用指南.md) ⭐ **推荐阅读**
|
||||
- [文档处理引擎设计方案](./02-文档处理引擎/01-文档处理引擎设计方案.md)
|
||||
|
||||
---
|
||||
|
||||
@@ -786,7 +870,7 @@ const response = await llm.chat(messages);
|
||||
| **LLM网关** | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
| **存储服务** | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| **异步任务** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
||||
| **RAG引擎** | 🔜 | ✅ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **知识库引擎** | 🔜 | ✅ | ❌ | ❌ | 🔜 | 🔜 |
|
||||
| **文档处理** | 🔜 | ✅ | ✅ | ❌ | ❌ | ❌ |
|
||||
| **认证授权** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| **Prompt管理** | 🔜 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
||||
Reference in New Issue
Block a user