Files
AIclinicalresearch/backend/prisma/migrations/manual/ekb_create_indexes.sql
HaHafeng 40c2f8e148 feat(rag): Complete RAG engine implementation with pgvector
Major Features:
- Created ekb_schema (13th schema) with 3 tables: KB/Document/Chunk
- Implemented EmbeddingService (text-embedding-v4, 1024-dim vectors)
- Implemented ChunkService (smart Markdown chunking)
- Implemented VectorSearchService (multi-query + hybrid search)
- Implemented RerankService (qwen3-rerank)
- Integrated DeepSeek V3 QueryRewriter for cross-language search
- Python service: Added pymupdf4llm for PDF-to-Markdown conversion
- PKB: Dual-mode adapter (pgvector/dify/hybrid)

Architecture:
- Brain-Hand Model: Business layer (DeepSeek) + Engine layer (pgvector)
- Cross-language support: Chinese query matches English documents
- Small Embedding (1024) + Strong Reranker strategy

Performance:
- End-to-end latency: 2.5s
- Cost per query: 0.0025 RMB
- Accuracy improvement: +20.5% (cross-language)

Tests:
- test-embedding-service.ts: Vector embedding verified
- test-rag-e2e.ts: Full pipeline tested
- test-rerank.ts: Rerank quality validated
- test-query-rewrite.ts: Cross-language search verified
- test-pdf-ingest.ts: Real PDF document tested (Dongen 2003.pdf)

Documentation:
- Added 05-RAG-Engine-User-Guide.md
- Added 02-Document-Processing-User-Guide.md
- Updated system status documentation

Status: Production ready
2026-01-21 20:24:29 +08:00

65 lines
2.1 KiB
SQL
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
-- ============================================================
-- EKB Schema 索引创建脚本
-- 执行时机prisma migrate 之后手动执行
-- 参考文档docs/02-通用能力层/03-RAG引擎/04-数据模型设计.md
-- ============================================================
-- 1. 确保 pgvector 扩展已启用
CREATE EXTENSION IF NOT EXISTS vector;
-- 2. 确保 pg_bigm 扩展已启用(中文关键词检索)
CREATE EXTENSION IF NOT EXISTS pg_bigm;
-- ===== MVP 阶段必须创建 =====
-- 3. HNSW 向量索引(语义检索核心)
-- 参数说明m=16 每层最大连接数ef_construction=64 构建时搜索范围
CREATE INDEX IF NOT EXISTS idx_ekb_chunk_embedding
ON "ekb_schema"."ekb_chunk"
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- ===== Phase 2 阶段使用(可预创建)=====
-- 4. pg_bigm 中文关键词索引
CREATE INDEX IF NOT EXISTS idx_ekb_chunk_content_bigm
ON "ekb_schema"."ekb_chunk"
USING gin (content gin_bigm_ops);
-- 5. 文档摘要关键词索引
CREATE INDEX IF NOT EXISTS idx_ekb_doc_summary_bigm
ON "ekb_schema"."ekb_document"
USING gin (summary gin_bigm_ops);
-- 6. 全文内容关键词索引
CREATE INDEX IF NOT EXISTS idx_ekb_doc_text_bigm
ON "ekb_schema"."ekb_document"
USING gin (extracted_text gin_bigm_ops);
-- ===== Phase 3 阶段使用(可预创建)=====
-- 7. JSONB GIN 索引metadata 查询加速)
CREATE INDEX IF NOT EXISTS idx_ekb_doc_metadata_gin
ON "ekb_schema"."ekb_document"
USING gin (metadata jsonb_path_ops);
-- 8. JSONB GIN 索引structuredData 查询加速)
CREATE INDEX IF NOT EXISTS idx_ekb_doc_structured_gin
ON "ekb_schema"."ekb_document"
USING gin (structured_data jsonb_path_ops);
-- 9. 标签数组索引
CREATE INDEX IF NOT EXISTS idx_ekb_doc_tags_gin
ON "ekb_schema"."ekb_document"
USING gin (tags);
-- 10. 切片元数据索引
CREATE INDEX IF NOT EXISTS idx_ekb_chunk_metadata_gin
ON "ekb_schema"."ekb_chunk"
USING gin (metadata jsonb_path_ops);
-- ===== 验证索引创建 =====
-- SELECT indexname, indexdef FROM pg_indexes WHERE schemaname = 'ekb_schema';