Core Components: - PDFStorageService with Dify/OSS adapters - LLM12FieldsService with Nougat-first + dual-model + 3-layer JSON parsing - PromptBuilder for dynamic prompt assembly - MedicalLogicValidator with 5 rules + fault tolerance - EvidenceChainValidator for citation integrity - ConflictDetectionService for dual-model comparison Prompt Engineering: - System Prompt (6601 chars, Section-Aware strategy) - User Prompt template (PICOS context injection) - JSON Schema (12 fields constraints) - Cochrane standards (not loaded in MVP) Key Innovations: - 3-layer JSON parsing (JSON.parse + json-repair + code block extraction) - Promise.allSettled for dual-model fault tolerance - safeGetFieldValue for robust field extraction - Mixed CN/EN token calculation Integration Tests: - integration-test.ts (full test) - quick-test.ts (quick test) - cached-result-test.ts (fault tolerance test) Documentation Updates: - Development record (Day 2-3 summary) - Quality assurance strategy (full-text screening) - Development plan (progress update) - Module status (v1.1 update) - Technical debt (10 new items) Test Results: - JSON parsing success rate: 100% - Medical logic validation: 5/5 passed - Dual-model parallel processing: OK - Cost per PDF: CNY 0.10 Files: 238 changed, 14383 insertions(+), 32 deletions(-) Docs: docs/03-涓氬姟妯″潡/ASL-AI鏅鸿兘鏂囩尞/05-寮€鍙戣褰?2025-11-22_Day2-Day3_LLM鏈嶅姟涓庨獙璇佺郴缁熷紑鍙?md
数据ETL引擎
能力定位: 通用能力层
复用率: 29% (2个模块依赖)
优先级: P2
状态: ⏳ 待实现
📋 能力概述
数据ETL引擎负责:
- Excel多表JOIN
- 数据清洗
- 数据转换
- 数据验证
📊 依赖模块
2个模块依赖(29%复用率):
- DC - 数据清洗整理(核心依赖)
- SSA - 智能统计分析(数据预处理)
💡 核心功能
1. Excel多表处理
- 读取多个Excel文件
- 自动JOIN操作
- GROUP BY聚合
2. 数据清洗
- 缺失值处理
- 重复值处理
- 异常值检测
3. 数据转换
- 类型转换
- 格式标准化
🏗️ 技术方案
云端版(最优)
# 基于Polars(性能极高)
class ETLEngine:
def read_excel(self, files: List[File]) -> List[DataFrame]
def join(self, dfs: List[DataFrame], keys: List[str]) -> DataFrame
def clean(self, df: DataFrame, rules: Dict) -> DataFrame
def export(self, df: DataFrame, format: str) -> bytes
单机版(兼容)
# 基于SQLite(内存友好)
# 分块读取,数据库引擎处理JOIN
🔗 相关文档
最后更新: 2025-11-06
维护人: 技术架构师