feat(dc): Complete Phase 1 - Portal workbench page development

Summary: - Implement DC module Portal page with 3 tool cards - Create ToolCard component with decorative background and hover animations - Implement TaskList component with table layout and progress bars - Implement AssetLibrary component with tab switching and file cards - Complete database verification (4 tables confirmed) - Complete backend API verification (6 endpoints ready) - Optimize UI to match prototype design (V2.html) Frontend Components (~715 lines): - components/ToolCard.tsx - Tool cards with animations - components/TaskList.tsx - Recent tasks table view - components/AssetLibrary.tsx - Data asset library with tabs - hooks/useRecentTasks.ts - Task state management - hooks/useAssets.ts - Asset state management - pages/Portal.tsx - Main portal page - types/portal.ts - TypeScript type definitions Backend Verification: - Backend API: 1495 lines code verified - Database: dc_schema with 4 tables verified - API endpoints: 6 endpoints tested (templates API works) Documentation: - Database verification report - Backend API test report - Phase 1 completion summary - UI optimization report - Development task checklist - Development plan for Tool B Status: Phase 1 completed (100%), ready for browser testing Next: Phase 2 - Tool B Step 1 and 2 development
2025-12-02 21:53:24 +08:00
parent f240aa9236
commit d4d33528c7
83 changed files with 21863 additions and 1601 deletions
--- a/docs/03-业务模块/DC-数据清洗整理/02-技术设计/技术设计文档：工具
+++ b/docs/03-业务模块/DC-数据清洗整理/02-技术设计/技术设计文档：工具
@@ -0,0 +1,188 @@
+# **技术设计文档：工具 B \- 病历结构化机器人 (The AI Structurer)**
+
+| 文档类型 | Technical Design Document (TDD) |
+| :---- | :---- |
+| **对应 PRD** | **PRD\_工具B\_病历结构化机器人\_V2.md** |
+| **版本** | **V2.0** (架构升级：双模型交叉验证) |
+| **状态** | Draft |
+| **核心目标** | 构建一个高可信度的医疗文本结构化引擎，通过**双模型（DeepSeek & Qwen）并发提取**与**自动交叉验证**，解决 AI 幻觉问题。 |
+
+## **1\. 总体架构设计 (Architecture Overview)**
+
+系统架构从“单线性流水线”升级为 **“Y型并发流水线”**。数据进入后，分发给两个不同的 LLM 模型并行处理，结果汇聚到“冲突检测引擎”进行比对，最后输出到人工验证网格。
+
+### **1.1 系统架构图**
+
+graph TD  
+    Client\[React 前端 (Grid & Drawer UI)\]  
+      
+    subgraph API\_Server \[Fastify API 服务\]  
+        JobAPI\[任务与模版接口\]  
+        VerifyAPI\[全景网格接口\]  
+    end  
+      
+    subgraph Async\_Cluster \[后台 Worker 集群\]  
+        BullMQ\[BullMQ 任务队列\]  
+        Orchestrator\[任务编排器\]  
+        PII\_Engine\[隐私脱敏引擎\]  
+          
+        subgraph Dual\_LLM\_Engine \[双盲提取引擎\]  
+            ClientA\[DeepSeek 客户端\]  
+            ClientB\[Qwen 客户端\]  
+        end  
+          
+        CrossValidator\[交叉验证/冲突检测器\]  
+    end  
+      
+    subgraph Storage \[数据存储\]  
+        PG\[(PostgreSQL \- 业务数据)\]  
+        VectorDB\[(pgvector \- 可选，用于语义比对)\]  
+        Redis\[(Redis \- 队列)\]  
+    end
+
+    Client \--1.上传&体检--\> JobAPI  
+    JobAPI \--2.创建并发任务--\> BullMQ  
+    BullMQ \--3.消费--\> Orchestrator  
+    Orchestrator \--4.脱敏--\> PII\_Engine  
+    PII\_Engine \--5.并行调用--\> ClientA & ClientB  
+    ClientA & ClientB \--6.返回JSON--\> CrossValidator  
+    CrossValidator \--7.计算一致性--\> PG  
+    Client \--8.拉取网格数据--\> VerifyAPI  
+    VerifyAPI \--9.人工裁决--\> PG
+
+## **2\. 技术选型 (Tech Stack)**
+
+| 层级 | 技术组件 | 选型理由 |
+| :---- | :---- | :---- |
+| **后端框架** | **Fastify 5.x** | 高性能异步 I/O，适合处理高并发模型调用。 |
+| **模型接入** | **LangChain.js** | 统一封装 DeepSeek 和 Qwen 的调用接口，便于切换模型。 |
+| **任务队列** | **BullMQ** | 核心组件。V2 需要利用 Flow 功能或手动编排来实现“等待两个模型都返回”的逻辑。 |
+| **冲突检测** | **Lodash (基础) \+ Dice Coefficient (进阶)** | 用于比对两个 JSON 对象的字段差异。文本相似度可使用简单的 Dice 系数或 Levenshtein 距离，暂不需要重型向量库。 |
+| **数据库** | **PostgreSQL 15** | 存储 JSONB 格式的双模型结果。 |
+| **前端交互** | **React \+ TanStack Table** | V2 改为全景网格，数据量大时需要 TanStack Table (Headless) 配合虚拟滚动。 |
+
+## **3\. 核心流程设计 (Core Logic)**
+
+### **3.1 智能体检 (Health Check Logic)**
+
+* **触发时机：** 用户在前端选择“文本列”的瞬间。  
+* **执行逻辑：**  
+  1. 后端读取该列的前 100 行（不读全量）。  
+  2. 计算统计指标：  
+     * emptyRate: 空值 / 总行数。  
+     * avgLength: 非空行的平均字符数。  
+  3. **拦截策略：** 若 emptyRate \> 0.8 或 avgLength \< 10，返回 status: 'BAD'。  
+  4. **Token 预估：** totalRows \* avgLength \* 1.5 (粗略估算)。
+
+### **3.2 双盲提取与交叉验证 (Double-Blind & Validation)**
+
+这是 V2 的心脏。
+
+#### **A. 提示词工程 (Prompt Engineering)**
+
+为了方便比对，必须强制两个模型输出**完全一致的 JSON 结构**。
+
+* **System Prompt:** "You are a medical structural extraction assistant..."  
+* **Constraint:** "Output strictly in JSON format. Keys must be: \['tumor\_size', 'lymph\_node', ...\]."  
+* **Temperature:** 设为 0，追求最大确定性。
+
+#### **B. 交叉验证算法 (The Judge)**
+
+当 Model A (DeepSeek) 和 Model B (Qwen) 返回结果后，执行比对：
+
+function validate(jsonA, jsonB) {  
+  const conflicts \= \[\];  
+  const keys \= Object.keys(jsonA);  
+    
+  for (const key of keys) {  
+    const valA \= normalize(jsonA\[key\]); // 归一化：去除空格、转小写、半角化  
+    const valB \= normalize(jsonB\[key\]);  
+      
+    // 1\. 精确匹配  
+    if (valA \=== valB) continue;  
+      
+    // 2\. 数值归一化匹配 (如 "3cm" vs "3.0cm")  
+    if (isNumber(valA) && isNumber(valB) && parse(valA) \=== parse(valB)) continue;  
+      
+    // 3\. (可选) 语义相似度匹配  
+    // if (similarity(valA, valB) \> 0.95) continue;  
+      
+    conflicts.push(key);  
+  }  
+    
+  return conflicts.length \=== 0 ? 'CLEAN' : 'CONFLICT';  
+}
+
+## **4\. 数据库设计 (Database Schema)**
+
+V2 需要存储两份 AI 结果以及用户的裁决结果。
+
+### **Prisma Schema 更新**
+
+// 任务表  
+model ExtractionJob {  
+  id          String   @id @default(uuid())  
+  // ...其他字段  
+  diseaseType String   // 疾病类型 (肺癌)  
+  reportType  String   // 报告类型 (病理)  
+  targetFields Json    // 目标字段定义 \[{name: "肿瘤大小", desc: "..."}\]  
+}
+
+// 单行记录表  
+model ExtractionItem {  
+  id          String   @id @default(uuid())  
+  jobId       String  
+  originalText String  @db.Text  
+    
+  // V2 核心字段  
+  resultA     Json?    // DeepSeek 结果 { "size": "3cm" }  
+  resultB     Json?    // Qwen 结果 { "size": "3.0 cm" }  
+    
+  // 冲突检测结果  
+  status      ItemStatus // PENDING, CLEAN, CONFLICT, RESOLVED  
+  conflictFields String\[\] // \["size"\] 记录哪些字段冲突了  
+    
+  // 最终采纳结果 (用户裁决后写入，或者一致时自动写入)  
+  finalResult Json?      
+}
+
+## **5\. 接口设计 (API Endpoints)**
+
+### **5.1 模版与配置**
+
+* GET /api/templates: 获取预设的疾病和报告模版列表。  
+* POST /api/jobs: 创建任务，Payload 中需包含 diseaseType 和 reportType，便于后端组装 Prompt。
+
+### **5.2 网格验证 (Grid Verification)**
+
+* GET /api/jobs/:id/rows: 分页获取验证数据。  
+  * **Response:** 返回 originalText, resultA, resultB, conflictFields。  
+* POST /api/items/:id/resolve: 单行裁决。  
+  * **Payload:** { field: "tumor\_size", chosenValue: "3cm" }。  
+  * **Logic:** 更新 finalResult，如果该行所有冲突字段都已解决，将 status 更新为 RESOLVED。
+
+## **6\. 前端详细设计 (Frontend)**
+
+### **6.1 全景验证网格 (Verification Grid)**
+
+* **组件选型：** 依然推荐 **TanStack Table** (逻辑层) \+ **UI 组件库** (渲染层)。  
+* **冲突单元格渲染：**  
+  * 当 conflictFields.includes(column.id) 时，单元格渲染为**对比模式**。  
+  * 显示两个小按钮：\[DS: 3cm\] 和 \[QW: 3.0cm\]。  
+  * 用户点击任一按钮，触发 resolve API，前端乐观更新（Optimistic Update）为选中状态。
+
+### **6.2 侧边栏原文 (Context Drawer)**
+
+* **触发：** 点击表格行的空白处或“查看原文”图标。  
+* **功能：** 展示 originalText。  
+* **高亮优化：** 简单实现 String.indexOf 查找当前字段的值并标黄。
+
+## **7\. 风险控制与性能优化**
+
+| 潜在风险 | 解决方案 |
+| :---- | :---- |
+| **双倍 Token 成本** | 1\. 默认使用 DeepSeek (极低成本) \+ Qwen (低成本) 组合。 2\. 在“体检”阶段严格拦截无效数据。 |
+| **处理速度慢** | 两个模型必须 **并发调用 (Promise.all)**，而不是串行。整体耗时取决于最慢的那个模型。 |
+| **模型格式不听话** | Prompt 中增加 Few-Shot (少样本) 示例，明确展示 JSON 格式。如果 JSON 解析失败，自动重试 1 次。 |
+| **前端网格卡顿** | 如果数据超过 1000 条，开启 Virtual Scrolling (虚拟滚动)。 |
+