Files

HaHafeng d4d33528c7 feat(dc): Complete Phase 1 - Portal workbench page development

Summary:
- Implement DC module Portal page with 3 tool cards
- Create ToolCard component with decorative background and hover animations
- Implement TaskList component with table layout and progress bars
- Implement AssetLibrary component with tab switching and file cards
- Complete database verification (4 tables confirmed)
- Complete backend API verification (6 endpoints ready)
- Optimize UI to match prototype design (V2.html)

Frontend Components (~715 lines):
- components/ToolCard.tsx - Tool cards with animations
- components/TaskList.tsx - Recent tasks table view
- components/AssetLibrary.tsx - Data asset library with tabs
- hooks/useRecentTasks.ts - Task state management
- hooks/useAssets.ts - Asset state management
- pages/Portal.tsx - Main portal page
- types/portal.ts - TypeScript type definitions

Backend Verification:
- Backend API: 1495 lines code verified
- Database: dc_schema with 4 tables verified
- API endpoints: 6 endpoints tested (templates API works)

Documentation:
- Database verification report
- Backend API test report
- Phase 1 completion summary
- UI optimization report
- Development task checklist
- Development plan for Tool B

Status: Phase 1 completed (100%), ready for browser testing
Next: Phase 2 - Tool B Step 1 and 2 development

2025-12-02 21:53:24 +08:00

7.8 KiB

Raw Blame History

技术设计文档：工具 B - 病历结构化机器人 (The AI Structurer)

文档类型	Technical Design Document (TDD)
对应 PRD	PRD_工具B_病历结构化机器人_V2.md
版本	V2.0 (架构升级：双模型交叉验证)
状态	Draft
核心目标	构建一个高可信度的医疗文本结构化引擎，通过双模型（DeepSeek & Qwen）并发提取与自动交叉验证，解决 AI 幻觉问题。

1. 总体架构设计 (Architecture Overview)

系统架构从“单线性流水线”升级为 “Y型并发流水线”。数据进入后，分发给两个不同的 LLM 模型并行处理，结果汇聚到“冲突检测引擎”进行比对，最后输出到人工验证网格。

1.1 系统架构图

graph TD
Client[React 前端 (Grid & Drawer UI)]

subgraph API\_Server \[Fastify API 服务\]  
    JobAPI\[任务与模版接口\]  
    VerifyAPI\[全景网格接口\]  
end  
  
subgraph Async\_Cluster \[后台 Worker 集群\]  
    BullMQ\[BullMQ 任务队列\]  
    Orchestrator\[任务编排器\]  
    PII\_Engine\[隐私脱敏引擎\]  
      
    subgraph Dual\_LLM\_Engine \[双盲提取引擎\]  
        ClientA\[DeepSeek 客户端\]  
        ClientB\[Qwen 客户端\]  
    end  
      
    CrossValidator\[交叉验证/冲突检测器\]  
end  
  
subgraph Storage \[数据存储\]  
    PG\[(PostgreSQL \- 业务数据)\]  
    VectorDB\[(pgvector \- 可选，用于语义比对)\]  
    Redis\[(Redis \- 队列)\]  
end

Client \--1.上传&体检--\> JobAPI  
JobAPI \--2.创建并发任务--\> BullMQ  
BullMQ \--3.消费--\> Orchestrator  
Orchestrator \--4.脱敏--\> PII\_Engine  
PII\_Engine \--5.并行调用--\> ClientA & ClientB  
ClientA & ClientB \--6.返回JSON--\> CrossValidator  
CrossValidator \--7.计算一致性--\> PG  
Client \--8.拉取网格数据--\> VerifyAPI  
VerifyAPI \--9.人工裁决--\> PG

2. 技术选型 (Tech Stack)

层级	技术组件	选型理由
后端框架	Fastify 5.x	高性能异步 I/O，适合处理高并发模型调用。
模型接入	LangChain.js	统一封装 DeepSeek 和 Qwen 的调用接口，便于切换模型。
任务队列	BullMQ	核心组件。V2 需要利用 Flow 功能或手动编排来实现“等待两个模型都返回”的逻辑。
冲突检测	Lodash (基础) + Dice Coefficient (进阶)	用于比对两个 JSON 对象的字段差异。文本相似度可使用简单的 Dice 系数或 Levenshtein 距离，暂不需要重型向量库。
数据库	PostgreSQL 15	存储 JSONB 格式的双模型结果。
前端交互	React + TanStack Table	V2 改为全景网格，数据量大时需要 TanStack Table (Headless) 配合虚拟滚动。

3. 核心流程设计 (Core Logic)

3.1 智能体检 (Health Check Logic)

触发时机： 用户在前端选择“文本列”的瞬间。
执行逻辑：
1. 后端读取该列的前 100 行（不读全量）。
2. 计算统计指标：
  - emptyRate: 空值 / 总行数。
  - avgLength: 非空行的平均字符数。
3. 拦截策略： 若 emptyRate > 0.8 或 avgLength < 10，返回 status: 'BAD'。
4. Token 预估： totalRows * avgLength * 1.5 (粗略估算)。

3.2 双盲提取与交叉验证 (Double-Blind & Validation)

这是 V2 的心脏。

A. 提示词工程 (Prompt Engineering)

为了方便比对，必须强制两个模型输出完全一致的 JSON 结构。

System Prompt: "You are a medical structural extraction assistant..."
Constraint: "Output strictly in JSON format. Keys must be: ['tumor_size', 'lymph_node', ...]."
Temperature: 设为 0，追求最大确定性。

B. 交叉验证算法 (The Judge)

当 Model A (DeepSeek) 和 Model B (Qwen) 返回结果后，执行比对：

function validate(jsonA, jsonB) {
const conflicts = [];
const keys = Object.keys(jsonA);

for (const key of keys) {
const valA = normalize(jsonA[key]); // 归一化：去除空格、转小写、半角化
const valB = normalize(jsonB[key]);

// 1\. 精确匹配  
if (valA \=== valB) continue;  
  
// 2\. 数值归一化匹配 (如 "3cm" vs "3.0cm")  
if (isNumber(valA) && isNumber(valB) && parse(valA) \=== parse(valB)) continue;  
  
// 3\. (可选) 语义相似度匹配  
// if (similarity(valA, valB) \> 0.95) continue;  
  
conflicts.push(key);

}

return conflicts.length === 0 ? 'CLEAN' : 'CONFLICT';
}

4. 数据库设计 (Database Schema)

V2 需要存储两份 AI 结果以及用户的裁决结果。

Prisma Schema 更新

// 任务表
model ExtractionJob {
id String @id @default(uuid())
// ...其他字段
diseaseType String // 疾病类型 (肺癌)
reportType String // 报告类型 (病理)
targetFields Json // 目标字段定义 [{name: "肿瘤大小", desc: "..."}]
}

// 单行记录表
model ExtractionItem {
id String @id @default(uuid())
jobId String
originalText String @db.Text

// V2 核心字段
resultA Json? // DeepSeek 结果 { "size": "3cm" }
resultB Json? // Qwen 结果 { "size": "3.0 cm" }

// 冲突检测结果
status ItemStatus // PENDING, CLEAN, CONFLICT, RESOLVED
conflictFields String[] // ["size"] 记录哪些字段冲突了

// 最终采纳结果 (用户裁决后写入，或者一致时自动写入)
finalResult Json?
}

5. 接口设计 (API Endpoints)

5.1 模版与配置

GET /api/templates: 获取预设的疾病和报告模版列表。
POST /api/jobs: 创建任务，Payload 中需包含 diseaseType 和 reportType，便于后端组装 Prompt。

5.2 网格验证 (Grid Verification)

GET /api/jobs/:id/rows: 分页获取验证数据。
- Response: 返回 originalText, resultA, resultB, conflictFields。
POST /api/items/:id/resolve: 单行裁决。
- Payload: { field: "tumor_size", chosenValue: "3cm" }。
- Logic: 更新 finalResult，如果该行所有冲突字段都已解决，将 status 更新为 RESOLVED。

6. 前端详细设计 (Frontend)

6.1 全景验证网格 (Verification Grid)

组件选型： 依然推荐 TanStack Table (逻辑层) + UI 组件库 (渲染层)。
冲突单元格渲染：
- 当 conflictFields.includes(column.id) 时，单元格渲染为对比模式。
- 显示两个小按钮：[DS: 3cm] 和 [QW: 3.0cm]。
- 用户点击任一按钮，触发 resolve API，前端乐观更新（Optimistic Update）为选中状态。

6.2 侧边栏原文 (Context Drawer)

触发： 点击表格行的空白处或“查看原文”图标。
功能： 展示 originalText。
高亮优化： 简单实现 String.indexOf 查找当前字段的值并标黄。

7. 风险控制与性能优化

潜在风险	解决方案
双倍 Token 成本	1. 默认使用 DeepSeek (极低成本) + Qwen (低成本) 组合。 2. 在“体检”阶段严格拦截无效数据。
处理速度慢	两个模型必须并发调用 (Promise.all)，而不是串行。整体耗时取决于最慢的那个模型。
模型格式不听话	Prompt 中增加 Few-Shot (少样本) 示例，明确展示 JSON 格式。如果 JSON 解析失败，自动重试 1 次。
前端网格卡顿	如果数据超过 1000 条，开启 Virtual Scrolling (虚拟滚动)。

7.8 KiB Raw Blame History Unescape Escape