feat(dc): Complete Phase 1 - Portal workbench page development
Summary: - Implement DC module Portal page with 3 tool cards - Create ToolCard component with decorative background and hover animations - Implement TaskList component with table layout and progress bars - Implement AssetLibrary component with tab switching and file cards - Complete database verification (4 tables confirmed) - Complete backend API verification (6 endpoints ready) - Optimize UI to match prototype design (V2.html) Frontend Components (~715 lines): - components/ToolCard.tsx - Tool cards with animations - components/TaskList.tsx - Recent tasks table view - components/AssetLibrary.tsx - Data asset library with tabs - hooks/useRecentTasks.ts - Task state management - hooks/useAssets.ts - Asset state management - pages/Portal.tsx - Main portal page - types/portal.ts - TypeScript type definitions Backend Verification: - Backend API: 1495 lines code verified - Database: dc_schema with 4 tables verified - API endpoints: 6 endpoints tested (templates API works) Documentation: - Database verification report - Backend API test report - Phase 1 completion summary - UI optimization report - Development task checklist - Development plan for Tool B Status: Phase 1 completed (100%), ready for browser testing Next: Phase 2 - Tool B Step 1 and 2 development
This commit is contained in:
512
docs/03-业务模块/DC-数据清洗整理/02-技术设计/数据库设计文档-DC模块(完整版).md
Normal file
512
docs/03-业务模块/DC-数据清洗整理/02-技术设计/数据库设计文档-DC模块(完整版).md
Normal file
@@ -0,0 +1,512 @@
|
||||
# 数据库设计文档 - 工具B(病历结构化机器人)
|
||||
|
||||
> **模块**: DC数据清洗整理 - 工具B
|
||||
> **版本**: V1.0
|
||||
> **Schema**: `dc_schema`
|
||||
> **更新日期**: 2025-12-02
|
||||
> **状态**: ✅ 已验证(数据库表已创建并初始化)
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
|
||||
- [一、概述](#一概述)
|
||||
- [二、Schema设计原则](#二schema设计原则)
|
||||
- [三、数据表设计](#三数据表设计)
|
||||
- [四、索引设计](#四索引设计)
|
||||
- [五、外键约束](#五外键约束)
|
||||
- [六、数据生命周期](#六数据生命周期)
|
||||
|
||||
---
|
||||
|
||||
## 一、概述
|
||||
|
||||
### 1.1 设计目标
|
||||
|
||||
工具B的数据库设计旨在支持:
|
||||
- ✅ 双大模型交叉验证的文本结构化
|
||||
- ✅ 大规模异步任务处理(1000+条记录)
|
||||
- ✅ 冲突检测与人工裁决
|
||||
- ✅ 预设模板管理与复用
|
||||
- ✅ 健康检查缓存优化
|
||||
|
||||
### 1.2 表关系总览
|
||||
|
||||
```
|
||||
dc_schema ✅ 已创建
|
||||
├── dc_health_checks [健康检查缓存] ✅ 已创建(2条记录)
|
||||
├── dc_templates [预设模板] ✅ 已创建(3条预设模板)
|
||||
├── dc_extraction_tasks [提取任务] ✅ 已创建(1条记录)
|
||||
│ └── dc_extraction_items [提取记录] (1:N) ✅ 已创建(4条记录)
|
||||
```
|
||||
|
||||
**✅ 验证状态(2025-12-02)**:
|
||||
- 所有表已创建并包含测试数据
|
||||
- 3个预设模板已初始化:肺癌病理报告、糖尿病入院记录、高血压门诊病历
|
||||
- 验证脚本:`backend/scripts/check-dc-tables.mjs`
|
||||
|
||||
### 1.3 技术栈
|
||||
|
||||
- **数据库**: PostgreSQL 15
|
||||
- **ORM**: Prisma 6
|
||||
- **Schema隔离**: `dc_schema`(独立命名空间)
|
||||
- **JSON字段**: 使用JSONB类型(高性能查询)
|
||||
|
||||
---
|
||||
|
||||
## 二、Schema设计原则
|
||||
|
||||
### 2.1 Schema隔离
|
||||
|
||||
```sql
|
||||
-- 所有表使用dc_schema命名空间
|
||||
CREATE TABLE "dc_schema"."dc_health_checks" (...);
|
||||
CREATE TABLE "dc_schema"."dc_extraction_tasks" (...);
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- ✅ 与其他模块完全隔离(platform_schema、asl_schema等)
|
||||
- ✅ 数据安全,避免误操作
|
||||
- ✅ 便于模块化管理和迁移
|
||||
|
||||
### 2.2 命名规范
|
||||
|
||||
| 规则 | 说明 | 示例 |
|
||||
|------|------|------|
|
||||
| **表名前缀** | `dc_` | `dc_extraction_tasks` |
|
||||
| **字段命名** | snake_case | `user_id`, `source_file_key` |
|
||||
| **时间戳** | 统一后缀 | `created_at`, `started_at` |
|
||||
| **外键** | 实体名_id | `task_id`, `user_id` |
|
||||
|
||||
### 2.3 JSONB字段使用场景
|
||||
|
||||
| 字段 | 类型 | 原因 |
|
||||
|------|------|------|
|
||||
| `target_fields` | JSONB | 灵活的字段配置 |
|
||||
| `result_a/result_b` | JSONB | 动态提取结果 |
|
||||
| `final_result` | JSONB | 最终裁决结果 |
|
||||
|
||||
---
|
||||
|
||||
## 三、数据表设计
|
||||
|
||||
### 3.1 dc_health_checks(健康检查缓存表)
|
||||
|
||||
**用途**: 缓存健康检查结果,避免重复计算
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_health_checks" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
"user_id" TEXT NOT NULL,
|
||||
"file_name" TEXT NOT NULL,
|
||||
"column_name" TEXT NOT NULL,
|
||||
|
||||
-- 统计指标
|
||||
"empty_rate" DOUBLE PRECISION NOT NULL,
|
||||
"avg_length" DOUBLE PRECISION NOT NULL,
|
||||
"total_rows" INTEGER NOT NULL,
|
||||
"estimated_tokens" INTEGER NOT NULL,
|
||||
|
||||
-- 检查结果
|
||||
"status" TEXT NOT NULL, -- 'good' | 'bad'
|
||||
"message" TEXT NOT NULL,
|
||||
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `id` | TEXT | UUID主键 | `uuid()` |
|
||||
| `user_id` | TEXT | 用户ID | `user-123` |
|
||||
| `file_name` | TEXT | 文件名 | `患者数据.xlsx` |
|
||||
| `column_name` | TEXT | 检查的列名 | `病历文本` |
|
||||
| `empty_rate` | DOUBLE | 空值率 (0-1) | 0.15 (15%) |
|
||||
| `avg_length` | DOUBLE | 平均文本长度 | 256.8 |
|
||||
| `total_rows` | INT | 总行数 | 500 |
|
||||
| `estimated_tokens` | INT | 预估Token数 | 150000 |
|
||||
| `status` | TEXT | 健康状态 | `good` / `bad` |
|
||||
| `message` | TEXT | 提示信息 | `健康度良好` |
|
||||
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_health_checks_user_id_file_name_idx"
|
||||
ON "dc_schema"."dc_health_checks"("user_id", "file_name");
|
||||
```
|
||||
|
||||
**业务规则**:
|
||||
- 空值率 > 80% → `status = 'bad'`
|
||||
- 平均长度 < 10 → `status = 'bad'`
|
||||
- 缓存有效期:24小时(应用层实现)
|
||||
|
||||
---
|
||||
|
||||
### 3.2 dc_templates(预设模板表)
|
||||
|
||||
**用途**: 存储疾病类型的预设提取模板
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_templates" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
"disease_type" TEXT NOT NULL, -- 'lung_cancer', 'diabetes', 'hypertension'
|
||||
"report_type" TEXT NOT NULL, -- 'pathology', 'admission', 'outpatient'
|
||||
"display_name" TEXT NOT NULL, -- '肺癌病理报告'
|
||||
"fields" JSONB NOT NULL, -- [{name, desc, width}]
|
||||
"prompt_template" TEXT NOT NULL,
|
||||
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
"updated_at" TIMESTAMP(3) NOT NULL,
|
||||
|
||||
CONSTRAINT "dc_templates_disease_type_report_type_key"
|
||||
UNIQUE ("disease_type", "report_type")
|
||||
);
|
||||
```
|
||||
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `disease_type` | TEXT | 疾病类型 | `lung_cancer` |
|
||||
| `report_type` | TEXT | 报告类型 | `pathology` |
|
||||
| `display_name` | TEXT | 显示名称 | `肺癌病理报告` |
|
||||
| `fields` | JSONB | 提取字段配置 | 见下方示例 |
|
||||
| `prompt_template` | TEXT | Prompt模板 | `请从以下病理报告中提取...` |
|
||||
|
||||
**fields字段结构**:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "病理类型",
|
||||
"desc": "如:浸润性腺癌、鳞状细胞癌",
|
||||
"width": "w-40"
|
||||
},
|
||||
{
|
||||
"name": "分化程度",
|
||||
"desc": "高/中/低分化",
|
||||
"width": "w-32"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**唯一约束**:
|
||||
```sql
|
||||
UNIQUE ("disease_type", "report_type")
|
||||
```
|
||||
同一疾病+报告类型组合只能有一个模板
|
||||
|
||||
---
|
||||
|
||||
### 3.3 dc_extraction_tasks(提取任务表)
|
||||
|
||||
**用途**: 管理批量提取任务,追踪进度和成本
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_extraction_tasks" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
"user_id" TEXT NOT NULL,
|
||||
"project_name" TEXT NOT NULL,
|
||||
"source_file_key" TEXT NOT NULL, -- Storage中的路径
|
||||
"text_column" TEXT NOT NULL,
|
||||
|
||||
-- 模板配置
|
||||
"disease_type" TEXT NOT NULL,
|
||||
"report_type" TEXT NOT NULL,
|
||||
"target_fields" JSONB NOT NULL,
|
||||
|
||||
-- 双模型配置
|
||||
"model_a" TEXT NOT NULL DEFAULT 'deepseek-v3',
|
||||
"model_b" TEXT NOT NULL DEFAULT 'qwen3-72b',
|
||||
|
||||
-- 任务状态
|
||||
"status" TEXT NOT NULL DEFAULT 'pending',
|
||||
"total_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"processed_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"clean_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"conflict_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"failed_count" INTEGER NOT NULL DEFAULT 0,
|
||||
|
||||
-- 成本统计
|
||||
"total_tokens" INTEGER NOT NULL DEFAULT 0,
|
||||
"total_cost" DOUBLE PRECISION NOT NULL DEFAULT 0,
|
||||
|
||||
-- 错误信息
|
||||
"error" TEXT,
|
||||
|
||||
-- 时间戳
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
"started_at" TIMESTAMP(3),
|
||||
"completed_at" TIMESTAMP(3)
|
||||
);
|
||||
```
|
||||
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `source_file_key` | TEXT | Storage路径 | `uploads/user123/data.xlsx` |
|
||||
| `text_column` | TEXT | 文本列名 | `病历文本` |
|
||||
| `target_fields` | JSONB | 提取字段 | `[{name, desc}]` |
|
||||
| `status` | TEXT | 任务状态 | `pending/processing/completed/failed` |
|
||||
| `total_count` | INT | 总记录数 | 500 |
|
||||
| `processed_count` | INT | 已处理数 | 250 |
|
||||
| `clean_count` | INT | 一致数 | 200 |
|
||||
| `conflict_count` | INT | 冲突数 | 45 |
|
||||
| `failed_count` | INT | 失败数 | 5 |
|
||||
| `total_tokens` | INT | 总Token数 | 150000 |
|
||||
| `total_cost` | DOUBLE | 总成本($) | 0.27 |
|
||||
|
||||
**状态流转**:
|
||||
```
|
||||
pending → processing → completed
|
||||
→ failed
|
||||
```
|
||||
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_extraction_tasks_user_id_status_idx"
|
||||
ON "dc_schema"."dc_extraction_tasks"("user_id", "status");
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 dc_extraction_items(提取记录表)
|
||||
|
||||
**用途**: 存储每条记录的双模型提取结果和冲突状态
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_extraction_items" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
"task_id" TEXT NOT NULL,
|
||||
|
||||
-- 原始数据
|
||||
"row_index" INTEGER NOT NULL,
|
||||
"original_text" TEXT NOT NULL,
|
||||
|
||||
-- 双模型结果
|
||||
"result_a" JSONB,
|
||||
"result_b" JSONB,
|
||||
|
||||
-- 冲突检测
|
||||
"status" TEXT NOT NULL DEFAULT 'pending',
|
||||
"conflict_fields" TEXT[] DEFAULT ARRAY[]::TEXT[],
|
||||
|
||||
-- 最终结果
|
||||
"final_result" JSONB,
|
||||
|
||||
-- Token统计
|
||||
"tokens_a" INTEGER NOT NULL DEFAULT 0,
|
||||
"tokens_b" INTEGER NOT NULL DEFAULT 0,
|
||||
|
||||
-- 错误信息
|
||||
"error" TEXT,
|
||||
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
"resolved_at" TIMESTAMP(3),
|
||||
|
||||
CONSTRAINT "dc_extraction_items_task_id_fkey"
|
||||
FOREIGN KEY ("task_id")
|
||||
REFERENCES "dc_schema"."dc_extraction_tasks"("id")
|
||||
ON DELETE CASCADE
|
||||
);
|
||||
```
|
||||
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `row_index` | INT | Excel行号 | 5 |
|
||||
| `original_text` | TEXT | 原始病历文本 | `患者,男,45岁...` |
|
||||
| `result_a` | JSONB | DeepSeek结果 | `{"肿瘤大小": "3cm"}` |
|
||||
| `result_b` | JSONB | Qwen结果 | `{"肿瘤大小": "3.0cm"}` |
|
||||
| `status` | TEXT | 处理状态 | `clean/conflict/resolved/failed` |
|
||||
| `conflict_fields` | TEXT[] | 冲突字段列表 | `["肿瘤大小"]` |
|
||||
| `final_result` | JSONB | 最终裁决结果 | `{"肿瘤大小": "3cm"}` |
|
||||
|
||||
**result_a/result_b结构示例**:
|
||||
```json
|
||||
{
|
||||
"病理类型": "浸润性腺癌",
|
||||
"分化程度": "中分化",
|
||||
"肿瘤大小": "3cm",
|
||||
"淋巴结转移": "无"
|
||||
}
|
||||
```
|
||||
|
||||
**状态说明**:
|
||||
- `pending`: 等待处理
|
||||
- `clean`: 双模型结果一致
|
||||
- `conflict`: 存在冲突,需人工裁决
|
||||
- `resolved`: 冲突已解决
|
||||
- `failed`: 提取失败
|
||||
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_extraction_items_task_id_status_idx"
|
||||
ON "dc_schema"."dc_extraction_items"("task_id", "status");
|
||||
```
|
||||
|
||||
**外键约束**:
|
||||
- `ON DELETE CASCADE`: 删除任务时自动删除所有记录
|
||||
|
||||
---
|
||||
|
||||
## 四、索引设计
|
||||
|
||||
### 4.1 索引列表
|
||||
|
||||
| 表名 | 索引字段 | 类型 | 用途 |
|
||||
|------|---------|------|------|
|
||||
| `dc_health_checks` | `(user_id, file_name)` | 复合 | 查询用户的历史检查 |
|
||||
| `dc_templates` | `(disease_type, report_type)` | 唯一 | 防止重复模板 |
|
||||
| `dc_extraction_tasks` | `(user_id, status)` | 复合 | 查询用户的任务列表 |
|
||||
| `dc_extraction_items` | `(task_id, status)` | 复合 | 查询任务的记录列表 |
|
||||
|
||||
### 4.2 性能考虑
|
||||
|
||||
**查询优化**:
|
||||
```sql
|
||||
-- 高效查询:利用索引
|
||||
SELECT * FROM dc_extraction_tasks
|
||||
WHERE user_id = 'user123' AND status = 'processing';
|
||||
|
||||
-- 高效查询:利用索引
|
||||
SELECT * FROM dc_extraction_items
|
||||
WHERE task_id = 'task456' AND status = 'conflict';
|
||||
```
|
||||
|
||||
**避免全表扫描**:
|
||||
- ✅ 始终在WHERE子句中包含索引字段
|
||||
- ✅ 使用`status`字段过滤可以显著减少扫描行数
|
||||
|
||||
---
|
||||
|
||||
## 五、外键约束
|
||||
|
||||
### 5.1 级联删除
|
||||
|
||||
```sql
|
||||
ALTER TABLE "dc_schema"."dc_extraction_items"
|
||||
ADD CONSTRAINT "dc_extraction_items_task_id_fkey"
|
||||
FOREIGN KEY ("task_id")
|
||||
REFERENCES "dc_schema"."dc_extraction_tasks"("id")
|
||||
ON DELETE CASCADE;
|
||||
```
|
||||
|
||||
**行为**:
|
||||
- 删除任务 → 自动删除所有关联的提取记录
|
||||
- 保证数据一致性
|
||||
|
||||
### 5.2 无外键的表
|
||||
|
||||
- `dc_health_checks`: 独立表,无外键
|
||||
- `dc_templates`: 独立表,无外键
|
||||
- `dc_extraction_tasks`: 无外键(user_id仅为标识,不强制关联)
|
||||
|
||||
**原因**:
|
||||
- ✅ 减少跨Schema依赖
|
||||
- ✅ 提高模块独立性
|
||||
- ✅ 简化迁移和回滚
|
||||
|
||||
---
|
||||
|
||||
## 六、数据生命周期
|
||||
|
||||
### 6.1 数据保留策略
|
||||
|
||||
| 表名 | 保留时间 | 清理策略 |
|
||||
|------|---------|---------|
|
||||
| `dc_health_checks` | 7天 | 定期清理旧记录 |
|
||||
| `dc_templates` | 永久 | 手动管理 |
|
||||
| `dc_extraction_tasks` | 90天 | 归档后删除 |
|
||||
| `dc_extraction_items` | 90天 | 随任务删除 |
|
||||
|
||||
### 6.2 归档策略
|
||||
|
||||
**大任务归档** (> 1000条记录):
|
||||
1. 任务完成后,导出结果到CSV/Excel
|
||||
2. 上传到Storage(永久保存)
|
||||
3. 删除数据库记录(释放空间)
|
||||
|
||||
### 6.3 清理脚本(示例)
|
||||
|
||||
```typescript
|
||||
// 清理7天前的健康检查记录
|
||||
await prisma.dCHealthCheck.deleteMany({
|
||||
where: {
|
||||
createdAt: {
|
||||
lt: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
// 归档90天前的已完成任务
|
||||
const oldTasks = await prisma.dCExtractionTask.findMany({
|
||||
where: {
|
||||
status: 'completed',
|
||||
completedAt: {
|
||||
lt: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
|
||||
}
|
||||
},
|
||||
include: { items: true }
|
||||
});
|
||||
|
||||
// 导出后删除
|
||||
for (const task of oldTasks) {
|
||||
await exportTaskToStorage(task);
|
||||
await prisma.dCExtractionTask.delete({ where: { id: task.id } });
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、数据安全
|
||||
|
||||
### 7.1 PII保护
|
||||
|
||||
**敏感字段**:
|
||||
- `original_text`: 可能包含患者姓名、身份证号
|
||||
- `result_a/result_b/final_result`: 可能包含结构化的敏感信息
|
||||
|
||||
**保护措施**:
|
||||
- ✅ 发送LLM前自动脱敏(PIIMaskUtil)
|
||||
- ✅ 数据库加密(PostgreSQL SSL)
|
||||
- ✅ 定期清理历史数据
|
||||
|
||||
### 7.2 用户隔离
|
||||
|
||||
**机制**:
|
||||
- 所有表包含`user_id`字段
|
||||
- 应用层强制过滤:`WHERE user_id = currentUserId`
|
||||
- 永不跨用户查询
|
||||
|
||||
---
|
||||
|
||||
## 八、附录
|
||||
|
||||
### 8.1 完整Schema DDL
|
||||
|
||||
完整的Schema创建脚本位于:
|
||||
```
|
||||
backend/prisma/migrations/20251127_add_dc_tool_b_tables/migration.sql
|
||||
```
|
||||
|
||||
### 8.2 Prisma模型定义
|
||||
|
||||
完整的Prisma模型定义位于:
|
||||
```
|
||||
backend/prisma/schema.prisma
|
||||
```
|
||||
搜索 `dc_schema` 查看所有模型。
|
||||
|
||||
### 8.3 变更历史
|
||||
|
||||
| 版本 | 日期 | 变更内容 |
|
||||
|------|------|---------|
|
||||
| V1.0 | 2025-11-27 | 初始版本,4个表 |
|
||||
|
||||
---
|
||||
|
||||
**文档结束** ✅
|
||||
|
||||
Reference in New Issue
Block a user