feat(admin): Add user management and upgrade to module permission system
Features - User Management (Phase 4.1): - Database: Add user_modules table for fine-grained module permissions - Database: Add 4 user permissions (view/create/edit/delete) to role_permissions - Backend: UserService (780 lines) - CRUD with tenant isolation - Backend: UserController + UserRoutes (648 lines) - 13 API endpoints - Backend: Batch import users from Excel - Frontend: UserListPage (412 lines) - list/filter/search/pagination - Frontend: UserFormPage (341 lines) - create/edit with module config - Frontend: UserDetailPage (393 lines) - details/tenant/module management - Frontend: 3 modal components (592 lines) - import/assign/configure - API: GET/POST/PUT/DELETE /api/admin/users/* endpoints Architecture Upgrade - Module Permission System: - Backend: Add getUserModules() method in auth.service - Backend: Login API returns modules array in user object - Frontend: AuthContext adds hasModule() method - Frontend: Navigation filters modules based on user.modules - Frontend: RouteGuard checks requiredModule instead of requiredVersion - Frontend: Remove deprecated version-based permission system - UX: Only show accessible modules in navigation (clean UI) - UX: Smart redirect after login (avoid 403 for regular users) Fixes: - Fix UTF-8 encoding corruption in ~100 docs files - Fix pageSize type conversion in userService (String to Number) - Fix authUser undefined error in TopNavigation - Fix login redirect logic with role-based access check - Update Git commit guidelines v1.2 with UTF-8 safety rules Database Changes: - CREATE TABLE user_modules (user_id, tenant_id, module_code, is_enabled) - ADD UNIQUE CONSTRAINT (user_id, tenant_id, module_code) - INSERT 4 permissions + role assignments - UPDATE PUBLIC tenant with 8 module subscriptions Technical: - Backend: 5 new files (~2400 lines) - Frontend: 10 new files (~2500 lines) - Docs: 1 development record + 2 status updates + 1 guideline update - Total: ~4900 lines of code Status: User management 100% complete, module permission system operational
This commit is contained in:
@@ -4,7 +4,8 @@
|
||||
> **版本**: V2.0 (MVP)
|
||||
> **Base URL**: `/api/v1/dc/tool-b`
|
||||
> **更新日期**: 2025-12-03
|
||||
> **<EFBFBD>嗆<EFBFBD>?*: <EFBFBD>?MVP摰峕<EFBFBD>嚗?銝服PI蝡舐<E89DA1><E88890>券<EFBFBD><E588B8>舐鍂嚗<E98D82>歇撉諹<E69289>嚗?
|
||||
> **状态**: ✅ MVP完成(8个API端点全部可用,已验证)
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
@@ -12,8 +13,8 @@
|
||||
- [一、API概览](#一api概览)
|
||||
- [二、认证与鉴权](#二认证与鉴权)
|
||||
- [三、API端点详情](#三api端点详情)
|
||||
- [<EFBFBD>䜘<EFBFBD><EFBFBD>㺭<EFBFBD>格芋<EFBFBD>尜(#<23>𥟇㺭<F0A59F87>格芋<E6A0BC>?
|
||||
- [鈭𢛵<EFBFBD><EFBFBD><EFBFBD>霂臬<EFBFBD><EFBFBD><EFBFBD>(#鈭娪<EFBFBD>霂臬<EFBFBD><EFBFBD>?
|
||||
- [四、数据模型](#四数据模型)
|
||||
- [五、错误处理](#五错误处理)
|
||||
- [六、性能指标](#六性能指标)
|
||||
|
||||
---
|
||||
@@ -22,47 +23,63 @@
|
||||
|
||||
### 1.1 端点列表
|
||||
|
||||
| # | <EFBFBD>寞<EFBFBD> | 頝臬<E9A09D> | 霂湔<E99C82> | <20>𡒊垢<F0A1928A>嗆<EFBFBD>?| <20>滨垢<E6BBA8>嗆<EFBFBD>?| 瘚贝<E7989A><E8B49D>嗆<EFBFBD>?|
|
||||
| # | 方法 | 路径 | 说明 | 后端状态 | 前端状态 | 测试状态 |
|
||||
|---|------|------|------|---------|---------|---------|
|
||||
| 0 | POST | `/upload` | <EFBFBD><EFBFBD>辣銝𠹺<EFBFBD> | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 1 | POST | `/health-check` | <EFBFBD>亙熒璉<EFBFBD><EFBFBD>?| <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 2 | GET | `/templates` | <EFBFBD>瑕<EFBFBD>璅⊥踎<EFBFBD>𡑒” | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 3 | POST | `/tasks` | <EFBFBD>𥕦遣<EFBFBD>𣂼<EFBFBD>隞餃𦛚 | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 4 | GET | `/tasks/:taskId/progress` | <EFBFBD>亥砭隞餃𦛚餈𥕦漲 | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 5 | GET | `/tasks/:taskId/items` | <EFBFBD>瑕<EFBFBD>撉諹<EFBFBD>蝵烐聢<EFBFBD>唳旿 | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 6 | POST | `/items/:itemId/resolve` | 鋆<EFBFBD><EFBFBD><EFBFBD>脩<EFBFBD> | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 7 | GET | `/tasks/:taskId/export` | 撖澆枂Excel蝏𤘪<EFBFBD> | <20>?撌脣<E6928C><E884A3>?| <20>?撌脣笆<E884A3>?| <20>?<3F>朞<EFBFBD> |
|
||||
| 0 | POST | `/upload` | 文件上传 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 1 | POST | `/health-check` | 健康检查 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 2 | GET | `/templates` | 获取模板列表 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 3 | POST | `/tasks` | 创建提取任务 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 4 | GET | `/tasks/:taskId/progress` | 查询任务进度 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 5 | GET | `/tasks/:taskId/items` | 获取验证网格数据 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 6 | POST | `/items/:itemId/resolve` | 裁决冲突 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
| 7 | GET | `/tasks/:taskId/export` | 导出Excel结果 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
|
||||
|
||||
**✅ MVP完成状态(2025-12-03)**:
|
||||
- 后端代码:~2200行(含Service、Controller、Routes)
|
||||
- 前端代码:~1400行(5步工作流完整实现)
|
||||
- 数据库表:4张表已创建,3个预设模板已就绪
|
||||
- API对接:8个端点全部集成并测试通过
|
||||
- LLM调用:DeepSeek-V3 + Qwen-Max 双模型验证成功
|
||||
- 真实测试:9条病理数据提取成功,Token消耗~10k
|
||||
- **已知问题**:4个技术债务(见`07-技术债务/Tool-B技术债务清单.md`)
|
||||
|
||||
**<EFBFBD>?MVP摰峕<E691B0><E5B395>嗆<EFBFBD><E59786><EFBFBD>2025-12-03嚗?*嚗?- <20>𡒊垢隞<E59EA2><E99A9E>嚗鰺2200銵䕘<E98AB5><E49598>俟ervice<63><65>ontroller<65><72>outes嚗?- <20>滨垢隞<E59EA2><E99A9E>嚗鰺1400銵䕘<E98AB5>5甇亙極雿𨀣<E99BBF>摰峕㟲摰䂿緵嚗?- <20>唳旿摨栞”嚗?撘㰘”撌脣<E6928C>撱綽<E692B1>3銝芷<E98A9D>霈暹芋<E69AB9>踹歇撠梁貌
|
||||
- API撖寞𦻖嚗?銝芰垢<E88AB0>孵<EFBFBD><E5ADB5>券<EFBFBD><E588B8>𣂼僎瘚贝<E7989A><E8B49D>朞<EFBFBD>
|
||||
- LLM靚<4D>鍂嚗鋽eepSeek-V3 + Qwen-Max <20>峕芋<E5B395>钅<EFBFBD>霂<EFBFBD><E99C82><EFBFBD>?- <20>笔<EFBFBD>瘚贝<E7989A>嚗?<3F>∠<EFBFBD><E288A0><EFBFBD>㺭<EFBFBD>格<EFBFBD><E6A0BC>𡝗<EFBFBD><F0A19D97><EFBFBD><EFBFBD>Token瘨<6E><E798A8>煫10k
|
||||
- **撌脩䰻<E884A9>桅<EFBFBD>**嚗?銝芣<E98A9D><E88AA3>臬<EFBFBD>箏𦛚嚗<F0A69B9A><E59A97>`07-<2D><><EFBFBD>臬<EFBFBD>箏𦛚/Tool-B<><42><EFBFBD>臬<EFBFBD>箏𦛚皜<F0A69B9A><E79A9C>.md`嚗?
|
||||
### 1.2 通用规范
|
||||
|
||||
**霂瑟<EFBFBD>憭?*嚗?```http
|
||||
**请求头**:
|
||||
```http
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer {token} # 未来实现
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD><EFBFBD>澆<EFBFBD>**嚗?```json
|
||||
**响应格式**:
|
||||
```json
|
||||
{
|
||||
"data": {...}, // <EFBFBD>𣂼<EFBFBD><EFBFBD>嗉<EFBFBD><EFBFBD>? "error": "...", // 憭梯揖<E6A2AF>嗉<EFBFBD><E59789>? "code": 200
|
||||
"data": {...}, // 成功时返回
|
||||
"error": "...", // 失败时返回
|
||||
"code": 200
|
||||
}
|
||||
```
|
||||
|
||||
**HTTP<EFBFBD>嗆<EFBFBD><EFBFBD><EFBFBD>**嚗?- `200`: <20>𣂼<EFBFBD>
|
||||
**HTTP状态码**:
|
||||
- `200`: 成功
|
||||
- `400`: 请求参数错误
|
||||
- `401`: <20>芾恕霂?- `403`: <20>䭾<EFBFBD><E4ADBE>?- `404`: 韏<><E99F8F>銝滚<E98A9D><E6BB9A>?- `500`: <20>滚𦛚<E6BB9A>典<EFBFBD><E585B8>券<EFBFBD>霂?
|
||||
- `401`: 未认证
|
||||
- `403`: 无权限
|
||||
- `404`: 资源不存在
|
||||
- `500`: 服务器内部错误
|
||||
|
||||
---
|
||||
|
||||
## 二、认证与鉴权
|
||||
|
||||
### 2.1 认证机制
|
||||
|
||||
**敶枏<EFBFBD><EFBFBD>嗆挾嚗㇈VP嚗?*嚗?- <20>?<3F><><EFBFBD>摰䂿緵霈方<E99C88>
|
||||
**当前阶段(MVP)**:
|
||||
- ❌ 暂不实现认证
|
||||
- 使用临时`userId`标识(从请求上下文获取)
|
||||
|
||||
**<EFBFBD>芣䔉摰䂿緵嚗Ā1.0嚗?*嚗?```http
|
||||
**未来实现(V1.0)**:
|
||||
```http
|
||||
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
```
|
||||
|
||||
@@ -70,35 +87,38 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
|
||||
| 操作 | 权限要求 | 说明 |
|
||||
|------|---------|------|
|
||||
| <EFBFBD>亙熒璉<EFBFBD><EFBFBD>?| user | <EFBFBD><EFBFBD><EFBFBD>厩鍂<EFBFBD>?|
|
||||
| <EFBFBD>亦<EFBFBD>璅⊥踎 | user | <EFBFBD><EFBFBD><EFBFBD>厩鍂<EFBFBD>?|
|
||||
| <EFBFBD>𥕦遣隞餃𦛚 | user | <EFBFBD><EFBFBD><EFBFBD>厩鍂<EFBFBD>?|
|
||||
| <EFBFBD>亥砭隞餃𦛚 | owner | 隞<EFBFBD>遙<EFBFBD>∪<EFBFBD>撱箄<EFBFBD>?|
|
||||
| 鋆<EFBFBD><EFBFBD><EFBFBD>脩<EFBFBD> | owner | 隞<EFBFBD>遙<EFBFBD>∪<EFBFBD>撱箄<EFBFBD>?|
|
||||
| 健康检查 | user | 所有用户 |
|
||||
| 查看模板 | user | 所有用户 |
|
||||
| 创建任务 | user | 所有用户 |
|
||||
| 查询任务 | owner | 仅任务创建者 |
|
||||
| 裁决冲突 | owner | 仅任务创建者 |
|
||||
|
||||
---
|
||||
|
||||
## 三、API端点详情
|
||||
|
||||
### 3.1 <EFBFBD>亙熒璉<EFBFBD><EFBFBD>?
|
||||
### 3.1 健康检查
|
||||
|
||||
**端点**: `POST /api/v1/dc/tool-b/health-check`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: 璉<><E79289>乍xcel<65>㛖<EFBFBD><E39B96>唳旿韐券<E99F90>嚗峕㜃<E5B395>芯<EFBFBD>韐券<E99F90><E588B8>唳旿
|
||||
**用途**: 检查Excel列的数据质量,拦截低质量数据
|
||||
|
||||
**霂瑟<EFBFBD>雿?*嚗?```json
|
||||
**请求体**:
|
||||
```json
|
||||
{
|
||||
"fileKey": "uploads/user123/data.xlsx",
|
||||
"columnName": "病历文本"
|
||||
}
|
||||
```
|
||||
|
||||
**霂瑟<EFBFBD><EFBFBD><EFBFBD>㺭**嚗?
|
||||
**请求参数**:
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `fileKey` | string | <EFBFBD>?| Storage銝剔<EFBFBD><EFBFBD><EFBFBD>辣頝臬<EFBFBD> |
|
||||
| `columnName` | string | <EFBFBD>?| 閬<><E996AC><EFBFBD>亦<EFBFBD><E4BAA6>堒<EFBFBD> |
|
||||
| `fileKey` | string | ✅ | Storage中的文件路径 |
|
||||
| `columnName` | string | ✅ | 要检查的列名 |
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗<><E59A97><EFBFBD>?- 200嚗㚁<E59A97>
|
||||
**响应**(成功 - 200):
|
||||
```json
|
||||
{
|
||||
"status": "good",
|
||||
@@ -106,11 +126,11 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
"avgLength": 256.8,
|
||||
"totalRows": 500,
|
||||
"estimatedTokens": 150000,
|
||||
"message": "<22>亙熒摨西<E691A8>憟踝<E6869F>憸<EFBFBD>恣瘨<E681A3><E798A8>㛖漲 150.0k Token嚗<EFBFBD><EFBFBD>璅∪<EFBFBD>蝥?300.0k Token嚗?
|
||||
"message": "健康度良好,预计消耗约 150.0k Token(双模型约 300.0k Token)"
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗<>仃韐?- 200雿<EFBFBD>tatus=bad嚗㚁<EFBFBD>
|
||||
**响应**(失败 - 200但status=bad):
|
||||
```json
|
||||
{
|
||||
"status": "bad",
|
||||
@@ -118,26 +138,30 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
"avgLength": 256.8,
|
||||
"totalRows": 500,
|
||||
"estimatedTokens": 0,
|
||||
"message": "蝛箏<E89D9B>潛<EFBFBD>餈<EFBFBD><E9A488>嚗?5.0%嚗㚁<E59A97>霂亙<E99C82>銝漤<E98A9D><E6BCA4><EFBFBD><EFBFBD>𣂼<EFBFBD>"
|
||||
"message": "空值率过高(85.0%),该列不适合提取"
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>摮埈挾**嚗?
|
||||
**响应字段**:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `status` | string | `good` <20>?`bad` |
|
||||
| `status` | string | `good` 或 `bad` |
|
||||
| `emptyRate` | number | 空值率 (0-1) |
|
||||
| `avgLength` | number | 平均文本长度 |
|
||||
| `totalRows` | number | <EFBFBD>餉<EFBFBD><EFBFBD>?|
|
||||
| `estimatedTokens` | number | 憸<EFBFBD>摯Token<EFBFBD>?|
|
||||
| `totalRows` | number | 总行数 |
|
||||
| `estimatedTokens` | number | 预估Token数 |
|
||||
| `message` | string | 提示信息 |
|
||||
|
||||
**銝𡁜𦛚閫<EFBFBD><EFBFBD>**嚗?- 蝛箏<E89D9B>潛<EFBFBD> > 80% <20>?`status = 'bad'`
|
||||
- 撟喳<E6929F><E596B3>踹漲 < 10 <20>?`status = 'bad'`
|
||||
- <20>芣<EFBFBD><E88AA3>亙<EFBFBD>100銵䕘<E98AB5><E49598>扯<EFBFBD>隡睃<E99AA1>嚗?
|
||||
**<2A>躰秤<E8BAB0>滚<EFBFBD>**嚗?```json
|
||||
**业务规则**:
|
||||
- 空值率 > 80% → `status = 'bad'`
|
||||
- 平均长度 < 10 → `status = 'bad'`
|
||||
- 只检查前100行(性能优化)
|
||||
|
||||
**错误响应**:
|
||||
```json
|
||||
{
|
||||
"error": "<22>?<3F><><EFBFBD><EFBFBD><EFBFBD>𧋦'銝滚<E98A9D><E6BB9A>?,
|
||||
"error": "列'病历文本'不存在",
|
||||
"code": 400
|
||||
}
|
||||
```
|
||||
@@ -148,10 +172,11 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
|
||||
**端点**: `GET /api/v1/dc/tool-b/templates`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: <20>瑕<EFBFBD><E79195><EFBFBD><EFBFBD>厰<EFBFBD>霈曄<E99C88><E69B84>𣂼<EFBFBD>璅⊥踎
|
||||
**用途**: 获取所有预设的提取模板
|
||||
|
||||
**霂瑟<EFBFBD>**: <EFBFBD>惩<EFBFBD><EFBFBD>?
|
||||
**<2A>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
**请求**: 无参数
|
||||
|
||||
**响应**(200):
|
||||
```json
|
||||
{
|
||||
"templates": [
|
||||
@@ -167,7 +192,7 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
},
|
||||
{
|
||||
"name": "分化程度",
|
||||
"desc": "擃?銝?雿𤾸<E99BBF><F0A4BEB8>?,
|
||||
"desc": "高/中/低分化",
|
||||
"width": "w-32"
|
||||
}
|
||||
]
|
||||
@@ -175,14 +200,15 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
{
|
||||
"diseaseType": "diabetes",
|
||||
"reportType": "admission",
|
||||
"displayName": "蝟硋倏<E7A18B><E5808F><EFBFBD><EFBFBD>Z扇敶?,
|
||||
"displayName": "糖尿病入院记录",
|
||||
"fields": [...]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>摮埈挾**嚗?
|
||||
**响应字段**:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `templates` | array | 模板列表 |
|
||||
@@ -191,7 +217,8 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
| `templates[].displayName` | string | 显示名称 |
|
||||
| `templates[].fields` | array | 提取字段配置 |
|
||||
|
||||
**蝻枏<EFBFBD>蝑𣇉裦**嚗?- 摰X<E691B0>蝡舐<E89DA1>摮矋<E691AE>1撠𤩺𧒄
|
||||
**缓存策略**:
|
||||
- 客户端缓存:1小时
|
||||
- 服务端缓存:永久(直到重启)
|
||||
|
||||
---
|
||||
@@ -200,9 +227,10 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
|
||||
**端点**: `POST /api/v1/dc/tool-b/tasks`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: <20>𥕦遣<F0A595A6>寥<EFBFBD><E5AFA5>𣂼<EFBFBD>隞餃𦛚嚗峕綫<E5B395><E7B6AB><EFBFBD>撘<EFBFBD>郊<EFBFBD>笔<EFBFBD>
|
||||
**用途**: 创建批量提取任务,推送到异步队列
|
||||
|
||||
**霂瑟<EFBFBD>雿?*嚗?```json
|
||||
**请求体**:
|
||||
```json
|
||||
{
|
||||
"projectName": "肺癌病理数据提取-2025Q1",
|
||||
"fileKey": "uploads/user123/lung_cancer_pathology.xlsx",
|
||||
@@ -216,36 +244,41 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
},
|
||||
{
|
||||
"name": "分化程度",
|
||||
"desc": "擃?銝?雿𤾸<E99BBF><F0A4BEB8>?
|
||||
"desc": "高/中/低分化"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**霂瑟<EFBFBD><EFBFBD><EFBFBD>㺭**嚗?
|
||||
**请求参数**:
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `projectName` | string | <EFBFBD>?| 隞餃𦛚<E9A483>滨妍 |
|
||||
| `fileKey` | string | <EFBFBD>?| Storage銝剔<EFBFBD><EFBFBD><EFBFBD>辣頝臬<EFBFBD> |
|
||||
| `textColumn` | string | <EFBFBD>?| <20><>𧋦<EFBFBD>堒<EFBFBD> |
|
||||
| `diseaseType` | string | <EFBFBD>?| <20>曄<EFBFBD>蝐餃<E89D90> |
|
||||
| `reportType` | string | <EFBFBD>?| <20>亙<EFBFBD>蝐餃<E89D90> |
|
||||
| `targetFields` | array | <EFBFBD>?| <20>𣂼<EFBFBD>摮埈挾<E59F88>滨蔭 |
|
||||
| `projectName` | string | ✅ | 任务名称 |
|
||||
| `fileKey` | string | ✅ | Storage中的文件路径 |
|
||||
| `textColumn` | string | ✅ | 文本列名 |
|
||||
| `diseaseType` | string | ✅ | 疾病类型 |
|
||||
| `reportType` | string | ✅ | 报告类型 |
|
||||
| `targetFields` | array | ✅ | 提取字段配置 |
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
**响应**(200):
|
||||
```json
|
||||
{
|
||||
"taskId": "550e8400-e29b-41d4-a716-446655440000"
|
||||
}
|
||||
```
|
||||
|
||||
**瘚<EFBFBD><EFBFBD>**嚗?1. 撉諹<E69289><E8ABB9><EFBFBD>辣摮睃銁
|
||||
2. 閫<><E996AB>Excel嚗𣬚<E59A97>霈⊥<E99C88>餉<EFBFBD><E9A489>?3. <20>𥕦遣隞餃𦛚霈啣<E99C88>嚗ìtatus=pending嚗?4. <20>券<EFBFBD><E588B8><EFBFBD>BullMQ<4D>笔<EFBFBD>
|
||||
**流程**:
|
||||
1. 验证文件存在
|
||||
2. 解析Excel,统计总行数
|
||||
3. 创建任务记录(status=pending)
|
||||
4. 推送到BullMQ队列
|
||||
5. 立即返回taskId
|
||||
|
||||
**<EFBFBD>躰秤<EFBFBD>滚<EFBFBD>**嚗?```json
|
||||
**错误响应**:
|
||||
```json
|
||||
{
|
||||
"error": "<22><>辣銝滚<E98A9D><E6BB9A>? uploads/user123/lung_cancer_pathology.xlsx",
|
||||
"error": "文件不存在: uploads/user123/lung_cancer_pathology.xlsx",
|
||||
"code": 404
|
||||
}
|
||||
```
|
||||
@@ -256,14 +289,14 @@ Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
|
||||
|
||||
**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/progress`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: 摰墧𧒄<E5A2A7>亥砭隞餃𦛚憭<F0A69B9A><E686AD>餈𥕦漲
|
||||
**用途**: 实时查询任务处理进度
|
||||
|
||||
**请求**:
|
||||
```
|
||||
GET /api/v1/dc/tool-b/tasks/550e8400-e29b-41d4-a716-446655440000/progress
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
**响应**(200):
|
||||
```json
|
||||
{
|
||||
"taskId": "550e8400-e29b-41d4-a716-446655440000",
|
||||
@@ -281,41 +314,46 @@ GET /api/v1/dc/tool-b/tasks/550e8400-e29b-41d4-a716-446655440000/progress
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>摮埈挾**嚗?
|
||||
**响应字段**:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `status` | string | `pending/processing/completed/failed` |
|
||||
| `progress` | number | 餈𥕦漲<EFBFBD>曉<EFBFBD>瘥?(0-100) |
|
||||
| `progress` | number | 进度百分比 (0-100) |
|
||||
| `totalCount` | number | 总记录数 |
|
||||
| `processedCount` | number | 已处理数 |
|
||||
| `cleanCount` | number | 一致记录数 |
|
||||
| `conflictCount` | number | <EFBFBD>脩<EFBFBD>霈啣<EFBFBD><EFBFBD>?|
|
||||
| `failedCount` | number | 憭梯揖霈啣<EFBFBD><EFBFBD>?|
|
||||
| `totalTokens` | number | 蝝航恣Token<EFBFBD>?|
|
||||
| `conflictCount` | number | 冲突记录数 |
|
||||
| `failedCount` | number | 失败记录数 |
|
||||
| `totalTokens` | number | 累计Token数 |
|
||||
| `totalCost` | number | 累计成本($) |
|
||||
|
||||
**頧株砭撱箄悅**嚗?- 摰X<E691B0>蝡舀<E89DA1>3蝘坿蔭霂V<E99C82>甈?- 敶𨦨status = 'completed'`<EFBFBD>嗅<EFBFBD>甇Z蔭霂?
|
||||
**轮询建议**:
|
||||
- 客户端每3秒轮询一次
|
||||
- 当`status = 'completed'`时停止轮询
|
||||
|
||||
---
|
||||
|
||||
### 3.5 获取验证网格数据
|
||||
|
||||
**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/items`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: <20>瑕<EFBFBD><E79195>峕芋<E5B395>𧢲<EFBFBD><F0A7A2B2>𣇉<EFBFBD><F0A38789>頣<EFBFBD><E9A0A3>其<EFBFBD>鈭箏極鋆<E6A5B5><E98B86>
|
||||
**用途**: 获取双模型提取结果,用于人工裁决
|
||||
|
||||
**请求**:
|
||||
```
|
||||
GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
```
|
||||
|
||||
**<EFBFBD>亥砭<EFBFBD><EFBFBD>㺭**嚗?
|
||||
| <20><>㺭 | 蝐餃<E89D90> | 敹<>‵ | 暺䁅恕<E48185>?| 霂湔<E99C82> |
|
||||
|------|------|------|--------|------|
|
||||
| `page` | number | <20>?| 1 | 憿萇<E686BF> |
|
||||
| `limit` | number | <20>?| 50 | 瘥誯△<E8AAAF>圈<EFBFBD> |
|
||||
| `status` | string | <20>?| - | 餈<>誘<EFBFBD>嗆<EFBFBD>?|
|
||||
**查询参数**:
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `page` | number | ❌ | 1 | 页码 |
|
||||
| `limit` | number | ❌ | 50 | 每页数量 |
|
||||
| `status` | string | ❌ | - | 过滤状态 |
|
||||
|
||||
**响应**(200):
|
||||
```json
|
||||
{
|
||||
"items": [
|
||||
@@ -324,13 +362,13 @@ GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
"rowIndex": 5,
|
||||
"originalText": "患者,男,45岁,诊断为浸润性腺癌,中分化,肿瘤最大径3cm...",
|
||||
"resultA": {
|
||||
"<EFBFBD><EFBFBD><EFBFBD>蝐餃<EFBFBD>": "瘚豢隋<EFBFBD>扯<EFBFBD><EFBFBD>?,
|
||||
"<EFBFBD><EFBFBD><EFBFBD>蝔见漲": "銝剖<EFBFBD><EFBFBD>?,
|
||||
"病理类型": "浸润性腺癌",
|
||||
"分化程度": "中分化",
|
||||
"肿瘤大小": "3cm"
|
||||
},
|
||||
"resultB": {
|
||||
"<EFBFBD><EFBFBD><EFBFBD>蝐餃<EFBFBD>": "瘚豢隋<EFBFBD>扯<EFBFBD><EFBFBD>?,
|
||||
"<EFBFBD><EFBFBD><EFBFBD>蝔见漲": "銝剖<EFBFBD><EFBFBD>?,
|
||||
"病理类型": "浸润性腺癌",
|
||||
"分化程度": "中分化",
|
||||
"肿瘤大小": "3.0cm"
|
||||
},
|
||||
"status": "conflict",
|
||||
@@ -347,7 +385,8 @@ GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>摮埈挾**嚗?
|
||||
**响应字段**:
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `items` | array | 记录列表 |
|
||||
@@ -361,7 +400,8 @@ GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
|
||||
**端点**: `POST /api/v1/dc/tool-b/items/:itemId/resolve`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: 鈭箏極<E7AE8F>㗇𥋘甇<F0A58B98>&<EFBFBD><EFBC86><EFBFBD><EFBFBD>𣇉<EFBFBD><F0A38789>?
|
||||
**用途**: 人工选择正确的提取结果
|
||||
|
||||
**请求**:
|
||||
```json
|
||||
{
|
||||
@@ -370,20 +410,22 @@ GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
}
|
||||
```
|
||||
|
||||
**霂瑟<EFBFBD><EFBFBD><EFBFBD>㺭**嚗?
|
||||
**请求参数**:
|
||||
|
||||
| 字段 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `field` | string | <EFBFBD>?| <20>脩<EFBFBD>摮埈挾<E59F88>?|
|
||||
| `chosenValue` | string | <EFBFBD>?| <20>㗇𥋘<E39787><F0A58B98><EFBFBD>?|
|
||||
| `field` | string | ✅ | 冲突字段名 |
|
||||
| `chosenValue` | string | ✅ | 选择的值 |
|
||||
|
||||
**<EFBFBD>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
**响应**(200):
|
||||
```json
|
||||
{
|
||||
"success": true
|
||||
}
|
||||
```
|
||||
|
||||
**銝𡁜𦛚<EFBFBD>餉<EFBFBD>**嚗?1. <20>湔鰵`finalResult[field] = chosenValue`
|
||||
**业务逻辑**:
|
||||
1. 更新`finalResult[field] = chosenValue`
|
||||
2. 从`conflictFields`中移除该字段
|
||||
3. 如果所有冲突解决,更新`status = 'resolved'`
|
||||
|
||||
@@ -393,27 +435,33 @@ GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
|
||||
|
||||
**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/export`
|
||||
|
||||
**<EFBFBD>券<EFBFBD>?*: 撖澆枂<E6BE86><E69E82>蝏<EFBFBD><E89D8F><EFBFBD>𣇉<EFBFBD><F0A38789>靝蛹Excel
|
||||
**用途**: 导出最终提取结果为Excel
|
||||
|
||||
**请求**:
|
||||
```
|
||||
GET /api/v1/dc/tool-b/tasks/550e8400.../export?format=xlsx
|
||||
```
|
||||
|
||||
**<EFBFBD>亥砭<EFBFBD><EFBFBD>㺭**嚗?
|
||||
| <20><>㺭 | 蝐餃<E89D90> | 敹<>‵ | 暺䁅恕<E48185>?| 霂湔<E99C82> |
|
||||
|------|------|------|--------|------|
|
||||
| `format` | string | <20>?| `xlsx` | 撖澆枂<E6BE86>澆<EFBFBD>嚗䫤xlsx/csv` |
|
||||
**查询参数**:
|
||||
|
||||
**<2A>滚<EFBFBD>**嚗?00嚗㚁<E59A97>
|
||||
- <20><>辣瘚<E8BEA3><E7989A>頧?- Content-Type: `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`
|
||||
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|
||||
|------|------|------|--------|------|
|
||||
| `format` | string | ❌ | `xlsx` | 导出格式:`xlsx/csv` |
|
||||
|
||||
**响应**(200):
|
||||
- 文件流下载
|
||||
- Content-Type: `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`
|
||||
- Content-Disposition: `attachment; filename="extraction_result_2025-11-27.xlsx"`
|
||||
|
||||
**撖澆枂<EFBFBD><EFBFBD>捆**嚗?- <20><>鉄<EFBFBD>笔<EFBFBD><E7AC94>?+ <20><><EFBFBD>㗇<EFBFBD><E39787>硋<EFBFBD>畾?- <20>芸<EFBFBD><E88AB8>冑clean`<EFBFBD>䈣resolved`<60>嗆<EFBFBD><E59786><EFBFBD>霈啣<E99C88>
|
||||
- <20>脩<EFBFBD>霈啣<E99C88>銝滚紡<E6BB9A>綽<EFBFBD><E7B6BD><EFBFBD>鈭箏極鋆<E6A5B5><E98B86>嚗?
|
||||
**导出内容**:
|
||||
- 包含原始列 + 所有提取字段
|
||||
- 只包含`clean`和`resolved`状态的记录
|
||||
- 冲突记录不导出(需人工裁决)
|
||||
|
||||
---
|
||||
|
||||
## <EFBFBD>䜘<EFBFBD><EFBFBD>㺭<EFBFBD>格芋<EFBFBD>?
|
||||
## 四、数据模型
|
||||
|
||||
### 4.1 HealthCheckResult
|
||||
|
||||
```typescript
|
||||
@@ -498,7 +546,8 @@ interface ExtractionItem {
|
||||
|
||||
---
|
||||
|
||||
## 鈭𢛵<EFBFBD><EFBFBD><EFBFBD>霂臬<EFBFBD><EFBFBD>?
|
||||
## 五、错误处理
|
||||
|
||||
### 5.1 错误响应格式
|
||||
|
||||
```json
|
||||
@@ -507,24 +556,27 @@ interface ExtractionItem {
|
||||
"code": 400,
|
||||
"details": {
|
||||
"field": "fileKey",
|
||||
"reason": "<22><>辣銝滚<E98A9D><E6BB9A>?
|
||||
"reason": "文件不存在"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 撣貉<EFBFBD><EFBFBD>躰秤<EFBFBD>?
|
||||
| HTTP<54>嗆<EFBFBD>?| code | 霂湔<E99C82> | 蝷箔<E89DB7> |
|
||||
### 5.2 常见错误码
|
||||
|
||||
| HTTP状态 | code | 说明 | 示例 |
|
||||
|----------|------|------|------|
|
||||
| 400 | `INVALID_PARAMS` | 参数错误 | 缺少fileKey |
|
||||
| 400 | `COLUMN_NOT_FOUND` | <EFBFBD>𦯀<EFBFBD>摮睃銁 | <20>?<3F><><EFBFBD><EFBFBD><EFBFBD>𧋦"銝滚<E98A9D><E6BB9A>?|
|
||||
| 400 | `COLUMN_NOT_FOUND` | 列不存在 | 列"病历文本"不存在 |
|
||||
| 400 | `BAD_HEALTH` | 健康检查未通过 | 空值率过高 |
|
||||
| 404 | `FILE_NOT_FOUND` | <EFBFBD><EFBFBD>辣銝滚<EFBFBD><EFBFBD>?| <20><>辣頝臬<E9A09D><E887AC>䭾<EFBFBD> |
|
||||
| 404 | `TASK_NOT_FOUND` | 隞餃𦛚銝滚<EFBFBD><EFBFBD>?| taskId<EFBFBD>䭾<EFBFBD> |
|
||||
| 403 | `FORBIDDEN` | <EFBFBD>䭾<EFBFBD>霈輸䔮 | <20>芾<EFBFBD>霈輸䔮<E8BCB8>芸楛<E88AB8><E6A59B>遙<EFBFBD>?|
|
||||
| 500 | `INTERNAL_ERROR` | <EFBFBD>滚𦛚<EFBFBD>券<EFBFBD>霂?| <20>唳旿摨栞<E691A8><E6A09E>亙仃韐?|
|
||||
| 404 | `FILE_NOT_FOUND` | 文件不存在 | 文件路径无效 |
|
||||
| 404 | `TASK_NOT_FOUND` | 任务不存在 | taskId无效 |
|
||||
| 403 | `FORBIDDEN` | 无权访问 | 只能访问自己的任务 |
|
||||
| 500 | `INTERNAL_ERROR` | 服务器错误 | 数据库连接失败 |
|
||||
|
||||
### 5.3 <EFBFBD>躰秤憭<EFBFBD><EFBFBD><EFBFBD><EFBFBD>雿喳<EFBFBD>頝?
|
||||
**摰X<E691B0>蝡?*嚗?```typescript
|
||||
### 5.3 错误处理最佳实践
|
||||
|
||||
**客户端**:
|
||||
```typescript
|
||||
try {
|
||||
const response = await fetch('/api/v1/dc/tool-b/health-check', {
|
||||
method: 'POST',
|
||||
@@ -543,8 +595,9 @@ try {
|
||||
return;
|
||||
}
|
||||
|
||||
// 蝏抒賒銝衤<EFBFBD>甇?} catch (error) {
|
||||
console.error('<27>亙熒璉<E78692><E79289>亙仃韐?', error);
|
||||
// 继续下一步
|
||||
} catch (error) {
|
||||
console.error('健康检查失败:', error);
|
||||
}
|
||||
```
|
||||
|
||||
@@ -556,41 +609,54 @@ try {
|
||||
|
||||
| API | 目标 | 说明 |
|
||||
|-----|------|------|
|
||||
| `/health-check` | < 3蝘?| Excel閫<EFBFBD><EFBFBD>+蝏蠘恣 |
|
||||
| `/health-check` | < 3秒 | Excel解析+统计 |
|
||||
| `/templates` | < 100ms | 内存缓存 |
|
||||
| `/tasks` (create) | < 500ms | 快速创建并返回 |
|
||||
| `/tasks/:id/progress` | < 100ms | 数据库单查询 |
|
||||
| `/tasks/:id/items` | < 500ms | 分页查询 |
|
||||
| `/items/:id/resolve` | < 200ms | 单行更新 |
|
||||
| `/tasks/:id/export` | < 10蝘?| <20><><EFBFBD>Excel<EFBFBD><EFBFBD>辣 |
|
||||
| `/tasks/:id/export` | < 10秒 | 生成Excel文件 |
|
||||
|
||||
### 6.2 并发处理能力
|
||||
|
||||
- **<2A>亙熒璉<E78692><E79289>?*: 10 req/s嚗㇆O撖<4F><E69296>嚗?- **隞餃𦛚<E9A483>𥕦遣**: 5 req/s嚗<73><E59A97><EFBFBD>交㺭<E4BAA4>桀<EFBFBD>嚗?- **餈𥕦漲<F0A595A6>亥砭**: 100 req/s嚗<EFBFBD>粉撖<EFBFBD><EFBFBD>嚗<EFBFBD>虾蝻枏<EFBFBD>嚗?- **撉諹<E69289>蝵烐聢**: 50 req/s嚗<73><E59A97>憿菜䰻霂g<E99C82>
|
||||
- **健康检查**: 10 req/s(IO密集)
|
||||
- **任务创建**: 5 req/s(写入数据库)
|
||||
- **进度查询**: 100 req/s(读密集,可缓存)
|
||||
- **验证网格**: 50 req/s(分页查询)
|
||||
|
||||
### 6.3 优化策略
|
||||
|
||||
**蝻枏<EFBFBD>**嚗?- `/templates` <20>?瘞訾<E7989E>蝻枏<E89DBB>嚗<EFBFBD><E59A97>摮矋<E691AE>
|
||||
- `/tasks/:id/progress` <20>?Redis蝻枏<E89DBB>嚗?蝘塬TL嚗?
|
||||
**撘<>郊憭<E9838A><E686AD>**嚗?- 隞餃𦛚憭<F0A69B9A><E686AD>雿輻鍂BullMQ<4D>𤾸蝱<F0A4BEB8>笔<EFBFBD>
|
||||
**缓存**:
|
||||
- `/templates` → 永久缓存(内存)
|
||||
- `/tasks/:id/progress` → Redis缓存(5秒TTL)
|
||||
|
||||
**异步处理**:
|
||||
- 任务处理使用BullMQ后台队列
|
||||
- 避免阻塞用户请求
|
||||
|
||||
**<EFBFBD><EFBFBD>△**嚗?- 撉諹<E69289>蝵烐聢暺䁅恕50<35>?憿?- <20><>憭?000<30>?憿?
|
||||
**分页**:
|
||||
- 验证网格默认50条/页
|
||||
- 最大1000条/页
|
||||
|
||||
---
|
||||
|
||||
## 銝<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>祆綉<EFBFBD>?
|
||||
## 七、版本控制
|
||||
|
||||
### 7.1 API版本策略
|
||||
|
||||
**当前版本**: `v1`
|
||||
|
||||
**URL格式**: `/api/v1/dc/tool-b/*`
|
||||
|
||||
**<EFBFBD>穃<EFBFBD><EFBFBD>澆捆<EFBFBD>輯笑**嚗?- v1<76><31>𧋦<EFBFBD>?026撟游<E6929F>靽脲<E99DBD>蝔喳<E89D94>
|
||||
- <20>啣<EFBFBD><E595A3>賡<EFBFBD>朞<EFBFBD><E69C9E>舫<EFBFBD>匧<EFBFBD><E58CA7>唳溶<E594B3>?- <20>游<EFBFBD><E6B8B8>批<EFBFBD><E689B9>游<EFBFBD>撣<EFBFBD>2
|
||||
**向后兼容承诺**:
|
||||
- v1版本在2026年前保持稳定
|
||||
- 新功能通过可选参数添加
|
||||
- 破坏性变更发布v2
|
||||
|
||||
### 7.2 废弃通知
|
||||
|
||||
敶𨯗PI<EFBFBD><EFBFBD>閬<EFBFBD><EFBFBD>撘<EFBFBD>𧒄嚗?```http
|
||||
当API需要废弃时:
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
X-API-Deprecated: true
|
||||
X-API-Sunset: 2026-12-31
|
||||
@@ -599,16 +665,19 @@ X-API-Replacement: /api/v2/dc/tool-b/health-check
|
||||
|
||||
---
|
||||
|
||||
## <EFBFBD>怒<EFBFBD><EFBFBD><EFBFBD>霂?
|
||||
## 八、测试
|
||||
|
||||
### 8.1 Postman Collection
|
||||
|
||||
摰峕㟲<EFBFBD><EFBFBD>PI瘚贝<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗?```
|
||||
docs/03-銝𡁜𦛚璅∪<E79285>/DC-<2D>唳旿皜<E697BF><E79A9C><EFBFBD>渡<EFBFBD>/02-<2D><><EFBFBD>航挽霈?ToolB-API.postman_collection.json
|
||||
完整的API测试集合:
|
||||
```
|
||||
docs/03-业务模块/DC-数据清洗整理/02-技术设计/ToolB-API.postman_collection.json
|
||||
```
|
||||
|
||||
### 8.2 示例请求
|
||||
|
||||
**<EFBFBD>亙熒璉<EFBFBD><EFBFBD>?*嚗?```bash
|
||||
**健康检查**:
|
||||
```bash
|
||||
curl -X POST http://localhost:3001/api/v1/dc/tool-b/health-check \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@@ -617,11 +686,13 @@ curl -X POST http://localhost:3001/api/v1/dc/tool-b/health-check \
|
||||
}'
|
||||
```
|
||||
|
||||
**<EFBFBD>瑕<EFBFBD>璅⊥踎**嚗?```bash
|
||||
**获取模板**:
|
||||
```bash
|
||||
curl http://localhost:3001/api/v1/dc/tool-b/templates
|
||||
```
|
||||
|
||||
**<EFBFBD>𥕦遣隞餃𦛚**嚗?```bash
|
||||
**创建任务**:
|
||||
```bash
|
||||
curl -X POST http://localhost:3001/api/v1/dc/tool-b/tasks \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@@ -636,19 +707,21 @@ curl -X POST http://localhost:3001/api/v1/dc/tool-b/tasks \
|
||||
|
||||
---
|
||||
|
||||
## 銋腈<EFBFBD><EFBFBD><EFBFBD>敶?
|
||||
## 九、附录
|
||||
|
||||
### 9.1 相关文档
|
||||
|
||||
- [<EFBFBD>唳旿摨栞挽霈⊥<EFBFBD>獢β(./<2F>唳旿摨栞挽霈⊥<E99C88>獢?撌亙<E6928C>B.md)
|
||||
- [PRD<EFBFBD><EFBFBD>﹝](../01-<EFBFBD><EFBFBD>瘙<EFBFBD><EFBFBD><EFBFBD>?PRD嚗関ool B - <EFBFBD><EFBFBD><EFBFBD>蝏𤘪<EFBFBD><EFBFBD>𡝗㦤<EFBFBD>其犖 (The AI Structurer).md)
|
||||
- [撘<EFBFBD><EFBFBD>𤏸恣<EFBFBD>哋(../04-撘<><E69298>𤏸恣<F0A48FB8>?撌亙<E6928C>B撘<42><E69298>𤏸恣<F0A48FB8>?<3F><><EFBFBD>蝏𤘪<E89D8F><F0A498AA>𡝗㦤<F0A19D97>其犖.md)
|
||||
- [数据库设计文档](./数据库设计文档-工具B.md)
|
||||
- [PRD文档](../01-需求分析/PRD:Tool B - 病历结构化机器人 (The AI Structurer).md)
|
||||
- [开发计划](../04-开发计划/工具B开发计划-病历结构化机器人.md)
|
||||
|
||||
### 9.2 变更日志
|
||||
|
||||
| 版本 | 日期 | 变更内容 |
|
||||
|------|------|---------|
|
||||
| V1.0 | 2025-11-27 | <EFBFBD>嘥<EFBFBD><EFBFBD><EFBFBD>𧋦嚗?銝服PI蝡舐<E89DA1> |
|
||||
| V1.0 | 2025-11-27 | 初始版本,7个API端点 |
|
||||
|
||||
---
|
||||
|
||||
**<EFBFBD><EFBFBD>﹝蝏𤘪<EFBFBD>** <20>?
|
||||
**文档结束** ✅
|
||||
|
||||
|
||||
@@ -1,99 +1,114 @@
|
||||
# **工具 C:AI 辅助医疗数据清洗场景分级清单**
|
||||
|
||||
餈嗘遢皜<EFBFBD><EFBFBD><EFBFBD>?*<2A><><EFBFBD>臬<EFBFBD><E887AC>圈𠗕摨?*<2A>?*銝𡁜𦛚<F0A1819C>餉<EFBFBD>憭齿<E686AD>摨?*隞𡒊<E99A9E><F0A1928A>訫<EFBFBD>憭齿<E686AD><E9BDBF>鍦<EFBFBD><E98DA6><EFBFBD><EFBFBD><EFBFBD>匧㦤<E58CA7>臬<EFBFBD><E887AC><EFBFBD>挽<EFBFBD>唳旿撌脣<E6928C>頧賭蛹 Pandas DataFrame (df)<EFBFBD>?
|
||||
这份清单按**技术实现难度**和**业务逻辑复杂度**从简单到复杂排列。所有场景均假设数据已加载为 Pandas DataFrame (df)。
|
||||
|
||||
## **Level 1: 基础卫生清理 (Data Hygiene)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁏<E59A97><F0A1818F>𡏭<EFBFBD><F0A18FAD>脲㺭<E884B2>桀<EFBFBD><E6A180>鐥<EFBFBD>𡏭<EFBFBD>霂領<E99C82>萘<EFBFBD><E89098>唳旿<E594B3><E697BF>xcel 銋蠘<E98A8B><E8A098>𡄯<EFBFBD>雿?Python <20>游翰<E6B8B8>游<EFBFBD><E6B8B8>?
|
||||
*目标:把“脏”数据变成“能读”的数据。Excel 也能做,但 Python 更快更准。*
|
||||
|
||||
### **1.1 变量名标准化 (Rename)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>笔<EFBFBD>銵典仍<E585B8>臭葉<E887AD><E89189><EFBFBD><EFBFBD>怎鸌畾羓泵<E7BE93>瘀<EFBFBD>撟湧<E6929F>(撗?, <20>批<EFBFBD>/Gender, <20>仿堺\_<>交<EFBFBD>嚗㚁<E59A97>SPSS <20>仿<EFBFBD><E4BBBF>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD><EFBFBD>匧<EFBFBD><E58CA7>滩蓮銝箇滲<E7AE87>望<EFBFBD>撠誩<E692A0>嚗<EFBFBD>縧<EFBFBD>㗇𡠺<E39787>瑯<EFBFBD><E791AF><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* 雿輻鍂<E8BCBB>惩<EFBFBD>摮堒<E691AE><E5A092>𡝗迤<F0A19D97>蹱𤜯<E8B9B1>W<EFBFBD><EFBCB7>溻<EFBFBD>?
|
||||
### **1.2 <20>啣<EFBFBD>澆<EFBFBD><E6BE86>𨀣<EFBFBD>瘥圝<E798A5>?(Clean Numeric)**
|
||||
* **场景:** 原始表头是中文或含特殊符号(年龄(岁), 性别/Gender, 入院\_日期),SPSS 报错。
|
||||
* **用户指令:** “把所有列名转为纯英文小写,去掉括号。”
|
||||
* **Python 逻辑:** 使用映射字典或正则替换列名。
|
||||
|
||||
* **<2A>箸艶嚗?* 璉<>撉𣬚<E69289>撖澆枂<E6BE86><E69E82>㺭<EFBFBD>殷<EFBFBD><E6AEB7>啣<EFBFBD>澆<EFBFBD>瘛瑕<E7989B>鈭<EFBFBD>泵<EFBFBD>瘀<EFBFBD>\>100, \<0.1, 12.5+, <20>芣䰻嚗剹<E59A97>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>䁅<EFBFBD><E48185>鐥<EFBFBD>坔<EFBFBD><E59D94>𣬚<EFBFBD><F0A3AC9A>墧㺭摮㛖泵<E39B96>瑕縧<E79195>㚁<EFBFBD><E39A81>娫<0.1<EFBFBD>蹱<EFBFBD><EFBFBD>?.05<EFBFBD>坔<EFBFBD><EFBFBD><EFBFBD><EFBFBD>頧砌蛹瘚桃<EFBFBD><EFBFBD>啜<EFBFBD><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* str.replace \+ 甇<><E79487><EFBFBD>𣂼<EFBFBD> \+ pd.to\_numeric(errors='coerce')<29>?
|
||||
### **1.3 蝏煺<E89D8F>蝻箏仃<E7AE8F>?(Standardize Nulls)**
|
||||
### **1.2 数值列“排毒” (Clean Numeric)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>唳旿<E594B3>峕毽<E5B395><E6AFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>隞<EFBFBD>”<EFBFBD>𦦵征<F0A6A6B5>萘<EFBFBD>霂㵪<E99C82>NA, N/A, \-, \\, 銝滩祕<E6BBA9>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD><EFBFBD>劐誨銵兩<E98AB5>䀹瓷<E480B9>争<EFBFBD>嗵<EFBFBD>摮㛖泵<E39B96>賜<EFBFBD>銝<EFBFBD><E98A9D>踵揢銝箸<E98A9D><E7AEB8><EFBFBD><EFBFBD>蝛箏<E89D9B>潦<EFBFBD><E6BDA6><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.replace(\['-', '銝滩祕', 'NA'\], np.nan, inplace=True)<29>?
|
||||
## **Level 2: <20>㗛<EFBFBD><E3979B><EFBFBD><EFBFBD><EFBFBD>碶<EFBFBD><E7A2B6>滨<EFBFBD><E6BBA8>?(Recode & Standardization)**
|
||||
* **场景:** 检验科导出的数据,数值列混入了符号(\>100, \<0.1, 12.5+, 未查)。
|
||||
* **用户指令:** “把‘肌酐’列里的非数字符号去掉,‘\<0.1’按‘0.05’处理,转为浮点数。”
|
||||
* **Python 逻辑:** str.replace \+ 正则提取 \+ pd.to\_numeric(errors='coerce')。
|
||||
|
||||
*<2A>格<EFBFBD>嚗帋蛹蝏蠘恣<E8A098><E681A3><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>掩<EFBFBD>㗛<EFBFBD><E3979B>?
|
||||
### **1.3 统一缺失值 (Standardize Nulls)**
|
||||
|
||||
### **2.1 <20><>𧋦頧祆㺭<E7A586>潭<EFBFBD>撠?(Map Categorical)**
|
||||
* **场景:** 数据里混杂了各种代表“空”的词:NA, N/A, \-, \\, 不详。
|
||||
* **用户指令:** “把所有代表‘没有’的字符都统一替换为标准的空值。”
|
||||
* **Python 逻辑:** df.replace(\['-', '不详', 'NA'\], np.nan, inplace=True)。
|
||||
|
||||
## **Level 2: 变量标准化与重编码 (Recode & Standardization)**
|
||||
|
||||
*目标:为统计分析准备分类变量。*
|
||||
|
||||
### **2.1 文本转数值映射 (Map Categorical)**
|
||||
|
||||
* **场景:** 性别列是 Male/Female,吸烟史是 Yes/No。
|
||||
* **用户指令:** “把性别转为 1(男)/0(女),把吸烟史转为 1/0。”
|
||||
* **Python 逻辑:** df\['sex'\].map({'Male': 1, 'Female': 0})。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>批<EFBFBD><E689B9>埈糓 Male/Female嚗<65>𢙺<EFBFBD>笔蟮<E7AC94>?Yes/No<4E>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>批<EFBFBD>頧砌蛹 1(<28>?/0(憟?嚗峕<E59A97><E5B395>貊<EFBFBD><E8B28A>脰蓮銝?1/0<><30><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df\['sex'\].map({'Male': 1, 'Female': 0})<29>?
|
||||
### **2.2 连续变量分箱 (Binning)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20><>閬<EFBFBD><E996AC>撟湧<E6929F><E6B9A7><EFBFBD><EFBFBD>餈𥡝<E9A488><F0A5A19D>⊥䲮璉<E4B2AE>撉䎚<E69289>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>撟湧<E6929F><E6B9A7>?0-18, 19-60, 60+ <20><>蛹<EFBFBD>䀹𧊋<E480B9>𣂼僑<F0A382BC>? <20>䀹<EFBFBD>撟氯<E6929F>? <20>䁅<EFBFBD><E48185>僑<EFBFBD>嗘<EFBFBD>蝏<EFBFBD><E89D8F><EFBFBD><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* pd.cut() <EFBFBD>賣㺭<EFBFBD>?
|
||||
* **场景:** 需要按年龄分组进行卡方检验。
|
||||
* **用户指令:** “把年龄按 0-18, 19-60, 60+ 分为‘未成年’, ‘成年’, ‘老年’三组。”
|
||||
* **Python 逻辑:** pd.cut() 函数。
|
||||
|
||||
### **2.3 复杂日期计算 (Date Logic)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 霈∠<E99C88><E288A0>笔<EFBFBD><E7AC94>園𡢿嚗㇉S嚗剹<E59A97><E589B9>xcel 蝏誩虜蝞烾<E89D9E><E783BE>啣僑<E595A3>𡝗<EFBFBD>隞賬<E99A9E>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣覔<F0A880A3>栽<EFBFBD>条&霂𦠜𠯫<F0A6A09C>麨<EFBFBD>坔<EFBFBD><E59D94>㗛<EFBFBD>霈踵𠯫<E8B8B5>麨<EFBFBD>躰恣蝞㛖<E89D9E>摮䀹<E691AE><E480B9>堆<EFBFBD>靽萘<E99DBD>1雿滚<E99BBF><E6BB9A>啜<EFBFBD><E5959C><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* (df\['end\_date'\] \- df\['start\_date'\]).dt.days / 30.4<EFBFBD>?
|
||||
* **场景:** 计算生存时间(OS)。Excel 经常算错闰年或月份。
|
||||
* **用户指令:** “根据‘确诊日期’和‘随访日期’计算生存月数,保留1位小数。”
|
||||
* **Python 逻辑:** (df\['end\_date'\] \- df\['start\_date'\]).dt.days / 30.4。
|
||||
|
||||
## **Level 3: 临床逻辑特征工程 (Feature Engineering)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁜抅鈭𤾸龫摮衣䰻霂<E4B0BB><E99C82><EFBFBD>鞉鰵<E99E89><E9B0B5><EFBFBD><EFBFBD>鞉<EFBFBD><E99E89><EFBFBD><EFBFBD>?
|
||||
*目标:基于医学知识生成新的分析指标。*
|
||||
|
||||
### **3.1 复合公式计算 (Complex Formula)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 霈∠<E99C88> eGFR (<28>曉<EFBFBD><E69B89><EFBFBD>誘餈<E8AA98><E9A488>) <20>?BMI<EFBFBD>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨅯葬<F0A885AF>𤏸恣蝞?BMI<4D><49><EFBFBD><EFBFBD>?BMI \> 28嚗𣬚<E59A97><F0A3AC9A>鞉鰵<E99E89>埈<EFBFBD>霈唬蛹<E594AC>䁅<EFBFBD><E48185>砽<EFBFBD>踺<EFBFBD><E8B8BA><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* <20>煾<EFBFBD><E785BE>𤥁恣蝞?df\['weight'\] / (df\['height'\]/100)\*\*2 \+ <EFBFBD>∩辣韏见<EFBFBD>?np.where<EFBFBD>?
|
||||
* **场景:** 计算 eGFR (肾小球滤过率) 或 BMI。
|
||||
* **用户指令:** “帮我计算 BMI。如果 BMI \> 28,生成新列标记为‘肥胖’。”
|
||||
* **Python 逻辑:** 向量化计算 df\['weight'\] / (df\['height'\]/100)\*\*2 \+ 条件赋值 np.where。
|
||||
|
||||
### **3.2 提取入排标准 (Cohort Selection)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 蝑偦<E89D91>厩泵<E58EA9><E6B3B5>辺隞嗥<E99A9E><E597A5>亦<EFBFBD>鈭箇黎<E7AE87>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𦦵<EFBFBD><F0A6A6B5>匧枂嚗𡁶&霂𠹺蛹<F0A0B9BA>箄<EFBFBD><E7AE84>䕘<EFBFBD>銝𥪜僑樴<E58391>之鈭?8撗<38><E69297>銝娍瓷<E5A88D>厰<EFBFBD>銵<EFBFBD><E98AB5>讠<EFBFBD><E8AEA0>脩<EFBFBD><E884A9><EFBFBD>犖<EFBFBD><E78A96><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.query("diagnosis \== 'Lung Adenocarcinoma' & age \> 18 & hypertension \== 0")<EFBFBD>?
|
||||
### **3.3 <20>穃<EFBFBD><E7A983>讐<EFBFBD><E8AE90>?(One-Hot Encoding)**
|
||||
* **场景:** 筛选符合条件的入组人群。
|
||||
* **用户指令:** “筛选出:确诊为肺腺癌,且年龄大于18岁,且没有高血压病史的病人。”
|
||||
* **Python 逻辑:** df.query("diagnosis \== 'Lung Adenocarcinoma' & age \> 18 & hypertension \== 0")。
|
||||
|
||||
* **<2A>箸艶嚗?* <20><><EFBFBD><EFBFBD>?Logistic <20>𧼮<EFBFBD>嚗峕<E59A97>銝<EFBFBD>銝芣<E98A9D>摨誩<E691A8><E8AAA9><EFBFBD>掩<EFBFBD>㗛<EFBFBD><E3979B>𡏭<EFBFBD><F0A18FAD>?(A, B, AB, O)<29>腈<EFBFBD>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>銵<EFBFBD><E98AB5>讠<EFBFBD><E8AEA0>𣂼<EFBFBD><F0A382BC>㗛<EFBFBD><E3979B><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* pd.get\_dummies(df\['blood\_type'\], prefix='blood')<29>?
|
||||
## **Level 4: 蝏𤘪<E89D8F><F0A498AA>滚<EFBFBD>銝𡡞<E98A9D>蝥扳祥<E689B3>?(Reshaping & Governance)**
|
||||
### **3.3 哑变量生成 (One-Hot Encoding)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁏㺿<F0A1818F>䁅”<E48185>潛<EFBFBD><E6BD9B><EFBFBD>誑<EFBFBD><E8AA91><EFBFBD><EFBFBD>孵<EFBFBD><E5ADB5><EFBFBD><EFBFBD>霈⊥芋<E28AA5>页<EFBFBD><E9A1B5>𤥁<EFBFBD>銵屸<E98AB5><E5B1B8>嗆㺭<E59786>桐耨憭溻<E686AD>?
|
||||
* **场景:** 准备做 Logistic 回归,有一个无序多分类变量“血型 (A, B, AB, O)”。
|
||||
* **用户指令:** “把血型生成哑变量。”
|
||||
* **Python 逻辑:** pd.get\_dummies(df\['blood\_type'\], prefix='blood')。
|
||||
|
||||
### **4.1 <20>踹捐銵刻蓮<E588BB>?(Pivot/Melt) <20>婙<EFBFBD>?Excel <20><>埯璇?*
|
||||
## **Level 4: 结构重塑与高级治理 (Reshaping & Governance)**
|
||||
|
||||
*目标:改变表格结构以适应特定的统计模型,或进行高阶数据修复。*
|
||||
|
||||
### **4.1 长宽表转换 (Pivot/Melt) —— Excel 的噩梦**
|
||||
|
||||
* **场景:** 目前是“一人多行”(张三-第1次化验,张三-第2次化验),要做重复测量分析,需要变成“一人一行”(张三-化验1-化验2)。
|
||||
* **用户指令:** “把表格从长表转为宽表,按病人ID索引,用‘访视次序’做后缀,铺开‘白细胞’列。”
|
||||
* **Python 逻辑:** df.pivot(index='id', columns='visit', values='wbc')。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>桀<EFBFBD><E6A180>胼<EFBFBD>靝<EFBFBD>鈭箏<E988AD>銵𢞖<E98AB5>嘅<EFBFBD>撘牐<E69298>-蝚?甈∪<E79488>撉䕘<E69289>撘牐<E69298>-蝚?甈∪<E79488>撉䕘<E69289>嚗諹<E59A97><E8ABB9>𡁻<EFBFBD>憭齿<E686AD><E9BDBF>誩<EFBFBD><E8AAA9>琜<EFBFBD><E7909C><EFBFBD>閬<EFBFBD><E996AC><EFBFBD>鐥<EFBFBD>靝<EFBFBD>鈭箔<E988AD>銵𢞖<E98AB5>嘅<EFBFBD>撘牐<E69298>-<2D>㚚<EFBFBD>1-<2D>㚚<EFBFBD>2嚗剹<E59A97>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>銵冽聢隞𡡞鵭銵刻蓮銝箏捐銵剁<E98AB5><E58981>厩<EFBFBD>鈭截D蝝W<E89D9D>嚗𣬚鍂<F0A3AC9A>䁅挪閫<E68CAA>活摨謿<E691A8>坔<EFBFBD><E59D94>𡒊<EFBFBD>嚗屸唍撘<E5948D><E69298>条蒾蝏<E892BE><E89D8F><EFBFBD>坔<EFBFBD><E59D94><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df.pivot(index='id', columns='visit', values='wbc')<29>?
|
||||
### **4.2 智能去重 (Smart Deduplication)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>䔶<EFBFBD>銝芰<E98A9D>鈭箸<E988AD>銝斗辺霈啣<E99C88>嚗䔶<E59A97><E494B6>∩縑<E288A9>臬<EFBFBD>嚗䔶<E59A97><E494B6>∩縑<E288A9>舐撩<E88890>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD>犖ID<49>駁<EFBFBD><E9A781><EFBFBD><EFBFBD><EFBFBD>𨀣<EFBFBD><F0A880A3>滚<EFBFBD>嚗䔶<E59A97><E494B6>仮<EFBFBD>䀹<EFBFBD><E480B9>交𠯫<E4BAA4>麨<EFBFBD>蹱<EFBFBD>餈𤑳<E9A488><F0A491B3><EFBFBD><EFBFBD><EFBFBD>∴<EFBFBD>憒<EFBFBD><E68692><EFBFBD>交<EFBFBD>銝<EFBFBD><E98A9D>瘀<EFBFBD>靽萘<E99DBD><E89098>䀹㺭<E480B9>桀<EFBFBD><E6A180>游漲<E6B8B8>蹱<EFBFBD>擃条<E69383><E69DA1><EFBFBD>辺<EFBFBD><E8BEBA><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.sort\_values(\['date', 'completeness'\]).drop\_duplicates(subset=\['id'\], keep='last')<EFBFBD>?
|
||||
* **场景:** 同一个病人有两条记录,一条信息全,一条信息缺。
|
||||
* **用户指令:** “按病人ID去重。如果有重复,保留‘检查日期’最近的那一条;如果日期一样,保留‘数据完整度’最高的那条。”
|
||||
* **Python 逻辑:** df.sort\_values(\['date', 'completeness'\]).drop\_duplicates(subset=\['id'\], keep='last')。
|
||||
|
||||
### **4.3 跨列逻辑校验 (Cross-Check)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>𤑳緵<F0A491B3>𤩺㺭<F0A4A9BA>柴<EFBFBD>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>乩<EFBFBD>銝𧢲<E98A9D>瘝⊥<E7989D><E28AA5>条琸<E69DA1>把<EFBFBD>嗘<EFBFBD><E59798>胼<EFBFBD>䀹<EFBFBD><E480B9>摮閙活<E99699>豹>0<>嗵<EFBFBD><E597B5>躰秤<E8BAB0>唳旿嚗峕<E59A97>霈啣枂<E595A3>乓<EFBFBD><E4B993><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.loc\[(df\['sex'\]=='<EFBFBD>?) & (df\['preg\_count'\]\>0), 'error\_flag'\] \= 1<EFBFBD>?
|
||||
### **4.4 憭𡁻<E686AD><F0A181BB>坿‘ (Multiple Imputation) <20>婙<EFBFBD>?蝏蠘恣摮衣<E691AE>擃条漣憛怨‘**
|
||||
* **场景:** 发现脏数据。
|
||||
* **用户指令:** “检查一下有没有‘男性’但是‘怀孕次数\>0’的错误数据,标记出来。”
|
||||
* **Python 逻辑:** df.loc\[(df\['sex'\]=='男') & (df\['preg\_count'\]\>0), 'error\_flag'\] \= 1。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>唳旿<E594B3><E697BF><EFBFBD>蝻箏仃<E7AE8F>潘<EFBFBD>憒?BMI 蝻箏仃嚗㚁<E59A97><E39A81>閧滲<E996A7>典<EFBFBD><E585B8>澆‵銵乩<E98AB5><E4B9A9>游<EFBFBD><E6B8B8>唳旿<E594B3><E697BF><EFBFBD><EFBFBD><EFBFBD><EFBFBD>閬<EFBFBD>⏚<EFBFBD>典<EFBFBD>隞硋<E99A9E><E7A18B>𧶏<EFBFBD>憒<EFBFBD>僑樴<E58391><E6A8B4><EFBFBD><EFBFBD>批<EFBFBD><E689B9><EFBFBD><EFBFBD><EFBFBD>琜<EFBFBD><E7909C><EFBFBD>㮾<EFBFBD>單<EFBFBD>扳䔉憸<E49489><E686B8>憛怨‘<E680A8>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>靝蝙<E99D9D>典<EFBFBD><E585B8>齿<EFBFBD>銵交<E98AB5>(MICE)撖嫖<E69296>𦲂MI<4D>坔<EFBFBD><E59D94>睃僑樴<E58391><E6A8B4>坔<EFBFBD><E59D94><EFBFBD>撩憭勗<E686AD>潸<EFBFBD>銵<EFBFBD>‵銵乓<E98AB5><E4B993><EFBFBD>?
|
||||
* # **Python <20>餉<EFBFBD>嚗?\`\`\`python** **from sklearn.experimental import enable\_iterative\_imputer** **from sklearn.impute import IterativeImputer** **隞<><E99A9E>撖寞㺭<E5AF9E>澆<EFBFBD>餈𥡝<E9A488><F0A5A19D>坿‘** **cols \= \['bmi', 'age', 'creatinine'\]** **imp \= IterativeImputer(max\_iter=10, random\_state=0)** **df\[cols\] \= imp.fit\_transform(df\[cols\])**
|
||||
### **4.4 多重插补 (Multiple Imputation) —— 统计学的高级填补**
|
||||
|
||||
## **Level 5: <20>䂿<EFBFBD><E482BF><EFBFBD><EFBFBD><EFBFBD><EFBFBD>𧋦<EFBFBD>𡝗<EFBFBD> (Text Mining) <20>婙<EFBFBD>?Python <20><><EFBFBD>撖寧<E69296>瘝餃躹**
|
||||
* **场景:** 数据集有缺失值(如 BMI 缺失),单纯用均值填补会破坏数据分布。需要利用其他变量(如年龄、性别、肌酐)的相关性来预测填补。
|
||||
* **用户指令:** “使用多重插补法(MICE)对‘BMI’和‘年龄’列的缺失值进行填补。”
|
||||
|
||||
*<2A>格<EFBFBD>嚗帋<E59A97>憭<EFBFBD>釣<EFBFBD>𡝗𥁒<F0A19D97>𦠜<EFBFBD><F0A6A09C>砌葉<E7A08C>𨀣<EFBFBD><F0A880A3>嘥枂<E598A5>唳旿<E594B3><E697BF><EFBFBD><EFBFBD>?Excel 蝏嘥笆<E598A5>帋<EFBFBD><E5B88B>啁<EFBFBD><E59581>?
|
||||
* # **Python 逻辑: \`\`\`python** **from sklearn.experimental import enable\_iterative\_imputer** **from sklearn.impute import IterativeImputer** **仅针对数值列进行插补** **cols \= \['bmi', 'age', 'creatinine'\]** **imp \= IterativeImputer(max\_iter=10, random\_state=0)** **df\[cols\] \= imp.fit\_transform(df\[cols\])**
|
||||
|
||||
### **5.1 甇<><E79487>銵刻噢撘𤩺<E69298><F0A4A9BA>?(Regex Extraction)**
|
||||
## **Level 5: 非结构化文本挖掘 (Text Mining) —— Python 的绝对统治区**
|
||||
|
||||
* **<2A>箸艶嚗?* <20>芣<EFBFBD>銝<EFBFBD><E98A9D>埈<EFBFBD><E59F88>砂<EFBFBD>𦦵<EFBFBD><F0A6A6B5><EFBFBD><EFBFBD><EFBFBD>凌<EFBFBD>嘅<EFBFBD><E59885><EFBFBD>捆憒<E68D86><E68692><EFBFBD>?撌西<E6928C>銝𠰴蠏)瘚豢隋<E8B1A2>扯<EFBFBD><E689AF>䕘<EFBFBD>憭批<E686AD>3.5\*2cm<63>腈<EFBFBD><E88588><EFBFBD>閬<EFBFBD><E996AC><EFBFBD>𤥁<EFBFBD><F0A4A581>文之撠譌<E692A0>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>靝<EFBFBD><E99D9D>条<EFBFBD><E69DA1><EFBFBD><EFBFBD><EFBFBD>凌<EFBFBD>䠷<EFBFBD><E4A0B7>𣂼<EFBFBD><F0A382BC>箄<EFBFBD><E7AE84>斤<EFBFBD><E696A4>踹<EFBFBD>嚗<EFBFBD><E59A97>憭抒<E686AD><E68A92><EFBFBD>葵<EFBFBD>啣<EFBFBD>嚗剹<E59A97><E589B9><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df\['text'\].str.extract(r'(\\d+\\.?\\d\*)\\s\*\[\\\*xX\]\\s\*(\\d+\\.?\\d\*)') 撟嗅<E6929F><E59785><EFBFBD>憭批<E686AD>潦<EFBFBD>?
|
||||
### **5.2 摮㛖泵銝脫芋蝟𠰴龪<F0A0B0B4>?(Fuzzy Matching)**
|
||||
*目标:从备注或报告文本中“抠”出数据。这是 Excel 绝对做不到的。*
|
||||
|
||||
* **<2A>箸艶嚗?* <20>駁堺<E9A781>滨妍敶訫<E695B6>瘛瑚僚嚗尠<E59A97>𨅯<EFBFBD><F0A885AF><EFBFBD>龫<EFBFBD>T<EFBFBD>腈<EFBFBD><E88588><EFBFBD>𨅯<EFBFBD>鈭砍<E988AD><E7A08D>𢞖<EFBFBD>腈<EFBFBD><E88588><EFBFBD>𨅯<EFBFBD><F0A885AF>𢞖<EFBFBD>腈<EFBFBD><E88588><EFBFBD>閬<EFBFBD><E996AC>銝<EFBFBD><E98A9D>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>睃龫<E79D83>W<EFBFBD>蝘售<E89D98>坔<EFBFBD><E59D94>峕<EFBFBD><E5B395>匧<EFBFBD><E58CA7>徉<EFBFBD>睃<EFBFBD><E79D83>𢞖<EFBFBD>嗵<EFBFBD>嚗屸<E59A97>蝏煺<E89D8F><E785BA>嫣蛹<E5ABA3>婱UMCH<43>踺<EFBFBD><E8B8BA><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df.loc\[df\['hospital'\].str.contains('<27>誩<EFBFBD>'), 'hospital'\] \= 'PUMCH'<27>
|
||||
### **5.1 正则表达式提取 (Regex Extraction)**
|
||||
|
||||
* **场景:** 只有一列文本“病理诊断”,内容如:“(左肺上叶)浸润性腺癌,大小3.5\*2cm”。需要提取肿瘤大小。
|
||||
* **用户指令:** “从‘病理诊断’里提取出肿瘤的长径(最大的那个数字)。”
|
||||
* **Python 逻辑:** df\['text'\].str.extract(r'(\\d+\\.?\\d\*)\\s\*\[\\\*xX\]\\s\*(\\d+\\.?\\d\*)') 并取最大值。
|
||||
|
||||
### **5.2 字符串模糊匹配 (Fuzzy Matching)**
|
||||
|
||||
* **场景:** 医院名称录入混乱:“协和医院”、“北京协和”、“协和”。需要统一。
|
||||
* **用户指令:** “把‘医院名称’列里所有包含‘协和’的,都统一改为‘PUMCH’。”
|
||||
* **Python 逻辑:** df.loc\[df\['hospital'\].str.contains('协和'), 'hospital'\] \= 'PUMCH'。
|
||||
@@ -4,15 +4,16 @@
|
||||
| :---- | :---- |
|
||||
| **对应 PRD** | **PRD\_总体\_医疗科研智能数据清洗平台.md** |
|
||||
| **版本** | **V1.0** |
|
||||
| **<EFBFBD>嗆<EFBFBD>?* | Final Draft |
|
||||
| **<EFBFBD>詨<EFBFBD><EFBFBD>格<EFBFBD>** | 蝖桃<EFBFBD>撟喳蝱<EFBFBD><EFBFBD><EFBFBD>銝<EFBFBD><EFBFBD><EFBFBD><EFBFBD>舀<EFBFBD><EFBFBD><EFBFBD><EFBFBD>璇喟<EFBFBD><EFBFBD>𡁶鍂璅∪<EFBFBD>銝𦒘<EFBFBD><EFBFBD>冽芋<EFBFBD>㛖<EFBFBD><EFBFBD><EFBFBD><EFBFBD>航器<EFBFBD>䕘<EFBFBD><EFBFBD><EFBFBD>紡憭𡁜𣪧<EFBFBD>笔僎銵<EFBFBD><EFBFBD><EFBFBD>㻫<EFBFBD>?|
|
||||
| **状态** | Final Draft |
|
||||
| **核心目标** | 确立平台的统一技术标准,梳理通用模块与专用模块的技术边界,指导多团队并行开发。 |
|
||||
|
||||
## **1\. <EFBFBD>颱<EFBFBD>蝟餌<EFBFBD><EFBFBD>嗆<EFBFBD><EFBFBD>?(System Architecture)**
|
||||
## **1\. 总体系统架构图 (System Architecture)**
|
||||
|
||||
平台采用 **“微服务化单体 (Modular Monolith)”** 或 **“BFF \+ Worker”** 架构。前端统一入口,后端按功能拆分服务或模块。
|
||||
|
||||
撟喳蝱<EFBFBD><EFBFBD>鍂 **<EFBFBD>𨅯凝<EFBFBD>滚𦛚<EFBFBD>硋<EFBFBD>雿?(Modular Monolith)<29>?* <20>?**<2A>輶FF \+ Worker<65>?* <20>嗆<EFBFBD><E59786><EFBFBD><EFBFBD>蝡舐<E89DA1>銝<EFBFBD><E98A9D>亙藁嚗<E89781><E59A97>蝡舀<E89DA1><E88880>蠘<EFBFBD><E8A098><EFBFBD><EFBFBD><EFBFBD>滚𦛚<E6BB9A>𡝗芋<F0A19D97>𨰜<EFBFBD>?
|
||||
graph TD
|
||||
subgraph Client\_Layer \[<EFBFBD>滨垢鈭支<EFBFBD>撅?(Browser)\]
|
||||
Portal\[撌乩<EFBFBD><EFBFBD>?(Portal)\]
|
||||
subgraph Client\_Layer \[前端交互层 (Browser)\]
|
||||
Portal\[工作台 (Portal)\]
|
||||
ToolA\_UI\[工具A: 超级合并器\]
|
||||
ToolB\_UI\[工具B: 结构化机器人\]
|
||||
ToolC\_UI\[工具C: 科研编辑器\]
|
||||
@@ -47,89 +48,93 @@ graph TD
|
||||
ToolC\_UI \--Local First--\> IndexedDB\[(Browser DB)\]
|
||||
ToolC\_UI \--快照同步--\> BFF
|
||||
|
||||
## **2\. <EFBFBD>𡁶鍂<EFBFBD><EFBFBD><EFBFBD>臬抅摨?(The Common Foundation)**
|
||||
## **2\. 通用技术基座 (The Common Foundation)**
|
||||
|
||||
餈䠷<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>舀<EFBFBD>韐舐忽<EFBFBD><EFBFBD><EFBFBD>㗇芋<EFBFBD>梹<EFBFBD><EFBFBD>臬𣪧<EFBFBD>笔<EFBFBD>憿餌<EFBFBD>銝<EFBFBD><EFBFBD>萄儐<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
|
||||
### **2.1 <20>滨垢<E6BBA8>𡁶鍂<F0A181B6>?(Frontend Core)**
|
||||
这部分技术栈贯穿所有模块,是团队必须统一遵循的标准。
|
||||
|
||||
### **2.1 前端通用栈 (Frontend Core)**
|
||||
|
||||
| 组件 | 选型 | 说明 |
|
||||
| :---- | :---- | :---- |
|
||||
| **獢<EFBFBD>沲** | **React 19** | <EFBFBD>拍鍂<EFBFBD><EFBFBD><EFBFBD>啁<EFBFBD> Hooks <20><>僎<EFBFBD>𤑳鸌<F0A491B3>扼<EFBFBD>?|
|
||||
| **<EFBFBD><EFBFBD>遣撌亙<EFBFBD>** | **Vite 5.x** | <EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>撱綽<EFBFBD><EFBFBD>舀<EFBFBD> HMR<EFBFBD>?|
|
||||
| **霂剛<EFBFBD>** | **TypeScript 5.x** | 撘箏<EFBFBD>撘箇掩<EFBFBD>页<EFBFBD><EFBFBD>滚<EFBFBD>蝡臬<EFBFBD>鈭怎掩<EFBFBD>见<EFBFBD>銋?(shared-types)<EFBFBD>?|
|
||||
| **<EFBFBD>瑕<EFBFBD>摨?* | **Tailwind CSS** | 蝏煺<EFBFBD> UI 憌擧聢嚗<EFBFBD>翰<EFBFBD>笔<EFBFBD><EFBFBD>㻫<EFBFBD>?|
|
||||
| **<EFBFBD>暹<EFBFBD>摨?* | **Lucide React** | 憌擧聢蝏煺<EFBFBD><EFBFBD><EFBFBD>蝠<EFBFBD>讐漣 SVG <20>暹<EFBFBD><E69AB9>?|
|
||||
| **頝舐眏** | **React Router v6** | 蝞∠<EFBFBD> Portal 銝𤾸<EFBFBD>銝?Tool 銋钅𡢿<E99285><F0A1A2BF>楝<EFBFBD>勗<EFBFBD>憟𨰜<E6869F>?|
|
||||
| **<EFBFBD>唳旿霂瑟<EFBFBD>** | **SWR** <EFBFBD>?**TanStack Query** | 憭<EFBFBD><EFBFBD> API 霂瑟<EFBFBD><EFBFBD><EFBFBD><EFBFBD>摮塩<EFBFBD><EFBFBD>誑<EFBFBD>𠹺遙<EFBFBD>∠𠶖<EFBFBD><EFBFBD><EFBFBD>**頧株砭 (Polling)**<EFBFBD>?|
|
||||
| **框架** | **React 19** | 利用最新的 Hooks 和并发特性。 |
|
||||
| **构建工具** | **Vite 5.x** | 极速构建,支持 HMR。 |
|
||||
| **语言** | **TypeScript 5.x** | 强制强类型,前后端共享类型定义 (shared-types)。 |
|
||||
| **样式库** | **Tailwind CSS** | 统一 UI 风格,快速开发。 |
|
||||
| **图标库** | **Lucide React** | 风格统一的轻量级 SVG 图标。 |
|
||||
| **路由** | **React Router v6** | 管理 Portal 与各个 Tool 之间的路由嵌套。 |
|
||||
| **数据请求** | **SWR** 或 **TanStack Query** | 处理 API 请求、缓存、以及任务状态的**轮询 (Polling)**。 |
|
||||
|
||||
### **2.2 <EFBFBD>𡒊垢<EFBFBD>𡁶鍂<EFBFBD>?(Backend Core)**
|
||||
### **2.2 后端通用栈 (Backend Core)**
|
||||
|
||||
| 组件 | 选型 | 说明 |
|
||||
| :---- | :---- | :---- |
|
||||
| **餈鞱<EFBFBD><EFBFBD>?* | **Node.js 22 (LTS)** | 靽脲<EFBFBD><EFBFBD><EFBFBD><EFBFBD>?LTS <20><>𧋦<EFBFBD>?|
|
||||
| **Web 獢<EFBFBD>沲** | **Fastify 5.x** | 擃䀹<EFBFBD>扯<EFBFBD>嚗䔶<EFBFBD>撘<EFBFBD><EFBFBD><EFBFBD>嚗玺chema <20>⊿<EFBFBD><E28ABF>见末<E8A781>?|
|
||||
| **ORM** | **Prisma 6** | 蝐餃<EFBFBD>摰匧<EFBFBD><EFBFBD><EFBFBD>㺭<EFBFBD>桀<EFBFBD><EFBFBD>滢<EFBFBD>嚗峕𣈲<EFBFBD>?Schema Migration<EFBFBD>?|
|
||||
| **<EFBFBD><EFBFBD>㺭<EFBFBD>⊿<EFBFBD>** | **Zod** | 餈鞱<EFBFBD><EFBFBD>?Schema <EFBFBD>⊿<EFBFBD>嚗<EFBFBD>虾<EFBFBD><EFBFBD><EFBFBD> TypeScript 蝐餃<EFBFBD><EFBFBD>?|
|
||||
| **<EFBFBD>亙<EFBFBD>** | **Winston / Pino** | 蝏𤘪<EFBFBD><EFBFBD>?JSON <20>亙<EFBFBD><E4BA99>?|
|
||||
| **运行时** | **Node.js 22 (LTS)** | 保持最新 LTS 版本。 |
|
||||
| **Web 框架** | **Fastify 5.x** | 高性能,低开销,Schema 校验友好。 |
|
||||
| **ORM** | **Prisma 6** | 类型安全的数据库操作,支持 Schema Migration。 |
|
||||
| **参数校验** | **Zod** | 运行时 Schema 校验,可生成 TypeScript 类型。 |
|
||||
| **日志** | **Winston / Pino** | 结构化 JSON 日志。 |
|
||||
|
||||
### **2.3 <EFBFBD>箇<EFBFBD>霈暹鴌<EFBFBD>?(Infrastructure)**
|
||||
### **2.3 基础设施栈 (Infrastructure)**
|
||||
|
||||
| 组件 | 选型 | 说明 |
|
||||
| :---- | :---- | :---- |
|
||||
| **<EFBFBD>唳旿摨?* | **PostgreSQL 15** | 摮睃<EFBFBD><EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>遙<EFBFBD>~<EFBFBD><EFBFBD><EFBFBD>鈭批<EFBFBD><EFBFBD>唳旿<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>蝏𤘪<EFBFBD> (JSONB)<EFBFBD>?|
|
||||
| **蝻枏<EFBFBD>/<2F>笔<EFBFBD>** | **Redis 7** | 餈䠷<EFBFBD><EFBFBD>?Redis <EFBFBD>W<EFBFBD>蝻枏<EFBFBD>嚗䔶<EFBFBD><EFBFBD>?**BullMQ** <20><><EFBFBD>蝡胯<E89DA1>?|
|
||||
| **<EFBFBD><EFBFBD>辣摮睃<EFBFBD>** | **MinIO / AWS S3** | 摮睃<EFBFBD><EFBFBD>冽<EFBFBD>銝𠹺<EFBFBD><EFBFBD>?Excel<EFBFBD><EFBFBD>DF 隞亙<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>葉<EFBFBD>湔<EFBFBD>隞嗚<EFBFBD>?|
|
||||
| **数据库** | **PostgreSQL 15** | 存储用户、任务、资产元数据、结构化结果 (JSONB)。 |
|
||||
| **缓存/队列** | **Redis 7** | 这里的 Redis 既做缓存,也是 **BullMQ** 的后端。 |
|
||||
| **文件存储** | **MinIO / AWS S3** | 存储用户上传的 Excel、PDF 以及生成的中间文件。 |
|
||||
|
||||
## **3\. 模块专用技术栈 (Module-Specific Stack)**
|
||||
|
||||
<EFBFBD><EFBFBD>笆銝滚<EFBFBD><EFBFBD>箸艶<EFBFBD><EFBFBD>鸌畾𢠃<EFBFBD>瘙<EFBFBD><EFBFBD><EFBFBD><EFBFBD>極<EFBFBD>瑕<EFBFBD><EFBFBD>乩<EFBFBD><EFBFBD>孵<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>舐<EFBFBD>隞嗚<EFBFBD>?
|
||||
### **3.1 撌亙<E6928C> A嚗朞<E59A97>蝥批<E89DA5>撟嗅膥 (IO 撖<><E69296><EFBFBD>?**
|
||||
针对不同场景的特殊需求,各工具引入了特定的技术组件。
|
||||
|
||||
*<2A>詨<EFBFBD><E8A9A8>烐<EFBFBD>嚗𡁜之<F0A1819C><E4B98B>辣瘚<E8BEA3><E7989A>憭<EFBFBD><E686AD><EFBFBD><EFBFBD>𠯫<EFBFBD>蠘圾<E8A098>僐<EFBFBD><E58390><EFBFBD>撣<EFBFBD>龪<EFBFBD>溻<EFBFBD>?
|
||||
### **3.1 工具 A:超级合并器 (IO 密集型)**
|
||||
|
||||
*核心挑战:大文件流式处理、日期解析、哈希匹配。*
|
||||
|
||||
| 领域 | 专用组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **<EFBFBD>𡒊垢 (Excel)** | **ExcelJS** | <EFBFBD>豢<EFBFBD> SheetJS嚗<EFBFBD><EFBFBD>撖?**Stream (瘚?** <20><>𣈲<EFBFBD><F0A388B2>凒憟踝<E6869F><E8B89D>賢<EFBFBD><E8B3A2><EFBFBD><EFBFBD>餈<EFBFBD><E9A488>摮㗛<E691AE><E3979B>嗥<EFBFBD>憭扳<E686AD>隞嗚<E99A9E>?|
|
||||
| **<EFBFBD>𡒊垢 (Date)** | **Day.js \+ CustomParseFormat** | 閫<EFBFBD><EFBFBD> Excel <EFBFBD><EFBFBD>僚<EFBFBD><EFBFBD>𠯫<EFBFBD><EFBFBD>聢撘?(44927, 2023/1/1)嚗諹蝠<EFBFBD>譍<EFBFBD>撘箏之<EFBFBD>?|
|
||||
| **撘<EFBFBD>郊<EFBFBD>笔<EFBFBD>** | **BullMQ** | 憭<EFBFBD><EFBFBD><EFBFBD>埈𧒄<EFBFBD><EFBFBD>僎隞餃𦛚嚗峕𣈲<EFBFBD><EFBFBD><EFBFBD>摨行<EFBFBD><EFBFBD>乓<EFBFBD>?|
|
||||
| **<EFBFBD>滨垢蝏<EFBFBD>辣** | **Ant Design Steps / Upload** | 敹恍<EFBFBD>笔<EFBFBD><EFBFBD>啣<EFBFBD>撖澆<EFBFBD> UI<EFBFBD>?|
|
||||
| **后端 (Excel)** | **ExcelJS** | 相比 SheetJS,它对 **Stream (流)** 的支持更好,能处理超过内存限制的大文件。 |
|
||||
| **后端 (Date)** | **Day.js \+ CustomParseFormat** | 解决 Excel 杂乱的日期格式 (44927, 2023/1/1),轻量且强大。 |
|
||||
| **异步队列** | **BullMQ** | 处理耗时合并任务,支持进度汇报。 |
|
||||
| **前端组件** | **Ant Design Steps / Upload** | 快速实现向导式 UI。 |
|
||||
|
||||
### **3.2 撌亙<EFBFBD> B嚗𡁶<E59A97><F0A181B6><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>箏膥鈭?(API/霈∠<E99C88>撖<EFBFBD><E69296><EFBFBD>?**
|
||||
### **3.2 工具 B:病历结构化机器人 (API/计算密集型)**
|
||||
|
||||
*<2A>詨<EFBFBD><E8A9A8>烐<EFBFBD>嚗匁LM 蝻𡝗<E89DBB><F0A19D97><EFBFBD><EFBFBD>璅∪<E79285>撟嗅<E6929F><E59785><EFBFBD><EFBFBD><EFBFBD>祆<EFBFBD>撖嫘<E69296>?
|
||||
*核心挑战:LLM 编排、双模型并发、文本比对。*
|
||||
|
||||
| 领域 | 专用组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **<EFBFBD>𡒊垢 (AI)** | **LangChain.js** | 蝏煺<EFBFBD> DeepSeek <EFBFBD>?Qwen <EFBFBD><EFBFBD><EFBFBD><EFBFBD>冽𦻖<EFBFBD><EFBFBD><EFBFBD>蝞∠<EFBFBD> Prompt Template<EFBFBD>?|
|
||||
| **<EFBFBD>𡒊垢 (Diff)** | **diff-match-patch** (Google) | 霈∠<EFBFBD>銝支葵璅∪<EFBFBD>颲枏枂<EFBFBD><EFBFBD><EFBFBD><EFBFBD>砍榆撘<EFBFBD><EFBFBD><EFBFBD>𤥁<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>擃䀝漁雿滨蔭<EFBFBD>?|
|
||||
| **<EFBFBD>𡒊垢 (瘥𥪜笆)** | **Lodash / Dice Coefficient** | <EFBFBD>其<EFBFBD> JSON 撖寡情<EFBFBD><EFBFBD>楛撅<EFBFBD><EFBFBD>撖孵<EFBFBD>摮㛖泵銝脩㮾隡澆漲霈∠<EFBFBD><EFBFBD>?|
|
||||
| **<EFBFBD>滨垢 (Grid)** | **TanStack Table** (Headless) | <EFBFBD>牐蛹<EFBFBD><EFBFBD>閬<EFBFBD><EFBFBD>摨血<EFBFBD><EFBFBD>嗯<EFBFBD>𨅯<EFBFBD>蝒<EFBFBD><EFBFBD><EFBFBD><EFBFBD>聢<EFBFBD>萘<EFBFBD> UI (撌血𢰧撟嗆<E6929F><E59786>厰僼)嚗峵eadless 摨𤘪<EFBFBD> AntD Table <EFBFBD>渡<EFBFBD>瘣颯<EFBFBD>?|
|
||||
| **后端 (AI)** | **LangChain.js** | 统一 DeepSeek 和 Qwen 的调用接口,管理 Prompt Template。 |
|
||||
| **后端 (Diff)** | **diff-match-patch** (Google) | 计算两个模型输出的文本差异,或者原文的高亮位置。 |
|
||||
| **后端 (比对)** | **Lodash / Dice Coefficient** | 用于 JSON 对象的深层比对和字符串相似度计算。 |
|
||||
| **前端 (Grid)** | **TanStack Table** (Headless) | 因为需要高度定制“冲突单元格”的 UI (左右并排按钮),Headless 库比 AntD Table 更灵活。 |
|
||||
|
||||
### **3.3 撌亙<EFBFBD> C嚗𡁶<E59A97><F0A181B6>娍㺭<E5A88D>桃<EFBFBD>颲穃膥 (鈭支<E988AD>撖<EFBFBD><E69296><EFBFBD>?**
|
||||
### **3.3 工具 C:科研数据编辑器 (交互密集型)**
|
||||
|
||||
*<2A>詨<EFBFBD><E8A9A8>烐<EFBFBD>嚗𡁜<E59A97>蝡舫<E89DA1><E888AB>扯<EFBFBD>皜脫<E79A9C><E884AB><EFBFBD>𧋦<EFBFBD>啗恣蝞𨰜<E89D9E><F0A8B09C>伃<EFBFBD><E4BC83><EFBFBD>滚<EFBFBD><E6BB9A>?
|
||||
*核心挑战:前端高性能渲染、本地计算、撤销重做。*
|
||||
|
||||
| 领域 | 专用组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **<EFBFBD>滨垢 (Grid)** | **AG Grid Community** | **<EFBFBD>詨<EFBFBD>蝏<EFBFBD>辣**<2A><>𣈲銝<F0A388B2><E98A9D>賢<EFBFBD>韐寞𣈲<E5AF9E><F0A388B2><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>具<EFBFBD><E585B7><EFBFBD><EFBFBD>𡝗嗻<F0A19D97><E597BB>xcel 蝥找漱鈭垍<E988AD>摨瓐<E691A8>?|
|
||||
| **<EFBFBD>滨垢 (Storage)** | **Dexie.js (IndexedDB)** | **Local-First <EFBFBD>嗆<EFBFBD><EFBFBD>詨<EFBFBD>**<EFBFBD><EFBFBD>銁瘚讛<EFBFBD><EFBFBD>函垢摮睃<EFBFBD> 5-10 銝<><E98A9D><EFBFBD>唳旿嚗屸<E59A97><E5B1B8>漤<EFBFBD>蝜<EFBFBD><E89D9C>蝏𡏭窈瘙<E7AA88><E79899>?|
|
||||
| **<EFBFBD>滨垢 (State)** | **Zustand \+ Immer** | <EFBFBD>拍鍂 Immer <EFBFBD>?Patches <EFBFBD>蠘<EFBFBD>摰䂿緵 **Undo/Redo (<EFBFBD>日<EFBFBD><EFBFBD>滚<EFBFBD>)** <EFBFBD><EFBFBD><EFBFBD>?|
|
||||
| **<EFBFBD>滨垢 (Calc)** | **Math.js** | 閫<EFBFBD><EFBFBD> JS 瘚桃<EFBFBD><EFBFBD>啁移摨阡䔮憸矋<EFBFBD>閫<EFBFBD><EFBFBD><EFBFBD>冽<EFBFBD>颲枏<EFBFBD><EFBFBD><EFBFBD>龫摮血<EFBFBD>撘?(ln, pow)<EFBFBD>?|
|
||||
| **<EFBFBD>滨垢 (Chart)** | **Ant Design Charts (G2)** | <EFBFBD>冽惣<EFBFBD>賭儒颲寞<EFBFBD>銝剔<EFBFBD><EFBFBD>嗥凒<EFBFBD>孵㦛<EFBFBD>屸<EFBFBD>甈∪㦛<EFBFBD>?|
|
||||
| **前端 (Grid)** | **AG Grid Community** | **核心组件**。唯一能免费支持虚拟滚动、列拖拽、Excel 级交互的库。 |
|
||||
| **前端 (Storage)** | **Dexie.js (IndexedDB)** | **Local-First 架构核心**。在浏览器端存储 5-10 万行数据,避免频繁网络请求。 |
|
||||
| **前端 (State)** | **Zustand \+ Immer** | 利用 Immer 的 Patches 功能实现 **Undo/Redo (撤销重做)** 栈。 |
|
||||
| **前端 (Calc)** | **Math.js** | 解决 JS 浮点数精度问题,解析用户输入的医学公式 (ln, pow)。 |
|
||||
| **前端 (Chart)** | **Ant Design Charts (G2)** | 在智能侧边栏中绘制直方图和频次图。 |
|
||||
|
||||
## **4\. 数据交互标准 (Data Standards)**
|
||||
|
||||
銝箔<EFBFBD><EFBFBD>㯄<EFBFBD>?A \-\> B \-\> C <EFBFBD><EFBFBD><EFBFBD>頧穿<EFBFBD>敹<EFBFBD>◆摰帋<EFBFBD>蝏煺<EFBFBD><EFBFBD><EFBFBD>㺭<EFBFBD>桐漱<EFBFBD>X聢撘譌<EFBFBD>?
|
||||
为了打通 A \-\> B \-\> C 的流转,必须定义统一的数据交换格式。
|
||||
|
||||
### **4.1 内部流转格式**
|
||||
|
||||
* **<EFBFBD><EFBFBD>辣<EFBFBD>拍<EFBFBD><EFBFBD>澆<EFBFBD>嚗?* 蝏煺<E89D8F>雿輻鍂 **CSV (UTF-8 with BOM)** <EFBFBD>?**JSON Lines (.jsonl)**<2A>?
|
||||
* *<EFBFBD><EFBFBD>眏嚗? Stream 憭<EFBFBD><EFBFBD><EFBFBD><EFBFBD>敹恬<EFBFBD>銝𥪯<EFBFBD>靘肽<EFBFBD> Excel 憭齿<E686AD><E9BDBF>?XML 蝏𤘪<E89D8F><F0A498AA>?
|
||||
* **<EFBFBD>交<EFBFBD><EFBFBD><EFBFBD><EFBFBD>嚗?* <20><><EFBFBD>匧極<E58CA7>瑚漣<E7919A>箇<EFBFBD><E7AE87>交<EFBFBD>嚗<EFBFBD><E59A97>憿餃<E686BF>銝<EFBFBD><E98A9D>碶蛹 YYYY-MM-DD 摮㛖泵銝脯<EFBFBD>?
|
||||
* **蝛箏<EFBFBD>潭<EFBFBD><EFBFBD><EFBFBD><EFBFBD>** 蝏煺<E89D8F>銝?null (JSON) <EFBFBD>?"" (CSV)嚗䔶艇蝳<EFBFBD>蝙<EFBFBD>?"NA", "-"<EFBFBD>?
|
||||
* **文件物理格式:** 统一使用 **CSV (UTF-8 with BOM)** 或 **JSON Lines (.jsonl)**。
|
||||
* *理由:* Stream 处理最快,且不依赖 Excel 复杂的 XML 结构。
|
||||
* **日期标准:** 所有工具产出的日期,必须归一化为 YYYY-MM-DD 字符串。
|
||||
* **空值标准:** 统一为 null (JSON) 或 "" (CSV),严禁使用 "NA", "-"。
|
||||
|
||||
### **4.2 API 响应结构 (Standard Response)**
|
||||
|
||||
interface ApiResponse\<T\> {
|
||||
code: number; // 0: <EFBFBD>𣂼<EFBFBD>, \>0: <EFBFBD>躰秤<EFBFBD>?
|
||||
code: number; // 0: 成功, \>0: 错误码
|
||||
data: T; // 业务数据
|
||||
message?: string; // 错误提示
|
||||
meta?: { // 分页或元数据
|
||||
@@ -140,8 +145,8 @@ interface ApiResponse\<T\> {
|
||||
|
||||
## **5\. 开发环境与部署 (DevOps)**
|
||||
|
||||
* **<EFBFBD><EFBFBD>恣<EFBFBD><EFBFBD><EFBFBD>** **pnpm** (<EFBFBD>刻<EFBFBD>嚗諹<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>条征<EFBFBD>湛<EFBFBD>摰㕑<EFBFBD>敹?<3F>?
|
||||
* **Monorepo (<EFBFBD>舫<EFBFBD>?嚗?* 撱箄悅雿輻鍂 Turborepo <EFBFBD>?Nx 蝞∠<EFBFBD> frontend, backend-api, worker-merger, worker-ai 蝑匧<EFBFBD>嚗<EFBFBD><EFBFBD>鈭怎掩<EFBFBD>见<EFBFBD>銋剹<EFBFBD>?
|
||||
* **包管理:** **pnpm** (推荐,节省磁盘空间,安装快)。
|
||||
* **Monorepo (可选):** 建议使用 Turborepo 或 Nx 管理 frontend, backend-api, worker-merger, worker-ai 等包,共享类型定义。
|
||||
* **容器化:**
|
||||
* **API Service:** <EFBFBD>删𠶖<EFBFBD><EFBFBD><EFBFBD><EFBFBD>舀偌撟單<EFBFBD>撅𨰻<EFBFBD>?
|
||||
* **Worker Service:** <EFBFBD>閧𡠺<EFBFBD>函蔡嚗峕覔<EFBFBD>?CPU/<2F><><EFBFBD>韐蠘蝸餈𥡝<E9A488><F0A5A19D>拙捆嚗<E68D86>鸌<EFBFBD>急糓 Worker A 憭<><E686AD>憭扳<E686AD>隞嗆𧒄<E59786><F0A79284><EFBFBD>瘨<EFBFBD><E798A8>堒之嚗剹<E59A97>
|
||||
* **API Service:** 无状态,可水平扩展。
|
||||
* **Worker Service:** 单独部署,根据 CPU/内存负载进行扩容(特别是 Worker A 处理大文件时内存消耗大)。
|
||||
@@ -1,16 +1,17 @@
|
||||
# **技术设计文档:工具 A \- 医疗数æ<C2B0>®è¶…级å<C2A7>ˆå¹¶å™?(The Super Merger)**
|
||||
# **技术设计文档:工具 A \- 医疗数据超级合并器 (The Super Merger)**
|
||||
|
||||
| 文档类型 | Technical Design Document (TDD) |
|
||||
| :---- | :---- |
|
||||
| **对应 PRD** | **PRD\_工具A\_超级合并器\_V2.md** |
|
||||
| **版本** | **V2.0** (æž¶æž„å<EFBFBD>‡çº§ï¼šè®¿è§†åŸºå‡?\+ æ—¶é—´çª? |
|
||||
| **状�* | Draft |
|
||||
| **æ ¸å¿ƒç›®æ ‡** | 构建一个基äº?Web çš?ETL å·¥å…·ï¼Œè§£å†³ä¸´åºŠç§‘ç ”ä¸â€œä¸€å¯¹å¤šâ€<EFBFBD>æ•°æ<EFBFBD>®å¯¹é½<EFBFBD>难题,实现基于时间窗的精准å<EFBFBD>ˆå¹¶ã€?|
|
||||
| **版本** | **V2.0** (架构升级:访视基准 \+ 时间窗) |
|
||||
| **状态** | Draft |
|
||||
| **核心目标** | 构建一个基于 Web 的 ETL 工具,解决临床科研中“一对多”数据对齐难题,实现基于时间窗的精准合并。 |
|
||||
|
||||
## **1\. 总体架构设计 (Architecture Overview)**
|
||||
|
||||
鉴于处ç<EFBFBD>† Excel 文件(解æž<C3A6>ã€<C3A3>å<EFBFBD>ˆå¹¶ã€<C3A3>写入)æ˜?CPU å¯†é›†åž‹å’Œå†…å˜æ•<C3A6>感型æ“<C3A6>作,为了é<E280A0>¿å…<C3A5>阻塞 Node.js 主线程,我们采用 **“异æ¥ä»»åŠ¡é˜Ÿåˆ?\+ æµ<C3A6>å¼<C3A5>处ç<E2809E>†â€?* 的架构模å¼<C3A5>ã€?
|
||||
### **1.1 系统架构�*
|
||||
鉴于处理 Excel 文件(解析、合并、写入)是 CPU 密集型和内存敏感型操作,为了避免阻塞 Node.js 主线程,我们采用 **“异步任务队列 \+ 流式处理”** 的架构模式。
|
||||
|
||||
### **1.1 系统架构图**
|
||||
|
||||
graph TD
|
||||
Client\[React 前端 (Wizard UI)\]
|
||||
@@ -29,7 +30,7 @@ graph TD
|
||||
end
|
||||
|
||||
subgraph Storage \[数据存储\]
|
||||
PG\[(PostgreSQL 业务�\]
|
||||
PG\[(PostgreSQL 业务库)\]
|
||||
FileSys\[临时文件存储 (Local/S3)\]
|
||||
Redis\[(Redis 缓存/队列)\]
|
||||
end
|
||||
@@ -42,36 +43,38 @@ graph TD
|
||||
BullMQ \--消费任务--\> Merger
|
||||
Merger \--读取辅表(全量)--\> FileSys
|
||||
Merger \--读取主表(流式)--\> FileSys
|
||||
Merger \--æµ<EFBFBD>å¼<EFBFBD>å<EFBFBD>ˆå¹¶ä¸Žå†™å…?-\> FileSys
|
||||
Merger \--更新状�-\> PG
|
||||
Merger \--流式合并与写入--\> FileSys
|
||||
Merger \--更新状态--\> PG
|
||||
Client \--3.轮询/WS 进度--\> TaskAPI
|
||||
Client \--4.下载结果--\> API\_Server
|
||||
|
||||
## **2\. 技术选型 (Tech Stack)**
|
||||
|
||||
åŸºäºŽçŽ°æœ‰æŠ€æœ¯æ ˆçš„é’ˆå¯¹æ€§é€‰æ‹©ï¼?
|
||||
| 层级 | 技术组ä»?| 选型ç<E280B9>†ç”± |
|
||||
基于现有技术栈的针对性选择:
|
||||
|
||||
| 层级 | 技术组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **å‰<EFBFBD>端** | **React 19 \+ Ant Design 5** | 利用 AntD çš?Steps, Upload, Tree (æ ‘çŠ¶é€‰æ‹©å™? 快速构å»?UIã€?|
|
||||
| **å<EFBFBD>Žç«¯æ¡†æž¶** | **Fastify 5.x** | 高性能 HTTP 框架,适å<E2809A>ˆé«˜å¹¶å<C2B6>?I/Oã€?|
|
||||
| **Excel 处ç<EFBFBD>†** | **ExcelJS** | **æ ¸å¿ƒç»„ä»¶**。支æŒ<C3A6>æµ<C3A6>å¼<C3A5>读å†?(Streaming I/O),这是处ç<E2809E>†å¤§æ•°æ<C2B0>®é‡<C3A9>ä¸<C3A4>崩的关键ã€?|
|
||||
| **日期处ç<EFBFBD>†** | **Day.js \+ CustomParseFormat** | **新增**。处ç<E2809E>†â€œæ—¶é—´åœ°ç‹±â€<C3A2>çš„æ ¸å¿ƒåº“ï¼Œéœ€è¦<C3A8>æž<C3A6>强的容错解æž<C3A6>能力ã€?|
|
||||
| **任务队列** | **BullMQ \+ Redis** | 必须异æ¥å¤„ç<EFBFBD>†ã€‚å<EFBFBD>ˆå¹¶é€»è¾‘å¤<EFBFBD>æ<EFBFBD>‚,耗时较长,必须用队列ã€?|
|
||||
| **æ•°æ<EFBFBD>®åº?* | **PostgreSQL 15 \+ Prisma** | å˜å‚¨ä»»åŠ¡çŠ¶æ€<EFBFBD>ã€<EFBFBD>文件元数æ<EFBFBD>®ã€?*ä¸<C3A4>建议将原始 Excel æ•°æ<C2B0>®å˜å…¥ PG**ã€?|
|
||||
| **验è¯<EFBFBD>åº?* | **Zod** | ç”¨äºŽæ ¡éªŒå‰<EFBFBD>端æ<EFBFBD><EFBFBD>交的å¤<EFBFBD>æ<EFBFBD>‚æ˜ å°„é…<EFBFBD>置结构ã€?|
|
||||
| **前端** | **React 19 \+ Ant Design 5** | 利用 AntD 的 Steps, Upload, Tree (树状选择器) 快速构建 UI。 |
|
||||
| **后端框架** | **Fastify 5.x** | 高性能 HTTP 框架,适合高并发 I/O。 |
|
||||
| **Excel 处理** | **ExcelJS** | **核心组件**。支持流式读写 (Streaming I/O),这是处理大数据量不崩的关键。 |
|
||||
| **日期处理** | **Day.js \+ CustomParseFormat** | **新增**。处理“时间地狱”的核心库,需要极强的容错解析能力。 |
|
||||
| **任务队列** | **BullMQ \+ Redis** | 必须异步处理。合并逻辑复杂,耗时较长,必须用队列。 |
|
||||
| **数据库** | **PostgreSQL 15 \+ Prisma** | 存储任务状态、文件元数据。**不建议将原始 Excel 数据存入 PG**。 |
|
||||
| **验证库** | **Zod** | 用于校验前端提交的复杂映射配置结构。 |
|
||||
|
||||
### **2.1 关键技术决ç?(ADR): 为什么ä¸<C3A4>ç”?Python (Pandas)?**
|
||||
### **2.1 关键技术决策 (ADR): 为什么不用 Python (Pandas)?**
|
||||
|
||||
虽然 Python Pandas 在数æ<EFBFBD>®å<EFBFBD>ˆå¹¶ä¸Šä»£ç <EFBFBD>更简æ´<EFBFBD>,但针å¯?*本工å…?*的场景,我们决定å<C5A1>šæŒ<C3A6>使用 **Node.js**,ç<C592>†ç”±å¦‚下:
|
||||
虽然 Python Pandas 在数据合并上代码更简洁,但针对**本工具**的场景,我们决定坚持使用 **Node.js**,理由如下:
|
||||
|
||||
1. **æµ<EFBFBD>å¼<EFBFBD>处ç<EFBFBD>†ä¼˜åŠ¿ï¼?* Pandas 倾å<C2BE>‘于全é‡<C3A9>åŠ è½½å†…å˜ï¼Œå®¹æ˜“ OOM。Node.js çš?Stream API 天然支æŒ<EFBFBD>背压,能稳定处ç<EFBFBD>†â€œæ•°æ<EFBFBD>®è†¨èƒ€â€<EFBFBD>问题ã€?
|
||||
2. **架构一致性:** é<>¿å…<C3A5>引入 Python Runtime 带æ<C2A6>¥çš„è¿<C3A8>ç»´æˆ<C3A6>本和 IPC 开销ã€?
|
||||
3. **结论ï¼?* 对于精确匹é…<C3A9>和逻辑清洗,Node.js 性能足够且更å<C2B4>¯æŽ§ã€?
|
||||
## **3\. æ•°æ<C2B0>®åº“设è®?(Database Schema)**
|
||||
1. **流式处理优势:** Pandas 倾向于全量加载内存,容易 OOM。Node.js 的 Stream API 天然支持背压,能稳定处理“数据膨胀”问题。
|
||||
2. **架构一致性:** 避免引入 Python Runtime 带来的运维成本和 IPC 开销。
|
||||
3. **结论:** 对于精确匹配和逻辑清洗,Node.js 性能足够且更可控。
|
||||
|
||||
## **3\. 数据库设计 (Database Schema)**
|
||||
|
||||
### **Prisma Schema 定义**
|
||||
|
||||
// 任务状æ€<EFBFBD>æžšä¸?
|
||||
// 任务状态枚举
|
||||
enum TaskStatus {
|
||||
PENDING
|
||||
PROCESSING
|
||||
@@ -79,7 +82,7 @@ enum TaskStatus {
|
||||
FAILED
|
||||
}
|
||||
|
||||
// å<EFBFBD>ˆå¹¶ä»»åŠ¡è¡?
|
||||
// 合并任务表
|
||||
model MergeTask {
|
||||
id String @id @default(uuid())
|
||||
userId String
|
||||
@@ -89,9 +92,9 @@ model MergeTask {
|
||||
// 核心配置字段 (V2 更新)
|
||||
// 结构: {
|
||||
// anchorFileId: string,
|
||||
// anchorKeys: { id: "ä½<EFBFBD>院å<EFBFBD>?, time: "入院日期" },
|
||||
// anchorKeys: { id: "住院号", time: "入院日期" },
|
||||
// window: { daysBefore: 7, daysAfter: 7 },
|
||||
// files: \[{ id: "f2", timeCol: "报告时间", columns: \["白细�\] }\]
|
||||
// files: \[{ id: "f2", timeCol: "报告时间", columns: \["白细胞"\] }\]
|
||||
// }
|
||||
config Json?
|
||||
|
||||
@@ -110,7 +113,7 @@ model SourceFile {
|
||||
task MergeTask @relation(fields: \[taskId\], references: \[id\])
|
||||
filename String
|
||||
filepath String
|
||||
headers Json // \["ä½<EFBFBD>院å<EFBFBD>?, "å§“å<E2809C><C3A5>", "入院日期"\]
|
||||
headers Json // \["住院号", "姓名", "入院日期"\]
|
||||
rowCount Int
|
||||
fileSize Int
|
||||
uploadedAt DateTime @default(now())
|
||||
|
||||
@@ -4,13 +4,14 @@
|
||||
| :---- | :---- |
|
||||
| **对应 PRD** | **PRD\_工具B\_病历结构化机器人\_V2.md** |
|
||||
| **版本** | **V2.0** (架构升级:双模型交叉验证) |
|
||||
| **<EFBFBD>嗆<EFBFBD>?* | Draft |
|
||||
| **<EFBFBD>詨<EFBFBD><EFBFBD>格<EFBFBD>** | <EFBFBD><EFBFBD>遣銝<EFBFBD>銝芷<EFBFBD><EFBFBD>臭縑摨衣<EFBFBD><EFBFBD>餌<EFBFBD><EFBFBD><EFBFBD>𧋦蝏𤘪<EFBFBD><EFBFBD>硋<EFBFBD><EFBFBD>𠬍<EFBFBD><EFBFBD>朞<EFBFBD>**<2A>峕芋<E5B395>页<EFBFBD>DeepSeek & Qwen嚗匧僎<E58CA7>烐<EFBFBD><E78390>?*銝?*<2A>芸𢆡鈭文<E988AD>撉諹<E69289>**嚗諹圾<E8ABB9>?AI 撟餉<E6929F><E9A489>桅<EFBFBD><E6A185>?|
|
||||
| **状态** | Draft |
|
||||
| **核心目标** | 构建一个高可信度的医疗文本结构化引擎,通过**双模型(DeepSeek & Qwen)并发提取**与**自动交叉验证**,解决 AI 幻觉问题。 |
|
||||
|
||||
## **1\. 总体架构设计 (Architecture Overview)**
|
||||
|
||||
蝟餌<EFBFBD><EFBFBD>嗆<EFBFBD>隞𢛶<EFBFBD>𨅯<EFBFBD>蝥踵<EFBFBD>扳<EFBFBD>瘞渡瑪<EFBFBD>嘥<EFBFBD>蝥找蛹 **<EFBFBD>陖<EFBFBD>见僎<EFBFBD>烐<EFBFBD>瘞渡瑪<EFBFBD>?*<2A><>㺭<EFBFBD>株<EFBFBD><E6A0AA>亙<EFBFBD>嚗<EFBFBD><E59A97><EFBFBD>𤑳<EFBFBD>銝支葵銝滚<E98A9D><E6BB9A>?LLM 璅∪<E79285>撟嗉<E6929F>憭<EFBFBD><E686AD>嚗𣬚<E59A97><F0A3AC9A>𨀣<EFBFBD><F0A880A3>𡁜<EFBFBD><F0A1819C>𨅯<EFBFBD>蝒<EFBFBD><E89D92>瘚见<E7989A><E8A781>𢛶<EFBFBD>肽<EFBFBD>銵峕<E98AB5>撖對<E69296><E5B08D><EFBFBD><EFBFBD>舘<EFBFBD><E88898>箏<EFBFBD>鈭箏極撉諹<E69289>蝵烐聢<E78390>?
|
||||
### **1.1 蝟餌<E89D9F><E9A48C>嗆<EFBFBD><E59786>?*
|
||||
系统架构从“单线性流水线”升级为 **“Y型并发流水线”**。数据进入后,分发给两个不同的 LLM 模型并行处理,结果汇聚到“冲突检测引擎”进行比对,最后输出到人工验证网格。
|
||||
|
||||
### **1.1 系统架构图**
|
||||
|
||||
graph TD
|
||||
Client\[React 前端 (Grid & Drawer UI)\]
|
||||
@@ -45,45 +46,50 @@ graph TD
|
||||
Orchestrator \--4.脱敏--\> PII\_Engine
|
||||
PII\_Engine \--5.并行调用--\> ClientA & ClientB
|
||||
ClientA & ClientB \--6.返回JSON--\> CrossValidator
|
||||
CrossValidator \--7.霈∠<EFBFBD>銝<EFBFBD><EFBFBD>湔<EFBFBD>?-\> PG
|
||||
CrossValidator \--7.计算一致性--\> PG
|
||||
Client \--8.拉取网格数据--\> VerifyAPI
|
||||
VerifyAPI \--9.人工裁决--\> PG
|
||||
|
||||
## **2\. 技术选型 (Tech Stack)**
|
||||
|
||||
| 撅<EFBFBD>漣 | <20><><EFBFBD>舐<EFBFBD>隞?| <20>匧<EFBFBD><E58CA7><EFBFBD>眏 |
|
||||
| 层级 | 技术组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **<EFBFBD>𡒊垢獢<EFBFBD>沲** | **Fastify 5.x** | 擃䀹<EFBFBD>扯<EFBFBD>撘<EFBFBD>郊 I/O嚗屸<E59A97><E5B1B8><EFBFBD>憭<EFBFBD><E686AD>擃睃僎<E79D83>烐芋<E78390>贝<EFBFBD><E8B49D>具<EFBFBD>?|
|
||||
| **璅∪<EFBFBD><EFBFBD>亙<EFBFBD>** | **LangChain.js** | 蝏煺<EFBFBD>撠<EFBFBD><EFBFBD> DeepSeek <EFBFBD>?Qwen <EFBFBD><EFBFBD><EFBFBD><EFBFBD>冽𦻖<EFBFBD><EFBFBD><EFBFBD>靘蹂<EFBFBD><EFBFBD><EFBFBD>揢璅∪<EFBFBD><EFBFBD>?|
|
||||
| **隞餃𦛚<EFBFBD>笔<EFBFBD>** | **BullMQ** | <EFBFBD>詨<EFBFBD>蝏<EFBFBD>辣<EFBFBD><EFBFBD>2 <20><>閬<EFBFBD>⏚<EFBFBD>?Flow <20>蠘<EFBFBD><E8A098>𡝗<EFBFBD><F0A19D97>函<EFBFBD><E587BD>埝䔉摰䂿緵<E482BF>𦦵<EFBFBD>敺<EFBFBD>舅銝芣芋<E88AA3>钅<EFBFBD>餈𥪜<E9A488><F0A5AA9C>萘<EFBFBD><E89098>餉<EFBFBD><E9A489>?|
|
||||
| **<EFBFBD>脩<EFBFBD>璉<EFBFBD>瘚?* | **Lodash (<EFBFBD>箇<EFBFBD>) \+ Dice Coefficient (餈偦𧫴)** | <EFBFBD>其<EFBFBD>瘥𥪜笆銝支葵 JSON 撖寡情<E5AFA1><E68385><EFBFBD>畾萄榆撘<E6A686><E69298><EFBFBD><EFBFBD><EFBFBD>祉㮾隡澆漲<E6BE86>臭蝙<E887AD>函<EFBFBD><E587BD>閧<EFBFBD> Dice 蝟餅㺭<EFBFBD>?Levenshtein 頝萘氖嚗峕<EFBFBD>銝漤<EFBFBD>閬<EFBFBD><EFBFBD><EFBFBD>见<EFBFBD><EFBFBD>誩<EFBFBD><EFBFBD>?|
|
||||
| **<EFBFBD>唳旿摨?* | **PostgreSQL 15** | 摮睃<EFBFBD> JSONB <EFBFBD>澆<EFBFBD><EFBFBD><EFBFBD><EFBFBD>璅∪<EFBFBD>蝏𤘪<EFBFBD><EFBFBD>?|
|
||||
| **<EFBFBD>滨垢鈭支<EFBFBD>** | **React \+ TanStack Table** | V2 <EFBFBD>嫣蛹<EFBFBD>冽艶蝵烐聢嚗峕㺭<EFBFBD>桅<EFBFBD>憭扳𧒄<EFBFBD><EFBFBD>閬?TanStack Table (Headless) <EFBFBD>滚<EFBFBD><EFBFBD>𡁏<EFBFBD>皛𡁜𢆡<EFBFBD>?|
|
||||
| **后端框架** | **Fastify 5.x** | 高性能异步 I/O,适合处理高并发模型调用。 |
|
||||
| **模型接入** | **LangChain.js** | 统一封装 DeepSeek 和 Qwen 的调用接口,便于切换模型。 |
|
||||
| **任务队列** | **BullMQ** | 核心组件。V2 需要利用 Flow 功能或手动编排来实现“等待两个模型都返回”的逻辑。 |
|
||||
| **冲突检测** | **Lodash (基础) \+ Dice Coefficient (进阶)** | 用于比对两个 JSON 对象的字段差异。文本相似度可使用简单的 Dice 系数或 Levenshtein 距离,暂不需要重型向量库。 |
|
||||
| **数据库** | **PostgreSQL 15** | 存储 JSONB 格式的双模型结果。 |
|
||||
| **前端交互** | **React \+ TanStack Table** | V2 改为全景网格,数据量大时需要 TanStack Table (Headless) 配合虚拟滚动。 |
|
||||
|
||||
## **3\. 核心流程设计 (Core Logic)**
|
||||
|
||||
### **3.1 智能体检 (Health Check Logic)**
|
||||
|
||||
* **閫血<EFBFBD><EFBFBD>嗆㦤嚗?* <20>冽<EFBFBD><E586BD>典<EFBFBD>蝡舫<E89DA1>㗇𥋘<E39787>𨀣<EFBFBD><F0A880A3>砍<EFBFBD><E7A08D>萘<EFBFBD><E89098>祇𡢿<E7A587>?
|
||||
* **<EFBFBD>扯<EFBFBD><EFBFBD>餉<EFBFBD>嚗?*
|
||||
1. <EFBFBD>𡒊垢霂餃<EFBFBD>霂亙<EFBFBD><EFBFBD><EFBFBD><EFBFBD> 100 銵䕘<E98AB5>銝滩粉<E6BBA9>券<EFBFBD>嚗剹<E59A97>?
|
||||
2. 霈∠<EFBFBD>蝏蠘恣<EFBFBD><EFBFBD><EFBFBD>嚗?
|
||||
* emptyRate: 蝛箏<EFBFBD>?/ <20>餉<EFBFBD><E9A489>啜<EFBFBD>?
|
||||
* avgLength: <EFBFBD>䂿征銵𣬚<EFBFBD>撟喳<EFBFBD>摮㛖泵<EFBFBD>啜<EFBFBD>?
|
||||
3. **<EFBFBD>行⏛蝑𣇉裦嚗?* <20>?emptyRate \> 0.8 <EFBFBD>?avgLength \< 10嚗諹<EFBFBD><EFBFBD>?status: 'BAD'<EFBFBD>?
|
||||
4. **Token 憸<EFBFBD>摯嚗?* totalRows \* avgLength \* 1.5 (蝎㛖裦隡啁<EFBFBD>)<29>?
|
||||
### **3.2 <20>𣬚𤩅<F0A3AC9A>𣂼<EFBFBD>銝𦒘漱<F0A69298>厰<EFBFBD>霂?(Double-Blind & Validation)**
|
||||
* **触发时机:** 用户在前端选择“文本列”的瞬间。
|
||||
* **执行逻辑:**
|
||||
1. 后端读取该列的前 100 行(不读全量)。
|
||||
2. 计算统计指标:
|
||||
* emptyRate: 空值 / 总行数。
|
||||
* avgLength: 非空行的平均字符数。
|
||||
3. **拦截策略:** 若 emptyRate \> 0.8 或 avgLength \< 10,返回 status: 'BAD'。
|
||||
4. **Token 预估:** totalRows \* avgLength \* 1.5 (粗略估算)。
|
||||
|
||||
餈蹱糓 V2 <20><><EFBFBD><EFBFBD>譌<EFBFBD>?
|
||||
#### **A. <20>鞟內霂滚極蝔?(Prompt Engineering)**
|
||||
### **3.2 双盲提取与交叉验证 (Double-Blind & Validation)**
|
||||
|
||||
这是 V2 的心脏。
|
||||
|
||||
#### **A. 提示词工程 (Prompt Engineering)**
|
||||
|
||||
为了方便比对,必须强制两个模型输出**完全一致的 JSON 结构**。
|
||||
|
||||
銝箔<EFBFBD><EFBFBD>嫣噶瘥𥪜笆嚗<EFBFBD><EFBFBD>憿餃撩<EFBFBD>嗡舅銝芣芋<EFBFBD>贝<EFBFBD><EFBFBD>?*摰<><E691B0>銝<EFBFBD><E98A9D>渡<EFBFBD> JSON 蝏𤘪<E89D8F>**<2A>?
|
||||
* **System Prompt:** "You are a medical structural extraction assistant..."
|
||||
* **Constraint:** "Output strictly in JSON format. Keys must be: \['tumor\_size', 'lymph\_node', ...\]."
|
||||
* **Temperature:** 霈曆蛹 0嚗諹蕭瘙<E895AD><E79899>憭抒&摰𡁏<E691B0>扼<EFBFBD>?
|
||||
* **Temperature:** 设为 0,追求最大确定性。
|
||||
|
||||
#### **B. 交叉验证算法 (The Judge)**
|
||||
|
||||
敶?Model A (DeepSeek) <EFBFBD>?Model B (Qwen) 餈𥪜<EFBFBD>蝏𤘪<EFBFBD><EFBFBD>𠬍<EFBFBD><EFBFBD>扯<EFBFBD>瘥𥪜笆嚗?
|
||||
当 Model A (DeepSeek) 和 Model B (Qwen) 返回结果后,执行比对:
|
||||
|
||||
function validate(jsonA, jsonB) {
|
||||
const conflicts \= \[\];
|
||||
const keys \= Object.keys(jsonA);
|
||||
@@ -95,10 +101,10 @@ function validate(jsonA, jsonB) {
|
||||
// 1\. 精确匹配
|
||||
if (valA \=== valB) continue;
|
||||
|
||||
// 2\. <EFBFBD>啣<EFBFBD>澆<EFBFBD>銝<EFBFBD><EFBFBD>硋龪<EFBFBD>?(憒?"3cm" vs "3.0cm")
|
||||
// 2\. 数值归一化匹配 (如 "3cm" vs "3.0cm")
|
||||
if (isNumber(valA) && isNumber(valB) && parse(valA) \=== parse(valB)) continue;
|
||||
|
||||
// 3\. (<EFBFBD>舫<EFBFBD>? 霂凋<E99C82><E5878B>訾撮摨血龪<E8A180>?
|
||||
// 3\. (可选) 语义相似度匹配
|
||||
// if (similarity(valA, valB) \> 0.95) continue;
|
||||
|
||||
conflicts.push(key);
|
||||
@@ -107,12 +113,13 @@ function validate(jsonA, jsonB) {
|
||||
return conflicts.length \=== 0 ? 'CLEAN' : 'CONFLICT';
|
||||
}
|
||||
|
||||
## **4\. <EFBFBD>唳旿摨栞挽霈?(Database Schema)**
|
||||
## **4\. 数据库设计 (Database Schema)**
|
||||
|
||||
V2 需要存储两份 AI 结果以及用户的裁决结果。
|
||||
|
||||
V2 <20><>閬<EFBFBD><E996AC><EFBFBD>其舅隞?AI 蝏𤘪<E89D8F>隞亙<E99A9E><E4BA99>冽<EFBFBD><E586BD><EFBFBD><EFBFBD><EFBFBD>喟<EFBFBD><E5969F>栶<EFBFBD>?
|
||||
### **Prisma Schema 更新**
|
||||
|
||||
// 隞餃𦛚銵?
|
||||
// 任务表
|
||||
model ExtractionJob {
|
||||
id String @id @default(uuid())
|
||||
// ...其他字段
|
||||
@@ -121,7 +128,7 @@ model ExtractionJob {
|
||||
targetFields Json // 目标字段定义 \[{name: "肿瘤大小", desc: "..."}\]
|
||||
}
|
||||
|
||||
// <EFBFBD>閗<EFBFBD>霈啣<EFBFBD>銵?
|
||||
// 单行记录表
|
||||
model ExtractionItem {
|
||||
id String @id @default(uuid())
|
||||
jobId String
|
||||
@@ -131,47 +138,51 @@ model ExtractionItem {
|
||||
resultA Json? // DeepSeek 结果 { "size": "3cm" }
|
||||
resultB Json? // Qwen 结果 { "size": "3.0 cm" }
|
||||
|
||||
// <EFBFBD>脩<EFBFBD>璉<EFBFBD>瘚讠<EFBFBD><EFBFBD>?
|
||||
// 冲突检测结果
|
||||
status ItemStatus // PENDING, CLEAN, CONFLICT, RESOLVED
|
||||
conflictFields String\[\] // \["size"\] 霈啣<EFBFBD><EFBFBD>芯<EFBFBD>摮埈挾<EFBFBD>脩<EFBFBD>鈭?
|
||||
conflictFields String\[\] // \["size"\] 记录哪些字段冲突了
|
||||
|
||||
// <EFBFBD><EFBFBD>蝏<EFBFBD><EFBFBD>蝥喟<EFBFBD><EFBFBD>?(<28>冽<EFBFBD>鋆<EFBFBD><E98B86><EFBFBD>𤾸<EFBFBD><F0A4BEB8>伐<EFBFBD><E4BC90>𤥁<EFBFBD><F0A4A581><EFBFBD><EFBFBD>湔𧒄<E6B994>芸𢆡<E88AB8>坔<EFBFBD>)
|
||||
// 最终采纳结果 (用户裁决后写入,或者一致时自动写入)
|
||||
finalResult Json?
|
||||
}
|
||||
|
||||
## **5\. 接口设计 (API Endpoints)**
|
||||
|
||||
### **5.1 璅∠<EFBFBD>銝𡡞<EFBFBD>蝵?*
|
||||
### **5.1 模版与配置**
|
||||
|
||||
* GET /api/templates: 获取预设的疾病和报告模版列表。
|
||||
* POST /api/jobs: 创建任务,Payload 中需包含 diseaseType 和 reportType,便于后端组装 Prompt。
|
||||
|
||||
* GET /api/templates: <20>瑕<EFBFBD>憸<EFBFBD>挽<EFBFBD><E68CBD>𪆴<EFBFBD><F0AA86B4><EFBFBD><EFBFBD>亙<EFBFBD>璅∠<E79285><E288A0>𡑒”<F0A19192>?
|
||||
* POST /api/jobs: <20>𥕦遣隞餃𦛚嚗釶ayload 銝剝<E98A9D><E5899D><EFBFBD>鉄 diseaseType <20>?reportType嚗䔶噶鈭𤾸<E988AD>蝡舐<E89DA1>鋆?Prompt<70>?
|
||||
### **5.2 网格验证 (Grid Verification)**
|
||||
|
||||
* GET /api/jobs/:id/rows: <EFBFBD><EFBFBD>△<EFBFBD>瑕<EFBFBD>撉諹<EFBFBD><EFBFBD>唳旿<EFBFBD>?
|
||||
* **Response:** 餈𥪜<EFBFBD> originalText, resultA, resultB, conflictFields<EFBFBD>?
|
||||
* POST /api/items/:id/resolve: <EFBFBD>閗<EFBFBD>鋆<EFBFBD><EFBFBD><EFBFBD>?
|
||||
* **Payload:** { field: "tumor\_size", chosenValue: "3cm" }<EFBFBD>?
|
||||
* **Logic:** <EFBFBD>湔鰵 finalResult嚗<EFBFBD><EFBFBD><EFBFBD>𡏭砲銵峕<EFBFBD><EFBFBD>匧<EFBFBD>蝒<EFBFBD><EFBFBD>畾菟<EFBFBD>撌脰圾<EFBFBD>喉<EFBFBD>撠?status <EFBFBD>湔鰵銝?RESOLVED<EFBFBD>?
|
||||
* GET /api/jobs/:id/rows: 分页获取验证数据。
|
||||
* **Response:** 返回 originalText, resultA, resultB, conflictFields。
|
||||
* POST /api/items/:id/resolve: 单行裁决。
|
||||
* **Payload:** { field: "tumor\_size", chosenValue: "3cm" }。
|
||||
* **Logic:** 更新 finalResult,如果该行所有冲突字段都已解决,将 status 更新为 RESOLVED。
|
||||
|
||||
## **6\. 前端详细设计 (Frontend)**
|
||||
|
||||
### **6.1 全景验证网格 (Verification Grid)**
|
||||
|
||||
* **蝏<EFBFBD>辣<EFBFBD>匧<EFBFBD>嚗?* 靘萘<E99D98><E89098>刻<EFBFBD> **TanStack Table** (<EFBFBD>餉<EFBFBD>撅? \+ **UI 蝏<EFBFBD>辣摨?* (皜脫<E79A9C>撅?<3F>?
|
||||
* **组件选型:** 依然推荐 **TanStack Table** (逻辑层) \+ **UI 组件库** (渲染层)。
|
||||
* **冲突单元格渲染:**
|
||||
* 敶?conflictFields.includes(column.id) <EFBFBD>塚<EFBFBD><EFBFBD>訫<EFBFBD><EFBFBD>潭葡<EFBFBD>㮖蛹**撖寞<E69296>璅∪<E79285>**<2A>?
|
||||
* <EFBFBD>曄內銝支葵撠𤩺<EFBFBD><EFBFBD>殷<EFBFBD>\[DS: 3cm\] <EFBFBD>?\[QW: 3.0cm\]<EFBFBD>?
|
||||
* <EFBFBD>冽<EFBFBD><EFBFBD>孵稬隞颱<EFBFBD><EFBFBD>厰僼嚗諹圻<EFBFBD>?resolve API嚗<49><E59A97>蝡臭<E89DA1>閫<EFBFBD>凒<EFBFBD>堆<EFBFBD>Optimistic Update嚗劐蛹<EFBFBD>劐葉<EFBFBD>嗆<EFBFBD><EFBFBD><EFBFBD>?
|
||||
### **6.2 靘扯器<E689AF>誩<EFBFBD><E8AAA9>?(Context Drawer)**
|
||||
* 当 conflictFields.includes(column.id) 时,单元格渲染为**对比模式**。
|
||||
* 显示两个小按钮:\[DS: 3cm\] 和 \[QW: 3.0cm\]。
|
||||
* 用户点击任一按钮,触发 resolve API,前端乐观更新(Optimistic Update)为选中状态。
|
||||
|
||||
### **6.2 侧边栏原文 (Context Drawer)**
|
||||
|
||||
* **触发:** 点击表格行的空白处或“查看原文”图标。
|
||||
* **功能:** 展示 originalText。
|
||||
* **高亮优化:** 简单实现 String.indexOf 查找当前字段的值并标黄。
|
||||
|
||||
* **閫血<E996AB>嚗?* <20>孵稬銵冽聢銵𣬚<E98AB5>蝛箇蒾憭<E892BE><E686AD><EFBFBD>𨀣䰻<F0A880A3>见<EFBFBD><E8A781><EFBFBD><EFBFBD>嘥㦛<E598A5><E3A69B><EFBFBD>?
|
||||
* **<2A>蠘<EFBFBD>嚗?* 撅閧內 originalText<78>?
|
||||
* **擃䀝漁隡睃<E99AA1>嚗?* 蝞<><E89D9E>訫<EFBFBD><E8A8AB>?String.indexOf <20>交𪄳敶枏<E695B6>摮埈挾<E59F88><E68CBE><EFBFBD>澆僎<E6BE86><E5838E><EFBFBD><EFBFBD>?
|
||||
## **7\. 风险控制与性能优化**
|
||||
|
||||
| 潜在风险 | 解决方案 |
|
||||
| :---- | :---- |
|
||||
| **<EFBFBD><EFBFBD><EFBFBD>?Token <EFBFBD>鞉𧋦** | 1\. 暺䁅恕雿輻鍂 DeepSeek (<EFBFBD><EFBFBD><EFBFBD><EFBFBD>鞉𧋦) \+ Qwen (雿擧<EFBFBD><EFBFBD>? 蝏<><E89D8F><EFBFBD>?2\. <20>兩<EFBFBD>靝<EFBFBD>璉<EFBFBD><E79289>嗪𧫴畾萎艇<E8908E>潭㜃<E6BDAD>芣<EFBFBD><E88AA3><EFBFBD>㺭<EFBFBD>柴<EFBFBD>?|
|
||||
| **憭<EFBFBD><EFBFBD><EFBFBD>笔漲<EFBFBD>?* | 銝支葵璅∪<E79285>敹<EFBFBD>◆ **撟嗅<E6929F>靚<EFBFBD>鍂 (Promise.all)**嚗諹<E59A97>䔶<EFBFBD><E494B6>臭葡銵䎚<E98AB5><E48E9A>㟲雿栞<E99BBF>埈𧒄<E59F88>硋<EFBFBD>鈭擧<E988AD><E693A7>Y<EFBFBD><EFBCB9><EFBFBD>葵璅∪<E79285><E288AA>?|
|
||||
| **璅∪<EFBFBD><EFBFBD>澆<EFBFBD>銝滚𨯬霂?* | Prompt 銝剖<EFBFBD><EFBFBD>?Few-Shot (撠烐甅<EFBFBD>? 蝷箔<E89DB7>嚗峕<E59A97>蝖桀<E89D96>蝷?JSON <20>澆<EFBFBD><E6BE86><EFBFBD><EFBFBD><EFBFBD>?JSON 閫<><E996AB>憭梯揖嚗諹䌊<E8ABB9>券<EFBFBD>霂?1 甈~<E79488>?|
|
||||
| **<EFBFBD>滨垢蝵烐聢<EFBFBD>⊿▼** | 憒<EFBFBD><EFBFBD><EFBFBD>唳旿頞<EFBFBD><EFBFBD> 1000 <20>∴<EFBFBD>撘<EFBFBD><E69298>?Virtual Scrolling (<EFBFBD>𡁏<EFBFBD>皛𡁜𢆡)<29>?|
|
||||
| **双倍 Token 成本** | 1\. 默认使用 DeepSeek (极低成本) \+ Qwen (低成本) 组合。 2\. 在“体检”阶段严格拦截无效数据。 |
|
||||
| **处理速度慢** | 两个模型必须 **并发调用 (Promise.all)**,而不是串行。整体耗时取决于最慢的那个模型。 |
|
||||
| **模型格式不听话** | Prompt 中增加 Few-Shot (少样本) 示例,明确展示 JSON 格式。如果 JSON 解析失败,自动重试 1 次。 |
|
||||
| **前端网格卡顿** | 如果数据超过 1000 条,开启 Virtual Scrolling (虚拟滚动)。 |
|
||||
|
||||
|
||||
@@ -1,17 +1,18 @@
|
||||
# **謚譛ッ隶セ隶。譁<EFBFBD>。」<EFBFBD>壼キ・蜈キ C \- 遘醍<E98198>疲焚謐ョ郛冶セ大<EFBDBE>?(V7 莠醍ォッ豐咏ョア謚鈴」朱勦迚<E58BA6>)**
|
||||
# **技术设计文档:工具 C \- 科研数据编辑器 (V7 云端沙箱抗风险版)**
|
||||
|
||||
| 文档类型 | Technical Design Document (TDD) |
|
||||
| :---- | :---- |
|
||||
| **蟇ケ蠎泌次蝙<EFBFBD>** | **蟾・蜈キC\_遘醍<E98198>疲焚謐ョ郛冶セ大勣\_蜴溷梛隶セ隶。\_V6\_菫ョ螟咲<E89E9F>?html** |
|
||||
| **迚域悽** | **V7.1** (螳梧紛謾カ蠖墓楔譫<EFBFBD><EFBFBD>遲<EFBFBD> ADR 荳守コ「髦滄」朱勦蟇ケ遲? |
|
||||
| **迥カ諤?* | Final Standard |
|
||||
| **譬ク蠢<EFBFBD>岼譬<EFBFBD>** | 譫<EFBFBD>サコ荳荳ェ鬮伜庄髱<EFBFBD>逧<EFBFBD>コ醍ォ?Python 謨ー謐ョ貂<EFBDAE>エ怜ケウ蜿ー縲ょ惠菫晞囿窶懈焚謐ョ荳榊<E88DB3>蝓溪晉噪蜑肴署荳具シ碁夊ソ<E5A48A> Apache Arrow 蜥梧<E89CA5>キ蠑丞<E8A091>遖サ謚譛ッ<E8AD9B>瑚ァ」蜀ウ譛榊苅遶ッ謇ァ陦悟クヲ譚・逧<EFBDA5>サカ霑滉ク取<EFBDB8>シ蠑丈ク「螟ア髣ョ鬚倥?|
|
||||
| **对应原型** | **工具C\_科研数据编辑器\_原型设计\_V6\_修复版.html** |
|
||||
| **版本** | **V7.1** (完整收录架构决策 ADR 与红队风险对策) |
|
||||
| **状态** | Final Standard |
|
||||
| **核心目标** | 构建一个高可靠的云端 Python 数据清洗平台。在保障“数据不出域”的前提下,通过 Apache Arrow 和样式分离技术,解决服务端执行带来的延迟与格式丢失问题。 |
|
||||
|
||||
## **1\. 总体架构设计 (System Architecture)**
|
||||
|
||||
驩エ莠取枚莉カ螟ァ蟆城剞蛻カ (\<20MB) 蜥瑚┳謨丞燕謠撰シ碁㊦逕ィ 窶廸ode.js BFF \+ Python Microservice窶?譫カ譫<EFBDB6>?
|
||||
V7 譬ク蠢<EFBFBD>合郤ァ<EFBFBD>?蠑募<E8A091> Apache Arrow 菴應クコ蜑榊錘遶ッ謨ー謐ョ莠、謐「譬<EFBFBD>㊥<EFBFBD>梧崛莉」菴取譜逧?Excel 譁<>サカ蜿榊、崎ッサ蜀呻シ悟ー<E6829F>黒谺。莠、莠貞サカ霑滉サ?8s 髯堺ス手<EFBDBD>?0.5s縲?
|
||||
### **1.1 譫カ譫<EFBDB6>挙謇大<E8AC87>?(V7 莨伜喧迚?**
|
||||
鉴于文件大小限制 (\<20MB) 和脱敏前提,采用 “Node.js BFF \+ Python Microservice” 架构。
|
||||
V7 核心升级: 引入 Apache Arrow 作为前后端数据交换标准,替代低效的 Excel 文件反复读写,将单次交互延迟从 8s 降低至 0.5s。
|
||||
|
||||
### **1.1 架构拓扑图 (V7 优化版)**
|
||||
|
||||
graph TD
|
||||
subgraph Client\_Layer \[用户端\]
|
||||
@@ -19,13 +20,13 @@ graph TD
|
||||
ArrowClient\[Apache Arrow JS\]
|
||||
end
|
||||
|
||||
subgraph Aliyun\_SAE \[髦ソ驥御コ?Serverless 蠎皮畑蠑墓梼\]
|
||||
subgraph Aliyun\_SAE \[阿里云 Serverless 应用引擎\]
|
||||
BFF\[Node.js Web 服务 (Fastify)\]
|
||||
PythonService\[Python 隶。邂怜セョ譛榊<EFBFBD>?(FastAPI)\]
|
||||
PythonService\[Python 计算微服务 (FastAPI)\]
|
||||
end
|
||||
|
||||
subgraph Cache\_Layer \[高速缓存层\]
|
||||
Redis\_Session\[Redis (蟄?DataFrame Arrow 蠎丞<EFBFBD>蛹?\]
|
||||
Redis\_Session\[Redis (存 DataFrame Arrow 序列化)\]
|
||||
end
|
||||
|
||||
subgraph AI\_PaaS \[AI 能力层\]
|
||||
@@ -34,15 +35,15 @@ graph TD
|
||||
end
|
||||
|
||||
subgraph Cloud\_Infra \[持久化层\]
|
||||
OSS\[蟇ケ雎。蟄伜お (蟄?Excel 蠎墓攸)\]
|
||||
OSS\[对象存储 (存 Excel 底板)\]
|
||||
RDS\[RDS PostgreSQL (存元数据)\]
|
||||
end
|
||||
|
||||
%% 莠、莠呈オ?
|
||||
%% 交互流
|
||||
User\[用户\] \--\>|1. 上传| BFF
|
||||
BFF \--\>|2. 存原始Excel| OSS
|
||||
BFF \--\>|3. 预热 Session| PythonService
|
||||
PythonService \--\>|4. 蜉<EFBFBD>霓ス蟷カ霓ャ荳?Arrow| Redis\_Session
|
||||
PythonService \--\>|4. 加载并转为 Arrow| Redis\_Session
|
||||
|
||||
User \--\>|5. AI 指令| Dify
|
||||
Dify \--\>|6. Python 代码| BFF
|
||||
@@ -60,69 +61,77 @@ graph TD
|
||||
|
||||
## **2\. 关键架构决策记录 (ADR)**
|
||||
|
||||
譛ャ闃りョー蠖穂コ<EFBFBD>クコ菴穂サ寂懷燕遶?Pyodide窶晁スャ蜷鯛懷錘遶ッ豐咏ョア窶晉噪蜀ウ遲冶ソ<E586B6>ィ具シ御セ帛屬髦溷盾閠<E79BBE>?
|
||||
本节记录了为何从“前端 Pyodide”转向“后端沙箱”的决策过程,供团队参考。
|
||||
|
||||
### **决策点:前端运行 (WASM) vs 后端运行 (Server-side)**
|
||||
|
||||
| 扈エ蠎ヲ | 譁ケ譯<EFBDB9> A<>壼燕遶?Pyodide (WASM) | 譁ケ譯<EFBDB9> B<>壼錘遶?Python (譛ャ譁ケ譯? | 蜀ウ遲也サ楢ョコ |
|
||||
| 维度 | 方案 A:前端 Pyodide (WASM) | 方案 B:后端 Python (本方案) | 决策结论 |
|
||||
| :---- | :---- | :---- | :---- |
|
||||
| **蜷ッ蜉ィ蟒カ霑<EFBFBD>** | **譫∵<EFBFBD> (15s+)**縲る怙荳玖スス \~20MB 蠑墓梼蛹<E6A2BC>シ檎畑謌キ菴馴ェ梧栫蟾ョ縲?| **遘貞シ**縲ら識蠅<E8AD98>惠譛榊苅蝎ィ鬚<EFBDA8>Ο<EFBFBD>悟叉蠑蜊ウ逕ィ縲?| **蜷守ォッ閭?* |
|
||||
| **莠、莠貞サカ霑<EFBFBD>** | **譫∝ソォ (\< 0.1s)**縲よ悽蝨ー蜀<EFBDB0>ュ俶桃菴懶シ梧裏鄂醍サ懷シ髞縲?| **荳ュ遲<EFBFBD> (0.5s)**縲る夊ソ<E5A48A> Apache Arrow 莨伜喧蜷主庄謗・蜿励?| 蜑咲ォッ閭?|
|
||||
| **遞ウ螳壽?* | **鬮倬」朱<EFBFBD>?*縲よオ剰ァ亥勣 Tab 蜀<>ュ俶怏髯撰シ梧<EFBDBC> OOM 蟠ゥ貅<EFBDA9>?| **鬮倡ィウ螳?*縲よ恪蜉。蝎ィ蜀<EFBDA8>ュ伜<EFBDAD>雜ウ<E99B9C>悟ョケ蝎ィ髫皮ヲサ<EFBDA6>悟エゥ貅<EFBDA9>ク榊スア蜩榊燕遶ッ縲?| **蜷守ォッ閭?* |
|
||||
| **蠎捺髪謖?* | **譛蛾剞**縲ゆク肴髪謖<E9ABAA>Κ蛻<CE9A> C 謇ゥ螻募コ?(螯ょ、肴揩扈溯ョ。蠎<EFBDA1>)縲?| **譌<EFBFBD>髯<EFBFBD>**縲よ<E7B8B2><E38288><EFBFBD>?Linux 邇ッ蠅<EFBDAF>シ檎函諤∝ョ梧紛縲?| **蜷守ォッ閭?* |
|
||||
| **蠑蜿鷹埓蠎?* | **譫<EFBFBD>ォ<EFBFBD>**縲る怙螟<E68099>炊 JS-Python 騾壻ソ。縲∝<E7B8B2>蟄倡ョ。逅<EFBDA1>?| **菴?*縲よ<E7B8B2><E38288><EFBFBD>?Web API 蠑蜿代?| **蜷守ォッ閭?* |
|
||||
| **启动延迟** | **极慢 (15s+)**。需下载 \~20MB 引擎包,用户体验极差。 | **秒开**。环境在服务器预热,即开即用。 | **后端胜** |
|
||||
| **交互延迟** | **极快 (\< 0.1s)**。本地内存操作,无网络开销。 | **中等 (0.5s)**。通过 Apache Arrow 优化后可接受。 | 前端胜 |
|
||||
| **稳定性** | **高风险**。浏览器 Tab 内存有限,易 OOM 崩溃。 | **高稳定**。服务器内存充足,容器隔离,崩溃不影响前端。 | **后端胜** |
|
||||
| **库支持** | **有限**。不支持部分 C 扩展库 (如复杂统计库)。 | **无限**。标准 Linux 环境,生态完整。 | **后端胜** |
|
||||
| **开发难度** | **极高**。需处理 JS-Python 通信、内存管理。 | **低**。标准 Web API 开发。 | **后端胜** |
|
||||
|
||||
**扈楢ョコ<EFBFBD>?* 驩エ莠寂懈焚謐ョ蟾イ閼ア謨鞘昜ク披懈枚莉カ霎<EFBDB6>ー鞘晢シ<E699A2>**蜷守ォッ謇ァ陦梧婿譯<E5A9BF>** 蝨ィ遞ウ螳壽ァ縲∝<E7B8B2>螳ケ諤ァ蜥悟シ蜿第<E89CBF>譛ャ荳雁<E88DB3>髱「閭懷<E996AD>縲?
|
||||
## **3\. 謚譛ッ騾牙梛荳手檮蜷?(Tech Stack Fusion)**
|
||||
**结论:** 鉴于“数据已脱敏”且“文件较小”,**后端执行方案** 在稳定性、兼容性和开发成本上全面胜出。
|
||||
|
||||
## **3\. 技术选型与融合 (Tech Stack Fusion)**
|
||||
|
||||
### **3.1 核心组件更新**
|
||||
|
||||
| 领域 | 选型 | V7 新增理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **謨ー謐ョ莠、謐「** | **Apache Arrow** | **蜈ウ髞ョ蜊<EFBFBD>コァ**縲ら畑莠?Python 蜥?Node.js/蜑咲ォッ荵矩龍逧<EFBFBD>ォ俶ァ閭ス謨ー謐ョ莨<EFBFBD>霎難シ碁∩蜈?JSON 蠎丞<E8A08E>蛹門シ髞<E9AB9E>?*隗」蜀ウ IO 蟒カ霑滓<E99C91>ク蠢<EFBDB8>**縲?|
|
||||
| **Excel 螟<EFBFBD>炊** | **openpyxl** | 譖ソ莉」郤?Pandas縲ら畑莠主惠蟇シ蜃コ譌?*菫晉蕗蜴溷ァ<E6BAB7> Excel 譬キ蠑<EFBDB7>**<2A>亥ヲる「懆牡縲∬セケ譯<EFBDB9>シ会シ?*隗」蜀ウ譬シ蠑丈ク「螟ア譬ク蠢<EFBDB8>**縲?|
|
||||
| **莨夊ッ晉シ灘ュ<EFBFBD>** | **Redis** | 逕ィ莠取嘯蟄倡畑謌キ逧?DataFrame (蠎丞<EFBFBD>蛹紋クコ Parquet/Arrow)<EFBFBD>碁∩蜈肴ッ乗ャ。謫堺ス憺<EFBFBD>蜴?OSS 隸サ譁<EFBDBB>サカ縲?|
|
||||
| **隶。邂玲恪蜉。** | **FastAPI \+ Celery** | 蠑募<EFBFBD> Celery 螟<EFBFBD>炊蠑よュ・莉サ蜉。<EFBFBD>碁亟豁「髟ソ隶。邂鈴仆蝪<EFBFBD> HTTP 郤ソ遞九?|
|
||||
| **数据交换** | **Apache Arrow** | **关键升级**。用于 Python 和 Node.js/前端之间的高性能数据传输,避免 JSON 序列化开销,**解决 IO 延迟核心**。 |
|
||||
| **Excel 处理** | **openpyxl** | 替代纯 Pandas。用于在导出时**保留原始 Excel 样式**(如颜色、边框),**解决格式丢失核心**。 |
|
||||
| **会话缓存** | **Redis** | 用于暂存用户的 DataFrame (序列化为 Parquet/Arrow),避免每次操作都去 OSS 读文件。 |
|
||||
| **计算服务** | **FastAPI \+ Celery** | 引入 Celery 处理异步任务,防止长计算阻塞 HTTP 线程。 |
|
||||
|
||||
## **4\. 騾<EFBFBD>髄鬟朱勦隸<EFBFBD>シー荳主ッケ遲?(Red Teaming Analysis)**
|
||||
## **4\. 逆向风险评估与对策 (Red Teaming Analysis)**
|
||||
|
||||
譛ャ闃りッヲ扈<EFBFBD>ョー蠖穂コ<EFBFBD>懃コ「髦滓オ玖ッ補昜クュ蜿醍鴫逧<EFBFBD>ス懷惠閾エ蜻ス鬟朱勦蜿雁<EFBFBD>蟾・遞句喧隗」蜀ウ譁ケ譯医?
|
||||
### **鬟朱勦荳<E88DB3>壻コ、莠貞サカ霑溽噪窶應ス捺─蟠ゥ蝪娯?*
|
||||
本节详细记录了“红队测试”中发现的潜在致命风险及其工程化解决方案。
|
||||
|
||||
### **风险一:交互延迟的“体感崩塌”**
|
||||
|
||||
* **逆向拷问:** 每次 AI 操作都走 OSS 下载 \-\> Pandas 读取 \-\> 计算 \-\> 上传,单次耗时可能超过 8秒,用户无法忍受。
|
||||
* **V7 解决方案:** **Session 驻留模式 (Memory-Resident)**
|
||||
1. **初始化:** 用户上传 Excel 后,后端将其加载为 DataFrame,并序列化为 **Arrow** 格式存入 Redis (TTL 30min)。
|
||||
2. **增量交互:** 前端发送指令,Python 从 Redis 读取 Arrow 数据(毫秒级),执行 Pandas 计算,将结果写回 Redis。
|
||||
3. **轻量反馈:** 计算完成后,只返回 **前 100 行预览数据** 给前端 AG Grid 渲染。
|
||||
4. **效果:** 耗时缩短至 **0.5s \- 1s**。
|
||||
|
||||
* **騾<>髄諡キ髣ョ<E9ABA3>?* 豈乗ャ。 AI 謫堺ス憺<EFBDBD>襍ー OSS 荳玖スス \-\> Pandas 隸サ蜿<EFBDBB> \-\> 隶。邂<EFBDA1> \-\> 荳贋シ<E8B48B><EFBDBC>悟黒谺。閠玲慮蜿ッ閭ス雜<EFBDBD>ソ<EFBFBD> 8遘抵シ檎畑謌キ譌<EFBDB7>豕募ソ榊女縲?
|
||||
* **V7 隗」蜀ウ譁ケ譯茨シ?* **Session 鬩サ逡呎ィ。蠑<EFBDA1> (Memory-Resident)**
|
||||
1. **蛻晏ァ句喧<E58FA5><E596A7>** 逕ィ謌キ荳贋シ<E8B48B> Excel 蜷趣シ悟錘遶ッ蟆<EFBDAF><E89F86>蜉<EFBFBD>霓ス荳?DataFrame<6D>悟ケカ蠎丞<E8A08E>蛹紋クコ **Arrow** 譬シ蠑丞ュ伜<EFBDAD> Redis (TTL 30min)縲?
|
||||
2. **蠅樣㍼莠、莠抵シ?* 蜑咲ォッ蜿鷹∵欠莉、<E88E89>訓ython 莉?Redis 隸サ蜿<EFBDBB> Arrow 謨ー謐ョ<E8AC90>域ッォ遘堤コァ<EFBDBA>会シ梧鴬陦<E9B4AC> Pandas 隶。邂暦シ悟ー<E6829F>サ捺棡蜀吝屓 Redis縲?
|
||||
3. **霓サ驥丞渚鬥茨シ?* 隶。邂怜ョ梧<EFBDAE>蜷趣シ悟宵霑泌<E99C91>?**蜑?100 陦碁「<E7A281>ァ域焚謐?* 扈吝燕遶?AG Grid 貂イ譟薙?
|
||||
4. **謨域棡<E59F9F>?* 閠玲慮郛ゥ遏ュ閾?**0.5s \- 1s**縲?
|
||||
### **风险二:Excel 格式丢失 (The Format Loss)**
|
||||
|
||||
* **騾<EFBFBD>髄諡キ髣ョ<EFBFBD>?* Pandas 逧?to\_excel 莨夐㍾鄂ョ謇譛牙黒蜈<EFBFBD><EFBFBD>シ譬キ蠑擾シ悟現逕滓<EFBFBD><EFBFBD>ウィ逧<EFBFBD>「懆牡蜥梧音豕ィ莨壻ク「螟ア縲?
|
||||
* **V7 隗」蜀ウ譁ケ譯茨シ?* **蠎墓攸蛻<E694B8>ヲサ遲也払 (Template Separation)**
|
||||
1. **荳贋シ<EFBFBD>譌カ<EFBFBD><EFBFBD>** 蟆<EFBFBD>次蟋?Excel 譬<EFBFBD>ョー荳?**"Style Template" (譬キ蠑丞コ墓攸)**<2A>梧ーク荵<EFBDB8>ソ晏ュ伜惠 OSS縲?
|
||||
2. **隶。邂玲慮<EFBFBD><EFBFBD>** 蜿ェ蝨ィ蜀<EFBDA8>ュ<EFBFBD>/Redis 荳ュ螟<EFBDAD>炊郤ッ謨ー謐ョ (Values)<29>御ク榊<EFBDB8>蠢<EFBFBD><E8A0A2>キ蠑上?
|
||||
3. **蟇シ蜃コ譌カ<EFBFBD><EFBFBD>** 菴ソ逕ィ openpyxl 蜉<EFBFBD>霓ス窶懈<EFBFBD>キ蠑丞コ墓攸窶晢シ悟ー<EFBFBD><EFBFBD>蟄倅クュ逧<EFBFBD>眠謨ー謐ョ**蝪ォ蜈・**蛻ー蠎墓攸逧<E694B8>ッケ蠎泌攝譬<E6949D>クュ<EFBDB8>御ソ晉蕗譛ェ菫ョ謾ケ蛹コ蝓溽噪閭梧勹濶イ蜥梧音豕ィ縲?
|
||||
### **鬟朱勦荳会シ夂憾諤∝酔豁・逧<EFBDA5>懷曙蜀吝<E89C80>遯≫?*
|
||||
* **逆向拷问:** Pandas 的 to\_excel 会重置所有单元格样式,医生标注的颜色和批注会丢失。
|
||||
* **V7 解决方案:** **底板分离策略 (Template Separation)**
|
||||
1. **上传时:** 将原始 Excel 标记为 **"Style Template" (样式底板)**,永久保存在 OSS。
|
||||
2. **计算时:** 只在内存/Redis 中处理纯数据 (Values),不关心样式。
|
||||
3. **导出时:** 使用 openpyxl 加载“样式底板”,将内存中的新数据**填入**到底板的对应坐标中,保留未修改区域的背景色和批注。
|
||||
|
||||
* **騾<>髄諡キ髣ョ<E9ABA3>?* 逕ィ謌キ謇句勘菫ョ謾ケ莠<EFBDB9>ャャ 5 陦鯉シ悟酔譌カ AI 蛻<>髯、莠<EFBDA4>ャャ 5 陦鯉シ悟ッシ閾エ謨ー謐ョ迥カ諤∽ク堺ク閾エ縲?
|
||||
* **V7 隗」蜀ウ譁ケ譯茨シ?* **UI 莠呈箕髞?(UI Locking)**
|
||||
* 蠖?AI 豁」蝨ィ逕滓<E98095>莉」遐∵<E98190>蜷守ォッ豁」蝨ィ隶。邂玲慮<E78EB2>窟G Grid 蠑コ蛻カ霑帛<E99C91> **readOnly** 讓。蠑擾シ悟ケカ蝨ィ逡碁擇譏セ遉?"AI 豁」蝨ィ螟<EFBDA8>炊..." 驕ョ鄂ゥ<E98482>?*迚ゥ逅<EFBDA9>アる擇荳顔ヲ∵ュ「蟷カ蜿第桃菴?*縲?
|
||||
### **鬟朱勦蝗幢シ壼ョ牙<EFBDAE>豐咏ョア騾<EFBDB1>?*
|
||||
### **风险三:状态同步的“双写冲突”**
|
||||
|
||||
* **騾<EFBFBD>髄諡キ髣ョ<EFBFBD>?* AI 逕滓<E98095>莠?import os; os.system('rm \-rf /')縲?
|
||||
* **V7 隗」蜀ウ譁ケ譯茨シ?* **AST 髱呎∝<C280>譫?\+ 螳ケ蝎ィ髫皮ヲサ**
|
||||
* **鬚<>」<EFBDA3>?* 蝨ィ謇ァ陦?exec() 蜑搾シ御スソ逕ィ Python ast 讓。蝮玲沖謠丈サ」遐∵<E98190>托シ梧」豬句芦 import os 遲牙<E981B2>髞ョ隸咲峩謗・謚帛<E8AC9A>蠑ょクク縲?
|
||||
* **襍<>コ宣剞鬚晢シ?* 菴ソ逕ィ Python resource 讓。蝮鈴剞蛻カ蜊墓ャ。謇ァ陦檎<E999A6>?CPU 譌カ髣エ (10s) 蜥?蜀<>ュ<EFBFBD> (1GB)縲?
|
||||
## **5\. 謨ー謐ョ蠎楢ョセ隶?(蜈<>焚謐ョ螻<EFBDAE>)**
|
||||
* **逆向拷问:** 用户手动修改了第 5 行,同时 AI 删除了第 5 行,导致数据状态不一致。
|
||||
* **V7 解决方案:** **UI 互斥锁 (UI Locking)**
|
||||
* 当 AI 正在生成代码或后端正在计算时,AG Grid 强制进入 **readOnly** 模式,并在界面显示 "AI 正在处理..." 遮罩,**物理层面上禁止并发操作**。
|
||||
|
||||
### **风险四:安全沙箱逃逸**
|
||||
|
||||
* **逆向拷问:** AI 生成了 import os; os.system('rm \-rf /')。
|
||||
* **V7 解决方案:** **AST 静态分析 \+ 容器隔离**
|
||||
* **预检:** 在执行 exec() 前,使用 Python ast 模块扫描代码树,检测到 import os 等关键词直接抛出异常。
|
||||
* **资源限额:** 使用 Python resource 模块限制单次执行的 CPU 时间 (10s) 和 内存 (1GB)。
|
||||
|
||||
## **5\. 数据库设计 (元数据层)**
|
||||
|
||||
新增 TaskAudit 表用于记录每一次 AI 操作的上下文,便于回滚和审计。
|
||||
|
||||
譁ー蠅<EFBFBD> TaskAudit 陦ィ逕ィ莠手ョー蠖墓ッ丈ク谺?AI 謫堺ス懃噪荳贋ク区枚<E58CBA>御セソ莠主屓貊壼柱螳。隶。縲?
|
||||
model TaskAudit {
|
||||
id String @id @default(uuid())
|
||||
datasetId String
|
||||
version Int // 謫堺ス懷錘逧<EFBFBD>沿譛ャ蜿?
|
||||
version Int // 操作后的版本号
|
||||
|
||||
actionType String // "AI\_CODE" or "MANUAL\_EDIT"
|
||||
prompt String? // 用户的自然语言指令
|
||||
code String? // AI 逕滓<EFBFBD>逧?Python 莉」遐<EFBFBD>
|
||||
code String? // AI 生成的 Python 代码
|
||||
|
||||
executionTime Int // 执行耗时 (ms)
|
||||
status String // SUCCESS / FAILED
|
||||
@@ -132,23 +141,24 @@ model TaskAudit {
|
||||
|
||||
## **6\. API 接口定义 (V7 优化)**
|
||||
|
||||
* POST /api/session/init: 荳贋シ<EFBFBD>譁<EFBFBD>サカ<EFBFBD>悟<EFBFBD>蟋句喧 Redis Session<EFBFBD>瑚ソ泌<EFBFBD>?sessionId縲?
|
||||
* POST /api/session/init: 上传文件,初始化 Redis Session,返回 sessionId。
|
||||
* POST /api/session/execute:
|
||||
* **Input:** { sessionId, code, version }
|
||||
* **Output:** { previewData: ArrowBase64, newVersion: int, logs: string }
|
||||
* **隸エ譏<EFBFBD>:** 莉<>ソ泌屓鬚<E5B193>ァ域焚謐ョ<E8AC90>御ク咲函謌?Excel 譁<>サカ縲?
|
||||
* **说明:** 仅返回预览数据,不生成 Excel 文件。
|
||||
* POST /api/session/save:
|
||||
* **Input:** { sessionId }
|
||||
* **Output:** { downloadUrl }
|
||||
* **隸エ譏<EFBFBD>:** 隗ヲ蜿<EFBDA6> openpyxl 蜷亥ケカ騾サ霎托シ檎函謌先怙扈?Excel 蟷カ荳贋シ?OSS縲?
|
||||
## **7\. 蠑蜿大<E89CBF>蟾・蟒コ隶?*
|
||||
* **说明:** 触发 openpyxl 合并逻辑,生成最终 Excel 并上传 OSS。
|
||||
|
||||
* **Python 扈?(驥堺クュ荵矩㍾):**
|
||||
* 螳樒鴫 Arrow \<-\> Pandas 逧<>コ丞<EFBDBA>蛹夜サ霎代?
|
||||
* 蟆∬」<EFBFBD> openpyxl 逧<><E980A7>キ蠑丞屓蝪ォ騾サ霎代?
|
||||
* 謳ュ蟒コ FastAPI \+ Redis 邇ッ蠅<EFBDAF>?
|
||||
* **Node.js 扈?**
|
||||
* 雍溯エ」 Dify 霓ャ蜿大柱驩エ譚<EFBDB4>?
|
||||
* **蜑咲ォッ扈?**
|
||||
* 髮<EFBFBD><EFBFBD> apache-arrow JS 蠎難シ瑚ァ」譫仙錘遶ッ霑泌屓逧<E5B193>コ瑚ソ帛宛豬∝ケカ蝨?AG Grid 螻慕、コ縲?
|
||||
* 螳樒鴫窶廣I 螟<>炊荳ュ窶晉噪蜈ィ螻城煤螳壻コ、莠偵
|
||||
## **7\. 开发分工建议**
|
||||
|
||||
* **Python 组 (重中之重):**
|
||||
* 实现 Arrow \<-\> Pandas 的序列化逻辑。
|
||||
* 封装 openpyxl 的样式回填逻辑。
|
||||
* 搭建 FastAPI \+ Redis 环境。
|
||||
* **Node.js 组:**
|
||||
* 负责 Dify 转发和鉴权。
|
||||
* **前端组:**
|
||||
* 集成 apache-arrow JS 库,解析后端返回的二进制流并在 AG Grid 展示。
|
||||
* 实现“AI 处理中”的全屏锁定交互。
|
||||
@@ -1,34 +1,37 @@
|
||||
# **<EFBFBD><EFBFBD><EFBFBD>航挽霈⊥<EFBFBD>獢<EFBFBD><EFBFBD>撌亙<EFBFBD> C \- 蝘𤑳<E89D98><F0A491B3>唳旿蝻𤥁<E89DBB><F0A4A581>?(The Research Editor)**
|
||||
# **技术设计文档:工具 C \- 科研数据编辑器 (The Research Editor)**
|
||||
|
||||
| 文档类型 | Technical Design Document (TDD) |
|
||||
| :---- | :---- |
|
||||
| **对应 PRD** | **PRD\_工具C\_科研数据编辑器\_V2.1.md** |
|
||||
| **<EFBFBD><EFBFBD>𧋦** | **V2.1** (<EFBFBD>啣<EFBFBD> Pivot 蝞埈<EFBFBD>銝?Web Worker <EFBFBD>嗆<EFBFBD>) |
|
||||
| **<EFBFBD>嗆<EFBFBD>?* | Final Draft |
|
||||
| **<EFBFBD>詨<EFBFBD><EFBFBD>格<EFBFBD>** | <EFBFBD><EFBFBD>遣銝<EFBFBD>銝芷<EFBFBD><EFBFBD>扯<EFBFBD><EFBFBD>?Web 蝡舀㺭<E88880>桃<EFBFBD>颲穃膥嚗峕𣈲<E5B395>?5 銝<><E98A9D>蝥扳㺭<E689B3>桃<EFBFBD>摰墧𧒄皜<F0A79284><E79A9C><EFBFBD><EFBFBD><EFBFBD><EFBFBD>誩<EFBFBD>撌伐<E6928C><E4BC90>恍鵭摰質蓮<E8B3AA>g<EFBFBD>銝𡡞<E98A9D>餉<EFBFBD>瘝餌<E7989D>嚗峕<E59A97>靘𥕞<E99D98>𣈯妟撱嗉<E692B1><E59789>脲<EFBFBD>雿靝<E99BBF>撉䎚<E69289>?|
|
||||
| **版本** | **V2.1** (新增 Pivot 算法与 Web Worker 架构) |
|
||||
| **状态** | Final Draft |
|
||||
| **核心目标** | 构建一个高性能的 Web 端数据编辑器,支持 5 万行级数据的实时清洗、变量加工(含长宽转换)与逻辑治理,提供“零延迟”操作体验。 |
|
||||
|
||||
## **1\. 总体架构设计 (Architecture Overview)**
|
||||
|
||||
銝箔<EFBFBD>皛∟雲 **PRD V2.1** 銝凌<EFBFBD>𨅯朖<EFBFBD>嗅<EFBFBD>擐<EFBFBD><EFBFBD>腈<EFBFBD><EFBFBD><EFBFBD>𨀣伃<EFBFBD><EFBFBD><EFBFBD>滚<EFBFBD><EFBFBD>苷誑<EFBFBD>𠰴<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>𣈯鵭摰質蓮<EFBFBD>T<EFBFBD>嗪<EFBFBD>瘙<EFBFBD><EFBFBD>撌亙<EFBFBD> C <20><>鍂 **"Local-First" (<EFBFBD>砍𧑐隡睃<EFBFBD>)** <20>嗆<EFBFBD><E59786>?
|
||||
<EFBFBD>詨<EFBFBD>蝑𣇉裦嚗?
|
||||
1. **<EFBFBD>唳旿撽餌<EFBFBD>嚗?* <20>唳旿<E594B3>㰘蝸<E3B098>𦒘蜓閬<E89C93><E996AC><EFBFBD>典銁瘚讛<E7989A><E8AE9B>函<EFBFBD> **IndexedDB (Dexie.js)** <20>?**<2A><><EFBFBD> (Zustand)** 銝准<E98A9D>?
|
||||
2. **霈∠<E99C88>銝𧢲𦆮嚗?* 憭齿<E686AD><E9BDBF><EFBFBD>恣蝞烾<E89D9E>餉<EFBFBD>嚗<EFBFBD><E59A97> Pivot<6F><74><EFBFBD>撘讛圾<E8AE9B>琜<EFBFBD>銝𧢲𦆮<F0A7A2B2>?**Web Worker**嚗屸<E59A97><E5B1B8>漤獈憛?UI 銝餌瑪蝔卝<E89D94>?
|
||||
### **1.1 蝟餌<E89D9F><E9A48C>嗆<EFBFBD><E59786>?*
|
||||
为了满足 **PRD V2.1** 中“即时反馈”、“撤销重做”以及复杂的“长宽转换”需求,工具 C 采用 **"Local-First" (本地优先)** 架构。
|
||||
|
||||
核心策略:
|
||||
|
||||
1. **数据驻留:** 数据加载后主要存储在浏览器的 **IndexedDB (Dexie.js)** 和 **内存 (Zustand)** 中。
|
||||
2. **计算下放:** 复杂的计算逻辑(如 Pivot、公式解析)下放至 **Web Worker**,避免阻塞 UI 主线程。
|
||||
|
||||
### **1.1 系统架构图**
|
||||
|
||||
graph TD
|
||||
subgraph Browser\_Layer \[浏览器端 (React SPA)\]
|
||||
UI\_Shell\[UI 憯喳<EFBFBD>: <20><>像<EFBFBD>?Toolbar \+ <EFBFBD>箄<EFBFBD> Sidebar\]
|
||||
UI\_Shell\[UI 壳层: 扁平化 Toolbar \+ 智能 Sidebar\]
|
||||
|
||||
subgraph Core\_Engine \[核心引擎\]
|
||||
GridComponent\[AG Grid (閫<EFBFBD>㦛撅?\]
|
||||
GridComponent\[AG Grid (视图层)\]
|
||||
StateManager\[Zustand Store (状态层)\]
|
||||
|
||||
subgraph Worker\_Thread \[Web Worker 线程\]
|
||||
ComputeEngine\[计算引擎 (Math.js / Pivot Alg)\]
|
||||
StatEngine\[蝏蠘恣撘閙<EFBFBD> (<28>湔䲮<E6B994>?憸烐活)\]
|
||||
StatEngine\[统计引擎 (直方图/频次)\]
|
||||
end
|
||||
|
||||
HistoryManager\[Immer Patches (<EFBFBD>日<EFBFBD><EFBFBD>?\]
|
||||
HistoryManager\[Immer Patches (撤销栈)\]
|
||||
end
|
||||
|
||||
subgraph Local\_Storage \[持久化层\]
|
||||
@@ -36,13 +39,13 @@ graph TD
|
||||
end
|
||||
end
|
||||
|
||||
subgraph Server\_Layer \[<EFBFBD>滚𦛚蝡?(Node.js)\]
|
||||
subgraph Server\_Layer \[服务端 (Node.js)\]
|
||||
API\[Fastify API\]
|
||||
S3\[对象存储 (MinIO/OSS)\]
|
||||
end
|
||||
|
||||
User \--1.操作(如Pivot)--\> UI\_Shell
|
||||
UI\_Shell \--2.<EFBFBD>煾<EFBFBD><EFBFBD><EFBFBD><EFBFBD>?postMessage)--\> Worker\_Thread
|
||||
UI\_Shell \--2.发送消息(postMessage)--\> Worker\_Thread
|
||||
Worker\_Thread \--3.计算结果--\> StateManager
|
||||
StateManager \--4.更新视图--\> GridComponent
|
||||
StateManager \--5.异步备份--\> Dexie
|
||||
@@ -51,14 +54,14 @@ graph TD
|
||||
|
||||
## **2\. 技术选型 (Tech Stack)**
|
||||
|
||||
| 撅<EFBFBD>漣 | <20><><EFBFBD>舐<EFBFBD>隞?| <20>匧<EFBFBD><E58CA7><EFBFBD>眏 |
|
||||
| 层级 | 技术组件 | 选型理由 |
|
||||
| :---- | :---- | :---- |
|
||||
| **銵冽聢<EFBFBD>詨<EFBFBD>** | **AG Grid Community** | <EFBFBD>臭<EFBFBD><EFBFBD>賢<EFBFBD>韐寞𣈲<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>具<EFBFBD><EFBFBD><EFBFBD><EFBFBD>𡝗嗻<EFBFBD><EFBFBD><EFBFBD><EFBFBD>扯<EFBFBD>皜脫<EFBFBD><EFBFBD>?React 銵冽聢摨瓐<E691A8>?|
|
||||
| **<EFBFBD>砍𧑐<EFBFBD>唳旿摨?* | **Dexie.js (IndexedDB)** | <EFBFBD>豢<EFBFBD> localStorage (5MB<EFBFBD>𣂼<EFBFBD>)嚗䬠ndexedDB 摰寥<EFBFBD>憭找<EFBFBD>撘<EFBFBD>郊嚗屸<EFBFBD><EFBFBD><EFBFBD>摮睃<EFBFBD> 5銝? 銵𣬚<E98AB5> JSON <20>唳旿<E594B3><E697BF><EFBFBD>?|
|
||||
| **<EFBFBD>嗆<EFBFBD><EFBFBD>恣<EFBFBD>?* | **Zustand \+ Immer** | Zustand 頧駁<EFBFBD>擃䀹<EFBFBD>嚗熘mmer <20>其<EFBFBD>憭<EFBFBD><E686AD>銝滚虾<E6BB9A>䀹㺭<E480B9>桃<EFBFBD><E6A183><EFBFBD><EFBFBD><EFBFBD>?produce <EFBFBD>?patches <EFBFBD>蠘<EFBFBD><EFBFBD>臬<EFBFBD><EFBFBD>?Undo/Redo <EFBFBD><EFBFBD>瓲敹<EFBFBD><EFBFBD>?|
|
||||
| **霈∠<EFBFBD>撘閙<EFBFBD>** | **Math.js \+ Web Worker** | 閫<EFBFBD><EFBFBD> JS 瘚桃<EFBFBD><EFBFBD>啁移摨阡䔮憸?(0.1+0.2\!=0.3)嚗𢫕eb Worker <EFBFBD>其<EFBFBD>撠?Pivot 蝑厰<EFBFBD>霈∠<EFBFBD>蝘餃枂銝餌瑪蝔卝<EFBFBD>?|
|
||||
| **<EFBFBD>唳旿憭<EFBFBD><EFBFBD>** | **Lodash** | <EFBFBD>箇<EFBFBD><EFBFBD><EFBFBD>㺭<EFBFBD>格<EFBFBD>雿頣<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>皛扎<EFBFBD><EFBFBD>楛<EFBFBD>瑁<EFBFBD>嚗剹<EFBFBD>?|
|
||||
| **<EFBFBD>航<EFBFBD><EFBFBD>?* | **Ant Design Charts** | <EFBFBD>冽惣<EFBFBD>賭儒颲寞<EFBFBD>銝剔<EFBFBD><EFBFBD>嗥凒<EFBFBD>孵㦛 (Histogram) <EFBFBD>屸<EFBFBD>甈∪㦛 (Bar)<EFBFBD>?|
|
||||
| **表格核心** | **AG Grid Community** | 唯一能免费支持虚拟滚动、列拖拽、高性能渲染的 React 表格库。 |
|
||||
| **本地数据库** | **Dexie.js (IndexedDB)** | 相比 localStorage (5MB限制),IndexedDB 容量大且异步,适合存储 5万+ 行的 JSON 数据集。 |
|
||||
| **状态管理** | **Zustand \+ Immer** | Zustand 轻量高效;Immer 用于处理不可变数据结构,其 produce 和 patches 功能是实现 Undo/Redo 的核心。 |
|
||||
| **计算引擎** | **Math.js \+ Web Worker** | 解决 JS 浮点数精度问题 (0.1+0.2\!=0.3);Web Worker 用于将 Pivot 等重计算移出主线程。 |
|
||||
| **数据处理** | **Lodash** | 基础的数据操作(分组、过滤、深拷贝)。 |
|
||||
| **可视化** | **Ant Design Charts** | 在智能侧边栏中绘制直方图 (Histogram) 和频次图 (Bar)。 |
|
||||
|
||||
## **3\. 核心模块详细设计**
|
||||
|
||||
@@ -66,68 +69,74 @@ graph TD
|
||||
|
||||
#### **A. 长宽转换 (Pivot / Reshaping Algorithm) \- V2.1 核心难点**
|
||||
|
||||
餈蹱糓<EFBFBD><EFBFBD>憭齿<EFBFBD><EFBFBD><EFBFBD>恣蝞𦯀遙<EFBFBD>∴<EFBFBD>敹<EFBFBD>◆<EFBFBD>?Web Worker 銝剜<E98A9D>銵䕘<E98AB5><E49598>血<EFBFBD>憿菟𢒰隡𡁜㨃甇颯<E79487>?
|
||||
* **颲枏<E9A2B2><E69E8F><EFBFBD>㺭嚗?*
|
||||
这是最复杂的计算任务,必须在 Web Worker 中执行,否则页面会卡死。
|
||||
|
||||
* **输入参数:**
|
||||
* data: 原始对象数组 Row\[\]
|
||||
* indexCol: 銝駁睸<EFBFBD>堒<EFBFBD> (e.g., 'patient\_id') \- 蝖桀<EFBFBD><EFBFBD>𡏭<EFBFBD><EFBFBD>?
|
||||
* pivotKeyCol: <EFBFBD>箏<EFBFBD><EFBFBD>堒<EFBFBD> (e.g., 'visit\_date') \- 蝖桀<EFBFBD><EFBFBD>𨅯<EFBFBD><EFBFBD>𡒊<EFBFBD><EFBFBD>?
|
||||
* valueCols: <EFBFBD>澆<EFBFBD><EFBFBD>齿㺭蝏?(e.g., \['wbc', 'bmi'\]) \- 蝖桀<EFBFBD><EFBFBD>𨅯‵<EFBFBD><EFBFBD><EFBFBD>潑<EFBFBD>?
|
||||
* **蝞埈<EFBFBD><EFBFBD>餉<EFBFBD>嚗?*
|
||||
1. **憸<EFBFBD><EFBFBD><EFBFBD>?(Guard):** 霈∠<EFBFBD> Unique(pivotKeyCol).length \* valueCols.length<EFBFBD><EFBFBD><EFBFBD><EFBFBD>𦦵<EFBFBD><EFBFBD>鞟<EFBFBD>瞏𨅯銁<EFBFBD>埈㺭 \> 1000嚗峕<E59A97><E5B395>粹<EFBFBD>霂胼<E99C82>𦦵<EFBFBD><F0A6A6B5>鞟<EFBFBD><E99E9F>埈㺭餈<E3BAAD><E9A488>嚗諹窈<E8ABB9><E7AA88><EFBFBD><EFBFBD>㗇㺭<E39787>栽<EFBFBD>腈<EFBFBD>?
|
||||
2. **<EFBFBD><EFBFBD><EFBFBD> (Grouping):** 雿輻鍂 \_.groupBy(data, indexCol) <EFBFBD>劐蜓<EFBFBD>桀<EFBFBD>蝏<EFBFBD><EFBFBD>?
|
||||
3. **頧祆揢 (Transformation):** <EFBFBD>滚<EFBFBD>瘥讐<EFBFBD><EFBFBD>唳旿嚗?
|
||||
* <EFBFBD>𥕦遣銝<EFBFBD>銝芣鰵銵<EFBFBD>笆鞊∴<EFBFBD>靽萘<EFBFBD>銝駁睸<EFBFBD>?
|
||||
* <EFBFBD>滚<EFBFBD>霂亦<EFBFBD><EFBFBD><EFBFBD><EFBFBD>銝<EFBFBD><EFBFBD>∟扇敶𤏪<EFBFBD><EFBFBD>瑕<EFBFBD> pivotKeyCol <EFBFBD><EFBFBD><EFBFBD>潘<EFBFBD>靘见<EFBFBD> "2023-01-01"嚗剹<EFBFBD>?
|
||||
* <EFBFBD>滚<EFBFBD> valueCols嚗<EFBFBD><EFBFBD><EFBFBD>潭<EFBFBD>撠<EFBFBD>蛹 ValueCol\_PivotKey (靘见<EFBFBD> "wbc\_2023-01-01")<EFBFBD>?
|
||||
4. **Schema<EFBFBD><EFBFBD><EFBFBD>:** <EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD><EFBFBD>鞉鰵<EFBFBD>?ColumnDefs<EFBFBD>?
|
||||
* **颲枏枂嚗?* { newRows, newColumnDefs }
|
||||
* indexCol: 主键列名 (e.g., 'patient\_id') \- 确定“行”
|
||||
* pivotKeyCol: 区分列名 (e.g., 'visit\_date') \- 确定“列后缀”
|
||||
* valueCols: 值列名数组 (e.g., \['wbc', 'bmi'\]) \- 确定“填充值”
|
||||
* **算法逻辑:**
|
||||
1. **预检查 (Guard):** 计算 Unique(pivotKeyCol).length \* valueCols.length。如果生成的潜在列数 \> 1000,抛出错误“生成的列数过多,请先筛选数据”。
|
||||
2. **分组 (Grouping):** 使用 \_.groupBy(data, indexCol) 按主键分组。
|
||||
3. **转换 (Transformation):** 遍历每组数据:
|
||||
* 创建一个新行对象,保留主键。
|
||||
* 遍历该组的每一条记录,获取 pivotKeyCol 的值(例如 "2023-01-01")。
|
||||
* 遍历 valueCols,将值映射为 ValueCol\_PivotKey (例如 "wbc\_2023-01-01")。
|
||||
4. **Schema生成:** 动态生成新的 ColumnDefs。
|
||||
* **输出:** { newRows, newColumnDefs }
|
||||
|
||||
#### **B. 公式变量 (Formula)**
|
||||
|
||||
* 雿輻鍂 math.evaluate(formula, row)<EFBFBD>?
|
||||
* **摰匧<EFBFBD>瘝嗵拳嚗?* <20>𣂼<EFBFBD><F0A382BC>砍<EFBFBD>銝剖虾霈輸䔮<E8BCB8><E494AE><EFBFBD><EFBFBD>譍<EFBFBD>銝箏<E98A9D><E7AE8F>滩<EFBFBD><E6BBA9><EFBFBD>㺭<EFBFBD>殷<EFBFBD><E6AEB7>脫迫 XSS<EFBFBD>?
|
||||
* **撘<EFBFBD>虜憭<EFBFBD><EFBFBD>嚗?* 憭<><E686AD><EFBFBD>支誑<E694AF>?(Infinity) <20>屸<EFBFBD><E5B1B8>啣<EFBFBD>霈∠<E99C88> (NaN) <20><><EFBFBD><EFBFBD>蛛<EFBFBD>蝏煺<E89D8F>餈𥪜<E9A488> null <20>㚚<EFBFBD>霂舀<E99C82>霈啜<E99C88>?
|
||||
### **3.2 <20>箄<EFBFBD>靘扯器<E689AF>誩<EFBFBD><E8AAA9>?(Insight Engine)**
|
||||
* 使用 math.evaluate(formula, row)。
|
||||
* **安全沙箱:** 限制公式中可访问的变量仅为当前行的数据,防止 XSS。
|
||||
* **异常处理:** 处理除以零 (Infinity) 和非数字计算 (NaN) 的情况,统一返回 null 或错误标记。
|
||||
|
||||
* **閫血<E996AB>:** <20>穃𨯬 AG Grid <20>?onColumnHeaderClicked 鈭衤辣<E8A1A4>?
|
||||
* **<2A>餅<EFBFBD> (Debounce):** 200ms 撱嗉<E692B1>霈∠<E99C88>嚗屸俈甇W翰<EFBCB7>笔<EFBFBD><E7AC94>W<EFBFBD><EFBCB7>?UI <20>芰<EFBFBD><E88AB0>?
|
||||
### **3.2 智能侧边栏引擎 (Insight Engine)**
|
||||
|
||||
* **触发:** 监听 AG Grid 的 onColumnHeaderClicked 事件。
|
||||
* **去抖 (Debounce):** 200ms 延迟计算,防止快速切换列时 UI 闪烁。
|
||||
* **统计逻辑:**
|
||||
* **<EFBFBD>啣<EFBFBD>澆<EFBFBD>:** 霈∠<EFBFBD> Min, Max, Mean, SD嚗<EFBFBD>僎雿輻鍂 Freedman-Diaconis 閫<EFBFBD><EFBFBD>霈∠<EFBFBD><EFBFBD>湔䲮<EFBFBD>曄<EFBFBD> Bins<EFBFBD>?
|
||||
* **<EFBFBD><EFBFBD>𧋦<EFBFBD>?** 霈∠<E99C88> Top 10 憸𤑳<E686B8><F0A491B3><EFBFBD>擃条<E69383>霂溻<E99C82>?
|
||||
* **数值列:** 计算 Min, Max, Mean, SD,并使用 Freedman-Diaconis 规则计算直方图的 Bins。
|
||||
* **文本列:** 计算 Top 10 频率最高的词。
|
||||
|
||||
### **3.3 历史记录与撤销 (History Manager)**
|
||||
|
||||
* **Undo/Redo 策略:**
|
||||
* **<EFBFBD>桅<EFBFBD>𡁏<EFBFBD>雿?(蝻𤥁<E89DBB>/<2F>踵揢):** 霈啣<EFBFBD> patches (Immer)<EFBFBD>?
|
||||
* **蝏𤘪<EFBFBD><EFBFBD>扳<EFBFBD>雿?(Pivot/<2F><><EFBFBD>/<2F><><EFBFBD><EFBFBD>啣<EFBFBD><E595A3>?:** <20>曹<EFBFBD>銵函<E98AB5><E587BD><EFBFBD><EFBFBD><EFBFBD>冽㺿<E586BD>矋<EFBFBD>霈啣<E99C88> patches <20>鞉𧋦餈<F0A78BA6><E9A488>銝娪𠗕隞亙<E99A9E>皛𠾼<E79A9B><F0A0BEBC><EFBFBD><EFBFBD>交㺿銝綽<E98A9D>**<2A>冽<EFBFBD>銵峕迨蝐餅<E89D90>雿𨅯<E99BBF>嚗<EFBFBD>撩<EFBFBD>嗡<EFBFBD>摮䀝<E691AE>銝芸<E98A9D><E88AB8>誩翰<E8AAA9>?(Checkpoint)**<2A><>伃<EFBFBD><E4BC83><EFBFBD>嗥凒<E597A5>仿<EFBFBD>頧賢翰<E8B3A2>扼<EFBFBD>?
|
||||
* **普通操作 (编辑/替换):** 记录 patches (Immer)。
|
||||
* **结构性操作 (Pivot/拆分/生成新变量):** 由于表结构完全改变,记录 patches 成本过高且难以回滚。策略改为:**在执行此类操作前,强制保存一个全量快照 (Checkpoint)**。撤销时直接重载快照。
|
||||
|
||||
## **4\. 数据流与存储设计**
|
||||
|
||||
### **4.1 浏览器端存储 (Dexie Schema)**
|
||||
|
||||
<EFBFBD>其<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>冽<EFBFBD>甇<EFBFBD>銁蝻𤥁<EFBFBD><EFBFBD><EFBFBD>㺭<EFBFBD>殷<EFBFBD>摰䂿緵<EFBFBD>𡏭䌊<EFBFBD>典翰<EFBFBD>把<EFBFBD>嘥<EFBFBD><EFBFBD>𨅯援皞<EFBFBD><EFBFBD>憭𨧀<EFBFBD>腈<EFBFBD>?
|
||||
用于暂存用户正在编辑的数据,实现“自动快照”和“崩溃恢复”。
|
||||
|
||||
const db \= new Dexie('ResearchEditorDB');
|
||||
db.version(2).stores({
|
||||
// 憿寧𤌍<EFBFBD><EFBFBD>㺭<EFBFBD>?
|
||||
// 项目元数据
|
||||
projects: '++id, name, lastModified, rowCount',
|
||||
|
||||
// <EFBFBD>唳旿<EFBFBD>?(Chunks): 撠?5銝<35><E98A9D><EFBFBD>唳旿<E594B3><E697BF><EFBFBD>銝箏<E98A9D>銝?Chunk 摮睃<E691AE>嚗屸<E59A97><E5B1B8>滚<EFBFBD>甈∟粉<E2889F>躰<EFBFBD>憭批紡<E689B9>湔<EFBFBD>閫<EFBFBD>膥撏拇<E6928F>
|
||||
// 数据块 (Chunks): 将 5万行数据切分为多个 Chunk 存储,避免单次读写过大导致浏览器崩溃
|
||||
dataChunks: '\[projectId+chunkIndex\], projectId',
|
||||
|
||||
// 操作历史 (用于恢复现场)
|
||||
history: 'projectId, stack',
|
||||
|
||||
// 摰峕㟲敹怎<EFBFBD> (<28>其<EFBFBD> Pivot 蝑匧之<E58CA7>滢<EFBFBD><E6BBA2><EFBFBD><EFBFBD>皛?
|
||||
// 完整快照 (用于 Pivot 等大操作的回滚)
|
||||
checkpoints: '++id, projectId, createdAt'
|
||||
});
|
||||
|
||||
### **4.2 后端存储 (PostgreSQL \+ OSS)**
|
||||
|
||||
<EFBFBD>𡒊垢隞<EFBFBD><EFBFBD>韐<EFBFBD><EFBFBD><EFBFBD>兩<EFBFBD>𨅯歇靽嘥<EFBFBD><EFBFBD>萘<EFBFBD>敹怎<EFBFBD>嚗䔶<EFBFBD><EFBFBD><EFBFBD><EFBFBD>摰墧𧒄蝻𤥁<EFBFBD><EFBFBD>?
|
||||
后端仅负责存储“已保存”的快照,不参与实时编辑。
|
||||
|
||||
model DatasetSnapshot {
|
||||
id String @id @default(uuid())
|
||||
taskId String // 关联任务
|
||||
version Int // <EFBFBD><EFBFBD>𧋦<EFBFBD>?
|
||||
version Int // 版本号
|
||||
|
||||
// 摮睃<EFBFBD>銝箏之<EFBFBD>?JSON Blob嚗峕<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?OSS <20><>辣頝臬<E9A09D> (<28>刻<EFBFBD> OSS)
|
||||
// 存储为大的 JSON Blob,或者指向 OSS 文件路径 (推荐 OSS)
|
||||
// 内容包含:rows\[\], columnDefs\[\], metadata
|
||||
ossKey String
|
||||
|
||||
@@ -136,31 +145,32 @@ model DatasetSnapshot {
|
||||
|
||||
## **5\. API 接口定义**
|
||||
|
||||
* POST /api/editor/init: <EFBFBD>嘥<EFBFBD><EFBFBD>𣇉<EFBFBD>颲穃膥隡朞<EFBFBD>嚗䔶<EFBFBD> OSS <20>㰘蝸<E3B098>笔<EFBFBD><E7AC94><EFBFBD>辣嚗<E8BEA3><E59A97><EFBFBD>𨀣糓隞𤾸極<F0A4BEB8>?A/B 瘚<>蓮餈<E893AE>䔉<EFBFBD><E49489><EFBFBD><EFBFBD>?
|
||||
* POST /api/editor/save: 靽嘥<EFBFBD>敶枏<EFBFBD>敹怎<EFBFBD><EFBFBD>?
|
||||
* POST /api/editor/export: 霂瑟<EFBFBD><EFBFBD>𡒊垢<EFBFBD><EFBFBD><EFBFBD> Excel/SPSS <EFBFBD><EFBFBD>辣<EFBFBD>?
|
||||
* POST /api/editor/init: 初始化编辑器会话,从 OSS 加载原始文件(如果是从工具 A/B 流转过来的)。
|
||||
* POST /api/editor/save: 保存当前快照。
|
||||
* POST /api/editor/export: 请求后端生成 Excel/SPSS 文件。
|
||||
* *Payload:* { rows: \[...\], format: 'spss' }
|
||||
* *霂湔<EFBFBD>:* 憒<><E68692><EFBFBD>唳旿<E594B3>誩<EFBFBD>嚗𣬚凒<F0A3AC9A>亙<EFBFBD>蝡?SheetJS <20><><EFBFBD>嚗𥟇㺭<F0A59F87>桅<EFBFBD>憭?(\>5MB) <20>𤑳<EFBFBD><F0A491B3>𡒊垢<F0A1928A><E59EA2><EFBFBD><EFBFBD>?
|
||||
## **6\. <20>扯<EFBFBD><E689AF><EFBFBD><EFBFBD>銝舘器<E88898>?(Performance Guardrails)**
|
||||
* *说明:* 如果数据量小,直接前端 SheetJS 生成;数据量大 (\>5MB) 发给后端生成。
|
||||
|
||||
## **6\. 性能准入与边界 (Performance Guardrails)**
|
||||
|
||||
| 数据量级 | 策略 |
|
||||
| :---- | :---- |
|
||||
| **\< 50,000 銵?* | **<EFBFBD>券<EFBFBD><EFBFBD>㰘蝸璅∪<EFBFBD>**<2A><><EFBFBD><EFBFBD>㗇㺭<E39787>桅<EFBFBD><E6A185>典<EFBFBD>摮?IndexedDB嚗峕<E59A97>雿𨀣<E99BBF>敹怒<E695B9>?|
|
||||
| **\> 50,000 銵?* | **<EFBFBD>漤<EFBFBD><EFBFBD>瑟芋撘?(Downsampling)**<2A><><EFBFBD>蝡臭<E89DA1><E887AD>㰘蝸<E3B098>?5 銝<><E98A9D><EFBFBD>其<EFBFBD>憸<EFBFBD><E686B8><EFBFBD>諹<EFBFBD><E8ABB9>坔<EFBFBD>摰𠾼<E691B0><F0A0BEBC>紡<EFBFBD>箸𧒄嚗<F0A79284><E59A97>皜<EFBFBD><E79A9C>閫<EFBFBD><E996AB>嚗㇌ecipe嚗匧<E59A97><E58CA7><EFBFBD><EFBFBD><EFBFBD>𡒊垢嚗𣬚眏<F0A3AC9A>𡒊垢 Worker 撖孵<E69296><E5ADB5>𤩺㺭<F0A4A9BA>株<EFBFBD>銵峕鸌憭<E9B88C><E686AD><EFBFBD>?|
|
||||
| **\< 50,000 行** | **全量加载模式**。所有数据都在内存/IndexedDB,操作极快。 |
|
||||
| **\> 50,000 行** | **降采样模式 (Downsampling)**。前端仅加载前 5 万行用于预览和规则制定。导出时,将清洗规则(Recipe)发送给后端,由后端 Worker 对全量数据进行批处理。 |
|
||||
|
||||
## **7\. 撘<EFBFBD><EFBFBD>𤏸恣<EFBFBD>?(Milestones)**
|
||||
## **7\. 开发计划 (Milestones)**
|
||||
|
||||
1. **Week 1: <EFBFBD>詨<EFBFBD>蝵烐聢銝𤾸<EFBFBD><EFBFBD>?*
|
||||
* <EFBFBD>剖遣 React \+ AG Grid <EFBFBD>臬<EFBFBD><EFBFBD>?
|
||||
* 摰䂿緵 SheetJS 撖澆<EFBFBD>銝?Dexie.js <EFBFBD><EFBFBD><EFBFBD><EFBFBD>㚚<EFBFBD>餉<EFBFBD><EFBFBD>?
|
||||
2. **Week 2: <EFBFBD><EFBFBD>像<EFBFBD>硋極<EFBFBD>瑟<EFBFBD>銝?Web Worker**
|
||||
* <EFBFBD>剖遣 Web Worker <EFBFBD>帋縑<EFBFBD>嗆<EFBFBD><EFBFBD>?
|
||||
* 摰䂿緵 Formula 霈∠<EFBFBD><EFBFBD>?Math.js <EFBFBD><EFBFBD><EFBFBD><EFBFBD>?
|
||||
* 摰䂿緵 Undo/Redo <EFBFBD><EFBFBD><EFBFBD>Immer嚗剹<EFBFBD>?
|
||||
1. **Week 1: 核心网格与存储**
|
||||
* 搭建 React \+ AG Grid 环境。
|
||||
* 实现 SheetJS 导入与 Dexie.js 持久化逻辑。
|
||||
2. **Week 2: 扁平化工具栏与 Web Worker**
|
||||
* 搭建 Web Worker 通信架构。
|
||||
* 实现 Formula 计算和 Math.js 集成。
|
||||
* 实现 Undo/Redo 栈(Immer)。
|
||||
3. **Week 3: 复杂计算 (Pivot)**
|
||||
* **<EFBFBD>滨<EFBFBD><EFBFBD>餃<EFBFBD>嚗?* <20>?Web Worker 銝剖<EFBFBD><EFBFBD>?Pivot 蝞埈<EFBFBD><EFBFBD>?
|
||||
* 摰䂿緵 Pivot <EFBFBD>?UI <EFBFBD>滨蔭撘寧<EFBFBD><EFBFBD>?
|
||||
* **重点攻坚:** 在 Web Worker 中实现 Pivot 算法。
|
||||
* 实现 Pivot 的 UI 配置弹窗。
|
||||
4. **Week 4: 智能侧边栏与导出**
|
||||
* 撘<EFBFBD><EFBFBD>𤑳凒<EFBFBD>孵㦛/憸烐活<E78390>曄<EFBFBD>隞?(AntD Charts)<EFBFBD>?
|
||||
* 摰䂿緵<EFBFBD><EFBFBD>拳<EFBFBD><EFBFBD><EFBFBD>撠<EFBFBD><EFBFBD><EFBFBD>‵銵亦撩憭勗<EFBFBD>潮<EFBFBD>餉<EFBFBD><EFBFBD>?
|
||||
* 撖寞𦻖<EFBFBD>𡒊垢靽嘥<EFBFBD><EFBFBD>亙藁<EFBFBD>
|
||||
* 开发直方图/频次图组件 (AntD Charts)。
|
||||
* 实现分箱、映射、填补缺失值逻辑。
|
||||
* 对接后端保存接口。
|
||||
@@ -1,10 +1,11 @@
|
||||
# <EFBFBD>唳旿摨栞挽霈⊥<EFBFBD>獢?- 撌亙<E6928C>B嚗<42><E59A97><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>箏膥鈭綽<E988AD>
|
||||
# 数据库设计文档 - 工具B(病历结构化机器人)
|
||||
|
||||
> **模块**: DC数据清洗整理 - 工具B
|
||||
> **版本**: V2.0 (MVP)
|
||||
> **Schema**: `dc_schema`
|
||||
> **更新日期**: 2025-12-03
|
||||
> **<EFBFBD>嗆<EFBFBD>?*: <EFBFBD>?MVP摰峕<EFBFBD>嚗<EFBFBD>歇撉諹<EFBFBD><EFBFBD>舐鍂嚗𣬚<EFBFBD>摰墧㺭<EFBFBD>格<EFBFBD>霂閖<EFBFBD>朞<EFBFBD>嚗?
|
||||
> **状态**: ✅ MVP完成(已验证可用,真实数据测试通过)
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
@@ -12,37 +13,49 @@
|
||||
- [一、概述](#一概述)
|
||||
- [二、Schema设计原则](#二schema设计原则)
|
||||
- [三、数据表设计](#三数据表设计)
|
||||
- [<EFBFBD>䜘<EFBFBD><EFBFBD>揣撘閗挽霈((#<23>𤤿揣撘閗挽霈?
|
||||
- [鈭𢛵<EFBFBD><EFBFBD><EFBFBD><EFBFBD>桃漲<EFBFBD>篏(#鈭𥪜<EFBFBD><EFBFBD>桃漲<EFBFBD>?
|
||||
- [<EFBFBD>准<EFBFBD><EFBFBD>㺭<EFBFBD>桃<EFBFBD><EFBFBD>賢𪂹<EFBFBD>篏(#<23>剜㺭<E5899C>桃<EFBFBD><E6A183>賢𪂹<E8B3A2>?
|
||||
- [四、索引设计](#四索引设计)
|
||||
- [五、外键约束](#五外键约束)
|
||||
- [六、数据生命周期](#六数据生命周期)
|
||||
|
||||
---
|
||||
|
||||
## 銝<EFBFBD><EFBFBD><EFBFBD><EFBFBD>餈?
|
||||
## 一、概述
|
||||
|
||||
### 1.1 设计目标
|
||||
|
||||
撌亙<EFBFBD>B<EFBFBD><EFBFBD>㺭<EFBFBD>桀<EFBFBD>霈曇恣<EFBFBD>典銁<EFBFBD>舀<EFBFBD>嚗?- <20>?<3F><>之璅∪<E79285>鈭文<E988AD>撉諹<E69289><E8ABB9><EFBFBD><EFBFBD><EFBFBD>祉<EFBFBD><E7A589><EFBFBD><EFBFBD>
|
||||
- <EFBFBD>?憭扯<E686AD>璅∪<E79285>甇乩遙<E4B9A9>∪<EFBFBD><E288AA><EFBFBD><EFBFBD>1000+<2B>∟扇敶𤏪<E695B6>
|
||||
- <EFBFBD>?<3F>脩<EFBFBD>璉<EFBFBD>瘚衤<E7989A>鈭箏極鋆<E6A5B5><E98B86>
|
||||
- <EFBFBD>?憸<>挽璅⊥踎蝞∠<E89D9E>銝𤾸<E98A9D><F0A4BEB8>?- <20>?<3F>亙熒璉<E78692><E79289>亦<EFBFBD>摮䀝<E691AE><E4809D>?
|
||||
工具B的数据库设计旨在支持:
|
||||
- ✅ 双大模型交叉验证的文本结构化
|
||||
- ✅ 大规模异步任务处理(1000+条记录)
|
||||
- ✅ 冲突检测与人工裁决
|
||||
- ✅ 预设模板管理与复用
|
||||
- ✅ 健康检查缓存优化
|
||||
|
||||
### 1.2 表关系总览
|
||||
|
||||
```
|
||||
dc_schema <EFBFBD>?撌脣<E6928C>撱箏僎餈鞱<E9A488>銝?<3F>鎿<EFBFBD><E98EBF><EFBFBD> dc_health_checks [<5B>亙熒璉<E78692><E79289>亦<EFBFBD>摮𤊓 <20>?餈鞱<E9A488>甇<EFBFBD>虜
|
||||
<EFBFBD>鎿<EFBFBD><EFBFBD><EFBFBD> dc_templates [憸<>挽璅⊥踎] <20>?3銝芷<E98A9D>霈暹芋<E69AB9>踹虾<E8B8B9>?<3F>鎿<EFBFBD><E98EBF><EFBFBD> dc_extraction_tasks [<5B>𣂼<EFBFBD>隞餃𦛚] <20>?撌脣<E6928C><E884A3>𣂼<EFBFBD>銝芯遙<E88AAF>?<3F>? <20>婙<EFBFBD><E5A999><EFBFBD> dc_extraction_items [<5B>𣂼<EFBFBD>霈啣<E99C88>] (1:N) <20>?<3F>峕芋<E5B395>讠<EFBFBD><E8AEA0>𨀣迤撣訾<E692A3>摮?```
|
||||
dc_schema ✅ 已创建并运行中
|
||||
├── dc_health_checks [健康检查缓存] ✅ 运行正常
|
||||
├── dc_templates [预设模板] ✅ 3个预设模板可用
|
||||
├── dc_extraction_tasks [提取任务] ✅ 已完成多个任务
|
||||
│ └── dc_extraction_items [提取记录] (1:N) ✅ 双模型结果正常保存
|
||||
```
|
||||
|
||||
**<2A>?MVP摰峕<E691B0><E5B395>嗆<EFBFBD><E59786><EFBFBD>2025-12-03嚗?*嚗?- <20><><EFBFBD>㕑”甇<E2809D>虜撌乩<E6928C>嚗<EFBFBD>歇憭<E6AD87><E686AD>憭帋葵<E5B88B>笔<EFBFBD>隞餃𦛚
|
||||
- 3銝芷<E98A9D>霈暹芋<E69AB9>選<EFBFBD><E981B8>箇<EFBFBD><E7AE87><EFBFBD><EFBFBD><EFBFBD>亙<EFBFBD><E4BA99><EFBFBD><EFBFBD>撠輻<E692A0><E8BCBB>仿堺霈啣<E99C88><E595A3><EFBFBD><EFBFBD>銵<EFBFBD><E98AB5>钅秄霂羓<E99C82><E7BE93>?- <20>笔<EFBFBD>瘚贝<E7989A>嚗?<3F>∠<EFBFBD><E288A0><EFBFBD>㺭<EFBFBD>格<EFBFBD><E6A0BC>𡝗<EFBFBD><F0A19D97><EFBFBD><EFBFBD>100%<25>𣂼<EFBFBD><F0A382BC>?- <20>峕芋<E5B395>讠<EFBFBD><E8AEA0>頣<EFBFBD>resultA<74><41>esultB<74><42>inalResult摮埈挾甇<E68CBE>虜靽嘥<E99DBD>
|
||||
**✅ MVP完成状态(2025-12-03)**:
|
||||
- 所有表正常工作,已处理多个真实任务
|
||||
- 3个预设模板:肺癌病理报告、糖尿病入院记录、高血压门诊病历
|
||||
- 真实测试:9条病理数据提取成功,100%成功率
|
||||
- 双模型结果:resultA、resultB、finalResult字段正常保存
|
||||
- Token统计:totalTokens字段正常累加
|
||||
- 冲突检测:conflictFields数组正常工作
|
||||
- 验证脚本:`backend/scripts/check-task-progress.mjs`
|
||||
|
||||
### 1.3 技术栈
|
||||
|
||||
- **<2A>唳旿摨?*: PostgreSQL 15
|
||||
- **数据库**: PostgreSQL 15
|
||||
- **ORM**: Prisma 6
|
||||
- **Schema隔离**: `dc_schema`(独立命名空间)
|
||||
- **JSON摮埈挾**: 雿輻鍂JSONB蝐餃<EFBFBD>嚗<EFBFBD><EFBFBD><EFBFBD>扯<EFBFBD><EFBFBD>亥砭嚗?
|
||||
- **JSON字段**: 使用JSONB类型(高性能查询)
|
||||
|
||||
---
|
||||
|
||||
## 二、Schema设计原则
|
||||
@@ -55,9 +68,10 @@ CREATE TABLE "dc_schema"."dc_health_checks" (...);
|
||||
CREATE TABLE "dc_schema"."dc_extraction_tasks" (...);
|
||||
```
|
||||
|
||||
**隡睃飵**嚗?- <20>?銝𤾸<E98A9D>隞𡝗芋<F0A19D97>堒<EFBFBD><E5A092>券<EFBFBD>蝳鳴<E89DB3>platform_schema<6D><61>sl_schema蝑㚁<E89D91>
|
||||
- <EFBFBD>?<3F>唳旿摰匧<E691B0>嚗屸<E59A97><E5B1B8>滩秤<E6BBA9>滢<EFBFBD>
|
||||
- <EFBFBD>?靘蹂<E99D98>璅∪<E79285><E288AA>𣇉恣<F0A38789><E681A3><EFBFBD>餈<EFBFBD>宏
|
||||
**优势**:
|
||||
- ✅ 与其他模块完全隔离(platform_schema、asl_schema等)
|
||||
- ✅ 数据安全,避免误操作
|
||||
- ✅ 便于模块化管理和迁移
|
||||
|
||||
### 2.2 命名规范
|
||||
|
||||
@@ -65,23 +79,24 @@ CREATE TABLE "dc_schema"."dc_extraction_tasks" (...);
|
||||
|------|------|------|
|
||||
| **表名前缀** | `dc_` | `dc_extraction_tasks` |
|
||||
| **字段命名** | snake_case | `user_id`, `source_file_key` |
|
||||
| **<EFBFBD>園𡢿<EFBFBD>?* | 蝏煺<E89D8F><E785BA>𡒊<EFBFBD> | `created_at`, `started_at` |
|
||||
| **时间戳** | 统一后缀 | `created_at`, `started_at` |
|
||||
| **外键** | 实体名_id | `task_id`, `user_id` |
|
||||
|
||||
### 2.3 JSONB字段使用场景
|
||||
|
||||
| 字段 | 类型 | 原因 |
|
||||
|------|------|------|
|
||||
| `target_fields` | JSONB | <EFBFBD>菜暑<EFBFBD><EFBFBD><EFBFBD>畾菟<EFBFBD>蝵?|
|
||||
| `result_a/result_b` | JSONB | <EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD><EFBFBD>𣇉<EFBFBD><EFBFBD>?|
|
||||
| `final_result` | JSONB | <EFBFBD><EFBFBD>蝏<EFBFBD><EFBFBD><EFBFBD>喟<EFBFBD><EFBFBD>?|
|
||||
| `target_fields` | JSONB | 灵活的字段配置 |
|
||||
| `result_a/result_b` | JSONB | 动态提取结果 |
|
||||
| `final_result` | JSONB | 最终裁决结果 |
|
||||
|
||||
---
|
||||
|
||||
## 三、数据表设计
|
||||
|
||||
### 3.1 dc_health_checks嚗<EFBFBD><EFBFBD>摨瑟<EFBFBD><EFBFBD>亦<EFBFBD>摮䁅”嚗?
|
||||
**<2A>券<EFBFBD>?*: 蝻枏<E89DBB><E69E8F>亙熒璉<E78692><E79289>亦<EFBFBD><E4BAA6>頣<EFBFBD><E9A0A3>踹<EFBFBD><E8B8B9>滚<EFBFBD>霈∠<E99C88>
|
||||
### 3.1 dc_health_checks(健康检查缓存表)
|
||||
|
||||
**用途**: 缓存健康检查结果,避免重复计算
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_health_checks" (
|
||||
@@ -96,39 +111,46 @@ CREATE TABLE "dc_schema"."dc_health_checks" (
|
||||
"total_rows" INTEGER NOT NULL,
|
||||
"estimated_tokens" INTEGER NOT NULL,
|
||||
|
||||
-- 璉<EFBFBD><EFBFBD>亦<EFBFBD><EFBFBD>? "status" TEXT NOT NULL, -- 'good' | 'bad'
|
||||
-- 检查结果
|
||||
"status" TEXT NOT NULL, -- 'good' | 'bad'
|
||||
"message" TEXT NOT NULL,
|
||||
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**摮埈挾霂湔<EFBFBD>**嚗?
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `id` | TEXT | UUID主键 | `uuid()` |
|
||||
| `user_id` | TEXT | 用户ID | `user-123` |
|
||||
| `file_name` | TEXT | <EFBFBD><EFBFBD>辣<EFBFBD>?| `<60><><EFBFBD><EFBFBD>㺭<EFBFBD>?xlsx` |
|
||||
| `file_name` | TEXT | 文件名 | `患者数据.xlsx` |
|
||||
| `column_name` | TEXT | 检查的列名 | `病历文本` |
|
||||
| `empty_rate` | DOUBLE | 空值率 (0-1) | 0.15 (15%) |
|
||||
| `avg_length` | DOUBLE | 平均文本长度 | 256.8 |
|
||||
| `total_rows` | INT | <EFBFBD>餉<EFBFBD><EFBFBD>?| 500 |
|
||||
| `estimated_tokens` | INT | 憸<EFBFBD>摯Token<EFBFBD>?| 150000 |
|
||||
| `status` | TEXT | <EFBFBD>亙熒<EFBFBD>嗆<EFBFBD>?| `good` / `bad` |
|
||||
| `total_rows` | INT | 总行数 | 500 |
|
||||
| `estimated_tokens` | INT | 预估Token数 | 150000 |
|
||||
| `status` | TEXT | 健康状态 | `good` / `bad` |
|
||||
| `message` | TEXT | 提示信息 | `健康度良好` |
|
||||
|
||||
**蝝W<EFBFBD>**嚗?```sql
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_health_checks_user_id_file_name_idx"
|
||||
ON "dc_schema"."dc_health_checks"("user_id", "file_name");
|
||||
```
|
||||
|
||||
**銝𡁜𦛚閫<EFBFBD><EFBFBD>**嚗?- 蝛箏<E89D9B>潛<EFBFBD> > 80% <20>?`status = 'bad'`
|
||||
- 撟喳<E6929F><E596B3>踹漲 < 10 <20>?`status = 'bad'`
|
||||
- 蝻枏<E89DBB><E69E8F>㗇<EFBFBD><E39787><EFBFBD><EFBFBD>24撠𤩺𧒄嚗<F0A79284><E59A97><EFBFBD>典<EFBFBD>摰䂿緵嚗?
|
||||
**业务规则**:
|
||||
- 空值率 > 80% → `status = 'bad'`
|
||||
- 平均长度 < 10 → `status = 'bad'`
|
||||
- 缓存有效期:24小时(应用层实现)
|
||||
|
||||
---
|
||||
|
||||
### 3.2 dc_templates嚗<EFBFBD><EFBFBD>霈暹芋<EFBFBD>輯”嚗?
|
||||
**<2A>券<EFBFBD>?*: 摮睃<E691AE><E79D83>曄<EFBFBD>蝐餃<E89D90><E9A483><EFBFBD><EFBFBD>霈暹<E99C88><E69AB9>𡝗芋<F0A19D97>?
|
||||
### 3.2 dc_templates(预设模板表)
|
||||
|
||||
**用途**: 存储疾病类型的预设提取模板
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_templates" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
@@ -146,16 +168,18 @@ CREATE TABLE "dc_schema"."dc_templates" (
|
||||
);
|
||||
```
|
||||
|
||||
**摮埈挾霂湔<EFBFBD>**嚗?
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `disease_type` | TEXT | 疾病类型 | `lung_cancer` |
|
||||
| `report_type` | TEXT | 报告类型 | `pathology` |
|
||||
| `display_name` | TEXT | 显示名称 | `肺癌病理报告` |
|
||||
| `fields` | JSONB | <EFBFBD>𣂼<EFBFBD>摮埈挾<EFBFBD>滨蔭 | 閫<><E996AB><EFBFBD>寧內靘?|
|
||||
| `prompt_template` | TEXT | Prompt璅⊥踎 | `霂瑚<EFBFBD>隞乩<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>亙<EFBFBD>銝剜<EFBFBD><EFBFBD>?..` |
|
||||
| `fields` | JSONB | 提取字段配置 | 见下方示例 |
|
||||
| `prompt_template` | TEXT | Prompt模板 | `请从以下病理报告中提取...` |
|
||||
|
||||
**fields摮埈挾蝏𤘪<EFBFBD>**嚗?```json
|
||||
**fields字段结构**:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "病理类型",
|
||||
@@ -164,20 +188,23 @@ CREATE TABLE "dc_schema"."dc_templates" (
|
||||
},
|
||||
{
|
||||
"name": "分化程度",
|
||||
"desc": "擃?銝?雿𤾸<E99BBF><F0A4BEB8>?,
|
||||
"desc": "高/中/低分化",
|
||||
"width": "w-32"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**<EFBFBD>臭<EFBFBD>蝥行<EFBFBD>**嚗?```sql
|
||||
**唯一约束**:
|
||||
```sql
|
||||
UNIQUE ("disease_type", "report_type")
|
||||
```
|
||||
<EFBFBD>䔶<EFBFBD><EFBFBD>曄<EFBFBD>+<2B>亙<EFBFBD>蝐餃<E89D90>蝏<EFBFBD><E89D8F><EFBFBD>芾<EFBFBD><E88ABE>劐<EFBFBD>銝芣芋<E88AA3>?
|
||||
同一疾病+报告类型组合只能有一个模板
|
||||
|
||||
---
|
||||
|
||||
### 3.3 dc_extraction_tasks嚗<EFBFBD><EFBFBD><EFBFBD>碶遙<EFBFBD>∟”嚗?
|
||||
**<2A>券<EFBFBD>?*: 蝞∠<E89D9E><E288A0>寥<EFBFBD><E5AFA5>𣂼<EFBFBD>隞餃𦛚嚗諹蕭頦芾<E9A0A6>摨血<E691A8><E8A180>鞉𧋦
|
||||
### 3.3 dc_extraction_tasks(提取任务表)
|
||||
|
||||
**用途**: 管理批量提取任务,追踪进度和成本
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_extraction_tasks" (
|
||||
@@ -192,10 +219,12 @@ CREATE TABLE "dc_schema"."dc_extraction_tasks" (
|
||||
"report_type" TEXT NOT NULL,
|
||||
"target_fields" JSONB NOT NULL,
|
||||
|
||||
-- <EFBFBD>峕芋<EFBFBD>钅<EFBFBD>蝵? "model_a" TEXT NOT NULL DEFAULT 'deepseek-v3',
|
||||
-- 双模型配置
|
||||
"model_a" TEXT NOT NULL DEFAULT 'deepseek-v3',
|
||||
"model_b" TEXT NOT NULL DEFAULT 'qwen3-72b',
|
||||
|
||||
-- 隞餃𦛚<EFBFBD>嗆<EFBFBD>? "status" TEXT NOT NULL DEFAULT 'pending',
|
||||
-- 任务状态
|
||||
"status" TEXT NOT NULL DEFAULT 'pending',
|
||||
"total_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"processed_count" INTEGER NOT NULL DEFAULT 0,
|
||||
"clean_count" INTEGER NOT NULL DEFAULT 0,
|
||||
@@ -209,41 +238,47 @@ CREATE TABLE "dc_schema"."dc_extraction_tasks" (
|
||||
-- 错误信息
|
||||
"error" TEXT,
|
||||
|
||||
-- <EFBFBD>園𡢿<EFBFBD>? "created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
-- 时间戳
|
||||
"created_at" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
"started_at" TIMESTAMP(3),
|
||||
"completed_at" TIMESTAMP(3)
|
||||
);
|
||||
```
|
||||
|
||||
**摮埈挾霂湔<EFBFBD>**嚗?
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `source_file_key` | TEXT | Storage路径 | `uploads/user123/data.xlsx` |
|
||||
| `text_column` | TEXT | 文本列名 | `病历文本` |
|
||||
| `target_fields` | JSONB | 提取字段 | `[{name, desc}]` |
|
||||
| `status` | TEXT | 隞餃𦛚<EFBFBD>嗆<EFBFBD>?| `pending/processing/completed/failed` |
|
||||
| `status` | TEXT | 任务状态 | `pending/processing/completed/failed` |
|
||||
| `total_count` | INT | 总记录数 | 500 |
|
||||
| `processed_count` | INT | 已处理数 | 250 |
|
||||
| `clean_count` | INT | 一致数 | 200 |
|
||||
| `conflict_count` | INT | <EFBFBD>脩<EFBFBD><EFBFBD>?| 45 |
|
||||
| `failed_count` | INT | 憭梯揖<EFBFBD>?| 5 |
|
||||
| `total_tokens` | INT | <EFBFBD>蓉oken<EFBFBD>?| 150000 |
|
||||
| `total_cost` | DOUBLE | <EFBFBD>餅<EFBFBD><EFBFBD>?$) | 0.27 |
|
||||
| `conflict_count` | INT | 冲突数 | 45 |
|
||||
| `failed_count` | INT | 失败数 | 5 |
|
||||
| `total_tokens` | INT | 总Token数 | 150000 |
|
||||
| `total_cost` | DOUBLE | 总成本($) | 0.27 |
|
||||
|
||||
**<EFBFBD>嗆<EFBFBD><EFBFBD><EFBFBD>頧?*嚗?```
|
||||
pending <20>?processing <20>?completed
|
||||
<20>?failed
|
||||
**状态流转**:
|
||||
```
|
||||
pending → processing → completed
|
||||
→ failed
|
||||
```
|
||||
|
||||
**蝝W<EFBFBD>**嚗?```sql
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_extraction_tasks_user_id_status_idx"
|
||||
ON "dc_schema"."dc_extraction_tasks"("user_id", "status");
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 dc_extraction_items嚗<EFBFBD><EFBFBD><EFBFBD>𤥁扇敶閗”嚗?
|
||||
**<2A>券<EFBFBD>?*: 摮睃<E691AE>瘥𤩺辺霈啣<E99C88><E595A3><EFBFBD><EFBFBD>璅∪<E79285><E288AA>𣂼<EFBFBD>蝏𤘪<E89D8F><F0A498AA><EFBFBD><EFBFBD>蝒<EFBFBD>𠶖<EFBFBD>?
|
||||
### 3.4 dc_extraction_items(提取记录表)
|
||||
|
||||
**用途**: 存储每条记录的双模型提取结果和冲突状态
|
||||
|
||||
```sql
|
||||
CREATE TABLE "dc_schema"."dc_extraction_items" (
|
||||
"id" TEXT NOT NULL PRIMARY KEY,
|
||||
@@ -253,13 +288,16 @@ CREATE TABLE "dc_schema"."dc_extraction_items" (
|
||||
"row_index" INTEGER NOT NULL,
|
||||
"original_text" TEXT NOT NULL,
|
||||
|
||||
-- <EFBFBD>峕芋<EFBFBD>讠<EFBFBD><EFBFBD>? "result_a" JSONB,
|
||||
-- 双模型结果
|
||||
"result_a" JSONB,
|
||||
"result_b" JSONB,
|
||||
|
||||
-- <EFBFBD>脩<EFBFBD>璉<EFBFBD>瘚? "status" TEXT NOT NULL DEFAULT 'pending',
|
||||
-- 冲突检测
|
||||
"status" TEXT NOT NULL DEFAULT 'pending',
|
||||
"conflict_fields" TEXT[] DEFAULT ARRAY[]::TEXT[],
|
||||
|
||||
-- <EFBFBD><EFBFBD>蝏<EFBFBD><EFBFBD><EFBFBD>? "final_result" JSONB,
|
||||
-- 最终结果
|
||||
"final_result" JSONB,
|
||||
|
||||
-- Token统计
|
||||
"tokens_a" INTEGER NOT NULL DEFAULT 0,
|
||||
@@ -278,63 +316,78 @@ CREATE TABLE "dc_schema"."dc_extraction_items" (
|
||||
);
|
||||
```
|
||||
|
||||
**摮埈挾霂湔<EFBFBD>**嚗?
|
||||
**字段说明**:
|
||||
|
||||
| 字段 | 类型 | 说明 | 示例 |
|
||||
|------|------|------|------|
|
||||
| `row_index` | INT | Excel行号 | 5 |
|
||||
| `original_text` | TEXT | <EFBFBD>笔<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>𧋦 | `<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>瘀<EFBFBD>45撗?..` |
|
||||
| `original_text` | TEXT | 原始病历文本 | `患者,男,45岁...` |
|
||||
| `result_a` | JSONB | DeepSeek结果 | `{"肿瘤大小": "3cm"}` |
|
||||
| `result_b` | JSONB | Qwen结果 | `{"肿瘤大小": "3.0cm"}` |
|
||||
| `status` | TEXT | 憭<EFBFBD><EFBFBD><EFBFBD>嗆<EFBFBD>?| `clean/conflict/resolved/failed` |
|
||||
| `status` | TEXT | 处理状态 | `clean/conflict/resolved/failed` |
|
||||
| `conflict_fields` | TEXT[] | 冲突字段列表 | `["肿瘤大小"]` |
|
||||
| `final_result` | JSONB | <EFBFBD><EFBFBD>蝏<EFBFBD><EFBFBD><EFBFBD>喟<EFBFBD><EFBFBD>?| `{"<22>輻𠈔憭批<E686AD>": "3cm"}` |
|
||||
| `final_result` | JSONB | 最终裁决结果 | `{"肿瘤大小": "3cm"}` |
|
||||
|
||||
**result_a/result_b蝏𤘪<EFBFBD>蝷箔<EFBFBD>**嚗?```json
|
||||
**result_a/result_b结构示例**:
|
||||
```json
|
||||
{
|
||||
"<22><><EFBFBD>蝐餃<E89D90>": "瘚豢隋<E8B1A2>扯<EFBFBD><E689AF>?,
|
||||
"<22><><EFBFBD>蝔见漲": "銝剖<E98A9D><E58996>?,
|
||||
"病理类型": "浸润性腺癌",
|
||||
"分化程度": "中分化",
|
||||
"肿瘤大小": "3cm",
|
||||
"瘛见毀蝏栞蓮蝘?: "<22>?
|
||||
"淋巴结转移": "无"
|
||||
}
|
||||
```
|
||||
|
||||
**<EFBFBD>嗆<EFBFBD><EFBFBD>秩<EFBFBD>?*嚗?- `pending`: 蝑匧<E89D91>憭<EFBFBD><E686AD>
|
||||
- `clean`: <20>峕芋<E5B395>讠<EFBFBD><E8AEA0>靝<EFBFBD><E99D9D>?- `conflict`: 摮睃銁<E79D83>脩<EFBFBD>嚗屸<E59A97>鈭箏極鋆<E6A5B5><E98B86>
|
||||
- `resolved`: <20>脩<EFBFBD>撌脰圾<E884B0>?- `failed`: <20>𣂼<EFBFBD>憭梯揖
|
||||
**状态说明**:
|
||||
- `pending`: 等待处理
|
||||
- `clean`: 双模型结果一致
|
||||
- `conflict`: 存在冲突,需人工裁决
|
||||
- `resolved`: 冲突已解决
|
||||
- `failed`: 提取失败
|
||||
|
||||
**蝝W<EFBFBD>**嚗?```sql
|
||||
**索引**:
|
||||
```sql
|
||||
CREATE INDEX "dc_extraction_items_task_id_status_idx"
|
||||
ON "dc_schema"."dc_extraction_items"("task_id", "status");
|
||||
```
|
||||
|
||||
**憭㚚睸蝥行<EFBFBD>**嚗?- `ON DELETE CASCADE`: <20>𣳇膄隞餃𦛚<E9A483>嗉䌊<E59789>典<EFBFBD><E585B8>斗<EFBFBD><E69697>㕑扇敶?
|
||||
**外键约束**:
|
||||
- `ON DELETE CASCADE`: 删除任务时自动删除所有记录
|
||||
|
||||
---
|
||||
|
||||
## <EFBFBD>䜘<EFBFBD><EFBFBD>揣撘閗挽霈?
|
||||
## 四、索引设计
|
||||
|
||||
### 4.1 索引列表
|
||||
|
||||
| 銵典<EFBFBD> | 蝝W<E89D9D>摮埈挾 | 蝐餃<E89D90> | <20>券<EFBFBD>?|
|
||||
| 表名 | 索引字段 | 类型 | 用途 |
|
||||
|------|---------|------|------|
|
||||
| `dc_health_checks` | `(user_id, file_name)` | 憭滚<EFBFBD> | <20>亥砭<E4BAA5>冽<EFBFBD><E586BD><EFBFBD><EFBFBD><EFBFBD>脫<EFBFBD><E884AB>?|
|
||||
| `dc_health_checks` | `(user_id, file_name)` | 复合 | 查询用户的历史检查 |
|
||||
| `dc_templates` | `(disease_type, report_type)` | 唯一 | 防止重复模板 |
|
||||
| `dc_extraction_tasks` | `(user_id, status)` | 憭滚<EFBFBD> | <20>亥砭<E4BAA5>冽<EFBFBD><E586BD><EFBFBD>遙<EFBFBD>∪<EFBFBD>銵?|
|
||||
| `dc_extraction_items` | `(task_id, status)` | 憭滚<EFBFBD> | <20>亥砭隞餃𦛚<E9A483><F0A69B9A>扇敶訫<E695B6>銵?|
|
||||
| `dc_extraction_tasks` | `(user_id, status)` | 复合 | 查询用户的任务列表 |
|
||||
| `dc_extraction_items` | `(task_id, status)` | 复合 | 查询任务的记录列表 |
|
||||
|
||||
### 4.2 性能考虑
|
||||
|
||||
**<EFBFBD>亥砭隡睃<EFBFBD>**嚗?```sql
|
||||
-- 擃䀹<E69383><E480B9>亥砭嚗𡁜⏚<F0A1819C>函揣撘?SELECT * FROM dc_extraction_tasks
|
||||
**查询优化**:
|
||||
```sql
|
||||
-- 高效查询:利用索引
|
||||
SELECT * FROM dc_extraction_tasks
|
||||
WHERE user_id = 'user123' AND status = 'processing';
|
||||
|
||||
-- 擃䀹<E69383><E480B9>亥砭嚗𡁜⏚<F0A1819C>函揣撘?SELECT * FROM dc_extraction_items
|
||||
-- 高效查询:利用索引
|
||||
SELECT * FROM dc_extraction_items
|
||||
WHERE task_id = 'task456' AND status = 'conflict';
|
||||
```
|
||||
|
||||
**<EFBFBD>踹<EFBFBD><EFBFBD>刻”<EFBFBD>急<EFBFBD>**嚗?- <20>?憪讠<E686AA><E8AEA0>汾HERE摮𣂼蘂銝剖<E98A9D><E58996>怎揣撘訫<E69298>畾?- <20>?雿輻鍂`status`摮埈挾餈<E68CBE>誘<EFBFBD>臭誑<E887AD>曇<EFBFBD><E69B87>誩<EFBFBD><E8AAA9>急<EFBFBD>銵峕㺭
|
||||
**避免全表扫描**:
|
||||
- ✅ 始终在WHERE子句中包含索引字段
|
||||
- ✅ 使用`status`字段过滤可以显著减少扫描行数
|
||||
|
||||
---
|
||||
|
||||
## 鈭𢛵<EFBFBD><EFBFBD><EFBFBD><EFBFBD>桃漲<EFBFBD>?
|
||||
## 五、外键约束
|
||||
|
||||
### 5.1 级联删除
|
||||
|
||||
```sql
|
||||
@@ -345,34 +398,46 @@ REFERENCES "dc_schema"."dc_extraction_tasks"("id")
|
||||
ON DELETE CASCADE;
|
||||
```
|
||||
|
||||
**銵䔶蛹**嚗?- <20>𣳇膄隞餃𦛚 <20>?<3F>芸𢆡<E88AB8>𣳇膄<F0A3B387><E88684><EFBFBD>匧<EFBFBD><E58CA7>𠉛<EFBFBD><F0A0899B>𣂼<EFBFBD>霈啣<E99C88>
|
||||
- 靽肽<E99DBD><E882BD>唳旿銝<E697BF><E98A9D>湔<EFBFBD>?
|
||||
### 5.2 <20>惩<EFBFBD><E683A9>桃<EFBFBD>銵?
|
||||
- `dc_health_checks`: <20>祉<EFBFBD>銵剁<E98AB5><E58981>惩<EFBFBD><E683A9>?- `dc_templates`: <20>祉<EFBFBD>銵剁<E98AB5><E58981>惩<EFBFBD><E683A9>?- `dc_extraction_tasks`: <20>惩<EFBFBD><E683A9>殷<EFBFBD>user_id隞<64>蛹<EFBFBD><E89BB9><EFBFBD>嚗䔶<E59A97>撘箏<E69298><E7AE8F>唾<EFBFBD>嚗?
|
||||
**<2A>笔<EFBFBD>**嚗?- <20>?<3F>誩<EFBFBD>頝沒chema靘肽<E99D98>
|
||||
- <20>?<3F>鞾<EFBFBD>璅∪<E79285><E288AA>祉<EFBFBD><E7A589>?- <20>?蝞<><E89D9E>𤥁<EFBFBD>蝘餃<E89D98><E9A483>墧<EFBFBD>
|
||||
**行为**:
|
||||
- 删除任务 → 自动删除所有关联的提取记录
|
||||
- 保证数据一致性
|
||||
|
||||
### 5.2 无外键的表
|
||||
|
||||
- `dc_health_checks`: 独立表,无外键
|
||||
- `dc_templates`: 独立表,无外键
|
||||
- `dc_extraction_tasks`: 无外键(user_id仅为标识,不强制关联)
|
||||
|
||||
**原因**:
|
||||
- ✅ 减少跨Schema依赖
|
||||
- ✅ 提高模块独立性
|
||||
- ✅ 简化迁移和回滚
|
||||
|
||||
---
|
||||
|
||||
## <EFBFBD>准<EFBFBD><EFBFBD>㺭<EFBFBD>桃<EFBFBD><EFBFBD>賢𪂹<EFBFBD>?
|
||||
## 六、数据生命周期
|
||||
|
||||
### 6.1 数据保留策略
|
||||
|
||||
| 表名 | 保留时间 | 清理策略 |
|
||||
|------|---------|---------|
|
||||
| `dc_health_checks` | 7憭?| 摰𡁏<E691B0>皜<EFBFBD><E79A9C><EFBFBD>扯扇敶?|
|
||||
| `dc_health_checks` | 7天 | 定期清理旧记录 |
|
||||
| `dc_templates` | 永久 | 手动管理 |
|
||||
| `dc_extraction_tasks` | 90憭?| 敶埝﹝<E59F9D>𤾸<EFBFBD><F0A4BEB8>?|
|
||||
| `dc_extraction_items` | 90憭?| <20>譍遙<E8AD8D>∪<EFBFBD><E288AA>?|
|
||||
| `dc_extraction_tasks` | 90天 | 归档后删除 |
|
||||
| `dc_extraction_items` | 90天 | 随任务删除 |
|
||||
|
||||
### 6.2 归档策略
|
||||
|
||||
**憭找遙<EFBFBD>∪<EFBFBD>獢?* (> 1000<30>∟扇敶?嚗?1. 隞餃𦛚摰峕<E691B0><E5B395>𠬍<EFBFBD>撖澆枂蝏𤘪<E89D8F><F0A498AA>蚓SV/Excel
|
||||
**大任务归档** (> 1000条记录):
|
||||
1. 任务完成后,导出结果到CSV/Excel
|
||||
2. 上传到Storage(永久保存)
|
||||
3. <20>𣳇膄<F0A3B387>唳旿摨栞扇敶𤏪<E695B6><F0A48FAA>𦠜𦆮蝛粹𡢿嚗?
|
||||
3. 删除数据库记录(释放空间)
|
||||
|
||||
### 6.3 清理脚本(示例)
|
||||
|
||||
```typescript
|
||||
// 皜<><E79A9C>7憭拙<E686AD><E68B99><EFBFBD><EFBFBD>摨瑟<E691A8><E7919F>亥扇敶?await prisma.dCHealthCheck.deleteMany({
|
||||
// 清理7天前的健康检查记录
|
||||
await prisma.dCHealthCheck.deleteMany({
|
||||
where: {
|
||||
createdAt: {
|
||||
lt: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
|
||||
@@ -391,7 +456,8 @@ const oldTasks = await prisma.dCExtractionTask.findMany({
|
||||
include: { items: true }
|
||||
});
|
||||
|
||||
// 撖澆枂<EFBFBD>𤾸<EFBFBD><EFBFBD>?for (const task of oldTasks) {
|
||||
// 导出后删除
|
||||
for (const task of oldTasks) {
|
||||
await exportTaskToStorage(task);
|
||||
await prisma.dCExtractionTask.delete({ where: { id: task.id } });
|
||||
}
|
||||
@@ -399,39 +465,52 @@ const oldTasks = await prisma.dCExtractionTask.findMany({
|
||||
|
||||
---
|
||||
|
||||
## 銝<EFBFBD><EFBFBD><EFBFBD>㺭<EFBFBD>桀<EFBFBD><EFBFBD>?
|
||||
## 七、数据安全
|
||||
|
||||
### 7.1 PII保护
|
||||
|
||||
**<EFBFBD>𤩺<EFBFBD>摮埈挾**嚗?- `original_text`: <20>航<EFBFBD><E888AA><EFBFBD>鉄<EFBFBD><E98984><EFBFBD><EFBFBD><EFBFBD><EFBFBD>溻<EFBFBD><E6BABB>澈隞質<E99A9E><E8B3AA>?- `result_a/result_b/final_result`: <20>航<EFBFBD><E888AA><EFBFBD>鉄蝏𤘪<E89D8F><F0A498AA>𣇉<EFBFBD><F0A38789>𤩺<EFBFBD>靽⊥<E99DBD>
|
||||
**敏感字段**:
|
||||
- `original_text`: 可能包含患者姓名、身份证号
|
||||
- `result_a/result_b/final_result`: 可能包含结构化的敏感信息
|
||||
|
||||
**靽脲擪<EFBFBD>芣鴌**嚗?- <20>?<3F>煾<EFBFBD><E785BE>LM<4C>滩䌊<E6BBA9>刻<EFBFBD><E588BB>𧶏<EFBFBD>PIIMaskUtil嚗?- <20>?<3F>唳旿摨枏<E691A8>撖<EFBFBD><E69296>PostgreSQL SSL嚗?- <20>?摰𡁏<E691B0>皜<EFBFBD><E79A9C><EFBFBD><EFBFBD>蟮<EFBFBD>唳旿
|
||||
**保护措施**:
|
||||
- ✅ 发送LLM前自动脱敏(PIIMaskUtil)
|
||||
- ✅ 数据库加密(PostgreSQL SSL)
|
||||
- ✅ 定期清理历史数据
|
||||
|
||||
### 7.2 用户隔离
|
||||
|
||||
**<EFBFBD>箏<EFBFBD>**嚗?- <20><><EFBFBD>㕑”<E39591><E2809D>鉄`user_id`摮埈挾
|
||||
**机制**:
|
||||
- 所有表包含`user_id`字段
|
||||
- 应用层强制过滤:`WHERE user_id = currentUserId`
|
||||
- 瘞訾<E7989E>頝函鍂<E587BD>瑟䰻霂?
|
||||
- 永不跨用户查询
|
||||
|
||||
---
|
||||
|
||||
## <EFBFBD>怒<EFBFBD><EFBFBD><EFBFBD>敶?
|
||||
## 八、附录
|
||||
|
||||
### 8.1 完整Schema DDL
|
||||
|
||||
摰峕㟲<EFBFBD><EFBFBD>chema<EFBFBD>𥕦遣<EFBFBD>𡁏𧋦雿滢<EFBFBD>嚗?```
|
||||
完整的Schema创建脚本位于:
|
||||
```
|
||||
backend/prisma/migrations/20251127_add_dc_tool_b_tables/migration.sql
|
||||
```
|
||||
|
||||
### 8.2 Prisma模型定义
|
||||
|
||||
摰峕㟲<EFBFBD><EFBFBD>risma璅∪<EFBFBD>摰帋<EFBFBD>雿滢<EFBFBD>嚗?```
|
||||
完整的Prisma模型定义位于:
|
||||
```
|
||||
backend/prisma/schema.prisma
|
||||
```
|
||||
<EFBFBD>𦦵揣 `dc_schema` <20>亦<EFBFBD><E4BAA6><EFBFBD><EFBFBD>㗇芋<E39787>卝<EFBFBD>?
|
||||
搜索 `dc_schema` 查看所有模型。
|
||||
|
||||
### 8.3 变更历史
|
||||
|
||||
| 版本 | 日期 | 变更内容 |
|
||||
|------|------|---------|
|
||||
| V1.0 | 2025-11-27 | <EFBFBD>嘥<EFBFBD><EFBFBD><EFBFBD>𧋦嚗?銝芾” |
|
||||
| V1.0 | 2025-11-27 | 初始版本,4个表 |
|
||||
|
||||
---
|
||||
|
||||
**<EFBFBD><EFBFBD>﹝蝏𤘪<EFBFBD>** <20>?
|
||||
**文档结束** ✅
|
||||
|
||||
|
||||
Reference in New Issue
Block a user