AIclinicalresearch/docs/03-业务模块/DC-数据清洗整理/02-技术设计/API设计文档-DC模块（完整版）.md

# API设计文档 - 工具B（病历结构化机器人）

> **模块**: DC数据清洗整理 - 工具B
> **版本**: V2.0 (MVP)
> **Base URL**: `/api/v1/dc/tool-b`
> **更新日期**: 2025-12-03
> **状态**: ✅ MVP完成（8个API端点全部可用，已验证）

---

## 📋 目录

- [一、API概览](#一api概览)
- [二、认证与鉴权](#二认证与鉴权)
- [三、API端点详情](#三api端点详情)
- [四、数据模型](#四数据模型)
- [五、错误处理](#五错误处理)
- [六、性能指标](#六性能指标)

---

## 一、API概览

### 1.1 端点列表

| # | 方法 | 路径 | 说明 | 后端状态 | 前端状态 | 测试状态 |
|---|------|------|------|---------|---------|---------|
| 0 | POST | `/upload` | 文件上传 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 1 | POST | `/health-check` | 健康检查 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 2 | GET | `/templates` | 获取模板列表 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 3 | POST | `/tasks` | 创建提取任务 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 4 | GET | `/tasks/:taskId/progress` | 查询任务进度 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 5 | GET | `/tasks/:taskId/items` | 获取验证网格数据 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 6 | POST | `/items/:itemId/resolve` | 裁决冲突 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |
| 7 | GET | `/tasks/:taskId/export` | 导出Excel结果 | ✅ 已完成 | ✅ 已对接 | ✅ 通过 |

**✅ MVP完成状态（2025-12-03）**：
- 后端代码：~2200行（含Service、Controller、Routes）
- 前端代码：~1400行（5步工作流完整实现）
- 数据库表：4张表已创建，3个预设模板已就绪
- API对接：8个端点全部集成并测试通过
- LLM调用：DeepSeek-V3 + Qwen-Max 双模型验证成功
- 真实测试：9条病理数据提取成功，Token消耗~10k
- **已知问题**：4个技术债务（见`07-技术债务/Tool-B技术债务清单.md`）

### 1.2 通用规范

**请求头**：
```http
Content-Type: application/json
Authorization: Bearer {token}  # 未来实现
```

**响应格式**：
```json
{
  "data": {...},      // 成功时返回
  "error": "...",     // 失败时返回
  "code": 200
}
```

**HTTP状态码**：
- `200`: 成功
- `400`: 请求参数错误
- `401`: 未认证
- `403`: 无权限
- `404`: 资源不存在
- `500`: 服务器内部错误

---

## 二、认证与鉴权

### 2.1 认证机制

**当前阶段（MVP）**：
- ❌ 暂不实现认证
- 使用临时`userId`标识（从请求上下文获取）

**未来实现（V1.0）**：
```http
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
```

### 2.2 权限模型

| 操作 | 权限要求 | 说明 |
|------|---------|------|
| 健康检查 | user | 所有用户 |
| 查看模板 | user | 所有用户 |
| 创建任务 | user | 所有用户 |
| 查询任务 | owner | 仅任务创建者 |
| 裁决冲突 | owner | 仅任务创建者 |

---

## 三、API端点详情

### 3.1 健康检查

**端点**: `POST /api/v1/dc/tool-b/health-check`

**用途**: 检查Excel列的数据质量，拦截低质量数据

**请求体**：
```json
{
  "fileKey": "uploads/user123/data.xlsx",
  "columnName": "病历文本"
}
```

**请求参数**：

| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `fileKey` | string | ✅ | Storage中的文件路径 |
| `columnName` | string | ✅ | 要检查的列名 |

**响应**（成功 - 200）：
```json
{
  "status": "good",
  "emptyRate": 0.12,
  "avgLength": 256.8,
  "totalRows": 500,
  "estimatedTokens": 150000,
  "message": "健康度良好，预计消耗约 150.0k Token（双模型约 300.0k Token）"
}
```

**响应**（失败 - 200但status=bad）：
```json
{
  "status": "bad",
  "emptyRate": 0.85,
  "avgLength": 256.8,
  "totalRows": 500,
  "estimatedTokens": 0,
  "message": "空值率过高（85.0%），该列不适合提取"
}
```

**响应字段**：

| 字段 | 类型 | 说明 |
|------|------|------|
| `status` | string | `good` 或 `bad` |
| `emptyRate` | number | 空值率 (0-1) |
| `avgLength` | number | 平均文本长度 |
| `totalRows` | number | 总行数 |
| `estimatedTokens` | number | 预估Token数 |
| `message` | string | 提示信息 |

**业务规则**：
- 空值率 > 80% → `status = 'bad'`
- 平均长度 < 10 → `status = 'bad'`
- 只检查前100行（性能优化）

**错误响应**：
```json
{
  "error": "列'病历文本'不存在",
  "code": 400
}
```

---

### 3.2 获取模板列表

**端点**: `GET /api/v1/dc/tool-b/templates`

**用途**: 获取所有预设的提取模板

**请求**: 无参数

**响应**（200）：
```json
{
  "templates": [
    {
      "diseaseType": "lung_cancer",
      "reportType": "pathology",
      "displayName": "肺癌病理报告",
      "fields": [
        {
          "name": "病理类型",
          "desc": "如：浸润性腺癌、鳞状细胞癌",
          "width": "w-40"
        },
        {
          "name": "分化程度",
          "desc": "高/中/低分化",
          "width": "w-32"
        }
      ]
    },
    {
      "diseaseType": "diabetes",
      "reportType": "admission",
      "displayName": "糖尿病入院记录",
      "fields": [...]
    }
  ]
}
```

**响应字段**：

| 字段 | 类型 | 说明 |
|------|------|------|
| `templates` | array | 模板列表 |
| `templates[].diseaseType` | string | 疾病类型 |
| `templates[].reportType` | string | 报告类型 |
| `templates[].displayName` | string | 显示名称 |
| `templates[].fields` | array | 提取字段配置 |

**缓存策略**：
- 客户端缓存：1小时
- 服务端缓存：永久（直到重启）

---

### 3.3 创建提取任务

**端点**: `POST /api/v1/dc/tool-b/tasks`

**用途**: 创建批量提取任务，推送到异步队列

**请求体**：
```json
{
  "projectName": "肺癌病理数据提取-2025Q1",
  "fileKey": "uploads/user123/lung_cancer_pathology.xlsx",
  "textColumn": "病历文本",
  "diseaseType": "lung_cancer",
  "reportType": "pathology",
  "targetFields": [
    {
      "name": "病理类型",
      "desc": "如：浸润性腺癌、鳞状细胞癌"
    },
    {
      "name": "分化程度",
      "desc": "高/中/低分化"
    }
  ]
}
```

**请求参数**：

| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `projectName` | string | ✅ | 任务名称 |
| `fileKey` | string | ✅ | Storage中的文件路径 |
| `textColumn` | string | ✅ | 文本列名 |
| `diseaseType` | string | ✅ | 疾病类型 |
| `reportType` | string | ✅ | 报告类型 |
| `targetFields` | array | ✅ | 提取字段配置 |

**响应**（200）：
```json
{
  "taskId": "550e8400-e29b-41d4-a716-446655440000"
}
```

**流程**：
1. 验证文件存在
2. 解析Excel，统计总行数
3. 创建任务记录（status=pending）
4. 推送到BullMQ队列
5. 立即返回taskId

**错误响应**：
```json
{
  "error": "文件不存在: uploads/user123/lung_cancer_pathology.xlsx",
  "code": 404
}
```

---

### 3.4 查询任务进度

**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/progress`

**用途**: 实时查询任务处理进度

**请求**:
```
GET /api/v1/dc/tool-b/tasks/550e8400-e29b-41d4-a716-446655440000/progress
```

**响应**（200）：
```json
{
  "taskId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing",
  "progress": 50,
  "totalCount": 500,
  "processedCount": 250,
  "cleanCount": 200,
  "conflictCount": 45,
  "failedCount": 5,
  "totalTokens": 75000,
  "totalCost": 0.135,
  "startedAt": "2025-11-27T10:00:00.000Z",
  "completedAt": null
}
```

**响应字段**：

| 字段 | 类型 | 说明 |
|------|------|------|
| `status` | string | `pending/processing/completed/failed` |
| `progress` | number | 进度百分比 (0-100) |
| `totalCount` | number | 总记录数 |
| `processedCount` | number | 已处理数 |
| `cleanCount` | number | 一致记录数 |
| `conflictCount` | number | 冲突记录数 |
| `failedCount` | number | 失败记录数 |
| `totalTokens` | number | 累计Token数 |
| `totalCost` | number | 累计成本($) |

**轮询建议**：
- 客户端每3秒轮询一次
- 当`status = 'completed'`时停止轮询

---

### 3.5 获取验证网格数据

**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/items`

**用途**: 获取双模型提取结果，用于人工裁决

**请求**:
```
GET /api/v1/dc/tool-b/tasks/550e8400.../items?page=1&limit=50&status=conflict
```

**查询参数**：

| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| `page` | number | ❌ | 1 | 页码 |
| `limit` | number | ❌ | 50 | 每页数量 |
| `status` | string | ❌ | - | 过滤状态 |

**响应**（200）：
```json
{
  "items": [
    {
      "id": "item-123",
      "rowIndex": 5,
      "originalText": "患者，男，45岁，诊断为浸润性腺癌，中分化，肿瘤最大径3cm...",
      "resultA": {
        "病理类型": "浸润性腺癌",
        "分化程度": "中分化",
        "肿瘤大小": "3cm"
      },
      "resultB": {
        "病理类型": "浸润性腺癌",
        "分化程度": "中分化",
        "肿瘤大小": "3.0cm"
      },
      "status": "conflict",
      "conflictFields": ["肿瘤大小"],
      "finalResult": null
    }
  ],
  "pagination": {
    "total": 45,
    "page": 1,
    "pageSize": 50,
    "totalPages": 1
  }
}
```

**响应字段**：

| 字段 | 类型 | 说明 |
|------|------|------|
| `items` | array | 记录列表 |
| `items[].status` | string | `clean/conflict/resolved/failed` |
| `items[].conflictFields` | array | 冲突字段列表 |
| `pagination` | object | 分页信息 |

---

### 3.6 裁决冲突

**端点**: `POST /api/v1/dc/tool-b/items/:itemId/resolve`

**用途**: 人工选择正确的提取结果

**请求**:
```json
{
  "field": "肿瘤大小",
  "chosenValue": "3cm"
}
```

**请求参数**：

| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `field` | string | ✅ | 冲突字段名 |
| `chosenValue` | string | ✅ | 选择的值 |

**响应**（200）：
```json
{
  "success": true
}
```

**业务逻辑**：
1. 更新`finalResult[field] = chosenValue`
2. 从`conflictFields`中移除该字段
3. 如果所有冲突解决，更新`status = 'resolved'`

---

### 3.7 导出结果

**端点**: `GET /api/v1/dc/tool-b/tasks/:taskId/export`

**用途**: 导出最终提取结果为Excel

**请求**:
```
GET /api/v1/dc/tool-b/tasks/550e8400.../export?format=xlsx
```

**查询参数**：

| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| `format` | string | ❌ | `xlsx` | 导出格式：`xlsx/csv` |

**响应**（200）：
- 文件流下载
- Content-Type: `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`
- Content-Disposition: `attachment; filename="extraction_result_2025-11-27.xlsx"`

**导出内容**：
- 包含原始列 + 所有提取字段
- 只包含`clean`和`resolved`状态的记录
- 冲突记录不导出（需人工裁决）

---

## 四、数据模型

### 4.1 HealthCheckResult

```typescript
interface HealthCheckResult {
  status: 'good' | 'bad';
  emptyRate: number;
  avgLength: number;
  totalRows: number;
  estimatedTokens: number;
  message: string;
}
```

### 4.2 Template

```typescript
interface Template {
  diseaseType: string;
  reportType: string;
  displayName: string;
  fields: TemplateField[];
}

interface TemplateField {
  name: string;
  desc: string;
  width?: string;
}
```

### 4.3 ExtractionTask

```typescript
interface ExtractionTask {
  id: string;
  userId: string;
  projectName: string;
  sourceFileKey: string;
  textColumn: string;

  diseaseType: string;
  reportType: string;
  targetFields: TemplateField[];

  status: 'pending' | 'processing' | 'completed' | 'failed';
  totalCount: number;
  processedCount: number;
  cleanCount: number;
  conflictCount: number;
  failedCount: number;

  totalTokens: number;
  totalCost: number;

  createdAt: Date;
  startedAt?: Date;
  completedAt?: Date;
}
```

### 4.4 ExtractionItem

```typescript
interface ExtractionItem {
  id: string;
  taskId: string;
  rowIndex: number;
  originalText: string;

  resultA?: Record<string, any>;
  resultB?: Record<string, any>;

  status: 'pending' | 'clean' | 'conflict' | 'resolved' | 'failed';
  conflictFields: string[];

  finalResult?: Record<string, any>;

  tokensA: number;
  tokensB: number;
}
```

---

## 五、错误处理

### 5.1 错误响应格式

```json
{
  "error": "错误描述",
  "code": 400,
  "details": {
    "field": "fileKey",
    "reason": "文件不存在"
  }
}
```

### 5.2 常见错误码

| HTTP状态 | code | 说明 | 示例 |
|----------|------|------|------|
| 400 | `INVALID_PARAMS` | 参数错误 | 缺少fileKey |
| 400 | `COLUMN_NOT_FOUND` | 列不存在 | 列"病历文本"不存在 |
| 400 | `BAD_HEALTH` | 健康检查未通过 | 空值率过高 |
| 404 | `FILE_NOT_FOUND` | 文件不存在 | 文件路径无效 |
| 404 | `TASK_NOT_FOUND` | 任务不存在 | taskId无效 |
| 403 | `FORBIDDEN` | 无权访问 | 只能访问自己的任务 |
| 500 | `INTERNAL_ERROR` | 服务器错误 | 数据库连接失败 |

### 5.3 错误处理最佳实践

**客户端**：
```typescript
try {
  const response = await fetch('/api/v1/dc/tool-b/health-check', {
    method: 'POST',
    body: JSON.stringify({ fileKey, columnName })
  });

  if (!response.ok) {
    const error = await response.json();
    throw new Error(error.error);
  }

  const data = await response.json();

  if (data.status === 'bad') {
    alert(data.message); // 健康检查未通过
    return;
  }

  // 继续下一步
} catch (error) {
  console.error('健康检查失败:', error);
}
```

---

## 六、性能指标

### 6.1 响应时间目标

| API | 目标 | 说明 |
|-----|------|------|
| `/health-check` | < 3秒 | Excel解析+统计 |
| `/templates` | < 100ms | 内存缓存 |
| `/tasks` (create) | < 500ms | 快速创建并返回 |
| `/tasks/:id/progress` | < 100ms | 数据库单查询 |
| `/tasks/:id/items` | < 500ms | 分页查询 |
| `/items/:id/resolve` | < 200ms | 单行更新 |
| `/tasks/:id/export` | < 10秒 | 生成Excel文件 |

### 6.2 并发处理能力

- **健康检查**: 10 req/s（IO密集）
- **任务创建**: 5 req/s（写入数据库）
- **进度查询**: 100 req/s（读密集，可缓存）
- **验证网格**: 50 req/s（分页查询）

### 6.3 优化策略

**缓存**：
- `/templates` → 永久缓存（内存）
- `/tasks/:id/progress` → Redis缓存（5秒TTL）

**异步处理**：
- 任务处理使用BullMQ后台队列
- 避免阻塞用户请求

**分页**：
- 验证网格默认50条/页
- 最大1000条/页

---

## 七、版本控制

### 7.1 API版本策略

**当前版本**: `v1`

**URL格式**: `/api/v1/dc/tool-b/*`

**向后兼容承诺**：
- v1版本在2026年前保持稳定
- 新功能通过可选参数添加
- 破坏性变更发布v2

### 7.2 废弃通知

当API需要废弃时：
```http
HTTP/1.1 200 OK
X-API-Deprecated: true
X-API-Sunset: 2026-12-31
X-API-Replacement: /api/v2/dc/tool-b/health-check
```

---

## 八、测试

### 8.1 Postman Collection

完整的API测试集合：
```
docs/03-业务模块/DC-数据清洗整理/02-技术设计/ToolB-API.postman_collection.json
```

### 8.2 示例请求

**健康检查**：
```bash
curl -X POST http://localhost:3001/api/v1/dc/tool-b/health-check \
  -H "Content-Type: application/json" \
  -d '{
    "fileKey": "uploads/test.xlsx",
    "columnName": "病历文本"
  }'
```

**获取模板**：
```bash
curl http://localhost:3001/api/v1/dc/tool-b/templates
```

**创建任务**：
```bash
curl -X POST http://localhost:3001/api/v1/dc/tool-b/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "projectName": "测试任务",
    "fileKey": "uploads/test.xlsx",
    "textColumn": "病历文本",
    "diseaseType": "lung_cancer",
    "reportType": "pathology",
    "targetFields": [{"name": "病理类型", "desc": "..."}]
  }'
```

---

## 九、附录

### 9.1 相关文档

- [数据库设计文档](./数据库设计文档-工具B.md)
- [PRD文档](../01-需求分析/PRD：Tool B - 病历结构化机器人 (The AI Structurer).md)
- [开发计划](../04-开发计划/工具B开发计划-病历结构化机器人.md)

### 9.2 变更日志

| 版本 | 日期 | 变更内容 |
|------|------|---------|
| V1.0 | 2025-11-27 | 初始版本，7个API端点 |

---

**文档结束** ✅