- Implement 5 core API endpoints (create task, get progress, get results, update decision, export Excel) - Add FulltextScreeningController with Zod validation (652 lines) - Implement ExcelExporter service with 4-sheet report generation (352 lines) - Register routes under /api/v1/asl/fulltext-screening - Create 31 REST Client test cases - Add automated integration test script - Fix PDF extraction fallback mechanism in LLM12FieldsService - Update API design documentation to v3.0 - Update development plan to v1.2 - Create Day 5 development record - Clean up temporary test files
1507 lines
43 KiB
Markdown
1507 lines
43 KiB
Markdown
# 全文复筛质量保障与可追溯策略
|
||
|
||
> **文档版本:** V1.0
|
||
> **创建日期:** 2025-11-22
|
||
> **适用模块:** AI 智能文献 - 全文复筛
|
||
> **目标:** 分阶段提升全文复筛的准确率、方法学质量判断和完整可追溯性
|
||
|
||
---
|
||
|
||
## 📋 文档概述
|
||
|
||
本文档定义了**全文复筛模块**在 **MVP → V1.0 → V2.0** 三个阶段的质量保障策略。
|
||
|
||
### 全文复筛 vs 标题摘要初筛:核心差异
|
||
|
||
| 维度 | 标题摘要初筛 | 全文复筛 | 策略差异 |
|
||
|------|-------------|---------|---------|
|
||
| **信息量** | 200-500字 | 5,000-20,000字 | 🔴 需分段处理 |
|
||
| **判断依据** | PICOS匹配度 | 12字段方法学质量 | 🔴 需专业判断标准 |
|
||
| **决策复杂度** | 低(是/否) | 高(12个字段×3级) | 🔴 需结构化提取 |
|
||
| **容错策略** | 宁错勿漏 | 不能漏关键信息 | 🔴 需验证机制 |
|
||
| **Token成本** | ¥0.005/篇 | ¥0.05-0.20/篇 | 🔴 需成本优化 |
|
||
| **可追溯性** | 引用摘要 | 具体页码/段落/表格 | 🔴 需证据链 |
|
||
|
||
### 核心设计原则
|
||
|
||
| 原则 | 说明 |
|
||
|------|------|
|
||
| **循证医学标准** | 基于Cochrane RoB 2.0工具的方法学质量评估标准 |
|
||
| **结构化提取** | Nougat + 分段提取 + 全文验证,避免"Lost in the Middle" |
|
||
| **完整证据链** | 每个字段强制要求原文引用(页码、段落、表格) |
|
||
| **分步实施** | MVP先验证可行性,V1.0提升质量,V2.0达到医学级标准 |
|
||
| **成本与质量平衡** | MVP用成本友好模型,关键字段用高端模型验证 |
|
||
|
||
---
|
||
|
||
## 🎯 三阶段路线图
|
||
|
||
```
|
||
MVP (3周) V1.0 (5周) V2.0 (8周)
|
||
├─ Nougat结构化提取 ├─ Cochrane标准Prompt ├─ 三模型仲裁
|
||
├─ 12字段分段提取 ├─ Few-shot医学案例库 ├─ 医学逻辑规则引擎
|
||
├─ 双模型验证 ├─ 完整证据链 ├─ 自动质量审计
|
||
├─ 字段级冲突检测 ├─ 全文交叉验证 ├─ HITL智能分流
|
||
└─ 基础可追溯 └─ 分级人工复核 └─ 审计级日志
|
||
↓ ↓ ↓
|
||
准确率 ≥ 85% 准确率 ≥ 92% 准确率 ≥ 96%
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 MVP 阶段(3 周)
|
||
|
||
### 目标定位
|
||
|
||
- **准确率目标**:≥ 85%
|
||
- **信息完整率**:≥ 90%(12字段不遗漏)
|
||
- **成本预算**:≤ ¥0.05/篇(DeepSeek-V3 + Qwen3-Max)
|
||
- **交付标准**:基础功能可用,支持结构化提取和双模型验证
|
||
|
||
---
|
||
|
||
### 一、核心技术策略
|
||
|
||
#### 1.1 ✅ Nougat结构化提取(关键优势)
|
||
|
||
**为什么选择Nougat**:
|
||
|
||
| 对比维度 | PyMuPDF | Nougat |
|
||
|---------|---------|--------|
|
||
| 输出格式 | 纯文本 | Markdown结构化 |
|
||
| 章节识别 | 需LLM二次识别(60%准确率) | 天然保留结构(95%准确率)✅ |
|
||
| 表格处理 | 文本乱码 | Markdown表格 ✅ |
|
||
| 公式识别 | 乱码 | LaTeX格式 ✅ |
|
||
| 适用场景 | 中文论文 | 英文学术论文 ✅ |
|
||
|
||
**实施方案**:
|
||
|
||
```typescript
|
||
// 混合策略:Nougat优先,PyMuPDF降级
|
||
async function extractFullText(pdfBuffer: Buffer, filename: string) {
|
||
// Step 1: 检测语言
|
||
const language = await detectLanguage(pdfBuffer);
|
||
|
||
// Step 2: 英文论文优先用Nougat
|
||
if (language === 'english') {
|
||
try {
|
||
const nougatResult = await extractionClient.extractPdf(
|
||
pdfBuffer, filename, 'nougat'
|
||
);
|
||
|
||
if (nougatResult.quality > 0.8) {
|
||
return {
|
||
method: 'nougat',
|
||
text: nougatResult.text,
|
||
format: 'markdown',
|
||
structured: true // ⭐ 关键优势
|
||
};
|
||
}
|
||
} catch (error) {
|
||
console.warn('Nougat失败,降级到PyMuPDF');
|
||
}
|
||
}
|
||
|
||
// Step 3: 中文论文或Nougat失败,用PyMuPDF
|
||
const pymupdfResult = await extractionClient.extractPdf(
|
||
pdfBuffer, filename, 'pymupdf'
|
||
);
|
||
|
||
return {
|
||
method: 'pymupdf',
|
||
text: pymupdfResult.text,
|
||
format: 'plaintext',
|
||
structured: false // 需要LLM识别结构
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.2 ✅ 12字段分段提取(避免Lost in the Middle)
|
||
|
||
**核心问题**:全文20K tokens一次性喂给LLM,中间章节信息遗漏率高达33%
|
||
|
||
**解决方案**:按字段定向提取相关章节
|
||
|
||
```typescript
|
||
// 12字段提取路由表
|
||
const FIELD_EXTRACTION_ROUTES = {
|
||
'研究设计': {
|
||
sections: ['abstract', 'methods'],
|
||
maxTokens: 3000,
|
||
priority: 'high'
|
||
},
|
||
'研究人群': {
|
||
sections: ['methods', 'results'],
|
||
maxTokens: 3500,
|
||
priority: 'high',
|
||
lookForTables: true // Table 1: Baseline
|
||
},
|
||
'干预措施': {
|
||
sections: ['methods', 'results'],
|
||
maxTokens: 3000,
|
||
priority: 'high'
|
||
},
|
||
'对照措施': {
|
||
sections: ['methods', 'results'],
|
||
maxTokens: 2500,
|
||
priority: 'high'
|
||
},
|
||
'结局指标': {
|
||
sections: ['methods', 'results'],
|
||
maxTokens: 4000,
|
||
priority: 'high',
|
||
lookForTables: true // Results tables
|
||
},
|
||
'随机化方法': {
|
||
sections: ['methods', 'figures'],
|
||
maxTokens: 2500,
|
||
priority: 'critical', // 关键字段
|
||
keywords: ['randomization', 'allocation', 'sequence', 'CONSORT']
|
||
},
|
||
'盲法': {
|
||
sections: ['methods'],
|
||
maxTokens: 2000,
|
||
priority: 'critical'
|
||
},
|
||
'样本量计算': {
|
||
sections: ['methods'],
|
||
maxTokens: 2000,
|
||
priority: 'medium'
|
||
},
|
||
'基线可比性': {
|
||
sections: ['results', 'tables'],
|
||
maxTokens: 3000,
|
||
priority: 'high',
|
||
specificTable: 'Table 1'
|
||
},
|
||
'结果完整性': {
|
||
sections: ['results', 'figures'],
|
||
maxTokens: 4000,
|
||
priority: 'critical',
|
||
keywords: ['ITT', 'per-protocol', 'missing data', 'dropout']
|
||
},
|
||
'选择性报告': {
|
||
sections: ['methods', 'results', 'supplementary'],
|
||
maxTokens: 3000,
|
||
priority: 'medium',
|
||
checkTrialRegistry: true // 对比注册方案
|
||
},
|
||
'其他偏倚': {
|
||
sections: ['methods', 'discussion', 'supplementary'],
|
||
maxTokens: 3000,
|
||
priority: 'medium'
|
||
}
|
||
};
|
||
|
||
// 分段并行提取
|
||
async function extractAllFields(sections: ParsedSections) {
|
||
const extractionTasks = Object.entries(FIELD_EXTRACTION_ROUTES).map(
|
||
([fieldName, config]) => ({
|
||
field: fieldName,
|
||
task: extractFieldWithEvidence(fieldName, sections, config)
|
||
})
|
||
);
|
||
|
||
// 并行执行(降低延迟)
|
||
const results = await Promise.all(
|
||
extractionTasks.map(t => t.task)
|
||
);
|
||
|
||
return results;
|
||
}
|
||
```
|
||
|
||
**优势**:
|
||
- ✅ 避免中间信息遗漏(准确率 70% → 90%)
|
||
- ✅ Token消耗降低40%(20K → 12K)
|
||
- ✅ 并行提取,延迟降低60%
|
||
- ✅ 每个字段LLM注意力更集中
|
||
|
||
---
|
||
|
||
#### 1.3 ✅ 双模型交叉验证
|
||
|
||
**模型组合**:DeepSeek-V3 + Qwen3-Max(成本友好)
|
||
|
||
```typescript
|
||
// 双模型并行调用
|
||
async function dualModelExtraction(
|
||
fieldName: string,
|
||
relevantContent: string,
|
||
prompt: string
|
||
) {
|
||
const [resultA, resultB] = await Promise.all([
|
||
llmService.chat('deepseek-v3', prompt, relevantContent),
|
||
llmService.chat('qwen-max', prompt, relevantContent)
|
||
]);
|
||
|
||
// 解析结果
|
||
const assessmentA = parseFieldAssessment(resultA);
|
||
const assessmentB = parseFieldAssessment(resultB);
|
||
|
||
// 冲突检测
|
||
const hasConflict = assessmentA.level !== assessmentB.level;
|
||
|
||
return {
|
||
field: fieldName,
|
||
modelA: {
|
||
model: 'deepseek-v3',
|
||
assessment: assessmentA.level, // '完整'/'不完整'/'无法判断'
|
||
evidence: assessmentA.evidence,
|
||
confidence: assessmentA.confidence
|
||
},
|
||
modelB: {
|
||
model: 'qwen-max',
|
||
assessment: assessmentB.level,
|
||
evidence: assessmentB.evidence,
|
||
confidence: assessmentB.confidence
|
||
},
|
||
hasConflict,
|
||
needReview: hasConflict ||
|
||
assessmentA.confidence < 0.7 ||
|
||
assessmentB.confidence < 0.7
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.4 ✅ 字段级冲突检测与分级复核
|
||
|
||
**不是简单的"全部冲突就人工复核",而是根据字段重要性分级**:
|
||
|
||
```typescript
|
||
// 字段重要性分级
|
||
const FIELD_IMPORTANCE = {
|
||
critical: ['随机化方法', '盲法', '结果完整性'], // 核心偏倚风险
|
||
high: ['研究设计', '研究人群', '干预措施', '结局指标', '基线可比性'],
|
||
medium: ['样本量计算', '选择性报告', '其他偏倚']
|
||
};
|
||
|
||
// 智能分流
|
||
function prioritizeReview(conflicts: FieldConflict[]): ReviewQueue {
|
||
const queue = {
|
||
urgent: [], // 关键字段冲突 → 立即人工复核
|
||
important: [], // 高优先级字段冲突 → 24小时内复核
|
||
normal: [] // 中等优先级字段冲突 → 48小时内复核
|
||
};
|
||
|
||
for (const conflict of conflicts) {
|
||
if (!conflict.hasConflict) continue;
|
||
|
||
if (FIELD_IMPORTANCE.critical.includes(conflict.field)) {
|
||
queue.urgent.push({
|
||
...conflict,
|
||
reason: '关键方法学字段冲突,影响偏倚风险评估',
|
||
deadline: new Date(Date.now() + 2 * 3600 * 1000) // 2小时
|
||
});
|
||
} else if (FIELD_IMPORTANCE.high.includes(conflict.field)) {
|
||
queue.important.push({
|
||
...conflict,
|
||
reason: '高优先级字段冲突',
|
||
deadline: new Date(Date.now() + 24 * 3600 * 1000) // 24小时
|
||
});
|
||
} else {
|
||
queue.normal.push({
|
||
...conflict,
|
||
reason: '一般字段冲突',
|
||
deadline: new Date(Date.now() + 48 * 3600 * 1000) // 48小时
|
||
});
|
||
}
|
||
}
|
||
|
||
return queue;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.5 ✅ 基础证据链(原文引用)
|
||
|
||
**MVP阶段要求**:每个字段必须有原文引用
|
||
|
||
```typescript
|
||
interface FieldEvidence {
|
||
field: string;
|
||
assessment: '完整' | '不完整' | '无法判断';
|
||
|
||
// ⭐ 强制要求
|
||
evidence: {
|
||
quote: string; // 原文引用(100-300字)
|
||
location: {
|
||
section: string; // "Methods"
|
||
page?: number; // 3(如果PDF有页码)
|
||
paragraph?: number; // 2
|
||
table?: string; // "Table 1"
|
||
figure?: string; // "Figure 1"
|
||
};
|
||
highlightedKeywords: string[]; // 关键信号词
|
||
};
|
||
|
||
reasoning: string; // 判断理由(50-200字)
|
||
confidence: number; // 0.0-1.0
|
||
}
|
||
|
||
// 后处理验证:确保每个字段都有证据
|
||
function validateEvidence(result: ExtractionResult): ValidationReport {
|
||
const errors = [];
|
||
|
||
for (const [field, data] of Object.entries(result.fields)) {
|
||
// 检查1:必须有引用
|
||
if (!data.evidence?.quote) {
|
||
errors.push({
|
||
field,
|
||
type: 'missing_evidence',
|
||
message: `字段"${field}"缺少原文引用`
|
||
});
|
||
}
|
||
|
||
// 检查2:引用不能太短(避免敷衍)
|
||
if (data.evidence?.quote && data.evidence.quote.length < 50) {
|
||
errors.push({
|
||
field,
|
||
type: 'insufficient_evidence',
|
||
message: `字段"${field}"的引用过短(<50字),可能不足以支持判断`
|
||
});
|
||
}
|
||
|
||
// 检查3:必须有位置信息
|
||
if (!data.evidence?.location?.section) {
|
||
errors.push({
|
||
field,
|
||
type: 'missing_location',
|
||
message: `字段"${field}"未标注原文位置`
|
||
});
|
||
}
|
||
}
|
||
|
||
return {
|
||
isValid: errors.length === 0,
|
||
errors,
|
||
completeness: 1 - (errors.length / (Object.keys(result.fields).length * 3))
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 二、12字段专业Prompt模板(MVP版)
|
||
|
||
#### 示例:随机化方法(关键字段)
|
||
|
||
```markdown
|
||
# 字段提取任务:随机化方法
|
||
|
||
## 背景说明
|
||
你是一位循证医学专家,正在评估一篇RCT研究的方法学质量。
|
||
请根据Cochrane偏倚风险评估工具(RoB 2.0)的标准,判断该研究的随机化方法是否充分。
|
||
|
||
## 待分析内容
|
||
以下是论文的Methods章节和相关图表:
|
||
|
||
${relevantContent}
|
||
|
||
## 判断标准
|
||
|
||
### 完整(Low risk of bias)
|
||
需**同时满足**以下条件:
|
||
1. ✅ 明确说明随机序列生成方法
|
||
- 示例:computer-generated random sequence, random number table,
|
||
central randomization, minimization
|
||
2. ✅ 说明分配隐藏方法
|
||
- 示例:sealed opaque envelopes, central allocation,
|
||
pharmacy-controlled, IWRS (Interactive Web Response System)
|
||
3. ✅ 无选择偏倚的证据
|
||
- 基线特征平衡
|
||
- 无异常的入组时间模式
|
||
|
||
### 不完整(High/Unclear risk of bias)
|
||
以下情况判定为不完整:
|
||
- ❌ 仅提到"随机分组"但无具体方法
|
||
- ❌ 使用不当的随机化方法(按日期、住院号、交替分配)
|
||
- ❌ 无分配隐藏或分配隐藏不当(开放分配表)
|
||
- ❌ 基线存在显著不平衡且无调整
|
||
- ⚠️ 方法描述模糊,无法判断充分性
|
||
|
||
### 无法判断(Unclear risk)
|
||
- 论文完全未提及随机化方法
|
||
- 仅在其他地方(如注册方案)提到,但本文未描述
|
||
|
||
## 关键信号词
|
||
|
||
**高质量信号(完整)**:
|
||
- "computer-generated random sequence"
|
||
- "central randomization/allocation"
|
||
- "sealed opaque envelopes"
|
||
- "stratified randomization"
|
||
- "block randomization"
|
||
- "minimization"
|
||
- "allocation concealment"
|
||
|
||
**风险信号(不完整)**:
|
||
- "alternating allocation"
|
||
- "by date of birth"
|
||
- "by hospital number"
|
||
- "open allocation"
|
||
- "assigned by investigator"
|
||
|
||
## 提取指南
|
||
|
||
1. **优先查找位置**:
|
||
- Methods章节的"Randomization"小节
|
||
- Figure 1 (CONSORT流程图)
|
||
- Trial Registration信息
|
||
- 补充材料(Supplementary Materials)
|
||
|
||
2. **交叉验证**:
|
||
- Methods描述 vs. Results中的基线数据
|
||
- 声称的方法 vs. 实际的基线平衡情况
|
||
|
||
3. **特殊情况**:
|
||
- 如果提到"see protocol"或"see trial registration",标记为需要查阅外部资料
|
||
- 如果是多中心研究,应该有中心随机化系统
|
||
|
||
## 输出格式(严格JSON)
|
||
|
||
{
|
||
"assessment": "完整" | "不完整" | "无法判断",
|
||
"evidence": {
|
||
"quote": "原文引用(100-300字,包含关键方法描述)",
|
||
"location": {
|
||
"section": "Methods",
|
||
"subsection": "Randomization",
|
||
"page": 3,
|
||
"paragraph": 2,
|
||
"figure": "Figure 1 (CONSORT diagram)"
|
||
},
|
||
"highlightedKeywords": [
|
||
"关键词1",
|
||
"关键词2"
|
||
]
|
||
},
|
||
"reasoning": "判断理由:根据原文引用,该研究...",
|
||
"confidence": 0.95,
|
||
"robAssessment": "Low risk" | "High risk" | "Unclear risk",
|
||
"needsExternalVerification": false,
|
||
"notes": "其他说明(可选)"
|
||
}
|
||
|
||
## 注意事项
|
||
|
||
1. **严格遵守Cochrane标准**:宁可判断为"不完整",不要过于宽松
|
||
2. **引用必须具体**:不要笼统地说"Methods章节提到",必须给出具体引用
|
||
3. **置信度诚实**:如果信息不清晰,降低confidence并标记needsExternalVerification
|
||
4. **区分"未做"和"未报告"**:
|
||
- 如果论文明确说"no randomization",assessment="不完整"
|
||
- 如果论文完全未提及,assessment="无法判断"
|
||
```
|
||
|
||
**其他11个字段的Prompt模板**:类似结构,根据Cochrane标准调整判断标准
|
||
|
||
---
|
||
|
||
### 三、MVP成本预算
|
||
|
||
**场景:100篇全文复筛**
|
||
|
||
| 环节 | Token消耗 | 模型 | 成本 |
|
||
|------|----------|------|------|
|
||
| Nougat提取 | - | 本地模型 | ¥0 |
|
||
| 12字段提取(双模型) | 12K × 2 = 24K | DeepSeek-V3 + Qwen3-Max | ¥0.06/篇 |
|
||
| 冲突字段人工复核(20%) | - | 人工 | 2分钟/字段 |
|
||
| **100篇总成本** | - | - | **¥6 + 人工成本** |
|
||
|
||
**对比**:
|
||
- 全文一次性提取:¥10/100篇
|
||
- 分段提取:¥6/100篇
|
||
- **节省40%成本 + 准确率提升**
|
||
|
||
---
|
||
|
||
### 四、MVP验收标准
|
||
|
||
| 指标 | 目标 | 验证方法 |
|
||
|------|------|----------|
|
||
| 字段提取完整率 | ≥ 90% | 12字段都有结果(非"无法判断") |
|
||
| 双模型一致率 | ≥ 75% | 12字段中至少9个一致 |
|
||
| 证据链完整性 | 100% | 每个字段都有原文引用和位置 |
|
||
| 人工复核队列 | ≤ 30% | 需要人工介入的文献占比 |
|
||
| Nougat成功率 | ≥ 85% | 英文论文成功提取比例 |
|
||
| 处理速度 | ≤ 3分钟/篇 | 从PDF到结果的总时长 |
|
||
|
||
---
|
||
|
||
## 📈 V1.0 阶段(5 周)
|
||
|
||
### 目标定位
|
||
|
||
- **准确率目标**:≥ 92%
|
||
- **信息完整率**:≥ 95%
|
||
- **成本预算**:≤ ¥0.08/篇(智能成本优化)
|
||
- **交付标准**:高质量输出,完整证据链,智能质量控制
|
||
|
||
---
|
||
|
||
### 一、质量提升策略
|
||
|
||
#### 1.1 ✅ Cochrane标准Prompt增强
|
||
|
||
**在MVP基础上增加**:
|
||
|
||
1. **Few-shot医学案例**(每个字段3-5个真实案例)
|
||
|
||
```markdown
|
||
## 参考案例
|
||
|
||
以下是3个真实RCT研究的随机化方法评估案例,帮助你理解判断标准:
|
||
|
||
### 案例1:高质量RCT(NEJM, 2023)
|
||
**原文引用**:
|
||
"Randomization was performed with the use of a computer-generated sequence
|
||
with stratification according to center and baseline NIHSS score (≤10 or >10).
|
||
Allocation was concealed through a central web-based system (IWRS)."
|
||
|
||
**评估结果**:完整
|
||
**理由**:
|
||
1. ✅ 明确的序列生成方法(computer-generated)
|
||
2. ✅ 分层随机化(提高平衡性)
|
||
3. ✅ 中心分配隐藏(IWRS)
|
||
4. ✅ 基线Table 1显示两组平衡良好(P>0.05)
|
||
**RoB 2.0判断**:Low risk of bias
|
||
|
||
---
|
||
|
||
### 案例2:质量不足(某期刊, 2020)
|
||
**原文引用**:
|
||
"Patients were randomly assigned to receive either drug A or placebo
|
||
in a 1:1 ratio. Randomization was performed by the study coordinator."
|
||
|
||
**评估结果**:不完整
|
||
**理由**:
|
||
1. ❌ 未说明序列生成方法(仅说"随机")
|
||
2. ❌ 由研究协调员执行随机化(无分配隐藏)
|
||
3. ⚠️ Table 1显示对照组年龄偏大(66.2 vs 62.1, P=0.04)
|
||
**RoB 2.0判断**:High risk of bias
|
||
**问题**:可能存在选择偏倚
|
||
|
||
---
|
||
|
||
### 案例3:边界情况(Lancet, 2021)
|
||
**原文引用**:
|
||
"Randomization was done with sequentially numbered, opaque, sealed envelopes
|
||
prepared by an independent statistician not otherwise involved in the trial."
|
||
|
||
**评估结果**:完整
|
||
**理由**:
|
||
1. ✅ 虽非中心随机化,但使用密封信封
|
||
2. ✅ 独立第三方准备(统计师)
|
||
3. ✅ 不透光(opaque)且密封(sealed)
|
||
4. ✅ 基线平衡良好
|
||
**RoB 2.0判断**:Low risk of bias
|
||
**说明**:符合Cochrane标准(密封信封 + 独立准备可接受)
|
||
|
||
---
|
||
|
||
现在请你参考以上案例的评估方式,分析当前论文...
|
||
```
|
||
|
||
2. **Chain of Thought推理**
|
||
|
||
```markdown
|
||
## 输出格式(增强版)
|
||
|
||
{
|
||
"assessment": "完整",
|
||
|
||
// ⭐ 新增:逐步推理过程
|
||
"reasoning_steps": {
|
||
"step1_sequenceGeneration": {
|
||
"finding": "论文提到'computer-generated random sequence'",
|
||
"evaluation": "满足序列生成方法要求 ✅"
|
||
},
|
||
"step2_allocationConcealment": {
|
||
"finding": "使用'central web-based system (IWRS)'",
|
||
"evaluation": "满足分配隐藏要求 ✅"
|
||
},
|
||
"step3_baselineBalance": {
|
||
"finding": "Table 1显示主要特征P>0.05",
|
||
"evaluation": "无明显选择偏倚证据 ✅"
|
||
},
|
||
"step4_finalJudgment": {
|
||
"conclusion": "三项标准均满足,判断为'完整'",
|
||
"confidence": 0.95
|
||
}
|
||
},
|
||
|
||
"evidence": { ... },
|
||
"robAssessment": "Low risk"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.2 ✅ 全文交叉验证(防遗漏)
|
||
|
||
**在分段提取后,增加全文验证环节**:
|
||
|
||
```typescript
|
||
// 阶段1:分段提取(已完成)
|
||
const segmentedResults = await extractAllFieldsSegmented(sections);
|
||
|
||
// ⭐ 阶段2:全文交叉验证(新增)
|
||
async function crossValidateWithFullText(
|
||
segmentedResults: FieldResult[],
|
||
fullTextMarkdown: string
|
||
): Promise<ValidationReport> {
|
||
|
||
// 验证1:检查是否有遗漏信息
|
||
const missingInfoChecks = await Promise.all([
|
||
checkForMissingInfo('随机化方法', fullTextMarkdown, segmentedResults),
|
||
checkForMissingInfo('盲法', fullTextMarkdown, segmentedResults),
|
||
// ... 其他关键字段
|
||
]);
|
||
|
||
// 验证2:检查是否有矛盾信息
|
||
const contradictionChecks = await checkContradictions(
|
||
segmentedResults,
|
||
fullTextMarkdown
|
||
);
|
||
|
||
// 验证3:检查是否提到补充材料
|
||
const supplementaryCheck = checkSupplementaryMaterial(fullTextMarkdown);
|
||
|
||
return {
|
||
missingInfoAlerts: missingInfoChecks.filter(c => c.hasIssue),
|
||
contradictions: contradictionChecks,
|
||
needsSupplementary: supplementaryCheck.needsExternal,
|
||
overallCompleteness: calculateCompleteness(...)
|
||
};
|
||
}
|
||
|
||
// 示例:检查遗漏信息
|
||
async function checkForMissingInfo(
|
||
field: string,
|
||
fullText: string,
|
||
extractedResult: FieldResult
|
||
): Promise<ValidationAlert> {
|
||
|
||
// 如果已经判定为"完整",跳过
|
||
if (extractedResult.assessment === '完整') {
|
||
return { field, hasIssue: false };
|
||
}
|
||
|
||
// 在全文中搜索关键词
|
||
const keywords = FIELD_KEYWORDS[field]; // 预定义关键词表
|
||
const foundKeywords = keywords.filter(kw =>
|
||
fullText.toLowerCase().includes(kw.toLowerCase())
|
||
);
|
||
|
||
// 如果全文中有关键词,但提取结果是"无法判断"
|
||
if (foundKeywords.length > 0 && extractedResult.assessment === '无法判断') {
|
||
return {
|
||
field,
|
||
hasIssue: true,
|
||
severity: 'warning',
|
||
message: `全文中发现关键词【${foundKeywords.join(', ')}】,
|
||
但字段"${field}"判断为"无法判断",可能存在遗漏`,
|
||
suggestedAction: 'targeted_re_extraction',
|
||
keywords: foundKeywords
|
||
};
|
||
}
|
||
|
||
return { field, hasIssue: false };
|
||
}
|
||
```
|
||
|
||
**效果**:
|
||
- 遗漏信息检出率:0% → 80%
|
||
- 准确率提升:85% → 92%
|
||
|
||
---
|
||
|
||
#### 1.3 ✅ 医学逻辑规则引擎
|
||
|
||
**自动检查常见的逻辑错误**:
|
||
|
||
```typescript
|
||
const MEDICAL_LOGIC_RULES = [
|
||
{
|
||
id: 'rule_001',
|
||
name: 'RCT必须有随机化',
|
||
check: (data) => {
|
||
const isRCT = data.研究设计.toLowerCase().includes('rct') ||
|
||
data.研究设计.includes('随机');
|
||
const hasRandomization = data.随机化方法 !== '无法判断';
|
||
return !isRCT || hasRandomization;
|
||
},
|
||
severity: 'error',
|
||
message: '研究声称是RCT但未找到随机化方法描述',
|
||
action: 'flag_for_urgent_review'
|
||
},
|
||
|
||
{
|
||
id: 'rule_002',
|
||
name: '双盲研究必须说明盲法',
|
||
check: (data) => {
|
||
const isDoubleBlind = data.研究设计.includes('双盲') ||
|
||
data.研究设计.includes('double-blind');
|
||
const hasBlinding = data.盲法 !== '无法判断' &&
|
||
data.盲法 !== '不完整';
|
||
return !isDoubleBlind || hasBlinding;
|
||
},
|
||
severity: 'error',
|
||
message: '声称双盲但盲法描述不完整',
|
||
action: 'flag_for_review'
|
||
},
|
||
|
||
{
|
||
id: 'rule_003',
|
||
name: '样本量与基线数据一致性',
|
||
check: (data) => {
|
||
const planned = extractNumber(data.样本量计算);
|
||
const enrolled = extractNumber(data.研究人群);
|
||
if (!planned || !enrolled) return true; // 无法提取则跳过
|
||
|
||
const deviation = Math.abs(planned - enrolled) / planned;
|
||
return deviation < 0.3; // 偏差<30%
|
||
},
|
||
severity: 'warning',
|
||
message: '计划样本量与实际入组差异较大(>30%)',
|
||
action: 'add_note'
|
||
},
|
||
|
||
{
|
||
id: 'rule_004',
|
||
name: '基线不平衡需要调整',
|
||
check: (data) => {
|
||
const hasImbalance = data.基线可比性.includes('不平衡') ||
|
||
data.基线可比性.includes('P<0.05');
|
||
const hasAdjustment = data.结局指标.includes('调整') ||
|
||
data.结局指标.includes('adjusted');
|
||
return !hasImbalance || hasAdjustment;
|
||
},
|
||
severity: 'warning',
|
||
message: '基线存在不平衡但未见调整分析',
|
||
action: 'add_note'
|
||
},
|
||
|
||
{
|
||
id: 'rule_005',
|
||
name: 'ITT分析完整性',
|
||
check: (data) => {
|
||
const hasDropout = extractNumber(data.结果完整性) > 0;
|
||
const hasITT = data.结果完整性.toLowerCase().includes('itt') ||
|
||
data.结果完整性.includes('intention-to-treat');
|
||
return !hasDropout || hasITT;
|
||
},
|
||
severity: 'warning',
|
||
message: '存在失访但未明确ITT分析',
|
||
action: 'flag_for_review'
|
||
}
|
||
];
|
||
|
||
// 自动验证
|
||
function validateMedicalLogic(extractedData: ExtractionResult): LogicReport {
|
||
const violations = [];
|
||
|
||
for (const rule of MEDICAL_LOGIC_RULES) {
|
||
try {
|
||
const passed = rule.check(extractedData);
|
||
if (!passed) {
|
||
violations.push({
|
||
ruleId: rule.id,
|
||
ruleName: rule.name,
|
||
severity: rule.severity,
|
||
message: rule.message,
|
||
action: rule.action
|
||
});
|
||
}
|
||
} catch (error) {
|
||
console.error(`规则${rule.id}执行失败:`, error);
|
||
}
|
||
}
|
||
|
||
return {
|
||
totalRules: MEDICAL_LOGIC_RULES.length,
|
||
passedRules: MEDICAL_LOGIC_RULES.length - violations.length,
|
||
violations,
|
||
overallValidity: violations.filter(v => v.severity === 'error').length === 0
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.4 ✅ 完整证据链(增强版)
|
||
|
||
**V1.0要求**:不仅有引用,还要有具体定位和高亮
|
||
|
||
```typescript
|
||
interface EnhancedEvidence {
|
||
field: string;
|
||
assessment: string;
|
||
|
||
evidence: {
|
||
// 主要证据
|
||
primaryQuote: {
|
||
text: string; // 原文引用
|
||
location: {
|
||
section: string; // "Methods"
|
||
subsection?: string; // "Randomization"
|
||
page: number; // 3
|
||
paragraph: number; // 2
|
||
lineRange?: [number, number]; // [45, 52]
|
||
};
|
||
highlightedText: string; // HTML高亮版本
|
||
keywords: string[]; // 关键词列表
|
||
};
|
||
|
||
// 支持证据(可选)
|
||
supportingQuotes?: Array<{
|
||
text: string;
|
||
location: any;
|
||
relation: string; // "confirms" | "contradicts" | "complements"
|
||
}>;
|
||
|
||
// 表格/图片证据
|
||
tableEvidence?: {
|
||
tableName: string; // "Table 1"
|
||
relevantCells: string[]; // 相关单元格内容
|
||
interpretation: string; // 对表格的解读
|
||
};
|
||
|
||
figureEvidence?: {
|
||
figureName: string; // "Figure 1"
|
||
caption: string;
|
||
relevantInfo: string;
|
||
};
|
||
};
|
||
|
||
// ⭐ 新增:完整推理链
|
||
reasoningChain: {
|
||
cochraneCriteria: string[]; // 应用的Cochrane标准
|
||
keyFindings: string[]; // 关键发现
|
||
assessment: string; // 最终判断
|
||
confidence: number;
|
||
uncertainties?: string[]; // 不确定因素
|
||
};
|
||
|
||
// ⭐ 新增:可追溯性元数据
|
||
metadata: {
|
||
extractionTimestamp: string;
|
||
modelUsed: string;
|
||
promptVersion: string;
|
||
processingTime: number;
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 二、V1.0成本预算
|
||
|
||
**场景:100篇全文复筛**
|
||
|
||
| 环节 | Token消耗 | 模型 | 成本 |
|
||
|------|----------|------|------|
|
||
| 12字段分段提取(双模型) | 12K | DeepSeek-V3 + Qwen3-Max | ¥0.06/篇 |
|
||
| 全文交叉验证 | 3K | DeepSeek-V3 | ¥0.003/篇 |
|
||
| 关键字段补充提取(20%) | 2K | Qwen3-Max | ¥0.016/篇(仅20%文献) |
|
||
| **100篇总成本** | - | - | **¥7.9** |
|
||
|
||
**质量提升**:准确率 85% → 92%
|
||
**成本增加**:¥6 → ¥8(+33%,但质量显著提升)
|
||
|
||
---
|
||
|
||
### 三、V1.0验收标准
|
||
|
||
| 指标 | 目标 | 验证方法 |
|
||
|------|------|----------|
|
||
| 准确率(人工抽查) | ≥ 92% | 随机抽查50篇,专家评估 |
|
||
| 信息完整率 | ≥ 95% | 12字段均有有效结果 |
|
||
| 证据链完整性 | 100% | 每个字段有详细证据和推理链 |
|
||
| 遗漏信息检出率 | ≥ 80% | 交叉验证发现的遗漏比例 |
|
||
| 逻辑规则覆盖率 | ≥ 80% | 规则引擎检查通过率 |
|
||
| 人工复核队列 | ≤ 25% | 需要人工介入的文献占比 |
|
||
|
||
---
|
||
|
||
## 🏆 V2.0 阶段(8 周)
|
||
|
||
### 目标定位
|
||
|
||
- **准确率目标**:≥ 96%(医学级标准)
|
||
- **人机一致性**:Cohen's Kappa ≥ 0.90
|
||
- **成本预算**:按需配置(质量优先)
|
||
- **交付标准**:自动化质量审计,符合Cochrane发表标准
|
||
|
||
---
|
||
|
||
### 一、医学级质量保障
|
||
|
||
#### 1.1 ✅ 三模型仲裁机制
|
||
|
||
**关键字段冲突时,启用第三方仲裁**:
|
||
|
||
```typescript
|
||
async function threeModelArbitration(
|
||
conflict: FieldConflict,
|
||
relevantContent: string
|
||
) {
|
||
|
||
// 第三方仲裁:Claude-4.5(高质量模型)
|
||
const arbitrationPrompt = `
|
||
你是Cochrane系统评价专家,现有两个AI模型对同一字段的判断存在冲突,
|
||
请你从循证医学的角度给出权威判断。
|
||
|
||
【冲突字段】:${conflict.field}
|
||
|
||
【模型A判断】:${conflict.modelA.assessment}
|
||
证据:${conflict.modelA.evidence.quote}
|
||
理由:${conflict.modelA.reasoning}
|
||
置信度:${conflict.modelA.confidence}
|
||
|
||
【模型B判断】:${conflict.modelB.assessment}
|
||
证据:${conflict.modelB.evidence.quote}
|
||
理由:${conflict.modelB.reasoning}
|
||
置信度:${conflict.modelB.confidence}
|
||
|
||
【原文】:
|
||
${relevantContent}
|
||
|
||
【仲裁任务】:
|
||
1. 根据Cochrane RoB 2.0标准,给出你的判断
|
||
2. 分析两个模型的判断,指出哪个更准确(或都不准确)
|
||
3. 引用Cochrane手册相关条款支持你的判断
|
||
4. 如果仍不确定,明确指出需要人工复核的原因
|
||
|
||
【输出格式】:JSON
|
||
`;
|
||
|
||
const arbitrationResult = await llmService.chat(
|
||
'claude-4.5',
|
||
arbitrationPrompt
|
||
);
|
||
|
||
return {
|
||
field: conflict.field,
|
||
arbitrator: 'claude-4.5',
|
||
finalJudgment: arbitrationResult.assessment,
|
||
analysis: {
|
||
modelAAccuracy: arbitrationResult.modelA_correct,
|
||
modelBAccuracy: arbitrationResult.modelB_correct,
|
||
correctModel: arbitrationResult.agree_with,
|
||
cochraneCitation: arbitrationResult.cochrane_reference
|
||
},
|
||
confidence: arbitrationResult.confidence,
|
||
stillNeedsHumanReview: arbitrationResult.confidence < 0.9
|
||
};
|
||
}
|
||
```
|
||
|
||
**成本控制**:
|
||
- 仅在关键字段冲突时启用(预计10-15%)
|
||
- 单次仲裁成本:¥0.02(Claude-4.5)
|
||
- 100篇总额外成本:¥2-3
|
||
|
||
---
|
||
|
||
#### 1.2 ✅ HITL智能分流
|
||
|
||
**基于规则的智能优先级排序**:
|
||
|
||
```typescript
|
||
function intelligentTriage(
|
||
extractionResult: ExtractionResult,
|
||
validationReport: ValidationReport,
|
||
arbitrationResults?: ArbitrationResult[]
|
||
): TriageDecision {
|
||
|
||
let priority = 0;
|
||
let needReview = false;
|
||
const reasons = [];
|
||
|
||
// 规则1:三模型仍不一致 → 最高优先级
|
||
if (arbitrationResults?.some(a => a.stillNeedsHumanReview)) {
|
||
priority = 100;
|
||
needReview = true;
|
||
reasons.push('三模型仲裁后仍存在不确定性');
|
||
}
|
||
|
||
// 规则2:关键字段质量问题 → 高优先级
|
||
const criticalIssues = validationReport.violations.filter(v =>
|
||
v.severity === 'error' &&
|
||
FIELD_IMPORTANCE.critical.includes(v.field)
|
||
);
|
||
if (criticalIssues.length > 0) {
|
||
priority = Math.max(priority, 90);
|
||
needReview = true;
|
||
reasons.push(`关键字段存在质量问题: ${criticalIssues.map(i => i.field).join(', ')}`);
|
||
}
|
||
|
||
// 规则3:RCT研究 → 中等优先级(质量要求高)
|
||
if (extractionResult.研究设计.includes('RCT')) {
|
||
priority = Math.max(priority, 70);
|
||
// RCT如果置信度低才需要复核
|
||
if (extractionResult.overallConfidence < 0.9) {
|
||
needReview = true;
|
||
reasons.push('RCT研究但整体置信度低于0.9');
|
||
}
|
||
}
|
||
|
||
// 规则4:关键结局指标(死亡率)→ 高优先级
|
||
if (extractionResult.结局指标.includes('死亡') ||
|
||
extractionResult.结局指标.includes('mortality')) {
|
||
priority = Math.max(priority, 80);
|
||
if (extractionResult.结果完整性 !== '完整') {
|
||
needReview = true;
|
||
reasons.push('关键结局指标(死亡率)但结果完整性有问题');
|
||
}
|
||
}
|
||
|
||
// 规则5:高置信度 + 无冲突 → 自动通过
|
||
if (extractionResult.overallConfidence > 0.95 &&
|
||
validationReport.violations.length === 0 &&
|
||
!arbitrationResults) {
|
||
priority = 10;
|
||
needReview = false;
|
||
reasons.push('高质量提取,无需人工复核');
|
||
}
|
||
|
||
// 规则6:发表在顶级期刊 → 降低复核优先级
|
||
const topJournals = ['NEJM', 'Lancet', 'JAMA', 'BMJ'];
|
||
if (topJournals.some(j => extractionResult.metadata.journal?.includes(j))) {
|
||
priority = Math.max(0, priority - 20);
|
||
reasons.push('发表在顶级期刊,方法学质量通常较高');
|
||
}
|
||
|
||
return {
|
||
priority,
|
||
needReview,
|
||
reasons,
|
||
estimatedReviewTime: estimateReviewTime(extractionResult, needReview),
|
||
reviewDeadline: calculateDeadline(priority)
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.3 ✅ 自动质量审计
|
||
|
||
**定期批量抽查(10%),自动生成质量报告**:
|
||
|
||
```typescript
|
||
// 每周自动审计
|
||
async function weeklyQualityAudit(
|
||
startDate: Date,
|
||
endDate: Date
|
||
): Promise<QualityAuditReport> {
|
||
|
||
// 1. 获取本周所有提取结果
|
||
const weeklyExtractions = await db.fulltextScreeningResults.findMany({
|
||
where: {
|
||
createdAt: { gte: startDate, lte: endDate }
|
||
}
|
||
});
|
||
|
||
// 2. 随机抽样10%
|
||
const sampleSize = Math.ceil(weeklyExtractions.length * 0.1);
|
||
const sample = randomSample(weeklyExtractions, sampleSize);
|
||
|
||
// 3. 人工复核样本
|
||
const humanReviews = await requestHumanReview(sample);
|
||
|
||
// 4. 计算质量指标
|
||
const metrics = {
|
||
准确率: calculateAccuracy(sample, humanReviews),
|
||
人机一致性: calculateCohenKappa(sample, humanReviews),
|
||
假阳性率: calculateFalsePositiveRate(sample, humanReviews),
|
||
假阴性率: calculateFalseNegativeRate(sample, humanReviews),
|
||
|
||
// 分字段准确率
|
||
字段准确率: FIELD_LIST.map(field => ({
|
||
field,
|
||
accuracy: calculateFieldAccuracy(field, sample, humanReviews)
|
||
}))
|
||
};
|
||
|
||
// 5. 模型性能对比
|
||
const modelPerformance = {
|
||
'deepseek-v3': analyzeModelPerformance('deepseek-v3', sample, humanReviews),
|
||
'qwen-max': analyzeModelPerformance('qwen-max', sample, humanReviews),
|
||
'claude-4.5': analyzeModelPerformance('claude-4.5', sample, humanReviews)
|
||
};
|
||
|
||
// 6. 问题分析
|
||
const issues = identifyCommonIssues(sample, humanReviews);
|
||
|
||
// 7. 改进建议
|
||
const recommendations = generateRecommendations(metrics, issues);
|
||
|
||
return {
|
||
period: { start: startDate, end: endDate },
|
||
totalExtractions: weeklyExtractions.length,
|
||
sampledExtractions: sampleSize,
|
||
metrics,
|
||
modelPerformance,
|
||
issues,
|
||
recommendations,
|
||
generatedAt: new Date()
|
||
};
|
||
}
|
||
|
||
// 自动识别常见问题
|
||
function identifyCommonIssues(
|
||
sample: Extraction[],
|
||
humanReviews: HumanReview[]
|
||
): Issue[] {
|
||
|
||
const issues = [];
|
||
|
||
// 问题1:某个字段错误率高
|
||
for (const field of FIELD_LIST) {
|
||
const fieldErrors = countFieldErrors(field, sample, humanReviews);
|
||
if (fieldErrors / sample.length > 0.15) { // 错误率>15%
|
||
issues.push({
|
||
type: 'high_field_error_rate',
|
||
field,
|
||
errorRate: fieldErrors / sample.length,
|
||
examples: getErrorExamples(field, sample, humanReviews, 3),
|
||
recommendation: `优化字段"${field}"的Prompt模板或Few-shot案例`
|
||
});
|
||
}
|
||
}
|
||
|
||
// 问题2:特定类型研究错误率高
|
||
const studyTypeErrors = analyzeByStudyType(sample, humanReviews);
|
||
for (const [studyType, errorRate] of Object.entries(studyTypeErrors)) {
|
||
if (errorRate > 0.15) {
|
||
issues.push({
|
||
type: 'high_study_type_error_rate',
|
||
studyType,
|
||
errorRate,
|
||
recommendation: `增加"${studyType}"类型研究的Few-shot案例`
|
||
});
|
||
}
|
||
}
|
||
|
||
// 问题3:特定模型表现差
|
||
const modelErrors = analyzeByModel(sample, humanReviews);
|
||
for (const [model, errorRate] of Object.entries(modelErrors)) {
|
||
if (errorRate > 0.15) {
|
||
issues.push({
|
||
type: 'model_underperformance',
|
||
model,
|
||
errorRate,
|
||
recommendation: `考虑调整模型"${model}"的参数或更换模型`
|
||
});
|
||
}
|
||
}
|
||
|
||
return issues;
|
||
}
|
||
```
|
||
|
||
**质量报表示例**:
|
||
|
||
```markdown
|
||
# 全文复筛质量审计报告
|
||
|
||
**审计周期**:2025-11-15 至 2025-11-22
|
||
**总提取数**:148篇
|
||
**抽样数**:15篇(10.1%)
|
||
|
||
## 整体质量指标
|
||
|
||
| 指标 | 本周 | 上周 | 趋势 |
|
||
|------|------|------|------|
|
||
| 准确率 | 94.7% | 93.2% | ↑ +1.5% |
|
||
| Cohen's Kappa | 0.89 | 0.87 | ↑ +0.02 |
|
||
| 假阳性率 | 3.1% | 4.2% | ↓ -1.1% |
|
||
| 假阴性率 | 2.2% | 2.6% | ↓ -0.4% |
|
||
|
||
## 分字段准确率
|
||
|
||
| 字段 | 准确率 | 状态 |
|
||
|------|--------|------|
|
||
| 研究设计 | 100% | ✅ 优秀 |
|
||
| 随机化方法 | 93.3% | ✅ 良好 |
|
||
| 盲法 | 86.7% | ⚠️ 需改进 |
|
||
| 基线可比性 | 100% | ✅ 优秀 |
|
||
| 结果完整性 | 93.3% | ✅ 良好 |
|
||
| ... | ... | ... |
|
||
|
||
## 模型性能对比
|
||
|
||
| 模型 | 准确率 | 平均置信度 | 处理时间 |
|
||
|------|--------|-----------|----------|
|
||
| DeepSeek-V3 | 92.1% | 0.87 | 45s |
|
||
| Qwen3-Max | 94.5% | 0.91 | 38s |
|
||
| Claude-4.5(仲裁) | 97.2% | 0.94 | 62s |
|
||
|
||
## 发现的问题
|
||
|
||
1. **字段"盲法"错误率偏高(13.3%)**
|
||
- 常见错误:将"单盲"误判为"完整"
|
||
- 原因分析:Prompt未明确区分单盲/双盲的质量差异
|
||
- 改进建议:更新Prompt,增加"单盲通常不足以防止检测偏倚"的说明
|
||
|
||
2. **队列研究提取准确率低于RCT(89% vs 96%)**
|
||
- 原因分析:队列研究的方法学描述更灵活,标准化程度低
|
||
- 改进建议:增加3个队列研究的Few-shot案例
|
||
|
||
## 改进建议
|
||
|
||
1. ✅ 立即执行:更新"盲法"字段Prompt模板
|
||
2. ⚡ 本周内:增加队列研究Few-shot案例库
|
||
3. 📅 下周:重新评估"盲法"字段准确率
|
||
|
||
## 下周目标
|
||
|
||
- 准确率:≥ 95%
|
||
- Cohen's Kappa:≥ 0.90
|
||
- "盲法"字段准确率:≥ 93%
|
||
```
|
||
|
||
---
|
||
|
||
#### 1.4 ✅ Prompt版本管理
|
||
|
||
**Git管理提示词模板,支持A/B测试**:
|
||
|
||
```
|
||
backend/prompts/asl/fulltext_screening/
|
||
├── changelog.md
|
||
├── fields/
|
||
│ ├── 随机化方法/
|
||
│ │ ├── v1.0.0-basic.md
|
||
│ │ ├── v1.1.0-with-examples.md
|
||
│ │ ├── v1.2.0-cot.md
|
||
│ │ └── v1.3.0-enhanced-cochrane.md ← 当前版本
|
||
│ ├── 盲法/
|
||
│ │ ├── v1.0.0-basic.md
|
||
│ │ ├── v1.1.0-clarify-single-double.md ← 改进版
|
||
│ │ └── ...
|
||
│ └── ...
|
||
└── tests/
|
||
└── benchmark_results.json
|
||
```
|
||
|
||
**数据库记录**:
|
||
|
||
```prisma
|
||
model PromptVersion {
|
||
id String @id @default(uuid())
|
||
|
||
field String // "随机化方法"
|
||
version String // "v1.3.0"
|
||
content String @db.Text
|
||
changelog String // "增强Cochrane标准描述,添加5个Few-shot案例"
|
||
|
||
// 性能指标(A/B测试结果)
|
||
accuracy Float? // 0.947
|
||
usageCount Int @default(0)
|
||
avgConfidence Float?
|
||
|
||
// 状态
|
||
isActive Boolean @default(false)
|
||
isExperimental Boolean @default(false)
|
||
|
||
createdAt DateTime @default(now())
|
||
deactivatedAt DateTime?
|
||
|
||
@@map("asl_prompt_versions")
|
||
}
|
||
```
|
||
|
||
**A/B测试**:
|
||
|
||
```typescript
|
||
// 20%流量使用新版Prompt
|
||
async function extractFieldWithABTest(
|
||
field: string,
|
||
content: string
|
||
) {
|
||
const isExperimentGroup = Math.random() < 0.2;
|
||
|
||
const promptVersion = isExperimentGroup
|
||
? await getPromptVersion(field, 'experimental')
|
||
: await getPromptVersion(field, 'stable');
|
||
|
||
const result = await llmService.chat(
|
||
'deepseek-v3',
|
||
promptVersion.content,
|
||
content
|
||
);
|
||
|
||
// 记录使用
|
||
await trackPromptUsage({
|
||
field,
|
||
version: promptVersion.version,
|
||
isExperiment: isExperimentGroup,
|
||
result
|
||
});
|
||
|
||
return result;
|
||
}
|
||
|
||
// 每周分析A/B测试结果
|
||
async function analyzeABTest(field: string): Promise<ABTestReport> {
|
||
const stableResults = await getPromptUsageStats(field, 'stable');
|
||
const experimentResults = await getPromptUsageStats(field, 'experimental');
|
||
|
||
const improvement = {
|
||
accuracy: experimentResults.accuracy - stableResults.accuracy,
|
||
confidence: experimentResults.avgConfidence - stableResults.avgConfidence,
|
||
processingTime: experimentResults.avgTime - stableResults.avgTime
|
||
};
|
||
|
||
// 统计显著性检验
|
||
const isSignificant = performTTest(stableResults, experimentResults);
|
||
|
||
return {
|
||
field,
|
||
stableVersion: stableResults.version,
|
||
experimentVersion: experimentResults.version,
|
||
sampleSize: {
|
||
stable: stableResults.count,
|
||
experiment: experimentResults.count
|
||
},
|
||
improvement,
|
||
isSignificant,
|
||
recommendation: isSignificant && improvement.accuracy > 0.02
|
||
? 'promote_to_stable' // 提升为稳定版
|
||
: 'continue_testing' // 继续测试
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 二、V2.0成本预算
|
||
|
||
**场景:100篇全文复筛(高质量项目)**
|
||
|
||
| 环节 | Token消耗 | 模型 | 成本 |
|
||
|------|----------|------|------|
|
||
| 12字段分段提取(双模型) | 12K | DeepSeek-V3 + Qwen3-Max | ¥0.06/篇 |
|
||
| 全文交叉验证 | 3K | DeepSeek-V3 | ¥0.003/篇 |
|
||
| 关键字段三模型仲裁(15%) | 3K | Claude-4.5 | ¥0.03/篇(仅15%) |
|
||
| 质量审计(10%抽查) | 2K | 人工 | 10分钟/篇 |
|
||
| **100篇总成本** | - | - | **¥10 + 人工成本** |
|
||
|
||
**质量提升**:准确率 92% → 96%
|
||
**成本增加**:¥8 → ¥10(+25%,但达到医学级标准)
|
||
|
||
---
|
||
|
||
### 三、V2.0验收标准
|
||
|
||
| 指标 | 目标 | 验证方法 |
|
||
|------|------|----------|
|
||
| 准确率(专家评估) | ≥ 96% | 人工抽查100篇 |
|
||
| 人机一致性 | Cohen's Kappa ≥ 0.90 | 统计分析 |
|
||
| 假阳性率 | ≤ 3% | 统计分析 |
|
||
| 假阴性率 | ≤ 2% | 统计分析 |
|
||
| 证据链完整性 | 100% | 自动检查 |
|
||
| 自动化审计 | 每周1次 | 系统报表 |
|
||
| Prompt版本管理 | 100% | Git历史追踪 |
|
||
| 符合Cochrane标准 | ≥ 95% | 专家认证 |
|
||
|
||
---
|
||
|
||
## 📊 三阶段对比总结
|
||
|
||
| 维度 | MVP | V1.0 | V2.0 |
|
||
|------|-----|------|------|
|
||
| **准确率** | 85% | 92% | 96% |
|
||
| **核心策略** | Nougat+分段提取 | +全文验证+逻辑规则 | +三模型仲裁+审计 |
|
||
| **证据链** | 基本引用 | 完整定位 | 审计级日志 |
|
||
| **质量控制** | 双模型验证 | 医学逻辑引擎 | HITL+自动审计 |
|
||
| **成本/100篇** | ¥6 | ¥8 | ¥10 |
|
||
| **开发周期** | 3周 | 5周 | 8周 |
|
||
| **适用场景** | 快速验证 | 常规项目 | Cochrane发表 |
|
||
|
||
---
|
||
|
||
## 🔄 实施路径
|
||
|
||
### 阶段1:MVP开发(Week 1-3)
|
||
|
||
**Week 1**:基础架构
|
||
- [x] PDF存储服务(已完成)✅
|
||
- [ ] Nougat提取+章节解析
|
||
- [ ] 12字段路由表设计
|
||
- [ ] 基础Prompt模板(12个字段)
|
||
|
||
**Week 2**:核心功能
|
||
- [ ] 分段并行提取
|
||
- [ ] 双模型调用
|
||
- [ ] 字段级冲突检测
|
||
- [ ] 基础证据链
|
||
|
||
**Week 3**:前端+测试
|
||
- [ ] 前端工作台
|
||
- [ ] 冲突对比视图
|
||
- [ ] 人工复核界面
|
||
- [ ] 功能测试+准确率评估
|
||
|
||
### 阶段2:V1.0增强(Week 4-8)
|
||
|
||
**Week 4-5**:质量提升
|
||
- [ ] Cochrane标准Prompt增强
|
||
- [ ] Few-shot医学案例库(每字段3-5个)
|
||
- [ ] CoT推理增强
|
||
|
||
**Week 6-7**:验证机制
|
||
- [ ] 全文交叉验证
|
||
- [ ] 医学逻辑规则引擎
|
||
- [ ] 完整证据链
|
||
|
||
**Week 8**:优化+文档
|
||
- [ ] 性能优化
|
||
- [ ] A/B测试
|
||
- [ ] 文档完善
|
||
|
||
### 阶段3:V2.0完善(Week 9-16)
|
||
|
||
**Week 9-11**:高级功能
|
||
- [ ] 三模型仲裁
|
||
- [ ] HITL智能分流
|
||
- [ ] Prompt版本管理+A/B测试
|
||
|
||
**Week 12-14**:质量审计
|
||
- [ ] 自动审计系统
|
||
- [ ] 质量报表
|
||
- [ ] 异常检测
|
||
|
||
**Week 15-16**:医学专家验证
|
||
- [ ] Cochrane专家评审
|
||
- [ ] 全量测试
|
||
- [ ] 发布文档
|
||
|
||
---
|
||
|
||
## 📚 相关文档
|
||
|
||
- [标题摘要初筛质量保障策略](./06-质量保障与可追溯策略.md)
|
||
- [全文复筛开发计划](../04-开发计划/04-全文复筛开发计划.md)
|
||
- [数据库设计](./01-数据库设计.md)
|
||
- [API设计规范](./02-API设计规范.md)
|
||
- [云原生开发规范](../../../04-开发规范/08-云原生开发规范.md)
|
||
|
||
---
|
||
|
||
**更新日志**:
|
||
- 2025-11-22: 创建文档,定义全文复筛三阶段质量保障策略
|
||
- 基于Nougat结构化+分段提取+全文验证的技术方案
|
||
- 参考Cochrane RoB 2.0标准设计专业Prompt模板
|
||
- 强调完整证据链和可追溯性
|
||
|
||
|
||
|