docs(asl): Complete Tool 3 extraction workbench V2.0 development plan (v1.5)
ASL Tool 3 Development Plan: - Architecture blueprint v1.5 (6 rounds of architecture review, 13 red lines) - M1/M2/M3 sprint checklists (Skeleton Pipeline / HITL Workbench / Dynamic Template Engine) - Code patterns cookbook (9 chapters: Fan-out, Prompt engineering, ACL, SSE dual-track, etc.) - Key patterns: Fan-out with Last Child Wins, Optimistic Locking, teamConcurrency throttling - PKB ACL integration (anti-corruption layer), MinerU Cache-Aside, NOTIFY/LISTEN cross-pod SSE - Data consistency snapshot for long-running extraction tasks Platform capability: - Add distributed Fan-out task pattern development guide (7 patterns + 10 anti-patterns) - Add system-level async architecture risk analysis blueprint - Add PDF table extraction engine design and usage guide (MinerU integration) - Add table extraction source code (TableExtractionManager + MinerU engine) Documentation updates: - Update ASL module status with Tool 3 V2.0 plan readiness - Update system status document (v6.2) with latest milestones - Add V2.0 product requirements, prototypes, and data dictionary specs - Add architecture review documents (4 rounds of review feedback) - Add test PDF files for extraction validation Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
471
docs/02-通用能力层/02-文档处理引擎/04-PDF表格提取引擎使用指南.md
Normal file
471
docs/02-通用能力层/02-文档处理引擎/04-PDF表格提取引擎使用指南.md
Normal file
@@ -0,0 +1,471 @@
|
||||
# PDF 表格提取引擎使用指南
|
||||
|
||||
> **文档版本**: v1.0
|
||||
> **最后更新**: 2026-02-23
|
||||
> **状态**: ✅ 已测试通过(MinerU 引擎)
|
||||
> **目标读者**: 业务模块开发者(ASL 全文复筛、系统综述数据提取等)
|
||||
> **前置条件**: `backend/.env` 中已配置 `MINERU_API_TOKEN`
|
||||
|
||||
---
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 5 秒上手
|
||||
|
||||
```typescript
|
||||
import { getTableExtractionManager } from '../common/document/tableExtraction/index.js';
|
||||
|
||||
const manager = getTableExtractionManager();
|
||||
const result = await manager.extractTables(pdfBuffer, 'paper.pdf');
|
||||
|
||||
for (const table of result.tables) {
|
||||
console.log(`${table.title}: ${table.rows.length} 行 × ${table.headers.length} 列`);
|
||||
}
|
||||
```
|
||||
|
||||
### 完整调用示例
|
||||
|
||||
```typescript
|
||||
import fs from 'fs';
|
||||
import { getTableExtractionManager } from '../common/document/tableExtraction/index.js';
|
||||
|
||||
const manager = getTableExtractionManager();
|
||||
|
||||
// 读取 PDF 文件
|
||||
const pdf = fs.readFileSync('/path/to/medical-paper.pdf');
|
||||
|
||||
// 提取表格(自动使用默认引擎 MinerU)
|
||||
const result = await manager.extractTables(pdf, 'medical-paper.pdf', {
|
||||
keepRaw: true, // 保留原始 Markdown
|
||||
});
|
||||
|
||||
console.log(`引擎: ${result.engine}`); // "mineru"
|
||||
console.log(`耗时: ${result.duration}ms`); // ~6000-20000ms
|
||||
console.log(`表格数: ${result.tables.length}`);
|
||||
|
||||
// 遍历每个表格
|
||||
for (const table of result.tables) {
|
||||
console.log(`\n[${table.title}]`);
|
||||
console.log(` 列: ${table.headers.join(' | ')}`);
|
||||
console.log(` 行数: ${table.rows.length}`);
|
||||
console.log(` 合并单元格: ${table.mergedCells.length}`);
|
||||
|
||||
// 访问具体数据
|
||||
for (const row of table.rows) {
|
||||
// row 是 string[],与 headers 一一对应
|
||||
console.log(` ${row.join(' | ')}`);
|
||||
}
|
||||
|
||||
// 原始 HTML(可直接渲染到前端)
|
||||
if (table.rawHtml) {
|
||||
console.log(` [HTML] ${table.rawHtml.substring(0, 100)}...`);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 核心概念
|
||||
|
||||
### 架构设计
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ 业务代码(ASL / 系统综述 / Meta 分析) │
|
||||
│ │
|
||||
│ manager.extractTables(pdf, filename) │
|
||||
│ → 返回 ExtractedTable[] │
|
||||
└──────────────────────┬─────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────▼─────────────────────────────┐
|
||||
│ TableExtractionManager (统一入口) │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
|
||||
│ │ MinerU (VLM) │ │ Qwen-VL │ │ Paddle │ │
|
||||
│ │ ✅ 已接入 │ │ 📋 待接入 │ │ 📋 待接入 │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────┘ │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**核心原则:使用者不需要关心底层引擎。** 提交 PDF → 获取结构化表格。
|
||||
|
||||
### 数据结构
|
||||
|
||||
```typescript
|
||||
// 提取结果
|
||||
interface ExtractionResult {
|
||||
tables: ExtractedTable[]; // 表格列表
|
||||
engine: string; // 使用的引擎名
|
||||
duration: number; // 耗时 (ms)
|
||||
pageCount?: number; // PDF 页数
|
||||
fullMarkdown?: string; // 完整 Markdown (需 keepRaw: true)
|
||||
}
|
||||
|
||||
// 单个表格
|
||||
interface ExtractedTable {
|
||||
title: string; // "Table 1 Baseline characteristics"
|
||||
headers: string[]; // 表头列名
|
||||
rows: string[][]; // 数据行(二维数组)
|
||||
mergedCells: MergedCell[]; // 合并单元格
|
||||
footnotes: string[]; // 脚注
|
||||
pageNumber?: number; // 页码
|
||||
rawHtml?: string; // 原始 HTML
|
||||
rawMarkdown?: string; // 原始 Markdown
|
||||
}
|
||||
|
||||
// 合并单元格
|
||||
interface MergedCell {
|
||||
row: number; // 起始行 (0-based)
|
||||
col: number; // 起始列 (0-based)
|
||||
rowSpan: number;
|
||||
colSpan: number;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API 参考
|
||||
|
||||
### `getTableExtractionManager()`
|
||||
|
||||
获取全局管理器单例。首次调用时自动注册 MinerU 引擎。
|
||||
|
||||
```typescript
|
||||
import { getTableExtractionManager } from '../common/document/tableExtraction/index.js';
|
||||
|
||||
const manager = getTableExtractionManager();
|
||||
```
|
||||
|
||||
### `manager.extractTables(pdf, filename, options?)`
|
||||
|
||||
提取 PDF 中的表格。
|
||||
|
||||
| 参数 | 类型 | 必填 | 说明 |
|
||||
|------|------|------|------|
|
||||
| `pdf` | `Buffer` | ✅ | PDF 文件内容 |
|
||||
| `filename` | `string` | ✅ | 文件名(含 .pdf 后缀) |
|
||||
| `options.language` | `'zh' \| 'en' \| 'auto'` | ❌ | 语言提示 |
|
||||
| `options.pages` | `number[]` | ❌ | 指定页码 |
|
||||
| `options.keepRaw` | `boolean` | ❌ | 保留原始 Markdown |
|
||||
| `options.engine` | `EngineType` | ❌ | 覆盖默认引擎 |
|
||||
|
||||
返回:`Promise<ExtractionResult>`
|
||||
|
||||
### `manager.availableEngines()`
|
||||
|
||||
返回已注册的引擎名称列表。
|
||||
|
||||
```typescript
|
||||
console.log(manager.availableEngines()); // ['mineru']
|
||||
```
|
||||
|
||||
### `manager.getEngine(name?)`
|
||||
|
||||
获取指定引擎实例。
|
||||
|
||||
### `manager.setDefault(name)`
|
||||
|
||||
切换默认引擎。
|
||||
|
||||
---
|
||||
|
||||
## 实战场景
|
||||
|
||||
### 场景 1:ASL 全文复筛 — 提取基线特征表
|
||||
|
||||
```typescript
|
||||
import { getTableExtractionManager } from '../common/document/tableExtraction/index.js';
|
||||
|
||||
async function extractBaselineTable(pdfBuffer: Buffer, filename: string) {
|
||||
const manager = getTableExtractionManager();
|
||||
const result = await manager.extractTables(pdfBuffer, filename);
|
||||
|
||||
// 找到 "Table 1" 或包含 "Baseline" 的表格
|
||||
const baseline = result.tables.find(
|
||||
(t) =>
|
||||
/table\s*1\b/i.test(t.title) ||
|
||||
/baseline/i.test(t.title),
|
||||
);
|
||||
|
||||
if (baseline) {
|
||||
return {
|
||||
title: baseline.title,
|
||||
columns: baseline.headers,
|
||||
data: baseline.rows,
|
||||
hasMergedCells: baseline.mergedCells.length > 0,
|
||||
};
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
```
|
||||
|
||||
### 场景 2:系统综述 — 提取所有表格为 JSON
|
||||
|
||||
```typescript
|
||||
async function extractAllTablesAsJson(pdfBuffer: Buffer, filename: string) {
|
||||
const manager = getTableExtractionManager();
|
||||
const result = await manager.extractTables(pdfBuffer, filename);
|
||||
|
||||
return result.tables.map((table) => ({
|
||||
title: table.title,
|
||||
headers: table.headers,
|
||||
rows: table.rows.map((row) => {
|
||||
const obj: Record<string, string> = {};
|
||||
table.headers.forEach((h, i) => {
|
||||
obj[h] = row[i] || '';
|
||||
});
|
||||
return obj;
|
||||
}),
|
||||
}));
|
||||
}
|
||||
|
||||
// 输出示例:
|
||||
// [
|
||||
// {
|
||||
// title: "Table 1 Baseline characteristics",
|
||||
// headers: ["", "", "EGb 761®(N=200)", "Placebo(N=202)", "p-value"],
|
||||
// rows: [
|
||||
// { "": "Sex female", "": "", "EGb 761®(N=200)": "139 (69.5)", ... },
|
||||
// ...
|
||||
// ]
|
||||
// }
|
||||
// ]
|
||||
```
|
||||
|
||||
### 场景 3:Meta 分析 — 提取效应值
|
||||
|
||||
```typescript
|
||||
async function extractEffectSizes(pdfBuffer: Buffer, filename: string) {
|
||||
const manager = getTableExtractionManager();
|
||||
const result = await manager.extractTables(pdfBuffer, filename);
|
||||
|
||||
// 找结局指标表
|
||||
const outcomeTable = result.tables.find(
|
||||
(t) => /outcome|result|efficacy|effect/i.test(t.title),
|
||||
);
|
||||
|
||||
if (!outcomeTable) return [];
|
||||
|
||||
return outcomeTable.rows.map((row) => ({
|
||||
measure: row[0],
|
||||
treatment: row[1],
|
||||
control: row[2],
|
||||
pValue: row[3],
|
||||
}));
|
||||
}
|
||||
```
|
||||
|
||||
### 场景 4:在 API 路由中使用
|
||||
|
||||
```typescript
|
||||
import { getTableExtractionManager } from '../../../common/document/tableExtraction/index.js';
|
||||
|
||||
async function handleTableExtraction(request: FastifyRequest, reply: FastifyReply) {
|
||||
const data = await request.file();
|
||||
if (!data) return reply.status(400).send({ error: 'No file uploaded' });
|
||||
|
||||
const buffer = await data.toBuffer();
|
||||
const manager = getTableExtractionManager();
|
||||
const result = await manager.extractTables(buffer, data.filename);
|
||||
|
||||
return reply.send({
|
||||
success: true,
|
||||
engine: result.engine,
|
||||
duration: result.duration,
|
||||
tables: result.tables.map((t) => ({
|
||||
title: t.title,
|
||||
headers: t.headers,
|
||||
rowCount: t.rows.length,
|
||||
rows: t.rows,
|
||||
mergedCells: t.mergedCells,
|
||||
})),
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 环境配置
|
||||
|
||||
### 必需环境变量
|
||||
|
||||
```bash
|
||||
# backend/.env
|
||||
|
||||
# MinerU Cloud API(必需)
|
||||
MINERU_API_TOKEN=your_mineru_api_token
|
||||
MINERU_API_BASE=https://mineru.net/api/v4
|
||||
MINERU_MODEL_VERSION=vlm
|
||||
```
|
||||
|
||||
### 获取 MinerU Token
|
||||
|
||||
1. 注册 [OpenDataLab](https://sso.openxlab.org.cn/login)
|
||||
2. 登录 [MinerU 控制台](https://mineru.net/)
|
||||
3. 个人中心 → API Token → 复制
|
||||
4. 写入 `backend/.env` 的 `MINERU_API_TOKEN`
|
||||
|
||||
### 免费额度
|
||||
|
||||
| 项目 | 限制 |
|
||||
|------|------|
|
||||
| 日解析页数 | 2000 页 |
|
||||
| 单文件大小 | ≤ 200 MB |
|
||||
| 单文件页数 | ≤ 600 页 |
|
||||
|
||||
小型综述 20 篇 (200 页) → 1 天免费完成。大型综述 500 篇 (5000 页) → 分 3 天免费完成。
|
||||
|
||||
---
|
||||
|
||||
## 运行测试
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
|
||||
# 测试指定 PDF(推荐)
|
||||
npx tsx src/tests/test-table-extraction.ts "../docs/03-业务模块/ASL-AI智能文献/05-测试文档/PDF/Herrschaft 2012.pdf"
|
||||
|
||||
# 自动选取测试目录中的第一个 PDF
|
||||
npx tsx src/tests/test-table-extraction.ts
|
||||
```
|
||||
|
||||
### 期望输出
|
||||
|
||||
```
|
||||
========================================
|
||||
PDF 表格提取引擎 — 集成测试
|
||||
========================================
|
||||
|
||||
文件: Herrschaft 2012.pdf
|
||||
引擎: mineru
|
||||
耗时: 6.5s
|
||||
检出表格: 3 个
|
||||
|
||||
────────────────────────────────────────
|
||||
表格 1: Table 1 Baseline characteristics...
|
||||
列数: 5
|
||||
行数: 18
|
||||
合并单元格: 2
|
||||
表头: ... | EGb 761®(N = 200) | Placebo(N = 202) | p-value
|
||||
|
||||
表格 2: Table 2
|
||||
列数: 4
|
||||
行数: 10
|
||||
|
||||
表格 3: Table 3 Adverse events...
|
||||
列数: 6
|
||||
行数: 7
|
||||
合并单元格: 4
|
||||
|
||||
测试通过
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 文件清单
|
||||
|
||||
```
|
||||
backend/src/common/document/tableExtraction/
|
||||
├── types.ts # 统一接口 + 类型定义
|
||||
├── htmlTableParser.ts # HTML <table> → ExtractedTable 解析器
|
||||
├── TableExtractionManager.ts # 引擎管理器(使用者入口)
|
||||
├── engines/
|
||||
│ └── MinerUEngine.ts # MinerU Cloud API 适配器
|
||||
└── index.ts # 统一导出 + 全局单例
|
||||
|
||||
backend/src/tests/
|
||||
└── test-table-extraction.ts # 集成测试脚本
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 扩展新引擎
|
||||
|
||||
添加新引擎只需 3 步:
|
||||
|
||||
### Step 1: 实现接口
|
||||
|
||||
```typescript
|
||||
// engines/Qwen3VLEngine.ts
|
||||
import type { ITableExtractionEngine, ExtractionOptions, ExtractionResult } from '../types.js';
|
||||
|
||||
export class Qwen3VLEngine implements ITableExtractionEngine {
|
||||
readonly name = 'qwen3vl';
|
||||
readonly displayName = 'Qwen3-VL 多模态';
|
||||
|
||||
async extractTables(
|
||||
pdf: Buffer,
|
||||
filename: string,
|
||||
options?: ExtractionOptions,
|
||||
): Promise<ExtractionResult> {
|
||||
// 实现提取逻辑 ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: 注册引擎
|
||||
|
||||
```typescript
|
||||
// index.ts 中添加
|
||||
import { Qwen3VLEngine } from './engines/Qwen3VLEngine.js';
|
||||
|
||||
// 在 getTableExtractionManager() 中
|
||||
if (process.env.QWEN3VL_API_KEY) {
|
||||
_instance.register(new Qwen3VLEngine());
|
||||
}
|
||||
```
|
||||
|
||||
### Step 3: 使用
|
||||
|
||||
```typescript
|
||||
const manager = getTableExtractionManager();
|
||||
|
||||
// 显式指定引擎
|
||||
const result = await manager.extractTables(pdf, 'paper.pdf', {
|
||||
engine: 'qwen3vl',
|
||||
});
|
||||
|
||||
// 或切换默认引擎
|
||||
manager.setDefault('qwen3vl');
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q: 提取耗时多久?
|
||||
|
||||
MinerU Cloud API 通常 5-20 秒(取决于 PDF 页数和云端负载)。首次请求可能较慢(云端冷启动),后续请求更快。
|
||||
|
||||
### Q: 没有检出表格?
|
||||
|
||||
1. 确认 PDF 中确实包含表格(扫描件图片中的表格也能识别)
|
||||
2. 检查 `fullMarkdown` 输出中是否有 `<table>` 标签
|
||||
3. MinerU 对极端复杂的嵌套表格可能识别不完整
|
||||
|
||||
### Q: 合并单元格数据如何处理?
|
||||
|
||||
`ExtractedTable.mergedCells` 记录了所有合并单元格的位置和跨度。在 `rows` 中,被合并的单元格只在起始位置有值,其余位置为空字符串。
|
||||
|
||||
### Q: 和文档处理引擎 (pymupdf4llm) 的关系?
|
||||
|
||||
两者分别负责不同场景:
|
||||
|
||||
| 引擎 | 路径 | 场景 |
|
||||
|------|------|------|
|
||||
| 文档处理引擎 | `ExtractionClient.ts` | 全文文本提取(标题摘要初筛、PKB 入库) |
|
||||
| **PDF 表格提取引擎** | `tableExtraction/` | 结构化表格提取(全文复筛、Meta 分析) |
|
||||
|
||||
---
|
||||
|
||||
## 相关文档
|
||||
|
||||
- [PDF 表格提取引擎设计方案](./03-PDF表格提取引擎设计方案.md) — 架构设计 + 候选引擎 + 对比测试
|
||||
- [文档处理引擎使用指南](./02-文档处理引擎使用指南.md) — 全文文本提取 (pymupdf4llm)
|
||||
- [文档处理引擎 README](./README.md) — 引擎总览
|
||||
|
||||
---
|
||||
|
||||
**维护人**: 技术架构师
|
||||
**核心依赖**: `adm-zip` (ZIP 解析), `axios` (HTTP 请求)
|
||||
Reference in New Issue
Block a user