docs(asl): Complete Tool 3 extraction workbench V2.0 development plan (v1.5)

ASL Tool 3 Development Plan: - Architecture blueprint v1.5 (6 rounds of architecture review, 13 red lines) - M1/M2/M3 sprint checklists (Skeleton Pipeline / HITL Workbench / Dynamic Template Engine) - Code patterns cookbook (9 chapters: Fan-out, Prompt engineering, ACL, SSE dual-track, etc.) - Key patterns: Fan-out with Last Child Wins, Optimistic Locking, teamConcurrency throttling - PKB ACL integration (anti-corruption layer), MinerU Cache-Aside, NOTIFY/LISTEN cross-pod SSE - Data consistency snapshot for long-running extraction tasks Platform capability: - Add distributed Fan-out task pattern development guide (7 patterns + 10 anti-patterns) - Add system-level async architecture risk analysis blueprint - Add PDF table extraction engine design and usage guide (MinerU integration) - Add table extraction source code (TableExtractionManager + MinerU engine) Documentation updates: - Update ASL module status with Tool 3 V2.0 plan readiness - Update system status document (v6.2) with latest milestones - Add V2.0 product requirements, prototypes, and data dictionary specs - Add architecture review documents (4 rounds of review feedback) - Add test PDF files for extraction validation Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-23 22:49:16 +08:00
parent 8f06d4f929
commit dc6b292308
42 changed files with 16615 additions and 41 deletions
--- a/docs/02-通用能力层/00-通用能力层清单.md
+++ b/docs/02-通用能力层/00-通用能力层清单.md
@@ -34,7 +34,7 @@
 | **LLM网关** | `common/llm/` | ✅ | 统一LLM适配器（5个模型） |
 | **流式响应** | `common/streaming/` | ✅ 🆕 | OpenAI Compatible流式输出 |
 | **🎉RAG引擎** | `common/rag/` | ✅ 🆕 | **完整实现！pgvector+DeepSeek+Rerank** |
-| **文档处理** | `extraction_service/` | ✅ 🆕 | pymupdf4llm PDF→Markdown |
+| **文档处理** | `extraction_service/` | ✅ V2 | pymupdf4llm (全文) + **PDF 表格提取引擎** (多引擎可插拔) |
 | **认证授权** | `common/auth/` | ✅ | JWT认证 + 权限控制 |
 | **Prompt管理** | `common/prompt/` | ✅ | 动态Prompt配置 |
 | **🆕R统计引擎** | `r-statistics-service/` | ✅ | Docker化R统计服务（plumber） |
@@ -525,11 +525,26 @@ const final = await searchService.rerank(queries[0], results, { topK: 5 });

 ---

-### 9. 🎉 文档处理引擎（✅ 2026-01-21 增强完成）
+### 9. 🎉 文档处理引擎（✅ V2 — 2026-02-23 表格提取引擎升级）

-**路径：** `extraction_service/` (Python 微服务，端口 8000)
+**路径：** `extraction_service/` (Python 微服务) + `backend/src/common/document/tableExtraction/` (TypeScript)

-**功能：** 将各类文档统一转换为 **LLM 友好的 Markdown 格式**
+**功能：** 将各类文档统一转换为 LLM 友好的 Markdown 格式 + **PDF 结构化表格提取**
+
+**V2 分层架构 — 全文文本 + 结构化表格 分离：**
+| 引擎层 | 定位 | 输出 | 状态 |
+|--------|------|------|------|
+| **pymupdf4llm** | 全文文本提取 | Markdown | ✅ 已有 |
+| **PDF 表格提取引擎** | 结构化表格提取 (统一抽象层) | ExtractedTable[] | ✅ V2 新增 |
+
+**PDF 表格提取引擎 — 候选引擎 (可插拔)：**
+| 引擎 | 状态 | 特点 |
+|------|------|------|
+| MinerU Cloud API (VLM) | ✅ 已接入 (当前默认) | 综合 4.6/5 |
+| Qwen3-VL | 📋 待评测 | 多模态理解最强 |
+| PaddleOCR-VL 1.5 | 📋 待评测 | 医学场景案例多 |
+| Qwen-OCR + Qwen-Long | 📋 待评测 | 成本最低 |
+| Docling (IBM) | 📋 待评测 | MIT 开源，离线部署 |

 **核心 API：**
 ```
@@ -540,16 +555,11 @@ Content-Type: multipart/form-data
 返回：{ success: true, text: "Markdown内容", metadata: {...} }
 ```

-**技术升级：**
- ✅ PDF 处理：pymupdf4llm（保留表格、公式、结构）
- ✅ 统一入口：DocumentProcessor 自动检测文件类型
- ✅ 零 OCR：电子版文档专用，扫描件返回友好提示
- ✅ 与 RAG 引擎无缝集成
-
 **支持格式：**
 | 格式 | 工具 | 输出质量 | 状态 |
 |------|------|----------|------|
-| PDF | pymupdf4llm | 表格保真 | ✅ |
+| PDF (全文) | pymupdf4llm | Markdown 文本 | ✅ |
+| PDF (表格) | **MinerU VLM** | HTML 结构化表格 | ✅ V2 |
 | Word | mammoth | 结构完整 | ✅ |
 | Excel/CSV | pandas | 上下文丰富 | ✅ |
 | PPT | python-pptx | 按页拆分 | ✅ |
@@ -592,7 +602,9 @@ const markdown = await client.extractText(buffer, 'pdf');
 - 🔜 AIA - 附件处理

 **详细文档：**
- 📖 [文档处理引擎使用指南](./02-文档处理引擎/02-文档处理引擎使用指南.md) ⭐ **推荐阅读**
+- 📖 [PDF 表格提取引擎使用指南](./02-文档处理引擎/04-PDF表格提取引擎使用指南.md) ⭐ **5 秒上手 + 实战场景**
+- 📖 [PDF 表格提取引擎设计方案](./02-文档处理引擎/03-PDF表格提取引擎设计方案.md) — 统一抽象 + 多引擎可插拔
+- 📖 [文档处理引擎使用指南](./02-文档处理引擎/02-文档处理引擎使用指南.md)
 - [文档处理引擎设计方案](./02-文档处理引擎/01-文档处理引擎设计方案.md)

 ---