docs(asl): Upgrade Tool 3 architecture from Fan-out to Scatter+Aggregator (v2.0)
Architecture transformation: - Replace Fan-out (Manager->Child->Last Child Wins) with Scatter+Aggregator pattern - API layer directly dispatches N independent jobs (no Manager) - Worker only writes its own Result row, never touches Task table (zero row-lock) - Aggregator polls groupBy for completion + zombie cleanup (replaces Sweeper) - Reduce red lines from 13 to 9, eliminate distributed complexity Documents updated (10 files): - 08-Tool3 main architecture doc: v2.0 rewrite (schema, Task 2.3/2.4, red lines, risks) - 08d-Code patterns: rewrite sections 4.1-4.6 (API dispatch, SingleWorker, Aggregator) - 08a-M1 sprint: rewrite M1-3 core (Worker+Aggregator), red lines, acceptance criteria - 08b-M2 sprint: simplify SSE (NOTIFY/LISTEN downgraded to P2 optional) - 08c-M3 sprint: milestone table wording update - New: Scatter+Polling Aggregator pattern guide v1.1 (Level 2 cookbook) - New: V2.0 architecture deep review and gap-fix report - Updated: ASL module status, system status, capability layer index Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -148,12 +148,15 @@ if (pkbExtractedText) {
|
||||
|
||||
---
|
||||
|
||||
## 4. Fan-out Worker 模式(核心)
|
||||
## 4. 散装派发 + Aggregator 轮询收口(核心)
|
||||
|
||||
> 🚀 **v2.0 架构转型:** 本节完全重写,对齐 `散装派发与轮询收口任务模式指南.md`。
|
||||
> 废弃 Fan-out 模式(Manager/Child/Last Child Wins/原子递增/独立 Sweeper)。
|
||||
|
||||
### 4.1 ExtractionService 接口
|
||||
|
||||
```typescript
|
||||
// ⚠️ v1.4 终极修正:废弃 P-Queue,并发控制完全交给 pg-boss teamConcurrency
|
||||
// 🚀 v2.0:并发控制交给 pg-boss teamConcurrency,Worker 只写自己的 Result
|
||||
class ExtractionService {
|
||||
constructor(
|
||||
private promptBuilder: DynamicPromptBuilder,
|
||||
@@ -163,295 +166,258 @@ class ExtractionService {
|
||||
private pkbBridge: PkbBridgeService,
|
||||
) {}
|
||||
|
||||
// 单篇文献提取(Child Job 调用)
|
||||
async extractOne(resultId: string, taskId: string): Promise<void>;
|
||||
// 单篇文献提取(Worker 调用)
|
||||
async extractOne(resultId: string, taskId: string): Promise<ExtractionOutput>;
|
||||
|
||||
// 内部流程(单篇粒度):
|
||||
// 1. 加载项目模板 → 组装 Schema
|
||||
// 2. 从 PKB 读取 extractedText(零成本);用 snapshotStorageKey 访问 OSS(防 PKB 删除,v1.5)
|
||||
// 3. ⚠️ v1.4: 通过 snapshotStorageKey → OSS 缓存检查 → MinerU 子队列(teamConcurrency 全局限流)
|
||||
// 2. 从 PKB 读取 extractedText(零成本);用 snapshotStorageKey 访问 OSS(防 PKB 删除)
|
||||
// 3. MinerU 表格提取(含 OSS 缓存 Cache-Aside)
|
||||
// 4. 组装 Prompt(XML 隔离区 + 防注入护栏)→ LLM 调用
|
||||
// 5. 解析 JSON → fuzzyQuoteMatch 验证
|
||||
// 6. ⚠️ 事务内 upsert Result + 原子递增父任务计数(防 Race Condition)
|
||||
// 7. SSE 推送进度日志
|
||||
// 6. 返回 ExtractionOutput(Worker 负责写 Result,本方法不写 DB)
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 ExtractionManagerWorker(Fire-and-forget)
|
||||
### 4.2 API 层散装派发(ExtractionController)
|
||||
|
||||
> 🚀 v2.0:**删除 ExtractionManagerWorker**。API 层直接创建 Result + 快照 PKB 元数据 + 散装派发 Job。
|
||||
|
||||
```typescript
|
||||
// Manager Worker — Fire-and-forget,派发后立即退出
|
||||
// ⚠️ v1.5:派发前一次性快照 PKB 元数据,防止提取中 PKB 侧删改导致崩溃
|
||||
class ExtractionManagerWorker {
|
||||
async handle(job: { data: { taskId: string } }) {
|
||||
const task = await prisma.aslExtractionTask.findUnique({ where: { id: job.data.taskId } });
|
||||
const results = await prisma.aslExtractionResult.findMany({ where: { taskId: task.id } });
|
||||
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// 🚨 v1.6 空集合边界守卫
|
||||
// 如果文献被全部删除或过滤后 results 为空,无 Child 被派发,
|
||||
// Last Child Wins 永远不触发,Task 永远卡在 processing。
|
||||
// Manager 必须自己充当"收口人"直接完成任务。
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
if (results.length === 0) {
|
||||
await prisma.aslExtractionTask.update({
|
||||
where: { id: task.id },
|
||||
data: { status: 'completed', completedAt: new Date() },
|
||||
});
|
||||
await broadcastLog(task.id, { source: 'system', message: '⚠️ No documents to extract, task auto-completed.' });
|
||||
return;
|
||||
// ExtractionController.ts — POST /api/v1/asl/extraction/tasks
|
||||
async function createTask(req, reply) {
|
||||
const { projectId, templateId, documentIds, idempotencyKey } = req.body;
|
||||
|
||||
if (documentIds.length === 0) throw new Error('No documents selected');
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// DB 级幂等:@unique 索引 + P2002 冲突捕获
|
||||
// ═══════════════════════════════════════════════════════
|
||||
let task;
|
||||
try {
|
||||
task = await prisma.aslExtractionTask.create({
|
||||
data: { projectId, templateId, totalCount: documentIds.length, status: 'processing', idempotencyKey },
|
||||
});
|
||||
} catch (error) {
|
||||
if (error.code === 'P2002' && idempotencyKey) {
|
||||
const existing = await prisma.aslExtractionTask.findFirst({ where: { idempotencyKey } });
|
||||
return reply.send({ success: true, taskId: existing.id, note: 'Idempotent return' });
|
||||
}
|
||||
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// ⚠️ v1.5 PKB 数据一致性快照
|
||||
// 提取任务可能持续 50 分钟,期间用户可能在 PKB 删除/修改文档。
|
||||
// 一次性批量读取 PKB 元数据并冻结到 AslExtractionResult,
|
||||
// Child Worker 从自身记录读取 snapshotStorageKey/snapshotFilename,
|
||||
// 不再运行时回查 PKB,即使 PKB 删了记录,OSS 文件通常仍在。
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
const pkbDocIds = results.map(r => r.pkbDocumentId).filter(Boolean);
|
||||
const pkbDocs = await Promise.all(
|
||||
pkbDocIds.map(id => this.pkbBridge.getDocumentDetail(id))
|
||||
);
|
||||
const pkbDocMap = new Map(pkbDocs.map(d => [d.documentId, d]));
|
||||
|
||||
// 批量快照写入
|
||||
await prisma.$transaction(
|
||||
results.map(result => {
|
||||
const doc = pkbDocMap.get(result.pkbDocumentId);
|
||||
return prisma.aslExtractionResult.update({
|
||||
where: { id: result.id },
|
||||
data: {
|
||||
snapshotStorageKey: doc?.storageKey ?? null,
|
||||
snapshotFilename: doc?.filename ?? null,
|
||||
}
|
||||
});
|
||||
})
|
||||
);
|
||||
|
||||
// Fan-out:为每篇文献派发 Child Job
|
||||
for (const result of results) {
|
||||
await pgBoss.send('asl_extraction_child', {
|
||||
taskId: task.id,
|
||||
resultId: result.id,
|
||||
pkbDocumentId: result.pkbDocumentId,
|
||||
}, {
|
||||
retryLimit: 3,
|
||||
retryDelay: 10, // 10 秒后重试
|
||||
retryBackoff: true, // 指数退避
|
||||
expireInMinutes: 30,
|
||||
singletonKey: `extract-${result.id}`, // 幂等键,防止重复派发
|
||||
});
|
||||
}
|
||||
// Manager 派发完毕后直接退出,不等待 Child 完成
|
||||
// 任务状态翻转由 "Last Child Wins" 机制在 Child Worker 中完成
|
||||
throw error;
|
||||
}
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// PKB 快照冻结(从旧 Manager 移至 API 层)
|
||||
// 提取可能持续 50 分钟,期间用户可能在 PKB 删除/修改文档
|
||||
// ═══════════════════════════════════════════════════════
|
||||
const pkbDocs = await Promise.all(
|
||||
documentIds.map(id => pkbBridge.getDocumentDetail(id))
|
||||
);
|
||||
const resultsData = pkbDocs.map(doc => ({
|
||||
taskId: task.id,
|
||||
projectId,
|
||||
pkbDocumentId: doc.documentId,
|
||||
snapshotStorageKey: doc.storageKey,
|
||||
snapshotFilename: doc.filename,
|
||||
status: 'pending',
|
||||
}));
|
||||
await prisma.aslExtractionResult.createMany({ data: resultsData });
|
||||
|
||||
const createdResults = await prisma.aslExtractionResult.findMany({ where: { taskId: task.id } });
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 散装派发:N 个独立 Job 一次入队(无 Manager 中转)
|
||||
// ═══════════════════════════════════════════════════════
|
||||
const jobs = createdResults.map(result => ({
|
||||
name: 'asl_extract_single',
|
||||
data: { resultId: result.id, taskId: task.id, pkbDocumentId: result.pkbDocumentId },
|
||||
options: {
|
||||
retryLimit: 3,
|
||||
retryBackoff: true,
|
||||
expireInMinutes: 30, // 执行超时,非排队等待超时
|
||||
singletonKey: `extract-${result.id}`,
|
||||
},
|
||||
}));
|
||||
await jobQueue.insert(jobs);
|
||||
|
||||
return reply.send({ success: true, taskId: task.id });
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 ExtractionChildWorker(乐观锁 + Last Child Wins + 错误分级)
|
||||
**要点:**
|
||||
- `idempotencyKey @unique` + P2002 = DB 物理级幂等
|
||||
- PKB 快照冻结在 API 层完成(无需 Manager)
|
||||
- `singletonKey: extract-${result.id}` 防重复派发
|
||||
- `jobQueue.insert(jobs)` 批量入队,100 篇 < 0.1 秒
|
||||
|
||||
### 4.3 ExtractionSingleWorker(幽灵守卫 + 错误分级)
|
||||
|
||||
> 🚀 v2.0:**删除 $transaction、删除 increment、删除 Last Child Wins**。
|
||||
> Worker 只管更新自己的 Result 行,绝不碰 Task 表。收口由 Aggregator(§4.6)负责。
|
||||
|
||||
```typescript
|
||||
// Child Worker — ⚠️ v1.4.2 终极修正:乐观锁 + 原子递增 + Last Child Wins + 错误分级路由
|
||||
class ExtractionChildWorker {
|
||||
async handle(job: { data: { taskId: string; resultId: string; pkbDocumentId: string } }) {
|
||||
const { taskId, resultId, pkbDocumentId } = job.data;
|
||||
|
||||
// 🚀 v2.0 散装 Worker — 只管自己的 Result,绝不碰 Task 表
|
||||
class ExtractionSingleWorker {
|
||||
async handle(job: { data: { resultId: string; taskId: string; pkbDocumentId: string } }) {
|
||||
const { resultId, taskId, pkbDocumentId } = job.data;
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 幽灵重试守卫:只允许 pending → extracting
|
||||
// OOM 崩溃 → status 卡在 extracting → 重试被跳过
|
||||
// → Aggregator 30 分钟后将 extracting 标为 error 兜底
|
||||
// ═══════════════════════════════════════════════════════
|
||||
const lock = await prisma.aslExtractionResult.updateMany({
|
||||
where: { id: resultId, status: 'pending' },
|
||||
data: { status: 'extracting' },
|
||||
});
|
||||
if (lock.count === 0) {
|
||||
return { success: true, note: 'Phantom retry skipped' };
|
||||
}
|
||||
|
||||
try {
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// ⚠️ v1.4.2 补丁 2:乐观锁抢占(替代 Read-then-Write 反模式)
|
||||
// 利用 updateMany 的 WHERE 条件充当原子锁:
|
||||
// 只有 status='pending' 的行才允许被更新为 'extracting'
|
||||
// 并发重试时第二个 Worker 会得到 count=0,直接退出
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
const lock = await prisma.aslExtractionResult.updateMany({
|
||||
where: { id: resultId, status: 'pending' },
|
||||
data: { status: 'extracting' },
|
||||
const data = await this.extractionService.extractOne(resultId, taskId);
|
||||
|
||||
// ✅ 只更新自己的 Result 行,绝不碰 Task 表!
|
||||
await prisma.aslExtractionResult.update({
|
||||
where: { id: resultId },
|
||||
data: { status: 'completed', extractedData: data, processedAt: new Date() },
|
||||
});
|
||||
|
||||
if (lock.count === 0) {
|
||||
// 已被其他 Worker 抢占或已完成,幂等跳过
|
||||
return { success: true, note: 'Idempotent skip: already processing or completed' };
|
||||
}
|
||||
|
||||
// 执行提取(此时该行已被本 Worker 独占为 'extracting')
|
||||
const extractResult = await this.extractionService.extractOne(resultId, taskId);
|
||||
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// ⚠️ v1.4.2 补丁 1 + v1.4 原子递增:
|
||||
// 事务内更新 Result 状态 + 原子递增父任务计数
|
||||
// 返回更新后的 Task,用于 "Last Child Wins" 判断
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
const [_resultUpdate, taskAfterUpdate] = await prisma.$transaction([
|
||||
prisma.aslExtractionResult.update({
|
||||
where: { id: resultId },
|
||||
data: { status: 'completed', extractedData: extractResult.data, processedAt: new Date() }
|
||||
}),
|
||||
prisma.aslExtractionTask.update({
|
||||
where: { id: taskId },
|
||||
data: {
|
||||
successCount: { increment: 1 },
|
||||
totalTokens: { increment: extractResult.tokens },
|
||||
totalCost: { increment: extractResult.cost },
|
||||
}
|
||||
}),
|
||||
]);
|
||||
|
||||
// 🚨 v1.6:SSE 推送日志(跨 Pod 广播,替代原 sseEmitter.emit)
|
||||
await broadcastLog(taskId, { source: 'system', message: `✅ ${extractResult.filename} extracted` });
|
||||
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// ⚠️ v1.4.2 补丁 1:"Last Child Wins" 终止器
|
||||
// 最后一个完成(成功或失败)的 Child 负责将父任务翻转为 completed
|
||||
// 这是 Fan-out 模式的关键收口逻辑——没有它,Task 永远卡在 processing
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
if (taskAfterUpdate.successCount + taskAfterUpdate.failedCount >= taskAfterUpdate.totalCount) {
|
||||
await prisma.aslExtractionTask.update({
|
||||
where: { id: taskId },
|
||||
data: { status: 'completed', completedAt: new Date() },
|
||||
});
|
||||
await broadcastLog(taskId, { source: 'system', type: 'complete', message: '🎉 All documents extracted.' });
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
// ⚠️ v1.4 错误分级路由:区分"致命错误"和"临时错误"
|
||||
if (error instanceof PkbDocumentNotFoundError || error.name === 'PdfCorruptedError') {
|
||||
// 致命错误:标记业务状态为 error + 原子递增 failedCount
|
||||
const taskAfterFail = await prisma.$transaction(async (tx) => {
|
||||
await tx.aslExtractionResult.update({
|
||||
where: { id: resultId },
|
||||
data: { status: 'error', errorMessage: error.message }
|
||||
});
|
||||
return tx.aslExtractionTask.update({
|
||||
where: { id: taskId },
|
||||
data: { failedCount: { increment: 1 } }
|
||||
});
|
||||
if (isPermanentError(error)) {
|
||||
// 致命错误:标记 error,不重试
|
||||
await prisma.aslExtractionResult.update({
|
||||
where: { id: resultId },
|
||||
data: { status: 'error', errorMessage: error.message },
|
||||
});
|
||||
|
||||
// ⚠️ v1.4.2 "Last Child Wins":失败的 Child 也要检查是否是最后一个
|
||||
if (taskAfterFail.successCount + taskAfterFail.failedCount >= taskAfterFail.totalCount) {
|
||||
await prisma.aslExtractionTask.update({
|
||||
where: { id: taskId },
|
||||
data: { status: 'completed', completedAt: new Date() },
|
||||
});
|
||||
await broadcastLog(taskId, { source: 'system', type: 'complete', message: '🎉 All documents extracted.' });
|
||||
}
|
||||
|
||||
return { success: false, reason: 'Permanent failure, aborted retry.' };
|
||||
return { success: false, note: 'Permanent error' };
|
||||
}
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
// 🚨 v1.6 补丁:临时错误 throw 前必须释放乐观锁!
|
||||
// 原因:上方 updateMany 已将 status 改为 'extracting'。
|
||||
// 如果裸 throw,pg-boss 重试时乐观锁 where: { status: 'pending' }
|
||||
// 返回 count=0 → 误判"幂等跳过" → 计数永远少一票 → Last Child Wins 永远不触发。
|
||||
// ═══════════════════════════════════════════════════════════
|
||||
|
||||
// 临时错误:回退 status → pending,让下次重试能通过幽灵守卫
|
||||
await prisma.aslExtractionResult.update({
|
||||
where: { id: resultId },
|
||||
data: { status: 'pending' },
|
||||
});
|
||||
throw error; // pg-boss 指数退避重试
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 临时错误 (429/网络抖动):throw → pg-boss 自动指数退避重试
|
||||
throw error;
|
||||
function isPermanentError(error: any): boolean {
|
||||
return error instanceof PkbDocumentNotFoundError
|
||||
|| error.name === 'PdfCorruptedError'
|
||||
|| (error.status && error.status >= 400 && error.status < 500 && error.status !== 429);
|
||||
}
|
||||
```
|
||||
|
||||
**与旧版 Fan-out Child Worker 的关键区别:**
|
||||
- ❌ ~~`prisma.$transaction([...Result, ...Task])`~~ → ✅ 只 `prisma.aslExtractionResult.update()`
|
||||
- ❌ ~~`successCount: { increment: 1 }`~~ → ✅ Worker 不碰 Task 表
|
||||
- ❌ ~~Last Child Wins 终止器~~ → ✅ Aggregator 定时收口(§4.6)
|
||||
- ✅ 幽灵守卫 + 临时错误回退 pending 保留不变
|
||||
|
||||
### 4.4 Worker + Aggregator 注册(扁平单队列)
|
||||
|
||||
> 🚀 v2.0:废弃三级嵌套队列,改为**单一扁平队列** `asl_extract_single`。
|
||||
> MinerU / LLM 调用由 Worker 内部 `await` 串行完成,`teamConcurrency` 即为总并发上限。
|
||||
|
||||
```typescript
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 单一 Worker 队列:全局最多 10 篇文献并行提取
|
||||
// ═══════════════════════════════════════════════════════
|
||||
jobQueue.work('asl_extract_single', { teamConcurrency: 10 }, async (job) => {
|
||||
await extractionSingleWorker.handle(job);
|
||||
});
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// Aggregator 定时收口:每 2 分钟扫描(代码见 §4.6)
|
||||
// ═══════════════════════════════════════════════════════
|
||||
await jobQueue.schedule('asl_extraction_aggregator', '*/2 * * * *');
|
||||
await jobQueue.work('asl_extraction_aggregator', aggregatorHandler);
|
||||
```
|
||||
|
||||
> **扁平架构对比:**
|
||||
> ```
|
||||
> 旧 Fan-out(三级嵌套,已废弃):
|
||||
> asl_extraction_manager → asl_extraction_child(10) → asl_mineru_extract(2) + asl_llm_extract(5)
|
||||
>
|
||||
> 新散装(扁平一级):
|
||||
> asl_extract_single (teamConcurrency: 10) ← Worker 内部串行调 MinerU + LLM
|
||||
> asl_extraction_aggregator (schedule: */2 * * * *) ← 定时收口 + 僵尸清理
|
||||
> ```
|
||||
> `teamConcurrency: 10` 即为 MinerU + LLM 的隐式总并发上限,无需额外子队列。
|
||||
|
||||
### 4.5 Postgres-Only 安全规范速查
|
||||
|
||||
| 规范 | 要求 | v2.0 散装模式实现 |
|
||||
|------|------|-----------|
|
||||
| **API 幂等** | 防止重复创建任务 | `idempotencyKey @unique` + P2002 冲突捕获 |
|
||||
| **Worker 幂等** | 容忍 pg-boss 重投 | `updateMany({ where: { status: 'pending' } })` 幽灵守卫 |
|
||||
| **Worker 独立** | Worker 绝不碰 Task 表 | 只写自己的 Result 行,Aggregator 负责收口 |
|
||||
| **Payload 轻量** | Job data 不超过数 KB | 仅传 `{ resultId, taskId, pkbDocumentId }`(< 200 bytes) |
|
||||
| **过期时间** | 必须设置 `expireInMinutes` | `expireInMinutes: 30`(执行超时,非排队等待超时) |
|
||||
| **错误分级** | 区分"可重试"和"永久失败" | 临时错误 → 回退 pending + throw;致命错误 → 标 error + return |
|
||||
| **死信处理** | 超过 retryLimit 的 Job | Aggregator 兼职 Sweeper:`extracting` 超 30min → error |
|
||||
| **进度查询** | 不在 Task 存冗余计数 | 前端 `groupBy` 实时聚合 Result 状态 |
|
||||
|
||||
### 4.6 ExtractionAggregator — 轮询收口 + 僵尸清理
|
||||
|
||||
> 🚀 v2.0:**Aggregator 兼职 Sweeper,一个组件两个职责。**
|
||||
> 取代旧版独立的 `asl_extraction_sweeper` + Fan-out `Last Child Wins` 收口逻辑。
|
||||
|
||||
```typescript
|
||||
// ExtractionAggregator.ts — pg-boss schedule 每 2 分钟执行一次
|
||||
async function aggregatorHandler() {
|
||||
const tasks = await prisma.aslExtractionTask.findMany({
|
||||
where: { status: 'processing' },
|
||||
});
|
||||
|
||||
for (const task of tasks) {
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 职责 1:僵尸清理(Sweeper 兼职)
|
||||
// extracting 超过 30 分钟 → 大概率 Worker 进程崩溃(OOM/SIGKILL)
|
||||
// ═══════════════════════════════════════════════════════
|
||||
await prisma.aslExtractionResult.updateMany({
|
||||
where: {
|
||||
taskId: task.id,
|
||||
status: 'extracting',
|
||||
updatedAt: { lt: new Date(Date.now() - 30 * 60 * 1000) },
|
||||
},
|
||||
data: { status: 'error', errorMessage: '[Aggregator] Timeout, likely worker crash.' },
|
||||
});
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 职责 2:聚合统计 — groupBy 一次查询
|
||||
// ═══════════════════════════════════════════════════════
|
||||
const stats = await prisma.aslExtractionResult.groupBy({
|
||||
by: ['status'],
|
||||
where: { taskId: task.id },
|
||||
_count: true,
|
||||
});
|
||||
const pending = stats.find(s => s.status === 'pending')?._count ?? 0;
|
||||
const extracting = stats.find(s => s.status === 'extracting')?._count ?? 0;
|
||||
|
||||
// ═══════════════════════════════════════════════════════
|
||||
// 职责 3:收口 — pending 和 extracting 都清零即完成
|
||||
// ═══════════════════════════════════════════════════════
|
||||
if (pending === 0 && extracting === 0) {
|
||||
await prisma.aslExtractionTask.update({
|
||||
where: { id: task.id },
|
||||
data: { status: 'completed', completedAt: new Date() },
|
||||
});
|
||||
logger.info(`[Aggregator] Task ${task.id} completed`);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4 Worker 注册(三级限流 + 队列命名合规)
|
||||
**Aggregator 与旧 Sweeper / Last Child Wins 的对比:**
|
||||
|
||||
```typescript
|
||||
// ⚠️ v1.4.2 补丁 3:队列名称全部使用下划线(遵守《Postgres-Only 指南》§4.1 红线)
|
||||
// 点号(.)在 pg-boss 底层解析中可能被识别为 Schema 分隔符,导致路由截断异常
|
||||
|
||||
jobQueue.work('asl_extraction_child', { teamConcurrency: 10 }, async (job) => {
|
||||
// 全局最多 10 个文献同时在 Node.js 内存中处理
|
||||
// 其余在 PostgreSQL 中排队(零内存占用)
|
||||
await extractionChildWorker.handle(job);
|
||||
});
|
||||
|
||||
// MinerU 子队列:全局仅允许 2 个并行(跨所有 Pod)
|
||||
jobQueue.work('asl_mineru_extract', { teamConcurrency: 2 }, async (job) => {
|
||||
const { storageKey, kbId, docId } = job.data;
|
||||
return await pdfPipeline.extractTables(storageKey, kbId, docId); // 含 OSS 缓存
|
||||
});
|
||||
|
||||
// LLM 子队列:全局仅允许 5 个并行
|
||||
jobQueue.work('asl_llm_extract', { teamConcurrency: 5 }, async (job) => {
|
||||
const { resultId, taskId, prompt } = job.data;
|
||||
return await llmGateway.call(prompt);
|
||||
});
|
||||
|
||||
// Child Worker 内部调用方式(不再使用 P-Queue)
|
||||
class ExtractionChildWorker {
|
||||
async extractWithMinerU(storageKey: string, kbId: string, docId: string) {
|
||||
const jobId = await pgBoss.send('asl_mineru_extract', { storageKey, kbId, docId });
|
||||
return await pgBoss.getJobResult(jobId);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
> **三级限流架构:**
|
||||
> ```
|
||||
> asl_extraction_child (teamConcurrency: 10) ← 背压阀门,防 OOM
|
||||
> └─ asl_mineru_extract (teamConcurrency: 2) ← 昂贵 API 保护
|
||||
> └─ asl_llm_extract (teamConcurrency: 5) ← LLM 并发保护
|
||||
> ```
|
||||
> 全部基于 PostgreSQL 行锁实现全局并发控制,跨所有 Node.js 实例生效。
|
||||
|
||||
### 4.5 Postgres-Only 安全规范速查
|
||||
|
||||
| 规范 | 要求 | 本模块实现 |
|
||||
|------|------|-----------|
|
||||
| **幂等性** | Worker 必须容忍 pg-boss 重投(at-least-once) | ⚠️ v1.4.2 `updateMany({ where: { status: 'pending' } })` 乐观锁原子抢占 |
|
||||
| **Payload 轻量** | Job data 不超过数 KB,禁止塞 PDF 正文 | 仅传 `{ taskId, resultId, pkbDocumentId }`,不超过 200 bytes |
|
||||
| **过期时间** | 必须设置 `expireInMinutes`,防止僵尸 Job | Manager: 60min,Child: 30min |
|
||||
| **错误分级** | 区分"可重试"和"永久失败" | 429/5xx → retry(pg-boss 指数退避),4xx/解析错误 → 标记 error,不 retry |
|
||||
| **死信处理** | 超过 retryLimit 的 Job 进入 DLQ | pg-boss 内置 `onFail` handler 标记该篇为 `error` |
|
||||
| **进度追踪** | 不在 Job data 中存大量进度 | 进度统一走 `CheckpointService`,Job data 仅含 ID 引用 |
|
||||
|
||||
### 🆕 4.6 Sweeper 清道夫 — 进程硬崩溃兜底(v1.6)
|
||||
|
||||
> **Fan-out 指南 v1.2 强制要求:** 单兵 Worker 无法处理自身猝死(OOM/SIGKILL),
|
||||
> 必须有系统级外部定时任务兜底。否则父任务可能永远卡在 `processing`。
|
||||
|
||||
```typescript
|
||||
// ===== 工具 3 专属清道夫(模块启动时注册) =====
|
||||
async function aslExtractionSweeper() {
|
||||
const stuckTasks = await prisma.aslExtractionTask.findMany({
|
||||
where: {
|
||||
status: 'processing',
|
||||
// 🚨 使用 updatedAt(最后活跃时间),而非 startedAt!
|
||||
// 500 篇文献正常排队可能需要 3+ 小时,用 startedAt 会误杀健康任务。
|
||||
// 只要 Child 还在完成并递增计数,updatedAt 就会持续刷新。
|
||||
updatedAt: { lt: new Date(Date.now() - 2 * 60 * 60 * 1000) },
|
||||
},
|
||||
});
|
||||
|
||||
for (const task of stuckTasks) {
|
||||
await prisma.aslExtractionTask.update({
|
||||
where: { id: task.id },
|
||||
data: {
|
||||
status: 'failed',
|
||||
errorMessage: '[Sweeper] No progress for 2h — likely Child Worker OOM/SIGKILL. Force-closed.',
|
||||
completedAt: new Date(),
|
||||
},
|
||||
});
|
||||
// 广播失败事件,确保前端 SSE 能感知
|
||||
await broadcastLog(task.id, {
|
||||
source: 'system',
|
||||
type: 'complete',
|
||||
message: '❌ [Sweeper] Task force-closed after 2h inactivity.',
|
||||
});
|
||||
logger.warn(`[Sweeper] Force-closed stuck task ${task.id} (no progress for 2h)`);
|
||||
}
|
||||
}
|
||||
|
||||
// 注册为 pg-boss 定时任务(每 10 分钟扫描一次)
|
||||
await jobQueue.schedule('asl_extraction_sweeper', '*/10 * * * *');
|
||||
await jobQueue.work('asl_extraction_sweeper', aslExtractionSweeper);
|
||||
```
|
||||
|
||||
> **关键:** Sweeper 判断"卡死"基于 `updatedAt` 而非 `startedAt`,避免误杀正在排队的超大批量任务。
|
||||
| 旧方案(已废弃) | 新方案 Aggregator |
|
||||
|---|---|
|
||||
| Last Child Wins 收口 + 独立 Sweeper 清道夫 | Aggregator 一人兼两职 |
|
||||
| Worker 必须 `$transaction` 递增 Task 计数 | Worker 只写 Result,零行锁争用 |
|
||||
| Sweeper 用 `updatedAt > 2h` 判断卡死 | Aggregator 用 `extracting > 30min` 清僵尸 |
|
||||
| 两个独立组件需分别注册 | 一个 `pg-boss schedule` 搞定 |
|
||||
|
||||
---
|
||||
|
||||
@@ -658,34 +624,38 @@ function useExtractionLogs(taskId: string) {
|
||||
```typescript
|
||||
// Step 2 页面组件:双轨制组合
|
||||
function ExtractionProgress({ taskId }: { taskId: string }) {
|
||||
const { data: task } = useTaskStatus(taskId); // 主驱动:轮询
|
||||
const { logs } = useExtractionLogs(taskId); // 增强:SSE 日志
|
||||
|
||||
// 进度条由 React Query 驱动(稳健)
|
||||
const percent = task ? Math.round((task.successCount + task.failedCount) / task.totalCount * 100) : 0;
|
||||
|
||||
// 完成检测由 React Query 驱动(不依赖 SSE complete 事件)
|
||||
// 🚀 v2.0:进度 API 返回 groupBy 聚合结果(无 successCount/failedCount 冗余字段)
|
||||
const { data: progress } = useTaskStatus(taskId); // 主驱动:轮询
|
||||
const { logs } = useExtractionLogs(taskId); // 增强:SSE 日志
|
||||
|
||||
const processed = (progress?.completedCount ?? 0) + (progress?.errorCount ?? 0);
|
||||
const percent = progress ? Math.round(processed / progress.totalCount * 100) : 0;
|
||||
|
||||
useEffect(() => {
|
||||
if (task?.status === 'completed' || task?.status === 'failed') {
|
||||
if (progress?.status === 'completed' || progress?.status === 'failed') {
|
||||
navigate(`/asl/extraction/workbench/${taskId}`);
|
||||
}
|
||||
}, [task?.status]);
|
||||
|
||||
}, [progress?.status]);
|
||||
|
||||
return (
|
||||
<>
|
||||
<Progress percent={percent} />
|
||||
<ProcessingTerminal logs={logs} /> {/* SSE 驱动,纯视觉 */}
|
||||
<ProcessingTerminal logs={logs} />
|
||||
</>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
> **双轨制分工:** React Query 轮询驱动进度条和步骤跳转(稳健可靠),SSE 仅灌日志流给 ProcessingTerminal(视觉增强,断开无影响)。
|
||||
> **v2.0 变化:** 进度 API 返回 `groupBy` 聚合的 `completedCount` / `errorCount` / `extractingCount` / `pendingCount`,不再依赖 Task 表冗余计数字段。
|
||||
|
||||
### 7.6 SSE 跨 Pod 广播 — PostgreSQL NOTIFY/LISTEN(v1.5,M2 实施)
|
||||
### 7.6 [P2 可选] SSE 跨 Pod 广播 — PostgreSQL NOTIFY/LISTEN
|
||||
|
||||
> ⚠️ **v2.0 降级说明:** 散装架构下,进度条和步骤跳转完全由 React Query 轮询驱动(§7.5),
|
||||
> SSE 日志流仅为视觉增强。跨 Pod 广播为 **P2 可选增强**,M1/M2 前期可用本 Pod 内存事件替代。
|
||||
>
|
||||
> **物理限制:** `sseEmitter.emit()` 基于内存 EventEmitter,用户连 Pod A、Worker 跑 Pod B → Pod A 零日志。
|
||||
> 使用 PostgreSQL `NOTIFY/LISTEN` 实现 Postgres-Only 合规的跨实例广播(不引入 Redis)。
|
||||
> 如需实现跨 Pod 日志广播,可使用以下 PostgreSQL `NOTIFY/LISTEN` 方案。
|
||||
|
||||
```typescript
|
||||
// ===== Worker 发送端(ExtractionChildWorker 内部) =====
|
||||
@@ -757,8 +727,8 @@ class SseNotifyBridge {
|
||||
- NOTIFY payload 物理上限 **~8000 bytes** → 发送前必须截断至 **7000 bytes**(v1.6 强制规范)
|
||||
- **禁止 `$executeRawUnsafe` + 字符串拼接!** 必须使用 `$executeRaw` Tagged Template + `pg_notify()`(v1.6 强制规范)
|
||||
- LISTEN 连接必须**独立于 Prisma 连接池**(PgClient 单独创建)
|
||||
- NOTIFY 是 fire-and-forget(无持久化),完美匹配 v1.4 双轨制定位
|
||||
- `complete` 事件仍走 NOTIFY 广播,确保"Last Child Wins"翻转状态后所有 Pod 的 SSE 客户端都能收到
|
||||
- NOTIFY 是 fire-and-forget(无持久化),完美匹配双轨制定位(SSE 断开不影响业务)
|
||||
- v2.0 下 `complete` 事件由前端轮询到 `status === 'completed'` 触发,不再强依赖 SSE 广播
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user