feat(platform): Fix pg-boss queue conflict and add safety standards
Summary: - Fix pg-boss queue conflict (duplicate key violation on queue_pkey) - Add global error listener to prevent process crash - Reduce connection pool from 10 to 4 - Add graceful shutdown handling (SIGTERM/SIGINT) - Fix researchWorker recursive call bug in catch block - Make screeningWorker idempotent using upsert Security Standards (v1.1): - Prohibit recursive retry in Worker catch blocks - Prohibit payload bloat (only store fileKey/ID in job.data) - Require Worker idempotency (upsert + unique constraint) - Recommend task-specific expireInSeconds settings - Document graceful shutdown pattern New Features: - PKB signed URL endpoint for document preview/download - pg_bigm installation guide for Docker - Dockerfile.postgres-with-extensions for pgvector + pg_bigm Documentation: - Update Postgres-Only async task processing guide (v1.1) - Add troubleshooting SQL queries - Update safety checklist Tested: Local verification passed
This commit is contained in:
@@ -1,10 +1,13 @@
|
||||
# Postgres-Only 异步任务处理指南
|
||||
|
||||
> **文档版本:** v1.0
|
||||
> **文档版本:** v1.1(2026-01-23 安全规范更新)
|
||||
> **创建日期:** 2025-12-22
|
||||
> **最后更新:** 2026-01-23
|
||||
> **维护者:** 平台架构团队
|
||||
> **适用场景:** 长时间任务(>30秒)、大文件处理、后台Worker
|
||||
> **参考实现:** DC Tool C Excel解析、ASL文献筛选、DC Tool B数据提取
|
||||
>
|
||||
> ⚠️ **重要更新 v1.1**:新增[🛡️ 安全规范](#-安全规范强制)章节,包含幂等性、错误处理等强制规范
|
||||
|
||||
---
|
||||
|
||||
@@ -537,6 +540,160 @@ async saveProcessedData(recordId, newData) {
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ 安全规范(强制)
|
||||
|
||||
> **更新日期**:2026-01-23
|
||||
> **来源**:内部逆向审查报告 + 生产问题修复
|
||||
|
||||
基于项目实际遇到的问题,以下规范 **必须遵守**:
|
||||
|
||||
### 规范1:禁止 Worker 递归死循环 ❌
|
||||
|
||||
**错误示例**:
|
||||
```typescript
|
||||
// ❌ 禁止:在 catch 块中重试业务逻辑
|
||||
jobQueue.process('your_task', async (job) => {
|
||||
try {
|
||||
await doSomething(job.data);
|
||||
} catch (error) {
|
||||
// ❌ 错误!这会导致死循环或重复执行
|
||||
await doSomething(job.data);
|
||||
throw error;
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
**正确做法**:
|
||||
```typescript
|
||||
// ✅ 正确:直接 throw,让 pg-boss 接管重试(默认3次)
|
||||
jobQueue.process('your_task', async (job) => {
|
||||
try {
|
||||
await doSomething(job.data);
|
||||
} catch (error) {
|
||||
logger.error('Job failed', { jobId: job.id, error: error.message });
|
||||
throw error; // ✅ pg-boss 会自动重试
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 规范2:禁止 Payload 膨胀 ❌
|
||||
|
||||
**错误示例**:
|
||||
```typescript
|
||||
// ❌ 禁止:在 job.data 中存大文件
|
||||
await jobQueue.push('parse_excel', {
|
||||
fileContent: base64EncodedFile, // ❌ 会导致 job 表膨胀
|
||||
imageData: base64Image, // ❌ 拖慢数据库
|
||||
});
|
||||
```
|
||||
|
||||
**正确做法**:
|
||||
```typescript
|
||||
// ✅ 正确:只存 fileKey 或数据库 ID
|
||||
await jobQueue.push('parse_excel', {
|
||||
sessionId: session.id, // ✅ 只存 ID
|
||||
fileKey: 'path/to/file', // ✅ 只存 OSS 路径
|
||||
userId: 'user-123',
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 规范3:Worker 必须幂等 ⭐
|
||||
|
||||
**问题**:任务失败重试时,可能导致重复写入、重复扣费、重复发邮件。
|
||||
|
||||
**错误示例**:
|
||||
```typescript
|
||||
// ❌ 非幂等:重试会创建多条记录
|
||||
await prisma.screeningResult.create({
|
||||
data: { projectId, literatureId, result }
|
||||
});
|
||||
```
|
||||
|
||||
**正确做法**:
|
||||
```typescript
|
||||
// ✅ 方案1:使用 upsert + 唯一约束
|
||||
await prisma.screeningResult.upsert({
|
||||
where: {
|
||||
projectId_literatureId: { projectId, literatureId }
|
||||
},
|
||||
create: { projectId, literatureId, result },
|
||||
update: { result }, // 重试时覆盖
|
||||
});
|
||||
|
||||
// ✅ 方案2:先检查状态再执行
|
||||
const existing = await prisma.task.findUnique({ where: { id: taskId } });
|
||||
if (existing?.status === 'completed') {
|
||||
logger.info('Task already completed, skipping');
|
||||
return;
|
||||
}
|
||||
await doWork();
|
||||
```
|
||||
|
||||
**幂等性检查清单**:
|
||||
| 操作类型 | 幂等方案 |
|
||||
|---------|---------|
|
||||
| 创建记录 | 使用 `upsert` + 唯一约束 |
|
||||
| 更新记录 | `update` 天然幂等 |
|
||||
| 发送邮件 | 先检查 `notificationSent` 标志 |
|
||||
| 扣费 | 使用幂等 key(如订单号) |
|
||||
| 调用外部API | 检查是否已成功 |
|
||||
|
||||
---
|
||||
|
||||
### 规范4:合理设置任务过期时间
|
||||
|
||||
**默认配置**(当前):
|
||||
```typescript
|
||||
expireInSeconds: 6 * 60 * 60 // 6小时
|
||||
```
|
||||
|
||||
**推荐配置**(按业务类型):
|
||||
| 任务类型 | 过期时间 | 理由 |
|
||||
|---------|---------|------|
|
||||
| `asl_screening_batch` | 30分钟 | 单条文献筛选 |
|
||||
| `dc_extraction_batch` | 1小时 | 批量数据提取 |
|
||||
| `dc_toolc_parse_excel` | 30分钟 | Excel解析 |
|
||||
| `rvw_review_task` | 20分钟 | 审稿任务 |
|
||||
| `asl_research_execute` | 30分钟 | DeepSearch检索 |
|
||||
|
||||
---
|
||||
|
||||
### 规范5:优雅关闭 ✅
|
||||
|
||||
**已在 `index.ts` 实现**:
|
||||
```typescript
|
||||
// 进程退出时优雅关闭
|
||||
process.on('SIGTERM', async () => {
|
||||
await fastify.close(); // 停止接收新请求
|
||||
await jobQueue.stop(); // 等待当前任务完成
|
||||
await prisma.$disconnect(); // 关闭数据库
|
||||
process.exit(0);
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 规范6:全局错误监听 ✅
|
||||
|
||||
**已在 `PgBossQueue.ts` 实现**:
|
||||
```typescript
|
||||
// 防止未捕获错误导致进程崩溃
|
||||
this.boss.on('error', (err) => {
|
||||
if (err.code === '23505' && err.constraint === 'queue_pkey') {
|
||||
// 队列冲突,静默处理
|
||||
console.log('Queue concurrency conflict auto-resolved');
|
||||
} else {
|
||||
console.error('PgBoss critical error:', err);
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 常见问题
|
||||
|
||||
### Q1: Worker 注册了但不工作?
|
||||
@@ -569,7 +726,7 @@ async saveProcessedData(recordId, newData) {
|
||||
|
||||
## ✅ 检查清单
|
||||
|
||||
在实施异步任务前,请确认:
|
||||
### 基础配置检查
|
||||
|
||||
- [ ] 业务表只存业务信息(不包含 status 等字段)
|
||||
- [ ] 队列名称使用下划线(不含冒号)
|
||||
@@ -579,11 +736,49 @@ async saveProcessedData(recordId, newData) {
|
||||
- [ ] Service 优先读取 clean data
|
||||
- [ ] saveProcessedData 同步更新 clean data
|
||||
|
||||
### 🛡️ 安全规范检查(强制)
|
||||
|
||||
- [ ] **幂等性**:使用 `upsert` 或先检查状态,确保重试安全
|
||||
- [ ] **Payload**:`job.data` 只存 ID 和 fileKey,不存大文件
|
||||
- [ ] **错误处理**:catch 块中直接 `throw error`,不要重试业务逻辑
|
||||
- [ ] **唯一约束**:数据库表有合适的唯一索引防止重复写入
|
||||
- [ ] **过期时间**:根据业务类型设置合理的 `expireInSeconds`
|
||||
|
||||
---
|
||||
|
||||
## 📊 故障排查 SQL
|
||||
|
||||
```sql
|
||||
-- 查看队列健康状况
|
||||
SELECT
|
||||
name AS queue_name,
|
||||
state,
|
||||
COUNT(*) AS count
|
||||
FROM platform_schema.job
|
||||
WHERE created_on > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY name, state
|
||||
ORDER BY name, state;
|
||||
|
||||
-- 查看失败任务
|
||||
SELECT id, name, data, output, created_on
|
||||
FROM platform_schema.job
|
||||
WHERE state = 'failed'
|
||||
ORDER BY created_on DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- 查看卡住的任务(processing 超过1小时)
|
||||
SELECT id, name, data, created_on, started_on
|
||||
FROM platform_schema.job
|
||||
WHERE state = 'active'
|
||||
AND started_on < NOW() - INTERVAL '1 hour';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**维护者**: 平台架构团队
|
||||
**最后更新**: 2025-12-22
|
||||
**文档状态**: ✅ 已完成
|
||||
**最后更新**: 2026-01-23
|
||||
**文档状态**: ✅ 已完成(v1.1 安全规范更新)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user