# PgBoss 队列监控与维护手册 > **文档版本**:v1.0 > **创建日期**:2026-01-27 > **基于故障**:2026-01-27 RVW 模块任务无限循环故障 --- ## 📋 目录 1. [故障背景](#故障背景) 2. [架构说明](#架构说明) 3. [日常监控 SQL](#日常监控-sql) 4. [故障排查指南](#故障排查指南) 5. [清理操作](#清理操作) 6. [预防措施](#预防措施) --- ## 故障背景 ### 2026-01-27 故障复盘 **现象**:RVW 审稿模块任务完成后继续无限循环执行,前端不显示结果 **表层原因**:数据库中残留 7 个重复的队列定义,导致单实例在一次事件循环中为同一 taskId 创建了 7 个 Job **根本原因**:**数据库迁移时 `platform_schema.queue` 表的主键约束丢失** | 环境 | `queue` 表主键 | 结果 | |------|---------------|------| | **本地** | ✅ `queue_pkey` (name 唯一) | `createQueue()` 重复调用会报错被忽略 | | **RDS** | ❌ **无主键**(迁移丢失) | `createQueue()` 每次都插入新行 | **证据**: ```sql -- 同一 taskId 被处理 7 次 task_id: bd19c3d3-80cc-42f7-85a4-d38b17319a1b created_on: 2026-01-27 16:06:07.446015+08 (全部相同!) job_count: 7 ``` **修复**: 1. 清理 32 个重复队列定义 2. 添加主键约束:`ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);` 3. 代码四层防御(前端锁 + API幂等 + singletonKey + Worker检查) 详见:[03-数据库迁移注意事项](./03-数据库迁移注意事项.md) --- ## 架构说明 ### PgBoss 表结构 | 表名 | 说明 | Schema | |------|------|--------| | `queue` | 队列定义(每种任务类型一条记录) | platform_schema | | `job` | 任务记录(旧版,可能未使用) | platform_schema | | `job_common` | 任务记录(当前使用) | platform_schema | | `schedule` | 定时任务配置 | platform_schema | | `subscription` | 订阅配置 | platform_schema | | `version` | pg-boss 版本信息 | platform_schema | ### 任务状态流转 ``` created → active → completed → failed → retry → active → expired ``` --- ## 日常监控 SQL ### 1. 检查重复队列定义(🔴 每日必查) ```sql -- 如果返回结果,说明有重复队列定义需要清理 SELECT name, COUNT(*) as cnt FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1 ORDER BY cnt DESC; ``` **预期结果**:无返回(0 行) **异常处理**:参考 [清理操作](#清理操作) --- ### 2. 检查任务状态分布 ```sql -- 查看各队列的任务状态分布 SELECT name, state, COUNT(*) as count FROM platform_schema.job_common GROUP BY name, state ORDER BY name, state; ``` **关注点**: - `active` 状态任务不应该长期存在 - `created` 状态任务堆积说明 Worker 未启动或有问题 - `retry` 状态任务过多说明有系统性错误 --- ### 3. 检查同一 taskId 重复处理 ```sql -- 检查是否有任务被重复处理 SELECT data->>'taskId' as task_id, COUNT(*) as job_count, MIN(created_on) as first_run, MAX(completed_on) as last_run FROM platform_schema.job_common WHERE name = 'rvw_review_task' AND created_on > NOW() - INTERVAL '24 hours' GROUP BY data->>'taskId' HAVING COUNT(*) > 1 ORDER BY job_count DESC; ``` **预期结果**:无返回(每个 taskId 只应处理一次) --- ### 4. 检查卡住的任务 ```sql -- 查找运行超过 30 分钟的活跃任务 SELECT id, name, state, data->>'taskId' as task_id, created_on, started_on, EXTRACT(EPOCH FROM (NOW() - started_on))/60 as running_minutes FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes' ORDER BY started_on; ``` --- ### 5. 队列健康检查汇总 ```sql -- 一键健康检查 SELECT 'queue_duplicates' as check_type, CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status, COUNT(*) as count FROM ( SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1 ) t UNION ALL SELECT 'stuck_active_jobs', CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END, COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes' UNION ALL SELECT 'duplicate_tasks_24h', CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END, COUNT(*) FROM ( SELECT data->>'taskId' FROM platform_schema.job_common WHERE created_on > NOW() - INTERVAL '24 hours' GROUP BY data->>'taskId' HAVING COUNT(*) > 1 ) t; ``` --- ## 故障排查指南 ### 症状 1:任务无限循环 **现象**:同一任务反复执行,日志显示不断 "Processing job" **排查步骤**: 1. 检查重复队列定义 ```sql SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1; ``` 2. 检查同一 taskId 的 Job 数量 ```sql SELECT data->>'taskId', COUNT(*) FROM platform_schema.job_common WHERE name = 'rvw_review_task' GROUP BY data->>'taskId' HAVING COUNT(*) > 1; ``` 3. 检查任务状态 ```sql SELECT id, name, state, data->>'taskId', retry_count, created_on FROM platform_schema.job_common WHERE data->>'taskId' = '问题taskId' ORDER BY created_on; ``` **修复**:清理重复队列定义 + 更新任务状态 --- ### 症状 2:任务卡住不执行 **现象**:任务状态一直是 `created`,不变成 `active` **排查步骤**: 1. 检查 Worker 是否注册 ```bash # 查看 SAE 日志 grep "Worker registered" /logs/app.log ``` 2. 检查 pg-boss 连接 ```sql SELECT * FROM pg_stat_activity WHERE application_name = 'aiclinical-queue'; ``` 3. 重启后端服务 --- ### 症状 3:任务状态不一致 **现象**:pg-boss 显示 completed,但业务表显示 pending **排查步骤**: 1. 对比两边状态 ```sql -- pg-boss 状态 SELECT id, state, data->>'taskId' as task_id, completed_on FROM platform_schema.job_common WHERE data->>'taskId' = 'xxx'; -- 业务表状态 SELECT id, status, "completedAt" FROM rvw_schema."ReviewTask" WHERE id = 'xxx'; ``` 2. 手动同步状态(如确认已完成) ```sql UPDATE rvw_schema."ReviewTask" SET status = 'completed', "completedAt" = NOW() WHERE id = 'xxx'; ``` --- ## 清理操作 ### 清理重复队列定义 ```sql -- 删除重复的队列定义,保留最新的一个 DELETE FROM platform_schema.queue a USING platform_schema.queue b WHERE a.name = b.name AND a.created_on < b.created_on; -- 验证 SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name; ``` --- ### 清理卡住的任务 ```sql -- 将卡住的 active 任务标记为 failed UPDATE platform_schema.job_common SET state = 'failed', completed_on = NOW() WHERE state = 'active' AND started_on < NOW() - INTERVAL '1 hour'; ``` --- ### 清理历史完成任务(释放空间) ```sql -- 删除 7 天前的已完成任务(谨慎操作) DELETE FROM platform_schema.job_common WHERE state = 'completed' AND completed_on < NOW() - INTERVAL '7 days'; ``` --- ## 预防措施 ### 代码层面(四层防御) | 层级 | 措施 | 代码位置 | |------|------|---------| | 前端 | `isSubmitting` 防重复点击 | `Dashboard.tsx` | | API | `updateMany` 乐观锁 | `reviewService.ts` | | 队列 | `singletonKey` 去重 | `PgBossQueue.ts` | | Worker | 状态检查跳过已完成 | `reviewWorker.ts` | ### 运维层面 1. **每日巡检**:执行健康检查 SQL 2. **部署后检查**:确认队列定义无重复 3. **告警配置**:设置队列异常告警(待实现) ### 代码审查要点 - [ ] 新增队列时,确保使用 `singletonKey` - [ ] Worker 处理前检查业务状态 - [ ] 避免在 `push()` 中调用 `createQueue()` - [ ] 前端异步操作添加 loading 状态 --- ## 📝 连接信息 ```bash # RDS 测试环境(仅供运维使用) # 外网访问需先开启白名单 psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" # 通过本地 Docker 连接 RDS docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "SQL语句" ``` --- ## 📚 相关文档 - [故障分析报告](../06-测试文档/故障分析报告%20(1).md) - [数据库运维手册](./03-数据库运维手册.md) - [日常更新快速操作手册](../05-部署文档/19-日常更新快速操作手册.md)