Backend fixes: - Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints) - Add singletonKey to prevent duplicate job enqueueing - Add idempotency check in reviewWorker (skip completed tasks) - Add optimistic locking in reviewService (atomic status update) Frontend fixes: - Add isSubmitting state to prevent duplicate submissions in RVW Dashboard - Fix API baseURL in knowledgeBaseApi (relative path) Cleanup (removed): - Old frontend/ directory (migrated to frontend-v2) - python-microservice/ (unused, replaced by extraction_service) - Root package.json and node_modules (accidentally created) - redcap-docker-dev/ (external dependency) - Various temporary files and outdated docs in root New documentation: - docs/07-运维文档/01-PgBoss队列监控与维护.md - docs/07-运维文档/02-故障预防检查清单.md - docs/07-运维文档/03-数据库迁移注意事项.md Database fix applied to RDS: - Added PRIMARY KEY to platform_schema.queue - Added 3 missing foreign key constraints Tested: Local build passed, RDS constraints verified
348 lines
8.4 KiB
Markdown
348 lines
8.4 KiB
Markdown
# PgBoss 队列监控与维护手册
|
||
|
||
> **文档版本**:v1.0
|
||
> **创建日期**:2026-01-27
|
||
> **基于故障**:2026-01-27 RVW 模块任务无限循环故障
|
||
|
||
---
|
||
|
||
## 📋 目录
|
||
|
||
1. [故障背景](#故障背景)
|
||
2. [架构说明](#架构说明)
|
||
3. [日常监控 SQL](#日常监控-sql)
|
||
4. [故障排查指南](#故障排查指南)
|
||
5. [清理操作](#清理操作)
|
||
6. [预防措施](#预防措施)
|
||
|
||
---
|
||
|
||
## 故障背景
|
||
|
||
### 2026-01-27 故障复盘
|
||
|
||
**现象**:RVW 审稿模块任务完成后继续无限循环执行,前端不显示结果
|
||
|
||
**表层原因**:数据库中残留 7 个重复的队列定义,导致单实例在一次事件循环中为同一 taskId 创建了 7 个 Job
|
||
|
||
**根本原因**:**数据库迁移时 `platform_schema.queue` 表的主键约束丢失**
|
||
|
||
| 环境 | `queue` 表主键 | 结果 |
|
||
|------|---------------|------|
|
||
| **本地** | ✅ `queue_pkey` (name 唯一) | `createQueue()` 重复调用会报错被忽略 |
|
||
| **RDS** | ❌ **无主键**(迁移丢失) | `createQueue()` 每次都插入新行 |
|
||
|
||
**证据**:
|
||
```sql
|
||
-- 同一 taskId 被处理 7 次
|
||
task_id: bd19c3d3-80cc-42f7-85a4-d38b17319a1b
|
||
created_on: 2026-01-27 16:06:07.446015+08 (全部相同!)
|
||
job_count: 7
|
||
```
|
||
|
||
**修复**:
|
||
1. 清理 32 个重复队列定义
|
||
2. 添加主键约束:`ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);`
|
||
3. 代码四层防御(前端锁 + API幂等 + singletonKey + Worker检查)
|
||
|
||
详见:[03-数据库迁移注意事项](./03-数据库迁移注意事项.md)
|
||
|
||
---
|
||
|
||
## 架构说明
|
||
|
||
### PgBoss 表结构
|
||
|
||
| 表名 | 说明 | Schema |
|
||
|------|------|--------|
|
||
| `queue` | 队列定义(每种任务类型一条记录) | platform_schema |
|
||
| `job` | 任务记录(旧版,可能未使用) | platform_schema |
|
||
| `job_common` | 任务记录(当前使用) | platform_schema |
|
||
| `schedule` | 定时任务配置 | platform_schema |
|
||
| `subscription` | 订阅配置 | platform_schema |
|
||
| `version` | pg-boss 版本信息 | platform_schema |
|
||
|
||
### 任务状态流转
|
||
|
||
```
|
||
created → active → completed
|
||
→ failed → retry → active
|
||
→ expired
|
||
```
|
||
|
||
---
|
||
|
||
## 日常监控 SQL
|
||
|
||
### 1. 检查重复队列定义(🔴 每日必查)
|
||
|
||
```sql
|
||
-- 如果返回结果,说明有重复队列定义需要清理
|
||
SELECT name, COUNT(*) as cnt
|
||
FROM platform_schema.queue
|
||
GROUP BY name
|
||
HAVING COUNT(*) > 1
|
||
ORDER BY cnt DESC;
|
||
```
|
||
|
||
**预期结果**:无返回(0 行)
|
||
|
||
**异常处理**:参考 [清理操作](#清理操作)
|
||
|
||
---
|
||
|
||
### 2. 检查任务状态分布
|
||
|
||
```sql
|
||
-- 查看各队列的任务状态分布
|
||
SELECT name, state, COUNT(*) as count
|
||
FROM platform_schema.job_common
|
||
GROUP BY name, state
|
||
ORDER BY name, state;
|
||
```
|
||
|
||
**关注点**:
|
||
- `active` 状态任务不应该长期存在
|
||
- `created` 状态任务堆积说明 Worker 未启动或有问题
|
||
- `retry` 状态任务过多说明有系统性错误
|
||
|
||
---
|
||
|
||
### 3. 检查同一 taskId 重复处理
|
||
|
||
```sql
|
||
-- 检查是否有任务被重复处理
|
||
SELECT
|
||
data->>'taskId' as task_id,
|
||
COUNT(*) as job_count,
|
||
MIN(created_on) as first_run,
|
||
MAX(completed_on) as last_run
|
||
FROM platform_schema.job_common
|
||
WHERE name = 'rvw_review_task'
|
||
AND created_on > NOW() - INTERVAL '24 hours'
|
||
GROUP BY data->>'taskId'
|
||
HAVING COUNT(*) > 1
|
||
ORDER BY job_count DESC;
|
||
```
|
||
|
||
**预期结果**:无返回(每个 taskId 只应处理一次)
|
||
|
||
---
|
||
|
||
### 4. 检查卡住的任务
|
||
|
||
```sql
|
||
-- 查找运行超过 30 分钟的活跃任务
|
||
SELECT
|
||
id,
|
||
name,
|
||
state,
|
||
data->>'taskId' as task_id,
|
||
created_on,
|
||
started_on,
|
||
EXTRACT(EPOCH FROM (NOW() - started_on))/60 as running_minutes
|
||
FROM platform_schema.job_common
|
||
WHERE state = 'active'
|
||
AND started_on < NOW() - INTERVAL '30 minutes'
|
||
ORDER BY started_on;
|
||
```
|
||
|
||
---
|
||
|
||
### 5. 队列健康检查汇总
|
||
|
||
```sql
|
||
-- 一键健康检查
|
||
SELECT
|
||
'queue_duplicates' as check_type,
|
||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
|
||
COUNT(*) as count
|
||
FROM (
|
||
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
|
||
) t
|
||
|
||
UNION ALL
|
||
|
||
SELECT
|
||
'stuck_active_jobs',
|
||
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||
COUNT(*)
|
||
FROM platform_schema.job_common
|
||
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||
|
||
UNION ALL
|
||
|
||
SELECT
|
||
'duplicate_tasks_24h',
|
||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END,
|
||
COUNT(*)
|
||
FROM (
|
||
SELECT data->>'taskId' FROM platform_schema.job_common
|
||
WHERE created_on > NOW() - INTERVAL '24 hours'
|
||
GROUP BY data->>'taskId' HAVING COUNT(*) > 1
|
||
) t;
|
||
```
|
||
|
||
---
|
||
|
||
## 故障排查指南
|
||
|
||
### 症状 1:任务无限循环
|
||
|
||
**现象**:同一任务反复执行,日志显示不断 "Processing job"
|
||
|
||
**排查步骤**:
|
||
|
||
1. 检查重复队列定义
|
||
```sql
|
||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
|
||
```
|
||
|
||
2. 检查同一 taskId 的 Job 数量
|
||
```sql
|
||
SELECT data->>'taskId', COUNT(*) FROM platform_schema.job_common
|
||
WHERE name = 'rvw_review_task' GROUP BY data->>'taskId' HAVING COUNT(*) > 1;
|
||
```
|
||
|
||
3. 检查任务状态
|
||
```sql
|
||
SELECT id, name, state, data->>'taskId', retry_count, created_on
|
||
FROM platform_schema.job_common
|
||
WHERE data->>'taskId' = '问题taskId' ORDER BY created_on;
|
||
```
|
||
|
||
**修复**:清理重复队列定义 + 更新任务状态
|
||
|
||
---
|
||
|
||
### 症状 2:任务卡住不执行
|
||
|
||
**现象**:任务状态一直是 `created`,不变成 `active`
|
||
|
||
**排查步骤**:
|
||
|
||
1. 检查 Worker 是否注册
|
||
```bash
|
||
# 查看 SAE 日志
|
||
grep "Worker registered" /logs/app.log
|
||
```
|
||
|
||
2. 检查 pg-boss 连接
|
||
```sql
|
||
SELECT * FROM pg_stat_activity WHERE application_name = 'aiclinical-queue';
|
||
```
|
||
|
||
3. 重启后端服务
|
||
|
||
---
|
||
|
||
### 症状 3:任务状态不一致
|
||
|
||
**现象**:pg-boss 显示 completed,但业务表显示 pending
|
||
|
||
**排查步骤**:
|
||
|
||
1. 对比两边状态
|
||
```sql
|
||
-- pg-boss 状态
|
||
SELECT id, state, data->>'taskId' as task_id, completed_on
|
||
FROM platform_schema.job_common WHERE data->>'taskId' = 'xxx';
|
||
|
||
-- 业务表状态
|
||
SELECT id, status, "completedAt" FROM rvw_schema."ReviewTask" WHERE id = 'xxx';
|
||
```
|
||
|
||
2. 手动同步状态(如确认已完成)
|
||
```sql
|
||
UPDATE rvw_schema."ReviewTask"
|
||
SET status = 'completed', "completedAt" = NOW()
|
||
WHERE id = 'xxx';
|
||
```
|
||
|
||
---
|
||
|
||
## 清理操作
|
||
|
||
### 清理重复队列定义
|
||
|
||
```sql
|
||
-- 删除重复的队列定义,保留最新的一个
|
||
DELETE FROM platform_schema.queue a
|
||
USING platform_schema.queue b
|
||
WHERE a.name = b.name
|
||
AND a.created_on < b.created_on;
|
||
|
||
-- 验证
|
||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name;
|
||
```
|
||
|
||
---
|
||
|
||
### 清理卡住的任务
|
||
|
||
```sql
|
||
-- 将卡住的 active 任务标记为 failed
|
||
UPDATE platform_schema.job_common
|
||
SET state = 'failed', completed_on = NOW()
|
||
WHERE state = 'active'
|
||
AND started_on < NOW() - INTERVAL '1 hour';
|
||
```
|
||
|
||
---
|
||
|
||
### 清理历史完成任务(释放空间)
|
||
|
||
```sql
|
||
-- 删除 7 天前的已完成任务(谨慎操作)
|
||
DELETE FROM platform_schema.job_common
|
||
WHERE state = 'completed'
|
||
AND completed_on < NOW() - INTERVAL '7 days';
|
||
```
|
||
|
||
---
|
||
|
||
## 预防措施
|
||
|
||
### 代码层面(四层防御)
|
||
|
||
| 层级 | 措施 | 代码位置 |
|
||
|------|------|---------|
|
||
| 前端 | `isSubmitting` 防重复点击 | `Dashboard.tsx` |
|
||
| API | `updateMany` 乐观锁 | `reviewService.ts` |
|
||
| 队列 | `singletonKey` 去重 | `PgBossQueue.ts` |
|
||
| Worker | 状态检查跳过已完成 | `reviewWorker.ts` |
|
||
|
||
### 运维层面
|
||
|
||
1. **每日巡检**:执行健康检查 SQL
|
||
2. **部署后检查**:确认队列定义无重复
|
||
3. **告警配置**:设置队列异常告警(待实现)
|
||
|
||
### 代码审查要点
|
||
|
||
- [ ] 新增队列时,确保使用 `singletonKey`
|
||
- [ ] Worker 处理前检查业务状态
|
||
- [ ] 避免在 `push()` 中调用 `createQueue()`
|
||
- [ ] 前端异步操作添加 loading 状态
|
||
|
||
---
|
||
|
||
## 📝 连接信息
|
||
|
||
```bash
|
||
# RDS 测试环境(仅供运维使用)
|
||
# 外网访问需先开启白名单
|
||
psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test"
|
||
|
||
# 通过本地 Docker 连接 RDS
|
||
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "SQL语句"
|
||
```
|
||
|
||
---
|
||
|
||
## 📚 相关文档
|
||
|
||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)
|
||
- [数据库运维手册](./03-数据库运维手册.md)
|
||
- [日常更新快速操作手册](../05-部署文档/19-日常更新快速操作手册.md)
|