fix(backend): Resolve PgBoss infinite loop issue and cleanup unused files
Backend fixes: - Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints) - Add singletonKey to prevent duplicate job enqueueing - Add idempotency check in reviewWorker (skip completed tasks) - Add optimistic locking in reviewService (atomic status update) Frontend fixes: - Add isSubmitting state to prevent duplicate submissions in RVW Dashboard - Fix API baseURL in knowledgeBaseApi (relative path) Cleanup (removed): - Old frontend/ directory (migrated to frontend-v2) - python-microservice/ (unused, replaced by extraction_service) - Root package.json and node_modules (accidentally created) - redcap-docker-dev/ (external dependency) - Various temporary files and outdated docs in root New documentation: - docs/07-运维文档/01-PgBoss队列监控与维护.md - docs/07-运维文档/02-故障预防检查清单.md - docs/07-运维文档/03-数据库迁移注意事项.md Database fix applied to RDS: - Added PRIMARY KEY to platform_schema.queue - Added 3 missing foreign key constraints Tested: Local build passed, RDS constraints verified
This commit is contained in:
347
docs/07-运维文档/01-PgBoss队列监控与维护.md
Normal file
347
docs/07-运维文档/01-PgBoss队列监控与维护.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# PgBoss 队列监控与维护手册
|
||||
|
||||
> **文档版本**:v1.0
|
||||
> **创建日期**:2026-01-27
|
||||
> **基于故障**:2026-01-27 RVW 模块任务无限循环故障
|
||||
|
||||
---
|
||||
|
||||
## 📋 目录
|
||||
|
||||
1. [故障背景](#故障背景)
|
||||
2. [架构说明](#架构说明)
|
||||
3. [日常监控 SQL](#日常监控-sql)
|
||||
4. [故障排查指南](#故障排查指南)
|
||||
5. [清理操作](#清理操作)
|
||||
6. [预防措施](#预防措施)
|
||||
|
||||
---
|
||||
|
||||
## 故障背景
|
||||
|
||||
### 2026-01-27 故障复盘
|
||||
|
||||
**现象**:RVW 审稿模块任务完成后继续无限循环执行,前端不显示结果
|
||||
|
||||
**表层原因**:数据库中残留 7 个重复的队列定义,导致单实例在一次事件循环中为同一 taskId 创建了 7 个 Job
|
||||
|
||||
**根本原因**:**数据库迁移时 `platform_schema.queue` 表的主键约束丢失**
|
||||
|
||||
| 环境 | `queue` 表主键 | 结果 |
|
||||
|------|---------------|------|
|
||||
| **本地** | ✅ `queue_pkey` (name 唯一) | `createQueue()` 重复调用会报错被忽略 |
|
||||
| **RDS** | ❌ **无主键**(迁移丢失) | `createQueue()` 每次都插入新行 |
|
||||
|
||||
**证据**:
|
||||
```sql
|
||||
-- 同一 taskId 被处理 7 次
|
||||
task_id: bd19c3d3-80cc-42f7-85a4-d38b17319a1b
|
||||
created_on: 2026-01-27 16:06:07.446015+08 (全部相同!)
|
||||
job_count: 7
|
||||
```
|
||||
|
||||
**修复**:
|
||||
1. 清理 32 个重复队列定义
|
||||
2. 添加主键约束:`ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);`
|
||||
3. 代码四层防御(前端锁 + API幂等 + singletonKey + Worker检查)
|
||||
|
||||
详见:[03-数据库迁移注意事项](./03-数据库迁移注意事项.md)
|
||||
|
||||
---
|
||||
|
||||
## 架构说明
|
||||
|
||||
### PgBoss 表结构
|
||||
|
||||
| 表名 | 说明 | Schema |
|
||||
|------|------|--------|
|
||||
| `queue` | 队列定义(每种任务类型一条记录) | platform_schema |
|
||||
| `job` | 任务记录(旧版,可能未使用) | platform_schema |
|
||||
| `job_common` | 任务记录(当前使用) | platform_schema |
|
||||
| `schedule` | 定时任务配置 | platform_schema |
|
||||
| `subscription` | 订阅配置 | platform_schema |
|
||||
| `version` | pg-boss 版本信息 | platform_schema |
|
||||
|
||||
### 任务状态流转
|
||||
|
||||
```
|
||||
created → active → completed
|
||||
→ failed → retry → active
|
||||
→ expired
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 日常监控 SQL
|
||||
|
||||
### 1. 检查重复队列定义(🔴 每日必查)
|
||||
|
||||
```sql
|
||||
-- 如果返回结果,说明有重复队列定义需要清理
|
||||
SELECT name, COUNT(*) as cnt
|
||||
FROM platform_schema.queue
|
||||
GROUP BY name
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY cnt DESC;
|
||||
```
|
||||
|
||||
**预期结果**:无返回(0 行)
|
||||
|
||||
**异常处理**:参考 [清理操作](#清理操作)
|
||||
|
||||
---
|
||||
|
||||
### 2. 检查任务状态分布
|
||||
|
||||
```sql
|
||||
-- 查看各队列的任务状态分布
|
||||
SELECT name, state, COUNT(*) as count
|
||||
FROM platform_schema.job_common
|
||||
GROUP BY name, state
|
||||
ORDER BY name, state;
|
||||
```
|
||||
|
||||
**关注点**:
|
||||
- `active` 状态任务不应该长期存在
|
||||
- `created` 状态任务堆积说明 Worker 未启动或有问题
|
||||
- `retry` 状态任务过多说明有系统性错误
|
||||
|
||||
---
|
||||
|
||||
### 3. 检查同一 taskId 重复处理
|
||||
|
||||
```sql
|
||||
-- 检查是否有任务被重复处理
|
||||
SELECT
|
||||
data->>'taskId' as task_id,
|
||||
COUNT(*) as job_count,
|
||||
MIN(created_on) as first_run,
|
||||
MAX(completed_on) as last_run
|
||||
FROM platform_schema.job_common
|
||||
WHERE name = 'rvw_review_task'
|
||||
AND created_on > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY data->>'taskId'
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY job_count DESC;
|
||||
```
|
||||
|
||||
**预期结果**:无返回(每个 taskId 只应处理一次)
|
||||
|
||||
---
|
||||
|
||||
### 4. 检查卡住的任务
|
||||
|
||||
```sql
|
||||
-- 查找运行超过 30 分钟的活跃任务
|
||||
SELECT
|
||||
id,
|
||||
name,
|
||||
state,
|
||||
data->>'taskId' as task_id,
|
||||
created_on,
|
||||
started_on,
|
||||
EXTRACT(EPOCH FROM (NOW() - started_on))/60 as running_minutes
|
||||
FROM platform_schema.job_common
|
||||
WHERE state = 'active'
|
||||
AND started_on < NOW() - INTERVAL '30 minutes'
|
||||
ORDER BY started_on;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. 队列健康检查汇总
|
||||
|
||||
```sql
|
||||
-- 一键健康检查
|
||||
SELECT
|
||||
'queue_duplicates' as check_type,
|
||||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
|
||||
COUNT(*) as count
|
||||
FROM (
|
||||
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
|
||||
) t
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT
|
||||
'stuck_active_jobs',
|
||||
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||||
COUNT(*)
|
||||
FROM platform_schema.job_common
|
||||
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT
|
||||
'duplicate_tasks_24h',
|
||||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END,
|
||||
COUNT(*)
|
||||
FROM (
|
||||
SELECT data->>'taskId' FROM platform_schema.job_common
|
||||
WHERE created_on > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY data->>'taskId' HAVING COUNT(*) > 1
|
||||
) t;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 故障排查指南
|
||||
|
||||
### 症状 1:任务无限循环
|
||||
|
||||
**现象**:同一任务反复执行,日志显示不断 "Processing job"
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 检查重复队列定义
|
||||
```sql
|
||||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
|
||||
```
|
||||
|
||||
2. 检查同一 taskId 的 Job 数量
|
||||
```sql
|
||||
SELECT data->>'taskId', COUNT(*) FROM platform_schema.job_common
|
||||
WHERE name = 'rvw_review_task' GROUP BY data->>'taskId' HAVING COUNT(*) > 1;
|
||||
```
|
||||
|
||||
3. 检查任务状态
|
||||
```sql
|
||||
SELECT id, name, state, data->>'taskId', retry_count, created_on
|
||||
FROM platform_schema.job_common
|
||||
WHERE data->>'taskId' = '问题taskId' ORDER BY created_on;
|
||||
```
|
||||
|
||||
**修复**:清理重复队列定义 + 更新任务状态
|
||||
|
||||
---
|
||||
|
||||
### 症状 2:任务卡住不执行
|
||||
|
||||
**现象**:任务状态一直是 `created`,不变成 `active`
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 检查 Worker 是否注册
|
||||
```bash
|
||||
# 查看 SAE 日志
|
||||
grep "Worker registered" /logs/app.log
|
||||
```
|
||||
|
||||
2. 检查 pg-boss 连接
|
||||
```sql
|
||||
SELECT * FROM pg_stat_activity WHERE application_name = 'aiclinical-queue';
|
||||
```
|
||||
|
||||
3. 重启后端服务
|
||||
|
||||
---
|
||||
|
||||
### 症状 3:任务状态不一致
|
||||
|
||||
**现象**:pg-boss 显示 completed,但业务表显示 pending
|
||||
|
||||
**排查步骤**:
|
||||
|
||||
1. 对比两边状态
|
||||
```sql
|
||||
-- pg-boss 状态
|
||||
SELECT id, state, data->>'taskId' as task_id, completed_on
|
||||
FROM platform_schema.job_common WHERE data->>'taskId' = 'xxx';
|
||||
|
||||
-- 业务表状态
|
||||
SELECT id, status, "completedAt" FROM rvw_schema."ReviewTask" WHERE id = 'xxx';
|
||||
```
|
||||
|
||||
2. 手动同步状态(如确认已完成)
|
||||
```sql
|
||||
UPDATE rvw_schema."ReviewTask"
|
||||
SET status = 'completed', "completedAt" = NOW()
|
||||
WHERE id = 'xxx';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 清理操作
|
||||
|
||||
### 清理重复队列定义
|
||||
|
||||
```sql
|
||||
-- 删除重复的队列定义,保留最新的一个
|
||||
DELETE FROM platform_schema.queue a
|
||||
USING platform_schema.queue b
|
||||
WHERE a.name = b.name
|
||||
AND a.created_on < b.created_on;
|
||||
|
||||
-- 验证
|
||||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 清理卡住的任务
|
||||
|
||||
```sql
|
||||
-- 将卡住的 active 任务标记为 failed
|
||||
UPDATE platform_schema.job_common
|
||||
SET state = 'failed', completed_on = NOW()
|
||||
WHERE state = 'active'
|
||||
AND started_on < NOW() - INTERVAL '1 hour';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 清理历史完成任务(释放空间)
|
||||
|
||||
```sql
|
||||
-- 删除 7 天前的已完成任务(谨慎操作)
|
||||
DELETE FROM platform_schema.job_common
|
||||
WHERE state = 'completed'
|
||||
AND completed_on < NOW() - INTERVAL '7 days';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 预防措施
|
||||
|
||||
### 代码层面(四层防御)
|
||||
|
||||
| 层级 | 措施 | 代码位置 |
|
||||
|------|------|---------|
|
||||
| 前端 | `isSubmitting` 防重复点击 | `Dashboard.tsx` |
|
||||
| API | `updateMany` 乐观锁 | `reviewService.ts` |
|
||||
| 队列 | `singletonKey` 去重 | `PgBossQueue.ts` |
|
||||
| Worker | 状态检查跳过已完成 | `reviewWorker.ts` |
|
||||
|
||||
### 运维层面
|
||||
|
||||
1. **每日巡检**:执行健康检查 SQL
|
||||
2. **部署后检查**:确认队列定义无重复
|
||||
3. **告警配置**:设置队列异常告警(待实现)
|
||||
|
||||
### 代码审查要点
|
||||
|
||||
- [ ] 新增队列时,确保使用 `singletonKey`
|
||||
- [ ] Worker 处理前检查业务状态
|
||||
- [ ] 避免在 `push()` 中调用 `createQueue()`
|
||||
- [ ] 前端异步操作添加 loading 状态
|
||||
|
||||
---
|
||||
|
||||
## 📝 连接信息
|
||||
|
||||
```bash
|
||||
# RDS 测试环境(仅供运维使用)
|
||||
# 外网访问需先开启白名单
|
||||
psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test"
|
||||
|
||||
# 通过本地 Docker 连接 RDS
|
||||
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "SQL语句"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 相关文档
|
||||
|
||||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)
|
||||
- [数据库运维手册](./03-数据库运维手册.md)
|
||||
- [日常更新快速操作手册](../05-部署文档/19-日常更新快速操作手册.md)
|
||||
200
docs/07-运维文档/02-故障预防检查清单.md
Normal file
200
docs/07-运维文档/02-故障预防检查清单.md
Normal file
@@ -0,0 +1,200 @@
|
||||
# 故障预防检查清单
|
||||
|
||||
> **文档版本**:v1.0
|
||||
> **创建日期**:2026-01-27
|
||||
> **适用场景**:部署前后检查、日常巡检、故障预防
|
||||
|
||||
---
|
||||
|
||||
## 📋 部署前检查
|
||||
|
||||
### 代码检查
|
||||
|
||||
- [ ] **前端防重复提交**
|
||||
- [ ] 异步操作添加 `isSubmitting` 或 `loading` 状态
|
||||
- [ ] 提交按钮在请求期间 disabled
|
||||
- [ ] 使用 `finally` 确保状态解锁
|
||||
|
||||
- [ ] **后端 API 幂等性**
|
||||
- [ ] 状态更新使用 `updateMany` + WHERE 条件
|
||||
- [ ] 检查更新数量,为 0 时返回错误
|
||||
- [ ] 避免 `update` 后再检查状态(非原子操作)
|
||||
|
||||
- [ ] **队列任务**
|
||||
- [ ] 使用 `singletonKey` 防止重复入队
|
||||
- [ ] Worker 处理前检查业务状态
|
||||
- [ ] 不在 `push()` 中调用 `createQueue()`
|
||||
|
||||
### 依赖检查
|
||||
|
||||
- [ ] 数据库连接正常
|
||||
- [ ] Redis 连接正常(如使用)
|
||||
- [ ] 第三方 API 可访问
|
||||
- [ ] 环境变量配置完整
|
||||
|
||||
---
|
||||
|
||||
## 📋 部署后检查
|
||||
|
||||
### 立即检查(部署后 5 分钟内)
|
||||
|
||||
```sql
|
||||
-- 0. 🔴 检查 pg-boss 4 个关键约束(数据库迁移后必查!)
|
||||
SELECT conname FROM pg_constraint
|
||||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||||
-- 预期:4 行
|
||||
-- 如果少于 4 行,参考 03-数据库迁移注意事项.md 修复
|
||||
|
||||
-- 1. 检查重复队列定义
|
||||
SELECT name, COUNT(*) as cnt
|
||||
FROM platform_schema.queue
|
||||
GROUP BY name
|
||||
HAVING COUNT(*) > 1;
|
||||
-- 预期:无返回
|
||||
```
|
||||
|
||||
```sql
|
||||
-- 2. 检查 Worker 是否注册(查看日志)
|
||||
-- SAE 日志搜索关键字:
|
||||
-- "Worker registered" 或 "Handler registered"
|
||||
```
|
||||
|
||||
```bash
|
||||
# 3. 健康检查接口
|
||||
curl https://your-domain/api/health
|
||||
# 预期:{"status":"ok"}
|
||||
```
|
||||
|
||||
### 功能验证(部署后 30 分钟内)
|
||||
|
||||
- [ ] 用户登录正常
|
||||
- [ ] 核心功能可用
|
||||
- [ ] 异步任务执行正常
|
||||
- [ ] 文件上传/下载正常
|
||||
|
||||
---
|
||||
|
||||
## 📋 日常巡检清单
|
||||
|
||||
### 每日必查(09:00)
|
||||
|
||||
| 检查项 | SQL/命令 | 预期结果 |
|
||||
|--------|---------|---------|
|
||||
| 队列重复定义 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
|
||||
| 卡住的任务 | `SELECT COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes';` | 0 |
|
||||
| 数据库连接数 | `SELECT count(*) FROM pg_stat_activity;` | < 80 |
|
||||
|
||||
### 一键健康检查
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
'queue_duplicates' as check_type,
|
||||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
|
||||
COUNT(*) as count
|
||||
FROM (
|
||||
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
|
||||
) t
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT
|
||||
'stuck_active_jobs',
|
||||
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||||
COUNT(*)
|
||||
FROM platform_schema.job_common
|
||||
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT
|
||||
'db_connections',
|
||||
CASE WHEN COUNT(*) > 80 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||||
COUNT(*)
|
||||
FROM pg_stat_activity;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 代码审查清单(PR Review)
|
||||
|
||||
### 前端
|
||||
|
||||
- [ ] 表单提交有 loading 状态
|
||||
- [ ] 按钮防抖/禁用处理
|
||||
- [ ] 错误处理完整(try-catch-finally)
|
||||
- [ ] 异步请求有超时设置
|
||||
|
||||
### 后端 API
|
||||
|
||||
- [ ] 使用事务或原子操作
|
||||
- [ ] 状态检查在更新之前
|
||||
- [ ] 错误信息明确
|
||||
- [ ] 日志记录关键步骤
|
||||
|
||||
### 队列任务
|
||||
|
||||
- [ ] 有 `singletonKey`
|
||||
- [ ] Worker 有幂等性检查
|
||||
- [ ] 超时和重试配置合理
|
||||
- [ ] 失败处理完整
|
||||
|
||||
---
|
||||
|
||||
## 📋 故障响应流程
|
||||
|
||||
### 1. 发现问题
|
||||
|
||||
- 用户反馈
|
||||
- 监控告警
|
||||
- 日志异常
|
||||
|
||||
### 2. 初步诊断(5 分钟内)
|
||||
|
||||
```sql
|
||||
-- 快速健康检查
|
||||
SELECT
|
||||
'queue_duplicates' as t, COUNT(*) FROM (SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1) x
|
||||
UNION ALL
|
||||
SELECT 'stuck_jobs', COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||||
UNION ALL
|
||||
SELECT 'db_connections', COUNT(*) FROM pg_stat_activity;
|
||||
```
|
||||
|
||||
### 3. 紧急修复
|
||||
|
||||
| 问题类型 | 紧急措施 |
|
||||
|---------|---------|
|
||||
| 任务无限循环 | 清理重复队列定义,重启服务 |
|
||||
| 数据库连接满 | 强制断开空闲连接,重启服务 |
|
||||
| 服务不可用 | 重启 SAE 应用 |
|
||||
|
||||
### 4. 根因分析
|
||||
|
||||
- 收集日志
|
||||
- 分析数据库状态
|
||||
- 复盘代码逻辑
|
||||
|
||||
### 5. 长期修复
|
||||
|
||||
- 提交代码修复 PR
|
||||
- 更新文档
|
||||
- 添加预防措施
|
||||
|
||||
---
|
||||
|
||||
## 📋 历史故障记录
|
||||
|
||||
| 日期 | 故障类型 | 影响范围 | 根因 | 修复措施 |
|
||||
|------|---------|---------|------|---------|
|
||||
| 2026-01-27 | 任务无限循环 | RVW 模块 | **数据库迁移丢失主键约束** → 队列定义重复 | 添加主键 + 四层防御 |
|
||||
| 2026-01-27 | 中文乱码 | 全系统 | PowerShell 编码问题 | Docker 内执行迁移 |
|
||||
| 2026-01-11 | 数据库事故 | 全系统 | 误操作 | 备份恢复流程 |
|
||||
|
||||
---
|
||||
|
||||
## 📚 相关文档
|
||||
|
||||
- [PgBoss 队列监控与维护](./01-PgBoss队列监控与维护.md)
|
||||
- [数据库运维手册](./03-数据库运维手册.md)
|
||||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)
|
||||
379
docs/07-运维文档/03-数据库迁移注意事项.md
Normal file
379
docs/07-运维文档/03-数据库迁移注意事项.md
Normal file
@@ -0,0 +1,379 @@
|
||||
# 数据库迁移注意事项
|
||||
|
||||
> **文档版本**:v1.1
|
||||
> **创建日期**:2026-01-27
|
||||
> **最后更新**:2026-01-27
|
||||
> **基于故障**:2026-01-27 RVW 任务无限循环故障(根因:pg-boss 表约束丢失)
|
||||
|
||||
---
|
||||
|
||||
## 📋 背景
|
||||
|
||||
### 故障复盘
|
||||
|
||||
**现象**:RVW 审稿模块任务完成后继续无限循环执行,同一 taskId 被处理 7 次
|
||||
|
||||
**根因**:数据库从本地迁移到 RDS 时,`platform_schema` 中 **pg-boss 表的约束丢失**
|
||||
|
||||
| 丢失的约束 | 类型 | 影响 |
|
||||
|-----------|------|------|
|
||||
| `queue_pkey` | 主键 | 🔴 **导致队列定义重复** |
|
||||
| `queue_dead_letter_fkey` | 外键 | 死信队列引用完整性丢失 |
|
||||
| `schedule_name_fkey` | 外键 | 定时任务引用完整性丢失 |
|
||||
| `subscription_name_fkey` | 外键 | 订阅引用完整性丢失 |
|
||||
|
||||
**影响**:每次 SAE 重启都会创建新的队列定义,导致同一任务被入队多次
|
||||
|
||||
---
|
||||
|
||||
## 🔍 为什么只有 pg-boss 相关约束丢失?
|
||||
|
||||
### pg-boss 表的"双重管理者"问题
|
||||
|
||||
| 类别 | 表示例 | 管理方式 | 约束创建 |
|
||||
|------|--------|---------|---------|
|
||||
| **业务表** | `users`, `tenants`, `ReviewTask` | Prisma Migration | ✅ 迁移文件明确包含所有约束 |
|
||||
| **pg-boss 表** | `queue`, `job`, `schedule` | pg-boss 运行时自动创建 | ⚠️ 仅在首次启动时创建 |
|
||||
|
||||
**关键点**:
|
||||
1. **业务表由 Prisma 管理**:通过 `prisma migrate deploy` 创建,约束完整
|
||||
2. **pg-boss 表在 Prisma Schema 中定义**:但只是通过 `db pull` 拉取的,Prisma 不负责创建
|
||||
3. **pg-boss 表实际由 `boss.start()` 创建**:如果表已存在,pg-boss 跳过创建步骤
|
||||
|
||||
### 迁移时发生了什么?
|
||||
|
||||
```
|
||||
本地数据库 pg_dump RDS
|
||||
┌─────────────────┐ ────────► ┌─────────────────┐
|
||||
│ queue 表 │ │ queue 表 │
|
||||
│ ✅ queue_pkey │ │ ❌ queue_pkey │ ← 丢失!
|
||||
│ ✅ 3个外键约束 │ │ ❌ 3个外键约束 │ ← 丢失!
|
||||
└─────────────────┘ └─────────────────┘
|
||||
|
||||
原因:
|
||||
1. pg_dump 导出时外键延迟处理
|
||||
2. queue 表有自引用外键 (dead_letter → name)
|
||||
3. 导入时约束创建顺序问题导致失败
|
||||
4. pg-boss 检测到表已存在,跳过约束创建
|
||||
```
|
||||
|
||||
### 自引用外键是罪魁祸首
|
||||
|
||||
```sql
|
||||
-- queue 表的自引用外键(容易在迁移时丢失)
|
||||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔴 迁移后必须检查项
|
||||
|
||||
### 1. pg-boss 4 个关键约束检查(最重要!)
|
||||
|
||||
```sql
|
||||
-- 🔴 一键检查 pg-boss 关键约束(必须返回 4 行)
|
||||
SELECT conname, contype,
|
||||
CASE contype WHEN 'p' THEN '主键' WHEN 'f' THEN '外键' END as type_desc
|
||||
FROM pg_constraint
|
||||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||||
AND conname IN (
|
||||
'queue_pkey', -- 主键(防止重复队列)
|
||||
'queue_dead_letter_fkey', -- 自引用外键
|
||||
'schedule_name_fkey', -- schedule → queue 外键
|
||||
'subscription_name_fkey' -- subscription → queue 外键
|
||||
)
|
||||
ORDER BY conname;
|
||||
|
||||
-- 预期结果(必须 4 行):
|
||||
-- conname | contype | type_desc
|
||||
-- -------------------------+---------+-----------
|
||||
-- queue_dead_letter_fkey | f | 外键
|
||||
-- queue_pkey | p | 主键
|
||||
-- schedule_name_fkey | f | 外键
|
||||
-- subscription_name_fkey | f | 外键
|
||||
```
|
||||
|
||||
**如果少于 4 行,立即修复**:
|
||||
|
||||
```sql
|
||||
-- 修复 1:添加 queue_pkey(如果缺失)
|
||||
-- 先清理重复数据
|
||||
DELETE FROM platform_schema.queue a
|
||||
USING platform_schema.queue b
|
||||
WHERE a.name = b.name
|
||||
AND a.created_on < b.created_on;
|
||||
-- 添加主键
|
||||
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
|
||||
|
||||
-- 修复 2:添加 queue_dead_letter_fkey(如果缺失)
|
||||
ALTER TABLE platform_schema.queue
|
||||
ADD CONSTRAINT queue_dead_letter_fkey
|
||||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
|
||||
|
||||
-- 修复 3:添加 schedule_name_fkey(如果缺失)
|
||||
ALTER TABLE platform_schema.schedule
|
||||
ADD CONSTRAINT schedule_name_fkey
|
||||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||||
|
||||
-- 修复 4:添加 subscription_name_fkey(如果缺失)
|
||||
ALTER TABLE platform_schema.subscription
|
||||
ADD CONSTRAINT subscription_name_fkey
|
||||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. 检查重复队列定义
|
||||
|
||||
```sql
|
||||
-- 检查是否有重复队列定义
|
||||
SELECT name, COUNT(*) as cnt
|
||||
FROM platform_schema.queue
|
||||
GROUP BY name
|
||||
HAVING COUNT(*) > 1;
|
||||
|
||||
-- 预期结果:无返回(0 行)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. 全局约束数量对比
|
||||
|
||||
```sql
|
||||
-- 检查各 schema 约束数量(与本地对比)
|
||||
SELECT n.nspname as schema, COUNT(*) as constraint_count
|
||||
FROM pg_constraint c
|
||||
JOIN pg_namespace n ON c.connamespace = n.oid
|
||||
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
|
||||
GROUP BY n.nspname
|
||||
ORDER BY n.nspname;
|
||||
|
||||
-- 🔴 platform_schema 约束数量必须是 33
|
||||
-- 如果少于 33,说明有约束丢失
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. 检查所有 pg-boss 表索引
|
||||
|
||||
```sql
|
||||
-- 列出 platform_schema 中的所有表和索引
|
||||
SELECT
|
||||
t.tablename,
|
||||
COUNT(i.indexname) as index_count
|
||||
FROM pg_tables t
|
||||
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND t.schemaname = i.schemaname
|
||||
WHERE t.schemaname = 'platform_schema'
|
||||
GROUP BY t.tablename
|
||||
ORDER BY t.tablename;
|
||||
|
||||
-- 关键表应该有索引:
|
||||
-- queue: 1 (queue_pkey)
|
||||
-- job_common: 多个索引
|
||||
-- schedule: 索引
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 数据库迁移完整检查清单
|
||||
|
||||
### 迁移前
|
||||
|
||||
- [ ] 备份源数据库
|
||||
- [ ] 记录源数据库的表结构和索引
|
||||
- [ ] 确认 pg_dump 参数包含约束和索引
|
||||
|
||||
### 迁移中
|
||||
|
||||
```bash
|
||||
# 推荐的 pg_dump 参数(包含完整结构)
|
||||
pg_dump -h localhost -U postgres -d ai_clinical_research \
|
||||
--no-owner \
|
||||
--no-acl \
|
||||
--format=plain \
|
||||
--encoding=UTF8 \
|
||||
> backup.sql
|
||||
|
||||
# 不要使用 --data-only(会丢失约束)
|
||||
```
|
||||
|
||||
### 迁移后(🔴 关键检查)
|
||||
|
||||
| 检查项 | SQL | 预期 |
|
||||
|--------|-----|------|
|
||||
| queue 主键 | `SELECT indexname FROM pg_indexes WHERE tablename='queue';` | `queue_pkey` |
|
||||
| 重复队列 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
|
||||
| 表数量 | `SELECT COUNT(*) FROM pg_tables WHERE schemaname='platform_schema';` | 与源一致 |
|
||||
| 索引数量 | `SELECT COUNT(*) FROM pg_indexes WHERE schemaname='platform_schema';` | 与源一致 |
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ 常见问题修复
|
||||
|
||||
### 问题 1:pg-boss 表约束丢失(2026-01-27 故障)
|
||||
|
||||
**症状**:
|
||||
- 每次应用启动都创建新的队列定义
|
||||
- 同一任务被重复处理多次
|
||||
- RVW 审稿任务无限循环
|
||||
|
||||
**丢失的约束(共 4 个)**:
|
||||
|
||||
| 约束 | 类型 | 影响 |
|
||||
|------|------|------|
|
||||
| `queue_pkey` | 主键 | 🔴 导致队列定义重复 |
|
||||
| `queue_dead_letter_fkey` | 外键 | 死信队列引用失效 |
|
||||
| `schedule_name_fkey` | 外键 | 定时任务引用失效 |
|
||||
| `subscription_name_fkey` | 外键 | 订阅引用失效 |
|
||||
|
||||
**一键修复脚本**:
|
||||
|
||||
```sql
|
||||
-- 🔴 执行前先检查哪些约束缺失
|
||||
SELECT conname FROM pg_constraint
|
||||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||||
|
||||
-- 修复步骤:
|
||||
|
||||
-- 1. 清理重复队列定义(如果有)
|
||||
DELETE FROM platform_schema.queue a
|
||||
USING platform_schema.queue b
|
||||
WHERE a.name = b.name AND a.created_on < b.created_on;
|
||||
|
||||
-- 2. 添加主键(如果缺失)
|
||||
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
|
||||
|
||||
-- 3. 添加外键约束(如果缺失)
|
||||
ALTER TABLE platform_schema.queue
|
||||
ADD CONSTRAINT queue_dead_letter_fkey
|
||||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
|
||||
|
||||
ALTER TABLE platform_schema.schedule
|
||||
ADD CONSTRAINT schedule_name_fkey
|
||||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||||
|
||||
ALTER TABLE platform_schema.subscription
|
||||
ADD CONSTRAINT subscription_name_fkey
|
||||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||||
|
||||
-- 4. 验证(必须返回 4 行)
|
||||
SELECT conname FROM pg_constraint
|
||||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 问题 2:迁移后中文乱码
|
||||
|
||||
**症状**:用户名、租户名显示为 `????`
|
||||
|
||||
**原因**:PowerShell 默认编码不是 UTF-8
|
||||
|
||||
**修复**:在 Docker 容器内执行导入导出
|
||||
|
||||
```bash
|
||||
# 在容器内导出(绕过 PowerShell 编码问题)
|
||||
docker exec -it postgres-container bash -c "pg_dump ... > /tmp/backup.sql"
|
||||
|
||||
# 在容器内导入
|
||||
docker exec -i postgres-container psql ... < backup.sql
|
||||
```
|
||||
|
||||
详见:[05-部署文档/0126部署/08-部署完成总结.md](../05-部署文档/0126部署/08-部署完成总结.md)
|
||||
|
||||
---
|
||||
|
||||
### 问题 3:迁移后外键约束失败
|
||||
|
||||
**症状**:`ERROR: insert or update violates foreign key constraint`
|
||||
|
||||
**原因**:表导入顺序错误
|
||||
|
||||
**修复**:使用事务或正确的依赖顺序
|
||||
|
||||
```sql
|
||||
-- 临时禁用外键检查(谨慎使用)
|
||||
SET session_replication_role = replica;
|
||||
|
||||
-- 导入数据...
|
||||
|
||||
-- 恢复外键检查
|
||||
SET session_replication_role = DEFAULT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 迁移对比脚本
|
||||
|
||||
迁移后执行此脚本,对比源和目标数据库:
|
||||
|
||||
```sql
|
||||
-- 🔴 一键健康检查(迁移后必须执行)
|
||||
|
||||
-- 1. 全局统计(与本地对比)
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||||
|
||||
-- 预期值(2026-01-27):tables=64, indexes=236, constraints=108
|
||||
|
||||
-- 2. platform_schema 约束数量
|
||||
SELECT COUNT(*) as platform_constraints
|
||||
FROM pg_constraint c
|
||||
JOIN pg_namespace n ON c.connamespace = n.oid
|
||||
WHERE n.nspname = 'platform_schema';
|
||||
|
||||
-- 预期值:33
|
||||
|
||||
-- 3. 🔴 pg-boss 4 个关键约束
|
||||
SELECT conname FROM pg_constraint
|
||||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||||
|
||||
-- 预期:4 行
|
||||
|
||||
-- 4. 重复队列检查
|
||||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
|
||||
|
||||
-- 预期:0 行
|
||||
```
|
||||
|
||||
### 本地 vs RDS 快速对比命令
|
||||
|
||||
```bash
|
||||
# 本地
|
||||
docker exec ai-clinical-postgres psql "postgresql://postgres:postgres123@localhost:5432/ai_clinical_research" -c "
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||||
"
|
||||
|
||||
# RDS
|
||||
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 相关文档
|
||||
|
||||
- [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md) - 队列日常监控
|
||||
- [02-故障预防检查清单](./02-故障预防检查清单.md) - 部署检查清单
|
||||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md) - 原始故障分析
|
||||
|
||||
---
|
||||
|
||||
## 📝 变更记录
|
||||
|
||||
| 日期 | 版本 | 变更内容 |
|
||||
|------|------|---------|
|
||||
| 2026-01-27 | v1.1 | 补充 3 个外键约束丢失分析,添加"双重管理者"问题说明 |
|
||||
| 2026-01-27 | v1.0 | 初始版本,基于 RVW 任务无限循环故障 |
|
||||
@@ -1,50 +1,59 @@
|
||||
# 运维文档
|
||||
|
||||
> **文档定位:** 系统运维、监控、故障排查
|
||||
> **适用范围:** 运维团队、SRE团队
|
||||
> **文档目的**:记录系统运维相关的监控、故障排查、预防措施等
|
||||
> **创建日期**:2026-01-27
|
||||
> **维护者**:运维团队
|
||||
|
||||
---
|
||||
|
||||
## 📋 运维文档清单
|
||||
## 📚 文档索引
|
||||
|
||||
| 文档 | 说明 | 状态 |
|
||||
|------|------|------|
|
||||
| **01-环境配置指南.md** | 环境变量、数据库连接、API密钥配置 | ✅ 已完成 |
|
||||
| **02-环境变量配置模板.md** | .env配置模板,含CloseAI配置 ⭐ | ✅ 已完成 |
|
||||
| **03-监控告警.md** | 监控指标、告警规则 | ⏳ 待创建 |
|
||||
| **04-故障排查.md** | 常见问题排查手册 | ⏳ 待创建 |
|
||||
| **05-备份恢复.md** | 数据备份和恢复策略 | ⏳ 待创建 |
|
||||
| 文档 | 说明 | 优先级 |
|
||||
|------|------|--------|
|
||||
| [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md) | pg-boss 任务队列的监控、清理、故障排查 | 🔴 高 |
|
||||
| [02-故障预防检查清单](./02-故障预防检查清单.md) | 部署前/后的检查清单,预防常见故障 | 🔴 高 |
|
||||
| [03-数据库迁移注意事项](./03-数据库迁移注意事项.md) | 数据库迁移时的检查项,避免约束丢失 | 🔴 高 |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 核心运维任务
|
||||
## 🔧 快速参考
|
||||
|
||||
### 1. 监控
|
||||
- 系统健康检查
|
||||
- 性能监控
|
||||
- 告警通知
|
||||
### 日常检查 SQL
|
||||
|
||||
### 2. 日志
|
||||
- 日志收集
|
||||
- 日志分析
|
||||
- 日志归档
|
||||
```sql
|
||||
-- 检查重复队列定义
|
||||
SELECT name, COUNT(*) as cnt
|
||||
FROM platform_schema.queue
|
||||
GROUP BY name
|
||||
HAVING COUNT(*) > 1;
|
||||
|
||||
### 3. 备份
|
||||
- 数据库备份
|
||||
- 文件备份
|
||||
- 恢复演练
|
||||
-- 检查任务状态分布
|
||||
SELECT name, state, COUNT(*)
|
||||
FROM platform_schema.job_common
|
||||
GROUP BY name, state
|
||||
ORDER BY name, state;
|
||||
```
|
||||
|
||||
### 4. 故障处理
|
||||
- 故障诊断
|
||||
- 应急预案
|
||||
- 事后总结
|
||||
### 紧急故障处理
|
||||
|
||||
1. **任务无限循环** → 参考 [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md)
|
||||
2. **数据库连接满** → 参考 [03-数据库运维手册](./03-数据库运维手册.md)
|
||||
3. **服务不可用** → 重启 SAE 应用,检查日志
|
||||
|
||||
---
|
||||
|
||||
**最后更新:** 2025-11-06
|
||||
**维护人:** 技术架构师
|
||||
|
||||
## 📈 监控告警
|
||||
|
||||
| 监控项 | 阈值 | 处理方式 |
|
||||
|--------|------|---------|
|
||||
| 队列重复定义 | > 1 | 清理重复条目 |
|
||||
| 活跃任务数 | > 100 | 检查是否有任务卡住 |
|
||||
| 数据库连接数 | > 80% | 检查连接泄漏 |
|
||||
|
||||
---
|
||||
|
||||
## 📝 相关文档
|
||||
|
||||
- [部署文档](../05-部署文档/README.md)
|
||||
- [测试文档](../06-测试文档/README.md)
|
||||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)
|
||||
|
||||
Reference in New Issue
Block a user