Backend fixes: - Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints) - Add singletonKey to prevent duplicate job enqueueing - Add idempotency check in reviewWorker (skip completed tasks) - Add optimistic locking in reviewService (atomic status update) Frontend fixes: - Add isSubmitting state to prevent duplicate submissions in RVW Dashboard - Fix API baseURL in knowledgeBaseApi (relative path) Cleanup (removed): - Old frontend/ directory (migrated to frontend-v2) - python-microservice/ (unused, replaced by extraction_service) - Root package.json and node_modules (accidentally created) - redcap-docker-dev/ (external dependency) - Various temporary files and outdated docs in root New documentation: - docs/07-运维文档/01-PgBoss队列监控与维护.md - docs/07-运维文档/02-故障预防检查清单.md - docs/07-运维文档/03-数据库迁移注意事项.md Database fix applied to RDS: - Added PRIMARY KEY to platform_schema.queue - Added 3 missing foreign key constraints Tested: Local build passed, RDS constraints verified
12 KiB
12 KiB
数据库迁移注意事项
文档版本:v1.1
创建日期:2026-01-27
最后更新:2026-01-27
基于故障:2026-01-27 RVW 任务无限循环故障(根因:pg-boss 表约束丢失)
📋 背景
故障复盘
现象:RVW 审稿模块任务完成后继续无限循环执行,同一 taskId 被处理 7 次
根因:数据库从本地迁移到 RDS 时,platform_schema 中 pg-boss 表的约束丢失
| 丢失的约束 | 类型 | 影响 |
|---|---|---|
queue_pkey |
主键 | 🔴 导致队列定义重复 |
queue_dead_letter_fkey |
外键 | 死信队列引用完整性丢失 |
schedule_name_fkey |
外键 | 定时任务引用完整性丢失 |
subscription_name_fkey |
外键 | 订阅引用完整性丢失 |
影响:每次 SAE 重启都会创建新的队列定义,导致同一任务被入队多次
🔍 为什么只有 pg-boss 相关约束丢失?
pg-boss 表的"双重管理者"问题
| 类别 | 表示例 | 管理方式 | 约束创建 |
|---|---|---|---|
| 业务表 | users, tenants, ReviewTask |
Prisma Migration | ✅ 迁移文件明确包含所有约束 |
| pg-boss 表 | queue, job, schedule |
pg-boss 运行时自动创建 | ⚠️ 仅在首次启动时创建 |
关键点:
- 业务表由 Prisma 管理:通过
prisma migrate deploy创建,约束完整 - pg-boss 表在 Prisma Schema 中定义:但只是通过
db pull拉取的,Prisma 不负责创建 - pg-boss 表实际由
boss.start()创建:如果表已存在,pg-boss 跳过创建步骤
迁移时发生了什么?
本地数据库 pg_dump RDS
┌─────────────────┐ ────────► ┌─────────────────┐
│ queue 表 │ │ queue 表 │
│ ✅ queue_pkey │ │ ❌ queue_pkey │ ← 丢失!
│ ✅ 3个外键约束 │ │ ❌ 3个外键约束 │ ← 丢失!
└─────────────────┘ └─────────────────┘
原因:
1. pg_dump 导出时外键延迟处理
2. queue 表有自引用外键 (dead_letter → name)
3. 导入时约束创建顺序问题导致失败
4. pg-boss 检测到表已存在,跳过约束创建
自引用外键是罪魁祸首
-- queue 表的自引用外键(容易在迁移时丢失)
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name)
🔴 迁移后必须检查项
1. pg-boss 4 个关键约束检查(最重要!)
-- 🔴 一键检查 pg-boss 关键约束(必须返回 4 行)
SELECT conname, contype,
CASE contype WHEN 'p' THEN '主键' WHEN 'f' THEN '外键' END as type_desc
FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN (
'queue_pkey', -- 主键(防止重复队列)
'queue_dead_letter_fkey', -- 自引用外键
'schedule_name_fkey', -- schedule → queue 外键
'subscription_name_fkey' -- subscription → queue 外键
)
ORDER BY conname;
-- 预期结果(必须 4 行):
-- conname | contype | type_desc
-- -------------------------+---------+-----------
-- queue_dead_letter_fkey | f | 外键
-- queue_pkey | p | 主键
-- schedule_name_fkey | f | 外键
-- subscription_name_fkey | f | 外键
如果少于 4 行,立即修复:
-- 修复 1:添加 queue_pkey(如果缺失)
-- 先清理重复数据
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name
AND a.created_on < b.created_on;
-- 添加主键
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 修复 2:添加 queue_dead_letter_fkey(如果缺失)
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
-- 修复 3:添加 schedule_name_fkey(如果缺失)
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 修复 4:添加 subscription_name_fkey(如果缺失)
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
2. 检查重复队列定义
-- 检查是否有重复队列定义
SELECT name, COUNT(*) as cnt
FROM platform_schema.queue
GROUP BY name
HAVING COUNT(*) > 1;
-- 预期结果:无返回(0 行)
3. 全局约束数量对比
-- 检查各 schema 约束数量(与本地对比)
SELECT n.nspname as schema, COUNT(*) as constraint_count
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
GROUP BY n.nspname
ORDER BY n.nspname;
-- 🔴 platform_schema 约束数量必须是 33
-- 如果少于 33,说明有约束丢失
4. 检查所有 pg-boss 表索引
-- 列出 platform_schema 中的所有表和索引
SELECT
t.tablename,
COUNT(i.indexname) as index_count
FROM pg_tables t
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND t.schemaname = i.schemaname
WHERE t.schemaname = 'platform_schema'
GROUP BY t.tablename
ORDER BY t.tablename;
-- 关键表应该有索引:
-- queue: 1 (queue_pkey)
-- job_common: 多个索引
-- schedule: 索引
📋 数据库迁移完整检查清单
迁移前
- 备份源数据库
- 记录源数据库的表结构和索引
- 确认 pg_dump 参数包含约束和索引
迁移中
# 推荐的 pg_dump 参数(包含完整结构)
pg_dump -h localhost -U postgres -d ai_clinical_research \
--no-owner \
--no-acl \
--format=plain \
--encoding=UTF8 \
> backup.sql
# 不要使用 --data-only(会丢失约束)
迁移后(🔴 关键检查)
| 检查项 | SQL | 预期 |
|---|---|---|
| queue 主键 | SELECT indexname FROM pg_indexes WHERE tablename='queue'; |
queue_pkey |
| 重复队列 | SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1; |
无返回 |
| 表数量 | SELECT COUNT(*) FROM pg_tables WHERE schemaname='platform_schema'; |
与源一致 |
| 索引数量 | SELECT COUNT(*) FROM pg_indexes WHERE schemaname='platform_schema'; |
与源一致 |
🛠️ 常见问题修复
问题 1:pg-boss 表约束丢失(2026-01-27 故障)
症状:
- 每次应用启动都创建新的队列定义
- 同一任务被重复处理多次
- RVW 审稿任务无限循环
丢失的约束(共 4 个):
| 约束 | 类型 | 影响 |
|---|---|---|
queue_pkey |
主键 | 🔴 导致队列定义重复 |
queue_dead_letter_fkey |
外键 | 死信队列引用失效 |
schedule_name_fkey |
外键 | 定时任务引用失效 |
subscription_name_fkey |
外键 | 订阅引用失效 |
一键修复脚本:
-- 🔴 执行前先检查哪些约束缺失
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 修复步骤:
-- 1. 清理重复队列定义(如果有)
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name AND a.created_on < b.created_on;
-- 2. 添加主键(如果缺失)
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 3. 添加外键约束(如果缺失)
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 4. 验证(必须返回 4 行)
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
问题 2:迁移后中文乱码
症状:用户名、租户名显示为 ????
原因:PowerShell 默认编码不是 UTF-8
修复:在 Docker 容器内执行导入导出
# 在容器内导出(绕过 PowerShell 编码问题)
docker exec -it postgres-container bash -c "pg_dump ... > /tmp/backup.sql"
# 在容器内导入
docker exec -i postgres-container psql ... < backup.sql
详见:05-部署文档/0126部署/08-部署完成总结.md
问题 3:迁移后外键约束失败
症状:ERROR: insert or update violates foreign key constraint
原因:表导入顺序错误
修复:使用事务或正确的依赖顺序
-- 临时禁用外键检查(谨慎使用)
SET session_replication_role = replica;
-- 导入数据...
-- 恢复外键检查
SET session_replication_role = DEFAULT;
📊 迁移对比脚本
迁移后执行此脚本,对比源和目标数据库:
-- 🔴 一键健康检查(迁移后必须执行)
-- 1. 全局统计(与本地对比)
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
-- 预期值(2026-01-27):tables=64, indexes=236, constraints=108
-- 2. platform_schema 约束数量
SELECT COUNT(*) as platform_constraints
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname = 'platform_schema';
-- 预期值:33
-- 3. 🔴 pg-boss 4 个关键约束
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 预期:4 行
-- 4. 重复队列检查
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
-- 预期:0 行
本地 vs RDS 快速对比命令
# 本地
docker exec ai-clinical-postgres psql "postgresql://postgres:postgres123@localhost:5432/ai_clinical_research" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
# RDS
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
📚 相关文档
- 01-PgBoss队列监控与维护 - 队列日常监控
- 02-故障预防检查清单 - 部署检查清单
- 故障分析报告 - 原始故障分析
📝 变更记录
| 日期 | 版本 | 变更内容 |
|---|---|---|
| 2026-01-27 | v1.1 | 补充 3 个外键约束丢失分析,添加"双重管理者"问题说明 |
| 2026-01-27 | v1.0 | 初始版本,基于 RVW 任务无限循环故障 |