Backend fixes: - Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints) - Add singletonKey to prevent duplicate job enqueueing - Add idempotency check in reviewWorker (skip completed tasks) - Add optimistic locking in reviewService (atomic status update) Frontend fixes: - Add isSubmitting state to prevent duplicate submissions in RVW Dashboard - Fix API baseURL in knowledgeBaseApi (relative path) Cleanup (removed): - Old frontend/ directory (migrated to frontend-v2) - python-microservice/ (unused, replaced by extraction_service) - Root package.json and node_modules (accidentally created) - redcap-docker-dev/ (external dependency) - Various temporary files and outdated docs in root New documentation: - docs/07-运维文档/01-PgBoss队列监控与维护.md - docs/07-运维文档/02-故障预防检查清单.md - docs/07-运维文档/03-数据库迁移注意事项.md Database fix applied to RDS: - Added PRIMARY KEY to platform_schema.queue - Added 3 missing foreign key constraints Tested: Local build passed, RDS constraints verified
380 lines
12 KiB
Markdown
380 lines
12 KiB
Markdown
# 数据库迁移注意事项
|
||
|
||
> **文档版本**:v1.1
|
||
> **创建日期**:2026-01-27
|
||
> **最后更新**:2026-01-27
|
||
> **基于故障**:2026-01-27 RVW 任务无限循环故障(根因:pg-boss 表约束丢失)
|
||
|
||
---
|
||
|
||
## 📋 背景
|
||
|
||
### 故障复盘
|
||
|
||
**现象**:RVW 审稿模块任务完成后继续无限循环执行,同一 taskId 被处理 7 次
|
||
|
||
**根因**:数据库从本地迁移到 RDS 时,`platform_schema` 中 **pg-boss 表的约束丢失**
|
||
|
||
| 丢失的约束 | 类型 | 影响 |
|
||
|-----------|------|------|
|
||
| `queue_pkey` | 主键 | 🔴 **导致队列定义重复** |
|
||
| `queue_dead_letter_fkey` | 外键 | 死信队列引用完整性丢失 |
|
||
| `schedule_name_fkey` | 外键 | 定时任务引用完整性丢失 |
|
||
| `subscription_name_fkey` | 外键 | 订阅引用完整性丢失 |
|
||
|
||
**影响**:每次 SAE 重启都会创建新的队列定义,导致同一任务被入队多次
|
||
|
||
---
|
||
|
||
## 🔍 为什么只有 pg-boss 相关约束丢失?
|
||
|
||
### pg-boss 表的"双重管理者"问题
|
||
|
||
| 类别 | 表示例 | 管理方式 | 约束创建 |
|
||
|------|--------|---------|---------|
|
||
| **业务表** | `users`, `tenants`, `ReviewTask` | Prisma Migration | ✅ 迁移文件明确包含所有约束 |
|
||
| **pg-boss 表** | `queue`, `job`, `schedule` | pg-boss 运行时自动创建 | ⚠️ 仅在首次启动时创建 |
|
||
|
||
**关键点**:
|
||
1. **业务表由 Prisma 管理**:通过 `prisma migrate deploy` 创建,约束完整
|
||
2. **pg-boss 表在 Prisma Schema 中定义**:但只是通过 `db pull` 拉取的,Prisma 不负责创建
|
||
3. **pg-boss 表实际由 `boss.start()` 创建**:如果表已存在,pg-boss 跳过创建步骤
|
||
|
||
### 迁移时发生了什么?
|
||
|
||
```
|
||
本地数据库 pg_dump RDS
|
||
┌─────────────────┐ ────────► ┌─────────────────┐
|
||
│ queue 表 │ │ queue 表 │
|
||
│ ✅ queue_pkey │ │ ❌ queue_pkey │ ← 丢失!
|
||
│ ✅ 3个外键约束 │ │ ❌ 3个外键约束 │ ← 丢失!
|
||
└─────────────────┘ └─────────────────┘
|
||
|
||
原因:
|
||
1. pg_dump 导出时外键延迟处理
|
||
2. queue 表有自引用外键 (dead_letter → name)
|
||
3. 导入时约束创建顺序问题导致失败
|
||
4. pg-boss 检测到表已存在,跳过约束创建
|
||
```
|
||
|
||
### 自引用外键是罪魁祸首
|
||
|
||
```sql
|
||
-- queue 表的自引用外键(容易在迁移时丢失)
|
||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name)
|
||
```
|
||
|
||
---
|
||
|
||
## 🔴 迁移后必须检查项
|
||
|
||
### 1. pg-boss 4 个关键约束检查(最重要!)
|
||
|
||
```sql
|
||
-- 🔴 一键检查 pg-boss 关键约束(必须返回 4 行)
|
||
SELECT conname, contype,
|
||
CASE contype WHEN 'p' THEN '主键' WHEN 'f' THEN '外键' END as type_desc
|
||
FROM pg_constraint
|
||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||
AND conname IN (
|
||
'queue_pkey', -- 主键(防止重复队列)
|
||
'queue_dead_letter_fkey', -- 自引用外键
|
||
'schedule_name_fkey', -- schedule → queue 外键
|
||
'subscription_name_fkey' -- subscription → queue 外键
|
||
)
|
||
ORDER BY conname;
|
||
|
||
-- 预期结果(必须 4 行):
|
||
-- conname | contype | type_desc
|
||
-- -------------------------+---------+-----------
|
||
-- queue_dead_letter_fkey | f | 外键
|
||
-- queue_pkey | p | 主键
|
||
-- schedule_name_fkey | f | 外键
|
||
-- subscription_name_fkey | f | 外键
|
||
```
|
||
|
||
**如果少于 4 行,立即修复**:
|
||
|
||
```sql
|
||
-- 修复 1:添加 queue_pkey(如果缺失)
|
||
-- 先清理重复数据
|
||
DELETE FROM platform_schema.queue a
|
||
USING platform_schema.queue b
|
||
WHERE a.name = b.name
|
||
AND a.created_on < b.created_on;
|
||
-- 添加主键
|
||
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
|
||
|
||
-- 修复 2:添加 queue_dead_letter_fkey(如果缺失)
|
||
ALTER TABLE platform_schema.queue
|
||
ADD CONSTRAINT queue_dead_letter_fkey
|
||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
|
||
|
||
-- 修复 3:添加 schedule_name_fkey(如果缺失)
|
||
ALTER TABLE platform_schema.schedule
|
||
ADD CONSTRAINT schedule_name_fkey
|
||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||
|
||
-- 修复 4:添加 subscription_name_fkey(如果缺失)
|
||
ALTER TABLE platform_schema.subscription
|
||
ADD CONSTRAINT subscription_name_fkey
|
||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||
```
|
||
|
||
---
|
||
|
||
### 2. 检查重复队列定义
|
||
|
||
```sql
|
||
-- 检查是否有重复队列定义
|
||
SELECT name, COUNT(*) as cnt
|
||
FROM platform_schema.queue
|
||
GROUP BY name
|
||
HAVING COUNT(*) > 1;
|
||
|
||
-- 预期结果:无返回(0 行)
|
||
```
|
||
|
||
---
|
||
|
||
### 3. 全局约束数量对比
|
||
|
||
```sql
|
||
-- 检查各 schema 约束数量(与本地对比)
|
||
SELECT n.nspname as schema, COUNT(*) as constraint_count
|
||
FROM pg_constraint c
|
||
JOIN pg_namespace n ON c.connamespace = n.oid
|
||
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
|
||
GROUP BY n.nspname
|
||
ORDER BY n.nspname;
|
||
|
||
-- 🔴 platform_schema 约束数量必须是 33
|
||
-- 如果少于 33,说明有约束丢失
|
||
```
|
||
|
||
---
|
||
|
||
### 4. 检查所有 pg-boss 表索引
|
||
|
||
```sql
|
||
-- 列出 platform_schema 中的所有表和索引
|
||
SELECT
|
||
t.tablename,
|
||
COUNT(i.indexname) as index_count
|
||
FROM pg_tables t
|
||
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND t.schemaname = i.schemaname
|
||
WHERE t.schemaname = 'platform_schema'
|
||
GROUP BY t.tablename
|
||
ORDER BY t.tablename;
|
||
|
||
-- 关键表应该有索引:
|
||
-- queue: 1 (queue_pkey)
|
||
-- job_common: 多个索引
|
||
-- schedule: 索引
|
||
```
|
||
|
||
---
|
||
|
||
## 📋 数据库迁移完整检查清单
|
||
|
||
### 迁移前
|
||
|
||
- [ ] 备份源数据库
|
||
- [ ] 记录源数据库的表结构和索引
|
||
- [ ] 确认 pg_dump 参数包含约束和索引
|
||
|
||
### 迁移中
|
||
|
||
```bash
|
||
# 推荐的 pg_dump 参数(包含完整结构)
|
||
pg_dump -h localhost -U postgres -d ai_clinical_research \
|
||
--no-owner \
|
||
--no-acl \
|
||
--format=plain \
|
||
--encoding=UTF8 \
|
||
> backup.sql
|
||
|
||
# 不要使用 --data-only(会丢失约束)
|
||
```
|
||
|
||
### 迁移后(🔴 关键检查)
|
||
|
||
| 检查项 | SQL | 预期 |
|
||
|--------|-----|------|
|
||
| queue 主键 | `SELECT indexname FROM pg_indexes WHERE tablename='queue';` | `queue_pkey` |
|
||
| 重复队列 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
|
||
| 表数量 | `SELECT COUNT(*) FROM pg_tables WHERE schemaname='platform_schema';` | 与源一致 |
|
||
| 索引数量 | `SELECT COUNT(*) FROM pg_indexes WHERE schemaname='platform_schema';` | 与源一致 |
|
||
|
||
---
|
||
|
||
## 🛠️ 常见问题修复
|
||
|
||
### 问题 1:pg-boss 表约束丢失(2026-01-27 故障)
|
||
|
||
**症状**:
|
||
- 每次应用启动都创建新的队列定义
|
||
- 同一任务被重复处理多次
|
||
- RVW 审稿任务无限循环
|
||
|
||
**丢失的约束(共 4 个)**:
|
||
|
||
| 约束 | 类型 | 影响 |
|
||
|------|------|------|
|
||
| `queue_pkey` | 主键 | 🔴 导致队列定义重复 |
|
||
| `queue_dead_letter_fkey` | 外键 | 死信队列引用失效 |
|
||
| `schedule_name_fkey` | 外键 | 定时任务引用失效 |
|
||
| `subscription_name_fkey` | 外键 | 订阅引用失效 |
|
||
|
||
**一键修复脚本**:
|
||
|
||
```sql
|
||
-- 🔴 执行前先检查哪些约束缺失
|
||
SELECT conname FROM pg_constraint
|
||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||
|
||
-- 修复步骤:
|
||
|
||
-- 1. 清理重复队列定义(如果有)
|
||
DELETE FROM platform_schema.queue a
|
||
USING platform_schema.queue b
|
||
WHERE a.name = b.name AND a.created_on < b.created_on;
|
||
|
||
-- 2. 添加主键(如果缺失)
|
||
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
|
||
|
||
-- 3. 添加外键约束(如果缺失)
|
||
ALTER TABLE platform_schema.queue
|
||
ADD CONSTRAINT queue_dead_letter_fkey
|
||
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
|
||
|
||
ALTER TABLE platform_schema.schedule
|
||
ADD CONSTRAINT schedule_name_fkey
|
||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||
|
||
ALTER TABLE platform_schema.subscription
|
||
ADD CONSTRAINT subscription_name_fkey
|
||
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
|
||
|
||
-- 4. 验证(必须返回 4 行)
|
||
SELECT conname FROM pg_constraint
|
||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||
```
|
||
|
||
---
|
||
|
||
### 问题 2:迁移后中文乱码
|
||
|
||
**症状**:用户名、租户名显示为 `????`
|
||
|
||
**原因**:PowerShell 默认编码不是 UTF-8
|
||
|
||
**修复**:在 Docker 容器内执行导入导出
|
||
|
||
```bash
|
||
# 在容器内导出(绕过 PowerShell 编码问题)
|
||
docker exec -it postgres-container bash -c "pg_dump ... > /tmp/backup.sql"
|
||
|
||
# 在容器内导入
|
||
docker exec -i postgres-container psql ... < backup.sql
|
||
```
|
||
|
||
详见:[05-部署文档/0126部署/08-部署完成总结.md](../05-部署文档/0126部署/08-部署完成总结.md)
|
||
|
||
---
|
||
|
||
### 问题 3:迁移后外键约束失败
|
||
|
||
**症状**:`ERROR: insert or update violates foreign key constraint`
|
||
|
||
**原因**:表导入顺序错误
|
||
|
||
**修复**:使用事务或正确的依赖顺序
|
||
|
||
```sql
|
||
-- 临时禁用外键检查(谨慎使用)
|
||
SET session_replication_role = replica;
|
||
|
||
-- 导入数据...
|
||
|
||
-- 恢复外键检查
|
||
SET session_replication_role = DEFAULT;
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 迁移对比脚本
|
||
|
||
迁移后执行此脚本,对比源和目标数据库:
|
||
|
||
```sql
|
||
-- 🔴 一键健康检查(迁移后必须执行)
|
||
|
||
-- 1. 全局统计(与本地对比)
|
||
SELECT
|
||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||
|
||
-- 预期值(2026-01-27):tables=64, indexes=236, constraints=108
|
||
|
||
-- 2. platform_schema 约束数量
|
||
SELECT COUNT(*) as platform_constraints
|
||
FROM pg_constraint c
|
||
JOIN pg_namespace n ON c.connamespace = n.oid
|
||
WHERE n.nspname = 'platform_schema';
|
||
|
||
-- 预期值:33
|
||
|
||
-- 3. 🔴 pg-boss 4 个关键约束
|
||
SELECT conname FROM pg_constraint
|
||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||
|
||
-- 预期:4 行
|
||
|
||
-- 4. 重复队列检查
|
||
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
|
||
|
||
-- 预期:0 行
|
||
```
|
||
|
||
### 本地 vs RDS 快速对比命令
|
||
|
||
```bash
|
||
# 本地
|
||
docker exec ai-clinical-postgres psql "postgresql://postgres:postgres123@localhost:5432/ai_clinical_research" -c "
|
||
SELECT
|
||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||
"
|
||
|
||
# RDS
|
||
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "
|
||
SELECT
|
||
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
|
||
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
|
||
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
|
||
"
|
||
```
|
||
|
||
---
|
||
|
||
## 📚 相关文档
|
||
|
||
- [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md) - 队列日常监控
|
||
- [02-故障预防检查清单](./02-故障预防检查清单.md) - 部署检查清单
|
||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md) - 原始故障分析
|
||
|
||
---
|
||
|
||
## 📝 变更记录
|
||
|
||
| 日期 | 版本 | 变更内容 |
|
||
|------|------|---------|
|
||
| 2026-01-27 | v1.1 | 补充 3 个外键约束丢失分析,添加"双重管理者"问题说明 |
|
||
| 2026-01-27 | v1.0 | 初始版本,基于 RVW 任务无限循环故障 |
|