Files
AIclinicalresearch/docs/07-运维文档/03-数据库迁移注意事项.md
HaHafeng bbf98c4d5c fix(backend): Resolve PgBoss infinite loop issue and cleanup unused files
Backend fixes:
- Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints)
- Add singletonKey to prevent duplicate job enqueueing
- Add idempotency check in reviewWorker (skip completed tasks)
- Add optimistic locking in reviewService (atomic status update)

Frontend fixes:
- Add isSubmitting state to prevent duplicate submissions in RVW Dashboard
- Fix API baseURL in knowledgeBaseApi (relative path)

Cleanup (removed):
- Old frontend/ directory (migrated to frontend-v2)
- python-microservice/ (unused, replaced by extraction_service)
- Root package.json and node_modules (accidentally created)
- redcap-docker-dev/ (external dependency)
- Various temporary files and outdated docs in root

New documentation:
- docs/07-运维文档/01-PgBoss队列监控与维护.md
- docs/07-运维文档/02-故障预防检查清单.md
- docs/07-运维文档/03-数据库迁移注意事项.md

Database fix applied to RDS:
- Added PRIMARY KEY to platform_schema.queue
- Added 3 missing foreign key constraints

Tested: Local build passed, RDS constraints verified
2026-01-27 18:16:22 +08:00

380 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 数据库迁移注意事项
> **文档版本**v1.1
> **创建日期**2026-01-27
> **最后更新**2026-01-27
> **基于故障**2026-01-27 RVW 任务无限循环故障根因pg-boss 表约束丢失)
---
## 📋 背景
### 故障复盘
**现象**RVW 审稿模块任务完成后继续无限循环执行,同一 taskId 被处理 7 次
**根因**:数据库从本地迁移到 RDS 时,`platform_schema`**pg-boss 表的约束丢失**
| 丢失的约束 | 类型 | 影响 |
|-----------|------|------|
| `queue_pkey` | 主键 | 🔴 **导致队列定义重复** |
| `queue_dead_letter_fkey` | 外键 | 死信队列引用完整性丢失 |
| `schedule_name_fkey` | 外键 | 定时任务引用完整性丢失 |
| `subscription_name_fkey` | 外键 | 订阅引用完整性丢失 |
**影响**:每次 SAE 重启都会创建新的队列定义,导致同一任务被入队多次
---
## 🔍 为什么只有 pg-boss 相关约束丢失?
### pg-boss 表的"双重管理者"问题
| 类别 | 表示例 | 管理方式 | 约束创建 |
|------|--------|---------|---------|
| **业务表** | `users`, `tenants`, `ReviewTask` | Prisma Migration | ✅ 迁移文件明确包含所有约束 |
| **pg-boss 表** | `queue`, `job`, `schedule` | pg-boss 运行时自动创建 | ⚠️ 仅在首次启动时创建 |
**关键点**
1. **业务表由 Prisma 管理**:通过 `prisma migrate deploy` 创建,约束完整
2. **pg-boss 表在 Prisma Schema 中定义**:但只是通过 `db pull` 拉取的Prisma 不负责创建
3. **pg-boss 表实际由 `boss.start()` 创建**如果表已存在pg-boss 跳过创建步骤
### 迁移时发生了什么?
```
本地数据库 pg_dump RDS
┌─────────────────┐ ────────► ┌─────────────────┐
│ queue 表 │ │ queue 表 │
│ ✅ queue_pkey │ │ ❌ queue_pkey │ ← 丢失!
│ ✅ 3个外键约束 │ │ ❌ 3个外键约束 │ ← 丢失!
└─────────────────┘ └─────────────────┘
原因:
1. pg_dump 导出时外键延迟处理
2. queue 表有自引用外键 (dead_letter → name)
3. 导入时约束创建顺序问题导致失败
4. pg-boss 检测到表已存在,跳过约束创建
```
### 自引用外键是罪魁祸首
```sql
-- queue 表的自引用外键(容易在迁移时丢失)
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name)
```
---
## 🔴 迁移后必须检查项
### 1. pg-boss 4 个关键约束检查(最重要!)
```sql
-- 🔴 一键检查 pg-boss 关键约束(必须返回 4 行)
SELECT conname, contype,
CASE contype WHEN 'p' THEN '主键' WHEN 'f' THEN '外键' END as type_desc
FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN (
'queue_pkey', -- 主键(防止重复队列)
'queue_dead_letter_fkey', -- 自引用外键
'schedule_name_fkey', -- schedule → queue 外键
'subscription_name_fkey' -- subscription → queue 外键
)
ORDER BY conname;
-- 预期结果(必须 4 行):
-- conname | contype | type_desc
-- -------------------------+---------+-----------
-- queue_dead_letter_fkey | f | 外键
-- queue_pkey | p | 主键
-- schedule_name_fkey | f | 外键
-- subscription_name_fkey | f | 外键
```
**如果少于 4 行,立即修复**
```sql
-- 修复 1添加 queue_pkey如果缺失
-- 先清理重复数据
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name
AND a.created_on < b.created_on;
-- 添加主键
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 修复 2添加 queue_dead_letter_fkey如果缺失
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
-- 修复 3添加 schedule_name_fkey如果缺失
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 修复 4添加 subscription_name_fkey如果缺失
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
```
---
### 2. 检查重复队列定义
```sql
-- 检查是否有重复队列定义
SELECT name, COUNT(*) as cnt
FROM platform_schema.queue
GROUP BY name
HAVING COUNT(*) > 1;
-- 预期结果无返回0 行)
```
---
### 3. 全局约束数量对比
```sql
-- 检查各 schema 约束数量(与本地对比)
SELECT n.nspname as schema, COUNT(*) as constraint_count
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
GROUP BY n.nspname
ORDER BY n.nspname;
-- 🔴 platform_schema 约束数量必须是 33
-- 如果少于 33说明有约束丢失
```
---
### 4. 检查所有 pg-boss 表索引
```sql
-- 列出 platform_schema 中的所有表和索引
SELECT
t.tablename,
COUNT(i.indexname) as index_count
FROM pg_tables t
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND t.schemaname = i.schemaname
WHERE t.schemaname = 'platform_schema'
GROUP BY t.tablename
ORDER BY t.tablename;
-- 关键表应该有索引:
-- queue: 1 (queue_pkey)
-- job_common: 多个索引
-- schedule: 索引
```
---
## 📋 数据库迁移完整检查清单
### 迁移前
- [ ] 备份源数据库
- [ ] 记录源数据库的表结构和索引
- [ ] 确认 pg_dump 参数包含约束和索引
### 迁移中
```bash
# 推荐的 pg_dump 参数(包含完整结构)
pg_dump -h localhost -U postgres -d ai_clinical_research \
--no-owner \
--no-acl \
--format=plain \
--encoding=UTF8 \
> backup.sql
# 不要使用 --data-only会丢失约束
```
### 迁移后(🔴 关键检查)
| 检查项 | SQL | 预期 |
|--------|-----|------|
| queue 主键 | `SELECT indexname FROM pg_indexes WHERE tablename='queue';` | `queue_pkey` |
| 重复队列 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
| 表数量 | `SELECT COUNT(*) FROM pg_tables WHERE schemaname='platform_schema';` | 与源一致 |
| 索引数量 | `SELECT COUNT(*) FROM pg_indexes WHERE schemaname='platform_schema';` | 与源一致 |
---
## 🛠️ 常见问题修复
### 问题 1pg-boss 表约束丢失2026-01-27 故障)
**症状**
- 每次应用启动都创建新的队列定义
- 同一任务被重复处理多次
- RVW 审稿任务无限循环
**丢失的约束(共 4 个)**
| 约束 | 类型 | 影响 |
|------|------|------|
| `queue_pkey` | 主键 | 🔴 导致队列定义重复 |
| `queue_dead_letter_fkey` | 外键 | 死信队列引用失效 |
| `schedule_name_fkey` | 外键 | 定时任务引用失效 |
| `subscription_name_fkey` | 外键 | 订阅引用失效 |
**一键修复脚本**
```sql
-- 🔴 执行前先检查哪些约束缺失
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 修复步骤:
-- 1. 清理重复队列定义(如果有)
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name AND a.created_on < b.created_on;
-- 2. 添加主键(如果缺失)
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 3. 添加外键约束(如果缺失)
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 4. 验证(必须返回 4 行)
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
```
---
### 问题 2迁移后中文乱码
**症状**:用户名、租户名显示为 `????`
**原因**PowerShell 默认编码不是 UTF-8
**修复**:在 Docker 容器内执行导入导出
```bash
# 在容器内导出(绕过 PowerShell 编码问题)
docker exec -it postgres-container bash -c "pg_dump ... > /tmp/backup.sql"
# 在容器内导入
docker exec -i postgres-container psql ... < backup.sql
```
详见:[05-部署文档/0126部署/08-部署完成总结.md](../05-部署文档/0126部署/08-部署完成总结.md)
---
### 问题 3迁移后外键约束失败
**症状**`ERROR: insert or update violates foreign key constraint`
**原因**:表导入顺序错误
**修复**:使用事务或正确的依赖顺序
```sql
-- 临时禁用外键检查(谨慎使用)
SET session_replication_role = replica;
-- 导入数据...
-- 恢复外键检查
SET session_replication_role = DEFAULT;
```
---
## 📊 迁移对比脚本
迁移后执行此脚本,对比源和目标数据库:
```sql
-- 🔴 一键健康检查(迁移后必须执行)
-- 1. 全局统计(与本地对比)
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
-- 预期值2026-01-27tables=64, indexes=236, constraints=108
-- 2. platform_schema 约束数量
SELECT COUNT(*) as platform_constraints
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname = 'platform_schema';
-- 预期值33
-- 3. 🔴 pg-boss 4 个关键约束
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 预期4 行
-- 4. 重复队列检查
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
-- 预期0 行
```
### 本地 vs RDS 快速对比命令
```bash
# 本地
docker exec ai-clinical-postgres psql "postgresql://postgres:postgres123@localhost:5432/ai_clinical_research" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
# RDS
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
```
---
## 📚 相关文档
- [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md) - 队列日常监控
- [02-故障预防检查清单](./02-故障预防检查清单.md) - 部署检查清单
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md) - 原始故障分析
---
## 📝 变更记录
| 日期 | 版本 | 变更内容 |
|------|------|---------|
| 2026-01-27 | v1.1 | 补充 3 个外键约束丢失分析,添加"双重管理者"问题说明 |
| 2026-01-27 | v1.0 | 初始版本,基于 RVW 任务无限循环故障 |