fix(backend): Resolve PgBoss infinite loop issue and cleanup unused files

Backend fixes:
- Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints)
- Add singletonKey to prevent duplicate job enqueueing
- Add idempotency check in reviewWorker (skip completed tasks)
- Add optimistic locking in reviewService (atomic status update)

Frontend fixes:
- Add isSubmitting state to prevent duplicate submissions in RVW Dashboard
- Fix API baseURL in knowledgeBaseApi (relative path)

Cleanup (removed):
- Old frontend/ directory (migrated to frontend-v2)
- python-microservice/ (unused, replaced by extraction_service)
- Root package.json and node_modules (accidentally created)
- redcap-docker-dev/ (external dependency)
- Various temporary files and outdated docs in root

New documentation:
- docs/07-运维文档/01-PgBoss队列监控与维护.md
- docs/07-运维文档/02-故障预防检查清单.md
- docs/07-运维文档/03-数据库迁移注意事项.md

Database fix applied to RDS:
- Added PRIMARY KEY to platform_schema.queue
- Added 3 missing foreign key constraints

Tested: Local build passed, RDS constraints verified
This commit is contained in:
2026-01-27 18:16:22 +08:00
parent 2481b786d8
commit bbf98c4d5c
214 changed files with 4318 additions and 44920 deletions

View File

@@ -0,0 +1,379 @@
# 数据库迁移注意事项
> **文档版本**v1.1
> **创建日期**2026-01-27
> **最后更新**2026-01-27
> **基于故障**2026-01-27 RVW 任务无限循环故障根因pg-boss 表约束丢失)
---
## 📋 背景
### 故障复盘
**现象**RVW 审稿模块任务完成后继续无限循环执行,同一 taskId 被处理 7 次
**根因**:数据库从本地迁移到 RDS 时,`platform_schema`**pg-boss 表的约束丢失**
| 丢失的约束 | 类型 | 影响 |
|-----------|------|------|
| `queue_pkey` | 主键 | 🔴 **导致队列定义重复** |
| `queue_dead_letter_fkey` | 外键 | 死信队列引用完整性丢失 |
| `schedule_name_fkey` | 外键 | 定时任务引用完整性丢失 |
| `subscription_name_fkey` | 外键 | 订阅引用完整性丢失 |
**影响**:每次 SAE 重启都会创建新的队列定义,导致同一任务被入队多次
---
## 🔍 为什么只有 pg-boss 相关约束丢失?
### pg-boss 表的"双重管理者"问题
| 类别 | 表示例 | 管理方式 | 约束创建 |
|------|--------|---------|---------|
| **业务表** | `users`, `tenants`, `ReviewTask` | Prisma Migration | ✅ 迁移文件明确包含所有约束 |
| **pg-boss 表** | `queue`, `job`, `schedule` | pg-boss 运行时自动创建 | ⚠️ 仅在首次启动时创建 |
**关键点**
1. **业务表由 Prisma 管理**:通过 `prisma migrate deploy` 创建,约束完整
2. **pg-boss 表在 Prisma Schema 中定义**:但只是通过 `db pull` 拉取的Prisma 不负责创建
3. **pg-boss 表实际由 `boss.start()` 创建**如果表已存在pg-boss 跳过创建步骤
### 迁移时发生了什么?
```
本地数据库 pg_dump RDS
┌─────────────────┐ ────────► ┌─────────────────┐
│ queue 表 │ │ queue 表 │
│ ✅ queue_pkey │ │ ❌ queue_pkey │ ← 丢失!
│ ✅ 3个外键约束 │ │ ❌ 3个外键约束 │ ← 丢失!
└─────────────────┘ └─────────────────┘
原因:
1. pg_dump 导出时外键延迟处理
2. queue 表有自引用外键 (dead_letter → name)
3. 导入时约束创建顺序问题导致失败
4. pg-boss 检测到表已存在,跳过约束创建
```
### 自引用外键是罪魁祸首
```sql
-- queue 表的自引用外键(容易在迁移时丢失)
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name)
```
---
## 🔴 迁移后必须检查项
### 1. pg-boss 4 个关键约束检查(最重要!)
```sql
-- 🔴 一键检查 pg-boss 关键约束(必须返回 4 行)
SELECT conname, contype,
CASE contype WHEN 'p' THEN '主键' WHEN 'f' THEN '外键' END as type_desc
FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN (
'queue_pkey', -- 主键(防止重复队列)
'queue_dead_letter_fkey', -- 自引用外键
'schedule_name_fkey', -- schedule → queue 外键
'subscription_name_fkey' -- subscription → queue 外键
)
ORDER BY conname;
-- 预期结果(必须 4 行):
-- conname | contype | type_desc
-- -------------------------+---------+-----------
-- queue_dead_letter_fkey | f | 外键
-- queue_pkey | p | 主键
-- schedule_name_fkey | f | 外键
-- subscription_name_fkey | f | 外键
```
**如果少于 4 行,立即修复**
```sql
-- 修复 1添加 queue_pkey如果缺失
-- 先清理重复数据
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name
AND a.created_on < b.created_on;
-- 添加主键
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 修复 2添加 queue_dead_letter_fkey如果缺失
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
-- 修复 3添加 schedule_name_fkey如果缺失
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 修复 4添加 subscription_name_fkey如果缺失
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
```
---
### 2. 检查重复队列定义
```sql
-- 检查是否有重复队列定义
SELECT name, COUNT(*) as cnt
FROM platform_schema.queue
GROUP BY name
HAVING COUNT(*) > 1;
-- 预期结果无返回0 行)
```
---
### 3. 全局约束数量对比
```sql
-- 检查各 schema 约束数量(与本地对比)
SELECT n.nspname as schema, COUNT(*) as constraint_count
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
GROUP BY n.nspname
ORDER BY n.nspname;
-- 🔴 platform_schema 约束数量必须是 33
-- 如果少于 33说明有约束丢失
```
---
### 4. 检查所有 pg-boss 表索引
```sql
-- 列出 platform_schema 中的所有表和索引
SELECT
t.tablename,
COUNT(i.indexname) as index_count
FROM pg_tables t
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND t.schemaname = i.schemaname
WHERE t.schemaname = 'platform_schema'
GROUP BY t.tablename
ORDER BY t.tablename;
-- 关键表应该有索引:
-- queue: 1 (queue_pkey)
-- job_common: 多个索引
-- schedule: 索引
```
---
## 📋 数据库迁移完整检查清单
### 迁移前
- [ ] 备份源数据库
- [ ] 记录源数据库的表结构和索引
- [ ] 确认 pg_dump 参数包含约束和索引
### 迁移中
```bash
# 推荐的 pg_dump 参数(包含完整结构)
pg_dump -h localhost -U postgres -d ai_clinical_research \
--no-owner \
--no-acl \
--format=plain \
--encoding=UTF8 \
> backup.sql
# 不要使用 --data-only会丢失约束
```
### 迁移后(🔴 关键检查)
| 检查项 | SQL | 预期 |
|--------|-----|------|
| queue 主键 | `SELECT indexname FROM pg_indexes WHERE tablename='queue';` | `queue_pkey` |
| 重复队列 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
| 表数量 | `SELECT COUNT(*) FROM pg_tables WHERE schemaname='platform_schema';` | 与源一致 |
| 索引数量 | `SELECT COUNT(*) FROM pg_indexes WHERE schemaname='platform_schema';` | 与源一致 |
---
## 🛠️ 常见问题修复
### 问题 1pg-boss 表约束丢失2026-01-27 故障)
**症状**
- 每次应用启动都创建新的队列定义
- 同一任务被重复处理多次
- RVW 审稿任务无限循环
**丢失的约束(共 4 个)**
| 约束 | 类型 | 影响 |
|------|------|------|
| `queue_pkey` | 主键 | 🔴 导致队列定义重复 |
| `queue_dead_letter_fkey` | 外键 | 死信队列引用失效 |
| `schedule_name_fkey` | 外键 | 定时任务引用失效 |
| `subscription_name_fkey` | 外键 | 订阅引用失效 |
**一键修复脚本**
```sql
-- 🔴 执行前先检查哪些约束缺失
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 修复步骤:
-- 1. 清理重复队列定义(如果有)
DELETE FROM platform_schema.queue a
USING platform_schema.queue b
WHERE a.name = b.name AND a.created_on < b.created_on;
-- 2. 添加主键(如果缺失)
ALTER TABLE platform_schema.queue ADD PRIMARY KEY (name);
-- 3. 添加外键约束(如果缺失)
ALTER TABLE platform_schema.queue
ADD CONSTRAINT queue_dead_letter_fkey
FOREIGN KEY (dead_letter) REFERENCES platform_schema.queue(name);
ALTER TABLE platform_schema.schedule
ADD CONSTRAINT schedule_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
ALTER TABLE platform_schema.subscription
ADD CONSTRAINT subscription_name_fkey
FOREIGN KEY (name) REFERENCES platform_schema.queue(name) ON DELETE CASCADE;
-- 4. 验证(必须返回 4 行)
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
```
---
### 问题 2迁移后中文乱码
**症状**:用户名、租户名显示为 `????`
**原因**PowerShell 默认编码不是 UTF-8
**修复**:在 Docker 容器内执行导入导出
```bash
# 在容器内导出(绕过 PowerShell 编码问题)
docker exec -it postgres-container bash -c "pg_dump ... > /tmp/backup.sql"
# 在容器内导入
docker exec -i postgres-container psql ... < backup.sql
```
详见:[05-部署文档/0126部署/08-部署完成总结.md](../05-部署文档/0126部署/08-部署完成总结.md)
---
### 问题 3迁移后外键约束失败
**症状**`ERROR: insert or update violates foreign key constraint`
**原因**:表导入顺序错误
**修复**:使用事务或正确的依赖顺序
```sql
-- 临时禁用外键检查(谨慎使用)
SET session_replication_role = replica;
-- 导入数据...
-- 恢复外键检查
SET session_replication_role = DEFAULT;
```
---
## 📊 迁移对比脚本
迁移后执行此脚本,对比源和目标数据库:
```sql
-- 🔴 一键健康检查(迁移后必须执行)
-- 1. 全局统计(与本地对比)
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
-- 预期值2026-01-27tables=64, indexes=236, constraints=108
-- 2. platform_schema 约束数量
SELECT COUNT(*) as platform_constraints
FROM pg_constraint c
JOIN pg_namespace n ON c.connamespace = n.oid
WHERE n.nspname = 'platform_schema';
-- 预期值33
-- 3. 🔴 pg-boss 4 个关键约束
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 预期4 行
-- 4. 重复队列检查
SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;
-- 预期0 行
```
### 本地 vs RDS 快速对比命令
```bash
# 本地
docker exec ai-clinical-postgres psql "postgresql://postgres:postgres123@localhost:5432/ai_clinical_research" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
# RDS
docker exec ai-clinical-postgres psql "postgresql://airesearch:Xibahe%40fengzhibo117@pgm-2zex1m2y3r23hdn5xo.pg.rds.aliyuncs.com:5432/ai_clinical_research_test" -c "
SELECT
(SELECT COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as tables,
(SELECT COUNT(*) FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema')) as indexes,
(SELECT COUNT(*) FROM pg_constraint c JOIN pg_namespace n ON c.connamespace = n.oid WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')) as constraints;
"
```
---
## 📚 相关文档
- [01-PgBoss队列监控与维护](./01-PgBoss队列监控与维护.md) - 队列日常监控
- [02-故障预防检查清单](./02-故障预防检查清单.md) - 部署检查清单
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md) - 原始故障分析
---
## 📝 变更记录
| 日期 | 版本 | 变更内容 |
|------|------|---------|
| 2026-01-27 | v1.1 | 补充 3 个外键约束丢失分析,添加"双重管理者"问题说明 |
| 2026-01-27 | v1.0 | 初始版本,基于 RVW 任务无限循环故障 |