fix(backend): Resolve PgBoss infinite loop issue and cleanup unused files

Backend fixes:
- Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints)
- Add singletonKey to prevent duplicate job enqueueing
- Add idempotency check in reviewWorker (skip completed tasks)
- Add optimistic locking in reviewService (atomic status update)

Frontend fixes:
- Add isSubmitting state to prevent duplicate submissions in RVW Dashboard
- Fix API baseURL in knowledgeBaseApi (relative path)

Cleanup (removed):
- Old frontend/ directory (migrated to frontend-v2)
- python-microservice/ (unused, replaced by extraction_service)
- Root package.json and node_modules (accidentally created)
- redcap-docker-dev/ (external dependency)
- Various temporary files and outdated docs in root

New documentation:
- docs/07-运维文档/01-PgBoss队列监控与维护.md
- docs/07-运维文档/02-故障预防检查清单.md
- docs/07-运维文档/03-数据库迁移注意事项.md

Database fix applied to RDS:
- Added PRIMARY KEY to platform_schema.queue
- Added 3 missing foreign key constraints

Tested: Local build passed, RDS constraints verified
This commit is contained in:
2026-01-27 18:16:22 +08:00
parent 2481b786d8
commit bbf98c4d5c
214 changed files with 4318 additions and 44920 deletions

View File

@@ -0,0 +1,200 @@
# 故障预防检查清单
> **文档版本**v1.0
> **创建日期**2026-01-27
> **适用场景**:部署前后检查、日常巡检、故障预防
---
## 📋 部署前检查
### 代码检查
- [ ] **前端防重复提交**
- [ ] 异步操作添加 `isSubmitting``loading` 状态
- [ ] 提交按钮在请求期间 disabled
- [ ] 使用 `finally` 确保状态解锁
- [ ] **后端 API 幂等性**
- [ ] 状态更新使用 `updateMany` + WHERE 条件
- [ ] 检查更新数量,为 0 时返回错误
- [ ] 避免 `update` 后再检查状态(非原子操作)
- [ ] **队列任务**
- [ ] 使用 `singletonKey` 防止重复入队
- [ ] Worker 处理前检查业务状态
- [ ] 不在 `push()` 中调用 `createQueue()`
### 依赖检查
- [ ] 数据库连接正常
- [ ] Redis 连接正常(如使用)
- [ ] 第三方 API 可访问
- [ ] 环境变量配置完整
---
## 📋 部署后检查
### 立即检查(部署后 5 分钟内)
```sql
-- 0. 🔴 检查 pg-boss 4 个关键约束(数据库迁移后必查!)
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 预期4 行
-- 如果少于 4 行,参考 03-数据库迁移注意事项.md 修复
-- 1. 检查重复队列定义
SELECT name, COUNT(*) as cnt
FROM platform_schema.queue
GROUP BY name
HAVING COUNT(*) > 1;
-- 预期:无返回
```
```sql
-- 2. 检查 Worker 是否注册(查看日志)
-- SAE 日志搜索关键字:
-- "Worker registered" 或 "Handler registered"
```
```bash
# 3. 健康检查接口
curl https://your-domain/api/health
# 预期:{"status":"ok"}
```
### 功能验证(部署后 30 分钟内)
- [ ] 用户登录正常
- [ ] 核心功能可用
- [ ] 异步任务执行正常
- [ ] 文件上传/下载正常
---
## 📋 日常巡检清单
### 每日必查09:00
| 检查项 | SQL/命令 | 预期结果 |
|--------|---------|---------|
| 队列重复定义 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
| 卡住的任务 | `SELECT COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes';` | 0 |
| 数据库连接数 | `SELECT count(*) FROM pg_stat_activity;` | < 80 |
### 一键健康检查
```sql
SELECT
'queue_duplicates' as check_type,
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
COUNT(*) as count
FROM (
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
) t
UNION ALL
SELECT
'stuck_active_jobs',
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
COUNT(*)
FROM platform_schema.job_common
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
UNION ALL
SELECT
'db_connections',
CASE WHEN COUNT(*) > 80 THEN '⚠️ 警告' ELSE '✅ 正常' END,
COUNT(*)
FROM pg_stat_activity;
```
---
## 📋 代码审查清单PR Review
### 前端
- [ ] 表单提交有 loading 状态
- [ ] 按钮防抖/禁用处理
- [ ] 错误处理完整try-catch-finally
- [ ] 异步请求有超时设置
### 后端 API
- [ ] 使用事务或原子操作
- [ ] 状态检查在更新之前
- [ ] 错误信息明确
- [ ] 日志记录关键步骤
### 队列任务
- [ ]`singletonKey`
- [ ] Worker 有幂等性检查
- [ ] 超时和重试配置合理
- [ ] 失败处理完整
---
## 📋 故障响应流程
### 1. 发现问题
- 用户反馈
- 监控告警
- 日志异常
### 2. 初步诊断5 分钟内)
```sql
-- 快速健康检查
SELECT
'queue_duplicates' as t, COUNT(*) FROM (SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1) x
UNION ALL
SELECT 'stuck_jobs', COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
UNION ALL
SELECT 'db_connections', COUNT(*) FROM pg_stat_activity;
```
### 3. 紧急修复
| 问题类型 | 紧急措施 |
|---------|---------|
| 任务无限循环 | 清理重复队列定义,重启服务 |
| 数据库连接满 | 强制断开空闲连接,重启服务 |
| 服务不可用 | 重启 SAE 应用 |
### 4. 根因分析
- 收集日志
- 分析数据库状态
- 复盘代码逻辑
### 5. 长期修复
- 提交代码修复 PR
- 更新文档
- 添加预防措施
---
## 📋 历史故障记录
| 日期 | 故障类型 | 影响范围 | 根因 | 修复措施 |
|------|---------|---------|------|---------|
| 2026-01-27 | 任务无限循环 | RVW 模块 | **数据库迁移丢失主键约束** → 队列定义重复 | 添加主键 + 四层防御 |
| 2026-01-27 | 中文乱码 | 全系统 | PowerShell 编码问题 | Docker 内执行迁移 |
| 2026-01-11 | 数据库事故 | 全系统 | 误操作 | 备份恢复流程 |
---
## 📚 相关文档
- [PgBoss 队列监控与维护](./01-PgBoss队列监控与维护.md)
- [数据库运维手册](./03-数据库运维手册.md)
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)