Files
AIclinicalresearch/docs/07-运维文档/02-故障预防检查清单.md
HaHafeng bbf98c4d5c fix(backend): Resolve PgBoss infinite loop issue and cleanup unused files
Backend fixes:
- Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints)
- Add singletonKey to prevent duplicate job enqueueing
- Add idempotency check in reviewWorker (skip completed tasks)
- Add optimistic locking in reviewService (atomic status update)

Frontend fixes:
- Add isSubmitting state to prevent duplicate submissions in RVW Dashboard
- Fix API baseURL in knowledgeBaseApi (relative path)

Cleanup (removed):
- Old frontend/ directory (migrated to frontend-v2)
- python-microservice/ (unused, replaced by extraction_service)
- Root package.json and node_modules (accidentally created)
- redcap-docker-dev/ (external dependency)
- Various temporary files and outdated docs in root

New documentation:
- docs/07-运维文档/01-PgBoss队列监控与维护.md
- docs/07-运维文档/02-故障预防检查清单.md
- docs/07-运维文档/03-数据库迁移注意事项.md

Database fix applied to RDS:
- Added PRIMARY KEY to platform_schema.queue
- Added 3 missing foreign key constraints

Tested: Local build passed, RDS constraints verified
2026-01-27 18:16:22 +08:00

201 lines
5.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 故障预防检查清单
> **文档版本**v1.0
> **创建日期**2026-01-27
> **适用场景**:部署前后检查、日常巡检、故障预防
---
## 📋 部署前检查
### 代码检查
- [ ] **前端防重复提交**
- [ ] 异步操作添加 `isSubmitting``loading` 状态
- [ ] 提交按钮在请求期间 disabled
- [ ] 使用 `finally` 确保状态解锁
- [ ] **后端 API 幂等性**
- [ ] 状态更新使用 `updateMany` + WHERE 条件
- [ ] 检查更新数量,为 0 时返回错误
- [ ] 避免 `update` 后再检查状态(非原子操作)
- [ ] **队列任务**
- [ ] 使用 `singletonKey` 防止重复入队
- [ ] Worker 处理前检查业务状态
- [ ] 不在 `push()` 中调用 `createQueue()`
### 依赖检查
- [ ] 数据库连接正常
- [ ] Redis 连接正常(如使用)
- [ ] 第三方 API 可访问
- [ ] 环境变量配置完整
---
## 📋 部署后检查
### 立即检查(部署后 5 分钟内)
```sql
-- 0. 🔴 检查 pg-boss 4 个关键约束(数据库迁移后必查!)
SELECT conname FROM pg_constraint
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
-- 预期4 行
-- 如果少于 4 行,参考 03-数据库迁移注意事项.md 修复
-- 1. 检查重复队列定义
SELECT name, COUNT(*) as cnt
FROM platform_schema.queue
GROUP BY name
HAVING COUNT(*) > 1;
-- 预期:无返回
```
```sql
-- 2. 检查 Worker 是否注册(查看日志)
-- SAE 日志搜索关键字:
-- "Worker registered" 或 "Handler registered"
```
```bash
# 3. 健康检查接口
curl https://your-domain/api/health
# 预期:{"status":"ok"}
```
### 功能验证(部署后 30 分钟内)
- [ ] 用户登录正常
- [ ] 核心功能可用
- [ ] 异步任务执行正常
- [ ] 文件上传/下载正常
---
## 📋 日常巡检清单
### 每日必查09:00
| 检查项 | SQL/命令 | 预期结果 |
|--------|---------|---------|
| 队列重复定义 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
| 卡住的任务 | `SELECT COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes';` | 0 |
| 数据库连接数 | `SELECT count(*) FROM pg_stat_activity;` | < 80 |
### 一键健康检查
```sql
SELECT
'queue_duplicates' as check_type,
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
COUNT(*) as count
FROM (
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
) t
UNION ALL
SELECT
'stuck_active_jobs',
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
COUNT(*)
FROM platform_schema.job_common
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
UNION ALL
SELECT
'db_connections',
CASE WHEN COUNT(*) > 80 THEN '⚠️ 警告' ELSE '✅ 正常' END,
COUNT(*)
FROM pg_stat_activity;
```
---
## 📋 代码审查清单PR Review
### 前端
- [ ] 表单提交有 loading 状态
- [ ] 按钮防抖/禁用处理
- [ ] 错误处理完整try-catch-finally
- [ ] 异步请求有超时设置
### 后端 API
- [ ] 使用事务或原子操作
- [ ] 状态检查在更新之前
- [ ] 错误信息明确
- [ ] 日志记录关键步骤
### 队列任务
- [ ]`singletonKey`
- [ ] Worker 有幂等性检查
- [ ] 超时和重试配置合理
- [ ] 失败处理完整
---
## 📋 故障响应流程
### 1. 发现问题
- 用户反馈
- 监控告警
- 日志异常
### 2. 初步诊断5 分钟内)
```sql
-- 快速健康检查
SELECT
'queue_duplicates' as t, COUNT(*) FROM (SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1) x
UNION ALL
SELECT 'stuck_jobs', COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
UNION ALL
SELECT 'db_connections', COUNT(*) FROM pg_stat_activity;
```
### 3. 紧急修复
| 问题类型 | 紧急措施 |
|---------|---------|
| 任务无限循环 | 清理重复队列定义,重启服务 |
| 数据库连接满 | 强制断开空闲连接,重启服务 |
| 服务不可用 | 重启 SAE 应用 |
### 4. 根因分析
- 收集日志
- 分析数据库状态
- 复盘代码逻辑
### 5. 长期修复
- 提交代码修复 PR
- 更新文档
- 添加预防措施
---
## 📋 历史故障记录
| 日期 | 故障类型 | 影响范围 | 根因 | 修复措施 |
|------|---------|---------|------|---------|
| 2026-01-27 | 任务无限循环 | RVW 模块 | **数据库迁移丢失主键约束** → 队列定义重复 | 添加主键 + 四层防御 |
| 2026-01-27 | 中文乱码 | 全系统 | PowerShell 编码问题 | Docker 内执行迁移 |
| 2026-01-11 | 数据库事故 | 全系统 | 误操作 | 备份恢复流程 |
---
## 📚 相关文档
- [PgBoss 队列监控与维护](./01-PgBoss队列监控与维护.md)
- [数据库运维手册](./03-数据库运维手册.md)
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)