Backend fixes: - Fix PgBoss task infinite loop on SAE (root cause: missing queue table constraints) - Add singletonKey to prevent duplicate job enqueueing - Add idempotency check in reviewWorker (skip completed tasks) - Add optimistic locking in reviewService (atomic status update) Frontend fixes: - Add isSubmitting state to prevent duplicate submissions in RVW Dashboard - Fix API baseURL in knowledgeBaseApi (relative path) Cleanup (removed): - Old frontend/ directory (migrated to frontend-v2) - python-microservice/ (unused, replaced by extraction_service) - Root package.json and node_modules (accidentally created) - redcap-docker-dev/ (external dependency) - Various temporary files and outdated docs in root New documentation: - docs/07-运维文档/01-PgBoss队列监控与维护.md - docs/07-运维文档/02-故障预防检查清单.md - docs/07-运维文档/03-数据库迁移注意事项.md Database fix applied to RDS: - Added PRIMARY KEY to platform_schema.queue - Added 3 missing foreign key constraints Tested: Local build passed, RDS constraints verified
201 lines
5.0 KiB
Markdown
201 lines
5.0 KiB
Markdown
# 故障预防检查清单
|
||
|
||
> **文档版本**:v1.0
|
||
> **创建日期**:2026-01-27
|
||
> **适用场景**:部署前后检查、日常巡检、故障预防
|
||
|
||
---
|
||
|
||
## 📋 部署前检查
|
||
|
||
### 代码检查
|
||
|
||
- [ ] **前端防重复提交**
|
||
- [ ] 异步操作添加 `isSubmitting` 或 `loading` 状态
|
||
- [ ] 提交按钮在请求期间 disabled
|
||
- [ ] 使用 `finally` 确保状态解锁
|
||
|
||
- [ ] **后端 API 幂等性**
|
||
- [ ] 状态更新使用 `updateMany` + WHERE 条件
|
||
- [ ] 检查更新数量,为 0 时返回错误
|
||
- [ ] 避免 `update` 后再检查状态(非原子操作)
|
||
|
||
- [ ] **队列任务**
|
||
- [ ] 使用 `singletonKey` 防止重复入队
|
||
- [ ] Worker 处理前检查业务状态
|
||
- [ ] 不在 `push()` 中调用 `createQueue()`
|
||
|
||
### 依赖检查
|
||
|
||
- [ ] 数据库连接正常
|
||
- [ ] Redis 连接正常(如使用)
|
||
- [ ] 第三方 API 可访问
|
||
- [ ] 环境变量配置完整
|
||
|
||
---
|
||
|
||
## 📋 部署后检查
|
||
|
||
### 立即检查(部署后 5 分钟内)
|
||
|
||
```sql
|
||
-- 0. 🔴 检查 pg-boss 4 个关键约束(数据库迁移后必查!)
|
||
SELECT conname FROM pg_constraint
|
||
WHERE connamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'platform_schema')
|
||
AND conname IN ('queue_pkey', 'queue_dead_letter_fkey', 'schedule_name_fkey', 'subscription_name_fkey');
|
||
-- 预期:4 行
|
||
-- 如果少于 4 行,参考 03-数据库迁移注意事项.md 修复
|
||
|
||
-- 1. 检查重复队列定义
|
||
SELECT name, COUNT(*) as cnt
|
||
FROM platform_schema.queue
|
||
GROUP BY name
|
||
HAVING COUNT(*) > 1;
|
||
-- 预期:无返回
|
||
```
|
||
|
||
```sql
|
||
-- 2. 检查 Worker 是否注册(查看日志)
|
||
-- SAE 日志搜索关键字:
|
||
-- "Worker registered" 或 "Handler registered"
|
||
```
|
||
|
||
```bash
|
||
# 3. 健康检查接口
|
||
curl https://your-domain/api/health
|
||
# 预期:{"status":"ok"}
|
||
```
|
||
|
||
### 功能验证(部署后 30 分钟内)
|
||
|
||
- [ ] 用户登录正常
|
||
- [ ] 核心功能可用
|
||
- [ ] 异步任务执行正常
|
||
- [ ] 文件上传/下载正常
|
||
|
||
---
|
||
|
||
## 📋 日常巡检清单
|
||
|
||
### 每日必查(09:00)
|
||
|
||
| 检查项 | SQL/命令 | 预期结果 |
|
||
|--------|---------|---------|
|
||
| 队列重复定义 | `SELECT name, COUNT(*) FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1;` | 无返回 |
|
||
| 卡住的任务 | `SELECT COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes';` | 0 |
|
||
| 数据库连接数 | `SELECT count(*) FROM pg_stat_activity;` | < 80 |
|
||
|
||
### 一键健康检查
|
||
|
||
```sql
|
||
SELECT
|
||
'queue_duplicates' as check_type,
|
||
CASE WHEN COUNT(*) > 0 THEN '❌ 异常' ELSE '✅ 正常' END as status,
|
||
COUNT(*) as count
|
||
FROM (
|
||
SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1
|
||
) t
|
||
|
||
UNION ALL
|
||
|
||
SELECT
|
||
'stuck_active_jobs',
|
||
CASE WHEN COUNT(*) > 0 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||
COUNT(*)
|
||
FROM platform_schema.job_common
|
||
WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||
|
||
UNION ALL
|
||
|
||
SELECT
|
||
'db_connections',
|
||
CASE WHEN COUNT(*) > 80 THEN '⚠️ 警告' ELSE '✅ 正常' END,
|
||
COUNT(*)
|
||
FROM pg_stat_activity;
|
||
```
|
||
|
||
---
|
||
|
||
## 📋 代码审查清单(PR Review)
|
||
|
||
### 前端
|
||
|
||
- [ ] 表单提交有 loading 状态
|
||
- [ ] 按钮防抖/禁用处理
|
||
- [ ] 错误处理完整(try-catch-finally)
|
||
- [ ] 异步请求有超时设置
|
||
|
||
### 后端 API
|
||
|
||
- [ ] 使用事务或原子操作
|
||
- [ ] 状态检查在更新之前
|
||
- [ ] 错误信息明确
|
||
- [ ] 日志记录关键步骤
|
||
|
||
### 队列任务
|
||
|
||
- [ ] 有 `singletonKey`
|
||
- [ ] Worker 有幂等性检查
|
||
- [ ] 超时和重试配置合理
|
||
- [ ] 失败处理完整
|
||
|
||
---
|
||
|
||
## 📋 故障响应流程
|
||
|
||
### 1. 发现问题
|
||
|
||
- 用户反馈
|
||
- 监控告警
|
||
- 日志异常
|
||
|
||
### 2. 初步诊断(5 分钟内)
|
||
|
||
```sql
|
||
-- 快速健康检查
|
||
SELECT
|
||
'queue_duplicates' as t, COUNT(*) FROM (SELECT name FROM platform_schema.queue GROUP BY name HAVING COUNT(*) > 1) x
|
||
UNION ALL
|
||
SELECT 'stuck_jobs', COUNT(*) FROM platform_schema.job_common WHERE state = 'active' AND started_on < NOW() - INTERVAL '30 minutes'
|
||
UNION ALL
|
||
SELECT 'db_connections', COUNT(*) FROM pg_stat_activity;
|
||
```
|
||
|
||
### 3. 紧急修复
|
||
|
||
| 问题类型 | 紧急措施 |
|
||
|---------|---------|
|
||
| 任务无限循环 | 清理重复队列定义,重启服务 |
|
||
| 数据库连接满 | 强制断开空闲连接,重启服务 |
|
||
| 服务不可用 | 重启 SAE 应用 |
|
||
|
||
### 4. 根因分析
|
||
|
||
- 收集日志
|
||
- 分析数据库状态
|
||
- 复盘代码逻辑
|
||
|
||
### 5. 长期修复
|
||
|
||
- 提交代码修复 PR
|
||
- 更新文档
|
||
- 添加预防措施
|
||
|
||
---
|
||
|
||
## 📋 历史故障记录
|
||
|
||
| 日期 | 故障类型 | 影响范围 | 根因 | 修复措施 |
|
||
|------|---------|---------|------|---------|
|
||
| 2026-01-27 | 任务无限循环 | RVW 模块 | **数据库迁移丢失主键约束** → 队列定义重复 | 添加主键 + 四层防御 |
|
||
| 2026-01-27 | 中文乱码 | 全系统 | PowerShell 编码问题 | Docker 内执行迁移 |
|
||
| 2026-01-11 | 数据库事故 | 全系统 | 误操作 | 备份恢复流程 |
|
||
|
||
---
|
||
|
||
## 📚 相关文档
|
||
|
||
- [PgBoss 队列监控与维护](./01-PgBoss队列监控与维护.md)
|
||
- [数据库运维手册](./03-数据库运维手册.md)
|
||
- [故障分析报告](../06-测试文档/故障分析报告%20(1).md)
|