feat(aia): Implement Protocol Agent MVP with reusable Agent framework

Sprint 1-3 Completed (Backend + Frontend): Backend (Sprint 1-2): - Implement 5-layer Agent framework (Query->Planner->Executor->Tools->Reflection) - Create agent_schema with 6 tables (agent_definitions, stages, prompts, sessions, traces, reflexion_rules) - Create protocol_schema with 2 tables (protocol_contexts, protocol_generations) - Implement Protocol Agent core services (Orchestrator, ContextService, PromptBuilder) - Integrate LLM service adapter (DeepSeek/Qwen/GPT-5/Claude) - 6 API endpoints with full authentication - 10/10 API tests passed Frontend (Sprint 3): - Add Protocol Agent entry in AgentHub (indigo theme card) - Implement ProtocolAgentPage with 3-column layout - Collapsible sidebar (Gemini style, 48px <-> 280px) - StatePanel with 5 stage cards (scientific_question, pico, study_design, sample_size, endpoints) - ChatArea with sync button and action cards integration - 100% prototype design restoration (608 lines CSS) - Detailed endpoints structure: baseline, exposure, outcomes, confounders Features: - 5-stage dialogue flow for research protocol design - Conversation-driven interaction with sync-to-protocol button - Real-time context state management - One-click protocol generation button (UI ready, backend pending) Database: - agent_schema: 6 tables for reusable Agent framework - protocol_schema: 2 tables for Protocol Agent - Seed data: 1 agent + 5 stages + 9 prompts + 4 reflexion rules Code Stats: - Backend: 13 files, 4338 lines - Frontend: 14 files, 2071 lines - Total: 27 files, 6409 lines Status: MVP core functionality completed, pending frontend-backend integration testing Next: Sprint 4 - One-click protocol generation + Word export
2026-01-24 17:29:24 +08:00
parent 61cdc97eeb
commit 96290d2f76
345 changed files with 13945 additions and 47 deletions
--- a/docs/07-运维文档/全自动巡检系统设计方案.md
+++ b/docs/07-运维文档/全自动巡检系统设计方案.md
@@ -0,0 +1,193 @@
+# **全自动巡检系统设计方案 (Synthetic Monitoring)**
+
+**目标：** 每天早上 06:00 自动验证系统 7 大模块核心功能，确诊“系统健康”，如有异常立即推送到企业微信。
+
+**适用环境：** 阿里云 SAE (Job 任务)
+
+**执行者：** 全自动巡检脚本 (HealthCheck Bot)
+
+## **1\. 架构设计**
+
+### **运行原理**
+
+graph LR  
+    A\[⏰ SAE 定时任务\<br\>06:00 AM\] \--\>|启动容器| B\[🩺 巡检脚本\<br\>HealthCheck Bot\]  
+    B \--\>|1. HTTP请求| C\[🌐 前端/后端 API\]  
+    B \--\>|2. 数据库查询| D\[🐘 PostgreSQL\]  
+    B \--\>|3. 模型调用| E\[🤖 LLM 服务\]  
+      
+    B \--\>|✅ 所有检查通过| F\[✅ 记录日志 (静默)\]  
+    B \--\>|❌ 发现异常| G\[🚨 企业微信报警\]
+
+### **为什么选择 SAE Job？**
+
+* **无需额外服务器**：不需要为了跑一个脚本单独买 ECS。  
+* **环境互通**：脚本在 VPC 内网运行，可以直接连接 RDS 数据库验证数据一致性，也可以通过内网 IP 调用 Python 微服务，**不走公网流量，速度极快**。  
+* **配置简单**：和部署后端应用完全一样，只是把“启动命令”改成了“执行一次脚本”。
+
+## **2\. 巡检脚本逻辑 (TypeScript 伪代码)**
+
+建议在 backend 项目中新建一个目录 scripts/health-check/，复用现有的 ORM 和 Service。
+
+### **核心检测项 (覆盖 7 大模块)**
+
+| 模块 | 检测点 (Check Point) | 预期结果 |
+| :---- | :---- | :---- |
+| **基础层** | **数据库连接** | prisma.$queryRaw('SELECT 1') 返回 1，耗时 \< 100ms |
+| **基础层** | **外部 API 连通性** | ping api.deepseek.com 成功 (验证 NAT 网关正常) |
+| **AIA** | **AI 问答响应** | 向 DeepSeek 发送 "Hi"，能在 5s 内收到回复 (验证 LLM 通路) |
+| **PKB** | **向量检索 (RAG)** | 上传一段测试文本，并在 1s 后通过关键词检索到它 (验证 pgvector 和 embedding) |
+| **ASL** | **Python 微服务** | 调用 Python 服务 /health 接口，返回 200 (验证 Python 容器存活) |
+| **DC** | **数据清洗** | 发送一个简单的 JSON 数据给 Tool C 接口，验证返回清洗结果 |
+| **OSS** | **文件存取** | 上传一个 health\_check.txt 到 OSS 并下载，内容一致 (验证存储) |
+
+### **代码示例**
+
+// backend/scripts/health-check/run.ts
+
+import { PrismaClient } from '@prisma/client';  
+import axios from 'axios';  
+import { sendWecomAlert } from './wecom'; // 复用 IIT 模块的企微通知代码
+
+const prisma \= new PrismaClient();  
+const REPORT \= {  
+  success: true,  
+  modules: \[\] as string\[\],  
+  errors: \[\] as string\[\]  
+};
+
+async function checkDatabase() {  
+  const start \= Date.now();  
+  try {  
+    await prisma.$queryRaw\`SELECT 1\`;  
+    REPORT.modules.push(\`✅ Database (${Date.now() \- start}ms)\`);  
+  } catch (e) {  
+    throw new Error(\`Database connection failed: ${e.message}\`);  
+  }  
+}
+
+async function checkPythonService() {  
+  // 使用内网地址，复用环境变量  
+  const url \= process.env.EXTRACTION\_SERVICE\_URL \+ '/health';   
+  try {  
+    const res \= await axios.get(url, { timeout: 2000 });  
+    if (res.status \=== 200\) REPORT.modules.push(\`✅ Python Service\`);  
+  } catch (e) {  
+    throw new Error(\`Python Service unreachable: ${e.message}\`);  
+  }  
+}
+
+async function checkLLM() {  
+  // 调用简单的 Chat 接口测试  
+  try {  
+    // 模拟一次简单的 AI 对话...  
+    REPORT.modules.push(\`✅ LLM Gateway\`);  
+  } catch (e) {  
+    throw new Error(\`LLM API failed: ${e.message}\`);  
+  }  
+}
+
+// ... 更多检查函数 ...
+
+async function main() {  
+  console.log('🚀 Starting Daily Health Check...');  
+    
+  try {  
+    await checkDatabase();  
+    await checkPythonService();  
+    await checkLLM();  
+    // await checkOSS();  
+    // await checkRAG();  
+      
+    console.log('🎉 All systems healthy\!');  
+    // 可选：成功也发送一条简短通知，让你早上醒来看到绿色对勾  
+    // await sendWecomAlert('🟢 每日巡检通过：系统运行正常');   
+      
+  } catch (error: any) {  
+    console.error('🔥 Health Check Failed:', error);  
+    REPORT.success \= false;  
+      
+    // 🚨 发生异常，立即推送报警！  
+    await sendWecomAlert(\`🔴 \*\*线上环境异常报警\*\* 🔴\\n\\n检查时间: ${new Date().toLocaleString()}\\n错误模块: ${error.message}\\n\\n请尽快检查 SAE 控制台！\`);  
+      
+    process.exit(1); // 让 SAE 任务标记为失败  
+  } finally {  
+    await prisma.$disconnect();  
+  }  
+}
+
+main();
+
+## **3\. 阿里云 SAE 部署实操**
+
+### **步骤 1: 构建镜像**
+
+既然脚本在 backend 项目里，你可以**直接复用 Node.js 后端的镜像**！
+
+不需要重新构建专门的镜像，因为后端镜像里已经包含了 Node 环境、Prisma Client 和所有依赖。
+
+### **步骤 2: 创建 SAE 任务 (Job)**
+
+1. 登录 SAE 控制台 \-\> 任务列表 (Job) \-\> 创建任务。  
+2. **应用名称**：clinical-health-check  
+3. **镜像地址**：选择你们的 backend 镜像 (如 backend:latest)。  
+4. **运行命令**：  
+   * 这里的命令会覆盖 Dockerfile 的 CMD。  
+   * 填写：npx tsx scripts/health-check/run.ts (假设你用 tsx 运行)  
+5. **调度配置**：  
+   * 并发策略：Forbid (禁止并发，上一次没跑完下一次不跑)  
+   * 定时配置 (Crontab)：0 6 \* \* \* (每天 06:00 执行)  
+6. **环境变量**：  
+   * **直接复制** 生产环境后端应用的所有环境变量 (DATABASE\_URL, WECHAT\_KEY 等)。  
+7. **网络配置**：  
+   * 选择和生产环境一样的 VPC 和 VSwitch (确保能连上 RDS)。
+
+## **4\. 报警通知模板 (企业微信)**
+
+当脚本捕获到异常时，发送如下 Markdown 消息到你们的开发群：
+
+🔴 \*\*严重：线上巡检未通过\*\*
+
+\> \*\*检测时间\*\*：2026-01-24 06:00:05  
+\> \*\*环境\*\*：Aliyun Production
+
+\*\*异常详情\*\*：  
+❌ \*\*\[Python Service\]\*\* Connection refused (172.16.x.x:8000)  
+✅ \*\*\[Database\]\*\* OK  
+✅ \*\*\[OSS\]\*\* OK
+
+\*\*建议操作\*\*：  
+1\. 检查 Python 微服务容器是否重启  
+2\. 检查 SAE 内存监控  
+3\. \[点击跳转 SAE 控制台\](https://sae.console.aliyun.com/...)
+
+## **5\. 进阶：阿里云原生监控 (兜底方案)**
+
+除了自己写脚本，阿里云还有一个**免费且好用**的功能，建议同时开启，作为**双重保险**。
+
+### **阿里云 CloudMonitor (云监控) \-\> 站点监控 (Site Monitor)**
+
+1. **无需写代码**。  
+2. 在云监控控制台，创建一个“站点监控”任务。  
+3. **监控地址**：输入你们的公网域名 http://8.140.53.236/。  
+4. **频率**：每 1 分钟一次。  
+5. **报警**：如果 HTTP 状态码 \!= 200，或者响应时间 \> 5秒，发送短信给你们手机。
+
+**对比：**
+
+* **云监控**：只能告诉你“网站打不打得开”（Ping 通不通）。  
+* **自研脚本 (本方案)**：能告诉你“**功能坏没坏**”（比如网站能打开，但 AI 回复不了，云监控发现不了，但脚本能发现）。
+
+## **6\. 实施路线图**
+
+1. **Day 1 (本地开发)**:  
+   * 在 backend 项目里写好 run.ts。  
+   * 在本地运行 npx tsx scripts/health-check/run.ts，确保它能跑通所有检查流程。  
+   * 测试企业微信推送是否正常。  
+2. **Day 2 (SAE 部署)**:  
+   * 不需要重新发版，只要之前的后端镜像里包含了这个脚本文件（通常都会包含 src 目录）。  
+   * 在 SAE 创建 Job，手动触发一次，看日志是否成功。  
+3. **Day 3 (安心睡觉)**:  
+   * 设置定时任务，每天早上收那个绿色的 ✅ 消息。
+
+这样做，你每天早上醒来第一眼看手机，如果看到“✅ 每日巡检通过”，你就可以安心刷牙洗脸；如果没收到或者收到红色报警，你在地铁上就能开始安排修复，而不是等用户投诉了才手忙脚乱。