Phase 2A: WorkflowPlannerService, WorkflowExecutorService, Python data quality, 6 bug fixes, DescriptiveResultView, multi-step R code/Word export, MVP UI reuse. V11 UI: Gemini-style, multi-task, single-page scroll, Word export. Architecture: Block-based rendering consensus (4 block types). New R tools: chi_square, correlation, descriptive, logistic_binary, mann_whitney, t_test_paired. Docs: dev summary, block-based plan, status updates, task list v2.0. Co-authored-by: Cursor <cursoragent@cursor.com>
2278 lines
74 KiB
Markdown
2278 lines
74 KiB
Markdown
# SSA-Pro 后端开发指南
|
||
|
||
> **文档版本:** v1.7
|
||
> **创建日期:** 2026-02-18
|
||
> **最后更新:** 2026-02-20(纳入 Prompt 体系 + 多工具流程规划 + 数据质量核查报告)
|
||
> **目标读者:** Node.js 后端工程师
|
||
|
||
---
|
||
|
||
## 1. 模块目录结构(概念显性化)
|
||
|
||
```
|
||
backend/src/modules/ssa/
|
||
├── index.ts # 模块入口,注册路由
|
||
├── routes/
|
||
│ ├── session.routes.ts # 会话管理路由
|
||
│ ├── analysis.routes.ts # 分析执行路由
|
||
│ └── consult.routes.ts # 🆕 咨询模式路由
|
||
│ └── config.routes.ts # 🆕 配置中台路由
|
||
│
|
||
├── planner/ # 🆕 Planner 职责(大脑)
|
||
│ ├── DataParserService.ts # 数据解析 + Schema 提取
|
||
│ ├── DecisionTableService.ts # 🆕 决策表匹配 (Goal,Y,X,Design)
|
||
│ ├── ToolRetrievalService.ts # RAG 工具检索(辅助)
|
||
│ ├── PlannerService.ts # AI 规划(有数据)
|
||
│ ├── ConsultService.ts # 无数据咨询
|
||
│ ├── SAPGeneratorService.ts # SAP 文档生成
|
||
│ └── CriticService.ts # 结果解读(流式)
|
||
│
|
||
├── executor/ # 🆕 Executor 职责(四肢)
|
||
│ ├── RClientService.ts # R 服务调用
|
||
│ └── InterpretationService.ts # 🆕 结果解读(配置模板)
|
||
│
|
||
├── config/ # 🆕 配置中台
|
||
│ ├── DecisionTableLoader.ts # 🆕 统计决策表加载
|
||
│ ├── RCodeLibraryService.ts # 🆕 R 代码库管理
|
||
│ ├── ParamMappingService.ts # 🆕 参数映射配置
|
||
│ ├── GuardrailConfigService.ts # 🆕 护栏规则链
|
||
│ ├── ConfigValidatorService.ts # 配置校验
|
||
│ └── ConfigCacheService.ts # 配置缓存
|
||
│
|
||
├── validators/
|
||
│ └── planSchema.ts # 📌 Zod Schema 定义
|
||
├── dto/
|
||
│ ├── CreateSessionDto.ts
|
||
│ ├── UploadDataDto.ts
|
||
│ └── ExecuteAnalysisDto.ts
|
||
└── types/
|
||
└── index.ts # 类型定义
|
||
```
|
||
|
||
### 1.1 🆕 设计原则(五条核心原则)
|
||
|
||
| 原则 | 说明 |
|
||
|------|------|
|
||
| **Planner/Executor 分离** | 不要把规划逻辑和执行逻辑混在一个 Class 里 |
|
||
| **支持无数据模式** | Planner 可以在没有数据的情况下独立工作 |
|
||
| **决策表优先** | 🆕 工具选择优先用决策表匹配,RAG 作为兜底 |
|
||
| **配置外置** | 工具定义、护栏规则从 Excel/数据库 加载,不硬编码 |
|
||
| **统一入口函数** | 🆕 所有 R 脚本统一 `run_analysis()` 入口 |
|
||
|
||
---
|
||
|
||
## 1.2 🆕 Prompt 体系与专家配置边界
|
||
|
||
> **核心理念:骨架与血肉的分离**
|
||
>
|
||
> 详细设计参考:`06-开发记录/SSA-Pro Prompt体系与专家配置边界梳理.md`
|
||
|
||
### 1.2.1 动态 Prompt 注入模式
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Prompt 动态注入架构 │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ AI 工程师维护 统计专家维护 │
|
||
│ ┌─────────────┐ ┌─────────────────────┐ │
|
||
│ │ Prompt 模板 │ │ 配置中台 (Excel) │ │
|
||
│ │ (骨架) │ │ - 决策表 │ │
|
||
│ │ │ │ - 使用规则 │ │
|
||
│ │ {{占位符}} │ ◀─────── │ - 解读模板 │ │
|
||
│ │ │ 注入 │ - 禁用词列表 │ │
|
||
│ └─────────────┘ └─────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ 完整 Prompt = 骨架 + 血肉 → 发送给 LLM ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 1.2.2 三个核心 Prompt
|
||
|
||
| 环节 | Prompt 名称 | 作用 | 专家注入内容 |
|
||
|------|------------|------|-------------|
|
||
| **意图重写** | `SSA_QUERY_REWRITER` | 将医生口语翻译为统计术语 | 同义词字典 |
|
||
| **智能规划** | `SSA_PLANNER` | 选择工具 + 生成参数映射 | 工具定义 + 使用规则 |
|
||
| **结果解读** | `SSA_CRITIC` | 生成论文级结论 | 解读模板 + 禁用词 |
|
||
|
||
### 1.2.3 职责边界表
|
||
|
||
| 资产类型 | 谁来写? | 存在哪里? | 被谁执行? |
|
||
|---------|---------|-----------|-----------|
|
||
| System Prompt 模板 (骨架) | AI 工程师 | `prompt_templates` 表 | 传给 LLM |
|
||
| 工具适用条件/数据要求 | 统计专家 | 配置中台 Excel | 注入 Prompt → LLM |
|
||
| 统计护栏规则 | 统计专家 | 配置中台 Excel | **传给 R 服务,由 R 强执行** |
|
||
| R 代码模板 | 统计专家 | 配置中台 Excel | 传给 R 服务 |
|
||
| 论文结论解释规范 | 统计专家 | 配置中台 Excel | 注入 Critic Prompt → LLM |
|
||
|
||
> ⚠️ **关键原则**:统计护栏规则(如正态性检验 P<0.05 降级)**绝对不要**放到 Prompt 里让 LLM 判断。这些规则必须由 R 代码强逻辑执行。
|
||
|
||
---
|
||
|
||
## 1.3 🆕 多工具流程规划设计
|
||
|
||
> **愿景目标**:"不是执行方法,而是规划流程"
|
||
>
|
||
> **MVP 目标**:LLM 能够规划 2-7 个工具的串联执行流程
|
||
|
||
### 1.3.1 流程规划架构
|
||
|
||
```
|
||
用户:"比较两组患者的血压差异"
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 多工具流程规划 │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ Step 1: 意图解析 │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ 目的:差异比较 | 变量:连续 | 分组:二分类 | 设计:独立 ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 2: 流程规划(LLM 输出) │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ { ││
|
||
│ │ "workflow": [ ││
|
||
│ │ { "step": 1, "tool": "ST_DATA_CHECK", "name": "数据校验" },││
|
||
│ │ { "step": 2, "tool": "ST_QUALITY_REPORT", "name": "质量核查" },││
|
||
│ │ { "step": 3, "tool": "ST_NORMALITY_TEST", "name": "正态性检验" },││
|
||
│ │ { "step": 4, "tool": "ST_LEVENE_TEST", "name": "方差齐性" },││
|
||
│ │ { "step": 5, "tool": "ST_T_TEST_IND", "name": "T检验" },││
|
||
│ │ { "step": 6, "tool": "ST_EFFECT_SIZE", "name": "效应量" },││
|
||
│ │ { "step": 7, "tool": "ST_CONCLUSION", "name": "结论生成" }││
|
||
│ │ ], ││
|
||
│ │ "reasoning": "两组独立样本比较,需先检验前提条件..." ││
|
||
│ │ } ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 3: 串联执行 │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ 工具1 → 工具2 → 工具3 → ... → 工具N ││
|
||
│ │ ↓ ↓ ↓ ↓ ││
|
||
│ │ 结果1 → 结果2 → 结果3 → ... → 最终报告 ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 1.3.2 WorkflowPlannerService 设计
|
||
|
||
```typescript
|
||
// planner/WorkflowPlannerService.ts
|
||
|
||
interface WorkflowStep {
|
||
step: number;
|
||
toolCode: string;
|
||
toolName: string;
|
||
params?: Record<string, any>;
|
||
dependsOn?: number[]; // 依赖的前置步骤
|
||
}
|
||
|
||
interface WorkflowPlan {
|
||
workflow: WorkflowStep[];
|
||
reasoning: string;
|
||
estimatedTime: number; // 预估耗时(秒)
|
||
}
|
||
|
||
export class WorkflowPlannerService {
|
||
|
||
/**
|
||
* 生成多工具执行流程
|
||
*/
|
||
async generateWorkflow(
|
||
sessionId: string,
|
||
userQuery: string,
|
||
dataSchema: object
|
||
): Promise<WorkflowPlan> {
|
||
|
||
// 1. 获取候选工具
|
||
const tools = await this.toolRetrieval.retrieveTools(userQuery, dataSchema, 10);
|
||
|
||
// 2. 构造 Prompt(包含流程规划指令)
|
||
const prompt = this.buildWorkflowPrompt(userQuery, dataSchema, tools);
|
||
|
||
// 3. 调用 LLM 生成流程
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
const response = await llm.chat([
|
||
{ role: 'system', content: prompt },
|
||
{ role: 'user', content: userQuery }
|
||
]);
|
||
|
||
// 4. 解析 + 校验
|
||
return this.parseWorkflowPlan(response, tools);
|
||
}
|
||
|
||
/**
|
||
* 构造流程规划 Prompt
|
||
*/
|
||
private buildWorkflowPrompt(query: string, schema: object, tools: any[]): string {
|
||
return `
|
||
你是一位顶尖的临床数据科学家。你拥有一个包含 ${tools.length} 个专家级统计工具的代码库。
|
||
|
||
用户数据结构:
|
||
${JSON.stringify(schema, null, 2)}
|
||
|
||
可用工具库:
|
||
${tools.map(t => `- ${t.toolCode}: ${t.name} - ${t.description}`).join('\n')}
|
||
|
||
请根据用户需求,规划一个完整的统计分析流程。流程应包含:
|
||
1. 数据质量核查(必须)
|
||
2. 前提条件检验(如适用)
|
||
3. 核心统计分析
|
||
4. 效应量计算(如适用)
|
||
5. 结论生成
|
||
|
||
输出 JSON 格式:
|
||
{
|
||
"workflow": [
|
||
{ "step": 1, "toolCode": "工具代码", "toolName": "工具名称" },
|
||
...
|
||
],
|
||
"reasoning": "规划理由"
|
||
}
|
||
|
||
只输出 JSON,不要其他内容。
|
||
`.trim();
|
||
}
|
||
}
|
||
```
|
||
|
||
### 1.3.3 WorkflowExecutorService 设计
|
||
|
||
```typescript
|
||
// executor/WorkflowExecutorService.ts
|
||
|
||
interface StepResult {
|
||
step: number;
|
||
toolCode: string;
|
||
status: 'success' | 'warning' | 'error';
|
||
result?: any;
|
||
error?: string;
|
||
executionMs: number;
|
||
}
|
||
|
||
export class WorkflowExecutorService {
|
||
|
||
/**
|
||
* 串联执行多个工具
|
||
*/
|
||
async executeWorkflow(
|
||
sessionId: string,
|
||
workflow: WorkflowStep[],
|
||
onStepComplete?: (stepResult: StepResult) => void // 实时回调
|
||
): Promise<StepResult[]> {
|
||
|
||
const results: StepResult[] = [];
|
||
let previousResult: any = null;
|
||
|
||
for (const step of workflow) {
|
||
const startTime = Date.now();
|
||
|
||
try {
|
||
// 构造本步骤的输入(可能依赖前置步骤的输出)
|
||
const input = this.buildStepInput(step, previousResult, sessionId);
|
||
|
||
// 调用 R 服务执行
|
||
const result = await this.rClient.execute(sessionId, {
|
||
tool_code: step.toolCode,
|
||
params: input
|
||
});
|
||
|
||
const stepResult: StepResult = {
|
||
step: step.step,
|
||
toolCode: step.toolCode,
|
||
status: result.status === 'success' ? 'success' : 'warning',
|
||
result: result,
|
||
executionMs: Date.now() - startTime
|
||
};
|
||
|
||
results.push(stepResult);
|
||
previousResult = result;
|
||
|
||
// 实时通知前端
|
||
onStepComplete?.(stepResult);
|
||
|
||
} catch (error: any) {
|
||
const stepResult: StepResult = {
|
||
step: step.step,
|
||
toolCode: step.toolCode,
|
||
status: 'error',
|
||
error: error.message,
|
||
executionMs: Date.now() - startTime
|
||
};
|
||
|
||
results.push(stepResult);
|
||
onStepComplete?.(stepResult);
|
||
|
||
// 决定是否继续执行后续步骤
|
||
if (this.isCriticalError(error)) {
|
||
break; // 关键错误,中断流程
|
||
}
|
||
// 非关键错误,继续执行
|
||
}
|
||
}
|
||
|
||
return results;
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 1.4 🆕 数据质量核查报告设计
|
||
|
||
> **愿景目标**:自动生成"数据体检报告",主动告诉用户数据有什么问题
|
||
>
|
||
> **MVP 目标**:在执行分析前,先生成数据质量核查报告
|
||
|
||
### 1.4.1 核查报告结构
|
||
|
||
```typescript
|
||
// types/DataQualityReport.ts
|
||
|
||
interface DataQualityReport {
|
||
// 基础统计
|
||
summary: {
|
||
totalRows: number;
|
||
totalColumns: number;
|
||
numericColumns: number;
|
||
categoricalColumns: number;
|
||
};
|
||
|
||
// 缺失值分析
|
||
missingAnalysis: {
|
||
totalMissing: number;
|
||
missingRate: number; // 总体缺失率
|
||
columns: Array<{
|
||
name: string;
|
||
missingCount: number;
|
||
missingRate: number;
|
||
suggestion: string; // 处理建议
|
||
}>;
|
||
};
|
||
|
||
// 异常值检测
|
||
outlierAnalysis: {
|
||
columns: Array<{
|
||
name: string;
|
||
outlierCount: number;
|
||
outlierValues: any[];
|
||
method: 'IQR' | 'ZScore';
|
||
suggestion: string;
|
||
}>;
|
||
};
|
||
|
||
// 分布检验(数值变量)
|
||
distributionAnalysis: {
|
||
columns: Array<{
|
||
name: string;
|
||
shapiroP: number;
|
||
isNormal: boolean;
|
||
skewness: number;
|
||
kurtosis: number;
|
||
suggestion: string;
|
||
}>;
|
||
};
|
||
|
||
// 分组平衡性(如有分组变量)
|
||
groupBalance?: {
|
||
groupColumn: string;
|
||
groups: Array<{
|
||
name: string;
|
||
count: number;
|
||
percentage: number;
|
||
}>;
|
||
isBalanced: boolean;
|
||
suggestion: string;
|
||
};
|
||
|
||
// 整体评估
|
||
overallAssessment: {
|
||
qualityScore: number; // 0-100
|
||
level: 'good' | 'acceptable' | 'poor';
|
||
warnings: string[];
|
||
recommendations: string[];
|
||
};
|
||
}
|
||
```
|
||
|
||
### 1.4.2 DataQualityService 设计
|
||
|
||
```typescript
|
||
// planner/DataQualityService.ts
|
||
|
||
export class DataQualityService {
|
||
|
||
/**
|
||
* 生成数据质量核查报告
|
||
* 这是 R 服务调用,不是 LLM
|
||
*/
|
||
async generateReport(sessionId: string): Promise<DataQualityReport> {
|
||
|
||
// 调用 R 服务的数据质量核查工具
|
||
const result = await this.rClient.execute(sessionId, {
|
||
tool_code: 'ST_QUALITY_REPORT',
|
||
params: {
|
||
check_missing: true,
|
||
check_outliers: true,
|
||
check_distribution: true,
|
||
check_balance: true
|
||
}
|
||
});
|
||
|
||
return this.transformToReport(result);
|
||
}
|
||
|
||
/**
|
||
* 生成用户友好的摘要(可选用 LLM 增强)
|
||
*/
|
||
async generateSummary(report: DataQualityReport): Promise<string> {
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
|
||
const prompt = `
|
||
你是一位数据分析专家。请根据以下数据质量核查结果,生成一段简洁的中文摘要(3-5句话),
|
||
告诉用户数据的整体质量如何,主要问题是什么,是否可以继续分析。
|
||
|
||
核查结果:
|
||
${JSON.stringify(report, null, 2)}
|
||
|
||
请直接输出摘要文本。
|
||
`;
|
||
|
||
return await llm.chat([{ role: 'user', content: prompt }]);
|
||
}
|
||
}
|
||
```
|
||
|
||
### 1.4.3 R 服务端实现(ST_QUALITY_REPORT)
|
||
|
||
```r
|
||
# tools/quality_report.R
|
||
|
||
#' @tool_code ST_QUALITY_REPORT
|
||
#' @name 数据质量核查报告
|
||
#' @description 生成全面的数据质量评估报告
|
||
|
||
run_analysis <- function(input) {
|
||
# 加载数据
|
||
df <- load_data(input)
|
||
|
||
report <- list()
|
||
|
||
# 1. 基础统计
|
||
report$summary <- list(
|
||
totalRows = nrow(df),
|
||
totalColumns = ncol(df),
|
||
numericColumns = sum(sapply(df, is.numeric)),
|
||
categoricalColumns = sum(sapply(df, is.character) | sapply(df, is.factor))
|
||
)
|
||
|
||
# 2. 缺失值分析
|
||
report$missingAnalysis <- analyze_missing(df)
|
||
|
||
# 3. 异常值检测(IQR 方法)
|
||
report$outlierAnalysis <- analyze_outliers(df)
|
||
|
||
# 4. 分布检验
|
||
report$distributionAnalysis <- analyze_distribution(df)
|
||
|
||
# 5. 分组平衡性
|
||
if (!is.null(input$group_var)) {
|
||
report$groupBalance <- analyze_balance(df, input$group_var)
|
||
}
|
||
|
||
# 6. 整体评估
|
||
report$overallAssessment <- calculate_quality_score(report)
|
||
|
||
return(list(
|
||
status = "success",
|
||
report = report
|
||
))
|
||
}
|
||
|
||
# 计算整体质量评分
|
||
calculate_quality_score <- function(report) {
|
||
score <- 100
|
||
warnings <- c()
|
||
recommendations <- c()
|
||
|
||
# 缺失值扣分
|
||
if (report$missingAnalysis$missingRate > 0.1) {
|
||
score <- score - 20
|
||
warnings <- c(warnings, "缺失值比例超过10%")
|
||
recommendations <- c(recommendations, "建议处理缺失值后再进行分析")
|
||
} else if (report$missingAnalysis$missingRate > 0.05) {
|
||
score <- score - 10
|
||
warnings <- c(warnings, "存在一定比例的缺失值")
|
||
}
|
||
|
||
# 异常值扣分
|
||
outlier_cols <- sum(sapply(report$outlierAnalysis$columns, function(x) x$outlierCount > 0))
|
||
if (outlier_cols > 0) {
|
||
score <- score - 5 * outlier_cols
|
||
warnings <- c(warnings, paste0(outlier_cols, "个变量存在异常值"))
|
||
}
|
||
|
||
# 非正态扣分(提示,不强制扣分)
|
||
non_normal <- sum(!sapply(report$distributionAnalysis$columns, function(x) x$isNormal))
|
||
if (non_normal > 0) {
|
||
recommendations <- c(recommendations,
|
||
paste0(non_normal, "个变量不满足正态分布,系统将自动选择非参数方法"))
|
||
}
|
||
|
||
# 确定等级
|
||
level <- if (score >= 80) "good" else if (score >= 60) "acceptable" else "poor"
|
||
|
||
return(list(
|
||
qualityScore = max(0, score),
|
||
level = level,
|
||
warnings = warnings,
|
||
recommendations = recommendations
|
||
))
|
||
}
|
||
```
|
||
|
||
### 1.4.4 前端展示(核查报告卡片)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 📊 数据质量核查报告 │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ 📈 数据概况 │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ 总样本量:200 行 × 15 列 ││
|
||
│ │ 数值变量:8 个 | 分类变量:7 个 ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │
|
||
│ ⚠️ 发现的问题 │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ • 缺失值:血压字段有 12 例缺失 (6%) ││
|
||
│ │ • 异常值:2 例血压 > 300 mmHg(疑似记录错误) ││
|
||
│ │ • 正态性:治疗组血压不满足正态分布 ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │
|
||
│ 💡 系统建议 │
|
||
│ ┌─────────────────────────────────────────────────────────────┐│
|
||
│ │ 1. 建议处理 2 例异常值后再分析 ││
|
||
│ │ 2. 由于正态性不满足,系统将自动选择非参数方法 ││
|
||
│ └─────────────────────────────────────────────────────────────┘│
|
||
│ │
|
||
│ 🎯 整体评估:良好 (82/100) │
|
||
│ │
|
||
│ [继续分析] [下载报告] │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## 2. 数据库 Schema(Prisma)
|
||
|
||
```prisma
|
||
// schema.prisma - SSA 模块部分
|
||
|
||
// 分析会话
|
||
model SsaSession {
|
||
id String @id @default(uuid())
|
||
userId String @map("user_id")
|
||
title String?
|
||
dataSchema Json? @map("data_schema") // 数据结构(LLM可见)
|
||
dataPayload Json? @map("data_payload") // 真实数据(仅R可见)
|
||
status String @default("active")
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
updatedAt DateTime @updatedAt @map("updated_at")
|
||
|
||
messages SsaMessage[]
|
||
|
||
@@map("ssa_sessions")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 消息记录
|
||
model SsaMessage {
|
||
id String @id @default(uuid())
|
||
sessionId String @map("session_id")
|
||
role String // user | assistant | system
|
||
contentType String @map("content_type") // text | plan | result
|
||
content Json
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
|
||
session SsaSession @relation(fields: [sessionId], references: [id])
|
||
|
||
@@map("ssa_messages")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 工具库
|
||
model SsaTool {
|
||
id String @id @default(uuid())
|
||
toolCode String @unique @map("tool_code")
|
||
name String
|
||
version String @default("1.0.0")
|
||
description String
|
||
usageContext String? @map("usage_context")
|
||
paramsSchema Json @map("params_schema")
|
||
guardrails Json?
|
||
searchText String @map("search_text")
|
||
embedding Unsupported("vector(1024)")?
|
||
isActive Boolean @default(true) @map("is_active")
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
updatedAt DateTime @updatedAt @map("updated_at")
|
||
|
||
@@map("tools_library")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 执行日志
|
||
model SsaExecutionLog {
|
||
id String @id @default(uuid())
|
||
sessionId String @map("session_id")
|
||
messageId String? @map("message_id")
|
||
toolCode String @map("tool_code")
|
||
inputParams Json @map("input_params")
|
||
outputStatus String @map("output_status")
|
||
outputResult Json? @map("output_result")
|
||
traceLog String[] @map("trace_log")
|
||
executionMs Int? @map("execution_ms")
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
|
||
@@map("execution_logs")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 🆕 统计决策表
|
||
model SsaDecisionTable {
|
||
id String @id @default(uuid())
|
||
goalType String @map("goal_type") // 分析目标:组间差异、相关性、分布描述
|
||
yType String @map("y_type") // 因变量类型:连续、分类、计数
|
||
xType String? @map("x_type") // 自变量类型:可选
|
||
designType String @map("design_type") // 设计类型:独立、配对、重复测量
|
||
toolCode String @map("tool_code") // 推荐工具
|
||
altToolCode String? @map("alt_tool_code") // 备选工具(降级)
|
||
priority Int @default(0) // 优先级
|
||
isActive Boolean @default(true) @map("is_active")
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
|
||
@@unique([goalType, yType, xType, designType])
|
||
@@map("decision_table")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 🆕 R 代码库
|
||
model SsaRCodeLibrary {
|
||
id String @id @default(uuid())
|
||
toolCode String @map("tool_code") // 关联工具代码
|
||
version String @default("1.0.0")
|
||
fileName String @map("file_name") // R 脚本文件名
|
||
codeContent String @map("code_content") // R 代码内容
|
||
entryFunc String @default("run_analysis") @map("entry_func") // 入口函数
|
||
description String?
|
||
dependencies String[] @default([]) // 依赖包列表
|
||
isActive Boolean @default(true) @map("is_active")
|
||
createdAt DateTime @default(now()) @map("created_at")
|
||
updatedAt DateTime @updatedAt @map("updated_at")
|
||
|
||
@@map("r_code_library")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 🆕 参数映射配置
|
||
model SsaParamMapping {
|
||
id String @id @default(uuid())
|
||
toolCode String @map("tool_code")
|
||
jsonKey String @map("json_key") // 前端传入的 JSON Key
|
||
rParamName String @map("r_param_name") // R 函数参数名
|
||
dataType String @map("data_type") // string | number | boolean
|
||
isRequired Boolean @default(false) @map("is_required")
|
||
defaultValue String? @map("default_value")
|
||
validationRule String? @map("validation_rule") // 校验规则
|
||
description String?
|
||
|
||
@@unique([toolCode, jsonKey])
|
||
@@map("param_mapping")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 🆕 护栏规则配置
|
||
model SsaGuardrailConfig {
|
||
id String @id @default(uuid())
|
||
toolCode String @map("tool_code")
|
||
checkName String @map("check_name") // 检查名称:正态性、方差齐性
|
||
checkOrder Int @default(0) @map("check_order") // 执行顺序
|
||
checkCode String @map("check_code") // R 函数名
|
||
threshold String? // 阈值条件:p < 0.05
|
||
actionType String @map("action_type") // Block | Warn | Switch
|
||
actionTarget String? @map("action_target") // Switch 时的目标工具
|
||
isEnabled Boolean @default(true) @map("is_enabled")
|
||
|
||
@@map("guardrail_config")
|
||
@@schema("ssa_schema")
|
||
}
|
||
|
||
// 🆕 结果解读模板
|
||
model SsaInterpretation {
|
||
id String @id @default(uuid())
|
||
toolCode String @map("tool_code")
|
||
scenarioKey String @map("scenario_key") // 场景:significant | not_significant
|
||
template String // 解读模板(含占位符)
|
||
placeholders String[] @default([]) // 占位符列表
|
||
|
||
@@unique([toolCode, scenarioKey])
|
||
@@map("interpretation_templates")
|
||
@@schema("ssa_schema")
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 3. API 路由设计
|
||
|
||
### 3.1 路由注册
|
||
|
||
```typescript
|
||
// index.ts
|
||
import { FastifyInstance } from 'fastify';
|
||
import sessionRoutes from './routes/session.routes';
|
||
import analysisRoutes from './routes/analysis.routes';
|
||
import consultRoutes from './routes/consult.routes'; // 🆕
|
||
import configRoutes from './routes/config.routes'; // 🆕
|
||
|
||
export default async function ssaModule(app: FastifyInstance) {
|
||
// 注册认证中间件
|
||
app.addHook('preHandler', app.authenticate);
|
||
|
||
// 注册子路由
|
||
app.register(sessionRoutes, { prefix: '/sessions' });
|
||
app.register(analysisRoutes, { prefix: '/sessions' });
|
||
app.register(consultRoutes, { prefix: '/consult' }); // 🆕 咨询模式
|
||
app.register(configRoutes, { prefix: '/config' }); // 🆕 配置中台
|
||
}
|
||
```
|
||
|
||
### 3.2 会话路由
|
||
|
||
```typescript
|
||
// routes/session.routes.ts
|
||
import { FastifyInstance } from 'fastify';
|
||
import { SessionService } from '../services/SessionService';
|
||
|
||
export default async function sessionRoutes(app: FastifyInstance) {
|
||
const sessionService = new SessionService();
|
||
|
||
// 创建会话
|
||
app.post('/', async (req, reply) => {
|
||
const userId = req.user.id;
|
||
const session = await sessionService.create(userId);
|
||
return reply.send(session);
|
||
});
|
||
|
||
// 获取会话列表
|
||
app.get('/', async (req, reply) => {
|
||
const userId = req.user.id;
|
||
const sessions = await sessionService.listByUser(userId);
|
||
return reply.send(sessions);
|
||
});
|
||
|
||
// 获取单个会话(含消息历史)
|
||
app.get('/:id', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
const session = await sessionService.getById(id, req.user.id);
|
||
return reply.send(session);
|
||
});
|
||
|
||
// 上传数据
|
||
app.post('/:id/upload', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
// 解析 Excel/CSV,提取 Schema 和 Data
|
||
const result = await sessionService.uploadData(id, req);
|
||
return reply.send(result);
|
||
});
|
||
}
|
||
```
|
||
|
||
### 3.3 分析路由
|
||
|
||
```typescript
|
||
// routes/analysis.routes.ts
|
||
import { FastifyInstance } from 'fastify';
|
||
import { PlannerService } from '../services/PlannerService';
|
||
import { RClientService } from '../services/RClientService';
|
||
import { CriticService } from '../services/CriticService';
|
||
|
||
export default async function analysisRoutes(app: FastifyInstance) {
|
||
const plannerService = new PlannerService();
|
||
const rClientService = new RClientService();
|
||
const criticService = new CriticService();
|
||
|
||
// 生成分析计划(不执行)
|
||
app.post('/:id/plan', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
const { query } = req.body as { query: string };
|
||
|
||
// 1. RAG 检索工具
|
||
// 2. LLM 生成计划
|
||
const plan = await plannerService.generatePlan(id, query);
|
||
|
||
return reply.send({
|
||
type: 'plan',
|
||
plan
|
||
});
|
||
});
|
||
|
||
// 确认执行
|
||
app.post('/:id/execute', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
const { plan } = req.body as { plan: object };
|
||
|
||
// 1. 调用 R 服务执行
|
||
const result = await rClientService.execute(id, plan);
|
||
|
||
// 2. 保存执行日志
|
||
// 3. 保存结果到消息
|
||
|
||
return reply.send({
|
||
type: 'result',
|
||
result
|
||
});
|
||
});
|
||
|
||
// 获取结果解读(流式)
|
||
app.get('/:id/interpret/:messageId', async (req, reply) => {
|
||
const { id, messageId } = req.params as { id: string; messageId: string };
|
||
|
||
// 流式返回 Critic 解读
|
||
reply.raw.setHeader('Content-Type', 'text/event-stream');
|
||
|
||
await criticService.streamInterpret(id, messageId, reply.raw);
|
||
});
|
||
|
||
// 下载代码
|
||
app.get('/:id/download-code/:messageId', async (req, reply) => {
|
||
const { id, messageId } = req.params as { id: string; messageId: string };
|
||
|
||
const code = await sessionService.getReproducibleCode(messageId);
|
||
|
||
reply.header('Content-Type', 'text/plain');
|
||
reply.header('Content-Disposition', 'attachment; filename="analysis.R"');
|
||
return reply.send(code);
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. 核心服务实现
|
||
|
||
### 4.1 RClientService(调用 R 服务)
|
||
|
||
```typescript
|
||
// services/RClientService.ts
|
||
import axios, { AxiosInstance } from 'axios';
|
||
import { prisma } from '@/common/db';
|
||
import { logger } from '@/common/logging';
|
||
|
||
export class RClientService {
|
||
private client: AxiosInstance;
|
||
|
||
constructor() {
|
||
this.client = axios.create({
|
||
baseURL: process.env.R_SERVICE_URL || 'http://localhost:8080',
|
||
timeout: 120000, // 📌 120s 超时(应对复杂计算)
|
||
headers: { 'Content-Type': 'application/json' }
|
||
});
|
||
}
|
||
|
||
async execute(sessionId: string, plan: {
|
||
tool_code: string;
|
||
params: Record<string, any>;
|
||
guardrails: Record<string, boolean>;
|
||
}) {
|
||
const startTime = Date.now();
|
||
|
||
// 1. 获取会话的真实数据
|
||
const session = await prisma.ssaSession.findUniqueOrThrow({
|
||
where: { id: sessionId }
|
||
});
|
||
|
||
// 🆕 2. 构造 R 服务请求(混合数据协议)
|
||
const dataSource = this.buildDataSource(session);
|
||
const requestBody = {
|
||
data_source: dataSource, // 🆕 统一数据源字段
|
||
params: plan.params,
|
||
guardrails: plan.guardrails
|
||
};
|
||
|
||
/**
|
||
* 🆕 根据数据大小选择传输方式
|
||
* - < 2MB: inline JSON
|
||
* - >= 2MB: OSS key
|
||
*/
|
||
private buildDataSource(session: any): { type: string; data?: any; oss_key?: string } {
|
||
const payload = session.dataPayload;
|
||
const payloadSize = JSON.stringify(payload).length;
|
||
|
||
const SIZE_THRESHOLD = 2 * 1024 * 1024; // 2MB
|
||
|
||
if (payloadSize < SIZE_THRESHOLD) {
|
||
// 小数据:直接内联
|
||
return {
|
||
type: 'inline',
|
||
data: payload
|
||
};
|
||
} else {
|
||
// 大数据:上传 OSS,传递 key
|
||
// 注意:此处假设 session 创建时已上传 OSS
|
||
const ossKey = session.dataOssKey || `sessions/${session.id}/data.json`;
|
||
return {
|
||
type: 'oss',
|
||
oss_key: ossKey
|
||
};
|
||
}
|
||
}
|
||
|
||
// 3. 调用 R 服务
|
||
try {
|
||
const response = await this.client.post(
|
||
`/api/v1/skills/${plan.tool_code}`,
|
||
requestBody
|
||
);
|
||
|
||
const executionMs = Date.now() - startTime;
|
||
|
||
// 4. 记录执行日志(不含真实数据)
|
||
await prisma.ssaExecutionLog.create({
|
||
data: {
|
||
sessionId,
|
||
toolCode: plan.tool_code,
|
||
inputParams: plan.params, // 只记录参数,不记录数据
|
||
outputStatus: response.data.status,
|
||
outputResult: response.data.results,
|
||
traceLog: response.data.trace_log || [],
|
||
executionMs
|
||
}
|
||
});
|
||
|
||
return response.data;
|
||
|
||
} catch (error: any) {
|
||
logger.error('R service call failed', { sessionId, toolCode: plan.tool_code, error });
|
||
|
||
// 🆕 502/504 特殊处理(R 服务崩溃或超时)
|
||
const statusCode = error.response?.status;
|
||
if (statusCode === 502 || statusCode === 504) {
|
||
throw new Error('统计服务繁忙或数据异常,请稍后重试');
|
||
}
|
||
|
||
// 🆕 提取 R 服务返回的用户友好提示
|
||
const userHint = error.response?.data?.user_hint;
|
||
if (userHint) {
|
||
throw new Error(userHint);
|
||
}
|
||
|
||
throw new Error(`R service error: ${error.message}`);
|
||
}
|
||
}
|
||
|
||
async healthCheck(): Promise<boolean> {
|
||
try {
|
||
const res = await this.client.get('/health');
|
||
return res.data.status === 'ok';
|
||
} catch {
|
||
return false;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.2 🆕 DecisionTableService(决策表匹配 - 优先)
|
||
|
||
```typescript
|
||
// planner/DecisionTableService.ts
|
||
import { prisma } from '@/common/db';
|
||
import { logger } from '@/common/logging';
|
||
|
||
interface AnalysisIntent {
|
||
goalType: string; // 组间差异 | 相关性 | 分布描述 | 预测建模
|
||
yType: string; // 连续 | 分类 | 计数
|
||
xType?: string; // 连续 | 分类 | 无
|
||
designType: string; // 独立 | 配对 | 重复测量
|
||
}
|
||
|
||
export class DecisionTableService {
|
||
|
||
/**
|
||
* 🆕 根据分析意图从决策表精准匹配工具
|
||
* 优先级: 决策表匹配 > RAG 检索
|
||
*/
|
||
async matchTool(intent: AnalysisIntent): Promise<string | null> {
|
||
const result = await prisma.ssaDecisionTable.findFirst({
|
||
where: {
|
||
goalType: intent.goalType,
|
||
yType: intent.yType,
|
||
xType: intent.xType || null,
|
||
designType: intent.designType,
|
||
isActive: true
|
||
},
|
||
orderBy: { priority: 'desc' }
|
||
});
|
||
|
||
if (result) {
|
||
logger.info('Decision table matched', {
|
||
intent,
|
||
toolCode: result.toolCode
|
||
});
|
||
return result.toolCode;
|
||
}
|
||
|
||
return null;
|
||
}
|
||
|
||
/**
|
||
* 🆕 获取降级工具(护栏触发时使用)
|
||
*/
|
||
async getAlternativeTool(toolCode: string): Promise<string | null> {
|
||
const entry = await prisma.ssaDecisionTable.findFirst({
|
||
where: { toolCode, isActive: true }
|
||
});
|
||
return entry?.altToolCode || null;
|
||
}
|
||
|
||
/**
|
||
* 🆕 从 LLM 提取分析意图(结构化)
|
||
*/
|
||
async extractIntent(userQuery: string, dataSchema: object): Promise<AnalysisIntent> {
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
|
||
const prompt = `
|
||
分析用户的统计需求,提取以下四个维度:
|
||
|
||
用户需求: ${userQuery}
|
||
数据结构: ${JSON.stringify(dataSchema, null, 2)}
|
||
|
||
请返回 JSON 格式:
|
||
{
|
||
"goalType": "组间差异 | 相关性 | 分布描述 | 预测建模",
|
||
"yType": "连续 | 分类 | 计数",
|
||
"xType": "连续 | 分类 | 无",
|
||
"designType": "独立 | 配对 | 重复测量"
|
||
}
|
||
|
||
只返回 JSON,不要其他内容。
|
||
`.trim();
|
||
|
||
const response = await llm.chat([{ role: 'user', content: prompt }]);
|
||
|
||
try {
|
||
return JSON.parse(jsonrepair(response));
|
||
} catch {
|
||
// 兜底默认值
|
||
return {
|
||
goalType: '组间差异',
|
||
yType: '连续',
|
||
xType: '分类',
|
||
designType: '独立'
|
||
};
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.3 ToolRetrievalService(RAG 检索 - 兜底)
|
||
|
||
```typescript
|
||
// planner/ToolRetrievalService.ts
|
||
import { VectorSearchService } from '@/common/rag';
|
||
import { LLMFactory } from '@/common/llm/adapters/LLMFactory';
|
||
import { prisma } from '@/common/db';
|
||
import { DecisionTableService } from './DecisionTableService';
|
||
|
||
export class ToolRetrievalService {
|
||
private vectorSearch: VectorSearchService;
|
||
private decisionTable: DecisionTableService;
|
||
|
||
constructor() {
|
||
this.vectorSearch = new VectorSearchService({
|
||
schema: 'ssa_schema',
|
||
table: 'tools_library',
|
||
embeddingColumn: 'embedding',
|
||
textColumn: 'search_text'
|
||
});
|
||
this.decisionTable = new DecisionTableService();
|
||
}
|
||
|
||
/**
|
||
* 🆕 工具选择策略:决策表优先,RAG 兜底
|
||
*/
|
||
async selectTool(query: string, dataSchema: object): Promise<any> {
|
||
// 1. 尝试决策表精准匹配
|
||
const intent = await this.decisionTable.extractIntent(query, dataSchema);
|
||
const matchedCode = await this.decisionTable.matchTool(intent);
|
||
|
||
if (matchedCode) {
|
||
const tool = await prisma.ssaTool.findUnique({
|
||
where: { toolCode: matchedCode }
|
||
});
|
||
if (tool) {
|
||
return { ...tool, matchMethod: 'decision_table' };
|
||
}
|
||
}
|
||
|
||
// 2. 决策表未命中,使用 RAG 检索
|
||
const ragResults = await this.retrieveTools(query, dataSchema, 1);
|
||
if (ragResults.length > 0) {
|
||
return { ...ragResults[0], matchMethod: 'rag' };
|
||
}
|
||
|
||
return null;
|
||
}
|
||
|
||
async retrieveTools(query: string, dataSchema: object, topK = 5) {
|
||
// 1. Query Rewrite(可选,提升召回)
|
||
const rewriter = LLMFactory.getAdapter('deepseek-v3');
|
||
const rewritePrompt = `
|
||
将用户的统计分析需求改写为更适合检索统计工具的查询:
|
||
用户需求: ${query}
|
||
数据结构: ${JSON.stringify(dataSchema)}
|
||
|
||
输出改写后的查询(一句话):
|
||
`.trim();
|
||
|
||
const rewrittenQuery = await rewriter.chat([
|
||
{ role: 'user', content: rewritePrompt }
|
||
]);
|
||
|
||
// 2. 向量检索
|
||
const vectorResults = await this.vectorSearch.search(rewrittenQuery, topK);
|
||
|
||
// 3. 关键词检索 (pg_bigm)
|
||
const keywordResults = await prisma.$queryRaw`
|
||
SELECT id, tool_code, name, description, params_schema, guardrails
|
||
FROM ssa_schema.tools_library
|
||
WHERE search_text LIKE '%' || ${query} || '%'
|
||
AND is_active = true
|
||
LIMIT 5
|
||
`;
|
||
|
||
// 4. RRF 融合
|
||
const merged = this.rrfMerge(vectorResults, keywordResults);
|
||
|
||
// 5. Rerank(可选)
|
||
// const reranked = await this.rerank(merged, query);
|
||
|
||
return merged.slice(0, topK);
|
||
}
|
||
|
||
private rrfMerge(vectorResults: any[], keywordResults: any[], k = 60) {
|
||
const scores = new Map<string, number>();
|
||
|
||
vectorResults.forEach((item, idx) => {
|
||
const rrf = 1 / (k + idx + 1);
|
||
scores.set(item.id, (scores.get(item.id) || 0) + rrf);
|
||
});
|
||
|
||
keywordResults.forEach((item, idx) => {
|
||
const rrf = 1 / (k + idx + 1);
|
||
scores.set(item.id, (scores.get(item.id) || 0) + rrf);
|
||
});
|
||
|
||
// 合并并排序
|
||
const allItems = [...vectorResults, ...keywordResults];
|
||
const unique = [...new Map(allItems.map(i => [i.id, i])).values()];
|
||
|
||
return unique.sort((a, b) =>
|
||
(scores.get(b.id) || 0) - (scores.get(a.id) || 0)
|
||
);
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.3 PlannerService(AI 规划 + JSON 容错)
|
||
|
||
#### 🆕 Prompt 世界观设计(智能化演进关键)
|
||
|
||
> **重要**:Prompt 的"世界观"设计直接影响 LLM 的推理质量和未来演进能力。
|
||
>
|
||
> 详细背景参考:`04-开发计划/06-智能化演进共识与MVP执行计划.md`
|
||
|
||
**错误的世界观(接线员思维):**
|
||
```
|
||
你是一个工具选择器,从以下列表中选择合适的工具...
|
||
```
|
||
|
||
**正确的世界观(数据科学家思维):**
|
||
```
|
||
你是一位顶尖的临床数据科学家,拥有以下能力:
|
||
|
||
1. 深刻理解医学研究的统计学需求
|
||
2. 精通各类统计方法的适用场景和前提条件
|
||
3. 能够诊断数据特征并选择最优分析路径
|
||
|
||
你现在拥有一个包含 100+ 专家级统计算法的代码库。
|
||
每个算法都经过统计学专家的严格验证,确保结果的权威性。
|
||
|
||
请理解医生的研究意图,诊断数据特征,从代码库中选择最合适的工具组合,
|
||
并制定完整的分析计划。你的目标是帮助医生产出可以直接用于 SCI 论文的统计结果。
|
||
```
|
||
|
||
**为什么这很重要?**
|
||
|
||
| 维度 | 接线员思维 | 数据科学家思维 |
|
||
|------|-----------|---------------|
|
||
| LLM 自我认知 | 被动的工具选择器 | 主动的分析规划师 |
|
||
| 推理深度 | 简单匹配 | 深度分析数据特征 |
|
||
| 输出质量 | 可能选错工具 | 更准确的工具选择 |
|
||
| Phase 3 演进 | 难以扩展到代码修改 | 自然过渡到代码理解和修改 |
|
||
|
||
> **MVP 阶段行动**:更新 `SSA_PLANNER` Prompt 模板,使用"数据科学家"世界观。
|
||
|
||
---
|
||
|
||
```typescript
|
||
// services/PlannerService.ts
|
||
import { LLMFactory } from '@/common/llm/adapters/LLMFactory';
|
||
import { PromptService } from '@/common/prompts';
|
||
import { ToolRetrievalService } from './ToolRetrievalService';
|
||
import { prisma } from '@/common/db';
|
||
import { jsonrepair } from 'jsonrepair'; // 📌 JSON 修复库
|
||
import { planSchema } from '../validators/planSchema'; // 📌 Zod Schema
|
||
|
||
export class PlannerService {
|
||
private retrieval: ToolRetrievalService;
|
||
|
||
constructor() {
|
||
this.retrieval = new ToolRetrievalService();
|
||
}
|
||
|
||
async generatePlan(sessionId: string, userQuery: string) {
|
||
// 1. 获取会话的数据 Schema(不含真实数据)
|
||
const session = await prisma.ssaSession.findUniqueOrThrow({
|
||
where: { id: sessionId },
|
||
select: { dataSchema: true }
|
||
});
|
||
|
||
// 2. RAG 检索候选工具
|
||
const candidateTools = await this.retrieval.retrieveTools(
|
||
userQuery,
|
||
session.dataSchema,
|
||
5
|
||
);
|
||
|
||
// 3. 获取 Planner Prompt
|
||
const promptTemplate = await PromptService.get('SSA_PLANNER');
|
||
|
||
// 4. 构造 Prompt
|
||
const systemPrompt = promptTemplate
|
||
.replace('{{data_schema_json}}', JSON.stringify(session.dataSchema, null, 2))
|
||
.replace('{{candidate_tools_json}}', JSON.stringify(candidateTools, null, 2));
|
||
|
||
// 5. 调用 LLM
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
const response = await llm.chat([
|
||
{ role: 'system', content: systemPrompt },
|
||
{ role: 'user', content: userQuery }
|
||
]);
|
||
|
||
// 6. 📌 解析 + 修复 + 校验 JSON
|
||
const plan = this.parseAndValidateJson(response, candidateTools);
|
||
|
||
// 7. 保存用户消息和计划消息
|
||
await prisma.ssaMessage.createMany({
|
||
data: [
|
||
{
|
||
sessionId,
|
||
role: 'user',
|
||
contentType: 'text',
|
||
content: { text: userQuery }
|
||
},
|
||
{
|
||
sessionId,
|
||
role: 'assistant',
|
||
contentType: 'plan',
|
||
content: plan
|
||
}
|
||
]
|
||
});
|
||
|
||
return plan;
|
||
}
|
||
|
||
// 📌 增强的 JSON 解析(含修复和校验)
|
||
private parseAndValidateJson(text: string, candidateTools: any[]): object {
|
||
// Step 1: 提取 JSON 块
|
||
const jsonMatch = text.match(/```json\n?([\s\S]*?)\n?```/) ||
|
||
text.match(/\{[\s\S]*\}/);
|
||
|
||
if (!jsonMatch) {
|
||
throw new Error('LLM response does not contain valid JSON');
|
||
}
|
||
|
||
let jsonStr = jsonMatch[1] || jsonMatch[0];
|
||
|
||
// Step 2: 使用 jsonrepair 修复常见问题(末尾逗号、缺少引号等)
|
||
try {
|
||
jsonStr = jsonrepair(jsonStr);
|
||
} catch (repairError) {
|
||
// 修复失败,继续尝试原始解析
|
||
}
|
||
|
||
// Step 3: 解析 JSON
|
||
let parsed: any;
|
||
try {
|
||
parsed = JSON.parse(jsonStr);
|
||
} catch (parseError) {
|
||
throw new Error(`JSON parse failed: ${parseError.message}`);
|
||
}
|
||
|
||
// Step 4: 使用 Zod 校验结构
|
||
const validatedPlan = planSchema.safeParse(parsed);
|
||
|
||
if (!validatedPlan.success) {
|
||
throw new Error(`Plan validation failed: ${validatedPlan.error.message}`);
|
||
}
|
||
|
||
// Step 5: 校验 tool_code 是否在候选列表中
|
||
const validToolCodes = candidateTools.map(t => t.tool_code);
|
||
if (!validToolCodes.includes(validatedPlan.data.tool_code)) {
|
||
throw new Error(`Invalid tool_code: ${validatedPlan.data.tool_code}`);
|
||
}
|
||
|
||
return validatedPlan.data;
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.4 Zod Schema 定义
|
||
|
||
```typescript
|
||
// validators/planSchema.ts
|
||
import { z } from 'zod';
|
||
|
||
export const planSchema = z.object({
|
||
tool_code: z.string().min(1),
|
||
reasoning: z.string().optional(),
|
||
params: z.record(z.any()),
|
||
guardrails: z.object({
|
||
check_normality: z.boolean().optional(),
|
||
check_homogeneity: z.boolean().optional(),
|
||
auto_fix: z.boolean().optional()
|
||
}).optional()
|
||
});
|
||
|
||
export type PlanType = z.infer<typeof planSchema>;
|
||
```
|
||
|
||
### 4.5 🆕 ConsultService(无数据咨询)
|
||
|
||
```typescript
|
||
// planner/ConsultService.ts
|
||
import { LLMFactory } from '@/common/llm/adapters/LLMFactory';
|
||
import { PromptService } from '@/common/prompts';
|
||
import { ToolRetrievalService } from './ToolRetrievalService';
|
||
import { prisma } from '@/common/db';
|
||
|
||
export class ConsultService {
|
||
private retrieval: ToolRetrievalService;
|
||
|
||
constructor() {
|
||
this.retrieval = new ToolRetrievalService();
|
||
}
|
||
|
||
/**
|
||
* 🆕 无数据咨询对话
|
||
* - 用户只描述研究设计、变量类型等
|
||
* - 系统推理适合的统计方法
|
||
*/
|
||
async chat(sessionId: string, userMessage: string) {
|
||
// 1. 获取会话历史
|
||
const history = await prisma.ssaMessage.findMany({
|
||
where: { sessionId },
|
||
orderBy: { createdAt: 'asc' }
|
||
});
|
||
|
||
// 2. 获取咨询专用 Prompt
|
||
const systemPrompt = await PromptService.get('SSA_CONSULT');
|
||
|
||
// 3. 构造消息列表
|
||
const messages = [
|
||
{ role: 'system' as const, content: systemPrompt },
|
||
...history.map(m => ({
|
||
role: m.role as 'user' | 'assistant',
|
||
content: typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
|
||
})),
|
||
{ role: 'user' as const, content: userMessage }
|
||
];
|
||
|
||
// 4. 调用 LLM
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
const response = await llm.chat(messages);
|
||
|
||
// 5. 保存消息
|
||
await prisma.ssaMessage.createMany({
|
||
data: [
|
||
{ sessionId, role: 'user', contentType: 'text', content: { text: userMessage } },
|
||
{ sessionId, role: 'assistant', contentType: 'text', content: { text: response } }
|
||
]
|
||
});
|
||
|
||
return response;
|
||
}
|
||
|
||
/**
|
||
* 🆕 生成 SAP 文档
|
||
* - 基于对话历史生成结构化的统计分析计划
|
||
*/
|
||
async generateSAP(sessionId: string): Promise<{
|
||
title: string;
|
||
sections: Array<{
|
||
heading: string;
|
||
content: string;
|
||
}>;
|
||
recommendedTools: string[];
|
||
}> {
|
||
const history = await prisma.ssaMessage.findMany({
|
||
where: { sessionId },
|
||
orderBy: { createdAt: 'asc' }
|
||
});
|
||
|
||
const sapPrompt = await PromptService.get('SSA_SAP_GENERATOR');
|
||
|
||
const llm = LLMFactory.getAdapter('deepseek-v3');
|
||
const response = await llm.chat([
|
||
{ role: 'system', content: sapPrompt },
|
||
{ role: 'user', content: `基于以下对话生成统计分析计划:\n${JSON.stringify(history)}` }
|
||
]);
|
||
|
||
// 解析 JSON 响应
|
||
const sap = JSON.parse(response);
|
||
|
||
return sap;
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.6 🆕 SAPGeneratorService(SAP 文档导出)
|
||
|
||
```typescript
|
||
// planner/SAPGeneratorService.ts
|
||
import { Document, Packer, Paragraph, HeadingLevel, Table, TableRow, TableCell } from 'docx';
|
||
|
||
interface SAPDocument {
|
||
title: string;
|
||
sections: Array<{
|
||
heading: string;
|
||
content: string;
|
||
}>;
|
||
recommendedTools: string[];
|
||
}
|
||
|
||
export class SAPGeneratorService {
|
||
|
||
/**
|
||
* 🆕 生成 Word 文档
|
||
*/
|
||
async generateWord(sap: SAPDocument): Promise<Buffer> {
|
||
const doc = new Document({
|
||
sections: [{
|
||
children: [
|
||
new Paragraph({
|
||
text: sap.title,
|
||
heading: HeadingLevel.TITLE
|
||
}),
|
||
...sap.sections.flatMap(section => [
|
||
new Paragraph({
|
||
text: section.heading,
|
||
heading: HeadingLevel.HEADING_1
|
||
}),
|
||
new Paragraph({ text: section.content })
|
||
]),
|
||
new Paragraph({
|
||
text: '推荐统计方法',
|
||
heading: HeadingLevel.HEADING_1
|
||
}),
|
||
...sap.recommendedTools.map(tool =>
|
||
new Paragraph({ text: `• ${tool}` })
|
||
)
|
||
]
|
||
}]
|
||
});
|
||
|
||
return await Packer.toBuffer(doc);
|
||
}
|
||
|
||
/**
|
||
* 🆕 生成 Markdown
|
||
*/
|
||
generateMarkdown(sap: SAPDocument): string {
|
||
let md = `# ${sap.title}\n\n`;
|
||
|
||
for (const section of sap.sections) {
|
||
md += `## ${section.heading}\n\n${section.content}\n\n`;
|
||
}
|
||
|
||
md += `## 推荐统计方法\n\n`;
|
||
for (const tool of sap.recommendedTools) {
|
||
md += `- ${tool}\n`;
|
||
}
|
||
|
||
return md;
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.7 🆕 ConfigLoaderService(配置中台)
|
||
|
||
```typescript
|
||
// config/ConfigLoaderService.ts
|
||
import * as XLSX from 'xlsx';
|
||
import { ConfigValidatorService } from './ConfigValidatorService';
|
||
import { logger } from '@/common/logging';
|
||
|
||
interface ToolConfig {
|
||
tool_code: string;
|
||
name: string;
|
||
description: string;
|
||
params_schema: Record<string, any>;
|
||
guardrails: Record<string, any>;
|
||
search_text: string;
|
||
}
|
||
|
||
interface GuardrailConfig {
|
||
guardrail_code: string;
|
||
description: string;
|
||
threshold: number;
|
||
auto_fix_action: string;
|
||
}
|
||
|
||
export class ConfigLoaderService {
|
||
private static instance: ConfigLoaderService;
|
||
private toolsCache: Map<string, ToolConfig> = new Map();
|
||
private guardrailsCache: Map<string, GuardrailConfig> = new Map();
|
||
private lastLoadTime: Date | null = null;
|
||
|
||
static getInstance() {
|
||
if (!this.instance) {
|
||
this.instance = new ConfigLoaderService();
|
||
}
|
||
return this.instance;
|
||
}
|
||
|
||
/**
|
||
* 🆕 从 Excel 加载配置
|
||
*/
|
||
async loadFromExcel(buffer: Buffer): Promise<{
|
||
tools: number;
|
||
guardrails: number;
|
||
errors: string[];
|
||
}> {
|
||
const workbook = XLSX.read(buffer, { type: 'buffer' });
|
||
const errors: string[] = [];
|
||
|
||
// Sheet 1: 工具定义
|
||
const toolsSheet = workbook.Sheets['Tools'];
|
||
if (toolsSheet) {
|
||
const toolsData = XLSX.utils.sheet_to_json<ToolConfig>(toolsSheet);
|
||
|
||
for (const tool of toolsData) {
|
||
// 校验
|
||
const validation = ConfigValidatorService.validateTool(tool);
|
||
if (validation.valid) {
|
||
this.toolsCache.set(tool.tool_code, tool);
|
||
} else {
|
||
errors.push(`Tool ${tool.tool_code}: ${validation.error}`);
|
||
}
|
||
}
|
||
}
|
||
|
||
// Sheet 2: 护栏规则
|
||
const guardrailsSheet = workbook.Sheets['Guardrails'];
|
||
if (guardrailsSheet) {
|
||
const guardrailsData = XLSX.utils.sheet_to_json<GuardrailConfig>(guardrailsSheet);
|
||
|
||
for (const gr of guardrailsData) {
|
||
const validation = ConfigValidatorService.validateGuardrail(gr);
|
||
if (validation.valid) {
|
||
this.guardrailsCache.set(gr.guardrail_code, gr);
|
||
} else {
|
||
errors.push(`Guardrail ${gr.guardrail_code}: ${validation.error}`);
|
||
}
|
||
}
|
||
}
|
||
|
||
this.lastLoadTime = new Date();
|
||
logger.info('Config loaded', {
|
||
tools: this.toolsCache.size,
|
||
guardrails: this.guardrailsCache.size
|
||
});
|
||
|
||
return {
|
||
tools: this.toolsCache.size,
|
||
guardrails: this.guardrailsCache.size,
|
||
errors
|
||
};
|
||
}
|
||
|
||
/**
|
||
* 🆕 热加载(清空缓存并重新加载)
|
||
*/
|
||
async reload(): Promise<void> {
|
||
// 从数据库或默认 Excel 重新加载
|
||
this.toolsCache.clear();
|
||
this.guardrailsCache.clear();
|
||
// ... 重新加载逻辑
|
||
logger.info('Config reloaded');
|
||
}
|
||
|
||
getTool(toolCode: string): ToolConfig | undefined {
|
||
return this.toolsCache.get(toolCode);
|
||
}
|
||
|
||
getAllTools(): ToolConfig[] {
|
||
return Array.from(this.toolsCache.values());
|
||
}
|
||
|
||
getGuardrail(code: string): GuardrailConfig | undefined {
|
||
return this.guardrailsCache.get(code);
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.8 🆕 ConfigValidatorService(配置校验)
|
||
|
||
```typescript
|
||
// config/ConfigValidatorService.ts
|
||
|
||
interface ValidationResult {
|
||
valid: boolean;
|
||
error?: string;
|
||
}
|
||
|
||
export class ConfigValidatorService {
|
||
|
||
/**
|
||
* 🆕 校验工具配置
|
||
*/
|
||
static validateTool(tool: any): ValidationResult {
|
||
// 必填校验
|
||
if (!tool.tool_code) {
|
||
return { valid: false, error: 'tool_code is required' };
|
||
}
|
||
if (!tool.name) {
|
||
return { valid: false, error: 'name is required' };
|
||
}
|
||
|
||
// 格式校验
|
||
if (!/^ST_[A-Z_]+$/.test(tool.tool_code)) {
|
||
return { valid: false, error: 'tool_code must match ST_XXX pattern' };
|
||
}
|
||
|
||
// params_schema 校验
|
||
if (tool.params_schema) {
|
||
try {
|
||
if (typeof tool.params_schema === 'string') {
|
||
JSON.parse(tool.params_schema);
|
||
}
|
||
} catch {
|
||
return { valid: false, error: 'params_schema is not valid JSON' };
|
||
}
|
||
}
|
||
|
||
return { valid: true };
|
||
}
|
||
|
||
/**
|
||
* 🆕 校验护栏配置
|
||
*/
|
||
static validateGuardrail(gr: any): ValidationResult {
|
||
if (!gr.guardrail_code) {
|
||
return { valid: false, error: 'guardrail_code is required' };
|
||
}
|
||
|
||
if (typeof gr.threshold !== 'number') {
|
||
return { valid: false, error: 'threshold must be a number' };
|
||
}
|
||
|
||
if (gr.threshold < 0 || gr.threshold > 1) {
|
||
return { valid: false, error: 'threshold must be between 0 and 1' };
|
||
}
|
||
|
||
return { valid: true };
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.9 🆕 配置中台路由
|
||
|
||
```typescript
|
||
// routes/config.routes.ts
|
||
import { FastifyInstance } from 'fastify';
|
||
import { ConfigLoaderService } from '../config/ConfigLoaderService';
|
||
|
||
export default async function configRoutes(app: FastifyInstance) {
|
||
const configService = ConfigLoaderService.getInstance();
|
||
|
||
// 导入 Excel 配置
|
||
app.post('/import', async (req, reply) => {
|
||
const data = await req.file();
|
||
if (!data) {
|
||
return reply.status(400).send({ error: 'No file uploaded' });
|
||
}
|
||
|
||
const buffer = await data.toBuffer();
|
||
const result = await configService.loadFromExcel(buffer);
|
||
|
||
return reply.send(result);
|
||
});
|
||
|
||
// 🆕 热加载配置(Admin API)
|
||
app.post('/reload', async (req, reply) => {
|
||
await configService.reload();
|
||
return reply.send({ success: true, timestamp: new Date().toISOString() });
|
||
});
|
||
|
||
// 获取工具列表
|
||
app.get('/tools', async (req, reply) => {
|
||
const tools = configService.getAllTools();
|
||
return reply.send(tools);
|
||
});
|
||
|
||
// 校验配置文件(不导入)
|
||
app.post('/validate', async (req, reply) => {
|
||
const data = await req.file();
|
||
if (!data) {
|
||
return reply.status(400).send({ error: 'No file uploaded' });
|
||
}
|
||
|
||
// 仅校验,不加载到缓存
|
||
// ...
|
||
return reply.send({ valid: true });
|
||
});
|
||
}
|
||
```
|
||
|
||
### 4.10 🆕 咨询模式路由
|
||
|
||
```typescript
|
||
// routes/consult.routes.ts
|
||
import { FastifyInstance } from 'fastify';
|
||
import { ConsultService } from '../planner/ConsultService';
|
||
import { SAPGeneratorService } from '../planner/SAPGeneratorService';
|
||
import { prisma } from '@/common/db';
|
||
|
||
export default async function consultRoutes(app: FastifyInstance) {
|
||
const consultService = new ConsultService();
|
||
const sapGenerator = new SAPGeneratorService();
|
||
|
||
// 创建咨询会话(无数据)
|
||
app.post('/', async (req, reply) => {
|
||
const userId = req.user.id;
|
||
|
||
const session = await prisma.ssaSession.create({
|
||
data: {
|
||
userId,
|
||
title: '统计咨询',
|
||
status: 'consult' // 🆕 标记为咨询模式
|
||
}
|
||
});
|
||
|
||
return reply.send(session);
|
||
});
|
||
|
||
// 咨询对话
|
||
app.post('/:id/chat', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
const { message } = req.body as { message: string };
|
||
|
||
const response = await consultService.chat(id, message);
|
||
|
||
return reply.send({ response });
|
||
});
|
||
|
||
// 生成 SAP 文档
|
||
app.post('/:id/generate-sap', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
|
||
const sap = await consultService.generateSAP(id);
|
||
|
||
// 保存到会话
|
||
await prisma.ssaMessage.create({
|
||
data: {
|
||
sessionId: id,
|
||
role: 'assistant',
|
||
contentType: 'sap',
|
||
content: sap
|
||
}
|
||
});
|
||
|
||
return reply.send(sap);
|
||
});
|
||
|
||
// 下载 SAP(Word/Markdown)
|
||
app.get('/:id/download-sap', async (req, reply) => {
|
||
const { id } = req.params as { id: string };
|
||
const { format = 'word' } = req.query as { format?: 'word' | 'markdown' };
|
||
|
||
// 获取最新的 SAP
|
||
const sapMessage = await prisma.ssaMessage.findFirst({
|
||
where: { sessionId: id, contentType: 'sap' },
|
||
orderBy: { createdAt: 'desc' }
|
||
});
|
||
|
||
if (!sapMessage) {
|
||
return reply.status(404).send({ error: 'SAP not found' });
|
||
}
|
||
|
||
const sap = sapMessage.content as any;
|
||
|
||
if (format === 'markdown') {
|
||
const md = sapGenerator.generateMarkdown(sap);
|
||
reply.header('Content-Type', 'text/markdown');
|
||
reply.header('Content-Disposition', 'attachment; filename="SAP.md"');
|
||
return reply.send(md);
|
||
} else {
|
||
const buffer = await sapGenerator.generateWord(sap);
|
||
reply.header('Content-Type', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
|
||
reply.header('Content-Disposition', 'attachment; filename="SAP.docx"');
|
||
return reply.send(buffer);
|
||
}
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Brain-Hand 数据隔离
|
||
|
||
**核心原则:LLM 只看 Schema,R 服务处理真实数据**
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ 数据上传流程 │
|
||
│ │
|
||
│ Excel/CSV ──────┬────────────────────────────────────────│
|
||
│ │ │
|
||
│ ┌──────▼──────┐ │
|
||
│ │ 数据解析器 │ │
|
||
│ └──────┬──────┘ │
|
||
│ │ │
|
||
│ ┌─────────┴─────────┐ │
|
||
│ │ │ │
|
||
│ dataSchema dataPayload │
|
||
│ (结构/类型/统计) (真实数据) │
|
||
│ │ │ │
|
||
│ ▼ ▼ │
|
||
│ LLM (Planner) R (Executor) │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 5.1 数据解析实现
|
||
|
||
```typescript
|
||
// services/DataParserService.ts
|
||
import * as XLSX from 'xlsx';
|
||
|
||
export class DataParserService {
|
||
|
||
static parse(buffer: Buffer, filename: string) {
|
||
const workbook = XLSX.read(buffer, { type: 'buffer' });
|
||
const sheetName = workbook.SheetNames[0];
|
||
const sheet = workbook.Sheets[sheetName];
|
||
|
||
// 转为 JSON 数组
|
||
const data = XLSX.utils.sheet_to_json(sheet);
|
||
|
||
// 提取 Schema
|
||
const schema = this.extractSchema(data);
|
||
|
||
return {
|
||
dataSchema: schema, // 给 LLM
|
||
dataPayload: data // 给 R
|
||
};
|
||
}
|
||
|
||
private static extractSchema(data: any[]) {
|
||
if (data.length === 0) return { columns: [], rowCount: 0 };
|
||
|
||
const columns = Object.keys(data[0]).map(colName => {
|
||
const values = data.map(row => row[colName]).filter(v => v != null);
|
||
const type = this.inferType(values);
|
||
|
||
return {
|
||
name: colName,
|
||
type,
|
||
...this.computeStats(values, type, data.length) // 📌 传入行数用于隐私保护
|
||
};
|
||
});
|
||
|
||
return {
|
||
rowCount: data.length,
|
||
columns
|
||
};
|
||
}
|
||
|
||
private static inferType(values: any[]): 'numeric' | 'categorical' | 'datetime' {
|
||
const sample = values.slice(0, 100);
|
||
const numericCount = sample.filter(v => typeof v === 'number' || !isNaN(Number(v))).length;
|
||
|
||
if (numericCount / sample.length > 0.9) return 'numeric';
|
||
return 'categorical';
|
||
}
|
||
|
||
private static computeStats(values: any[], type: string, rowCount: number) {
|
||
if (type === 'numeric') {
|
||
const nums = values.map(Number).filter(n => !isNaN(n));
|
||
let min = Math.min(...nums);
|
||
let max = Math.max(...nums);
|
||
|
||
// 📌 小样本隐私保护:N < 10 时模糊化极值
|
||
if (rowCount < 10) {
|
||
min = Math.floor(min / 10) * 10; // 向下取整到十位
|
||
max = Math.ceil(max / 10) * 10; // 向上取整到十位
|
||
}
|
||
|
||
return {
|
||
min,
|
||
max,
|
||
mean: nums.reduce((a, b) => a + b, 0) / nums.length,
|
||
missing: values.length - nums.length,
|
||
privacyProtected: rowCount < 10 // 📌 标记是否已模糊化
|
||
};
|
||
}
|
||
|
||
// categorical
|
||
const counts = new Map<string, number>();
|
||
values.forEach(v => {
|
||
const key = String(v);
|
||
counts.set(key, (counts.get(key) || 0) + 1);
|
||
});
|
||
|
||
// 🆕 分类变量隐私保护:
|
||
// 如果某个取值的计数 < 5 且总行数 > 10,则隐藏具体值
|
||
const uniqueValues: string[] = [];
|
||
let maskedCount = 0;
|
||
|
||
for (const [value, count] of counts.entries()) {
|
||
if (count < 5 && rowCount > 10) {
|
||
maskedCount++;
|
||
} else {
|
||
uniqueValues.push(value);
|
||
}
|
||
}
|
||
|
||
// 最多展示 10 个非敏感值
|
||
const safeValues = uniqueValues.slice(0, 10);
|
||
if (maskedCount > 0) {
|
||
safeValues.push(`[${maskedCount} 个稀有值已隐藏]`);
|
||
}
|
||
|
||
return {
|
||
uniqueValues: safeValues,
|
||
uniqueCount: counts.size,
|
||
missing: values.filter(v => v == null || v === '').length,
|
||
privacyProtected: maskedCount > 0 // 🆕 标记
|
||
};
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Prompt 注册
|
||
|
||
```sql
|
||
-- 注册 Planner Prompt
|
||
INSERT INTO capability_schema.prompt_templates (code, name, content, model, temperature)
|
||
VALUES (
|
||
'SSA_PLANNER',
|
||
'SSA 统计规划器',
|
||
'你是一名资深的生物统计学家。你面前有一份数据摘要(Metadata)和一组可用的统计工具箱。
|
||
请根据用户的需求,选择最合适的一个工具,并生成详细的执行计划(SAP)。
|
||
|
||
### 数据摘要
|
||
{{data_schema_json}}
|
||
|
||
### 可用工具箱 (Candidates)
|
||
{{candidate_tools_json}}
|
||
|
||
### 决策规则 (Guardrails)
|
||
1. **类型匹配**:严格检查变量类型。不要把分类变量填入要求数值型的参数中。
|
||
2. **工具匹配**:如果用户要做 "预测",优先选 "回归" 类工具;如果做 "差异",选 "检验" 类工具。
|
||
3. **护栏配置**:对于 T 检验、ANOVA 等参数检验,必须开启 check_normality。
|
||
|
||
### 输出要求
|
||
请先在 <thinking> 标签中进行推理,分析变量类型和工具适用性。
|
||
然后输出纯 JSON,格式如下:
|
||
{
|
||
"tool_code": "选中工具的CODE",
|
||
"reasoning": "一句话解释为什么选这个工具",
|
||
"params": { ...根据工具定义的 params_schema 填写... },
|
||
"guardrails": { "check_normality": true, "auto_fix": true }
|
||
}',
|
||
'deepseek-v3',
|
||
0.3
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 与主应用集成
|
||
|
||
```typescript
|
||
// backend/src/index.ts
|
||
import ssaModule from './modules/ssa';
|
||
|
||
// 在 Fastify 注册
|
||
app.register(ssaModule, { prefix: '/api/v1/ssa' });
|
||
```
|
||
|
||
---
|
||
|
||
## 8. 环境变量
|
||
|
||
```env
|
||
# .env
|
||
|
||
# R 服务配置
|
||
R_SERVICE_URL=http://ssa-r-service:8080 # SAE VPC 内网地址
|
||
R_SERVICE_TIMEOUT=120000 # 📌 超时 120s
|
||
|
||
# 📌 OSS 配置(必须使用 VPC 内网 Endpoint)
|
||
OSS_ENDPOINT=oss-cn-beijing-internal.aliyuncs.com # 内网地址
|
||
OSS_BUCKET=ssa-data-bucket
|
||
OSS_ACCESS_KEY_ID=your-access-key
|
||
OSS_ACCESS_KEY_SECRET=your-secret
|
||
|
||
# LLM 配置
|
||
LLM_DEFAULT_MODEL=deepseek-v3
|
||
```
|
||
|
||
> **重要**:OSS Endpoint 必须使用 `-internal` 后缀的 VPC 内网地址,否则 R 服务的网络隔离策略会导致文件下载失败。
|
||
|
||
---
|
||
|
||
## 9. 测试检查清单
|
||
|
||
| 测试场景 | 预期结果 |
|
||
|----------|---------|
|
||
| POST /sessions 创建会话 | 返回 sessionId |
|
||
| POST /sessions/:id/upload (CSV) | 返回 dataSchema |
|
||
| POST /sessions/:id/upload (N<10) | dataSchema.privacyProtected = true |
|
||
| POST /sessions/:id/plan (T检验意图) | 返回包含 tool_code 的 plan |
|
||
| POST /sessions/:id/plan (LLM 返回格式错误 JSON) | json-repair 修复成功 |
|
||
| POST /sessions/:id/plan (参数不合法) | Zod 校验失败,返回错误 |
|
||
| POST /sessions/:id/execute | R 服务返回 success |
|
||
| POST /sessions/:id/execute (超过 60s) | 不超时,等待 120s |
|
||
| GET /sessions/:id/download-code | 下载 .R 文件 |
|
||
| R 服务宕机时 execute | 返回友好错误 |
|
||
|
||
---
|
||
|
||
## 10. 依赖包清单
|
||
|
||
```json
|
||
{
|
||
"dependencies": {
|
||
"jsonrepair": "^3.6.0",
|
||
"zod": "^3.22.4",
|
||
"xlsx": "^0.18.5",
|
||
"axios": "^1.6.0",
|
||
"docx": "^8.5.0" // 🆕 Word 文档生成(SAP 导出)
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 11. 🆕 配置中台 Excel 模板规范
|
||
|
||
> **核心理念**:统计学专家通过 Excel + R 脚本配置系统行为,无需修改代码。
|
||
|
||
### 11.0 🆕 决策表 Excel (decision_table.xlsx)
|
||
|
||
> 用于 Planner 工具选择,四维精准匹配
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| goal_type | string | ✅ | 分析目标:组间差异 / 相关性 / 分布描述 / 预测建模 |
|
||
| y_type | string | ✅ | 因变量类型:连续 / 分类 / 计数 |
|
||
| x_type | string | | 自变量类型:连续 / 分类 / 无 |
|
||
| design_type | string | ✅ | 设计类型:独立 / 配对 / 重复测量 |
|
||
| tool_code | string | ✅ | 推荐工具代码 |
|
||
| alt_tool_code | string | | 备选工具(护栏降级) |
|
||
| priority | number | | 优先级(数字越大越优先) |
|
||
|
||
**示例数据**:
|
||
| goal_type | y_type | x_type | design_type | tool_code | alt_tool_code |
|
||
|-----------|--------|--------|-------------|-----------|---------------|
|
||
| 组间差异 | 连续 | 分类 | 独立 | ST_T_TEST_IND | ST_MANN_WHITNEY |
|
||
| 组间差异 | 连续 | 分类 | 配对 | ST_T_TEST_PAIRED | ST_WILCOXON |
|
||
| 相关性 | 连续 | 连续 | 独立 | ST_CORRELATION | ST_CORRELATION |
|
||
|
||
### 11.1 Sheet 1: Metadata(工具元数据)
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| tool_code | string | ✅ | 工具代码,格式 ST_XXX |
|
||
| name | string | ✅ | 工具名称 |
|
||
| version | string | | 版本号,默认 1.0.0 |
|
||
| r_script_file | string | ✅ | 🆕 R 脚本文件名(如 t_test_ind.R) |
|
||
| description | string | ✅ | 工具描述 |
|
||
| usage_context | string | | 适用场景 |
|
||
| search_text | string | | RAG 搜索关键词 |
|
||
|
||
### 11.2 Sheet 2: ParamMapping(参数映射)
|
||
|
||
> 🆕 JSON Key → R 参数名映射
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| tool_code | string | ✅ | 工具代码 |
|
||
| json_key | string | ✅ | 前端传入的 JSON 字段名 |
|
||
| r_param_name | string | ✅ | R 函数参数名 |
|
||
| data_type | string | ✅ | 数据类型:string / number / boolean |
|
||
| is_required | boolean | | 是否必填 |
|
||
| default_value | string | | 默认值 |
|
||
| validation_rule | string | | 校验规则(正则或条件) |
|
||
| description | string | | 参数说明 |
|
||
|
||
**示例数据**:
|
||
| tool_code | json_key | r_param_name | data_type | is_required |
|
||
|-----------|----------|--------------|-----------|-------------|
|
||
| ST_T_TEST_IND | group_variable | group_var | string | TRUE |
|
||
| ST_T_TEST_IND | value_variable | value_var | string | TRUE |
|
||
| ST_T_TEST_IND | confidence_level | conf_level | number | FALSE |
|
||
|
||
### 11.3 Sheet 3: Guardrails(护栏规则链)
|
||
|
||
> 🆕 支持 Block / Warn / Switch 三种 Action
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| tool_code | string | ✅ | 工具代码 |
|
||
| check_name | string | ✅ | 检查名称:正态性检验 / 方差齐性 / 样本量 |
|
||
| check_order | number | | 执行顺序(数字越小越先) |
|
||
| check_code | string | ✅ | R 函数名(如 check_normality) |
|
||
| threshold | string | | 阈值条件:p < 0.05 |
|
||
| action_type | string | ✅ | 🆕 **Block** / **Warn** / **Switch** |
|
||
| action_target | string | | Switch 时的目标工具代码 |
|
||
| is_enabled | boolean | | 是否启用 |
|
||
|
||
**Action 类型说明**:
|
||
- **Block**: 阻止执行,返回错误
|
||
- **Warn**: 警告但继续执行
|
||
- **Switch**: 🆕 自动切换到备选方法
|
||
|
||
**示例数据**:
|
||
| tool_code | check_name | check_code | threshold | action_type | action_target |
|
||
|-----------|------------|------------|-----------|-------------|---------------|
|
||
| ST_T_TEST_IND | 正态性检验 | check_normality | p < 0.05 | Switch | ST_MANN_WHITNEY |
|
||
| ST_T_TEST_IND | 样本量检查 | check_sample_size | n < 3 | Block | |
|
||
| ST_ANOVA_ONE | 方差齐性 | check_homogeneity | p < 0.05 | Warn | |
|
||
|
||
### 11.4 Sheet 4: OutputDef(输出字段定义)
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| tool_code | string | ✅ | 工具代码 |
|
||
| field_name | string | ✅ | 字段名 |
|
||
| display_name | string | ✅ | 展示名称 |
|
||
| data_type | string | | 数据类型 |
|
||
| format_rule | string | | 格式化规则(如 %.3f) |
|
||
|
||
### 11.5 Sheet 5: Interpretation(结果解读模板)
|
||
|
||
> 🆕 "填空题"式的论文级结论生成
|
||
|
||
| 列名 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| tool_code | string | ✅ | 工具代码 |
|
||
| scenario_key | string | ✅ | 场景:significant / not_significant / warning |
|
||
| template | text | ✅ | 解读模板(含占位符) |
|
||
| placeholders | text | | 占位符列表(JSON 数组) |
|
||
|
||
**示例模板**:
|
||
```
|
||
场景: significant
|
||
模板: "采用 {method} 进行分析,结果表明两组之间存在统计学显著差异(t = {statistic}, p {p_value_fmt}, 95% CI [{ci_lower}, {ci_upper}])。{group1} 组均值为 {mean1} ± {sd1},{group2} 组均值为 {mean2} ± {sd2}。"
|
||
|
||
场景: not_significant
|
||
模板: "采用 {method} 进行分析,结果表明两组之间差异无统计学意义(t = {statistic}, p = {p_value_fmt})。"
|
||
```
|
||
|
||
### 11.6 🆕 R 脚本规范
|
||
|
||
> 专家上传的 R 脚本必须遵循以下规范
|
||
|
||
```r
|
||
# 文件名: t_test_ind.R
|
||
# 工具代码: ST_T_TEST_IND
|
||
# 版本: 1.0.0
|
||
|
||
#' @title 独立样本 T 检验
|
||
#' @description 比较两组独立样本的均值差异
|
||
#' @param input List 包含 data_source, params, guardrails
|
||
#' @return List 包含 status, results, plots, trace_log, reproducible_code
|
||
|
||
# 📌 所有脚本必须使用统一入口函数
|
||
run_analysis <- function(input) {
|
||
# 1. 数据加载
|
||
df <- load_input_data(input)
|
||
|
||
# 2. 参数提取(根据 ParamMapping 配置)
|
||
group_var <- input$params$group_var
|
||
value_var <- input$params$value_var
|
||
|
||
# 3. 护栏检查(根据 Guardrails 配置)
|
||
# ... 护栏检查代码 ...
|
||
|
||
# 4. 核心计算
|
||
result <- t.test(...)
|
||
|
||
# 5. 返回标准格式
|
||
return(list(
|
||
status = "success",
|
||
results = list(...),
|
||
plots = list(...),
|
||
trace_log = logs,
|
||
reproducible_code = code
|
||
))
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 12. 🆕 SAP 文档规范
|
||
|
||
### 12.1 SAP 结构定义
|
||
|
||
```typescript
|
||
interface SAPDocument {
|
||
title: string; // 统计分析计划标题
|
||
sections: Array<{
|
||
heading: string; // 章节标题
|
||
content: string; // 章节内容
|
||
}>;
|
||
recommendedTools: string[]; // 推荐的统计方法列表
|
||
metadata: {
|
||
generatedAt: string; // 生成时间
|
||
sessionId: string; // 关联会话
|
||
version: string; // 版本号
|
||
};
|
||
}
|
||
```
|
||
|
||
### 12.2 标准章节
|
||
|
||
1. **研究背景** - 研究目的、设计类型
|
||
2. **数据描述** - 样本量、变量类型、缺失情况
|
||
3. **统计假设** - 原假设、备择假设
|
||
4. **分析方法** - 具体统计方法及选择理由
|
||
5. **结果解读指南** - 如何解读统计结果
|
||
6. **注意事项** - 方法局限性、前提条件
|