# SSA-Pro R 服务开发指南 > **文档版本:** v1.6 > **创建日期:** 2026-02-18 > **最后更新:** 2026-02-20(纳入智能化演进共识 + 错误捕获管道设计) > **目标读者:** R 统计工程师 --- ## 1. 项目结构 ``` r-statistics-service/ ├── Dockerfile ├── renv.lock # 📌 包版本锁定文件 ├── .Rprofile # renv 初始化 ├── plumber.R # Plumber 入口 ├── tools/ # 🆕 专家上传的 R 脚本(统一 run_analysis 入口) │ ├── t_test_ind.R # 独立样本 T 检验 │ ├── t_test_paired.R # 配对样本 T 检验 │ ├── anova_one.R # 单因素方差分析 │ ├── wilcoxon.R # Wilcoxon 秩和检验 │ └── ... # 📌 约 100 个成熟脚本 ├── templates/ # 📌 代码交付模板(glue 格式) │ ├── t_test.R.template │ ├── anova.R.template │ └── ... ├── utils/ │ ├── data_loader.R # 混合数据协议加载器 │ ├── guardrails.R # 🆕 护栏函数库(支持 Block/Warn/Switch) │ ├── code_generator.R # 代码生成工具(使用 glue) │ ├── result_formatter.R # 结果格式化(p_value_fmt) │ ├── interpretation.R # 🆕 结果解读(基于配置模板) │ └── error_codes.R # 📌 错误码定义 ├── metadata/ # 工具元数据(由配置中台管理) │ └── tools.yaml # 备用配置 └── tests/ ├── test_tools.R # 单元测试 └── fixtures/ # 标准测试数据集 ├── normal_data.csv ├── skewed_data.csv └── missing_data.csv ``` ### 1.1 🆕 专家 R 脚本规范 > **核心要求**:所有脚本必须使用统一入口函数 `run_analysis(input)` ```r # 文件头部注释(必填) #' @tool_code ST_T_TEST_IND #' @name 独立样本 T 检验 #' @version 1.0.0 #' @description 比较两组独立样本的均值差异 #' @author 统计学专家团队 # 📌 统一入口函数(所有脚本必须实现) run_analysis <- function(input) { # ... 实现逻辑 ... } ``` --- ## 2. Dockerfile 模板 ```dockerfile FROM rocker/r-ver:4.3 LABEL maintainer="your-team@example.com" LABEL version="1.0.0" LABEL description="SSA-Pro R Statistics Service" # 安装系统依赖 RUN apt-get update && apt-get install -y \ libcurl4-openssl-dev \ libssl-dev \ libxml2-dev \ && rm -rf /var/lib/apt/lists/* # 📌 安装 renv(包管理工具) RUN R -e "install.packages('renv', repos='https://cloud.r-project.org/')" WORKDIR /app # 📌 先复制 renv.lock,利用 Docker 缓存 COPY renv.lock renv.lock COPY .Rprofile .Rprofile # 📌 使用 renv 恢复依赖(版本锁定) RUN R -e "renv::restore()" # 复制应用代码 COPY . . EXPOSE 8080 # 🆕 OSS 配置通过环境变量注入(开发/生产环境不同) ENV OSS_ENDPOINT="" ENV OSS_ACCESS_KEY_ID="" ENV OSS_ACCESS_KEY_SECRET="" ENV OSS_BUCKET="" # 📌 启动前清理临时文件 CMD ["R", "-e", "unlink(list.files('/tmp', full.names=TRUE), recursive=TRUE); plumber::plumb('plumber.R')$run(host='0.0.0.0', port=8080)"] ``` ### 2.2 环境变量配置(🆕 开发/生产差异) ```yaml # docker-compose.yml (本地开发) services: ssa-r-service: build: . ports: - "8080:8080" environment: - OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com # 公网 - OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} - OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} - OSS_BUCKET=ssa-data-bucket ``` ```yaml # SAE 环境变量 (生产) OSS_ENDPOINT: oss-cn-beijing-internal.aliyuncs.com # 🆕 VPC 内网 OSS_ACCESS_KEY_ID: ****** OSS_ACCESS_KEY_SECRET: ****** OSS_BUCKET: ssa-data-bucket ``` > **重要**:OSS Endpoint 绝不能硬编码,必须通过环境变量注入。本地开发用公网,SAE 生产用内网。 ### 2.1 renv.lock 示例 ```json { "R": { "Version": "4.3.0", "Repositories": [ { "Name": "CRAN", "URL": "https://cloud.r-project.org" } ] }, "Packages": { "plumber": { "Package": "plumber", "Version": "1.2.1", "Source": "Repository" }, "jsonlite": { "Package": "jsonlite", "Version": "1.8.8", "Source": "Repository" }, "ggplot2": { "Package": "ggplot2", "Version": "3.4.4", "Source": "Repository" }, "glue": { "Package": "glue", "Version": "1.7.0", "Source": "Repository" }, "styler": { "Package": "styler", "Version": "1.10.2", "Source": "Repository" }, "dplyr": { "Package": "dplyr", "Version": "1.1.4", "Source": "Repository" }, "tidyr": { "Package": "tidyr", "Version": "1.3.0", "Source": "Repository" }, "base64enc": { "Package": "base64enc", "Version": "0.1-3", "Source": "Repository" }, "yaml": { "Package": "yaml", "Version": "2.3.8", "Source": "Repository" }, "car": { "Package": "car", "Version": "3.1-2", "Source": "Repository" } } } ``` > **重要**:每次添加新依赖后,执行 `renv::snapshot()` 更新 lock 文件。 --- ## 3. Plumber 入口文件 ```r # plumber.R library(plumber) library(jsonlite) # 加载工具模块 tools_dir <- "tools" tool_files <- list.files(tools_dir, pattern = "\\.R$", full.names = TRUE) for (f in tool_files) source(f) # 加载公共函数 source("utils/data_loader.R") # 🆕 混合数据协议 source("utils/guardrails.R") source("utils/code_generator.R") source("utils/result_formatter.R") # 🆕 结果格式化 #* @apiTitle SSA-Pro R Statistics Service #* @apiDescription 严谨型统计分析 R 引擎 #* 健康检查 #* @get /health function() { list( status = "ok", timestamp = Sys.time(), version = "1.0.0" ) } #* 执行统计工具 #* @post /api/v1/skills/ #* @param tool_code:str 工具代码 #* @serializer unboxedJSON function(req, tool_code) { tryCatch({ # 解析请求体 input <- jsonlite::fromJSON(req$postBody, simplifyVector = FALSE) # 🆕 Debug 模式:保留临时文件用于排查 debug_mode <- isTRUE(input$debug) # 动态调用工具函数 func_name <- paste0("run_", tolower(tool_code)) if (!exists(func_name, mode = "function")) { return(list( status = "error", message = paste("Unknown tool:", tool_code) )) } result <- do.call(func_name, list(input)) # 🆕 Debug 模式:附加临时文件路径 if (debug_mode && !is.null(result$tmp_files)) { result$debug_files <- result$tmp_files message("[DEBUG] 临时文件已保留: ", paste(result$tmp_files, collapse = ", ")) } return(result) }, error = function(e) { # 🆕 使用友好错误映射 return(map_r_error(e$message)) }) } ``` --- ## 4. 错误码定义 ### 🆕 4.0 错误捕获管道设计(为 Phase 3 靶向修改铺路) > **重要**:结构化的错误捕获是 Phase 3 "靶向代码修改"能力的基础。 > > 详细背景参考:`04-开发计划/06-智能化演进共识与MVP执行计划.md` **Phase 3 的工作流程:** ``` 工具执行报错 → 错误捕获管道 → 结构化 JSON → 反馈给 LLM → LLM 靶向修改代码 ``` **错误 JSON 结构要求(便于 LLM 理解):** ```json { "status": "error", "error_code": "E002", "error_type": "business", "message": "列 'blood_pressure' 类型应为 numeric,实际为 character", "user_hint": "该列包含非数值数据,请检查数据格式", "context": { "tool_code": "ST_T_TEST_IND", "problematic_column": "blood_pressure", "expected_type": "numeric", "actual_type": "character", "sample_values": ["120", "130", "未知", "125"], "line_number": 45 } } ``` **关键字段说明:** | 字段 | MVP 用途 | Phase 3 用途 | |------|---------|-------------| | `error_code` | 日志分类 | LLM 识别错误类型 | | `error_type` | 区分业务/系统错误 | LLM 判断是否可自愈 | | `message` | 开发调试 | LLM 理解错误原因 | | `user_hint` | 前端展示 | 保留 | | `context` | 🆕 可选 | LLM 靶向修改的关键信息 | > **MVP 阶段行动**:错误响应中尽量包含 `context` 信息,为 Phase 3 积累"黄金数据集"。 --- ```r # utils/error_codes.R # 📌 结构化错误码,便于 LLM 自愈 ERROR_CODES <- list( # 业务错误(可被 Planner 修复) E001_COLUMN_NOT_FOUND = list( code = "E001", type = "business", message_template = "列名 '{col}' 在数据中不存在", user_hint = "请检查变量名是否拼写正确" # 🆕 用户友好提示 ), E002_TYPE_MISMATCH = list( code = "E002", type = "business", message_template = "列 '{col}' 类型应为 {expected},实际为 {actual}", user_hint = "该列包含非数值数据,请检查数据格式" ), E003_INSUFFICIENT_GROUPS = list( code = "E003", type = "business", message_template = "分组变量 '{col}' 应有 {expected} 个水平,实际有 {actual} 个", user_hint = "分组变量的取值个数不符合要求" ), E004_SAMPLE_TOO_SMALL = list( code = "E004", type = "business", message_template = "样本量 {n} 不足,至少需要 {min_required}", user_hint = "数据量太少,无法进行统计分析" ), # 🆕 统计计算错误(用户友好映射) E005_SINGULAR_MATRIX = list( code = "E005", type = "business", message_template = "矩阵计算异常: {details}", user_hint = "变量之间可能存在多重共线性,建议移除高度相关的变量" ), E006_CONVERGENCE_FAILED = list( code = "E006", type = "business", message_template = "模型未能收敛: {details}", user_hint = "算法未能找到稳定解,可能需要调整参数或检查数据" ), E007_VARIANCE_ZERO = list( code = "E007", type = "business", message_template = "变量 '{col}' 方差为零", user_hint = "该列的所有值都相同,无法进行比较" ), # 系统错误(需人工介入) E100_INTERNAL_ERROR = list( code = "E100", type = "system", message_template = "内部错误: {details}", user_hint = "系统繁忙,请稍后重试" ), E101_PACKAGE_MISSING = list( code = "E101", type = "system", message_template = "缺少依赖包: {package}", user_hint = "请联系管理员" ) ) # 🆕 R 原始错误到错误码的映射字典 R_ERROR_MAPPING <- list( "system is computationally singular" = "E005_SINGULAR_MATRIX", "did not converge" = "E006_CONVERGENCE_FAILED", "constant" = "E007_VARIANCE_ZERO" ) # 构造错误响应(🆕 含用户友好提示) make_error <- function(error_def, ...) { params <- list(...) msg <- error_def$message_template for (name in names(params)) { msg <- gsub(paste0("\\{", name, "\\}"), as.character(params[[name]]), msg) } return(list( status = "error", error_code = error_def$code, error_type = error_def$type, message = msg, user_hint = error_def$user_hint # 🆕 前端可直接展示 )) } # 🆕 尝试将 R 原始错误映射为友好错误码 map_r_error <- function(raw_error_msg) { for (pattern in names(R_ERROR_MAPPING)) { if (grepl(pattern, raw_error_msg, ignore.case = TRUE)) { error_key <- R_ERROR_MAPPING[[pattern]] return(make_error(ERROR_CODES[[error_key]], details = raw_error_msg)) } } # 无法映射,返回通用内部错误 return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = raw_error_msg)) } ``` --- ## 5. 🆕 混合数据协议加载器 > **核心问题**:Node.js 发送的 Payload 可能是 inline JSON(小数据)或 OSS Key(大数据),R 服务必须统一处理。 ```r # utils/data_loader.R # 🆕 混合数据协议:自动识别 inline 数据 vs OSS key library(httr) library(jsonlite) # 🆕 开发模式开关(本地无法访问 OSS 时启用) DEV_MODE <- Sys.getenv("DEV_MODE", "false") == "true" # 从环境变量获取 OSS 配置(开发/生产差异化) get_oss_config <- function() { list( endpoint = Sys.getenv("OSS_ENDPOINT", ""), access_key_id = Sys.getenv("OSS_ACCESS_KEY_ID", ""), access_key_secret = Sys.getenv("OSS_ACCESS_KEY_SECRET", ""), bucket = Sys.getenv("OSS_BUCKET", ""), mock_data_dir = Sys.getenv("OSS_MOCK_DIR", "tests/fixtures") # 🆕 Mock 目录 ) } # 统一数据加载入口 load_input_data <- function(input) { # 检查输入结构 if (is.null(input$data_source)) { stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = "请求缺少 data_source 字段")) } source_type <- input$data_source$type # "inline" | "oss" if (source_type == "inline") { # 📌 方式1:内联 JSON 数据(< 2MB) return(as.data.frame(input$data_source$data)) } else if (source_type == "oss") { # 📌 方式2:从 OSS 下载(2MB - 20MB) return(load_from_oss(input$data_source$oss_key)) } else { stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = paste("未知的 data_source.type:", source_type))) } } # 从 OSS 下载数据(🆕 支持 DEV_MODE Mock) load_from_oss <- function(oss_key) { config <- get_oss_config() # 🆕 开发模式:从本地 fixtures 读取 Mock 数据 if (DEV_MODE) { mock_file <- file.path(config$mock_data_dir, basename(oss_key)) if (file.exists(mock_file)) { message(glue("[DEV_MODE] 使用本地 Mock 文件: {mock_file}")) return(read.csv(mock_file, stringsAsFactors = FALSE)) } else { # 回退到 normal_data.csv fallback <- file.path(config$mock_data_dir, "normal_data.csv") message(glue("[DEV_MODE] Mock 文件不存在,使用默认: {fallback}")) return(read.csv(fallback, stringsAsFactors = FALSE)) } } if (config$endpoint == "") { stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = "OSS_ENDPOINT 环境变量未配置")) } # 构造签名 URL(简化版,生产应使用 SDK) url <- sprintf("https://%s.%s/%s", config$bucket, config$endpoint, oss_key) # 下载到临时文件 temp_file <- tempfile(fileext = ".csv") on.exit(unlink(temp_file)) # 确保清理 tryCatch({ response <- GET(url, add_headers( Authorization = generate_oss_signature(config, "GET", oss_key) ), write_disk(temp_file, overwrite = TRUE)) if (status_code(response) != 200) { stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = paste("OSS 下载失败:", status_code(response)))) } return(read.csv(temp_file, stringsAsFactors = FALSE)) }, error = function(e) { stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = paste("OSS 网络错误:", e$message))) }) } # OSS 签名生成(简化版) generate_oss_signature <- function(config, method, object_key) { # TODO: 完整 OSS V4 签名实现 # MVP 阶段可使用阿里云 R SDK 或预签名 URL paste0("OSS ", config$access_key_id, ":", "SIGNATURE_PLACEHOLDER") } ``` ### 5.1 后端 Payload 格式规范 Node.js `RClientService` 发送给 R 的 Payload 格式: ```typescript // 小数据(< 2MB):inline 模式 { "data_source": { "type": "inline", "data": [ { "group": "A", "value": 10.5 }, { "group": "B", "value": 12.3 } ] }, "params": { "group_var": "group", "value_var": "value" } } // 大数据(2MB - 20MB):OSS 模式 { "data_source": { "type": "oss", "oss_key": "sessions/abc123/data.csv" }, "params": { "group_var": "group", "value_var": "value" } } ``` --- ## 6. 🆕 结果格式化工具 ```r # utils/result_formatter.R # 🆕 统计结果格式化,确保 p 值显示规范 # 格式化 p 值(符合 APA 规范) format_p_value <- function(p) { if (is.na(p)) return(NA) if (p < 0.001) { return("< 0.001") } else { return(sprintf("%.3f", p)) } } # 构建标准化结果(包含 p_value_fmt) make_result <- function(p_value, statistic, method, ...) { list( p_value = p_value, p_value_fmt = format_p_value(p_value), # 🆕 前端直接展示 statistic = statistic, method = method, ... ) } # 格式化置信区间 format_ci <- function(lower, upper, digits = 2) { sprintf("[%.${digits}f, %.${digits}f]", lower, upper) } # 格式化效应量 format_effect_size <- function(value, type = "d") { interpretation <- "" if (type == "d") { # Cohen's d if (abs(value) < 0.2) interpretation <- "微小" else if (abs(value) < 0.5) interpretation <- "小" else if (abs(value) < 0.8) interpretation <- "中等" else interpretation <- "大" } list( value = round(value, 3), interpretation = interpretation ) } ``` --- ## 7. 工具 Wrapper 标准模板(使用 glue) ### 5.1 代码模板文件 ```r # templates/t_test.R.template # SSA-Pro 自动生成代码 # 工具: {tool_name} # 时间: {timestamp} # ================================ # 🆕 自动安装依赖(用户本地运行时自动检测) required_packages <- c("ggplot2", "car") new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])] if(length(new_packages)) {{ message("正在安装缺失的依赖包: ", paste(new_packages, collapse = ", ")) install.packages(new_packages, repos = "https://cloud.r-project.org") }} library(ggplot2) # 数据准备 df <- read.csv("your_data.csv") group_var <- "{group_var}" value_var <- "{value_var}" # 正态性检验 {normality_code} # {method_name} result <- {test_code} print(result) # 可视化 ggplot(df, aes(x = {group_var}, y = {value_var})) + geom_boxplot(fill = "#3b82f6", alpha = 0.6) + theme_minimal() + labs(title = "Distribution of {value_var} by {group_var}") ``` ### 5.2 Wrapper 实现(使用 glue) ```r # tools/ST_T_TEST_IND.R # 独立样本 T 检验 library(glue) source("utils/error_codes.R") run_st_t_test_ind <- function(input) { # ===== 初始化 ===== logs <- c() log_add <- function(msg) { logs <<- c(logs, paste0("[", Sys.time(), "] ", msg)) } tmp_files <- c() # 📌 追踪临时文件 # 📌 确保退出时清理临时文件 on.exit({ if (length(tmp_files) > 0) { unlink(tmp_files) } }, add = TRUE) # ===== 🆕 数据加载(混合协议) ===== log_add("开始加载输入数据") df <- tryCatch( load_input_data(input), # 🆕 统一入口,自动处理 inline/OSS error = function(e) { log_add(paste("数据加载失败:", e$message)) return(NULL) } ) if (is.null(df)) { return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = "数据加载失败")) } log_add(glue("数据加载成功: {nrow(df)} 行, {ncol(df)} 列")) p <- input$params guardrails_cfg <- input$guardrails group_var <- p$group_variable value_var <- p$value_variable # 📌 业务错误检查 if (!(group_var %in% names(df))) { return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = group_var)) } if (!(value_var %in% names(df))) { return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = value_var)) } groups <- unique(df[[group_var]]) if (length(groups) != 2) { return(make_error(ERROR_CODES$E003_INSUFFICIENT_GROUPS, col = group_var, expected = 2, actual = length(groups))) } # ===== 护栏检查 ===== normality_code <- "" method_used <- "t.test" if (isTRUE(guardrails_cfg$check_normality)) { log_add("执行正态性检验") use_nonparam <- FALSE for (g in groups) { vals <- df[df[[group_var]] == g, value_var] if (length(vals) >= 3 && length(vals) <= 5000) { sw_test <- shapiro.test(vals) normality_code <- paste0(normality_code, glue("shapiro.test(df[df${group_var} == '{g}', '{value_var}'])\n")) if (sw_test$p.value < 0.05) { use_nonparam <- TRUE log_add(glue("组[{g}] Shapiro-Wilk P = {round(sw_test$p.value, 4)} < 0.05, 拒绝正态性")) } } } if (use_nonparam && isTRUE(guardrails_cfg$auto_fix)) { log_add("触发降级: T-Test -> Wilcoxon") return(run_st_wilcoxon(input)) } } # ===== 核心计算 ===== log_add("执行 T 检验") g1_vals <- df[df[[group_var]] == groups[1], value_var] g2_vals <- df[df[[group_var]] == groups[2], value_var] result <- t.test(g1_vals, g2_vals, var.equal = FALSE) # ===== 生成图表 ===== log_add("生成箱线图") plot_base64 <- generate_boxplot(df, group_var, value_var, tmp_files) # ===== 📌 使用 glue 生成代码 ===== template <- readLines("templates/t_test.R.template", warn = FALSE) template_str <- paste(template, collapse = "\n") reproducible_code <- glue(template_str, tool_name = "独立样本 T 检验", timestamp = Sys.time(), group_var = group_var, value_var = value_var, normality_code = if (nchar(normality_code) > 0) normality_code else "# 未执行正态性检验", method_name = result$method, test_code = glue("t.test(df[df${group_var} == '{groups[1]}', '{value_var}'], df[df${group_var} == '{groups[2]}', '{value_var}'], var.equal = FALSE)") ) # 📌 使用 styler 格式化代码(可选) # reproducible_code <- styler::style_text(reproducible_code) # ===== 返回结果 ===== log_add("分析完成") return(list( status = "success", message = "分析完成", results = list( method = result$method, statistic = unbox(as.numeric(result$statistic)), p_value = unbox(as.numeric(result$p.value)), p_value_fmt = format_p_value(result$p.value), # 🆕 格式化 p 值 conf_int = as.numeric(result$conf.int), estimate = as.numeric(result$estimate), group_stats = list( list(group = groups[1], n = length(g1_vals), mean = mean(g1_vals), sd = sd(g1_vals)), list(group = groups[2], n = length(g2_vals), mean = mean(g2_vals), sd = sd(g2_vals)) ) ), plots = list(plot_base64), trace_log = logs, reproducible_code = as.character(reproducible_code) )) } # 📌 辅助函数(带临时文件追踪) generate_boxplot <- function(df, group_var, value_var, tmp_files_ref) { library(ggplot2) library(base64enc) p <- ggplot(df, aes_string(x = group_var, y = value_var)) + geom_boxplot(fill = "#3b82f6", alpha = 0.6) + theme_minimal() + labs(title = paste("Distribution of", value_var, "by", group_var)) tmp_file <- tempfile(fileext = ".png") tmp_files_ref <- c(tmp_files_ref, tmp_file) # 追踪 ggsave(tmp_file, p, width = 6, height = 4, dpi = 100) base64_str <- base64encode(tmp_file) unlink(tmp_file) # 立即清理 return(paste0("data:image/png;base64,", base64_str)) } ``` --- ## 8. 护栏函数库 ```r # utils/guardrails.R # 🆕 大样本优化阈值 LARGE_SAMPLE_THRESHOLD <- 5000 # 🆕 护栏 Action 类型 ACTION_BLOCK <- "Block" # 阻止执行 ACTION_WARN <- "Warn" # 警告但继续 ACTION_SWITCH <- "Switch" # 切换到备选方法 # 🆕 护栏检查结果结构 # list( # passed = TRUE/FALSE, # action = "Block" | "Warn" | "Switch", # action_target = "ST_XXX" | NULL, # p_value = 0.05, # reason = "描述" # ) # 正态性检验(🆕 支持三种 Action) check_normality <- function(values, alpha = 0.05, action = ACTION_SWITCH, action_target = NULL) { n <- length(values) # 样本量过小 if (n < 3) { return(list( passed = TRUE, action = NULL, action_target = NULL, reason = "样本量过小,跳过正态性检验", skipped = TRUE )) } # 🆕 大样本优化:N > 5000 时使用抽样检验 if (n > LARGE_SAMPLE_THRESHOLD) { set.seed(42) sampled_values <- sample(values, 1000) test <- shapiro.test(sampled_values) passed <- test$p.value >= alpha return(list( passed = passed, action = if (passed) NULL else action, action_target = if (passed) NULL else action_target, p_value = test$p.value, reason = glue("大样本(N={n})抽样检验,{if (passed) '满足正态性' else '不满足正态性'}"), sampled = TRUE, sample_size = 1000 )) } # 常规检验 test <- shapiro.test(values) passed <- test$p.value >= alpha return(list( passed = passed, action = if (passed) NULL else action, action_target = if (passed) NULL else action_target, p_value = test$p.value, reason = if (passed) "满足正态性" else "不满足正态性", sampled = FALSE )) } # 方差齐性检验 (Levene) check_homogeneity <- function(df, group_var, value_var, alpha = 0.05, action = ACTION_WARN) { library(car) formula <- as.formula(paste(value_var, "~", group_var)) test <- leveneTest(formula, data = df) p_val <- test$`Pr(>F)`[1] passed <- p_val >= alpha return(list( passed = passed, action = if (passed) NULL else action, p_value = p_val, reason = if (passed) "方差齐性满足" else "方差不齐性" )) } # 样本量检验 check_sample_size <- function(n, min_required = 30, action = ACTION_BLOCK) { passed <- n >= min_required return(list( passed = passed, action = if (passed) NULL else action, n = n, reason = if (passed) "样本量充足" else paste0("样本量不足, 需要至少 ", min_required) )) } # 🆕 执行护栏链(按 check_order 顺序执行) run_guardrail_chain <- function(input, guardrail_configs) { for (config in guardrail_configs) { check_func <- get(config$check_code) result <- do.call(check_func, list( input, action = config$action_type, action_target = config$action_target )) if (!result$passed) { if (result$action == ACTION_BLOCK) { return(list( status = "blocked", reason = result$reason )) } else if (result$action == ACTION_SWITCH) { return(list( status = "switch", target_tool = result$action_target, reason = result$reason )) } # WARN: 记录警告但继续 } } return(list(status = "passed")) } ``` --- ## 9. API 请求/响应规范 ### 9.1 🆕 请求格式(混合协议) **方式 1:Inline 数据(< 2MB)** ```json { "data_source": { "type": "inline", "data": [ {"Gender": "Male", "GLU": 5.8, "Age": 45}, {"Gender": "Female", "GLU": 5.1, "Age": 38} ] }, "params": { "group_variable": "Gender", "value_variable": "GLU" }, "guardrails": { "check_normality": true, "auto_fix": true } } ``` **方式 2:OSS 数据(2MB - 20MB)** ```json { "data_source": { "type": "oss", "oss_key": "sessions/abc123/data.csv" }, "params": { "group_variable": "Gender", "value_variable": "GLU" }, "guardrails": { "check_normality": true, "auto_fix": true } } ``` ### 9.2 成功响应(🆕 含 p_value_fmt) ```json { "status": "success", "message": "分析完成", "results": { "method": "Welch Two Sample t-test", "statistic": 2.345, "p_value": 0.021, "p_value_fmt": "0.021", "conf_int": [0.12, 1.28], "group_stats": [ {"group": "Male", "n": 78, "mean": 5.8, "sd": 0.9}, {"group": "Female", "n": 72, "mean": 5.1, "sd": 0.7} ] }, "plots": ["data:image/png;base64,..."], "trace_log": [ "[2026-02-18 10:30:01] 数据加载成功: 150 行, 5 列", "[2026-02-18 10:30:01] 执行正态性检验", "[2026-02-18 10:30:02] 执行 T 检验", "[2026-02-18 10:30:02] 分析完成" ], "reproducible_code": "# SSA-Pro 自动生成代码\n..." } ``` > **p_value_fmt 说明**: > - p >= 0.001: 保留 3 位小数,如 "0.021" > - p < 0.001: 显示 "< 0.001" > - 前端应直接使用 `p_value_fmt` 展示,避免重复格式化 ### 7.3 错误响应(📌 含结构化错误码) ```json { "status": "error", "error_code": "E001", "error_type": "business", "message": "列名 'invalid_col' 在数据中不存在", "trace_log": [ "[2026-02-18 10:30:01] 开始解析输入数据", "[2026-02-18 10:30:01] 错误: 列名 'invalid_col' 在数据中不存在" ] } ``` > **错误类型说明**: > - `business`:业务错误,Planner 可尝试自动修复参数后重试 > - `system`:系统错误,需人工介入 --- ## 8. MVP 10 个工具清单 | 序号 | 工具代码 | 文件名 | 主要函数 | 护栏 | |------|---------|--------|---------|------| | 1 | ST_T_TEST_IND | ST_T_TEST_IND.R | `run_st_t_test_ind()` | 正态性 | | 2 | ST_T_TEST_PAIRED | ST_T_TEST_PAIRED.R | `run_st_t_test_paired()` | 正态性 | | 3 | ST_ANOVA_ONE | ST_ANOVA_ONE.R | `run_st_anova_one()` | 正态性+方差齐性 | | 4 | ST_CHI_SQUARE | ST_CHI_SQUARE.R | `run_st_chi_square()` | 期望频数 | | 5 | ST_FISHER | ST_FISHER.R | `run_st_fisher()` | 无 | | 6 | ST_WILCOXON | ST_WILCOXON.R | `run_st_wilcoxon()` | 无 | | 7 | ST_MANN_WHITNEY | ST_MANN_WHITNEY.R | `run_st_mann_whitney()` | 无 | | 8 | ST_CORRELATION | ST_CORRELATION.R | `run_st_correlation()` | 正态性(决定Pearson/Spearman) | | 9 | ST_LINEAR_REG | ST_LINEAR_REG.R | `run_st_linear_reg()` | 残差正态性 | | 10 | ST_DESCRIPTIVE | ST_DESCRIPTIVE.R | `run_st_descriptive()` | 无 | --- ## 9. 本地开发流程 ### 8.1 构建镜像 ```bash cd r-statistics-service docker build -t ssa-r-service:dev . ``` ### 8.2 运行容器 ```bash docker run -d -p 8080:8080 --name ssa-r-dev ssa-r-service:dev ``` ### 8.3 测试健康检查 ```bash curl http://localhost:8080/health ``` ### 8.4 测试工具调用 ```bash curl -X POST http://localhost:8080/api/v1/skills/ST_T_TEST_IND \ -H "Content-Type: application/json" \ -d '{ "data": [ {"Gender": "Male", "GLU": 5.8}, {"Gender": "Male", "GLU": 6.1}, {"Gender": "Female", "GLU": 5.0}, {"Gender": "Female", "GLU": 5.2} ], "params": { "group_variable": "Gender", "value_variable": "GLU" }, "guardrails": { "check_normality": true, "auto_fix": true } }' ``` --- ## 10. 工具元数据格式 ```yaml # metadata/tools.yaml tools: - code: ST_T_TEST_IND name: 独立样本 T 检验 version: "1.0.0" category: 假设检验 description: | 用于比较两个独立组的均值是否存在显著差异。 适用场景:比较男性vs女性的血糖水平、实验组vs对照组的疗效等。 usage_context: | - 两组独立样本比较 - 连续型数值变量 - 样本量建议 >= 30 params_schema: type: object required: - group_variable - value_variable properties: group_variable: type: string description: 分组变量名(应为分类变量,仅含两个水平) value_variable: type: string description: 检验变量名(应为数值型) guardrails: - check_normality - check_homogeneity ``` --- ## 11. 常见问题 ### Q1: 护栏检查失败后如何处理? 如果 `auto_fix = true`,R 服务会自动降级到适当的非参数方法。如果 `auto_fix = false`,则返回警告但仍执行原方法。 ### Q2: 如何添加新工具? 1. 在 `tools/` 目录创建 `ST_NEW_TOOL.R` 2. 实现 `run_st_new_tool(input)` 函数 3. 在 `metadata/tools.yaml` 添加元数据 4. 执行后端脚本导入到 pgvector ### Q3: 图表生成失败怎么办? 检查 `plots` 字段是否为空数组。R 服务不会因图表失败而中断整个分析,但会在 `trace_log` 中记录错误。 ### Q4: 如何添加新的 R 包依赖? 1. 在 R 控制台执行 `install.packages("new_package")` 2. 执行 `renv::snapshot()` 更新 `renv.lock` 3. 提交 `renv.lock` 到版本控制 4. 重新构建 Docker 镜像 ### Q5: 临时文件清理策略是什么? - **代码层面**:使用 `on.exit(unlink(tmp_files))` 确保函数退出时清理 - **容器层面**:Docker 启动时清理 `/tmp` - **运维层面**:SAE 配置定时任务,每日清理 24 小时前的临时文件