Files
AIclinicalresearch/docs/03-业务模块/SSA-智能统计分析/04-开发计划/02-R服务开发指南.md

1094 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SSA-Pro R 服务开发指南
> **文档版本:** v1.5
> **创建日期:** 2026-02-18
> **最后更新:** 2026-02-18纳入专家配置体系 + 统一入口函数)
> **目标读者:** R 统计工程师
---
## 1. 项目结构
```
r-statistics-service/
├── Dockerfile
├── renv.lock # 📌 包版本锁定文件
├── .Rprofile # renv 初始化
├── plumber.R # Plumber 入口
├── tools/ # 🆕 专家上传的 R 脚本(统一 run_analysis 入口)
│ ├── t_test_ind.R # 独立样本 T 检验
│ ├── t_test_paired.R # 配对样本 T 检验
│ ├── anova_one.R # 单因素方差分析
│ ├── wilcoxon.R # Wilcoxon 秩和检验
│ └── ... # 📌 约 100 个成熟脚本
├── templates/ # 📌 代码交付模板glue 格式)
│ ├── t_test.R.template
│ ├── anova.R.template
│ └── ...
├── utils/
│ ├── data_loader.R # 混合数据协议加载器
│ ├── guardrails.R # 🆕 护栏函数库(支持 Block/Warn/Switch
│ ├── code_generator.R # 代码生成工具(使用 glue
│ ├── result_formatter.R # 结果格式化p_value_fmt
│ ├── interpretation.R # 🆕 结果解读(基于配置模板)
│ └── error_codes.R # 📌 错误码定义
├── metadata/ # 工具元数据(由配置中台管理)
│ └── tools.yaml # 备用配置
└── tests/
├── test_tools.R # 单元测试
└── fixtures/ # 标准测试数据集
├── normal_data.csv
├── skewed_data.csv
└── missing_data.csv
```
### 1.1 🆕 专家 R 脚本规范
> **核心要求**:所有脚本必须使用统一入口函数 `run_analysis(input)`
```r
# 文件头部注释(必填)
#' @tool_code ST_T_TEST_IND
#' @name 独立样本 T 检验
#' @version 1.0.0
#' @description 比较两组独立样本的均值差异
#' @author 统计学专家团队
# 📌 统一入口函数(所有脚本必须实现)
run_analysis <- function(input) {
# ... 实现逻辑 ...
}
```
---
## 2. Dockerfile 模板
```dockerfile
FROM rocker/r-ver:4.3
LABEL maintainer="your-team@example.com"
LABEL version="1.0.0"
LABEL description="SSA-Pro R Statistics Service"
# 安装系统依赖
RUN apt-get update && apt-get install -y \
libcurl4-openssl-dev \
libssl-dev \
libxml2-dev \
&& rm -rf /var/lib/apt/lists/*
# 📌 安装 renv包管理工具
RUN R -e "install.packages('renv', repos='https://cloud.r-project.org/')"
WORKDIR /app
# 📌 先复制 renv.lock利用 Docker 缓存
COPY renv.lock renv.lock
COPY .Rprofile .Rprofile
# 📌 使用 renv 恢复依赖(版本锁定)
RUN R -e "renv::restore()"
# 复制应用代码
COPY . .
EXPOSE 8080
# 🆕 OSS 配置通过环境变量注入(开发/生产环境不同)
ENV OSS_ENDPOINT=""
ENV OSS_ACCESS_KEY_ID=""
ENV OSS_ACCESS_KEY_SECRET=""
ENV OSS_BUCKET=""
# 📌 启动前清理临时文件
CMD ["R", "-e", "unlink(list.files('/tmp', full.names=TRUE), recursive=TRUE); plumber::plumb('plumber.R')$run(host='0.0.0.0', port=8080)"]
```
### 2.2 环境变量配置(🆕 开发/生产差异)
```yaml
# docker-compose.yml (本地开发)
services:
ssa-r-service:
build: .
ports:
- "8080:8080"
environment:
- OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com # 公网
- OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID}
- OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET}
- OSS_BUCKET=ssa-data-bucket
```
```yaml
# SAE 环境变量 (生产)
OSS_ENDPOINT: oss-cn-beijing-internal.aliyuncs.com # 🆕 VPC 内网
OSS_ACCESS_KEY_ID: ******
OSS_ACCESS_KEY_SECRET: ******
OSS_BUCKET: ssa-data-bucket
```
> **重要**OSS Endpoint 绝不能硬编码必须通过环境变量注入。本地开发用公网SAE 生产用内网。
### 2.1 renv.lock 示例
```json
{
"R": {
"Version": "4.3.0",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://cloud.r-project.org"
}
]
},
"Packages": {
"plumber": { "Package": "plumber", "Version": "1.2.1", "Source": "Repository" },
"jsonlite": { "Package": "jsonlite", "Version": "1.8.8", "Source": "Repository" },
"ggplot2": { "Package": "ggplot2", "Version": "3.4.4", "Source": "Repository" },
"glue": { "Package": "glue", "Version": "1.7.0", "Source": "Repository" },
"styler": { "Package": "styler", "Version": "1.10.2", "Source": "Repository" },
"dplyr": { "Package": "dplyr", "Version": "1.1.4", "Source": "Repository" },
"tidyr": { "Package": "tidyr", "Version": "1.3.0", "Source": "Repository" },
"base64enc": { "Package": "base64enc", "Version": "0.1-3", "Source": "Repository" },
"yaml": { "Package": "yaml", "Version": "2.3.8", "Source": "Repository" },
"car": { "Package": "car", "Version": "3.1-2", "Source": "Repository" }
}
}
```
> **重要**:每次添加新依赖后,执行 `renv::snapshot()` 更新 lock 文件。
---
## 3. Plumber 入口文件
```r
# plumber.R
library(plumber)
library(jsonlite)
# 加载工具模块
tools_dir <- "tools"
tool_files <- list.files(tools_dir, pattern = "\\.R$", full.names = TRUE)
for (f in tool_files) source(f)
# 加载公共函数
source("utils/data_loader.R") # 🆕 混合数据协议
source("utils/guardrails.R")
source("utils/code_generator.R")
source("utils/result_formatter.R") # 🆕 结果格式化
#* @apiTitle SSA-Pro R Statistics Service
#* @apiDescription 严谨型统计分析 R 引擎
#* 健康检查
#* @get /health
function() {
list(
status = "ok",
timestamp = Sys.time(),
version = "1.0.0"
)
}
#* 执行统计工具
#* @post /api/v1/skills/<tool_code>
#* @param tool_code:str 工具代码
#* @serializer unboxedJSON
function(req, tool_code) {
tryCatch({
# 解析请求体
input <- jsonlite::fromJSON(req$postBody, simplifyVector = FALSE)
# 🆕 Debug 模式:保留临时文件用于排查
debug_mode <- isTRUE(input$debug)
# 动态调用工具函数
func_name <- paste0("run_", tolower(tool_code))
if (!exists(func_name, mode = "function")) {
return(list(
status = "error",
message = paste("Unknown tool:", tool_code)
))
}
result <- do.call(func_name, list(input))
# 🆕 Debug 模式:附加临时文件路径
if (debug_mode && !is.null(result$tmp_files)) {
result$debug_files <- result$tmp_files
message("[DEBUG] 临时文件已保留: ", paste(result$tmp_files, collapse = ", "))
}
return(result)
}, error = function(e) {
# 🆕 使用友好错误映射
return(map_r_error(e$message))
})
}
```
---
## 4. 错误码定义
```r
# utils/error_codes.R
# 📌 结构化错误码,便于 LLM 自愈
ERROR_CODES <- list(
# 业务错误(可被 Planner 修复)
E001_COLUMN_NOT_FOUND = list(
code = "E001",
type = "business",
message_template = "列名 '{col}' 在数据中不存在",
user_hint = "请检查变量名是否拼写正确" # 🆕 用户友好提示
),
E002_TYPE_MISMATCH = list(
code = "E002",
type = "business",
message_template = "列 '{col}' 类型应为 {expected},实际为 {actual}",
user_hint = "该列包含非数值数据,请检查数据格式"
),
E003_INSUFFICIENT_GROUPS = list(
code = "E003",
type = "business",
message_template = "分组变量 '{col}' 应有 {expected} 个水平,实际有 {actual} 个",
user_hint = "分组变量的取值个数不符合要求"
),
E004_SAMPLE_TOO_SMALL = list(
code = "E004",
type = "business",
message_template = "样本量 {n} 不足,至少需要 {min_required}",
user_hint = "数据量太少,无法进行统计分析"
),
# 🆕 统计计算错误(用户友好映射)
E005_SINGULAR_MATRIX = list(
code = "E005",
type = "business",
message_template = "矩阵计算异常: {details}",
user_hint = "变量之间可能存在多重共线性,建议移除高度相关的变量"
),
E006_CONVERGENCE_FAILED = list(
code = "E006",
type = "business",
message_template = "模型未能收敛: {details}",
user_hint = "算法未能找到稳定解,可能需要调整参数或检查数据"
),
E007_VARIANCE_ZERO = list(
code = "E007",
type = "business",
message_template = "变量 '{col}' 方差为零",
user_hint = "该列的所有值都相同,无法进行比较"
),
# 系统错误(需人工介入)
E100_INTERNAL_ERROR = list(
code = "E100",
type = "system",
message_template = "内部错误: {details}",
user_hint = "系统繁忙,请稍后重试"
),
E101_PACKAGE_MISSING = list(
code = "E101",
type = "system",
message_template = "缺少依赖包: {package}",
user_hint = "请联系管理员"
)
)
# 🆕 R 原始错误到错误码的映射字典
R_ERROR_MAPPING <- list(
"system is computationally singular" = "E005_SINGULAR_MATRIX",
"did not converge" = "E006_CONVERGENCE_FAILED",
"constant" = "E007_VARIANCE_ZERO"
)
# 构造错误响应(🆕 含用户友好提示)
make_error <- function(error_def, ...) {
params <- list(...)
msg <- error_def$message_template
for (name in names(params)) {
msg <- gsub(paste0("\\{", name, "\\}"), as.character(params[[name]]), msg)
}
return(list(
status = "error",
error_code = error_def$code,
error_type = error_def$type,
message = msg,
user_hint = error_def$user_hint # 🆕 前端可直接展示
))
}
# 🆕 尝试将 R 原始错误映射为友好错误码
map_r_error <- function(raw_error_msg) {
for (pattern in names(R_ERROR_MAPPING)) {
if (grepl(pattern, raw_error_msg, ignore.case = TRUE)) {
error_key <- R_ERROR_MAPPING[[pattern]]
return(make_error(ERROR_CODES[[error_key]], details = raw_error_msg))
}
}
# 无法映射,返回通用内部错误
return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = raw_error_msg))
}
```
---
## 5. 🆕 混合数据协议加载器
> **核心问题**Node.js 发送的 Payload 可能是 inline JSON小数据或 OSS Key大数据R 服务必须统一处理。
```r
# utils/data_loader.R
# 🆕 混合数据协议:自动识别 inline 数据 vs OSS key
library(httr)
library(jsonlite)
# 🆕 开发模式开关(本地无法访问 OSS 时启用)
DEV_MODE <- Sys.getenv("DEV_MODE", "false") == "true"
# 从环境变量获取 OSS 配置(开发/生产差异化)
get_oss_config <- function() {
list(
endpoint = Sys.getenv("OSS_ENDPOINT", ""),
access_key_id = Sys.getenv("OSS_ACCESS_KEY_ID", ""),
access_key_secret = Sys.getenv("OSS_ACCESS_KEY_SECRET", ""),
bucket = Sys.getenv("OSS_BUCKET", ""),
mock_data_dir = Sys.getenv("OSS_MOCK_DIR", "tests/fixtures") # 🆕 Mock 目录
)
}
# 统一数据加载入口
load_input_data <- function(input) {
# 检查输入结构
if (is.null(input$data_source)) {
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
details = "请求缺少 data_source 字段"))
}
source_type <- input$data_source$type # "inline" | "oss"
if (source_type == "inline") {
# 📌 方式1内联 JSON 数据(< 2MB
return(as.data.frame(input$data_source$data))
} else if (source_type == "oss") {
# 📌 方式2从 OSS 下载2MB - 20MB
return(load_from_oss(input$data_source$oss_key))
} else {
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
details = paste("未知的 data_source.type:", source_type)))
}
}
# 从 OSS 下载数据(🆕 支持 DEV_MODE Mock
load_from_oss <- function(oss_key) {
config <- get_oss_config()
# 🆕 开发模式:从本地 fixtures 读取 Mock 数据
if (DEV_MODE) {
mock_file <- file.path(config$mock_data_dir, basename(oss_key))
if (file.exists(mock_file)) {
message(glue("[DEV_MODE] 使用本地 Mock 文件: {mock_file}"))
return(read.csv(mock_file, stringsAsFactors = FALSE))
} else {
# 回退到 normal_data.csv
fallback <- file.path(config$mock_data_dir, "normal_data.csv")
message(glue("[DEV_MODE] Mock 文件不存在,使用默认: {fallback}"))
return(read.csv(fallback, stringsAsFactors = FALSE))
}
}
if (config$endpoint == "") {
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
details = "OSS_ENDPOINT 环境变量未配置"))
}
# 构造签名 URL简化版生产应使用 SDK
url <- sprintf("https://%s.%s/%s",
config$bucket, config$endpoint, oss_key)
# 下载到临时文件
temp_file <- tempfile(fileext = ".csv")
on.exit(unlink(temp_file)) # 确保清理
tryCatch({
response <- GET(url,
add_headers(
Authorization = generate_oss_signature(config, "GET", oss_key)
),
write_disk(temp_file, overwrite = TRUE))
if (status_code(response) != 200) {
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
details = paste("OSS 下载失败:", status_code(response))))
}
return(read.csv(temp_file, stringsAsFactors = FALSE))
}, error = function(e) {
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
details = paste("OSS 网络错误:", e$message)))
})
}
# OSS 签名生成(简化版)
generate_oss_signature <- function(config, method, object_key) {
# TODO: 完整 OSS V4 签名实现
# MVP 阶段可使用阿里云 R SDK 或预签名 URL
paste0("OSS ", config$access_key_id, ":", "SIGNATURE_PLACEHOLDER")
}
```
### 5.1 后端 Payload 格式规范
Node.js `RClientService` 发送给 R 的 Payload 格式:
```typescript
// 小数据(< 2MBinline 模式
{
"data_source": {
"type": "inline",
"data": [
{ "group": "A", "value": 10.5 },
{ "group": "B", "value": 12.3 }
]
},
"params": {
"group_var": "group",
"value_var": "value"
}
}
// 大数据2MB - 20MBOSS 模式
{
"data_source": {
"type": "oss",
"oss_key": "sessions/abc123/data.csv"
},
"params": {
"group_var": "group",
"value_var": "value"
}
}
```
---
## 6. 🆕 结果格式化工具
```r
# utils/result_formatter.R
# 🆕 统计结果格式化,确保 p 值显示规范
# 格式化 p 值(符合 APA 规范)
format_p_value <- function(p) {
if (is.na(p)) return(NA)
if (p < 0.001) {
return("< 0.001")
} else {
return(sprintf("%.3f", p))
}
}
# 构建标准化结果(包含 p_value_fmt
make_result <- function(p_value, statistic, method, ...) {
list(
p_value = p_value,
p_value_fmt = format_p_value(p_value), # 🆕 前端直接展示
statistic = statistic,
method = method,
...
)
}
# 格式化置信区间
format_ci <- function(lower, upper, digits = 2) {
sprintf("[%.${digits}f, %.${digits}f]", lower, upper)
}
# 格式化效应量
format_effect_size <- function(value, type = "d") {
interpretation <- ""
if (type == "d") { # Cohen's d
if (abs(value) < 0.2) interpretation <- "微小"
else if (abs(value) < 0.5) interpretation <- "小"
else if (abs(value) < 0.8) interpretation <- "中等"
else interpretation <- "大"
}
list(
value = round(value, 3),
interpretation = interpretation
)
}
```
---
## 7. 工具 Wrapper 标准模板(使用 glue
### 5.1 代码模板文件
```r
# templates/t_test.R.template
# SSA-Pro 自动生成代码
# 工具: {tool_name}
# 时间: {timestamp}
# ================================
# 🆕 自动安装依赖(用户本地运行时自动检测)
required_packages <- c("ggplot2", "car")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) {{
message("正在安装缺失的依赖包: ", paste(new_packages, collapse = ", "))
install.packages(new_packages, repos = "https://cloud.r-project.org")
}}
library(ggplot2)
# 数据准备
df <- read.csv("your_data.csv")
group_var <- "{group_var}"
value_var <- "{value_var}"
# 正态性检验
{normality_code}
# {method_name}
result <- {test_code}
print(result)
# 可视化
ggplot(df, aes(x = {group_var}, y = {value_var})) +
geom_boxplot(fill = "#3b82f6", alpha = 0.6) +
theme_minimal() +
labs(title = "Distribution of {value_var} by {group_var}")
```
### 5.2 Wrapper 实现(使用 glue
```r
# tools/ST_T_TEST_IND.R
# 独立样本 T 检验
library(glue)
source("utils/error_codes.R")
run_st_t_test_ind <- function(input) {
# ===== 初始化 =====
logs <- c()
log_add <- function(msg) { logs <<- c(logs, paste0("[", Sys.time(), "] ", msg)) }
tmp_files <- c() # 📌 追踪临时文件
# 📌 确保退出时清理临时文件
on.exit({
if (length(tmp_files) > 0) {
unlink(tmp_files)
}
}, add = TRUE)
# ===== 🆕 数据加载(混合协议) =====
log_add("开始加载输入数据")
df <- tryCatch(
load_input_data(input), # 🆕 统一入口,自动处理 inline/OSS
error = function(e) {
log_add(paste("数据加载失败:", e$message))
return(NULL)
}
)
if (is.null(df)) {
return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = "数据加载失败"))
}
log_add(glue("数据加载成功: {nrow(df)} 行, {ncol(df)} 列"))
p <- input$params
guardrails_cfg <- input$guardrails
group_var <- p$group_variable
value_var <- p$value_variable
# 📌 业务错误检查
if (!(group_var %in% names(df))) {
return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = group_var))
}
if (!(value_var %in% names(df))) {
return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = value_var))
}
groups <- unique(df[[group_var]])
if (length(groups) != 2) {
return(make_error(ERROR_CODES$E003_INSUFFICIENT_GROUPS,
col = group_var, expected = 2, actual = length(groups)))
}
# ===== 护栏检查 =====
normality_code <- ""
method_used <- "t.test"
if (isTRUE(guardrails_cfg$check_normality)) {
log_add("执行正态性检验")
use_nonparam <- FALSE
for (g in groups) {
vals <- df[df[[group_var]] == g, value_var]
if (length(vals) >= 3 && length(vals) <= 5000) {
sw_test <- shapiro.test(vals)
normality_code <- paste0(normality_code,
glue("shapiro.test(df[df${group_var} == '{g}', '{value_var}'])\n"))
if (sw_test$p.value < 0.05) {
use_nonparam <- TRUE
log_add(glue("组[{g}] Shapiro-Wilk P = {round(sw_test$p.value, 4)} < 0.05, 拒绝正态性"))
}
}
}
if (use_nonparam && isTRUE(guardrails_cfg$auto_fix)) {
log_add("触发降级: T-Test -> Wilcoxon")
return(run_st_wilcoxon(input))
}
}
# ===== 核心计算 =====
log_add("执行 T 检验")
g1_vals <- df[df[[group_var]] == groups[1], value_var]
g2_vals <- df[df[[group_var]] == groups[2], value_var]
result <- t.test(g1_vals, g2_vals, var.equal = FALSE)
# ===== 生成图表 =====
log_add("生成箱线图")
plot_base64 <- generate_boxplot(df, group_var, value_var, tmp_files)
# ===== 📌 使用 glue 生成代码 =====
template <- readLines("templates/t_test.R.template", warn = FALSE)
template_str <- paste(template, collapse = "\n")
reproducible_code <- glue(template_str,
tool_name = "独立样本 T 检验",
timestamp = Sys.time(),
group_var = group_var,
value_var = value_var,
normality_code = if (nchar(normality_code) > 0) normality_code else "# 未执行正态性检验",
method_name = result$method,
test_code = glue("t.test(df[df${group_var} == '{groups[1]}', '{value_var}'],
df[df${group_var} == '{groups[2]}', '{value_var}'],
var.equal = FALSE)")
)
# 📌 使用 styler 格式化代码(可选)
# reproducible_code <- styler::style_text(reproducible_code)
# ===== 返回结果 =====
log_add("分析完成")
return(list(
status = "success",
message = "分析完成",
results = list(
method = result$method,
statistic = unbox(as.numeric(result$statistic)),
p_value = unbox(as.numeric(result$p.value)),
p_value_fmt = format_p_value(result$p.value), # 🆕 格式化 p 值
conf_int = as.numeric(result$conf.int),
estimate = as.numeric(result$estimate),
group_stats = list(
list(group = groups[1], n = length(g1_vals), mean = mean(g1_vals), sd = sd(g1_vals)),
list(group = groups[2], n = length(g2_vals), mean = mean(g2_vals), sd = sd(g2_vals))
)
),
plots = list(plot_base64),
trace_log = logs,
reproducible_code = as.character(reproducible_code)
))
}
# 📌 辅助函数(带临时文件追踪)
generate_boxplot <- function(df, group_var, value_var, tmp_files_ref) {
library(ggplot2)
library(base64enc)
p <- ggplot(df, aes_string(x = group_var, y = value_var)) +
geom_boxplot(fill = "#3b82f6", alpha = 0.6) +
theme_minimal() +
labs(title = paste("Distribution of", value_var, "by", group_var))
tmp_file <- tempfile(fileext = ".png")
tmp_files_ref <- c(tmp_files_ref, tmp_file) # 追踪
ggsave(tmp_file, p, width = 6, height = 4, dpi = 100)
base64_str <- base64encode(tmp_file)
unlink(tmp_file) # 立即清理
return(paste0("data:image/png;base64,", base64_str))
}
```
---
## 8. 护栏函数库
```r
# utils/guardrails.R
# 🆕 大样本优化阈值
LARGE_SAMPLE_THRESHOLD <- 5000
# 🆕 护栏 Action 类型
ACTION_BLOCK <- "Block" # 阻止执行
ACTION_WARN <- "Warn" # 警告但继续
ACTION_SWITCH <- "Switch" # 切换到备选方法
# 🆕 护栏检查结果结构
# list(
# passed = TRUE/FALSE,
# action = "Block" | "Warn" | "Switch",
# action_target = "ST_XXX" | NULL,
# p_value = 0.05,
# reason = "描述"
# )
# 正态性检验(🆕 支持三种 Action
check_normality <- function(values, alpha = 0.05, action = ACTION_SWITCH, action_target = NULL) {
n <- length(values)
# 样本量过小
if (n < 3) {
return(list(
passed = TRUE,
action = NULL,
action_target = NULL,
reason = "样本量过小,跳过正态性检验",
skipped = TRUE
))
}
# 🆕 大样本优化N > 5000 时使用抽样检验
if (n > LARGE_SAMPLE_THRESHOLD) {
set.seed(42)
sampled_values <- sample(values, 1000)
test <- shapiro.test(sampled_values)
passed <- test$p.value >= alpha
return(list(
passed = passed,
action = if (passed) NULL else action,
action_target = if (passed) NULL else action_target,
p_value = test$p.value,
reason = glue("大样本(N={n})抽样检验,{if (passed) '满足正态性' else '不满足正态性'}"),
sampled = TRUE,
sample_size = 1000
))
}
# 常规检验
test <- shapiro.test(values)
passed <- test$p.value >= alpha
return(list(
passed = passed,
action = if (passed) NULL else action,
action_target = if (passed) NULL else action_target,
p_value = test$p.value,
reason = if (passed) "满足正态性" else "不满足正态性",
sampled = FALSE
))
}
# 方差齐性检验 (Levene)
check_homogeneity <- function(df, group_var, value_var, alpha = 0.05, action = ACTION_WARN) {
library(car)
formula <- as.formula(paste(value_var, "~", group_var))
test <- leveneTest(formula, data = df)
p_val <- test$`Pr(>F)`[1]
passed <- p_val >= alpha
return(list(
passed = passed,
action = if (passed) NULL else action,
p_value = p_val,
reason = if (passed) "方差齐性满足" else "方差不齐性"
))
}
# 样本量检验
check_sample_size <- function(n, min_required = 30, action = ACTION_BLOCK) {
passed <- n >= min_required
return(list(
passed = passed,
action = if (passed) NULL else action,
n = n,
reason = if (passed) "样本量充足" else paste0("样本量不足, 需要至少 ", min_required)
))
}
# 🆕 执行护栏链(按 check_order 顺序执行)
run_guardrail_chain <- function(input, guardrail_configs) {
for (config in guardrail_configs) {
check_func <- get(config$check_code)
result <- do.call(check_func, list(
input,
action = config$action_type,
action_target = config$action_target
))
if (!result$passed) {
if (result$action == ACTION_BLOCK) {
return(list(
status = "blocked",
reason = result$reason
))
} else if (result$action == ACTION_SWITCH) {
return(list(
status = "switch",
target_tool = result$action_target,
reason = result$reason
))
}
# WARN: 记录警告但继续
}
}
return(list(status = "passed"))
}
```
---
## 9. API 请求/响应规范
### 9.1 🆕 请求格式(混合协议)
**方式 1Inline 数据(< 2MB**
```json
{
"data_source": {
"type": "inline",
"data": [
{"Gender": "Male", "GLU": 5.8, "Age": 45},
{"Gender": "Female", "GLU": 5.1, "Age": 38}
]
},
"params": {
"group_variable": "Gender",
"value_variable": "GLU"
},
"guardrails": {
"check_normality": true,
"auto_fix": true
}
}
```
**方式 2OSS 数据2MB - 20MB**
```json
{
"data_source": {
"type": "oss",
"oss_key": "sessions/abc123/data.csv"
},
"params": {
"group_variable": "Gender",
"value_variable": "GLU"
},
"guardrails": {
"check_normality": true,
"auto_fix": true
}
}
```
### 9.2 成功响应(🆕 含 p_value_fmt
```json
{
"status": "success",
"message": "分析完成",
"results": {
"method": "Welch Two Sample t-test",
"statistic": 2.345,
"p_value": 0.021,
"p_value_fmt": "0.021",
"conf_int": [0.12, 1.28],
"group_stats": [
{"group": "Male", "n": 78, "mean": 5.8, "sd": 0.9},
{"group": "Female", "n": 72, "mean": 5.1, "sd": 0.7}
]
},
"plots": ["data:image/png;base64,..."],
"trace_log": [
"[2026-02-18 10:30:01] 数据加载成功: 150 行, 5 列",
"[2026-02-18 10:30:01] 执行正态性检验",
"[2026-02-18 10:30:02] 执行 T 检验",
"[2026-02-18 10:30:02] 分析完成"
],
"reproducible_code": "# SSA-Pro 自动生成代码\n..."
}
```
> **p_value_fmt 说明**
> - p >= 0.001: 保留 3 位小数,如 "0.021"
> - p < 0.001: 显示 "< 0.001"
> - 前端应直接使用 `p_value_fmt` 展示,避免重复格式化
### 7.3 错误响应(📌 含结构化错误码)
```json
{
"status": "error",
"error_code": "E001",
"error_type": "business",
"message": "列名 'invalid_col' 在数据中不存在",
"trace_log": [
"[2026-02-18 10:30:01] 开始解析输入数据",
"[2026-02-18 10:30:01] 错误: 列名 'invalid_col' 在数据中不存在"
]
}
```
> **错误类型说明**
> - `business`业务错误Planner 可尝试自动修复参数后重试
> - `system`:系统错误,需人工介入
---
## 8. MVP 10 个工具清单
| 序号 | 工具代码 | 文件名 | 主要函数 | 护栏 |
|------|---------|--------|---------|------|
| 1 | ST_T_TEST_IND | ST_T_TEST_IND.R | `run_st_t_test_ind()` | 正态性 |
| 2 | ST_T_TEST_PAIRED | ST_T_TEST_PAIRED.R | `run_st_t_test_paired()` | 正态性 |
| 3 | ST_ANOVA_ONE | ST_ANOVA_ONE.R | `run_st_anova_one()` | 正态性+方差齐性 |
| 4 | ST_CHI_SQUARE | ST_CHI_SQUARE.R | `run_st_chi_square()` | 期望频数 |
| 5 | ST_FISHER | ST_FISHER.R | `run_st_fisher()` | 无 |
| 6 | ST_WILCOXON | ST_WILCOXON.R | `run_st_wilcoxon()` | 无 |
| 7 | ST_MANN_WHITNEY | ST_MANN_WHITNEY.R | `run_st_mann_whitney()` | 无 |
| 8 | ST_CORRELATION | ST_CORRELATION.R | `run_st_correlation()` | 正态性决定Pearson/Spearman |
| 9 | ST_LINEAR_REG | ST_LINEAR_REG.R | `run_st_linear_reg()` | 残差正态性 |
| 10 | ST_DESCRIPTIVE | ST_DESCRIPTIVE.R | `run_st_descriptive()` | 无 |
---
## 9. 本地开发流程
### 8.1 构建镜像
```bash
cd r-statistics-service
docker build -t ssa-r-service:dev .
```
### 8.2 运行容器
```bash
docker run -d -p 8080:8080 --name ssa-r-dev ssa-r-service:dev
```
### 8.3 测试健康检查
```bash
curl http://localhost:8080/health
```
### 8.4 测试工具调用
```bash
curl -X POST http://localhost:8080/api/v1/skills/ST_T_TEST_IND \
-H "Content-Type: application/json" \
-d '{
"data": [
{"Gender": "Male", "GLU": 5.8},
{"Gender": "Male", "GLU": 6.1},
{"Gender": "Female", "GLU": 5.0},
{"Gender": "Female", "GLU": 5.2}
],
"params": {
"group_variable": "Gender",
"value_variable": "GLU"
},
"guardrails": {
"check_normality": true,
"auto_fix": true
}
}'
```
---
## 10. 工具元数据格式
```yaml
# metadata/tools.yaml
tools:
- code: ST_T_TEST_IND
name: 独立样本 T 检验
version: "1.0.0"
category: 假设检验
description: |
用于比较两个独立组的均值是否存在显著差异。
适用场景比较男性vs女性的血糖水平、实验组vs对照组的疗效等。
usage_context: |
- 两组独立样本比较
- 连续型数值变量
- 样本量建议 >= 30
params_schema:
type: object
required:
- group_variable
- value_variable
properties:
group_variable:
type: string
description: 分组变量名(应为分类变量,仅含两个水平)
value_variable:
type: string
description: 检验变量名(应为数值型)
guardrails:
- check_normality
- check_homogeneity
```
---
## 11. 常见问题
### Q1: 护栏检查失败后如何处理?
如果 `auto_fix = true`R 服务会自动降级到适当的非参数方法。如果 `auto_fix = false`,则返回警告但仍执行原方法。
### Q2: 如何添加新工具?
1.`tools/` 目录创建 `ST_NEW_TOOL.R`
2. 实现 `run_st_new_tool(input)` 函数
3.`metadata/tools.yaml` 添加元数据
4. 执行后端脚本导入到 pgvector
### Q3: 图表生成失败怎么办?
检查 `plots` 字段是否为空数组。R 服务不会因图表失败而中断整个分析,但会在 `trace_log` 中记录错误。
### Q4: 如何添加新的 R 包依赖?
1. 在 R 控制台执行 `install.packages("new_package")`
2. 执行 `renv::snapshot()` 更新 `renv.lock`
3. 提交 `renv.lock` 到版本控制
4. 重新构建 Docker 镜像
### Q5: 临时文件清理策略是什么?
- **代码层面**:使用 `on.exit(unlink(tmp_files))` 确保函数退出时清理
- **容器层面**Docker 启动时清理 `/tmp`
- **运维层面**SAE 配置定时任务,每日清理 24 小时前的临时文件