Phase 2A: WorkflowPlannerService, WorkflowExecutorService, Python data quality, 6 bug fixes, DescriptiveResultView, multi-step R code/Word export, MVP UI reuse. V11 UI: Gemini-style, multi-task, single-page scroll, Word export. Architecture: Block-based rendering consensus (4 block types). New R tools: chi_square, correlation, descriptive, logistic_binary, mann_whitney, t_test_paired. Docs: dev summary, block-based plan, status updates, task list v2.0. Co-authored-by: Cursor <cursoragent@cursor.com>
1138 lines
32 KiB
Markdown
1138 lines
32 KiB
Markdown
# SSA-Pro R 服务开发指南
|
||
|
||
> **文档版本:** v1.6
|
||
> **创建日期:** 2026-02-18
|
||
> **最后更新:** 2026-02-20(纳入智能化演进共识 + 错误捕获管道设计)
|
||
> **目标读者:** R 统计工程师
|
||
|
||
---
|
||
|
||
## 1. 项目结构
|
||
|
||
```
|
||
r-statistics-service/
|
||
├── Dockerfile
|
||
├── renv.lock # 📌 包版本锁定文件
|
||
├── .Rprofile # renv 初始化
|
||
├── plumber.R # Plumber 入口
|
||
├── tools/ # 🆕 专家上传的 R 脚本(统一 run_analysis 入口)
|
||
│ ├── t_test_ind.R # 独立样本 T 检验
|
||
│ ├── t_test_paired.R # 配对样本 T 检验
|
||
│ ├── anova_one.R # 单因素方差分析
|
||
│ ├── wilcoxon.R # Wilcoxon 秩和检验
|
||
│ └── ... # 📌 约 100 个成熟脚本
|
||
├── templates/ # 📌 代码交付模板(glue 格式)
|
||
│ ├── t_test.R.template
|
||
│ ├── anova.R.template
|
||
│ └── ...
|
||
├── utils/
|
||
│ ├── data_loader.R # 混合数据协议加载器
|
||
│ ├── guardrails.R # 🆕 护栏函数库(支持 Block/Warn/Switch)
|
||
│ ├── code_generator.R # 代码生成工具(使用 glue)
|
||
│ ├── result_formatter.R # 结果格式化(p_value_fmt)
|
||
│ ├── interpretation.R # 🆕 结果解读(基于配置模板)
|
||
│ └── error_codes.R # 📌 错误码定义
|
||
├── metadata/ # 工具元数据(由配置中台管理)
|
||
│ └── tools.yaml # 备用配置
|
||
└── tests/
|
||
├── test_tools.R # 单元测试
|
||
└── fixtures/ # 标准测试数据集
|
||
├── normal_data.csv
|
||
├── skewed_data.csv
|
||
└── missing_data.csv
|
||
```
|
||
|
||
### 1.1 🆕 专家 R 脚本规范
|
||
|
||
> **核心要求**:所有脚本必须使用统一入口函数 `run_analysis(input)`
|
||
|
||
```r
|
||
# 文件头部注释(必填)
|
||
#' @tool_code ST_T_TEST_IND
|
||
#' @name 独立样本 T 检验
|
||
#' @version 1.0.0
|
||
#' @description 比较两组独立样本的均值差异
|
||
#' @author 统计学专家团队
|
||
|
||
# 📌 统一入口函数(所有脚本必须实现)
|
||
run_analysis <- function(input) {
|
||
# ... 实现逻辑 ...
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Dockerfile 模板
|
||
|
||
```dockerfile
|
||
FROM rocker/r-ver:4.3
|
||
|
||
LABEL maintainer="your-team@example.com"
|
||
LABEL version="1.0.0"
|
||
LABEL description="SSA-Pro R Statistics Service"
|
||
|
||
# 安装系统依赖
|
||
RUN apt-get update && apt-get install -y \
|
||
libcurl4-openssl-dev \
|
||
libssl-dev \
|
||
libxml2-dev \
|
||
&& rm -rf /var/lib/apt/lists/*
|
||
|
||
# 📌 安装 renv(包管理工具)
|
||
RUN R -e "install.packages('renv', repos='https://cloud.r-project.org/')"
|
||
|
||
WORKDIR /app
|
||
|
||
# 📌 先复制 renv.lock,利用 Docker 缓存
|
||
COPY renv.lock renv.lock
|
||
COPY .Rprofile .Rprofile
|
||
|
||
# 📌 使用 renv 恢复依赖(版本锁定)
|
||
RUN R -e "renv::restore()"
|
||
|
||
# 复制应用代码
|
||
COPY . .
|
||
|
||
EXPOSE 8080
|
||
|
||
# 🆕 OSS 配置通过环境变量注入(开发/生产环境不同)
|
||
ENV OSS_ENDPOINT=""
|
||
ENV OSS_ACCESS_KEY_ID=""
|
||
ENV OSS_ACCESS_KEY_SECRET=""
|
||
ENV OSS_BUCKET=""
|
||
|
||
# 📌 启动前清理临时文件
|
||
CMD ["R", "-e", "unlink(list.files('/tmp', full.names=TRUE), recursive=TRUE); plumber::plumb('plumber.R')$run(host='0.0.0.0', port=8080)"]
|
||
```
|
||
|
||
### 2.2 环境变量配置(🆕 开发/生产差异)
|
||
|
||
```yaml
|
||
# docker-compose.yml (本地开发)
|
||
services:
|
||
ssa-r-service:
|
||
build: .
|
||
ports:
|
||
- "8080:8080"
|
||
environment:
|
||
- OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com # 公网
|
||
- OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID}
|
||
- OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET}
|
||
- OSS_BUCKET=ssa-data-bucket
|
||
```
|
||
|
||
```yaml
|
||
# SAE 环境变量 (生产)
|
||
OSS_ENDPOINT: oss-cn-beijing-internal.aliyuncs.com # 🆕 VPC 内网
|
||
OSS_ACCESS_KEY_ID: ******
|
||
OSS_ACCESS_KEY_SECRET: ******
|
||
OSS_BUCKET: ssa-data-bucket
|
||
```
|
||
|
||
> **重要**:OSS Endpoint 绝不能硬编码,必须通过环境变量注入。本地开发用公网,SAE 生产用内网。
|
||
|
||
### 2.1 renv.lock 示例
|
||
|
||
```json
|
||
{
|
||
"R": {
|
||
"Version": "4.3.0",
|
||
"Repositories": [
|
||
{
|
||
"Name": "CRAN",
|
||
"URL": "https://cloud.r-project.org"
|
||
}
|
||
]
|
||
},
|
||
"Packages": {
|
||
"plumber": { "Package": "plumber", "Version": "1.2.1", "Source": "Repository" },
|
||
"jsonlite": { "Package": "jsonlite", "Version": "1.8.8", "Source": "Repository" },
|
||
"ggplot2": { "Package": "ggplot2", "Version": "3.4.4", "Source": "Repository" },
|
||
"glue": { "Package": "glue", "Version": "1.7.0", "Source": "Repository" },
|
||
"styler": { "Package": "styler", "Version": "1.10.2", "Source": "Repository" },
|
||
"dplyr": { "Package": "dplyr", "Version": "1.1.4", "Source": "Repository" },
|
||
"tidyr": { "Package": "tidyr", "Version": "1.3.0", "Source": "Repository" },
|
||
"base64enc": { "Package": "base64enc", "Version": "0.1-3", "Source": "Repository" },
|
||
"yaml": { "Package": "yaml", "Version": "2.3.8", "Source": "Repository" },
|
||
"car": { "Package": "car", "Version": "3.1-2", "Source": "Repository" }
|
||
}
|
||
}
|
||
```
|
||
|
||
> **重要**:每次添加新依赖后,执行 `renv::snapshot()` 更新 lock 文件。
|
||
|
||
---
|
||
|
||
## 3. Plumber 入口文件
|
||
|
||
```r
|
||
# plumber.R
|
||
library(plumber)
|
||
library(jsonlite)
|
||
|
||
# 加载工具模块
|
||
tools_dir <- "tools"
|
||
tool_files <- list.files(tools_dir, pattern = "\\.R$", full.names = TRUE)
|
||
for (f in tool_files) source(f)
|
||
|
||
# 加载公共函数
|
||
source("utils/data_loader.R") # 🆕 混合数据协议
|
||
source("utils/guardrails.R")
|
||
source("utils/code_generator.R")
|
||
source("utils/result_formatter.R") # 🆕 结果格式化
|
||
|
||
#* @apiTitle SSA-Pro R Statistics Service
|
||
#* @apiDescription 严谨型统计分析 R 引擎
|
||
|
||
#* 健康检查
|
||
#* @get /health
|
||
function() {
|
||
list(
|
||
status = "ok",
|
||
timestamp = Sys.time(),
|
||
version = "1.0.0"
|
||
)
|
||
}
|
||
|
||
#* 执行统计工具
|
||
#* @post /api/v1/skills/<tool_code>
|
||
#* @param tool_code:str 工具代码
|
||
#* @serializer unboxedJSON
|
||
function(req, tool_code) {
|
||
tryCatch({
|
||
# 解析请求体
|
||
input <- jsonlite::fromJSON(req$postBody, simplifyVector = FALSE)
|
||
|
||
# 🆕 Debug 模式:保留临时文件用于排查
|
||
debug_mode <- isTRUE(input$debug)
|
||
|
||
# 动态调用工具函数
|
||
func_name <- paste0("run_", tolower(tool_code))
|
||
|
||
if (!exists(func_name, mode = "function")) {
|
||
return(list(
|
||
status = "error",
|
||
message = paste("Unknown tool:", tool_code)
|
||
))
|
||
}
|
||
|
||
result <- do.call(func_name, list(input))
|
||
|
||
# 🆕 Debug 模式:附加临时文件路径
|
||
if (debug_mode && !is.null(result$tmp_files)) {
|
||
result$debug_files <- result$tmp_files
|
||
message("[DEBUG] 临时文件已保留: ", paste(result$tmp_files, collapse = ", "))
|
||
}
|
||
|
||
return(result)
|
||
|
||
}, error = function(e) {
|
||
# 🆕 使用友好错误映射
|
||
return(map_r_error(e$message))
|
||
})
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. 错误码定义
|
||
|
||
### 🆕 4.0 错误捕获管道设计(为 Phase 3 靶向修改铺路)
|
||
|
||
> **重要**:结构化的错误捕获是 Phase 3 "靶向代码修改"能力的基础。
|
||
>
|
||
> 详细背景参考:`04-开发计划/06-智能化演进共识与MVP执行计划.md`
|
||
|
||
**Phase 3 的工作流程:**
|
||
```
|
||
工具执行报错 → 错误捕获管道 → 结构化 JSON → 反馈给 LLM → LLM 靶向修改代码
|
||
```
|
||
|
||
**错误 JSON 结构要求(便于 LLM 理解):**
|
||
```json
|
||
{
|
||
"status": "error",
|
||
"error_code": "E002",
|
||
"error_type": "business",
|
||
"message": "列 'blood_pressure' 类型应为 numeric,实际为 character",
|
||
"user_hint": "该列包含非数值数据,请检查数据格式",
|
||
"context": {
|
||
"tool_code": "ST_T_TEST_IND",
|
||
"problematic_column": "blood_pressure",
|
||
"expected_type": "numeric",
|
||
"actual_type": "character",
|
||
"sample_values": ["120", "130", "未知", "125"],
|
||
"line_number": 45
|
||
}
|
||
}
|
||
```
|
||
|
||
**关键字段说明:**
|
||
|
||
| 字段 | MVP 用途 | Phase 3 用途 |
|
||
|------|---------|-------------|
|
||
| `error_code` | 日志分类 | LLM 识别错误类型 |
|
||
| `error_type` | 区分业务/系统错误 | LLM 判断是否可自愈 |
|
||
| `message` | 开发调试 | LLM 理解错误原因 |
|
||
| `user_hint` | 前端展示 | 保留 |
|
||
| `context` | 🆕 可选 | LLM 靶向修改的关键信息 |
|
||
|
||
> **MVP 阶段行动**:错误响应中尽量包含 `context` 信息,为 Phase 3 积累"黄金数据集"。
|
||
|
||
---
|
||
|
||
```r
|
||
# utils/error_codes.R
|
||
# 📌 结构化错误码,便于 LLM 自愈
|
||
|
||
ERROR_CODES <- list(
|
||
# 业务错误(可被 Planner 修复)
|
||
E001_COLUMN_NOT_FOUND = list(
|
||
code = "E001",
|
||
type = "business",
|
||
message_template = "列名 '{col}' 在数据中不存在",
|
||
user_hint = "请检查变量名是否拼写正确" # 🆕 用户友好提示
|
||
),
|
||
E002_TYPE_MISMATCH = list(
|
||
code = "E002",
|
||
type = "business",
|
||
message_template = "列 '{col}' 类型应为 {expected},实际为 {actual}",
|
||
user_hint = "该列包含非数值数据,请检查数据格式"
|
||
),
|
||
E003_INSUFFICIENT_GROUPS = list(
|
||
code = "E003",
|
||
type = "business",
|
||
message_template = "分组变量 '{col}' 应有 {expected} 个水平,实际有 {actual} 个",
|
||
user_hint = "分组变量的取值个数不符合要求"
|
||
),
|
||
E004_SAMPLE_TOO_SMALL = list(
|
||
code = "E004",
|
||
type = "business",
|
||
message_template = "样本量 {n} 不足,至少需要 {min_required}",
|
||
user_hint = "数据量太少,无法进行统计分析"
|
||
),
|
||
|
||
# 🆕 统计计算错误(用户友好映射)
|
||
E005_SINGULAR_MATRIX = list(
|
||
code = "E005",
|
||
type = "business",
|
||
message_template = "矩阵计算异常: {details}",
|
||
user_hint = "变量之间可能存在多重共线性,建议移除高度相关的变量"
|
||
),
|
||
E006_CONVERGENCE_FAILED = list(
|
||
code = "E006",
|
||
type = "business",
|
||
message_template = "模型未能收敛: {details}",
|
||
user_hint = "算法未能找到稳定解,可能需要调整参数或检查数据"
|
||
),
|
||
E007_VARIANCE_ZERO = list(
|
||
code = "E007",
|
||
type = "business",
|
||
message_template = "变量 '{col}' 方差为零",
|
||
user_hint = "该列的所有值都相同,无法进行比较"
|
||
),
|
||
|
||
# 系统错误(需人工介入)
|
||
E100_INTERNAL_ERROR = list(
|
||
code = "E100",
|
||
type = "system",
|
||
message_template = "内部错误: {details}",
|
||
user_hint = "系统繁忙,请稍后重试"
|
||
),
|
||
E101_PACKAGE_MISSING = list(
|
||
code = "E101",
|
||
type = "system",
|
||
message_template = "缺少依赖包: {package}",
|
||
user_hint = "请联系管理员"
|
||
)
|
||
)
|
||
|
||
# 🆕 R 原始错误到错误码的映射字典
|
||
R_ERROR_MAPPING <- list(
|
||
"system is computationally singular" = "E005_SINGULAR_MATRIX",
|
||
"did not converge" = "E006_CONVERGENCE_FAILED",
|
||
"constant" = "E007_VARIANCE_ZERO"
|
||
)
|
||
|
||
# 构造错误响应(🆕 含用户友好提示)
|
||
make_error <- function(error_def, ...) {
|
||
params <- list(...)
|
||
msg <- error_def$message_template
|
||
for (name in names(params)) {
|
||
msg <- gsub(paste0("\\{", name, "\\}"), as.character(params[[name]]), msg)
|
||
}
|
||
return(list(
|
||
status = "error",
|
||
error_code = error_def$code,
|
||
error_type = error_def$type,
|
||
message = msg,
|
||
user_hint = error_def$user_hint # 🆕 前端可直接展示
|
||
))
|
||
}
|
||
|
||
# 🆕 尝试将 R 原始错误映射为友好错误码
|
||
map_r_error <- function(raw_error_msg) {
|
||
for (pattern in names(R_ERROR_MAPPING)) {
|
||
if (grepl(pattern, raw_error_msg, ignore.case = TRUE)) {
|
||
error_key <- R_ERROR_MAPPING[[pattern]]
|
||
return(make_error(ERROR_CODES[[error_key]], details = raw_error_msg))
|
||
}
|
||
}
|
||
# 无法映射,返回通用内部错误
|
||
return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = raw_error_msg))
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. 🆕 混合数据协议加载器
|
||
|
||
> **核心问题**:Node.js 发送的 Payload 可能是 inline JSON(小数据)或 OSS Key(大数据),R 服务必须统一处理。
|
||
|
||
```r
|
||
# utils/data_loader.R
|
||
# 🆕 混合数据协议:自动识别 inline 数据 vs OSS key
|
||
|
||
library(httr)
|
||
library(jsonlite)
|
||
|
||
# 🆕 开发模式开关(本地无法访问 OSS 时启用)
|
||
DEV_MODE <- Sys.getenv("DEV_MODE", "false") == "true"
|
||
|
||
# 从环境变量获取 OSS 配置(开发/生产差异化)
|
||
get_oss_config <- function() {
|
||
list(
|
||
endpoint = Sys.getenv("OSS_ENDPOINT", ""),
|
||
access_key_id = Sys.getenv("OSS_ACCESS_KEY_ID", ""),
|
||
access_key_secret = Sys.getenv("OSS_ACCESS_KEY_SECRET", ""),
|
||
bucket = Sys.getenv("OSS_BUCKET", ""),
|
||
mock_data_dir = Sys.getenv("OSS_MOCK_DIR", "tests/fixtures") # 🆕 Mock 目录
|
||
)
|
||
}
|
||
|
||
# 统一数据加载入口
|
||
load_input_data <- function(input) {
|
||
# 检查输入结构
|
||
if (is.null(input$data_source)) {
|
||
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
|
||
details = "请求缺少 data_source 字段"))
|
||
}
|
||
|
||
source_type <- input$data_source$type # "inline" | "oss"
|
||
|
||
if (source_type == "inline") {
|
||
# 📌 方式1:内联 JSON 数据(< 2MB)
|
||
return(as.data.frame(input$data_source$data))
|
||
|
||
} else if (source_type == "oss") {
|
||
# 📌 方式2:从 OSS 下载(2MB - 20MB)
|
||
return(load_from_oss(input$data_source$oss_key))
|
||
|
||
} else {
|
||
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
|
||
details = paste("未知的 data_source.type:", source_type)))
|
||
}
|
||
}
|
||
|
||
# 从 OSS 下载数据(🆕 支持 DEV_MODE Mock)
|
||
load_from_oss <- function(oss_key) {
|
||
config <- get_oss_config()
|
||
|
||
# 🆕 开发模式:从本地 fixtures 读取 Mock 数据
|
||
if (DEV_MODE) {
|
||
mock_file <- file.path(config$mock_data_dir, basename(oss_key))
|
||
if (file.exists(mock_file)) {
|
||
message(glue("[DEV_MODE] 使用本地 Mock 文件: {mock_file}"))
|
||
return(read.csv(mock_file, stringsAsFactors = FALSE))
|
||
} else {
|
||
# 回退到 normal_data.csv
|
||
fallback <- file.path(config$mock_data_dir, "normal_data.csv")
|
||
message(glue("[DEV_MODE] Mock 文件不存在,使用默认: {fallback}"))
|
||
return(read.csv(fallback, stringsAsFactors = FALSE))
|
||
}
|
||
}
|
||
|
||
if (config$endpoint == "") {
|
||
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
|
||
details = "OSS_ENDPOINT 环境变量未配置"))
|
||
}
|
||
|
||
# 构造签名 URL(简化版,生产应使用 SDK)
|
||
url <- sprintf("https://%s.%s/%s",
|
||
config$bucket, config$endpoint, oss_key)
|
||
|
||
# 下载到临时文件
|
||
temp_file <- tempfile(fileext = ".csv")
|
||
on.exit(unlink(temp_file)) # 确保清理
|
||
|
||
tryCatch({
|
||
response <- GET(url,
|
||
add_headers(
|
||
Authorization = generate_oss_signature(config, "GET", oss_key)
|
||
),
|
||
write_disk(temp_file, overwrite = TRUE))
|
||
|
||
if (status_code(response) != 200) {
|
||
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
|
||
details = paste("OSS 下载失败:", status_code(response))))
|
||
}
|
||
|
||
return(read.csv(temp_file, stringsAsFactors = FALSE))
|
||
|
||
}, error = function(e) {
|
||
stop(make_error(ERROR_CODES$E100_INTERNAL_ERROR,
|
||
details = paste("OSS 网络错误:", e$message)))
|
||
})
|
||
}
|
||
|
||
# OSS 签名生成(简化版)
|
||
generate_oss_signature <- function(config, method, object_key) {
|
||
# TODO: 完整 OSS V4 签名实现
|
||
# MVP 阶段可使用阿里云 R SDK 或预签名 URL
|
||
paste0("OSS ", config$access_key_id, ":", "SIGNATURE_PLACEHOLDER")
|
||
}
|
||
```
|
||
|
||
### 5.1 后端 Payload 格式规范
|
||
|
||
Node.js `RClientService` 发送给 R 的 Payload 格式:
|
||
|
||
```typescript
|
||
// 小数据(< 2MB):inline 模式
|
||
{
|
||
"data_source": {
|
||
"type": "inline",
|
||
"data": [
|
||
{ "group": "A", "value": 10.5 },
|
||
{ "group": "B", "value": 12.3 }
|
||
]
|
||
},
|
||
"params": {
|
||
"group_var": "group",
|
||
"value_var": "value"
|
||
}
|
||
}
|
||
|
||
// 大数据(2MB - 20MB):OSS 模式
|
||
{
|
||
"data_source": {
|
||
"type": "oss",
|
||
"oss_key": "sessions/abc123/data.csv"
|
||
},
|
||
"params": {
|
||
"group_var": "group",
|
||
"value_var": "value"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 6. 🆕 结果格式化工具
|
||
|
||
```r
|
||
# utils/result_formatter.R
|
||
# 🆕 统计结果格式化,确保 p 值显示规范
|
||
|
||
# 格式化 p 值(符合 APA 规范)
|
||
format_p_value <- function(p) {
|
||
if (is.na(p)) return(NA)
|
||
|
||
if (p < 0.001) {
|
||
return("< 0.001")
|
||
} else {
|
||
return(sprintf("%.3f", p))
|
||
}
|
||
}
|
||
|
||
# 构建标准化结果(包含 p_value_fmt)
|
||
make_result <- function(p_value, statistic, method, ...) {
|
||
list(
|
||
p_value = p_value,
|
||
p_value_fmt = format_p_value(p_value), # 🆕 前端直接展示
|
||
statistic = statistic,
|
||
method = method,
|
||
...
|
||
)
|
||
}
|
||
|
||
# 格式化置信区间
|
||
format_ci <- function(lower, upper, digits = 2) {
|
||
sprintf("[%.${digits}f, %.${digits}f]", lower, upper)
|
||
}
|
||
|
||
# 格式化效应量
|
||
format_effect_size <- function(value, type = "d") {
|
||
interpretation <- ""
|
||
if (type == "d") { # Cohen's d
|
||
if (abs(value) < 0.2) interpretation <- "微小"
|
||
else if (abs(value) < 0.5) interpretation <- "小"
|
||
else if (abs(value) < 0.8) interpretation <- "中等"
|
||
else interpretation <- "大"
|
||
}
|
||
list(
|
||
value = round(value, 3),
|
||
interpretation = interpretation
|
||
)
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 工具 Wrapper 标准模板(使用 glue)
|
||
|
||
### 5.1 代码模板文件
|
||
|
||
```r
|
||
# templates/t_test.R.template
|
||
# SSA-Pro 自动生成代码
|
||
# 工具: {tool_name}
|
||
# 时间: {timestamp}
|
||
# ================================
|
||
|
||
# 🆕 自动安装依赖(用户本地运行时自动检测)
|
||
required_packages <- c("ggplot2", "car")
|
||
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
|
||
if(length(new_packages)) {{
|
||
message("正在安装缺失的依赖包: ", paste(new_packages, collapse = ", "))
|
||
install.packages(new_packages, repos = "https://cloud.r-project.org")
|
||
}}
|
||
|
||
library(ggplot2)
|
||
|
||
# 数据准备
|
||
df <- read.csv("your_data.csv")
|
||
group_var <- "{group_var}"
|
||
value_var <- "{value_var}"
|
||
|
||
# 正态性检验
|
||
{normality_code}
|
||
|
||
# {method_name}
|
||
result <- {test_code}
|
||
print(result)
|
||
|
||
# 可视化
|
||
ggplot(df, aes(x = {group_var}, y = {value_var})) +
|
||
geom_boxplot(fill = "#3b82f6", alpha = 0.6) +
|
||
theme_minimal() +
|
||
labs(title = "Distribution of {value_var} by {group_var}")
|
||
```
|
||
|
||
### 5.2 Wrapper 实现(使用 glue)
|
||
|
||
```r
|
||
# tools/ST_T_TEST_IND.R
|
||
# 独立样本 T 检验
|
||
|
||
library(glue)
|
||
source("utils/error_codes.R")
|
||
|
||
run_st_t_test_ind <- function(input) {
|
||
# ===== 初始化 =====
|
||
logs <- c()
|
||
log_add <- function(msg) { logs <<- c(logs, paste0("[", Sys.time(), "] ", msg)) }
|
||
tmp_files <- c() # 📌 追踪临时文件
|
||
|
||
# 📌 确保退出时清理临时文件
|
||
on.exit({
|
||
if (length(tmp_files) > 0) {
|
||
unlink(tmp_files)
|
||
}
|
||
}, add = TRUE)
|
||
|
||
# ===== 🆕 数据加载(混合协议) =====
|
||
log_add("开始加载输入数据")
|
||
df <- tryCatch(
|
||
load_input_data(input), # 🆕 统一入口,自动处理 inline/OSS
|
||
error = function(e) {
|
||
log_add(paste("数据加载失败:", e$message))
|
||
return(NULL)
|
||
}
|
||
)
|
||
|
||
if (is.null(df)) {
|
||
return(make_error(ERROR_CODES$E100_INTERNAL_ERROR, details = "数据加载失败"))
|
||
}
|
||
log_add(glue("数据加载成功: {nrow(df)} 行, {ncol(df)} 列"))
|
||
|
||
p <- input$params
|
||
guardrails_cfg <- input$guardrails
|
||
|
||
group_var <- p$group_variable
|
||
value_var <- p$value_variable
|
||
|
||
# 📌 业务错误检查
|
||
if (!(group_var %in% names(df))) {
|
||
return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = group_var))
|
||
}
|
||
if (!(value_var %in% names(df))) {
|
||
return(make_error(ERROR_CODES$E001_COLUMN_NOT_FOUND, col = value_var))
|
||
}
|
||
|
||
groups <- unique(df[[group_var]])
|
||
if (length(groups) != 2) {
|
||
return(make_error(ERROR_CODES$E003_INSUFFICIENT_GROUPS,
|
||
col = group_var, expected = 2, actual = length(groups)))
|
||
}
|
||
|
||
# ===== 护栏检查 =====
|
||
normality_code <- ""
|
||
method_used <- "t.test"
|
||
|
||
if (isTRUE(guardrails_cfg$check_normality)) {
|
||
log_add("执行正态性检验")
|
||
use_nonparam <- FALSE
|
||
|
||
for (g in groups) {
|
||
vals <- df[df[[group_var]] == g, value_var]
|
||
if (length(vals) >= 3 && length(vals) <= 5000) {
|
||
sw_test <- shapiro.test(vals)
|
||
normality_code <- paste0(normality_code,
|
||
glue("shapiro.test(df[df${group_var} == '{g}', '{value_var}'])\n"))
|
||
if (sw_test$p.value < 0.05) {
|
||
use_nonparam <- TRUE
|
||
log_add(glue("组[{g}] Shapiro-Wilk P = {round(sw_test$p.value, 4)} < 0.05, 拒绝正态性"))
|
||
}
|
||
}
|
||
}
|
||
|
||
if (use_nonparam && isTRUE(guardrails_cfg$auto_fix)) {
|
||
log_add("触发降级: T-Test -> Wilcoxon")
|
||
return(run_st_wilcoxon(input))
|
||
}
|
||
}
|
||
|
||
# ===== 核心计算 =====
|
||
log_add("执行 T 检验")
|
||
g1_vals <- df[df[[group_var]] == groups[1], value_var]
|
||
g2_vals <- df[df[[group_var]] == groups[2], value_var]
|
||
|
||
result <- t.test(g1_vals, g2_vals, var.equal = FALSE)
|
||
|
||
# ===== 生成图表 =====
|
||
log_add("生成箱线图")
|
||
plot_base64 <- generate_boxplot(df, group_var, value_var, tmp_files)
|
||
|
||
# ===== 📌 使用 glue 生成代码 =====
|
||
template <- readLines("templates/t_test.R.template", warn = FALSE)
|
||
template_str <- paste(template, collapse = "\n")
|
||
|
||
reproducible_code <- glue(template_str,
|
||
tool_name = "独立样本 T 检验",
|
||
timestamp = Sys.time(),
|
||
group_var = group_var,
|
||
value_var = value_var,
|
||
normality_code = if (nchar(normality_code) > 0) normality_code else "# 未执行正态性检验",
|
||
method_name = result$method,
|
||
test_code = glue("t.test(df[df${group_var} == '{groups[1]}', '{value_var}'],
|
||
df[df${group_var} == '{groups[2]}', '{value_var}'],
|
||
var.equal = FALSE)")
|
||
)
|
||
|
||
# 📌 使用 styler 格式化代码(可选)
|
||
# reproducible_code <- styler::style_text(reproducible_code)
|
||
|
||
# ===== 返回结果 =====
|
||
log_add("分析完成")
|
||
|
||
return(list(
|
||
status = "success",
|
||
message = "分析完成",
|
||
results = list(
|
||
method = result$method,
|
||
statistic = unbox(as.numeric(result$statistic)),
|
||
p_value = unbox(as.numeric(result$p.value)),
|
||
p_value_fmt = format_p_value(result$p.value), # 🆕 格式化 p 值
|
||
conf_int = as.numeric(result$conf.int),
|
||
estimate = as.numeric(result$estimate),
|
||
group_stats = list(
|
||
list(group = groups[1], n = length(g1_vals), mean = mean(g1_vals), sd = sd(g1_vals)),
|
||
list(group = groups[2], n = length(g2_vals), mean = mean(g2_vals), sd = sd(g2_vals))
|
||
)
|
||
),
|
||
plots = list(plot_base64),
|
||
trace_log = logs,
|
||
reproducible_code = as.character(reproducible_code)
|
||
))
|
||
}
|
||
|
||
# 📌 辅助函数(带临时文件追踪)
|
||
generate_boxplot <- function(df, group_var, value_var, tmp_files_ref) {
|
||
library(ggplot2)
|
||
library(base64enc)
|
||
|
||
p <- ggplot(df, aes_string(x = group_var, y = value_var)) +
|
||
geom_boxplot(fill = "#3b82f6", alpha = 0.6) +
|
||
theme_minimal() +
|
||
labs(title = paste("Distribution of", value_var, "by", group_var))
|
||
|
||
tmp_file <- tempfile(fileext = ".png")
|
||
tmp_files_ref <- c(tmp_files_ref, tmp_file) # 追踪
|
||
|
||
ggsave(tmp_file, p, width = 6, height = 4, dpi = 100)
|
||
|
||
base64_str <- base64encode(tmp_file)
|
||
unlink(tmp_file) # 立即清理
|
||
|
||
return(paste0("data:image/png;base64,", base64_str))
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. 护栏函数库
|
||
|
||
```r
|
||
# utils/guardrails.R
|
||
|
||
# 🆕 大样本优化阈值
|
||
LARGE_SAMPLE_THRESHOLD <- 5000
|
||
|
||
# 🆕 护栏 Action 类型
|
||
ACTION_BLOCK <- "Block" # 阻止执行
|
||
ACTION_WARN <- "Warn" # 警告但继续
|
||
ACTION_SWITCH <- "Switch" # 切换到备选方法
|
||
|
||
# 🆕 护栏检查结果结构
|
||
# list(
|
||
# passed = TRUE/FALSE,
|
||
# action = "Block" | "Warn" | "Switch",
|
||
# action_target = "ST_XXX" | NULL,
|
||
# p_value = 0.05,
|
||
# reason = "描述"
|
||
# )
|
||
|
||
# 正态性检验(🆕 支持三种 Action)
|
||
check_normality <- function(values, alpha = 0.05, action = ACTION_SWITCH, action_target = NULL) {
|
||
n <- length(values)
|
||
|
||
# 样本量过小
|
||
if (n < 3) {
|
||
return(list(
|
||
passed = TRUE,
|
||
action = NULL,
|
||
action_target = NULL,
|
||
reason = "样本量过小,跳过正态性检验",
|
||
skipped = TRUE
|
||
))
|
||
}
|
||
|
||
# 🆕 大样本优化:N > 5000 时使用抽样检验
|
||
if (n > LARGE_SAMPLE_THRESHOLD) {
|
||
set.seed(42)
|
||
sampled_values <- sample(values, 1000)
|
||
test <- shapiro.test(sampled_values)
|
||
passed <- test$p.value >= alpha
|
||
|
||
return(list(
|
||
passed = passed,
|
||
action = if (passed) NULL else action,
|
||
action_target = if (passed) NULL else action_target,
|
||
p_value = test$p.value,
|
||
reason = glue("大样本(N={n})抽样检验,{if (passed) '满足正态性' else '不满足正态性'}"),
|
||
sampled = TRUE,
|
||
sample_size = 1000
|
||
))
|
||
}
|
||
|
||
# 常规检验
|
||
test <- shapiro.test(values)
|
||
passed <- test$p.value >= alpha
|
||
|
||
return(list(
|
||
passed = passed,
|
||
action = if (passed) NULL else action,
|
||
action_target = if (passed) NULL else action_target,
|
||
p_value = test$p.value,
|
||
reason = if (passed) "满足正态性" else "不满足正态性",
|
||
sampled = FALSE
|
||
))
|
||
}
|
||
|
||
# 方差齐性检验 (Levene)
|
||
check_homogeneity <- function(df, group_var, value_var, alpha = 0.05, action = ACTION_WARN) {
|
||
library(car)
|
||
|
||
formula <- as.formula(paste(value_var, "~", group_var))
|
||
test <- leveneTest(formula, data = df)
|
||
p_val <- test$`Pr(>F)`[1]
|
||
passed <- p_val >= alpha
|
||
|
||
return(list(
|
||
passed = passed,
|
||
action = if (passed) NULL else action,
|
||
p_value = p_val,
|
||
reason = if (passed) "方差齐性满足" else "方差不齐性"
|
||
))
|
||
}
|
||
|
||
# 样本量检验
|
||
check_sample_size <- function(n, min_required = 30, action = ACTION_BLOCK) {
|
||
passed <- n >= min_required
|
||
return(list(
|
||
passed = passed,
|
||
action = if (passed) NULL else action,
|
||
n = n,
|
||
reason = if (passed) "样本量充足" else paste0("样本量不足, 需要至少 ", min_required)
|
||
))
|
||
}
|
||
|
||
# 🆕 执行护栏链(按 check_order 顺序执行)
|
||
run_guardrail_chain <- function(input, guardrail_configs) {
|
||
for (config in guardrail_configs) {
|
||
check_func <- get(config$check_code)
|
||
result <- do.call(check_func, list(
|
||
input,
|
||
action = config$action_type,
|
||
action_target = config$action_target
|
||
))
|
||
|
||
if (!result$passed) {
|
||
if (result$action == ACTION_BLOCK) {
|
||
return(list(
|
||
status = "blocked",
|
||
reason = result$reason
|
||
))
|
||
} else if (result$action == ACTION_SWITCH) {
|
||
return(list(
|
||
status = "switch",
|
||
target_tool = result$action_target,
|
||
reason = result$reason
|
||
))
|
||
}
|
||
# WARN: 记录警告但继续
|
||
}
|
||
}
|
||
|
||
return(list(status = "passed"))
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 9. API 请求/响应规范
|
||
|
||
### 9.1 🆕 请求格式(混合协议)
|
||
|
||
**方式 1:Inline 数据(< 2MB)**
|
||
```json
|
||
{
|
||
"data_source": {
|
||
"type": "inline",
|
||
"data": [
|
||
{"Gender": "Male", "GLU": 5.8, "Age": 45},
|
||
{"Gender": "Female", "GLU": 5.1, "Age": 38}
|
||
]
|
||
},
|
||
"params": {
|
||
"group_variable": "Gender",
|
||
"value_variable": "GLU"
|
||
},
|
||
"guardrails": {
|
||
"check_normality": true,
|
||
"auto_fix": true
|
||
}
|
||
}
|
||
```
|
||
|
||
**方式 2:OSS 数据(2MB - 20MB)**
|
||
```json
|
||
{
|
||
"data_source": {
|
||
"type": "oss",
|
||
"oss_key": "sessions/abc123/data.csv"
|
||
},
|
||
"params": {
|
||
"group_variable": "Gender",
|
||
"value_variable": "GLU"
|
||
},
|
||
"guardrails": {
|
||
"check_normality": true,
|
||
"auto_fix": true
|
||
}
|
||
}
|
||
```
|
||
|
||
### 9.2 成功响应(🆕 含 p_value_fmt)
|
||
|
||
```json
|
||
{
|
||
"status": "success",
|
||
"message": "分析完成",
|
||
"results": {
|
||
"method": "Welch Two Sample t-test",
|
||
"statistic": 2.345,
|
||
"p_value": 0.021,
|
||
"p_value_fmt": "0.021",
|
||
"conf_int": [0.12, 1.28],
|
||
"group_stats": [
|
||
{"group": "Male", "n": 78, "mean": 5.8, "sd": 0.9},
|
||
{"group": "Female", "n": 72, "mean": 5.1, "sd": 0.7}
|
||
]
|
||
},
|
||
"plots": ["data:image/png;base64,..."],
|
||
"trace_log": [
|
||
"[2026-02-18 10:30:01] 数据加载成功: 150 行, 5 列",
|
||
"[2026-02-18 10:30:01] 执行正态性检验",
|
||
"[2026-02-18 10:30:02] 执行 T 检验",
|
||
"[2026-02-18 10:30:02] 分析完成"
|
||
],
|
||
"reproducible_code": "# SSA-Pro 自动生成代码\n..."
|
||
}
|
||
```
|
||
|
||
> **p_value_fmt 说明**:
|
||
> - p >= 0.001: 保留 3 位小数,如 "0.021"
|
||
> - p < 0.001: 显示 "< 0.001"
|
||
> - 前端应直接使用 `p_value_fmt` 展示,避免重复格式化
|
||
|
||
### 7.3 错误响应(📌 含结构化错误码)
|
||
|
||
```json
|
||
{
|
||
"status": "error",
|
||
"error_code": "E001",
|
||
"error_type": "business",
|
||
"message": "列名 'invalid_col' 在数据中不存在",
|
||
"trace_log": [
|
||
"[2026-02-18 10:30:01] 开始解析输入数据",
|
||
"[2026-02-18 10:30:01] 错误: 列名 'invalid_col' 在数据中不存在"
|
||
]
|
||
}
|
||
```
|
||
|
||
> **错误类型说明**:
|
||
> - `business`:业务错误,Planner 可尝试自动修复参数后重试
|
||
> - `system`:系统错误,需人工介入
|
||
|
||
---
|
||
|
||
## 8. MVP 10 个工具清单
|
||
|
||
| 序号 | 工具代码 | 文件名 | 主要函数 | 护栏 |
|
||
|------|---------|--------|---------|------|
|
||
| 1 | ST_T_TEST_IND | ST_T_TEST_IND.R | `run_st_t_test_ind()` | 正态性 |
|
||
| 2 | ST_T_TEST_PAIRED | ST_T_TEST_PAIRED.R | `run_st_t_test_paired()` | 正态性 |
|
||
| 3 | ST_ANOVA_ONE | ST_ANOVA_ONE.R | `run_st_anova_one()` | 正态性+方差齐性 |
|
||
| 4 | ST_CHI_SQUARE | ST_CHI_SQUARE.R | `run_st_chi_square()` | 期望频数 |
|
||
| 5 | ST_FISHER | ST_FISHER.R | `run_st_fisher()` | 无 |
|
||
| 6 | ST_WILCOXON | ST_WILCOXON.R | `run_st_wilcoxon()` | 无 |
|
||
| 7 | ST_MANN_WHITNEY | ST_MANN_WHITNEY.R | `run_st_mann_whitney()` | 无 |
|
||
| 8 | ST_CORRELATION | ST_CORRELATION.R | `run_st_correlation()` | 正态性(决定Pearson/Spearman) |
|
||
| 9 | ST_LINEAR_REG | ST_LINEAR_REG.R | `run_st_linear_reg()` | 残差正态性 |
|
||
| 10 | ST_DESCRIPTIVE | ST_DESCRIPTIVE.R | `run_st_descriptive()` | 无 |
|
||
|
||
---
|
||
|
||
## 9. 本地开发流程
|
||
|
||
### 8.1 构建镜像
|
||
|
||
```bash
|
||
cd r-statistics-service
|
||
docker build -t ssa-r-service:dev .
|
||
```
|
||
|
||
### 8.2 运行容器
|
||
|
||
```bash
|
||
docker run -d -p 8080:8080 --name ssa-r-dev ssa-r-service:dev
|
||
```
|
||
|
||
### 8.3 测试健康检查
|
||
|
||
```bash
|
||
curl http://localhost:8080/health
|
||
```
|
||
|
||
### 8.4 测试工具调用
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/api/v1/skills/ST_T_TEST_IND \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"data": [
|
||
{"Gender": "Male", "GLU": 5.8},
|
||
{"Gender": "Male", "GLU": 6.1},
|
||
{"Gender": "Female", "GLU": 5.0},
|
||
{"Gender": "Female", "GLU": 5.2}
|
||
],
|
||
"params": {
|
||
"group_variable": "Gender",
|
||
"value_variable": "GLU"
|
||
},
|
||
"guardrails": {
|
||
"check_normality": true,
|
||
"auto_fix": true
|
||
}
|
||
}'
|
||
```
|
||
|
||
---
|
||
|
||
## 10. 工具元数据格式
|
||
|
||
```yaml
|
||
# metadata/tools.yaml
|
||
tools:
|
||
- code: ST_T_TEST_IND
|
||
name: 独立样本 T 检验
|
||
version: "1.0.0"
|
||
category: 假设检验
|
||
description: |
|
||
用于比较两个独立组的均值是否存在显著差异。
|
||
适用场景:比较男性vs女性的血糖水平、实验组vs对照组的疗效等。
|
||
usage_context: |
|
||
- 两组独立样本比较
|
||
- 连续型数值变量
|
||
- 样本量建议 >= 30
|
||
params_schema:
|
||
type: object
|
||
required:
|
||
- group_variable
|
||
- value_variable
|
||
properties:
|
||
group_variable:
|
||
type: string
|
||
description: 分组变量名(应为分类变量,仅含两个水平)
|
||
value_variable:
|
||
type: string
|
||
description: 检验变量名(应为数值型)
|
||
guardrails:
|
||
- check_normality
|
||
- check_homogeneity
|
||
```
|
||
|
||
---
|
||
|
||
## 11. 常见问题
|
||
|
||
### Q1: 护栏检查失败后如何处理?
|
||
|
||
如果 `auto_fix = true`,R 服务会自动降级到适当的非参数方法。如果 `auto_fix = false`,则返回警告但仍执行原方法。
|
||
|
||
### Q2: 如何添加新工具?
|
||
|
||
1. 在 `tools/` 目录创建 `ST_NEW_TOOL.R`
|
||
2. 实现 `run_st_new_tool(input)` 函数
|
||
3. 在 `metadata/tools.yaml` 添加元数据
|
||
4. 执行后端脚本导入到 pgvector
|
||
|
||
### Q3: 图表生成失败怎么办?
|
||
|
||
检查 `plots` 字段是否为空数组。R 服务不会因图表失败而中断整个分析,但会在 `trace_log` 中记录错误。
|
||
|
||
### Q4: 如何添加新的 R 包依赖?
|
||
|
||
1. 在 R 控制台执行 `install.packages("new_package")`
|
||
2. 执行 `renv::snapshot()` 更新 `renv.lock`
|
||
3. 提交 `renv.lock` 到版本控制
|
||
4. 重新构建 Docker 镜像
|
||
|
||
### Q5: 临时文件清理策略是什么?
|
||
|
||
- **代码层面**:使用 `on.exit(unlink(tmp_files))` 确保函数退出时清理
|
||
- **容器层面**:Docker 启动时清理 `/tmp`
|
||
- **运维层面**:SAE 配置定时任务,每日清理 24 小时前的临时文件
|