Files

HaHafeng f9ed0c2528 feat(rvw): Complete V2.0 Week 3 - Statistical validation extension and UX improvements

Week 3 Development Summary:

- Implement negative sign normalization (6 Unicode variants)

- Enhance T-test validation with smart sample size extraction

- Enhance SE triangle and CI-P consistency validation with subrow support

- Add precise sub-cell highlighting for P-values in multi-line cells

- Add frontend issue type Chinese translations (6 new types)

- Add file format tips for PDF/DOC uploads

Technical improvements:

- Add _clean_statistical_text() in extractor.py

- Add _safe_float() wrapper in validator.py

- Add ForensicsReport.tsx component

- Update ISSUE_TYPE_LABELS translations

Documentation:

- Add 2026-02-18 development record

- Update RVW module status (v5.1)

- Update system status (v5.2)

Status: Week 3 complete, ready for Week 4 testing
Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-18 18:26:16 +08:00

19 KiB

Raw Blame History

临床统计特殊符号提取白皮书

用途： 指导 Python (python-docx) 在提取 Word 表格时进行字符清洗和标准化。

核心痛点： 同一个数学含义，可能由多种不同的编码方式表示。

1. 希腊字母类 (Greek Letters)

这是最容易出现乱码或识别错误的重灾区。

⚠️ 提取陷阱： 很多老旧的 Word 文档（特别是中文期刊投稿）喜欢用 Symbol 字体。在 python-docx 提取 text 时，你可能会读到一个普通的英文字母 c，但用户看到的是。

解决方案：检查 run.font.name。如果字体是 Symbol，需要建立映射表（c -> χ, a -> α）。

2. 数学运算符类 (Operators)

| | 加减/标准差 | \u00b1 | +/-, + / - | 统一标准化为 \u00b1 |

| | 小于等于 | \u2264 | <=, =< | 统一为 <= |

| | 大于等于 | \u2265 | >= | 统一为 >= |

| | 不等于 | \u2260 | !=, <>, /= | 统一为 != |

| | 约等于 | \u2248 | ~, = | 统一为 ~= |

| | 乘号/交互项 | \u00d7 | x, X, * | 统一为 x |

⚠️ 提取陷阱： “负号”是数据清洗中最大的坑。Word 会自动把连字符（Hyphen）转成破折号（Dash）或数学减号（Minus）。

python 代码：value.replace('\u2212', '-').replace('\u2013', '-')

3. 统计学专用标记 (Statistical Notations)

| 符号 | 含义 | 形式 | 提取难点 |

| | 样本率 | p 上加尖帽 | 同上。 |

⚠️ 提取陷阱： 对于这种带修饰符的字符，python-docx 可能只能提取到底座字符 x。

策略：对于数据侦探来说，通常我们关注的是表头里的 Mean 或 Average 关键词，而不是符号。如果表头只有，可能需要结合上下文推断。

4. 拉丁字母的特殊含义 (Latin Context)

虽然是普通字母，但在统计学上下文中具有特殊含义，通常以斜体 (Italic) 出现。

| 符号 | 含义 | 易混淆点 |

| | t 检验统计量 | 容易混淆为时间单位 t (time) 或吨 (ton) |

| | F 检验统计量 | 女性 (Female) |

| | Z 检验统计量 | - |

| | P 值 (概率) | 磷 (Phosphorus) |

| | 样本量 | 牛顿 (Newton) |

| | 相关系数 | 半径 (radius) |

| | 回归系数 | - |

| | 优势比 | 手术室 (Operating Room), 或者 (or) |

| | 风险比 | 心率 (Heart Rate) |

| | 置信区间 | 心脏指数 (Cardiac Index) |

⚠️ 提取策略： 不能只看字符，要看组合。

P 单独出现且数值在 0-1 之间 -> P 值。
t 单独出现且数值 > 0 -> t 值。
CI 后面跟着括号 (1.2-3.4) -> 置信区间。

5. Python 字符串清洗工具箱 (Cleaner Utils)

建议在 DocxTableExtractor 中集成以下清洗函数：

import re

def clean_statistical_text(text):
if not text:
return ""

\# 1\. 归一化负号 (CRITICAL)  
text \= text.replace('\\u2212', '-').replace('\\u2013', '-').replace('\\u2014', '-')  
  
\# 2\. 归一化卡方 (Chi-square)  
\# 处理 Symbol 字体的 'c'2 (需配合 run.font 检查，此处仅处理 Unicode)  
text \= text.replace('\\u03c72', 'chi-square')  
text \= text.replace('\\u03c7\\u00b2', 'chi-square')  
text \= re.sub(r'\[Xxχ\]\\^?2', 'chi-square', text) \# 正则匹配常见变体  
  
\# 3\. 归一化加减号  
text \= text.replace('\\u00b1', '+/-')  
  
\# 4\. 归一化比较符  
text \= text.replace('≤', '\<=').replace('≥', '\>=')  
  
\# 5\. 去除不可见字符 (Zero-width space 等)  
text \= re.sub(r'\[\\u200b\\u200c\\u200d\\ufeff\]', '', text)  
  
return text.strip()

6. 总结

在 Word 提取中，最大的“鬼怪”不是复杂的，而是：

假的负号（导致 float() 崩溃）。
Symbol 字体（导致变成 a）。
多段落换行（上一节已解决）。

只要处理好这三点，99% 的统计表格都能被正确解析。

19 KiB Raw Blame History Unescape Escape