feat(dc/tool-c): Add missing value imputation feature with 6 methods and MICE

Major features: 1. Missing value imputation (6 simple methods + MICE): - Mean/Median/Mode/Constant imputation - Forward fill (ffill) and Backward fill (bfill) for time series - MICE multivariate imputation (in progress, shape issue to fix) 2. Auto precision detection: - Automatically match decimal places of original data - Prevent false precision (e.g. 13.57 instead of 13.566716417910449) 3. Categorical variable detection: - Auto-detect and skip categorical columns in MICE - Show warnings for unsuitable columns - Suggest mode imputation for categorical data 4. UI improvements: - Rename button: "Delete Missing" to "Missing Value Handling" - Remove standalone "Dedup" and "MICE" buttons - 3-tab dialog: Delete / Fill / Advanced Fill - Display column statistics and recommended methods - Extended warning messages (8 seconds for skipped columns) 5. Bug fixes: - Fix sessionService.updateSessionData -> saveProcessedData - Fix OperationResult interface (add message and stats) - Fix Toolbar button labels and removal Modified files: Python: operations/fillna.py (new, 556 lines), main.py (3 new endpoints) Backend: QuickActionService.ts, QuickActionController.ts, routes/index.ts Frontend: MissingValueDialog.tsx (new, 437 lines), Toolbar.tsx, index.tsx Tests: test_fillna_operations.py (774 lines), test scripts and docs Docs: 5 documentation files updated Known issues: - MICE imputation has DataFrame shape mismatch issue (under debugging) - Workaround: Use 6 simple imputation methods first Status: Development complete, MICE debugging in progress Lines added: ~2000 lines across 3 tiers
2025-12-10 13:06:00 +08:00
parent f4f1d09837
commit 74cf346453
102 changed files with 3806 additions and 181 deletions
--- a/docs/03-业务模块/DC-数据清洗整理/04-开发计划/工具C_缺失值处理功能开发计划.md
+++ b/docs/03-业务模块/DC-数据清洗整理/04-开发计划/工具C_缺失值处理功能开发计划.md
@@ -26,7 +26,7 @@
 #### Tab 2：填补缺失值 ⭐ 新增
 1. **均值填补**（Mean Imputation）
   - 适用于：数值型变量，正态分布
-   - 实现：`df[column].fillna(df[column].mean())`
+   - 实现：创建新列，填充均值

 2. **中位数填补**（Median Imputation）
   - 适用于：数值型变量，偏态分布
@@ -40,6 +40,16 @@
   - 适用于：任何类型，用户指定值
   - 实现：创建新列，填充指定值

+5. **前向填充**（Forward Fill）
+   - 适用于：时间序列数据、有顺序的观察数据
+   - 实现：`df[column].fillna(method='ffill')`，用前一个非缺失值填充
+   - 示例：[10, NaN, NaN, 20] → [10, 10, 10, 20]
+
+6. **后向填充**（Backward Fill）
+   - 适用于：时间序列数据、有顺序的观察数据
+   - 实现：`df[column].fillna(method='bfill')`，用后一个非缺失值填充
+   - 示例：[10, NaN, NaN, 20] → [10, 20, 20, 20]
+
 **注意**：所有填补方法都会创建新列（如`体重_填补`），新列紧邻原列，便于对比验证。

 #### Tab 3：高级填补 ⭐ 新增
@@ -48,10 +58,10 @@
   - 实现：使用 `sklearn.impute.IterativeImputer`

 ### Phase 2：未来扩展（本次不开发）
- 前向/后向填充（Forward/Backward Fill）
 - 分组填补（Grouped Imputation）
 - 线性插值（Linear Interpolation）
 - KNN填补（KNN Imputation）
+- 组合填补（根据条件使用不同填补方法）

 ---

@@ -107,6 +117,8 @@
 │  ⚪ 中位数填补（适合偏态分布的数值变量）⭐            │
 │  ⚪ 众数填补（适合分类变量或离散数值）                │
 │  ⚪ 固定值填补：[_______] ← 用户输入                 │
+│  ⚪ 前向填充（用前一个值填充，适合时间序列）          │
+│  ⚪ 后向填充（用后一个值填充，适合时间序列）          │
 │                                                        │
 │  📈 填补预览：                                        │
 │  ┌──────────────────────────────────────────────┐   │
@@ -195,7 +207,7 @@ def fillna_simple(
    df: pd.DataFrame,
    column: str,
    new_column_name: str,
-    method: Literal['mean', 'median', 'mode', 'constant'],
+    method: Literal['mean', 'median', 'mode', 'constant', 'ffill', 'bfill'],
    fill_value: Any = None
 ) -> dict:
    """
@@ -210,6 +222,8 @@ def fillna_simple(
            - 'median': 中位数填补
            - 'mode': 众数填补
            - 'constant': 固定值填补
+            - 'ffill': 前向填充（用前一个非缺失值）
+            - 'bfill': 后向填充（用后一个非缺失值）
        fill_value: 固定值（method='constant'时必填）
    
    Returns:
@@ -324,7 +338,7 @@ async def operation_fillna_mice(request: FillnaMiceRequest):
 async executeFillnaSimple(params: {
  sessionId: string;
  column: string;
-  method: 'mean' | 'median' | 'mode' | 'constant';
+  method: 'mean' | 'median' | 'mode' | 'constant' | 'ffill' | 'bfill';
  fillValue?: any;
 }): Promise<any>

@@ -369,7 +383,7 @@ interface MissingValueDialogProps {
 // 新增状态
 const [activeTab, setActiveTab] = useState<'delete' | 'fill' | 'mice'>('fill');
 const [selectedColumn, setSelectedColumn] = useState<string>('');
-const [fillMethod, setFillMethod] = useState<'mean' | 'median' | 'mode' | 'constant'>('median');
+const [fillMethod, setFillMethod] = useState<'mean' | 'median' | 'mode' | 'constant' | 'ffill' | 'bfill'>('median');
 const [fillValue, setFillValue] = useState<any>(null);
 const [columnStats, setColumnStats] = useState<any>(null);

@@ -501,6 +515,7 @@ const actionButtons = [
 - 数值列（偏态分布）：体重（缺失20%）
 - 分类列：婚姻状况（缺失10%）
 - 多列缺失：收缩压（15%）+ 舒张压（12%）
+- 时间序列列：随访血压（有顺序，缺失18%）- 用于测试前/后向填充
 ```

 #### 测试用例
@@ -670,6 +685,30 @@ scikit-learn >= 1.2.0  # ← MICE需要

 ## 📝 更新记录

+### 2025-12-10 更新（用户要求）
+
+**新增功能**：
+1. ✅ **前向/后向填充加入本次开发**（原计划在Phase 2）
+   - 前向填充（Forward Fill）：用前一个非缺失值填充
+   - 后向填充（Backward Fill）：用后一个非缺失值填充
+   - 适用场景：时间序列数据、有顺序的观察数据
+
+**影响**：
+- Tab 2新增2个填补选项（共6种方法）
+- Python函数 `fillna_simple` 方法参数新增 `'ffill'` 和 `'bfill'`
+- 测试用例从14个增加到18个
+- 开发时间从5-6小时增加到6-7小时
+
+**适用场景说明**：
+- 均值/中位数：适合独立观察的数值变量
+- 众数：适合分类变量
+- 固定值：用户自定义场景
+- **前向填充**：随访数据（如多次测量，用上次值填充）
+- **后向填充**：预测性数据（用未来已知值填充）
+- MICE：需要考虑变量间关系的高质量填补
+
+---
+
 ### 2025-12-09 更新（根据用户确认）

 **核心变更**：