feat(dc/tool-c): Add missing value imputation feature with 6 methods and MICE
Major features: 1. Missing value imputation (6 simple methods + MICE): - Mean/Median/Mode/Constant imputation - Forward fill (ffill) and Backward fill (bfill) for time series - MICE multivariate imputation (in progress, shape issue to fix) 2. Auto precision detection: - Automatically match decimal places of original data - Prevent false precision (e.g. 13.57 instead of 13.566716417910449) 3. Categorical variable detection: - Auto-detect and skip categorical columns in MICE - Show warnings for unsuitable columns - Suggest mode imputation for categorical data 4. UI improvements: - Rename button: "Delete Missing" to "Missing Value Handling" - Remove standalone "Dedup" and "MICE" buttons - 3-tab dialog: Delete / Fill / Advanced Fill - Display column statistics and recommended methods - Extended warning messages (8 seconds for skipped columns) 5. Bug fixes: - Fix sessionService.updateSessionData -> saveProcessedData - Fix OperationResult interface (add message and stats) - Fix Toolbar button labels and removal Modified files: Python: operations/fillna.py (new, 556 lines), main.py (3 new endpoints) Backend: QuickActionService.ts, QuickActionController.ts, routes/index.ts Frontend: MissingValueDialog.tsx (new, 437 lines), Toolbar.tsx, index.tsx Tests: test_fillna_operations.py (774 lines), test scripts and docs Docs: 5 documentation files updated Known issues: - MICE imputation has DataFrame shape mismatch issue (under debugging) - Workaround: Use 6 simple imputation methods first Status: Development complete, MICE debugging in progress Lines added: ~2000 lines across 3 tiers
This commit is contained in:
@@ -26,7 +26,7 @@
|
||||
#### Tab 2:填补缺失值 ⭐ 新增
|
||||
1. **均值填补**(Mean Imputation)
|
||||
- 适用于:数值型变量,正态分布
|
||||
- 实现:`df[column].fillna(df[column].mean())`
|
||||
- 实现:创建新列,填充均值
|
||||
|
||||
2. **中位数填补**(Median Imputation)
|
||||
- 适用于:数值型变量,偏态分布
|
||||
@@ -40,6 +40,16 @@
|
||||
- 适用于:任何类型,用户指定值
|
||||
- 实现:创建新列,填充指定值
|
||||
|
||||
5. **前向填充**(Forward Fill)
|
||||
- 适用于:时间序列数据、有顺序的观察数据
|
||||
- 实现:`df[column].fillna(method='ffill')`,用前一个非缺失值填充
|
||||
- 示例:[10, NaN, NaN, 20] → [10, 10, 10, 20]
|
||||
|
||||
6. **后向填充**(Backward Fill)
|
||||
- 适用于:时间序列数据、有顺序的观察数据
|
||||
- 实现:`df[column].fillna(method='bfill')`,用后一个非缺失值填充
|
||||
- 示例:[10, NaN, NaN, 20] → [10, 20, 20, 20]
|
||||
|
||||
**注意**:所有填补方法都会创建新列(如`体重_填补`),新列紧邻原列,便于对比验证。
|
||||
|
||||
#### Tab 3:高级填补 ⭐ 新增
|
||||
@@ -48,10 +58,10 @@
|
||||
- 实现:使用 `sklearn.impute.IterativeImputer`
|
||||
|
||||
### Phase 2:未来扩展(本次不开发)
|
||||
- 前向/后向填充(Forward/Backward Fill)
|
||||
- 分组填补(Grouped Imputation)
|
||||
- 线性插值(Linear Interpolation)
|
||||
- KNN填补(KNN Imputation)
|
||||
- 组合填补(根据条件使用不同填补方法)
|
||||
|
||||
---
|
||||
|
||||
@@ -107,6 +117,8 @@
|
||||
│ ⚪ 中位数填补(适合偏态分布的数值变量)⭐ │
|
||||
│ ⚪ 众数填补(适合分类变量或离散数值) │
|
||||
│ ⚪ 固定值填补:[_______] ← 用户输入 │
|
||||
│ ⚪ 前向填充(用前一个值填充,适合时间序列) │
|
||||
│ ⚪ 后向填充(用后一个值填充,适合时间序列) │
|
||||
│ │
|
||||
│ 📈 填补预览: │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
@@ -195,7 +207,7 @@ def fillna_simple(
|
||||
df: pd.DataFrame,
|
||||
column: str,
|
||||
new_column_name: str,
|
||||
method: Literal['mean', 'median', 'mode', 'constant'],
|
||||
method: Literal['mean', 'median', 'mode', 'constant', 'ffill', 'bfill'],
|
||||
fill_value: Any = None
|
||||
) -> dict:
|
||||
"""
|
||||
@@ -210,6 +222,8 @@ def fillna_simple(
|
||||
- 'median': 中位数填补
|
||||
- 'mode': 众数填补
|
||||
- 'constant': 固定值填补
|
||||
- 'ffill': 前向填充(用前一个非缺失值)
|
||||
- 'bfill': 后向填充(用后一个非缺失值)
|
||||
fill_value: 固定值(method='constant'时必填)
|
||||
|
||||
Returns:
|
||||
@@ -324,7 +338,7 @@ async def operation_fillna_mice(request: FillnaMiceRequest):
|
||||
async executeFillnaSimple(params: {
|
||||
sessionId: string;
|
||||
column: string;
|
||||
method: 'mean' | 'median' | 'mode' | 'constant';
|
||||
method: 'mean' | 'median' | 'mode' | 'constant' | 'ffill' | 'bfill';
|
||||
fillValue?: any;
|
||||
}): Promise<any>
|
||||
|
||||
@@ -369,7 +383,7 @@ interface MissingValueDialogProps {
|
||||
// 新增状态
|
||||
const [activeTab, setActiveTab] = useState<'delete' | 'fill' | 'mice'>('fill');
|
||||
const [selectedColumn, setSelectedColumn] = useState<string>('');
|
||||
const [fillMethod, setFillMethod] = useState<'mean' | 'median' | 'mode' | 'constant'>('median');
|
||||
const [fillMethod, setFillMethod] = useState<'mean' | 'median' | 'mode' | 'constant' | 'ffill' | 'bfill'>('median');
|
||||
const [fillValue, setFillValue] = useState<any>(null);
|
||||
const [columnStats, setColumnStats] = useState<any>(null);
|
||||
|
||||
@@ -501,6 +515,7 @@ const actionButtons = [
|
||||
- 数值列(偏态分布):体重(缺失20%)
|
||||
- 分类列:婚姻状况(缺失10%)
|
||||
- 多列缺失:收缩压(15%)+ 舒张压(12%)
|
||||
- 时间序列列:随访血压(有顺序,缺失18%)- 用于测试前/后向填充
|
||||
```
|
||||
|
||||
#### 测试用例
|
||||
@@ -670,6 +685,30 @@ scikit-learn >= 1.2.0 # ← MICE需要
|
||||
|
||||
## 📝 更新记录
|
||||
|
||||
### 2025-12-10 更新(用户要求)
|
||||
|
||||
**新增功能**:
|
||||
1. ✅ **前向/后向填充加入本次开发**(原计划在Phase 2)
|
||||
- 前向填充(Forward Fill):用前一个非缺失值填充
|
||||
- 后向填充(Backward Fill):用后一个非缺失值填充
|
||||
- 适用场景:时间序列数据、有顺序的观察数据
|
||||
|
||||
**影响**:
|
||||
- Tab 2新增2个填补选项(共6种方法)
|
||||
- Python函数 `fillna_simple` 方法参数新增 `'ffill'` 和 `'bfill'`
|
||||
- 测试用例从14个增加到18个
|
||||
- 开发时间从5-6小时增加到6-7小时
|
||||
|
||||
**适用场景说明**:
|
||||
- 均值/中位数:适合独立观察的数值变量
|
||||
- 众数:适合分类变量
|
||||
- 固定值:用户自定义场景
|
||||
- **前向填充**:随访数据(如多次测量,用上次值填充)
|
||||
- **后向填充**:预测性数据(用未来已知值填充)
|
||||
- MICE:需要考虑变量间关系的高质量填补
|
||||
|
||||
---
|
||||
|
||||
### 2025-12-09 更新(根据用户确认)
|
||||
|
||||
**核心变更**:
|
||||
|
||||
Reference in New Issue
Block a user