feat(admin): Add user management and upgrade to module permission system
Features - User Management (Phase 4.1): - Database: Add user_modules table for fine-grained module permissions - Database: Add 4 user permissions (view/create/edit/delete) to role_permissions - Backend: UserService (780 lines) - CRUD with tenant isolation - Backend: UserController + UserRoutes (648 lines) - 13 API endpoints - Backend: Batch import users from Excel - Frontend: UserListPage (412 lines) - list/filter/search/pagination - Frontend: UserFormPage (341 lines) - create/edit with module config - Frontend: UserDetailPage (393 lines) - details/tenant/module management - Frontend: 3 modal components (592 lines) - import/assign/configure - API: GET/POST/PUT/DELETE /api/admin/users/* endpoints Architecture Upgrade - Module Permission System: - Backend: Add getUserModules() method in auth.service - Backend: Login API returns modules array in user object - Frontend: AuthContext adds hasModule() method - Frontend: Navigation filters modules based on user.modules - Frontend: RouteGuard checks requiredModule instead of requiredVersion - Frontend: Remove deprecated version-based permission system - UX: Only show accessible modules in navigation (clean UI) - UX: Smart redirect after login (avoid 403 for regular users) Fixes: - Fix UTF-8 encoding corruption in ~100 docs files - Fix pageSize type conversion in userService (String to Number) - Fix authUser undefined error in TopNavigation - Fix login redirect logic with role-based access check - Update Git commit guidelines v1.2 with UTF-8 safety rules Database Changes: - CREATE TABLE user_modules (user_id, tenant_id, module_code, is_enabled) - ADD UNIQUE CONSTRAINT (user_id, tenant_id, module_code) - INSERT 4 permissions + role assignments - UPDATE PUBLIC tenant with 8 module subscriptions Technical: - Backend: 5 new files (~2400 lines) - Frontend: 10 new files (~2500 lines) - Docs: 1 development record + 2 status updates + 1 guideline update - Total: ~4900 lines of code Status: User management 100% complete, module permission system operational
This commit is contained in:
@@ -1,99 +1,114 @@
|
||||
# **工具 C:AI 辅助医疗数据清洗场景分级清单**
|
||||
|
||||
餈嗘遢皜<EFBFBD><EFBFBD><EFBFBD>?*<2A><><EFBFBD>臬<EFBFBD><E887AC>圈𠗕摨?*<2A>?*銝𡁜𦛚<F0A1819C>餉<EFBFBD>憭齿<E686AD>摨?*隞𡒊<E99A9E><F0A1928A>訫<EFBFBD>憭齿<E686AD><E9BDBF>鍦<EFBFBD><E98DA6><EFBFBD><EFBFBD><EFBFBD>匧㦤<E58CA7>臬<EFBFBD><E887AC><EFBFBD>挽<EFBFBD>唳旿撌脣<E6928C>頧賭蛹 Pandas DataFrame (df)<EFBFBD>?
|
||||
这份清单按**技术实现难度**和**业务逻辑复杂度**从简单到复杂排列。所有场景均假设数据已加载为 Pandas DataFrame (df)。
|
||||
|
||||
## **Level 1: 基础卫生清理 (Data Hygiene)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁏<E59A97><F0A1818F>𡏭<EFBFBD><F0A18FAD>脲㺭<E884B2>桀<EFBFBD><E6A180>鐥<EFBFBD>𡏭<EFBFBD>霂領<E99C82>萘<EFBFBD><E89098>唳旿<E594B3><E697BF>xcel 銋蠘<E98A8B><E8A098>𡄯<EFBFBD>雿?Python <20>游翰<E6B8B8>游<EFBFBD><E6B8B8>?
|
||||
*目标:把“脏”数据变成“能读”的数据。Excel 也能做,但 Python 更快更准。*
|
||||
|
||||
### **1.1 变量名标准化 (Rename)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>笔<EFBFBD>銵典仍<E585B8>臭葉<E887AD><E89189><EFBFBD><EFBFBD>怎鸌畾羓泵<E7BE93>瘀<EFBFBD>撟湧<E6929F>(撗?, <20>批<EFBFBD>/Gender, <20>仿堺\_<>交<EFBFBD>嚗㚁<E59A97>SPSS <20>仿<EFBFBD><E4BBBF>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD><EFBFBD>匧<EFBFBD><E58CA7>滩蓮銝箇滲<E7AE87>望<EFBFBD>撠誩<E692A0>嚗<EFBFBD>縧<EFBFBD>㗇𡠺<E39787>瑯<EFBFBD><E791AF><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* 雿輻鍂<E8BCBB>惩<EFBFBD>摮堒<E691AE><E5A092>𡝗迤<F0A19D97>蹱𤜯<E8B9B1>W<EFBFBD><EFBCB7>溻<EFBFBD>?
|
||||
### **1.2 <20>啣<EFBFBD>澆<EFBFBD><E6BE86>𨀣<EFBFBD>瘥圝<E798A5>?(Clean Numeric)**
|
||||
* **场景:** 原始表头是中文或含特殊符号(年龄(岁), 性别/Gender, 入院\_日期),SPSS 报错。
|
||||
* **用户指令:** “把所有列名转为纯英文小写,去掉括号。”
|
||||
* **Python 逻辑:** 使用映射字典或正则替换列名。
|
||||
|
||||
* **<2A>箸艶嚗?* 璉<>撉𣬚<E69289>撖澆枂<E6BE86><E69E82>㺭<EFBFBD>殷<EFBFBD><E6AEB7>啣<EFBFBD>澆<EFBFBD>瘛瑕<E7989B>鈭<EFBFBD>泵<EFBFBD>瘀<EFBFBD>\>100, \<0.1, 12.5+, <20>芣䰻嚗剹<E59A97>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>䁅<EFBFBD><E48185>鐥<EFBFBD>坔<EFBFBD><E59D94>𣬚<EFBFBD><F0A3AC9A>墧㺭摮㛖泵<E39B96>瑕縧<E79195>㚁<EFBFBD><E39A81>娫<0.1<EFBFBD>蹱<EFBFBD><EFBFBD>?.05<EFBFBD>坔<EFBFBD><EFBFBD><EFBFBD><EFBFBD>頧砌蛹瘚桃<EFBFBD><EFBFBD>啜<EFBFBD><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* str.replace \+ 甇<><E79487><EFBFBD>𣂼<EFBFBD> \+ pd.to\_numeric(errors='coerce')<29>?
|
||||
### **1.3 蝏煺<E89D8F>蝻箏仃<E7AE8F>?(Standardize Nulls)**
|
||||
### **1.2 数值列“排毒” (Clean Numeric)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>唳旿<E594B3>峕毽<E5B395><E6AFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>隞<EFBFBD>”<EFBFBD>𦦵征<F0A6A6B5>萘<EFBFBD>霂㵪<E99C82>NA, N/A, \-, \\, 銝滩祕<E6BBA9>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD><EFBFBD>劐誨銵兩<E98AB5>䀹瓷<E480B9>争<EFBFBD>嗵<EFBFBD>摮㛖泵<E39B96>賜<EFBFBD>銝<EFBFBD><E98A9D>踵揢銝箸<E98A9D><E7AEB8><EFBFBD><EFBFBD>蝛箏<E89D9B>潦<EFBFBD><E6BDA6><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.replace(\['-', '銝滩祕', 'NA'\], np.nan, inplace=True)<29>?
|
||||
## **Level 2: <20>㗛<EFBFBD><E3979B><EFBFBD><EFBFBD><EFBFBD>碶<EFBFBD><E7A2B6>滨<EFBFBD><E6BBA8>?(Recode & Standardization)**
|
||||
* **场景:** 检验科导出的数据,数值列混入了符号(\>100, \<0.1, 12.5+, 未查)。
|
||||
* **用户指令:** “把‘肌酐’列里的非数字符号去掉,‘\<0.1’按‘0.05’处理,转为浮点数。”
|
||||
* **Python 逻辑:** str.replace \+ 正则提取 \+ pd.to\_numeric(errors='coerce')。
|
||||
|
||||
*<2A>格<EFBFBD>嚗帋蛹蝏蠘恣<E8A098><E681A3><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>掩<EFBFBD>㗛<EFBFBD><E3979B>?
|
||||
### **1.3 统一缺失值 (Standardize Nulls)**
|
||||
|
||||
### **2.1 <20><>𧋦頧祆㺭<E7A586>潭<EFBFBD>撠?(Map Categorical)**
|
||||
* **场景:** 数据里混杂了各种代表“空”的词:NA, N/A, \-, \\, 不详。
|
||||
* **用户指令:** “把所有代表‘没有’的字符都统一替换为标准的空值。”
|
||||
* **Python 逻辑:** df.replace(\['-', '不详', 'NA'\], np.nan, inplace=True)。
|
||||
|
||||
## **Level 2: 变量标准化与重编码 (Recode & Standardization)**
|
||||
|
||||
*目标:为统计分析准备分类变量。*
|
||||
|
||||
### **2.1 文本转数值映射 (Map Categorical)**
|
||||
|
||||
* **场景:** 性别列是 Male/Female,吸烟史是 Yes/No。
|
||||
* **用户指令:** “把性别转为 1(男)/0(女),把吸烟史转为 1/0。”
|
||||
* **Python 逻辑:** df\['sex'\].map({'Male': 1, 'Female': 0})。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>批<EFBFBD><E689B9>埈糓 Male/Female嚗<65>𢙺<EFBFBD>笔蟮<E7AC94>?Yes/No<4E>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>批<EFBFBD>頧砌蛹 1(<28>?/0(憟?嚗峕<E59A97><E5B395>貊<EFBFBD><E8B28A>脰蓮銝?1/0<><30><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df\['sex'\].map({'Male': 1, 'Female': 0})<29>?
|
||||
### **2.2 连续变量分箱 (Binning)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20><>閬<EFBFBD><E996AC>撟湧<E6929F><E6B9A7><EFBFBD><EFBFBD>餈𥡝<E9A488><F0A5A19D>⊥䲮璉<E4B2AE>撉䎚<E69289>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>撟湧<E6929F><E6B9A7>?0-18, 19-60, 60+ <20><>蛹<EFBFBD>䀹𧊋<E480B9>𣂼僑<F0A382BC>? <20>䀹<EFBFBD>撟氯<E6929F>? <20>䁅<EFBFBD><E48185>僑<EFBFBD>嗘<EFBFBD>蝏<EFBFBD><E89D8F><EFBFBD><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* pd.cut() <EFBFBD>賣㺭<EFBFBD>?
|
||||
* **场景:** 需要按年龄分组进行卡方检验。
|
||||
* **用户指令:** “把年龄按 0-18, 19-60, 60+ 分为‘未成年’, ‘成年’, ‘老年’三组。”
|
||||
* **Python 逻辑:** pd.cut() 函数。
|
||||
|
||||
### **2.3 复杂日期计算 (Date Logic)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 霈∠<E99C88><E288A0>笔<EFBFBD><E7AC94>園𡢿嚗㇉S嚗剹<E59A97><E589B9>xcel 蝏誩虜蝞烾<E89D9E><E783BE>啣僑<E595A3>𡝗<EFBFBD>隞賬<E99A9E>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣覔<F0A880A3>栽<EFBFBD>条&霂𦠜𠯫<F0A6A09C>麨<EFBFBD>坔<EFBFBD><E59D94>㗛<EFBFBD>霈踵𠯫<E8B8B5>麨<EFBFBD>躰恣蝞㛖<E89D9E>摮䀹<E691AE><E480B9>堆<EFBFBD>靽萘<E99DBD>1雿滚<E99BBF><E6BB9A>啜<EFBFBD><E5959C><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* (df\['end\_date'\] \- df\['start\_date'\]).dt.days / 30.4<EFBFBD>?
|
||||
* **场景:** 计算生存时间(OS)。Excel 经常算错闰年或月份。
|
||||
* **用户指令:** “根据‘确诊日期’和‘随访日期’计算生存月数,保留1位小数。”
|
||||
* **Python 逻辑:** (df\['end\_date'\] \- df\['start\_date'\]).dt.days / 30.4。
|
||||
|
||||
## **Level 3: 临床逻辑特征工程 (Feature Engineering)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁜抅鈭𤾸龫摮衣䰻霂<E4B0BB><E99C82><EFBFBD>鞉鰵<E99E89><E9B0B5><EFBFBD><EFBFBD>鞉<EFBFBD><E99E89><EFBFBD><EFBFBD>?
|
||||
*目标:基于医学知识生成新的分析指标。*
|
||||
|
||||
### **3.1 复合公式计算 (Complex Formula)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 霈∠<E99C88> eGFR (<28>曉<EFBFBD><E69B89><EFBFBD>誘餈<E8AA98><E9A488>) <20>?BMI<EFBFBD>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨅯葬<F0A885AF>𤏸恣蝞?BMI<4D><49><EFBFBD><EFBFBD>?BMI \> 28嚗𣬚<E59A97><F0A3AC9A>鞉鰵<E99E89>埈<EFBFBD>霈唬蛹<E594AC>䁅<EFBFBD><E48185>砽<EFBFBD>踺<EFBFBD><E8B8BA><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* <20>煾<EFBFBD><E785BE>𤥁恣蝞?df\['weight'\] / (df\['height'\]/100)\*\*2 \+ <EFBFBD>∩辣韏见<EFBFBD>?np.where<EFBFBD>?
|
||||
* **场景:** 计算 eGFR (肾小球滤过率) 或 BMI。
|
||||
* **用户指令:** “帮我计算 BMI。如果 BMI \> 28,生成新列标记为‘肥胖’。”
|
||||
* **Python 逻辑:** 向量化计算 df\['weight'\] / (df\['height'\]/100)\*\*2 \+ 条件赋值 np.where。
|
||||
|
||||
### **3.2 提取入排标准 (Cohort Selection)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* 蝑偦<E89D91>厩泵<E58EA9><E6B3B5>辺隞嗥<E99A9E><E597A5>亦<EFBFBD>鈭箇黎<E7AE87>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𦦵<EFBFBD><F0A6A6B5>匧枂嚗𡁶&霂𠹺蛹<F0A0B9BA>箄<EFBFBD><E7AE84>䕘<EFBFBD>銝𥪜僑樴<E58391>之鈭?8撗<38><E69297>銝娍瓷<E5A88D>厰<EFBFBD>銵<EFBFBD><E98AB5>讠<EFBFBD><E8AEA0>脩<EFBFBD><E884A9><EFBFBD>犖<EFBFBD><E78A96><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.query("diagnosis \== 'Lung Adenocarcinoma' & age \> 18 & hypertension \== 0")<EFBFBD>?
|
||||
### **3.3 <20>穃<EFBFBD><E7A983>讐<EFBFBD><E8AE90>?(One-Hot Encoding)**
|
||||
* **场景:** 筛选符合条件的入组人群。
|
||||
* **用户指令:** “筛选出:确诊为肺腺癌,且年龄大于18岁,且没有高血压病史的病人。”
|
||||
* **Python 逻辑:** df.query("diagnosis \== 'Lung Adenocarcinoma' & age \> 18 & hypertension \== 0")。
|
||||
|
||||
* **<2A>箸艶嚗?* <20><><EFBFBD><EFBFBD>?Logistic <20>𧼮<EFBFBD>嚗峕<E59A97>銝<EFBFBD>銝芣<E98A9D>摨誩<E691A8><E8AAA9><EFBFBD>掩<EFBFBD>㗛<EFBFBD><E3979B>𡏭<EFBFBD><F0A18FAD>?(A, B, AB, O)<29>腈<EFBFBD>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>銵<EFBFBD><E98AB5>讠<EFBFBD><E8AEA0>𣂼<EFBFBD><F0A382BC>㗛<EFBFBD><E3979B><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* pd.get\_dummies(df\['blood\_type'\], prefix='blood')<29>?
|
||||
## **Level 4: 蝏𤘪<E89D8F><F0A498AA>滚<EFBFBD>銝𡡞<E98A9D>蝥扳祥<E689B3>?(Reshaping & Governance)**
|
||||
### **3.3 哑变量生成 (One-Hot Encoding)**
|
||||
|
||||
*<2A>格<EFBFBD>嚗𡁏㺿<F0A1818F>䁅”<E48185>潛<EFBFBD><E6BD9B><EFBFBD>誑<EFBFBD><E8AA91><EFBFBD><EFBFBD>孵<EFBFBD><E5ADB5><EFBFBD><EFBFBD>霈⊥芋<E28AA5>页<EFBFBD><E9A1B5>𤥁<EFBFBD>銵屸<E98AB5><E5B1B8>嗆㺭<E59786>桐耨憭溻<E686AD>?
|
||||
* **场景:** 准备做 Logistic 回归,有一个无序多分类变量“血型 (A, B, AB, O)”。
|
||||
* **用户指令:** “把血型生成哑变量。”
|
||||
* **Python 逻辑:** pd.get\_dummies(df\['blood\_type'\], prefix='blood')。
|
||||
|
||||
### **4.1 <20>踹捐銵刻蓮<E588BB>?(Pivot/Melt) <20>婙<EFBFBD>?Excel <20><>埯璇?*
|
||||
## **Level 4: 结构重塑与高级治理 (Reshaping & Governance)**
|
||||
|
||||
*目标:改变表格结构以适应特定的统计模型,或进行高阶数据修复。*
|
||||
|
||||
### **4.1 长宽表转换 (Pivot/Melt) —— Excel 的噩梦**
|
||||
|
||||
* **场景:** 目前是“一人多行”(张三-第1次化验,张三-第2次化验),要做重复测量分析,需要变成“一人一行”(张三-化验1-化验2)。
|
||||
* **用户指令:** “把表格从长表转为宽表,按病人ID索引,用‘访视次序’做后缀,铺开‘白细胞’列。”
|
||||
* **Python 逻辑:** df.pivot(index='id', columns='visit', values='wbc')。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>桀<EFBFBD><E6A180>胼<EFBFBD>靝<EFBFBD>鈭箏<E988AD>銵𢞖<E98AB5>嘅<EFBFBD>撘牐<E69298>-蝚?甈∪<E79488>撉䕘<E69289>撘牐<E69298>-蝚?甈∪<E79488>撉䕘<E69289>嚗諹<E59A97><E8ABB9>𡁻<EFBFBD>憭齿<E686AD><E9BDBF>誩<EFBFBD><E8AAA9>琜<EFBFBD><E7909C><EFBFBD>閬<EFBFBD><E996AC><EFBFBD>鐥<EFBFBD>靝<EFBFBD>鈭箔<E988AD>銵𢞖<E98AB5>嘅<EFBFBD>撘牐<E69298>-<2D>㚚<EFBFBD>1-<2D>㚚<EFBFBD>2嚗剹<E59A97>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD>銵冽聢隞𡡞鵭銵刻蓮銝箏捐銵剁<E98AB5><E58981>厩<EFBFBD>鈭截D蝝W<E89D9D>嚗𣬚鍂<F0A3AC9A>䁅挪閫<E68CAA>活摨謿<E691A8>坔<EFBFBD><E59D94>𡒊<EFBFBD>嚗屸唍撘<E5948D><E69298>条蒾蝏<E892BE><E89D8F><EFBFBD>坔<EFBFBD><E59D94><EFBFBD><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df.pivot(index='id', columns='visit', values='wbc')<29>?
|
||||
### **4.2 智能去重 (Smart Deduplication)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>䔶<EFBFBD>銝芰<E98A9D>鈭箸<E988AD>銝斗辺霈啣<E99C88>嚗䔶<E59A97><E494B6>∩縑<E288A9>臬<EFBFBD>嚗䔶<E59A97><E494B6>∩縑<E288A9>舐撩<E88890>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3><EFBFBD>犖ID<49>駁<EFBFBD><E9A781><EFBFBD><EFBFBD><EFBFBD>𨀣<EFBFBD><F0A880A3>滚<EFBFBD>嚗䔶<E59A97><E494B6>仮<EFBFBD>䀹<EFBFBD><E480B9>交𠯫<E4BAA4>麨<EFBFBD>蹱<EFBFBD>餈𤑳<E9A488><F0A491B3><EFBFBD><EFBFBD><EFBFBD>∴<EFBFBD>憒<EFBFBD><E68692><EFBFBD>交<EFBFBD>銝<EFBFBD><E98A9D>瘀<EFBFBD>靽萘<E99DBD><E89098>䀹㺭<E480B9>桀<EFBFBD><E6A180>游漲<E6B8B8>蹱<EFBFBD>擃条<E69383><E69DA1><EFBFBD>辺<EFBFBD><E8BEBA><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.sort\_values(\['date', 'completeness'\]).drop\_duplicates(subset=\['id'\], keep='last')<EFBFBD>?
|
||||
* **场景:** 同一个病人有两条记录,一条信息全,一条信息缺。
|
||||
* **用户指令:** “按病人ID去重。如果有重复,保留‘检查日期’最近的那一条;如果日期一样,保留‘数据完整度’最高的那条。”
|
||||
* **Python 逻辑:** df.sort\_values(\['date', 'completeness'\]).drop\_duplicates(subset=\['id'\], keep='last')。
|
||||
|
||||
### **4.3 跨列逻辑校验 (Cross-Check)**
|
||||
|
||||
* **<EFBFBD>箸艶嚗?* <20>𤑳緵<F0A491B3>𤩺㺭<F0A4A9BA>柴<EFBFBD>?
|
||||
* **<EFBFBD>冽<EFBFBD><EFBFBD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>乩<EFBFBD>銝𧢲<E98A9D>瘝⊥<E7989D><E28AA5>条琸<E69DA1>把<EFBFBD>嗘<EFBFBD><E59798>胼<EFBFBD>䀹<EFBFBD><E480B9>摮閙活<E99699>豹>0<>嗵<EFBFBD><E597B5>躰秤<E8BAB0>唳旿嚗峕<E59A97>霈啣枂<E595A3>乓<EFBFBD><E4B993><EFBFBD>?
|
||||
* **Python <EFBFBD>餉<EFBFBD>嚗?* df.loc\[(df\['sex'\]=='<EFBFBD>?) & (df\['preg\_count'\]\>0), 'error\_flag'\] \= 1<EFBFBD>?
|
||||
### **4.4 憭𡁻<E686AD><F0A181BB>坿‘ (Multiple Imputation) <20>婙<EFBFBD>?蝏蠘恣摮衣<E691AE>擃条漣憛怨‘**
|
||||
* **场景:** 发现脏数据。
|
||||
* **用户指令:** “检查一下有没有‘男性’但是‘怀孕次数\>0’的错误数据,标记出来。”
|
||||
* **Python 逻辑:** df.loc\[(df\['sex'\]=='男') & (df\['preg\_count'\]\>0), 'error\_flag'\] \= 1。
|
||||
|
||||
* **<2A>箸艶嚗?* <20>唳旿<E594B3><E697BF><EFBFBD>蝻箏仃<E7AE8F>潘<EFBFBD>憒?BMI 蝻箏仃嚗㚁<E59A97><E39A81>閧滲<E996A7>典<EFBFBD><E585B8>澆‵銵乩<E98AB5><E4B9A9>游<EFBFBD><E6B8B8>唳旿<E594B3><E697BF><EFBFBD><EFBFBD><EFBFBD><EFBFBD>閬<EFBFBD>⏚<EFBFBD>典<EFBFBD>隞硋<E99A9E><E7A18B>𧶏<EFBFBD>憒<EFBFBD>僑樴<E58391><E6A8B4><EFBFBD><EFBFBD>批<EFBFBD><E689B9><EFBFBD><EFBFBD><EFBFBD>琜<EFBFBD><E7909C><EFBFBD>㮾<EFBFBD>單<EFBFBD>扳䔉憸<E49489><E686B8>憛怨‘<E680A8>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>靝蝙<E99D9D>典<EFBFBD><E585B8>齿<EFBFBD>銵交<E98AB5>(MICE)撖嫖<E69296>𦲂MI<4D>坔<EFBFBD><E59D94>睃僑樴<E58391><E6A8B4>坔<EFBFBD><E59D94><EFBFBD>撩憭勗<E686AD>潸<EFBFBD>銵<EFBFBD>‵銵乓<E98AB5><E4B993><EFBFBD>?
|
||||
* # **Python <20>餉<EFBFBD>嚗?\`\`\`python** **from sklearn.experimental import enable\_iterative\_imputer** **from sklearn.impute import IterativeImputer** **隞<><E99A9E>撖寞㺭<E5AF9E>澆<EFBFBD>餈𥡝<E9A488><F0A5A19D>坿‘** **cols \= \['bmi', 'age', 'creatinine'\]** **imp \= IterativeImputer(max\_iter=10, random\_state=0)** **df\[cols\] \= imp.fit\_transform(df\[cols\])**
|
||||
### **4.4 多重插补 (Multiple Imputation) —— 统计学的高级填补**
|
||||
|
||||
## **Level 5: <20>䂿<EFBFBD><E482BF><EFBFBD><EFBFBD><EFBFBD><EFBFBD>𧋦<EFBFBD>𡝗<EFBFBD> (Text Mining) <20>婙<EFBFBD>?Python <20><><EFBFBD>撖寧<E69296>瘝餃躹**
|
||||
* **场景:** 数据集有缺失值(如 BMI 缺失),单纯用均值填补会破坏数据分布。需要利用其他变量(如年龄、性别、肌酐)的相关性来预测填补。
|
||||
* **用户指令:** “使用多重插补法(MICE)对‘BMI’和‘年龄’列的缺失值进行填补。”
|
||||
|
||||
*<2A>格<EFBFBD>嚗帋<E59A97>憭<EFBFBD>釣<EFBFBD>𡝗𥁒<F0A19D97>𦠜<EFBFBD><F0A6A09C>砌葉<E7A08C>𨀣<EFBFBD><F0A880A3>嘥枂<E598A5>唳旿<E594B3><E697BF><EFBFBD><EFBFBD>?Excel 蝏嘥笆<E598A5>帋<EFBFBD><E5B88B>啁<EFBFBD><E59581>?
|
||||
* # **Python 逻辑: \`\`\`python** **from sklearn.experimental import enable\_iterative\_imputer** **from sklearn.impute import IterativeImputer** **仅针对数值列进行插补** **cols \= \['bmi', 'age', 'creatinine'\]** **imp \= IterativeImputer(max\_iter=10, random\_state=0)** **df\[cols\] \= imp.fit\_transform(df\[cols\])**
|
||||
|
||||
### **5.1 甇<><E79487>銵刻噢撘𤩺<E69298><F0A4A9BA>?(Regex Extraction)**
|
||||
## **Level 5: 非结构化文本挖掘 (Text Mining) —— Python 的绝对统治区**
|
||||
|
||||
* **<2A>箸艶嚗?* <20>芣<EFBFBD>銝<EFBFBD><E98A9D>埈<EFBFBD><E59F88>砂<EFBFBD>𦦵<EFBFBD><F0A6A6B5><EFBFBD><EFBFBD><EFBFBD>凌<EFBFBD>嘅<EFBFBD><E59885><EFBFBD>捆憒<E68D86><E68692><EFBFBD>?撌西<E6928C>銝𠰴蠏)瘚豢隋<E8B1A2>扯<EFBFBD><E689AF>䕘<EFBFBD>憭批<E686AD>3.5\*2cm<63>腈<EFBFBD><E88588><EFBFBD>閬<EFBFBD><E996AC><EFBFBD>𤥁<EFBFBD><F0A4A581>文之撠譌<E692A0>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>靝<EFBFBD><E99D9D>条<EFBFBD><E69DA1><EFBFBD><EFBFBD><EFBFBD>凌<EFBFBD>䠷<EFBFBD><E4A0B7>𣂼<EFBFBD><F0A382BC>箄<EFBFBD><E7AE84>斤<EFBFBD><E696A4>踹<EFBFBD>嚗<EFBFBD><E59A97>憭抒<E686AD><E68A92><EFBFBD>葵<EFBFBD>啣<EFBFBD>嚗剹<E59A97><E589B9><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df\['text'\].str.extract(r'(\\d+\\.?\\d\*)\\s\*\[\\\*xX\]\\s\*(\\d+\\.?\\d\*)') 撟嗅<E6929F><E59785><EFBFBD>憭批<E686AD>潦<EFBFBD>?
|
||||
### **5.2 摮㛖泵銝脫芋蝟𠰴龪<F0A0B0B4>?(Fuzzy Matching)**
|
||||
*目标:从备注或报告文本中“抠”出数据。这是 Excel 绝对做不到的。*
|
||||
|
||||
* **<2A>箸艶嚗?* <20>駁堺<E9A781>滨妍敶訫<E695B6>瘛瑚僚嚗尠<E59A97>𨅯<EFBFBD><F0A885AF><EFBFBD>龫<EFBFBD>T<EFBFBD>腈<EFBFBD><E88588><EFBFBD>𨅯<EFBFBD>鈭砍<E988AD><E7A08D>𢞖<EFBFBD>腈<EFBFBD><E88588><EFBFBD>𨅯<EFBFBD><F0A885AF>𢞖<EFBFBD>腈<EFBFBD><E88588><EFBFBD>閬<EFBFBD><E996AC>銝<EFBFBD><E98A9D>?
|
||||
* **<2A>冽<EFBFBD><E586BD><EFBFBD>誘嚗?* <20>𨀣<EFBFBD><F0A880A3>睃龫<E79D83>W<EFBFBD>蝘售<E89D98>坔<EFBFBD><E59D94>峕<EFBFBD><E5B395>匧<EFBFBD><E58CA7>徉<EFBFBD>睃<EFBFBD><E79D83>𢞖<EFBFBD>嗵<EFBFBD>嚗屸<E59A97>蝏煺<E89D8F><E785BA>嫣蛹<E5ABA3>婱UMCH<43>踺<EFBFBD><E8B8BA><EFBFBD>?
|
||||
* **Python <20>餉<EFBFBD>嚗?* df.loc\[df\['hospital'\].str.contains('<27>誩<EFBFBD>'), 'hospital'\] \= 'PUMCH'<27>
|
||||
### **5.1 正则表达式提取 (Regex Extraction)**
|
||||
|
||||
* **场景:** 只有一列文本“病理诊断”,内容如:“(左肺上叶)浸润性腺癌,大小3.5\*2cm”。需要提取肿瘤大小。
|
||||
* **用户指令:** “从‘病理诊断’里提取出肿瘤的长径(最大的那个数字)。”
|
||||
* **Python 逻辑:** df\['text'\].str.extract(r'(\\d+\\.?\\d\*)\\s\*\[\\\*xX\]\\s\*(\\d+\\.?\\d\*)') 并取最大值。
|
||||
|
||||
### **5.2 字符串模糊匹配 (Fuzzy Matching)**
|
||||
|
||||
* **场景:** 医院名称录入混乱:“协和医院”、“北京协和”、“协和”。需要统一。
|
||||
* **用户指令:** “把‘医院名称’列里所有包含‘协和’的,都统一改为‘PUMCH’。”
|
||||
* **Python 逻辑:** df.loc\[df\['hospital'\].str.contains('协和'), 'hospital'\] \= 'PUMCH'。
|
||||
Reference in New Issue
Block a user