skrub库介绍

python

数据分析

机器学习

数据清洗

skrub是一个 python 库，用于简化预处理和特征工程表格机器学习

作者

不止BI

发布于

2025年4月13日

skrub 中内置了 employee_salaries 作为示例数据

代码

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y

数据概览

Cleaner可以清洁数据框，解析空值、日期，并删除具有太多空值的列
TableReport可以生成数据清洁报告

代码

from skrub import Cleaner, TableReport

employees_df = Cleaner().fit_transform(employees_df)
TableReport(employees_df)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2007

9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2012

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.918
division	assignment_category	0.618
assignment_category	employee_position_title	0.500
division	employee_position_title	0.422
department	employee_position_title	0.415
department_name	employee_position_title	0.415
department	assignment_category	0.413
department_name	assignment_category	0.413
gender	department_name	0.379
gender	department	0.379
department	division	0.365
department_name	division	0.365
gender	employee_position_title	0.265
gender	division	0.246
gender	assignment_category	0.246
employee_position_title	year_first_hired	0.143
employee_position_title	date_first_hired	0.142
department	date_first_hired	0.0870
department_name	date_first_hired	0.0870

Cramér’s V（克拉默V系数）解读

Cramér’s V 是一种用于衡量两个分类变量（名义变量）之间关联强度的统计量，其值介于0到1之间。由瑞典统计学家Harold Cramér提出，适用于大于2×2的列联表（ contingency table），是卡方检验的补充工具。

0：变量间无关联（完全独立）
1：变量间完全关联

代码

from skrub import patch_display, unpatch_display

patch_display()

employees_df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2007

9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2012

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.914
division	assignment_category	0.607
assignment_category	employee_position_title	0.492
division	employee_position_title	0.426
department	employee_position_title	0.409
department_name	employee_position_title	0.409
department	assignment_category	0.397
department_name	assignment_category	0.397
gender	department_name	0.380
gender	department	0.380
department	division	0.371
department_name	division	0.371
gender	employee_position_title	0.265
gender	assignment_category	0.257
gender	division	0.246
employee_position_title	date_first_hired	0.148
employee_position_title	year_first_hired	0.146
department	date_first_hired	0.0958
department_name	date_first_hired	0.0958

patch_display 可以将pandas和polars的默认显示格式修改为更易读的格式，使用unpatch_display可以恢复默认格式

代码

unpatch_display()

数据清洗

Joiner可以实现数据关联，允许模糊连接多个表，主表每行将使用辅助表中最佳匹配的值进行扩充。你可以通过max_dist设置模糊匹配的距离。

代码

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)

capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)

	airport_id	airport_name	city	capital	country
0	1	Charles de Gaulle	Paris	Paris	France
1	2	Aeroporto Leonardo da Vinci	Roma	Rome	Italy

AggJoiner还可以将多个数据框连接并聚合到一起，例如将航班信息聚合到机场信息中

代码

from skrub import AggJoiner

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],  # the cols to perform aggregation on
    operations=["mean", "std"],  # the operations to compute
)
agg_joiner.fit_transform(airports)

	airport_id	airport_name	city	total_passengers_mean	total_passengers_std
0	1	Charles de Gaulle	Paris	103.333333	15.275252
1	2	Aeroporto Leonardo da Vinci	Roma	80.000000	10.000000

MultiAggJoiner 可以将多个数据框连接并聚合到一起，例如将医疗数据聚合到患者信息中

代码

import pandas as pd
from skrub import MultiAggJoiner

patients = pd.DataFrame({
    "patient_id": [1, 2],
    "age": ["72", "45"],
})

hospitalizations = pd.DataFrame({
    "visit_id": range(1, 7),
    "patient_id": [1, 1, 1, 1, 2, 2],
    "days_of_stay": [2, 4, 1, 1, 3, 12],
    "hospital": ["Cochin", "Bichat", "Cochin", "Necker", "Bichat", "Bichat"],
})

medications = pd.DataFrame({
    "medication_id": range(1, 6),
    "patient_id": [1, 1, 1, 1, 2],
    "medication": ["ozempic", "ozempic", "electrolytes", "ozempic", "morphine"],
})

glucose = pd.DataFrame({
    "biology_id": range(1, 7),
    "patientID": [1, 1, 1, 1, 2, 2],
    "value": [1.4, 3.4, 1.0, 0.8, 3.1, 6.5],
})

multi_agg_joiner = MultiAggJoiner(
    aux_tables=[hospitalizations, medications, glucose],
    main_keys=[["patient_id"], ["patient_id"], ["patient_id"]],
    aux_keys=[["patient_id"], ["patient_id"], ["patientID"]],
    cols=[["days_of_stay"], ["medication"], ["value"]],
    operations=[["max"], ["mode"], ["mean", "std"]],
    suffixes=["", "", "_glucose"],
)

multi_agg_joiner.fit_transform(patients)

	patient_id	age	days_of_stay_max	medication_mode	value_mean_glucose	value_std_glucose
0	1	72	4	ozempic	1.65	1.193035
1	2	45	12	morphine	4.80	2.404163

数据编码

代码

from skrub import GapEncoder

X = pd.Series(
    [
        "Rome, Italy",
        "Rome",
        "Roma, Italia",
        "Madrid, SP",
        "Madrid, spain",
        "Madrid",
        "Romq",
        "Rome, It",
    ],
    name="city",
)
enc = GapEncoder(n_components=2, random_state=0)
enc.fit(X)

GapEncoder(n_components=2, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

代码

encoded = enc.fit_transform(X).assign(original=X)
encoded

	city: madrid, spain, sp	city: italia, italy, romq	original
0	0.052257	13.547743	Rome, Italy
1	0.050202	3.049798	Rome
2	0.063282	15.036718	Roma, Italia
3	12.047028	0.052972	Madrid, SP
4	16.547818	0.052182	Madrid, spain
5	6.048861	0.051139	Madrid
6	0.050019	3.049981	Romq
7	0.053193	9.046807	Rome, It

机器学习

skrub中的tabular_learner 函数提供了一种简便的方法来构建简单但可靠的机器学习模型

代码

from sklearn.model_selection import cross_validate

from skrub import tabular_learner

model = tabular_learner("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]

array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

回到顶部

skrub库介绍

数据概览

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

数据清洗

数据编码

机器学习