skrub库介绍

python
数据分析
机器学习
数据清洗
skrub是一个 python 库,用于 简化预处理和特征工程 表格机器学习
作者

不止BI

发布于

2025年4月13日

skrub 中内置了 employee_salaries 作为示例数据

代码
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y

数据概览

  • Cleaner可以清洁 数据框,解析空值、日期,并删除具有太多空值的列

  • TableReport可以生成数据清洁报告

代码
from skrub import Cleaner, TableReport

employees_df = Cleaner().fit_transform(employees_df)
TableReport(employees_df)
Cramér’s V(克拉默V系数)解读

Cramér’s V 是一种用于衡量两个分类变量(名义变量)之间关联强度的统计量,其值介于0到1之间。由瑞典统计学家Harold Cramér提出,适用于大于2×2的列联表( contingency table),是卡方检验的补充工具。

  • 0:变量间无关联(完全独立)
  • 1:变量间完全关联
代码
from skrub import patch_display, unpatch_display

patch_display()

employees_df

patch_display 可以将pandas和polars的默认显示格式修改为更易读的格式,使用unpatch_display可以恢复默认格式

代码
unpatch_display()

数据清洗

Joiner可以实现数据关联,允许模糊连接多个表,主表每行将使用辅助表中最佳匹配的值进行扩充。 你可以通过max_dist设置模糊匹配的距离。

代码
import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)

capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy

AggJoiner还可以将多个数据框连接并聚合到一起,例如将航班信息聚合到机场信息中

代码
from skrub import AggJoiner

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],  # the cols to perform aggregation on
    operations=["mean", "std"],  # the operations to compute
)
agg_joiner.fit_transform(airports)
airport_id airport_name city total_passengers_mean total_passengers_std
0 1 Charles de Gaulle Paris 103.333333 15.275252
1 2 Aeroporto Leonardo da Vinci Roma 80.000000 10.000000

MultiAggJoiner 可以将多个数据框连接并聚合到一起,例如将医疗数据聚合到患者信息中

代码
import pandas as pd
from skrub import MultiAggJoiner

patients = pd.DataFrame({
    "patient_id": [1, 2],
    "age": ["72", "45"],
})

hospitalizations = pd.DataFrame({
    "visit_id": range(1, 7),
    "patient_id": [1, 1, 1, 1, 2, 2],
    "days_of_stay": [2, 4, 1, 1, 3, 12],
    "hospital": ["Cochin", "Bichat", "Cochin", "Necker", "Bichat", "Bichat"],
})

medications = pd.DataFrame({
    "medication_id": range(1, 6),
    "patient_id": [1, 1, 1, 1, 2],
    "medication": ["ozempic", "ozempic", "electrolytes", "ozempic", "morphine"],
})

glucose = pd.DataFrame({
    "biology_id": range(1, 7),
    "patientID": [1, 1, 1, 1, 2, 2],
    "value": [1.4, 3.4, 1.0, 0.8, 3.1, 6.5],
})

multi_agg_joiner = MultiAggJoiner(
    aux_tables=[hospitalizations, medications, glucose],
    main_keys=[["patient_id"], ["patient_id"], ["patient_id"]],
    aux_keys=[["patient_id"], ["patient_id"], ["patientID"]],
    cols=[["days_of_stay"], ["medication"], ["value"]],
    operations=[["max"], ["mode"], ["mean", "std"]],
    suffixes=["", "", "_glucose"],
)

multi_agg_joiner.fit_transform(patients)
patient_id age days_of_stay_max medication_mode value_mean_glucose value_std_glucose
0 1 72 4 ozempic 1.65 1.193035
1 2 45 12 morphine 4.80 2.404163

数据编码

代码
from skrub import GapEncoder

X = pd.Series(
    [
        "Rome, Italy",
        "Rome",
        "Roma, Italia",
        "Madrid, SP",
        "Madrid, spain",
        "Madrid",
        "Romq",
        "Rome, It",
    ],
    name="city",
)
enc = GapEncoder(n_components=2, random_state=0)
enc.fit(X)
GapEncoder(n_components=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
代码
encoded = enc.fit_transform(X).assign(original=X)
encoded
city: madrid, spain, sp city: italia, italy, romq original
0 0.052257 13.547743 Rome, Italy
1 0.050202 3.049798 Rome
2 0.063282 15.036718 Roma, Italia
3 12.047028 0.052972 Madrid, SP
4 16.547818 0.052182 Madrid, spain
5 6.048861 0.051139 Madrid
6 0.050019 3.049981 Romq
7 0.053193 9.046807 Rome, It

机器学习

skrub中的tabular_learner 函数提供了一种简便的方法来构建简单但可靠的机器学习模型

代码
from sklearn.model_selection import cross_validate

from skrub import tabular_learner

model = tabular_learner("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

回到顶部