数据操作语法对比教程

DuckDB vs R (Tidyverse & data.table) vs Python (Pandas)

1 包(库)的安装

在打开本页面的同时,会自动开始安装dplyrtidyrdata.table,安装速度取决于你的网络情况,请耐心等待,不要刷新,否则将重新安装。对于其他包,你可以使用webr::install进行安装,安装完成后使用library加载包

webr::install("ggplot2")

在打开本页面的同时,会自动安装常用的数据分析库,如pandasmatplotlib等,对于没有预安装的库,你可以使用micropip进行安装,安装完成后使用import进行加载库。比如如果需要安装pandas,你可以使用以下代码

import micropip
micropip.install('pandas')
import pandas as pd

2 数据集

企鹅数据集(penguins)是一个包含南极洲三种企鹅物种的生物测量数据的公开数据集。所有示例基于本数据集,请执行下方代码块先加载本数据集至环境。

  • species: 企鹅物种 (Adelie, Chinstrap, Gentoo)
  • island: 栖息岛屿 (Biscoe, Dream, Torgersen)
  • sex: 性别 (male, female)
  • bill_length_mm: 喙长 (毫米)
  • bill_depth_mm: 喙深 (毫米)
  • flipper_length_mm: 鳍长 (毫米)
  • body_mass_g: 体重 (克)

首先,我们需要将CSV数据加载到DuckDB的内存数据库中,创建一个名为 penguins 的表。后续所有SQL操作都将基于此表。

-- 从URL加载CSV数据并创建一个名为 'penguins' 的表
CREATE TABLE penguins AS SELECT * FROM 'penguins.csv';
_duckdb_editor_4 = Object {code: null, options: Object, indicator: it}

-- 显示表的前5行以确认加载成功
SELECT * FROM penguins LIMIT 5;
_duckdb_editor_5 = Object {code: null, options: Object, indicator: it}
library(dplyr)

tb_penguins =
read.csv(
"https://cdn.jsdelivr.net/gh/mwaskom/seaborn-data/penguins.csv"
) %>%
as_tibble()
head(tb_penguins)
_webr_editor_6 = Object {code: null, options: Object, indicator: it}
library(data.table)

dt_penguins = as.data.table(
read.csv(
"https://cdn.jsdelivr.net/gh/mwaskom/seaborn-data/penguins.csv"
)
)
head(dt_penguins)
_webr_editor_7 = Object {code: null, options: Object, indicator: it}
import pandas as pd

df_penguins = pd.read_csv("https://cdn.jsdelivr.net/gh/mwaskom/seaborn-data/penguins.csv")
df_penguins.head(5)
_pyodide_editor_8 = Object {code: null, options: Object, indicator: it}

3 链式法则

SQL 主要通过嵌套查询或通用表表达式(CTE, Common Table Expressions)来实现类似链式操作的逻辑。CTE (WITH 子句) 提高了可读性。


-- 使用 CTE 实现链式操作
WITH filtered_penguins AS (
SELECT * FROM penguins WHERE species != 'Adelie'
)
SELECT species, COUNT(*) AS N FROM filtered_penguins GROUP BY species;
_duckdb_editor_9 = Object {code: null, options: Object, indicator: it}

你可以使用%>% 或者|>进行链式处理,快捷键Ctrl+Shift+M

library(dplyr)

tb_penguins %>%
group_by(species) %>%
summarise(Ncount = n())
_webr_editor_10 = Object {code: null, options: Object, indicator: it}

使用中括号进行链式处理,形如DT[ ... ][ ... ][ ... ],为了美观也可以纵向链式。

library(data.table)

dt_penguins[species != 'Adelie '][, .N, by = species]
_webr_editor_11 = Object {code: null, options: Object, indicator: it}

Python 使用.进行链式处理,为了美观可以使用反斜杠将一行代码拆分成多行或是将整个链式放在括号中。

# 反斜杠拆分
df_penguins \
.groupby('species') \
.size() \
.reset_index(name='count')
_pyodide_editor_12 = Object {code: null, options: Object, indicator: it}

4 筛选行

使用 WHERE 子句进行条件筛选。


SELECT *
FROM penguins
WHERE species = 'Adelie'
AND bill_length_mm >= 40
AND island = 'Torgersen';
_duckdb_editor_13 = Object {code: null, options: Object, indicator: it}
  • 使用slice进行行索引筛选,slice_headslice_tail可以快速筛选
  • 使用filter进行条件数据筛选
  • 使用slice_max 或者slice_min 快速筛选某列最大值对应行
# 条件数据筛选
tb_penguins %>%
filter(species == 'Adelie', bill_length_mm >= 40, island == 'Torgersen')
_webr_editor_14 = Object {code: null, options: Object, indicator: it}

data.table的一般数据清洗形式形如DT[i, j, by],其中i的位置可以用来筛选行和排序。

library(data.table)

# 条件筛选
dt_penguins[species == 'Adelie' & bill_length_mm >= 40 & island == 'Torgersen']
_webr_editor_15 = Object {code: null, options: Object, indicator: it}

Python 使用loc方法或直接[]进行条件选择,使用iloc进行索引筛选。query方法也十分便捷。

# 使用 []
df_penguins[(df_penguins['species'] == 'Adelie') &
(df_penguins['bill_length_mm'] >= 40) &
(df_penguins['island'] == 'Torgersen')]
_pyodide_editor_16 = Object {code: null, options: Object, indicator: it}

5 选择列

SELECT 语句后直接列出你需要的列名。使用 * 选择所有列。DuckDB 还提供了 COLUMNS 函数,可以使用正则表达式方便地进行列匹配。


-- 按名称选择
SELECT species, island, bill_length_mm, bill_depth_mm
FROM penguins
LIMIT 5;

-- 使用通配符选择以 'bill_' 开头的列
SELECT COLUMNS('bill_.*')
FROM penguins
LIMIT 5;
_duckdb_editor_17 = Object {code: null, options: Object, indicator: it}
  • 使用select进行列选择。
  • select内部提供了强大的辅助函数来匹配列,例如starts_with()ends_with()contains()以及支持正则表达式的matches()
# 按列名称选择
tb_penguins %>% select(species, island, bill_length_mm, bill_depth_mm)

# 使用辅助函数
tb_penguins %>% select(starts_with('bill'))
_webr_editor_18 = Object {code: null, options: Object, indicator: it}

j位置选择列,使用 .()list() 可以确保返回的是一个 data.table.SD.SDcols 参数组合提供了更强大的列选择能力,尤其适合进行模式匹配。

# 使用 .() 选择列
dt_penguins[, .(species, island, bill_length_mm, bill_depth_mm)]

# 使用 .SDcols 和模式匹配
dt_penguins[, .SD, .SDcols = names(dt_penguins) %like% '^bill']
_webr_editor_19 = Object {code: null, options: Object, indicator: it}

使用 [] 并传入一个列名列表来选择多列。.filter() 方法则提供了更灵活的选择方式,例如通过正则表达式。

# 选择特定列
df_penguins[['species', 'island', 'bill_length_mm', 'bill_depth_mm']]

# 使用 filter 和正则表达式
df_penguins.filter(regex='^bill_')
_pyodide_editor_20 = Object {code: null, options: Object, indicator: it}

6 添加/修改列

SELECT 语句中,你可以直接进行计算,并使用 AS 关键字为新列赋予名称。


-- 计算喙长与喙深的比率
SELECT
*,
bill_length_mm / bill_depth_mm AS bill_ratio
FROM penguins
LIMIT 5;
_duckdb_editor_21 = Object {code: null, options: Object, indicator: it}

使用dplyr包中的mutate函数来创建或修改列。它非常适合在链式操作中使用。

tb_penguins %>%
mutate(
bill_ratio = bill_length_mm / bill_depth_mm,
bill_area = bill_length_mm * bill_depth_mm
) %>%
head(5)
_webr_editor_22 = Object {code: null, options: Object, indicator: it}

data.table使用独特的 := 操作符,它可以非常高效地”原地”添加或修改列,无需复制整个数据集。这使得它在处理大型数据集时速度极快。

# 使用 `:=` 直接在原表上添加列。
# 注意:这会直接修改 dt_penguins 对象
dt_penguins[, bill_ratio := bill_length_mm / bill_depth_mm]
head(dt_penguins, 5)

# 也可以同时创建多个列
dt_penguins[,
c('bill_ratio_2', 'bill_area') := .(
bill_length_mm / bill_depth_mm,
bill_length_mm * bill_depth_mm
)
]
_webr_editor_23 = Object {code: null, options: Object, indicator: it}

你可以直接用 df['new_column'] = ... 的方式创建一个新列。在链式操作中,.assign() 方法更为方便,它会返回一个包含新列的新 DataFrame,而不修改原始数据。

# 使用 .assign() 进行链式操作
df_penguins.assign(
bill_ratio = lambda df: df['bill_length_mm'] / df['bill_depth_mm'],
bill_area = lambda df: df['bill_length_mm'] * df['bill_depth_mm']
).head(5)
_pyodide_editor_24 = Object {code: null, options: Object, indicator: it}

7 分组与聚合

使用 GROUP BY 子句进行分组,然后在 SELECT 中使用聚合函数(如 COUNT, SUM, AVG 等)进行计算。


-- 按物种分组,计算每组的数量、平均喙长和最大体重
SELECT
species,
COUNT(*) AS count,
AVG(bill_length_mm) AS avg_bill_length,
MAX(body_mass_g) AS max_body_mass
FROM penguins
GROUP BY species;
_duckdb_editor_25 = Object {code: null, options: Object, indicator: it}

使用 group_by() 进行分组,然后使用 summarise()(或 summarize())进行聚合计算。

tb_penguins %>%
group_by(species) %>%
summarise(
count = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE)
)
_webr_editor_26 = Object {code: null, options: Object, indicator: it}

j 位置使用 .()list() 进行聚合计算,并通过 by 参数指定分组变量。

dt_penguins[,
.(
count = .N,
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE)
),
by = species
]
_webr_editor_27 = Object {code: null, options: Object, indicator: it}

使用 groupby() 方法进行分组,然后使用聚合方法(如 size(), mean(), max() 等)。

df_penguins.groupby('species').agg(
count=('species', 'size'),
avg_bill_length=('bill_length_mm', 'mean'),
max_body_mass=('body_mass_g', 'max')
).reset_index()
_pyodide_editor_28 = Object {code: null, options: Object, indicator: it}

8 排序

使用 ORDER BY 子句进行排序,ASC 为升序(默认),DESC 为降序。


-- 按体重降序排序,然后按喙长升序排序
SELECT *
FROM penguins
ORDER BY body_mass_g DESC, bill_length_mm ASC
LIMIT 10;
_duckdb_editor_29 = Object {code: null, options: Object, indicator: it}

使用 arrange() 函数进行排序,默认升序,使用 desc() 包裹变量名表示降序。

tb_penguins %>%
arrange(desc(body_mass_g), bill_length_mm) %>%
head(10)
_webr_editor_30 = Object {code: null, options: Object, indicator: it}

使用 order() 函数进行排序,默认升序,在变量前加 - 号表示降序。

dt_penguins[order(-body_mass_g, bill_length_mm)] %>%
head(10)
_webr_editor_31 = Object {code: null, options: Object, indicator: it}

使用 sort_values() 方法进行排序,ascending 参数控制升序或降序。

df_penguins.sort_values(
by=['body_mass_g', 'bill_length_mm'],
ascending=[False, True]
).head(10)
_pyodide_editor_32 = Object {code: null, options: Object, indicator: it}

9 处理缺失值

使用 WHERE 子句配合 IS NULLIS NOT NULL 来筛选缺失值或非缺失值。


-- 查找 bill_length_mm 有缺失值的行
SELECT *
FROM penguins
WHERE bill_length_mm IS NULL
LIMIT 5;

-- 删除所有包含缺失值的行
SELECT *
FROM penguins
WHERE bill_length_mm IS NOT NULL
AND bill_depth_mm IS NOT NULL
AND flipper_length_mm IS NOT NULL
AND body_mass_g IS NOT NULL
AND sex IS NOT NULL;
_duckdb_editor_33 = Object {code: null, options: Object, indicator: it}

使用 is.na() 检测缺失值,complete.cases() 检测完整行,drop_na() 快速删除含缺失值的行。

# 查看 bill_length_mm 列的缺失值情况
tb_penguins %>%
filter(is.na(bill_length_mm)) %>%
head(5)

# 删除所有包含缺失值的行
tb_penguins %>%
drop_na()

# 删除特定列的缺失值
tb_penguins %>%
drop_na(bill_length_mm, bill_depth_mm, sex)
_webr_editor_34 = Object {code: null, options: Object, indicator: it}

使用 is.na()i 位置进行筛选,.N 可以快速统计数量。

# 查看 bill_length_mm 列的缺失值
dt_penguins[is.na(bill_length_mm)] %>%
head(5)

# 删除所有包含缺失值的行
dt_penguins[complete.cases(dt_penguins)]

# 统计各列的缺失值数量
dt_penguins[, lapply(.SD, function(x) sum(is.na(x)))]
_webr_editor_35 = Object {code: null, options: Object, indicator: it}

使用 isnull()isna() 检测缺失值,dropna() 删除含缺失值的行。

# 查看 bill_length_mm 列的缺失值
df_penguins[df_penguins['bill_length_mm'].isnull()].head(5)

# 删除所有包含缺失值的行
df_penguins.dropna()

# 删除特定列的缺失值
df_penguins.dropna(subset=['bill_length_mm', 'bill_depth_mm', 'sex'])

# 统计各列的缺失值数量
df_penguins.isnull().sum()
_pyodide_editor_36 = Object {code: null, options: Object, indicator: it}

10 数据连接/合并

使用标准的 SQL JOIN 语法进行数据连接,支持 INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN 等。


-- 创建两个示例表用于演示连接操作
CREATE TEMPORARY TABLE species_info AS
SELECT DISTINCT species, island
FROM penguins;

CREATE TEMPORARY TABLE measurements AS
SELECT species, bill_length_mm, bill_depth_mm, body_mass_g
FROM penguins
WHERE species = 'Adelie'
LIMIT 3;

-- 内连接示例
SELECT *
FROM species_info s
INNER JOIN measurements m ON s.species = m.species;

-- 左连接示例
SELECT *
FROM species_info s
LEFT JOIN measurements m ON s.species = m.species;
_duckdb_editor_37 = Object {code: null, options: Object, indicator: it}

使用 dplyr 包中的连接函数:inner_join(), left_join(), right_join(), full_join() 等。

# 创建两个示例数据框用于演示连接操作
species_info <- tb_penguins %>%
distinct(species, island)

measurements <- tb_penguins %>%
filter(species == 'Adelie') %>%
select(species, bill_length_mm, bill_depth_mm, body_mass_g) %>%
head(3)

# 内连接示例
inner_join(species_info, measurements, by = "species")

# 左连接示例
left_join(species_info, measurements, by = "species")
_webr_editor_38 = Object {code: null, options: Object, indicator: it}

使用 merge() 函数进行连接,通过修改表结构避免笛卡尔积。

# 创建两个示例数据表 - 修改结构避免笛卡尔积
# 表1:只包含物种信息(不包含会产生笛卡尔积的岛屿)
species_info_dt <- unique(dt_penguins[, .(species)])

# 表2:测量数据,只选择特定物种的前3行
measurements_dt <- dt_penguins[
species == 'Adelie',
.(species, bill_length_mm, bill_depth_mm, body_mass_g)
][1:3]

cat("表1:物种信息\n")
print(species_info_dt)

cat("\n表2:测量数据\n")
print(measurements_dt)

# 内连接示例
cat("\n内连接结果:\n")
inner_result <- merge(species_info_dt, measurements_dt, by = "species")
print(inner_result)

# 左连接示例
cat("\n左连接结果:\n")
left_result <- merge(
species_info_dt,
measurements_dt,
by = "species",
all.x = TRUE
)
print(left_result)

# 如果需要包含岛屿信息的示例(每个物种只保留一个岛屿)
cat("\n包含岛屿信息的连接:\n")
species_island_dt <- dt_penguins[,
.SD[1],
by = species,
_webr_editor_39 = Object {code: null, options: Object, indicator: it}

使用 merge() 函数或 join() 方法进行数据连接,支持各种连接类型。

# 创建两个示例DataFrame
species_info_df = df_penguins[['species', 'island']].drop_duplicates()
measurements_df = df_penguins[df_penguins['species'] == 'Adelie'][['species', 'bill_length_mm', 'bill_depth_mm', 'body_mass_g']].head(3)

# 内连接示例 - 使用 merge()
pd.merge(species_info_df, measurements_df, on='species')

# 左连接示例 - 使用 merge()
pd.merge(species_info_df, measurements_df, on='species', how='left')

# 使用 join() 方法
species_info_df.set_index('species').join(measurements_df.set_index('species'), how='inner').reset_index()
_pyodide_editor_40 = Object {code: null, options: Object, indicator: it}

11 数据重塑

使用 UNPIVOT 进行宽表转长表,PIVOT 进行长表转宽表。


-- 宽表转长表:将多个测量列转换为长格式
SELECT
species,
island,
measurement_type,
measurement_value
FROM penguins
UNPIVOT (measurement_value FOR measurement_type IN
(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))
LIMIT 10;

-- 创建示例长表用于演示转宽表
CREATE TEMPORARY TABLE long_penguins AS
SELECT
species,
island,
measurement_type,
measurement_value
FROM penguins
UNPIVOT (measurement_value FOR measurement_type IN
(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g));

-- 长表转宽表
PIVOT long_penguins
ON measurement_type
USING AVG(measurement_value)
GROUP BY species, island;
_duckdb_editor_41 = Object {code: null, options: Object, indicator: it}

使用 pivot_longer() 进行宽表转长表,pivot_wider() 进行长表转宽表。

library(tidyr)

# 宽表转长表:将多个测量列转换为长格式
tb_penguins %>%
pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
names_to = "measurement_type",
values_to = "measurement_value"
) %>%
head(10)

# 长表转宽表
long_data <- tb_penguins %>%
pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
names_to = "measurement_type",
values_to = "measurement_value"
)

long_data %>%
pivot_wider(
names_from = measurement_type,
values_from = measurement_value,
values_fn = mean
)
_webr_editor_42 = Object {code: null, options: Object, indicator: it}

使用 melt() 进行宽表转长表,dcast() 进行长表转宽表。

# 宽表转长表
melt(
dt_penguins,
measure.vars = c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"
),
variable.name = "measurement_type",
value.name = "measurement_value"
) %>%
head(10)

# 长表转宽表
long_dt <- melt(
dt_penguins,
measure.vars = c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"
),
variable.name = "measurement_type",
value.name = "measurement_value"
)

dcast(
long_dt,
species + island ~ measurement_type,
fun.aggregate = mean,
value.var = "measurement_value"
)
_webr_editor_43 = Object {code: null, options: Object, indicator: it}

使用 melt() 进行宽表转长表,pivot_table()pivot() 进行长表转宽表。

# 宽表转长表
long_df = df_penguins.melt(
id_vars=['species', 'island', 'sex'],
value_vars=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'],
var_name='measurement_type',
value_name='measurement_value'
)
long_df.head(10)

# 长表转宽表
long_df.pivot_table(
index=['species', 'island'],
columns='measurement_type',
values='measurement_value',
aggfunc='mean'
).reset_index()
_pyodide_editor_44 = Object {code: null, options: Object, indicator: it}

12 窗口函数

使用窗口函数配合 OVER() 子句,支持 PARTITION BYORDER BY


-- 计算每个物种内按体重的排名
SELECT
species,
island,
body_mass_g,
ROW_NUMBER() OVER (PARTITION BY species ORDER BY body_mass_g DESC) as rank_in_species,
AVG(body_mass_g) OVER (PARTITION BY species) as avg_species_mass,
body_mass_g - AVG(body_mass_g) OVER (PARTITION BY species) as mass_diff_from_avg
FROM penguins
ORDER BY species, rank_in_species
LIMIT 10;
_duckdb_editor_45 = Object {code: null, options: Object, indicator: it}

使用 dplyr 的窗口函数如 row_number(), rank(), lead(), lag() 等,配合 group_by()

# 计算每个物种内按体重的排名
tb_penguins %>%
group_by(species) %>%
mutate(
rank_in_species = row_number(desc(body_mass_g)),
avg_species_mass = mean(body_mass_g, na.rm = TRUE),
mass_diff_from_avg = body_mass_g - avg_species_mass
) %>%
ungroup() %>%
select(
species,
island,
body_mass_g,
rank_in_species,
avg_species_mass,
mass_diff_from_avg
) %>%
arrange(species, rank_in_species) %>%
head(10)
_webr_editor_46 = Object {code: null, options: Object, indicator: it}

使用特殊的 .N, .I, .SD 等符号,配合 by 参数实现窗口函数功能。

# 计算每个物种内按体重的排名
dt_penguins[,
`:=`(
rank_in_species = frank(-body_mass_g, ties.method = "first"),
avg_species_mass = mean(body_mass_g, na.rm = TRUE)
),
by = species
][,
mass_diff_from_avg := body_mass_g - avg_species_mass
][
order(species, rank_in_species),
.(
species,
island,
body_mass_g,
rank_in_species,
avg_species_mass,
mass_diff_from_avg
)
][1:10]
_webr_editor_47 = Object {code: null, options: Object, indicator: it}

使用 groupby() 配合 transform() 方法,或使用 rank() 等方法实现窗口函数。

# 计算每个物种内按体重的排名
df_result = df_penguins.copy()
df_result['rank_in_species'] = df_result.groupby('species')['body_mass_g'] \
.rank(method='first', ascending=False)
df_result['avg_species_mass'] = df_result.groupby('species')['body_mass_g'] \
.transform('mean')
df_result['mass_diff_from_avg'] = df_result['body_mass_g'] - df_result['avg_species_mass']

df_result[['species', 'island', 'body_mass_g', 'rank_in_species',
'avg_species_mass', 'mass_diff_from_avg']] \
.sort_values(['species', 'rank_in_species']) \
.head(10)
_pyodide_editor_48 = Object {code: null, options: Object, indicator: it}

13 字符串处理

使用内置字符串函数如 LENGTH, UPPER, LOWER, SUBSTRING, REPLACE 等。


-- 字符串处理示例
SELECT
species,
-- 基本转换
UPPER(species) as species_upper,
LOWER(species) as species_lower,
-- 字符串长度
LENGTH(species) as species_length,
-- 子字符串提取
SUBSTRING(species, 1, 3) as species_prefix,
-- 字符串替换
REPLACE(species, 'e', 'E') as species_replace_e,
-- 字符串连接
species || ' Penguin' as species_full_name
FROM penguins
GROUP BY species
LIMIT 5;
_duckdb_editor_49 = Object {code: null, options: Object, indicator: it}

使用 stringr 包提供的字符串处理函数,如 str_to_upper, str_to_lower, str_length, str_sub, str_replace 等。

library(stringr)

# 字符串处理示例
tb_penguins %>%
distinct(species) %>%
mutate(
# 基本转换
species_upper = str_to_upper(species),
species_lower = str_to_lower(species),
# 字符串长度
species_length = str_length(species),
# 子字符串提取
species_prefix = str_sub(species, 1, 3),
# 字符串替换
species_replace_e = str_replace_all(species, "e", "E"),
# 字符串连接
species_full_name = paste(species, "Penguin")
)
_webr_editor_50 = Object {code: null, options: Object, indicator: it}

使用基础R的字符串函数或 stringr 包函数进行字符串处理。

library(stringr)

# 字符串处理示例
dt_penguins[, .(species = unique(species))][,
`:=`(
# 基本转换
species_upper = toupper(species),
species_lower = tolower(species),
# 字符串长度
species_length = nchar(species),
# 子字符串提取
species_prefix = substr(species, 1, 3),
# 字符串替换
species_replace_e = gsub("e", "E", species),
# 字符串连接
species_full_name = paste(species, "Penguin")
)
]
_webr_editor_51 = Object {code: null, options: Object, indicator: it}

使用字符串方法如 .upper(), .lower(), len(), 切片操作, .replace() 等,或使用 str 访问器。

# 字符串处理示例
df_penguins[['species']].drop_duplicates().assign(
# 基本转换
species_upper = lambda df: df['species'].str.upper(),
species_lower = lambda df: df['species'].str.lower(),
# 字符串长度
species_length = lambda df: df['species'].str.len(),
# 子字符串提取
species_prefix = lambda df: df['species'].str[:3],
# 字符串替换
species_replace_e = lambda df: df['species'].str.replace('e', 'E'),
# 字符串连接
species_full_name = lambda df: df['species'] + ' Penguin'
)
_pyodide_editor_52 = Object {code: null, options: Object, indicator: it}

14 条件处理与逻辑判断

使用 CASE WHEN 语句进行条件处理,支持多条件判断。


-- 条件处理示例:根据体重分类企鹅
SELECT
species,
body_mass_g,
-- 简单条件判断
CASE
WHEN body_mass_g > 5000 THEN 'Heavy'
WHEN body_mass_g > 4000 THEN 'Medium'
ELSE 'Light'
END as weight_category,
-- 多条件组合
CASE
WHEN species = 'Adelie' AND body_mass_g > 4000 THEN 'Large Adelie'
WHEN species = 'Gentoo' AND body_mass_g > 5500 THEN 'Large Gentoo'
ELSE 'Standard'
END as size_category,
-- 使用 IN 操作符
CASE
WHEN island IN ('Torgersen', 'Dream') THEN 'Group A'
WHEN island = 'Biscoe' THEN 'Group B'
ELSE 'Other'
END as island_group
FROM penguins
LIMIT 10;
_duckdb_editor_53 = Object {code: null, options: Object, indicator: it}

使用 case_when() 函数进行多条件判断,或 if_else() 进行简单条件判断。

# 条件处理示例:根据体重分类企鹅
tb_penguins %>%
head(10) %>%
mutate(
# 简单条件判断
weight_category = case_when(
body_mass_g > 5000 ~ 'Heavy',
body_mass_g > 4000 ~ 'Medium',
TRUE ~ 'Light'
),
# 多条件组合
size_category = case_when(
species == 'Adelie' & body_mass_g > 4000 ~ 'Large Adelie',
species == 'Gentoo' & body_mass_g > 5500 ~ 'Large Gentoo',
TRUE ~ 'Standard'
),
# 使用 %in% 操作符
island_group = case_when(
island %in% c('Torgersen', 'Dream') ~ 'Group A',
island == 'Biscoe' ~ 'Group B',
TRUE ~ 'Other'
)
) %>%
select(
species,
island,
body_mass_g,
weight_category,
size_category,
island_group
)
_webr_editor_54 = Object {code: null, options: Object, indicator: it}

使用 fcase() 函数(data.table专用)进行多条件判断,或基础的 ifelse() 函数。

# 条件处理示例:根据体重分类企鹅
dt_penguins[
1:10,
.(
species,
island,
body_mass_g,
# 使用 fcase 进行多条件判断
weight_category = fcase(
body_mass_g > 5000,
'Heavy',
body_mass_g > 4000,
'Medium',
default = 'Light'
),
# 多条件组合
size_category = fcase(
species == 'Adelie' & body_mass_g > 4000,
'Large Adelie',
species == 'Gentoo' & body_mass_g > 5500,
'Large Gentoo',
default = 'Standard'
),
# 使用 %in% 操作符
island_group = fcase(
island %in% c('Torgersen', 'Dream'),
'Group A',
island == 'Biscoe',
'Group B',
default = 'Other'
)
)
]
_webr_editor_55 = Object {code: null, options: Object, indicator: it}

使用 np.where() 进行简单条件判断,或使用字典映射和 apply() 进行复杂条件处理。

import numpy as np

# 条件处理示例:根据体重分类企鹅
df_penguins.head(10).assign(
# 简单条件判断
weight_category = lambda df: np.where(df['body_mass_g'] > 5000, 'Heavy',
np.where(df['body_mass_g'] > 4000, 'Medium', 'Light')),
# 多条件组合
size_category = lambda df: np.where((df['species'] == 'Adelie') & (df['body_mass_g'] > 4000), 'Large Adelie',
np.where((df['species'] == 'Gentoo') & (df['body_mass_g'] > 5500), 'Large Gentoo', 'Standard')),
# 使用 isin 方法
island_group = lambda df: np.where(df['island'].isin(['Torgersen', 'Dream']), 'Group A',
np.where(df['island'] == 'Biscoe', 'Group B', 'Other'))
)[['species', 'island', 'body_mass_g', 'weight_category', 'size_category', 'island_group']]
_pyodide_editor_56 = Object {code: null, options: Object, indicator: it}

15 数据采样与随机化

使用 USING SAMPLEORDER BY RANDOM() 进行随机抽样。


-- 简单随机抽样 - 抽取10%的数据
SELECT *
FROM penguins
USING SAMPLE 10 PERCENT (bernoulli);

-- 随机抽取10行
SELECT *
FROM penguins
ORDER BY RANDOM()
LIMIT 10;

-- 分层抽样 - 每个物种抽取5只企鹅
WITH ranked_penguins AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY species ORDER BY RANDOM()) as rn
FROM penguins
)
SELECT *
FROM ranked_penguins
WHERE rn <= 5;

-- 有放回抽样
SELECT *
FROM penguins
USING SAMPLE 20 ROWS (reservoir);
_duckdb_editor_57 = Object {code: null, options: Object, indicator: it}

使用 sample_n(), sample_frac(), 和 slice_sample() 进行随机抽样。

# 简单随机抽样 - 抽取10%的数据
tb_penguins %>%
sample_frac(0.1)

# 随机抽取10行
tb_penguins %>%
sample_n(10)

# 使用 slice_sample() - 推荐方式
tb_penguins %>%
slice_sample(n = 10)

# 分层抽样 - 每个物种抽取5只企鹅
tb_penguins %>%
group_by(species) %>%
slice_sample(n = 5) %>%
ungroup()

# 设置随机种子保证可重复性
set.seed(123)
tb_penguins %>%
slice_sample(n = 5)

# 有放回抽样
tb_penguins %>%
slice_sample(n = 20, replace = TRUE)
_webr_editor_58 = Object {code: null, options: Object, indicator: it}

使用 .Nsample() 函数进行随机抽样。

# 简单随机抽样 - 抽取10%的数据
dt_penguins[sample(.N, round(.N * 0.1))]

# 随机抽取10行
dt_penguins[sample(.N, 10)]

# 分层抽样 - 每个物种抽取5只企鹅
dt_penguins[, .SD[sample(.N, min(5, .N))], by = species]

# 设置随机种子保证可重复性
set.seed(123)
dt_penguins[sample(.N, 5)]

# 有放回抽样
dt_penguins[sample(.N, 20, replace = TRUE)]
_webr_editor_59 = Object {code: null, options: Object, indicator: it}

使用 sample() 方法进行随机抽样。

# 简单随机抽样 - 抽取10%的数据
df_penguins.sample(frac=0.1)

# 随机抽取10行
df_penguins.sample(n=10)

# 分层抽样 - 每个物种抽取5只企鹅
df_penguins.groupby('species').apply(
lambda x: x.sample(n=min(5, len(x)))
).reset_index(drop=True)

# 设置随机种子保证可重复性
df_penguins.sample(n=5, random_state=123)

# 有放回抽样
df_penguins.sample(n=20, replace=True)
_pyodide_editor_60 = Object {code: null, options: Object, indicator: it}

16 数据描述性统计

使用聚合函数和 GROUP BY 进行描述性统计。


-- 基本描述性统计
SELECT
COUNT(*) as count,
AVG(bill_length_mm) as mean_bill_length,
STDDEV(bill_length_mm) as std_bill_length,
MIN(bill_length_mm) as min_bill_length,
MAX(bill_length_mm) as max_bill_length,
MEDIAN(bill_length_mm) as median_bill_length,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY bill_length_mm) as q1_bill_length,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY bill_length_mm) as q3_bill_length
FROM penguins;

-- 按物种分组的统计
SELECT
species,
COUNT(*) as count,
ROUND(AVG(body_mass_g), 2) as avg_mass,
ROUND(STDDEV(body_mass_g), 2) as std_mass,
MIN(body_mass_g) as min_mass,
MAX(body_mass_g) as max_mass
FROM penguins
GROUP BY species;

-- 频数统计
SELECT
species,
sex,
COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
FROM penguins
WHERE sex IS NOT NULL
GROUP BY species, sex
ORDER BY species, sex;
_duckdb_editor_61 = Object {code: null, options: Object, indicator: it}

使用 summarise() 和统计函数进行描述性统计。

# 基本描述性统计
tb_penguins %>%
summarise(
count = n(),
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
std_bill_length = sd(bill_length_mm, na.rm = TRUE),
min_bill_length = min(bill_length_mm, na.rm = TRUE),
max_bill_length = max(bill_length_mm, na.rm = TRUE),
median_bill_length = median(bill_length_mm, na.rm = TRUE),
q1_bill_length = quantile(bill_length_mm, 0.25, na.rm = TRUE),
q3_bill_length = quantile(bill_length_mm, 0.75, na.rm = TRUE)
)

# 按物种分组的统计
tb_penguins %>%
group_by(species) %>%
summarise(
count = n(),
avg_mass = round(mean(body_mass_g, na.rm = TRUE), 2),
std_mass = round(sd(body_mass_g, na.rm = TRUE), 2),
min_mass = min(body_mass_g, na.rm = TRUE),
max_mass = max(body_mass_g, na.rm = TRUE)
)

# 频数统计
tb_penguins %>%
filter(!is.na(sex)) %>%
count(species, sex) %>%
mutate(percentage = round(n * 100 / sum(n), 2)) %>%
arrange(species, sex)
_webr_editor_62 = Object {code: null, options: Object, indicator: it}

j 位置使用统计函数进行描述性统计。

# 基本描述性统计
dt_penguins[, .(
count = .N,
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
std_bill_length = sd(bill_length_mm, na.rm = TRUE),
min_bill_length = min(bill_length_mm, na.rm = TRUE),
max_bill_length = max(bill_length_mm, na.rm = TRUE),
median_bill_length = median(bill_length_mm, na.rm = TRUE),
q1_bill_length = quantile(bill_length_mm, 0.25, na.rm = TRUE),
q3_bill_length = quantile(bill_length_mm, 0.75, na.rm = TRUE)
)]

# 按物种分组的统计
dt_penguins[,
.(
count = .N,
avg_mass = round(mean(body_mass_g, na.rm = TRUE), 2),
std_mass = round(sd(body_mass_g, na.rm = TRUE), 2),
min_mass = min(body_mass_g, na.rm = TRUE),
max_mass = max(body_mass_g, na.rm = TRUE)
),
by = species
]

# 频数统计
dt_penguins[!is.na(sex), .(count = .N), by = .(species, sex)][,
percentage := round(count * 100 / sum(count), 2)
][order(species, sex)]
_webr_editor_63 = Object {code: null, options: Object, indicator: it}

使用 describe() 方法和聚合函数进行描述性统计。

# 基本描述性统计 - 使用 describe()
df_penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].describe()

# 自定义描述性统计
df_penguins.agg({
'bill_length_mm': ['count', 'mean', 'std', 'min', 'max', 'median']
})

# 按物种分组的统计
df_penguins.groupby('species').agg({
'body_mass_g': ['count', 'mean', 'std', 'min', 'max']
}).round(2)

# 频数统计
freq_table = df_penguins[df_penguins['sex'].notna()].groupby(['species', 'sex']).size().reset_index(name='count')
freq_table['percentage'] = round(freq_table['count'] * 100 / freq_table['count'].sum(), 2)
freq_table.sort_values(['species', 'sex'])
_pyodide_editor_64 = Object {code: null, options: Object, indicator: it}
Downloading package: htmltools
回到顶部