R语言时序分析

R语言

时间序列

数据分析

数据清洗

机器学习

使用R语言进行时间序列分析，timetk包时序数据处理及modeltime包时序建模

作者

不止BI

发布于

2024年7月30日

数据准备

历史数据

代码

library(tidyquant)
library(tidyverse)
library(timetk)
tb_stocks <- tq_get(c("000001.ss", "600036.ss"), get = "stock.prices", from = " 2019-01-01")


tb_stocks %>%
  group_by(symbol) %>%
  plot_time_series(.date_var = date, .value = close, .facet_ncol = 2)

常用数据操作

时间汇总统计

代码

tb_stocks %>%
  group_by(symbol) %>%
  summarise_by_time(
    .date_var = date,
    .by = "month",
    volume = sum(volume)
  ) %>%
  plot_time_series(.date_var = date, .value = volume, .facet_ncol = 2)

时间序列填充

时间序列数据可能存在某段时间中断的情况，可以使用pad_by_time补全时间

代码

tb_stocks %>%
  group_by(symbol) %>%
  pad_by_time(date, .by = "auto")

# A tibble: 4,072 × 8
# Groups:   symbol [2]
   symbol    date        open  high   low close volume adjusted
   <chr>     <date>     <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
 1 000001.ss 2019-01-02 2498. 2500. 2456. 2465. 109900    2465.
 2 000001.ss 2019-01-03 2462. 2488. 2456. 2464. 124400    2464.
 3 000001.ss 2019-01-04 2446. 2515. 2441. 2515. 168900    2515.
 4 000001.ss 2019-01-05   NA    NA    NA    NA      NA      NA 
 5 000001.ss 2019-01-06   NA    NA    NA    NA      NA      NA 
 6 000001.ss 2019-01-07 2529. 2537. 2516. 2533. 177300    2533.
 7 000001.ss 2019-01-08 2530. 2531. 2520. 2526. 158100    2526.
 8 000001.ss 2019-01-09 2536. 2574. 2536. 2544. 191900    2544.
 9 000001.ss 2019-01-10 2544. 2552. 2532. 2535. 159900    2535.
10 000001.ss 2019-01-11 2540. 2555. 2533. 2554. 149400    2554.
# ℹ 4,062 more rows

时间窗口

窗口计算

代码

roll_avg_30 <- slidify(.f = mean, .period = 30, .align = "center", .partial = TRUE)

tb_stocks %>%
  select(symbol, date, adjusted) %>%
  group_by(symbol) %>%
  mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>%
  tidyr::pivot_longer(cols = c(adjusted, rolling_avg_30)) %>%
  plot_time_series(date, value,
    .color_var = name,
    .facet_ncol = 2, .smooth = FALSE,
    .interactive = T
  )

窗口模型

代码

# Rolling regressions are easy to implement using `.unlist = FALSE`
lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3),
  .period = 90,
  .unlist = FALSE, .align = "right"
)


tb_stocks %>%
  select(symbol, date, adjusted, volume) %>%
  group_by(symbol) %>%
  mutate(numeric_date = as.numeric(date)) %>%
  # Apply rolling regression
  mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>%
  filter(!is.na(rolling_lm))

# A tibble: 2,522 × 6
# Groups:   symbol [2]
   symbol    date       adjusted volume numeric_date rolling_lm
   <chr>     <date>        <dbl>  <dbl>        <dbl> <list>    
 1 000001.ss 2019-05-22    2892. 200000        18038 <lm>      
 2 000001.ss 2019-05-23    2853. 197300        18039 <lm>      
 3 000001.ss 2019-05-24    2853. 167200        18040 <lm>      
 4 000001.ss 2019-05-27    2892. 196700        18043 <lm>      
 5 000001.ss 2019-05-28    2910. 223300        18044 <lm>      
 6 000001.ss 2019-05-29    2915. 199000        18045 <lm>      
 7 000001.ss 2019-05-30    2906. 205800        18046 <lm>      
 8 000001.ss 2019-05-31    2899. 195200        18047 <lm>      
 9 000001.ss 2019-06-03    2890. 215900        18050 <lm>      
10 000001.ss 2019-06-04    2862. 188500        18051 <lm>      
# ℹ 2,512 more rows

可视化

ACF、PACF和CCF图

ACF、PACF和CCF

ACF、PACF和CCF都是用于分析时间序列数据相关性的统计工具。它们可以帮助识别序列中的模式和结构，并为模型选择提供依据。

自相关函数（ACF）衡量的是时间序列中同一变量在不同时间点之间的相关性。它计算的是滞后 k 阶的自协方差函数（autocovariance function，ACVF）与方差之比。ACF 值在 -1 到 1 之间，其中 1 表示完全相关，0 表示不相关，-1 表示完全负相关。

偏自相关函数（PACF）衡量的是时间序列中同一变量在剔除掉中间滞后的情况下，相邻时间点之间的相关性。它计算的是条件自相关函数（conditional autocorrelation function），即在控制了其他滞后变量的情况下，某一滞后变量与目标变量之间的相关性。PACF 值也在 -1 到 1 之间。

互相关函数（CCF）衡量的是两个不同时间序列之间在不同时间点之间的相关性。它计算的是两个时间序列的滞后 k 阶互协方差函数（cross-covariance function，CCVF）与各自方差的乘积之比。CCF 值也在 -1 到 1 之间。

以下是一些ACF、PACF和CCF的应用示例：

识别时间序列中的季节性：ACF 和 PACF 图中的尖峰可以指示时间序列中存在季节性。例如，如果 ACF 图在每个 12 个时间步长处都有一个尖峰，则表明时间序列存在年季节性。
确定模型的阶数：ACF 和 PACF 图中的尾部衰减模式可以帮助确定 ARIMA 或 SARIMA 模型的阶数。例如，如果 ACF 和 PACF 图中的尾部迅速衰减到零，则表明模型的阶数可能较低。
评估预测模型的性能：ACF 和 PACF 图可以用于评估预测模型的残差是否具有自相关性。如果残差具有自相关性，则表明模型可能存在缺陷。

代码

tb_stocks %>%
  group_by(symbol) %>%
  plot_acf_diagnostics(
    date,
    .value = close,
    .lags = "7 days",
    .ccf_vars = c(volume, open), # CCFs
    .interactive = T
  )

季节诊断

代码

tb_stocks %>%
  group_by(symbol) %>%
  plot_seasonal_diagnostics(date, .value = close, .interactive = T)

STL分解

代码

tb_stocks %>%
  group_by(symbol) %>%
  na.omit() %>%
  plot_stl_diagnostics(
    date, close,
    .frequency = "auto", .trend = "auto",
    .feature_set = c("observed", "season", "trend", "remainder"),
    .interactive = T
  )

异常值检测

代码

anomalize_tbl <- wikipedia_traffic_daily %>%
  group_by(Page) %>%
  anomalize(
      .date_var      = date, 
      .value         = value,
      .iqr_alpha     = 0.05,
      .max_anomalies = 0.20,
      .message       = FALSE
  )

anomalize_tbl %>% glimpse()

Rows: 5,500
Columns: 13
Groups: Page [10]
$ Page              <chr> "Death_of_Freddie_Gray_en.wikipedia.org_mobile-web_a…
$ date              <date> 2015-07-01, 2015-07-02, 2015-07-03, 2015-07-04, 201…
$ observed          <dbl> 791, 704, 903, 732, 558, 504, 543, 1156, 1196, 701, …
$ season            <dbl> 9.2554027, 13.1448419, -20.5421912, -24.0966839, 0.4…
$ trend             <dbl> 748.1907, 744.3101, 740.4296, 736.5490, 732.6684, 72…
$ remainder         <dbl> 33.553930, -53.454953, 183.112637, 19.547687, -175.1…
$ seasadj           <dbl> 781.7446, 690.8552, 923.5422, 756.0967, 557.5091, 50…
$ anomaly           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No"…
$ anomaly_direction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ anomaly_score     <dbl> 83.50187, 170.51076, 66.05683, 97.50812, 292.21517, …
$ recomposed_l1     <dbl> -328.3498, -328.3409, -365.9085, -373.3435, -352.636…
$ recomposed_l2     <dbl> 2077.354, 2077.362, 2039.795, 2032.360, 2053.067, 20…
$ observed_clean    <dbl> 791.000, 704.000, 903.000, 732.000, 558.000, 504.000…

原始分组和日期时间列。
季节分解： observed 、 seasonal 、 seasadj 、 trend和remainder 。目的是消除趋势和季节性，使其余部分保持稳定并代表正常变化和异常变化。
异常识别和评分： anomaly 、 anomaly_score 、 anomaly_direction 。这些识别异常决策（是/否），根据距中心线的距离对异常进行评分，并标记方向（-1（向下）、零（非异常）、+1（向上））。
重组： recomposed_l1和recomposed_l2 。将它们视为下限和上限。任何低于 l1 或高于 l2 的观测数据都是异常的。
清理后的数据： observed_clean 。自动提供清理后的数据，其中异常值替换为重组 l1/l2 边界内的数据。话虽如此，在简单地删除异常值并使用清理后的数据之前，您应该始终首先尝试了解为什么数据被视为异常。

绘制异常点

代码

anomalize_tbl %>%
    group_by(Page) %>%
    plot_anomalies(
        date,
        .facet_ncol = 2
    )

异常清理

代码

anomalize_tbl %>%
    group_by(Page) %>%
    plot_anomalies_cleaned(
        date,
        .facet_ncol = 2
    )

时序聚类

通过时间序列的特征（如趋势、频率、季节分解等），将时间序列数据进行聚类划分

计算特征

代码

# 自定义计算时间序列特征的函数
my_mean <- function(x, na.rm=TRUE) {
  mean(x, na.rm = na.rm)
}

df_stocks_cluster = tq_get(c("000001.ss","600036.ss","600035.ss","600030.ss","600037.ss"),
                         get = "stock.prices",
                         from = " 2019-01-01")

tsfeature_tbl <-  df_stocks_cluster %>%
    group_by(symbol) %>%
    tk_tsfeatures(
      .date_var = date,
      .value    = close,
      .period   = 52,
      .features = c("frequency", "stl_features", "entropy", "acf_features", "my_mean"),
      .scale    = TRUE,
      .prefix   = "ts_"
    ) %>%
    ungroup()

tsfeature_tbl

# A tibble: 5 × 22
  symbol    ts_frequency ts_nperiods ts_seasonal_period ts_trend      ts_spike
  <chr>            <dbl>       <dbl>              <dbl>    <dbl>         <dbl>
1 000001.ss           52           1                 52    0.941 0.00000000586
2 600036.ss           52           1                 52    0.969 0.00000000147
3 600035.ss           52           1                 52    0.943 0.00000000484
4 600030.ss           52           1                 52    0.898 0.0000000213 
5 600037.ss           52           1                 52    0.934 0.0000000123 
# ℹ 16 more variables: ts_linearity <dbl>, ts_curvature <dbl>, ts_e_acf1 <dbl>,
#   ts_e_acf10 <dbl>, ts_seasonal_strength <dbl>, ts_peak <dbl>,
#   ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>, ts_x_acf10 <dbl>,
#   ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>, ts_diff2_acf1 <dbl>,
#   ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>, ts_my_mean <dbl>

聚类

代码

set.seed(123)


cluster_tbl <- tibble(
    cluster = tsfeature_tbl %>% 
        select(-symbol) %>%
        as.matrix() %>%
        kmeans(centers = 3, nstart = 100) %>%
        pluck("cluster")
) %>%
    bind_cols(
        tsfeature_tbl
    )

cluster_tbl

# A tibble: 5 × 23
  cluster symbol   ts_frequency ts_nperiods ts_seasonal_period ts_trend ts_spike
    <int> <chr>           <dbl>       <dbl>              <dbl>    <dbl>    <dbl>
1       2 000001.…           52           1                 52    0.941  5.86e-9
2       2 600036.…           52           1                 52    0.969  1.47e-9
3       1 600035.…           52           1                 52    0.943  4.84e-9
4       2 600030.…           52           1                 52    0.898  2.13e-8
5       3 600037.…           52           1                 52    0.934  1.23e-8
# ℹ 16 more variables: ts_linearity <dbl>, ts_curvature <dbl>, ts_e_acf1 <dbl>,
#   ts_e_acf10 <dbl>, ts_seasonal_strength <dbl>, ts_peak <dbl>,
#   ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>, ts_x_acf10 <dbl>,
#   ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>, ts_diff2_acf1 <dbl>,
#   ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>, ts_my_mean <dbl>

可视化

代码

cluster_tbl %>%
    select(cluster, symbol) %>%
    right_join(df_stocks_cluster, by = "symbol") %>%
    group_by(symbol) %>%
    plot_time_series(
      date, close, 
      .color_var   = cluster, 
      .facet_ncol  = 2, 
      .interactive = T
    )

时序模型基础

数据划分

代码

library(xgboost)
library(tidymodels)
library(modeltime)

# 使用最近三个月数据作为测试集，cumulative = TRUE使用先前所有的数据作为训练集合 
tb_stocks_1 = tb_stocks %>% filter(symbol == '000001.ss') %>% select(date,close)
splits <- tb_stocks_1 %>% 
  time_series_split(assess = "3 months", cumulative = T)
# splits <- tb_stocks_1 %>% initial_time_split(prop = 0.9)
splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, close, .interactive = T)

数据预处理

代码

library(tidymodels)
recipe_spec_timeseries_lm <- recipe(close ~ ., data = training(splits)) %>%
    step_timeseries_signature(date)  %>%
    step_fourier(date, period = 365, K = 5) %>%
    step_rm(date) %>%
    step_rm(contains("iso"), contains("minute"), contains("hour"),
            contains("am.pm"), contains("xts")) %>%
    step_normalize(contains("index.num"), date_year) %>%
    step_dummy(contains("lbl"), one_hot = TRUE) 

recipe_spec_timeseries <- recipe(close ~ date, data = training(splits))


recipe_spec_mars <- recipe(close ~ date, data = training(splits)) %>%
    step_date(date, features = "month", ordinal = FALSE) %>%
    step_mutate(date_num = as.numeric(date)) %>%
    step_normalize(date_num) %>%
    step_rm(date)
# 查看预处理后数据
juice(prep(recipe_spec_timeseries_lm)) %>% head()

# A tibble: 6 × 47
  close date_index.num date_year date_half date_quarter date_month date_day
  <dbl>          <dbl>     <dbl>     <int>        <int>      <int>    <int>
1 2465.          -1.74     -1.41         1            1          1        2
2 2464.          -1.73     -1.41         1            1          1        3
3 2515.          -1.73     -1.41         1            1          1        4
4 2533.          -1.73     -1.41         1            1          1        7
5 2526.          -1.72     -1.41         1            1          1        8
6 2544.          -1.72     -1.41         1            1          1        9
# ℹ 40 more variables: date_second <int>, date_wday <int>, date_mday <int>,
#   date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
#   date_week2 <int>, date_week3 <int>, date_week4 <int>, date_mday7 <int>,
#   date_sin365_K1 <dbl>, date_cos365_K1 <dbl>, date_sin365_K2 <dbl>,
#   date_cos365_K2 <dbl>, date_sin365_K3 <dbl>, date_cos365_K3 <dbl>,
#   date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, date_sin365_K5 <dbl>,
#   date_cos365_K5 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>, …

代码

# 将预处理 运用在新数据上

bake(prep(recipe_spec_timeseries_lm),new_data = testing(splits)) %>% head()

# A tibble: 6 × 47
  close date_index.num date_year date_half date_quarter date_month date_day
  <dbl>          <dbl>     <dbl>     <int>        <int>      <int>    <int>
1 3148.           1.74      1.82         1            2          5        7
2 3128.           1.74      1.82         1            2          5        8
3 3154.           1.75      1.82         1            2          5        9
4 3155.           1.75      1.82         1            2          5       10
5 3148.           1.75      1.82         1            2          5       13
6 3146.           1.76      1.82         1            2          5       14
# ℹ 40 more variables: date_second <int>, date_wday <int>, date_mday <int>,
#   date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
#   date_week2 <int>, date_week3 <int>, date_week4 <int>, date_mday7 <int>,
#   date_sin365_K1 <dbl>, date_cos365_K1 <dbl>, date_sin365_K2 <dbl>,
#   date_cos365_K2 <dbl>, date_sin365_K3 <dbl>, date_cos365_K3 <dbl>,
#   date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, date_sin365_K5 <dbl>,
#   date_cos365_K5 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>, …

定义模型

代码

model_spec_lm <- linear_reg(
    mode = "regression", 
    penalty = 0.1
) %>%
    set_engine("glmnet")

model_arima_no_boost <- arima_reg() %>%
    set_engine(engine = "auto_arima")


model_spec_mars <- mars(mode = "regression") %>%
    set_engine("earth")

合成工作流

代码

workflow_lm <- workflow() %>%
    add_recipe(recipe_spec_timeseries_lm) %>%
    add_model(model_spec_lm)

workflow_arima_no_boost = workflow() %>%
    add_recipe(recipe_spec_timeseries) %>%
    add_model(model_arima_no_boost)

拟合

代码

workflow_fit_lm <- workflow_lm %>% fit(data = training(splits))
workflow_fit_arima_no_boost = workflow_arima_no_boost %>% fit(data = training(splits))


model_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>%
    fit(close ~ date + as.numeric(date) + factor(month(date, label = TRUE), ordered = F),
        data = training(splits))


model_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(close ~ date, data = training(splits))

model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(close ~ date, data = training(splits))


  
wflw_fit_mars <- workflow() %>%
    add_recipe(recipe_spec_mars) %>%
    add_model(model_spec_mars) %>%
    fit(training(splits))

wflw_xgb <- workflow() %>%
    add_model(
        boost_tree("regression") %>% set_engine("xgboost")
    ) %>%
    add_recipe(recipe_spec_timeseries_lm) %>%
    fit(training(splits))

模型验证

创建模型表，方便对比不同模型效果

代码

library(modeltime)
model_table <- modeltime_table(
  workflow_fit_lm,
  workflow_fit_arima_no_boost,
  model_fit_ets,
  model_fit_prophet,
  wflw_fit_mars,
  wflw_xgb
) 

model_table

# Modeltime Table
# A tibble: 6 × 3
  .model_id .model     .model_desc 
      <int> <list>     <chr>       
1         1 <workflow> GLMNET      
2         2 <workflow> ARIMA(0,1,0)
3         3 <fit[+]>   ETS(A,N,N)  
4         4 <fit[+]>   PROPHET     
5         5 <workflow> EARTH       
6         6 <workflow> XGBOOST

计算测试集

代码

calibration_table <- model_table %>%
  modeltime::modeltime_calibrate(testing(splits))

calibration_table

# Modeltime Table
# A tibble: 6 × 5
  .model_id .model     .model_desc  .type .calibration_data
      <int> <list>     <chr>        <chr> <list>           
1         1 <workflow> GLMNET       Test  <tibble [59 × 4]>
2         2 <workflow> ARIMA(0,1,0) Test  <tibble [59 × 4]>
3         3 <fit[+]>   ETS(A,N,N)   Test  <tibble [59 × 4]>
4         4 <fit[+]>   PROPHET      Test  <tibble [59 × 4]>
5         5 <workflow> EARTH        Test  <tibble [59 × 4]>
6         6 <workflow> XGBOOST      Test  <tibble [59 × 4]>

代码

calibration_table %>%
  modeltime::modeltime_forecast(actual_data = tb_stocks_1,new_data = testing(splits)) %>%
  modeltime::plot_modeltime_forecast(.interactive = T)

模型评价

代码

calibration_table %>%
  modeltime::modeltime_accuracy() %>%
  modeltime::table_modeltime_accuracy()

预测未来数据

预测未来数据之前需要将模型用完整数据先重新拟合

代码

calibration_table %>%
  modeltime::modeltime_refit(tb_stocks_1) %>%
  modeltime::modeltime_forecast(h = "12 months", actual_data = tb_stocks_1) %>%
  modeltime::plot_modeltime_forecast()

时序模型进阶

同时对大量不同的时间序列进行预测，称之为大规模预测。一般来说，有两种大规模预测方法（一种是慢而准确的方法，另一种是快速而好的方法）：

全局建模（快速且良好）：使用面板数据结构，只制了一个模型来生成所有时间序列的预测，可以轻松扩展到10,000+时间序列。
迭代预测（缓慢而准确）：针对每个时间序列分别建模，准确性最佳，但比全局模型花费更长的时间及更多内存开销

选取沃尔玛产品销量数据作为示例数据，其中ID用于分隔时间序列组

代码

data_tbl <- walmart_sales_weekly %>%
    select(id, date = Date, value = Weekly_Sales)

data_tbl %>% head()

# A tibble: 6 × 3
  id    date        value
  <fct> <date>      <dbl>
1 1_1   2010-02-05 24924.
2 1_1   2010-02-12 46039.
3 1_1   2010-02-19 41596.
4 1_1   2010-02-26 19404.
5 1_1   2010-03-05 21828.
6 1_1   2010-03-12 21043.

可视化

代码

data_tbl %>%
  group_by(id) %>%
  plot_time_series(
    date, value, .interactive = T, .facet_ncol = 2
  )

数据拆分

代码

splits <- data_tbl %>% 
  time_series_split(
    assess     = "3 months", 
    cumulative = TRUE
  )

splits

<Analysis/Assess/Total>
<917/84/1001>

全局建模

特征工程

代码

rec_obj <- recipe(value ~ ., training(splits)) %>%
  # 这里ID是factor类型，用droplevels删除分类 ID 列中多余的未使用level
    step_mutate_at(id, fn = droplevels) %>%
  # 添加日期特征
    step_timeseries_signature(date) %>%
  # 删除日期列。像 XGBoost 这样的机器学习算法不能很好地处理日期列
    step_rm(date) %>%
  # 移除任何零方差特征。没有方差的特征不会向我们的模型添加有用的信息。
    step_zv(all_predictors()) %>%
  # 将分类特征转换为二进制独热编码特征，XGBoost 等算法可以更有效地建模。
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

summary(prep(rec_obj))

# A tibble: 38 × 4
   variable       type      role      source  
   <chr>          <list>    <chr>     <chr>   
 1 value          <chr [2]> outcome   original
 2 date_index.num <chr [2]> predictor derived 
 3 date_year      <chr [2]> predictor derived 
 4 date_year.iso  <chr [2]> predictor derived 
 5 date_half      <chr [2]> predictor derived 
 6 date_quarter   <chr [2]> predictor derived 
 7 date_month     <chr [2]> predictor derived 
 8 date_month.xts <chr [2]> predictor derived 
 9 date_day       <chr [2]> predictor derived 
10 date_mday      <chr [2]> predictor derived 
# ℹ 28 more rows

代码

juice(prep(rec_obj)) %>% head()

# A tibble: 6 × 38
    value date_index.num date_year date_year.iso date_half date_quarter
    <dbl>          <dbl>     <int>         <int>     <int>        <int>
1  24924.     1265328000      2010          2010         1            1
2  13740.     1265328000      2010          2010         1            1
3  40129.     1265328000      2010          2010         1            1
4  41969.     1265328000      2010          2010         1            1
5 115564.     1265328000      2010          2010         1            1
6  64495.     1265328000      2010          2010         1            1
# ℹ 32 more variables: date_month <int>, date_month.xts <int>, date_day <int>,
#   date_mday <int>, date_qday <int>, date_yday <int>, date_mweek <int>,
#   date_week <int>, date_week.iso <int>, date_week2 <int>, date_week3 <int>,
#   date_week4 <int>, date_mday7 <int>, id_X1_1 <dbl>, id_X1_3 <dbl>,
#   id_X1_8 <dbl>, id_X1_13 <dbl>, id_X1_38 <dbl>, id_X1_93 <dbl>,
#   id_X1_95 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>,
#   date_month.lbl_03 <dbl>, date_month.lbl_04 <dbl>, …

xgboost建模

代码

wflw_xgb <- workflow() %>%
    add_model(
        boost_tree("regression") %>% set_engine("xgboost")
    ) %>%
    add_recipe(rec_obj) %>%
    fit(training(splits))

wflw_xgb

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_mutate_at()
• step_timeseries_signature()
• step_rm()
• step_zv()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 46.6 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 37 
niter: 15
nfeatures : 37 
evaluation_log:
     iter training_rmse
    <num>         <num>
        1     46315.130
        2     33003.670
---                    
       14      3554.521
       15      3423.419

创建模型表

代码

model_tbl <- modeltime_table(
    wflw_xgb
)

model_tbl

# Modeltime Table
# A tibble: 1 × 3
  .model_id .model     .model_desc
      <int> <list>     <chr>      
1         1 <workflow> XGBOOST

模型校准

代码

calib_tbl <- model_tbl %>%
    modeltime_calibrate(
      new_data = testing(splits), 
      id       = "id"
    )

calib_tbl

# Modeltime Table
# A tibble: 1 × 5
  .model_id .model     .model_desc .type .calibration_data
      <int> <list>     <chr>       <chr> <list>           
1         1 <workflow> XGBOOST     Test  <tibble [84 × 5]>

模型全局准确性

代码

calib_tbl %>% 
    modeltime_accuracy(acc_by_id = FALSE) %>% 
    table_modeltime_accuracy(.interactive = FALSE)

Accuracy Table
.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
1	XGBOOST	Test	3698.69	8.3	0.11	8.17	5059.45	0.98

不同时间序列模型准确性

代码

calib_tbl %>% 
    modeltime_accuracy(acc_by_id = TRUE) %>% 
    table_modeltime_accuracy(.interactive = FALSE)

Accuracy Table
.model_id	.model_desc	.type	id	mae	mape	mase	smape	rmse	rsq
1	XGBOOST	Test	1_1	2125.99	10.83	1.59	10.30	2596.97	0.50
1	XGBOOST	Test	1_3	3397.33	18.67	0.57	17.60	4048.49	0.91
1	XGBOOST	Test	1_8	2086.84	5.45	0.95	5.62	2283.70	0.67
1	XGBOOST	Test	1_13	1596.47	3.86	0.70	3.96	1913.82	0.53
1	XGBOOST	Test	1_38	8345.09	10.77	1.03	10.99	9685.67	0.01
1	XGBOOST	Test	1_93	4278.75	5.37	0.42	5.57	5440.12	0.75
1	XGBOOST	Test	1_95	4060.36	3.14	0.51	3.18	4875.10	0.65

预测测试数据

代码

calib_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits),
        actual_data = data_tbl,
        conf_by_id  = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 3,
        .interactive = FALSE
    )

重新拟合模型

代码

refit_tbl <- calib_tbl %>%
  modeltime_refit(data = data_tbl)

refit_tbl

# Modeltime Table
# A tibble: 1 × 5
  .model_id .model     .model_desc .type .calibration_data
      <int> <list>     <chr>       <chr> <list>           
1         1 <workflow> XGBOOST     Test  <tibble [84 × 5]>

预测未来数据

代码

future_tbl <- data_tbl %>%
  group_by(id) %>%
  future_frame(.length_out = 52, .bind_data = FALSE)
refit_tbl %>%
  modeltime_forecast(
    new_data    = future_tbl,
    actual_data = data_tbl, 
    conf_by_id  = TRUE
  ) %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .interactive = F,
    .facet_ncol  = 2
  )

嵌套预测

代码

nested_data_tbl <- data_tbl %>%
    
    # 1. 扩展原始数据，增加未来52周
    extend_timeseries(
        .id_var        = id,
        .date_var      = date,
        .length_future = 52
    ) %>%
    
    # 2. 嵌套数据: 按ID分组生成用于训练模型的actual_data和用于预测未来的未来future_data

    nest_timeseries(
        .id_var        = id,
        .length_future = 52,
        .length_actual = 52*2
    ) %>%
    
   # 3. 数据划分: 将actual_data中划分出一部分的数据作为测试集
    split_nested_timeseries(
        .length_test = 12
    )

nested_data_tbl

# A tibble: 7 × 4
  id    .actual_data       .future_data      .splits        
  <fct> <list>             <list>            <list>         
1 1_1   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
2 1_3   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
3 1_8   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
4 1_13  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
5 1_38  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
6 1_93  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
7 1_95  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>

分别使用prophet和xgboost模型

代码

rec_prophet <- recipe(value ~ date, extract_nested_train_split(nested_data_tbl)) 

wflw_prophet <- workflow() %>%
    add_model(
      prophet_reg("regression", seasonality_yearly = TRUE) %>% 
        set_engine("prophet")
    ) %>%
    add_recipe(rec_prophet)

rec_xgb <- recipe(value ~ ., extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(date) %>%
    step_rm(date) %>%
    step_zv(all_predictors()) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

wflw_xgb <- workflow() %>%
    add_model(boost_tree("regression") %>% set_engine("xgboost")) %>%
    add_recipe(rec_xgb)

模型表

代码

nested_modeltime_tbl <- modeltime_nested_fit(
  nested_data = nested_data_tbl,
  wflw_prophet,
  wflw_xgb
)

nested_modeltime_tbl

# Nested Modeltime Table
  # A tibble: 7 × 5
  id    .actual_data       .future_data      .splits         .modeltime_tables 
  <fct> <list>             <list>            <list>          <list>            
1 1_1   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
2 1_3   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
3 1_8   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
4 1_13  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
5 1_38  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
6 1_93  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
7 1_95  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>

提取模型信息

模型评价

代码

nested_modeltime_tbl %>% 
  extract_nested_test_accuracy() %>%
  table_modeltime_accuracy(.interactive = T)

测试预测

代码

nested_modeltime_tbl %>% 
  extract_nested_test_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 2,
    .interactive = T
  )

提取错误

预测过程中可能会出现错误，可以通过extract_nested_error_report查看错误

代码

nested_modeltime_tbl %>% 
  extract_nested_error_report()

# A tibble: 0 × 4
# ℹ 4 variables: id <fct>, .model_id <int>, .model_desc <chr>,
#   .error_desc <chr>

提取最佳

代码

best_nested_modeltime_tbl <- nested_modeltime_tbl %>%
    modeltime_nested_select_best(
      metric                = "rmse", 
      minimize              = TRUE, 
      filter_test_forecasts = TRUE
    )
best_nested_modeltime_tbl %>%
  extract_nested_best_model_report()

# Nested Modeltime Table
  # A tibble: 7 × 10
  id    .model_id .model_desc .type   mae  mape  mase smape  rmse   rsq
  <fct>     <int> <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1_1           2 XGBOOST     Test  1610.  7.76 1.21   7.93 2062. 0.896
2 1_3           2 XGBOOST     Test  4671. 34.4  0.786 27.0  5252. 0.938
3 1_8           2 XGBOOST     Test  1157.  3.00 0.526  3.06 1540. 0.547
4 1_13          2 XGBOOST     Test  1809.  4.45 0.796  4.59 2129. 0.734
5 1_38          1 PROPHET     Test  6958.  9.44 0.860  9.09 8585. 0.189
6 1_93          2 XGBOOST     Test  4164.  5.33 0.409  5.57 5357. 0.803
7 1_95          2 XGBOOST     Test  4363.  3.43 0.551  3.41 4984. 0.629

代码

best_nested_modeltime_tbl %>%
  extract_nested_test_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 2,
    .interactive = T
  )

重新拟合

代码

nested_modeltime_refit_tbl <- best_nested_modeltime_tbl %>%
    modeltime_nested_refit(
        control = control_nested_refit(verbose = TRUE)
    )

预测未来

代码

nested_modeltime_refit_tbl %>%
  extract_nested_future_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .interactive = T,
    .facet_ncol  = 2
  )

并行调参

parallel_start开始并行并设置核心数

代码

parallel_start(2, .method = "parallel")

代码

model_tbl <- tibble(
  learn_rate = c(0.001, 0.010, 0.100, 0.350, 0.500, 0.650)
) %>%
  create_model_grid(
    f_model_spec = boost_tree,
    engine_name  = "xgboost",
    mode         = "regression"
  )

model_tbl

# A tibble: 6 × 2
  learn_rate .models  
       <dbl> <list>   
1      0.001 <spec[+]>
2      0.01  <spec[+]>
3      0.1   <spec[+]>
4      0.35  <spec[+]>
5      0.5   <spec[+]>
6      0.65  <spec[+]>

生成`workflow_set`

代码

dataset_tbl <- walmart_sales_weekly %>%
  select(id, Date, Weekly_Sales)
splits <- time_series_split(
  dataset_tbl, 
  assess     = "6 months", 
  cumulative = TRUE
)


model_list <- model_tbl$.models

recipe_spec_xgboost <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
  step_timeseries_signature(Date) %>%
  step_rm(Date) %>%
  step_normalize(Date_index.num) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

model_wfset <- workflow_set(
  preproc = list(
    recipe_spec_xgboost
  ),
  models = model_list, 
  cross = TRUE
)

model_wfset

# A workflow set/tibble: 6 × 4
  wflow_id            info             option    result    
  <chr>               <list>           <list>    <list>    
1 recipe_boost_tree_1 <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_boost_tree_2 <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_boost_tree_3 <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_boost_tree_4 <tibble [1 × 4]> <opts[0]> <list [0]>
5 recipe_boost_tree_5 <tibble [1 × 4]> <opts[0]> <list [0]>
6 recipe_boost_tree_6 <tibble [1 × 4]> <opts[0]> <list [0]>

拟合

并行拟合与顺序拟合速度对比

并行拟合
顺序拟合

代码

model_parallel_tbl <- model_wfset %>%
  modeltime_fit_workflowset(
    data    = training(splits),
    control = control_fit_workflowset(
      verbose   = TRUE,
      allow_par = TRUE
    )
  )

代码

model_sequential_tbl <- model_wfset %>%
  modeltime_fit_workflowset(
    data    = training(splits),
    control = control_fit_workflowset(
      verbose   = TRUE,
      allow_par = FALSE
    )
  )

模型评估

代码

model_parallel_tbl %>%
  modeltime_calibrate(testing(splits)) %>%
  modeltime_accuracy() %>%
  table_modeltime_accuracy(.interactive = FALSE)

Accuracy Table
.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
1	RECIPE_BOOST_TREE_1	Test	55572.50	98.52	1.63	194.17	66953.92	0.96
2	RECIPE_BOOST_TREE_2	Test	48819.23	86.15	1.43	151.49	58992.30	0.96
3	RECIPE_BOOST_TREE_3	Test	13286.28	21.41	0.39	24.70	17271.50	0.98
4	RECIPE_BOOST_TREE_4	Test	3802.13	8.11	0.11	8.05	5558.56	0.98
5	RECIPE_BOOST_TREE_5	Test	3361.45	7.01	0.10	7.08	5233.28	0.98
6	RECIPE_BOOST_TREE_6	Test	3645.01	8.76	0.11	9.20	5420.95	0.98

可视化

代码

model_parallel_tbl %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dataset_tbl,
    keep_data   = TRUE
  ) %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 3,
    .interactive = FALSE
  )

关闭集群

代码

parallel_stop()

回到顶部

数据准备

历史数据

常用数据操作

时间汇总统计

时间序列填充

时间窗口

窗口计算

窗口模型

可视化

ACF、PACF和CCF图

季节诊断

STL分解

异常值检测

绘制异常点

异常清理

时序聚类

时序模型基础

数据划分

数据预处理

定义模型

合成工作流

拟合

模型验证

模型评价

预测未来数据

时序模型进阶

全局建模

嵌套预测

提取模型信息

测试预测

提取错误

提取最佳

重新拟合

预测未来

并行调参

生成workflow_set

拟合

模型评估

生成`workflow_set`