R语言时序分析

R语言
时间序列
数据分析
数据清洗
机器学习
使用R语言进行时间序列分析,timetk包时序数据处理及modeltime包时序建模
作者

不止BI

发布于

2024年7月30日

数据准备

历史数据

代码
library(tidyquant)
library(tidyverse)
library(timetk)
tb_stocks <- tq_get(c("000001.ss", "600036.ss"), get = "stock.prices", from = " 2019-01-01")


tb_stocks %>%
  group_by(symbol) %>%
  plot_time_series(.date_var = date, .value = close, .facet_ncol = 2)

常用数据操作

时间汇总统计

代码
tb_stocks %>%
  group_by(symbol) %>%
  summarise_by_time(
    .date_var = date,
    .by = "month",
    volume = sum(volume)
  ) %>%
  plot_time_series(.date_var = date, .value = volume, .facet_ncol = 2)

时间序列填充

时间序列数据可能存在某段时间中断的情况,可以使用pad_by_time补全时间

代码
tb_stocks %>%
  group_by(symbol) %>%
  pad_by_time(date, .by = "auto")
# A tibble: 4,072 × 8
# Groups:   symbol [2]
   symbol    date        open  high   low close volume adjusted
   <chr>     <date>     <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
 1 000001.ss 2019-01-02 2498. 2500. 2456. 2465. 109900    2465.
 2 000001.ss 2019-01-03 2462. 2488. 2456. 2464. 124400    2464.
 3 000001.ss 2019-01-04 2446. 2515. 2441. 2515. 168900    2515.
 4 000001.ss 2019-01-05   NA    NA    NA    NA      NA      NA 
 5 000001.ss 2019-01-06   NA    NA    NA    NA      NA      NA 
 6 000001.ss 2019-01-07 2529. 2537. 2516. 2533. 177300    2533.
 7 000001.ss 2019-01-08 2530. 2531. 2520. 2526. 158100    2526.
 8 000001.ss 2019-01-09 2536. 2574. 2536. 2544. 191900    2544.
 9 000001.ss 2019-01-10 2544. 2552. 2532. 2535. 159900    2535.
10 000001.ss 2019-01-11 2540. 2555. 2533. 2554. 149400    2554.
# ℹ 4,062 more rows

时间窗口

窗口计算

代码
roll_avg_30 <- slidify(.f = mean, .period = 30, .align = "center", .partial = TRUE)

tb_stocks %>%
  select(symbol, date, adjusted) %>%
  group_by(symbol) %>%
  mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>%
  tidyr::pivot_longer(cols = c(adjusted, rolling_avg_30)) %>%
  plot_time_series(date, value,
    .color_var = name,
    .facet_ncol = 2, .smooth = FALSE,
    .interactive = T
  )

窗口模型

代码
# Rolling regressions are easy to implement using `.unlist = FALSE`
lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3),
  .period = 90,
  .unlist = FALSE, .align = "right"
)


tb_stocks %>%
  select(symbol, date, adjusted, volume) %>%
  group_by(symbol) %>%
  mutate(numeric_date = as.numeric(date)) %>%
  # Apply rolling regression
  mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>%
  filter(!is.na(rolling_lm))
# A tibble: 2,522 × 6
# Groups:   symbol [2]
   symbol    date       adjusted volume numeric_date rolling_lm
   <chr>     <date>        <dbl>  <dbl>        <dbl> <list>    
 1 000001.ss 2019-05-22    2892. 200000        18038 <lm>      
 2 000001.ss 2019-05-23    2853. 197300        18039 <lm>      
 3 000001.ss 2019-05-24    2853. 167200        18040 <lm>      
 4 000001.ss 2019-05-27    2892. 196700        18043 <lm>      
 5 000001.ss 2019-05-28    2910. 223300        18044 <lm>      
 6 000001.ss 2019-05-29    2915. 199000        18045 <lm>      
 7 000001.ss 2019-05-30    2906. 205800        18046 <lm>      
 8 000001.ss 2019-05-31    2899. 195200        18047 <lm>      
 9 000001.ss 2019-06-03    2890. 215900        18050 <lm>      
10 000001.ss 2019-06-04    2862. 188500        18051 <lm>      
# ℹ 2,512 more rows

可视化

ACF、PACF和CCF图

ACF、PACF和CCF都是用于分析时间序列数据相关性的统计工具。它们可以帮助识别序列中的模式和结构,并为模型选择提供依据。

自相关函数(ACF)衡量的是时间序列中同一变量在不同时间点之间的相关性。它计算的是滞后 k 阶的自协方差函数(autocovariance function,ACVF)与方差之比。ACF 值在 -1 到 1 之间,其中 1 表示完全相关,0 表示不相关,-1 表示完全负相关。

偏自相关函数(PACF)衡量的是时间序列中同一变量在剔除掉中间滞后的情况下,相邻时间点之间的相关性。它计算的是条件自相关函数(conditional autocorrelation function),即在控制了其他滞后变量的情况下,某一滞后变量与目标变量之间的相关性。PACF 值也 在 -1 到 1 之间。

互相关函数(CCF)衡量的是两个不同时间序列之间在不同时间点之间的相关性。它计算的是两个时间序列的滞后 k 阶互协方差函数(cross-covariance function,CCVF)与各自方差的乘积之比。CCF 值也在 -1 到 1 之间。

以下是一些ACF、PACF和CCF的应用示例:

  • 识别时间序列中的季节性:ACF 和 PACF 图中的尖峰可以指示时间序列中存在季节性。例如,如果 ACF 图在每个 12 个时间步长处都有一个尖峰,则表明时间序列存在年季节性。

  • 确定模型的阶数:ACF 和 PACF 图中的尾部衰减模式可以帮助确定 ARIMA 或 SARIMA 模型的阶数。例如,如果 ACF 和 PACF 图中的尾部迅速衰减到零,则表明模型的阶数可能较低。

  • 评估预测模型的性能:ACF 和 PACF 图可以用于评估预测模型的残差是否具有自相关性。如果残差具有自相关性,则表明模型可能存在缺陷。

代码
tb_stocks %>%
  group_by(symbol) %>%
  plot_acf_diagnostics(
    date,
    .value = close,
    .lags = "7 days",
    .ccf_vars = c(volume, open), # CCFs
    .interactive = T
  )

季节诊断

代码
tb_stocks %>%
  group_by(symbol) %>%
  plot_seasonal_diagnostics(date, .value = close, .interactive = T)

STL分解

代码
tb_stocks %>%
  group_by(symbol) %>%
  na.omit() %>%
  plot_stl_diagnostics(
    date, close,
    .frequency = "auto", .trend = "auto",
    .feature_set = c("observed", "season", "trend", "remainder"),
    .interactive = T
  )

异常值检测

代码
anomalize_tbl <- wikipedia_traffic_daily %>%
  group_by(Page) %>%
  anomalize(
      .date_var      = date, 
      .value         = value,
      .iqr_alpha     = 0.05,
      .max_anomalies = 0.20,
      .message       = FALSE
  )

anomalize_tbl %>% glimpse()
Rows: 5,500
Columns: 13
Groups: Page [10]
$ Page              <chr> "Death_of_Freddie_Gray_en.wikipedia.org_mobile-web_a…
$ date              <date> 2015-07-01, 2015-07-02, 2015-07-03, 2015-07-04, 201…
$ observed          <dbl> 791, 704, 903, 732, 558, 504, 543, 1156, 1196, 701, …
$ season            <dbl> 9.2554027, 13.1448419, -20.5421912, -24.0966839, 0.4…
$ trend             <dbl> 748.1907, 744.3101, 740.4296, 736.5490, 732.6684, 72…
$ remainder         <dbl> 33.553930, -53.454953, 183.112637, 19.547687, -175.1…
$ seasadj           <dbl> 781.7446, 690.8552, 923.5422, 756.0967, 557.5091, 50…
$ anomaly           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No"…
$ anomaly_direction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ anomaly_score     <dbl> 83.50187, 170.51076, 66.05683, 97.50812, 292.21517, …
$ recomposed_l1     <dbl> -328.3498, -328.3409, -365.9085, -373.3435, -352.636…
$ recomposed_l2     <dbl> 2077.354, 2077.362, 2039.795, 2032.360, 2053.067, 20…
$ observed_clean    <dbl> 791.000, 704.000, 903.000, 732.000, 558.000, 504.000…
  1. 原始分组和日期时间列。

  2. 季节分解: observedseasonalseasadjtrendremainder 。目的是消除趋势和季节性,使其余部分保持稳定并代表正常变化和异常变化。

  3. 异常识别和评分: anomalyanomaly_scoreanomaly_direction 。这些识别异常决策(是/否),根据距中心线的距离对异常进行评分,并标记方向(-1(向下)、零(非异常)、+1(向上))。

  4. 重组: recomposed_l1recomposed_l2 。将它们视为下限和上限。任何低于 l1 或高于 l2 的观测数据都是异常的。

  5. 清理后的数据: observed_clean 。自动提供清理后的数据,其中异常值替换为重组 l1/l2 边界内的数据。话虽如此,在简单地删除异常值并使用清理后的数据之前,您应该始终首先尝试了解为什么数据被视为异常。

绘制异常点

代码
anomalize_tbl %>%
    group_by(Page) %>%
    plot_anomalies(
        date,
        .facet_ncol = 2
    )

异常清理

代码
anomalize_tbl %>%
    group_by(Page) %>%
    plot_anomalies_cleaned(
        date,
        .facet_ncol = 2
    )

时序聚类

通过时间序列的特征(如趋势、频率、季节分解等),将时间序列数据进行聚类划分

计算特征

代码
# 自定义计算时间序列特征的函数
my_mean <- function(x, na.rm=TRUE) {
  mean(x, na.rm = na.rm)
}

df_stocks_cluster = tq_get(c("000001.ss","600036.ss","600035.ss","600030.ss","600037.ss"),
                         get = "stock.prices",
                         from = " 2019-01-01")

tsfeature_tbl <-  df_stocks_cluster %>%
    group_by(symbol) %>%
    tk_tsfeatures(
      .date_var = date,
      .value    = close,
      .period   = 52,
      .features = c("frequency", "stl_features", "entropy", "acf_features", "my_mean"),
      .scale    = TRUE,
      .prefix   = "ts_"
    ) %>%
    ungroup()

tsfeature_tbl
# A tibble: 5 × 22
  symbol    ts_frequency ts_nperiods ts_seasonal_period ts_trend      ts_spike
  <chr>            <dbl>       <dbl>              <dbl>    <dbl>         <dbl>
1 000001.ss           52           1                 52    0.941 0.00000000586
2 600036.ss           52           1                 52    0.969 0.00000000147
3 600035.ss           52           1                 52    0.943 0.00000000484
4 600030.ss           52           1                 52    0.898 0.0000000213 
5 600037.ss           52           1                 52    0.934 0.0000000123 
# ℹ 16 more variables: ts_linearity <dbl>, ts_curvature <dbl>, ts_e_acf1 <dbl>,
#   ts_e_acf10 <dbl>, ts_seasonal_strength <dbl>, ts_peak <dbl>,
#   ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>, ts_x_acf10 <dbl>,
#   ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>, ts_diff2_acf1 <dbl>,
#   ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>, ts_my_mean <dbl>

聚类

代码
set.seed(123)


cluster_tbl <- tibble(
    cluster = tsfeature_tbl %>% 
        select(-symbol) %>%
        as.matrix() %>%
        kmeans(centers = 3, nstart = 100) %>%
        pluck("cluster")
) %>%
    bind_cols(
        tsfeature_tbl
    )

cluster_tbl
# A tibble: 5 × 23
  cluster symbol   ts_frequency ts_nperiods ts_seasonal_period ts_trend ts_spike
    <int> <chr>           <dbl>       <dbl>              <dbl>    <dbl>    <dbl>
1       2 000001.…           52           1                 52    0.941  5.86e-9
2       2 600036.…           52           1                 52    0.969  1.47e-9
3       1 600035.…           52           1                 52    0.943  4.84e-9
4       2 600030.…           52           1                 52    0.898  2.13e-8
5       3 600037.…           52           1                 52    0.934  1.23e-8
# ℹ 16 more variables: ts_linearity <dbl>, ts_curvature <dbl>, ts_e_acf1 <dbl>,
#   ts_e_acf10 <dbl>, ts_seasonal_strength <dbl>, ts_peak <dbl>,
#   ts_trough <dbl>, ts_entropy <dbl>, ts_x_acf1 <dbl>, ts_x_acf10 <dbl>,
#   ts_diff1_acf1 <dbl>, ts_diff1_acf10 <dbl>, ts_diff2_acf1 <dbl>,
#   ts_diff2_acf10 <dbl>, ts_seas_acf1 <dbl>, ts_my_mean <dbl>

可视化

代码
cluster_tbl %>%
    select(cluster, symbol) %>%
    right_join(df_stocks_cluster, by = "symbol") %>%
    group_by(symbol) %>%
    plot_time_series(
      date, close, 
      .color_var   = cluster, 
      .facet_ncol  = 2, 
      .interactive = T
    )

时序模型基础

数据划分

代码
library(xgboost)
library(tidymodels)
library(modeltime)

# 使用最近三个月数据作为测试集,cumulative = TRUE使用先前所有的数据作为训练集合 
tb_stocks_1 = tb_stocks %>% filter(symbol == '000001.ss') %>% select(date,close)
splits <- tb_stocks_1 %>% 
  time_series_split(assess = "3 months", cumulative = T)
# splits <- tb_stocks_1 %>% initial_time_split(prop = 0.9)
splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, close, .interactive = T)

数据预处理

代码
library(tidymodels)
recipe_spec_timeseries_lm <- recipe(close ~ ., data = training(splits)) %>%
    step_timeseries_signature(date)  %>%
    step_fourier(date, period = 365, K = 5) %>%
    step_rm(date) %>%
    step_rm(contains("iso"), contains("minute"), contains("hour"),
            contains("am.pm"), contains("xts")) %>%
    step_normalize(contains("index.num"), date_year) %>%
    step_dummy(contains("lbl"), one_hot = TRUE) 

recipe_spec_timeseries <- recipe(close ~ date, data = training(splits))


recipe_spec_mars <- recipe(close ~ date, data = training(splits)) %>%
    step_date(date, features = "month", ordinal = FALSE) %>%
    step_mutate(date_num = as.numeric(date)) %>%
    step_normalize(date_num) %>%
    step_rm(date)
# 查看预处理后数据
juice(prep(recipe_spec_timeseries_lm)) %>% head()
# A tibble: 6 × 47
  close date_index.num date_year date_half date_quarter date_month date_day
  <dbl>          <dbl>     <dbl>     <int>        <int>      <int>    <int>
1 2465.          -1.74     -1.41         1            1          1        2
2 2464.          -1.73     -1.41         1            1          1        3
3 2515.          -1.73     -1.41         1            1          1        4
4 2533.          -1.73     -1.41         1            1          1        7
5 2526.          -1.72     -1.41         1            1          1        8
6 2544.          -1.72     -1.41         1            1          1        9
# ℹ 40 more variables: date_second <int>, date_wday <int>, date_mday <int>,
#   date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
#   date_week2 <int>, date_week3 <int>, date_week4 <int>, date_mday7 <int>,
#   date_sin365_K1 <dbl>, date_cos365_K1 <dbl>, date_sin365_K2 <dbl>,
#   date_cos365_K2 <dbl>, date_sin365_K3 <dbl>, date_cos365_K3 <dbl>,
#   date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, date_sin365_K5 <dbl>,
#   date_cos365_K5 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>, …
代码
# 将预处理 运用在新数据上

bake(prep(recipe_spec_timeseries_lm),new_data = testing(splits)) %>% head()
# A tibble: 6 × 47
  close date_index.num date_year date_half date_quarter date_month date_day
  <dbl>          <dbl>     <dbl>     <int>        <int>      <int>    <int>
1 3148.           1.74      1.82         1            2          5        7
2 3128.           1.74      1.82         1            2          5        8
3 3154.           1.75      1.82         1            2          5        9
4 3155.           1.75      1.82         1            2          5       10
5 3148.           1.75      1.82         1            2          5       13
6 3146.           1.76      1.82         1            2          5       14
# ℹ 40 more variables: date_second <int>, date_wday <int>, date_mday <int>,
#   date_qday <int>, date_yday <int>, date_mweek <int>, date_week <int>,
#   date_week2 <int>, date_week3 <int>, date_week4 <int>, date_mday7 <int>,
#   date_sin365_K1 <dbl>, date_cos365_K1 <dbl>, date_sin365_K2 <dbl>,
#   date_cos365_K2 <dbl>, date_sin365_K3 <dbl>, date_cos365_K3 <dbl>,
#   date_sin365_K4 <dbl>, date_cos365_K4 <dbl>, date_sin365_K5 <dbl>,
#   date_cos365_K5 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>, …

定义模型

代码
model_spec_lm <- linear_reg(
    mode = "regression", 
    penalty = 0.1
) %>%
    set_engine("glmnet")

model_arima_no_boost <- arima_reg() %>%
    set_engine(engine = "auto_arima")


model_spec_mars <- mars(mode = "regression") %>%
    set_engine("earth") 

合成工作流

代码
workflow_lm <- workflow() %>%
    add_recipe(recipe_spec_timeseries_lm) %>%
    add_model(model_spec_lm)

workflow_arima_no_boost = workflow() %>%
    add_recipe(recipe_spec_timeseries) %>%
    add_model(model_arima_no_boost)

拟合

代码
workflow_fit_lm <- workflow_lm %>% fit(data = training(splits))
workflow_fit_arima_no_boost = workflow_arima_no_boost %>% fit(data = training(splits))


model_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>%
    fit(close ~ date + as.numeric(date) + factor(month(date, label = TRUE), ordered = F),
        data = training(splits))


model_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(close ~ date, data = training(splits))

model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(close ~ date, data = training(splits))


  
wflw_fit_mars <- workflow() %>%
    add_recipe(recipe_spec_mars) %>%
    add_model(model_spec_mars) %>%
    fit(training(splits))

wflw_xgb <- workflow() %>%
    add_model(
        boost_tree("regression") %>% set_engine("xgboost")
    ) %>%
    add_recipe(recipe_spec_timeseries_lm) %>%
    fit(training(splits))

模型验证

创建模型表,方便对比不同模型效果

代码
library(modeltime)
model_table <- modeltime_table(
  workflow_fit_lm,
  workflow_fit_arima_no_boost,
  model_fit_ets,
  model_fit_prophet,
  wflw_fit_mars,
  wflw_xgb
) 

model_table
# Modeltime Table
# A tibble: 6 × 3
  .model_id .model     .model_desc 
      <int> <list>     <chr>       
1         1 <workflow> GLMNET      
2         2 <workflow> ARIMA(0,1,0)
3         3 <fit[+]>   ETS(A,N,N)  
4         4 <fit[+]>   PROPHET     
5         5 <workflow> EARTH       
6         6 <workflow> XGBOOST     

计算测试集

代码
calibration_table <- model_table %>%
  modeltime::modeltime_calibrate(testing(splits))

calibration_table
# Modeltime Table
# A tibble: 6 × 5
  .model_id .model     .model_desc  .type .calibration_data
      <int> <list>     <chr>        <chr> <list>           
1         1 <workflow> GLMNET       Test  <tibble [59 × 4]>
2         2 <workflow> ARIMA(0,1,0) Test  <tibble [59 × 4]>
3         3 <fit[+]>   ETS(A,N,N)   Test  <tibble [59 × 4]>
4         4 <fit[+]>   PROPHET      Test  <tibble [59 × 4]>
5         5 <workflow> EARTH        Test  <tibble [59 × 4]>
6         6 <workflow> XGBOOST      Test  <tibble [59 × 4]>
代码
calibration_table %>%
  modeltime::modeltime_forecast(actual_data = tb_stocks_1,new_data = testing(splits)) %>%
  modeltime::plot_modeltime_forecast(.interactive = T)

模型评价

代码
calibration_table %>%
  modeltime::modeltime_accuracy() %>%
  modeltime::table_modeltime_accuracy()

预测未来数据

预测未来数据之前需要将模型用完整数据先重新拟合

代码
calibration_table %>%
  modeltime::modeltime_refit(tb_stocks_1) %>%
  modeltime::modeltime_forecast(h = "12 months", actual_data = tb_stocks_1) %>%
  modeltime::plot_modeltime_forecast()

时序模型进阶

同时对大量不同的时间序列进行预测,称之为大规模预测。一般来说,有两种大规模预测方法(一种是慢而准确的方法,另一种是快速而好的方法):

  1. 全局建模(快速且良好):使用面板数据结构,只制了一个模型来生成所有时间序列的预测,可以轻松扩展到10,000+时间序列。

  2. 迭代预测(缓慢而准确):针对每个时间序列分别建模,准确性最佳,但比全局模型花费更长的时间及更多内存开销

选取沃尔玛产品销量数据作为示例数据,其中ID用于分隔时间序列组

代码
data_tbl <- walmart_sales_weekly %>%
    select(id, date = Date, value = Weekly_Sales)

data_tbl %>% head()
# A tibble: 6 × 3
  id    date        value
  <fct> <date>      <dbl>
1 1_1   2010-02-05 24924.
2 1_1   2010-02-12 46039.
3 1_1   2010-02-19 41596.
4 1_1   2010-02-26 19404.
5 1_1   2010-03-05 21828.
6 1_1   2010-03-12 21043.

可视化

代码
data_tbl %>%
  group_by(id) %>%
  plot_time_series(
    date, value, .interactive = T, .facet_ncol = 2
  )

数据拆分

代码
splits <- data_tbl %>% 
  time_series_split(
    assess     = "3 months", 
    cumulative = TRUE
  )

splits
<Analysis/Assess/Total>
<917/84/1001>

全局建模

特征工程

代码
rec_obj <- recipe(value ~ ., training(splits)) %>%
  # 这里ID是factor类型,用droplevels删除分类 ID 列中多余的未使用level
    step_mutate_at(id, fn = droplevels) %>%
  # 添加日期特征
    step_timeseries_signature(date) %>%
  # 删除日期列。像 XGBoost 这样的机器学习算法不能很好地处理日期列
    step_rm(date) %>%
  # 移除任何零方差特征。没有方差的特征不会向我们的模型添加有用的信息。
    step_zv(all_predictors()) %>%
  # 将分类特征转换为二进制独热编码特征,XGBoost 等算法可以更有效地建模。
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

summary(prep(rec_obj))
# A tibble: 38 × 4
   variable       type      role      source  
   <chr>          <list>    <chr>     <chr>   
 1 value          <chr [2]> outcome   original
 2 date_index.num <chr [2]> predictor derived 
 3 date_year      <chr [2]> predictor derived 
 4 date_year.iso  <chr [2]> predictor derived 
 5 date_half      <chr [2]> predictor derived 
 6 date_quarter   <chr [2]> predictor derived 
 7 date_month     <chr [2]> predictor derived 
 8 date_month.xts <chr [2]> predictor derived 
 9 date_day       <chr [2]> predictor derived 
10 date_mday      <chr [2]> predictor derived 
# ℹ 28 more rows
代码
juice(prep(rec_obj)) %>% head()
# A tibble: 6 × 38
    value date_index.num date_year date_year.iso date_half date_quarter
    <dbl>          <dbl>     <int>         <int>     <int>        <int>
1  24924.     1265328000      2010          2010         1            1
2  13740.     1265328000      2010          2010         1            1
3  40129.     1265328000      2010          2010         1            1
4  41969.     1265328000      2010          2010         1            1
5 115564.     1265328000      2010          2010         1            1
6  64495.     1265328000      2010          2010         1            1
# ℹ 32 more variables: date_month <int>, date_month.xts <int>, date_day <int>,
#   date_mday <int>, date_qday <int>, date_yday <int>, date_mweek <int>,
#   date_week <int>, date_week.iso <int>, date_week2 <int>, date_week3 <int>,
#   date_week4 <int>, date_mday7 <int>, id_X1_1 <dbl>, id_X1_3 <dbl>,
#   id_X1_8 <dbl>, id_X1_13 <dbl>, id_X1_38 <dbl>, id_X1_93 <dbl>,
#   id_X1_95 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>,
#   date_month.lbl_03 <dbl>, date_month.lbl_04 <dbl>, …

xgboost建模

代码
wflw_xgb <- workflow() %>%
    add_model(
        boost_tree("regression") %>% set_engine("xgboost")
    ) %>%
    add_recipe(rec_obj) %>%
    fit(training(splits))

wflw_xgb
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_mutate_at()
• step_timeseries_signature()
• step_rm()
• step_zv()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 46.6 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 37 
niter: 15
nfeatures : 37 
evaluation_log:
     iter training_rmse
    <num>         <num>
        1     46315.130
        2     33003.670
---                    
       14      3554.521
       15      3423.419

创建模型表

代码
model_tbl <- modeltime_table(
    wflw_xgb
)

model_tbl
# Modeltime Table
# A tibble: 1 × 3
  .model_id .model     .model_desc
      <int> <list>     <chr>      
1         1 <workflow> XGBOOST    

模型校准

代码
calib_tbl <- model_tbl %>%
    modeltime_calibrate(
      new_data = testing(splits), 
      id       = "id"
    )

calib_tbl
# Modeltime Table
# A tibble: 1 × 5
  .model_id .model     .model_desc .type .calibration_data
      <int> <list>     <chr>       <chr> <list>           
1         1 <workflow> XGBOOST     Test  <tibble [84 × 5]>

模型全局准确性

代码
calib_tbl %>% 
    modeltime_accuracy(acc_by_id = FALSE) %>% 
    table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 XGBOOST Test 3698.69 8.3 0.11 8.17 5059.45 0.98

不同时间序列模型准确性

代码
calib_tbl %>% 
    modeltime_accuracy(acc_by_id = TRUE) %>% 
    table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table
.model_id .model_desc .type id mae mape mase smape rmse rsq
1 XGBOOST Test 1_1 2125.99 10.83 1.59 10.30 2596.97 0.50
1 XGBOOST Test 1_3 3397.33 18.67 0.57 17.60 4048.49 0.91
1 XGBOOST Test 1_8 2086.84 5.45 0.95 5.62 2283.70 0.67
1 XGBOOST Test 1_13 1596.47 3.86 0.70 3.96 1913.82 0.53
1 XGBOOST Test 1_38 8345.09 10.77 1.03 10.99 9685.67 0.01
1 XGBOOST Test 1_93 4278.75 5.37 0.42 5.57 5440.12 0.75
1 XGBOOST Test 1_95 4060.36 3.14 0.51 3.18 4875.10 0.65

预测测试数据

代码
calib_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits),
        actual_data = data_tbl,
        conf_by_id  = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 3,
        .interactive = FALSE
    )

重新拟合模型

代码
refit_tbl <- calib_tbl %>%
  modeltime_refit(data = data_tbl)

refit_tbl
# Modeltime Table
# A tibble: 1 × 5
  .model_id .model     .model_desc .type .calibration_data
      <int> <list>     <chr>       <chr> <list>           
1         1 <workflow> XGBOOST     Test  <tibble [84 × 5]>

预测未来数据

代码
future_tbl <- data_tbl %>%
  group_by(id) %>%
  future_frame(.length_out = 52, .bind_data = FALSE)
refit_tbl %>%
  modeltime_forecast(
    new_data    = future_tbl,
    actual_data = data_tbl, 
    conf_by_id  = TRUE
  ) %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .interactive = F,
    .facet_ncol  = 2
  )

嵌套预测

代码
nested_data_tbl <- data_tbl %>%
    
    # 1. 扩展原始数据,增加未来52周
    extend_timeseries(
        .id_var        = id,
        .date_var      = date,
        .length_future = 52
    ) %>%
    
    # 2. 嵌套数据: 按ID分组生成用于训练模型的actual_data和用于预测未来的未来future_data

    nest_timeseries(
        .id_var        = id,
        .length_future = 52,
        .length_actual = 52*2
    ) %>%
    
   # 3. 数据划分: 将actual_data中划分出一部分的数据作为测试集
    split_nested_timeseries(
        .length_test = 12
    )

nested_data_tbl
# A tibble: 7 × 4
  id    .actual_data       .future_data      .splits        
  <fct> <list>             <list>            <list>         
1 1_1   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
2 1_3   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
3 1_8   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
4 1_13  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
5 1_38  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
6 1_93  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>
7 1_95  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]>

分别使用prophet和xgboost模型

代码
rec_prophet <- recipe(value ~ date, extract_nested_train_split(nested_data_tbl)) 

wflw_prophet <- workflow() %>%
    add_model(
      prophet_reg("regression", seasonality_yearly = TRUE) %>% 
        set_engine("prophet")
    ) %>%
    add_recipe(rec_prophet)

rec_xgb <- recipe(value ~ ., extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(date) %>%
    step_rm(date) %>%
    step_zv(all_predictors()) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

wflw_xgb <- workflow() %>%
    add_model(boost_tree("regression") %>% set_engine("xgboost")) %>%
    add_recipe(rec_xgb)

模型表

代码
nested_modeltime_tbl <- modeltime_nested_fit(
  nested_data = nested_data_tbl,
  wflw_prophet,
  wflw_xgb
)

nested_modeltime_tbl
# Nested Modeltime Table
  # A tibble: 7 × 5
  id    .actual_data       .future_data      .splits         .modeltime_tables 
  <fct> <list>             <list>            <list>          <list>            
1 1_1   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
2 1_3   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
3 1_8   <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
4 1_13  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
5 1_38  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
6 1_93  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>
7 1_95  <tibble [104 × 2]> <tibble [52 × 2]> <split [92|12]> <mdl_tm_t [2 × 5]>

提取模型信息

模型评价

代码
nested_modeltime_tbl %>% 
  extract_nested_test_accuracy() %>%
  table_modeltime_accuracy(.interactive = T)

测试预测

代码
nested_modeltime_tbl %>% 
  extract_nested_test_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 2,
    .interactive = T
  )

提取错误

预测过程中可能会出现错误,可以通过extract_nested_error_report查看错误

代码
nested_modeltime_tbl %>% 
  extract_nested_error_report()
# A tibble: 0 × 4
# ℹ 4 variables: id <fct>, .model_id <int>, .model_desc <chr>,
#   .error_desc <chr>

提取最佳

代码
best_nested_modeltime_tbl <- nested_modeltime_tbl %>%
    modeltime_nested_select_best(
      metric                = "rmse", 
      minimize              = TRUE, 
      filter_test_forecasts = TRUE
    )
best_nested_modeltime_tbl %>%
  extract_nested_best_model_report()
# Nested Modeltime Table
  # A tibble: 7 × 10
  id    .model_id .model_desc .type   mae  mape  mase smape  rmse   rsq
  <fct>     <int> <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1_1           2 XGBOOST     Test  1610.  7.76 1.21   7.93 2062. 0.896
2 1_3           2 XGBOOST     Test  4671. 34.4  0.786 27.0  5252. 0.938
3 1_8           2 XGBOOST     Test  1157.  3.00 0.526  3.06 1540. 0.547
4 1_13          2 XGBOOST     Test  1809.  4.45 0.796  4.59 2129. 0.734
5 1_38          1 PROPHET     Test  6958.  9.44 0.860  9.09 8585. 0.189
6 1_93          2 XGBOOST     Test  4164.  5.33 0.409  5.57 5357. 0.803
7 1_95          2 XGBOOST     Test  4363.  3.43 0.551  3.41 4984. 0.629
代码
best_nested_modeltime_tbl %>%
  extract_nested_test_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 2,
    .interactive = T
  )

重新拟合

代码
nested_modeltime_refit_tbl <- best_nested_modeltime_tbl %>%
    modeltime_nested_refit(
        control = control_nested_refit(verbose = TRUE)
    )

预测未来

代码
nested_modeltime_refit_tbl %>%
  extract_nested_future_forecast() %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .interactive = T,
    .facet_ncol  = 2
  )

并行调参

parallel_start开始并行并设置核心数

代码
parallel_start(2, .method = "parallel")
代码
model_tbl <- tibble(
  learn_rate = c(0.001, 0.010, 0.100, 0.350, 0.500, 0.650)
) %>%
  create_model_grid(
    f_model_spec = boost_tree,
    engine_name  = "xgboost",
    mode         = "regression"
  )

model_tbl
# A tibble: 6 × 2
  learn_rate .models  
       <dbl> <list>   
1      0.001 <spec[+]>
2      0.01  <spec[+]>
3      0.1   <spec[+]>
4      0.35  <spec[+]>
5      0.5   <spec[+]>
6      0.65  <spec[+]>
生成workflow_set
代码
dataset_tbl <- walmart_sales_weekly %>%
  select(id, Date, Weekly_Sales)
splits <- time_series_split(
  dataset_tbl, 
  assess     = "6 months", 
  cumulative = TRUE
)


model_list <- model_tbl$.models

recipe_spec_xgboost <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
  step_timeseries_signature(Date) %>%
  step_rm(Date) %>%
  step_normalize(Date_index.num) %>%
  step_zv(all_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

model_wfset <- workflow_set(
  preproc = list(
    recipe_spec_xgboost
  ),
  models = model_list, 
  cross = TRUE
)

model_wfset
# A workflow set/tibble: 6 × 4
  wflow_id            info             option    result    
  <chr>               <list>           <list>    <list>    
1 recipe_boost_tree_1 <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_boost_tree_2 <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_boost_tree_3 <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_boost_tree_4 <tibble [1 × 4]> <opts[0]> <list [0]>
5 recipe_boost_tree_5 <tibble [1 × 4]> <opts[0]> <list [0]>
6 recipe_boost_tree_6 <tibble [1 × 4]> <opts[0]> <list [0]>
拟合

并行拟合与顺序拟合速度对比

代码
model_parallel_tbl <- model_wfset %>%
  modeltime_fit_workflowset(
    data    = training(splits),
    control = control_fit_workflowset(
      verbose   = TRUE,
      allow_par = TRUE
    )
  )
代码
model_sequential_tbl <- model_wfset %>%
  modeltime_fit_workflowset(
    data    = training(splits),
    control = control_fit_workflowset(
      verbose   = TRUE,
      allow_par = FALSE
    )
  )
模型评估
代码
model_parallel_tbl %>%
  modeltime_calibrate(testing(splits)) %>%
  modeltime_accuracy() %>%
  table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 RECIPE_BOOST_TREE_1 Test 55572.50 98.52 1.63 194.17 66953.92 0.96
2 RECIPE_BOOST_TREE_2 Test 48819.23 86.15 1.43 151.49 58992.30 0.96
3 RECIPE_BOOST_TREE_3 Test 13286.28 21.41 0.39 24.70 17271.50 0.98
4 RECIPE_BOOST_TREE_4 Test 3802.13 8.11 0.11 8.05 5558.56 0.98
5 RECIPE_BOOST_TREE_5 Test 3361.45 7.01 0.10 7.08 5233.28 0.98
6 RECIPE_BOOST_TREE_6 Test 3645.01 8.76 0.11 9.20 5420.95 0.98

可视化

代码
model_parallel_tbl %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dataset_tbl,
    keep_data   = TRUE
  ) %>%
  group_by(id) %>%
  plot_modeltime_forecast(
    .facet_ncol  = 3,
    .interactive = FALSE
  )

关闭集群

代码
parallel_stop()
回到顶部