Python中的 时序分析

python
时间序列
数据分析
数据清洗
机器学习
利用pytimetk等进行时序分析、绘图及建模
作者

不止BI

发布于

2024年6月26日

pytimetk

数据准备

股票历史数据

代码
import adata
import numpy as np
import pandas as pd
# k_type: k线类型:1.日;2.周;3.月 默认:1 日k
df_stock_byday = pd.concat([adata.stock.market.get_market(stock_code='002241', k_type=1, start_date='2021-01-01').assign(stock_code='002241'),adata.stock.market.get_market(stock_code='000001', k_type=1, start_date='2021-01-01').assign(stock_code='000001')])


df_stock_byday[['open', 'close', 'volume', 'high', 'low']] = df_stock_byday[['open', 'close', 'volume', 'high', 'low']].astype(np.float64)

常用数据操作

常用日期类

获取当前日期

代码
from datetime import date
today = date.today()
print(today)
type(today)
2025-05-17
datetime.date

格式化

将日期类转为字符串

代码
import datetime
today_str = datetime.datetime.strftime(today, '%m/%d/%Y')
print(today_str)
type(today_str)
05/17/2025
str

将字符串转为日期类

代码
datetime.datetime.strptime('2024-12-01',"%Y-%m-%d")
datetime.datetime(2024, 12, 1, 0, 0)
代码
from dateutil import parser

date_string = "2024-03-17 12:00 AM"
dt = parser.parse(date_string)
print(dt)  # 输出: 2021-03-17 00:00:00
2024-03-17 00:00:00

计算日期增量

日期差值

代码
from dateutil import relativedelta

from_date = datetime.datetime(2021, 1, 1)
to_date = datetime.datetime(2021, 3, 1)

difference = relativedelta.relativedelta(to_date, from_date)
print(difference)
relativedelta(months=+2)

增加日期

代码
new_date = from_date + relativedelta.relativedelta(months=1)
print(new_date)
2021-02-01 00:00:00

时区

代码
from dateutil import tz
from datetime import datetime

utc_time = datetime.now(tz.tzutc())
print(utc_time)
2025-05-17 10:58:42.964511+00:00

转换时区

代码
local_tz = tz.gettz('America/New_York')  # 获取时区
local_time = utc_time.astimezone(local_tz)
print(local_time)  # 输出转换后的本地时间
2025-05-17 06:58:42.964511-04:00

生成日期序列

代码
from dateutil import rrule
from datetime import datetime

start_date = datetime(2021, 1, 1)
end_date = datetime(2021, 12, 31)

# 生成每个月的第一天的日期序列
for dt in rrule.rrule(rrule.MONTHLY, dtstart=start_date, until=end_date):
    print(dt)
2021-01-01 00:00:00
2021-02-01 00:00:00
2021-03-01 00:00:00
2021-04-01 00:00:00
2021-05-01 00:00:00
2021-06-01 00:00:00
2021-07-01 00:00:00
2021-08-01 00:00:00
2021-09-01 00:00:00
2021-10-01 00:00:00
2021-11-01 00:00:00
2021-12-01 00:00:00

查看数据结构

代码
import pytimetk as tk

df_stock_byday.glimpse()
<class 'pandas.core.frame.DataFrame'>: 2112 rows of 13 columns
stock_code:      object            ['002241', '002241', '002241', '00224 ...
trade_time:      object            ['2021-01-04 00:00:00', '2021-01-05 0 ...
trade_date:      object            ['2021-01-04', '2021-01-05', '2021-01 ...
open:            float64           [36.65, 35.79, 37.65, 37.14, 38.75, 3 ...
close:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, 40 ...
high:            float64           [36.65, 38.2, 38.05, 38.8, 39.15, 40. ...
low:             float64           [35.7, 35.49, 36.7, 36.9, 37.58, 38.0 ...
volume:          float64           [128543900.0, 167528000.0, 95322100.0 ...
amount:          float64           [4707993856.0, 6236502016.0, 36116663 ...
change_pct:      float64           [-1.58, 4.83, -2.12, 3.57, 0.63, 5.6, ...
change:          float64           [-0.58, 1.74, -0.8, 1.32, 0.24, 2.16, ...
turnover_ratio:  float64           [4.61, 6.01, 3.42, 4.89, 3.62, 6.23,  ...
pre_close:       float64           [36.62, 36.04, 37.78, 36.98, 38.3, 38 ...

转为日期格式

代码
df_stock_byday['trade_date'] = pd.to_datetime(df_stock_byday['trade_date'], format='%Y-%m-%d')
df_stock_byday['trade_time'] = pd.to_datetime(df_stock_byday['trade_time'], infer_datetime_format=True)

添加频率

Pandas 提供了多种频率字符串(也称为偏移别名)来定义时间序列的频率。以下是 Pandas 中使用的一些常见频率字符串:

  1. ‘B’:工作日

  2. ‘D’:日历日

  3. ‘W’:每周

  4. ‘M’:月末

  5. ‘BM’:营业月末

  6. ‘MS’:月份开始

  7. ‘BMS’:营业月份开始

  8. ‘Q’:季度末

  9. ‘BQ’:业务季度结束

  10. ‘QS’:季度开始

  11. ‘BQS’:业务季度开始

  12. ‘A’ 或 ‘Y’:年末

  13. “BA” 或 “BY”:业务年度结束

  14. ‘AS’ 或 ‘YS’:年份开始

  15. ‘BAS’ 或 ‘BYS’:营业年度开始

  16. ‘H’:每小时

  17. ‘T’ 或 ‘min’:每分钟

  18. ‘S’:其次

  19. ‘L’ 或 ‘ms’:毫秒

  20. ‘U’:微秒

  21. ‘N’:纳秒

自定义频率:

  • 您还可以通过组合基本频率来创建自定义频率,例如:

    • ‘2D’:每 2 天

    • ‘3W’:每 3 周

    • ‘4H’:每 4 小时

    • ‘1H30T’:每 1 小时 30 分钟

复合频率:

  • 您可以将多个频率相加在一起。

    • ‘1D1H’:1 天 1 小时

    • ‘1H30T’:1 小时 30 分钟

代码
date_range_two_days = pd.date_range(start='2023-01-01', end='2023-01-10', freq='1H30T')

date_range_two_days
DatetimeIndex(['2023-01-01 00:00:00', '2023-01-01 01:30:00',
               '2023-01-01 03:00:00', '2023-01-01 04:30:00',
               '2023-01-01 06:00:00', '2023-01-01 07:30:00',
               '2023-01-01 09:00:00', '2023-01-01 10:30:00',
               '2023-01-01 12:00:00', '2023-01-01 13:30:00',
               ...
               '2023-01-09 10:30:00', '2023-01-09 12:00:00',
               '2023-01-09 13:30:00', '2023-01-09 15:00:00',
               '2023-01-09 16:30:00', '2023-01-09 18:00:00',
               '2023-01-09 19:30:00', '2023-01-09 21:00:00',
               '2023-01-09 22:30:00', '2023-01-10 00:00:00'],
              dtype='datetime64[ns]', length=145, freq='90min')
代码
df_stock_byday = df_stock_byday.pad_by_time('trade_date', freq = 'B')
df_stock_byday.index.freq = 'B'
df_stock_byday.set_index(['trade_date'],inplace = True)
df_stock_byday.tail()
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close
trade_date
2025-05-14 002241 2025-05-14 22.84 22.69 23.00 22.36 94170600.0 2.130960e+09 -0.79 -0.18 3.05 22.87
2025-05-15 000001 2025-05-15 11.43 11.39 11.50 11.38 103546000.0 1.183733e+09 -0.35 -0.04 0.53 11.43
2025-05-15 002241 2025-05-15 22.67 22.04 22.67 22.00 61836500.0 1.371791e+09 -2.86 -0.65 2.01 22.69
2025-05-16 002241 2025-05-16 21.90 22.00 22.20 21.81 44581800.0 9.825537e+08 -0.18 -0.04 1.45 22.04
2025-05-16 000001 2025-05-16 11.37 11.38 11.43 11.31 91474400.0 1.039224e+09 -0.09 -0.01 0.47 11.39

按频率统计

长表模式

代码
summary_stock_code_df = df_stock_byday \
    .groupby("stock_code") \
    .summarize_by_time(
        date_column  = 'trade_time', 
        value_column = 'close',
        freq         = "MS", 
        agg_func = ['mean', 'median', 'min', 'max'],
        wide_format  = False
    )


summary_stock_code_df.head()
stock_code trade_time close_mean close_median close_min close_max
0 000001 2021-01-01 19.587000 19.855 16.51 21.43
1 000001 2021-02-01 21.913333 22.190 19.72 23.29
2 000001 2021-03-01 19.829565 19.830 18.74 21.35
3 000001 2021-04-01 20.266667 20.020 18.60 21.93
4 000001 2021-05-01 22.341667 22.160 21.41 23.53

宽表模式

代码
df_stock_byday \
    .groupby("stock_code") \
    .summarize_by_time(
        date_column  = 'trade_time', 
        value_column = 'close',
        freq         = "MS",
        agg_func     = 'mean',
        wide_format  = True
    ). \
    head()
trade_time close_000001 close_002241
0 2021-01-01 19.587000 39.263500
1 2021-02-01 21.913333 32.644000
2 2021-03-01 19.829565 28.349565
3 2021-04-01 20.266667 31.550952
4 2021-05-01 22.341667 36.576111

生成时间特征

代码
df_stock_byday_with_sig = df_stock_byday.augment_timeseries_signature(date_column = 'trade_time')
df_stock_byday_with_sig.head()
stock_code trade_time open close high low volume amount change_pct change ... trade_time_mday trade_time_qday trade_time_yday trade_time_weekend trade_time_hour trade_time_minute trade_time_second trade_time_msecond trade_time_nsecond trade_time_am_pm
trade_date
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 ... 4.0 4.0 4.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 ... 4.0 4.0 4.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 ... 5.0 5.0 5.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 ... 5.0 5.0 5.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-06 002241 2021-01-06 37.65 36.98 38.05 36.70 95322100.0 3.611666e+09 -2.12 -0.80 ... 6.0 6.0 6.0 0 0.0 0.0 0.0 0.0 0.0 am

5 rows × 41 columns

生成滞后特征

代码
df_stock_byday \
  .groupby('stock_code') \
  .augment_lags(date_column = 'trade_time',value_column = 'close',lags = (1, 7)) \
  .head()
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close close_lag_1 close_lag_2 close_lag_3 close_lag_4 close_lag_5 close_lag_6 close_lag_7
trade_date
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 0.80 17.68 NaN NaN NaN NaN NaN NaN NaN
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 4.61 36.62 NaN NaN NaN NaN NaN NaN NaN
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 0.94 16.94 16.94 NaN NaN NaN NaN NaN NaN
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 6.01 36.04 36.04 NaN NaN NaN NaN NaN NaN
2021-01-06 000001 2021-01-06 16.42 17.90 17.90 16.34 193494500.0 3.648522e+09 8.42 1.39 1.00 16.51 16.51 16.94 NaN NaN NaN NaN NaN

生成滚动窗口特征

代码
df_stock_byday \
  .groupby('stock_code') \
  .augment_rolling(
                date_column = 'trade_time',
                value_column = 'close',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close close_rolling_mean_win_2 close_rolling_std_win_2 close_rolling_mean_win_7 close_rolling_std_win_7
trade_date
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 0.80 17.68 NaN NaN NaN NaN
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 4.61 36.62 NaN NaN NaN NaN
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 6.01 36.04 36.910 0.870 36.910000 0.870000
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 0.94 16.94 16.725 0.215 16.725000 0.215000
2021-01-06 000001 2021-01-06 16.42 17.90 17.90 16.34 193494500.0 3.648522e+09 8.42 1.39 1.00 16.51 17.205 0.695 17.116667 0.581053
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-05-14 000001 2025-05-14 11.29 11.43 11.47 11.25 168351400.0 1.912995e+09 1.24 0.14 0.87 11.29 11.360 0.070 11.157143 0.147911
2025-05-15 002241 2025-05-15 22.67 22.04 22.67 22.00 61836500.0 1.371791e+09 -2.86 -0.65 2.01 22.69 22.365 0.325 22.382857 0.494302
2025-05-15 000001 2025-05-15 11.43 11.39 11.50 11.38 103546000.0 1.183733e+09 -0.35 -0.04 0.53 11.43 11.410 0.020 11.218571 0.142471
2025-05-16 000001 2025-05-16 11.37 11.38 11.43 11.31 91474400.0 1.039224e+09 -0.09 -0.01 0.47 11.39 11.385 0.005 11.268571 0.128222
2025-05-16 002241 2025-05-16 21.90 22.00 22.20 21.81 44581800.0 9.825537e+08 -0.18 -0.04 1.45 22.04 22.020 0.020 22.422857 0.448226

2112 rows × 16 columns

生成未来日期

基于数据框

代码
df_stock_byday \
        .groupby('stock_code') \
        .future_frame('trade_time', length_out = 365) \
        .augment_timeseries_signature('trade_time') \
        .query("close.isna()") \
        .tail() 
stock_code trade_time open close high low volume amount change_pct change ... trade_time_mday trade_time_qday trade_time_yday trade_time_weekend trade_time_hour trade_time_minute trade_time_second trade_time_msecond trade_time_nsecond trade_time_am_pm
2921 002241 2026-05-12 NaN NaN NaN NaN NaN NaN NaN NaN ... 12.0 42.0 132.0 0 0.0 0.0 0.0 0.0 0.0 am
2922 002241 2026-05-13 NaN NaN NaN NaN NaN NaN NaN NaN ... 13.0 43.0 133.0 0 0.0 0.0 0.0 0.0 0.0 am
2923 002241 2026-05-14 NaN NaN NaN NaN NaN NaN NaN NaN ... 14.0 44.0 134.0 0 0.0 0.0 0.0 0.0 0.0 am
2924 002241 2026-05-15 NaN NaN NaN NaN NaN NaN NaN NaN ... 15.0 45.0 135.0 0 0.0 0.0 0.0 0.0 0.0 am
2925 002241 2026-05-16 NaN NaN NaN NaN NaN NaN NaN NaN ... 16.0 46.0 136.0 0 0.0 0.0 0.0 0.0 0.0 am

5 rows × 41 columns

基于Series

代码
pd.Series(pd.date_range("2023", "2024", freq = "D")) \
  .make_future_timeseries(12) \
  .get_timeseries_signature()
idx idx_index_num idx_year idx_year_iso idx_yearstart idx_yearend idx_leapyear idx_half idx_quarter idx_quarteryear ... idx_mday idx_qday idx_yday idx_weekend idx_hour idx_minute idx_second idx_msecond idx_nsecond idx_am_pm
0 2024-01-02 1704153600 2024 2024 0 0 1 1 1 2024Q1 ... 2 2 2 0 0 0 0 0 0 am
1 2024-01-03 1704240000 2024 2024 0 0 1 1 1 2024Q1 ... 3 3 3 0 0 0 0 0 0 am
2 2024-01-04 1704326400 2024 2024 0 0 1 1 1 2024Q1 ... 4 4 4 0 0 0 0 0 0 am
3 2024-01-05 1704412800 2024 2024 0 0 1 1 1 2024Q1 ... 5 5 5 0 0 0 0 0 0 am
4 2024-01-06 1704499200 2024 2024 0 0 1 1 1 2024Q1 ... 6 6 6 0 0 0 0 0 0 am
5 2024-01-07 1704585600 2024 2024 0 0 1 1 1 2024Q1 ... 7 7 7 1 0 0 0 0 0 am
6 2024-01-08 1704672000 2024 2024 0 0 1 1 1 2024Q1 ... 8 8 8 0 0 0 0 0 0 am
7 2024-01-09 1704758400 2024 2024 0 0 1 1 1 2024Q1 ... 9 9 9 0 0 0 0 0 0 am
8 2024-01-10 1704844800 2024 2024 0 0 1 1 1 2024Q1 ... 10 10 10 0 0 0 0 0 0 am
9 2024-01-11 1704931200 2024 2024 0 0 1 1 1 2024Q1 ... 11 11 11 0 0 0 0 0 0 am
10 2024-01-12 1705017600 2024 2024 0 0 1 1 1 2024Q1 ... 12 12 12 0 0 0 0 0 0 am
11 2024-01-13 1705104000 2024 2024 0 0 1 1 1 2024Q1 ... 13 13 13 0 0 0 0 0 0 am

12 rows × 30 columns

数据观察

绘制折线图观察每日收盘价趋势

代码
df_stock_byday['year'] = pd.to_datetime(df_stock_byday['trade_time']).dt.year
df_stock_byday. \
  reset_index(). \
  dropna(). \
  groupby("stock_code"). \
  plot_timeseries(date_column  = 'trade_time',
  facet_ncol = 2, 
  color_column = 'year',
  facet_scales = "free",
  value_column = 'close')
Jan 2022Jan 2023Jan 2024Jan 2025152025303540455055202220232024202581012141618202224
Legend2021.02022.02023.02024.02025.0Time Series Plot002241000001

针对股票时序数据可以绘制k线图

代码
from datetime import datetime

import vectorbt as vbt

df_stock_byday = adata.stock.market.get_market(stock_code='002241', k_type=1, start_date='2021-01-01').assign(stock_code='002241')

df_stock_byday[['open', 'close', 'volume', 'high', 'low']] = df_stock_byday[['open', 'close', 'volume', 'high', 'low']].astype(np.float64)

df_stock_byday['trade_date'] = pd.to_datetime(df_stock_byday['trade_date'], format='%Y-%m-%d')
df_stock_byday['trade_time'] = pd.to_datetime(df_stock_byday['trade_time'], infer_datetime_format=True)
plot_Candlestick = df_stock_byday.vbt.ohlcv.plot(plot_type='Candlestick')
plot_Candlestick.update_layout(height=None,width=None)
plot_Candlestick.show()

异常值检测

代码
df_stock_byday  = df_stock_byday.loc[df_stock_byday.stock_code == '002241']
# df_stock_byday = df_stock_byday.pad_by_time('trade_time', freq = 'B')  # 按工作日重采样,缺失的日期的值用NA填补  
# df_stock_byday.set_index(['trade_time'],inplace = True)
# df_stock_byday.index.freq = 'B'

anomalize_df = tk.anomalize(
    data          = df_stock_byday,
    date_column   = 'trade_time',
    value_column  = 'close',
    period        = 7,
    iqr_alpha     = 0.05, # using the default
    clean_alpha   = 0.75, # using the default
    clean         = "min_max"
)

anomalize_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 1056 rows of 12 columns
trade_time:         datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), ...
observed:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, ...
seasonal:           float64           [4.79967331128657, -2.865434819973 ...
seasadj:            float64           [31.24032668871343, 40.64543481997 ...
trend:              float64           [40.98373500443075, 40.73328348123 ...
remainder:          float64           [-9.743408315717318, -0.0878486612 ...
anomaly:            object            ['Yes', 'No', 'No', 'No', 'No', 'N ...
anomaly_score:      float64           [10.49732636731087, 0.841766712858 ...
anomaly_direction:  int32             [-1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
recomposed_l1:      float64           [41.975004554364844, 34.0594448999 ...
recomposed_l2:      float64           [51.0996481802569, 43.184088525804 ...
observed_clean:     float64           [43.115585007601354, 37.78, 36.98, ...

绘制异常值

代码
# Plot anomalies
tk.plot_anomalies(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '异常值'
)
Jan 2021Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025Jul 2025102030405060
Legendobservedanomalies异常值

清理异常值后

代码
tk.plot_anomalies_cleaned(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '清理异常值后'
)
Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025152025303540455055
Legendobservedobserved_clean清理异常值后

绘制季节分解图

时间序列分解是一种将时间序列分解为多个组成部分的统计方法,每个组成部分代表模式的基本类别之一。这些组成部分通常包括:

  • 趋势 (T):长期上升或下降的趋势

  • 季节性 (S):在一年或更短的时间内重复出现的周期性波动

  • 周期性 (C):比季节性更长的周期性波动

  • 不规则性 (I):随机的、不可预测的变化

代码
tk.plot_anomalies_decomp(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '时间序列分解'
)
Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 202520304050Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025−2024Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 202520304050Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025−10−50510
时间序列分解observedseasonaltrendremainder

时间序列模型

单变量时间序列模型 多变量时间序列模型
只使用一个变量 使用多个变量
无法使用外部数据 可以使用外部数据
仅基于过去和现在之间的关系 基于过去和现在之间的关系,以及变量之间的关系
预测未来某个时间点该变量的值 预测未来某个时间点一个或多个变量的值

随机森林

代码
import pandas as pd
import numpy as np
import pytimetk as tk

from sklearn.ensemble import RandomForestRegressor


df_stock_byday.glimpse()
dset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date'])

dset = dset.drop(columns=[
    'id', # This column can be removed as it is equivalent to 'Dept'
    'Store', # This column has only one possible value
    'Type', # This column has only one possible value
    'Size', # This column has only one possible value
    'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',
    'IsHoliday', 'Temperature', 'Fuel_Price', 'CPI',
       'Unemployment'])

dset.head()
sales_df = dset
sales_df_with_futureframe = sales_df \
    .groupby('Dept') \
    .future_frame(
        date_column = 'Date',
        length_out  = 5
    )
    
sales_df_dates = sales_df_with_futureframe.augment_timeseries_signature(date_column = 'Date')
sales_df_dates.head(10)


df_with_lags = sales_df_dates \
    .groupby('Dept') \
    .augment_lags(
        date_column  = 'Date',
        value_column = 'Weekly_Sales',
        lags         = [5,6,7,8,9]
    )
    
lag_columns = [col for col in df_with_lags.columns if 'lag' in col]

df_with_rolling = df_with_lags \
    .groupby('Dept') \
    .augment_rolling(
        date_column  = 'Date',
        value_column = lag_columns,
        window  = 4,
        window_func = 'mean',
        threads = 1 # Change to -1 to use all available cores
    ) 
df_with_rolling[df_with_rolling.Dept ==1].head(10)
df_with_lags.head(5)

all_lag_columns = [col for col in df_with_rolling.columns if 'lag' in col]

df_no_nas = df_with_rolling \
    .dropna(subset=all_lag_columns, inplace=False)

df_no_nas.head()

future = df_no_nas[df_no_nas.Weekly_Sales.isnull()]
train = df_no_nas[df_no_nas.Weekly_Sales.notnull()]


train_columns = [ 
    'Dept'
    , 'Date_year'
    , 'Date_month'
    , 'Date_yweek'
    , 'Date_mweek'
    , 'Weekly_Sales_lag_5'
    , 'Weekly_Sales_lag_6'
    , 'Weekly_Sales_lag_7'
    , 'Weekly_Sales_lag_8'
    , 'Weekly_Sales_lag_5_rolling_mean_win_4'
    , 'Weekly_Sales_lag_6_rolling_mean_win_4'
    , 'Weekly_Sales_lag_7_rolling_mean_win_4'
    , 'Weekly_Sales_lag_8_rolling_mean_win_4'
    ]

X = train[train_columns]
y = train[['Weekly_Sales']]

model = RandomForestRegressor(random_state=123)
model = model.fit(X, y)

predicted_values = model.predict(future[train_columns])
future['y_pred'] = predicted_values

future.head(10)

train['type'] = 'actuals'
future['type'] = 'prediction'

full_df = pd.concat([train, future])

full_df.head(10)

full_df['Weekly_Sales'] = np.where(full_df.type =='actuals', full_df.Weekly_Sales, full_df.y_pred)

full_df \
    .groupby('Dept') \
    .plot_timeseries(
        date_column = 'Date',
        value_column = 'Weekly_Sales',
        color_column = 'type',
        smooth = False,
        smooth_alpha = 0,
        facet_ncol = 2,
        facet_scales = "free",
        y_intercept_color = tk.palette_timetk()['steel_blue'],
        width = 800,
        height = 600,
        engine = 'plotly'
    )
<class 'pandas.core.frame.DataFrame'>: 1056 rows of 13 columns
stock_code:      object            ['002241', '002241', '002241', '00224 ...
trade_time:      datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), Ti ...
trade_date:      datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), Ti ...
open:            float64           [36.65, 35.79, 37.65, 37.14, 38.75, 3 ...
close:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, 40 ...
high:            float64           [36.65, 38.2, 38.05, 38.8, 39.15, 40. ...
low:             float64           [35.7, 35.49, 36.7, 36.9, 37.58, 38.0 ...
volume:          float64           [128543900.0, 167528000.0, 95322100.0 ...
amount:          float64           [4707993856.0, 6236502016.0, 36116663 ...
change_pct:      float64           [-1.58, 4.83, -2.12, 3.57, 0.63, 5.6, ...
change:          float64           [-0.58, 1.74, -0.8, 1.32, 0.24, 2.16, ...
turnover_ratio:  float64           [4.61, 6.01, 3.42, 4.89, 3.62, 6.23,  ...
pre_close:       float64           [36.62, 36.04, 37.78, 36.98, 38.3, 38 ...
Jan 2011Jan 201220k30k40k50k2011201210k20k30k40k50k2011201235k40k2011201235k40k45k2011201260k80k100k120k2011201260k70k80k90k100k20112012100k120k140k
LegendactualspredictionTime Series Plot13813389395
代码
# !pip install git+https://github.com/business-science/pymodeltime.git
# 
# !pip install autogluon
# 
# 
# !pip install h2o

pytorch-forecasting

pytorch-forecasting 是一个基于 PyTorch 的时间序列预测库,专为处理复杂的时间序列数据而设计。它提供了多种先进的深度学习模型,用于解决时间序列预测问题,并且支持自定义数据处理和模型扩展。

模型特点:

  • Temporal Fusion Transformer (TFT):是一种基于 Transformer 架构的时间序列预测模型,能够处理多变量时间序列数据,并通过注意力机制动态地关注重要的时间步长和特征。它在许多实际应用中表现出色,尤其是在处理长序列和多变量数据时。

  • N-Beats:是一种基于神经网络的时间序列预测模型,通过堆叠多个简单的神经网络块来捕捉时间序列的短期和长期模式。它以可解释性强和训练速度快而著称。

  • DeepAR 和 DeepVAR:是基于循环神经网络(RNN)的时间序列预测模型,能够处理多变量时间序列数据,并支持自回归建模。

时间序列数据处理

代码
import warnings
warnings.filterwarnings("ignore")  # avoid printing out absolute paths
import copy
from pathlib import Path
import warnings
import lightning.pytorch as pl
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor
from lightning.pytorch.loggers import TensorBoardLogger
import numpy as np
import pandas as pd
import torch

from pytorch_forecasting import Baseline, TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import MAE, SMAPE, PoissonLoss, QuantileLoss
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import (
    optimize_hyperparameters,
)

对stallion数据进行预处理,该数据集描述了各种饮料的销售情况

代码
from pytorch_forecasting.data.examples import get_stallion_data

data = get_stallion_data()

# 添加时间索引,每个步长加1
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
data["time_idx"] -= data["time_idx"].min()

# 添加月份特征,categories类型
data["month"] = data.date.dt.month.astype(str).astype(
    "category"
) 
# 给 volume 列的每个值加上一个很小的数 1e-8 (即 1 * 10的-8次方)。 这是为了避免 volume 列中可能存在的 0 值导致对数运算出错 
data["log_volume"] = np.log(data.volume + 1e-8)
# 按照 time_idx 和 sku 列进行分组。observed=True 参数仅考虑在数据中实际观察到的类别组合, 这在处理类别数据时尤为重要,尤其是当某些类别组合在数据中未出现时。如果设置为 False(默认值),即使某些类别组合不存在于数据中,也会在分组操作中包含它们,这可能导致不正确的结果。
data["avg_volume_by_sku"] = data.groupby(
    ["time_idx", "sku"], observed=True
).volume.transform("mean")

data["avg_volume_by_agency"] = data.groupby(
    ["time_idx", "agency"], observed=True
).volume.transform("mean")

# 添加节假日特征,把onehot编码的节日特征变为分类特征
special_days = [
    "easter_day",
    "good_friday",
    "new_year",
    "christmas",
    "labor_day",
    "independence_day",
    "revolution_day_memorial",
    "regional_games",
    "fifa_u_17_world_cup",
    "football_gold_cup",
    "beer_capital",
    "music_fest",
]
data[special_days] = (
    data[special_days].apply(lambda x: x.map({0: "-", 1: x.name})).astype("category")
)
data.sample(10, random_state=521)
agency sku volume date industry_volume soda_volume avg_max_temp price_regular price_actual discount ... football_gold_cup beer_capital music_fest discount_in_percent timeseries time_idx month log_volume avg_volume_by_sku avg_volume_by_agency
291 Agency_25 SKU_03 0.5076 2013-01-01 492612703 718394219 25.845238 1264.162234 1152.473405 111.688829 ... - - - 8.835008 228 0 1 -0.678062 1225.306376 99.650400
871 Agency_29 SKU_02 8.7480 2015-01-01 498567142 762225057 27.584615 1316.098485 1296.804924 19.293561 ... - - - 1.465966 177 24 1 2.168825 1634.434615 11.397086
19532 Agency_47 SKU_01 4.9680 2013-09-01 454252482 789624076 30.665957 1269.250000 1266.490490 2.759510 ... - - - 0.217413 322 8 9 1.603017 2625.472644 48.295650
2089 Agency_53 SKU_07 21.6825 2013-10-01 480693900 791658684 29.197727 1193.842373 1128.124395 65.717978 ... - beer_capital - 5.504745 240 9 10 3.076505 38.529107 2511.035175
9755 Agency_17 SKU_02 960.5520 2015-03-01 515468092 871204688 23.608120 1338.334248 1232.128069 106.206179 ... - - music_fest 7.935699 259 26 3 6.867508 2143.677462 396.022140
7561 Agency_05 SKU_03 1184.6535 2014-02-01 425528909 734443953 28.668254 1369.556376 1161.135214 208.421162 ... - - - 15.218151 21 13 2 7.077206 1566.643589 1881.866367
19204 Agency_11 SKU_05 5.5593 2017-08-01 623319783 1049868815 31.915385 1922.486644 1651.307674 271.178970 ... - - - 14.105636 17 55 8 1.715472 1385.225478 109.699200
8781 Agency_48 SKU_04 4275.1605 2013-03-01 509281531 892192092 26.767857 1761.258209 1546.059670 215.198539 ... - - music_fest 12.218455 151 2 3 8.360577 1757.950603 1925.272108
2540 Agency_07 SKU_21 0.0000 2015-10-01 544203593 761469815 28.987755 0.000000 0.000000 0.000000 ... - - - 0.000000 300 33 10 -18.420681 0.000000 2418.719550
12084 Agency_21 SKU_03 46.3608 2017-04-01 589969396 940912941 32.478910 1675.922116 1413.571789 262.350327 ... - - - 15.654088 181 51 4 3.836454 2034.293024 109.381800

10 rows × 31 columns

代码
data.describe()
volume date industry_volume soda_volume avg_max_temp price_regular price_actual discount avg_population_2017 avg_yearly_household_income_2017 discount_in_percent timeseries time_idx log_volume avg_volume_by_sku avg_volume_by_agency
count 21000.000000 21000 2.100000e+04 2.100000e+04 21000.000000 21000.000000 21000.000000 21000.000000 2.100000e+04 21000.000000 21000.000000 21000.00000 21000.000000 21000.000000 21000.000000 21000.000000
mean 1492.403982 2015-06-16 20:48:00 5.439214e+08 8.512000e+08 28.612404 1451.536344 1267.347450 184.374146 1.045065e+06 151073.494286 10.574884 174.50000 29.500000 2.464118 1492.403982 1492.403982
min 0.000000 2013-01-01 00:00:00 4.130518e+08 6.964015e+08 16.731034 0.000000 -3121.690141 0.000000 1.227100e+04 90240.000000 0.000000 0.00000 0.000000 -18.420681 0.000000 0.000000
25% 8.272388 2014-03-24 06:00:00 5.090553e+08 7.890880e+08 25.374816 1311.547158 1178.365653 54.935108 6.018900e+04 110057.000000 3.749628 87.00000 14.750000 2.112923 932.285496 113.420250
50% 158.436000 2015-06-16 00:00:00 5.512000e+08 8.649196e+08 28.479272 1495.174592 1324.695705 138.307225 1.232242e+06 131411.000000 8.948990 174.50000 29.500000 5.065351 1402.305264 1730.529771
75% 1774.793475 2016-09-08 12:00:00 5.893715e+08 9.005551e+08 31.568405 1725.652080 1517.311427 272.298630 1.729177e+06 206553.000000 15.647058 262.00000 44.250000 7.481439 2195.362302 2595.316500
max 22526.610000 2017-12-01 00:00:00 6.700157e+08 1.049869e+09 45.290476 19166.625000 4925.404000 19166.625000 3.137874e+06 247220.000000 226.740147 349.00000 59.000000 10.022453 4332.363750 5884.717375
std 2711.496882 NaN 6.288022e+07 7.824340e+07 3.972833 683.362417 587.757323 257.469968 9.291926e+05 50409.593114 9.590813 101.03829 17.318515 8.178218 1051.790829 1328.239698

创建数据集和加载器

代码
# 预测未来6个月
max_prediction_length = 6
# 编码的最大长度。这是时间序列数据集使用的最大历史长度。
max_encoder_length = 24
# 时间索引减去预测长度,作为训练集长度
training_cutoff = data["time_idx"].max() - max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="volume",
    group_ids=["agency", "sku"],
    min_encoder_length=max_encoder_length
    // 2,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=["agency", "sku"],
    static_reals=["avg_population_2017", "avg_yearly_household_income_2017"],
    time_varying_known_categoricals=["special_days", "month"],
    variable_groups={
        "special_days": special_days
    },  # 一个分组内的分类变量会当作一个变量
    time_varying_known_reals=["time_idx", "price_regular", "discount_in_percent"],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=[
        "volume",
        "log_volume",
        "industry_volume",
        "soda_volume",
        "avg_max_temp",
        "avg_volume_by_agency",
        "avg_volume_by_sku",
    ],
    target_normalizer=GroupNormalizer(
        groups=["agency", "sku"], transformation="softplus"
    ),  # use softplus and normalize by group
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
)

# 配置预测每个时间序列的最后 max_prediction_length 个时间点
validation = TimeSeriesDataSet.from_dataset(
    training, data, predict=True, stop_randomization=True
)

# 创建训练集和验证集的加载器
batch_size = 128  # 介于32和128之间
train_dataloader = training.to_dataloader(
    train=True, batch_size=batch_size, num_workers=0
)
val_dataloader = validation.to_dataloader(
    train=False, batch_size=batch_size * 10, num_workers=0
)

训练模型

创建一个基准模型

代码
baseline_predictions = Baseline().predict(val_dataloader, return_y=True)
MAE()(baseline_predictions.output, baseline_predictions.y)
tensor(293.0088)

确定学习速率

代码
# 创建一个 Trainer 对象,用于管理训练过程
pl.seed_everything(42)
trainer = pl.Trainer(
    accelerator="cpu",
    # 梯度裁剪是一种防止梯度爆炸的技术,特别是对于循环神经网络(RNNs)。当梯度值超过这个值时,它们会被缩放,以避免训练过程中出现数值不稳定
    gradient_clip_val=0.1,
)


tft = TemporalFusionTransformer.from_dataset(
    training,
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate=0.03,
    hidden_size=8,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size=1,
    dropout=0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size=8,  # set to <= hidden_size
    loss=QuantileLoss(),
    optimizer="ranger",
    # reduce learning rate if no improvement in validation loss after x epochs
    # reduce_on_plateau_patience=1000,
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")
# find optimal learning rate
# from lightning.pytorch.tuner import Tuner

# res = Tuner(trainer).lr_find(
#     tft,
#     train_dataloaders=train_dataloader,
#     val_dataloaders=val_dataloader,
#     max_lr=10.0,
#     min_lr=1e-6
# )

# print(f"suggested learning rate: {res.suggestion()}")
# fig = res.plot(show=True, suggest=True)
# fig.show()
Number of parameters in network: 13.5k
代码
# configure network and trainer
early_stop_callback = EarlyStopping(
    monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"
)
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard

trainer = pl.Trainer(
    max_epochs=50,
    accelerator="cpu",
    enable_model_summary=True,
    gradient_clip_val=0.1,
    limit_train_batches=50,  # coment in for training, running valiation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[lr_logger, early_stop_callback],
    logger=logger,
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.03,
    hidden_size=16,
    attention_head_size=2,
    dropout=0.1,
    hidden_continuous_size=8,
    loss=QuantileLoss(),
    log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    optimizer="ranger",
    reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")
Number of parameters in network: 29.4k
代码
trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)
回到顶部