Python中的 时序分析

python
时间序列
数据分析
数据清洗
机器学习
利用pytimetk等进行时序分析、绘图及建模
作者

不止BI

发布于

2024年6月26日

数据准备

股票历史数据

代码
import adata
import numpy as np
import pandas as pd
# k_type: k线类型:1.日;2.周;3.月 默认:1 日k
df_stock_byday = pd.concat([adata.stock.market.get_market(stock_code='002241', k_type=1, start_date='2021-01-01').assign(stock_code='002241'),adata.stock.market.get_market(stock_code='000001', k_type=1, start_date='2021-01-01').assign(stock_code='000001')])


df_stock_byday[['open', 'close', 'volume', 'high', 'low']] = df_stock_byday[['open', 'close', 'volume', 'high', 'low']].astype(np.float64)

常用数据操作

常用日期类

获取当前日期

代码
from datetime import date
today = date.today()
print(today)
type(today)
2025-04-12
datetime.date

格式化

将日期类转为字符串

代码
import datetime
today_str = datetime.datetime.strftime(today, '%m/%d/%Y')
print(today_str)
type(today_str)
04/12/2025
str

将字符串转为日期类

代码
datetime.datetime.strptime('2024-12-01',"%Y-%m-%d")
datetime.datetime(2024, 12, 1, 0, 0)
代码
from dateutil import parser

date_string = "2024-03-17 12:00 AM"
dt = parser.parse(date_string)
print(dt)  # 输出: 2021-03-17 00:00:00
2024-03-17 00:00:00

计算日期增量

日期差值

代码
from dateutil import relativedelta

from_date = datetime.datetime(2021, 1, 1)
to_date = datetime.datetime(2021, 3, 1)

difference = relativedelta.relativedelta(to_date, from_date)
print(difference)
relativedelta(months=+2)

增加日期

代码
new_date = from_date + relativedelta.relativedelta(months=1)
print(new_date)
2021-02-01 00:00:00

时区

代码
from dateutil import tz
from datetime import datetime

utc_time = datetime.now(tz.tzutc())
print(utc_time)
2025-04-12 13:46:40.263500+00:00

转换时区

代码
local_tz = tz.gettz('America/New_York')  # 获取时区
local_time = utc_time.astimezone(local_tz)
print(local_time)  # 输出转换后的本地时间
2025-04-12 09:46:40.263500-04:00

生成日期序列

代码
from dateutil import rrule
from datetime import datetime

start_date = datetime(2021, 1, 1)
end_date = datetime(2021, 12, 31)

# 生成每个月的第一天的日期序列
for dt in rrule.rrule(rrule.MONTHLY, dtstart=start_date, until=end_date):
    print(dt)
2021-01-01 00:00:00
2021-02-01 00:00:00
2021-03-01 00:00:00
2021-04-01 00:00:00
2021-05-01 00:00:00
2021-06-01 00:00:00
2021-07-01 00:00:00
2021-08-01 00:00:00
2021-09-01 00:00:00
2021-10-01 00:00:00
2021-11-01 00:00:00
2021-12-01 00:00:00

查看数据结构

代码
import pytimetk as tk

df_stock_byday.glimpse()
<class 'pandas.core.frame.DataFrame'>: 2068 rows of 13 columns
stock_code:      object            ['002241', '002241', '002241', '00224 ...
trade_time:      object            ['2021-01-04 00:00:00', '2021-01-05 0 ...
trade_date:      object            ['2021-01-04', '2021-01-05', '2021-01 ...
open:            float64           [36.65, 35.79, 37.65, 37.14, 38.75, 3 ...
close:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, 40 ...
high:            float64           [36.65, 38.2, 38.05, 38.8, 39.15, 40. ...
low:             float64           [35.7, 35.49, 36.7, 36.9, 37.58, 38.0 ...
volume:          float64           [128543900.0, 167528000.0, 95322100.0 ...
amount:          float64           [4707993856.0, 6236502016.0, 36116663 ...
change_pct:      float64           [-1.58, 4.83, -2.12, 3.57, 0.63, 5.6, ...
change:          float64           [-0.58, 1.74, -0.8, 1.32, 0.24, 2.16, ...
turnover_ratio:  float64           [4.61, 6.01, 3.42, 4.89, 3.62, 6.23,  ...
pre_close:       float64           [36.62, 36.04, 37.78, 36.98, 38.3, 38 ...

转为日期格式

代码
df_stock_byday['trade_date'] = pd.to_datetime(df_stock_byday['trade_date'], format='%Y-%m-%d')
df_stock_byday['trade_time'] = pd.to_datetime(df_stock_byday['trade_time'], infer_datetime_format=True)

添加频率

Pandas 提供了多种频率字符串(也称为偏移别名)来定义时间序列的频率。以下是 Pandas 中使用的一些常见频率字符串:

  1. ‘B’:工作日

  2. ‘D’:日历日

  3. ‘W’:每周

  4. ‘M’:月末

  5. ‘BM’:营业月末

  6. ‘MS’:月份开始

  7. ‘BMS’:营业月份开始

  8. ‘Q’:季度末

  9. ‘BQ’:业务季度结束

  10. ‘QS’:季度开始

  11. ‘BQS’:业务季度开始

  12. ‘A’ 或 ‘Y’:年末

  13. “BA” 或 “BY”:业务年度结束

  14. ‘AS’ 或 ‘YS’:年份开始

  15. ‘BAS’ 或 ‘BYS’:营业年度开始

  16. ‘H’:每小时

  17. ‘T’ 或 ‘min’:每分钟

  18. ‘S’:其次

  19. ‘L’ 或 ‘ms’:毫秒

  20. ‘U’:微秒

  21. ‘N’:纳秒

自定义频率:

  • 您还可以通过组合基本频率来创建自定义频率,例如:

    • ‘2D’:每 2 天

    • ‘3W’:每 3 周

    • ‘4H’:每 4 小时

    • ‘1H30T’:每 1 小时 30 分钟

复合频率:

  • 您可以将多个频率相加在一起。

    • ‘1D1H’:1 天 1 小时

    • ‘1H30T’:1 小时 30 分钟

代码
date_range_two_days = pd.date_range(start='2023-01-01', end='2023-01-10', freq='1H30T')

date_range_two_days
DatetimeIndex(['2023-01-01 00:00:00', '2023-01-01 01:30:00',
               '2023-01-01 03:00:00', '2023-01-01 04:30:00',
               '2023-01-01 06:00:00', '2023-01-01 07:30:00',
               '2023-01-01 09:00:00', '2023-01-01 10:30:00',
               '2023-01-01 12:00:00', '2023-01-01 13:30:00',
               ...
               '2023-01-09 10:30:00', '2023-01-09 12:00:00',
               '2023-01-09 13:30:00', '2023-01-09 15:00:00',
               '2023-01-09 16:30:00', '2023-01-09 18:00:00',
               '2023-01-09 19:30:00', '2023-01-09 21:00:00',
               '2023-01-09 22:30:00', '2023-01-10 00:00:00'],
              dtype='datetime64[ns]', length=145, freq='90min')
代码
df_stock_byday = df_stock_byday.pad_by_time('trade_date', freq = 'B')
df_stock_byday.index.freq = 'B'
df_stock_byday.set_index(['trade_date'],inplace = True)
df_stock_byday.tail()
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close
trade_date
2025-04-09 002241 2025-04-09 17.40 18.88 19.02 17.11 282500900.0 5.082247e+09 -0.37 -0.07 9.16 18.95
2025-04-10 000001 2025-04-10 10.86 10.90 10.93 10.82 86654800.0 9.433523e+08 0.93 0.10 0.45 10.80
2025-04-10 002241 2025-04-10 20.76 20.71 20.77 19.90 281173100.0 5.758097e+09 9.69 1.83 9.12 18.88
2025-04-11 002241 2025-04-11 20.21 21.08 21.65 19.80 223356400.0 4.598416e+09 1.79 0.37 7.24 20.71
2025-04-11 000001 2025-04-11 10.87 10.89 10.90 10.83 58387900.0 6.343118e+08 -0.09 -0.01 0.30 10.90

按频率统计

长表模式

代码
summary_stock_code_df = df_stock_byday \
    .groupby("stock_code") \
    .summarize_by_time(
        date_column  = 'trade_time', 
        value_column = 'close',
        freq         = "MS", 
        agg_func = ['mean', 'median', 'min', 'max'],
        wide_format  = False
    )


summary_stock_code_df.head()
stock_code trade_time close_mean close_median close_min close_max
0 000001 2021-01-01 19.587000 19.855 16.51 21.43
1 000001 2021-02-01 21.913333 22.190 19.72 23.29
2 000001 2021-03-01 19.829565 19.830 18.74 21.35
3 000001 2021-04-01 20.266667 20.020 18.60 21.93
4 000001 2021-05-01 22.341667 22.160 21.41 23.53

宽表模式

代码
df_stock_byday \
    .groupby("stock_code") \
    .summarize_by_time(
        date_column  = 'trade_time', 
        value_column = 'close',
        freq         = "MS",
        agg_func     = 'mean',
        wide_format  = True
    ). \
    head()
trade_time close_000001 close_002241
0 2021-01-01 19.587000 39.263500
1 2021-02-01 21.913333 32.644000
2 2021-03-01 19.829565 28.349565
3 2021-04-01 20.266667 31.550952
4 2021-05-01 22.341667 36.576111

生成时间特征

代码
df_stock_byday_with_sig = df_stock_byday.augment_timeseries_signature(date_column = 'trade_time')
df_stock_byday_with_sig.head()
stock_code trade_time open close high low volume amount change_pct change ... trade_time_mday trade_time_qday trade_time_yday trade_time_weekend trade_time_hour trade_time_minute trade_time_second trade_time_msecond trade_time_nsecond trade_time_am_pm
trade_date
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 ... 4.0 4.0 4.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 ... 4.0 4.0 4.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 ... 5.0 5.0 5.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 ... 5.0 5.0 5.0 0 0.0 0.0 0.0 0.0 0.0 am
2021-01-06 000001 2021-01-06 16.42 17.90 17.90 16.34 193494500.0 3.648522e+09 8.42 1.39 ... 6.0 6.0 6.0 0 0.0 0.0 0.0 0.0 0.0 am

5 rows × 41 columns

生成滞后特征

代码
df_stock_byday \
  .groupby('stock_code') \
  .augment_lags(date_column = 'trade_time',value_column = 'close',lags = (1, 7)) \
  .head()
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close close_lag_1 close_lag_2 close_lag_3 close_lag_4 close_lag_5 close_lag_6 close_lag_7
trade_date
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 0.80 17.68 NaN NaN NaN NaN NaN NaN NaN
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 4.61 36.62 NaN NaN NaN NaN NaN NaN NaN
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 0.94 16.94 16.94 NaN NaN NaN NaN NaN NaN
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 6.01 36.04 36.04 NaN NaN NaN NaN NaN NaN
2021-01-06 000001 2021-01-06 16.42 17.90 17.90 16.34 193494500.0 3.648522e+09 8.42 1.39 1.00 16.51 16.51 16.94 NaN NaN NaN NaN NaN

生成滚动窗口特征

代码
df_stock_byday \
  .groupby('stock_code') \
  .augment_rolling(
                date_column = 'trade_time',
                value_column = 'close',
                window = [2,7],
                window_func = ['mean', ('std', lambda x: x.std())]
            )
stock_code trade_time open close high low volume amount change_pct change turnover_ratio pre_close close_rolling_mean_win_2 close_rolling_std_win_2 close_rolling_mean_win_7 close_rolling_std_win_7
trade_date
2021-01-04 000001 2021-01-04 17.44 16.94 17.44 16.78 155421600.0 2.891682e+09 -4.19 -0.74 0.80 17.68 NaN NaN NaN NaN
2021-01-04 002241 2021-01-04 36.65 36.04 36.65 35.70 128543900.0 4.707994e+09 -1.58 -0.58 4.61 36.62 NaN NaN NaN NaN
2021-01-05 002241 2021-01-05 35.79 37.78 38.20 35.49 167528000.0 6.236502e+09 4.83 1.74 6.01 36.04 36.910 0.870 36.910000 0.870000
2021-01-05 000001 2021-01-05 16.74 16.51 16.82 16.14 182135200.0 3.284607e+09 -2.54 -0.43 0.94 16.94 16.725 0.215 16.725000 0.215000
2021-01-06 002241 2021-01-06 37.65 36.98 38.05 36.70 95322100.0 3.611666e+09 -2.12 -0.80 3.42 37.78 37.380 0.400 36.933333 0.711118
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-04-09 000001 2025-04-09 10.73 10.80 10.83 10.66 106439400.0 1.145239e+09 -0.18 -0.02 0.55 10.82 10.810 0.010 11.080000 0.270079
2025-04-10 002241 2025-04-10 20.76 20.71 20.77 19.90 281173100.0 5.758097e+09 9.69 1.83 9.12 18.88 19.795 0.915 22.138571 2.811235
2025-04-10 000001 2025-04-10 10.86 10.90 10.93 10.82 86654800.0 9.433523e+08 0.93 0.10 0.45 10.80 10.850 0.050 11.028571 0.265138
2025-04-11 000001 2025-04-11 10.87 10.89 10.90 10.83 58387900.0 6.343118e+08 -0.09 -0.01 0.30 10.90 10.895 0.005 10.974286 0.248530
2025-04-11 002241 2025-04-11 20.21 21.08 21.65 19.80 223356400.0 4.598416e+09 1.79 0.37 7.24 20.71 20.895 0.185 21.435714 2.332172

2068 rows × 16 columns

生成未来日期

基于数据框

代码
df_stock_byday \
        .groupby('stock_code') \
        .future_frame('trade_time', length_out = 365) \
        .augment_timeseries_signature('trade_time') \
        .query("close.isna()") \
        .tail() 
stock_code trade_time open close high low volume amount change_pct change ... trade_time_mday trade_time_qday trade_time_yday trade_time_weekend trade_time_hour trade_time_minute trade_time_second trade_time_msecond trade_time_nsecond trade_time_am_pm
2874 002241 2026-04-07 NaN NaN NaN NaN NaN NaN NaN NaN ... 7.0 7.0 97.0 0 0.0 0.0 0.0 0.0 0.0 am
2875 002241 2026-04-08 NaN NaN NaN NaN NaN NaN NaN NaN ... 8.0 8.0 98.0 0 0.0 0.0 0.0 0.0 0.0 am
2876 002241 2026-04-09 NaN NaN NaN NaN NaN NaN NaN NaN ... 9.0 9.0 99.0 0 0.0 0.0 0.0 0.0 0.0 am
2877 002241 2026-04-10 NaN NaN NaN NaN NaN NaN NaN NaN ... 10.0 10.0 100.0 0 0.0 0.0 0.0 0.0 0.0 am
2878 002241 2026-04-11 NaN NaN NaN NaN NaN NaN NaN NaN ... 11.0 11.0 101.0 0 0.0 0.0 0.0 0.0 0.0 am

5 rows × 41 columns

基于Series

代码
pd.Series(pd.date_range("2023", "2024", freq = "D")) \
  .make_future_timeseries(12) \
  .get_timeseries_signature()
idx idx_index_num idx_year idx_year_iso idx_yearstart idx_yearend idx_leapyear idx_half idx_quarter idx_quarteryear ... idx_mday idx_qday idx_yday idx_weekend idx_hour idx_minute idx_second idx_msecond idx_nsecond idx_am_pm
0 2024-01-02 1704153600 2024 2024 0 0 1 1 1 2024Q1 ... 2 2 2 0 0 0 0 0 0 am
1 2024-01-03 1704240000 2024 2024 0 0 1 1 1 2024Q1 ... 3 3 3 0 0 0 0 0 0 am
2 2024-01-04 1704326400 2024 2024 0 0 1 1 1 2024Q1 ... 4 4 4 0 0 0 0 0 0 am
3 2024-01-05 1704412800 2024 2024 0 0 1 1 1 2024Q1 ... 5 5 5 0 0 0 0 0 0 am
4 2024-01-06 1704499200 2024 2024 0 0 1 1 1 2024Q1 ... 6 6 6 0 0 0 0 0 0 am
5 2024-01-07 1704585600 2024 2024 0 0 1 1 1 2024Q1 ... 7 7 7 1 0 0 0 0 0 am
6 2024-01-08 1704672000 2024 2024 0 0 1 1 1 2024Q1 ... 8 8 8 0 0 0 0 0 0 am
7 2024-01-09 1704758400 2024 2024 0 0 1 1 1 2024Q1 ... 9 9 9 0 0 0 0 0 0 am
8 2024-01-10 1704844800 2024 2024 0 0 1 1 1 2024Q1 ... 10 10 10 0 0 0 0 0 0 am
9 2024-01-11 1704931200 2024 2024 0 0 1 1 1 2024Q1 ... 11 11 11 0 0 0 0 0 0 am
10 2024-01-12 1705017600 2024 2024 0 0 1 1 1 2024Q1 ... 12 12 12 0 0 0 0 0 0 am
11 2024-01-13 1705104000 2024 2024 0 0 1 1 1 2024Q1 ... 13 13 13 0 0 0 0 0 0 am

12 rows × 30 columns

数据观察

绘制折线图观察每日收盘价趋势

代码
df_stock_byday['year'] = pd.to_datetime(df_stock_byday['trade_time']).dt.year
df_stock_byday. \
  reset_index(). \
  dropna(). \
  groupby("stock_code"). \
  plot_timeseries(date_column  = 'trade_time',
  facet_ncol = 2, 
  color_column = 'year',
  facet_scales = "free",
  value_column = 'close')
Jan 2022Jan 2023Jan 2024Jan 2025152025303540455055202220232024202581012141618202224
Legend2021.02022.02023.02024.02025.0Time Series Plot002241000001

针对股票时序数据可以绘制k线图

代码
from datetime import datetime

import vectorbt as vbt

df_stock_byday = adata.stock.market.get_market(stock_code='002241', k_type=1, start_date='2021-01-01').assign(stock_code='002241')

df_stock_byday[['open', 'close', 'volume', 'high', 'low']] = df_stock_byday[['open', 'close', 'volume', 'high', 'low']].astype(np.float64)

df_stock_byday['trade_date'] = pd.to_datetime(df_stock_byday['trade_date'], format='%Y-%m-%d')
df_stock_byday['trade_time'] = pd.to_datetime(df_stock_byday['trade_time'], infer_datetime_format=True)
plot_Candlestick = df_stock_byday.vbt.ohlcv.plot(plot_type='Candlestick')
plot_Candlestick.update_layout(height=None,width=None)
plot_Candlestick.show()

异常值检测

代码
df_stock_byday  = df_stock_byday.loc[df_stock_byday.stock_code == '002241']
# df_stock_byday = df_stock_byday.pad_by_time('trade_time', freq = 'B')  # 按工作日重采样,缺失的日期的值用NA填补  
# df_stock_byday.set_index(['trade_time'],inplace = True)
# df_stock_byday.index.freq = 'B'

anomalize_df = tk.anomalize(
    data          = df_stock_byday,
    date_column   = 'trade_time',
    value_column  = 'close',
    period        = 7,
    iqr_alpha     = 0.05, # using the default
    clean_alpha   = 0.75, # using the default
    clean         = "min_max"
)

anomalize_df.glimpse()
<class 'pandas.core.frame.DataFrame'>: 1034 rows of 12 columns
trade_time:         datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), ...
observed:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, ...
seasonal:           float64           [4.791440340959438, -2.93423223153 ...
seasadj:            float64           [31.24855965904056, 40.71423223153 ...
trend:              float64           [40.9763326233664, 40.727817349658 ...
remainder:          float64           [-9.727772964325844, -0.0135851181 ...
anomaly:            object            ['Yes', 'No', 'No', 'No', 'No', 'N ...
anomaly_score:      float64           [10.458157338228853, 0.74396949203 ...
anomaly_direction:  int32             [-1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...
recomposed_l1:      float64           [42.09159541096675, 34.11740756476 ...
recomposed_l2:      float64           [50.904719265490954, 42.9305314192 ...
observed_clean:     float64           [43.19323589278227, 37.78, 36.98,  ...

绘制异常值

代码
# Plot anomalies
tk.plot_anomalies(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '异常值'
)
Jan 2021Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025Jul 2025102030405060
Legendobservedanomalies异常值

清理异常值后

代码
tk.plot_anomalies_cleaned(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '清理异常值后'
)
Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025152025303540455055
Legendobservedobserved_clean清理异常值后

绘制季节分解图

时间序列分解是一种将时间序列分解为多个组成部分的统计方法,每个组成部分代表模式的基本类别之一。这些组成部分通常包括:

  • 趋势 (T):长期上升或下降的趋势

  • 季节性 (S):在一年或更短的时间内重复出现的周期性波动

  • 周期性 (C):比季节性更长的周期性波动

  • 不规则性 (I):随机的、不可预测的变化

代码
tk.plot_anomalies_decomp(
    data        = anomalize_df,
    date_column = 'trade_time',
    engine      = 'plotly',
    title       = '时间序列分解'
)
Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 202520304050Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025−2024Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 202520304050Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025−10−505
时间序列分解observedseasonaltrendremainder

时间序列模型

单变量时间序列模型 多变量时间序列模型
只使用一个变量 使用多个变量
无法使用外部数据 可以使用外部数据
仅基于过去和现在之间的关系 基于过去和现在之间的关系,以及变量之间的关系
预测未来某个时间点该变量的值 预测未来某个时间点一个或多个变量的值

随机森林

代码
import pandas as pd
import numpy as np
import pytimetk as tk

from sklearn.ensemble import RandomForestRegressor


df_stock_byday.glimpse()
dset = tk.load_dataset('walmart_sales_weekly', parse_dates = ['Date'])

dset = dset.drop(columns=[
    'id', # This column can be removed as it is equivalent to 'Dept'
    'Store', # This column has only one possible value
    'Type', # This column has only one possible value
    'Size', # This column has only one possible value
    'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',
    'IsHoliday', 'Temperature', 'Fuel_Price', 'CPI',
       'Unemployment'])

dset.head()
sales_df = dset
sales_df_with_futureframe = sales_df \
    .groupby('Dept') \
    .future_frame(
        date_column = 'Date',
        length_out  = 5
    )
    
sales_df_dates = sales_df_with_futureframe.augment_timeseries_signature(date_column = 'Date')
sales_df_dates.head(10)


df_with_lags = sales_df_dates \
    .groupby('Dept') \
    .augment_lags(
        date_column  = 'Date',
        value_column = 'Weekly_Sales',
        lags         = [5,6,7,8,9]
    )
    
lag_columns = [col for col in df_with_lags.columns if 'lag' in col]

df_with_rolling = df_with_lags \
    .groupby('Dept') \
    .augment_rolling(
        date_column  = 'Date',
        value_column = lag_columns,
        window  = 4,
        window_func = 'mean',
        threads = 1 # Change to -1 to use all available cores
    ) 
df_with_rolling[df_with_rolling.Dept ==1].head(10)
df_with_lags.head(5)

all_lag_columns = [col for col in df_with_rolling.columns if 'lag' in col]

df_no_nas = df_with_rolling \
    .dropna(subset=all_lag_columns, inplace=False)

df_no_nas.head()

future = df_no_nas[df_no_nas.Weekly_Sales.isnull()]
train = df_no_nas[df_no_nas.Weekly_Sales.notnull()]


train_columns = [ 
    'Dept'
    , 'Date_year'
    , 'Date_month'
    , 'Date_yweek'
    , 'Date_mweek'
    , 'Weekly_Sales_lag_5'
    , 'Weekly_Sales_lag_6'
    , 'Weekly_Sales_lag_7'
    , 'Weekly_Sales_lag_8'
    , 'Weekly_Sales_lag_5_rolling_mean_win_4'
    , 'Weekly_Sales_lag_6_rolling_mean_win_4'
    , 'Weekly_Sales_lag_7_rolling_mean_win_4'
    , 'Weekly_Sales_lag_8_rolling_mean_win_4'
    ]

X = train[train_columns]
y = train[['Weekly_Sales']]

model = RandomForestRegressor(random_state=123)
model = model.fit(X, y)

predicted_values = model.predict(future[train_columns])
future['y_pred'] = predicted_values

future.head(10)

train['type'] = 'actuals'
future['type'] = 'prediction'

full_df = pd.concat([train, future])

full_df.head(10)

full_df['Weekly_Sales'] = np.where(full_df.type =='actuals', full_df.Weekly_Sales, full_df.y_pred)

full_df \
    .groupby('Dept') \
    .plot_timeseries(
        date_column = 'Date',
        value_column = 'Weekly_Sales',
        color_column = 'type',
        smooth = False,
        smooth_alpha = 0,
        facet_ncol = 2,
        facet_scales = "free",
        y_intercept_color = tk.palette_timetk()['steel_blue'],
        width = 800,
        height = 600,
        engine = 'plotly'
    )
<class 'pandas.core.frame.DataFrame'>: 1034 rows of 13 columns
stock_code:      object            ['002241', '002241', '002241', '00224 ...
trade_time:      datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), Ti ...
trade_date:      datetime64[ns]    [Timestamp('2021-01-04 00:00:00'), Ti ...
open:            float64           [36.65, 35.79, 37.65, 37.14, 38.75, 3 ...
close:           float64           [36.04, 37.78, 36.98, 38.3, 38.54, 40 ...
high:            float64           [36.65, 38.2, 38.05, 38.8, 39.15, 40. ...
low:             float64           [35.7, 35.49, 36.7, 36.9, 37.58, 38.0 ...
volume:          float64           [128543900.0, 167528000.0, 95322100.0 ...
amount:          float64           [4707993856.0, 6236502016.0, 36116663 ...
change_pct:      float64           [-1.58, 4.83, -2.12, 3.57, 0.63, 5.6, ...
change:          float64           [-0.58, 1.74, -0.8, 1.32, 0.24, 2.16, ...
turnover_ratio:  float64           [4.61, 6.01, 3.42, 4.89, 3.62, 6.23,  ...
pre_close:       float64           [36.62, 36.04, 37.78, 36.98, 38.3, 38 ...
Jan 2011Jan 201220k30k40k50k2011201210k20k30k40k50k2011201235k40k2011201235k40k45k2011201260k80k100k120k2011201260k70k80k90k100k20112012100k120k140k
LegendactualspredictionTime Series Plot13813389395
代码
# !pip install git+https://github.com/business-science/pymodeltime.git
# 
# !pip install autogluon
# 
# 
# !pip install h2o
回到顶部