ML4T - 第7章第4节 线性回归统计 Linear Regression for Statistics

目录

[一、Load Data 加载数据](#一、Load Data 加载数据)

[二、Explore Data 探索数据](#二、Explore Data 探索数据)

[三、OLS analysis 普通最小二乘分析](#三、OLS analysis 普通最小二乘分析)

[1-Day Returns](#1-Day Returns)

[5-Day Returns](#5-Day Returns)

[四、Obtain the residuals 计算残差](#四、Obtain the residuals 计算残差)

[residuals analysis 残差分析](#residuals analysis 残差分析)

[10-Day Returns](#10-Day Returns)

[Monthly Returns](#Monthly Returns)


参考:https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/07_linear_models/04_statistical_inference_of_stock_returns_with_statsmodels.ipynb

Statistical inference of stock returns with linear regression

这些需要先从0703中构建数据,然后再这一篇。

一、Load Data 加载数据

python 复制代码
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

import pandas as pd

from statsmodels.api import OLS, add_constant, graphics
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import norm

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
idx = pd.IndexSlice
python 复制代码
# Load Data
with pd.HDFStore('data.h5') as store:
    data = (store['model_data']
            .dropna()
            .drop(['open', 'close', 'low', 'high'], axis=1))

# Select Investment Universe
data = data[data.dollar_vol_rank<100]
data.info(show_counts=True)

输出:

python 复制代码
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 109434 entries, ('AAL', Timestamp('2013-07-26 00:00:00')) to ('YUM', Timestamp('2015-11-03 00:00:00'))
Data columns (total 65 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   volume                 109434 non-null  float64
 1   dollar_vol             109434 non-null  float64
 2   dollar_vol_1m          109434 non-null  float64
 3   dollar_vol_rank        109434 non-null  float64
 4   rsi                    109434 non-null  float64
 5   bb_high                109434 non-null  float64
 6   bb_low                 109434 non-null  float64
 7   atr                    109434 non-null  float64
 8   macd                   109434 non-null  float64
 9   return_1d              109434 non-null  float64
 10  return_5d              109434 non-null  float64
 11  return_10d             109434 non-null  float64
 12  return_21d             109434 non-null  float64
 13  return_42d             109434 non-null  float64
 14  return_63d             109434 non-null  float64
 15  return_1d_lag1         109434 non-null  float64
 16  return_5d_lag1         109434 non-null  float64
 17  return_10d_lag1        109434 non-null  float64
 18  return_21d_lag1        109434 non-null  float64
 19  return_1d_lag2         109434 non-null  float64
 20  return_5d_lag2         109434 non-null  float64
 21  return_10d_lag2        109434 non-null  float64
 22  return_21d_lag2        109434 non-null  float64
 23  return_1d_lag3         109434 non-null  float64
 24  return_5d_lag3         109434 non-null  float64
 25  return_10d_lag3        109434 non-null  float64
 26  return_21d_lag3        109434 non-null  float64
 27  return_1d_lag4         109434 non-null  float64
 28  return_5d_lag4         109434 non-null  float64
 29  return_10d_lag4        109434 non-null  float64
 30  return_21d_lag4        109434 non-null  float64
 31  return_1d_lag5         109434 non-null  float64
 32  return_5d_lag5         109434 non-null  float64
 33  return_10d_lag5        109434 non-null  float64
 34  return_21d_lag5        109434 non-null  float64
 35  target_1d              109434 non-null  float64
 36  target_5d              109434 non-null  float64
 37  target_10d             109434 non-null  float64
 38  target_21d             109434 non-null  float64
 39  year_2014              109434 non-null  bool   
 40  year_2015              109434 non-null  bool   
 41  year_2016              109434 non-null  bool   
 42  year_2017              109434 non-null  bool   
 43  month_2                109434 non-null  bool   
 44  month_3                109434 non-null  bool   
 45  month_4                109434 non-null  bool   
 46  month_5                109434 non-null  bool   
 47  month_6                109434 non-null  bool   
 48  month_7                109434 non-null  bool   
 49  month_8                109434 non-null  bool   
 50  month_9                109434 non-null  bool   
 51  month_10               109434 non-null  bool   
 52  month_11               109434 non-null  bool   
 53  month_12               109434 non-null  bool   
 54  capital_goods          109434 non-null  bool   
 55  consumer_durables      109434 non-null  bool   
 56  consumer_non-durables  109434 non-null  bool   
 57  consumer_services      109434 non-null  bool   
 58  energy                 109434 non-null  bool   
 59  finance                109434 non-null  bool   
 60  health_care            109434 non-null  bool   
 61  miscellaneous          109434 non-null  bool   
 62  public_utilities       109434 non-null  bool   
 63  technology             109434 non-null  bool   
 64  transportation         109434 non-null  bool   
dtypes: bool(26), float64(39)
memory usage: 36.5+ MB

Create Model Data

python 复制代码
y = data.filter(like='target')
X = data.drop(y.columns, axis=1)
X = X.drop(['dollar_vol', 'dollar_vol_rank', 'volume', 'consumer_durables'], axis=1)

如果这里报错的话,要修改上一节0704的代码:

python 复制代码
prices['sector'] = prices['sector'].str.lower().str.replace(' ', '_')  # add code;增加的代码

prices.assign(sector=pd.factorize(prices.sector, sort=True)[0]).to_hdf('data.h5', 'model_data/no_dummies')
prices = pd.get_dummies(prices,
                        columns=['year', 'month', 'sector'],
                        prefix=['year', 'month', ''],
                        prefix_sep=['_', '_', ''],
                        drop_first=True)
prices.info(show_counts=True)

二、Explore Data 探索数据

python 复制代码
sns.clustermap(y.corr(), cmap=sns.diverging_palette(h_neg=20, h_pos=220), center=0, annot=True, fmt='.2%');
python 复制代码
sns.clustermap(X.corr(), cmap=sns.diverging_palette(h_neg=20, h_pos=220), center=0);
plt.gcf().set_size_inches((14, 14))

热力图结论:

热力图显示了特征之间的相关性。

相关系数的范围在-1到1之间,值越接近1或-1,说明特征之间的相关性越强。

相关系数为0说明特征之间没有相关性。

Conclusion of heat map:

The heatmap shows the correlation between features.

The range of the correlation coefficient is between -1 and 1. The closer the value is to 1 or -1, the stronger the correlation between the features.

A correlation coefficient of 0 indicates that there is no correlation between the features.

python 复制代码
corr_mat = X.corr().stack().reset_index()
corr_mat.columns=['var1', 'var2', 'corr']
corr_mat = corr_mat[corr_mat.var1!=corr_mat.var2].sort_values(by='corr', ascending=False)

# corr_mat.head().append(corr_mat.tail())  old code is bad; 作者原始代码已经不行了
pd.concat([corr_mat.head(), corr_mat.tail()])  # pandas ≥ 2.0 good

三、OLS analysis 普通最小二乘分析

Linear Regression for Statistical Inference: OLS with statsmodels

用statsmodels库进行线性回归分析,OLS线性回归

Ticker-wise standardization

statsmodels warns of high design matrix condition numbers. This can arise when the variables are not standardized and the Eigenvalues differ due to scaling. The following step avoids this warning.

python 复制代码
sectors = X.iloc[:, -10:]
X = (X.drop(sectors.columns, axis=1)
     .groupby(level='ticker')
     .transform(lambda x: (x - x.mean()) / x.std())
    .join(sectors)
    .fillna(0))

1-Day Returns

python 复制代码
target = 'target_1d'
# this is MultiIndex
# model = OLS(endog=y[target], exog=add_constant(X))      # old code is bad; 老代码失效
X = X.astype(float)
model = OLS(endog=y[target], exog=add_constant(X))    # this is good.
trained_model = model.fit()
print(trained_model.summary())

# R-squared is only 0.009, less than 0.5; bad

输出:

(R方值比较小的,低于0.5,说明拟合不好)

python 复制代码
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              target_1d   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     18.36
Date:                Sun, 28 Sep 2025   Prob (F-statistic):          5.52e-181
Time:                        14:20:47   Log-Likelihood:             2.8809e+05
No. Observations:              109434   AIC:                        -5.761e+05
Df Residuals:                  109376   BIC:                        -5.755e+05
Df Model:                          57                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                  8.573e-05      0.000      0.543      0.587      -0.000       0.000
dollar_vol_1m            -0.0004   6.78e-05     -5.392      0.000      -0.000      -0.000
rsi                       0.0001      0.000      0.758      0.449      -0.000       0.001
bb_high                   0.0002      0.000      0.759      0.448      -0.000       0.001
bb_low                    0.0006      0.000      2.717      0.007       0.000       0.001
atr                    2.421e-06   7.35e-05      0.033      0.974      -0.000       0.000
macd                     -0.0004      0.000     -1.764      0.078      -0.001    4.68e-05
return_1d                 0.0024      0.000      8.693      0.000       0.002       0.003
return_5d                -0.0010      0.001     -1.151      0.250      -0.003       0.001
return_10d               -0.0061      0.001     -6.641      0.000      -0.008      -0.004
return_21d                0.0024      0.000      5.552      0.000       0.002       0.003
return_42d               -0.0035      0.001     -6.605      0.000      -0.005      -0.002
return_63d               -0.0015      0.000     -3.711      0.000      -0.002      -0.001
return_1d_lag1            0.0022      0.000      7.945      0.000       0.002       0.003
return_5d_lag1            0.0047      0.001      7.420      0.000       0.003       0.006
return_10d_lag1          -0.0007      0.001     -0.639      0.523      -0.003       0.001
return_21d_lag1           0.0025      0.000      6.952      0.000       0.002       0.003
return_1d_lag2            0.0023      0.000      8.058      0.000       0.002       0.003
return_5d_lag2            0.0007      0.001      1.022      0.307      -0.001       0.002
return_10d_lag2          -0.0002      0.001     -0.193      0.847      -0.002       0.001
return_21d_lag2       -6.892e-05      0.000     -0.250      0.802      -0.001       0.000
return_1d_lag3            0.0022      0.000      7.804      0.000       0.002       0.003
return_5d_lag3            0.0013      0.001      1.807      0.071      -0.000       0.003
return_10d_lag3           0.0006      0.000      4.431      0.000       0.000       0.001
return_21d_lag3          -0.0003   5.57e-05     -4.564      0.000      -0.000      -0.000
return_1d_lag4            0.0025      0.000      8.871      0.000       0.002       0.003
return_5d_lag4            0.0009      0.001      1.511      0.131      -0.000       0.002
return_10d_lag4           0.0005      0.000      4.674      0.000       0.000       0.001
return_21d_lag4       -8.646e-07   5.51e-05     -0.016      0.987      -0.000       0.000
return_1d_lag5        -9.493e-05   6.16e-05     -1.540      0.123      -0.000    2.59e-05
return_5d_lag5            0.0004      0.001      0.692      0.489      -0.001       0.001
return_10d_lag5           0.0008   9.86e-05      7.792      0.000       0.001       0.001
return_21d_lag5           0.0001   5.48e-05      2.023      0.043    3.44e-06       0.000
year_2014                -0.0003   8.28e-05     -4.111      0.000      -0.001      -0.000
year_2015                -0.0006   9.04e-05     -7.115      0.000      -0.001      -0.000
year_2016                -0.0004   8.84e-05     -5.039      0.000      -0.001      -0.000
year_2017                -0.0002   8.81e-05     -2.175      0.030      -0.000   -1.89e-05
month_2                   0.0010   7.23e-05     14.348      0.000       0.001       0.001
month_3                   0.0004   7.44e-05      4.885      0.000       0.000       0.001
month_4                   0.0005   7.29e-05      6.433      0.000       0.000       0.001
month_5                   0.0005   7.22e-05      6.911      0.000       0.000       0.001
month_6                   0.0004   7.34e-05      5.687      0.000       0.000       0.001
month_7                   0.0007    7.6e-05      8.693      0.000       0.001       0.001
month_8                 4.93e-05   7.66e-05      0.644      0.520      -0.000       0.000
month_9                   0.0004   7.53e-05      4.810      0.000       0.000       0.001
month_10                  0.0006   7.77e-05      8.279      0.000       0.000       0.001
month_11                  0.0006   7.57e-05      7.491      0.000       0.000       0.001
month_12                  0.0004   7.38e-05      5.094      0.000       0.000       0.001
capital_goods             0.0006      0.000      2.428      0.015       0.000       0.001
consumer_non-durables     0.0004      0.000      1.469      0.142      -0.000       0.001
consumer_services         0.0004      0.000      2.130      0.033     3.6e-05       0.001
energy                    0.0001      0.000      0.495      0.621      -0.000       0.001
finance                   0.0006      0.000      2.573      0.010       0.000       0.001
health_care               0.0003      0.000      1.733      0.083   -4.55e-05       0.001
miscellaneous             0.0009      0.000      2.445      0.014       0.000       0.002
public_utilities       5.961e-06      0.000      0.018      0.986      -0.001       0.001
technology                0.0008      0.000      3.945      0.000       0.000       0.001
transportation            0.0006      0.000      1.825      0.068   -4.21e-05       0.001
==============================================================================
Omnibus:                    32445.335   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4401670.731
Skew:                          -0.255   Prob(JB):                         0.00
Kurtosis:                      34.066   Cond. No.                         77.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5-Day Returns

python 复制代码
target = 'target_5d'
model = OLS(endog=y[target], exog=add_constant(X))
trained_model = model.fit()
print(trained_model.summary())

R方值比1-Day Returns要高一点

四、Obtain the residuals 计算残差

residuals analysis 残差分析

python 复制代码
preds = trained_model.predict(add_constant(X))
residuals = y[target] - preds
fig, axes = plt.subplots(ncols=2, figsize=(14,4))
sns.distplot(residuals, fit=norm, ax=axes[0], axlabel='Residuals', label='Residuals')
axes[0].set_title('Residual Distribution')
axes[0].legend()
plot_acf(residuals, lags=10, zero=False, ax=axes[1], title='Residual Autocorrelation')
axes[1].set_xlabel('Lags')
sns.despine()
fig.tight_layout();

The residuals are normally distributed, it's good.

残差的分布是正态分布,这是好的。

But the autocorrelation is not good.

残差的自相关图不是很好(前几个Lags自相关系数在0.2以上),说明残差之间不是独立的。

10-Day Returns

python 复制代码
target = 'target_10d'
model = OLS(endog=y[target], exog=add_constant(X))
trained_model = model.fit()
print(trained_model.summary())

Monthly Returns

python 复制代码
target = 'target_21d'
model = OLS(endog=y[target], exog=add_constant(X))
trained_model = model.fit()
print(trained_model.summary())

my conclusion 我的结论

days increase, R-squared increase. model is more accurate.

天数增加,R-squared增加,模型更准确。

相关推荐
曾经的三心草3 小时前
Python14-SVM⽀持向量机
人工智能·机器学习·支持向量机
西猫雷婶4 小时前
pytorch基本运算-torch.normal()函数输出多维数据时,如何绘制正态分布函数图
人工智能·pytorch·python·深度学习·神经网络·机器学习·线性回归
敲代码的嘎仔4 小时前
牛客算法基础noob56 BFS
java·开发语言·数据结构·程序人生·算法·宽度优先
补三补四4 小时前
卡尔曼滤波
python·算法·机器学习·数据挖掘
wa的一声哭了5 小时前
Stanford CS336 Lecture3 | Architectures, hyperparameters
人工智能·pytorch·python·深度学习·机器学习·语言模型·自然语言处理
WaWaJie_Ngen5 小时前
LevOJ P2080 炼金铺 II [矩阵解法]
c++·线性代数·算法·矩阵
大有数据可视化6 小时前
机器学习+数字孪生:从诊断到自主决策的跨越
人工智能·机器学习
今后1236 小时前
【数据结构】堆、计数、桶、基数排序的实现
数据结构·算法·堆排序·计数排序·桶排序·基数排序
敲代码的嘎仔6 小时前
牛客算法基础noob59 简写单词
java·开发语言·数据结构·程序人生·算法·leetcode·学习方法