1.LSTM介绍
长短期记忆(Long short-term memory, LSTM)是一种特殊的RNN,主要是为了解决长序列训练过程中的梯度消失和梯度爆炸问题。简单来说,就是相比普通的RNN,LSTM能够在更长的序列中有更好的表现。其结构图如下所示:
LSTM在处理序列问题上有着很好的性能优势,所以此次我们使用LSTM模型来进行回归预测模型
本次的模型采用paddlepaddle封装的LSTM模型进行建模,首先先导入必要的包
2.导入必要的包
In [1]
javascript
import math
import numpy as np
from numpy import concatenate
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
import paddle
3.数据集的预处理
本次数据集为2021年数学建模研究生竞赛B题中监测点A的每小时数据。观察数据集可以发现出时间外有十一个特征,我们使用除pm2.5特征外的十个特征来对pm2.5进行每小时预测建模(实际上,我们也可以使用pm2.5本身进行单变量的建模,但考虑到其余特征对pm2.5或多或少会有影响,所以此处进行多特征建模)
原始的数据集存在缺失值,我们需要对缺失的部分进行操作,同时在神经网络训练中,为了抹平各个特征之间的差异性,会进行归一化操作。
所以对数据集我们进行如下操作:
- 数据集序列转有监督学习;
- 数据集缺失值处理;
- 数据集归一化操作。
3.1数据集序列转监督函数定义
In [2]
ini
def series_to_supervised(data,n_in=1,n_out=1,dropnan=True):
n_vars=1 if type(data) is list else data.shape[1]
df=pd.DataFrame(data)
cols,names=[],[]
for i in range(n_in,0,-1):
cols.append(df.shift(i))
names+=[('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
for i in range(0,n_out):
cols.append(df.shift(-i))
if i==0:
names+=[('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names+=[('var%d(t+%d)' % (j+1,i)) for j in range(n_vars)]
agg=pd.concat(cols,axis=1)
agg.columns=names
if dropnan:
agg.dropna(inplace=True)
return agg
参数说明,data为数据集本身,n_in代表多少行作为一个序列,n_out为拼接多少个序列然后得到新的数据。
3.2数据集缺失值填充集归一化函数定义
In [3]
ini
def prepare_data(filepath, n_in, n_out=1, n_vars=6):
#读取数据集
dataset=pd.read_csv(filepath,engine='python')
dataset.fillna(method='ffill', inplace=True)#缺失值填充
#设置时间戳索引
dataset['date'] = pd.to_datetime(dataset['date'])
dataset.set_index("date", inplace=True)
values = dataset.values
#保证所有数据都是float32类型
values = values.astype('float32')
#变量归一化
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
#将时间序列问题转化为监督学习问题
reframed = series_to_supervised(scaled, n_in, n_out)
#取出保留的变量
contain_vars = []
for i in range(n_in, 0,-1):#for i in range(1,n_in+1):
contain_vars += [('var%d(t-%d)' % (j, i)) for j in range(1,n_vars+1)]
data = reframed [ contain_vars + ['var1(t)'] + [('var1(t+%d)' % (j)) for j in range(1,n_out)]]
#print('Reframed data:',data.head())
#修改列名
col_names = ['Y', 'X1', 'X2', 'X3','X4', 'X5','X6','X7','X8','X9','X10','X11']
contain_vars = []
for i in range(n_in, 0,-1):
contain_vars += [('%s(t-%d)' % (col_names[j], i)) for j in range(0,n_vars)]
data.columns = contain_vars + ['Y(t)'] + [('Y(t+%d)' % (j)) for j in range(1,n_out)]
# print(data.head())
return data
参数说明,变量名相同的含义相同,n_vars表示有多少个变量,这里我们采用pandas里的"ffill"方式进行缺失值的处理,然后使用归一化对数据集进行归一化操作。
4.LSTM模型的搭建
此步骤除了搭建LSTM模型外,还定义了数据流函数,使得数据集的特征能够流入模型中进行训练
In [4]
ruby
class LSTM(paddle.nn.Layer):
def __init__(self,input_size=6,hidden_size=50):
super().__init__()
self.rnn=paddle.nn.LSTM(input_size=input_size,hidden_size=hidden_size)
self.linear=paddle.nn.Linear(hidden_size,1)
def forward(self,inputs):
y,(hidden,cell)=self.rnn(inputs)
output=self.linear(hidden)
output=paddle.squeeze(output)
return output
In [5]
ini
class MyDataset(paddle.io.Dataset):
def __init__(self,data,n_in=1,num_features=6,num_labels=1):
super(MyDataset,self).__init__()
self.data=data
self.num_features=num_features
self.num_labels=num_labels
data=data.values
x=data[:,:n_in*num_features]
self.y=data[:,n_in*num_features]
self.x=x.reshape((data.shape[0],n_in,num_features))
self.x=np.array(self.x,dtype='float32')
self.y=np.array(self.y,dtype='float32')
self.num_sample=len(x)
def __getitem__(self,index):
data=self.x[index]
label=self.y[index]
return data,label
def __len__(self):
return self.num_sample
5.模型的训练
In [6]
ini
data= prepare_data('hour.csv', 1, n_out=1, n_vars=11)#数据读取,此数据集变量有11个
n_train_samples = int(len(data) * 0.85)#训练集占比为0.85
train_data = data[:n_train_samples]
test_data = data[n_train_samples:]
In [7]
ini
train_dataset = MyDataset(train_data,n_in=1,num_features=11)
test_dataset = MyDataset(test_data,n_in=1,num_features=11)
train_loader = paddle.io.DataLoader(train_dataset, batch_size=10, shuffle=False, drop_last=False)
In [8]
python
import paddle.fluid as fluid
use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
paddle.set_device('gpu:0')
In [9]
ini
model = paddle.Model(LSTM(input_size=11,hidden_size=50))
model.prepare(optimizer=paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
loss=paddle.nn.MSELoss(reduction='mean'))
model.fit(train_dataset,
epochs=30,
batch_size=20,
shuffle=False,
drop_last=True,
verbose=1)
vbnet
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/30
step 50/825 [>.............................] - loss: 0.0086 - ETA: 2s - 4ms/step
python
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
bash
step 825/825 [==============================] - loss: 6.2217e-04 - 3ms/step
Epoch 2/30
step 825/825 [==============================] - loss: 7.0187e-04 - 3ms/step
Epoch 3/30
step 825/825 [==============================] - loss: 7.5661e-04 - 3ms/step
Epoch 4/30
step 825/825 [==============================] - loss: 7.9002e-04 - 3ms/step
Epoch 5/30
step 825/825 [==============================] - loss: 8.0973e-04 - 3ms/step
Epoch 6/30
step 825/825 [==============================] - loss: 8.2120e-04 - 3ms/step
Epoch 7/30
step 825/825 [==============================] - loss: 8.2733e-04 - 3ms/step
Epoch 8/30
step 825/825 [==============================] - loss: 8.2982e-04 - 3ms/step
Epoch 9/30
step 825/825 [==============================] - loss: 8.2988e-04 - 3ms/step
Epoch 10/30
step 825/825 [==============================] - loss: 8.2849e-04 - 3ms/step
Epoch 11/30
step 825/825 [==============================] - loss: 8.2655e-04 - 3ms/step
Epoch 12/30
step 825/825 [==============================] - loss: 8.2418e-04 - 3ms/step
Epoch 13/30
step 825/825 [==============================] - loss: 8.2167e-04 - 3ms/step
Epoch 14/30
step 825/825 [==============================] - loss: 8.1937e-04 - 3ms/step
Epoch 15/30
step 825/825 [==============================] - loss: 8.1763e-04 - 3ms/step
Epoch 16/30
step 825/825 [==============================] - loss: 8.1659e-04 - 3ms/step
Epoch 17/30
step 825/825 [==============================] - loss: 8.1617e-04 - 3ms/step
Epoch 18/30
step 825/825 [==============================] - loss: 8.1629e-04 - 3ms/step
Epoch 19/30
step 825/825 [==============================] - loss: 8.1685e-04 - 3ms/step
Epoch 20/30
step 825/825 [==============================] - loss: 8.1782e-04 - 3ms/step
Epoch 21/30
step 825/825 [==============================] - loss: 8.1926e-04 - 3ms/step
Epoch 22/30
step 825/825 [==============================] - loss: 8.2092e-04 - 3ms/step
Epoch 23/30
step 825/825 [==============================] - loss: 8.2267e-04 - 3ms/step
Epoch 24/30
step 825/825 [==============================] - loss: 8.2448e-04 - 3ms/step
Epoch 25/30
step 825/825 [==============================] - loss: 8.2632e-04 - 3ms/step
Epoch 26/30
step 825/825 [==============================] - loss: 8.2816e-04 - 3ms/step
Epoch 27/30
step 825/825 [==============================] - loss: 8.2998e-04 - 3ms/step
Epoch 28/30
step 825/825 [==============================] - loss: 8.3177e-04 - 3ms/step
Epoch 29/30
step 825/825 [==============================] - loss: 8.3355e-04 - 3ms/step
Epoch 30/30
step 825/825 [==============================] - loss: 8.3533e-04 - 3ms/step
6.预测任务
In [10]
scss
# 使用验证集进行验证
y_pred = model.predict(test_dataset)
plt.plot(test_dataset.y)
plt.plot(np.concatenate(y_pred))
plt.show()
arduino
Predict begin...
step 2915/2915 [==============================] - 2ms/step
Predict samples: 2915
python
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
if isinstance(obj, collections.Iterator):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return list(data) if isinstance(data, collections.MappingView) else data
arduino
<Figure size 432x288 with 1 Axes>
代码解释
7.小结
上述实验的是多特征的预测,实际上我们也可以使用LSTM进行单变量回归预测任务。对于一些有多特征但只限训练集,需要预测的部分没有的时序问题,我们可以先进行单变量回归预测出后续的特征值然后再进行多特征LSTM建模。