房价预测|Pytorch

一、数据预处理

z-分数归一化(Z-score normalization)也称为标准差标准化,是数据预处理中的一种常用技术,用于将特征缩放到标准正态分布(均值为 0,标准差为 1)。

独热编码(One-Hot Encoding)是处理分类特征的一种常用技术,它将分类变量转换为二进制向量,使得机器学习模型能够正确处理这些非数值数据。

二、神经网络

本项目采用的是全连接神经网络(Fully Connected Network)

每一层的神经元与下一层所有神经元连接。

1.输入层(Input Layer):

接收原始数据,不进行计算,仅传递数据。

神经元数量 = 输入数据的维度(数据集特征)。

2.隐藏层(Hidden Layer):

位于输入层和输出层之间,负责提取数据的特征。

本文仅设置了一个隐藏层,共有100个神经节点。

3.输出层(Output Layer):

输出模型的最终结果(回归任务的预测值)。

神经元数量 = 任务目标的维度(回归任务的输出维度为1)。

4.激活函数

激活函数是神经网络能拟合复杂模式的关键,其核心作用是引入非线性变换。

本项目所使用的激活函数是relu函数:

5.计算损失(Loss Calculation)

用损失函数(Loss Function)衡量预测结果与真实标签的差距。

在本文中采用均方根误差

6.正则化

L2 正则化(Ridge Regression):倾向于减小参数值

在损失函数中添加参数的平方和作为惩罚项:损失函数= 原始损失 +

7.优化器

Adam(Adaptive Moment Estimation)是深度学习中最流行的优化算法之一,结合了 Adagrad 和 RMSProp 的优点,能够自适应地调整每个参数的学习率。

将最原始的导数通过以下公式变换:

(t为迭代次数)

最后更新的参数公式为:

8、交叉验证

  • 评估模型稳定性:单一训练集 / 测试集划分可能导致评估结果波动较大。

  • 充分利用数据:在数据量有限时,避免测试集浪费。

  • 防止过拟合:更全面地检测模型在不同数据分布下的表现。

K 折交叉验证(K-Fold CV)

  • 步骤
    1. 将数据集均分为 K 个子集(折)。
    2. 轮流选择 1 个子集作为验证集 ,其余 K-1 个子集作为训练集
    3. 重复 K 次,计算平均验证分数。

三、代码

工具模块

python 复制代码
import pandas as pd
import torch
from d2l import torch as d2l
from torch import nn
from torch.utils.tensorboard import SummaryWriter

加载数据

python 复制代码
train_data = pd.read_csv('data/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('data/house-prices-advanced-regression-techniques/test.csv')

查看数据集信息,查看数据集的大小以及前4列的具体信息

python 复制代码
print(train_data.shape)
print(test_data.shape)
print(train_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])
print(test_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])

数据预处理

数据集中'Id'列,该数据对于模型训练无其他作用,所以需将此列去除。在训练过程中需要将特征及标签分离开,最后将训练集与测试集数据合并

python 复制代码
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

获取数据集中的数值型的特征字段,并将数值型的数据进行归一化,缺失值填充为0(即均值)

python 复制代码
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
all_features[numeric_features] = all_features[numeric_features].fillna(0)

将非数值型的特征进行独热编码,设置参数,使其中的缺失值视为有效的特征值

python 复制代码
all_features = pd.get_dummies(all_features, dummy_na=True)

将训练集特征及测试集特征分离开,获取训练标签,为训练做准备

python 复制代码
n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32)

设置损失函数,MSELoss为平方误差,即

clamp函数设置函数范围,防止出现log 0的情况

python 复制代码
loss = nn.MSELoss()
def log_rmse(net, features, labels):
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels)))
    return rmse.item()

设置网络框架

python 复制代码
in_features = train_features.shape[1]
temp_features=100
def get_net():
    net = nn.Sequential(
        nn.Linear(in_features,temp_features),
        nn.ReLU(),
        nn.Linear(temp_features,1)
    )
    return net

赋值迭代器,设置adam优化器 ,将训练结果可视化

python 复制代码
writer=SummaryWriter("./log/houseprices")

def train(net,i, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay,
          batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay)
    for epoch in range(num_epochs):
        for x, y in train_iter:
            optimizer.zero_grad()
            l = loss(net(x), y)
            l.backward()
            optimizer.step()
        train_ls.append(log_rmse(net, train_features, train_labels))
        writer.add_scalar("{}_fold/train".format(i+1),train_ls[epoch],epoch)
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
            writer.add_scalar("{}_fold/test".format(i + 1), test_ls[epoch], epoch)
    return train_ls, test_ls

将训练集分成k分,每次返回第i分数据作为验证集,其余作为训练集

python 复制代码
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid

k---折交叉验证

python 复制代码
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net,i, *data, num_epochs, learning_rate, weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f},'
              f'验证log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

将全部训练集数据进行训练,并将测试集数据放入网络中进行训练,并生成预测结果文档

python 复制代码
def train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size):
    net=get_net()
    train_ls,_=train(net,5,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size)
    print(f'训练log rmse{float(train_ls[-1]):f}')
    preds=net(test_features).detach().numpy()
    test_data['SalePrice']=pd.Series(preds.reshape(1,-1)[0])
    submission=pd.concat([test_data['Id'],test_data['SalePrice']],axis=1)
    submission.to_csv('submission.csv',index=False)

全部代码(其中的超参数为随机参数,需自调)

python 复制代码
import pandas as pd
import torch
from d2l import torch as d2l
from torch import nn

# 获取数据
from torch.utils.tensorboard import SummaryWriter

train_data = pd.read_csv('data/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('data/house-prices-advanced-regression-techniques/test.csv')

# 查看数据具体信息
# print(train_data.shape)
# print(test_data.shape)
# print(train_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])
# print(test_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])

# 去除id列
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

# 数据预处理
# 获取数值型特征
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
#归一化
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
#处理缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)

#独热编码
all_features = pd.get_dummies(all_features, dummy_na=True)
#获取训练数据
n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32)

#损失函数
loss = nn.MSELoss()
def log_rmse(net, features, labels):
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels)))
    return rmse.item()

#网络框架
in_features = train_features.shape[1]
temp_features=100
def get_net():
    net = nn.Sequential(
        nn.Linear(in_features,temp_features),
        nn.ReLU(),
        nn.Linear(temp_features,1)
    )
    return net

writer=SummaryWriter("./log/houseprices")

def train(net,i, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay,
          batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay)
    for epoch in range(num_epochs):
        for x, y in train_iter:
            optimizer.zero_grad()
            l = loss(net(x), y)
            l.backward()
            optimizer.step()
        train_ls.append(log_rmse(net, train_features, train_labels))
        writer.add_scalar("{}_fold/train".format(i+1),train_ls[epoch],epoch)
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
            writer.add_scalar("{}_fold/test".format(i + 1), test_ls[epoch], epoch)
    return train_ls, test_ls


def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid


def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net,i, *data, num_epochs, learning_rate, weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f},'
              f'验证log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

def train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size):
    net=get_net()
    train_ls,_=train(net,5,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size)
    print(f'训练log rmse{float(train_ls[-1]):f}')
    preds=net(test_features).detach().numpy()
    test_data['SalePrice']=pd.Series(preds.reshape(1,-1)[0])
    submission=pd.concat([test_data['Id'],test_data['SalePrice']],axis=1)
    submission.to_csv('submission.csv',index=False)

k, num_epochs, lr, weight_decay, batch_size = 5,100,5,0,64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, weight_decay, batch_size)
with open('./weight/houseprice.txt', 'a', encoding='utf-8') as file:
    file.write('-'*50)
    file.write('\n')
    line = 'k:{},num_epochs:{},lr:{},weight_decay:{},batch_size:{},temp_feaatures:{}\n'.format(k,num_epochs,lr,weight_decay,batch_size,temp_features)
    file.write(line)
    line=f'{k}-折验证:平均训练log rmse{float(train_l):f},平均验证log rmse{float(valid_l):f}\n'
    file.write(line)
print(f'{k}-折验证:平均训练log rmse{float(train_l):f},'
      f'平均验证log rmse{float(valid_l):f}')
train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size)
相关推荐
AndrewHZ6 分钟前
【图像处理基石】如何对遥感图像进行目标检测?
图像处理·人工智能·pytorch·目标检测·遥感图像·小目标检测·旋转目标检测
微小冷8 分钟前
Vimba相机二次开发教程,基于Python
开发语言·python·二次开发·相机开发·vimba相机·vimba
非优秀程序员8 分钟前
8 个提升开发者效率的小众 AI 项目
前端·人工智能·后端
留意_yl20 分钟前
量化感知训练(QAT)流程
人工智能
java1234_小锋21 分钟前
【NLP舆情分析】基于python微博舆情分析可视化系统(flask+pandas+echarts) 视频教程 - 热词数量分析日期统计功能实现
python·自然语言处理·flask
山烛37 分钟前
KNN 算法中的各种距离:从原理到应用
人工智能·python·算法·机器学习·knn·k近邻算法·距离公式
盲盒Q1 小时前
《频率之光:归途之光》
人工智能·硬件架构·量子计算
guozhetao1 小时前
【ST表、倍增】P7167 [eJOI 2020] Fountain (Day1)
java·c++·python·算法·leetcode·深度优先·图论
墨染点香1 小时前
第七章 Pytorch构建模型详解【构建CIFAR10模型结构】
人工智能·pytorch·python
go54631584651 小时前
基于分组规则的Excel数据分组优化系统设计与实现
人工智能·学习·生成对抗网络·数学建模·语音识别