多层感知机(MLP)实现考勤预测二分类任务(sklearn)

python 复制代码

'''
1、基础应用：
https://blog.csdn.net/qq_36158230/article/details/118670801
多层感知机(MLP)实现考勤预测二分类任务(sklearn)
2、分类器参数：https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html
3、损失函数：https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
看着示例数据、参数训练出的模型效果不好呀hhh
'''

import csv
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
 
# 加载训练数据集
def load_dataset(path):
    dataset_file = csv.reader(open(path))
    vector_x = []   # 样本
    y = []          # 样本所对应的标签
    # 从文件读取训练数据集
    for content in dataset_file:
        # 如果读取的不是表头
        if dataset_file.line_num != 1:
            # 读取一行并转化为列表
            content = list(map(float, content))
            if len(content) != 0:
                vector_x.append(content[1:12])  # 第0-11列是样本的特征，其中第0列是id
                y.append(content[-1])           # 最后一列是样本的标签
                #print(content,len(content)) # [2.0, 1.0, 1.0, 2.0, 3.0, 1.0, 4.0, 1.0, 3.0, 2.0, 1.0, 3.0, 1.0] 13
                #break
    return vector_x, y  # 返回训练数据集
 
# 训练模型
def mlp_cls(vector_x_train, y_train):
    # 输入层->第一层->第二层->输出层
    #    12      30     20      1  # 节点个数
    # MLPClassifier参数说明详情见https://www.cnblogs.com/-X-peng/p/14225973.html
    mlp = MLPClassifier(solver='adam', alpha=0, hidden_layer_sizes=(30, 20), random_state=1)
    mlp.fit(vector_x_train, y_train)        # 训练
    return mlp
 
# 模型预测
def mlp_cls_predict(mlp, vector_x_test, y_test):
    # 预测
    y_predict = mlp.predict(vector_x_test)
    n = 3
    print("模型预测值：", y_predict[:n], ", 模型true值：", y_test[:n])
    print(y_predict[0]==y_test[0])
    print("测试集大小：", len(y_test), len(y_predict))
    label_1 = []
    label_fu1 = []
    for p in y_test:
        if p==1:
            label_1.append(p)
            #print("label: 1")
        if p==-1:
            label_fu1.append(-1)
    print('测试集：和',len(label_1)+len(label_fu1),'，len(label_1)', len(label_1), ',len(label_fu1)',len(label_fu1))
    error_n = 0
    for i in range(len(y_predict)):
        if y_predict[i] != y_test[i]:
            print('错误预测结果：', y_predict[i], ', 真实值：', y_test[i])
            error_n +=1
    print('预测错误的数量：', error_n)
    # 输出模型预测的准确度
    print(accuracy_score(y_predict, y_test))
 
# 实验
if __name__ == '__main__':
    # 1. 加载数据集
    
    vector_x, y = load_dataset("dataset.csv") # 如果报错，原因：ipynb创建的时候在其他目录，而不是csv文件的路径下
    print('数据集大小（预期161）：', len(vector_x), len(y))
    count_fu1 = []
    for p in y:
        if p==-1:
            count_fu1.append(p)
    print('总的负样本数：', len(count_fu1))
        
    print(vector_x[:3], y[:3])
    scalar = StandardScaler()               # 标准化转换
    scalar.fit(vector_x)                    # 训练标准化对象
    vector_x = scalar.transform(vector_x)   # 转换数据集
    print(vector_x[:3], y[:3])
    print("每个特征的Mean:", scalar.mean_, "特征数量：", len(scalar.mean_))
    print("每个特征的Standard Deviation:", scalar.scale_, "特征数量：", len(scalar.scale_))
    '''
    在数据处理中，标准化是一种常见的预处理步骤，用于将数据转换为均值为 0，标准差为 1 的分布。这有助于确保不同特征的值处于相似的尺度，避免某些特征对模型的影响过大。
    1)对每个特征计算其均值和标准差。
    2)将每个特征的值减去均值，然后除以标准差，以完成标准化处理。
    
    preprocessing.scale(data) 是 Scikit-learn 中 preprocessing 模块提供的一种快速标准化数据的方法。这个方法会对输入的数据进行标准化处理，即将数据按特征进行标准化，使得每个特征的均值为 0，标准差为 1。
    这个方法适用于需要快速对数据进行标准化的情况，但是它并不像使用 StandardScaler 那样可以保存均值和标准差供后续使用。
    '''
    # 2. 划分训练集和测试集
    # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    vector_x_train, vector_x_test, y_train, y_test = train_test_split(vector_x, y, test_size=0.2, random_state=0)
    # vector_x_train, vector_x_test, y_train, y_test = train_test_split(vector_x, y, test_size=0.2, random_state=10)
    label_1 = []
    label_fu1 = []
    for p in y_train:
        if p==1:
            label_1.append(p)
            #print("label: 1")
        if p==-1:
            label_fu1.append(-1)
    print('训练集：和',len(label_1)+len(label_fu1),'，len(label_1)', len(label_1), ',len(label_fu1)',len(label_fu1))
    # 3. 训练
    mlp = mlp_cls(vector_x_train, y_train)
    # 4. 预测
    mlp_cls_predict(mlp, vector_x_test, y_test)

    print('【感觉模型预测时，将结果都预测为正样本1了，看看训练的精度怎么样】')
    mlp_cls_predict(mlp, vector_x_train, y_train)

对应的输出

python 复制代码

数据集大小（预期161）： 161 161
总的负样本数： 18
[[1.0, 1.0, 2.0, 3.0, 1.0, 4.0, 1.0, 3.0, 2.0, 1.0, 3.0], [1.0, 2.0, 2.0, 3.0, 1.0, 4.0, 1.0, 4.0, 2.0, 2.0, 3.0], [2.0, 2.0, 2.0, 3.0, 1.0, 3.0, 1.0, 3.0, 1.0, 2.0, 3.0]] [1.0, 1.0, 1.0]
[[-0.41854806 -1.31484355  0.36579067  0.3130227  -0.80178373  0.49383162
  -0.48832524  0.23600208  0.06141296 -2.9104275  -0.39633848]
 [-0.41854806  0.76054676  0.36579067  0.3130227  -0.80178373  0.49383162
  -0.48832524  1.23590565  0.06141296  0.34359214 -0.39633848]
 [ 2.38921186  0.76054676  0.36579067  0.3130227  -0.80178373 -0.71081824
  -0.48832524  0.23600208 -1.35108513  0.34359214 -0.39633848]] [1.0, 1.0, 1.0]
每个特征的Mean: [1.14906832 1.63354037 1.88198758 2.74534161 1.39130435 3.59006211
 1.19254658 2.76397516 1.95652174 1.89440994 3.31677019] 特征数量： 11
每个特征的Standard Deviation: [0.35615581 0.48183708 0.32262283 0.81354605 0.48804227 0.83011673
 0.39429988 1.00009644 0.70796556 0.30731222 0.79924157] 特征数量： 11
训练集：和 128 ，len(label_1) 111 ,len(label_fu1) 17
模型预测值： [1. 1. 1.] , 模型true值： [1.0, 1.0, 1.0]
True
测试集大小： 33 33
测试集：和 33 ，len(label_1) 32 ,len(label_fu1) 1
错误预测结果： 1.0 , 真实值： -1.0
预测错误的数量： 1
0.9696969696969697
【感觉模型预测时，将结果都预测为正样本1了，看看训练的精度怎么样】
模型预测值： [1. 1. 1.] , 模型true值： [1.0, -1.0, 1.0]
True
测试集大小： 128 128
测试集：和 128 ，len(label_1) 111 ,len(label_fu1) 17
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
错误预测结果： 1.0 , 真实值： -1.0
预测错误的数量： 13
0.8984375
/base/envs/py36/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)