第22周：机器学习

摘要

本周接着继续学习了吴恩达机器学习的实验3------手写数字识别，是一个多分类的问题，利用逻辑回归的模型和神经网络前馈算法来实现预测分类，最后利用常见的评估指标来对该模型进行评价和分析，再一次巩固了前两周的逻辑回归模型。还动手学习了pytorch的基本操作------微积分和自动微分。

Abstract

This week, we continued to study Ernst Wu's Machine Learning Experiment 3 - Handwritten Digit Recognition, which is a multi-classification problem that uses a logistic regression model and a neural network feed-forward algorithm to achieve predictive classification, and then finally evaluates and analyzes the model using common evaluation metrics, which once again reinforces the logistic regression model from the previous two weeks. There was also hands-on learning of the basic operations of pytorch - calculus and auto-differentiation.

一、吴恩达机器学习exp3------多分类

案例：手写数字分类

目标：数字总共10个类别，但是逻辑回归只能实现二分类的问题。所以我们先实现1个1对1的二分类，再训练k个二分类器，就能实现对k个不同数字的分类。

载入数据

复制代码

data = scio.loadmat('ex3data1.mat')
x = data.get('X')
y = data.get('y')
y = np.expand_dims(np.where(y[:,0]==10,0,y[:,0]), axis=1)
x.shape, y.shape

其中X是图像数据的特征集，每行代表一个图像，每列代表代表一个像素特征，每行的所有列就是所有像素组成的图像；y是标签，仍旧每行代表一个图像的标签。

注：其中ex3data1.mat数据中的y标签"10"代表"0"，所以需要将标签中的10全部替换为9

展示一下图像数据及其标签的数据形状：

展示数据集图像

python 复制代码

def plot_image(img):
    sample_idx = np.random.choice(np.arange(img.shape[0]), 25)  # 100*400
    sample_images = img[sample_idx, :]

    fig, ax_array = plt.subplots(nrows=5, ncols=5, sharey=True, sharex=True, figsize=(5, 5))

    for r in range(5):
        for c in range(5):
            ax_array[r, c].matshow(sample_images[5 * r + c].reshape((20, 20)).T,
                                   cmap=matplotlib.cm.binary)
            plt.xticks(np.array([]))
            plt.yticks(np.array([]))

plot_image(x)

展示前25张如下：

对标签实现one-hot编码

调用Network.py中的onehot_encode函数

python 复制代码

from Network import onehot_encode
y_onehot, cls = onehot_encode(y)
y_onehot

把ex3data1.mat中的标签全部转换为独热编码，如下：

划分训练集和验证集

python 复制代码

from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(x, y_onehot, test_size=0.2)
print("Total train samples: {}\n"
      "Total val samples: {}".format(train_y.shape[0],val_y.shape[0]))
train_y_ex.shape

将原始的数据集X随机的划分为两个部分：一个较大的部分（80%）用作训练集，用于训练模型；一个较小的部分（20%）用作验证集，用于评估模型的性能，以防止模型过拟合。划分结果如下：

展示训练集的形状大小：

统计分类情况

python 复制代码

for cls_idx in cls:
    train_sample_n = np.where(train_y[:,cls_idx]==1)[0].shape[0]
    val_sample_n = np.where(val_y[:,cls_idx]==1)[0].shape[0]
    print("Class {}:\t{} train samples\t{} val samples".format(cls_idx, train_sample_n, val_sample_n))
train_y_ex = np.expand_dims(train_y,axis=1)
val_y_ex = np.expand_dims(val_y,axis=1)

根据标签，我们可以得知训练集和验证集的真实分类情况，如下：

1、1对1分类

逻辑回归实现二分类

以"0"为例，实现对0的二分类，套用逻辑回归的自定义函数1，传入的参数由训练集的样本和标签、验证集的样本和标签、训练轮次、学习率、是否正则化或归一化组成，最终输出是参数、训练集损失、验证集损失及验证集的精确度

python 复制代码

from LogisticRegression import LogisticRegression
epochs = 200
alpha = 0.1
scale = 10
regularize = "L2"
normalize = False
logistic_reg = LogisticRegression(x=train_x,y=train_y_ex[:,:,0],val_x=val_x,val_y=val_y_ex[:,:,0],epoch=epochs,lr=alpha,scale=scale,normalize=normalize, regularize=regularize)
theta, train_loss, val_loss = logistic_reg.train()
theta.shape

部分训练结果如下：

可以看出，无论是训练集还是验证集的损失都是逐渐降低

查看训练过程损失函数

python 复制代码

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(1,epochs+1), train_loss, 'r', label="Train Loss")
ax.plot(np.arange(1,epochs+1), val_loss, 'b', label="Val Loss")
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Train Curve')
plt.legend(loc=2)
plt.show()

训练集和测试集随着训练轮次的增加其损失值的变化如下：

二分类性能

查看最终训练过后的各个性能指标

python 复制代码

from LogisticRegression import bce_loss
from sklearn.metrics import f1_score

#logistic_reg是LogisticRegression类别的实例化模型对象
pred_prob = logistic_reg.get_prob(x=val_x)  #调用ogistic_reg模型中的get_prob()函数计算预测估计概率
loss_val = bce_loss(pred=pred_prob, target=val_y_ex[:,:,0])  #调用ogistic_reg模型中bce_loss()函数计算验证集上的损失
pred = logistic_reg.predict(x=val_x)  #调用ogistic_reg模型中的predict()方法进行预测
logistic_f1 = f1_score(val_y_ex[:,:,0],pred)   #f1_score函数计算真实标签
acc = logistic_reg.test(x=val_x,y=val_y_ex[:,:,0])   #调用logistic_reg模型中的test()方法计算验证集与真实标签的精确值
print("Accuracy on Val set: {:.2f}%\n"
      "Val Loss on Val set: {:.4f}\n"
      "F1 Score on Val set: {:.4f}".format(acc * 100, loss_val, logistic_f1))

二分类的性能指标有精确度、验证集损失、F1-score，其结果如下：

验证集准确率：

准确率是最直观 的性能指标，它表示模型预测正确的样本占总样本的比例。对于二分类问题，准确率计算公式为：

其中TP是真正例、TN是真负例

较高的准确率意味着在大多数情况下预测结果都是正确的，但是在各类别的数据不平衡的情况下出现误差。

验证集损失值：

损失值是模型预测值与真实值之间差异的量化度量，逻辑回归中常用BCE误差。

损失值越低，表示模型的预测值越接近真实值，模型的性能越好，最终的目标就是不断优化模型参数以至于损失值最小。

F1分数：

F1分数是精确率和召回率的调和平均数，计算公式为：

F1分数考虑了模型的精确率和召回率，对于类别不平衡的数据集特别有用。F1分数越高，说明在precision和recall之间取得更好的平衡。

综上所述，我们希望val_acc越大 模型整体表现越好、val_loss越小 模型的预测值与真实值越接近、F1分数越大模型的稳定性越好。

2、1对k分类

训练k个分类器

设置一个空数组，将0、1、2...9的各自的分类器依次加入到数组中去

python 复制代码

classifier_list = []
for cls_idx in cls:
    classifier = LogisticRegression(x=train_x,y=train_y_ex[:,:,cls_idx], val_x=val_x, val_y=val_y_ex[:,:,cls_idx], epoch=epochs,lr=alpha,normalize=normalize, regularize="L2", scale=2, show=False)
    classifier.train()
    classifier_list.append(classifier)

print("Total Classifiers: {}".format(len(classifier_list)))

进行分类

python 复制代码

prob_list = [classifier_i.get_prob(val_x) for classifier_i in classifier_list]  #依次遍历每个分类器，返回每个分类器中的"概率预测的列表"，其中的每个分类器调用get_prob方法来获取验证集val_x的预测概率
prob_arr = np.array(prob_list).squeeze().T  #进行数据变换，最终使得每个分类器的预测概率对应于数组的每一列
multi_pred = np.argmax(prob_arr,axis=1)  #选取预测概率prob_arr中最大值作为最终的预测分类
multi_pred[:5]

得到每一个待预测图像数据的索引，即为找到分类，输出前五个预测后的类别

sklearn衡量分类性能

python 复制代码

from sklearn.metrics import classification_report
report = classification_report(multi_pred, np.argmax(val_y, axis=1), digits=4)
print(report)

使用scikit-learn库中的classification_report函数来生成一个分类报告，该报告提供了模型预测结果的详细性能评估。

3、前向传播

神经网络结构

在Network.py中提前定义了神经网络的前馈算法，模型直接调用

python 复制代码

from Network import PytorchForward
model = PytorchForward()
for name, parameters in model.named_parameters():
    print(name,':',parameters.size())

输出网络各层的参数大小：

查看参数值

在该两层的神经网络中，其输入输出参数的维度大小是不断变化的

Layer1：

输入：输入维度是 (n,d+1)。其中n 是样本数量，d+1 是特征数量，包括一个偏置项。

权重：权重矩阵的维度是 (d+1,hidden_size)。其中hidden_size 是隐藏层的大小。

输出：输出的维度是 (n,hidden_size)。这是通过将输入与权重矩阵相乘并应用激活函数得到的

Layer2：

输入：输入维度是 (n,hidden_size+1)。其中hidden_size+1 包括隐藏层的大小加上一个偏置项。

权重：权重矩阵的维度是 (hidden_size+1,cls_n)。其中cls_n 是输出层的大小，即类别数量。

输出：的维度是 (n,cls_n)。这是通过将输入与权重矩阵相乘并应用激活函数得到的。

python 复制代码

import scipy.io as scio
import numpy as np
data = scio.loadmat('ex3weights.mat')
theta1 = data.get('Theta1').T
theta2 = data.get('Theta2').T
theta1.shape, theta2.shape

最终得到权重参数的大小

加载数据

python 复制代码

data = scio.loadmat('ex3data1.mat')
x = data.get('X')
y = data.get('y')
# y = np.where(y[:,0]==10,0,y[:,0])
x.shape, y.shape

与上面的size相对应，说明一张图像数据被分成了400个像素点，共有特征数量d=400

调用算法查看结果

调用前向传播算法来训练模型，最终分类报告（同上调用sklearn.metrics库中的 classification_report函数）

python 复制代码

from Network import ForwardModel
model = ForwardModel()
model.load_parameters([theta1, theta2])
pred_prob = model(x)
# pred_prob = np.concatenate([np.expand_dims(pred_prob[:,-1],axis=1), pred_prob[:,:-1]], axis=1)

pred = np.argmax(pred_prob,axis=1) + 1
from sklearn.metrics import classification_report
report = classification_report(pred, y, digits=4)
print(report)

二、动手深度学习pytorch------微积分和自动微分

1、微积分

微积分的定义

定义微分式并让h不断接近0

python 复制代码

def numerical_lim(f, x, h):
    return (f(x + h) - f(x)) / h

h = 0.1
for i in range(5):
    print(f'h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f}')
    h *= 0.1

可视化微分

python 复制代码

def use_svg_display():  
    """使用svg格式在Jupyter中显示绘图"""
    backend_inline.set_matplotlib_formats('svg')

def set_figsize(figsize=(3.5, 2.5)):  
    """设置matplotlib的图表大小"""
    use_svg_display()
    d2l.plt.rcParams['figure.figsize'] = figsize

def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """设置matplotlib的轴"""
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.set_xscale(xscale)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()

def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
         ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """绘制数据点"""
    if legend is None:
        legend = []

    set_figsize(figsize)
    axes = axes if axes else d2l.plt.gca()

    def has_one_axis(X):
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))

    if has_one_axis(X):
        X = [X]
    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]
    if len(X) != len(Y):
        X = X * len(Y)
    axes.cla()
    for x, y, fmt in zip(X, Y, fmts):
        if len(x):
            axes.plot(x, y, fmt)
        else:
            axes.plot(y, fmt)
    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)

2、自动微分

计算y关于x的梯度之前，需要一个地方来存储梯

python 复制代码

import torch
x = torch.arange(4.0)
x.requires_grad_(True)#存储梯度
x.grad
y = 2 * torch.dot(x, x)   #计算y关于x的梯度

通过调用反向传播函数 来自动计算y关于x每个分量的梯度

python 复制代码

y.backward()
x.grad

总结

下周继续通过吴恩达机器学习的实验来回顾基础模型，并详细了解其模型内部的具体函数。继续学习pytorch的基本操作。