前言

刷完理论课去找实战打，找了李宏毅的课程实战，教的是Keras，代码能一行行分析明白，但是是真的一点都不会写。于是去b站找视频补了点实战知识。

本篇文章将分解作业2和作业3提供的模板代码，提炼出模型训练时的常用操作，包括csv文件的输入输出，numpy的数组操作，keras训练模型的步骤等。

实战来自李宏毅的机器学习课程：李宏毅机器学习中文课程 - 网易云课堂 (163.com)

Keras实战视频链接：深度学习框架【Keras项目实战】

作业Kaggle链接：

收入预测：ML2019SPRING-hw2 | Kaggle
图片情绪分类：ML2019SPRING-hw3 | Kaggle

个人博客页：深入分解机器学习实战作业模板代码------二分类、卷积神经网络 | Andrew的个人博客 (andreww1219.github.io)

一、问题描述

1. 收入预测

根据给定的个人资讯，预测此人的收入能否大于50K。

数据集X_train包含许多个人信息，Y_train对应他们年收入是否大于50K。训练一个二分类模型，对X_test作预测。

模板代码：

概率生成模型(Probabilistic Generative Model)： ProbabilisticGenerativeModel (ntumlta2019.github.io)

逻辑回归(Logistic Regression)：LogisticRegression (ntumlta2019.github.io)

2. 图片情绪分类

给定48 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∗ * </math>∗48像素的图片，判断该图片所表达的情绪，包括0：生气, 1：厌恶, 2：恐惧, 3：高兴, 4：难过, 5：惊讶, 6：中立）

训练集x_train.csv每一行有两列，第一列label为图片的情绪，第二列为48 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∗ * </math>∗48个像素值，范围从0~255。训练一个卷积神经网络，对x_test.csv作预测

模板代码：2019 Spring ML HW3 - 手把手教學 - HackMD

二、处理输入输出

1. 读取csv文件

利用np.genfromtxt() 需要添加参数delimeter=',' 返回的结果是列表而不是ndarray对象

python 复制代码

 raw_train = np.genfromtxt(path, delimiter=',', dtype=str, skip_header=1) # skip_header=1

引入python自带的csv包，

python 复制代码

with open(path, newline='') as csvfile:
    raw_train = np.array(list(csv.reader(csvfile))[1:], dtype=float) # 取下标[1:]表示去掉表头

2. 归一化/标准化/离散化

归一化： <math xmlns="http://www.w3.org/1998/Math/MathML"> x = x − x m i n x m a x − x m i n x = \frac{x - x_{min}}{x_{max} - x_{min}} </math>x=xmax−xminx−xmin

标准化： <math xmlns="http://www.w3.org/1998/Math/MathML"> x = x − μ σ x = \frac{x - \mu}{\sigma} </math>x=σx−μ

离散化：将数据标签1，2, ..., n转化为[1, 0, ..., 0], [0, 1, ..., 0], ..., [0, 0, ..., 1]

2.1 手动处理

归一化

python 复制代码

self.min = np.min(rows, axis=0).reshape(1, -1)
self.std = np.std(rows, axis=0).reshape(1, -1)
self.theta = np.ones((rows.shape[1] + 1, 1), dtype=float)
for i in range(rows.shape[0]):
    rows[i, :] = (rows[i, :] - self.min) / self.std

标准化

python 复制代码

self.mean = np.mean(rows, axis=0).reshape(1, -1)
self.std = np.std(rows, axis=0).reshape(1, -1)
self.theta = np.ones((rows.shape[1] + 1, 1), dtype=float)
for i in range(rows.shape[0]):
    rows[i, :] = (rows[i, :] - self.mean) / self.std

对于axis的理解 ：axis=i，操作沿着第i个下标变化的方向进行

参考：Numpy:对Axis的理解 - 知乎 (zhihu.com)

2.2 利用sklearn

归一化：MinMaxScaler

python 复制代码

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
data = np.array([[-1, -2, -3, -4, -5],
                    [0, 0, 0, 0, 0],
                    [1, 2, 3, 4, 5]])
minMaxScaler = MinMaxScaler()

# 在处理训练数据时使用fit_transform
data = minMaxScaler.fit_transform(data)
print(data)
"""
输出结果
[[0.  0.  0.  0.  0. ]
 [0.5 0.5 0.5 0.5 0.5]
 [1.  1.  1.  1.  1. ]]
"""

test_data = np.array([[-2, -3, -4, -5, -6],
                    [0, 0, 0, 0, 0],
                    [2, 3, 4, 5, 6]])
# 在处理测试数据时使用transform
test_data = minMaxScaler.transform(test_data)

标准化：StandardScaler

python 复制代码

import numpy as np
from sklearn.preprocessing import StandardScaler
data = np.array([[-1, -2, -3, -4, -5],
                    [0, 0, 0, 0, 0],
                    [1, 2, 3, 4, 5]])
stdScaler = StandardScaler()
# 在处理训练数据时使用fit_transform
data = stdScaler.fit_transform(data)
print(data)
"""
输出结果
[[-1.22474487 -1.22474487 -1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487  1.22474487  1.22474487]]
"""

# 在处理测试数据时使用transform
test_data = np.array([[-2, -3, -4, -5, -6],
                    [0, 0, 0, 0, 0],
                    [2, 3, 4, 5, 6]])
test_data = stdScaler.transform(test_data)

离散化：LabelBinarizer

python 复制代码

import numpy as np
from sklearn.preprocessing import LabelBinarizer

train_label = np.array([0, 1, 2, 3, 4, 5])

lb = LabelBinarizer()
# 在处理训练数据时使用fit_transform
train_label = lb.fit_transform(train_label)
print(train_label)
"""
输出结果
[[1 0 0 0 0 0]
 [0 1 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 [0 0 0 0 0 1]]
"""

test_label = np.array([1, 2, 3, 4, 5, 0])
# 在处理测试数据时使用transform
test_label = lb.transform(test_label)
print(test_label)
"""
输出结果
[[0 1 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 [0 0 0 0 0 1]
 [1 0 0 0 0 0]]
"""

3. 数据分割

原始数据如下

python 复制代码

import numpy as np

x_train = np.array([[1, 2, 3, 4]
                   for i in range(10000)])
y_train = np.array([i for i in range(10000)])
print(x_train.shape[0], y_train.shape[0])
"""
10000 10000
"""

3.1 手动分割

按照比例分割

python 复制代码

def segmentation(x_train, y_train, proportion):
    train_data = []
    train_label = []
    val_data = []
    val_label = []
    for i in range(x_train.shape[0]):
        if i % proportion == 0:
            val_data.append(x_train[i])
            val_label.append(y_train[i])
        else:
            train_data.append(x_train[i])
            train_label.append(y_train[i])
    train_data = np.array(train_data, dtype=float)
    train_label = np.array(train_label, dtype=float)
    val_data = np.array(val_data, dtype=float)
    val_label = np.array(val_label, dtype=float)
    return train_data, train_label, val_data, val_label

train_data, train_label, val_data, val_label = segmentation(x_train, y_train, 10)
print(len(train_data), len(train_label), len(val_data), len(val_label))

"""
9000 9000 1000 1000
"""

3.2 利用sklearn

python 复制代码

SEED = 12   # 指定随机数种子以便再现
train_data, train_label, val_data, val_label = (
    train_test_split(x_train, y_train, test_size=0.2, random_state=SEED))

print(len(train_data), len(train_label), len(val_data), len(val_label))
"""
8000 2000 8000 2000
"""

三、模型构建

1. 概率生成模型

概率生成模型要求先将数据集分割为两部分：

python 复制代码

class_0_id = []
class_1_id = []
for i in range(self.data['Y_train'].shape[0]):
    if self.data['Y_train'][i][0] == 0:
        class_0_id.append(i)
    else:
        class_1_id.append(i)

class_0 = self.data['X_train'][class_0_id]
class_1 = self.data['X_train'][class_1_id]

这里用到了高级索引，参考：NumPy 高级索引 | 菜鸟教程 (runoob.com)

分别求两部分的均值和协方差矩阵，共享协方差矩阵是两者的加权平均和：

python 复制代码

mean_0 = np.mean(class_0, axis=0)
mean_1 = np.mean(class_1, axis=0)

n = class_0.shape[1]
cov_0 = np.zeros((n, n))
cov_1 = np.zeros((n, n))

for i in range(class_0.shape[0]):
    cov_0 += np.dot(np.transpose([class_0[i] - mean_0]), [(class_0[i] - mean_0)]) / class_0.shape[0]

for i in range(class_1.shape[0]):
    cov_1 += np.dot(np.transpose([class_1[i] - mean_1]), [(class_1[i] - mean_1)]) / class_1.shape[0]

cov = (cov_0 * class_0.shape[0] + cov_1 * class_1.shape[0]) / (class_0.shape[0] + class_1.shape[0])

由概率生成模型的参数为：

<math xmlns="http://www.w3.org/1998/Math/MathML"> ω = ( μ 0 − μ 1 ) T Σ − 1 \omega = (\mu^0 - \mu^1)^T \Sigma^{-1} </math>ω=(μ0−μ1)TΣ−1

<math xmlns="http://www.w3.org/1998/Math/MathML"> b = − 1 2 ( μ 0 ) T Σ − 1 μ 0 + 1 2 ( μ 1 ) T Σ − 1 μ 1 + ln ⁡ m 0 m 1 b = - \frac{1}{2} (\mu^0)^T \Sigma^{-1} \mu^0 + \frac{1}{2} (\mu^1)^T \Sigma^{-1} \mu^1 + \ln{ \frac{m_0}{m_1}} </math>b=−21(μ0)TΣ−1μ0+21(μ1)TΣ−1μ1+lnm1m0

得到

python 复制代码

self.w = np.transpose(((mean_0 - mean_1)).dot(inv(cov)))
self.b = (- 0.5) * (mean_0).dot(inv(cov)).dot(mean_0) \
         + 0.5 * (mean_1).dot(inv(cov)).dot(mean_1) \
         + np.log(float(class_0.shape[0]) / class_1.shape[0])

2. 逻辑回归

打乱数据集：利用高级索引，将特征和标签同时打乱且仍然能够互相对应：

python 复制代码

import numpy as np
from numpy.random import shuffle

def _shuffle(X, Y):
    randomize = np.arange(X.shape[0])
    shuffle(randomize)
    return X[randomize], Y[randomize]

python 复制代码

X = np.array([[1, 2],
              [3, 4],
              [5, 6],
              [7, 8],
              [9, 10]])
Y = np.array([0, 1, 2, 3, 4])

X, Y = _shuffle(X, Y)

print(X)
print(Y)

输出如下：

bash 复制代码

[[ 5  6]
 [ 9 10]
 [ 1  2]
 [ 7  8]
 [ 3  4]]
[2 4 0 3 1]

batch划分：

python 复制代码

 for idx in range(int(np.floor(len(Y_train)/batch_size))):
            X = X_train[idx*batch_size:(idx+1)*batch_size]
            Y = Y_train[idx*batch_size:(idx+1)*batch_size]

求梯度：根据梯度公式 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ J ( θ ) ∂ θ j = 1 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ θ j ] \frac{\partial J(\theta) }{\partial \theta_j}= \frac{1}{m} [ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \lambda \theta_j] </math>∂θj∂J(θ)=m1[∑i=1m(hθ(x(i))−y(i))xj(i)+λθj]，得到：

python 复制代码

def _gradient_regularization(X, Y_label, w, b, lamda):
    # return the mean of the graident
    y_pred = get_prob(X, w, b)
    pred_error = Y_label - y_pred
    w_grad = -np.mean(np.multiply(pred_error.T, X.T), 1)+lamda*w
    b_grad = -np.mean(pred_error)
    return w_grad, b_grad

详解：梯度 <math xmlns="http://www.w3.org/1998/Math/MathML"> w w </math>w和 <math xmlns="http://www.w3.org/1998/Math/MathML"> b b </math>b合起来就是公式中的 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ，为一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> n + 1 n + 1 </math>n+1维列向量

pred_error 和 X 都是 <math xmlns="http://www.w3.org/1998/Math/MathML"> m × n m \times n </math>m×n维矩阵，两者作内积后需要对 <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m所在维度求平均值，可以有以下两种实现：

python 复制代码

# 先转置得到n*m维矩阵，再在第二个维度，即axis=1上求平均值
w_grad = -np.mean(np.multiply(pred_error.T, X.T), axis=1) + lamda*w 
# 对m*n维矩阵在第一个维度求平均值，即axis=0，再转置
w_grad = -np.mean(np.multiply(pred_error, X), axis=0).T + lamda*w

3. 卷积神经网络

搭建网络那就要按李宏毅说的三步走了！

打开冰箱：定义网络结构，即选定一批函数
把大象放进冰箱里：确定损失函数和优化方法，即定义评价函数优劣的方法
关冰箱门：拟合数据集，找到最优的函数

建议多查阅api文档：Keras API文档

3.1 定义网络结构：model.add

3.1.1 卷积层

后面常接BatchNormalization，能加快训练和提升性能（未考究）

在torch中：torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0.dilation=1, groups=1, bias=True, padding_mode='zeros'),示例：

python 复制代码

nn.Conv2d(1, 64, 4, 2, 1)
nn.BatchNorm2d(64)

在keras中：keras.layers.Conv2D( filters, kernel_size, strides=(1, 1), padding="valid", kernel_initializer="glorot_uniform", bias_initializer="zeros", ... )

python 复制代码

model.add(Conv2D(input_shape=(48, 48, 1), filters=64, kernel_size=(4, 4), strides=2, padding='same',
                 kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=SEED)))
model.add(BatchNormalization())

3.1.2 激活层

在torch中：

python 复制代码

 nn.LeakyReLU(0.2)

在keras中：不太理解，有的是在layers引入，有的在activations引入，也可以在layer的参数中指定

python 复制代码

model.add(LeakyReLU(alpha=0.2))

csharp 复制代码

model.add(layers.Activation(activations.relu))

3.1.3 池化层

在torch中：torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)

python 复制代码

nn.MaxPool2d(2, 2, 0)  # kernel_size=2, stride=2, padding=0

在keras中： keras.layers.MaxPooling2D( pool_size=(2, 2), strides=None, padding="valid", ... )

python 复制代码

model.add(MaxPooling2D((2, 2)))

3.1.4 全连接层

在torch中：torch.nn.Linear(in_features, out_features, bias=True)

python 复制代码

 nn.Linear(256*3*3, 1024)

在keras中： keras.layers.Dense( units, activation=None, use_bias=True, kernel_initializer="glorot_uniform", bias_initializer="zeros", ... )

python 复制代码

model.add(Dense(units=1024, activation='relu'))

3.1.5 一些优化

1. 添加kernel_initialization:

在torch中：

python 复制代码

def gaussian_weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1 and classname.find('Conv') == 0:
        m.weight.data.normal_(0.0, 0.02)

python 复制代码

self.cnn = nn.Sequential(
           # 省略大量代码
        )
self.fc = nn.Sequential(
           # 省略大量代码
        )

self.cnn.apply(gaussian_weights_init)
self.fc.apply(gaussian_weights_init)

在keras中：在带参数的layer中添加参数kernel_initalization（好麻烦）

问了gpt，他说可以这样，还没试过：

python 复制代码

initializer = RandomNormal(mean=0.0, stddev=0.05)

python 复制代码

for layer in model.layers: 
    if hasattr(layer, 'kernel_initializer'): 
        layer.kernel_initializer = initializer

2. 添加Dropout层:

丢弃部分神经网络的输入，减少过拟合（好像不能跟BatchNormalization一起用）

python 复制代码

model.add(Dropout(rate=0.5))

3.2 选择优化方法：model.complie

源码如下：

python 复制代码

Model.compile(
    optimizer="rmsprop",
    loss=None,
    loss_weights=None,
    metrics=None,
    weighted_metrics=None,
    run_eagerly=False,
    steps_per_execution=1,
    jit_compile="auto",
    auto_scale_loss=True,
)

optimizer: 优化器, 包括SGD, RMSprop, Adam等

metrics: 评价标准, 包括accuracy(准确率), binary_accuracy(二分类)、categorical_accuracy(多分类) 等

loss: 损失函数, 包括mse, binary_crossentropy, categorical_crossentropy等

verbose: 日志显示，verbose=0不显示, verbose=1为每个verbose显示进度条, verbose=2每个verbose输出一次

示例:

python 复制代码

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001),
              metrics=[keras.metrics.CategoricalAccuracy()])

3.3 拟合数据集：model.fit

fit的参数如下：

python 复制代码

model.fit(x, y, batch_size, epochs, verbose, validation_split, validation_data, validation_freq)

fit返回一个History对象记录了每一个epoch的数据，可用于绘图

python 复制代码

H = model.fit(x_data, x_label,
              epochs=EPOCHS,
              batch_size=BATCH_SIZE,
              validation_data=(val_data, val_label),
              verbose=1)

python 复制代码

# plot
N = np.arange(0, EPOCHS)

plt.figure()
plt.plot(N, H.history["loss"], label="train loss")
plt.plot(N, H.history["val_loss"], label="val loss")
plt.plot(N, H.history["accuracy"], label="train_acc")
plt.plot(N, H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy (Simple NN)")
plt.ylabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.show()

fit后的模型可以对测试数据作预测

python 复制代码

pred_raw = model.predict(test_data)

及时保存模型，我可不想白练了一晚

python 复制代码

# 保存模型
save_path = './model_0120'
model.save(save_path)

还有看到fit_generator的，之后再看一下

深入分解机器学习实战作业模板代码——二分类、卷积神经网络

前言

一、问题描述

1. 收入预测

2. 图片情绪分类

二、处理输入输出

1. 读取csv文件

2. 归一化/标准化/离散化

2.1 手动处理

2.2 利用sklearn

3. 数据分割

3.1 手动分割

3.2 利用sklearn

三、模型构建

1. 概率生成模型

2. 逻辑回归

3. 卷积神经网络

3.1 定义网络结构：model.add

3.1.1 卷积层

3.1.2 激活层

3.1.3 池化层

3.1.4 全连接层

3.1.5 一些优化

3.2 选择优化方法：model.complie

3.3 拟合数据集：model.fit