用 Python 从零开始创建神经网络（十六）：二元 Logistic 回归

二元 Logistic 回归

引言
[1. Sigmoid 激活函数](#1. Sigmoid 激活函数)
[2. Sigmoid 函数导数](#2. Sigmoid 函数导数)
[3. Sigmoid 函数代码](#3. Sigmoid 函数代码)
[4. 二元交叉熵损失（Binary Cross-Entropy Loss）](#4. 二元交叉熵损失（Binary Cross-Entropy Loss）)
[5. 二元交叉熵损失导数（Binary Cross-Entropy Loss Derivative）](#5. 二元交叉熵损失导数（Binary Cross-Entropy Loss Derivative）)
[6. 二进制交叉熵代码（Binary Cross-Entropy Code）](#6. 二进制交叉熵代码（Binary Cross-Entropy Code）)
[7. 实现二元逻辑回归和二元交叉熵损失（Implementing Binary Logistic Regression and Binary Cross-Entropy Loss）](#7. 实现二元逻辑回归和二元交叉熵损失（Implementing Binary Logistic Regression and Binary Cross-Entropy Loss）)
到此为止的全部代码：

引言

现在我们已经学习了如何创建和训练神经网络，让我们考虑神经网络的一种替代输出层。到目前为止，我们使用的输出层是一个概率分布，其中所有值表示特定类别为正确类别的置信水平，这些置信度之和为1。我们现在将讨论另一种输出层选项，其中每个神经元分别代表两个类别------0表示其中一个类别，1表示另一个类别。这种类型输出层的模型被称为二元逻辑回归。这种单个神经元可以区分两个类别，例如猫和狗，但也可以区分猫和非猫，或者任何两类的组合，你甚至可以有多个这样的神经元。例如，一个模型可能有两个二元输出神经元。其中一个神经元可以区分"人/非人"，另一个神经元可以区分"室内/室外"。二元逻辑回归是一种回归类型的算法，它的不同之处在于我们将使用sigmoid 激活函数作为输出层，而不是softmax，并使用二元交叉熵（binary cross-entropy）而不是分类交叉熵（categorical cross-entropy）来计算损失。

1. Sigmoid 激活函数

用于回归器的sigmoid 激活函数能够将从负无穷到正无穷的输出范围"压缩"到0和1之间。0和1的界限分别代表两种可能的类别。sigmoid函数的公式为：

σ ( x ) = 1 1 + e − x \sigma (x) = \frac{1}{1 + e^{-x}} σ(x)=1+e−x1

就神经网络而言，我们将使用常用符号：

σ i , j = 1 1 + e − z i , j \sigma_{i,j} = \frac{1}{1 + e^{-z_{i,j}}} σi,j=1+e−zi,j1

Sigmoid 函数的分母包含 e e e，其指数为 z i , j z_{i,j} zi,j，其中 z z z表示该激活函数作为输入的层的单一输出值。索引 i i i表示当前样本，索引 j j j表示该样本中的当前输出。

如果我们绘制 sigmoid 函数：

注意，该函数的输出平均值为0.5，并且在接近0或1时会"压平"为一条水平线。Sigmoid 函数以指数速度逼近其最大值和最小值。例如，当输入为2时，输出约为0.88，已经非常接近1；当输入为3时，输出约为0.95，以此类推。对于负值的情况也是类似的： σ ( − 2 ) ≈ 0.12 \sigma(-2) \approx 0.12 σ(−2)≈0.12， σ ( − 3 ) ≈ 0.05 \sigma(-3) \approx 0.05 σ(−3)≈0.05。这种特性使得Sigmoid激活函数非常适合作为二元逻辑回归模型中用于最终层输出的激活函数。

对于常用的函数，例如Sigmoid 函数，其导数几乎总是公开可用的知识。除非你是在发明一个新的函数，否则通常不需要手动计算导数。不过，这仍然是一个很好的练习。Sigmoid 函数的导数推导结果为 σ i , j ( 1 − σ i , j ) \sigma_{i,j}(1-\sigma_{i,j}) σi,j(1−σi,j)。如果你愿意直接利用这个结论，而不深入研究数学推导过程，可以直接跳到下一节。

2. Sigmoid 函数导数

让我们定义一下Sigmoid函数相对于输入的导数：

在这个阶段，我们可能会开始计算除法运算的导数，但由于分子只是一个值1，整个分数实际上等同于其分母的倒数，可以表示为其负幂：

计算幂运算的导数比计算除法运算的导数更容易，因此，让我们按照这个方法更新方程：

现在，我们可以计算表达式的导数，该表达式被提升到 − 1 -1 −1次幂，其结果等于该指数乘以表达式自身，并将其次幂降低1。然后，根据链式法则，我们需要计算该表达式本身的导数：

我们已经学过，求和运算的导数是导数之和：

1 1 1相对于 z i , j z_{i,j} zi,j的导数为 0 0 0，因为常数的导数总是 0 0 0。常数 e e e的 − z i , j -z_{i,j} −zi,j次幂的导数等于该值乘以指数的导数：

− z i , j -z_{i,j} −zi,j相对于 z i , j z_{i,j} zi,j的导数等于 − 1 -1 −1，因为 − 1 -1 −1是一个常数，可以移到导数之外，剩下的是 z i , j z_{i,j} zi,j相对于 z i , j z_{i,j} zi,j的导数，而我们知道，这个导数等于 1 1 1：

现在，我们可以将减号移到括号外，并抵消另一个减号：

让我们重写所得的方程------将表达式的 − 2 -2 −2次幂写成其倒数的平方形式，然后将方程中的乘数（我们用来相乘的值）变为结果分数的分子：

该分数的分母可以写成表达式自身相乘的形式，而不是将其提升为平方：

现在我们可以将这个分数拆分为两个独立的分数------一个分数的分子为1，另一个分数的分子为 e − z i , j e^{-z_{i,j}} e−zi,j，它们各自的分母包含通过乘法操作分开的表达式。这么做是因为我们在两个分数之间执行的是乘法运算：

如果你还记得Sigmoid函数的公式，你可能已经看到我们的推导方向了------被乘数（即被乘以乘数的值）正是Sigmoid函数的公式。让我们继续对这个公式进行推导------如果乘数的分子能够被表示为包含Sigmoid函数公式的一种表达式，那将是理想的。我们可以通过加1再减1的方式来操作，因为这不会改变其值：

这样我们就可以通过乘法器中的减号将乘法器分成两个独立的分数：

乘数中的被减数（即我们用来进行减法的值）等于1，因为分子和分母相等，而减数（即从被减数中减去的值）实际上也是Sigmoid函数的公式：

原来，sigmoid 函数的导数也等于这个函数乘以 1 和这个函数的差值。这样，我们就可以在代码中轻松写出这个导数。

全面解决方案：

3. Sigmoid 函数代码

与其他激活函数类似，我们将编写前向传播方法和后向传播方法。对于前向传播，我们将对输入应用Sigmoid函数。对于后向传播，我们将利用Sigmoid函数的导数，如我们在推导Sigmoid函数导数时所计算的，等于前向传播中Sigmoid输出与 1 1 1减去该输出的差值的乘积。

python 复制代码

import numpy as np

# Sigmoid activation
class Activation_Sigmoid:
    # Forward pass
    def forward(self, inputs):
        # Save input and calculate/save output
        # of the sigmoid function
        self.inputs = inputs
        self.output = 1 / (1 + np.exp(-inputs))
        
    # Backward pass
    def backward(self, dvalues):
        # Derivative - calculates from output of the sigmoid function
        self.dinputs = dvalues * (1 - self.output) * self.output

现在我们有了新的激活函数，需要对二元交叉熵损失的新计算方法进行编码。

4. 二元交叉熵损失（Binary Cross-Entropy Loss）

为了计算二分类交叉熵损失，我们将继续使用分类交叉熵损失中的负对数概念。不同的是，我们不会仅针对目标类别进行计算，而是分别对每个神经元的正确类别和错误类别的对数似然进行求和。由于类别值仅为0或1，我们可以将错误类别简化为 1 − 正确类别 1-\text{正确类别} 1−正确类别，这相当于对值进行取反。然后，我们可以计算正确类别和错误类别的负对数似然，并将它们相加。我们给出两种形式的公式------第一种直接遵循上述描述，优化后的版本则通过移动负号和去除冗余括号简化了表达式：

在代码中，它将以如下形式开始（但很快就会修改，所以先不要提交到代码库中）：

python 复制代码

sample_losses = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

由于一个模型可以包含多个二分类输出，并且与交叉熵损失不同，每个输出会生成其各自的预测，因此单个输出上的损失将是一个包含每个输出值的损失向量。我们需要的是一个样本的整体损失，为了实现这一点，我们需要对单个样本中所有输出的损失取平均值：

python 复制代码

sample_losses = np.mean(sample_losses, axis=-1)

最后一个参数 axis=-1 告诉 NumPy 在最后一个维度上计算平均值。为了更直观地理解这一点，让我们举一个简单的例子。假设这是一个包含 3 个神经元的输出层的模型输出，并且它已经通过了二分类交叉熵损失函数：

python 复制代码

outputs = np.array([[1, 2, 3],
					[2, 4, 6],
					[0, 5, 10],
					[11, 12, 13],
					[5, 10, 15]])

这些数字在这个例子中是完全虚构的。我们希望对每个输出向量（例如[1, 2, 3]）中的数字计算平均值，并将结果放入输出向量中。然后我们希望对其他向量重复这一操作，并返回结果向量，这将是一个一维数组。使用 NumPy 实现如下：

python 复制代码

np.mean(outputs, axis=-1)

python 复制代码

>>>
array([ 2., 4., 5., 12., 10.])

如果我们计算第一个输出的平均值，确实为2，第二个输出的平均值确实为4，依此类推。

我们还将继承自损失类（Loss class），因此整体的损失计算将由我们已经为分类交叉熵损失类创建的calculate方法处理。

5. 二元交叉熵损失导数（Binary Cross-Entropy Loss Derivative）

为了从这里计算梯度，我们已经知道自然对数的导数是 1 / x 1/x 1/x，并且 1 − x 1-x 1−x的导数是 − 1 -1 −1。在简化形式下，这给出了 − ( y true / y + ( 1 − y true ) / ( 1 − y ) ) ⋅ ( − 1 ) -(y_{\text{true}} / y + (1 - y_{\text{true}}) / (1 - y)) \cdot (-1) −(ytrue/y+(1−ytrue)/(1−y))⋅(−1)。

要计算该损失函数相对于预测输入的偏导数，我们将使用损失方程的后一个版本。在这种情况下，使用哪个版本实际上并没有区别：

我们要计算偏导数的表达式由两个子表达式组成，它们是求和运算的组成部分。我们可以将其写成导数之和：

两个组成部分在其导数中都包含了 y i , j y_{i,j} yi,j（目标值），它们是相对于 y ^ i , j \hat{y}_{i,j} y^i,j（预测值，不同的变量）求导的常数，因此我们可以将它们与其他常数以及负号一起移到导数的外面：

现在，与分类交叉熵损失的导数类似，我们需要计算对数函数的导数，该导数等于其参数的倒数，并乘以（根据链式法则）该参数的导数。让我们将其应用于两个偏导数：

现在，第一个偏导数等于1，因为被求导的值与求导变量相同。第二个偏导数可以表示为导数之间的差值：

对于这两个新的导数，第一个等于0，因为常数的导数总是等于0；第二个导数等于1，因为被求导的值与求导变量是相同的值。

最后，我们可以清理一下，得到结果方程：

二元交叉熵损失的偏导数求解出一个非常简单的方程，很容易在代码中实现。

全面解决方案：

这个偏导数是单个输出损失的导数，对于任何类型的输出，我们始终需要相对于样本损失而不是原子输出损失来计算导数，因为在前向传播过程中，我们必须计算所有输出损失的平均值以形成样本损失：

对于反向传播，我们必须计算样本损失相对于每个输入的偏导数：

我们刚刚计算了单个输出损失相对于相关预测的偏导数。现在我们需要计算样本损失相对于单个输出损失的偏导数：

1 / J 1/J 1/J（输出的数量）是一个常数，可以移到导数运算之外。由于我们是针对给定的输出 j j j计算导数，求和中只包含一个元素时，其值就等于该元素本身：

其余导数等于 1，因为变量相对于同一变量的导数等于 1。

全面解决方案：

现在，我们可以运用链式法则，更新相对于单一输出损失的样本损失偏导数方程：

我们必须执行这种归一化操作，因为每个输出都会返回其自身的导数。如果不进行归一化，每增加一个输入都会提升梯度，从而需要调整包括学习率在内的其他超参数。

6. 二进制交叉熵代码（Binary Cross-Entropy Code）

在我们的代码中，这将是：

python 复制代码

# Number of samples
samples = len(dvalues)
# Number of outputs in every sample
# We'll use the first sample to count them
outputs = len(dvalues[0])
# Calculate gradient
self.dinputs = -(y_true / clipped_dvalues - (1 - y_true) / (1 - clipped_dvalues)) / outputs

与我们在分类交叉熵损失中所做的类似，我们需要对梯度进行归一化处理，使其与我们计算的样本数量无关：

python 复制代码

# Normalize gradient
self.dinputs = self.dinputs / samples

最后，我们需要解决对数函数的数值不稳定性问题。Sigmoid激活函数的返回值范围是0到1（包含0和1），但 l o g ( 0 ) log(0) log(0)会带来一个小问题，由于计算方式，它会返回负无穷大。这本身可能并不是特别严重的问题，但如果列表中包含 − i n f -inf −inf，则其均值将为 − i n f -inf −inf；同样的道理，任何包含正无穷大的列表，其均值都会是正无穷大。

python 复制代码

import numpy as np

np.log(0)

python 复制代码

>>>
__main__:1: RuntimeWarning: divide by zero encountered in log
-inf

python 复制代码

print(np.mean([5, 2, 4, np.log(0)]))

python 复制代码

>>>
-inf

这与我们在第 5 章中讨论的分类交叉熵损失问题类似。为了避免这个问题，我们将在批量值上添加削波：

python 复制代码

# Clip data to prevent division by 0
# Clip both sides to not drag mean towards any value
y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

现在，我们将使用这些剪切后的值，而不是原始值来进行前向传递：

python 复制代码

# Calculate sample-wise loss
sample_losses = -(y_true * np.log(y_pred_clipped) + (1 - y_true) * np.log(1 - y_pred_clipped))

在计算导数时执行除法操作时，传入的梯度可能包含0和1这两种值。这两种值中的任何一种都会分别在 y true / d values y_{\text{true}} / d_{\text{values}} ytrue/dvalues或 ( 1 − y true ) / ( 1 − d values ) (1 - y_{\text{true}}) / (1 - d_{\text{values}}) (1−ytrue)/(1−dvalues)部分引发问题（前者中的0或后者中的 1 − 1 = 0 1 - 1 = 0 1−1=0都会导致除以0的错误）。因此，我们也需要对这个梯度进行裁剪：

python 复制代码

# Clip data to prevent division by 0
# Clip both sides to not drag mean towards any value
clipped_dvalues = np.clip(dvalues, 1e-7, 1 - 1e-7)

现在，与前向传递类似，我们可以使用这些剪切值：

python 复制代码

# Calculate gradient
self.dinputs = -(y_true / clipped_dvalues - (1 - y_true) / (1 - clipped_dvalues)) / outputs

二进制交叉熵的完整代码：

python 复制代码

# Binary cross-entropy loss
class Loss_BinaryCrossentropy(Loss):
    # Forward pass
    def forward(self, y_pred, y_true):
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Calculate sample-wise loss
        sample_losses = -(y_true * np.log(y_pred_clipped) +
        (1 - y_true) * np.log(1 - y_pred_clipped))
        sample_losses = np.mean(sample_losses, axis=-1)
        # Return losses
        return sample_losses
    
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of outputs in every sample
        # We'll use the first sample to count them
        outputs = len(dvalues[0])
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        clipped_dvalues = np.clip(dvalues, 1e-7, 1 - 1e-7)
        # Calculate gradient
        self.dinputs = -(y_true / clipped_dvalues -
        (1 - y_true) / (1 - clipped_dvalues)) / outputs
        # Normalize gradient
        self.dinputs = self.dinputs / samples

现在我们有了新的激活函数和损失计算，我们将对现有的 softmax 分类器进行编辑，以实现二元逻辑回归模型。

7. 实现二元逻辑回归和二元交叉熵损失（Implementing Binary Logistic Regression and Binary Cross-Entropy Loss）

使用这些新的类后，我们的代码变更将体现在实际代码的执行中（而不是修改类本身）。第一个变更是让spiral_data对象输出2个类别，而不是3个类别，如下所示：

python 复制代码

# Create dataset
X, y = spiral_data(samples=100, classes=2)

接下来，我们将重塑标签，因为它们不再稀疏。它们是二进制的，0 或 1：

python 复制代码

# Reshape labels to be a list of lists
# Inner list contains one output (either 0 or 1)
# per each output neuron, 1 in this case
y = y.reshape(-1, 1)

请看这里的区别。起初，spiral_data 函数的 y 输出会是这样的

python 复制代码

X, y = spiral_data(samples=100, classes=2)
print(y[:5])

python 复制代码

>>>
[0 0 0 0 0]

然后，我们在这里将其重塑为二元逻辑回归：

python 复制代码

y = y.reshape(-1, 1)
print(y[:5])

python 复制代码

>>>
[[0]
 [0]
 [0]
 [0]
 [0]]

为什么要这么做？最初，在使用softmax分类器时，spiral_data的值可以直接用作目标标签，因为它们以数字形式包含了正确的类别标签------即正确类别的索引，其中输出层的每个神经元对应一个独立的类别，例如[0, 1, 1, 0, 1]。然而，在这种情况下，我们尝试表示一些二元输出，其中每个神经元各自代表2个可能的类别。对于我们当前的示例，我们有一个输出神经元，因此神经网络的输出应为一个张量（数组），包含一个值，目标值为0或1，例如[[0], [1], [1], [0], [1]]。

.reshape(-1, 1)的作用是将数据重塑为二维，其中第二维包含一个元素，而第一维的大小为根据其他条件计算的结果（-1表示可变的维度）。在NumPy中，形状中只能使用一次-1，用于表示该维度是可变的。得益于此功能，我们并不需要每次都有相同数量的样本，NumPy可以为我们处理相应的计算。在上述示例中，所有值都是0，因为spiral_data函数是一次生成一个类别的数据，从0开始。我们还需要以相同的方式重塑测试数据中的 y y y值。

让我们创建层并使用适当的激活函数：

python 复制代码

# Create dataset
X, y = spiral_data(100, 2)

# Reshape labels to be a list of lists
# Inner list contains one output (either 0 or 1)
# per each output neuron, 1 in this case
y = y.reshape(-1, 1)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64, weight_regularizer_l2=5e-4, bias_regularizer_l2=5e-4)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()

# Create second Dense layer with 64 input features (as we take output
# of previous layer here) and 1 output value
dense2 = Layer_Dense(64, 1)

# Create Sigmoid activation:
activation2 = Activation_Sigmoid()

请注意，我们仍然在隐藏层中使用ReLU激活函数。即使我们实际上正在构建一种不同类型的分类器，隐藏层的激活函数并不一定需要改变。你还应该注意到，因为这是一个二分类器，所以dense2对象只有1个输出。它的输出正好表示2个类别（0或1），映射到一个神经元上。现在我们可以选择损失函数和优化器。对于Adam优化器的设置，我们将使用默认学习率和衰减参数 5 e − 7 5e-7 5e−7。

python 复制代码

# Create loss function
loss_function = Loss_BinaryCrossentropy()

# Create optimizer
optimizer = Optimizer_Adam(decay=5e-7)

虽然我们需要对损失进行不同的计算（因为我们对输出层使用了不同的激活函数），但仍然可以使用与Softmax分类器相同的优化器。另外一个小的变化是我们如何衡量预测结果。在概率分布中，我们使用argmax来确定哪个索引对应于最大值，并将其作为分类结果。在二分类器中，我们需要判断输出更接近0还是1。为此，我们将输出简化为：

python 复制代码

predictions = (activation2.output > 0.5) * 1

这会对输出值是否大于0.5的判断语句产生True/False的评估结果。当True和False被视为数字时，它们分别表示1和0。例如，如果我们执行int(True)，结果将是1；而int(False)的结果将是0。如果我们想将一个True/False布尔值的列表转换为数字，则不能直接将列表包装在int()中。然而，我们可以直接对布尔值数组进行数学运算，并返回算术结果。例如，我们可以运行：

python 复制代码

import numpy as np

a = np.array([True, False, True])
print(a)

python 复制代码

>>>
[ True False True]

然后：

python 复制代码

b = a*1
print(b)

python 复制代码

>>>
[1 0 1]

因此，为了评估预测准确性，我们可以在代码中执行以下操作：

python 复制代码

predictions = (activation2.output > 0.5) * 1
accuracy = np.mean(predictions==y_test)

* 1的乘法将布尔值的True/False数组分别转换为数值1/0。我们还需要为验证数据实现这种准确率计算。

到此为止的全部代码：

python 复制代码

import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

# Dense layer
class Layer_Dense:
    # Layer initialization
    def __init__(self, n_inputs, n_neurons,
                 weight_regularizer_l1=0, weight_regularizer_l2=0,
                 bias_regularizer_l1=0, bias_regularizer_l2=0):
        # Initialize weights and biases
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        # Set regularization strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.bias_regularizer_l2 = bias_regularizer_l2
    
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from inputs, weights and biases
        self.output = np.dot(inputs, self.weights) + self.biases
        
    # Backward pass
    def backward(self, dvalues):
        # Gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradients on regularization
        # L1 on weights
        if self.weight_regularizer_l1 > 0:
            dL1 = np.ones_like(self.weights)
            dL1[self.weights < 0] = -1
            self.dweights += self.weight_regularizer_l1 * dL1
        # L2 on weights
        if self.weight_regularizer_l2 > 0:
            self.dweights += 2 * self.weight_regularizer_l2 * self.weights
        # L1 on biases
        if self.bias_regularizer_l1 > 0:
            dL1 = np.ones_like(self.biases)
            dL1[self.biases < 0] = -1
            self.dbiases += self.bias_regularizer_l1 * dL1
        # L2 on biases
        if self.bias_regularizer_l2 > 0:
            self.dbiases += 2 * self.bias_regularizer_l2 * self.biases
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)
        
        
# Dropout
class Layer_Dropout:        
    # Init
    def __init__(self, rate):
        # Store rate, we invert it as for example for dropout
        # of 0.1 we need success rate of 0.9
        self.rate = 1 - rate
        
    # Forward pass
    def forward(self, inputs):
        # Save input values
        self.inputs = inputs
        # Generate and save scaled mask
        self.binary_mask = np.random.binomial(1, self.rate, size=inputs.shape) / self.rate
        # Apply mask to output values
        self.output = inputs * self.binary_mask
        
    # Backward pass
    def backward(self, dvalues):
        # Gradient on values
        self.dinputs = dvalues * self.binary_mask
        
        
# ReLU activation
class Activation_ReLU:  
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Calculate output values from inputs
        self.output = np.maximum(0, inputs)
        
    # Backward pass
    def backward(self, dvalues):
        # Since we need to modify original variable,
        # let's make a copy of values first
        self.dinputs = dvalues.copy()
        # Zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0
        
        
# Softmax activation
class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Remember input values
        self.inputs = inputs
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities
        
    # Backward pass
    def backward(self, dvalues):
        # Create uninitialized array
        self.dinputs = np.empty_like(dvalues)
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            # Flatten output array
            single_output = single_output.reshape(-1, 1)
            # Calculate Jacobian matrix of the output and
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient
            # and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)
        
      
# Sigmoid activation
class Activation_Sigmoid:
    # Forward pass
    def forward(self, inputs):
        # Save input and calculate/save output
        # of the sigmoid function
        self.inputs = inputs
        self.output = 1 / (1 + np.exp(-inputs))
        
    # Backward pass
    def backward(self, dvalues):
        # Derivative - calculates from output of the sigmoid function
        self.dinputs = dvalues * (1 - self.output) * self.output
        
        
# SGD optimizer
class Optimizer_SGD:
    # Initialize optimizer - set settings,
    # learning rate of 1. is default for this optimizer
    def __init__(self, learning_rate=1., decay=0., momentum=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum
        
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))
    
    # Update parameters
    def update_params(self, layer):
        # If we use momentum
        if self.momentum:
            # If layer does not contain momentum arrays, create them
            # filled with zeros
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                # If there is no momentum array for weights
                # The array doesn't exist for biases yet either.
                layer.bias_momentums = np.zeros_like(layer.biases)
            # Build weight updates with momentum - take previous
            # updates multiplied by retain factor and update with
            # current gradients
            weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates
            
            # Build bias updates
            bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates
        # Vanilla SGD updates (as before momentum update)
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases
        # Update weights and biases using either
        # vanilla or momentum updates
        layer.weights += weight_updates
        layer.biases += bias_updates
                
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1        


# Adagrad optimizer
class Optimizer_Adagrad:
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))
    
    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays,
        # create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2
            
        # Vanilla SGD parameter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)
    
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1
            
            
# RMSprop optimizer
class Optimizer_RMSprop:            
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, rho=0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho
    
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))
    
    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays,
        # create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
        # Update cache with squared current gradients
        layer.weight_cache = self.rho * layer.weight_cache + (1 - self.rho) * layer.dweights**2
        layer.bias_cache = self.rho * layer.bias_cache + (1 - self.rho) * layer.dbiases**2
        
        # Vanilla SGD parameter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)
    
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1
            

# Adam optimizer
class Optimizer_Adam:
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, beta_1=0.9, beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2
    
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))        

    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays,
        # create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)
         # Update momentum with current gradients
        layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases
        # Get corrected momentum
        # self.iteration is 0 at first pass
        # and we need to start with 1 here
        weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        # Update cache with squared current gradients
        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2
        # Get corrected cache
        weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))
        # Vanilla SGD parameter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.current_learning_rate * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)
                    
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1
            
        
# Common loss class
class Loss:
    # Regularization loss calculation
    def regularization_loss(self, layer):        
        # 0 by default
        regularization_loss = 0
        # L1 regularization - weights
        # calculate only when factor greater than 0
        if layer.weight_regularizer_l1 > 0:
            regularization_loss += layer.weight_regularizer_l1 * np.sum(np.abs(layer.weights))
        # L2 regularization - weights
        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * np.sum(layer.weights * layer.weights)
        # L1 regularization - biases
        # calculate only when factor greater than 0
        if layer.bias_regularizer_l1 > 0:
            regularization_loss += layer.bias_regularizer_l1 * np.sum(np.abs(layer.biases))
        # L2 regularization - biases
        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * np.sum(layer.biases * layer.biases)
        return regularization_loss
        
    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):
        # Calculate sample losses
        sample_losses = self.forward(output, y)
        # Calculate mean loss
        data_loss = np.mean(sample_losses)
        # Return loss
        return data_loss
        

# Cross-entropy loss
class Loss_CategoricalCrossentropy(Loss):
    # Forward pass
    def forward(self, y_pred, y_true):
        # Number of samples in a batch
        samples = len(y_pred)
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Probabilities for target values -
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[
                range(samples),
                y_true
            ]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of labels in every sample
        # We'll use the first sample to count them
        labels = len(dvalues[0])
        # If labels are sparse, turn them into one-hot vector
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
        # Calculate gradient
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples
        
        
# Softmax classifier - combined Softmax activation
# and cross-entropy loss for faster backward step
class Activation_Softmax_Loss_CategoricalCrossentropy():  
    # Creates activation and loss function objects
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()
    # Forward pass
    def forward(self, inputs, y_true):
        # Output layer's activation function
        self.activation.forward(inputs)
        # Set the output
        self.output = self.activation.output
        # Calculate and return loss value
        return self.loss.calculate(self.output, y_true)
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)     
        # If labels are one-hot encoded,
        # turn them into discrete values
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
        # Copy so we can safely modify
        self.dinputs = dvalues.copy()
        # Calculate gradient
        self.dinputs[range(samples), y_true] -= 1
        # Normalize gradient
        self.dinputs = self.dinputs / samples
        
        
# Binary cross-entropy loss
class Loss_BinaryCrossentropy(Loss): 
    # Forward pass
    def forward(self, y_pred, y_true):
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        # Calculate sample-wise loss
        sample_losses = -(y_true * np.log(y_pred_clipped) + (1 - y_true) * np.log(1 - y_pred_clipped))
        sample_losses = np.mean(sample_losses, axis=-1)
        # Return losses
        return sample_losses       
    
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of outputs in every sample
        # We'll use the first sample to count them
        outputs = len(dvalues[0])
        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        clipped_dvalues = np.clip(dvalues, 1e-7, 1 - 1e-7)
        # Calculate gradient
        self.dinputs = -(y_true / clipped_dvalues - (1 - y_true) / (1 - clipped_dvalues)) / outputs
        # Normalize gradient
        self.dinputs = self.dinputs / samples
        
        
# Create dataset
X, y = spiral_data(samples=100, classes=2)

# Reshape labels to be a list of lists
# Inner list contains one output (either 0 or 1)
# per each output neuron, 1 in this case
y = y.reshape(-1, 1)

# Create Dense layer with 2 input features and 64 output values
dense1 = Layer_Dense(2, 64, weight_regularizer_l2=5e-4, bias_regularizer_l2=5e-4)

# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 64 input features (as we take output
# of previous layer here) and 1 output value
dense2 = Layer_Dense(64, 1)

# Create Sigmoid activation:
activation2 = Activation_Sigmoid()

# Create loss function
loss_function = Loss_BinaryCrossentropy()

# Create optimizer
optimizer = Optimizer_Adam(decay=5e-7)

# Train in loop
for epoch in range(10001):
    # Perform a forward pass of our training data through this layer
    dense1.forward(X)
    
    # Perform a forward pass through activation function
    # takes the output of first dense layer here
    activation1.forward(dense1.output)
    
    # Perform a forward pass through second Dense layer
    # takes outputs of activation function
    # of first layer as inputs
    dense2.forward(activation1.output)
    
    # Perform a forward pass through activation function
    # takes the output of second dense layer here
    activation2.forward(dense2.output)
    
    # Calculate the data loss
    data_loss = loss_function.calculate(activation2.output, y)
    # Calculate regularization penalty
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)
    
    # Calculate overall loss
    loss = data_loss + regularization_loss
    
    # Calculate accuracy from output of activation2 and targets
    # Part in the brackets returns a binary mask - array consisting
    # of True/False values, multiplying it by 1 changes it into array
    # of 1s and 0s
    predictions = (activation2.output > 0.5) * 1
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f'epoch: {epoch}, ' +
              f'acc: {accuracy:.3f}, '+
              f'loss: {loss:.3f} (' +
              f'data_loss: {data_loss:.3f}, ' +
              f'reg_loss: {regularization_loss:.3f}), ' +
              f'lr: {optimizer.current_learning_rate}')
    
    # Backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dinputs)
    dense2.backward(activation2.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    # Update weights and biases
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()

# Validate the model
# Create test dataset
X_test, y_test = spiral_data(samples=100, classes=2)

# Reshape labels to be a list of lists
# Inner list contains one output (either 0 or 1)
# per each output neuron, 1 in this case
y_test = y_test.reshape(-1, 1)

# Perform a forward pass of our testing data through this layer
dense1.forward(X_test)

# Perform a forward pass through activation function
# takes the output of first dense layer here
activation1.forward(dense1.output)

# Perform a forward pass through second Dense layer
# takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Perform a forward pass through activation function
# takes the output of second dense layer here
activation2.forward(dense2.output)

# Calculate the data loss
loss = loss_function.calculate(activation2.output, y_test)

# Calculate accuracy from output of activation2 and targets
# Part in the brackets returns a binary mask - array consisting of
# True/False values, multiplying it by 1 changes it into array
# of 1s and 0s
predictions = (activation2.output > 0.5) * 1
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')

python 复制代码

>>>
epoch: 0, acc: 0.500, loss: 0.693 (data_loss: 0.693, reg_loss: 0.000), lr: 0.001
epoch: 100, acc: 0.630, loss: 0.674 (data_loss: 0.673, reg_loss: 0.001), lr: 0.0009999505024501287
epoch: 200, acc: 0.625, loss: 0.669 (data_loss: 0.668, reg_loss: 0.001), lr: 0.0009999005098992651
epoch: 300, acc: 0.645, loss: 0.665 (data_loss: 0.663, reg_loss: 0.002), lr: 0.000999850522346909
epoch: 400, acc: 0.650, loss: 0.659 (data_loss: 0.657, reg_loss: 0.002), lr: 0.0009998005397923115
epoch: 500, acc: 0.675, loss: 0.648 (data_loss: 0.644, reg_loss: 0.004), lr: 0.0009997505622347225
epoch: 600, acc: 0.720, loss: 0.632 (data_loss: 0.626, reg_loss: 0.006), lr: 0.0009997005896733929
...
epoch: 1500, acc: 0.810, loss: 0.502 (data_loss: 0.464, reg_loss: 0.038), lr: 0.0009992510613295335
...
epoch: 2500, acc: 0.855, loss: 0.433 (data_loss: 0.380, reg_loss: 0.053), lr: 0.0009987520593019025
...
epoch: 4500, acc: 0.905, loss: 0.363 (data_loss: 0.305, reg_loss: 0.058), lr: 0.0009977555488927658
epoch: 4600, acc: 0.905, loss: 0.361 (data_loss: 0.303, reg_loss: 0.058), lr: 0.000997705775569079
epoch: 4700, acc: 0.905, loss: 0.358 (data_loss: 0.300, reg_loss: 0.058), lr: 0.0009976560072110577
epoch: 4800, acc: 0.910, loss: 0.354 (data_loss: 0.296, reg_loss: 0.058), lr: 0.0009976062438179587
...
epoch: 6100, acc: 0.915, loss: 0.324 (data_loss: 0.262, reg_loss: 0.062), lr: 0.0009969597711777935
...
epoch: 6600, acc: 0.935, loss: 0.307 (data_loss: 0.245, reg_loss: 0.062), lr: 0.000996711350897713
epoch: 6700, acc: 0.935, loss: 0.304 (data_loss: 0.243, reg_loss: 0.062), lr: 0.0009966616816971556
epoch: 6800, acc: 0.935, loss: 0.303 (data_loss: 0.241, reg_loss: 0.062), lr: 0.00099661201744669
epoch: 6900, acc: 0.935, loss: 0.301 (data_loss: 0.239, reg_loss: 0.062), lr: 0.0009965623581455767
...
epoch: 9800, acc: 0.945, loss: 0.262 (data_loss: 0.205, reg_loss: 0.057), lr: 0.0009951243880606966
epoch: 9900, acc: 0.945, loss: 0.261 (data_loss: 0.204, reg_loss: 0.057), lr: 0.0009950748768967994
epoch: 10000, acc: 0.945, loss: 0.260 (data_loss: 0.204, reg_loss: 0.056), lr: 0.0009950253706593885
validation, acc: 0.920, loss: 0.237

模型在这里表现得相当不错！你应该对如何调整输出层以更好地适应你试图解决的问题有一些直觉，同时保持隐藏层基本不变。在下一章中，我们将研究回归问题，其中目标输出不再是分类，而是预测一个标量值，例如房屋的价格。

本章的章节代码、更多资源和勘误表：https://nnfs.io/ch16