Python深度学习精华笔记2：基于Keras解决二分类问题

公众号：尤而小屋

作者：Peter

编辑：Peter

持续更新《Python深度学习》一书的精华内容，仅作为学习笔记分享。

本文是第二篇：基于keras建模解决Python深度学习的二分类问题，使用keras内置的IMDB数据集

二分类的最后一层使用sigmoid作为激活函数
使用binary_crossentropy作为损失（二元交叉熵损失）

运行环境：Python3.9.13 + Keras2.12.0 + tensorflow2.12.0

In [1]:

python 复制代码

import pandas as pd
import numpy as np

import tensorflow as tf
from keras.datasets import imdb  # 内置数据集

from keras import models
from keras import layers
from keras import optimizers  # 优化器
from tensorflow.keras.utils import to_categorical  # 实现one-hot编码

# from tensorflow.keras import optimizers
# 修改1
# from tensorflow.python.keras.optimizers import rmsprop_v2

导入IMDB数据

IMDB数据集是一个非常著名和广泛使用的电影数据集，包含了大量的电影和演员的信息。它由互联网电影数据库（IMDB）提供，包含了超过4700部电影和电视节目的信息，以及超过50万名演员和工作人员的信息。

IMDB数据集非常适合用于电影推荐、电影属性预测、演员演技评估等任务。您可以利用这个数据集来训练和测试机器学习模型，以实现自动电影推荐、电影属性预测、演员演技评估等。

使用IMDB数据集可以进行以下类型的机器学习实验和研究：

电影推荐：利用机器学习算法根据用户的观影历史和喜好，向用户推荐适合他们观看的电影。
电影属性预测：根据电影的属性（例如类型、导演、主演等），利用机器学习算法预测电影的评分和评论。
演员演技评估：利用机器学习算法评估演员的表演技巧和水平，以及他们在电影中的重要性。

总之，IMDB数据集是一个非常丰富和有用的数据集，可以用于电影推荐、电影属性预测、演员演技评估等任务。通过使用这个数据集，您可以深入了解电影和演员的信息，以及它们之间的关系和影响。

IMDB数据集已经内置Keras库中

In [2]:

javascript 复制代码

from keras.datasets import imdb

In [3]:

ini 复制代码

# 取出训练集中最常出现的前10000个词语
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [4]:

css 复制代码

train_data[:2]

Out[4]:

python 复制代码

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])],
      dtype=object)

两个labels相关的数据都是0和1的二分类标签：其中0代表负面neg，1代表正面pos

In [5]:

css 复制代码

train_labels[:3]

Out[5]:

ini 复制代码

array([1, 0, 0], dtype=int64)

In [6]:

css 复制代码

test_labels[:3]

Out[6]:

ini 复制代码

array([0, 1, 1], dtype=int64)

前10000个单词说明单词索引不超过9999：

In [7]:

lua 复制代码

max([max(sequence) for sequence in train_data])

Out[7]:

yaml 复制代码

单词和索引的互换：

In [8]:

python 复制代码

word_index = imdb.get_word_index()

reverse_word_index = dict([value, key] for (key, value) in word_index.items())  # 翻转过程
reverse_word_index

# 结果

{34701: 'fawn',
 52006: 'tsukino',
 52007: 'nunnery',
 16816: 'sonja',
 63951: 'vani',
 1408: 'woods',
 16115: 'spiders',
 2345: 'hanging',
 2289: 'woody',
 52008: 'trawling',
 52009: "hold's",
 11307: 'comically',
 40830: 'localized'
 .......
 }

将评论解析为英文单词：

In [9]:

python 复制代码

decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review

Out[9]:

css 复制代码

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

训练集整数序列编码

将整数序列编码为二进制矩阵：

In [10]:

python 复制代码

import numpy as np

def vectorize_sequences(seq, dim=10000):  
    """
    seq: 输入序列
    dim：10000，维度
    """
    results = np.zeros((len(seq), dim))  # 创建全0矩阵  length * dim
    for i, s in enumerate(seq):
        results[i,s] = 1.   # 将该位置的值从0变成1，如果没有出现则还是0
    return results

X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)

In [11]:

css 复制代码

X_train[0]

Out[11]:

scss 复制代码

array([0., 1., 1., ..., 0., 0., 0.])

标签向量化

In [12]:

ini 复制代码

y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

上面已经将训练集和测试集都处理完后，就可以将数据喂入到神经网络中：

构建网络

In [13]:

javascript 复制代码

from keras import models
from keras import layers

In [14]:

复制代码

X_train.shape

Out[14]:

scss 复制代码

(25000, 10000)

为什么在深度学习中需要激活函数？

在深度学习中，激活函数是神经网络中的一个重要组成部分。它们被用来引入非线性性到神经网络中，这是神经网络能够学习非线性模式的关键。
如果没有激活函数，神经网络只能学习线性模式，这限制了它们的应用范围。激活函数使得神经元能够以一种非线性方式响应输入信号，从而使得神经网络能够更好地适应复杂的数据分布。
此外，激活函数还可以帮助引入梯度，这是深度学习中的关键概念。在反向传播算法中，激活函数将导致梯度的非线性变化，这使得网络能够更好地学习和优化。
最后，激活函数还可以用来控制输出值的范围，例如ReLU和Sigmoid函数可以将输出值映射到0到1之间，这可以帮助控制网络的输出值范围，并防止出现梯度消失或梯度爆炸等问题。

因此，激活函数在深度学习中起着非常重要的作用，它们不仅可以引入非线性性，还可以帮助引入梯度，控制输出值范围，从而提升神经网络的性能。

In [15]:

python 复制代码

model = models.Sequential()

model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))

编译网络compile

配置优化器和损失后编译网络

In [16]:

ini 复制代码

# 写法1

model.compile(optimizer='rmsprop',  # 优化器
              loss='binary_crossentropy',  # 二进制交叉熵
              metrics=['accuracy']   # 评价指标
             )

In [17]:

python 复制代码

# 写法2：有改动

model.compile(
    # 原文
    # optimizer= optimizers.RMSprop(lr=0.001),   # 正则项
    optimizer= tf.keras.optimizers.RMSprop(learning_rate=0.001),   # 添加前缀tf；lr也要改成learning_rate
    loss='binary_crossentropy',  # 交叉熵
    metrics=['accuracy']   # 使用全称
)

模型训练fit

In [18]:

python 复制代码

# 留出验证集和真实训练集

x_val = X_train[:10000]  # 前10000个
partial_x_train = X_train[10000:] # 10000个之后  真实训练集

y_val = y_train[:10000]
partial_y_train = y_train[10000:]  # 真实训练集

In [19]:

python 复制代码

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val)
                   )
Epoch 1/20
30/30 [==============================] - 1s 23ms/step - loss: 0.5108 - accuracy: 0.7746 - val_loss: 0.3802 - val_accuracy: 0.8653
Epoch 2/20
30/30 [==============================] - 0s 13ms/step - loss: 0.3102 - accuracy: 0.8959 - val_loss: 0.3066 - val_accuracy: 0.8850
Epoch 3/20
30/30 [==============================] - 0s 14ms/step - loss: 0.2343 - accuracy: 0.9200 - val_loss: 0.2997 - val_accuracy: 0.8815
Epoch 4/20
30/30 [==============================] - 0s 14ms/step - loss: 0.1912 - accuracy: 0.9371 - val_loss: 0.2921 - val_accuracy: 0.8828
......
Epoch 19/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0164 - accuracy: 0.9975 - val_loss: 0.5394 - val_accuracy: 0.8723
Epoch 20/20
30/30 [==============================] - 0s 11ms/step - loss: 0.0165 - accuracy: 0.9971 - val_loss: 0.5563 - val_accuracy: 0.8714

关于History对象：

In [20]:

ini 复制代码

his_dict = history.history  # 字典类型

In [21]:

scss 复制代码

his_dict.keys()

Out[21]:

css 复制代码

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

模型概览summary

In [22]:

python 复制代码

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 16)                160016    
                                                                 
 dense_1 (Dense)             (None, 16)                272       
                                                                 
 dense_2 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________

模型指标评估

In [23]:

scss 复制代码

model.evaluate(X_test, y_test)
782/782 [==============================] - 1s 880us/step - loss: 0.6017 - accuracy: 0.8582

Out[23]:

csharp 复制代码

[0.601686954498291, 0.8581600189208984]

模型指标可视化

In [24]:

python 复制代码

import matplotlib.pyplot as plt

loss = his_dict["loss"]
val_loss = his_dict["val_loss"]
acc = his_dict["accuracy"]
val_acc = his_dict["val_accuracy"]

In [25]:

ini 复制代码

epochs = range(1, len(loss) + 1)  # 作为横轴

In [26]:

python 复制代码

# 1、损失loss

plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss")
plt.show()

python 复制代码

# 2、精度acc

plt.clf()  #  清空图像
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.xlabel("Epochs")
plt.ylabel("Acc")
plt.legend()

plt.title("Training and Validation Acc")
plt.show()

重新训练

可以看到随着网络训练的进行，loss在训练集上越来越小，acc在训练集上越来越大；但是在验证集上并非如此。

也是说，模型在训练集上表现得很好，但是在验证集上不行，出现了过拟合。

重新训练一个模型，共4轮epochs=4

指定轮次训练

In [28]:

python 复制代码

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))  # 原文1000
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))  

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["acc"])  # 原文accuracy  改成acc

# 编译模型
model.compile(optimizer='rmsprop',  # 优化器
              loss='binary_crossentropy',  # 二进制交叉熵
              metrics=['accuracy']   # 评价指标
             )
# 训练
history = model.fit(X_train,  # 在完整数据集上训练
                    y_train,
                    epochs=4,
                    batch_size=512,
                    validation_data=(x_val, y_val)
                   )
Epoch 1/4
49/49 [==============================] - 1s 16ms/step - loss: 0.4807 - accuracy: 0.8072 - val_loss: 0.3134 - val_accuracy: 0.9003
Epoch 2/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2772 - accuracy: 0.9021 - val_loss: 0.2212 - val_accuracy: 0.9259
Epoch 3/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2173 - accuracy: 0.9200 - val_loss: 0.1930 - val_accuracy: 0.9283
Epoch 4/4
49/49 [==============================] - 0s 10ms/step - loss: 0.1840 - accuracy: 0.9324 - val_loss: 0.1472 - val_accuracy: 0.9544

预测结果及可视化

最终模型预测：

In [29]:

python 复制代码

results = model.predict(X_test)
results
782/782 [==============================] - 1s 790us/step

Out[29]:

python 复制代码

array([[0.19428788],
       [0.9998849 ],
       [0.8095433 ],
       ...,
       [0.1104579 ],
       [0.07548532],
       [0.65479356]], dtype=float32)

模型网络对某些样本十分可信；比如概率是0.998（表示1）或者0.1（表示0）；有一些则模棱两可。

In [30]:

bash 复制代码

results.flatten()  # 将二维展开成一维flatten

Out[30]:

ini 复制代码

array([0.19428788, 0.9998849 , 0.8095433 , ..., 0.1104579 , 0.07548532,
       0.65479356], dtype=float32)

通过np.round函数直接将概率转成0-1分类：

In [31]:

ini 复制代码

y_predict = np.round(results.flatten())
y_predict

Out[31]:

ini 复制代码

array([0., 1., 1., ..., 0., 0., 1.], dtype=float32)

In [32]:

python 复制代码

y_test

Out[32]:

python 复制代码

array([0., 1., 1., ..., 0., 0., 0.], dtype=float32)

In [33]:

python 复制代码

from sklearn.metrics import classification_report, confusion_matrix, r2_score, recall_score

In [34]:

python 复制代码

confusion_matrix(y_predict,y_test)  # 混淆矩阵

Out[34]:

python 复制代码

array([[11169,  1512],
       [ 1331, 10988]], dtype=int64)

In [35]:

python 复制代码

print(classification_report(y_predict, y_test))
              precision    recall  f1-score   support

         0.0       0.89      0.88      0.89     12681
         1.0       0.88      0.89      0.89     12319

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000

In [36]:

python 复制代码

import seaborn as sns

sns.heatmap(confusion_matrix(y_predict,y_test),  # 混淆矩阵
            annot=True, # 显示数值
            #cmap=plt.cm.Blues,
            fmt='.0f' # 指定格式
           )

plt.show()