公众号:尤而小屋
作者:Peter
编辑:Peter
持续更新《Python深度学习》一书的精华内容,仅作为学习笔记分享。
本文是第二篇:基于keras建模解决Python深度学习的二分类问题,使用keras内置的IMDB数据集

- 二分类的最后一层使用sigmoid作为激活函数
- 使用binary_crossentropy作为损失(二元交叉熵损失)
运行环境:Python3.9.13 + Keras2.12.0 + tensorflow2.12.0
In [1]:
python
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.datasets import imdb # 内置数据集
from keras import models
from keras import layers
from keras import optimizers # 优化器
from tensorflow.keras.utils import to_categorical # 实现one-hot编码
# from tensorflow.keras import optimizers
# 修改1
# from tensorflow.python.keras.optimizers import rmsprop_v2
导入IMDB数据
IMDB数据集是一个非常著名和广泛使用的电影数据集,包含了大量的电影和演员的信息。它由互联网电影数据库(IMDB)提供,包含了超过4700部电影和电视节目的信息,以及超过50万名演员和工作人员的信息。
IMDB数据集非常适合用于电影推荐、电影属性预测、演员演技评估等任务。您可以利用这个数据集来训练和测试机器学习模型,以实现自动电影推荐、电影属性预测、演员演技评估等。
使用IMDB数据集可以进行以下类型的机器学习实验和研究:
- 电影推荐:利用机器学习算法根据用户的观影历史和喜好,向用户推荐适合他们观看的电影。
- 电影属性预测:根据电影的属性(例如类型、导演、主演等),利用机器学习算法预测电影的评分和评论。
- 演员演技评估:利用机器学习算法评估演员的表演技巧和水平,以及他们在电影中的重要性。
总之,IMDB数据集是一个非常丰富和有用的数据集,可以用于电影推荐、电影属性预测、演员演技评估等任务。通过使用这个数据集,您可以深入了解电影和演员的信息,以及它们之间的关系和影响。
IMDB数据集已经内置Keras库中
In [2]:
javascript
from keras.datasets import imdb
In [3]:
ini
# 取出训练集中最常出现的前10000个词语
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
In [4]:
css
train_data[:2]
Out[4]:
python
array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])],
dtype=object)
两个labels相关的数据都是0和1的二分类标签:其中0代表负面neg,1代表正面pos
In [5]:
css
train_labels[:3]
Out[5]:
ini
array([1, 0, 0], dtype=int64)
In [6]:
css
test_labels[:3]
Out[6]:
ini
array([0, 1, 1], dtype=int64)
前10000个单词说明单词索引不超过9999:
In [7]:
lua
max([max(sequence) for sequence in train_data])
Out[7]:
yaml
9999
单词和索引的互换:
In [8]:
python
word_index = imdb.get_word_index()
reverse_word_index = dict([value, key] for (key, value) in word_index.items()) # 翻转过程
reverse_word_index
# 结果
{34701: 'fawn',
52006: 'tsukino',
52007: 'nunnery',
16816: 'sonja',
63951: 'vani',
1408: 'woods',
16115: 'spiders',
2345: 'hanging',
2289: 'woody',
52008: 'trawling',
52009: "hold's",
11307: 'comically',
40830: 'localized'
.......
}
将评论解析为英文单词:
In [9]:
python
decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review
Out[9]:
css
"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
训练集整数序列编码
将整数序列编码为二进制矩阵:
In [10]:
python
import numpy as np
def vectorize_sequences(seq, dim=10000):
"""
seq: 输入序列
dim:10000,维度
"""
results = np.zeros((len(seq), dim)) # 创建全0矩阵 length * dim
for i, s in enumerate(seq):
results[i,s] = 1. # 将该位置的值从0变成1,如果没有出现则还是0
return results
X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)
In [11]:
css
X_train[0]
Out[11]:
scss
array([0., 1., 1., ..., 0., 0., 0.])
标签向量化
In [12]:
ini
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")
上面已经将训练集和测试集都处理完后,就可以将数据喂入到神经网络中:
构建网络
In [13]:
javascript
from keras import models
from keras import layers
In [14]:
X_train.shape
Out[14]:
scss
(25000, 10000)
为什么在深度学习中需要激活函数?
- 在深度学习中,激活函数是神经网络中的一个重要组成部分。它们被用来引入非线性性到神经网络中,这是神经网络能够学习非线性模式的关键。
- 如果没有激活函数,神经网络只能学习线性模式,这限制了它们的应用范围。激活函数使得神经元能够以一种非线性方式响应输入信号,从而使得神经网络能够更好地适应复杂的数据分布。
- 此外,激活函数还可以帮助引入梯度,这是深度学习中的关键概念。在反向传播算法中,激活函数将导致梯度的非线性变化,这使得网络能够更好地学习和优化。
- 最后,激活函数还可以用来控制输出值的范围,例如ReLU和Sigmoid函数可以将输出值映射到0到1之间,这可以帮助控制网络的输出值范围,并防止出现梯度消失或梯度爆炸等问题。
因此,激活函数在深度学习中起着非常重要的作用,它们不仅可以引入非线性性,还可以帮助引入梯度,控制输出值范围,从而提升神经网络的性能。
In [15]:
python
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
编译网络compile
配置优化器和损失后编译网络
In [16]:
ini
# 写法1
model.compile(optimizer='rmsprop', # 优化器
loss='binary_crossentropy', # 二进制交叉熵
metrics=['accuracy'] # 评价指标
)
In [17]:
python
# 写法2:有改动
model.compile(
# 原文
# optimizer= optimizers.RMSprop(lr=0.001), # 正则项
optimizer= tf.keras.optimizers.RMSprop(learning_rate=0.001), # 添加前缀tf;lr也要改成learning_rate
loss='binary_crossentropy', # 交叉熵
metrics=['accuracy'] # 使用全称
)
模型训练fit
In [18]:
python
# 留出验证集和真实训练集
x_val = X_train[:10000] # 前10000个
partial_x_train = X_train[10000:] # 10000个之后 真实训练集
y_val = y_train[:10000]
partial_y_train = y_train[10000:] # 真实训练集
In [19]:
python
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val)
)
Epoch 1/20
30/30 [==============================] - 1s 23ms/step - loss: 0.5108 - accuracy: 0.7746 - val_loss: 0.3802 - val_accuracy: 0.8653
Epoch 2/20
30/30 [==============================] - 0s 13ms/step - loss: 0.3102 - accuracy: 0.8959 - val_loss: 0.3066 - val_accuracy: 0.8850
Epoch 3/20
30/30 [==============================] - 0s 14ms/step - loss: 0.2343 - accuracy: 0.9200 - val_loss: 0.2997 - val_accuracy: 0.8815
Epoch 4/20
30/30 [==============================] - 0s 14ms/step - loss: 0.1912 - accuracy: 0.9371 - val_loss: 0.2921 - val_accuracy: 0.8828
......
Epoch 19/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0164 - accuracy: 0.9975 - val_loss: 0.5394 - val_accuracy: 0.8723
Epoch 20/20
30/30 [==============================] - 0s 11ms/step - loss: 0.0165 - accuracy: 0.9971 - val_loss: 0.5563 - val_accuracy: 0.8714
关于History对象:
In [20]:
ini
his_dict = history.history # 字典类型
In [21]:
scss
his_dict.keys()
Out[21]:
css
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
模型概览summary
In [22]:
python
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 160016
dense_1 (Dense) (None, 16) 272
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________
模型指标评估
In [23]:
scss
model.evaluate(X_test, y_test)
782/782 [==============================] - 1s 880us/step - loss: 0.6017 - accuracy: 0.8582
Out[23]:
csharp
[0.601686954498291, 0.8581600189208984]
模型指标可视化
In [24]:
python
import matplotlib.pyplot as plt
loss = his_dict["loss"]
val_loss = his_dict["val_loss"]
acc = his_dict["accuracy"]
val_acc = his_dict["val_accuracy"]
In [25]:
ini
epochs = range(1, len(loss) + 1) # 作为横轴
In [26]:
python
# 1、损失loss
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss")
plt.show()

python
# 2、精度acc
plt.clf() # 清空图像
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.xlabel("Epochs")
plt.ylabel("Acc")
plt.legend()
plt.title("Training and Validation Acc")
plt.show()

重新训练
可以看到随着网络训练的进行,loss在训练集上越来越小,acc在训练集上越来越大;但是在验证集上并非如此。
也是说,模型在训练集上表现得很好,但是在验证集上不行,出现了过拟合。
重新训练一个模型,共4轮epochs=4
指定轮次训练
In [28]:
python
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],))) # 原文1000
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["acc"]) # 原文accuracy 改成acc
# 编译模型
model.compile(optimizer='rmsprop', # 优化器
loss='binary_crossentropy', # 二进制交叉熵
metrics=['accuracy'] # 评价指标
)
# 训练
history = model.fit(X_train, # 在完整数据集上训练
y_train,
epochs=4,
batch_size=512,
validation_data=(x_val, y_val)
)
Epoch 1/4
49/49 [==============================] - 1s 16ms/step - loss: 0.4807 - accuracy: 0.8072 - val_loss: 0.3134 - val_accuracy: 0.9003
Epoch 2/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2772 - accuracy: 0.9021 - val_loss: 0.2212 - val_accuracy: 0.9259
Epoch 3/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2173 - accuracy: 0.9200 - val_loss: 0.1930 - val_accuracy: 0.9283
Epoch 4/4
49/49 [==============================] - 0s 10ms/step - loss: 0.1840 - accuracy: 0.9324 - val_loss: 0.1472 - val_accuracy: 0.9544
预测结果及可视化
最终模型预测:
In [29]:
python
results = model.predict(X_test)
results
782/782 [==============================] - 1s 790us/step
Out[29]:
python
array([[0.19428788],
[0.9998849 ],
[0.8095433 ],
...,
[0.1104579 ],
[0.07548532],
[0.65479356]], dtype=float32)
模型网络对某些样本十分可信;比如概率是0.998(表示1)或者0.1(表示0);有一些则模棱两可。
In [30]:
bash
results.flatten() # 将二维展开成一维flatten
Out[30]:
ini
array([0.19428788, 0.9998849 , 0.8095433 , ..., 0.1104579 , 0.07548532,
0.65479356], dtype=float32)
通过np.round函数直接将概率转成0-1分类:
In [31]:
ini
y_predict = np.round(results.flatten())
y_predict
Out[31]:
ini
array([0., 1., 1., ..., 0., 0., 1.], dtype=float32)
In [32]:
python
y_test
Out[32]:
python
array([0., 1., 1., ..., 0., 0., 0.], dtype=float32)
In [33]:
python
from sklearn.metrics import classification_report, confusion_matrix, r2_score, recall_score
In [34]:
python
confusion_matrix(y_predict,y_test) # 混淆矩阵
Out[34]:
python
array([[11169, 1512],
[ 1331, 10988]], dtype=int64)
In [35]:
python
print(classification_report(y_predict, y_test))
precision recall f1-score support
0.0 0.89 0.88 0.89 12681
1.0 0.88 0.89 0.89 12319
accuracy 0.89 25000
macro avg 0.89 0.89 0.89 25000
weighted avg 0.89 0.89 0.89 25000
In [36]:
python
import seaborn as sns
sns.heatmap(confusion_matrix(y_predict,y_test), # 混淆矩阵
annot=True, # 显示数值
#cmap=plt.cm.Blues,
fmt='.0f' # 指定格式
)
plt.show()
