用keras对电影评论进行情感分析

文章目录

GITHUB地址https://github.com/fz861062923/Keras

下载IMDb数据

python 复制代码
#下载网站http://ai.stanford.edu/~amaas/data/sentiment/

读取IMDb数据

python 复制代码
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
C:\Users\admin\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
python 复制代码
#因为数据也是从网络上爬取的,所以还需要用正则表达式去除HTML标签
import re
def remove_html(text):
    r=re.compile(r'<[^>]+>')
    return r.sub('',text)
python 复制代码
#观察IMDB文件目录结构,用函数进行读取
import os
def read_file(filetype):
    path='./aclImdb/'
    file_list=[]
    positive=path+filetype+'/pos/'
    for f in os.listdir(positive):
        file_list+=[positive+f]
    negative=path+filetype+'/neg/'
    for f in os.listdir(negative):
        file_list+=[negative+f]
    print('filetype:',filetype,'file_length:',len(file_list))
    label=([1]*12500+[0]*12500)#train数据和test数据中positive都是12500,negative都是12500
    text=[]
    for f_ in file_list:
        with open(f_,encoding='utf8') as f:
            text+=[remove_html(''.join(f.readlines()))]
    return label,text
python 复制代码
#用x表示label,y表示text里面的内容
x_train,y_train=read_file('train')
filetype: train file_length: 25000
python 复制代码
x_test,y_test=read_file('test')
filetype: test file_length: 25000
python 复制代码
y_train[0]
'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

建立分词器

具体用法可以参看官网https://keras.io/preprocessing/text/

python 复制代码
token=Tokenizer(num_words=2000)#建立一个有2000单词的字典
token.fit_on_texts(y_train)#读取所有的训练数据评论,按照单词在评论中出现的次数进行排序,前2000名会列入字典
python 复制代码
#查看token读取多少文章
token.document_count
25000

将评论数据转化为数字列表

python 复制代码
train_seq=token.texts_to_sequences(y_train)
test_seq=token.texts_to_sequences(y_test)
python 复制代码
print(y_train[0])
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
python 复制代码
print(train_seq[0])
[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]

让转换后的数字长度相同

python 复制代码
#截长补短,让每一个数字列表长度都为100
_train=sequence.pad_sequences(train_seq,maxlen=100)
_test=sequence.pad_sequences(test_seq,maxlen=100)
python 复制代码
print(train_seq[0])
[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]
python 复制代码
print(_train[0])
[  29    1  168   54   13   45   81   40  391  109  137   13   57  149
    7    1  481   68    5  260   11    6   72    5  631   70    6    1
    5    1 1530   33   66   63  204  139   64 1229    1    4    1  222
  899   28   68    4    1    9  693    2   64 1530   50    9  215    1
  386    7   59    3 1470  798    5  176    1  391    9 1235   29  308
    3  352  343  142  129    5   27    4  125 1470    5  308    9  532
   11  107 1466    4   57  554  100   11  308    6  226   47    3   11
    8  214]
python 复制代码
_train.shape
(25000, 100)

加入嵌入层

将数字列表转化为向量列表(为什么转化,建议大家都思考一哈)

python 复制代码
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import Embedding
python 复制代码
model=Sequential()
python 复制代码
model.add(Embedding(output_dim=32,#将数字列表转换为32维的向量
                   input_dim=2000,#输入数据的维度是2000,因为之前建立的字典有2000个单词
                   input_length=100))#数字列表的长度为100
model.add(Dropout(0.25))

建立多层感知机模型

加入平坦层
python 复制代码
model.add(Flatten())
加入隐藏层
python 复制代码
model.add(Dense(units=256,
               activation='relu'))
model.add(Dropout(0.35))
加入输出层
python 复制代码
model.add(Dense(units=1,#输出层只有一个神经元,输出1表示正面评价,输出0表示负面评价
               activation='sigmoid'))
查看模型摘要
python 复制代码
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________

训练模型

python 复制代码
model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])
python 复制代码
train_history=model.fit(_train,x_train,batch_size=100,
                       epochs=10,verbose=2,
                       validation_split=0.2)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 21s - loss: 0.4851 - acc: 0.7521 - val_loss: 0.4491 - val_acc: 0.7894
Epoch 2/10
 - 20s - loss: 0.2817 - acc: 0.8829 - val_loss: 0.6735 - val_acc: 0.6892
Epoch 3/10
 - 13s - loss: 0.1901 - acc: 0.9285 - val_loss: 0.5907 - val_acc: 0.7632
Epoch 4/10
 - 12s - loss: 0.1066 - acc: 0.9622 - val_loss: 0.7522 - val_acc: 0.7528
Epoch 5/10
 - 13s - loss: 0.0681 - acc: 0.9765 - val_loss: 0.9863 - val_acc: 0.7404
Epoch 6/10
 - 13s - loss: 0.0486 - acc: 0.9827 - val_loss: 1.0818 - val_acc: 0.7506
Epoch 7/10
 - 14s - loss: 0.0380 - acc: 0.9859 - val_loss: 0.9823 - val_acc: 0.7780
Epoch 8/10
 - 17s - loss: 0.0360 - acc: 0.9860 - val_loss: 1.1297 - val_acc: 0.7634
Epoch 9/10
 - 13s - loss: 0.0321 - acc: 0.9891 - val_loss: 1.2459 - val_acc: 0.7480
Epoch 10/10
 - 14s - loss: 0.0281 - acc: 0.9899 - val_loss: 1.4111 - val_acc: 0.7304

评估模型准确率

python 复制代码
scores=model.evaluate(_test,x_test)#第一个参数为feature,第二个参数为label
25000/25000 [==============================] - 4s 148us/step
python 复制代码
scores[1]
0.80972

进行预测

python 复制代码
predict=model.predict_classes(_test)
python 复制代码
predict[:10]
array([[1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]])
python 复制代码
#转换成一维数组
predict=predict.reshape(-1)
predict[:10]
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1])

查看测试数据预测结果

python 复制代码
_dict={1:'正面的评论',0:'负面的评论'}
def display(i):
    print(y_test[i])
    print('label真实值为:',_dict[x_test[i]],
         '预测结果为:',_dict[predict[i]])
python 复制代码
display(0)
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
label真实值为: 正面的评论 预测结果为: 正面的评论

完整函数

python 复制代码
def review(input_text):
    input_seq=token.texts_to_sequences([input_text])
    pad_input_seq=sequence.pad_sequences(input_seq,maxlen=100)
    predict_result=model.predict_classes(pad_input_seq)
    print(_dict[predict_result[0][0]])
python 复制代码
#IMDB上面找的一段评论,进行预测
review('''
Going into this movie, I had low expectations. I'd seen poor reviews, and I also kind of hate the idea of remaking animated films for no reason other than to make them live action, as if that's supposed to make them better some how. This movie pleasantly surprised me!

Beauty and the Beast is a fun, charming movie, that is a blast in many ways. The film very easy on the eyes! Every shot is colourful and beautifully crafted. The acting is also excellent. Dan Stevens is excellent. You can see him if you look closely at The Beast, but not so clearly that it pulls you out of the film. His performance is suitably over the top in anger, but also very charming. Emma Watson was fine, but to be honest, she was basically just playing Hermione, and I didn't get much of a character from her. She likes books, and she's feisty. That's basically all I got. For me, the one saving grace for her character, is you can see how much fun Emma Watson is having. I've heard interviews in which she's expressed how much she's always loved Belle as a character, and it shows.

The stand out for me was Lumieré, voiced by Ewan McGregor. He was hilarious, and over the top, and always fun! He lit up the screen (no pun intended) every time he showed up!

The only real gripes I have with the film are some questionable CGI with the Wolves and with a couple of The Beast's scenes, and some pacing issues. The film flows really well, to such an extent that in some scenes, the camera will dolly away from the character it's focusing on, and will pan across the countryside, and track to another, far away, with out cutting. This works really well, but a couple times, the film will just fade to black, and it's quite jarring. It happens like 3 or 4 times, but it's really noticeable, and took me out of the experience. Also, they added some stuff to the story that I don't want to spoil, but I don't think it worked on any level, story wise, or logically.

Overall, it's a fun movie! I would recommend it to any fan of the original, but those who didn't like the animated classic, or who hate musicals might be better off staying aw''')
正面的评论
python 复制代码
review('''
This is a horrible Disney piece of crap full of completely lame singsongs, a script so wrong it is an insult to other scripts to even call it a script. The only way I could enjoy this is after eating two complete space cakes, and even then I would prefer analysing our wallpaper!''')
负面的评论
  • 到这里用在keras中用多层感知机进行情感预测就结束了,反思实验可以改进的地方,除了神经元的个数,还可以将字典的单词个数设置大一些(原来是2000),数字列表的长度maxlen也可以设置长一些

用RNN模型进行IMDb情感分析

  • 使用RNN的好处也可以思考一哈
python 复制代码
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
python 复制代码
model_rnn=Sequential()
python 复制代码
model_rnn.add(Embedding(output_dim=32,
                   input_dim=2000,
                   input_length=100))
model_rnn.add(Dropout(0.25))
python 复制代码
model_rnn.add(SimpleRNN(units=16))#RNN层有16个神经元
python 复制代码
model_rnn.add(Dense(units=256,activation='relu'))
model_rnn.add(Dropout(0.25))
model_rnn.add(Dense(units=1,activation='sigmoid'))
python 复制代码
model_rnn.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 16)                784       
_________________________________________________________________
dense_1 (Dense)              (None, 256)               4352      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 69,393
Trainable params: 69,393
Non-trainable params: 0
_________________________________________________________________
python 复制代码
model_rnn.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])
train_history=model_rnn.fit(_train,x_train,batch_size=100,
                       epochs=10,verbose=2,
                       validation_split=0.2)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 13s - loss: 0.5200 - acc: 0.7319 - val_loss: 0.6095 - val_acc: 0.6960
Epoch 2/10
 - 12s - loss: 0.3485 - acc: 0.8506 - val_loss: 0.4994 - val_acc: 0.7766
Epoch 3/10
 - 12s - loss: 0.3109 - acc: 0.8710 - val_loss: 0.5842 - val_acc: 0.7598
Epoch 4/10
 - 13s - loss: 0.2874 - acc: 0.8833 - val_loss: 0.4420 - val_acc: 0.8136
Epoch 5/10
 - 12s - loss: 0.2649 - acc: 0.8929 - val_loss: 0.6818 - val_acc: 0.7270
Epoch 6/10
 - 14s - loss: 0.2402 - acc: 0.9035 - val_loss: 0.5634 - val_acc: 0.7984
Epoch 7/10
 - 16s - loss: 0.2084 - acc: 0.9190 - val_loss: 0.6392 - val_acc: 0.7694
Epoch 8/10
 - 16s - loss: 0.1855 - acc: 0.9289 - val_loss: 0.6388 - val_acc: 0.7650
Epoch 9/10
 - 14s - loss: 0.1641 - acc: 0.9367 - val_loss: 0.8356 - val_acc: 0.7592
Epoch 10/10
 - 19s - loss: 0.1430 - acc: 0.9451 - val_loss: 0.7365 - val_acc: 0.7766
python 复制代码
scores=model_rnn.evaluate(_test,x_test)
25000/25000 [==============================] - 14s 567us/step
python 复制代码
scores[1]#提高了大概两个百分点
0.82084

用LSTM模型进行IMDb情感分析

python 复制代码
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
python 复制代码
model_lstm=Sequential()
python 复制代码
model_lstm.add(Embedding(output_dim=32,
                   input_dim=2000,
                   input_length=100))
model_lstm.add(Dropout(0.25))
python 复制代码
model_lstm.add(LSTM(32))
python 复制代码
model_lstm.add(Dense(units=256,activation='relu'))
model_lstm.add(Dropout(0.25))
model_lstm.add(Dense(units=1,activation='sigmoid'))
python 复制代码
model_lstm.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_3 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
=================================================================
Total params: 81,025
Trainable params: 81,025
Non-trainable params: 0
_________________________________________________________________
python 复制代码
scores=model_rnn.evaluate(_test,x_test)
25000/25000 [==============================] - 13s 522us/step
python 复制代码
scores[1]
0.82084

可以看出和RMNN差不多,这可能因为事评论数据的时间间隔不大,不能充分体现LSTM的优越性

相关推荐
泰迪智能科技012 小时前
高校深度学习视觉应用平台产品介绍
人工智能·深度学习
盛派网络小助手2 小时前
微信 SDK 更新 Sample,NCF 文档和模板更新,更多更新日志,欢迎解锁
开发语言·人工智能·后端·架构·c#
Eric.Lee20212 小时前
Paddle OCR 中英文检测识别 - python 实现
人工智能·opencv·计算机视觉·ocr检测
cd_farsight3 小时前
nlp初学者怎么入门?需要学习哪些?
人工智能·自然语言处理
AI明说3 小时前
评估大语言模型在药物基因组学问答任务中的表现:PGxQA
人工智能·语言模型·自然语言处理·数智药师·数智药学
Focus_Liu3 小时前
NLP-UIE(Universal Information Extraction)
人工智能·自然语言处理
PowerBI学谦3 小时前
使用copilot轻松将电子邮件转为高效会议
人工智能·copilot
audyxiao0013 小时前
AI一周重要会议和活动概览
人工智能·计算机视觉·数据挖掘·多模态
Jeremy_lf4 小时前
【生成模型之三】ControlNet & Latent Diffusion Models论文详解
人工智能·深度学习·stable diffusion·aigc·扩散模型