自然语言模型NLP ：tensorflow搭建情感分析模型

一、NLP: Natural Language Processing

二、将文本转化为数据

一个单词LISTEN可以由一串数字表示，主流的编码方式有ASCLL码

eg:L I S T E N -->076 073 083 084 069 078

跟编码字母比起来，编码单词可能更简单

eg: I love my dog 假设表示成1 2 3 4

那么 I love my cat 可以表示成1 2 3 5

如何借助代码来实现呢？这里使用的是tensorflow

python 复制代码

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100,oov_token = "<oov>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences,padding = 'post')
print(word_index)
print(sequences)
print(padded)

运行结果

python 复制代码

D:\Python\python.exe E:/PyCharmWorkspace/RequirementAnalysis_AI/test_tensorflow.py
{'<oov>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]

Process finished with exit code 0

为了避免出现新的单词导致无法识别

1.需要大量的训练数据尽可能覆盖多的单词

2.使用oov_token属性，将其设置为代替语料库中无法识别的内容，分词器会为其创建一个token代替它不认识的单词

3.使用padding，padding会自动在句子前面填充0来确保句子的完整性，如果不想在句子前面补0，而使希望在句子后面填充，只需要将padding参数设置为post

如果不希望填充句子的长度跟最大长度一致，可以用maxlen参数指定所需要的句子长度

python 复制代码

padded = pad_sequences(sequences,padding = 'post',maxlen=5)

如果句子长度超过设置长度，可以设定截断方式，通过post truncation截断结尾处的单词

python 复制代码

padded = pad_sequences(sequences,padding = 'post',truncation = 'post',maxlen=5)

三、构建情感识别模型

首先准备一份json测试文件

{"article_link":"https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline": "boehner just wants wife to listen, not come upwith alternative debt-reductionideas","is_sarcastic":1}, {"article_link":"https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne'revival catches up to ourthorny political mood, for better and worse","is_sarcastic":0}, {"article_link": "https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697", "headline": "mom starting to fear son's web series closestthing she will have to grandchild","is_sarcastic":1}

使用python来读取这段json文件进行预处理

python 复制代码

import json
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

with open("E:\PyCharmWorkspace\RequirementAnalysis_AI\sarcasm.json",'r')as f:
    datastore = json.load(f)

sentences =[]
labels =[]
urls =[]
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])


tokenizer = Tokenizer(oov_token="<00V>")
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
sequences =tokenizer.texts_to_sequences(sentences)
padded =pad_sequences(sequences,padding='post')

print(padded[0])
print(padded.shape)

使用tokenizer.fit_on_texts方法可以为语料库中的每个单词创建token

然后就可以把句子转化为token序列

python 复制代码

D:\Python\python.exe E:/PyCharmWorkspace/RequirementAnalysis_AI/testcase/tensorflow_BuildEmotionAnalysisModel.py
[ 3  4  5  6  2  7  8  9 10 11 12 13  0]
(3, 13)

Process finished with exit code 0

第一行是句子转化成token序列

第二行表示一共有3个序列，每个序列有13个token

现在输出的是没有带情感分析的序列，因为我们还没有对模型进行数据训练

接下来我们使用json文件的前面部分数据用于训练神经网络，然后测试后面部分数据

python 复制代码

training_size = 10
training_sentences =sentences[0:training_size]
testing_sentences =sentences[training_size:]
training_labels =labels[0:training_size]
testing_labels =labels[training_size:]

tokenizer =Tokenizer(num_words=13, oov_token="<00V>")
tokenizer.fit_on_texts(training_sentences)

word_index=tokenizer.word_index
max_length = 13
padding_type='post'
trunc_type='post'
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length,padding=padding_type,truncating=trunc_type)

testing_sequences =tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length,padding=padding_type,truncating=trunc_type)

到这里我们已经把句子转化为数字，这里数字指代表单词的token，但是如何从token中了解句子的情感呢？这里需要使用到嵌入（embedding）

假设good表示1，bad表示-1

那么not bad可以表示为0.5，not good可以表示为-0.5

以此类推，当加载到神经网络中，进行训练的句子越来越对时，神经网络可以查看这个词的向量，并将向量相加，输出一个情感分析判断，这个过程就叫做嵌入（embedding）

python 复制代码

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=2000),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24,activation='relu'),
    tf.keras.layers.Dense(1,activation='sigmoid')
    ])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

这部分就是神经网络的代码，最上层是一个嵌入，每个单词的情感都会被学习，借助

复制代码

GlobalAveragePooling参数进行池化，然后反馈到神经网络

python 复制代码

num_epochs =30
history = model.fit(training_padded, training_labels,epochs=num_epochs,
                    validation_data=(testing_padded,testing_labels),verbose=2)

现在，训练就跟model.fit一样简单，使用训练数据和训练标签，并指定验证数据的testing_padded和testing_lables

实际上，训练神经网络部分代码，我也不太理解，先继续往后学

python 复制代码

sentence = [
    "granny starting to fear spiders in the garden might be real",
    "the weather today is bright and sunny"]
sequences =tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=2000,padding='post',truncating='post')
print(model.predict(padded))

如何利用该神经网络，判断句子情感呢

借助前面创建的分词器，将句子转化为token序列，这样测试句子将拥有和训练集一样的token

然后将序列填充到跟训练集相同的维度，并使用同样的padding_type,接下来就可以进行情感预测