一、NLP: Natural Language Processing
二、将文本转化为数据
一个单词LISTEN可以由一串数字表示,主流的编码方式有ASCLL码
eg:L I S T E N -->076 073 083 084 069 078
跟编码字母比起来,编码单词可能更简单
eg: I love my dog 假设表示成1 2 3 4
那么 I love my cat 可以表示成1 2 3 5
如何借助代码来实现呢?这里使用的是tensorflow
python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
'I love my dog',
'I love my cat',
'You love my dog',
'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100,oov_token = "<oov>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences,padding = 'post')
print(word_index)
print(sequences)
print(padded)
运行结果
python
D:\Python\python.exe E:/PyCharmWorkspace/RequirementAnalysis_AI/test_tensorflow.py
{'<oov>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 5 3 2 4 0 0 0]
[ 5 3 2 7 0 0 0]
[ 6 3 2 4 0 0 0]
[ 8 6 9 2 4 10 11]]
Process finished with exit code 0
为了避免出现新的单词导致无法识别
1.需要大量的训练数据尽可能覆盖多的单词
2.使用oov_token属性,将其设置为代替语料库中无法识别的内容,分词器会为其创建一个token代替它不认识的单词
3.使用padding,padding会自动在句子前面填充0来确保句子的完整性,如果不想在句子前面补0,而使希望在句子后面填充,只需要将padding参数设置为post
如果不希望填充句子的长度跟最大长度一致,可以用maxlen参数指定所需要的句子长度
python
padded = pad_sequences(sequences,padding = 'post',maxlen=5)
如果句子长度超过设置长度,可以设定截断方式,通过post truncation截断结尾处的单词
python
padded = pad_sequences(sequences,padding = 'post',truncation = 'post',maxlen=5)
三、构建情感识别模型
首先准备一份json测试文件
[{"article_link":"https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline": "boehner just wants wife to listen, not come upwith alternative debt-reductionideas","is_sarcastic":1},
{"article_link":"https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne'revival catches up to ourthorny political mood, for better and worse","is_sarcastic":0},
{"article_link": "https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697", "headline": "mom starting to fear son's web series closestthing she will have to grandchild","is_sarcastic":1}]
使用python来读取这段json文件进行预处理
python
import json
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
with open("E:\PyCharmWorkspace\RequirementAnalysis_AI\sarcasm.json",'r')as f:
datastore = json.load(f)
sentences =[]
labels =[]
urls =[]
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
urls.append(item['article_link'])
tokenizer = Tokenizer(oov_token="<00V>")
tokenizer.fit_on_texts(sentences)
word_index=tokenizer.word_index
sequences =tokenizer.texts_to_sequences(sentences)
padded =pad_sequences(sequences,padding='post')
print(padded[0])
print(padded.shape)
使用tokenizer.fit_on_texts方法可以为语料库中的每个单词创建token
然后就可以把句子转化为token序列
python
D:\Python\python.exe E:/PyCharmWorkspace/RequirementAnalysis_AI/testcase/tensorflow_BuildEmotionAnalysisModel.py
[ 3 4 5 6 2 7 8 9 10 11 12 13 0]
(3, 13)
Process finished with exit code 0
第一行是句子转化成token序列
第二行表示一共有3个序列,每个序列有13个token
现在输出的是没有带情感分析的序列,因为我们还没有对模型进行数据训练
接下来我们使用json文件的前面部分数据用于训练神经网络,然后测试后面部分数据
python
training_size = 10
training_sentences =sentences[0:training_size]
testing_sentences =sentences[training_size:]
training_labels =labels[0:training_size]
testing_labels =labels[training_size:]
tokenizer =Tokenizer(num_words=13, oov_token="<00V>")
tokenizer.fit_on_texts(training_sentences)
word_index=tokenizer.word_index
max_length = 13
padding_type='post'
trunc_type='post'
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length,padding=padding_type,truncating=trunc_type)
testing_sequences =tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length,padding=padding_type,truncating=trunc_type)
到这里我们已经把句子转化为数字,这里数字指代表单词的token,但是如何从token中了解句子的情感呢?这里需要使用到嵌入(embedding)
假设good表示1,bad表示-1
那么not bad可以表示为0.5,not good可以表示为-0.5
以此类推,当加载到神经网络中,进行训练的句子越来越对时,神经网络可以查看这个词的向量,并将向量相加,输出一个情感分析判断,这个过程就叫做嵌入(embedding)
python
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=2000),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24,activation='relu'),
tf.keras.layers.Dense(1,activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
这部分就是神经网络的代码,最上层是一个嵌入,每个单词的情感都会被学习,借助
GlobalAveragePooling参数进行池化,然后反馈到神经网络
python
num_epochs =30
history = model.fit(training_padded, training_labels,epochs=num_epochs,
validation_data=(testing_padded,testing_labels),verbose=2)
现在,训练就跟model.fit一样简单,使用训练数据和训练标签,并指定验证数据的testing_padded和testing_lables
实际上,训练神经网络部分代码,我也不太理解,先继续往后学
python
sentence = [
"granny starting to fear spiders in the garden might be real",
"the weather today is bright and sunny"]
sequences =tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=2000,padding='post',truncating='post')
print(model.predict(padded))
如何利用该神经网络,判断句子情感呢
借助前面创建的分词器,将句子转化为token序列,这样测试句子将拥有和训练集一样的token
然后将序列填充到跟训练集相同的维度,并使用同样的padding_type,接下来就可以进行情感预测