Embeding，到底嵌入了什么

Embeding，中文嵌入，听上去一脸懵，嵌入？把什么嵌入什么？嵌入到哪？Embeding到底是个啥，别急我们一步步来看

从推荐系统开始

现在你是一名旅游网站的开发人员，你接到了一个需求，需要根据旅客选择的酒店信息，推荐差不多的酒店。你表示产品一定是疯了，我又没去过这些酒店，我怎么知道谁和谁差不多。不过吐槽归吐槽，活还是得干。

于是你冥思苦想，从数据库里的酒店信息发现了一些端倪，如果说，能够把酒店信息中的价格，地址，服务项目都拆分出来，然后进行对比，这样是不是就可以找到差不多的酒店？

酒店信息在数据库中都有明确的字段，直接取就可以，但是如何进行对比呢？一个字段一个字段的匹配？肯定不行，那系统得慢成什么样子。于是，你想到可以先对酒店信息进行分词，然后统计每个词出现的词频，这样就可以通过计算两条数据的词频是否相近来判断两个酒店是不是类似。

余弦相似度

举个例子：

信息A：我不爱吃蔬菜，我喜欢吃肉
信息B：我不爱吃肉，我爱吃蔬菜第一步：分词
信息A：我/不爱/吃/蔬菜，我/喜欢/吃/肉
信息B：我/不爱/吃/肉，我/爱/吃/蔬菜第二步：列出所有的词
我，不爱，爱，吃，肉，蔬菜，喜欢第三步：统计词频
信息A：我 2，不爱 1，爱 0，吃 2，肉 1，蔬菜 1，喜欢 1
信息B：我 2，不爱 1，爱 2，吃 2，肉 1，蔬菜 1，喜欢 0
然后我们就得到了两个数组，也可以说是两个向量
- A：[2,1,0,2,1,1,1]
- B：[2,1,2,2,1,1,0]

那么现在我们该怎么计算两个向量的相似度呢，总不能拿眼看吧。于是你想到了曾经学过的数学知识，余弦，我们可以通过计算两个向量的余弦夹角，来判断两个向量是不是相似的，具体公式如下：

看上去有点头疼？没关系，我们可以通过numpy包来计算

python 复制代码

import numpy as np

# 定义两个7维向量
vector_a = np.array([2,1,0,2,1,1,1])
vector_b = np.array([2,1,2,2,1,1,0])

# 计算点积
dot_product = np.dot(vector_a, vector_b)

# 计算向量的欧几里得范数
norm_a = np.linalg.norm(vector_a)
norm_b = np.linalg.norm(vector_b)

# 计算余弦相似度
cosine_similarity = dot_product / (norm_a * norm_b)

print("余弦相似度:", cosine_similarity)

最终结果：

复制代码

0.819891591749923

根据余弦相似度的范围[-1,1]可以知道，这两个句子是相似度高达0.8。

但是，发现问题没有？这两个句子从语义上来说完全不是一回事啊，一个爱吃肉，一个爱吃蔬菜，哪里相似了？

N-Garm

通过思考你发现，分词统计词频的时候只是统计了每个词的数量，上下文并没有考虑进去，那如何把上下文考虑进去呢？既然单个词没有办法表达上下文，那么把分词的边界扩大不就可以了，比如:

单个词划分：A/B/C/D/E/F
两个词划分：AB,BC,CD,DE,EF
三个词划分：ABC,BCD,CDE,DEF,
N个词划分。。。

python 复制代码

import numpy as np
from collections import Counter

# 定义N-Gram分词函数
def n_gram_tokenize(text, n=3):
    return [text[i:i+n] for i in range(len(text) - n + 1)]

# 定义余弦相似度计算函数
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# 定义两句话
sentence_a = "我不爱吃蔬菜，我喜欢吃肉"
sentence_b = "我不爱吃肉，我爱吃蔬菜"

# 对两句话进行N-Gram分词
tokens_a = n_gram_tokenize(sentence_a, n=3)
tokens_b = n_gram_tokenize(sentence_b, n=3)

# 构建词表
vocab = list(set(tokens_a + tokens_b))

# 将句子转换为向量
def sentence_to_vector(tokens, vocab):
    counter = Counter(tokens)
    return [counter.get(word, 0) for word in vocab]

vector_a = sentence_to_vector(tokens_a, vocab)
vector_b = sentence_to_vector(tokens_b, vocab)

# 计算余弦相似度
similarity = cosine_similarity(vector_a, vector_b)
print(f"余弦相似度: {similarity:.4f}")

结果：

makefile 复制代码

余弦相似度: 0.5000

可以发现好了很多，已经不是很像了。

这就是所谓的N-Gram也就是N元语法，N元语法有以下几个特征

基于一个假设：第n个词出现与前n-1个词相关，与其他任何词不相关
N=1时为unigram N=2时为bigram N=3时为trigram
N-Gram是指给定一段文本，其中的N个item序列
当一阶特征不够用的时候，不如处理文本特征的时候，一个关键词是一个特征，但是有的时候这样做不是很有用，采用N元语法，可以理解两个相邻关键词的特征组合

下面我们以酒店推荐系统为例，做一下实践

用到的数据集： github.com/susanli2016...

python 复制代码

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
pd.options.display.max_columns = 30
import matplotlib.pyplot as plt
# 支持中文
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
# 数据探索
print(df.head())
print('数据集中的酒店个数：', len(df))

数据集中一共有152个酒店，我们选择第10个酒店

python 复制代码

def print_description(index):
    example = df[df.index == index][['desc', 'name']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Name:', example[1])
print('第10个酒店的描述：')
print_description(10)

下一步我们对原始数据集进行清洗,考虑到英文中存在大量的重复性但区分度不高的词汇，所以我们设置停用词

python 复制代码

# 创建英文停用词列表
ENGLISH_STOPWORDS = {
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 
    'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', 
    "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 
    'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
    'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 
    'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 
    'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 
    'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 
    'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 
    "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', 
    "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', 
    "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 
    'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
}
# 文本预处理
REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
# 使用自定义的英文停用词列表替代nltk的stopwords
STOPWORDS = ENGLISH_STOPWORDS
# 对文本进行清洗
def clean_text(text):
    # 全部小写
    text = text.lower()
    # 用空格替代一些特殊符号，如标点
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    # 移除BAD_SYMBOLS_RE
    text = BAD_SYMBOLS_RE.sub('', text)
    # 从文本中去掉停用词
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text
# 对desc字段进行清理，apply针对某列
df['desc_clean'] = df['desc'].apply(clean_text)
print(df['desc_clean'])

下一步我们开始对数据集进行建模

python 复制代码

df.set_index('name', inplace = True)
# 使用TF-IDF提取文本特征，使用自定义停用词列表
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words=list(ENGLISH_STOPWORDS))
# 针对desc_clean提取tfidf
tfidf_matrix = tf.fit_transform(df['desc_clean'])
print('TFIDF feature names:')
#print(tf.get_feature_names_out())
print(len(tf.get_feature_names_out()))
#print('tfidf_matrix:')
#print(tfidf_matrix)
#print(tfidf_matrix.shape)
# 计算酒店之间的余弦相似度（线性核函数）
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
#print(cosine_similarities)
print(cosine_similarities.shape)
indices = pd.Series(df.index) #df.index是酒店名称

这里我们引入了另一个概念，TF-IDF,之前我们提到了N-Gram可以一定程度上抽取单词上下文的特征，但是仍然存在一个问题。如果一个n-gram特征在一篇文档中频繁出现，我们可以说它是一个明显的特征，但是如果所有文档中都频繁出现，就说明这个词区分度不是很高，所以只用n-gram还是有些粗糙的，这里就要使用TF-IDF进一步计算

也就是说TF-IDF值越大，那么这个特征就是区分度高，又很明显

python 复制代码

# 建模
df.set_index('name', inplace = True)
# 使用TF-IDF提取文本特征，使用自定义停用词列表
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words=list(ENGLISH_STOPWORDS))
# 针对desc_clean提取tfidf
tfidf_matrix = tf.fit_transform(df['desc_clean'])
print('TFIDF feature names:')
#print(tf.get_feature_names_out())
print(len(tf.get_feature_names_out()))
#print('tfidf_matrix:')
#print(tfidf_matrix)
#print(tfidf_matrix.shape)

我们可以看到一共抽取了3347个特征，然后我们来计算酒店信息的相似度矩阵

python 复制代码

# 计算酒店之间的余弦相似度（线性核函数）
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
#print(cosine_similarities)
print(cosine_similarities.shape)
indices = pd.Series(df.index) #df.index是酒店名称

然后我们基于上一步的相似度矩阵，来尝试进行酒店推荐

python 复制代码

def recommendations(name, cosine_similarities = cosine_similarities):
    recommended_hotels = []
    # 找到想要查询酒店名称的idx
    idx = indices[indices == name].index[0]
    print('idx=', idx)
    # 对于idx酒店的余弦相似度向量按照从大到小进行排序
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)
    # 取相似度最大的前10个（除了自己以外）
    top_10_indexes = list(score_series.iloc[1:11].index)
    # 放到推荐列表中
    for i in top_10_indexes:
        recommended_hotels.append(list(df.index)[i])
    return recommended_hotels
print(recommendations('Hilton Seattle Airport & Conference Center'))

可以看到，结果还是ok的

但是，但是仍然不太好，为什么？我们可以看到前面的特征抽取抽出了3347个特征，这还是原始数据集的内容不是很长，如果塞一篇小说进去。。。。会导致维度爆炸

这个时候就要提到我们今天要说的Embeding，也就是嵌入

Embeding

刚才我们发现，抽取出来的特征矩阵维度太多，计算量可能过大，那咋办？既然维度多，那就降维。我们可以将维度过多的特征嵌入到一个维度固定的矩阵里，转换成维度相同的向量，这就是嵌入，把一堆东西塞到一个固定大小的盒子里。

维度降下来了，向量之间是可以通过余弦夹角计算相似度的

Word2Vec

Word2Vec 是Embeding的一种方式，市面上还有很多Embeding模型，这里只是用来举个例子，方便大家理解原理

这张图是Word2Vec简化版的神经网络，分为Input(输入),Hidden(隐藏)，Output(输出)三层，假设我们现在有一本《三国演义》，我把它交给Word2Vec，Word2Vec会将整本书进行压缩并向量化，并且抽取某些词上下文的特征，举个例子，张飞属于蜀国，刘备属于蜀国，张飞，刘备和蜀国的在语义上就会更接近，但是曹操和蜀国在语义上就会更远，基于这个特点，隐藏层会形成一个矩阵，这个矩阵的大小是n*m,n是有多少个词，m是每个词的维度数量。

然后我们就得到了一个查找表

给到对应的词，然后在隐藏层矩阵中找到对应的向量，之后由输出层输出，这样我们就可以拿着包含了某个词上下文特征的向量，去做接下来的任务

Word2Vec的两种模式

Skip-Gram，跟定输入的词预测上下文

2. CBOW给定上下文，预测相对应的词

原理上大致就是这样，下面我们来看下实际的操作，我们使用Gensim工具来完成安装方式

bash 复制代码

pip install gensim

gensim是一个开源的工具包，可以从非结构化文本中，无监督地学习到隐层的主题向量表达，每一个向量变换的操作都对应着一个主题模型，支持TF-IDF，LDA, LSA, word2vec 等多种主题模型算法

关键参数
- window,句子中当前单词和被预测单词的最大距离
- min_count,需要训练词语的最小出现次数，默认为5
- size,向量维度，默认为100
- worker,训练使用的线程数，默认为1即不使用多线程我们以西游记作为基本语料，计算小说中人物的相似度

首先我们使用jieba库对小说进行分词

python 复制代码

# -*-coding: utf-8 -*-
# 对txt文件进行中文分词
import jieba
import os
from utils import files_processing

# 源文件所在目录
source_folder = './journey_to_the_west/source'
segment_folder = './journey_to_the_west/segment'

# 字词分割，对整个文件内容进行字词分割
def segment_lines(file_list,segment_out_dir,stopwords=[]):
    for i,file in enumerate(file_list):
        segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
        with open(file, 'rb') as f:
            document = f.read()
            document_cut = jieba.cut(document)
            sentence_segment=[]
            for word in document_cut:
                if word not in stopwords:
                    sentence_segment.append(word)
            result = ' '.join(sentence_segment)
            result = result.encode('utf-8')
            with open(segment_out_name, 'wb') as f2:
                f2.write(result)

# 对source中的txt文件进行分词，输出到segment目录中
file_list=files_processing.get_files_list(source_folder, postfix='*.txt')
segment_lines(file_list, segment_folder)

然后我们将处理好的文件转化为一个sentence迭代器

python 复制代码

from gensim.models import word2vec
import multiprocessing

# 如果目录中有多个文件，可以使用PathLineSentences
segment_folder = './journey_to_the_west/segment'
sentences = word2vec.PathLineSentences(segment_folder)

最后，我们使用word2vec进行训练

python 复制代码

# 设置模型参数，进行训练
model = word2vec.Word2Vec(sentences, vector_size=100, window=3, min_count=1)
print(model.wv.similarity('孙悟空', '猪八戒')) # 孙悟空和猪八戒的相似度
print(model.wv.similarity('孙悟空', '孙行者')) # 孙悟空和孙行者的相似度
print(model.wv.most_similar(positive=['孙悟空', '唐僧'], negative=['孙行者'])) # 孙悟空+唐僧-孙行者=？

让我们看一下输出结果

效果好像不是很好，让我们调整参数再来一次

python 复制代码

# 设置模型参数，进行训练
model2 = word2vec.Word2Vec(sentences, vector_size=128, window=5, min_count=5, workers=multiprocessing.cpu_count())
# 保存模型
model2.save('./models/word2Vec.model')
print(model2.wv.similarity('孙悟空', '猪八戒'))
print(model2.wv.similarity('孙悟空', '孙行者'))
print(model2.wv.most_similar(positive=['孙悟空', '唐僧'], negative=['孙行者']))

可以看到好了一些些，但还不是特别理想，因为word2vec还是一个相对粗糙的方式，

但是基于这个原理，可以窥探推荐系统的一角，将商品，视频，小说等特征代换文本特征，我们可以通过抽取商品，视频，小说的特征，当一个用户购买某个商品，刷过某个视频之后，可以去匹配特征相似的对象，然后输出相似度最大的top-n个。。。。