词向量与词嵌入

Word2Vec

原理

连续词袋模型（CBOW）

跳字模型（Skip-Gram）

Glove

FastText

Doc2Vec

[gensim 库](#gensim 库)

Word2Vec

模型****背景：2013 年提出，通过神经网络训练生成词向量，解决 one-hot 编码的维度灾难和无法度量词语相似性问题。训练方式有 CBOW（连续词袋模型）和 Skip-gram（跳字模型）。CBOW 通过上下文词预测中心词，计算效率高，适合常见词训练；Skip-gram 通过中心词预测上下文词，计算量较大，但对稀有词训练效果好。

原理

Word2Vec 基于分布假说，即上下文相似的词，其语义也相似。它通过学习词的上下文信息来生成词向量，主要基于两种训练模式：连续词袋模型（Continuous Bag-of-Words，CBOW）和跳字模型（Skip-Gram）。

连续词袋模型（CBOW）

原理描述：CBOW 模型的目标是根据上下文词来预测中心词。假设给定一个长度为C的上下文窗口，模型会使用上下文窗口内的所有词（不包括中心词）作为输入，经过神经网络的计算，预测出中心词。

跳字模型（Skip-Gram）

原理描述：与 CBOW 相反，Skip-Gram 模型是根据中心词来预测其上下文词。它以中心词作为输入，尝试预测上下文窗口内的每个词。

使用实践：在自然语言处理任务中，可用于语义相似度计算、文本分类、信息检索、机器翻译、命名实体识别和生成任务等。在推荐系统中，计算商品描述词向量的相似度，为用户推荐相似商品。

代码示例：使用gensim库训练 Word2Vec 模型。

注意：需要有一份三国的txt文档作为数据

复制代码

import jieba`
`import re`
`from gensim.models import Word2Vec`

`# 读取文本并分词`
`f =` `open("sanguo.txt",` `'r', encoding='utf-8')`
`lines =` `[]`
`for line in f:`
`    temp = jieba.lcut(line)`
`    words =` `[]`
    `for i in temp:`
`        i = re.sub("[\s+\.\!\/_,$%^* (+\"\'""《》]+|[+--!,｡?､~@#¥%......&* ():;']+",` `"", i)`
        `if` `len(i)` `>` `0:`
`            words.append(i)`
    `if` `len(words)` `>` `0:`
`        lines.append(words)`

`# 训练Word2Vec模型`
`model = Word2Vec(lines, vector_size=20, window=2, min_count=3, epochs=7, negative=10, sg=1)`

`# 获取词向量`
`print("孔明的词向量:\n", model.wv.get_vector('孔明'))`
`# 获取相关度高的词语`
`print("\n和孔明相关性最高的前20个词语:")`
`print(model.wv.most_similar('孔明', topn=20))`
`

Glove

模型****背景：2014 年提出，结合全局统计信息与低维词向量表示，通过统计全局单词共同出现的概率（全局共现关系）捕捉词之间的语义关系，核心是全局共现矩阵。训练目标是中心词向量 × 上下文词向量 + 偏置项≈共现次数。

使用实践：在文本挖掘、信息检索等领域广泛应用，能有效提升模型对语义关系的理解能力。在文档分类任务中，为每个文档中的词生成更具语义代表性的向量，辅助分类决策。

代码示例：运行 Glove 模型需配置好 C++ 编译环境，且 Python 版本支持到 3.8，这里以创建 python3.6 环境为例。

复制代码

# 创建python环境`
`# conda create -n py_glove python=3.6`
`# 切换环境`
`# conda activate py_glove`
`# 安装模型构建工具`
`# 进入准备的glove-python工具包`
`glove-python-master`
`# 安装glove-python`
`# pip install scipy`
`# python setup.py install`
`# 安装jieba分词器`
`# pip install jieba`

`from glove import Glove, Corpus`
`import jieba`
`import re`

`# 打开文件并分词`
`f =` `open('sanguo.txt',` `'r', encoding='utf-8')`
`lines =` `[]`
`for line in f:`
`    temp = jieba.lcut(line)`
`    words =` `[]`
    `for i in temp:`
`        i = re.sub("[\s+\.\!\/_,$%^* (+\"\'""《》]+|[+--!,｡?､~@#¥%......&* ():;']+",` `"", i)`
        `if` `len(i)` `>` `0:`
`            words.append(i)`
    `if` `len(words)` `>` `0:`
`        lines.append(words)`

`# 生成共现矩阵`
`corpus = Corpus()`
`corpus.fit(lines, window=10)`
`# 查看词汇表大小`
`print("词汇表大小:",` `len(corpus.dictionary))`
`# 查看共现矩阵`
`print(corpus.matrix)`

`# 创建并训练GloVe模型`
`glove = Glove(no_components=20, learning_rate=0.05)`
`glove.fit(corpus.matrix, epochs=10, no_threads=4, verbose=True)`
`glove.add_dictionary(corpus.dictionary)`

`# 查看词向量和相似词`
`print(glove.word_vectors[glove.dictionary['刘备']])`
`print(glove.most_similar('主公', number=10))`
`

FastText

模型****背景：由 Facebook AI Research 团队开发，基于 Word2Vec 扩展优化。核心创新是子词机制，将单词拆分为子词生成词向量，能捕捉形态信息、处理未登录词，提升模型泛化能力。支持 CBOW 和 Skipgram 训练模式。

使用实践：在文本分类、词向量生成等任务表现出色，尤其适用于处理包含大量生僻词或新词的文本数据，如社交媒体文本分析。

代码示例

复制代码

from gensim.models import FastText`
`import jieba`
`import re`

`# 读取文本并分词`
`f =` `open("sanguo.txt",` `'r', encoding='utf-8')`
`lines =` `[]`
`for line in f:`
`    temp = jieba.lcut(line)`
`    words =` `[]`
    `for i in temp:`
`        i = re.sub("[\s+\.\!\/_,$%^* (+\"\'""《》]+|[+--!,｡?､~@#¥%......&* ():;']+",` `"", i)`
        `if` `len(i)` `>` `0:`
`            words.append(i)`
    `if` `len(words)` `>` `0:`
`        lines.append(words)`

`# 训练FastText模型`
`model = FastText(`
`    sentences=lines,`
`    vector_size=20,`
`    window=5,`
`    min_count=3,`
`    sg=1,`
`    epochs=10,`
`    workers=4,`
`    min_n=2,`
`    max_n=4`
`)`

`# 获取词向量和相关度高的词语`
`print("主公的词向量:\n", model.wv.get_vector("主公"))`
`print("和主公相关性最高词语:")`
`print(model.wv.most_similar("主公"))`
`print("和荆州相关性最高词语:")`
`print(model.wv.most_similar("荆州"))`
`

Doc2Vec

模型****背景：2014 年由 Google 提出，基于 Word2Vec 思想扩展，目标是生成句子、段落或文档向量表示。训练方式有 DBOW（类似于 Skip-gram 模型，直接用文档向量预测上下文单词）和 DM（类似于 CBOW 模型，预测中心词，文档向量参与捕捉全局语义信息）。

使用实践：主要用于文档相似度计算、文档聚类、文本分类、信息检索与推荐系统等。在新闻推荐系统中，将用户浏览过的新闻转换为文档向量，为用户推荐相似主题的新闻。

代码示例

复制代码

import jieba`
`import re`
`import gensim`
`from gensim.models.doc2vec import Doc2Vec, TaggedDocument`

`# 读取文本并分词`
`f =` `open("sanguo.txt",` `'r', encoding='utf-8')`
`lines =` `[]`
`for line in f:`
`    temp = jieba.lcut(line)`
`    words =` `[]`
    `for i in temp:`
`        i = re.sub("[\s+\.\!\/_,$%^* (+\"\'""《》]+|[+--!,｡?､~@#¥%......&* ():;']+",` `"", i)`
        `if` `len(i)` `>` `0:`
`            words.append(i)`
    `if` `len(words)` `>` `0:`
`        lines.append(words)`

`# 将段落转换为TaggedDocument格式`
`documents =` `[TaggedDocument(words=doc, tags=[str(i)])` `for i, doc in` `enumerate(lines)]`

`# 设置并训练Doc2Vec模型`
`model = Doc2Vec(`
`    vector_size=20,`
`    window=2,`
`    min_count=3,`
`    workers=4,`
`    dm=1`
`)`
`model.build_vocab(documents)`
`model.train(`
`    documents,`
`    total_examples=model.corpus_count,`
`    epochs=40`
`)`

`# 测试词向量和文档向量`
`print("荆州的词向量:\n", model.wv.get_vector("荆州"))`
`print("和荆州相关性最高的前20个词语:")`
`print(model.wv.most_similar("荆州", topn=20))`
`print(documents[2])`
`print(model.dv[2])`
`similar_docs = model.dv.most_similar(str(2), topn=5)`
`print("与原始文档最接近的段落:")`
`for doc, similarity in similar_docs:`
    `print(f"文档 {doc} 的相似度: {similarity}")`
    `print(f"文档 {doc} 内容: {documents[int(doc)]}")`
`

gensim 库

概述：是一个用于自然语言处理任务的 Python 库，提供多种工具和算法，支持从文本预处理到模型训练、评估和应用的全流程操作。

使用实践：在上述 Word2Vec、Glove（需结合特定工具）、FastText、Doc2Vec 模型的代码示例中，都借助gensim库进行模型的构建、训练和相关操作。它还能用于文本分类、情感分析、主题建模等任务。在新闻分类任务中，先对新闻文本进行预处理，再使用gensim中的模型训练，实现对新闻类别的自动划分。

代码示例：以文本分类（基于 TF-IDF 模型）为例。

复制代码

from gensim.utils import simple_preprocess`
`from gensim.corpora import Dictionary`
`from gensim.models import TfidfModel`
`from gensim.similarities import MatrixSimilarity`
`from nltk.corpus import stopwords`
`import nltk`

`# 下载nltk的停用词数据`
`nltk.download('stopwords')`
`stop_words =` `set(stopwords.words('english'))`

`# 文本预处理`
`def` `preprocess_text(text):`
`    result =` `[]`
    `for token in simple_preprocess(text):`
        `if token not` `in stop_words:`
`            result.append(token)`
    `return result`

`# 示例文本`
`documents =` `[`
    `"This is the first document.",`
    `"This document is the second document.",`
    `"And this is the third one.",`
    `"Is this the first document?"`
`]`

`# 预处理文本`
`processed_docs =` `[preprocess_text(doc)` `for doc in documents]`

`# 创建字典`
`dictionary = Dictionary(processed_docs)`

`# 生成词袋模型`
`corpus =` `[dictionary.doc2bow(doc)` `for doc in processed_docs]`

`# 训练TF-IDF模型`
`tfidf = TfidfModel(corpus)`

`# 计算相似度`
`index = MatrixSimilarity(tfidf[corpus])`
`query_document =` `"This is a query document."`
`query_bow = dictionary.doc2bow(preprocess_text(query_document))`
`query_tfidf = tfidf[query_bow]`
`sims = index[query_tfidf]`
`sims =` `sorted(enumerate(sims), key=lambda item:` `-item[1])`

`for doc_position, doc_score in sims:`
    `print(f"Document {doc_position}: {doc_score}")