Word2Vec_词嵌套

Word2Vec

Word2Vec 是Google 2013年推出的一个NLP工具，它的特点是将所有的词向量化，这样词与词之间就可以定量地度量它们之间的关系，挖掘词之间的联系。

word2Vec.pdf论文

链接: https://pan.baidu.com/s/1JegdOm2V20v9leTroxnZzQ 提取码: dykp

`gensim.models.word2vec` 介绍

python 复制代码

from gensim.models import word2vec
import csv
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def text_clear(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9]"," ",text)
    text = re.sub(r" +"," ",text)
    text = text.strip()
    text = text.split(" ")
    text = [word for word in text if word not in stoplist]
    text = [PorterStemmer().stem(word) for word in text]
    text.append("eos")
    text = ["bos"] + text
    return text


agnews_train = csv.reader(open("./dataset/train.csv","r"))
agnews_label = []
agnews_title = []
agnews_text = []

stoplist = stopwords.words('english')


for line in agnews_train:
    agnews_label.append(np.float32(line[0]))
    agnews_title.append(text_clear(line[1]))
    agnews_text.append(text_clear(line[2]))

print(agnews_text[0:2])

model = word2vec.Word2Vec(agnews_text,vector_size=64,min_count=0, window=5,epochs=128)
model_name = "corpusWord2Vec.model"
model.save(model_name)

print("Finish")

数据读取处理

文本主题提取：基于TF-IDF博文中详细介绍了对Ag-news数据集的读取、清洗等操作步骤

模型训练与保存

python 复制代码

model = word2vec.Word2Vec(agnews_text,vector_size=64,min_count=0, window=5,epochs=128)
model_name = "corpusWord2Vec.model"
model.save(model_name)

参数名	典型值	作用与影响	调优建议
vector_size	100, 200, 300	词向量的维度。	维度越高，表征能力越强，但也更容易过拟合，且计算量增大。
window	5	上下文窗口大小。即考虑中心词前后多少个词作为其上下文。	窗口越大，捕获的语义越宏观（主题相关）；窗口越小，捕获的语法关系越强。一般设5-10。
min_count	5	词频阈值。出现次数低于此值的词将被忽略。	过滤掉极低频的噪声词（如错别字），能提升模型稳定性和训练速度。小语料可设为1或2。
workers	4	训练时使用的线程数。	充分利用多核CPU加速训练。通常设为你的CPU核心数。
sg	0 或 1	训练算法：0 表示 CBOW, 1 表示 Skip-gram。	根据上述的CBOW/Skip-gram特点选择。
hs	0	是否使用层次Softmax。0表示使用负采样（Negative Sampling）。	对于大规模词汇表，负采样（hs=0）效率远高于层次Softmax。通常保持为0。
negative	5	负采样数。仅在hs=0时有效。	增加此值会使训练更稳健，但速度变慢。一般设5-20，小数据集可设小一些。

模型加载

python 复制代码

word2vec.Word2Vec.load("corpusWord2Vec.model")

获取词向量矩阵

python 复制代码

import itertools
# 获取词向量矩阵
word_vectors_matrix = model.wv.vectors

# 查看矩阵形状
print("词向量矩阵形状:", word_vectors_matrix.shape)  # 输出如 (5, 100)

# 获取所有词（词汇表）
#print("词汇表:", model.wv.key_to_index)
#遍历词汇表字典的前20项元素
for key, value in itertools.islice(model.wv.key_to_index.items(), 20):
    print(f"Key: {key}, Value: {value}")

复制代码

词向量矩阵形状: (43525, 64)
Key: eos, Value: 0
Key: bos, Value: 1
Key: 39, Value: 2
Key: said, Value: 3
Key: new, Value: 4
Key: reuter, Value: 5
Key: year, Value: 6
Key: quot, Value: 7
Key: compani, Value: 8
Key: two, Value: 9
Key: us, Value: 10
Key: first, Value: 11
Key: ap, Value: 12
Key: gt, Value: 13
Key: lt, Value: 14
Key: world, Value: 15
Key: monday, Value: 16
Key: one, Value: 17
Key: wednesday, Value: 18
Key: tuesday, Value: 19

获取某个词汇的向量

python 复制代码

#获取某个词汇的向量
print("year:",model.wv["year"])

复制代码

year: [ -4.5941806    1.7358713    1.329131     1.945462     5.435929
  -3.1395907    4.220834    -0.5986781    1.8163828   -0.23765224
   2.537547     1.4427937   -0.6865506    0.62047076  -1.4648733
  -2.2761319   -2.2882795   -0.56683517  -2.488293     4.761698
   4.301814     1.9047298    3.7248683    1.1285942   -1.5330548
  -0.29018068  -1.7294165   -0.04464156   3.8446014   -0.5445558
  -8.661683     1.1196393    0.35982367  -1.2469587    4.4957056
  -0.51467353  -2.4929457   -2.4596636    5.699205     2.6921985
   0.31560746  -2.7784114   -0.13437042  -1.9150872   -7.094548
   0.8324861    7.189384    -1.117163   -10.094558    -4.1156693
   1.4288932    2.7343435   -3.2910051   -1.9463073    0.41674006
   1.3934506   -4.7457247   -1.6112362   -6.155947     3.937971
   4.773978    -0.9590569   -0.4999122   -3.6928618 ]

Word2Vec_词嵌套

Word2Vec

gensim.models.word2vec 介绍

数据读取处理

模型训练与保存

模型加载

获取词向量矩阵

获取某个词汇的向量

`gensim.models.word2vec` 介绍