【AI大模型开发】-基于 Word2Vec 的中文古典小说词向量分析实战

基于 Word2Vec 的中文古典小说词向量分析实战

一、项目概述

本项目基于 Gensim 库实现了中文古典小说的 Word Embedding 分析，通过 Word2Vec 算法将文本中的词语转换为向量表示，进而实现词语相似度计算和类比推理等功能。项目包含了《西游记》和《三国演义》两个经典中文文本的分析案例，适合自然语言处理初学者学习和实践。

核心功能

中文文本分词处理
Word2Vec 模型训练与保存
词语相似度计算
相似词语检索
词语类比推理

二、项目结构解析

复制代码

word2vec/
├── 20260117_three_kingdoms_embedding/  # 三国演义分析模块
│   ├── requirements.txt                 # 依赖列表
│   └── three_kingdoms_embedding.py      # 三国演义分析主程序
├── journey_to_the_west/                 # 西游记数据集
│   ├── segment/                         # 分词结果
│   │   └── segment_0.txt
│   └── source/                          # 原始文本
│       └── journey_to_the_west.txt
├── models/                              # 模型保存目录
│   └── word2Vec.model
├── three_kingdoms/                      # 三国演义数据集
│   └── source/
│       └── three_kingdoms.txt
├── utils/                               # 工具函数库
│   ├── __init__.py
│   ├── create_batch_data.py
│   ├── create_word2vec.py
│   ├── files_processing.py
│   └── segment.py
├── requirements.txt                     # 项目依赖
├── word_seg.py                          # 分词处理脚本
├── word_seg.ipynb                       # 分词处理notebook
├── word_similarity.py                   # 相似度分析脚本
└── word_similarity.ipynb                # 相似度分析notebook

主要文件说明

word_seg.py：中文文本分词处理脚本，基于 jieba 实现
word_similarity.py：西游记 Word2Vec 分析主程序
three_kingdoms_embedding.py：三国演义 Word2Vec 分析主程序
utils/segment.py：分词工具函数库
utils/files_processing.py：文件处理工具函数

三、环境搭建与依赖安装

1. Python 环境要求

Python 3.7+
推荐使用 Anaconda 或虚拟环境

2. 依赖安装

复制代码

# 安装基础依赖
pip install gensim jieba numpy scipy

# 或使用项目提供的requirements.txt
pip install -r requirements.txt

四、西游记 Word Embedding 分析实战

1. 数据准备

项目已提供预处理好的西游记文本数据，位于 journey_to_the_west/source/journey_to_the_west.txt。

2. 代码解析

word_similarity.py 核心代码解析：

复制代码

# -*-coding: utf-8 -*-
from gensim.models import word2vec
import multiprocessing
import os

# 获取脚本所在目录的绝对路径
script_dir = os.path.dirname(os.path.abspath(__file__))

# 切分之后的句子合集目录
segment_folder = os.path.join(script_dir, './journey_to_the_west/segment')
sentences = word2vec.PathLineSentences(segment_folder)

# 设置模型参数，进行训练
model = word2vec.Word2Vec(sentences, vector_size=100, window=3, min_count=1)

# 计算词语相似度
print(model.wv.similarity('孙悟空', '猪八戒'))
print(model.wv.similarity('孙悟空', '孙行者'))

# 词语类比推理
print(model.wv.most_similar(positive=['孙悟空', '唐僧'], negative=['孙行者']))

# 训练第二个模型（参数调整）
model2 = word2vec.Word2Vec(sentences, vector_size=128, window=5, min_count=5, workers=multiprocessing.cpu_count())

# 保存模型
model_save_path = os.path.join(script_dir, './models/word2Vec.model')
model2.save(model_save_path)

3. 运行程序

复制代码

python word_similarity.py

4. 输出结果解析

复制代码

0.9699761            # 孙悟空与猪八戒的相似度
0.9831075            # 孙悟空与孙行者的相似度
[('大王', 0.98641425), ('我儿', 0.97925192), ...]  # 类比推理结果

五、三国演义 Word Embedding 分析实战

1. 数据准备

三国演义文本数据位于 three_kingdoms/source/three_kingdoms.txt。

2. 代码解析

three_kingdoms_embedding.py 核心功能：

复制代码

# 1. 文本分词处理
def segment_text():
    stopwords = get_stopwords()
    with open(data_path, 'r', encoding='utf-8') as f:
        content = f.read()
    seg_list = jieba.cut(content)
    filtered_words = [word for word in seg_list if word not in stopwords]
    # 保存分词结果
    with open(segment_path, 'w', encoding='utf-8') as f:
        sentences = ' '.join(filtered_words).split('\n')
        for sentence in sentences:
            if sentence.strip():
                f.write(sentence.strip() + '\n')

# 2. Word2Vec模型训练
def train_word2vec():
    sentences = word2vec.PathLineSentences(segment_path)
    model = word2vec.Word2Vec(
        sentences,
        vector_size=100,
        window=5,
        min_count=5,
        workers=multiprocessing.cpu_count()
    )
    model.save(model_path)
    return model

# 3. 相似词分析
def analyze_similar_words(model, target_word='曹操', topn=10):
    similar_words = model.wv.most_similar(target_word, topn=topn)
    for word, similarity in similar_words:
        print(f"{word}: {similarity:.4f}")

# 4. 类比推理
def analogical_reasoning(model, positive_words, negative_words, topn=10):
    result = model.wv.most_similar(positive=positive_words, negative=negative_words, topn=topn)
    for word, similarity in result:
        print(f"{word}: {similarity:.4f}")

3. 运行程序

复制代码

cd 20260117_three_kingdoms_embedding
pip install -r requirements.txt
python three_kingdoms_embedding.py

4. 预期结果

复制代码

=== 与'曹操'最相近的词 ===
刘备: 0.8923
孙权: 0.8756
诸葛亮: 0.8542
...

=== 类比推理: 曹操 + 刘备 - 张飞 ===
孙权: 0.8321
周瑜: 0.8154
...

六、Word2Vec 参数调优指南

核心参数说明

vector_size：词向量维度，默认 100。维度越高，表达能力越强，但计算成本也越高
window：上下文窗口大小，默认 5。窗口越大，能捕获更多上下文信息
min_count：词频阈值，低于此值的词将被忽略，默认 5
workers：训练时使用的进程数，默认 1。设置为 CPU 核心数可提高训练速度

参数调优建议

对于小规模语料库（<100MB），建议 vector_size=100-200
对于大规模语料库（>1GB），建议 vector_size=300-500
短文本适合较小的 window 值（2-3），长文本适合较大的 window 值（5-10）
min_count 建议根据语料库大小调整，小规模语料库可设为 1-2

七、项目扩展应用

1. 文本分类

将训练好的词向量作为特征，用于文本分类任务：

复制代码

from gensim.models import Word2Vec

# 加载模型
model = Word2Vec.load('models/word2Vec.model')

# 获取文本向量表示
def get_text_vector(text, model):
    words = jieba.cut(text)
    vector = np.zeros(model.vector_size)
    count = 0
    for word in words:
        if word in model.wv:
            vector += model.wv[word]
            count += 1
    if count > 0:
        vector /= count
    return vector

2. 关键词提取

基于词向量的关键词提取：

复制代码

def extract_keywords(text, model, topn=10):
    words = list(jieba.cut(text))
    # 计算每个词与其他词的平均相似度
    word_scores = {}
    for word in set(words):
        if word in model.wv:
            similarity_sum = 0
            count = 0
            for other_word in set(words):
                if other_word != word and other_word in model.wv:
                    similarity_sum += model.wv.similarity(word, other_word)
                    count += 1
            if count > 0:
                word_scores[word] = similarity_sum / count
    # 排序并返回前n个关键词
    return sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:topn]

3. 文本摘要生成

结合词向量实现文本摘要：

复制代码

def generate_summary(text, model, num_sentences=3):
    # 将文本分割为句子
    sentences = re.split(r'[。！？]', text)
    sentences = [s for s in sentences if s.strip()]
    
    # 获取每个句子的向量表示
    sentence_vectors = []
    for sentence in sentences:
        vector = get_text_vector(sentence, model)
        sentence_vectors.append(vector)
    
    # 计算句子间的相似度矩阵
    similarity_matrix = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                similarity_matrix[i][j] = cosine_similarity(
                    sentence_vectors[i].reshape(1, -1),
                    sentence_vectors[j].reshape(1, -1)
                )[0, 0]
    
    # 使用PageRank算法对句子排序
    scores = pagerank(similarity_matrix)
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    
    # 返回前num_sentences个句子作为摘要
    summary = '。'.join([s[1] for s in ranked_sentences[:num_sentences]]) + '。'
    return summary

八、常见问题与解决方案

1. 中文分词不准确

解决方案：

使用 jieba 的自定义词典功能
增加停用词列表
调整分词模式（精确模式、全模式、搜索引擎模式）

2. 模型训练速度慢

解决方案：

增加 workers 参数（设置为 CPU 核心数）
调整 min_count 参数，过滤低频词
减小 vector_size 参数

3. 相似度计算结果不合理

解决方案：

增加语料库规模
调整 window 参数
增加训练轮次
过滤噪声数据

九、总结与展望

本项目通过 Word2Vec 算法实现了中文古典小说的词向量分析，展示了自然语言处理在中文文本分析中的应用。通过本项目的学习，读者可以掌握：

中文文本预处理方法
Word2Vec 模型的原理与应用
词向量的基本操作（相似度计算、类比推理）
模型参数调优技巧

未来可以进一步扩展的方向：

使用 BERT 等预训练模型进行更深入的文本分析
实现可视化功能，展示词向量空间分布
结合知识图谱技术，构建更丰富的语义表示
应用于更多中文文本类型的分析

十、参考文献

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
中文自然语言处理入门与实践. 机械工业出版社.
Gensim 官方文档：https://radimrehurek.com/gensim/
jieba 分词库官方文档：https://github.com/fxsjy/jieba