Python机器学习实战：用机器学习进行情感分析核心知识点总结

一、本章核心内容概览

聚焦于自然语言处理(NLP)中的情感分析任务，以IMDb电影评论数据集为实战案例，系统讲解了从文本预处理到模型训练的全流程：

模块	核心技术	应用场景
文本预处理	正则清洗、分词、词干提取	数据清洗与标准化
特征工程	词袋模型、TF-IDF、N-gram	文本向量化表示
模型训练	逻辑回归、GridSearchCV调参	二分类情感预测
大规模数据	HashingVectorizer + SGD	核外学习(Out-of-core)
无监督学习	LDA主题建模	文档主题发现

二、数据集准备：IMDb影评数据

2.1 数据集简介

IMDb电影评论数据集由Andrew Maas等人收集，包含：

总量：50,000条电影评论
标签：正面(评分≥6星, label=1) / 负面(评分<6星, label=0)
分布：训练集25,000 + 测试集25,000

md 复制代码

┌─────────────────────────────────────────┐
│           IMDb Dataset Structure        │
├─────────────────────────────────────────┤
│  aclImdb/                               │
│  ├── train/                             │
│  │   ├── pos/          # 12500 positive │
│  │   └── neg/          # 12500 negative │
│  └── test/                              │
│      ├── pos/          # 12500 positive │
│      └── neg/          # 12500 negative │
└─────────────────────────────────────────┘

2.2 数据加载与预处理代码

python 复制代码

import pyprind
import pandas as pd
import os
import sys
import numpy as np

# 数据路径配置
basepath = 'aclImdb'
labels = {'pos': 1, 'neg': 0}

# 初始化进度条
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()

# 遍历读取数据
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                     'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = pd.concat([df, pd.DataFrame(
                [[txt, labels[l]]], 
                columns=['review', 'sentiment'])], 
                ignore_index=True)
            pbar.update()

# 随机打乱并保存
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

三、文本特征工程：从词袋到TF-IDF

3.1 词袋模型(Bag-of-Words)

核心思想：忽略文本顺序，将文档表示为词频向量

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              Bag-of-Words 工作流程                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  原始文本                    词汇表构建                  │
│  ┌─────────────┐            ┌─────────────────┐        │
│  │ The sun is  │            │ and: 0          │        │
│  │ shining     │    ───►    │ is: 1           │        │
│  │             │   Tokenize │ one: 2          │        │
│  │ The weather │            │ shining: 3      │        │
│  │ is sweet    │            │ sun: 4          │        │
│  │             │            │ sweet: 5        │        │
│  │ The sun is  │            │ the: 6          │        │
│  │ shining...  │            │ two: 7          │        │
│  └─────────────┘            │ weather: 8      │        │
│                             └─────────────────┘        │
│                                      │                  │
│                                      ▼                  │
│                            ┌─────────────────┐         │
│                            │ 特征向量(词频)   │         │
│                            │ [0, 2, 2, 2, 1, │         │
│                            │  1, 3, 1, 1]    │         │
│                            └─────────────────┘         │
│                                                         │
└─────────────────────────────────────────────────────────┘

代码实现：

python 复制代码

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet', 
    'The sun is shining, the weather is sweet, and one and one is two'
])

bag = count.fit_transform(docs)

# 查看词汇表映射
print(count.vocabulary_)
# {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 
#  'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

# 查看特征向量
print(bag.toarray())
# [[0 1 0 1 1 0 1 0 0]   <- 第一句
#  [0 1 0 0 0 1 1 0 1]   <- 第二句  
#  [2 3 2 1 1 1 2 1 1]]  <- 第三句

3.2 N-gram模型

扩展词袋模型，考虑词的顺序信息：

N-gram类型	示例	适用场景
Unigram (1-gram)	"the", "sun", "is"	基础特征
Bigram (2-gram)	"the sun", "sun is"	捕捉局部搭配
Trigram (3-gram)	"the sun is", "sun is shining"	短语识别

python 复制代码

# 使用2-gram
bigram = CountVectorizer(ngram_range=(2, 2))

3.3 TF-IDF加权（核心优化）

问题：高频词（如"the", "is"）携带信息量低，应降低权重

TF-IDF公式：

tf-idf(t,d)=tf(t,d)×idf(t,d)\text{tf-idf}(t,d) = \text{tf}(t,d) \times \text{idf}(t,d)tf-idf(t,d)=tf(t,d)×idf(t,d)

其中：

tf(t,d)\text{tf}(t,d)tf(t,d)：词t在文档d中的频率
idf(t,d)=log⁡1+nd1+df(d,t)\text{idf}(t,d) = \log\frac{1+n_d}{1+\text{df}(d,t)}idf(t,d)=log1+df(d,t)1+nd：逆文档频率

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              TF-IDF 计算流程                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  原始词频矩阵              经过TF-IDF转换               │
│  ┌─────────┐               ┌─────────┐                 │
│  │ 0  1  0 │               │ 0  0.43 0 │               │
│  │ 0  1  0 │    ───►       │ 0  0.43 0 │               │
│  │ 2  3  2 │               │ 0.5 0.45 0.5│             │
│  └─────────┘               └─────────┘                 │
│                                                         │
│  特点：                                                  │
│  • "is"出现频繁 → TF-IDF权重降低(0.43)                  │
│  • "and"只在第三句出现 → 权重相对较高(0.5)               │
│                                                         │
└─────────────────────────────────────────────────────────┘

代码实现：

python 复制代码

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)

# 将词频转换为TF-IDF
tfidf_matrix = tfidf.fit_transform(count.fit_transform(docs))
print(tfidf_matrix.toarray())

四、文本预处理完整流程

4.1 文本清洗Pipeline

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              文本预处理工作流程                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  原始HTML文本                                            │
│  "<br />This :) is :( a test :-)!"                      │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐     删除HTML标签                        │
│  │ 正则清洗    │ ──► re.sub('<[^>]*>', '', text)        │
│  └─────────────┘                                       │
│         │                                               │
│         ▼                                               │
│  "This :) is :( a test :-)!"                            │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐     提取表情符号                        │
│  │ 表情保留    │ ──► 使用正则提取(?::|;|=)(?:-)?        │
│  │             │     (?:\)|\(|D|P)                      │
│  └─────────────┘                                       │
│         │                                               │
│         ▼                                               │
│  emoticons = [':)', ':(', ':-)']                        │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐     清洗文本+小写化                     │
│  │ 标准化      │ ──► re.sub('[\W]+', ' ', text.lower()) │
│  └─────────────┘                                       │
│         │                                               │
│         ▼                                               │
│  "this is a test " + " ".join(emoticons)                │
│         │                                               │
│         ▼                                               │
│  最终输出: "this is a test :) :( :-)"                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

完整清洗函数：

python 复制代码

import re

def preprocessor(text):
    # 删除HTML标签
    text = re.sub('<[^>]*>', '', text)
    # 提取表情符号
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # 删除非单词字符，转小写，附加表情
    text = (re.sub('[\W]+', ' ', text.lower()) + 
            ' '.join(emoticons).replace('-', ''))
    return text

4.3 停用词处理

python 复制代码

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')

# 过滤停用词
[w for w in tokenizer_porter('a runner likes running and runs a lot') 
 if w not in stop]
# 输出: ['runner', 'like', 'run', 'run', 'lot']

五、模型训练：逻辑回归文本分类

5.1 完整训练Pipeline

md 复制代码

┌─────────────────────────────────────────────────────────┐
│           文本分类模型训练流程                           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. 数据准备                                             │
│     ┌─────────────┐                                     │
│     │ 25,000训练  │                                     │
│     │ 25,000测试  │                                     │
│     └──────┬──────┘                                     │
│            │                                            │
│            ▼                                            │
│  2. 构建Pipeline                                         │
│     ┌─────────────────────────────────────┐              │
│     │  TfidfVectorizer  →  LogisticRegression │          │
│     │  (文本向量化)      (分类器)            │          │
│     └─────────────────────────────────────┘              │
│            │                                            │
│            ▼                                            │
│  3. 超参数搜索(GridSearchCV)                             │
│     ┌─────────────────────────────────────┐              │
│     │ 参数网格：                           │              │
│     │ • vect__ngram_range: [(1,1)]        │              │
│     │ • vect__stop_words: [None, stop]    │              │
│     │ • vect__tokenizer: [tokenizer,      │              │
│     │                      tokenizer_porter]│             │
│     │ • clf__C: [1.0, 10.0]               │              │
│     │ • clf__penalty: ['l2']              │              │
│     └─────────────────────────────────────┘              │
│            │                                            │
│            ▼                                            │
│  4. 5折交叉验证训练                                       │
│     ┌─────────────┐                                     │
│     │ 最佳参数    │ C=10.0, tokenizer=tokenizer         │
│     │ 最佳得分    │ CV Accuracy: 0.897                  │
│     └─────────────┘                                     │
│            │                                            │
│            ▼                                            │
│  5. 模型评估                                             │
│     ┌─────────────┐                                     │
│     │ Test Accuracy: 0.899 (89.9%)                     │
│     └─────────────┘                                     │
│                                                         │
└─────────────────────────────────────────────────────────┘

完整代码：

python 复制代码

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# 划分数据集
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

# 构建Pipeline
tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False, 
                        preprocessor=None)

param_grid = [
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [None],
        'vect__tokenizer': [tokenizer, tokenizer_porter],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokenizer],
        'vect__use_idf': [False],
        'vect__norm': [None],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    }
]

lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])

# 网格搜索
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy', cv=5, 
                           verbose=2, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

# 输出结果
print(f'Best parameters: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

# 测试集评估
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

六、核外学习(Out-of-core Learning)

6.1 问题背景

当数据集无法完全加载到内存时，需要使用核外学习：

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              核外学习 vs 传统学习                        │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  传统方法 (内存不足时失败)                                │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│  │ 加载全部 │───►│ 特征提取 │───►│ 模型训练 │             │
│  │ 50K数据 │    │ 50K×50K │    │ 内存溢出 │             │
│  └─────────┘    │ 矩阵    │    │  ❌     │             │
│                 └─────────┘    └─────────┘             │
│                                                         │
│  核外学习 (流式处理)                                     │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│  │ 流式读取 │───►│ 分批处理 │───►│ 增量更新 │             │
│  │ 1000/批 │    │ 1000×N  │    │ partial_fit│           │
│  └─────────┘    └─────────┘    └─────────┘             │
│       ↑                            │                    │
│       └────────────────────────────┘                    │
│              循环45次，逐步学习                          │
│                                                         │
└─────────────────────────────────────────────────────────┘

6.2 核心组件

组件	作用	关键特性
`HashingVectorizer`	文本向量化	无状态，不存储词汇表，固定特征维度
`SGDClassifier`	增量学习分类器	`partial_fit()`方法支持在线学习
生成器函数	数据流控制	逐块读取CSV，避免内存占用

6.3 完整实现代码

python 复制代码

import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
import pyprind

stop = stopwords.words('english')

def tokenizer(text):
    """文本清洗+分词+去停用词"""
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + 
            ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    """文档流生成器"""
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # 跳过表头
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

def get_minibatch(doc_stream, size):
    """获取小批量数据"""
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

# 初始化向量化器和分类器
vect = HashingVectorizer(
    decode_error='ignore',
    n_features=2**21,        # 2097152维特征
    preprocessor=None,
    tokenizer=tokenizer
)

clf = SGDClassifier(
    loss='log_loss',         # 逻辑回归损失
    random_state=1
)

# 训练流程
doc_stream = stream_docs(path='movie_data.csv')
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])

# 45轮迭代，每轮1000样本
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

# 测试评估
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')
# 输出: Accuracy: 0.868

# 继续更新模型
clf.partial_fit(X_test, y_test)

七、LDA主题建模（无监督学习）

7.1 LDA核心概念

Latent Dirichlet Allocation (LDA)：从文档集合中发现潜在主题的生成式概率模型

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              LDA 主题建模原理                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  输入：词袋矩阵 (文档×词汇)                               │
│  ┌─────────────────┐                                    │
│  │  Doc  │ word1   │ word2  │ word3  │ ...              │
│  │  doc1 │   2     │   0    │   1    │                  │
│  │  doc2 │   0     │   3    │   2    │                  │
│  │  doc3 │   1     │   1    │   0    │                  │
│  └─────────────────┘                                    │
│            │                                            │
│            ▼  矩阵分解                                   │
│  ┌─────────────────┐     ┌─────────────────┐             │
│  │  文档-主题矩阵   │  ×  │  主题-词汇矩阵   │             │
│  │    (N×K)       │     │    (K×M)       │             │
│  └─────────────────┘     └─────────────────┘             │
│            │                                            │
│            ▼                                            │
│  输出：每个文档的主题分布 + 每个主题的词分布                │
│                                                         │
│  超参数：n_components = K (主题数量，需预先指定)          │
│                                                         │
└─────────────────────────────────────────────────────────┘

7.2 实战代码

python 复制代码

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 加载数据
df = pd.read_csv('movie_data.csv', encoding='utf-8')

# 构建词袋模型（限制高频词和特征维度）
count = CountVectorizer(
    stop_words='english',
    max_df=0.1,        # 忽略文档频率>10%的词
    max_features=5000  # 只保留前5000个高频词
)
X = count.fit_transform(df['review'].values)

# LDA模型
lda = LatentDirichletAllocation(
    n_components=10,        # 10个主题
    random_state=123,
    learning_method='batch' # 批量学习（更准确）
)
X_topics = lda.fit_transform(X)

# 查看每个主题的前5个关键词
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {topic_idx + 1}:')
    print(' '.join([
        feature_names[i] 
        for i in topic.argsort()[:-n_top_words - 1:-1]
    ]))

7.3 主题发现结果示例

主题ID	关键词	主题解释
Topic 1	worst, minutes, awful, script, stupid	差评电影
Topic 2	family, mother, father, children, girl	家庭电影
Topic 3	american, war, dvd, music, tv	战争片
Topic 4	human, audience, cinema, art, sense	艺术电影
Topic 5	police, guy, car, dead, murder	犯罪片
Topic 6	horror, house, sex, girl, woman	恐怖片
Topic 7	role, performance, comedy, actor, performances	喜剧片
Topic 8	series, episode, war, episodes, tv	电视电影
Topic 9	book, version, original, read, novel	书籍改编
Topic 10	action, fight, guy, guys, cool	动作片

八、本章知识图谱

md 复制代码

┌─────────────────────────────────────────────────────────┐
│              第8章 情感分析 知识体系                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────┐                                        │
│  │  数据层     │  IMDb 50K影评数据                       │
│  │  Data Layer │  清洗 → 分词 → 词干提取                 │
│  └──────┬──────┘                                        │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐                                        │
│  │  特征层     │  CountVectorizer (词袋)                 │
│  │  Feature    │  TfidfVectorizer (TF-IDF)              │
│  │  Layer      │  HashingVectorizer (核外学习)           │
│  └──────┬──────┘                                        │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐        ┌─────────────┐                 │
│  │  模型层     │◄──────►│  无监督     │                 │
│  │  Model      │  LDA   │  主题建模   │                 │
│  │  Layer      │        │             │                 │
│  │             │        └─────────────┘                 │
│  │ • LogisticRegression (监督分类)                      │
│  │ • SGDClassifier (在线学习)                           │
│  └──────┬──────┘                                        │
│         │                                               │
│         ▼                                               │
│  ┌─────────────┐                                        │
│  │  评估层     │  Accuracy, MSE, R²                     │
│  │  Evaluation │  交叉验证, 网格搜索                     │
│  │  Layer      │  残差分析, 轮廓系数                     │
│  └─────────────┘                                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

九、关键要点总结

9.1 核心技巧清单

技巧	应用场景	代码要点
正则表达式清洗	去除HTML标签	`re.sub('<[^>]*>', '', text)`
TF-IDF加权	降低高频词权重	`TfidfVectorizer(use_idf=True)`
Porter词干提取	归一化词形	`PorterStemmer().stem(word)`
GridSearchCV	超参数优化	`cv=5`, `n_jobs=-1`
HashingVectorizer	大规模数据	`n_features=2**21`
partial_fit	在线学习	增量更新模型参数
LDA	主题发现	`n_components=10`

9.2 性能对比

方法	内存占用	训练时间	准确率	适用场景
标准TF-IDF + LR	高	5-10分钟	89.9%	中小规模数据
Hashing + SGD	极低	<1分钟	86.8%	大规模/流式数据
LDA主题建模	中	5分钟+	-	无监督探索

本文来自《智能系统与技术丛书 Python机器学习基于PyTorch和Scikit-Learn_（美）塞巴斯蒂安・拉施卡》的学习与理解，仅供学习使用，请勿用于商业用途

Python机器学习实战：用机器学习进行情感分析 核心知识点总结