中文垃圾短信分类实验报告

一、实验概述

1.1 实验背景

随着移动通信技术的快速发展，垃圾短信问题日益严重。本项目旨在利用深度学习技术构建一个高效的中文垃圾短信分类系统，能够自动识别和过滤垃圾短信，提升用户体验。

1.2 实验目标

实现对中文短信的自动分类（正常短信 vs 垃圾短信）
达到较高的分类准确率和召回率
构建完整的文本分类流水线
提供可解释的预测结果

1.3 技术栈

深度学习框架：PaddlePaddle 2.4
文本处理：jieba分词
数据预处理：pandas, numpy
可视化：matplotlib, seaborn
评估指标：准确率、精确率、召回率、F1分数
数据集
- https://wwvt.lanzoum.com/b0ult16qj 密码:fuob
- 下载stopword.txt、my_chinese_sms.txt可运行本实验代码
- 另外两个有100w条数据，可以进行改进操作

二、数据处理模块

2.1 数据加载与预处理

复制代码

#!/usr/bin/env 
# coding: utf-8

# ===================== 数据处理模块 =====================

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset, DataLoader
import pandas as pd
import numpy as np
import re
import jieba
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight
import os
from collections import Counter

print("🚀 开始中文垃圾短信分类实验...")

# ===================== 1. 数据准备 =====================

# 读取停用词表
stopwords_path = './stopword.txt'
stop_words = set()
with open(stopwords_path, 'r', encoding='gbk') as f:
    for line in f:
        stop_word = line.strip()
        if stop_word:
            stop_words.add(stop_word)
print(f"✅ 加载停用词：{len(stop_words)} 个")

# 读取短信数据
def load_sms_data(file_path):
    labels = []
    texts = []
    
    encodings = ['utf-8', 'gbk', 'gb2312']
    lines = []
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                lines = f.readlines()
            print(f"✅ 用 {encoding} 编码成功读取文件")
            break
        except:
            continue
    
    for line_num, line in enumerate(lines, 1):
        line = line.strip()
        if not line:
            continue
        try:
            if line.startswith('|'):
                line = line[1:]
            label = line[0]
            text = line[1:].strip()
            if label not in ['0', '1']:
                continue
            labels.append(int(label))
            texts.append(text)
        except:
            continue
    
    df = pd.DataFrame({'label': labels, 'text': texts})
    df = df[df['text'].notna() & (df['text'] != '')]
    return df

df = load_sms_data('./my_chinese_sms.txt')
print(f"✅ 加载短信数据：{len(df)} 条")

# 数据分布分析
normal_count = len(df[df['label'] == 0])
spam_count = len(df[df['label'] == 1])
print(f"📊 数据分布：正常短信 {normal_count} 条，垃圾短信 {spam_count} 条")

# 可视化数据分布
plt.figure(figsize=(8, 6))
labels = ['正常短信', '垃圾短信']
sizes = [normal_count, spam_count]
colors = ['#66b3ff', '#ff9999']
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.title('短信数据分布')
plt.axis('equal')
plt.savefig('./data_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

2.2 文本预处理流程

文本预处理是整个流程的基础，主要包括以下几个步骤：

文本清洗：去除特殊字符，保留中文字符、数字和字母
中文分词：使用jieba进行精确模式分词
停用词过滤：去除无意义的停用词
序列化处理：将文本转换为数字序列

===================== 2. 文本预处理 =====================

def simple_preprocess(text):
"""简化的预处理流程"""
# 1. 文本清洗：保留中文字符、数字、字母
text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', ' ', text)
text = re.sub(r'\s+', ' ', text).strip()
复制代码
```
 # 2. 中文分词
 words = jieba.lcut(text, cut_all=False)
 
 # 3. 停用词过滤
 valid_words = [word for word in words 
                if word not in stop_words 
                and len(word) > 1]
 
 return valid_words
```
print("🔄 开始文本预处理...")
df['words'] = df['text'].apply(simple_preprocess)
df['word_count'] = df['words'].apply(len)

构建词汇表

def build_vocab(texts, min_freq=2):
"""构建词汇表"""
word_counter = Counter()
for words in texts:
word_counter.update(words)
复制代码
```
 # 过滤低频词
 vocab = [word for word, count in word_counter.items() if count >= min_freq]
 
 # 构建词到索引的映射
 word2idx = {'<PAD>': 0, '<UNK>': 1}
 for idx, word in enumerate(vocab, start=2):
     word2idx[word] = idx
 
 idx2word = {idx: word for word, idx in word2idx.items()}
 
 print(f"✅ 构建词汇表：{len(word2idx)} 个词")
 print(f"📊 最常见词：{list(word_counter.most_common(10))}")
 
 return word2idx, idx2word
```
构建词汇表

word2idx, idx2word = build_vocab(df['words'])

def words_to_sequence(words, word2idx, max_len=50):
"""将词列表转换为索引序列"""
word_indices = []
for word in words[:max_len]:
word_indices.append(word2idx.get(word, word2idx['<UNK>']))
复制代码
```
 # 填充或截断
 if len(word_indices) < max_len:
     word_indices.extend([word2idx['<PAD>']] * (max_len - len(word_indices)))
 else:
     word_indices = word_indices[:max_len]
 
 return word_indices
```
分析词序列长度分布

word_seq_lengths = df['word_count']
max_word_len = min(int(np.percentile(word_seq_lengths, 95)), 50)
print(f"📊 词序列长度分析：95%分位数 = {max_word_len}")

可视化序列长度分布

plt.figure(figsize=(10, 6))
plt.hist(word_seq_lengths, bins=50, alpha=0.7, color='skyblue')
plt.axvline(max_word_len, color='red', linestyle='--', label=f'95%分位数: {max_word_len}')
plt.xlabel('文本长度（词数）')
plt.ylabel('频次')
plt.title('文本长度分布')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('./text_length_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

转换为词序列

df['word_sequence'] = df['words'].apply(lambda x: words_to_sequence(x, word2idx, max_word_len))

2.3 数据集划分

复制代码

# ===================== 3. 数据集构建 =====================

class SimpleSmsDataset(Dataset):
    """短信数据集类"""
    def __init__(self, df, max_word_len):
        self.labels = df['label'].values
        self.word_sequences = df['word_sequence'].values
        self.max_word_len = max_word_len

    def __getitem__(self, idx):
        word_seq = self.word_sequences[idx]
        label = self.labels[idx]
        
        return (paddle.to_tensor(word_seq, dtype=paddle.int64),
                paddle.to_tensor(label, dtype=paddle.int64))

    def __len__(self):
        return len(self.labels)

# 划分数据集（保持原始分布）
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2024, stratify=df['label'])
train_df, val_df = train_test_split(train_df, test_size=0.125, random_state=2024, stratify=train_df['label'])

print(f"📊 数据集划分：")
print(f"训练集：{len(train_df)} 条 (正常{len(train_df[train_df['label']==0])}, 垃圾{len(train_df[train_df['label']==1])})")
print(f"验证集：{len(val_df)} 条 (正常{len(val_df[val_df['label']==0])}, 垃圾{len(val_df[val_df['label']==1])})")
print(f"测试集：{len(test_df)} 条 (正常{len(test_df[test_df['label']==0])}, 垃圾{len(test_df[test_df['label']==1])})")

# 创建DataLoader
batch_size = 32
train_dataset = SimpleSmsDataset(train_df, max_word_len)
val_dataset = SimpleSmsDataset(val_df, max_word_len)
test_dataset = SimpleSmsDataset(test_df, max_word_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print("✅ 数据处理模块完成")

三、模型训练与预测模块

3.1 模型架构设计

本实验采用基于多尺度卷积神经网络（Multi-scale CNN）的文本分类模型。该模型通过不同大小的卷积核捕捉文本中不同长度的特征模式，能够有效处理变长文本序列。

复制代码

# ===================== 模型训练与预测模块 =====================

# ===================== 4. 模型设计 =====================

class SimpleCNNModel(nn.Layer):
    """基于多尺度CNN的文本分类模型"""
    def __init__(self, vocab_size, embed_dim=128, num_filters=128, dropout=0.3):
        super().__init__()
        
        # 词嵌入层：将离散的词语索引转换为连续的向量表示
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # 多尺度卷积层：捕捉不同长度的文本模式
        self.conv1 = nn.Conv1D(embed_dim, num_filters, kernel_size=2, padding='same')  # 捕捉2-gram特征
        self.conv2 = nn.Conv1D(embed_dim, num_filters, kernel_size=3, padding='same')  # 捕捉3-gram特征
        self.conv3 = nn.Conv1D(embed_dim, num_filters, kernel_size=4, padding='same')  # 捕捉4-gram特征
        
        # 批量归一化：加速训练收敛，提高模型稳定性
        self.bn1 = nn.BatchNorm1D(num_filters)
        self.bn2 = nn.BatchNorm1D(num_filters)
        self.bn3 = nn.BatchNorm1D(num_filters)
        
        # 全局池化：将变长序列转换为固定长度表示
        self.global_pool = nn.AdaptiveMaxPool1D(1)
        
        # 分类器：将特征映射到类别空间
        self.classifier = nn.Sequential(
            nn.Linear(num_filters * 3, 128),  # 融合多尺度特征
            nn.ReLU(),
            nn.Dropout(dropout),  # 防止过拟合
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 2)  # 二分类输出
        )
        
    def forward(self, x):
        # 词嵌入：[batch, seq_len] -> [batch, seq_len, embed_dim]
        embedded = self.embedding(x)
        embedded = embedded.transpose([0, 2, 1])  # [batch, embed_dim, seq_len]
        
        # 多尺度卷积 + 激活函数
        conv1 = F.relu(self.bn1(self.conv1(embedded)))  # [batch, num_filters, seq_len]
        conv2 = F.relu(self.bn2(self.conv2(embedded)))
        conv3 = F.relu(self.bn3(self.conv3(embedded)))
        
        # 全局最大池化：[batch, num_filters, seq_len] -> [batch, num_filters]
        pool1 = self.global_pool(conv1).squeeze(-1)
        pool2 = self.global_pool(conv2).squeeze(-1)
        pool3 = self.global_pool(conv3).squeeze(-1)
        
        # 特征拼接：融合多尺度特征
        combined = paddle.concat([pool1, pool2, pool3], axis=1)  # [batch, num_filters*3]
        
        # 分类
        logits = self.classifier(combined)  # [batch, 2]
        
        return logits

# 模型可视化
def visualize_model_architecture():
    """可视化模型架构"""
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # 输入表示
    axes[0,0].imshow(np.random.rand(10, 128), cmap='viridis', aspect='auto')
    axes[0,0].set_title('词嵌入表示\n(序列长度×嵌入维度)')
    axes[0,0].set_xlabel('嵌入维度')
    axes[0,0].set_ylabel('序列位置')
    
    # 多尺度卷积
    for i, kernel_size in enumerate([2, 3, 4]):
        axes[0,1].plot([0, kernel_size], [i, i], 'o-', linewidth=3, 
                      label=f'{kernel_size}-gram卷积核')
    axes[0,1].set_title('多尺度卷积核')
    axes[0,1].set_xlabel('卷积核大小')
    axes[0,1].set_ylabel('不同尺度')
    axes[0,1].legend()
    axes[0,1].grid(True)
    
    # 特征融合
    features = np.random.rand(3, 10)
    axes[1,0].imshow(features, cmap='plasma', aspect='auto')
    axes[1,0].set_title('多尺度特征融合')
    axes[1,0].set_xlabel('特征维度')
    axes[1,0].set_ylabel('不同尺度')
    
    # 分类决策
    x = np.linspace(0, 1, 100)
    axes[1,1].plot(x, 1/(1+np.exp(-10*(x-0.5))), 'r-', label='垃圾短信概率')
    axes[1,1].plot(x, 1 - 1/(1+np.exp(-10*(x-0.5))), 'b-', label='正常短信概率')
    axes[1,1].set_title('分类决策')
    axes[1,1].set_xlabel('特征强度')
    axes[1,1].set_ylabel('类别概率')
    axes[1,1].legend()
    axes[1,1].grid(True)
    
    plt.tight_layout()
    plt.savefig('./model_architecture.png', dpi=300, bbox_inches='tight')
    plt.show()

visualize_model_architecture()

# 初始化模型
vocab_size = len(word2idx)
model = SimpleCNNModel(vocab_size=vocab_size)
print(f"✅ 初始化模型，词汇表大小: {vocab_size}")

# 使用加权损失函数处理类别不平衡
class_weights = compute_class_weight(
    'balanced',
    classes=np.array([0, 1]),
    y=train_df['label'].values
)
class_weights = paddle.to_tensor(class_weights, dtype=paddle.float32)

criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = paddle.optimizer.Adam(learning_rate=1e-3, parameters=model.parameters())

print(f"📊 使用类别权重: {class_weights.numpy()}")

3.2 模型训练过程

复制代码

# ===================== 5. 训练和评估函数 =====================

def train_epoch(model, dataloader, criterion, optimizer):
    """训练一个epoch"""
    model.train()
    total_loss, total_correct, total_samples = 0, 0, 0
    
    for batch_word, batch_y in dataloader:
        # 前向传播
        logits = model(batch_word)
        loss = criterion(logits, batch_y)
        
        # 反向传播
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        
        # 统计指标
        batch_size = batch_word.shape[0]
        total_loss += loss.item() * batch_size
        pred_labels = paddle.argmax(logits, axis=1)
        total_correct += (pred_labels == batch_y).sum().item()
        total_samples += batch_size
    
    return total_loss / total_samples, total_correct / total_samples

def evaluate_model(model, dataloader):
    """模型评估"""
    model.eval()
    all_preds, all_labels = [], []
    
    with paddle.no_grad():
        for batch_word, batch_y in dataloader:
            logits = model(batch_word)
            pred_labels = paddle.argmax(logits, axis=1)
            all_preds.extend(pred_labels.numpy())
            all_labels.extend(batch_y.numpy())
    
    accuracy = accuracy_score(all_labels, all_preds)
    precision = precision_score(all_labels, all_preds, average='binary', pos_label=1, zero_division=0)
    recall = recall_score(all_labels, all_preds, average='binary', pos_label=1, zero_division=0)
    f1 = f1_score(all_labels, all_preds, average='binary', pos_label=1, zero_division=0)
    
    return accuracy, precision, recall, f1, all_labels, all_preds

# ===================== 6. 模型训练 =====================

epochs = 8
train_losses, train_accs = [], []
val_accs, val_f1s = [], []
best_val_f1 = 0.0

print(f"🚀 开始训练（共{epochs}轮）...")
for epoch in range(epochs):
    # 训练
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    
    # 验证
    val_acc, val_precision, val_recall, val_f1, _, _ = evaluate_model(model, val_loader)
    val_accs.append(val_acc)
    val_f1s.append(val_f1)
    
    print(f"Epoch [{epoch+1:2d}/{epochs}] | "
          f"训练损失: {train_loss:.4f} | 训练准确率: {train_acc:.4f} | "
          f"验证准确率: {val_acc:.4f} | 验证F1: {val_f1:.4f}")
    
    # 保存最佳模型
    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        paddle.save(model.state_dict(), './best_final_model.pdparams')
        print(f"💾 保存最佳模型，验证F1: {val_f1:.4f}")

# 保存最终模型
paddle.save(model.state_dict(), './final_final_model.pdparams')
print("🎉 训练完成！")

# 可视化训练过程
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(train_losses, 'r-o', linewidth=2, markersize=6, label='训练损失')
plt.xlabel('训练轮次')
plt.ylabel('损失值')
plt.title('训练损失曲线')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 3, 2)
plt.plot(train_accs, 'b-s', linewidth=2, markersize=6, label='训练准确率')
plt.plot(val_accs, 'g-^', linewidth=2, markersize=6, label='验证准确率')
plt.xlabel('训练轮次')
plt.ylabel('准确率')
plt.title('准确率变化曲线')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 3, 3)
plt.plot(val_f1s, 'm-d', linewidth=2, markersize=6, label='验证F1分数')
plt.xlabel('训练轮次')
plt.ylabel('F1分数')
plt.title('F1分数变化曲线')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('./training_curves.png', dpi=300, bbox_inches='tight')
plt.show()

3.3 模型评估与结果分析

复制代码

# ===================== 7. 模型评估 =====================

# 加载最佳模型
model.set_state_dict(paddle.load('./best_final_model.pdparams'))
print("✅ 加载最佳模型进行测试")

# 测试集评估
test_acc, test_precision, test_recall, test_f1, test_labels, test_preds = evaluate_model(model, test_loader)

print(f"\n📊 测试集最终评估结果：")
print(f"准确率 (Accuracy): {test_acc:.4f}")
print(f"精确率 (Precision): {test_precision:.4f}")
print(f"召回率 (Recall): {test_recall:.4f}")
print(f"F1分数 (F1-Score): {test_f1:.4f}")

# 详细分类报告
print(f"\n📋 详细分类报告：")
print(classification_report(test_labels, test_preds, target_names=['正常短信', '垃圾短信']))

# 混淆矩阵可视化
conf_matrix = confusion_matrix(test_labels, test_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['正常短信', '垃圾短信'], 
            yticklabels=['正常短信', '垃圾短信'])
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title('混淆矩阵 - 测试集')
plt.tight_layout()
plt.savefig('./final_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# 性能指标雷达图
categories = ['准确率', '精确率', '召回率', 'F1分数']
values = [test_acc, test_precision, test_recall, test_f1]
values += values[:1]  # 闭合雷达图

angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(projection='polar'))
ax.plot(angles, values, 'o-', linewidth=2, label='模型性能')
ax.fill(angles, values, alpha=0.25)
ax.set_thetagrids(np.degrees(angles[:-1]), categories)
ax.set_ylim(0, 1)
ax.set_title('模型性能雷达图')
ax.grid(True)
plt.savefig('./performance_radar.png', dpi=300, bbox_inches='tight')
plt.show()

3.4 预测功能与案例分析

复制代码

# ===================== 8. 预测功能实现 =====================

def predict_sms_final(text, model, word2idx, max_len=50):
    """预测单条短信"""
    model.eval()
    
    # 预处理
    words = simple_preprocess(text)
    word_seq = words_to_sequence(words, word2idx, max_len)
    
    # 转换为张量
    word_tensor = paddle.to_tensor([word_seq], dtype=paddle.int64)
    
    with paddle.no_grad():
        logits = model(word_tensor)
        probs = F.softmax(logits, axis=1)
        pred_label = paddle.argmax(probs, axis=1).item()
        confidence = probs[0][pred_label].item()
        normal_prob = probs[0][0].item()
        spam_prob = probs[0][1].item()
    
    return pred_label, confidence, words, normal_prob, spam_prob

# 测试预测功能
print(f"\n🔍 最终预测示例：")
test_samples = [
    "您好，您的快递已到达小区门口，请及时领取",
    "恭喜您获得100元话费奖励！请点击链接领取",
    "今晚一起吃饭吗？老地方见",
    "您的账户存在风险，请立即验证身份信息",
    "周末有空吗？想约你去看电影",
    "免费领取iPhone14，点击链接立即获取",
    "明天开会时间改为下午三点，请准时参加",
    "银行通知：您的信用卡已逾期，请尽快还款",
    "妈妈，我晚上不回家吃饭了",
    "双十一大促，全场五折起，限时抢购"
]

# 预期的正确标签（根据常识判断）
expected_labels = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]  # 0=正常, 1=垃圾

print("=" * 80)
correct_predictions = 0
total_predictions = len(test_samples)

results = []
for i, (sample, expected) in enumerate(zip(test_samples, expected_labels), 1):
    pred, conf, words, normal_prob, spam_prob = predict_sms_final(
        sample, model, word2idx, max_word_len
    )
    label_str = "垃圾短信" if pred == 1 else "正常短信"
    expected_str = "垃圾短信" if expected == 1 else "正常短信"
    is_correct = "✅" if pred == expected else "❌"
    
    if pred == expected:
        correct_predictions += 1
    
    results.append({
        'text': sample,
        'words': words,
        'predicted': pred,
        'expected': expected,
        'normal_prob': normal_prob,
        'spam_prob': spam_prob,
        'correct': pred == expected
    })
    
    print(f"示例{i}: {sample}")
    print(f"分词: {words}")
    print(f"概率分布: 正常{normal_prob:.4f} vs 垃圾{spam_prob:.4f}")
    print(f"预测: {label_str} (预期: {expected_str}) {is_correct} (置信度: {conf:.4f})")
    print("-" * 60)

accuracy = correct_predictions / total_predictions
print(f"📊 人工验证准确率: {accuracy:.2%} ({correct_predictions}/{total_predictions})")

# 可视化预测结果
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# 概率分布图
normal_probs = [r['normal_prob'] for r in results]
spam_probs = [r['spam_prob'] for r in results]
x = range(len(results))

axes[0].bar(x, normal_probs, width=0.4, label='正常短信概率', alpha=0.7)
axes[0].bar([i + 0.4 for i in x], spam_probs, width=0.4, label='垃圾短信概率', alpha=0.7)
axes[0].set_xlabel('样本编号')
axes[0].set_ylabel('概率')
axes[0].set_title('预测概率分布')
axes[0].legend()
axes[0].grid(alpha=0.3)

# 正确率分析
correct_counts = [1 if r['correct'] else 0 for r in results]
cumulative_accuracy = [sum(correct_counts[:i+1])/(i+1) for i in range(len(correct_counts))]

axes[1].plot(range(1, len(results)+1), cumulative_accuracy, 'g-o', linewidth=2)
axes[1].axhline(y=accuracy, color='r', linestyle='--', label=f'最终准确率: {accuracy:.2%}')
axes[1].set_xlabel('样本数量')
axes[1].set_ylabel('累积准确率')
axes[1].set_title('累积准确率变化')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('./prediction_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("=" * 80)
print("🎯 中文垃圾短信分类实验完成！")
print("=" * 80)

# 保存预测分析结果
prediction_analysis = pd.DataFrame(results)
prediction_analysis.to_csv('./final_prediction_analysis.csv', index=False, encoding='utf-8')
print("💾 最终预测分析结果已保存至 final_prediction_analysis.csv")

四、实验结果与分析

4.1 模型性能总结

经过8轮训练，模型在测试集上表现优异：

准确率: 96.02%
精确率: 96.52%
召回率: 97.28%
F1分数: 96.90%

4.2 训练过程分析

从训练曲线可以看出：

损失收敛: 训练损失从0.3000迅速下降到0.0244，表明模型有效学习
准确率提升: 训练准确率从87.67%提升到99.29%，验证准确率稳定在96%以上
无过拟合: 训练和验证性能基本同步提升，未出现明显过拟合

4.3 预测案例分析

在10个测试样本中，模型正确分类6个，准确率60%。主要错误类型：

假阴性错误: 账户风险提示、银行通知等被误判为正常短信
假阳性错误: 会议通知、家庭短信被误判为垃圾短信

4.4 技术亮点

多尺度特征提取: 使用2、3、4-gram卷积核捕捉不同长度文本模式
类别平衡处理: 加权损失函数有效缓解数据不平衡问题
端到端流程: 完整的预处理、训练、评估、预测流水线
可解释性: 提供概率分布和分词结果，便于理解模型决策

五、结论与展望

5.1 实验结论

本实验成功构建了一个基于多尺度CNN的中文垃圾短信分类系统，在测试集上达到了96.02%的准确率。模型能够有效区分正常短信和垃圾短信，具有较好的实用价值。

5.2 技术贡献

验证了字符级CNN在中文文本分类中的有效性
提出了完整的预处理和特征工程方案
实现了高性能且可解释的垃圾短信检测系统

5.3 未来工作

模型优化: 尝试预训练语言模型（如BERT、ERNIE）
数据增强: 引入更多样化的训练数据
实时检测: 开发在线学习和实时检测功能
多语言支持: 扩展到其他语言的垃圾短信检测

本实验为中文垃圾短信检测提供了可靠的技术方案和完整的实现参考。

中文垃圾短信分类实验报告

中文垃圾短信分类实验报告

一、实验概述

1.1 实验背景

1.2 实验目标

1.3 技术栈

二、数据处理模块

2.1 数据加载与预处理

2.2 文本预处理流程

===================== 2. 文本预处理 =====================

构建词汇表

构建词汇表

分析词序列长度分布

可视化序列长度分布

转换为词序列

2.3 数据集划分

三、模型训练与预测模块

3.1 模型架构设计

3.2 模型训练过程

3.3 模型评估与结果分析

3.4 预测功能与案例分析

四、实验结果与分析

4.1 模型性能总结

4.2 训练过程分析

4.3 预测案例分析

4.4 技术亮点

五、结论与展望

5.1 实验结论

5.2 技术贡献

5.3 未来工作