机器学习:使用LSTM训练情感分析模型

背景:电商平台收到了许多关于宝贝的评论,需要自动从大量的数据中分析出各种情感的用户评论,并实时给出预警和提醒。

数据格式:

数据中最重要的两列:Review Text 和Rating,一个是评论内容一个是评分。

第一步:先进行数据处理:

python 复制代码
def load_and_preprocess_data(filepath):
    """加载并预处理数据"""
    df = pd.read_csv(filepath)
    texts = df['Review Text'].values
    labels = df['Rating'].values

    # 标签编码 (1-5 -> 0-4)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    # 文本序列化
    tokenizer = Tokenizer(num_words=20000, oov_token="<OOV>")
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    padded_sequences = pad_sequences(sequences, maxlen=200, truncating='post')

    return padded_sequences, labels, tokenizer, le

加载数据,并将数据序列化。

第二步:利用LSTM建立模型的结构,设置数据嵌入的向量维度:

python 复制代码
def build_model(vocab_size, max_len, embedding_dim=128):
    """构建LSTM模型"""
    model = Sequential([
        Embedding(vocab_size, embedding_dim, input_length=max_len),
        Bidirectional(LSTM(64, return_sequences=True)),
        Dropout(0.5),
        LSTM(32),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(5, activation='softmax')
    ])

    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

第三步:训练并保存模型:

python 复制代码
def main():
    # 加载数据
    X, y, tokenizer, le = load_and_preprocess_data('Clothing_Reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 构建模型
    model = build_model(vocab_size=20000, max_len=200)
    model.summary()

    # 训练模型
    history = model.fit(
        X_train, y_train,
        validation_data=(X_test, y_test),
        epochs=10,
        batch_size=64,
        callbacks=[EarlyStopping(monitor='val_loss', patience=3)],
        verbose=1
    )
    print(f"history: {history}")
    # 评估模型
    # plot_history(history)
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f'测试集准确率: {accuracy:.2f}')

    # 保存模型和tokenizer
    model.save('sentiment_lstm_5class.h5')
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    print("模型和tokenizer已保存")
    # 保存label encoder
    with open('label_encoder.pickle', 'wb') as handle:
        pickle.dump(le, handle, protocol=pickle.HIGHEST_PROTOCOL)

至此,根据训练集训练的模型保存完毕。要想使用训练保持好的模型进行数据预测,需要新建一个预测的方法:

python 复制代码
import numpy as np
import pickle

from keras.src.saving import load_model
from keras.src.utils import pad_sequences
from sklearn.preprocessing import LabelEncoder


class SentimentPredictor:
    def __init__(self, model_path, tokenizer_path, label_encoder_path=None, max_len=200):
        """初始化预测器"""
        self.model = load_model(model_path)
        with open(tokenizer_path, 'rb') as handle:
            self.tokenizer = pickle.load(handle)
        with open(label_encoder_path, 'rb') as handle:
            self.label_encoder = pickle.load(handle)
        self.max_len = max_len

    def preprocess_text(self, text):
        """预处理文本"""
        sequence = self.tokenizer.texts_to_sequences([text])
        padded = pad_sequences(sequence, maxlen=self.max_len, truncating='post')
        return padded

    def predict_sentiment(self, text, verbose=False):
        """预测情感分值(1-5)"""
        # 预处理
        padded_sequence = self.preprocess_text(text)

        # 预测
        prediction = self.model.predict(padded_sequence, verbose=0)
        predicted_class = np.argmax(prediction, axis=1)

        # 转换回原始标签(1-5)
        predicted_score = self.label_encoder.inverse_transform(predicted_class)[0]

        if verbose:
            print(f"评论: {text}")
            print(f"预测情感分值: {predicted_score}")
            print("各类别概率:")
            for i, prob in enumerate(prediction[0]):
                print(f"{i + 1}分: {prob:.4f}")
            print("-" * 50)

        return predicted_score, prediction[0]

这个类封装了读取加载模型、使用模型进行预测的方法,调用是传入已经保持的模型路径即可。使用实例:

python 复制代码
# 初始化预测器
predictor = SentimentPredictor(
    model_path='sentiment_lstm_5class.h5',
    tokenizer_path='tokenizer.pickle',
    label_encoder_path='label_encoder.pickle'
)

# 测试评论
test_reviews = [
    "This product is absolutely amazing! Best purchase ever!",
    "The item was okay, but not worth the price.",
    "Terrible quality. Would not recommend to anyone.",
    "It's decent for the price, though it has some flaws.",
    "I'm completely satisfied with this purchase. It exceeded all my expectations!"
]

# 批量预测
for review in test_reviews:
    score, probs = predictor.predict_sentiment(review, verbose=True)
    print(f"{review}: 预测情感分值: {score},准确率: {np.max(probs) * 100:.2f}%")

# 预测单个评论
sample_review = "The product was good but the delivery took too long."
score, probs = predictor.predict_sentiment(sample_review, verbose=True)
print(f"{sample_review}: 预测情感分值: {score},准确率: {np.max(probs) * 100:.2f}%")
相关推荐
Blossom.1183 小时前
移动端部署噩梦终结者:动态稀疏视觉Transformer的量化实战
java·人工智能·python·深度学习·算法·机器学习·transformer
月下倩影时3 小时前
视觉进阶篇——机器学习训练过程(手写数字识别,量大管饱需要耐心)
人工智能·学习·机器学习
生信大表哥7 小时前
贝叶斯共识聚类(BCC)
机器学习·数据挖掘·聚类
Cathy Bryant11 小时前
信息论(五):联合熵与条件熵
人工智能·笔记·机器学习·数学建模·概率论
aitoolhub18 小时前
重塑机器人未来:空间智能驱动产业智能化升级
大数据·人工智能·深度学习·机器学习·机器人·aigc
淬炼之火18 小时前
阅读:基于深度学习的红外可见光图像融合综述
图像处理·深度学习·机器学习·计算机视觉·特征融合·红外图像识别
极客BIM工作室18 小时前
思维链(CoT)的本质:无需架构调整,仅靠提示工程激活大模型推理能力
人工智能·机器学习·架构
三条猫19 小时前
AI 大模型如何给 CAD 3D 模型“建立语义”?
人工智能·机器学习·3d·ai·大模型·cad
pen-ai1 天前
【高级机器学习】 10. 领域适应与迁移学习
人工智能·机器学习·迁移学习
CV实验室1 天前
AAAI 2026 Oral 之江实验室等提出MoEGCL:在6大基准数据集上刷新SOTA,聚类准确率最高提升超8%!
人工智能·机器学习·计算机视觉·数据挖掘·论文·聚类