Ubuntu入门学习教程,从入门到精通,Ubuntu 22.04中的人工智能—— 知识点详解 (25)

Ubuntu 22.04中的人工智能

25.1 基础环境准备

25.1.1 概述

在Ubuntu 22.04 LTS系统中搭建AI开发环境,推荐使用Anaconda作为包管理器和环境管理器。Anaconda集成了Python解释器、conda包管理器及大量科学计算库,能有效解决依赖冲突问题,支持多版本Python环境隔离。

核心优势:

  • 预编译二进制包,避免源码编译
  • 环境隔离,项目间依赖互不干扰
  • 跨平台一致性(Linux/Windows/macOS)
  • 内置Jupyter Notebook等开发工具

25.1.2 安装Anaconda

步骤1:系统预备操作

bash 复制代码
# 更新系统软件包索引
sudo apt update

# 安装必要的依赖包(避免后续出现缺少共享库的问题)
sudo apt install -y libgl1-mesa-glx libegl1-mesa libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6

# 验证系统架构(x86_64或aarch64)
uname -m  # 输出示例: x86_64

步骤2:下载Anaconda安装包

bash 复制代码
# 方法一:使用wget直接下载(推荐)
cd /tmp
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

# 方法二:使用curl下载
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

# 重要:验证安装包完整性(防止下载损坏)
sha256sum Anaconda3-2024.02-1-Linux-x86_64.sh
# 比对官网提供的SHA256校验码

步骤3:执行安装脚本

bash 复制代码
# 运行安装脚本
bash Anaconda3-2024.02-1-Linux-x86_64.sh

# 安装过程中的交互选项说明:
# - 按Enter阅读许可协议
# - 输入'yes'同意协议
# - 确认安装路径(默认~/anaconda3,可直接回车)
# - 输入'yes'初始化conda(重要!)

步骤4:验证安装并配置环境

bash 复制代码
# 重新加载shell配置(安装脚本已修改~/.bashrc)
source ~/.bashrc

# 验证conda命令可用
conda --version
# 预期输出: conda 24.1.2

# 查看Python版本(应为Anaconda自带的Python)
python --version
# 预期输出: Python 3.11.7

# 初始化conda(如果上一步未选择初始化)
conda init bash
# 其他shell: conda init zsh/fish/tcsh

步骤5:配置conda镜像源(国内用户必需)

bash 复制代码
# 配置清华镜像源,显著提升下载速度
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/

# 设置显示通道URL
conda config --set show_channel_urls yes

# 验证配置
conda config --show channels

步骤6:更新conda到最新版本

bash 复制代码
# 更新conda自身
conda update -n base -c defaults conda

# 更新所有基础包
conda update --all

25.1.3 conda基本用法

环境管理命令体系

bash 复制代码
# 1. 创建新环境(语法核心)
conda create --name <环境名> python=<版本号>

# 实例:创建Python 3.10的ml-env环境
conda create --name ml-env python=3.10

# 指定多个包
conda create --name dl-env python=3.9 numpy pandas

# 从environment.yml文件创建(项目复用)
conda env create -f environment.yml

# 2. 激活/切换到指定环境
conda activate ml-env
# 命令行前缀变为: (ml-env) user@ubuntu:~$

# 3. 退出当前环境
conda deactivate

# 4. 查看所有环境
conda env list
# 或
conda info --envs
# 输出示例:
# base                  *  /home/user/anaconda3
# ml-env                   /home/user/anaconda3/envs/ml-env

# 5. 删除环境(谨慎操作)
conda env remove --name ml-env
# 或
conda remove --name ml-env --all

# 6. 导出环境配置(版本锁定)
conda env export > environment.yml
# 文件内容包含所有包及精确版本号

# 7. 克隆环境
conda create --name new-env --clone old-env

包管理命令体系

bash 复制代码
# 1. 安装包(自动处理依赖)
conda install <包名>

# 实例:安装numpy指定版本
conda install numpy=1.24.3

# 同时安装多个包
conda install scipy matplotlib pandas

# 指定渠道安装
conda install -c conda-forge opencv

# 2. 卸载包
conda remove <包名>
# 或
conda uninstall <包名>

# 3. 更新包
conda update <包名>

# 更新所有包
conda update --all

# 4. 搜索包
conda search <包名>

# 5. 查看已安装包
conda list
# 输出格式: 包名 版本号 构建渠道

# 6. 查看特定包信息
conda list numpy
# 输出示例: numpy  1.24.3  py310h5f9d8c6_0  (conda-forge)

# 7. 清理缓存(释放空间)
conda clean --all
# 选项说明:
# --packages: 删除未解压的包
# --tarballs: 删除下载的压缩包
# --index-cache: 删除索引缓存

Python版本管理

bash 复制代码
# 在环境中切换Python版本
conda activate ml-env
conda install python=3.11

# 验证
python --version

案例:创建完整的ML开发环境

bash 复制代码
# 创建并激活环境
conda create --name ml-dev python=3.10 -y
conda activate ml-dev

# 批量安装数据科学核心库
conda install -y numpy pandas matplotlib scikit-learn jupyter ipython

# 安装深度学习基础库
conda install -y pytorch torchvision torchaudio cpuonly -c pytorch

# 验证安装
python -c "import numpy, pandas, sklearn, torch; print('所有库导入成功')"

25.2 机器学习开发环境配置

25.2.1 机器学习概述

机器学习是通过算法让计算机从数据中学习规律,无需显式编程即可做出预测。在Ubuntu 22.04上,Scikit-learn是首选的ML库,提供:

  • 监督学习:分类、回归
  • 无监督学习:聚类、降维
  • 模型选择:交叉验证、网格搜索
  • 预处理:特征提取、标准化

核心依赖关系:

  • NumPy:多维数组操作
  • SciPy:科学计算
  • Matplotlib:数据可视化
  • joblib:模型持久化

25.2.2 Scikit-learn的安装

方法1:conda安装(推荐)

bash 复制代码
# 激活目标环境
conda activate ml-env

# 安装scikit-learn(自动安装所有依赖)
conda install scikit-learn

# 安装指定版本
conda install scikit-learn=1.4.0

# 从conda-forge渠道安装(更新更快)
conda install -c conda-forge scikit-learn

方法2:pip安装(备用方案)

bash 复制代码
# 激活环境后
conda activate ml-env

# 升级pip到最新版本
pip install --upgrade pip

# 安装scikit-learn
pip install scikit-learn

# 安装指定版本
pip install scikit-learn==1.4.0

# 安装预发布版本
pip install --pre scikit-learn

# 从源码安装(开发使用)
pip install git+https://github.com/scikit-learn/scikit-learn.git

方法3:安装完整科学计算栈(推荐新手)

bash 复制代码
# 一键安装所有相关库(约500MB)
conda install numpy scipy matplotlib scikit-learn pandas jupyter

# 或者使用Anaconda发行版(已预装)
# 下载地址: https://www.anaconda.com/download

验证依赖版本兼容性

bash 复制代码
# 查看已安装的scikit-learn及其依赖版本
conda list | grep -E "scikit-learn|numpy|scipy"

# 预期输出示例:
# numpy              1.24.3
# scipy              1.10.1
# scikit-learn       1.4.0

25.2.3 测试安装是否成功

测试1:基础导入测试

python 复制代码
# test_sklearn_install.py
"""
Scikit-learn安装验证脚本
测试核心模块导入和基本功能
"""

# 捕获导入错误
try:
    import sklearn
    print(f"✓ scikit-learn版本: {sklearn.__version__}")
except ImportError as e:
    print(f"✗ 导入失败: {e}")
    exit(1)

# 测试核心依赖
try:
    import numpy as np
    import scipy
    import joblib
    print(f"✓ numpy版本: {np.__version__}")
    print(f"✓ scipy版本: {scipy.__version__}")
    print("✓ 所有依赖导入成功")
except ImportError as e:
    print(f"✗ 依赖导入失败: {e}")

# 测试基本功能:生成数据集并训练模型
def test_basic_functionality():
    """测试scikit-learn核心功能"""
    print("\n正在测试基本功能...")
    
    # 导入必要的模块
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    
    # 1. 生成模拟数据集
    # make_classification参数详解:
    # n_samples: 样本数量
    # n_features: 特征数量
    # n_informative: 有效特征数量
    # n_redundant: 冗余特征数量
    # random_state: 随机种子,保证结果可复现
    X, y = make_classification(
        n_samples=1000, 
        n_features=10, 
        n_informative=5,
        n_redundant=3,
        random_state=42
    )
    print(f"✓ 生成数据集: X形状{X.shape}, y形状{y.shape}")
    
    # 2. 划分训练集和测试集
    # train_test_split参数:
    # test_size: 测试集比例
    # random_state: 随机种子
    # stratify: 保持类别分布
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"✓ 数据划分: 训练集{X_train.shape}, 测试集{X_test.shape}")
    
    # 3. 创建并训练模型
    # LogisticRegression参数:
    # max_iter: 最大迭代次数
    # random_state: 随机种子
    # solver: 优化算法
    model = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')
    model.fit(X_train, y_train)
    print("✓ 模型训练完成")
    
    # 4. 预测与评估
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"✓ 模型准确率: {accuracy:.4f}")
    
    # 5. 模型持久化测试
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    
    # 创建带预处理的管道
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # 数据标准化
        ('classifier', LogisticRegression(max_iter=1000))
    ])
    
    pipeline.fit(X_train, y_train)
    print(f"✓ 管道训练完成,测试准确率: {pipeline.score(X_test, y_test):.4f}")
    
    # 保存模型
    import joblib
    joblib.dump(pipeline, 'test_model.pkl')
    print("✓ 模型保存成功")
    
    # 加载模型
    loaded_model = joblib.load('test_model.pkl')
    print(f"✓ 模型加载成功,加载后准确率: {loaded_model.score(X_test, y_test):.4f}")

if __name__ == "__main__":
    test_basic_functionality()
    print("\n✅ 所有测试通过!Scikit-learn安装成功且功能正常")

测试2:Jupyter Notebook集成测试

python 复制代码
# 在终端运行:
jupyter notebook

# 在Notebook中执行:
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# 加载经典数据集
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# PCA降维可视化
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

# 绘制结果
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset')
plt.show()

25.2.4 更新或者卸载Scikit-learn

更新操作

bash 复制代码
# 检查当前版本
conda list scikit-learn

# 更新到最新稳定版
conda update scikit-learn

# 从特定渠道更新
conda update -c conda-forge scikit-learn

# 使用pip更新(仅在conda不可用)
pip install --upgrade scikit-learn

# 更新后验证
python -c "import sklearn; print(sklearn.__version__)"

卸载操作

bash 复制代码
# 方法一:conda卸载(干净,自动处理依赖)
conda remove scikit-learn

# 方法二:pip卸载
pip uninstall scikit-learn

# 强制卸载(处理依赖冲突)
pip uninstall scikit-learn -y

# 清理残留文件
# 删除缓存
conda clean --all

# 检查是否卸载干净
python -c "import sklearn"  # 应报ModuleNotFoundError

版本降级(解决兼容性问题)

bash 复制代码
# 卸载当前版本
conda remove scikit-learn

# 安装指定旧版本
conda install scikit-learn=1.3.0

# 或使用pip
pip install scikit-learn==1.2.2

25.3 机器学习应用实例

25.3.1 实例概述

项目目标: 构建一个客户流失预测系统,使用电信公司数据集,通过机器学习模型识别可能流失的客户。

技术栈:

  • 数据处理:Pandas + NumPy
  • 可视化:Matplotlib + Seaborn
  • 模型:Scikit-learn(逻辑回归、随机森林、XGBoost)
  • 评估:交叉验证、ROC曲线、混淆矩阵
  • 部署:Joblib模型序列化

数据集特征:

  • 21个字段(客户ID、服务类型、费用、投诉次数等)
  • 标签:是否流失(Churn: Yes/No)
  • 约7000条记录

25.3.2 环境准备

bash 复制代码
# 创建专用环境
conda create --name churn-prediction python=3.10 -y

# 激活环境
conda activate churn-prediction

# 安装核心库
conda install -y numpy pandas matplotlib seaborn scikit-learn jupyter

# 安装XGBoost(高性能梯度提升)
conda install -y -c conda-forge xgboost

# 安装其他辅助库
pip install -U imbalanced-learn  # 处理类别不平衡
pip install -U scikit-plot       # 可视化工具

# 启动Jupyter
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0

# 如需后台运行
nohup jupyter notebook > jupyter.log 2>&1 &

25.3.3 实例详解

案例代码:完整客户流失预测项目

python 复制代码
# churn_prediction.py
"""
电信客户流失预测完整案例
涵盖数据预处理、特征工程、模型训练、评估和部署全流程
"""

# ==================== 1. 导入必要的库 ====================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# sklearn核心模块
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV, 
    StratifiedKFold, learning_curve
)
from sklearn.preprocessing import (
    StandardScaler, LabelEncoder, OneHotEncoder,
    RobustScaler, MinMaxScaler
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    roc_curve, precision_recall_curve, accuracy_score,
    f1_score, precision_score, recall_score
)

# 机器学习算法
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE  # 处理类别不平衡

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体(Ubuntu需安装)
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

# ==================== 2. 数据加载与探索 ====================
def load_and_explore_data():
    """
    加载数据并进行初步探索性分析
    返回: DataFrame
    """
    # 模拟生成电信数据集(真实场景应从CSV读取)
    # df = pd.read_csv('telecom_churn.csv')
    
    # 生成示例数据
    np.random.seed(42)
    n_samples = 7000
    
    data = {
        'customerID': [f'CID_{i:06d}' for i in range(n_samples)],
        'gender': np.random.choice(['Male', 'Female'], n_samples),
        'SeniorCitizen': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]),
        'Partner': np.random.choice(['Yes', 'No'], n_samples),
        'Dependents': np.random.choice(['Yes', 'No'], n_samples),
        'tenure': np.random.randint(0, 73, n_samples),  # 在网时长
        'PhoneService': np.random.choice(['Yes', 'No'], n_samples),
        'MultipleLines': np.random.choice(['No', 'Yes', 'No phone service'], n_samples),
        'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
        'OnlineSecurity': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'OnlineBackup': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'DeviceProtection': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'TechSupport': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'StreamingTV': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'StreamingMovies': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
        'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
        'PaperlessBilling': np.random.choice(['Yes', 'No'], n_samples),
        'PaymentMethod': np.random.choice([
            'Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'
        ], n_samples),
        'MonthlyCharges': np.random.uniform(18, 120, n_samples).round(2),
        'TotalCharges': np.random.uniform(18, 8000, n_samples).round(2),
        'Churn': np.random.choice(['No', 'Yes'], n_samples, p=[0.73, 0.27])  # 27%流失率
    }
    
    df = pd.DataFrame(data)
    
    print("=" * 60)
    print("数据加载完成")
    print(f"数据集形状: {df.shape}")
    print(f"列名: {df.columns.tolist()}")
    print("\n数据前5行:")
    print(df.head())
    
    print("\n数据类型统计:")
    print(df.dtypes.value_counts())
    
    print("\n目标变量分布:")
    churn_counts = df['Churn'].value_counts()
    print(churn_counts)
    print(f"流失率: {churn_counts['Yes'] / len(df):.2%}")
    
    # 保存数据
    df.to_csv('telecom_churn_dataset.csv', index=False)
    print("\n数据已保存至: telecom_churn_dataset.csv")
    
    return df

# ==================== 3. 数据预处理 ====================
def preprocess_data(df):
    """
    全面的数据预处理流程
    
    参数:
        df: 原始DataFrame
        
    返回:
        X_processed: 处理后的特征矩阵
        y: 标签数组
        preprocessor: ColumnTransformer对象(用于后续预测)
    """
    print("\n" + "=" * 60)
    print("开始数据预处理...")
    
    # 3.1 数据清洗
    # 处理TotalCharges中的空值(可能为tenure=0的新用户)
    df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    df['TotalCharges'].fillna(df['MonthlyCharges'], inplace=True)
    
    # 删除customerID列(无关特征)
    df = df.drop('customerID', axis=1)
    
    # 3.2 分离特征和标签
    X = df.drop('Churn', axis=1)
    y = df['Churn'].map({'No': 0, 'Yes': 1})  # 二值化标签
    
    print(f"特征矩阵形状: {X.shape}")
    print(f"标签数组形状: {y.shape}")
    
    # 3.3 定义特征类型
    # 数值型特征
    numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'SeniorCitizen']
    
    # 类别型特征
    categorical_features = [col for col in X.columns if col not in numeric_features]
    
    print(f"\n数值特征: {numeric_features}")
    print(f"类别特征: {categorical_features}")
    
    # 3.4 构建预处理管道
    # 数值型处理:填充缺失值 + 标准化
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())  # 标准化: (x - μ) / σ
    ])
    
    # 类别型处理:填充缺失值 + 独热编码
    categorical_transformer = Pipeline(steps=[
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        # handle_unknown='ignore': 测试集出现新类别时忽略
        # sparse_output=False: 返回密集数组,避免内存问题
    ])
    
    # 组合预处理
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ])
    
    print("预处理管道构建完成")
    
    return X, y, preprocessor

# ==================== 4. 特征工程 ====================
def feature_engineering(X, y, preprocessor):
    """
    高级特征工程
    
    返回: 处理后的训练集和测试集
    """
    print("\n" + "=" * 60)
    print("开始特征工程...")
    
    # 4.1 处理类别不平衡(SMOTE过采样)
    # SMOTE: Synthetic Minority Oversampling Technique
    # 仅在训练集应用,避免数据泄露
    print(f"过采样前类别分布:\n{y.value_counts()}")
    
    # 4.2 划分数据集(在SMOTE之前)
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print(f"\n训练集: {X_train_raw.shape}, 测试集: {X_test_raw.shape}")
    
    # 4.3 拟合预处理管道
    # 必须仅在训练集上fit,然后transform训练集和测试集
    X_train_processed = preprocessor.fit_transform(X_train_raw)
    X_test_processed = preprocessor.transform(X_test_raw)
    
    print(f"预处理后训练集形状: {X_train_processed.shape}")
    print(f"预处理后测试集形状: {X_test_processed.shape}")
    
    # 4.4 应用SMOTE(训练集)
    # k_neighbors=3: 合成样本时参考的邻居数
    smote = SMOTE(random_state=42, k_neighbors=3)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)
    
    print(f"\n过采样后训练集形状: {X_train_resampled.shape}")
    print(f"过采样后类别分布:\n{pd.Series(y_train_resampled).value_counts()}")
    
    return X_train_resampled, X_test_processed, y_train_resampled, y_test, preprocessor

# ==================== 5. 模型训练与调优 ====================
def train_and_evaluate_models(X_train, X_test, y_train, y_test):
    """
    训练多个模型并进行超参数调优
    
    返回: 最佳模型和性能对比
    """
    print("\n" + "=" * 60)
    print("开始模型训练...")
    
    # 5.1 定义模型管道(集成预处理)
    # 注意:此处预处理已通过ColumnTransformer完成
    models = {
        'Logistic Regression': LogisticRegression(
            max_iter=1000, random_state=42, class_weight='balanced'
        ),
        'Random Forest': RandomForestClassifier(
            n_estimators=100, random_state=42, class_weight='balanced'
        ),
        'Gradient Boosting': GradientBoostingClassifier(
            random_state=42
        ),
        'SVM': SVC(
            random_state=42, probability=True, class_weight='balanced'
        )
    }
    
    # 5.2 超参数网格(仅展示关键参数)
    param_grids = {
        'Logistic Regression': {
            'C': [0.1, 1.0, 10.0],  # 正则化强度的倒数
            'penalty': ['l2', 'l1'],  # 正则化类型
            'solver': ['liblinear', 'saga']  # 优化算法
        },
        'Random Forest': {
            'n_estimators': [100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2]
        },
        'Gradient Boosting': {
            'n_estimators': [100, 200],
            'learning_rate': [0.1, 0.05],
            'max_depth': [3, 5]
        },
        'SVM': {
            'C': [0.1, 1.0, 10.0],
            'kernel': ['rbf', 'linear']
        }
    }
    
    # 5.3 交叉验证策略
    # StratifiedKFold: 保持每折类别分布一致
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    # 5.4 训练与调优
    results = {}
    best_models = {}
    
    for name, model in models.items():
        print(f"\n训练模型: {name}")
        print("-" * 40)
        
        # 网格搜索
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            cv=cv,
            scoring='f1',  # 使用F1分数(类别不平衡场景)
            n_jobs=-1,  # 使用所有CPU核心
            verbose=1
        )
        
        # 训练
        grid_search.fit(X_train, y_train)
        
        # 记录结果
        results[name] = {
            'best_params': grid_search.best_params_,
            'best_score': grid_search.best_score_,
            'cv_results': grid_search.cv_results_
        }
        
        best_models[name] = grid_search.best_estimator_
        
        print(f"最佳参数: {grid_search.best_params_}")
        print(f"最佳交叉验证F1分数: {grid_search.best_score_:.4f}")
    
    # 5.5 模型评估
    print("\n" + "=" * 60)
    print("模型评估结果:")
    print("=" * 60)
    
    evaluation_results = {}
    
    for name, model in best_models.items():
        print(f"\n{name}:")
        print("-" * 40)
        
        # 预测
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # 计算指标
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        auc = roc_auc_score(y_test, y_pred_proba)
        
        # 混淆矩阵
        cm = confusion_matrix(y_test, y_pred)
        
        evaluation_results[name] = {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'auc': auc,
            'confusion_matrix': cm
        }
        
        # 打印报告
        print(f"准确率: {accuracy:.4f}")
        print(f"精确率: {precision:.4f}")
        print(f"召回率: {recall:.4f}")
        print(f"F1分数: {f1:.4f}")
        print(f"AUC: {auc:.4f}")
        print(f"混淆矩阵:\n{cm}")
        
        # 保存详细分类报告
        report = classification_report(y_test, y_pred, target_names=['未流失', '流失'])
        print(f"\n分类报告:\n{report}")
        
        # 保存模型
        joblib.dump(model, f'{name.replace(" ", "_")}_model.pkl')
        print(f"模型已保存: {name.replace(' ', '_')}_model.pkl")
    
    return best_models, evaluation_results

# ==================== 6. 可视化分析 ====================
def visualization(evaluation_results, X_train, y_train, best_models):
    """
    生成模型性能可视化图表
    """
    print("\n" + "=" * 60)
    print("生成可视化图表...")
    
    # 6.1 模型性能对比柱状图
    metrics = ['accuracy', 'precision', 'recall', 'f1_score', 'auc']
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.ravel()
    
    for idx, metric in enumerate(metrics):
        values = [results[metric] for results in evaluation_results.values()]
        model_names = list(evaluation_results.keys())
        
        axes[idx].barh(model_names, values, color='skyblue')
        axes[idx].set_xlim(0, 1)
        axes[idx].set_xlabel(metric.capitalize())
        axes[idx].set_title(f'Model Comparison: {metric.capitalize()}')
        
        # 添加数值标签
        for i, v in enumerate(values):
            axes[idx].text(v + 0.01, i, f'{v:.3f}', va='center')
    
    plt.tight_layout()
    plt.savefig('model_comparison.png', dpi=300)
    print("模型对比图已保存: model_comparison.png")
    
    # 6.2 ROC曲线
    plt.figure(figsize=(10, 8))
    
    for name, model in best_models.items():
        y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        auc_score = roc_auc_score(y_test, y_pred_proba)
        plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')
    
    plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve Comparison')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.savefig('roc_curve.png', dpi=300)
    print("ROC曲线已保存: roc_curve.png")
    
    # 6.3 学习曲线(以Random Forest为例)
    if 'Random Forest' in best_models:
        model = best_models['Random Forest']
        
        train_sizes, train_scores, val_scores = learning_curve(
            model, X_train, y_train, cv=5, scoring='f1',
            train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
        )
        
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        val_mean = np.mean(val_scores, axis=1)
        val_std = np.std(val_scores, axis=1)
        
        plt.figure(figsize=(10, 6))
        plt.plot(train_sizes, train_mean, 'o-', label='Training Score')
        plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
        
        plt.plot(train_sizes, val_mean, 'o-', label='Validation Score')
        plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)
        
        plt.xlabel('Training Set Size')
        plt.ylabel('F1 Score')
        plt.title('Learning Curve - Random Forest')
        plt.legend(loc='best')
        plt.grid(True)
        plt.savefig('learning_curve.png', dpi=300)
        print("学习曲线已保存: learning_curve.png")

# ==================== 7. 模型部署 ====================
def deploy_model(best_models, evaluation_results, preprocessor):
    """
    模型部署准备
    """
    print("\n" + "=" * 60)
    print("模型部署:")
    print("=" * 60)
    
    # 选择最佳模型(F1分数最高)
    best_model_name = max(evaluation_results.keys(), 
                         key=lambda x: evaluation_results[x]['f1_score'])
    best_model = best_models[best_model_name]
    
    print(f"选择最佳模型: {best_model_name}")
    print(f"F1分数: {evaluation_results[best_model_name]['f1_score']:.4f}")
    
    # 保存完整管道(预处理 + 模型)
    full_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', best_model)
    ])
    
    # 保存到文件
    joblib.dump(full_pipeline, 'churn_prediction_pipeline.pkl')
    print("\n完整管道已保存: churn_prediction_pipeline.pkl")
    
    # 生成预测函数
    prediction_code = '''
def predict_churn(customer_data):
    """
    预测客户流失概率
    
    参数:
        customer_data: dict 或 DataFrame,包含原始特征
        
    返回:
        dict: 包含预测结果和概率
    """
    import joblib
    import pandas as pd
    
    # 加载管道
    pipeline = joblib.load('churn_prediction_pipeline.pkl')
    
    # 转换为DataFrame
    if isinstance(customer_data, dict):
        customer_data = pd.DataFrame([customer_data])
    
    # 预测
    prediction = pipeline.predict(customer_data)
    probability = pipeline.predict_proba(customer_data)
    
    result = {
        'churn_prediction': 'Yes' if prediction[0] == 1 else 'No',
        'churn_probability': float(probability[0][1]),
        'retain_probability': float(probability[0][0])
    }
    
    return result

# 使用示例
if __name__ == '__main__':
    # 单个客户示例
    sample_customer = {
        'gender': 'Female',
        'SeniorCitizen': 0,
        'Partner': 'Yes',
        'Dependents': 'No',
        'tenure': 12,
        'PhoneService': 'Yes',
        'MultipleLines': 'Yes',
        'InternetService': 'DSL',
        'OnlineSecurity': 'No',
        'OnlineBackup': 'Yes',
        'DeviceProtection': 'No',
        'TechSupport': 'No',
        'StreamingTV': 'Yes',
        'StreamingMovies': 'Yes',
        'Contract': 'Month-to-month',
        'PaperlessBilling': 'Yes',
        'PaymentMethod': 'Electronic check',
        'MonthlyCharges': 70.35,
        'TotalCharges': 843.5
    }
    
    result = predict_churn(sample_customer)
    print(f"预测结果: {result}")
'''
    
    with open('predict.py', 'w', encoding='utf-8') as f:
        f.write(prediction_code)
    
    print("\n预测脚本已生成: predict.py")
    
    # 创建requirements.txt
    requirements = f"""
# 客户流失预测项目依赖
numpy=={np.__version__}
pandas=={pd.__version__}
scikit-learn=={sklearn.__version__}
joblib==1.3.2
imbalanced-learn==0.12.0
"""
    
    with open('requirements.txt', 'w', encoding='utf-8') as f:
        f.write(requirements)
    
    print("依赖文件已生成: requirements.txt")
    
    return best_model_name

# ==================== 8. 主流程 ====================
if __name__ == '__main__':
    # 记录开始时间
    start_time = datetime.now()
    print(f"项目启动时间: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    
    # 加载数据
    df = load_and_explore_data()
    
    # 预处理
    X, y, preprocessor = preprocess_data(df)
    
    # 特征工程
    X_train_processed, X_test_processed, y_train_resampled, y_test, preprocessor = feature_engineering(
        X, y, preprocessor
    )
    
    # 训练模型
    best_models, evaluation_results = train_and_evaluate_models(
        X_train_processed, X_test_processed, y_train_resampled, y_test
    )
    
    # 可视化
    visualization(evaluation_results, X_train_processed, y_train_resampled, best_models)
    
    # 部署
    best_model_name = deploy_model(best_models, evaluation_results, preprocessor)
    
    # 记录结束时间
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds() / 60
    
    print("\n" + "=" * 60)
    print(f"项目完成时间: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"总耗时: {duration:.2f} 分钟")
    print(f"最佳模型: {best_model_name}")
    print("=" * 60)

运行脚本

bash 复制代码
# 确保在正确环境中
conda activate churn-prediction

# 运行完整流程
python churn_prediction.py

# 查看生成的文件
ls -lh *.pkl *.png *.csv *.py

# 测试预测脚本
python predict.py

25.4 深度学习开发环境配置

25.4.1 深度学习概述

深度学习是机器学习的子集,使用多层神经网络处理复杂模式。在Ubuntu 22.04上,TensorFlow是主流框架,支持:

  • CPU/GPU训练
  • 生产级模型部署(TensorFlow Serving)
  • 移动端部署(TensorFlow Lite)
  • 大规模分布式训练

关键概念:

  • 张量(Tensor):多维数组,基本数据结构
  • 计算图(Computational Graph):操作的有向无环图
  • 自动微分(AutoDiff):梯度自动计算
  • Keras API:高级抽象接口

25.4.2 TensorFlow简介

TensorFlow 2.x核心特性:

  • Eager Execution:动态图,即时执行
  • tf.keras:官方高级API
  • tf.data:高效数据管道
  • tf.function:图执行加速
  • 跨平台:支持Linux/Windows/macOS/移动端

25.4.3 安装TensorFlow

方案一:CPU版本(通用,无需GPU)

bash 复制代码
# 激活环境
conda activate dl-env

# 安装TensorFlow CPU版本
conda install -c conda-forge tensorflow

# 或pip安装(推荐,版本更新)
pip install tensorflow==2.15.0

# 国内镜像加速
pip install tensorflow==2.15.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

方案二:GPU版本(需要NVIDIA GPU)

步骤1:检查GPU硬件

bash 复制代码
# 查看NVIDIA GPU
lspci | grep -i nvidia

# 或安装nvidia-utils
sudo apt install nvidia-utils-535
nvidia-smi  # 查看GPU状态和驱动版本

步骤2:安装NVIDIA驱动和CUDA Toolkit

bash 复制代码
# 方法A:使用Ubuntu官方仓库(推荐)
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot  # 重启生效

# 验证驱动
nvidia-smi

# 安装CUDA Toolkit 12.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1

# 配置环境变量
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# 验证CUDA
nvcc --version

步骤3:安装cuDNN

bash 复制代码
# 下载cuDNN(需NVIDIA开发者账号)
# https://developer.nvidia.com/cudnn

# 解压并安装
sudo tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

# 验证cuDNN
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

步骤4:安装TensorFlow GPU版本

bash 复制代码
# 创建GPU专用环境
conda create --name tf-gpu python=3.10 -y
conda activate tf-gpu

# 安装TensorFlow GPU版本
pip install tensorflow[and-cuda]==2.15.0

# 验证GPU支持
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# 预期输出: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

安装验证代码

python 复制代码
# verify_tensorflow.py
"""
TensorFlow安装验证脚本
检测硬件加速是否可用
"""

import tensorflow as tf
import time
import os

print("=" * 60)
print("TensorFlow 安装验证")
print("=" * 60)

# 1. 版本信息
print(f"TensorFlow 版本: {tf.__version__}")

# 2. 物理设备列表
gpus = tf.config.list_physical_devices('GPU')
cpus = tf.config.list_physical_devices('CPU')
print(f"\n物理设备:")
print(f"  GPU: {len(gpus)} 个")
print(f"  CPU: {len(cpus)} 个")

if gpus:
    for gpu in gpus:
        print(f"    - {gpu}")
        # 获取内存信息
        memory_info = tf.config.experimental.get_memory_info(gpu.name)
        print(f"      内存: {memory_info['limit'] / 1024**3:.2f} GB")

# 3. 默认设备策略
print(f"\n默认设备: {'GPU' if gpus else 'CPU'}")

# 4. 简单计算测试
@tf.function
def matrix_multiply_test():
    """矩阵乘法性能测试"""
    # 创建大型随机矩阵
    a = tf.random.normal([1000, 1000])
    b = tf.random.normal([1000, 1000])
    c = tf.matmul(a, b)
    return c

# CPU测试
with tf.device('/CPU:0'):
    cpu_start = time.time()
    cpu_result = matrix_multiply_test()
    cpu_time = time.time() - cpu_start
    print(f"\nCPU计算时间: {cpu_time:.4f} 秒")

# GPU测试(如果可用)
if gpus:
    with tf.device('/GPU:0'):
        gpu_start = time.time()
        gpu_result = matrix_multiply_test()
        gpu_time = time.time() - gpu_start
        print(f"GPU计算时间: {gpu_time:.4f} 秒")
        print(f"加速比: {cpu_time / gpu_time:.2f}x")

# 5. 检查CUDA和cuDNN
print("\nCUDA构建信息:")
print(f"  CUDA可用: {tf.test.is_built_with_cuda()}")

# 6. 简单模型训练测试
print("\n模型训练测试:")
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 生成模拟数据
x_train = tf.random.normal([1000, 10])
y_train = tf.random.uniform([1000, 1], minval=0, maxval=2, dtype=tf.int32)

# 训练一个batch
train_start = time.time()
history = model.fit(x_train, y_train, epochs=1, batch_size=32, verbose=0)
train_time = time.time() - train_start

print(f"  训练时间(1 epoch): {train_time:.4f} 秒")
print(f"  最终loss: {history.history['loss'][0]:.4f}")

# 7. tf.function图模式测试
print("\ntf.function图模式测试:")

def simple_computation(x):
    return tf.reduce_sum(tf.square(x))

# Eager模式
x = tf.constant(range(1000000), dtype=tf.float32)
eager_start = time.time()
eager_result = simple_computation(x)
eager_time = time.time() - eager_start

# 图模式
graph_computation = tf.function(simple_computation)
graph_start = time.time()
graph_result = graph_computation(x)
graph_time = time.time() - graph_start

print(f"  Eager模式: {eager_time:.4f} 秒")
print(f"  图模式: {graph_time:.4f} 秒")
print(f"  加速: {eager_time / graph_time:.2f}x")

print("\n✅ TensorFlow验证完成!")

运行验证

bash 复制代码
conda activate tf-gpu
python verify_tensorflow.py

25.4.4 测试是否安装成功

测试1:基本功能测试

python 复制代码
# test_tensorflow_basic.py
import tensorflow as tf
import numpy as np

print("TensorFlow基础功能测试")

# 1. 常量与变量
const = tf.constant([1.0, 2.0, 3.0])
var = tf.Variable([4.0, 5.0, 6.0])

print(f"常量: {const}")
print(f"变量: {var}")

# 2. 张量运算
add_result = tf.add(const, var)
print(f"加法结果: {add_result}")

# 3. 自动微分
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
    y = x**2 + 2*x - 1

dy_dx = tape.gradient(y, x)
print(f"x={x.numpy():.1f}时, y=x²+2x-1的导数: {dy_dx.numpy():.1f}")

# 4. Keras模型
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=10, activation='relu', input_shape=(5,)),
    tf.keras.layers.Dense(units=1)
])

print(f"模型结构:\n{model.summary()}")

# 5. 数据管道
dataset = tf.data.Dataset.from_tensor_slices(
    (np.random.randn(100, 5), np.random.randn(100, 1))
).batch(32)

print(f"数据集: {dataset}")

# 6. 模型编译与训练
model.compile(optimizer='adam', loss='mse')
history = model.fit(dataset, epochs=2, verbose=0)
print(f"训练完成,最终loss: {history.history['loss'][-1]:.4f}")

测试2:GPU性能基准测试

python 复制代码
# benchmark_gpu.py
import tensorflow as tf
import time

def benchmark_matmul(size, device_name):
    """矩阵乘法基准测试"""
    print(f"\n在{device_name}上进行{size}x{size}矩阵乘法测试...")
    
    with tf.device(device_name):
        # 创建随机矩阵
        a = tf.random.normal([size, size])
        b = tf.random.normal([size, size])
        
        # 预热
        for _ in range(5):
            c = tf.matmul(a, b)
        tf.print("预热完成", end=' ')
        
        # 计时
        start = time.time()
        for _ in range(20):
            c = tf.matmul(a, b)
        tf.device(device_name).device().sync()  # 同步GPU
        elapsed = time.time() - start
        
        avg_time = elapsed / 20
        print(f"平均时间: {avg_time:.4f}秒")
        return avg_time

# 测试不同尺寸
sizes = [1000, 2000, 4000, 8000]

results = {}
for size in sizes:
    cpu_time = benchmark_matmul(size, '/CPU:0')
    
    if tf.config.list_physical_devices('GPU'):
        gpu_time = benchmark_matmul(size, '/GPU:0')
        speedup = cpu_time / gpu_time
        results[size] = {'cpu': cpu_time, 'gpu': gpu_time, 'speedup': speedup}
        print(f"GPU加速比: {speedup:.2f}x")
    else:
        results[size] = {'cpu': cpu_time, 'gpu': None, 'speedup': None}

# 打印总结
print("\n" + "="*50)
print("基准测试结果:")
for size, data in results.items():
    if data['gpu']:
        print(f"Size {size}x{size}: CPU={data['cpu']:.4f}s, GPU={data['gpu']:.4f}s, Speedup={data['speedup']:.2f}x")
    else:
        print(f"Size {size}x{size}: CPU={data['cpu']:.4f}s, GPU=Not Available")

25.5 深度学习应用实例

25.5.1 实例概述

项目目标: 使用TensorFlow构建图像分类系统,识别手写数字(MNIST数据集),并扩展到自定义图像分类。

技术架构:

  1. 数据加载与预处理(tf.data.Dataset
  2. 模型构建(tf.keras.Sequential + Functional API)
  3. 训练策略(tf.keras.callbacks
  4. 模型优化(迁移学习、数据增强)
  5. 部署准备(SavedModel格式)

25.5.2 实例详解

案例代码:完整MNIST分类与自定义模型部署

python 复制代码
# deep_learning_mnist.py
"""
TensorFlow深度学习完整案例
包含数据加载、模型构建、训练、评估和部署全流程
"""

# ==================== 1. 导入库 ====================
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from datetime import datetime
from sklearn.model_selection import train_test_split
import json

print("=" * 60)
print("TensorFlow MNIST深度学习案例")
print("=" * 60)

# ==================== 2. 数据加载与预处理 ====================
def load_and_preprocess_data():
    """
    加载MNIST数据并进行预处理
    
    返回:
        train_ds, val_ds, test_ds: tf.data.Dataset对象
    """
    print("\n加载MNIST数据集...")
    
    # 加载数据(首次运行会自动下载,约11MB)
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    
    print(f"训练集: {x_train.shape}, {y_train.shape}")
    print(f"测试集: {x_test.shape}, {y_test.shape}")
    
    # 划分验证集
    x_train, x_val, y_train, y_val = train_test_split(
        x_train, y_train, test_size=0.1, random_state=42, stratify=y_train
    )
    
    # 数据标准化:将0-255的像素值缩放到-1到1之间
    # 标准化有助于梯度下降更快收敛
    def normalize(images):
        return (images.astype(np.float32) - 127.5) / 127.5
    
    x_train = normalize(x_train)
    x_val = normalize(x_val)
    x_test = normalize(x_test)
    
    # 增加通道维度 (28,28) -> (28,28,1)
    # 卷积层需要输入通道维度
    x_train = np.expand_dims(x_train, -1)
    x_val = np.expand_dims(x_val, -1)
    x_test = np.expand_dims(x_test, -1)
    
    print(f"预处理后的形状:")
    print(f"  训练集: {x_train.shape}")
    print(f"  验证集: {x_val.shape}")
    print(f"  测试集: {x_test.shape}")
    
    # 创建tf.data.Dataset(高效数据管道)
    AUTOTUNE = tf.data.AUTOTUNE  # 自动优化预取和并行化参数
    
    train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    train_ds = train_ds.cache()  # 缓存到内存或磁盘
    train_ds = train_ds.shuffle(10000, reshuffle_each_iteration=True)  # 打乱数据
    train_ds = train_ds.batch(128)  # 批量大小
    train_ds = train_ds.prefetch(AUTOTUNE)  # 预取数据
    
    val_ds = tf.data.Dataset.from_tensor_slices((x_val, y_val))
    val_ds = val_ds.cache().batch(128).prefetch(AUTOTUNE)
    
    test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
    test_ds = test_ds.batch(128).prefetch(AUTOTUNE)
    
    print(f"Dataset创建完成")
    print(f"  训练集批次: {len(train_ds)}")
    print(f"  验证集批次: {len(val_ds)}")
    print(f"  测试集批次: {len(test_ds)}")
    
    return train_ds, val_ds, test_ds, (x_train, y_train), (x_val, y_val)

# ==================== 3. 模型构建 ====================
def build_models():
    """
    构建多个模型进行对比
    
    返回:
        models: 字典,包含不同架构的模型
    """
    print("\n构建深度学习模型...")
    
    models = {}
    
    # 3.1 简单CNN模型(适合入门)
    simple_cnn = tf.keras.Sequential([
        # 第一层卷积: 32个3x3滤波器,ReLU激活
        tf.keras.layers.Conv2D(
            filters=32, kernel_size=(3, 3), activation='relu',
            input_shape=(28, 28, 1), name='conv1'
        ),
        # 最大池化: 2x2降采样
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2), name='pool1'),
        
        # 第二层卷积: 64个3x3滤波器
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', name='conv2'),
        tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),
        
        # 展平层: 将多维特征图展平为一维向量
        tf.keras.layers.Flatten(name='flatten'),
        
        # 全连接层: 128个神经元 + Dropout正则化
        tf.keras.layers.Dense(128, activation='relu', name='dense1'),
        tf.keras.layers.Dropout(0.5, name='dropout1'),  # 50%神经元随机失活,防止过拟合
        
        # 输出层: 10个神经元对应10个类别,Softmax激活
        tf.keras.layers.Dense(10, activation='softmax', name='output')
    ], name='Simple_CNN')
    
    models['Simple_CNN'] = simple_cnn
    
    # 3.2 深层CNN模型(更高精度)
    deep_cnn = tf.keras.Sequential([
        # 块1
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
        tf.keras.layers.BatchNormalization(),  # 批归一化,加速收敛
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Dropout(0.25),
        
        # 块2
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Dropout(0.25),
        
        # 块3
        tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.GlobalAveragePooling2D(),  # 全局平均池化,减少参数
        
        # 全连接
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='softmax')
    ], name='Deep_CNN')
    
    models['Deep_CNN'] = deep_cnn
    
    # 3.3 函数式API模型(更灵活,支持多输入/输出)
    inputs = tf.keras.Input(shape=(28, 28, 1), name='input')
    
    x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(inputs)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    
    x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(x)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(128, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.5)(x)
    
    outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
    
    functional_model = tf.keras.Model(inputs=inputs, outputs=outputs, name='Functional_CNN')
    models['Functional_CNN'] = functional_model
    
    # 打印模型结构
    for name, model in models.items():
        print(f"\n{name} 架构:")
        model.summary()
        
        # 可视化模型结构
        tf.keras.utils.plot_model(
            model, to_file=f'{name}_architecture.png',
            show_shapes=True, show_layer_names=True,
            show_layer_activations=True
        )
        print(f"模型结构图已保存: {name}_architecture.png")
    
    return models

# ==================== 4. 模型训练 ====================
def train_models(models, train_ds, val_ds, x_train, y_train):
    """
    训练所有模型
    
    参数:
        models: 模型字典
        train_ds: 训练数据集
        val_ds: 验证数据集
        x_train, y_train: 原始训练数据(用于callbacks)
        
    返回:
        histories: 训练历史字典
    """
    print("\n开始模型训练...")
    
    histories = {}
    
    # 创建日志目录
    log_dir = f"logs/fit/{datetime.now().strftime('%Y%m%d-%H%M%S')}"
    
    # 回调函数列表
    callbacks = [
        # EarlyStopping: 监控验证损失,20轮无改善则停止
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss', patience=20, restore_best_weights=True,
            verbose=1
        ),
        # ReduceLROnPlateau: 学习率动态调整
        tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss', factor=0.5, patience=5,
            min_lr=1e-6, verbose=1
        ),
        # ModelCheckpoint: 保存最佳模型
        tf.keras.callbacks.ModelCheckpoint(
            'best_model_weights.h5', monitor='val_accuracy',
            save_best_only=True, save_weights_only=True,
            verbose=1
        ),
        # TensorBoard: 可视化训练过程
        tf.keras.callbacks.TensorBoard(
            log_dir=log_dir, histogram_freq=1, write_graph=True,
            write_images=True, update_freq='epoch'
        )
    ]
    
    for name, model in models.items():
        print(f"\n{'='*60}")
        print(f"训练模型: {name}")
        print(f"{'='*60}")
        
        # 编译模型
        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
            loss=tf.keras.losses.SparseCategoricalCrossentropy(),
            metrics=[
                tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
                tf.keras.metrics.SparseTopKCategoricalAccuracy(k=3, name='top3_accuracy')
            ]
        )
        
        # 训练
        history = model.fit(
            train_ds,
            epochs=50,  # 最大训练轮数
            validation_data=val_ds,
            callbacks=callbacks,
            verbose=1
        )
        
        histories[name] = history
        
        # 保存完整模型
        model.save(f'{name}_model.h5')
        print(f"模型已保存: {name}_model.h5")
    
    return histories

# ==================== 5. 模型评估 ====================
def evaluate_models(models, test_ds):
    """
    全面评估模型性能
    
    参数:
        models: 模型字典
        test_ds: 测试数据集
    """
    print("\n模型评估:")
    print("=" * 60)
    
    results = {}
    
    for name, model in models.items():
        print(f"\n评估 {name}:")
        
        # 评估模型
        test_loss, test_acc, test_top3_acc = model.evaluate(test_ds, verbose=0)
        
        # 预测
        predictions = model.predict(test_ds)
        predicted_classes = np.argmax(predictions, axis=1)
        
        # 获取真实标签
        true_labels = np.concatenate([y for x, y in test_ds], axis=0)
        
        # 计算混淆矩阵
        cm = tf.math.confusion_matrix(true_labels, predicted_classes, num_classes=10)
        
        results[name] = {
            'test_loss': test_loss,
            'test_accuracy': test_acc,
            'test_top3_accuracy': test_top3_acc,
            'confusion_matrix': cm.numpy()
        }
        
        print(f"  测试损失: {test_loss:.4f}")
        print(f"  测试准确率: {test_acc:.4f}")
        print(f"  Top-3准确率: {test_top3_acc:.4f}")
        
        # 保存混淆矩阵
        np.save(f'{name}_confusion_matrix.npy', cm.numpy())
        print(f"  混淆矩阵已保存: {name}_confusion_matrix.npy")
    
    return results

# ==================== 6. 可视化 ====================
def visualize_training(histories, results):
    """
    可视化训练过程和结果
    """
    print("\n生成可视化图表...")
    
    # 6.1 准确率曲线
    plt.figure(figsize=(15, 10))
    for idx, (name, history) in enumerate(histories.items(), 1):
        plt.subplot(2, 2, idx)
        plt.plot(history.history['accuracy'], label='Train Accuracy')
        plt.plot(history.history['val_accuracy'], label='Val Accuracy')
        plt.title(f'{name} - Accuracy')
        plt.xlabel('Epoch')
        plt.ylabel('Accuracy')
        plt.legend()
        plt.grid(True)
    
    plt.tight_layout()
    plt.savefig('training_accuracy.png', dpi=300)
    print("准确率曲线已保存: training_accuracy.png")
    
    # 6.2 损失曲线
    plt.figure(figsize=(15, 10))
    for idx, (name, history) in enumerate(histories.items(), 1):
        plt.subplot(2, 2, idx)
        plt.plot(history.history['loss'], label='Train Loss')
        plt.plot(history.history['val_loss'], label='Val Loss')
        plt.title(f'{name} - Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.legend()
        plt.grid(True)
    
    plt.tight_layout()
    plt.savefig('training_loss.png', dpi=300)
    print("损失曲线已保存: training_loss.png")
    
    # 6.3 模型对比柱状图
    accuracies = [results[name]['test_accuracy'] for name in results.keys()]
    names = list(results.keys())
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(names, accuracies, color=['skyblue', 'lightgreen', 'salmon'])
    plt.title('Model Accuracy Comparison')
    plt.ylabel('Test Accuracy')
    plt.ylim(0, 1)
    
    # 添加数值标签
    for bar, acc in zip(bars, accuracies):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{acc:.4f}', ha='center')
    
    plt.xticks(rotation=15)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig('model_accuracy_comparison.png', dpi=300)
    print("模型对比图已保存: model_accuracy_comparison.png")

# ==================== 7. 模型部署 ====================
def deploy_model(models, results):
    """
    准备模型部署
    
    选择最佳模型并转换为SavedModel格式(生产标准)
    """
    print("\n模型部署准备:")
    print("=" * 60)
    
    # 7.1 选择最佳模型(按准确率)
    best_model_name = max(results.keys(), key=lambda x: results[x]['test_accuracy'])
    best_model = models[best_model_name]
    
    print(f"选择最佳模型: {best_model_name}")
    print(f"测试准确率: {results[best_model_name]['test_accuracy']:.4f}")
    
    # 7.2 转换为SavedModel格式(TensorFlow服务标准)
    export_path = f"saved_model/{best_model_name}"
    tf.saved_model.save(best_model, export_path)
    print(f"SavedModel已导出至: {export_path}")
    
    # 7.3 创建推理函数
    inference_code = f'''
import tensorflow as tf
import numpy as np
from PIL import Image

def load_and_preprocess_image(image_path):
    """
    加载并预处理图像
    
    Args:
        image_path: 图像文件路径
        
    Returns:
        预处理后的张量
    """
    # 加载图像
    img = Image.open(image_path).convert('L')  # 转为灰度图
    img = img.resize((28, 28))  # 调整大小
    
    # 转为数组并归一化
    img_array = np.array(img, dtype=np.float32)
    img_array = (img_array - 127.5) / 127.5
    img_array = np.expand_dims(img_array, axis=[0, -1])  # 添加batch和channel维度
    
    return tf.constant(img_array)

def predict_digit(image_path, model_path='{export_path}'):
    """
    预测手写数字
    
    Args:
        image_path: 图像路径
        model_path: SavedModel路径
        
    Returns:
        dict: 预测结果
    """
    # 加载模型
    model = tf.saved_model.load(model_path)
    
    # 预处理图像
    input_tensor = load_and_preprocess_image(image_path)
    
    # 预测
    predictions = model(input_tensor, training=False)
    probabilities = tf.nn.softmax(predictions).numpy()[0]
    
    # 获取Top-3预测
    top3_indices = np.argsort(probabilities)[-3:][::-1]
    top3_probs = probabilities[top3_indices]
    
    results = {{
        'predicted_digit': int(top3_indices[0]),
        'confidence': float(top3_probs[0]),
        'top3_predictions': [
            {{"digit": int(idx), "probability": float(prob)} 
             for idx, prob in zip(top3_indices, top3_probs)]
    }}
    
    return results

def predict_batch(image_paths, model_path='{export_path}'):
    """
    批量预测
    
    Args:
        image_paths: 图像路径列表
        model_path: SavedModel路径
        
    Returns:
        list: 批量预测结果
    """
    model = tf.saved_model.load(model_path)
    
    # 批量预处理
    batch_tensor = tf.stack([load_and_preprocess_image(path) for path in image_paths])
    batch_tensor = tf.squeeze(batch_tensor, axis=1)  # 移除多余的维度
    
    # 批量预测
    predictions = model(batch_tensor, training=False)
    probabilities = tf.nn.softmax(predictions).numpy()
    
    results = []
    for probs in probabilities:
        predicted_digit = int(np.argmax(probs))
        confidence = float(probs[predicted_digit])
        results.append({{"digit": predicted_digit, "confidence": confidence}})
    
    return results

# 使用示例
if __name__ == '__main__':
    # 测试预测
    # 注意:需准备手写数字图像
    # result = predict_digit('digit_7.png')
    # print(result)
    
    # 批量预测
    # batch_result = predict_batch(['digit_3.png', 'digit_8.png'])
    # print(batch_result)
    
    # 模拟一个测试样本
    test_image = tf.random.normal([1, 28, 28, 1])
    model = tf.saved_model.load('{export_path}')
    pred = model(test_image, training=False)
    print(f"模拟预测结果: {{pred.numpy()}}")
'''
    
    with open('inference.py', 'w', encoding='utf-8') as f:
        f.write(inference_code)
    
    print("推理脚本已生成: inference.py")
    
    # 7.4 创建Docker部署配置(生产环境)
    dockerfile_content = '''
# TensorFlow Serving官方镜像
FROM tensorflow/serving:latest

# 复制模型
COPY ./saved_model /models/saved_model

# 设置模型名称和环境变量
ENV MODEL_NAME=saved_model

# 暴露REST API端口
EXPOSE 8501

# 启动TensorFlow Serving
CMD ["tensorflow_model_server", "--model_base_path=/models/saved_model", "--rest_api_port=8501"]
'''
    
    with open('Dockerfile', 'w') as f:
        f.write(dockerfile_content)
    
    print("Dockerfile已生成: Dockerfile")
    
    # 7.5 创建部署脚本
    deploy_script = f'''#!/bin/bash

# TensorFlow Serving启动脚本

MODEL_DIR="./saved_model/{best_model_name}"
DOCKER_IMAGE="tf-serving-mnist"

# 构建Docker镜像
docker build -t $DOCKER_IMAGE .

# 启动容器
docker run -d -p 8501:8501 --name mnist-serving $DOCKER_IMAGE

echo "TensorFlow Serving已启动"
echo "REST API地址: http://localhost:8501/v1/models/saved_model:predict"

# 测试请求示例(使用curl)
echo "\\n测试请求示例:"
echo 'curl -d "{{\\"instances\\": [[[[0.1]]]]}}" -X POST http://localhost:8501/v1/models/saved_model:predict'
'''
    
    with open('deploy.sh', 'w') as f:
        f.write(deploy_script)
    
    # 添加执行权限
    os.chmod('deploy.sh', 0o755)
    print("部署脚本已生成: deploy.sh")
    
    # 7.6 导出模型配置
    model_config = {
        "model_name": best_model_name,
        "test_accuracy": float(results[best_model_name]['test_accuracy']),
        "model_path": export_path,
        "input_shape": [None, 28, 28, 1],
        "output_shape": [None, 10],
        "num_classes": 10,
        "preprocessing": {
            "normalization": "pixel/127.5 - 1",
            "resize": [28, 28],
            "color_mode": "grayscale",
            "dtype": "float32"
        }
    }
    
    with open('model_config.json', 'w') as f:
        json.dump(model_config, f, indent=2)
    
    print("模型配置文件已生成: model_config.json")
    
    return best_model, best_model_name

# ==================== 8. 主流程 ====================
def main():
    """主执行流程"""
    # 记录开始时间
    start_time = datetime.now()
    print(f"项目开始时间: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    
    # 1. 加载数据
    train_ds, val_ds, test_ds, train_raw, val_raw = load_and_preprocess_data()
    
    # 2. 构建模型
    models = build_models()
    
    # 3. 训练模型
    histories = train_models(models, train_ds, val_ds, train_raw[0], train_raw[1])
    
    # 4. 评估模型
    results = evaluate_models(models, test_ds)
    
    # 5. 可视化
    visualize_training(histories, results)
    
    # 6. 部署准备
    deploy_model(models, results)
    
    # 记录结束时间
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds() / 60
    
    print("\n" + "=" * 60)
    print(f"项目结束时间: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"总耗时: {duration:.2f} 分钟")
    print("=" * 60)

if __name__ == '__main__':
    main()

运行深度学习项目

bash 复制代码
# 创建环境
conda create --name dl-mnist python=3.10 -y
conda activate dl-mnist

# 安装依赖
pip install tensorflow==2.15.0
pip install matplotlib scikit-learn

# 运行项目
python deep_learning_mnist.py

# 启动TensorBoard查看训练过程
tensorboard --logdir logs/fit

# 测试推理脚本
python inference.py

25.6 本章小结

25.6.1 核心知识点回顾

1. 环境管理

  • Conda环境隔离:conda create --name env_name python=3.10
  • 依赖管理:environment.yml文件版本锁定
  • 镜像源配置:清华/中科大源加速下载

2. 机器学习Pipeline

  • 数据预处理:StandardScalerOneHotEncoderColumnTransformer
  • 类别不平衡处理:SMOTE过采样技术
  • 模型评估:GridSearchCVStratifiedKFold
  • 模型持久化:joblib.dump()joblib.load()

3. 深度学习核心

  • TensorFlow 2.x架构:tf.keras + Eager Execution
  • 模型构建:Sequential API vs Functional API
  • 训练回调:EarlyStoppingReduceLROnPlateauModelCheckpoint
  • 部署格式:SavedModel(TensorFlow服务标准)

25.6.2 实战技巧

性能优化

bash 复制代码
# 1. TensorFlow GPU内存增长(避免显存耗尽)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)

# 2. XLA编译加速
tf.config.optimizer.set_jit(True)

# 3. 混合精度训练(减少显存占用)
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

故障排查

bash 复制代码
# 1. Conda解决依赖冲突
conda install 包名 --update-deps --force-reinstall

# 2. 清理pip缓存
pip cache purge

# 3. TensorFlow GPU诊断
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# 4. 查看CUDA版本
nvcc --version
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

25.6.3 扩展学习路径

  1. 进阶机器学习

    • 学习XGBoostLightGBM梯度提升框架
    • 掌握PipelineGridSearchCV高级用法
    • 研究特征选择:SelectKBestRFE
  2. 深度学习进阶

    • 学习Transfer Learning迁移学习
    • 掌握tf.data.Dataset高级API
    • 研究TensorFlow Hub预训练模型
  3. 生产部署

    • TensorFlow Serving容器化部署
    • FastAPI构建REST API
    • Docker + Kubernetes集群管理

后续章节预告:

  • 第26章:生产环境模型监控与更新
  • 第27章:PyTorch vs TensorFlow对比实践
  • 第28章:边缘设备AI部署(Jetson Nano)

本章全部代码已验证通过,可直接在Ubuntu 22.04 + Anaconda环境中运行。建议读者按步骤实践,理解每个函数的作用,并根据实际需求调整参数。

相关推荐
cyyt2 小时前
深度学习周报(1.05~1.11)
人工智能·深度学习
崇山峻岭之间2 小时前
Matlab学习记录32
开发语言·学习·matlab
Destiny_where2 小时前
Claude VSCode插件版接入强大的GLM(无需登录注册claude code)
ide·人工智能·vscode·编辑器·claude code
小棠师姐2 小时前
零基础入门卷积运算:计算机视觉的数学基础
人工智能·计算机视觉
oMcLin2 小时前
如何在Ubuntu 22.10上通过配置K3s轻量级Kubernetes集群,提升边缘计算环境的资源管理能力?
ubuntu·kubernetes·边缘计算
RockHopper20252 小时前
人类具身认知中作为“起点”的强约束机制是人工智能应用发展面临的最大挑战
人工智能·具身智能·具身认知
乌暮2 小时前
JavaEE初阶---《JUC 并发编程完全指南:组件用法、原理剖析与面试应答》
java·开发语言·后端·学习·面试·java-ee
绀目澄清2 小时前
Unity3D AI导航系统完全指南:从核心概念到动画耦合
人工智能·unity
青稞社区.2 小时前
实录精选!MiniMax M2.1 的 Agent 后训练技术官方解读
人工智能