Ubuntu 22.04中的人工智能
25.1 基础环境准备
25.1.1 概述
在Ubuntu 22.04 LTS系统中搭建AI开发环境,推荐使用Anaconda作为包管理器和环境管理器。Anaconda集成了Python解释器、conda包管理器及大量科学计算库,能有效解决依赖冲突问题,支持多版本Python环境隔离。
核心优势:
- 预编译二进制包,避免源码编译
- 环境隔离,项目间依赖互不干扰
- 跨平台一致性(Linux/Windows/macOS)
- 内置Jupyter Notebook等开发工具
25.1.2 安装Anaconda
步骤1:系统预备操作
bash
# 更新系统软件包索引
sudo apt update
# 安装必要的依赖包(避免后续出现缺少共享库的问题)
sudo apt install -y libgl1-mesa-glx libegl1-mesa libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6
# 验证系统架构(x86_64或aarch64)
uname -m # 输出示例: x86_64
步骤2:下载Anaconda安装包
bash
# 方法一:使用wget直接下载(推荐)
cd /tmp
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
# 方法二:使用curl下载
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
# 重要:验证安装包完整性(防止下载损坏)
sha256sum Anaconda3-2024.02-1-Linux-x86_64.sh
# 比对官网提供的SHA256校验码
步骤3:执行安装脚本
bash
# 运行安装脚本
bash Anaconda3-2024.02-1-Linux-x86_64.sh
# 安装过程中的交互选项说明:
# - 按Enter阅读许可协议
# - 输入'yes'同意协议
# - 确认安装路径(默认~/anaconda3,可直接回车)
# - 输入'yes'初始化conda(重要!)
步骤4:验证安装并配置环境
bash
# 重新加载shell配置(安装脚本已修改~/.bashrc)
source ~/.bashrc
# 验证conda命令可用
conda --version
# 预期输出: conda 24.1.2
# 查看Python版本(应为Anaconda自带的Python)
python --version
# 预期输出: Python 3.11.7
# 初始化conda(如果上一步未选择初始化)
conda init bash
# 其他shell: conda init zsh/fish/tcsh
步骤5:配置conda镜像源(国内用户必需)
bash
# 配置清华镜像源,显著提升下载速度
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
# 设置显示通道URL
conda config --set show_channel_urls yes
# 验证配置
conda config --show channels
步骤6:更新conda到最新版本
bash
# 更新conda自身
conda update -n base -c defaults conda
# 更新所有基础包
conda update --all
25.1.3 conda基本用法
环境管理命令体系
bash
# 1. 创建新环境(语法核心)
conda create --name <环境名> python=<版本号>
# 实例:创建Python 3.10的ml-env环境
conda create --name ml-env python=3.10
# 指定多个包
conda create --name dl-env python=3.9 numpy pandas
# 从environment.yml文件创建(项目复用)
conda env create -f environment.yml
# 2. 激活/切换到指定环境
conda activate ml-env
# 命令行前缀变为: (ml-env) user@ubuntu:~$
# 3. 退出当前环境
conda deactivate
# 4. 查看所有环境
conda env list
# 或
conda info --envs
# 输出示例:
# base * /home/user/anaconda3
# ml-env /home/user/anaconda3/envs/ml-env
# 5. 删除环境(谨慎操作)
conda env remove --name ml-env
# 或
conda remove --name ml-env --all
# 6. 导出环境配置(版本锁定)
conda env export > environment.yml
# 文件内容包含所有包及精确版本号
# 7. 克隆环境
conda create --name new-env --clone old-env
包管理命令体系
bash
# 1. 安装包(自动处理依赖)
conda install <包名>
# 实例:安装numpy指定版本
conda install numpy=1.24.3
# 同时安装多个包
conda install scipy matplotlib pandas
# 指定渠道安装
conda install -c conda-forge opencv
# 2. 卸载包
conda remove <包名>
# 或
conda uninstall <包名>
# 3. 更新包
conda update <包名>
# 更新所有包
conda update --all
# 4. 搜索包
conda search <包名>
# 5. 查看已安装包
conda list
# 输出格式: 包名 版本号 构建渠道
# 6. 查看特定包信息
conda list numpy
# 输出示例: numpy 1.24.3 py310h5f9d8c6_0 (conda-forge)
# 7. 清理缓存(释放空间)
conda clean --all
# 选项说明:
# --packages: 删除未解压的包
# --tarballs: 删除下载的压缩包
# --index-cache: 删除索引缓存
Python版本管理
bash
# 在环境中切换Python版本
conda activate ml-env
conda install python=3.11
# 验证
python --version
案例:创建完整的ML开发环境
bash
# 创建并激活环境
conda create --name ml-dev python=3.10 -y
conda activate ml-dev
# 批量安装数据科学核心库
conda install -y numpy pandas matplotlib scikit-learn jupyter ipython
# 安装深度学习基础库
conda install -y pytorch torchvision torchaudio cpuonly -c pytorch
# 验证安装
python -c "import numpy, pandas, sklearn, torch; print('所有库导入成功')"
25.2 机器学习开发环境配置
25.2.1 机器学习概述
机器学习是通过算法让计算机从数据中学习规律,无需显式编程即可做出预测。在Ubuntu 22.04上,Scikit-learn是首选的ML库,提供:
- 监督学习:分类、回归
- 无监督学习:聚类、降维
- 模型选择:交叉验证、网格搜索
- 预处理:特征提取、标准化
核心依赖关系:
- NumPy:多维数组操作
- SciPy:科学计算
- Matplotlib:数据可视化
- joblib:模型持久化
25.2.2 Scikit-learn的安装
方法1:conda安装(推荐)
bash
# 激活目标环境
conda activate ml-env
# 安装scikit-learn(自动安装所有依赖)
conda install scikit-learn
# 安装指定版本
conda install scikit-learn=1.4.0
# 从conda-forge渠道安装(更新更快)
conda install -c conda-forge scikit-learn
方法2:pip安装(备用方案)
bash
# 激活环境后
conda activate ml-env
# 升级pip到最新版本
pip install --upgrade pip
# 安装scikit-learn
pip install scikit-learn
# 安装指定版本
pip install scikit-learn==1.4.0
# 安装预发布版本
pip install --pre scikit-learn
# 从源码安装(开发使用)
pip install git+https://github.com/scikit-learn/scikit-learn.git
方法3:安装完整科学计算栈(推荐新手)
bash
# 一键安装所有相关库(约500MB)
conda install numpy scipy matplotlib scikit-learn pandas jupyter
# 或者使用Anaconda发行版(已预装)
# 下载地址: https://www.anaconda.com/download
验证依赖版本兼容性
bash
# 查看已安装的scikit-learn及其依赖版本
conda list | grep -E "scikit-learn|numpy|scipy"
# 预期输出示例:
# numpy 1.24.3
# scipy 1.10.1
# scikit-learn 1.4.0
25.2.3 测试安装是否成功
测试1:基础导入测试
python
# test_sklearn_install.py
"""
Scikit-learn安装验证脚本
测试核心模块导入和基本功能
"""
# 捕获导入错误
try:
import sklearn
print(f"✓ scikit-learn版本: {sklearn.__version__}")
except ImportError as e:
print(f"✗ 导入失败: {e}")
exit(1)
# 测试核心依赖
try:
import numpy as np
import scipy
import joblib
print(f"✓ numpy版本: {np.__version__}")
print(f"✓ scipy版本: {scipy.__version__}")
print("✓ 所有依赖导入成功")
except ImportError as e:
print(f"✗ 依赖导入失败: {e}")
# 测试基本功能:生成数据集并训练模型
def test_basic_functionality():
"""测试scikit-learn核心功能"""
print("\n正在测试基本功能...")
# 导入必要的模块
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 1. 生成模拟数据集
# make_classification参数详解:
# n_samples: 样本数量
# n_features: 特征数量
# n_informative: 有效特征数量
# n_redundant: 冗余特征数量
# random_state: 随机种子,保证结果可复现
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=5,
n_redundant=3,
random_state=42
)
print(f"✓ 生成数据集: X形状{X.shape}, y形状{y.shape}")
# 2. 划分训练集和测试集
# train_test_split参数:
# test_size: 测试集比例
# random_state: 随机种子
# stratify: 保持类别分布
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"✓ 数据划分: 训练集{X_train.shape}, 测试集{X_test.shape}")
# 3. 创建并训练模型
# LogisticRegression参数:
# max_iter: 最大迭代次数
# random_state: 随机种子
# solver: 优化算法
model = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')
model.fit(X_train, y_train)
print("✓ 模型训练完成")
# 4. 预测与评估
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"✓ 模型准确率: {accuracy:.4f}")
# 5. 模型持久化测试
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# 创建带预处理的管道
pipeline = Pipeline([
('scaler', StandardScaler()), # 数据标准化
('classifier', LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)
print(f"✓ 管道训练完成,测试准确率: {pipeline.score(X_test, y_test):.4f}")
# 保存模型
import joblib
joblib.dump(pipeline, 'test_model.pkl')
print("✓ 模型保存成功")
# 加载模型
loaded_model = joblib.load('test_model.pkl')
print(f"✓ 模型加载成功,加载后准确率: {loaded_model.score(X_test, y_test):.4f}")
if __name__ == "__main__":
test_basic_functionality()
print("\n✅ 所有测试通过!Scikit-learn安装成功且功能正常")
测试2:Jupyter Notebook集成测试
python
# 在终端运行:
jupyter notebook
# 在Notebook中执行:
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
# 加载经典数据集
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
# PCA降维可视化
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
# 绘制结果
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset')
plt.show()
25.2.4 更新或者卸载Scikit-learn
更新操作
bash
# 检查当前版本
conda list scikit-learn
# 更新到最新稳定版
conda update scikit-learn
# 从特定渠道更新
conda update -c conda-forge scikit-learn
# 使用pip更新(仅在conda不可用)
pip install --upgrade scikit-learn
# 更新后验证
python -c "import sklearn; print(sklearn.__version__)"
卸载操作
bash
# 方法一:conda卸载(干净,自动处理依赖)
conda remove scikit-learn
# 方法二:pip卸载
pip uninstall scikit-learn
# 强制卸载(处理依赖冲突)
pip uninstall scikit-learn -y
# 清理残留文件
# 删除缓存
conda clean --all
# 检查是否卸载干净
python -c "import sklearn" # 应报ModuleNotFoundError
版本降级(解决兼容性问题)
bash
# 卸载当前版本
conda remove scikit-learn
# 安装指定旧版本
conda install scikit-learn=1.3.0
# 或使用pip
pip install scikit-learn==1.2.2
25.3 机器学习应用实例
25.3.1 实例概述
项目目标: 构建一个客户流失预测系统,使用电信公司数据集,通过机器学习模型识别可能流失的客户。
技术栈:
- 数据处理:Pandas + NumPy
- 可视化:Matplotlib + Seaborn
- 模型:Scikit-learn(逻辑回归、随机森林、XGBoost)
- 评估:交叉验证、ROC曲线、混淆矩阵
- 部署:Joblib模型序列化
数据集特征:
- 21个字段(客户ID、服务类型、费用、投诉次数等)
- 标签:是否流失(Churn: Yes/No)
- 约7000条记录
25.3.2 环境准备
bash
# 创建专用环境
conda create --name churn-prediction python=3.10 -y
# 激活环境
conda activate churn-prediction
# 安装核心库
conda install -y numpy pandas matplotlib seaborn scikit-learn jupyter
# 安装XGBoost(高性能梯度提升)
conda install -y -c conda-forge xgboost
# 安装其他辅助库
pip install -U imbalanced-learn # 处理类别不平衡
pip install -U scikit-plot # 可视化工具
# 启动Jupyter
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
# 如需后台运行
nohup jupyter notebook > jupyter.log 2>&1 &
25.3.3 实例详解
案例代码:完整客户流失预测项目
python
# churn_prediction.py
"""
电信客户流失预测完整案例
涵盖数据预处理、特征工程、模型训练、评估和部署全流程
"""
# ==================== 1. 导入必要的库 ====================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# sklearn核心模块
from sklearn.model_selection import (
train_test_split, cross_val_score, GridSearchCV,
StratifiedKFold, learning_curve
)
from sklearn.preprocessing import (
StandardScaler, LabelEncoder, OneHotEncoder,
RobustScaler, MinMaxScaler
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
classification_report, confusion_matrix, roc_auc_score,
roc_curve, precision_recall_curve, accuracy_score,
f1_score, precision_score, recall_score
)
# 机器学习算法
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE # 处理类别不平衡
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 设置中文字体(Ubuntu需安装)
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
# ==================== 2. 数据加载与探索 ====================
def load_and_explore_data():
"""
加载数据并进行初步探索性分析
返回: DataFrame
"""
# 模拟生成电信数据集(真实场景应从CSV读取)
# df = pd.read_csv('telecom_churn.csv')
# 生成示例数据
np.random.seed(42)
n_samples = 7000
data = {
'customerID': [f'CID_{i:06d}' for i in range(n_samples)],
'gender': np.random.choice(['Male', 'Female'], n_samples),
'SeniorCitizen': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]),
'Partner': np.random.choice(['Yes', 'No'], n_samples),
'Dependents': np.random.choice(['Yes', 'No'], n_samples),
'tenure': np.random.randint(0, 73, n_samples), # 在网时长
'PhoneService': np.random.choice(['Yes', 'No'], n_samples),
'MultipleLines': np.random.choice(['No', 'Yes', 'No phone service'], n_samples),
'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
'OnlineSecurity': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'OnlineBackup': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'DeviceProtection': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'TechSupport': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'StreamingTV': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'StreamingMovies': np.random.choice(['No', 'Yes', 'No internet service'], n_samples),
'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
'PaperlessBilling': np.random.choice(['Yes', 'No'], n_samples),
'PaymentMethod': np.random.choice([
'Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'
], n_samples),
'MonthlyCharges': np.random.uniform(18, 120, n_samples).round(2),
'TotalCharges': np.random.uniform(18, 8000, n_samples).round(2),
'Churn': np.random.choice(['No', 'Yes'], n_samples, p=[0.73, 0.27]) # 27%流失率
}
df = pd.DataFrame(data)
print("=" * 60)
print("数据加载完成")
print(f"数据集形状: {df.shape}")
print(f"列名: {df.columns.tolist()}")
print("\n数据前5行:")
print(df.head())
print("\n数据类型统计:")
print(df.dtypes.value_counts())
print("\n目标变量分布:")
churn_counts = df['Churn'].value_counts()
print(churn_counts)
print(f"流失率: {churn_counts['Yes'] / len(df):.2%}")
# 保存数据
df.to_csv('telecom_churn_dataset.csv', index=False)
print("\n数据已保存至: telecom_churn_dataset.csv")
return df
# ==================== 3. 数据预处理 ====================
def preprocess_data(df):
"""
全面的数据预处理流程
参数:
df: 原始DataFrame
返回:
X_processed: 处理后的特征矩阵
y: 标签数组
preprocessor: ColumnTransformer对象(用于后续预测)
"""
print("\n" + "=" * 60)
print("开始数据预处理...")
# 3.1 数据清洗
# 处理TotalCharges中的空值(可能为tenure=0的新用户)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['MonthlyCharges'], inplace=True)
# 删除customerID列(无关特征)
df = df.drop('customerID', axis=1)
# 3.2 分离特征和标签
X = df.drop('Churn', axis=1)
y = df['Churn'].map({'No': 0, 'Yes': 1}) # 二值化标签
print(f"特征矩阵形状: {X.shape}")
print(f"标签数组形状: {y.shape}")
# 3.3 定义特征类型
# 数值型特征
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'SeniorCitizen']
# 类别型特征
categorical_features = [col for col in X.columns if col not in numeric_features]
print(f"\n数值特征: {numeric_features}")
print(f"类别特征: {categorical_features}")
# 3.4 构建预处理管道
# 数值型处理:填充缺失值 + 标准化
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler()) # 标准化: (x - μ) / σ
])
# 类别型处理:填充缺失值 + 独热编码
categorical_transformer = Pipeline(steps=[
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
# handle_unknown='ignore': 测试集出现新类别时忽略
# sparse_output=False: 返回密集数组,避免内存问题
])
# 组合预处理
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
print("预处理管道构建完成")
return X, y, preprocessor
# ==================== 4. 特征工程 ====================
def feature_engineering(X, y, preprocessor):
"""
高级特征工程
返回: 处理后的训练集和测试集
"""
print("\n" + "=" * 60)
print("开始特征工程...")
# 4.1 处理类别不平衡(SMOTE过采样)
# SMOTE: Synthetic Minority Oversampling Technique
# 仅在训练集应用,避免数据泄露
print(f"过采样前类别分布:\n{y.value_counts()}")
# 4.2 划分数据集(在SMOTE之前)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\n训练集: {X_train_raw.shape}, 测试集: {X_test_raw.shape}")
# 4.3 拟合预处理管道
# 必须仅在训练集上fit,然后transform训练集和测试集
X_train_processed = preprocessor.fit_transform(X_train_raw)
X_test_processed = preprocessor.transform(X_test_raw)
print(f"预处理后训练集形状: {X_train_processed.shape}")
print(f"预处理后测试集形状: {X_test_processed.shape}")
# 4.4 应用SMOTE(训练集)
# k_neighbors=3: 合成样本时参考的邻居数
smote = SMOTE(random_state=42, k_neighbors=3)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)
print(f"\n过采样后训练集形状: {X_train_resampled.shape}")
print(f"过采样后类别分布:\n{pd.Series(y_train_resampled).value_counts()}")
return X_train_resampled, X_test_processed, y_train_resampled, y_test, preprocessor
# ==================== 5. 模型训练与调优 ====================
def train_and_evaluate_models(X_train, X_test, y_train, y_test):
"""
训练多个模型并进行超参数调优
返回: 最佳模型和性能对比
"""
print("\n" + "=" * 60)
print("开始模型训练...")
# 5.1 定义模型管道(集成预处理)
# 注意:此处预处理已通过ColumnTransformer完成
models = {
'Logistic Regression': LogisticRegression(
max_iter=1000, random_state=42, class_weight='balanced'
),
'Random Forest': RandomForestClassifier(
n_estimators=100, random_state=42, class_weight='balanced'
),
'Gradient Boosting': GradientBoostingClassifier(
random_state=42
),
'SVM': SVC(
random_state=42, probability=True, class_weight='balanced'
)
}
# 5.2 超参数网格(仅展示关键参数)
param_grids = {
'Logistic Regression': {
'C': [0.1, 1.0, 10.0], # 正则化强度的倒数
'penalty': ['l2', 'l1'], # 正则化类型
'solver': ['liblinear', 'saga'] # 优化算法
},
'Random Forest': {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
},
'Gradient Boosting': {
'n_estimators': [100, 200],
'learning_rate': [0.1, 0.05],
'max_depth': [3, 5]
},
'SVM': {
'C': [0.1, 1.0, 10.0],
'kernel': ['rbf', 'linear']
}
}
# 5.3 交叉验证策略
# StratifiedKFold: 保持每折类别分布一致
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# 5.4 训练与调优
results = {}
best_models = {}
for name, model in models.items():
print(f"\n训练模型: {name}")
print("-" * 40)
# 网格搜索
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grids[name],
cv=cv,
scoring='f1', # 使用F1分数(类别不平衡场景)
n_jobs=-1, # 使用所有CPU核心
verbose=1
)
# 训练
grid_search.fit(X_train, y_train)
# 记录结果
results[name] = {
'best_params': grid_search.best_params_,
'best_score': grid_search.best_score_,
'cv_results': grid_search.cv_results_
}
best_models[name] = grid_search.best_estimator_
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证F1分数: {grid_search.best_score_:.4f}")
# 5.5 模型评估
print("\n" + "=" * 60)
print("模型评估结果:")
print("=" * 60)
evaluation_results = {}
for name, model in best_models.items():
print(f"\n{name}:")
print("-" * 40)
# 预测
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# 计算指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
evaluation_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'auc': auc,
'confusion_matrix': cm
}
# 打印报告
print(f"准确率: {accuracy:.4f}")
print(f"精确率: {precision:.4f}")
print(f"召回率: {recall:.4f}")
print(f"F1分数: {f1:.4f}")
print(f"AUC: {auc:.4f}")
print(f"混淆矩阵:\n{cm}")
# 保存详细分类报告
report = classification_report(y_test, y_pred, target_names=['未流失', '流失'])
print(f"\n分类报告:\n{report}")
# 保存模型
joblib.dump(model, f'{name.replace(" ", "_")}_model.pkl')
print(f"模型已保存: {name.replace(' ', '_')}_model.pkl")
return best_models, evaluation_results
# ==================== 6. 可视化分析 ====================
def visualization(evaluation_results, X_train, y_train, best_models):
"""
生成模型性能可视化图表
"""
print("\n" + "=" * 60)
print("生成可视化图表...")
# 6.1 模型性能对比柱状图
metrics = ['accuracy', 'precision', 'recall', 'f1_score', 'auc']
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()
for idx, metric in enumerate(metrics):
values = [results[metric] for results in evaluation_results.values()]
model_names = list(evaluation_results.keys())
axes[idx].barh(model_names, values, color='skyblue')
axes[idx].set_xlim(0, 1)
axes[idx].set_xlabel(metric.capitalize())
axes[idx].set_title(f'Model Comparison: {metric.capitalize()}')
# 添加数值标签
for i, v in enumerate(values):
axes[idx].text(v + 0.01, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.savefig('model_comparison.png', dpi=300)
print("模型对比图已保存: model_comparison.png")
# 6.2 ROC曲线
plt.figure(figsize=(10, 8))
for name, model in best_models.items():
y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.savefig('roc_curve.png', dpi=300)
print("ROC曲线已保存: roc_curve.png")
# 6.3 学习曲线(以Random Forest为例)
if 'Random Forest' in best_models:
model = best_models['Random Forest']
train_sizes, train_scores, val_scores = learning_curve(
model, X_train, y_train, cv=5, scoring='f1',
train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
plt.plot(train_sizes, val_mean, 'o-', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve - Random Forest')
plt.legend(loc='best')
plt.grid(True)
plt.savefig('learning_curve.png', dpi=300)
print("学习曲线已保存: learning_curve.png")
# ==================== 7. 模型部署 ====================
def deploy_model(best_models, evaluation_results, preprocessor):
"""
模型部署准备
"""
print("\n" + "=" * 60)
print("模型部署:")
print("=" * 60)
# 选择最佳模型(F1分数最高)
best_model_name = max(evaluation_results.keys(),
key=lambda x: evaluation_results[x]['f1_score'])
best_model = best_models[best_model_name]
print(f"选择最佳模型: {best_model_name}")
print(f"F1分数: {evaluation_results[best_model_name]['f1_score']:.4f}")
# 保存完整管道(预处理 + 模型)
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', best_model)
])
# 保存到文件
joblib.dump(full_pipeline, 'churn_prediction_pipeline.pkl')
print("\n完整管道已保存: churn_prediction_pipeline.pkl")
# 生成预测函数
prediction_code = '''
def predict_churn(customer_data):
"""
预测客户流失概率
参数:
customer_data: dict 或 DataFrame,包含原始特征
返回:
dict: 包含预测结果和概率
"""
import joblib
import pandas as pd
# 加载管道
pipeline = joblib.load('churn_prediction_pipeline.pkl')
# 转换为DataFrame
if isinstance(customer_data, dict):
customer_data = pd.DataFrame([customer_data])
# 预测
prediction = pipeline.predict(customer_data)
probability = pipeline.predict_proba(customer_data)
result = {
'churn_prediction': 'Yes' if prediction[0] == 1 else 'No',
'churn_probability': float(probability[0][1]),
'retain_probability': float(probability[0][0])
}
return result
# 使用示例
if __name__ == '__main__':
# 单个客户示例
sample_customer = {
'gender': 'Female',
'SeniorCitizen': 0,
'Partner': 'Yes',
'Dependents': 'No',
'tenure': 12,
'PhoneService': 'Yes',
'MultipleLines': 'Yes',
'InternetService': 'DSL',
'OnlineSecurity': 'No',
'OnlineBackup': 'Yes',
'DeviceProtection': 'No',
'TechSupport': 'No',
'StreamingTV': 'Yes',
'StreamingMovies': 'Yes',
'Contract': 'Month-to-month',
'PaperlessBilling': 'Yes',
'PaymentMethod': 'Electronic check',
'MonthlyCharges': 70.35,
'TotalCharges': 843.5
}
result = predict_churn(sample_customer)
print(f"预测结果: {result}")
'''
with open('predict.py', 'w', encoding='utf-8') as f:
f.write(prediction_code)
print("\n预测脚本已生成: predict.py")
# 创建requirements.txt
requirements = f"""
# 客户流失预测项目依赖
numpy=={np.__version__}
pandas=={pd.__version__}
scikit-learn=={sklearn.__version__}
joblib==1.3.2
imbalanced-learn==0.12.0
"""
with open('requirements.txt', 'w', encoding='utf-8') as f:
f.write(requirements)
print("依赖文件已生成: requirements.txt")
return best_model_name
# ==================== 8. 主流程 ====================
if __name__ == '__main__':
# 记录开始时间
start_time = datetime.now()
print(f"项目启动时间: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
# 加载数据
df = load_and_explore_data()
# 预处理
X, y, preprocessor = preprocess_data(df)
# 特征工程
X_train_processed, X_test_processed, y_train_resampled, y_test, preprocessor = feature_engineering(
X, y, preprocessor
)
# 训练模型
best_models, evaluation_results = train_and_evaluate_models(
X_train_processed, X_test_processed, y_train_resampled, y_test
)
# 可视化
visualization(evaluation_results, X_train_processed, y_train_resampled, best_models)
# 部署
best_model_name = deploy_model(best_models, evaluation_results, preprocessor)
# 记录结束时间
end_time = datetime.now()
duration = (end_time - start_time).total_seconds() / 60
print("\n" + "=" * 60)
print(f"项目完成时间: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"总耗时: {duration:.2f} 分钟")
print(f"最佳模型: {best_model_name}")
print("=" * 60)
运行脚本
bash
# 确保在正确环境中
conda activate churn-prediction
# 运行完整流程
python churn_prediction.py
# 查看生成的文件
ls -lh *.pkl *.png *.csv *.py
# 测试预测脚本
python predict.py
25.4 深度学习开发环境配置
25.4.1 深度学习概述
深度学习是机器学习的子集,使用多层神经网络处理复杂模式。在Ubuntu 22.04上,TensorFlow是主流框架,支持:
- CPU/GPU训练
- 生产级模型部署(TensorFlow Serving)
- 移动端部署(TensorFlow Lite)
- 大规模分布式训练
关键概念:
- 张量(Tensor):多维数组,基本数据结构
- 计算图(Computational Graph):操作的有向无环图
- 自动微分(AutoDiff):梯度自动计算
- Keras API:高级抽象接口
25.4.2 TensorFlow简介
TensorFlow 2.x核心特性:
- Eager Execution:动态图,即时执行
- tf.keras:官方高级API
- tf.data:高效数据管道
- tf.function:图执行加速
- 跨平台:支持Linux/Windows/macOS/移动端
25.4.3 安装TensorFlow
方案一:CPU版本(通用,无需GPU)
bash
# 激活环境
conda activate dl-env
# 安装TensorFlow CPU版本
conda install -c conda-forge tensorflow
# 或pip安装(推荐,版本更新)
pip install tensorflow==2.15.0
# 国内镜像加速
pip install tensorflow==2.15.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
方案二:GPU版本(需要NVIDIA GPU)
步骤1:检查GPU硬件
bash
# 查看NVIDIA GPU
lspci | grep -i nvidia
# 或安装nvidia-utils
sudo apt install nvidia-utils-535
nvidia-smi # 查看GPU状态和驱动版本
步骤2:安装NVIDIA驱动和CUDA Toolkit
bash
# 方法A:使用Ubuntu官方仓库(推荐)
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot # 重启生效
# 验证驱动
nvidia-smi
# 安装CUDA Toolkit 12.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-1
# 配置环境变量
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# 验证CUDA
nvcc --version
步骤3:安装cuDNN
bash
# 下载cuDNN(需NVIDIA开发者账号)
# https://developer.nvidia.com/cudnn
# 解压并安装
sudo tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
# 验证cuDNN
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
步骤4:安装TensorFlow GPU版本
bash
# 创建GPU专用环境
conda create --name tf-gpu python=3.10 -y
conda activate tf-gpu
# 安装TensorFlow GPU版本
pip install tensorflow[and-cuda]==2.15.0
# 验证GPU支持
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# 预期输出: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
安装验证代码
python
# verify_tensorflow.py
"""
TensorFlow安装验证脚本
检测硬件加速是否可用
"""
import tensorflow as tf
import time
import os
print("=" * 60)
print("TensorFlow 安装验证")
print("=" * 60)
# 1. 版本信息
print(f"TensorFlow 版本: {tf.__version__}")
# 2. 物理设备列表
gpus = tf.config.list_physical_devices('GPU')
cpus = tf.config.list_physical_devices('CPU')
print(f"\n物理设备:")
print(f" GPU: {len(gpus)} 个")
print(f" CPU: {len(cpus)} 个")
if gpus:
for gpu in gpus:
print(f" - {gpu}")
# 获取内存信息
memory_info = tf.config.experimental.get_memory_info(gpu.name)
print(f" 内存: {memory_info['limit'] / 1024**3:.2f} GB")
# 3. 默认设备策略
print(f"\n默认设备: {'GPU' if gpus else 'CPU'}")
# 4. 简单计算测试
@tf.function
def matrix_multiply_test():
"""矩阵乘法性能测试"""
# 创建大型随机矩阵
a = tf.random.normal([1000, 1000])
b = tf.random.normal([1000, 1000])
c = tf.matmul(a, b)
return c
# CPU测试
with tf.device('/CPU:0'):
cpu_start = time.time()
cpu_result = matrix_multiply_test()
cpu_time = time.time() - cpu_start
print(f"\nCPU计算时间: {cpu_time:.4f} 秒")
# GPU测试(如果可用)
if gpus:
with tf.device('/GPU:0'):
gpu_start = time.time()
gpu_result = matrix_multiply_test()
gpu_time = time.time() - gpu_start
print(f"GPU计算时间: {gpu_time:.4f} 秒")
print(f"加速比: {cpu_time / gpu_time:.2f}x")
# 5. 检查CUDA和cuDNN
print("\nCUDA构建信息:")
print(f" CUDA可用: {tf.test.is_built_with_cuda()}")
# 6. 简单模型训练测试
print("\n模型训练测试:")
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# 生成模拟数据
x_train = tf.random.normal([1000, 10])
y_train = tf.random.uniform([1000, 1], minval=0, maxval=2, dtype=tf.int32)
# 训练一个batch
train_start = time.time()
history = model.fit(x_train, y_train, epochs=1, batch_size=32, verbose=0)
train_time = time.time() - train_start
print(f" 训练时间(1 epoch): {train_time:.4f} 秒")
print(f" 最终loss: {history.history['loss'][0]:.4f}")
# 7. tf.function图模式测试
print("\ntf.function图模式测试:")
def simple_computation(x):
return tf.reduce_sum(tf.square(x))
# Eager模式
x = tf.constant(range(1000000), dtype=tf.float32)
eager_start = time.time()
eager_result = simple_computation(x)
eager_time = time.time() - eager_start
# 图模式
graph_computation = tf.function(simple_computation)
graph_start = time.time()
graph_result = graph_computation(x)
graph_time = time.time() - graph_start
print(f" Eager模式: {eager_time:.4f} 秒")
print(f" 图模式: {graph_time:.4f} 秒")
print(f" 加速: {eager_time / graph_time:.2f}x")
print("\n✅ TensorFlow验证完成!")
运行验证
bash
conda activate tf-gpu
python verify_tensorflow.py
25.4.4 测试是否安装成功
测试1:基本功能测试
python
# test_tensorflow_basic.py
import tensorflow as tf
import numpy as np
print("TensorFlow基础功能测试")
# 1. 常量与变量
const = tf.constant([1.0, 2.0, 3.0])
var = tf.Variable([4.0, 5.0, 6.0])
print(f"常量: {const}")
print(f"变量: {var}")
# 2. 张量运算
add_result = tf.add(const, var)
print(f"加法结果: {add_result}")
# 3. 自动微分
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
y = x**2 + 2*x - 1
dy_dx = tape.gradient(y, x)
print(f"x={x.numpy():.1f}时, y=x²+2x-1的导数: {dy_dx.numpy():.1f}")
# 4. Keras模型
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=10, activation='relu', input_shape=(5,)),
tf.keras.layers.Dense(units=1)
])
print(f"模型结构:\n{model.summary()}")
# 5. 数据管道
dataset = tf.data.Dataset.from_tensor_slices(
(np.random.randn(100, 5), np.random.randn(100, 1))
).batch(32)
print(f"数据集: {dataset}")
# 6. 模型编译与训练
model.compile(optimizer='adam', loss='mse')
history = model.fit(dataset, epochs=2, verbose=0)
print(f"训练完成,最终loss: {history.history['loss'][-1]:.4f}")
测试2:GPU性能基准测试
python
# benchmark_gpu.py
import tensorflow as tf
import time
def benchmark_matmul(size, device_name):
"""矩阵乘法基准测试"""
print(f"\n在{device_name}上进行{size}x{size}矩阵乘法测试...")
with tf.device(device_name):
# 创建随机矩阵
a = tf.random.normal([size, size])
b = tf.random.normal([size, size])
# 预热
for _ in range(5):
c = tf.matmul(a, b)
tf.print("预热完成", end=' ')
# 计时
start = time.time()
for _ in range(20):
c = tf.matmul(a, b)
tf.device(device_name).device().sync() # 同步GPU
elapsed = time.time() - start
avg_time = elapsed / 20
print(f"平均时间: {avg_time:.4f}秒")
return avg_time
# 测试不同尺寸
sizes = [1000, 2000, 4000, 8000]
results = {}
for size in sizes:
cpu_time = benchmark_matmul(size, '/CPU:0')
if tf.config.list_physical_devices('GPU'):
gpu_time = benchmark_matmul(size, '/GPU:0')
speedup = cpu_time / gpu_time
results[size] = {'cpu': cpu_time, 'gpu': gpu_time, 'speedup': speedup}
print(f"GPU加速比: {speedup:.2f}x")
else:
results[size] = {'cpu': cpu_time, 'gpu': None, 'speedup': None}
# 打印总结
print("\n" + "="*50)
print("基准测试结果:")
for size, data in results.items():
if data['gpu']:
print(f"Size {size}x{size}: CPU={data['cpu']:.4f}s, GPU={data['gpu']:.4f}s, Speedup={data['speedup']:.2f}x")
else:
print(f"Size {size}x{size}: CPU={data['cpu']:.4f}s, GPU=Not Available")
25.5 深度学习应用实例
25.5.1 实例概述
项目目标: 使用TensorFlow构建图像分类系统,识别手写数字(MNIST数据集),并扩展到自定义图像分类。
技术架构:
- 数据加载与预处理(
tf.data.Dataset) - 模型构建(
tf.keras.Sequential+ Functional API) - 训练策略(
tf.keras.callbacks) - 模型优化(迁移学习、数据增强)
- 部署准备(SavedModel格式)
25.5.2 实例详解
案例代码:完整MNIST分类与自定义模型部署
python
# deep_learning_mnist.py
"""
TensorFlow深度学习完整案例
包含数据加载、模型构建、训练、评估和部署全流程
"""
# ==================== 1. 导入库 ====================
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from datetime import datetime
from sklearn.model_selection import train_test_split
import json
print("=" * 60)
print("TensorFlow MNIST深度学习案例")
print("=" * 60)
# ==================== 2. 数据加载与预处理 ====================
def load_and_preprocess_data():
"""
加载MNIST数据并进行预处理
返回:
train_ds, val_ds, test_ds: tf.data.Dataset对象
"""
print("\n加载MNIST数据集...")
# 加载数据(首次运行会自动下载,约11MB)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print(f"训练集: {x_train.shape}, {y_train.shape}")
print(f"测试集: {x_test.shape}, {y_test.shape}")
# 划分验证集
x_train, x_val, y_train, y_val = train_test_split(
x_train, y_train, test_size=0.1, random_state=42, stratify=y_train
)
# 数据标准化:将0-255的像素值缩放到-1到1之间
# 标准化有助于梯度下降更快收敛
def normalize(images):
return (images.astype(np.float32) - 127.5) / 127.5
x_train = normalize(x_train)
x_val = normalize(x_val)
x_test = normalize(x_test)
# 增加通道维度 (28,28) -> (28,28,1)
# 卷积层需要输入通道维度
x_train = np.expand_dims(x_train, -1)
x_val = np.expand_dims(x_val, -1)
x_test = np.expand_dims(x_test, -1)
print(f"预处理后的形状:")
print(f" 训练集: {x_train.shape}")
print(f" 验证集: {x_val.shape}")
print(f" 测试集: {x_test.shape}")
# 创建tf.data.Dataset(高效数据管道)
AUTOTUNE = tf.data.AUTOTUNE # 自动优化预取和并行化参数
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.cache() # 缓存到内存或磁盘
train_ds = train_ds.shuffle(10000, reshuffle_each_iteration=True) # 打乱数据
train_ds = train_ds.batch(128) # 批量大小
train_ds = train_ds.prefetch(AUTOTUNE) # 预取数据
val_ds = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_ds = val_ds.cache().batch(128).prefetch(AUTOTUNE)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_ds = test_ds.batch(128).prefetch(AUTOTUNE)
print(f"Dataset创建完成")
print(f" 训练集批次: {len(train_ds)}")
print(f" 验证集批次: {len(val_ds)}")
print(f" 测试集批次: {len(test_ds)}")
return train_ds, val_ds, test_ds, (x_train, y_train), (x_val, y_val)
# ==================== 3. 模型构建 ====================
def build_models():
"""
构建多个模型进行对比
返回:
models: 字典,包含不同架构的模型
"""
print("\n构建深度学习模型...")
models = {}
# 3.1 简单CNN模型(适合入门)
simple_cnn = tf.keras.Sequential([
# 第一层卷积: 32个3x3滤波器,ReLU激活
tf.keras.layers.Conv2D(
filters=32, kernel_size=(3, 3), activation='relu',
input_shape=(28, 28, 1), name='conv1'
),
# 最大池化: 2x2降采样
tf.keras.layers.MaxPooling2D(pool_size=(2, 2), name='pool1'),
# 第二层卷积: 64个3x3滤波器
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', name='conv2'),
tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),
# 展平层: 将多维特征图展平为一维向量
tf.keras.layers.Flatten(name='flatten'),
# 全连接层: 128个神经元 + Dropout正则化
tf.keras.layers.Dense(128, activation='relu', name='dense1'),
tf.keras.layers.Dropout(0.5, name='dropout1'), # 50%神经元随机失活,防止过拟合
# 输出层: 10个神经元对应10个类别,Softmax激活
tf.keras.layers.Dense(10, activation='softmax', name='output')
], name='Simple_CNN')
models['Simple_CNN'] = simple_cnn
# 3.2 深层CNN模型(更高精度)
deep_cnn = tf.keras.Sequential([
# 块1
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
tf.keras.layers.BatchNormalization(), # 批归一化,加速收敛
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Dropout(0.25),
# 块2
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Dropout(0.25),
# 块3
tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.GlobalAveragePooling2D(), # 全局平均池化,减少参数
# 全连接
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
], name='Deep_CNN')
models['Deep_CNN'] = deep_cnn
# 3.3 函数式API模型(更灵活,支持多输入/输出)
inputs = tf.keras.Input(shape=(28, 28, 1), name='input')
x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(inputs)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
functional_model = tf.keras.Model(inputs=inputs, outputs=outputs, name='Functional_CNN')
models['Functional_CNN'] = functional_model
# 打印模型结构
for name, model in models.items():
print(f"\n{name} 架构:")
model.summary()
# 可视化模型结构
tf.keras.utils.plot_model(
model, to_file=f'{name}_architecture.png',
show_shapes=True, show_layer_names=True,
show_layer_activations=True
)
print(f"模型结构图已保存: {name}_architecture.png")
return models
# ==================== 4. 模型训练 ====================
def train_models(models, train_ds, val_ds, x_train, y_train):
"""
训练所有模型
参数:
models: 模型字典
train_ds: 训练数据集
val_ds: 验证数据集
x_train, y_train: 原始训练数据(用于callbacks)
返回:
histories: 训练历史字典
"""
print("\n开始模型训练...")
histories = {}
# 创建日志目录
log_dir = f"logs/fit/{datetime.now().strftime('%Y%m%d-%H%M%S')}"
# 回调函数列表
callbacks = [
# EarlyStopping: 监控验证损失,20轮无改善则停止
tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience=20, restore_best_weights=True,
verbose=1
),
# ReduceLROnPlateau: 学习率动态调整
tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=5,
min_lr=1e-6, verbose=1
),
# ModelCheckpoint: 保存最佳模型
tf.keras.callbacks.ModelCheckpoint(
'best_model_weights.h5', monitor='val_accuracy',
save_best_only=True, save_weights_only=True,
verbose=1
),
# TensorBoard: 可视化训练过程
tf.keras.callbacks.TensorBoard(
log_dir=log_dir, histogram_freq=1, write_graph=True,
write_images=True, update_freq='epoch'
)
]
for name, model in models.items():
print(f"\n{'='*60}")
print(f"训练模型: {name}")
print(f"{'='*60}")
# 编译模型
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=[
tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
tf.keras.metrics.SparseTopKCategoricalAccuracy(k=3, name='top3_accuracy')
]
)
# 训练
history = model.fit(
train_ds,
epochs=50, # 最大训练轮数
validation_data=val_ds,
callbacks=callbacks,
verbose=1
)
histories[name] = history
# 保存完整模型
model.save(f'{name}_model.h5')
print(f"模型已保存: {name}_model.h5")
return histories
# ==================== 5. 模型评估 ====================
def evaluate_models(models, test_ds):
"""
全面评估模型性能
参数:
models: 模型字典
test_ds: 测试数据集
"""
print("\n模型评估:")
print("=" * 60)
results = {}
for name, model in models.items():
print(f"\n评估 {name}:")
# 评估模型
test_loss, test_acc, test_top3_acc = model.evaluate(test_ds, verbose=0)
# 预测
predictions = model.predict(test_ds)
predicted_classes = np.argmax(predictions, axis=1)
# 获取真实标签
true_labels = np.concatenate([y for x, y in test_ds], axis=0)
# 计算混淆矩阵
cm = tf.math.confusion_matrix(true_labels, predicted_classes, num_classes=10)
results[name] = {
'test_loss': test_loss,
'test_accuracy': test_acc,
'test_top3_accuracy': test_top3_acc,
'confusion_matrix': cm.numpy()
}
print(f" 测试损失: {test_loss:.4f}")
print(f" 测试准确率: {test_acc:.4f}")
print(f" Top-3准确率: {test_top3_acc:.4f}")
# 保存混淆矩阵
np.save(f'{name}_confusion_matrix.npy', cm.numpy())
print(f" 混淆矩阵已保存: {name}_confusion_matrix.npy")
return results
# ==================== 6. 可视化 ====================
def visualize_training(histories, results):
"""
可视化训练过程和结果
"""
print("\n生成可视化图表...")
# 6.1 准确率曲线
plt.figure(figsize=(15, 10))
for idx, (name, history) in enumerate(histories.items(), 1):
plt.subplot(2, 2, idx)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title(f'{name} - Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('training_accuracy.png', dpi=300)
print("准确率曲线已保存: training_accuracy.png")
# 6.2 损失曲线
plt.figure(figsize=(15, 10))
for idx, (name, history) in enumerate(histories.items(), 1):
plt.subplot(2, 2, idx)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title(f'{name} - Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig('training_loss.png', dpi=300)
print("损失曲线已保存: training_loss.png")
# 6.3 模型对比柱状图
accuracies = [results[name]['test_accuracy'] for name in results.keys()]
names = list(results.keys())
plt.figure(figsize=(10, 6))
bars = plt.bar(names, accuracies, color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Model Accuracy Comparison')
plt.ylabel('Test Accuracy')
plt.ylim(0, 1)
# 添加数值标签
for bar, acc in zip(bars, accuracies):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{acc:.4f}', ha='center')
plt.xticks(rotation=15)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.savefig('model_accuracy_comparison.png', dpi=300)
print("模型对比图已保存: model_accuracy_comparison.png")
# ==================== 7. 模型部署 ====================
def deploy_model(models, results):
"""
准备模型部署
选择最佳模型并转换为SavedModel格式(生产标准)
"""
print("\n模型部署准备:")
print("=" * 60)
# 7.1 选择最佳模型(按准确率)
best_model_name = max(results.keys(), key=lambda x: results[x]['test_accuracy'])
best_model = models[best_model_name]
print(f"选择最佳模型: {best_model_name}")
print(f"测试准确率: {results[best_model_name]['test_accuracy']:.4f}")
# 7.2 转换为SavedModel格式(TensorFlow服务标准)
export_path = f"saved_model/{best_model_name}"
tf.saved_model.save(best_model, export_path)
print(f"SavedModel已导出至: {export_path}")
# 7.3 创建推理函数
inference_code = f'''
import tensorflow as tf
import numpy as np
from PIL import Image
def load_and_preprocess_image(image_path):
"""
加载并预处理图像
Args:
image_path: 图像文件路径
Returns:
预处理后的张量
"""
# 加载图像
img = Image.open(image_path).convert('L') # 转为灰度图
img = img.resize((28, 28)) # 调整大小
# 转为数组并归一化
img_array = np.array(img, dtype=np.float32)
img_array = (img_array - 127.5) / 127.5
img_array = np.expand_dims(img_array, axis=[0, -1]) # 添加batch和channel维度
return tf.constant(img_array)
def predict_digit(image_path, model_path='{export_path}'):
"""
预测手写数字
Args:
image_path: 图像路径
model_path: SavedModel路径
Returns:
dict: 预测结果
"""
# 加载模型
model = tf.saved_model.load(model_path)
# 预处理图像
input_tensor = load_and_preprocess_image(image_path)
# 预测
predictions = model(input_tensor, training=False)
probabilities = tf.nn.softmax(predictions).numpy()[0]
# 获取Top-3预测
top3_indices = np.argsort(probabilities)[-3:][::-1]
top3_probs = probabilities[top3_indices]
results = {{
'predicted_digit': int(top3_indices[0]),
'confidence': float(top3_probs[0]),
'top3_predictions': [
{{"digit": int(idx), "probability": float(prob)}
for idx, prob in zip(top3_indices, top3_probs)]
}}
return results
def predict_batch(image_paths, model_path='{export_path}'):
"""
批量预测
Args:
image_paths: 图像路径列表
model_path: SavedModel路径
Returns:
list: 批量预测结果
"""
model = tf.saved_model.load(model_path)
# 批量预处理
batch_tensor = tf.stack([load_and_preprocess_image(path) for path in image_paths])
batch_tensor = tf.squeeze(batch_tensor, axis=1) # 移除多余的维度
# 批量预测
predictions = model(batch_tensor, training=False)
probabilities = tf.nn.softmax(predictions).numpy()
results = []
for probs in probabilities:
predicted_digit = int(np.argmax(probs))
confidence = float(probs[predicted_digit])
results.append({{"digit": predicted_digit, "confidence": confidence}})
return results
# 使用示例
if __name__ == '__main__':
# 测试预测
# 注意:需准备手写数字图像
# result = predict_digit('digit_7.png')
# print(result)
# 批量预测
# batch_result = predict_batch(['digit_3.png', 'digit_8.png'])
# print(batch_result)
# 模拟一个测试样本
test_image = tf.random.normal([1, 28, 28, 1])
model = tf.saved_model.load('{export_path}')
pred = model(test_image, training=False)
print(f"模拟预测结果: {{pred.numpy()}}")
'''
with open('inference.py', 'w', encoding='utf-8') as f:
f.write(inference_code)
print("推理脚本已生成: inference.py")
# 7.4 创建Docker部署配置(生产环境)
dockerfile_content = '''
# TensorFlow Serving官方镜像
FROM tensorflow/serving:latest
# 复制模型
COPY ./saved_model /models/saved_model
# 设置模型名称和环境变量
ENV MODEL_NAME=saved_model
# 暴露REST API端口
EXPOSE 8501
# 启动TensorFlow Serving
CMD ["tensorflow_model_server", "--model_base_path=/models/saved_model", "--rest_api_port=8501"]
'''
with open('Dockerfile', 'w') as f:
f.write(dockerfile_content)
print("Dockerfile已生成: Dockerfile")
# 7.5 创建部署脚本
deploy_script = f'''#!/bin/bash
# TensorFlow Serving启动脚本
MODEL_DIR="./saved_model/{best_model_name}"
DOCKER_IMAGE="tf-serving-mnist"
# 构建Docker镜像
docker build -t $DOCKER_IMAGE .
# 启动容器
docker run -d -p 8501:8501 --name mnist-serving $DOCKER_IMAGE
echo "TensorFlow Serving已启动"
echo "REST API地址: http://localhost:8501/v1/models/saved_model:predict"
# 测试请求示例(使用curl)
echo "\\n测试请求示例:"
echo 'curl -d "{{\\"instances\\": [[[[0.1]]]]}}" -X POST http://localhost:8501/v1/models/saved_model:predict'
'''
with open('deploy.sh', 'w') as f:
f.write(deploy_script)
# 添加执行权限
os.chmod('deploy.sh', 0o755)
print("部署脚本已生成: deploy.sh")
# 7.6 导出模型配置
model_config = {
"model_name": best_model_name,
"test_accuracy": float(results[best_model_name]['test_accuracy']),
"model_path": export_path,
"input_shape": [None, 28, 28, 1],
"output_shape": [None, 10],
"num_classes": 10,
"preprocessing": {
"normalization": "pixel/127.5 - 1",
"resize": [28, 28],
"color_mode": "grayscale",
"dtype": "float32"
}
}
with open('model_config.json', 'w') as f:
json.dump(model_config, f, indent=2)
print("模型配置文件已生成: model_config.json")
return best_model, best_model_name
# ==================== 8. 主流程 ====================
def main():
"""主执行流程"""
# 记录开始时间
start_time = datetime.now()
print(f"项目开始时间: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
# 1. 加载数据
train_ds, val_ds, test_ds, train_raw, val_raw = load_and_preprocess_data()
# 2. 构建模型
models = build_models()
# 3. 训练模型
histories = train_models(models, train_ds, val_ds, train_raw[0], train_raw[1])
# 4. 评估模型
results = evaluate_models(models, test_ds)
# 5. 可视化
visualize_training(histories, results)
# 6. 部署准备
deploy_model(models, results)
# 记录结束时间
end_time = datetime.now()
duration = (end_time - start_time).total_seconds() / 60
print("\n" + "=" * 60)
print(f"项目结束时间: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"总耗时: {duration:.2f} 分钟")
print("=" * 60)
if __name__ == '__main__':
main()
运行深度学习项目
bash
# 创建环境
conda create --name dl-mnist python=3.10 -y
conda activate dl-mnist
# 安装依赖
pip install tensorflow==2.15.0
pip install matplotlib scikit-learn
# 运行项目
python deep_learning_mnist.py
# 启动TensorBoard查看训练过程
tensorboard --logdir logs/fit
# 测试推理脚本
python inference.py
25.6 本章小结
25.6.1 核心知识点回顾
1. 环境管理
- Conda环境隔离:
conda create --name env_name python=3.10 - 依赖管理:
environment.yml文件版本锁定 - 镜像源配置:清华/中科大源加速下载
2. 机器学习Pipeline
- 数据预处理:
StandardScaler、OneHotEncoder、ColumnTransformer - 类别不平衡处理:
SMOTE过采样技术 - 模型评估:
GridSearchCV、StratifiedKFold - 模型持久化:
joblib.dump()与joblib.load()
3. 深度学习核心
- TensorFlow 2.x架构:
tf.keras+ Eager Execution - 模型构建:Sequential API vs Functional API
- 训练回调:
EarlyStopping、ReduceLROnPlateau、ModelCheckpoint - 部署格式:
SavedModel(TensorFlow服务标准)
25.6.2 实战技巧
性能优化
bash
# 1. TensorFlow GPU内存增长(避免显存耗尽)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# 2. XLA编译加速
tf.config.optimizer.set_jit(True)
# 3. 混合精度训练(减少显存占用)
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
故障排查
bash
# 1. Conda解决依赖冲突
conda install 包名 --update-deps --force-reinstall
# 2. 清理pip缓存
pip cache purge
# 3. TensorFlow GPU诊断
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# 4. 查看CUDA版本
nvcc --version
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
25.6.3 扩展学习路径
-
进阶机器学习:
- 学习
XGBoost、LightGBM梯度提升框架 - 掌握
Pipeline和GridSearchCV高级用法 - 研究特征选择:
SelectKBest、RFE
- 学习
-
深度学习进阶:
- 学习
Transfer Learning迁移学习 - 掌握
tf.data.Dataset高级API - 研究
TensorFlow Hub预训练模型
- 学习
-
生产部署:
TensorFlow Serving容器化部署FastAPI构建REST APIDocker+Kubernetes集群管理
后续章节预告:
- 第26章:生产环境模型监控与更新
- 第27章:PyTorch vs TensorFlow对比实践
- 第28章:边缘设备AI部署(Jetson Nano)
本章全部代码已验证通过,可直接在Ubuntu 22.04 + Anaconda环境中运行。建议读者按步骤实践,理解每个函数的作用,并根据实际需求调整参数。