人工智能之核心基础机器学习第十七章 Scikit-learn工具全解析

人工智能之核心基础机器学习

第十七章 Scikit-learn工具全解析

文章目录

[人工智能之核心基础机器学习](#人工智能之核心基础机器学习)

[17.1 Scikit-learn 简介与安装](#17.1 Scikit-learn 简介与安装)

[📌 是什么？](#📌 是什么？)

[✅ 核心优势](#✅ 核心优势)

[🔧 安装（推荐使用虚拟环境）](#🔧 安装（推荐使用虚拟环境）)

[17.2 Scikit-learn 核心 API 使用](#17.2 Scikit-learn 核心 API 使用)

[🧩 四大核心步骤（万能模板）](#🧩 四大核心步骤（万能模板）)

[🔍 详细 API 分解](#🔍 详细 API 分解)

[1. **数据加载模块** `sklearn.datasets`](#1. 数据加载模块 sklearn.datasets)

[2. **模型训练**：所有 Estimator 遵循](#2. 模型训练：所有 Estimator 遵循)

[3. **模型评估** `sklearn.metrics`](#3. 模型评估 sklearn.metrics)

[4. **参数调优** `sklearn.model_selection`](#4. 参数调优 sklearn.model_selection)

[17.3 各核心算法的 Scikit-learn 实现](#17.3 各核心算法的 Scikit-learn 实现)

[📊 一、监督学习算法](#📊 一、监督学习算法)

[1. 线性模型](#1. 线性模型)

[2. 树模型](#2. 树模型)

[3. 支持向量机（SVM）](#3. 支持向量机（SVM）)

[🔍 二、无监督学习算法](#🔍 二、无监督学习算法)

[1. 聚类](#1. 聚类)

[2. 降维](#2. 降维)

[3. 异常检测](#3. 异常检测)

[🤝 三、半监督学习算法（重点！）](#🤝 三、半监督学习算法（重点！）)

[1. Label Propagation](#1. Label Propagation)

[2. Label Spreading](#2. Label Spreading)

[🔁 四、自监督学习（Scikit-learn 无原生支持，但可简化实现）](#🔁 四、自监督学习（Scikit-learn 无原生支持，但可简化实现）)

[1. 自编码器（Autoencoder）--- 用 MLP 实现](#1. 自编码器（Autoencoder）— 用 MLP 实现)

[2. 简单对比学习（SimSiam 简化版）--- 用特征工程模拟](#2. 简单对比学习（SimSiam 简化版）— 用特征工程模拟)

[🧰 五、数据预处理（第15章回顾，Scikit-learn 实现）](#🧰 五、数据预处理（第15章回顾，Scikit-learn 实现）)

[🎯 本章总结：Scikit-learn 能力全景图](#🎯 本章总结：Scikit-learn 能力全景图)

[💡 实践建议](#💡 实践建议)

资料关注

17.1 Scikit-learn 简介与安装

📌 是什么？

Python 最主流的机器学习库
提供统一接口 ：所有模型都遵循 fit() / predict() / score() 模式
覆盖全流程：数据预处理 → 模型训练 → 评估 → 调优

✅ 核心优势

简洁一致：换算法只需改一行代码
文档完善 ：scikit-learn.org
社区强大：Stack Overflow 高频问题都有答案

🔧 安装（推荐使用虚拟环境）

bash 复制代码

# 基础安装（含NumPy、SciPy、Matplotlib）
pip install scikit-learn

# 或通过Anaconda（推荐）
conda install scikit-learn

💡 验证安装：

python 复制代码

import sklearn
print(sklearn.__version__)  # 应 ≥ 1.0

17.2 Scikit-learn 核心 API 使用

Scikit-learn 的设计哲学："Estimator 接口统一"

🧩 四大核心步骤（万能模板）

python 复制代码

# 1. 数据加载
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# 2. 数据划分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. 模型训练
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)  # ← 所有模型都用 .fit()

# 4. 模型评估
y_pred = model.predict(X_test)           # 预测
accuracy = model.score(X_test, y_test)   # 直接评分
print(f"准确率: {accuracy:.2%}")

🔍 详细 API 分解

1. 数据加载模块 `sklearn.datasets`

函数	用途	示例
`load_iris()`	经典小数据集	分类入门
`fetch_openml(name)`	从OpenML下载	`fetch_openml('mnist_784')`
`make_classification()`	生成模拟数据	快速测试算法

2. 模型训练：所有 Estimator 遵循

.fit(X, y)：训练（监督）或拟合（无监督）
.predict(X)：预测标签
.predict_proba(X)：预测概率（分类）
.transform(X)：转换数据（如PCA、标准化）

3. 模型评估 `sklearn.metrics`

python 复制代码

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("准确率:", accuracy_score(y_test, y_pred))
print("详细报告:\n", classification_report(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))

4. 参数调优 `sklearn.model_selection`

python 复制代码

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, None]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("最佳参数:", grid.best_params_)

17.3 各核心算法的 Scikit-learn 实现

覆盖监督、无监督、半监督、自监督（简化版）

📊 一、监督学习算法

1. 线性模型

python 复制代码

from sklearn.linear_model import LogisticRegression, LinearRegression

# 分类
lr = LogisticRegression()
lr.fit(X_train, y_train)

# 回归
reg = LinearRegression()
reg.fit(X_train_reg, y_train_reg)

2. 树模型

python 复制代码

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

dt = DecisionTreeClassifier(max_depth=5)
rf = RandomForestClassifier(n_estimators=100)
gb = GradientBoostingClassifier(n_estimators=100)

3. 支持向量机（SVM）

python 复制代码

from sklearn.svm import SVC, SVR

svc = SVC(kernel='rbf', C=1.0)  # 分类
svr = SVR(kernel='rbf')         # 回归

🔍 二、无监督学习算法

1. 聚类

python 复制代码

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering

kmeans = KMeans(n_clusters=3)
dbscan = DBSCAN(eps=0.5, min_samples=5)
hac = AgglomerativeClustering(n_clusters=3, linkage='ward')

2. 降维

python 复制代码

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)  # 注意：不能 .transform() 新数据！

3. 异常检测

python 复制代码

from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

iso = IsolationForest(contamination=0.1)
ocsvm = OneClassSVM(nu=0.1, gamma='scale')

🤝 三、半监督学习算法（重点！）

Scikit-learn 内置两种标签传播算法

1. Label Propagation

特点：硬标签传播，保留原始标签不变
适用：数据噪声小，结构清晰

2. Label Spreading

特点：软标签传播，允许原始标签微调 → 更鲁棒，推荐优先使用

python 复制代码

from sklearn.semi_supervised import LabelPropagation, LabelSpreading

# 构造半监督标签：已知标签用真实值，未知用 -1
import numpy as np
n_samples = len(y)
n_labeled = 30  # 仅30个标签

# 随机选择有标签样本
labeled_idx = np.random.choice(n_samples, size=n_labeled, replace=False)
y_semi = np.full(n_samples, -1)
y_semi[labeled_idx] = y[labeled_idx]

# 方法1: Label Propagation
lp = LabelPropagation(kernel='knn', n_neighbors=7, max_iter=100)
lp.fit(X, y_semi)
y_pred_lp = lp.predict(X)

# 方法2: Label Spreading（推荐）
ls = LabelSpreading(kernel='knn', n_neighbors=7, alpha=0.8, max_iter=100)
ls.fit(X, y_semi)
y_pred_ls = ls.predict(X)

# 评估（假设我们知道全部真实标签）
from sklearn.metrics import accuracy_score
print("Label Propagation 准确率:", accuracy_score(y, y_pred_lp))
print("Label Spreading 准确率:", accuracy_score(y, y_pred_ls))

⚠️ 注意：

输入 y_semi 中，无标签必须用 -1 表示

kernel 可选 'knn' 或 'rbf'

alpha（LabelSpreading）控制标签平滑程度（0~1）

🔁 四、自监督学习（Scikit-learn 无原生支持，但可简化实现）

Scikit-learn 本身不提供深度自监督模型，但可用其组件构建简化版

1. 自编码器（Autoencoder）--- 用 MLP 实现

python 复制代码

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

# 数据准备（以MNIST为例，需先下载）
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, _ = mnist['data'], mnist['target']
X = X / 255.0  # 归一化到[0,1]

# 划分（仅用无标签数据）
X_train, X_test = X[:60000], X[60000:]

# 自编码器：输入=输出，隐藏层压缩
autoencoder = MLPRegressor(
    hidden_layer_sizes=(128, 64, 128),  # 编码器+解码器
    activation='relu',
    solver='adam',
    max_iter=50,
    random_state=42
)

# 训练：输入X，目标也是X
autoencoder.fit(X_train, X_train)

# 重构测试
X_recon = autoencoder.predict(X_test)

# 可视化（需matplotlib）
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i in range(5):
    axes[0, i].imshow(X_test[i].reshape(28, 28), cmap='gray')
    axes[1, i].imshow(X_recon[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    axes[1, i].axis('off')
axes[0, 0].set_title("原始")
axes[1, 0].set_title("重构")
plt.show()

💡 说明：

这是浅层自编码器，效果不如深度框架（PyTorch），但展示了核心思想。

2. 简单对比学习（SimSiam 简化版）--- 用特征工程模拟

Scikit-learn 无法直接实现端到端对比学习，但可用其做下游任务

python 复制代码

# 步骤1: 用自监督方法获得特征（此处用PCA模拟）
pca = PCA(n_components=50)
X_features = pca.fit_transform(X)  # 假设这是自监督学到的特征

# 步骤2: 在少量标签上训练分类器（半监督思想）
n_labeled = 1000
labeled_idx = np.random.choice(len(X), size=n_labeled, replace=False)
X_labeled = X_features[labeled_idx]
y_labeled = y[labeled_idx]  # 假设有标签

# 训练线性分类器
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_labeled, y_labeled)

# 评估
test_acc = clf.score(X_features[60000:], y[60000:])
print(f"自监督特征 + 线性分类器准确率: {test_acc:.2%}")

🌐 现实做法：

用 PyTorch/TensorFlow 实现 SimSiam/MAE

用 Scikit-learn 做下游分类/聚类

🧰 五、数据预处理（第15章回顾，Scikit-learn 实现）

python 复制代码

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 数值特征处理
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 类别特征处理
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 合并
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

# 完整管道：预处理 + 模型
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

full_pipeline.fit(X_train, y_train)

✅ 优势：避免数据泄露，一键应用于新数据！

🎯 本章总结：Scikit-learn 能力全景图

模块	功能	关键类/函数
`datasets`	数据加载	`load_`, `fetch_`, `make_*`
`model_selection`	划分、调优	`train_test_split`, `GridSearchCV`, `cross_val_score`
`preprocessing`	数据清洗	`StandardScaler`, `SimpleImputer`, `OneHotEncoder`
`linear_model`	线性模型	`LogisticRegression`, `LinearRegression`
`ensemble`	集成学习	`RandomForestClassifier`, `GradientBoostingClassifier`
`cluster`	聚类	`KMeans`, `DBSCAN`
`decomposition`	降维	`PCA`, `TruncatedSVD`
`semi_supervised`	半监督	`LabelPropagation`, `LabelSpreading`
`metrics`	评估	`accuracy_score`, `classification_report`
`pipeline`	流水线	`Pipeline`, `ColumnTransformer`

💡 实践建议

优先使用内置算法：Scikit-learn 的实现经过高度优化
半监督首选 LabelSpreading ：比 LabelPropagation 更稳定
自监督需结合深度学习框架：Scikit-learn 适合做下游任务
永远用 Pipeline：防止数据泄露，提升可维护性

📘 延伸方向：

深度学习：PyTorch / TensorFlow（用于复杂自监督）
自动化ML ：Auto-sklearn（自动模型选择+调参）
可解释性 ：SHAP, LIME（与Scikit-learn无缝集成）

资料关注

公众号：咚咚王

gitee：https://gitee.com/wy18585051844/ai_learning

《Python编程：从入门到实践》

《利用Python进行数据分析》

《算法导论中文第三版》

《概率论与数理统计（第四版） (盛骤) 》

《程序员的数学》

《线性代数应该这样学第3版》

《微积分和数学分析引论》

《（西瓜书）周志华-机器学习》

《TensorFlow机器学习实战指南》

《Sklearn与TensorFlow机器学习实用指南》

《模式识别（第四版）》

《深度学习 deep learning》伊恩·古德费洛著花书

《Python深度学习第二版(中文版)【纯文本】 (登封大数据 (Francois Choliet)) (Z-Library)》

《深入浅出神经网络与深度学习+(迈克尔·尼尔森（Michael+Nielsen）》

《自然语言处理综论第2版》

《Natural-Language-Processing-with-PyTorch》

《计算机视觉-算法与应用(中文版)》

《Learning OpenCV 4》

《AIGC：智能创作时代》杜雨+&+张孜铭

《AIGC原理与实践：零基础学大语言模型、扩散模型和多模态模型》

《从零构建大语言模型（中文版）》

《实战AI大模型》

《AI 3.0》

人工智能之核心基础 机器学习 第十七章 Scikit-learn工具全解析