Scikit-learn 零基础，从安装到实战机器学习模型

Scikit-learn（简称 sklearn）是 Python 机器学习领域的经典库，封装了分类、回归、聚类、特征工程等全套机器学习工具，上手简单、文档完善，是新手入门机器学习的首选。本文从实战角度拆解 Scikit-learn 核心用法，所有示例均可直接运行，帮你快速搭建第一个机器学习模型。

一、Scikit-learn 安装与环境准备

1. 安装方式

Scikit-learn 依赖 NumPy、Pandas、SciPy 等库，建议先安装这些基础库，再安装 sklearn：

bash 复制代码

# 安装基础依赖
pip install numpy pandas scipy -i https://pypi.tuna.tsinghua.edu.cn/simple
# 安装 scikit-learn（国内源加速）
pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple

验证安装成功：

python 复制代码

import sklearn
print("Scikit-learn 版本：", sklearn.__version__)  # 输出版本号（推荐1.2+）

2. 核心模块说明

Scikit-learn 按功能划分模块，新手重点掌握以下核心模块：

sklearn.datasets：内置数据集（用于练习）；
sklearn.model_selection：数据划分、模型评估（如训练集/测试集拆分）；
sklearn.preprocessing：数据预处理（特征标准化、编码等）；
sklearn.linear_model：线性模型（如线性回归、逻辑回归）；
sklearn.tree/sklearn.ensemble：树模型、集成模型（如随机森林）；
sklearn.metrics：模型评估指标（如准确率、均方误差）。

二、Scikit-learn 核心流程（五步走）

Scikit-learn 所有模型的使用流程高度统一，核心可总结为"五步走"：

准备数据（加载/预处理）；
划分训练集和测试集；
选择模型并初始化；
模型训练（拟合数据）；
模型预测与评估。

下面以"鸢尾花分类"（经典入门案例）为例，完整演示整个流程。

示例：鸢尾花分类（监督学习-分类）

python 复制代码

# 第一步：加载数据并查看基本信息
from sklearn.datasets import load_iris
import pandas as pd

# 加载内置鸢尾花数据集
iris = load_iris()
# 转为 DataFrame 方便查看（新手友好）
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["类别"] = iris.target  # 标签列（0/1/2 对应三种鸢尾花）
print("数据集前5行：")
print(df.head())
print("\n数据集基本信息：")
print(f"特征数量：{iris.data.shape[1]}，样本数量：{iris.data.shape[0]}")
print(f"类别：{iris.target_names}")

# 第二步：划分训练集和测试集（核心！避免过拟合）
from sklearn.model_selection import train_test_split

# 特征变量 X，目标变量 y
X = iris.data
y = iris.target
# 拆分：70% 训练集，30% 测试集，随机种子固定（保证结果可复现）
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y  # stratify 保证类别分布均匀
)
print(f"\n训练集样本数：{X_train.shape[0]}，测试集样本数：{X_test.shape[0]}")

# 第三步：选择模型并初始化（以逻辑回归为例）
from sklearn.linear_model import LogisticRegression

# 初始化模型（设置随机种子保证结果稳定）
model = LogisticRegression(random_state=42, max_iter=200)  # max_iter 增加迭代次数避免收敛警告

# 第四步：模型训练（拟合训练集数据）
model.fit(X_train, y_train)

# 第五步：模型预测与评估
from sklearn.metrics import accuracy_score, classification_report

# 预测测试集
y_pred = model.predict(X_test)
# 计算准确率（分类任务核心指标）
accuracy = accuracy_score(y_test, y_pred)
print(f"\n模型测试集准确率：{accuracy:.2f}")
# 详细评估报告（精确率、召回率、F1值）
print("\n分类报告：")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

输出结果（关键部分）：

markdown 复制代码

模型测试集准确率：0.98
分类报告：
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.94      1.00      0.97        15
   virginica       1.00      0.93      0.97        15

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

三、数据预处理（机器学习第一步）

原始数据往往无法直接输入模型，预处理是提升模型效果的关键，Scikit-learn 提供了一站式预处理工具。

1. 特征标准化/归一化

数值特征的量纲差异（如身高180cm、体重70kg）会影响线性模型效果，需统一量纲：

python 复制代码

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 1. 标准化（均值0，方差1，最常用）
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # 训练集拟合+转换
X_test_scaled = scaler.transform(X_test)        # 测试集仅转换（避免数据泄露）

# 2. 归一化（缩放到0-1之间）
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)

2. 类别特征编码

字符串类型的类别特征（如"男/女""北京/上海"）需转为数值：

python 复制代码

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# 示例数据
data = pd.DataFrame({
    "性别": ["男", "女", "男", "女"],
    "城市": ["北京", "上海", "广州", "北京"]
})

# 1. 标签编码（适用于有序类别，如"低/中/高"）
le = LabelEncoder()
data["性别_编码"] = le.fit_transform(data["性别"])  # 男=1，女=0

# 2. 独热编码（适用于无序类别，如"北京/上海/广州"）
ohe = OneHotEncoder(sparse_output=False, drop="first")  # drop=first 避免多重共线性
city_encoded = ohe.fit_transform(data[["城市"]])
# 转为 DataFrame 方便查看
city_df = pd.DataFrame(city_encoded, columns=ohe.get_feature_names_out(["城市"]))
data = pd.concat([data, city_df], axis=1)
print("编码后数据：")
print(data)

3. 处理缺失值

python 复制代码

from sklearn.impute import SimpleImputer

# 均值填充数值型缺失值
imputer = SimpleImputer(strategy="mean")  # 可选：median（中位数）、most_frequent（众数）
X_train_imputed = imputer.fit_transform(X_train)

四、常用机器学习模型实战

Scikit-learn 封装了主流机器学习模型，调用方式高度统一，仅需替换模型类即可。

1. 回归模型（预测连续值，如房价、销量）

以波士顿房价预测（回归任务）为例：

python 复制代码

from sklearn.datasets import load_diabetes  # 糖尿病数据集（回归）
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 加载数据
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练线性回归模型
lr = LinearRegression()
lr.fit(X_train, y_train)

# 预测与评估
y_pred = lr.predict(X_test)
print(f"均方误差（MSE）：{mean_squared_error(y_test, y_pred):.2f}")
print(f"决定系数（R²）：{r2_score(y_test, y_pred):.2f}")  # R²越接近1越好

2. 分类模型（预测类别，如是否患病、客户流失）

除了逻辑回归，常用的还有决策树、随机森林：

python 复制代码

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# 1. 决策树
dt = DecisionTreeClassifier(max_depth=3, random_state=42)  # 限制树深度避免过拟合
dt.fit(X_train, y_train)
dt_accuracy = accuracy_score(y_test, dt.predict(X_test))
print(f"决策树准确率：{dt_accuracy:.2f}")

# 2. 随机森林（集成模型，效果更优）
rf = RandomForestClassifier(n_estimators=100, random_state=42)  # 100棵树
rf.fit(X_train, y_train)
rf_accuracy = accuracy_score(y_test, rf.predict(X_test))
print(f"随机森林准确率：{rf_accuracy:.2f}")

3. 无监督学习（聚类，无标签数据）

以 K-Means 聚类为例（鸢尾花数据集无标签聚类）：

python 复制代码

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# 初始化K-Means（指定聚类数为3，对应鸢尾花3个类别）
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(iris.data)

# 可视化聚类结果（取前两个特征）
plt.scatter(iris.data[:, 0], iris.data[:, 1], c=cluster_labels, cmap="viridis")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker="*", s=200, c="red")
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("K-Means 聚类结果")
plt.show()

五、模型评估与调优

1. 交叉验证（更可靠的模型评估）

单次训练/测试集拆分结果有偶然性，交叉验证可综合评估模型稳定性：

python 复制代码

from sklearn.model_selection import cross_val_score

# 5折交叉验证
scores = cross_val_score(rf, X, y, cv=5, scoring="accuracy")
print(f"5折交叉验证准确率：{scores.mean():.2f} ± {scores.std():.2f}")

2. 超参数调优（提升模型效果）

通过网格搜索自动寻找最优超参数：

python 复制代码

from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
    "n_estimators": [50, 100, 200],  # 树的数量
    "max_depth": [3, 5, None]        # 树的最大深度
}

# 网格搜索（5折交叉验证）
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy"
)
grid_search.fit(X_train, y_train)

# 最优参数和得分
print(f"最优参数：{grid_search.best_params_}")
print(f"最优交叉验证得分：{grid_search.best_score_:.2f}")

# 使用最优模型预测
best_model = grid_search.best_estimator_
final_accuracy = accuracy_score(y_test, best_model.predict(X_test))
print(f"最优模型测试集准确率：{final_accuracy:.2f}")

六、模型保存与加载

训练好的模型可保存为文件，后续直接加载使用，无需重复训练：

python 复制代码

import joblib

# 保存模型
joblib.dump(best_model, "iris_rf_model.pkl")

# 加载模型
loaded_model = joblib.load("iris_rf_model.pkl")

# 用加载的模型预测
new_data = [[5.1, 3.5, 1.4, 0.2]]  # 新的鸢尾花特征
pred = loaded_model.predict(new_data)
print(f"预测类别：{iris.target_names[pred[0]]}")