笔记/sklearn中的数据划分方法

文章目录

- 一、前言
- 二、数据划分方法
- - [1. 留出法（Hold-out）](#1. 留出法（Hold-out）)
  - [2. K折交叉验证（K-Fold）](#2. K折交叉验证（K-Fold）)
  - [3. 留一法（Leave-One-Out）](#3. 留一法（Leave-One-Out）)
- 三、总结

一、前言

简要介绍数据划分在机器学习中的作用。

二、数据划分方法

1. 留出法（Hold-out）

使用 train_test_split 将数据分为训练集和测试集。
代码片段：

python 复制代码

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print('Train obs: ', len(X_train))
print('Test obs: ', len(X_test))

2. K折交叉验证（K-Fold）

用 KFold 将数据分为多折，循环训练和测试。
代码片段：

python 复制代码

from sklearn.model_selection import KFold
X = np.random.randn(20, 1)
# 创建一个KFold对象,将数据分为5份，shuffle=True表示在分割前会先打乱数据
# 设置一个random state保证每次打乱的结果一致
kf = KFold(n_splits=5, shuffle=True, random_state=10)
#kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    print(train_index, test_index)
# 创建一个KFold对象,将数据分为5份，不打乱数据
kf = KFold(n_splits=5, shuffle=False)
#kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
    print(train_index, test_index)

Note：假设总共有N个样本，K折交叉验证会将数据平均分成K份。每一折中，test_index的数量大约是 N/K（如果N不能被K整除，有的折会多一个或少一个），其余的样本作为训练集，train_index的数量就是N- test_index 的数量。在本例中，test_index的数量是20/5=4。

3. 留一法（Leave-One-Out）

每次留一个样本做测试，其余做训练。
代码片段：

python 复制代码

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
    print(train_index, test_index)

三、总结

方法名称	主要思想	sklearn实现	训练集数量	测试集数量	适用场景与特点
留出法	随机划分一部分做训练，其余做测试	`train_test_split`	设定比例（如60%）	设定比例（如40%）	简单高效，适合大数据集
K折交叉验证	将数据均分为K份，轮流做测试	`KFold`	N-N/K	N/K	评估更稳定，适合中小数据集
留一法	每次留一个样本做测试，其余训练	`LeaveOneOut`	N-1	1	适合样本量较小的情况

说明：

训练集数量和测试集数量均为占总样本数的比例或数量。
K折法和留一法属于交叉验证，能更全面评估模型性能。
留出法实现简单，适合数据量较大时快速实验。

参考：https://scikit-learn.org/stable/api/sklearn.model_selection.html

博客内容如有错误欢迎指正~