机器学习入门：决策树的欠拟合与过拟合

目标

学习如何优化决策树模型，避免欠拟合和过拟合问题，找到最佳的模型复杂度。

什么是欠拟合和过拟合？

欠拟合（Underfitting）：模型过于简单，无法捕捉数据中的重要模式
过拟合（Overfitting） ：模型过于复杂，记住了训练数据的噪声，在新数据上表现差

1. 数据准备

python 复制代码

# 导入必要的库
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# 加载爱荷华房价数据
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)

# 创建目标变量 y
y = home_data.SalePrice

# 选择特征创建 X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# 划分训练集和验证集
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

2. 基准模型性能

让我们先看看没有限制的决策树表现如何：

python 复制代码

# 创建没有限制的决策树模型

iowa_model = DecisionTreeRegressor(random_state=1)

iowa_model.fit(train_X, train_y)

# 在验证集上做预测并计算MAE

val_predictions = iowa_model.predict(val_X)

val_mae = mean_absolute_error(val_predictions, val_y)

print("验证集 MAE: {:,.0f}".format(val_mae))

3. 控制模型复杂度

我们使用 max_leaf_nodes 参数来控制决策树的复杂度。叶子节点越少，模型越简单。

创建评估函数

python 复制代码

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):

  """

  计算给定max_leaf_nodes参数下模型的平均绝对误差

  """

  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)

  model.fit(train_X, train_y)

  preds_val = model.predict(val_X)

  mae = mean_absolute_error(val_y, preds_val)

  return mae

4. 寻找最优参数

方法一：显式循环

python 复制代码

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

scores = []

# 循环测试不同的max_leaf_nodes值

for max_leaf_nodes in candidate_max_leaf_nodes:

  mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)

  scores.append(mae)

# 找到最小MAE对应的参数

best_score = min(scores)

best_index = scores.index(best_score)

best_tree_size = candidate_max_leaf_nodes[best_index]

print(f"最佳的 max_leaf_nodes: {best_tree_size}")

print(f"对应的 MAE: {best_score}")

方法二：字典推导式（简洁版）

python 复制代码

# 使用字典推导式的简洁写法

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}

best_tree_size = min(scores, key=scores.get)

print(f"最佳的 max_leaf_nodes: {best_tree_size}")

print(f"对应的 MAE: {scores[best_tree_size]}")

5. 训练最终模型

找到最优参数后，使用全部数据训练最终模型：

python 复制代码

# 使用最佳参数创建最终模型

final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# 使用全部数据训练

final_model.fit(X, y)

print("最终模型训练完成！")

6. 结果分析

不同参数下的性能表现

python 复制代码

# 可视化不同参数的性能

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

plt.plot(list(scores.keys()), list(scores.values()), 'bo-')

plt.xlabel('Max Leaf Nodes')

plt.ylabel('Mean Absolute Error')

plt.title('模型复杂度 vs 验证误差')

plt.grid(True)

plt.show()

核心概念总结

偏差-方差权衡

低 max_leaf_nodes（简单模型）：高偏差，低方差 → 欠拟合
高 max_leaf_nodes（复杂模型）：低偏差，高方差 → 过拟合
适中的 max_leaf_nodes：偏差和方差的良好平衡

最佳实践

使用验证集：永远不要用测试数据来选择参数
系统化搜索：测试多个参数值，找到最优解
最终训练：用全部可用数据训练最终模型
交叉验证：对于更可靠的结果，考虑使用交叉验证

下一步

这个方法同样适用于其他超参数：

min_samples_split：分割内部节点所需的最小样本数
min_samples_leaf：叶子节点的最小样本数
max_depth：树的最大深度