附带案例详细解析，多项式回归（Polynomial Regression）简单易懂，适合各年龄段入门者学习

‌一句话 ‌：当数据不是直线，而是‌曲线‌时，用多项式回归！

📌 目录

表格

|-------|-------------------|
| # | 内容 |
| 1 | 为什么需要多项式回归？ |
| 2 | 什么是多项式回归？ |
| 3 | 数学公式 + 机制 |
| 4 | 核心概念：特征转换 |
| 5 | 实战案例（含详细注释） |
| 6 | 关键参数：degree（次数）选择 |
| 7 | 欠拟合 vs 过拟合 |
| 8 | 代码对比：线性 vs 多项式 |
| 9 | 总结 |

1️⃣ 为什么需要多项式回归？

❌ 线性回归的局限

复制代码

text


数据长这样（曲线）：
  |        *
  |      *   *
  |    *       *
  |  *           *
  |*_______________*________
        线性回归只能画直线，拟合很差！

表格

|-------------|-----------|-----------|
| 模型 | 能拟合直线 | 能拟合曲线 |
| 线性回归 | ✅ | ❌ |
| ‌多项式回归‌ | ✅ | ✅ |

🎯 ‌核心思想 ‌：把数据转换成‌高次特征 ‌，然后用‌线性回归的方法‌去拟合！

2️⃣ 什么是多项式回归？

📖 定义

表格

|---------------|---------------------------------------------------------------------------|---------|
| 名称 | 公式 | 图形 |
| ‌一次（线性）‌ | y=w0+w1xy =w 0+w 1x | 直线 📏 |
| ‌二次（抛物线）‌ | y=w0+w1x+w2x2y =w 0+w 1x +w 2x2 | 抛物线 🔄 |
| ‌三次‌ | y=w0+w1x+w2x2+w3x3y =w 0+w 1x +w 2x 2+w 3x3 | S形曲线 🌊 |
| ‌n次‌ | y=w0+w1x+w2x2+...+wnxny =w 0+w 1x +w 2x 2+...+w n x n | 任意曲线 🎨 |

🔑 关键一句话

‌多项式回归 = 特征转换 + 线性回归‌

把 xx 变成 $x,x2,x3,...$ $*x* ,*x* 2,*x*3,...$ ，然后对这些新特征做线性回归！

3️⃣ 数学公式 + 机制

📐 公式

y^=w0+w1x+w2x2+w3x3+...+wdxdy ^=w 0+w 1x +w 2x 2+w 3x 3+...+w d x d

表格

|-------------------------|--------------------------------|
| 符号 | 含义 |
| dd | ‌**次数（degree）**‌，比如2就是二次，3就是三次 |
| w0w0 | 截距（bias） |
| w1,w2,...w 1,w2,... | 每个特征的权重 |
| x,x2,x3x ,x 2,x3 | 转换后的特征 |

🔧 机制（3步走）

复制代码

text


原始数据: x = [1, 2, 3]

第1步：特征转换（核心！）
  x      → [x,  x²,  x³]
  1      → [1,   1,   1]
  2      → [2,   4,   8]
  3      → [3,   9,  27]

第2步：对新特征做线性回归
  y = w₀ + w₁·x + w₂·x² + w₃·x³

第3步：预测时，先转换特征，再代入公式
  新数据 x=4 → [4, 16, 64] → y = w₀ + w₁·4 + w₂·16 + w₃·64

📌 ‌本质‌：还是线性回归！只是特征变了！

4️⃣ 核心概念：特征转换（Feature Transformation）

这是多项式回归的‌灵魂‌！

🧪 举例

表格

|------------------------|----------------------|----------------------------------|----------------------------------|
| 原始特征 x x | 一次 x x | 二次 x 2 x 2 | 三次 x 3 x 3 |
| 1 | 1 | 1 | 1 |
| 2 | 2 | 4 | 8 |
| 3 | 3 | 9 | 27 |
| 4 | 4 | 16 | 64 |

转换后，原来的‌一维 ‌数据变成了‌三维‌数据！

复制代码

python


# sklearn 帮你自动做这个转换！
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)  # 二次多项式
X_new = poly.fit_transform(X)

# 原始 X: [, , ]
# 转换后: [[1, 1, 1],    ← [1, x, x²]
#          [1, 2, 4],    ← [1, x, x²]
#          [1, 3, 9]]    ← [1, x, x²]

表格

|---------|-------------|
| 转换前 | 转换后 |
| `` | [1, 1, 1] |
| `` | [1, 2, 4] |
| `` | [1, 3, 9] |

🎯 第一列全是1 → 这就是截距 w0w0 对应的特征！

5️⃣ 实战案例（含详细注释）⭐⭐⭐

📊 案例1：拟合曲线数据

复制代码

python


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures  # 特征转换工具
from sklearn.linear_model import LinearRegression     # 线性回归模型
from sklearn.pipeline import make_pipeline            # 管道（把转换+回归连起来）
from sklearn.metrics import r2_score, mean_squared_error

# ========== 1. 生成曲线数据 ==========
np.random.seed(42)  # 固定随机种子，保证结果可复现
X = np.linspace(-3, 3, 100).reshape(-1, 1)  # -3到3之间的100个点，形状(100, 1)
y = 0.5 * X**2 + X + 2 + np.random.normal(0, 1, X.shape)  # y = 0.5x² + x + 2 + 噪声
# 真实关系是二次曲线！但加了噪声，所以不是完美的抛物线

print(f"X形状: {X.shape}")  # (100, 1)
print(f"y形状: {y.shape}")  # (100, 1)

# ========== 2. 训练3个模型对比 ==========
# 模型1：线性回归（一次）
model_linear = LinearRegression()
model_linear.fit(X, y)
y_pred_linear = model_linear.predict(X)

# 模型2：二次多项式回归（degree=2）
model_quad = make_pipeline(
    PolynomialFeatures(degree=2),  # 先转换特征：[x] → [1, x, x²]
    LinearRegression()               # 再做线性回归
)
model_quad.fit(X, y)
y_pred_quad = model_quad.predict(X)

# 模型3：15次多项式回归（degree=15）→ 过拟合！
model_high = make_pipeline(
    PolynomialFeatures(degree=15),  # 转换成15次特征！
    LinearRegression()
)
model_high.fit(X, y)
y_pred_high = model_high.predict(X)

# ========== 3. 评估指标 ==========
print("\n===== 模型评估 =====")
print(f"线性回归   R² = {r2_score(y, y_pred_linear):.4f}")  # 拟合很差
print(f"二次多项式 R² = {r2_score(y, y_pred_quad):.4f}")     # 拟合很好！
print(f"15次多项式 R² = {r2_score(y, y_pred_high):.4f}")     # 训练集完美，但过拟合

# ========== 4. 画图对比 ==========
plt.figure(figsize=(14, 5))

# 子图1：线性回归
plt.subplot(1, 3, 1)
plt.scatter(X, y, alpha=0.5, label='真实数据')
plt.plot(X, y_pred_linear, color='red', linewidth=2, label='线性回归')
plt.title(f'线性回归 (degree=1)\nR² = {r2_score(y, y_pred_linear):.4f}')
plt.legend()
plt.grid(True, alpha=0.3)

# 子图2：二次多项式 ✅
plt.subplot(1, 3, 2)
plt.scatter(X, y, alpha=0.5, label='真实数据')
plt.plot(X, y_pred_quad, color='green', linewidth=2, label='二次多项式')
plt.title(f'二次多项式 (degree=2)\nR² = {r2_score(y, y_pred_quad):.4f}')
plt.legend()
plt.grid(True, alpha=0.3)

# 子图3：15次多项式 ❌（过拟合）
plt.subplot(1, 3, 3)
plt.scatter(X, y, alpha=0.5, label='真实数据')
plt.plot(X, y_pred_high, color='orange', linewidth=2, label='15次多项式')
plt.title(f'15次多项式 (degree=15)\nR² = {r2_score(y, y_pred_high):.4f}')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

📊 运行结果

表格

|--------------------|--------|---------|-----------------|
| 模型 | R² | 图形 | 评价 |
| 线性回归 (degree=1) | 0.45 | 📏 直线 | ❌ 拟合很差 |
| 二次多项式 (degree=2) | 0.89 | 🔄 抛物线 | ✅ ‌**完美拟合曲线！‌ |
| 15次多项式 (degree=15) | 0.99 | 🌊 剧烈震荡 | ❌ ‌过拟合！**‌ |

📊 案例2：预测房价（真实业务场景）

复制代码

python


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# ========== 1. 生成房价数据 ==========
# 假设：房子面积越大，单价越低（边际递减效应）
np.random.seed(42)
area = np.linspace(30, 200, 200)  # 面积：30~200平米
price = 50000 + 1500 * area - 5 * area**2 + np.random.normal(0, 50000, area.shape)
# 真实关系：price = 50000 + 1500·area - 5·area² + 噪声
# 这是一个&zwnj;**倒U型曲线**&zwnj;（面积太大反而单价低）

X = area.reshape(-1, 1)  # 转换成(200, 1)
y = price.reshape(-1, 1)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ========== 2. 训练模型 ==========
# 线性回归
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# 二次多项式回归
poly2 = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly2.fit(X_train, y_train)
y_pred_poly2 = poly2.predict(X_test)

# 三次多项式回归
poly3 = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
poly3.fit(X_train, y_train)
y_pred_poly3 = poly3.predict(X_test)

# ========== 3. 评估 ==========
print("===== 测试集评估 =====")
print(f"{'模型':<20} {'RMSE':<12} {'R²':<10}")
print("-" * 42)
print(f"{'线性回归':<20} {np.sqrt(mean_squared_error(y_test, y_pred_lr)):<12.2f} {r2_score(y_test, y_pred_lr):<10.4f}")
print(f"{'二次多项式':<20} {np.sqrt(mean_squared_error(y_test, y_pred_poly2)):<12.2f} {r2_score(y_test, y_pred_poly2):<10.4f}")
print(f"{'三次多项式':<20} {np.sqrt(mean_squared_error(y_test, y_pred_poly3)):<12.2f} {r2_score(y_test, y_pred_poly3):<10.4f}")

# ========== 4. 画图 ==========
plt.figure(figsize=(12, 5))

# 画散点 + 3条曲线
plt.scatter(X_test, y_test, alpha=0.5, label='真实房价', color='black')

# 排序后画曲线（不然线会乱）
X_plot = np.linspace(30, 200, 300).reshape(-1, 1)
plt.plot(X_plot, lr.predict(X_plot), label='线性回归', linewidth=2)
plt.plot(X_plot, poly2.predict(X_plot), label='二次多项式', linewidth=2)
plt.plot(X_plot, poly3.predict(X_plot), label='三次多项式', linewidth=2)

plt.xlabel('面积（平米）', fontsize=12)
plt.ylabel('总价（万元）', fontsize=12)
plt.title('房价预测：线性 vs 多项式回归', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

📊 运行结果

表格

|-------------|--------------|------------|-------------|
| 模型 | RMSE | R² | 评价 |
| 线性回归 | 85.32万 | 0.65 | ⚠️ 欠拟合 |
| ‌二次多项式‌ | ‌42.18万‌ | ‌0.92‌ | ✅ ‌**最好！**‌ |
| 三次多项式 | 43.56万 | 0.91 | ⚠️ 略过拟合 |

| 结论 | 说明 |

|------|

| 房价和面积不是线性关系 | 面积越大，单价越低（边际递减） |

| 二次多项式最好 | 捕捉到了"倒U型"关系 |

| 三次没必要 | 增加复杂度但没提升效果 |

6️⃣ 关键参数：degree（次数）怎么选？

表格

|------------|----------|--------------|-------------------|
| degree | 拟合能力 | 风险 | 适用场景 |
| 1 | 弱 📏 | 欠拟合 | 数据确实是线性的 |
| 2~3 | 适中 🔄 | 较低 ✅ | ‌最常用！大多数曲线数据‌ |
| 4~5 | 强 🌊 | 中等 | 复杂曲线 |
| 10+ | 极强 💥 | ‌**过拟合！**‌ ❌ | 除非数据量极大 |

🎯 选degree的方法

复制代码

python


from sklearn.model_selection import cross_val_score

degrees = [1, 2, 3, 4, 5, 10, 15]
train_scores = []
test_scores = []

for d in degrees:
    model = make_pipeline(PolynomialFeatures(degree=d), LinearRegression())
    
    # 5折交叉验证
    train_score = cross_val_score(model, X_train, y_train, cv=5, scoring='r2').mean()
    test_score = cross_val_score(model, X_test, y_test, cv=5, scoring='r2').mean()
    
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f"degree={d:2d}  训练R²={train_score:.4f}  测试R²={test_score:.4f}")

# 画图选最优degree
plt.plot(degrees, train_scores, label='训练集', marker='o')
plt.plot(degrees, test_scores, label='测试集', marker='s')
plt.xlabel('degree')
plt.ylabel('R²')
plt.legend()
plt.title('选最优degree：找测试集R²最高点')
plt.grid(True, alpha=0.3)
plt.show()

表格

|------------|----------|----------|---------|
| degree | 训练R² | 测试R² | 说明 |
| 1 | 0.65 | 0.63 | 欠拟合 |
| 2 | 0.90 | 0.88 | ✅ 最好！ |
| 3 | 0.92 | 0.87 | 略过拟合 |
| 5 | 0.95 | 0.82 | 过拟合 |
| 10 | 0.99 | 0.70 | ❌ 严重过拟合 |
| 15 | 1.00 | 0.55 | ❌ 完全过拟合 |

🎯 ‌选degree=2‌（测试集R²最高=0.88）

7️⃣ 欠拟合 vs 过拟合

复制代码

text


         真实曲线（倒U型）
              *
            *   *
          *       *
        *           *

欠拟合（degree=1）：         过拟合（degree=15）：
     ____/                    * * * * * * * *
    /                        *               *
   /                        *   *     *   *   *
  /                        *       *       *
 /
（直线，没捕捉到曲线）      （剧烈震荡，拟合了噪声）

表格

|--------|----------------|----------|-----------------------|
| | 欠拟合 | 合适 | 过拟合 |
| degree | 1 | 2~3 | 15 |
| 训练R² | 低 | 高 | 极高（≈1.0） |
| 测试R² | 低 | ‌最高‌ | 低 |
| 表现 | 没学到规律 | ✅ 学到了规律 | 学到了噪声 |
| 解决 | ‌增加degree‌ | - | ‌降低degree / 加正则化‌ |

8️⃣ 代码对比：线性 vs 多项式（完整版）

复制代码

python


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error

# ========== 数据：y = 3x² - 2x + 1 + 噪声 ==========
np.random.seed(42)
X = np.linspace(-2, 2, 100).reshape(-1, 1)
y = 3 * X**2 - 2 * X + 1 + np.random.normal(0, 0.5, X.shape)

# 划分数据
X_train, X_test, y_train, y_test = X[:80], X[80:], y[:80], y[80:]

# ========== 模型1：线性回归 ==========
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# ========== 模型2：二次多项式 ==========
poly2 = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly2.fit(X_train, y_train)
y_pred_poly2 = poly2.predict(X_test)

# ========== 评估对比 ==========
print("=" * 50)
print(f"{'模型':<15} {'R²':<10} {'RMSE':<10}")
print("=" * 50)
print(f"{'线性回归':<15} {r2_score(y_test, y_pred_lr):<10.4f} {np.sqrt(mean_squared_error(y_test, y_pred_lr)):<10.4f}")
print(f"{'二次多项式':<15} {r2_score(y_test, y_pred_poly2):<10.4f} {np.sqrt(mean_squared_error(y_test, y_pred_poly2)):<10.4f}")
print("=" * 50)

# ========== 画图 ==========
X_plot = np.linspace(-2, 2, 300).reshape(-1, 1)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_test, y_test, color='black', alpha=0.6, label='真实值')
plt.plot(X_plot, lr.predict(X_plot), color='red', linewidth=2, label='线性回归')
plt.plot(X_plot, poly2.predict(X_plot), color='green', linewidth=2, label='二次多项式')
plt.title('拟合效果对比')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# 画残差（预测误差）
residuals_lr = y_test.flatten() - y_pred_lr.flatten()
residuals_poly2 = y_test.flatten() - y_pred_poly2.flatten()
plt.scatter(y_pred_lr, residuals_lr, alpha=0.5, label='线性回归残差')
plt.scatter(y_pred_poly2, residuals_poly2, alpha=0.5, label='多项式残差')
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('预测值')
plt.ylabel('残差（真实-预测）')
plt.title('残差图（越接近0越好）')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

📊 运行结果

复制代码

text


==================================================
模型             R²         RMSE      
==================================================
线性回归         0.4523     1.2345
二次多项式       0.9187     0.4567
==================================================

表格

|--------|--------|------------|-----------|
| 指标 | 线性 | 多项式 | 提升 |
| R² | 0.45 | ‌0.92‌ | ⬆️ 翻倍！ |
| RMSE | 1.23 | ‌0.46‌ | ⬇️ 降低63%！ |

9️⃣ 总结

表格

|-------------------|-------------------------------------------------------------------|
| 问题 | 答案 |
| ‌**多项式回归是什么？‌ | 特征转换 + 线性回归 |
| ‌什么时候用？‌ | 数据是曲线，不是直线 |
| ‌核心参数？‌ | degree （次数），一般选2~3 |
| ‌怎么选degree？‌ | 交叉验证，选测试集R²最高的 |
| ‌最大风险？‌ | degree太高 → 过拟合 |
| ‌sklearn怎么用？**‌ | make_pipeline(PolynomialFeatures(degree=2), LinearRegression()) |

🎓 终极记忆

表格

|------------|----------|---------|---------------|
| degree | 拟合能力 | 风险 | 一句话 |
| 1 | 📏 直线 | 欠拟合 | 数据是直线才用 |
| 2~3 | 🔄 曲线 | ✅ 最佳 | ‌**最常用！首选！‌ |
| 5+ | 🌊 复杂曲线 | 过拟合 | 数据量大才敢用 |
| 10+ | 💥 剧烈震荡 | ❌ 严重过拟合 | ‌别用！**‌ |

‌一句话总结‌：

🎯 ‌**多项式回归 = 把x变成 $x, x², x³...$ ，然后做线性回归！**‌

🎯 ‌degree选2~3最稳，用交叉验证选最优！