作者的话 :经过前面7篇文章的学习,我们已经掌握了Python基础、数据处理和数据可视化技能。从今天开始,我们将正式进入机器学习算法 的学习!本文作为监督学习的开篇,将详细讲解最基础也最实用的两个算法------线性回归 和逻辑回归,带你从零实现,彻底搞懂原理!
一、监督学习概述
1.1 什么是监督学习?
**监督学习(Supervised Learning)**是机器学习中最基础、应用最广泛的一类方法。其核心思想是:通过已标注的训练数据(输入-输出对),学习一个从输入到输出的映射函数。
基本框架:
- 训练数据:(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)
- 目标:学习函数 f,使得 f(x) approx y
- 分类:回归问题(连续值)vs 分类问题(离散值)
1.2 监督学习的类型
| 类型 | 输出 | 典型算法 | 应用场景 |
|---|---|---|---|
| 回归 | 连续数值 | 线性回归、岭回归、Lasso | 房价预测、销量预测 |
| 分类 | 离散类别 | 逻辑回归、SVM、决策树 | 垃圾邮件识别、疾病诊断 |
二、线性回归:从原理到实现
2.1 什么是线性回归?
**线性回归(Linear Regression)**假设目标变量与特征之间存在线性关系:
y = w_1x_1 + w_2x_2 + ... + w_nx_n + b = mathbf{w}\^Tmathbf{x} + b
其中:
- y:目标变量(因变量)
- mathbf{x}:特征向量(自变量)
- mathbf{w}:权重向量(待学习参数)
- b:偏置项(截距)
2.2 损失函数:均方误差
为了找到最优参数,我们需要定义**损失函数(Loss Function)**来衡量预测值与真实值之间的差距。
MSE = rac{1}{n}sum_{i=1}\^{n}(y_i - hat{y}_i)\^2 = rac{1}{n}sum_{i=1}\^{n}(y_i - (mathbf{w}\^Tmathbf{x}_i + b))\^2
2.3 参数求解:梯度下降法
**梯度下降(Gradient Descent)**是优化损失函数的常用方法:
python
# 梯度下降伪代码
初始化参数 w, b
重复直到收敛:
计算梯度:∂L/∂w, ∂L/∂b
更新参数:w = w - α * ∂L/∂w
b = b - α * ∂L/∂b
2.4 Python实现:从零开始
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 生成模拟数据
X, y = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
class LinearRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.losses = []
def fit(self, X, y):
n_samples, n_features = X.shape
# 初始化参数
self.weights = np.zeros(n_features)
self.bias = 0
# 梯度下降
for i in range(self.n_iter):
# 前向传播
y_pred = self.predict(X)
# 计算损失
loss = np.mean((y - y_pred) ** 2)
self.losses.append(loss)
# 计算梯度
dw = -(2/n_samples) * np.dot(X.T, (y - y_pred))
db = -(2/n_samples) * np.sum(y - y_pred)
# 更新参数
self.weights -= self.lr * dw
self.bias -= self.lr * db
if i % 100 == 0:
print(f'Iteration {i}, Loss: {loss:.4f}')
def predict(self, X):
return np.dot(X, self.weights) + self.bias
def score(self, X, y):
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
# 训练模型
model = LinearRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X_train_scaled, y_train)
# 评估模型
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f'\n训练集R²: {train_score:.4f}')
print(f'测试集R²: {test_score:.4f}')
2.5 可视化训练过程
python
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 损失曲线
axes[0].plot(model.losses)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Training Loss Curve')
axes[0].grid(True, alpha=0.3)
# 拟合效果
axes[1].scatter(X_test, y_test, alpha=0.5, label='Actual')
X_line = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
X_line_scaled = scaler.transform(X_line)
y_line = model.predict(X_line_scaled)
axes[1].plot(X_line, y_line, 'r-', linewidth=2, label='Predicted')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].set_title('Linear Regression Fit')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
2.6 多元线性回归
python
# 生成多元数据
X_multi, y_multi = make_regression(n_samples=1000, n_features=5, noise=10, random_state=42)
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)
# 标准化
scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)
# 训练
model_multi = LinearRegression(learning_rate=0.01, n_iterations=2000)
model_multi.fit(X_train_m_scaled, y_train_m)
# 查看权重
print('学习到的权重:', model_multi.weights)
print('偏置项:', model_multi.bias)
print(f'R² Score: {model_multi.score(X_test_m_scaled, y_test_m):.4f}')
2.7 使用sklearn实现
python
from sklearn.linear_model import LinearRegression as SklearnLR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# 创建模型
sk_model = SklearnLR()
sk_model.fit(X_train_scaled, y_train)
# 预测
y_pred = sk_model.predict(X_test_scaled)
# 评估指标
print(f'MSE: {mean_squared_error(y_test, y_pred):.4f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.4f}')
print(f'R²: {r2_score(y_test, y_pred):.4f}')
print(f'系数: {sk_model.coef_}')
print(f'截距: {sk_model.intercept_:.4f}')
三、逻辑回归:分类问题的基石
3.1 为什么需要逻辑回归?
线性回归输出的是连续值,但分类问题需要离散输出(如0/1)。逻辑回归(Logistic Regression) 通过引入Sigmoid函数,将线性输出映射到(0,1)区间,表示概率。
3.2 Sigmoid函数
sigma(z) = rac{1}{1 + e\^{-z}}
其中 z = mathbf{w}\^Tmathbf{x} + b
Sigmoid函数的特性:
- 输出范围:(0, 1)
- 当 z = 0 时,sigma(z) = 0.5
- 当 z o +infty 时,sigma(z) o 1
- 当 z o -infty 时,sigma(z) o 0
python
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = np.linspace(-10, 10, 100)
s = sigmoid(z)
plt.figure(figsize=(10, 6))
plt.plot(z, s, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.show()
3.3 损失函数:交叉熵损失
对于二分类问题,使用二元交叉熵损失(Binary Cross-Entropy):
L = -rac{1}{n}sum_{i=1}\^{n}\[y_i log(hat{y}_i) + (1-y_i)log(1-hat{y}_i)\]
3.4 Python实现:从零开始
python
from sklearn.datasets import make_classification
# 生成二分类数据
X_cls, y_cls = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1,
random_state=42)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_cls, y_cls, test_size=0.2, random_state=42)
# 标准化
scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)
class LogisticRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.losses = []
def sigmoid(self, z):
# 防止溢出
return np.where(z >= 0,
1 / (1 + np.exp(-z)),
np.exp(z) / (1 + np.exp(z)))
def fit(self, X, y):
n_samples, n_features = X.shape
# 初始化参数
self.weights = np.zeros(n_features)
self.bias = 0
# 梯度下降
for i in range(self.n_iter):
# 前向传播
z = np.dot(X, self.weights) + self.bias
y_pred = self.sigmoid(z)
# 计算损失(交叉熵)
epsilon = 1e-15 # 防止log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
self.losses.append(loss)
# 计算梯度
dw = np.dot(X.T, (y_pred - y)) / n_samples
db = np.sum(y_pred - y) / n_samples
# 更新参数
self.weights -= self.lr * dw
self.bias -= self.lr * db
if i % 100 == 0:
print(f'Iteration {i}, Loss: {loss:.4f}')
def predict_proba(self, X):
z = np.dot(X, self.weights) + self.bias
return self.sigmoid(z)
def predict(self, X, threshold=0.5):
return (self.predict_proba(X) >= threshold).astype(int)
def score(self, X, y):
return np.mean(self.predict(X) == y)
# 训练模型
lr_model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
lr_model.fit(X_train_c_scaled, y_train_c)
# 评估
print(f'\n训练集准确率: {lr_model.score(X_train_c_scaled, y_train_c):.4f}')
print(f'测试集准确率: {lr_model.score(X_test_c_scaled, y_test_c):.4f}')
3.5 可视化决策边界
python
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 损失曲线
axes[0].plot(lr_model.losses)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cross-Entropy Loss')
axes[0].set_title('Training Loss Curve')
axes[0].grid(True, alpha=0.3)
# 决策边界
h = 0.02
x_min, x_max = X_test_c[:, 0].min() - 1, X_test_c[:, 0].max() + 1
y_min, y_max = X_test_c[:, 1].min() - 1, X_test_c[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# 对网格点进行预测
grid_points = np.c_[xx.ravel(), yy.ravel()]
grid_points_scaled = scaler_c.transform(grid_points)
Z = lr_model.predict_proba(grid_points_scaled)
Z = Z.reshape(xx.shape)
# 绘制
axes[1].contourf(xx, yy, Z, levels=50, alpha=0.8, cmap='RdYlBu')
scatter = axes[1].scatter(X_test_c[:, 0], X_test_c[:, 1], c=y_test_c,
cmap='RdYlBu', edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Decision Boundary')
plt.colorbar(scatter, ax=axes[1])
plt.tight_layout()
plt.show()
3.6 多分类逻辑回归:Softmax
对于多分类问题,使用Softmax函数:
P(y=j\|mathbf{x}) = rac{e\^{mathbf{w}_j\^Tmathbf{x} + b_j}}{sum_{k=1}\^{K}e\^{mathbf{w}_k\^Tmathbf{x} + b_k}}
python
from sklearn.linear_model import LogisticRegression as SklearnLR
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# 使用sklearn实现
sk_lr = SklearnLR(max_iter=1000, random_state=42)
sk_lr.fit(X_train_c_scaled, y_train_c)
# 预测
y_pred_c = sk_lr.predict(X_test_c_scaled)
# 评估
print(f'准确率: {accuracy_score(y_test_c, y_pred_c):.4f}')
print('\n分类报告:')
print(classification_report(y_test_c, y_pred_c))
# 混淆矩阵
import seaborn as sns
cm = confusion_matrix(y_test_c, y_pred_c)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
四、实战案例:房价预测与疾病诊断
4.1 案例一:波士顿房价预测(线性回归)
python
from sklearn.datasets import fetch_california_housing
# 加载数据
housing = fetch_california_housing()
X_h = housing.data
y_h = housing.target
print(f'特征名称: {housing.feature_names}')
print(f'数据形状: {X_h.shape}')
# 划分数据集
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42)
# 标准化
scaler_h = StandardScaler()
X_train_h_scaled = scaler_h.fit_transform(X_train_h)
X_test_h_scaled = scaler_h.transform(X_test_h)
# 训练模型
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_h_scaled, y_train_h)
# 评估
y_pred_h = ridge_model.predict(X_test_h_scaled)
print(f'\nMSE: {mean_squared_error(y_test_h, y_pred_h):.4f}')
print(f'R²: {r2_score(y_test_h, y_pred_h):.4f}')
# 特征重要性
import pandas as pd
feature_importance = pd.DataFrame({
'Feature': housing.feature_names,
'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient Value')
plt.title('Feature Importance in House Price Prediction')
plt.tight_layout()
plt.show()
4.2 案例二:乳腺癌诊断(逻辑回归)
python
from sklearn.datasets import load_breast_cancer
# 加载数据
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target
print(f'特征名称: {cancer.feature_names[:5]}...')
print(f'类别: {cancer.target_names}')
print(f'数据形状: {X_cancer.shape}')
# 划分数据集
X_train_can, X_test_can, y_train_can, y_test_can = train_test_split(
X_cancer, y_cancer, test_size=0.2, random_state=42)
# 标准化
scaler_can = StandardScaler()
X_train_can_scaled = scaler_can.fit_transform(X_train_can)
X_test_can_scaled = scaler_can.transform(X_test_can)
# 训练逻辑回归
lr_cancer = SklearnLR(max_iter=1000, random_state=42)
lr_cancer.fit(X_train_can_scaled, y_train_can)
# 预测概率
y_prob_can = lr_cancer.predict_proba(X_test_can_scaled)[:, 1]
# ROC曲线
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test_can, y_prob_can)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Breast Cancer Diagnosis')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()
print(f'\nAUC Score: {roc_auc:.4f}')
print(f'测试集准确率: {lr_cancer.score(X_test_can_scaled, y_test_can):.4f}')
五、模型评估指标详解
5.1 回归问题评估指标
| 指标 | 公式 | 说明 |
|---|---|---|
| MSE | rac{1}{n}sum(y-hat{y})\^2 | 均方误差,对大误差敏感 |
| RMSE | sqrt{MSE} | 均方根误差,与目标变量同量纲 |
| MAE | rac{1}{n}sum|y-hat{y}| | 平均绝对误差,更稳健 |
| R² | 1 - rac{SS_{res}}{SS_{tot}} | 决定系数,越接近1越好 |
5.2 分类问题评估指标
| 指标 | 公式 | 说明 |
|---|---|---|
| 准确率 | rac{TP+TN}{TP+TN+FP+FN} | 整体预测正确率 |
| 精确率 | rac{TP}{TP+FP} | 预测为正的样本中实际为正的比例 |
| 召回率 | rac{TP}{TP+FN} | 实际为正的样本中被正确预测的比例 |
| F1分数 | 2 cdot rac{Precision cdot Recall}{Precision + Recall} | 精确率和召回率的调和平均 |
六、常见问题与解决方案
6.1 过拟合与欠拟合
过拟合(Overfitting):模型在训练集表现好,但在测试集表现差。
欠拟合(Underfitting):模型在训练集和测试集都表现差。
解决方案:
- 过拟合:增加数据、正则化、减少特征、使用简单模型
- 欠拟合:增加特征、使用复杂模型、减少正则化
6.2 正则化
L1正则化(Lasso):L = Loss + lambdasum\|w_i\|
L2正则化(Ridge):L = Loss + lambdasum w_i\^2
python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# 岭回归(L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
# Lasso回归(L1)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
# 弹性网络(L1+L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train_scaled, y_train)
七、学习资源推荐
- 《统计学习方法》:李航著,机器学习经典教材
- Coursera机器学习课程:Andrew Ng主讲
- Scikit-learn官方文档:最实用的参考手册
- Kaggle入门竞赛:通过实践巩固知识
下一篇预告:【第9篇】决策树与随机森林:集成学习的入门之道
本文为系列第8篇,系统讲解了监督学习的两个基础算法。如有疑问欢迎评论区交流!
标签:#线性回归 #逻辑回归 #监督学习 #机器学习 #Python #人工智能 #教程