人工智能【第8篇】监督学习实战：线性回归与逻辑回归算法详解（万字长文+完整代码实现）

作者的话 ：经过前面7篇文章的学习，我们已经掌握了Python基础、数据处理和数据可视化技能。从今天开始，我们将正式进入机器学习算法 的学习！本文作为监督学习的开篇，将详细讲解最基础也最实用的两个算法------线性回归 和逻辑回归，带你从零实现，彻底搞懂原理！

一、监督学习概述

1.1 什么是监督学习？

**监督学习（Supervised Learning）**是机器学习中最基础、应用最广泛的一类方法。其核心思想是：通过已标注的训练数据（输入-输出对），学习一个从输入到输出的映射函数。

基本框架：

训练数据： $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$
目标：学习函数 $f$ ，使得 $f(x) approx y$
分类：回归问题（连续值）vs 分类问题（离散值）

1.2 监督学习的类型

类型	输出	典型算法	应用场景
回归	连续数值	线性回归、岭回归、Lasso	房价预测、销量预测
分类	离散类别	逻辑回归、SVM、决策树	垃圾邮件识别、疾病诊断

二、线性回归：从原理到实现

2.1 什么是线性回归？

**线性回归（Linear Regression）**假设目标变量与特征之间存在线性关系：

$y = w_1x_1 + w_2x_2 + ... + w_nx_n + b = mathbf{w}\^Tmathbf{x} + b$

其中：

$y$ ：目标变量（因变量）
$mathbf{x}$ ：特征向量（自变量）
$mathbf{w}$ ：权重向量（待学习参数）
$b$ ：偏置项（截距）

2.2 损失函数：均方误差

为了找到最优参数，我们需要定义**损失函数（Loss Function）**来衡量预测值与真实值之间的差距。

$MSE = rac{1}{n}sum_{i=1}\^{n}(y_i - hat{y}_i)\^2 = rac{1}{n}sum_{i=1}\^{n}(y_i - (mathbf{w}\^Tmathbf{x}_i + b))\^2$

2.3 参数求解：梯度下降法

**梯度下降（Gradient Descent）**是优化损失函数的常用方法：

python 复制代码

# 梯度下降伪代码
初始化参数 w, b
重复直到收敛：
    计算梯度：∂L/∂w, ∂L/∂b
    更新参数：w = w - α * ∂L/∂w
              b = b - α * ∂L/∂b

2.4 Python实现：从零开始

python 复制代码

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 生成模拟数据
X, y = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # 初始化参数
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # 梯度下降
        for i in range(self.n_iter):
            # 前向传播
            y_pred = self.predict(X)
            
            # 计算损失
            loss = np.mean((y - y_pred) ** 2)
            self.losses.append(loss)
            
            # 计算梯度
            dw = -(2/n_samples) * np.dot(X.T, (y - y_pred))
            db = -(2/n_samples) * np.sum(y - y_pred)
            
            # 更新参数
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            if i % 100 == 0:
                print(f'Iteration {i}, Loss: {loss:.4f}')
    
    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# 训练模型
model = LinearRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X_train_scaled, y_train)

# 评估模型
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f'\n训练集R²: {train_score:.4f}')
print(f'测试集R²: {test_score:.4f}')

2.5 可视化训练过程

python 复制代码

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 损失曲线
axes[0].plot(model.losses)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Training Loss Curve')
axes[0].grid(True, alpha=0.3)

# 拟合效果
axes[1].scatter(X_test, y_test, alpha=0.5, label='Actual')
X_line = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
X_line_scaled = scaler.transform(X_line)
y_line = model.predict(X_line_scaled)
axes[1].plot(X_line, y_line, 'r-', linewidth=2, label='Predicted')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].set_title('Linear Regression Fit')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

2.6 多元线性回归

python 复制代码

# 生成多元数据
X_multi, y_multi = make_regression(n_samples=1000, n_features=5, noise=10, random_state=42)
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

# 标准化
scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)

# 训练
model_multi = LinearRegression(learning_rate=0.01, n_iterations=2000)
model_multi.fit(X_train_m_scaled, y_train_m)

# 查看权重
print('学习到的权重:', model_multi.weights)
print('偏置项:', model_multi.bias)
print(f'R² Score: {model_multi.score(X_test_m_scaled, y_test_m):.4f}')

2.7 使用sklearn实现

python 复制代码

from sklearn.linear_model import LinearRegression as SklearnLR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 创建模型
sk_model = SklearnLR()
sk_model.fit(X_train_scaled, y_train)

# 预测
y_pred = sk_model.predict(X_test_scaled)

# 评估指标
print(f'MSE: {mean_squared_error(y_test, y_pred):.4f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.4f}')
print(f'R²: {r2_score(y_test, y_pred):.4f}')
print(f'系数: {sk_model.coef_}')
print(f'截距: {sk_model.intercept_:.4f}')

三、逻辑回归：分类问题的基石

3.1 为什么需要逻辑回归？

线性回归输出的是连续值，但分类问题需要离散输出（如0/1）。逻辑回归（Logistic Regression） 通过引入Sigmoid函数，将线性输出映射到(0,1)区间，表示概率。

3.2 Sigmoid函数

$sigma(z) = rac{1}{1 + e\^{-z}}$

其中 $z = mathbf{w}\^Tmathbf{x} + b$

Sigmoid函数的特性：

输出范围：(0, 1)
当 $z = 0$ 时， $sigma(z) = 0.5$
当 $z o +infty$ 时， $sigma(z) o 1$
当 $z o -infty$ 时， $sigma(z) o 0$

python 复制代码

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
s = sigmoid(z)

plt.figure(figsize=(10, 6))
plt.plot(z, s, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.show()

3.3 损失函数：交叉熵损失

对于二分类问题，使用二元交叉熵损失（Binary Cross-Entropy）：

$L = -rac{1}{n}sum_{i=1}\^{n}\[y_i log(hat{y}_i) + (1-y_i)log(1-hat{y}_i)\]$

3.4 Python实现：从零开始

python 复制代码

from sklearn.datasets import make_classification

# 生成二分类数据
X_cls, y_cls = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                                   n_informative=2, n_clusters_per_class=1, 
                                   random_state=42)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42)

# 标准化
scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)

class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []
    
    def sigmoid(self, z):
        # 防止溢出
        return np.where(z >= 0, 
                       1 / (1 + np.exp(-z)),
                       np.exp(z) / (1 + np.exp(z)))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # 初始化参数
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # 梯度下降
        for i in range(self.n_iter):
            # 前向传播
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)
            
            # 计算损失（交叉熵）
            epsilon = 1e-15  # 防止log(0)
            y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
            loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
            self.losses.append(loss)
            
            # 计算梯度
            dw = np.dot(X.T, (y_pred - y)) / n_samples
            db = np.sum(y_pred - y) / n_samples
            
            # 更新参数
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            if i % 100 == 0:
                print(f'Iteration {i}, Loss: {loss:.4f}')
    
    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# 训练模型
lr_model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
lr_model.fit(X_train_c_scaled, y_train_c)

# 评估
print(f'\n训练集准确率: {lr_model.score(X_train_c_scaled, y_train_c):.4f}')
print(f'测试集准确率: {lr_model.score(X_test_c_scaled, y_test_c):.4f}')

3.5 可视化决策边界

python 复制代码

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 损失曲线
axes[0].plot(lr_model.losses)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cross-Entropy Loss')
axes[0].set_title('Training Loss Curve')
axes[0].grid(True, alpha=0.3)

# 决策边界
h = 0.02
x_min, x_max = X_test_c[:, 0].min() - 1, X_test_c[:, 0].max() + 1
y_min, y_max = X_test_c[:, 1].min() - 1, X_test_c[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# 对网格点进行预测
grid_points = np.c_[xx.ravel(), yy.ravel()]
grid_points_scaled = scaler_c.transform(grid_points)
Z = lr_model.predict_proba(grid_points_scaled)
Z = Z.reshape(xx.shape)

# 绘制
axes[1].contourf(xx, yy, Z, levels=50, alpha=0.8, cmap='RdYlBu')
scatter = axes[1].scatter(X_test_c[:, 0], X_test_c[:, 1], c=y_test_c, 
                         cmap='RdYlBu', edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Decision Boundary')
plt.colorbar(scatter, ax=axes[1])

plt.tight_layout()
plt.show()

3.6 多分类逻辑回归：Softmax

对于多分类问题，使用Softmax函数：

$P(y=j\|mathbf{x}) = rac{e\^{mathbf{w}_j\^Tmathbf{x} + b_j}}{sum_{k=1}\^{K}e\^{mathbf{w}_k\^Tmathbf{x} + b_k}}$

python 复制代码

from sklearn.linear_model import LogisticRegression as SklearnLR
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 使用sklearn实现
sk_lr = SklearnLR(max_iter=1000, random_state=42)
sk_lr.fit(X_train_c_scaled, y_train_c)

# 预测
y_pred_c = sk_lr.predict(X_test_c_scaled)

# 评估
print(f'准确率: {accuracy_score(y_test_c, y_pred_c):.4f}')
print('\n分类报告:')
print(classification_report(y_test_c, y_pred_c))

# 混淆矩阵
import seaborn as sns
cm = confusion_matrix(y_test_c, y_pred_c)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

四、实战案例：房价预测与疾病诊断

4.1 案例一：波士顿房价预测（线性回归）

python 复制代码

from sklearn.datasets import fetch_california_housing

# 加载数据
housing = fetch_california_housing()
X_h = housing.data
y_h = housing.target

print(f'特征名称: {housing.feature_names}')
print(f'数据形状: {X_h.shape}')

# 划分数据集
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42)

# 标准化
scaler_h = StandardScaler()
X_train_h_scaled = scaler_h.fit_transform(X_train_h)
X_test_h_scaled = scaler_h.transform(X_test_h)

# 训练模型
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_h_scaled, y_train_h)

# 评估
y_pred_h = ridge_model.predict(X_test_h_scaled)
print(f'\nMSE: {mean_squared_error(y_test_h, y_pred_h):.4f}')
print(f'R²: {r2_score(y_test_h, y_pred_h):.4f}')

# 特征重要性
import pandas as pd
feature_importance = pd.DataFrame({
    'Feature': housing.feature_names,
    'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient Value')
plt.title('Feature Importance in House Price Prediction')
plt.tight_layout()
plt.show()

4.2 案例二：乳腺癌诊断（逻辑回归）

python 复制代码

from sklearn.datasets import load_breast_cancer

# 加载数据
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print(f'特征名称: {cancer.feature_names[:5]}...')
print(f'类别: {cancer.target_names}')
print(f'数据形状: {X_cancer.shape}')

# 划分数据集
X_train_can, X_test_can, y_train_can, y_test_can = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42)

# 标准化
scaler_can = StandardScaler()
X_train_can_scaled = scaler_can.fit_transform(X_train_can)
X_test_can_scaled = scaler_can.transform(X_test_can)

# 训练逻辑回归
lr_cancer = SklearnLR(max_iter=1000, random_state=42)
lr_cancer.fit(X_train_can_scaled, y_train_can)

# 预测概率
y_prob_can = lr_cancer.predict_proba(X_test_can_scaled)[:, 1]

# ROC曲线
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test_can, y_prob_can)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Breast Cancer Diagnosis')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

print(f'\nAUC Score: {roc_auc:.4f}')
print(f'测试集准确率: {lr_cancer.score(X_test_can_scaled, y_test_can):.4f}')

五、模型评估指标详解

5.1 回归问题评估指标

指标	公式	说明
MSE	$rac{1}{n}sum(y-hat{y})\^2$	均方误差，对大误差敏感
RMSE	$sqrt{MSE}$	均方根误差，与目标变量同量纲
MAE	$rac{1}{n}sum\|y-hat{y}\|$	平均绝对误差，更稳健
R²	$1 - rac{SS_{res}}{SS_{tot}}$	决定系数，越接近1越好

5.2 分类问题评估指标

指标	公式	说明
准确率	$rac{TP+TN}{TP+TN+FP+FN}$	整体预测正确率
精确率	$rac{TP}{TP+FP}$	预测为正的样本中实际为正的比例
召回率	$rac{TP}{TP+FN}$	实际为正的样本中被正确预测的比例
F1分数	$2 cdot rac{Precision cdot Recall}{Precision + Recall}$	精确率和召回率的调和平均

六、常见问题与解决方案

6.1 过拟合与欠拟合

过拟合（Overfitting）：模型在训练集表现好，但在测试集表现差。

欠拟合（Underfitting）：模型在训练集和测试集都表现差。

解决方案：

过拟合：增加数据、正则化、减少特征、使用简单模型
欠拟合：增加特征、使用复杂模型、减少正则化

6.2 正则化

L1正则化（Lasso）： $L = Loss + lambdasum\|w_i\|$

L2正则化（Ridge）： $L = Loss + lambdasum w_i\^2$

python 复制代码

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# 岭回归（L2）
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Lasso回归（L1）
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

# 弹性网络（L1+L2）
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train_scaled, y_train)

七、学习资源推荐

《统计学习方法》：李航著，机器学习经典教材
Coursera机器学习课程：Andrew Ng主讲
Scikit-learn官方文档：最实用的参考手册
Kaggle入门竞赛：通过实践巩固知识

下一篇预告：【第9篇】决策树与随机森林：集成学习的入门之道

本文为系列第8篇，系统讲解了监督学习的两个基础算法。如有疑问欢迎评论区交流！

标签：#线性回归 #逻辑回归 #监督学习 #机器学习 #Python #人工智能 #教程