机器学习基础：从零理解核心概念与算法分类

1. 什么是机器学习

复制代码

机器学习（Machine Learning）：
├── 定义：让计算机从数据中自动学习规律，无需显式编程
├── 核心思想：数据 → 模型 → 预测
├── 与传统编程的区别：
│   ├── 传统编程：规则 + 数据 → 结果
│   └── 机器学习：数据 + 结果 → 规则
└── 应用：图像识别、语音识别、推荐系统、自动驾驶、医疗诊断

2. 学习类型

复制代码

机器学习三大范式：
├── 监督学习（Supervised Learning）
│   ├── 有标签数据
│   ├── 分类：离散输出（猫/狗）
│   └── 回归：连续输出（房价/温度）
├── 无监督学习（Unsupervised Learning）
│   ├── 无标签数据
│   ├── 聚类：K-Means、DBSCAN
│   └── 降维：PCA、t-SNE
└── 强化学习（Reinforcement Learning）
    ├── 智能体与环境交互
    ├── 奖励信号驱动
    └── 应用：游戏AI、机器人控制

3. 算法分类速查

复制代码

算法分类：
├── 监督学习
│   ├── 分类
│   │   ├── 逻辑回归（Logistic Regression）
│   │   ├── 支持向量机（SVM）
│   │   ├── 决策树（Decision Tree）
│   │   ├── 随机森林（Random Forest）
│   │   ├── K近邻（KNN）
│   │   ├── 朴素贝叶斯（Naive Bayes）
│   │   └── XGBoost / LightGBM
│   └── 回归
│       ├── 线性回归（Linear Regression）
│       ├── 岭回归（Ridge）
│       ├── Lasso 回归
│       └── 决策树回归
├── 无监督学习
│   ├── 聚类
│   │   ├── K-Means
│   │   ├── DBSCAN
│   │   └── 层次聚类
│   └── 降维
│       ├── PCA
│       ├── t-SNE
│       └── UMAP
└── 深度学习
    ├── CNN（图像）
    ├── RNN/LSTM（序列）
    ├── Transformer（NLP）
    └── GAN（生成）

4. 核心概念

4.1 偏差-方差权衡

复制代码

偏差（Bias）：模型预测与真实值的偏离程度
├── 高偏差：欠拟合（模型太简单）
└── 解决：增加模型复杂度

方差（Variance）：模型对训练数据变化的敏感程度
├── 高方差：过拟合（模型太复杂）
└── 解决：正则化、增加数据

偏差-方差权衡：
├── 总误差 = 偏差² + 方差 + 噪声
├── 目标：找到偏差和方差的平衡点
└── 方法：交叉验证选择最优模型复杂度

4.2 过拟合与欠拟合

python 复制代码

# 过拟合示例
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np

# 欠拟合：线性模型拟合非线性数据
model_underfit = LinearRegression()

# 过拟合：高次多项式拟合少量数据
model_overfit = make_pipeline(
    PolynomialFeatures(degree=15),
    LinearRegression()
)

# 合适：适当复杂度
model_good = make_pipeline(
    PolynomialFeatures(degree=3),
    LinearRegression()
)

4.3 交叉验证

python 复制代码

from sklearn.model_selection import cross_val_score, KFold

# K 折交叉验证
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"准确率: {scores.mean():.4f} ± {scores.std():.4f}")

5. 数据预处理基础

python 复制代码

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# 加载数据
data = pd.read_csv('data.csv')

# 缺失值处理
data.fillna(data.mean(), inplace=True)  # 均值填充

# 编码分类变量
le = LabelEncoder()
data['category'] = le.fit_transform(data['category'])

# 特征缩放
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

6. 第一个机器学习项目

python 复制代码

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

总结

概念	说明
监督学习	有标签，分类/回归
无监督学习	无标签，聚类/降维
偏差-方差	欠拟合 vs 过拟合
交叉验证	评估模型泛化能力
特征工程	数据预处理是关键