Deep Learning|03 Overview of Machine Learning

Deep Learning|03 Overview of Machine Learning

In layman's terms, machine learning (ML) is the process of enabling computers to automatically learn and acquire certain knowledge from numerous and massive data. In the early period of engineering, ML was always called pattern recognition, which tended to solve certain application tasks that are easy for humans.

Fundamental concepts

There are some basic concepts needed to learn:

  • Feature : Attribute. Generally, we use x∈Rn\boldsymbol{x} \in \mathbb{R}^nx∈Rn to represent the features of an object.

  • Label : Generally, we use y∈Ry \in \mathbb{R}y∈R to represent the label of an object.

  • Sample: Instance.

  • Data Set: Includes training set and test set.

Suppose the training set D\mathcal{D}D consists of NNN samples, where each sample is Identically and Independently Distributed (IID)---that is, they are independently drawn from the same data distribution. It is denoted as:
D={(x(1),y(1)),(x(2),y(2)),...,(x(N),y(N))}. \mathcal{D} = \{(\boldsymbol{x}^{(1)}, y^{(1)}), (\boldsymbol{x}^{(2)}, y^{(2)}), \dots, (\boldsymbol{x}^{(N)}, y^{(N)})\}. D={(x(1),y(1)),(x(2),y(2)),...,(x(N),y(N))}.

Given the training set D\mathcal{D}D, we hope the computer can automatically find an "optimal" function f∗(x)f^*(\boldsymbol{x})f∗(x) from a function set F={f1(x),f2(x),... }\mathcal{F} = \{f_1(\boldsymbol{x}), f_2(\boldsymbol{x}), \dots\}F={f1(x),f2(x),...} to approximate the true mapping relationship between the feature vector x\boldsymbol{x}x and the label yyy of each sample. For a sample x\boldsymbol{x}x, we can use the function f∗(x)f^*(\boldsymbol{x})f∗(x) to predict the value of its label as:
y^=f∗(x) \hat{y} = f^*(\boldsymbol{x}) y^=f∗(x)

or predict the conditional probability of the label as:
p^(y∣x)=fy∗(x) \hat{p}(y \mid \boldsymbol{x}) = f_y^*(\boldsymbol{x}) p^(y∣x)=fy∗(x)

How to find this "optimal" function f∗(x)f^*(\boldsymbol{x})f∗(x) is the key to machine learning, which is generally accomplished through a learning algorithm A\mathcal{A}A. This search process is usually called the learning or training process.

In this way, the next time we buy mangoes from the market (test samples), we can predict the quality of the mangoes using the learned function f∗(x)f^*(\boldsymbol{x})f∗(x) based on the features of the mangoes. For the fairness of evaluation, we still draw a set of mangoes as the test set D′\mathcal{D}'D′ identically and independently, and test all mangoes in the test set to calculate the accuracy of the prediction results:
Acc(f∗(x))=1∣D′∣∑I(f∗(x)=y) \text{Acc}(f^*(\boldsymbol{x})) = \frac{1}{|\mathcal{D}'|} \sum \mathrm{I}\left(f^*(\boldsymbol{x}) = y\right) Acc(f∗(x))=∣D′∣1∑I(f∗(x)=y)

where I(⋅)\mathrm{I}(\cdot)I(⋅) is the indicator function, and ∣D′∣|\mathcal{D}'|∣D′∣ is the size of the test set.

Three fundamental elements of ML

For an ML task, the input space X\mathcal{X}X and output space Y\mathcal{Y}Y form a sample space and generate a data set: (x,y)∈X×Y(\boldsymbol{x}, y) \in \mathcal{X} \times \mathcal{Y}(x,y)∈X×Y. The goal of ML is to find a model to approximate the mapping from x\boldsymbol{x}x to yyy, including functional mapping and conditional probability distribution.

Since we do not know the real mapping, we assume a function set F\mathcal{F}F (called the Hypothesis Space) and find the best element f∗∈Ff^* \in \mathcal{F}f∗∈F to replace the real mapping. Generally, the hypothesis space F\mathcal{F}F is a parameterized function family:
F={f(x;θ)∣θ∈Rm} \mathcal{F} = \{f(\boldsymbol{x}; \boldsymbol{\theta}) \mid \boldsymbol{\theta} \in \mathbb{R}^m\} F={f(x;θ)∣θ∈Rm}

where f(x;θ)f(\boldsymbol{x}; \boldsymbol{\theta})f(x;θ) is a function with θ\boldsymbol{\theta}θ as the parameter, and mmm is the number of parameters.

Common hypothesis spaces can be divided into linear and nonlinear types, corresponding to linear models and nonlinear models.

Linear Model

The hypothesis space of a linear model is a parameterized family of linear functions, i.e.:
f(x;θ)=w⊤x+b f(\boldsymbol{x}; \boldsymbol{\theta}) = \boldsymbol{w}^\top \boldsymbol{x} + b f(x;θ)=w⊤x+b

where the parameter θ\boldsymbol{\theta}θ includes the weight vector w\boldsymbol{w}w and the bias bbb.

Nonlinear Model

A generalized nonlinear model can be written as a linear combination of multiple nonlinear basis functions ϕ(x)\phi(\boldsymbol{x})ϕ(x):
f(x;θ)=w⊤ϕ(x)+b. f(\boldsymbol{x}; \boldsymbol{\theta}) = \boldsymbol{w}^\top \phi(\boldsymbol{x}) + b. f(x;θ)=w⊤ϕ(x)+b.

where ϕ(x)=[ϕ1(x),ϕ2(x),...,ϕK(x)]⊤\phi(\boldsymbol{x}) = [\phi_1(\boldsymbol{x}), \phi_2(\boldsymbol{x}), \dots, \phi_K(\boldsymbol{x})]^\topϕ(x)=[ϕ1(x),ϕ2(x),...,ϕK(x)]⊤ is a vector composed of KKK nonlinear basis functions, and the parameter θ\boldsymbol{\theta}θ includes the weight vector w\boldsymbol{w}w and the bias bbb.

If ϕ(x)\phi(\boldsymbol{x})ϕ(x) itself is a learnable basis function (e.g.):
ϕk(x)=h(wk⊤ϕ′(x)+bk),∀1≤k≤K \phi_k(\boldsymbol{x}) = h\left( \boldsymbol{w}_k^\top \phi'(\boldsymbol{x}) + b_k \right), \quad \forall 1 \leq k \leq K ϕk(x)=h(wk⊤ϕ′(x)+bk),∀1≤k≤K

where h(⋅)h(\cdot)h(⋅) is a nonlinear function, ϕ′(x)\phi'(\boldsymbol{x})ϕ′(x) is another set of basis functions, and wk\boldsymbol{w}_kwk and bkb_kbk are learnable parameters, then f(x;θ)f(\boldsymbol{x}; \boldsymbol{\theta})f(x;θ) is equivalent to a neural network model.

Learning Principle

A good model f(x;θ∗)f(\boldsymbol{x}; \boldsymbol{\theta}^*)f(x;θ∗) should be as consistent as possible with the true mapping values on D={(xn,yn)}n=1N\mathcal{D} = \left\{(\boldsymbol{x}n, y_n)\right\}{n=1}^ND={(xn,yn)}n=1N, i.e.:
∣f(x;θ∗)−y∣<ε,∀(xn,yn)∈D \left|f(\boldsymbol{x}; \boldsymbol{\theta}^*) - y\right| < \varepsilon, \quad \forall (\boldsymbol{x}_n, y_n) \in \mathcal{D} ∣f(x;θ∗)−y∣<ε,∀(xn,yn)∈D

or consistent with the real conditional probability distribution pr(y∣x)p_r(y|\boldsymbol{x})pr(y∣x), i.e.:
∣fy(x;θ∗)−pr(y∣x)∣<ε,∀(xn,yn)∈D \left|f_y(\boldsymbol{x}; \boldsymbol{\theta}^*) - p_r(y|\boldsymbol{x})\right| < \varepsilon, \quad \forall (\boldsymbol{x}_n, y_n) \in \mathcal{D} ∣fy(x;θ∗)−pr(y∣x)∣<ε,∀(xn,yn)∈D

The quality of a model can be evaluated by the expected risk R(θ)\mathcal{R}(\boldsymbol{\theta})R(θ), defined as:
R(θ)=E(x,y)∼pr(x,y)[L(y,f(x;θ))] \mathcal{R}(\boldsymbol{\theta}) = \mathbb{E}_{(\boldsymbol{x}, y) \sim p_r(\boldsymbol{x}, y)} \left[\mathcal{L}(y, f(\boldsymbol{x}; \boldsymbol{\theta}))\right] R(θ)=E(x,y)∼pr(x,y)[L(y,f(x;θ))]

where pr(x,y)p_r(\boldsymbol{x}, y)pr(x,y) is the real distribution of the data, and L(y,f(x;θ))\mathcal{L}(y, f(\boldsymbol{x}; \boldsymbol{\theta}))L(y,f(x;θ)) (called the Loss Function) is used to measure the difference between the true label yyy and the predicted value f(x;θ)f(\boldsymbol{x}; \boldsymbol{\theta})f(x;θ).

Common loss functions are as follows:

  1. 0-1 Loss Function
    L(y,f(x;θ))={0,y=f(x;θ)1,y≠f(x;θ) \mathcal{L}(y, f(\boldsymbol{x};\boldsymbol{\theta})) = \begin{cases} 0, & y = f(\boldsymbol{x};\boldsymbol{\theta}) \\ 1, & y \neq f(\boldsymbol{x};\boldsymbol{\theta}) \end{cases} L(y,f(x;θ))={0,1,y=f(x;θ)y=f(x;θ)

  2. Quadratic Loss Function
    L(y,f(x;θ))=12[y−f(x;θ)]2 \mathcal{L}(y, f(\boldsymbol{x};\boldsymbol{\theta})) = \frac{1}{2}\left[y - f(\boldsymbol{x};\boldsymbol{\theta})\right]^2 L(y,f(x;θ))=21[y−f(x;θ)]2

  3. Cross-Entropy Loss Function

    The cross-entropy loss function is generally used for classification problems. Assume the label y∈{1,⋯ ,C}y \in \{1, \cdots, C\}y∈{1,⋯,C} of a sample is a discrete category, and the output of the model f(x;θ)∈[0,1]Cf(\boldsymbol{x}; \theta) \in [0,1]^Cf(x;θ)∈[0,1]C is the conditional probability distribution of the category labels, i.e.:
    p(y=c∣x;θ)=fc(x;θ) p(y = c|\boldsymbol{x};\boldsymbol{\theta}) = f_c(\boldsymbol{x}; \boldsymbol{\theta}) p(y=c∣x;θ)=fc(x;θ)

    and it satisfies:
    fc(x;θ)∈[0,1],∑c=1Cfc(x;θ)=1. f_c(\boldsymbol{x}; \boldsymbol{\theta}) \in [0,1], \quad \sum_{c=1}^C f_c(\boldsymbol{x}; \boldsymbol{\theta}) = 1. fc(x;θ)∈[0,1],c=1∑Cfc(x;θ)=1.

We can use a CCC-dimensional one-hot vector y\boldsymbol{y}y to represent the sample label. Assume the label of a sample is kkk---then only the kkk-th dimension of the label vector y\boldsymbol{y}y is 1, and the other elements are 0. The label vector y\boldsymbol{y}y can be regarded as the true conditional probability distribution pr(y∣x)p_r(\boldsymbol{y}|\boldsymbol{x})pr(y∣x) of the sample label (i.e., the ccc-th dimension, denoted as ycy_cyc for 1≤c≤C1 \leq c \leq C1≤c≤C, is the true conditional probability of category ccc). If the category of a sample is kkk, the probability of it belonging to the kkk-th category is 1, and the probability of it belonging to other categories is 0.

For two probability distributions, cross-entropy is generally used to measure their difference. The cross-entropy between the true distribution y\boldsymbol{y}y of labels and the predicted distribution f(x;θ)f(\boldsymbol{x}; \theta)f(x;θ) of the model is:
L(y,f(x;θ))=−y⊺log⁡f(x;θ)=−∑c=1Cyclog⁡fc(x;θ) \mathcal{L}(\boldsymbol{y}, f(\boldsymbol{x}; \boldsymbol{\theta})) = -\boldsymbol{y}^\intercal \log f(\boldsymbol{x}; \boldsymbol{\theta}) = -\sum_{c=1}^C y_c \log f_c(\boldsymbol{x}; \boldsymbol{\theta}) L(y,f(x;θ))=−y⊺logf(x;θ)=−c=1∑Cyclogfc(x;θ)

For a one-hot label vector, the formula can be rewritten as:
L(y,f(x;θ))=−log⁡fy(x;θ) \mathcal{L}(\boldsymbol{y}, f(\boldsymbol{x}; \boldsymbol{\theta})) = -\log f_y(\boldsymbol{x}; \boldsymbol{\theta}) L(y,f(x;θ))=−logfy(x;θ)

It can be observed that the cross-entropy loss function is the negative log-likelihood.

  1. Hinge Loss Function
    For a sample (x,y)(\boldsymbol{x}, y)(x,y) (where y∈{−1,1}y \in \{-1, 1\}y∈{−1,1} is the class label, and f(x)=w⊤x+bf(\boldsymbol{x}) = \boldsymbol{w}^\top \boldsymbol{x} + bf(x)=w⊤x+b is the model's output), the hinge loss is defined as:
    L(y,f(x))=max⁡(0,1−y⋅f(x)) \mathcal{L}(y, f(\boldsymbol{x})) = \max(0, 1 - y \cdot f(\boldsymbol{x})) L(y,f(x))=max(0,1−y⋅f(x))

The target of ML is to minimize the empirical risk on D\mathcal{D}D:
RDemp(θ)=∑n=1NL(y(n),f(x(n);θ))N \mathcal{R}^{\text{emp}}{\mathcal{D}}(\boldsymbol{\theta}) = \frac{\sum{n=1}^N \mathcal{L}(\boldsymbol{y}^{(n)}, f(\boldsymbol{x}^{(n)}; \boldsymbol{\theta}))}{N} RDemp(θ)=N∑n=1NL(y(n),f(x(n);θ))

Thus, the optimal parameter θ∗\boldsymbol{\theta}^*θ∗ can be calculated as:
θ∗=arg⁡min⁡θRDemp(θ) \boldsymbol{\theta}^* = \arg \min_{\boldsymbol{\theta}} \mathcal{R}^{\text{emp}}_{\mathcal{D}}(\boldsymbol{\theta}) θ∗=argθminRDemp(θ)

Optimization Algorithm

As shown in the figure:

Hyper-parameters need to be set manually, while parameters can be learned through certain optimization algorithms.

Gradient Descent

θt+1=θt−η⋅∇θRD(θ) \boldsymbol{\theta}{t+1} = \boldsymbol{\theta}{t} - \eta \cdot \nabla_{\boldsymbol{\theta}} \mathcal{R}_{\mathcal{D}}(\boldsymbol{\theta}) θt+1=θt−η⋅∇θRD(θ)

where η\etaη is the learning rate.

Based on the amount of data used in each iteration, Gradient Descent is mainly divided into three types:

  • Batch Gradient Descent (BGD): Uses all training data to calculate the gradient in each iteration. Its advantage is stable gradients, while its disadvantage is low computational efficiency when the data volume is large.
  • Stochastic Gradient Descent (SGD): Uses only a single training sample to calculate the gradient in each iteration. Its advantages are fast speed and the ability to avoid local optima, while its disadvantage is large gradient fluctuations.
  • Mini-Batch Gradient Descent (MBGD): Uses a small batch of training samples (e.g., 32, 64, or 128 samples) to calculate the gradient in each iteration. It balances the stability of BGD and the efficiency of SGD, and is currently the most commonly used form in deep learning.

Linear Regression

From the perspective of ML, in linear regression, the independent variable x∈Rn\boldsymbol{x} \in \mathbb{R}^nx∈Rn is the feature vector of the sample, and the dependent variable is the label y∈Ry \in \mathbb{R}y∈R (a continuous value). The model is defined as:
f(x;w,b)=w⊤x+b f(\boldsymbol{x};\boldsymbol{w},b) = \boldsymbol{w}^\top \boldsymbol{x} + b f(x;w,b)=w⊤x+b

where w\boldsymbol{w}w and bbb are learnable parameters.

Define ⨁\bigoplus⨁ as the vector concatenation symbol. The linear regression equation can be rewritten as:
f(x^;w^)=w^⊤x^ f(\hat{\boldsymbol{x}};\hat{\boldsymbol{w}}) = \hat{\boldsymbol{w}}^\top \hat{\boldsymbol{x}} f(x^;w^)=w^⊤x^

where x^=x⨁1\hat{\boldsymbol{x}} = \boldsymbol{x} \bigoplus 1x^=x⨁1 and w^=w⨁b\hat{\boldsymbol{w}} = \boldsymbol{w} \bigoplus bw^=w⨁b.

It is easy to prove that the empirical risk function of linear regression is:
R(w^)=12∥y−X⊤w^∥ \mathcal{R}( \hat{\boldsymbol{w}} ) = \frac{1}{2}\|\boldsymbol{y} - X^\top \hat{\boldsymbol{w}}\| R(w^)=21∥y−X⊤w^∥

where XXX is ⨁Dx^\bigoplus_{\mathcal{D}} \hat{\boldsymbol{x}}⨁Dx^ (i.e., the matrix formed by concatenating all x^\hat{\boldsymbol{x}}x^ in the training set D\mathcal{D}D).

The optimal parameter w^∗\hat{\boldsymbol{w}}^*w^∗ is:
w^∗=(XX⊤)−1Xy \hat{\boldsymbol{w}}^* = (XX^\top)^{-1}X\boldsymbol{y} w^∗=(XX⊤)−1Xy

Bias-Variance Decomposition

The generalization error of a model can be decomposed into three parts: bias, variance, and irreducible noise:
ED[(y−f(x))2]⏟Generalization Error =(f∗(x)−ED[f(x)])2⏟Bias2+ED[(f(x)−ED[f(x)])2]⏟Variance+Var(ϵ)⏟Irreducible Noise \underbrace{\mathbb{E}\mathcal{D}\left[(y - f(\boldsymbol{x}))^2\right]}{\text{Generalization Error }} = \underbrace{(f^*(\boldsymbol{x}) - \mathbb{E}\mathcal{D}[f(\boldsymbol{x})])^2}{\text{Bias}^2} + \underbrace{\mathbb{E}\mathcal{D}\left[\left(f(\boldsymbol{x}) - \mathbb{E}\mathcal{D}[f(\boldsymbol{x})]\right)^2\right]}{\text{Variance}} + \underbrace{\text{Var}(\epsilon)}{\text{Irreducible Noise}} Generalization Error ED[(y−f(x))2]=Bias2 (f∗(x)−ED[f(x)])2+Variance ED[(f(x)−ED[f(x)])2]+Irreducible Noise Var(ϵ)

  • Bias: Refers to the difference between the average performance of a model on different training sets and the optimal model. It measures the fitting ability of the model (a high bias indicates underfitting).
  • Variance: Refers to the variation in model predictions across different training sets. It measures whether the model is prone to overfitting (a high variance indicates overfitting).
  • Irreducible Noise: The inherent noise in the data, which cannot be eliminated by any model.

Feature Engineering

Different types of data have different spaces of raw features. This requires us to select or extract informative features as input for the computer.

Feature Selection

Feature selection is a critical preprocessing step in machine learning and data mining. Its core goal is to select a subset of the most relevant features from the original feature set, while removing irrelevant, redundant, or noisy features. This process simplifies models, reduces computational costs, mitigates overfitting, and improves interpretability---all without significantly sacrificing predictive performance.

Feature Extraction

The core goal of feature extraction is to transform raw high-dimensional data into a lower-dimensional feature space by capturing the most informative patterns, while discarding noise, redundancy, or irrelevant details. Unlike feature selection (which selects a subset of original features), feature extraction creates new features that better represent the underlying structure of the data.

相关推荐
寒秋丶6 分钟前
Milvus:集合(Collections)操作详解(三)
数据库·人工智能·python·ai·ai编程·milvus·向量数据库
寒秋丶8 分钟前
Milvus:Schema详解(四)
数据库·人工智能·python·ai·ai编程·milvus·向量数据库
CAD老兵10 分钟前
量化技术:如何让你的 3D 模型和 AI 模型瘦身又飞快
人工智能·深度学习·机器学习
算法与编程之美12 分钟前
探索不同的优化器对分类精度的影响和卷积层的输入输出的shape的计算公式
人工智能·深度学习·机器学习·分类·数据挖掘
大千AI助手12 分钟前
微软SPARTA框架:高效稀疏注意力机制详解
人工智能·深度学习·神经网络·llm·大千ai助手·sparta·稀疏注意力机制
林九生12 分钟前
【人工智能】使用 vLLM 高效部署大语言模型:以 secGpt14b 为例
人工智能·语言模型·自然语言处理
云茧18 分钟前
机器学习中的Hello World:线性回归(一)
人工智能·机器学习·线性回归
他们叫我技术总监35 分钟前
从开发者视角深度评测:ModelEngine 与 AI 开发平台的技术博弈
java·人工智能·dubbo·智能体·modelengine
minhuan36 分钟前
构建AI智能体:八十三、当AI开始“失忆“:深入理解和预防模型衰老与数据漂移
人工智能·模型衰老·数据偏移·psi群体稳定性指标·ks统计量检验
AI浩37 分钟前
深入级联不稳定性:从 Lipschitz 连续性视角探讨图像恢复与目标检测的协同作用
人工智能·目标检测·php