数据分布与目标损失函数选择的关系

注意：大量引用

实际场景中遇到目标函数相关问题，遂调研并记录

一、通过极大似然法求解目标函数

使用特定损失函数的前提是我们对标签的分布进行了某种假设，在这种假设的前提下通过极大似然法推出所有样本构成的极大似然公式，然后再使用凸优化的方法比如常见的梯度下降法进行求解。

1. 二分类问题

常见假设是标签服从伯努利分布，即两点分布。
伯努利分布 是一个离散型机率分布。试验成功，随机变量取值为1；试验失败，随机变量取值为0。成功机率为p，失败机率为q =1-p，N次试验后，成功期望为 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ∗ p N*p </math>N∗p，方差为 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ∗ p ∗ ( 1 − p ) N*p*(1-p) </math>N∗p∗(1−p) ，所以伯努利分布又称两点分布。

假设观察二分类问题中的数据 <math xmlns="http://www.w3.org/1998/Math/MathML"> D 1 , D 2 , D 3 , . . . , D n {D_1,D_2,D_3,...,D_n} </math>D1,D2,D3,...,Dn，极大似然的目标是：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> m a x P ( D 1 , D 2 , D 3 , . . . , D n ) max P(D_1,D_2,D_3,...,D_n) </math>maxP(D1,D2,D3,...,Dn)

联合分布难计算，引入假设独立同分布（i.i.d.），目标公式改变为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> m a x ∏ N i P ( D i ) (1) max\prod_{N}^{i} P(D_i) \tag{1} </math>maxN∏iP(Di)(1)

将函数取对数，函数极值点不会改变，公式变为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> m a x ∑ N i l o g P ( D i ) (2) max\sum_{N}^{i} log P(D_i) \tag{2} </math>maxN∑ilogP(Di)(2)

伯努利分布下随机变量的最大似然计算方法 ， <math xmlns="http://www.w3.org/1998/Math/MathML"> P ( X = 1 ) = p , P ( X = 0 ) = 1 − p P(X=1)=p, P(X=0)=1-p </math>P(X=1)=p,P(X=0)=1−p：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> P ( X ) = p X ( 1 − p ) 1 − X D = D 1 , D 2 , D 3 , . . . , D n (3) P(X)=p^X (1-p)^{1-X} \tag{3}\\ D = {D_1,D_2,D_3,...,D_n} </math>P(X)=pX(1−p)1−XD=D1,D2,D3,...,Dn(3)
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> max ⁡ p l o g P ( D ) = max ⁡ p l o g ∏ N i P ( D i ) = max ⁡ p ∑ N i l o g P ( D i ) = max ⁡ p ∑ N i [ D i l o g p + ( 1 − D i ) l o g ( 1 − p ) ] (4) \max_{p} log P(D) = \max_{p} log\prod_{N}^{i} P(D_i) \\ = \max_{p} \sum_{N}^{i} log P(D_i) \\ = \max_{p} \sum_{N}^{i} [D_i logp+(1-D_i)log(1-p)] \tag{4} </math>pmaxlogP(D)=pmaxlogN∏iP(Di)=pmaxN∑ilogP(Di)=pmaxN∑i[Dilogp+(1−Di)log(1−p)](4)

即二元交叉熵公式，又称logloss类别数为2的特例等。

2. 线性回归

线性回归的损失函数MSE，公式为：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> C = ∑ i = 1 n ( y i − y ^ i ) 2 (5) C=\sum_{i=1}^{n}(y_i-\hat y_i)^2 \tag{5} </math>C=i=1∑n(yi−y^i)2(5)

另还有其他的线性回归损失函数，如使用mae，Poisson，tweedie loss 都可以，不过也许不那么线性了（...）

从统计学角度出发：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y ( i ) = θ T x ( i ) + ε ( i ) (6) y^{(i)}=\theta^{T}x^{(i)}+\varepsilon^{(i)} \tag{6} </math>y(i)=θTx(i)+ε(i)(6)

假设线性回归中的残差符合正态分布：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> p ( ϵ ( i ) ) = 1 2 π σ e x p ( − ( ϵ ( i ) ) 2 2 σ 2 ) (7) p(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma } exp(-\frac{(\epsilon^{(i)})^{2}}{2\sigma^2}) \tag{7} </math>p(ϵ(i))=2π σ1exp(−2σ2(ϵ(i))2)(7)

代入得到了以线性回归待求解参数为正态分布的待求解分布参数的公式
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) (8) p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma } exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^{2}}{2\sigma^2}) \tag{8} </math>p(y(i)∣x(i);θ)=2π σ1exp(−2σ2(y(i)−θTx(i))2)(8)

取log后化简：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> l o g L ( θ ) = l o g ∏ i = 1 m 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) (9) logL(\theta )=log\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma } exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^{2}}{2\sigma^2}) \tag{9} </math>logL(θ)=logi=1∏m2π σ1exp(−2σ2(y(i)−θTx(i))2)(9)
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∑ i = 1 m l o g 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = m l o g 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 (10) \sum_{i=1}^{m}log\frac{1}{\sqrt{2\pi}\sigma } exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^{2}}{2\sigma^2}) \\ =mlog\frac{1}{\sqrt{2\pi}\sigma }-\frac{1}{\sigma^2}\cdot \frac{1}{2}\sum_{i=1}^{m}{(y^{(i)}-\theta^T x^{(i)})^{2}} \tag{10} </math>i=1∑mlog2π σ1exp(−2σ2(y(i)−θTx(i))2)=mlog2π σ1−σ21⋅21i=1∑m(y(i)−θTx(i))2(10)

等式右侧的第一项为一个常量，似然函数要取最大值，因而第二项越小越好，有：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J ( θ ) = 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 (11) J(\theta )=\frac{1}{2}\sum_{i=1}^{m}{(y^{(i)}-\theta^T x^{(i)})^{2}}\tag{11} </math>J(θ)=21i=1∑m(y(i)−θTx(i))2(11)

即为MSE。

残差必须服从独立正态分布，自变量和残差一定保持相互独立。 线性回归的损失函数mse就是：在某个 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( u , σ 2 ) (u,σ^2) </math>(u,σ2)下，使得服从正态分布的 <math xmlns="http://www.w3.org/1998/Math/MathML"> ε ε </math>ε取得现有样本 <math xmlns="http://www.w3.org/1998/Math/MathML"> ε i ε_i </math>εi的概率最大从而推算出来的损失函数的表达式。

mse实际上是方差加偏差。可以同时优化，所以常用MSE。

批量梯度下降时，需要除样本数量n，这样损失才在一个量级上。此时MSE前系数取1/N。

参考：线性回归相关知识点

我们在使用线性回归的时候的基本假设是噪声服从正态分布，当噪声符合正态分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ( 0 , δ 2 ) N(0,\delta^2) </math>N(0,δ2)时，因变量则符合正态分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> N ( a x ( i ) + b , δ 2 ) N(ax^{(i)}+b,\delta^2) </math>N(ax(i)+b,δ2)，其中预测函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> y = a x ( i ) + b y=ax^{(i)}+b </math>y=ax(i)+b。这个结论可以由正态分布的概率密度函数得到。也就是说当噪声符合正态分布时，其因变量必然也符合正态分布。因此，我们使用mse的时候实际上是假设y服从正态分布的。

二、在实际应用场景中，当回归问题数据不符合正态分布时（如长尾分布），应该如何讨论？

参考：根据标签分布来选择损失函数

1.分类问题

在分类问题中，基于公式（4）： <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ N i [ D i l o g p + ( 1 − D i ) l o g ( 1 − p ) ] \sum_{N}^{i} [D_i logp+(1-D_i)log(1-p)] </math>∑Ni[Dilogp+(1−Di)log(1−p)]，并不涉及数据分布对logloss的影响。但实际场景中，数据的分布不均衡在工程上会对模型的表现造成影响，即不均衡样本影响模型对类别的学习。这种影响在样本量较少时比较显著；在样本量大时，由于模型能够充分学习到各个类别的特征，模型同样可以充分学习，一般不需要进行不均衡处理。

不均衡学习针对的更多是"绝对不均衡"的问题，即样本的绝对数量很稀少而不是比例。

2.回归问题

在回归问题中，连续标签并没有"不均衡"这个概念，而是标签分布不符合高斯分布状态。当标签分布不符合高斯分布时，以MSE （ <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 \frac{1}{2}\sum_{i=1}^{m}{(y^{(i)}-\theta^T x^{(i)})^{2}} </math>21∑i=1m(y(i)−θTx(i))2）为损失函数的模型拟合效果差，此时一般采取两个方式：

1.在标签层面进行样本处理，使用对数变换或boxcox变换使标签尽量接近高斯分布；
2.采用别的损失函数，如tweedie loss、possion loss等。

tweedie loss

如图是tweedie分布的经典例子。
Predictive Modeling with the Tweedie Distribution
tweedie loss保姆级推导过程 Tweedie 回归在信贷模型中的应用

在有些场景中，如保险定价场景中，样本的标签分布往往不是正态分布的。90%的值是0，<%10有值，属于长尾分布。其分布有两个特征：

right-skewed: 样本主要分布在左边，而靠右的样本极少，表现出长尾分布；
zero-inflated: 在标签取值等于0处，聚集了大量的样本; 在保险定价相关的模型中，通常会将理赔次数这个随机变量记为N, 假定其分布为泊松分布： <math xmlns="http://www.w3.org/1998/Math/MathML"> N ∼ p o i s ( λ ) N∼pois(λ) </math>N∼pois(λ)，理赔金额随机变量记为 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z i Z_i </math>Zi, 其分布为伽马分布: <math xmlns="http://www.w3.org/1998/Math/MathML"> Z i ∼ G a m m a ( α , γ ) Z_i∼Gamma(α,γ) </math>Zi∼Gamma(α,γ), 总的理赔金额 <math xmlns="http://www.w3.org/1998/Math/MathML"> Y Y </math>Y:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Y = ∑ i = 1 N Z i Y = \sum_{i=1}^{N}Z_i </math>Y=i=1∑NZi

在概率论中随机变量Y的分布被称为Compound Poisson--Gamma Distribution，基于该分布的概率密度函数和极大似然估计，又可以推导出tweedie损失函数，或者说tweedie分布是Compound Poisson--Gamma Distribution的一个特例. 常规的损失函数和Tweedie的对比：

MSE基于真实标签分布为正态分布推导而来, 对异常值敏感, 倾向于预测值为mean(y)
MAE倾向于中位数
Tweedie对预测值比真实值小的情况惩罚更严重(保险/损失等预估场景)

Tweedie的损失函数形式：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> T w e e d i e = 1 N ∑ i = 1 N − ( y i μ i 1 − ρ 1 − ρ − μ i 2 − ρ 2 − ρ ) Tweedie = \frac{1}{N} \sum_{i=1}^{N}-(y_i \frac{\mu _{i}^{1-\rho}}{1-\rho}-\frac{\mu _{i}^{2-\rho}}{2-\rho} ) </math>Tweedie=N1i=1∑N−(yi1−ρμi1−ρ−2−ρμi2−ρ)

在实际回归场景中（如预测），可以避免如MAE/MSE生成一致预测值的倾向

三、经典损失函数

参考：经典损失函数

均方误差（Mean Squared Error，MSE）
交叉熵损失（Cross-Entropy Loss）
对数损失（Log Loss）
平滑L1损失（Smooth L1 Loss）
平滑L2损失（Smooth L2 Loss）
对数平方误差（Log Cosine Loss）
Margin Loss
对数sigmoid损失（Log Sigmoid Loss）
Hinge Loss
Triplet Loss

1.回归损失

(1)均方误差（Mean Squared Error，MSE）

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> M S E = 1 m ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 MSE=\frac{1}{m}\sum_{i=1}^{m}{(y^{(i)}-\hat y^{(i)})^{2}} </math>MSE=m1i=1∑m(y(i)−y^(i))2

<math xmlns="http://www.w3.org/1998/Math/MathML"> y i y_i </math>yi 表示真实值， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ i \hat{y}_i </math>y^i表示预测值， <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n表示数据样本数。

(2)平滑L1损失（Smooth L1 Loss，Huber函数δ=1时的特例）

平滑L1损失（Smooth L1 Loss）是一种混合损失函数，用于回归问题。平滑L1损失的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L S m o o t h L 1 ( y , y ^ ) = { 0.5 y 2 if ∣ y ∣ ≤ c c ∣ y ∣ − 0.5 c 2 if ∣ y ∣ > c L_{SmoothL1}(y, \hat{y}) = \begin{cases} 0.5y^2 & \text{if } |y| \leq c \\ c|y| - 0.5c^2 & \text{if } |y| > c \end{cases} </math>LSmoothL1(y,y^)={0.5y2c∣y∣−0.5c2if ∣y∣≤cif ∣y∣>c

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 表示真实值， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ \hat{y} </math>y^表示预测值， <math xmlns="http://www.w3.org/1998/Math/MathML"> c c </math>c 是一个常数，通常取为0.01。

Huber损失函数（平滑平均绝对误差）。Huber函数是对MAE和MSE二者的综合，其在函数值为0时，它也是可微分的。包含了一个超参数δ，δ 值决定了 Huber侧重于 MSE 还是 MAE：

当δ~ 0时，Huber损失会趋向于MSE；

当δ~ ∞，Huber损失会趋向于MAE。

为什么使用Huber Loss？ 使用MAE训练神经网络的一个大问题是经常会遇到很大的梯度，使用梯度下降时可能导致训练结束时错过最小值。对于MSE，梯度会随着损失接近最小值而降低，从而使其更加精确。

在这种情况下，Huber Loss可能会非常有用，因为它会使最小值附近弯曲，从而降低梯度。另外它比MSE对异常值更鲁棒。因此，它结合了MSE和MAE的优良特性。

但是，Huber Loss的问题是可能需要迭代地训练超参数delta。

(3)平滑L2损失（Smooth L2 Loss）

平滑L2损失（Smooth L2 Loss）是一种混合损失函数，用于回归问题。平滑L2损失的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L S m o o t h L 2 ( y , y ^ ) = 1 2 ( y − y ^ ) 2 + λ 2 y ^ 2 L_{SmoothL2}(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 + \frac{\lambda}{2}\hat{y}^2 </math>LSmoothL2(y,y^)=21(y−y^)2+2λy^2

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 表示真实值， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ \hat{y} </math>y^ 表示预测值， <math xmlns="http://www.w3.org/1998/Math/MathML"> λ \lambda </math>λ是一个正常数，用于调整平滑L2损失的程度。

(4)绝对误差（MAE）& L1误差（nn.L1Loss）

绝对误差与L1误差，两者的误差都根据模型预测值 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) f(x) </math>f(x) 与样本真实值 <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 之间距离计算。其公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> M A E = 1 n ∑ i = 1 n ∣ y i − y i p ∣ L 1 = ∑ i = 1 n ∣ y i − y i p ∣ MAE = \frac {1}{n} \sum_{i=1}^{n}|y_i-y_{i}^{p}| \\ L1 = \sum_{i=1}^{n}|y_i-y_{i}^{p}| </math>MAE=n1i=1∑n∣yi−yip∣L1=i=1∑n∣yi−yip∣

2.分类损失

(1)交叉熵损失（Cross-Entropy Loss）

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( p , q ) = − ∑ i = 1 n [ y i log ⁡ ( y ^ i ) + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] H(p, q) = -\sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] </math>H(p,q)=−i=1∑n[yilog(y^i)+(1−yi)log(1−y^i)]

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> y i y_i </math>yi表示真实标签（0或1）， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ i \hat{y}_i </math>y^i表示预测概率， <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n表示数据样本数。

(2)对数损失（Log Loss）

对数损失（Log Loss）是一种特殊的交叉熵损失，用于二分类问题。对数损失的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L o g L o s s = − 1 n ∑ i = 1 n [ y i log ⁡ ( y ^ i ) + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] Log Loss = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] </math>LogLoss=−n1i=1∑n[yilog(y^i)+(1−yi)log(1−y^i)]

(3)对数平方误差（Log Cosine Loss）

对数平方误差（Log Cosine Loss）是一种用于角度相似度问题的损失函数。对数平方误差的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L o g C o s i n e L o s s = − 1 2 log ⁡ ( 2 − 2 cos ⁡ ( θ ) ) Log Cosine Loss = -\frac{1}{2} \log(2 - 2 \cos(\theta)) </math>LogCosineLoss=−21log(2−2cos(θ))

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ表示真实角度与预测角度之间的角度差。

(4)Margin Loss

Margin Loss 是一种用于多类分类问题的损失函数，用于处理软标签。Margin Loss 的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> M a r g i n L o s s = max ⁡ ( 0 , 1 − y ⋅ y ^ ) Margin Loss = \max(0, 1 - y \cdot \hat{y}) </math>MarginLoss=max(0,1−y⋅y^)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 表示真实标签， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ \hat{y} </math>y^ 表示预测概率。

(5)对数sigmoid损失（Log Sigmoid Loss）

对数sigmoid损失（Log Sigmoid Loss）是一种用于二分类问题的损失函数。对数sigmoid损失的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L o g S i g m o i d L o s s = − 1 n ∑ i = 1 n [ y i log ⁡ ( y ^ i ) + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] Log Sigmoid Loss = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] </math>LogSigmoidLoss=−n1i=1∑n[yilog(y^i)+(1−yi)log(1−y^i)]

(6)Hinge Loss

Hinge Loss 是一种用于多类分类问题的损失函数，用于处理软标签。Hinge Loss 的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H i n g e L o s s = max ⁡ ( 0 , 1 − y ⋅ y ^ ) Hinge Loss = \max(0, 1 - y \cdot \hat{y}) </math>HingeLoss=max(0,1−y⋅y^)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y表示真实标签， <math xmlns="http://www.w3.org/1998/Math/MathML"> y ^ \hat{y} </math>y^表示预测概率。

(7)Triplet Loss

Triplet Loss 是一种用于多类分类问题的损失函数，用于处理三元组数据。Triplet Loss 的数学模型公式如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> T r i p l e t L o s s = max ⁡ ( 0 , d ( a , p ) − d ( a , n ) + m ) Triplet Loss = \max(0, d(a, p) - d(a, n) + m) </math>TripletLoss=max(0,d(a,p)−d(a,n)+m)

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 表示查询样本， <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p 表示正样本， <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n表示负样本， <math xmlns="http://www.w3.org/1998/Math/MathML"> d ( a , p ) d(a, p) </math>d(a,p)表示查询样本与正样本之间的距离， <math xmlns="http://www.w3.org/1998/Math/MathML"> d ( a , n ) d(a, n) </math>d(a,n)表示查询样本与负样本之间的距离， <math xmlns="http://www.w3.org/1998/Math/MathML"> m m </math>m是一个正常数，用于调整三元组损失的程度。

工业应用中如何选择合适的损失函数

选择合适的损失函数

1.对于MAE、MSE、Huber

从误差的角度来说： MSE可以用来评价数据变化的程度，MAE则能更好的反应预测值误差的实际情况
从离群点角度选择： 如果离群点仅仅只是在数据提取的过程中的损坏或者清洗中的错误采样，则无须给予过多关注，那么我们应该选择 MAE，但如果离群点是实际的数据或者重要的数据需要被检测到的异常值，那我们应该选择 MSE。

由于MSE对误差 <math xmlns="http://www.w3.org/1998/Math/MathML"> （ e ）（e） </math>（e）进行平方操作 <math xmlns="http://www.w3.org/1998/Math/MathML"> （ y − y p r e d i c t e d = e ）（y - y_predicted = e） </math>（y−ypredicted=e），如果 <math xmlns="http://www.w3.org/1998/Math/MathML"> e > 1 e> 1 </math>e>1，误差的值会增加很多。如果数据中有一个离群点， <math xmlns="http://www.w3.org/1998/Math/MathML"> e e </math>e的值将会很高，将会远远大于 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ e ∣ |e| </math>∣e∣。和以MAE为损失的模型相比，以MSE为损失的模型会赋予更高的权重给离群点。

对所有的观测数据，如果只给一个预测结果来最小化MSE，那么该预测值应该是所有目标值的均值。但是如果试图最小化MAE，那么这个预测就是所有目标值的中位数。中位数对于离群点比平均值更鲁棒，这使得MAE比MSE更加鲁棒。

使用MAE损失（特别是对于神经网络）的一个大问题是它的梯度始终是相同的，这意味着即使对于小的损失值，其梯度也是大的。这对模型的学习不好。为了解决这个问题，可以使用随着接近最小值而减小的动态学习率。MSE在这种情况下的表现很好，即使采用固定的学习率也会收敛。MSE损失的梯度在损失值较高时会比较大，随着损失接近0时而下降，从而使其在训练结束时更加精确。
一个以MAE为损失的模型可能对所有观测数据都预测为150，而忽略10％的离群情况，因为它会尝试去接近中值

从收敛速度的角度来说： MSE>Huber>MAE
从求解梯度的复杂度来说：MSE 要优于 MAE，且梯度也是动态变化的，MSE能较快准确达到收敛。
从模型的角度选择：对于大多数CNN网络 ，我们一般是使用MSE而不是MAE，因为训练CNN网络很看重训练速度，对于边框预测回归问题 ，通常也可以选择平方损失函数，但平方损失函数缺点是当存在离群点（outliers)的时候，这些点会占loss的主要组成部分。对于目标检测FastR CNN采用稍微缓和一点绝对损失函数（smooth L1损失），它是随着误差线性增长，而不是平方增长。

参考：5 Regression Loss Functions All Machine Learners Should Know 连续损失函数：（A）MSE损失函数; （B）MAE损失函数; （C）Huber损失函数; （D）Quantile损失函数。用有噪声的sinc(x)数据来拟合平滑GBM的示例：（E）原始sinc(x)函数; （F）以MSE和MAE为损失拟合的平滑GBM; （G）以Huber Loss拟合的平滑GBM， = {4,2,1}; （H）以Quantile Loss拟合的平滑GBM。