Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度)

Mean Absolute Error {MAE} Loss Function - Derivatives and Gradients {导数和梯度}

  • [1. torch.nn.L1Loss](#1. torch.nn.L1Loss)
    • [1.1. Parameters](#1.1. Parameters)
    • [1.2. Shape](#1.2. Shape)
    • [1.3. `torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')`](#1.3. torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean'))
    • [1.4. `torch.nn.L1Loss(size_average=None, reduce=None, reduction='sum')`](#1.4. torch.nn.L1Loss(size_average=None, reduce=None, reduction='sum'))
  • [2. sign function or signum function](#2. sign function or signum function)
    • [2.1. Basic properties](#2.1. Basic properties)
    • [2.2. numpy.sign](#2.2. numpy.sign)
  • [3. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度)](#3. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度))
  • [4. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度)](#4. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度))
    • [4.1. PyTorch `torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')`](#4.1. PyTorch torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean'))
    • [4.2. PyTorch `torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')`](#4.2. PyTorch torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean'))
    • [4.3. Python `torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')`](#4.3. Python torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean'))
    • [4.4. Python `torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')`](#4.4. Python torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean'))
  • References

1. torch.nn.L1Loss

https://docs.pytorch.org/docs/stable/generated/torch.nn.L1Loss.html
https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/loss.py

class torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

Creates a criterion that measures the mean absolute error (MAE) between each element in the input x x x and target y y y.

The unreduced (i.e. with reduction set to 'none') loss can be described as:

ℓ ( x , y ) = L = { l 1 , ... , l N } ⊤ , l n = ∣ x n − y n ∣ , \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left| x_n - y_n \right|, ℓ(x,y)=L={l1,...,lN}⊤,ln=∣xn−yn∣,

where N N N is the batch size.

If reduction is not 'none' (default 'mean'), then:

ℓ ( x , y ) = { mean ⁡ ( L ) , if reduction = 'mean'; sum ⁡ ( L ) , if reduction = 'sum'. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases} ℓ(x,y)={mean(L),sum(L),if reduction='mean';if reduction='sum'.

x x x and y y y are tensors of arbitrary shapes with a total of N N N elements each.

The sum operation still operates over all the elements, and divides by N N N.

The division by N N N can be avoided if one sets reduction = 'sum'.

Supports real-valued and complex-valued inputs.

1.1. Parameters

  • size_average (bool, optional): Deprecated (see reduction). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there are multiple elements per sample. If the field size_average is set to False, the losses are instead summed for each minibatch. Ignored when reduce is False. Default: True

  • reduce (bool, optional): Deprecated (see reduction). By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per batch element instead and ignores size_average. Default: True

  • reduction (str, optional): Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Note: size_average and reduce are in the process of being deprecated, and in the meantime, specifying either of those two args will override reduction. Default: 'mean'

1.2. Shape

  • Input : (*), where * means any number of dimensions.

  • Target : (*), same shape as the input.

  • Output : scalar. If reduction is 'none', then (*), same shape as the input.

1.3. torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

y_true: Ground truth values with shape = [d1, d2, ..., dN].
y_pred: The predicted values with shape = [d1, d2, ..., dN]. (Tensor of the same shape as y_true)

复制代码
MAELoss (mean) = np.mean(np.abs(y_pred - y_true))

Mean Absolute Error (MAE):

MAE (mean) = 1 d 1 × d 2 × . . . × d N ∑ i = 1 d 1 × d 2 × . . . × d N ∣ y _ p r e d i − y _ t r u e i ∣ \begin{aligned} \text{MAE (mean)} &= \frac{1}{d1 \times d2 \times ... \times dN} \sum_{i=1}^{d1 \times d2 \times ... \times dN} |\boldsymbol{y\pred}{i} - \boldsymbol{y\true}{i}| \\ \end{aligned} MAE (mean)=d1×d2×...×dN1i=1∑d1×d2×...×dN∣y_predi−y_truei∣

The MAE function gradient (Tensor of the same shape as y_pred):

J ( mean ) = ∂ L ∂ y ^ = 1 d 1 × d 2 × . . . × d N sgn ( y _ p r e d − y _ t r u e ) \begin{aligned} \boldsymbol{J (\text{mean})} &= \frac{\partial{L}}{\partial{\boldsymbol{\hat{y}}}} \\ &= \frac{1}{d1 \times d2 \times ... \times dN} \text{sgn}(\boldsymbol{y\_pred} - \boldsymbol{y\_true}) \\ \end{aligned} J(mean)=∂y^∂L=d1×d2×...×dN1sgn(y_pred−y_true)

1.4. torch.nn.L1Loss(size_average=None, reduce=None, reduction='sum')

y_true: Ground truth values with shape = [d1, d2, ..., dN].
y_pred: The predicted values with shape = [d1, d2, ..., dN]. (Tensor of the same shape as y_true)

复制代码
MAELoss (sum) = np.sum(np.abs(y_pred - y_true))

Mean Absolute Error (MAE):

MAE (sum) = ∑ i = 1 d 1 × d 2 × . . . × d N ∣ y _ p r e d i − y _ t r u e i ∣ \begin{aligned} \text{MAE (sum)} &= \sum_{i=1}^{d1 \times d2 \times ... \times dN} |\boldsymbol{y\pred}{i} - \boldsymbol{y\true}{i}| \\ \end{aligned} MAE (sum)=i=1∑d1×d2×...×dN∣y_predi−y_truei∣

The MAE function gradient (Tensor of the same shape as y_pred):

J ( sum ) = ∂ L ∂ y ^ = sgn ( y _ p r e d − y _ t r u e ) \begin{aligned} \boldsymbol{J (\text{sum})} &= \frac{\partial{L}}{\partial{\boldsymbol{\hat{y}}}} \\ &= \text{sgn} (\boldsymbol{y\_pred} - \boldsymbol{y\_true}) \\ \end{aligned} J(sum)=∂y^∂L=sgn(y_pred−y_true)

2. sign function or signum function

In mathematics, the sign function or signum function is a function that has the value -1, +1 or 0 according to whether the sign of a given real number is positive or negative, or the given number is itself zero.

In mathematical notation the sign function is often represented as sgn x \text{sgn} \ x sgn x or sgn ( x ) \text{sgn} (x) sgn(x).

sgn ( x ) = { − 1 if x < 0 0 if x = 0 1 if x > 0 \text{sgn} (x) = \begin{cases} -1 & \text{if} \ x < 0 \\ 0 & \text{if} \ x = 0 \\ 1 & \text{if} \ x > 0 \\ \end{cases} sgn(x)=⎩ ⎨ ⎧−101if x<0if x=0if x>0

For example:

sgn ( 2 ) = + 1 sgn ( π ) = + 1 sgn ( − 8 ) = − 1 sgn ( − 1 2 ) = − 1 sgn ( 0 ) = 0 \begin{aligned} \text{sgn}(2) &=& +1 \\ \text{sgn}(\pi) &=& +1 \\ \text{sgn}(-8) &=& -1 \\ \text{sgn}(-\frac{1}{2}) &=& -1 \\ \text{sgn}(0) &=& 0 \\ \end{aligned} sgn(2)sgn(π)sgn(−8)sgn(−21)sgn(0)=====+1+1−1−10

2.1. Basic properties

Any real number can be expressed as the product of its absolute value and its sign:
x = ∣ x ∣ sgn x x = |x| \text{sgn} x x=∣x∣sgnx

It follows that whenever x {x} x is not equal to 0 we have
sgn x = x ∣ x ∣ = ∣ x ∣ x \text{sgn} x = \frac{x}{|x|} = \frac{|x|}{x} sgnx=∣x∣x=x∣x∣

Similarly, for any real number x x x,
∣ x ∣ = x sgn x |x| = x \text{sgn} x ∣x∣=xsgnx

2.2. numpy.sign

https://numpy.org/doc/stable/reference/generated/numpy.sign.html

Returns an element-wise indication of the sign of a number.

The sign function returns -1 if x < 0, 0 if x==0, 1 if x > 0. nan is returned for nan inputs.

Notes

There is more than one definition of sign in common use for complex numbers. The definition used here, x / ∣ x ∣ x/|x| x/∣x∣, is the more common and useful one, but is different from the one used in numpy prior to version 2.0, x / x × x x/\sqrt{x \times x} x/x×x , which is equivalent to sign(x.real) + 0j if x.real != 0 else sign(x.imag) + 0j.

3. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度)

Predicted values:

y _ p r e d = y ^ = [ y ^ 1 , y ^ 2 , ... , y ^ N ] \boldsymbol{y\pred} = \boldsymbol{\hat{y}} = \left[ \hat{y}{1}, \hat{y}{2}, \dots, \hat{y}{N} \right ] y_pred=y^=[y^1,y^2,...,y^N]

True values:

y _ t r u e = y = [ y 1 , y 2 , ... , y N ] \boldsymbol{y\true} = \boldsymbol{y} = \left[ {y}{1}, {y}{2}, \dots, {y}{N} \right ] y_true=y=[y1,y2,...,yN]

Mean Absolute Error (MAE):

MAE = L = 1 N ∑ i = 1 N ∣ y _ p r e d i − y _ t r u e i ∣ = 1 N ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ( ∣ y ^ 1 − y 1 ∣ + ∣ y ^ 2 − y 2 ∣ + ... + ∣ y ^ N − y N ∣ ) \begin{aligned} \text{MAE} = L &= \frac{1}{N} \sum_{i=1}^{N}|\boldsymbol{y\pred}{i} - \boldsymbol{y\true}{i}| \\ &= \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} (|\hat{y}{1} - {y}{1}| + |\hat{y}{2} - {y}{2}| + \ldots + |\hat{y}{N} - {y}{N}|) \\ \end{aligned} MAE=L=N1i=1∑N∣y_predi−y_truei∣=N1i=1∑N∣y^i−yi∣=N1(∣y^1−y1∣+∣y^2−y2∣+...+∣y^N−yN∣)

其中 MAE = L \text{MAE} = L MAE=L 是标量, N N N 是常量。

The partial derivative with respect to y ^ j \hat{y}_{j} y^j is:

∂ L ∂ y ^ j = ∂ ∂ y ^ j 1 N ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ∂ ∂ y ^ j ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ∑ i = 1 N ∂ ∂ y ^ j ∣ y ^ i − y i ∣ \begin{aligned} \frac{\partial{L}}{\partial{\hat{y}{j}}} &= \frac{\partial{}}{\partial{\hat{y}{j}}} \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} \frac{\partial{}}{\partial{\hat{y}{j}}} \sum{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} \sum_{i=1}^{N} \frac{\partial{}}{\partial{\hat{y}{j}}} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}_{i}| \\ \end{aligned} ∂y^j∂L=∂y^j∂N1i=1∑N∣y^i−yi∣=N1∂y^j∂i=1∑N∣y^i−yi∣=N1i=1∑N∂y^j∂∣y^i−yi∣

If i ≠ j i\neq j i=j, then ∣ y ^ i − y i ∣ |\hat{y}i - y_i| ∣y^i−yi∣ is not a function of y ^ j \hat{y}{j} y^j, so ∂ ∂ y ^ j ∣ y ^ i − y i ∣ = 0 \frac{\partial}{\partial \hat{y}_{j}}|\hat{y}_i - y_i| = 0 ∂y^j∂∣y^i−yi∣=0. We can remove the summation because the partial derivative of L L L for i ≠ j i\neq j i=j is 0. We are left with only one term in the sum, where i = j i=j i=j:

∂ L ∂ y ^ j = ∂ ∂ y ^ j 1 N ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ∂ ∂ y ^ j ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ∑ i = 1 N ∂ ∂ y ^ j ∣ y ^ i − y i ∣ = 1 N ∂ ∂ y ^ j ∣ y ^ j − y j ∣ = 1 N sgn ( y ^ j − y j ) ∂ ∂ y ^ j ( y ^ j − y j ) = 1 N sgn ( y ^ j − y j ) = 1 N y ^ j − y j ∣ y ^ j − y j ∣ = 1 N ∣ y ^ j − y j ∣ y ^ j − y j \begin{aligned} \frac{\partial{L}}{\partial{\hat{y}{j}}} &= \frac{\partial{}}{\partial{\hat{y}{j}}} \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} \frac{\partial{}}{\partial{\hat{y}{j}}} \sum{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} \sum_{i=1}^{N} \frac{\partial{}}{\partial{\hat{y}{j}}} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} \frac{\partial{}}{\partial{\hat{y}{j}}} |\hat{y}{j} - y{j}| \\ &= \frac{1}{N} \text{sgn}(\hat{y}{j} - y{j}) \frac{\partial{}}{\partial{\hat{y}{j}}} (\hat{y}{j} - y_{j}) \\ &= \frac{1}{N} \text{sgn}(\hat{y}{j} - y{j}) \\ &= \frac{1}{N} \frac{\hat{y}{j} - y{j}}{|\hat{y}{j} - y{j}|} \\ &= \frac{1}{N} \frac{|\hat{y}{j} - y{j}|}{\hat{y}{j} - y{j}} \\ \end{aligned} ∂y^j∂L=∂y^j∂N1i=1∑N∣y^i−yi∣=N1∂y^j∂i=1∑N∣y^i−yi∣=N1i=1∑N∂y^j∂∣y^i−yi∣=N1∂y^j∂∣y^j−yj∣=N1sgn(y^j−yj)∂y^j∂(y^j−yj)=N1sgn(y^j−yj)=N1∣y^j−yj∣y^j−yj=N1y^j−yj∣y^j−yj∣

Finally, the gradient ∂ L ∂ y ^ \frac{\partial{L}}{\partial{\hat{\boldsymbol{y}}}} ∂y^∂L is the vector whose j j j-th component is ∂ L ∂ y ^ j \frac{\partial{L}}{\partial{\hat{y}{j}}} ∂y^j∂L, so we have ∂ L ∂ y ^ = 1 N sgn ( y ^ j − y j ) \frac{\partial{L}}{\partial{\hat{\boldsymbol{y}}}} = \frac{1}{N} \text{sgn}(\boldsymbol{\hat{y}}{j} - \boldsymbol{y}_{j}) ∂y^∂L=N1sgn(y^j−yj).

The MAE function gradient (Jacobian for MAE):

MAE = L = f ( y ^ 1 , y ^ 2 , ... , y ^ N ) \begin{aligned} \text{MAE} = L &= f({\hat{y}}{1}, {\hat{y}}{2}, \ \ldots, {\hat{y}}_{N}) \\ \end{aligned} MAE=L=f(y^1,y^2, ...,y^N)

J = ∂ L ∂ y ^ = ∂ L ∂ ( y ^ 1 , y ^ 2 , ... , y ^ N ) = [ ∂ L ∂ y ^ 1 , ∂ L ∂ y ^ 2 , ... , ∂ L ∂ y ^ N ] = [ 1 N sgn ( y ^ 1 − y 1 ) , 1 N sgn ( y ^ 2 − y 2 ) , ... , 1 N sgn ( y ^ N − y N ) ] = 1 N [ sgn ( y ^ 1 − y 1 ) , sgn ( y ^ 2 − y 2 ) , ... , sgn ( y ^ N − y N ) ] = 1 N sgn ( y ^ − y ) = 1 N sgn ( y _ p r e d − y _ t r u e ) = 1 N y ^ − y ∣ y ^ − y ∣ = 1 N ∣ y ^ − y ∣ y ^ − y \begin{aligned} \boldsymbol{J} &= \frac{\partial{L}}{\partial{\boldsymbol{\hat{y}}}} \\ &= \frac{\partial{L}}{\partial{({\hat{y}}{1}, {\hat{y}}{2}, \ \ldots, {\hat{y}}{N})}} \\ &= \left[\frac{\partial{L}}{\partial{\hat{y}{1}}}, \frac{\partial{L}}{\partial{\hat{y}{2}}}, \ \ldots, \frac{\partial{L}}{\partial{\hat{y}{N}}} \right] \\ &= \left[\frac{1}{N} \text{sgn}(\hat{y}{1} - y{1}), \frac{1}{N} \text{sgn}(\hat{y}{2} - y{2}), \ \ldots, \frac{1}{N} \text{sgn}(\hat{y}{N} - y{N}) \right] \\ &= \frac{1}{N} \left[\text{sgn}(\hat{y}{1} - y{1}), \text{sgn}(\hat{y}{2} - y{2}), \ \ldots, \text{sgn}(\hat{y}{N} - y{N}) \right] \\ &= \frac{1}{N} \text{sgn}(\boldsymbol{\hat{y}} - \boldsymbol{y}) \\ &= \frac{1}{N} \text{sgn}(\boldsymbol{y\_pred} - \boldsymbol{y\_true}) \\ &= \frac{1}{N} \frac{\boldsymbol{\hat{y}} - \boldsymbol{y}}{|\boldsymbol{\hat{y}} - \boldsymbol{y}|} \\ &= \frac{1}{N} \frac{|\boldsymbol{\hat{y}} - \boldsymbol{y}|}{\boldsymbol{\hat{y}} - \boldsymbol{y}} \\ \end{aligned} J=∂y^∂L=∂(y^1,y^2, ...,y^N)∂L=[∂y^1∂L,∂y^2∂L, ...,∂y^N∂L]=[N1sgn(y^1−y1),N1sgn(y^2−y2), ...,N1sgn(y^N−yN)]=N1[sgn(y^1−y1),sgn(y^2−y2), ...,sgn(y^N−yN)]=N1sgn(y^−y)=N1sgn(y_pred−y_true)=N1∣y^−y∣y^−y=N1y^−y∣y^−y∣

4. Mean Absolute Error (MAE) Loss Function - Derivatives and Gradients (导数和梯度)

4.1. PyTorch torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

复制代码
#!/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

predictions = torch.tensor([1, 2, 3, 4], dtype=torch.float, requires_grad=True)
targets = torch.tensor([0, 2, 4, 2], dtype=torch.float)

print(f"predictions.requires_grad: {predictions.requires_grad}, targets.requires_grad: {targets.requires_grad}")
print(f"predictions.shape: {predictions.shape}, targets.shape: {targets.shape}")

# Compute MAE loss
mae_loss = nn.L1Loss(size_average=None, reduce=None, reduction='mean')
loss = mae_loss(predictions, targets)
print(f"MAE loss: {loss}")

# Compute gradients
loss.backward()

# Access the gradients of 'predictions'
print(f"\ngradients of predictions:\n{predictions.grad}")
print(f"predictions.shape: {predictions.shape}")

Predicted values:

y _ p r e d = y ^ = [ y ^ 1 , y ^ 2 , ... , y ^ N ] = [ 1 , 2 , 3 , 4 ] \begin{aligned} \boldsymbol{y\pred} = \boldsymbol{\hat{y}} &= \left[ \hat{y}{1}, \hat{y}{2}, \dots, \hat{y}{N} \right] \\ &= \left[ 1, 2, 3, 4 \right] \\ \end{aligned} y_pred=y^=[y^1,y^2,...,y^N]=[1,2,3,4]

True values:

y _ t r u e = y = [ y 1 , y 2 , ... , y N ] = [ 0 , 2 , 4 , 2 ] \begin{aligned} \boldsymbol{y\true} = \boldsymbol{y} &= \left[ {y}{1}, {y}{2}, \dots, {y}{N} \right] \\ &= \left[ 0, 2, 4, 2 \right] \\ \end{aligned} y_true=y=[y1,y2,...,yN]=[0,2,4,2]

Mean Absolute Error (MAE):

MAE = L = 1 N ∑ i = 1 N ∣ y _ p r e d i − y _ t r u e i ∣ = 1 N ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 N ( ∣ y ^ 1 − y 1 ∣ + ∣ y ^ 2 − y 2 ∣ + ... + ∣ y ^ N − y N ∣ ) = 1 4 ( ∣ 1 − 0 ∣ + ∣ 2 − 2 ∣ + ∣ 3 − 4 ∣ + ∣ 4 − 2 ∣ ) = 1.0 \begin{aligned} \text{MAE} = L &= \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{y\pred}{i} - \boldsymbol{y\true}{i}| \\ &= \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{N} (|\hat{y}{1} - {y}{1}| + |\hat{y}{2} - {y}{2}| + \ldots + |\hat{y}{N} - {y}{N}|) \\ &= \frac{1}{4} (|1 - 0| + |2 - 2| + |3 - 4| + |4 - 2|) \\ &= 1.0 \\ \end{aligned} MAE=L=N1i=1∑N∣y_predi−y_truei∣=N1i=1∑N∣y^i−yi∣=N1(∣y^1−y1∣+∣y^2−y2∣+...+∣y^N−yN∣)=41(∣1−0∣+∣2−2∣+∣3−4∣+∣4−2∣)=1.0

The MAE function gradient (Jacobian for MAE):

J = ∂ L ∂ y ^ = ∂ L ∂ ( y ^ 1 , y ^ 2 , ... , y ^ N ) = [ ∂ L ∂ y ^ 1 , ∂ L ∂ y ^ 2 , ... , ∂ L ∂ y ^ N ] = [ 1 N sgn ( y ^ 1 − y 1 ) , 1 N sgn ( y ^ 2 − y 2 ) , ... , 1 N sgn ( y ^ N − y N ) ] = 1 N [ sgn ( y ^ 1 − y 1 ) , sgn ( y ^ 2 − y 2 ) , ... , sgn ( y ^ N − y N ) ] = 1 4 [ sgn ( 1 − 0 ) , sgn ( 2 − 2 ) , sgn ( 3 − 4 ) , sgn ( 4 − 2 ) ] = 1 4 [ 1 , 0 , − 1 , 1 ] = [ 0.25 , 0 , − 0.25 , 0.25 ] \begin{aligned} \boldsymbol{J} &= \frac{\partial{L}}{\partial{\boldsymbol{\hat{y}}}} \\ &= \frac{\partial{L}}{\partial{({\hat{y}}{1}, {\hat{y}}{2}, \ \ldots, {\hat{y}}{N})}} \\ &= \left[\frac{\partial{L}}{\partial{\hat{y}{1}}}, \frac{\partial{L}}{\partial{\hat{y}{2}}}, \ \ldots, \frac{\partial{L}}{\partial{\hat{y}{N}}} \right] \\ &= \left[\frac{1}{N} \text{sgn}(\hat{y}{1} - y{1}), \frac{1}{N} \text{sgn}(\hat{y}{2} - y{2}), \ \ldots, \frac{1}{N} \text{sgn}(\hat{y}{N} - y{N}) \right] \\ &= \frac{1}{N} \left[\text{sgn}(\hat{y}{1} - y{1}), \text{sgn}(\hat{y}{2} - y{2}), \ \ldots, \text{sgn}(\hat{y}{N} - y{N}) \right] \\ &= \frac{1}{4} \left[\text{sgn}(1 - 0), \text{sgn}(2 - 2), \text{sgn}(3 - 4), \text{sgn}(4 - 2) \right] \\ &= \frac{1}{4} \left[1, 0, -1, 1 \right] \\ &= \left[0.25, 0, -0.25, 0.25 \right] \\ \end{aligned} J=∂y^∂L=∂(y^1,y^2, ...,y^N)∂L=[∂y^1∂L,∂y^2∂L, ...,∂y^N∂L]=[N1sgn(y^1−y1),N1sgn(y^2−y2), ...,N1sgn(y^N−yN)]=N1[sgn(y^1−y1),sgn(y^2−y2), ...,sgn(y^N−yN)]=41[sgn(1−0),sgn(2−2),sgn(3−4),sgn(4−2)]=41[1,0,−1,1]=[0.25,0,−0.25,0.25]

复制代码
/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/mae.py 
predictions.requires_grad: True, targets.requires_grad: False
predictions.shape: torch.Size([4]), targets.shape: torch.Size([4])
MAE loss: 1.0

gradients of predictions:
tensor([ 0.2500,  0.0000, -0.2500,  0.2500])
predictions.shape: torch.Size([4])

Process finished with exit code 0

4.2. PyTorch torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

复制代码
#!/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

predictions = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=torch.float, requires_grad=True)
targets = torch.tensor([[2, 2, 2], [6, 5, 4], [9, 9, 9]], dtype=torch.float)

print(f"predictions.requires_grad: {predictions.requires_grad}, targets.requires_grad: {targets.requires_grad}")
print(f"predictions.shape: {predictions.shape}, targets.shape: {targets.shape}")

# Compute MAE loss
mae_loss = nn.L1Loss(size_average=None, reduce=None, reduction='mean')
loss = mae_loss(predictions, targets)
print(f"MAE loss: {loss}")

# Compute gradients
loss.backward()

# Access the gradients of 'predictions'
print(f"\ngradients of predictions:\n{predictions.grad}")
print(f"predictions.shape: {predictions.shape}")

Predicted values:

y _ p r e d = y ^ = [ 1 , 2 , 3 4 , 5 , 6 7 , 8 , 9 ] \begin{aligned} \boldsymbol{y\_pred} &= \boldsymbol{\hat{y}} \\ &= \begin{bmatrix} 1, 2, 3 \\ 4, 5, 6 \\ 7, 8, 9 \\ \end{bmatrix} \end{aligned} y_pred=y^= 1,2,34,5,67,8,9

True values:

y _ t r u e = y = [ 2 , 2 , 2 6 , 5 , 4 9 , 9 , 9 ] \begin{aligned} \boldsymbol{y\_true} &= \boldsymbol{y} \\ &= \begin{bmatrix} 2, 2, 2 \\ 6, 5, 4 \\ 9, 9, 9 \\ \end{bmatrix} \end{aligned} y_true=y= 2,2,26,5,49,9,9

Mean Absolute Error (MAE):

MAE = L = 1 N ∑ i = 1 N ∣ y _ p r e d i − y _ t r u e i ∣ = 1 N ∑ i = 1 N ∣ y ^ i − y i ∣ = 1 3 × 3 ( ∣ 1 − 2 ∣ + ∣ 2 − 2 ∣ + ∣ 3 − 2 ∣ + ∣ 4 − 6 ∣ + ∣ 5 − 5 ∣ + ∣ 6 − 4 ∣ + ∣ 7 − 9 ∣ + ∣ 8 − 9 ∣ + ∣ 9 − 9 ∣ ) = 1 9 × 9 = 1.0 \begin{aligned} \text{MAE} = L &= \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{y\pred}{i} - \boldsymbol{y\true}{i}| \\ &= \frac{1}{N} \sum_{i=1}^{N} |\boldsymbol{\hat{y}}{i} - \boldsymbol{y}{i}| \\ &= \frac{1}{3 \times 3} (|1 - 2| + |2 - 2| + |3 - 2| + |4 - 6| + |5 - 5| + |6 - 4| + |7 - 9| + |8 - 9| + |9 - 9|) \\ &= \frac{1}{9} \times 9 \\ &= 1.0 \\ \end{aligned} MAE=L=N1i=1∑N∣y_predi−y_truei∣=N1i=1∑N∣y^i−yi∣=3×31(∣1−2∣+∣2−2∣+∣3−2∣+∣4−6∣+∣5−5∣+∣6−4∣+∣7−9∣+∣8−9∣+∣9−9∣)=91×9=1.0

The MAE function gradient (Jacobian for MAE):

J = ∂ L ∂ y ^ = ∂ L ∂ ( y ^ 1 , y ^ 2 , ... , y ^ N ) = [ ∂ L ∂ y ^ 1 , ∂ L ∂ y ^ 2 , ... , ∂ L ∂ y ^ N ] = [ 1 N sgn ( y ^ 1 − y 1 ) , 1 N sgn ( y ^ 2 − y 2 ) , ... , 1 N sgn ( y ^ N − y N ) ] = [ 1 3 × 3 sgn ( 1 − 2 ) , 1 3 × 3 sgn ( 2 − 2 ) , 1 3 × 3 sgn ( 3 − 2 ) 1 3 × 3 sgn ( 4 − 6 ) , 1 3 × 3 sgn ( 5 − 5 ) , 1 3 × 3 sgn ( 6 − 4 ) 1 3 × 3 sgn ( 7 − 9 ) , 1 3 × 3 sgn ( 8 − 9 ) , 1 3 × 3 sgn ( 9 − 9 ) ] = [ 1 3 × 3 ( − 1 ) , 1 3 × 3 ( 0 ) , 1 3 × 3 ( 1 ) 1 3 × 3 ( − 1 ) , 1 3 × 3 ( 0 ) , 1 3 × 3 ( 1 ) 1 3 × 3 ( − 1 ) , 1 3 × 3 ( − 1 ) , 1 3 × 3 ( 0 ) ] = [ − 0.1111 , 0.0000 , 0.1111 − 0.1111 , 0.0000 , 0.1111 − 0.1111 , − 0.1111 , 0.0000 ] \begin{aligned} \boldsymbol{J} &= \frac{\partial{L}}{\partial{\boldsymbol{\hat{y}}}} \\ &= \frac{\partial{L}}{\partial{({\hat{y}}{1}, {\hat{y}}{2}, \ \ldots, {\hat{y}}{N})}} \\ &= \left[\frac{\partial{L}}{\partial{\hat{y}{1}}}, \frac{\partial{L}}{\partial{\hat{y}{2}}}, \ \ldots, \frac{\partial{L}}{\partial{\hat{y}{N}}} \right] \\ &= \left[\frac{1}{N} \text{sgn}(\hat{y}{1} - y{1}), \frac{1}{N} \text{sgn}(\hat{y}{2} - y{2}), \ \ldots, \frac{1}{N} \text{sgn}(\hat{y}{N} - y{N}) \right] \\ &= \begin{bmatrix} \frac{1}{3 \times 3} \text{sgn}(1 - 2), \frac{1}{3 \times 3} \text{sgn}(2 - 2), \frac{1}{3 \times 3} \text{sgn}(3 - 2) \\ \frac{1}{3 \times 3} \text{sgn}(4 - 6), \frac{1}{3 \times 3} \text{sgn}(5 - 5), \frac{1}{3 \times 3} \text{sgn}(6 - 4) \\ \frac{1}{3 \times 3} \text{sgn}(7 - 9), \frac{1}{3 \times 3} \text{sgn}(8 - 9), \frac{1}{3 \times 3} \text{sgn}(9 - 9) \\ \end{bmatrix} \\ &= \begin{bmatrix} \frac{1}{3 \times 3} (-1), \frac{1}{3 \times 3} (0), \frac{1}{3 \times 3} (1) \\ \frac{1}{3 \times 3} (-1), \frac{1}{3 \times 3} (0), \frac{1}{3 \times 3} (1) \\ \frac{1}{3 \times 3} (-1), \frac{1}{3 \times 3} (-1), \frac{1}{3 \times 3} (0) \\ \end{bmatrix} \\ &= \begin{bmatrix} -0.1111, \ \ 0.0000, \ \ 0.1111 \\ -0.1111, \ \ 0.0000, \ \ 0.1111 \\ -0.1111, -0.1111, \ \ 0.0000 \\ \end{bmatrix} \end{aligned} J=∂y^∂L=∂(y^1,y^2, ...,y^N)∂L=[∂y^1∂L,∂y^2∂L, ...,∂y^N∂L]=[N1sgn(y^1−y1),N1sgn(y^2−y2), ...,N1sgn(y^N−yN)]= 3×31sgn(1−2),3×31sgn(2−2),3×31sgn(3−2)3×31sgn(4−6),3×31sgn(5−5),3×31sgn(6−4)3×31sgn(7−9),3×31sgn(8−9),3×31sgn(9−9) = 3×31(−1),3×31(0),3×31(1)3×31(−1),3×31(0),3×31(1)3×31(−1),3×31(−1),3×31(0) = −0.1111, 0.0000, 0.1111−0.1111, 0.0000, 0.1111−0.1111,−0.1111, 0.0000

复制代码
/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/mae.py 
predictions.requires_grad: True, targets.requires_grad: False
predictions.shape: torch.Size([3, 3]), targets.shape: torch.Size([3, 3])
MAE loss: 1.0

gradients of predictions:
tensor([[-0.1111,  0.0000,  0.1111],
        [-0.1111,  0.0000,  0.1111],
        [-0.1111, -0.1111,  0.0000]])
predictions.shape: torch.Size([3, 3])

Process finished with exit code 0

PyTorch 中文文档
https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-nn/

torch.Tensor 默认 requires_grad=False,Variable 默认 requires_grad=False,torch.nn.parameter.Parameter 默认 requires_grad=True

class torch.nn.parameter.Parameter(data=None, requires_grad=True)
https://docs.pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html

Variable (deprecated)
https://docs.pytorch.org/docs/stable/autograd.html#variable-deprecated

4.3. Python torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

复制代码
# !/usr/bin/env python
# coding=utf-8

import numpy as np

predictions = np.array([1, 2, 3, 4], dtype=np.float32)
targets = np.array([0, 2, 4, 2], dtype=np.float32)

print(f"predictions.shape: {predictions.shape}, targets.shape: {targets.shape}")

assert (predictions.shape == targets.shape)


# Compute MAE loss
def mae_loss(predictions, targets):
    # Calculates the Mean Absolute Error loss
    return np.mean(np.abs(predictions - targets))


loss = mae_loss(predictions, targets)
print(f"MAE loss: {loss}")


# Compute gradients
def mae_grad(predictions, targets):
    # Calculates the subgradient of the MAE loss with respect to predictions
    return np.sign(predictions - targets) / np.prod(predictions.shape)


gradients = mae_grad(predictions, targets)

# Access the gradients of 'predictions'
print(f"\ngradients of predictions:\n{gradients}")
print(f"gradients.shape: {gradients.shape}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/mae.py 
predictions.shape: (4,), targets.shape: (4,)
MAE loss: 1.0

gradients of predictions:
[ 0.25  0.   -0.25  0.25]
gradients.shape: (4,)

Process finished with exit code 0

4.4. Python torch.nn.L1Loss(size_average=None, reduce=None, reduction='mean')

复制代码
# !/usr/bin/env python
# coding=utf-8

import numpy as np

predictions = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
targets = np.array([[2, 2, 2], [6, 5, 4], [9, 9, 9]], dtype=np.float32)

print(f"predictions.shape: {predictions.shape}, targets.shape: {targets.shape}")

assert (predictions.shape == targets.shape)


# Compute MAE loss
def mae_loss(predictions, targets):
    # Calculates the Mean Absolute Error loss
    return np.mean(np.abs(predictions - targets))


loss = mae_loss(predictions, targets)
print(f"MAE loss: {loss}")


# Compute gradients
def mae_grad(predictions, targets):
    # Calculates the subgradient of the MAE loss with respect to predictions
    return np.sign(predictions - targets) / np.prod(predictions.shape)


gradients = mae_grad(predictions, targets)

# Access the gradients of 'predictions'
print(f"\ngradients of predictions:\n{gradients}")
print(f"gradients.shape: {gradients.shape}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/mae.py 
predictions.shape: (3, 3), targets.shape: (3, 3)
MAE loss: 1.0

gradients of predictions:
[[-0.11111111  0.          0.11111111]
 [-0.11111111  0.          0.11111111]
 [-0.11111111 -0.11111111  0.        ]]
gradients.shape: (3, 3)

Process finished with exit code 0

References

1\] Yongqiang Cheng (程永强), \[2\] Deep Learning Tutorials, \[3\] Gradient boosting performs gradient descent, \[4\] Matrix calculus, \[5\] tf.keras.losses.MSE, \[6\] Half mean squared error,

相关推荐
Yongqiang Cheng2 天前
Abs Function - Derivatives and Gradients (导数和梯度)
梯度·导数·abs·gradients·derivatives
Yongqiang Cheng4 天前
Softplus Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·softplus
課代表6 天前
从初等数学到高等数学
算法·微积分·函数·极限·导数·积分·方程
点云SLAM6 天前
Derivative 英文单词学习
导数·英文单词学习·雅思备考·derivative·派生物 / 衍生量·非原创的 / 模仿性的
Yongqiang Cheng10 天前
SELU Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·selu
Yongqiang Cheng16 天前
ELU Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·elu
Yongqiang Cheng17 天前
Tanh Function - Derivatives and Gradients (导数和梯度)
梯度·导数·tanh·gradients·derivatives
Yongqiang Cheng18 天前
ReLU Function and Leaky ReLU Function - Derivatives and Gradients (导数和梯度)
梯度·导数·relu·gradients·derivatives·leaky relu
小毅&Nora18 天前
【数学】【微积分】 ③ 导数的核心应用:从变化率到现实世界优化
微积分·导数