Softmax Function - Derivatives and Gradients {导数和梯度}
- [1. Softmax Function](#1. Softmax Function)
-
- [1.1. Shape](#1.1. Shape)
- [1.2. Parameters](#1.2. Parameters)
- [2. Softmax Function - Derivatives and Gradients (导数和梯度)](#2. Softmax Function - Derivatives and Gradients (导数和梯度))
-
- [2.1. PyTorch `torch.nn.functional.softmax(input, dim=0)`](#2.1. PyTorch
torch.nn.functional.softmax(input, dim=0)) - [2.2. PyTorch `torch.nn.functional.softmax(input, dim=0)`](#2.2. PyTorch
torch.nn.functional.softmax(input, dim=0)) - [2.3. Python Softmax Function](#2.3. Python Softmax Function)
- [2.4. Python Softmax Function](#2.4. Python Softmax Function)
- [2.1. PyTorch `torch.nn.functional.softmax(input, dim=0)`](#2.1. PyTorch
- References
1. Softmax Function
class torch.nn.Softmax(dim=None)
https://docs.pytorch.org/docs/stable/generated/torch.nn.Softmax.html
torch.nn.functional.softmax(input, dim=None, _stacklevel=3, dtype=None)
https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html
https://github.com/pytorch/pytorch/blob/v2.9.1/torch/nn/modules/activation.py
class torch.nn.Softmax(dim=None)
Applies the Softmax function to an n-dimensional input Tensor.
It is applied to all slices along dim, and will re-scale them so that the elements lie in the range [0, 1] and sum to 1.
The definition of the Softmax function:
Softmax ( x i ) = exp ( x i ) ∑ k = 1 N exp ( x k ) ∀ i ∈ 1 2 . . . N = e x i ∑ k = 1 N e x k ∀ i ∈ 1 2 . . . N \begin{aligned} \text{Softmax}(x_{i}) &= \frac{\exp(x_i)}{\sum_{k=1}^{N} \exp(x_k)} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ &= \frac{e^{x_i}}{\sum_{k=1}^{N} e^{x_k}} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ \end{aligned} Softmax(xi)=∑k=1Nexp(xk)exp(xi) ∀ i∈1 2 ... N=∑k=1Nexkexi ∀ i∈1 2 ... N
When the input Tensor is a sparse tensor then the unspecified values are treated as -inf.
This module doesn't work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use LogSoftmax instead (it's faster and has better numerical properties).
1.1. Shape
- Input : (
*), where*means, any number of additional dimensions. - Output : (
*), same shape as the input.
Returns a Tensor of the same dimension and shape as the input with values in the range [0, 1].
1.2. Parameters
- dim (
int): A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
2. Softmax Function - Derivatives and Gradients (导数和梯度)
Notes
- Element-wise Multiplication (Hadamard Product) (
*operator ornumpy.multiply()): Multiplies corresponding elements of two arrays that must have the same shape (or be broadcastable to a common shape). - Matrix Multiplication (Dot Product) (
@operator ornumpy.matmul()ornumpy.dot()): Performs the standard linear algebra operation that requires specific dimension compatibility rules. (e.g., the number of columns in the first array must match the number of rows in the second).
The definition of the Softmax function:
x = [ x 1 x 2 ⋮ x N ] \begin{aligned} \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \\ \end{bmatrix} \end{aligned} x= x1x2⋮xN
y i = Softmax ( x i ) = e x i ∑ k = 1 N e x k ∀ i ∈ 1 2 . . . N \begin{aligned} y_{i} = \text{Softmax}(x_{i}) &= \frac{e^{x_i}}{\sum_{k=1}^{N} e^{x_k}} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ \end{aligned} yi=Softmax(xi)=∑k=1Nexkexi ∀ i∈1 2 ... N
y = [ y 1 y 2 ⋮ y N ] = f ( x ) = Softmax ( x ) = [ e x 1 ∑ k = 1 N e x k e x 2 ∑ k = 1 N e x k ⋮ e x N ∑ k = 1 N e x k ] = [ e x 1 e x 1 + e x 2 + ⋯ + e x N e x 2 e x 1 + e x 2 + ⋯ + e x N ⋮ e x N e x 1 + e x 2 + ⋯ + e x N ] = [ e x 1 e x 2 ⋮ e x N ] / ( e x 1 + e x 2 + ⋯ + e x N ) = [ f 1 ( x 1 , x 2 , ⋯ , x N ) f 2 ( x 1 , x 2 , ⋯ , x N ) ⋮ f N ( x 1 , x 2 , ⋯ , x N ) ] \begin{aligned} \mathbf{y} &= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \\ \end{bmatrix} \\ &= f(\mathbf{x}) = \text{Softmax}(\mathbf{x}) \\ &= \begin{bmatrix} \frac{e^{x_1}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \frac{e^{x_2}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \vdots \\[1.2ex] \frac{e^{x_N}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \end{bmatrix} \\ &= \begin{bmatrix} \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \vdots \\[1.2ex] \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \end{bmatrix} \\ &= \begin{bmatrix} e^{x_1} \\[1.2ex] e^{x_2} \\[1.2ex] \vdots \\[1.2ex] e^{x_N} \\[1.2ex] \end{bmatrix} / \left( e^{x_1} + e^{x_2} + \cdots + e^{x_N} \right) \\ &= \begin{bmatrix} f_{1}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] f_{2}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] \vdots \\[1.2ex] f_{N}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] \end{bmatrix} \end{aligned} y= y1y2⋮yN =f(x)=Softmax(x)= ∑k=1Nexkex1∑k=1Nexkex2⋮∑k=1NexkexN = ex1+ex2+⋯+exNex1ex1+ex2+⋯+exNex2⋮ex1+ex2+⋯+exNexN = ex1ex2⋮exN /(ex1+ex2+⋯+exN)= f1(x1,x2,⋯,xN)f2(x1,x2,⋯,xN)⋮fN(x1,x2,⋯,xN)
{ y 1 = f 1 ( x 1 , x 2 , ⋯ , x N ) = e x 1 e x 1 + e x 2 + ⋯ + e x N y 2 = f 2 ( x 1 , x 2 , ⋯ , x N ) = e x 2 e x 1 + e x 2 + ⋯ + e x N y 3 = f 3 ( x 1 , x 2 , ⋯ , x N ) = e x 3 e x 1 + e x 2 + ⋯ + e x N ⋮ y N = f N ( x 1 , x 2 , ⋯ , x N ) = e x N e x 1 + e x 2 + ⋯ + e x N \begin{aligned} \begin{cases} y_1 &= f_{1}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ y_2 &= f_{2}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ y_3 &= f_{3}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ & \vdots \\[1.2ex] y_N &= f_{N}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ \end{cases} \end{aligned} ⎩ ⎨ ⎧y1y2y3yN=f1(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex1=f2(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex2=f3(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex3⋮=fN(x1,x2,⋯,xN)=ex1+ex2+⋯+exNexN
Quotient Rule:
( u v ) = u ′ v − u v ′ v 2 \left( \frac{u}{v} \right) = \frac{u'v - uv'}{v^{2}} (vu)=v2u′v−uv′
∂ y 1 ∂ x 1 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 1 = e x 1 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 1 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = e x 1 ∗ e x 1 + e x 1 ∗ ( e x 2 + ⋯ + e x N ) − e x 1 ∗ e x 1 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = e x 1 ∗ ( e x 2 + ⋯ + e x N ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 + ⋯ + e x N e x 1 + e x 2 + ⋯ + e x N ) = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 1 + e x 2 + ⋯ + e x N − e x 1 e x 1 + e x 2 + ⋯ + e x N ) = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( 1 − e x 1 e x 1 + e x 2 + ⋯ + e x N ) = y 1 ∗ ( 1 − y 1 ) \begin{aligned} \frac{\partial y_{1}}{\partial x_{1}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{1}} \\[1.2ex] &= \frac{e^{x_1} * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_1})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{e^{x_1} * e^{x_1} + e^{x_1} * (e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * e^{x_1}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{e^{x_1} * (e^{x_2} + \cdots + e^{x_N})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2} + \cdots + e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_1} + e^{x_2} + \cdots + e^{x_N} - e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( 1 - \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= y_1 * (1 - y_1) \end{aligned} ∂x1∂y1=∂x1∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)2ex1∗(ex1+ex2+⋯+exN)−ex1∗(ex1)=(ex1+ex2+⋯+exN)2ex1∗ex1+ex1∗(ex2+⋯+exN)−ex1∗ex1=(ex1+ex2+⋯+exN)2ex1∗(ex2+⋯+exN)=(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex2+⋯+exN)=(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex1+ex2+⋯+exN−ex1)=(ex1+ex2+⋯+exNex1)∗(1−ex1+ex2+⋯+exNex1)=y1∗(1−y1)
∂ y 1 ∂ x 2 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 2 = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 2 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x 2 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y 2 \begin{aligned} \frac{\partial y_{1}}{\partial x_{2}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{2}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_2})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_2}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_2 \end{aligned} ∂x2∂y1=∂x2∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(ex2)=(ex1+ex2+⋯+exN)2−ex1∗ex2=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNex2)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex2)=−y1∗y2
∂ y 1 ∂ x 3 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 3 = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 3 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x 3 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 3 e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 3 e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y 3 \begin{aligned} \frac{\partial y_{1}}{\partial x_{3}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{3}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_3})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_3}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_3 \\ \end{aligned} ∂x3∂y1=∂x3∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(ex3)=(ex1+ex2+⋯+exN)2−ex1∗ex3=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNex3)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex3)=−y1∗y3
∂ y 1 ∂ x N = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x N = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x N ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x N ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x N e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x N e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y N \begin{aligned} \frac{\partial y_{1}}{\partial x_{N}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{N}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_N})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_N}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_N \\ \end{aligned} ∂xN∂y1=∂xN∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(exN)=(ex1+ex2+⋯+exN)2−ex1∗exN=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNexN)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNexN)=−y1∗yN
The Jacobian (in numerator layout notation) for the Softmax function:
J = ∂ ( y 1 , y 2 , ⋯ , y N ) ∂ ( x 1 , x 2 , ⋯ , x N ) = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x N ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x N ⋮ ⋮ ⋱ ⋮ ∂ y N ∂ x 1 ∂ y N ∂ x 2 ⋯ ∂ y N ∂ x N ] . = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x N − 1 ∂ y 1 ∂ x N ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x N − 1 ∂ y 2 ∂ x N ⋮ ⋮ ⋱ ⋮ ⋮ ∂ y N − 1 ∂ x 1 ∂ y N − 1 ∂ x 2 ⋯ ∂ y N − 1 ∂ x N − 1 ∂ y N − 1 ∂ x N ∂ y N ∂ x 1 ∂ y N ∂ x 2 ⋯ ∂ y N ∂ x N − 1 ∂ y N ∂ x N ] = [ y 1 ∗ ( 1 − y 1 ) − y 1 ∗ y 2 ⋯ − y 1 ∗ y N − 1 − y 1 ∗ y N − y 2 ∗ y 1 y 2 ∗ ( 1 − y 2 ) ⋯ − y 2 ∗ y N − 1 − y 2 ∗ y N ⋮ ⋮ ⋱ ⋮ ⋮ − y N − 1 ∗ y 1 − y N − 1 ∗ y 2 ⋯ y N − 1 ∗ ( 1 − y N − 1 ) − y N − 1 ∗ y N − y N ∗ y 1 − y N ∗ y 2 ⋯ − y N ∗ y N − 1 y N ∗ ( 1 − y N ) ] = [ y 1 y 1 ⋯ y 1 y 1 y 2 y 2 ⋯ y 2 y 2 ⋮ ⋮ ⋱ ⋮ ⋮ y N − 1 y N − 1 ⋯ y N − 1 y N − 1 y N y N ⋯ y N y N ] ∗ [ 1 − y 1 − y 2 ⋯ − y N − 1 − y N − y 1 1 − y 2 ⋯ − y N − 1 − y N ⋮ ⋮ ⋱ ⋮ ⋮ − y 1 − y 2 ⋯ 1 − y N − 1 − y N − y 1 − y 2 ⋯ − y N − 1 1 − y N ] = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ [ 1 − y 1 − y 2 ⋯ − y N − 1 − y N − y 1 1 − y 2 ⋯ − y N − 1 − y N ⋮ ⋮ ⋱ ⋮ ⋮ − y 1 − y 2 ⋯ 1 − y N − 1 − y N − y 1 − y 2 ⋯ − y N − 1 1 − y N ] = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − [ y 1 y 2 ⋯ y N − 1 y N y 1 y 2 ⋯ y N − 1 y N ⋮ ⋮ ⋱ ⋮ ⋮ y 1 y 2 ⋯ y N − 1 y N y 1 y 2 ⋯ y N − 1 y N ] ) = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − [ y 1 y 2 ⋯ y N − 1 y N ] ) = Softmax ( x ) ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − Softmax ( x ) T ) = Softmax ( x ) ∗ ( I − Softmax ( x ) T ) \begin{aligned} J &= \frac{\partial (y_{1}, y_{2}, \cdots, y_{N})}{\partial (x_{1}, x_{2}, \cdots, x_{N})} \\[1.2ex] &= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_N}\\[1.2ex] \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots\\[1.2ex] \frac{\partial y_N}{\partial x_1} & \frac{\partial y_N}{\partial x_2} & \cdots & \frac{\partial y_N}{\partial x_N}\\[1.2ex] \end{bmatrix}. \\[1.2ex] &= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_{N-1}} & \frac{\partial y_1}{\partial x_N}\\[1.2ex] \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_{N-1}} & \frac{\partial y_2}{\partial x_N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] \frac{\partial y_{N-1}}{\partial x_1} & \frac{\partial y_{N-1}}{\partial x_2} & \cdots & \frac{\partial y_{N-1}}{\partial x_{N-1}} & \frac{\partial y_{N-1}}{\partial x_N}\\[1.2ex] \frac{\partial y_N}{\partial x_1} & \frac{\partial y_N}{\partial x_2} & \cdots & \frac{\partial y_N}{\partial x_{N-1}} & \frac{\partial y_N}{\partial x_N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1} * (1 - y_{1}) & -y_{1} * y_{2} & \cdots & -y_{1} * y_{N-1} & -y_{1} * y_{N}\\[1.2ex] -y_{2} * y_{1} & y_{2} * (1 - y_{2}) & \cdots & -y_{2} * y_{N-1} & -y_{2} * y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{N-1} * y_{1} & -y_{N-1} * y_{2} & \cdots & y_{N-1} * (1 - y_{N-1}) & -y_{N-1} * y_{N}\\[1.2ex] -y_{N} * y_{1} & -y_{N} * y_{2} & \cdots & -y_{N} * y_{N-1} & y_{N} * (1 - y_{N})\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1} & y_{1} & \cdots & y_{1} & y_{1}\\[1.2ex] y_{2} & y_{2} & \cdots & y_{2} & y_{2}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] y_{N-1} & y_{N-1} & \cdots & y_{N-1} & y_{N-1}\\[1.2ex] y_{N} & y_{N} & \cdots & y_{N} & y_{N}\\[1.2ex] \end{bmatrix} * \begin{bmatrix} 1-y_{1} & -y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & 1-y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{1} & -y_{2} & \cdots & 1-y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & -y_{2} & \cdots & -y_{N-1} & 1-y_{N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \begin{bmatrix} 1-y_{1} & -y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & 1-y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{1} & -y_{2} & \cdots & 1-y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & -y_{2} & \cdots & -y_{N-1} & 1-y_{N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \begin{bmatrix} y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \end{bmatrix} \right) \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \begin{bmatrix} y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \end{bmatrix} \right) \\[1.2ex] &= \text{Softmax}(\mathbf{x}) * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \text{Softmax}(\mathbf{x})^{T} \right) \\[1.2ex] &= \text{Softmax}(\mathbf{x}) * \left( \mathbf{I} - \text{Softmax}(\mathbf{x})^{T} \right) \\[1.2ex] \end{aligned} J=∂(x1,x2,⋯,xN)∂(y1,y2,⋯,yN)= ∂x1∂y1∂x1∂y2⋮∂x1∂yN∂x2∂y1∂x2∂y2⋮∂x2∂yN⋯⋯⋱⋯∂xN∂y1∂xN∂y2⋮∂xN∂yN .= ∂x1∂y1∂x1∂y2⋮∂x1∂yN−1∂x1∂yN∂x2∂y1∂x2∂y2⋮∂x2∂yN−1∂x2∂yN⋯⋯⋱⋯⋯∂xN−1∂y1∂xN−1∂y2⋮∂xN−1∂yN−1∂xN−1∂yN∂xN∂y1∂xN∂y2⋮∂xN∂yN−1∂xN∂yN = y1∗(1−y1)−y2∗y1⋮−yN−1∗y1−yN∗y1−y1∗y2y2∗(1−y2)⋮−yN−1∗y2−yN∗y2⋯⋯⋱⋯⋯−y1∗yN−1−y2∗yN−1⋮yN−1∗(1−yN−1)−yN∗yN−1−y1∗yN−y2∗yN⋮−yN−1∗yNyN∗(1−yN) = y1y2⋮yN−1yNy1y2⋮yN−1yN⋯⋯⋱⋯⋯y1y2⋮yN−1yNy1y2⋮yN−1yN ∗ 1−y1−y1⋮−y1−y1−y21−y2⋮−y2−y2⋯⋯⋱⋯⋯−yN−1−yN−1⋮1−yN−1−yN−1−yN−yN⋮−yN1−yN = y1y2⋮yN−1yN ∗ 1−y1−y1⋮−y1−y1−y21−y2⋮−y2−y2⋯⋯⋱⋯⋯−yN−1−yN−1⋮1−yN−1−yN−1−yN−yN⋮−yN1−yN = y1y2⋮yN−1yN ∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 − y1y1⋮y1y1y2y2⋮y2y2⋯⋯⋱⋯⋯yN−1yN−1⋮yN−1yN−1yNyN⋮yNyN = y1y2⋮yN−1yN ∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 −[y1y2⋯yN−1yN] =Softmax(x)∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 −Softmax(x)T =Softmax(x)∗(I−Softmax(x)T)
It is very similar to the derivative of the Sigmoid function but not exactly the same.
Notes
公式中出现的乘法都是 Element-wise Multiplication (Hadamard Product) (
*operator ornumpy.multiply())
2.1. PyTorch torch.nn.functional.softmax(input, dim=0)
# !/usr/bin/env python
# coding=utf-8
import torch
import torch.nn as nn
torch.set_printoptions(precision=6)
input = torch.tensor([1, -2, 0], dtype=torch.float, requires_grad=True)
print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")
forward_output = torch.nn.functional.softmax(input, dim=0)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")
forward_output.backward(torch.ones_like(input), retain_graph=True)
print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")
/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/softmax.py
input.requires_grad: True, input.shape: torch.Size([3])
forward_output.shape: torch.Size([3])
Forward Pass Output:
tensor([0.705385, 0.035119, 0.259496], grad_fn=<SoftmaxBackward0>)
backward_output.shape: torch.Size([3])
Backward Pass Output:
tensor([0., 0., 0.])
Process finished with exit code 0
If you call probs.sum().backward() on your Softmax output, the resulting gradient for your input will always be 0.
By definition, the Softmax function outputs a probability distribution where all elements sum exactly to 1.0. The sum of Softmax outputs is a constant 1.0 regardless of the input values. The derivative of any constant value with respect to its inputs is 0.
无论你怎么改变输入,输出之和都不会变,所以梯度为 0。
Because the sum is a constant 1.0, its derivative with respect to any input is 0. Even if you change the inputs, the sum remains 1.0, meaning there is no change (zero gradient).
在微积分中,常数的导数是 0。因为无论你的输入 x x x 如何变化,sum(softmax(x)) 的结果永远是 1.0,没有变化率,所以梯度为 0。
2.2. PyTorch torch.nn.functional.softmax(input, dim=0)
# !/usr/bin/env python
# coding=utf-8
import torch
import torch.nn as nn
torch.set_printoptions(precision=6)
input = torch.tensor([-2], dtype=torch.float, requires_grad=True)
print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")
forward_output = torch.nn.functional.softmax(input, dim=0)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")
forward_output.backward(torch.ones_like(input), retain_graph=True)
print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")
/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/softmax.py
input.requires_grad: True, input.shape: torch.Size([1])
forward_output.shape: torch.Size([1])
Forward Pass Output:
tensor([1.], grad_fn=<SoftmaxBackward0>)
backward_output.shape: torch.Size([1])
Backward Pass Output:
tensor([0.])
Process finished with exit code 0
If you apply Softmax to a tensor with only one element, the output is always 1.0. Because the output is a constant 1.0, the gradient with respect to that input is naturally 0.
如果你对只有一个元素的 Tensor 应用 Softmax,结果永远是 1.0。由于输出恒为常数,其导数永远为 0。
2.3. Python Softmax Function
# !/usr/bin/env python
# coding=utf-8
import numpy as np
# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting
class SoftsignLayer:
"""
A class to represent the Softsign layer for a neural network.
"""
def __init__(self):
# Cache the input for the backward pass
self.input = None
def forward(self, input):
"""
Forward Pass: f(x) = x / (1 + |x|)
Maps input values to the range [-1, 1]
Computes the element-wise absolute value
"""
self.input = input
output = input / (1 + np.abs(input))
return output
def backward(self, upstream_gradient):
"""
Backward Pass (Backpropagation): f'(x) = 1 / (1 + |x|)^2
The total gradient is the element-wise product of the upstream
gradient and the derivative of the Log.
"""
softsign_derivative = 1 / (1 + np.abs(self.input)) ** 2
print(f"abs_derivative.shape: {softsign_derivative.shape}")
print(f"Softsign Derivative:\n{softsign_derivative}")
# upstream_gradient: the gradient of the loss with respect to the output
# Computes the gradient of the loss with respect to the input (dL/dx)
# Apply the chain rule: multiply the derivative by the upstream gradient
# dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
downstream_gradient = upstream_gradient * softsign_derivative
return downstream_gradient
layer = SoftsignLayer()
input = np.array([-1.5, 0.0, 1.5, 0.5, -2.0, 3.0], dtype=np.float32)
# Forward pass
forward_output = layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")
# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")
2.4. Python Softmax Function
# !/usr/bin/env python
# coding=utf-8
import numpy as np
# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting
class SoftsignLayer:
"""
A class to represent the Softsign layer for a neural network.
"""
def __init__(self):
# Cache the input for the backward pass
self.input = None
def forward(self, input):
"""
Forward Pass: f(x) = x / (1 + |x|)
Maps input values to the range [-1, 1]
Computes the element-wise absolute value
"""
self.input = input
output = input / (1 + np.abs(input))
return output
def backward(self, upstream_gradient):
"""
Backward Pass (Backpropagation): f'(x) = 1 / (1 + |x|)^2
The total gradient is the element-wise product of the upstream
gradient and the derivative of the Log.
"""
softsign_derivative = 1 / (1 + np.abs(self.input)) ** 2
print(f"abs_derivative.shape: {softsign_derivative.shape}")
print(f"Softsign Derivative:\n{softsign_derivative}")
# upstream_gradient: the gradient of the loss with respect to the output
# Computes the gradient of the loss with respect to the input (dL/dx)
# Apply the chain rule: multiply the derivative by the upstream gradient
# dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
downstream_gradient = upstream_gradient * softsign_derivative
return downstream_gradient
layer = SoftsignLayer()
input = np.array([[-1.5, 0.0, 1.5], [0.5, -2.0, 3.0]], dtype=np.float32)
# Forward pass
forward_output = layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")
# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")
References
1\] Yongqiang Cheng (程永强),