Softmax Function - Derivatives and Gradients (导数和梯度)

Softmax Function - Derivatives and Gradients {导数和梯度}

[1. Softmax Function](#1. Softmax Function)
- [1.1. Shape](#1.1. Shape)
- [1.2. Parameters](#1.2. Parameters)
[2. Softmax Function - Derivatives and Gradients (导数和梯度)](#2. Softmax Function - Derivatives and Gradients (导数和梯度))
- [2.1. PyTorch `torch.nn.functional.softmax(input, dim=0)`](#2.1. PyTorch torch.nn.functional.softmax(input, dim=0))
- [2.2. PyTorch `torch.nn.functional.softmax(input, dim=0)`](#2.2. PyTorch torch.nn.functional.softmax(input, dim=0))
- [2.3. Python Softmax Function](#2.3. Python Softmax Function)
- [2.4. Python Softmax Function](#2.4. Python Softmax Function)
References

1. Softmax Function

class torch.nn.Softmax(dim=None)
https://docs.pytorch.org/docs/stable/generated/torch.nn.Softmax.html

torch.nn.functional.softmax(input, dim=None, _stacklevel=3, dtype=None)
https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html

https://github.com/pytorch/pytorch/blob/v2.9.1/torch/nn/modules/activation.py

class torch.nn.Softmax(dim=None)

Applies the Softmax function to an n-dimensional input Tensor.

It is applied to all slices along dim, and will re-scale them so that the elements lie in the range [0, 1] and sum to 1.

The definition of the Softmax function:

Softmax ( x i ) = exp ⁡ ( x i ) ∑ k = 1 N exp ⁡ ( x k ) ∀ i ∈ 1 2 . . . N = e x i ∑ k = 1 N e x k ∀ i ∈ 1 2 . . . N \begin{aligned} \text{Softmax}(x_{i}) &= \frac{\exp(x_i)}{\sum_{k=1}^{N} \exp(x_k)} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ &= \frac{e^{x_i}}{\sum_{k=1}^{N} e^{x_k}} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ \end{aligned} Softmax(xi)=∑k=1Nexp(xk)exp(xi) ∀ i∈1 2 ... N=∑k=1Nexkexi ∀ i∈1 2 ... N

When the input Tensor is a sparse tensor then the unspecified values are treated as -inf.

This module doesn't work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use LogSoftmax instead (it's faster and has better numerical properties).

1.1. Shape

Input : (*), where * means, any number of additional dimensions.
Output : (*), same shape as the input.

Returns a Tensor of the same dimension and shape as the input with values in the range [0, 1].

1.2. Parameters

dim (int): A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

2. Softmax Function - Derivatives and Gradients (导数和梯度)

Notes

Element-wise Multiplication (Hadamard Product) (* operator or numpy.multiply()): Multiplies corresponding elements of two arrays that must have the same shape (or be broadcastable to a common shape).
Matrix Multiplication (Dot Product) (@ operator or numpy.matmul() or numpy.dot()): Performs the standard linear algebra operation that requires specific dimension compatibility rules. (e.g., the number of columns in the first array must match the number of rows in the second).

The definition of the Softmax function:

x = [ x 1 x 2 ⋮ x N ] \begin{aligned} \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \\ \end{bmatrix} \end{aligned} x= x1x2⋮xN

y i = Softmax ( x i ) = e x i ∑ k = 1 N e x k ∀ i ∈ 1 2 . . . N \begin{aligned} y_{i} = \text{Softmax}(x_{i}) &= \frac{e^{x_i}}{\sum_{k=1}^{N} e^{x_k}} \ \ \ \forall \ i \in 1 \ 2 \ ... \ N \\ \end{aligned} yi=Softmax(xi)=∑k=1Nexkexi ∀ i∈1 2 ... N

y = [ y 1 y 2 ⋮ y N ] = f ( x ) = Softmax ( x ) = [ e x 1 ∑ k = 1 N e x k e x 2 ∑ k = 1 N e x k ⋮ e x N ∑ k = 1 N e x k ] = [ e x 1 e x 1 + e x 2 + ⋯ + e x N e x 2 e x 1 + e x 2 + ⋯ + e x N ⋮ e x N e x 1 + e x 2 + ⋯ + e x N ] = [ e x 1 e x 2 ⋮ e x N ] / ( e x 1 + e x 2 + ⋯ + e x N ) = [ f 1 ( x 1 , x 2 , ⋯ , x N ) f 2 ( x 1 , x 2 , ⋯ , x N ) ⋮ f N ( x 1 , x 2 , ⋯ , x N ) ] \begin{aligned} \mathbf{y} &= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \\ \end{bmatrix} \\ &= f(\mathbf{x}) = \text{Softmax}(\mathbf{x}) \\ &= \begin{bmatrix} \frac{e^{x_1}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \frac{e^{x_2}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \vdots \\[1.2ex] \frac{e^{x_N}}{\sum_{k=1}^{N} e^{x_k}} \\[1.2ex] \end{bmatrix} \\ &= \begin{bmatrix} \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \vdots \\[1.2ex] \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\[1.2ex] \end{bmatrix} \\ &= \begin{bmatrix} e^{x_1} \\[1.2ex] e^{x_2} \\[1.2ex] \vdots \\[1.2ex] e^{x_N} \\[1.2ex] \end{bmatrix} / \left( e^{x_1} + e^{x_2} + \cdots + e^{x_N} \right) \\ &= \begin{bmatrix} f_{1}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] f_{2}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] \vdots \\[1.2ex] f_{N}(x_{1}, x_{2},\cdots,x_{N}) \\[1.0ex] \end{bmatrix} \end{aligned} y= y1y2⋮yN =f(x)=Softmax(x)= ∑k=1Nexkex1∑k=1Nexkex2⋮∑k=1NexkexN = ex1+ex2+⋯+exNex1ex1+ex2+⋯+exNex2⋮ex1+ex2+⋯+exNexN = ex1ex2⋮exN /(ex1+ex2+⋯+exN)= f1(x1,x2,⋯,xN)f2(x1,x2,⋯,xN)⋮fN(x1,x2,⋯,xN)

{ y 1 = f 1 ( x 1 , x 2 , ⋯ , x N ) = e x 1 e x 1 + e x 2 + ⋯ + e x N y 2 = f 2 ( x 1 , x 2 , ⋯ , x N ) = e x 2 e x 1 + e x 2 + ⋯ + e x N y 3 = f 3 ( x 1 , x 2 , ⋯ , x N ) = e x 3 e x 1 + e x 2 + ⋯ + e x N ⋮ y N = f N ( x 1 , x 2 , ⋯ , x N ) = e x N e x 1 + e x 2 + ⋯ + e x N \begin{aligned} \begin{cases} y_1 &= f_{1}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ y_2 &= f_{2}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ y_3 &= f_{3}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ & \vdots \\[1.2ex] y_N &= f_{N}(x_{1}, x_{2},\cdots,x_{N}) = \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \\ \end{cases} \end{aligned} ⎩ ⎨ ⎧y1y2y3yN=f1(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex1=f2(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex2=f3(x1,x2,⋯,xN)=ex1+ex2+⋯+exNex3⋮=fN(x1,x2,⋯,xN)=ex1+ex2+⋯+exNexN

Quotient Rule:

( u v ) = u ′ v − u v ′ v 2 \left( \frac{u}{v} \right) = \frac{u'v - uv'}{v^{2}} (vu)=v2u′v−uv′

∂ y 1 ∂ x 1 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 1 = e x 1 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 1 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = e x 1 ∗ e x 1 + e x 1 ∗ ( e x 2 + ⋯ + e x N ) − e x 1 ∗ e x 1 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = e x 1 ∗ ( e x 2 + ⋯ + e x N ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 + ⋯ + e x N e x 1 + e x 2 + ⋯ + e x N ) = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 1 + e x 2 + ⋯ + e x N − e x 1 e x 1 + e x 2 + ⋯ + e x N ) = ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( 1 − e x 1 e x 1 + e x 2 + ⋯ + e x N ) = y 1 ∗ ( 1 − y 1 ) \begin{aligned} \frac{\partial y_{1}}{\partial x_{1}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{1}} \\[1.2ex] &= \frac{e^{x_1} * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_1})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{e^{x_1} * e^{x_1} + e^{x_1} * (e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * e^{x_1}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{e^{x_1} * (e^{x_2} + \cdots + e^{x_N})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2} + \cdots + e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_1} + e^{x_2} + \cdots + e^{x_N} - e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( 1 - \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= y_1 * (1 - y_1) \end{aligned} ∂x1∂y1=∂x1∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)2ex1∗(ex1+ex2+⋯+exN)−ex1∗(ex1)=(ex1+ex2+⋯+exN)2ex1∗ex1+ex1∗(ex2+⋯+exN)−ex1∗ex1=(ex1+ex2+⋯+exN)2ex1∗(ex2+⋯+exN)=(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex2+⋯+exN)=(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex1+ex2+⋯+exN−ex1)=(ex1+ex2+⋯+exNex1)∗(1−ex1+ex2+⋯+exNex1)=y1∗(1−y1)

∂ y 1 ∂ x 2 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 2 = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 2 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x 2 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 2 e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y 2 \begin{aligned} \frac{\partial y_{1}}{\partial x_{2}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{2}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_2})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_2}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_2}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_2 \end{aligned} ∂x2∂y1=∂x2∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(ex2)=(ex1+ex2+⋯+exN)2−ex1∗ex2=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNex2)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex2)=−y1∗y2

∂ y 1 ∂ x 3 = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x 3 = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x 3 ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x 3 ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 3 e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x 3 e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y 3 \begin{aligned} \frac{\partial y_{1}}{\partial x_{3}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{3}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_3})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_3}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_3}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_3 \\ \end{aligned} ∂x3∂y1=∂x3∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(ex3)=(ex1+ex2+⋯+exN)2−ex1∗ex3=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNex3)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNex3)=−y1∗y3

∂ y 1 ∂ x N = ∂ ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∂ x N = 0 ∗ ( e x 1 + e x 2 + ⋯ + e x N ) − e x 1 ∗ ( e x N ) ( e x 1 + e x 2 + ⋯ + e x N ) 2 = − e x 1 ∗ e x N ( e x 1 + e x 2 + ⋯ + e x N ) 2 = ( − e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x N e x 1 + e x 2 + ⋯ + e x N ) = − ( e x 1 e x 1 + e x 2 + ⋯ + e x N ) ∗ ( e x N e x 1 + e x 2 + ⋯ + e x N ) = − y 1 ∗ y N \begin{aligned} \frac{\partial y_{1}}{\partial x_{N}} &= \frac{\partial \left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right)}{\partial x_{N}} \\[1.2ex] &= \frac{0 * (e^{x_1} + e^{x_2} + \cdots + e^{x_N}) - e^{x_1} * (e^{x_N})}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \frac{- e^{x_1} * e^{x_N}}{(e^{x_1} + e^{x_2} + \cdots + e^{x_N})^{2}} \\[1.2ex] &= \left( \frac{-e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -\left( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) * \left( \frac{e^{x_N}}{e^{x_1} + e^{x_2} + \cdots + e^{x_N}} \right) \\[1.2ex] &= -y_1 * y_N \\ \end{aligned} ∂xN∂y1=∂xN∂(ex1+ex2+⋯+exNex1)=(ex1+ex2+⋯+exN)20∗(ex1+ex2+⋯+exN)−ex1∗(exN)=(ex1+ex2+⋯+exN)2−ex1∗exN=(ex1+ex2+⋯+exN−ex1)∗(ex1+ex2+⋯+exNexN)=−(ex1+ex2+⋯+exNex1)∗(ex1+ex2+⋯+exNexN)=−y1∗yN

The Jacobian (in numerator layout notation) for the Softmax function:

J = ∂ ( y 1 , y 2 , ⋯ , y N ) ∂ ( x 1 , x 2 , ⋯ , x N ) = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x N ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x N ⋮ ⋮ ⋱ ⋮ ∂ y N ∂ x 1 ∂ y N ∂ x 2 ⋯ ∂ y N ∂ x N ] . = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x N − 1 ∂ y 1 ∂ x N ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x N − 1 ∂ y 2 ∂ x N ⋮ ⋮ ⋱ ⋮ ⋮ ∂ y N − 1 ∂ x 1 ∂ y N − 1 ∂ x 2 ⋯ ∂ y N − 1 ∂ x N − 1 ∂ y N − 1 ∂ x N ∂ y N ∂ x 1 ∂ y N ∂ x 2 ⋯ ∂ y N ∂ x N − 1 ∂ y N ∂ x N ] = [ y 1 ∗ ( 1 − y 1 ) − y 1 ∗ y 2 ⋯ − y 1 ∗ y N − 1 − y 1 ∗ y N − y 2 ∗ y 1 y 2 ∗ ( 1 − y 2 ) ⋯ − y 2 ∗ y N − 1 − y 2 ∗ y N ⋮ ⋮ ⋱ ⋮ ⋮ − y N − 1 ∗ y 1 − y N − 1 ∗ y 2 ⋯ y N − 1 ∗ ( 1 − y N − 1 ) − y N − 1 ∗ y N − y N ∗ y 1 − y N ∗ y 2 ⋯ − y N ∗ y N − 1 y N ∗ ( 1 − y N ) ] = [ y 1 y 1 ⋯ y 1 y 1 y 2 y 2 ⋯ y 2 y 2 ⋮ ⋮ ⋱ ⋮ ⋮ y N − 1 y N − 1 ⋯ y N − 1 y N − 1 y N y N ⋯ y N y N ] ∗ [ 1 − y 1 − y 2 ⋯ − y N − 1 − y N − y 1 1 − y 2 ⋯ − y N − 1 − y N ⋮ ⋮ ⋱ ⋮ ⋮ − y 1 − y 2 ⋯ 1 − y N − 1 − y N − y 1 − y 2 ⋯ − y N − 1 1 − y N ] = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ [ 1 − y 1 − y 2 ⋯ − y N − 1 − y N − y 1 1 − y 2 ⋯ − y N − 1 − y N ⋮ ⋮ ⋱ ⋮ ⋮ − y 1 − y 2 ⋯ 1 − y N − 1 − y N − y 1 − y 2 ⋯ − y N − 1 1 − y N ] = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − [ y 1 y 2 ⋯ y N − 1 y N y 1 y 2 ⋯ y N − 1 y N ⋮ ⋮ ⋱ ⋮ ⋮ y 1 y 2 ⋯ y N − 1 y N y 1 y 2 ⋯ y N − 1 y N ] ) = [ y 1 y 2 ⋮ y N − 1 y N ] ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − [ y 1 y 2 ⋯ y N − 1 y N ] ) = Softmax ( x ) ∗ ( [ 1 0 ⋯ 0 0 0 1 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 1 0 0 0 ⋯ 0 1 ] − Softmax ( x ) T ) = Softmax ( x ) ∗ ( I − Softmax ( x ) T ) \begin{aligned} J &= \frac{\partial (y_{1}, y_{2}, \cdots, y_{N})}{\partial (x_{1}, x_{2}, \cdots, x_{N})} \\[1.2ex] &= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_N}\\[1.2ex] \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots\\[1.2ex] \frac{\partial y_N}{\partial x_1} & \frac{\partial y_N}{\partial x_2} & \cdots & \frac{\partial y_N}{\partial x_N}\\[1.2ex] \end{bmatrix}. \\[1.2ex] &= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_{N-1}} & \frac{\partial y_1}{\partial x_N}\\[1.2ex] \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_{N-1}} & \frac{\partial y_2}{\partial x_N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] \frac{\partial y_{N-1}}{\partial x_1} & \frac{\partial y_{N-1}}{\partial x_2} & \cdots & \frac{\partial y_{N-1}}{\partial x_{N-1}} & \frac{\partial y_{N-1}}{\partial x_N}\\[1.2ex] \frac{\partial y_N}{\partial x_1} & \frac{\partial y_N}{\partial x_2} & \cdots & \frac{\partial y_N}{\partial x_{N-1}} & \frac{\partial y_N}{\partial x_N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1} * (1 - y_{1}) & -y_{1} * y_{2} & \cdots & -y_{1} * y_{N-1} & -y_{1} * y_{N}\\[1.2ex] -y_{2} * y_{1} & y_{2} * (1 - y_{2}) & \cdots & -y_{2} * y_{N-1} & -y_{2} * y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{N-1} * y_{1} & -y_{N-1} * y_{2} & \cdots & y_{N-1} * (1 - y_{N-1}) & -y_{N-1} * y_{N}\\[1.2ex] -y_{N} * y_{1} & -y_{N} * y_{2} & \cdots & -y_{N} * y_{N-1} & y_{N} * (1 - y_{N})\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1} & y_{1} & \cdots & y_{1} & y_{1}\\[1.2ex] y_{2} & y_{2} & \cdots & y_{2} & y_{2}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] y_{N-1} & y_{N-1} & \cdots & y_{N-1} & y_{N-1}\\[1.2ex] y_{N} & y_{N} & \cdots & y_{N} & y_{N}\\[1.2ex] \end{bmatrix} * \begin{bmatrix} 1-y_{1} & -y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & 1-y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{1} & -y_{2} & \cdots & 1-y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & -y_{2} & \cdots & -y_{N-1} & 1-y_{N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \begin{bmatrix} 1-y_{1} & -y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & 1-y_{2} & \cdots & -y_{N-1} & -y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] -y_{1} & -y_{2} & \cdots & 1-y_{N-1} & -y_{N}\\[1.2ex] -y_{1} & -y_{2} & \cdots & -y_{N-1} & 1-y_{N}\\[1.2ex] \end{bmatrix} \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \begin{bmatrix} y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \end{bmatrix} \right) \\[1.2ex] &= \begin{bmatrix} y_{1}\\[1.2ex] y_{2}\\[1.2ex] \vdots\\[1.2ex] y_{N-1}\\[1.2ex] y_{N}\\[1.2ex] \end{bmatrix} * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \begin{bmatrix} y_{1} & y_{2} & \cdots & y_{N-1} & y_{N}\\[1.2ex] \end{bmatrix} \right) \\[1.2ex] &= \text{Softmax}(\mathbf{x}) * \left( \begin{bmatrix} 1 & 0 & \cdots & 0 & 0\\[1.2ex] 0 & 1 & \cdots & 0 & 0\\[1.2ex] \vdots & \vdots & \ddots & \vdots & \vdots\\[1.2ex] 0 & 0 & \cdots & 1 & 0\\[1.2ex] 0 & 0 & \cdots & 0 & 1\\[1.2ex] \end{bmatrix} - \text{Softmax}(\mathbf{x})^{T} \right) \\[1.2ex] &= \text{Softmax}(\mathbf{x}) * \left( \mathbf{I} - \text{Softmax}(\mathbf{x})^{T} \right) \\[1.2ex] \end{aligned} J=∂(x1,x2,⋯,xN)∂(y1,y2,⋯,yN)= ∂x1∂y1∂x1∂y2⋮∂x1∂yN∂x2∂y1∂x2∂y2⋮∂x2∂yN⋯⋯⋱⋯∂xN∂y1∂xN∂y2⋮∂xN∂yN .= ∂x1∂y1∂x1∂y2⋮∂x1∂yN−1∂x1∂yN∂x2∂y1∂x2∂y2⋮∂x2∂yN−1∂x2∂yN⋯⋯⋱⋯⋯∂xN−1∂y1∂xN−1∂y2⋮∂xN−1∂yN−1∂xN−1∂yN∂xN∂y1∂xN∂y2⋮∂xN∂yN−1∂xN∂yN = y1∗(1−y1)−y2∗y1⋮−yN−1∗y1−yN∗y1−y1∗y2y2∗(1−y2)⋮−yN−1∗y2−yN∗y2⋯⋯⋱⋯⋯−y1∗yN−1−y2∗yN−1⋮yN−1∗(1−yN−1)−yN∗yN−1−y1∗yN−y2∗yN⋮−yN−1∗yNyN∗(1−yN) = y1y2⋮yN−1yNy1y2⋮yN−1yN⋯⋯⋱⋯⋯y1y2⋮yN−1yNy1y2⋮yN−1yN ∗ 1−y1−y1⋮−y1−y1−y21−y2⋮−y2−y2⋯⋯⋱⋯⋯−yN−1−yN−1⋮1−yN−1−yN−1−yN−yN⋮−yN1−yN = y1y2⋮yN−1yN ∗ 1−y1−y1⋮−y1−y1−y21−y2⋮−y2−y2⋯⋯⋱⋯⋯−yN−1−yN−1⋮1−yN−1−yN−1−yN−yN⋮−yN1−yN = y1y2⋮yN−1yN ∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 − y1y1⋮y1y1y2y2⋮y2y2⋯⋯⋱⋯⋯yN−1yN−1⋮yN−1yN−1yNyN⋮yNyN = y1y2⋮yN−1yN ∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 −[y1y2⋯yN−1yN] =Softmax(x)∗ 10⋮0001⋮00⋯⋯⋱⋯⋯00⋮1000⋮01 −Softmax(x)T =Softmax(x)∗(I−Softmax(x)T)

It is very similar to the derivative of the Sigmoid function but not exactly the same.

Notes

公式中出现的乘法都是 Element-wise Multiplication (Hadamard Product) (* operator or numpy.multiply())

2.1. PyTorch `torch.nn.functional.softmax(input, dim=0)`

复制代码

# !/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

torch.set_printoptions(precision=6)

input = torch.tensor([1, -2, 0], dtype=torch.float, requires_grad=True)

print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")

forward_output = torch.nn.functional.softmax(input, dim=0)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

forward_output.backward(torch.ones_like(input), retain_graph=True)

print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/softmax.py 
input.requires_grad: True, input.shape: torch.Size([3])

forward_output.shape: torch.Size([3])
Forward Pass Output:
tensor([0.705385, 0.035119, 0.259496], grad_fn=<SoftmaxBackward0>)

backward_output.shape: torch.Size([3])
Backward Pass Output:
tensor([0., 0., 0.])

Process finished with exit code 0

If you call probs.sum().backward() on your Softmax output, the resulting gradient for your input will always be 0.

By definition, the Softmax function outputs a probability distribution where all elements sum exactly to 1.0. The sum of Softmax outputs is a constant 1.0 regardless of the input values. The derivative of any constant value with respect to its inputs is 0.

无论你怎么改变输入，输出之和都不会变，所以梯度为 0。

Because the sum is a constant 1.0, its derivative with respect to any input is 0. Even if you change the inputs, the sum remains 1.0, meaning there is no change (zero gradient).

在微积分中，常数的导数是 0。因为无论你的输入 x x x 如何变化，sum(softmax(x)) 的结果永远是 1.0，没有变化率，所以梯度为 0。

2.2. PyTorch `torch.nn.functional.softmax(input, dim=0)`

复制代码

# !/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

torch.set_printoptions(precision=6)

input = torch.tensor([-2], dtype=torch.float, requires_grad=True)

print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")

forward_output = torch.nn.functional.softmax(input, dim=0)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

forward_output.backward(torch.ones_like(input), retain_graph=True)

print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/softmax.py 
input.requires_grad: True, input.shape: torch.Size([1])

forward_output.shape: torch.Size([1])
Forward Pass Output:
tensor([1.], grad_fn=<SoftmaxBackward0>)

backward_output.shape: torch.Size([1])
Backward Pass Output:
tensor([0.])

Process finished with exit code 0

If you apply Softmax to a tensor with only one element, the output is always 1.0. Because the output is a constant 1.0, the gradient with respect to that input is naturally 0.

如果你对只有一个元素的 Tensor 应用 Softmax，结果永远是 1.0。由于输出恒为常数，其导数永远为 0。

2.3. Python Softmax Function

复制代码

# !/usr/bin/env python
# coding=utf-8

import numpy as np


# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting

class SoftsignLayer:
    """
    A class to represent the Softsign layer for a neural network.
    """

    def __init__(self):
        # Cache the input for the backward pass
        self.input = None

    def forward(self, input):
        """
        Forward Pass: f(x) = x / (1 + |x|)
        Maps input values to the range [-1, 1]
        Computes the element-wise absolute value
        """

        self.input = input
        output = input / (1 + np.abs(input))
        return output

    def backward(self, upstream_gradient):
        """
        Backward Pass (Backpropagation): f'(x) = 1 / (1 + |x|)^2
        The total gradient is the element-wise product of the upstream
        gradient and the derivative of the Log.
        """

        softsign_derivative = 1 / (1 + np.abs(self.input)) ** 2
        print(f"abs_derivative.shape: {softsign_derivative.shape}")
        print(f"Softsign Derivative:\n{softsign_derivative}")

        # upstream_gradient: the gradient of the loss with respect to the output
        # Computes the gradient of the loss with respect to the input (dL/dx)
        # Apply the chain rule: multiply the derivative by the upstream gradient
        # dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
        downstream_gradient = upstream_gradient * softsign_derivative
        return downstream_gradient


layer = SoftsignLayer()

input = np.array([-1.5, 0.0, 1.5, 0.5, -2.0, 3.0], dtype=np.float32)

# Forward pass
forward_output = layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")

复制代码

2.4. Python Softmax Function

复制代码

# !/usr/bin/env python
# coding=utf-8

import numpy as np


# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting

class SoftsignLayer:
    """
    A class to represent the Softsign layer for a neural network.
    """

    def __init__(self):
        # Cache the input for the backward pass
        self.input = None

    def forward(self, input):
        """
        Forward Pass: f(x) = x / (1 + |x|)
        Maps input values to the range [-1, 1]
        Computes the element-wise absolute value
        """

        self.input = input
        output = input / (1 + np.abs(input))
        return output

    def backward(self, upstream_gradient):
        """
        Backward Pass (Backpropagation): f'(x) = 1 / (1 + |x|)^2
        The total gradient is the element-wise product of the upstream
        gradient and the derivative of the Log.
        """

        softsign_derivative = 1 / (1 + np.abs(self.input)) ** 2
        print(f"abs_derivative.shape: {softsign_derivative.shape}")
        print(f"Softsign Derivative:\n{softsign_derivative}")

        # upstream_gradient: the gradient of the loss with respect to the output
        # Computes the gradient of the loss with respect to the input (dL/dx)
        # Apply the chain rule: multiply the derivative by the upstream gradient
        # dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
        downstream_gradient = upstream_gradient * softsign_derivative
        return downstream_gradient


layer = SoftsignLayer()

input = np.array([[-1.5, 0.0, 1.5], [0.5, -2.0, 3.0]], dtype=np.float32)

# Forward pass
forward_output = layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")

复制代码

References

1\] Yongqiang Cheng (程永强), \[2\] 动手学深度学习, \[3\] Deep Learning Tutorials, \[4\] Gradient boosting performs gradient descent, \[5\] Matrix calculus, \[6\] Artificial Inteligence,

Softmax Function - Derivatives and Gradients (导数和梯度)

Softmax Function - Derivatives and Gradients {导数和梯度}

1. Softmax Function

1.1. Shape

1.2. Parameters

2. Softmax Function - Derivatives and Gradients (导数和梯度)

2.1. PyTorch torch.nn.functional.softmax(input, dim=0)

2.2. PyTorch torch.nn.functional.softmax(input, dim=0)

2.3. Python Softmax Function

2.4. Python Softmax Function

References

2.1. PyTorch `torch.nn.functional.softmax(input, dim=0)`

2.2. PyTorch `torch.nn.functional.softmax(input, dim=0)`