Softmax求导
其实BP过程在pytorch中可以自动进行,这里进行推导只是强迫症
A
Apart证明softmax求导和softmax的BP过程
本来像手打公式的,想想还是算了,引用部分给出latex公式说明。
A.1
softmax导数

A.2
softmax梯度下降

B
基本上都是拾人牙慧,在此给出引用和参考。
参考:
\(引用几个定理B.15和B.16\)
\((B.15)\)
\ \\begin{aligned} \& \\vec{x} \\in k\^{M \\times 1}, y \\in R, \\vec{z} \\in R\^{N \\times 1},\\quad 则: \\\\ \& \\frac{\\partial y \\vec{z}}{\\partial \\vec{x}}=y \\frac{\\partial \\vec{z}}{\\partial \\vec{x}}+\\frac{\\partial y}{\\partial \\vec{x}} \\cdot \\vec{z}\^{\\top} \\in R\^{M \\times N} \\end{aligned} \\
\\\begin{aligned} \& \\text{\[证明:} \\ & dy\vec{z} \\ & =d y \cdot \vec{z}+y \cdot d \vec{z} \\ &=\vec{z} \cdot d y+y \cdot d \vec{z} \\ &=\vec{z} \cdot \left(\frac{\partial y}{\partial \vec{x}}\right)^{\top} d \vec{x}+y \cdot\left(\frac{\partial \vec{z}}{\partial \vec{x}}\right)^{\top} d \vec{x} \\ & \therefore \frac{\partial y \vec{z}}{\partial \vec{x}}=y \cdot \frac{\partial \vec{z}}{\partial \vec{x}}+\frac{\partial y}{\partial \vec{x}} \cdot \vec{z}^{\top} \end{aligned} \]
\((B.26)\)
\\\begin{aligned} \& \\vec{x} \\in R\^N, \\quad \\vec{f}(\\vec{x})=\\left\[f\\left(x_1\\right), f\\left(x_2\\right) \\ldots f\\left(x_n\\right)\\right \in R^N, 则 \\ & \frac{\partial \vec{f}(\vec{x})}{\partial \vec{x}}=\operatorname{diag}\left(\vec{f}^{\prime}(\vec{x})\right) \end{aligned} \]
\\\begin{aligned} \& \\text { \[证明: } \frac{\partial \vec{f}(\vec{x})}{\partial \vec{x}}=\left\\begin{array}{cccc} \\frac{\\partial f_1}{\\partial x_1} \& \\frac{\\partial f_2}{\\partial x_1} \& \\cdots \& \\frac{\\partial f_n}{\\partial \\eta_n} \\\\ \\vdots \& \\vdots \& \& \\vdots \\\\ \\frac{\\partial f_1}{\\partial x_n} \& \\frac{\\partial f_1}{\\partial x_n} \& \\cdots \& -\\frac{\\partial f_n}{\\partial x_n} \\end{array}\\right=\left\\begin{array}{llll} f\^{\\prime}\\left(x_1\\right) \& \& \\\\ \& f\^{\\prime}\\left(x_2\\right) \& \& \\\\ \& \& \\ddots \& \\\\ \& \& \& f\^{\\prime}\\left(x_n\\right) \\end{array}\\right=\operatorname{diag}\left(\vec{f}^{\prime}(\vec{x})\right) \end{aligned} \]
\(Apart中必须说明的两个推导:\)
\((1)\)
\\\begin{aligned} \& \\vec{x} \\in R\^n, \\exp (\\vec{x})=\\left\[\\begin{array}{c} \\exp \\left(x_1\\right) \\\\ \\vdots \\\\ \\exp \\left(x_n\\right) \\end{array}\\right \in R^n\\ & 故存在偏导:\frac{\partial \exp (\vec{x})}{\partial \vec{x}}=\left\\begin{array}{ccc} \\frac{\\partial \\exp \\left(x_1\\right)}{\\partial x_1} \& \\cdots \& \\frac{\\partial \\exp \\left(x_n\\right)}{\\partial x_1} \\\\ \\vdots \& \& \\\\ \\frac{\\partial \\exp \\left(x_1\\right)}{\\partial x_n} \& \\cdots \& \\frac{\\partial \\exp \\left(x_n\\right)}{\\partial x_n} \\end{array}\\right=\operatorname{diag}(\exp (\vec{x})) \end{aligned} \]
\((2)\)
\\\begin{aligned} \& d\\vec{1}\^{\\top} \\exp (\\vec{x}) \\\\ \& =\\vec{1}\^{\\top} d \\exp (\\vec{x}) \\\\ \&=\\vec{1}\^{\\top}\\left(\\exp \^{\\prime}(\\vec{x}) \\odot d \\vec{x}\\right) \\\\ \&=\\left(\\vec{1} \\odot \\exp \^{\\prime}(\\vec{x})\\right)\^{\\top} d \\vec{x} \\\\ \& \\text { 有: } \\frac{\\partial \\vec{1}\^{\\top} \\exp (\\vec{x})}{\\partial \\vec{x}}=\\vec{1} \\odot \\exp \^{\\prime}(\\vec{x})=\\exp \^{\\prime}(\\vec{x})=\\exp (\\vec{x}) \\end{aligned} \\
C
理解可能有偏颇。