目录
前置:神经网络卷积层梯度公式推导
代码封装:神经网络常见层Numpy封装参考(6):卷积层
平均池化层
数据结构
假设输入图像数据已通过池化层处理并完成填充,其结构如下:
X ( H × W ) = [ x 1 ( 1 ) x 1 ( 2 ) ⋯ x 1 ( W ) x 2 ( 1 ) x 2 ( 2 ) ⋯ x 2 ( W ) ⋯ ⋯ ⋯ ⋯ x H ( 1 ) x H ( 2 ) ⋯ x H ( W ) ] {{\bf{X}}_{(H \times W)}} = \left[ {\begin{array}{c} {{x_1}^{(1)}}&{{x_1}^{(2)}}& \cdots &{{x_1}^{(W)}}\\ {{x_2}^{(1)}}&{{x_2}^{(2)}}& \cdots &{{x_2}^{(W)}}\\ \cdots & \cdots & \cdots & \cdots \\ {{x_H}^{(1)}}&{{x_H}^{(2)}}& \cdots &{{x_H}^{(W)}} \end{array}} \right] X(H×W)= x1(1)x2(1)⋯xH(1)x1(2)x2(2)⋯xH(2)⋯⋯⋯⋯x1(W)x2(W)⋯xH(W)
通过如下公式计算输出形状:
H o u t = ⌊ H + 2 × p a d d i n g − H k s t r i d e ⌋ + 1 W o u t = ⌊ W + 2 × p a d d i n g − W k s t r i d e ⌋ + 1 \begin{array}{l} {H_{out}} = \left\lfloor {\frac{{H + 2 \times {\rm{padding}} - {H_k}}}{{{\rm{stride}}}}} \right\rfloor + 1\\ {W_{out}} = \left\lfloor {\frac{{W + 2 \times {\rm{padding}} - {W_k}}}{{{\rm{stride}}}}} \right\rfloor + 1 \end{array} Hout=⌊strideH+2×padding−Hk⌋+1Wout=⌊strideW+2×padding−Wk⌋+1
X ( H × W ) {{\bf{X}}{(H \times W)}} X(H×W)经过im2col算法转化生成的col矩阵形状如下:
X c o l ( H o u t W o u t × H k W k ) = [ x ^ 1 ( 1 ) x ^ 1 ( 2 ) ⋯ x ^ 1 ( H k W k ) x ^ 2 ( 1 ) x ^ 2 ( 2 ) ⋯ x ^ 2 ( H k W k ) ⋯ ⋯ ⋯ ⋯ x ^ H o u t W o u t ( 1 ) x ^ H o u t W o u t ( 2 ) ⋯ x ^ H o u t W o u t ( H k W k ) ] \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H_{out}}{W_{out}} \times {H_k}{W_k})} = \left[ {\begin{array}{c} {{{\hat x}1}^{(1)}}&{{{\hat x}1}^{(2)}}& \cdots &{{{\hat x}1}^{({H_k}{W_k})}}\\ {{{\hat x}2}^{(1)}}&{{{\hat x}2}^{(2)}}& \cdots &{{{\hat x}2}^{({H_k}{W_k})}}\\ \cdots & \cdots & \cdots & \cdots \\ {{{\hat x}{{H{out}}{W{out}}}}^{(1)}}&{{{\hat x}{{H{out}}{W{out}}}}^{(2)}}& \cdots &{{{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}} \end{array}} \right] Xcol(HoutWout×HkWk)= x^1(1)x^2(1)⋯x^HoutWout(1)x^1(2)x^2(2)⋯x^HoutWout(2)⋯⋯⋯⋯x^1(HkWk)x^2(HkWk)⋯x^HoutWout(HkWk)
前向传播过程
直接将 X c o l ( H o u t W o u t × H k W k ) \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} Xcol(HoutWout×HkWk)矩阵的最后一个维度求和即可(输出时要将 Y f l a t ( H o u t W o u t × 1 ) \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} Yflat(HoutWout×1)变形至 H o u t × W o u t {H_{out}} \times {W_{out}} Hout×Wout):
Y f l a t ( H o u t W o u t × 1 ) = X c o l ( H o u t W o u t × H k W k ) . m e a n ( a x i s = − 1 ) = 1 H k W k × [ ∑ i H k W k x ^ 1 ( i ) ∑ i H k W k x ^ 2 ( i ) ⋯ ∑ i H k W k x ^ H o u t W o u t ( i ) ] = [ y 1 y 2 ⋯ y H o u t W o u t ] , y H o u t W o u t = 1 H k W k × ∑ i H k W k x ^ H o u t W o u t ( i ) \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} = \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} .{\mathop{\rm mean}\nolimits} \left( {axis = - 1} \right) = \frac{1}{{{H_k}{W_k}}} \times \left[ {\begin{array}{c} {\sum\nolimits_i^{{H_k}{W_k}} {{{\hat x}1}^{(i)}} }\\ {\sum\nolimits_i^{{H_k}{W_k}} {{{\hat x}2}^{(i)}} }\\ \cdots \\ {\sum\nolimits_i^{{H_k}{W_k}} {{{\hat x}{{H{out}}{W_{out}}}}^{(i)}} } \end{array}} \right] = \left[ {\begin{array}{c} {{y_1}}\\ {{y_2}}\\ \cdots \\ {{y_{{H_{out}}{W_{out}}}}} \end{array}} \right],{y_{{H_{out}}{W_{out}}}} = \frac{1}{{{H_k}{W_k}}} \times \sum\nolimits_i^{{H_k}{W_k}} {{{\hat x}{{H{out}}{W_{out}}}}^{(i)}} Yflat(HoutWout×1)=Xcol(HoutWout×HkWk).mean(axis=−1)=HkWk1× ∑iHkWkx^1(i)∑iHkWkx^2(i)⋯∑iHkWkx^HoutWout(i) = y1y2⋯yHoutWout ,yHoutWout=HkWk1×∑iHkWkx^HoutWout(i)
产生损失:
l o s s = L ( Y f l a t ( H o u t W o u t × 1 ) ) loss = L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right) loss=L(Yflat(HoutWout×1))
传入该层的梯度结构如下:
G r a d f l a t ( H o u t W o u t × 1 ) = ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ Y f l a t ( H o u t W o u t × 1 ) = [ ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ y 1 ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ y 2 ⋯ ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ y H o u t W o u t ] = [ g 1 g 2 ⋯ g H o u t W o u t ] , g H o u t W o u t = ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ y H o u t W o u t \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} = \frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} }} = \left[ {\begin{array}{c} {\frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {y_1}}}}\\ {\frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {y_2}}}}\\ \cdots \\ {\frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {y_{{H_{out}}{W_{out}}}}}}} \end{array}} \right] = \left[ {\begin{array}{c} {{g_1}}\\ {{g_2}}\\ \cdots \\ {{g_{{H_{out}}{W_{out}}}}} \end{array}} \right],{g_{{H_{out}}{W_{out}}}} = \frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {y_{{H_{out}}{W_{out}}}}}} Gradflat(HoutWout×1)=∂Yflat(HoutWout×1)∂L(Yflat(HoutWout×1))= ∂y1∂L(Yflat(HoutWout×1))∂y2∂L(Yflat(HoutWout×1))⋯∂yHoutWout∂L(Yflat(HoutWout×1)) = g1g2⋯gHoutWout ,gHoutWout=∂yHoutWout∂L(Yflat(HoutWout×1))
反向传播过程
我们直接对损失以输入数据的最后一项元素求偏导,即输出梯度的通项公式:
∂ l o s s ∂ x ^ H o u t W o u t ( H k W k ) = ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ x ^ H o u t W o u t ( H k W k ) = g H o u t W o u t ⋅ ∂ 1 H k W k × ∑ i H k W k x ^ H o u t W o u t ( i ) ∂ x ^ H o u t W o u t ( H k W k ) = 1 H k W k × g H o u t W o u t \frac{{\partial loss}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} = \frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} = {g_{{H_{out}}{W_{out}}}} \cdot \frac{{\partial \frac{1}{{{H_k}{W_k}}} \times \sum\nolimits_i^{{H_k}{W_k}} {{{\hat x}{{H{out}}{W_{out}}}}^{(i)}} }}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} = \frac{1}{{{H_k}{W_k}}} \times {g_{{H_{out}}{W_{out}}}} ∂x^HoutWout(HkWk)∂loss=∂x^HoutWout(HkWk)∂L(Yflat(HoutWout×1))=gHoutWout⋅∂x^HoutWout(HkWk)∂HkWk1×∑iHkWkx^HoutWout(i)=HkWk1×gHoutWout
合并为完整的梯度矩阵:
d X c o l ( H o u t W o u t × H k W k ) = ∂ l o s s ∂ X c o l ( H o u t W o u t × H k W k ) = 1 H k W k × [ g 1 g 1 ⋯ g 1 g 2 g 2 ⋯ g 2 ⋯ ⋯ ⋯ ⋯ g H o u t W o u t g H o u t W o u t ⋯ g H o u t W o u t ] = 1 H k W k × ( 0 ( H o u t W o u t × H k W k ) + G r a d f l a t ( H o u t W o u t × 1 ) ) \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} = \frac{{\partial loss}}{{\partial \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} }} = \frac{1}{{{H_k}{W_k}}} \times \left[ {\begin{array}{c} {{g_1}}&{{g_1}}& \cdots &{{g_1}}\\ {{g_2}}&{{g_2}}& \cdots &{{g_2}}\\ \cdots & \cdots & \cdots & \cdots \\ {{g_{{H_{out}}{W_{out}}}}}&{{g_{{H_{out}}{W_{out}}}}}& \cdots &{{g_{{H_{out}}{W_{out}}}}} \end{array}} \right] = \frac{1}{{{H_k}{W_k}}} \times \left( {{{\bf{0}}{({H{out}}{W_{out}} \times {H_k}{W_k})}} + \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right) dXcol(HoutWout×HkWk)=∂Xcol(HoutWout×HkWk)∂loss=HkWk1× g1g2⋯gHoutWoutg1g2⋯gHoutWout⋯⋯⋯⋯g1g2⋯gHoutWout =HkWk1×(0(HoutWout×HkWk)+Gradflat(HoutWout×1))
显然要想得到 d X c o l ( H o u t W o u t × H k W k ) \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} dXcol(HoutWout×HkWk),只需将输入梯度 G r a d ( H o u t × W o u t ) {\bf{Gra}}{{\bf{d}}{({H{out}} \times {W_{out}})}} Grad(Hout×Wout)展平为 G r a d f l a t ( H o u t W o u t × 1 ) \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} Gradflat(HoutWout×1),然后通过广播形成 G R A D ( H o u t W o u t × H k W k ) {\bf{GRA}}{{\bf{D}}{({H{out}}{W_{out}} \times {H_k}{W_k})}} GRAD(HoutWout×HkWk),最后除以 H k × W k {H_k} \times {W_k} Hk×Wk即可。
最终,将 d X c o l ( H o u t W o u t × H k W k ) \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} dXcol(HoutWout×HkWk)通过col2im算法还原并累加,即可得到输出梯度:
G R A D ′ ( H × W ) = ∑ c o l 2 i m ( d X c o l ( H o u t W o u t × H k W k ) ) {\bf{GRAD}}{{\bf{'}}{(H \times W)}} = \sum {{\mathop{\rm col2im}\nolimits} \left( {\mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H_{out}}{W_{out}} \times {H_k}{W_k})} } \right)} GRAD′(H×W)=∑col2im(dXcol(HoutWout×HkWk))
拓展为批数据,公式不变:
Y f l a t ( N × C × H o u t W o u t × 1 ) = X c o l ( N × C × H o u t W o u t × H k W k ) . m e a n ( a x i s = − 1 ) d X c o l ( N × C × H o u t W o u t × H k W k ) = 1 H k W k × ( 0 ( N × C × H o u t W o u t × H k W k ) + G r a d f l a t ( N × C × H o u t W o u t × 1 ) ) G R A D ′ ( N × C × H × W ) = ∑ c o l 2 i m ( d X c o l ( N × C × H o u t W o u t × H k W k ) ) \begin{array}{l} \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times 1)} = \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})} .{\mathop{\rm mean}\nolimits} \left( {axis = - 1} \right)\\ \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})} = \frac{1}{{{H_k}{W_k}}} \times \left( {{{\bf{0}}{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})}} + \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times 1)} } \right)\\ {\bf{GRAD}}{{\bf{'}}{(N \times C \times H \times W)}} = \sum {{\mathop{\rm col}\nolimits} 2im\left( {\mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H_{out}}{W_{out}} \times {H_k}{W_k})} } \right)} \end{array} Yflat(N×C×HoutWout×1)=Xcol(N×C×HoutWout×HkWk).mean(axis=−1)dXcol(N×C×HoutWout×HkWk)=HkWk1×(0(N×C×HoutWout×HkWk)+Gradflat(N×C×HoutWout×1))GRAD′(N×C×H×W)=∑col2im(dXcol(N×C×HoutWout×HkWk))
最大池化层
前向传播过程
类比均值池化层,最大池化层的前向传播公式变成了:
Y f l a t ( H o u t W o u t × 1 ) = X c o l ( H o u t W o u t × H k W k ) . m a x ( a x i s = − 1 ) = [ max ( x ^ 1 ( 1 ) , x ^ 1 ( 2 ) , ⋯ , x ^ 1 ( H k W k ) ) max ( x ^ 2 ( 1 ) , x ^ 2 ( 2 ) , ⋯ , x ^ 2 ( H k W k ) ) ⋯ max ( x ^ H o u t W o u t ( 1 ) , x ^ H o u t W o u t ( 2 ) , ⋯ , x ^ H o u t W o u t ( H k W k ) ) ] = [ y 1 y 2 ⋯ y H o u t W o u t ] \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} = \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})}{\rm .max\left( {axis = - 1} \right)} = \left[ {\begin{array}{c} {\max \left( {{{\hat x}1}^{(1)},{{\hat x}1}^{(2)}, \cdots ,{{\hat x}1}^{({H_k}{W_k})}} \right)}\\ {\max \left( {{{\hat x}2}^{(1)},{{\hat x}2}^{(2)}, \cdots ,{{\hat x}2}^{({H_k}{W_k})}} \right)}\\ \cdots \\ {\max \left( {{{\hat x}{{H{out}}{W{out}}}}^{(1)},{{\hat x}{{H{out}}{W{out}}}}^{(2)}, \cdots ,{{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}} \right)} \end{array}} \right] = \left[ {\begin{array}{c} {{y_1}}\\ {{y_2}}\\ \cdots \\ {{y_{{H_{out}}{W_{out}}}}} \end{array}} \right] Yflat(HoutWout×1)=Xcol(HoutWout×HkWk).max(axis=−1)= max(x^1(1),x^1(2),⋯,x^1(HkWk))max(x^2(1),x^2(2),⋯,x^2(HkWk))⋯max(x^HoutWout(1),x^HoutWout(2),⋯,x^HoutWout(HkWk)) = y1y2⋯yHoutWout
对每一个输出项:
y H o u t W o u t = max ( x ^ H o u t W o u t ( 1 ) , x ^ H o u t W o u t ( 2 ) , ⋯ , x ^ H o u t W o u t ( H k W k ) ) {y_{{H_{out}}{W_{out}}}} = \max \left( {{{\hat x}{{H{out}}{W_{out}}}}^{(1)},{{\hat x}{{H{out}}{W_{out}}}}^{(2)}, \cdots ,{{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}} \right) yHoutWout=max(x^HoutWout(1),x^HoutWout(2),⋯,x^HoutWout(HkWk))
反向传播过程
同理,我们对损失以最后一项输入数据求偏导:
∂ l o s s ∂ x ^ H o u t W o u t ( H k W k ) = ∂ L ( Y f l a t ( H o u t W o u t × 1 ) ) ∂ x ^ H o u t W o u t ( H k W k ) = g H o u t W o u t ⋅ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( H k W k ) \frac{{\partial loss}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} = \frac{{\partial L\left( {\mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} } \right)}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} = {g_{{H_{out}}{W_{out}}}} \cdot \frac{{\partial {y_{{H_{out}}{W_{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}} ∂x^HoutWout(HkWk)∂loss=∂x^HoutWout(HkWk)∂L(Yflat(HoutWout×1))=gHoutWout⋅∂x^HoutWout(HkWk)∂yHoutWout
整合:
d X c o l ( H o u t W o u t × H k W k ) = ∂ l o s s ∂ X c o l ( H o u t W o u t × H k W k ) = [ g 1 ⋅ ∂ y 1 ∂ x ^ 1 ( 1 ) g 1 ⋅ ∂ y 1 ∂ x ^ 1 ( 2 ) ⋯ g 1 ⋅ ∂ y 1 ∂ x ^ 1 ( H k W k ) g 2 ⋅ ∂ y 2 ∂ x ^ 2 ( 1 ) g 2 ⋅ ∂ y 2 ∂ x ^ 2 ( 2 ) ⋯ g 2 ⋅ ∂ y 2 ∂ x ^ 2 ( H k W k ) ⋯ ⋯ ⋯ ⋯ g H o u t W o u t ⋅ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( 1 ) g H o u t W o u t ⋅ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( 2 ) ⋯ g H o u t W o u t ⋅ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( H k W k ) ] = G r a d f l a t ( H o u t W o u t × 1 ) × ∂ Y f l a t ( H o u t W o u t × 1 ) ∂ X c o l ( H o u t W o u t × H k W k ) \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} = \frac{{\partial loss}}{{\partial \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} }} = \left[ {\begin{array}{c} {{g_1} \cdot \frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{(1)}}}}&{{g_1} \cdot \frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{(2)}}}}& \cdots &{{g_1} \cdot \frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{({H_k}{W_k})}}}}\\ {{g_2} \cdot \frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{(1)}}}}&{{g_2} \cdot \frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{(2)}}}}& \cdots &{{g_2} \cdot \frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{({H_k}{W_k})}}}}\\ \cdots & \cdots & \cdots & \cdots \\ {{g{{H{out}}{W{out}}}} \cdot \frac{{\partial {y{{H{out}}{W{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{(1)}}}}&{{g_{{H_{out}}{W_{out}}}} \cdot \frac{{\partial {y_{{H_{out}}{W_{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{(2)}}}}& \cdots &{{g_{{H_{out}}{W_{out}}}} \cdot \frac{{\partial {y_{{H_{out}}{W_{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}}} \end{array}} \right] = \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} \times \frac{{\partial \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} }}{{\partial \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} }} dXcol(HoutWout×HkWk)=∂Xcol(HoutWout×HkWk)∂loss= g1⋅∂x^1(1)∂y1g2⋅∂x^2(1)∂y2⋯gHoutWout⋅∂x^HoutWout(1)∂yHoutWoutg1⋅∂x^1(2)∂y1g2⋅∂x^2(2)∂y2⋯gHoutWout⋅∂x^HoutWout(2)∂yHoutWout⋯⋯⋯⋯g1⋅∂x^1(HkWk)∂y1g2⋅∂x^2(HkWk)∂y2⋯gHoutWout⋅∂x^HoutWout(HkWk)∂yHoutWout =Gradflat(HoutWout×1)×∂Xcol(HoutWout×HkWk)∂Yflat(HoutWout×1)
显然式 ∂ Y f l a t ( H o u t W o u t × 1 ) ∂ X c o l ( H o u t W o u t × H k W k ) \frac{{\partial \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} }}{{\partial \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} }} ∂Xcol(HoutWout×HkWk)∂Yflat(HoutWout×1)是一个雅可比矩阵,即:
∂ Y f l a t ( H o u t W o u t × 1 ) ∂ X c o l ( H o u t W o u t × H k W k ) = [ ∂ y 1 ∂ x ^ 1 ( 1 ) ∂ y 1 ∂ x ^ 1 ( 2 ) ⋯ ∂ y 1 ∂ x ^ 1 ( H k W k ) ∂ y 2 ∂ x ^ 2 ( 1 ) ∂ y 2 ∂ x ^ 2 ( 2 ) ⋯ ∂ y 2 ∂ x ^ 2 ( H k W k ) ⋯ ⋯ ⋯ ⋯ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( 1 ) ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( 2 ) ⋯ ∂ y H o u t W o u t ∂ x ^ H o u t W o u t ( H k W k ) ] ( H o u t W o u t × H k W k ) = X c o l ( H o u t W o u t × H k W k ) . a r g m a x ( a x i s = − 1 ) . e y e ( a x i s = − 1 ) \frac{{\partial \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{({H{out}}{W_{out}} \times 1)} }}{{\partial \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} }} = {\left[ {\begin{array}{c} {\frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{(1)}}}}&{\frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{(2)}}}}& \cdots &{\frac{{\partial {y_1}}}{{\partial {{\hat x}1}^{({H_k}{W_k})}}}}\\ {\frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{(1)}}}}&{\frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{(2)}}}}& \cdots &{\frac{{\partial {y_2}}}{{\partial {{\hat x}2}^{({H_k}{W_k})}}}}\\ \cdots & \cdots & \cdots & \cdots \\ {\frac{{\partial {y{{H{out}}{W{out}}}}}}{{\partial {{\hat x}{{H{out}}{W{out}}}}^{(1)}}}}&{\frac{{\partial {y_{{H_{out}}{W_{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{(2)}}}}& \cdots &{\frac{{\partial {y_{{H_{out}}{W_{out}}}}}}{{\partial {{\hat x}{{H{out}}{W_{out}}}}^{({H_k}{W_k})}}}} \end{array}} \right]{({H{out}}{W_{out}} \times {H_k}{W_k})}} = \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} {\rm .{\mathop{argmax}\nolimits} \left( {axis = - 1} \right).eye\left( {axis = - 1} \right)} ∂Xcol(HoutWout×HkWk)∂Yflat(HoutWout×1)= ∂x^1(1)∂y1∂x^2(1)∂y2⋯∂x^HoutWout(1)∂yHoutWout∂x^1(2)∂y1∂x^2(2)∂y2⋯∂x^HoutWout(2)∂yHoutWout⋯⋯⋯⋯∂x^1(HkWk)∂y1∂x^2(HkWk)∂y2⋯∂x^HoutWout(HkWk)∂yHoutWout (HoutWout×HkWk)=Xcol(HoutWout×HkWk).argmax(axis=−1).eye(axis=−1)
最后,同样将 d X c o l ( H o u t W o u t × H k W k ) \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{({H{out}}{W_{out}} \times {H_k}{W_k})} dXcol(HoutWout×HkWk)通过col2im算法还原并累加即可。应用到批数据同样公式不变:
Y f l a t ( N × C × H o u t W o u t × 1 ) = X c o l ( N × C × H o u t W o u t × H k W k ) . m a x ( a x i s = − 1 ) d X c o l ( N × C × H o u t W o u t × H k W k ) = G r a d f l a t ( N × C × H o u t W o u t × 1 ) × X c o l ( N × C × H o u t W o u t × H k W k ) . a r g m a x ( a x i s = − 1 ) . e y e ( a x i s = − 1 ) G R A D ′ ( N × C × H × W ) = ∑ c o l 2 i m ( d X c o l ( N × C × H o u t W o u t × H k W k ) ) \begin{array}{l} \mathop {{{\bf{Y}}{{\bf{flat}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times 1)} = \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})} {\rm .max\left( {axis = - 1} \right)}\\ \mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})} = \mathop {{\bf{Gra}}{{\bf{d}}{{\bf{flat}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times 1)} \times \mathop {{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H{out}}{W_{out}} \times {H_k}{W_k})} {\rm .{\mathop{argmax}\nolimits} \left( {axis = - 1} \right).eye\left( {axis = - 1} \right)}\\ {\bf{GRAD}}{{\bf{'}}{(N \times C \times H \times W)}} = \sum {{\mathop{\rm col2im}\nolimits} \left( {\mathop {{\bf{d}}{{\bf{X}}{{\bf{col}}}}}\limits^{(N \times C \times {H_{out}}{W_{out}} \times {H_k}{W_k})} } \right)} \end{array} Yflat(N×C×HoutWout×1)=Xcol(N×C×HoutWout×HkWk).max(axis=−1)dXcol(N×C×HoutWout×HkWk)=Gradflat(N×C×HoutWout×1)×Xcol(N×C×HoutWout×HkWk).argmax(axis=−1).eye(axis=−1)GRAD′(N×C×H×W)=∑col2im(dXcol(N×C×HoutWout×HkWk))