564 字
3 分钟
Backpropagation

Backpropagation#

这是Lecture 3 (Part I) - “Manual” Neural NetworksLecture 3 (Part II) - “Manual” Neural Networks的笔记。

The gradients of a two-layer network#

two-layer network#

在实际场景下,linear hypothesis class无法分类所有情况

image-20250218150519988

image-20250218150532311

因此往往采用MLP,也就是线性网络+非线性网络嵌套。一个最简单的例子就是two-layer network,

σ(XW1)W2\sigma(XW_1)W_2

image-20250218150658714

这里 σ\sigma 表示一个对逐个元素的非线性变换,比如常见的ReLU、tanh等。

XRm×n,W1Rn×d,W2Rd×kX\in\mathbb R^{m\times n},W_1\in\mathbb R^{n\times d}, W_2\in\mathbb R^{d\times k}

gradients#

XX 表示输入是batch matrix form。

目标:计算 {W1,W2}ce(σ(XW1)W2,y)\nabla_{\{W_1,W_2\}}\ell_{ce}(\sigma(XW_1)W_2,y)

W2W_2#

然后我们还是“将一切视作标量”,直接链式求导,

ce(σ(XW1)W2,y)W2=ce(σ(XW1)W2,y)σ(XW1)W2σ(XW1)W2W2=(SIy)σ(XW1)(S=normalize(exp(σ(XW1)W2)))\begin{aligned} \frac{\partial \ell_{ce}(\sigma(XW_1)W_2,y)}{\partial W_2} &= \frac{\partial \ell_{ce}(\sigma(XW_1)W_2,y)}{\partial \sigma(XW_1)W_2}\cdot \frac{\partial \sigma(XW_1)W_2}{\partial W_2}\\ &= (S - I_y)\cdot \sigma(XW_1)\quad(S=\text{normalize}(\exp(\sigma(XW_1)W_2))) \end{aligned}

然后调整维度,(SIy)Rm×k,σ(XW1)Rm×d(S-I_y)\in\mathbb R^{m\times k},\sigma(XW_1)\in R^{m\times d},而我们是对 W2Rd×kW_2\in\mathbb R^{d\times k} 求偏导,因此

W2ce(σ(XW1)W2,y)=(σ(XW1))T(SIy)\nabla_{W_2}\ell_{ce}(\sigma(XW_1)W_2,y) = (\sigma(XW_1))^T\cdot (S-I_y)

W1W_1#

再对 W1W_1 求偏导,

ce(σ(XW1)W2,y)W1=ce(σ(XW1)W2,y)σ(XW1)W2σ(XW1)W2σ(XW1)σ(XW1)XW1XW1W1=(SIy)W2σ(XW1)X\begin{aligned} \frac{\partial \ell_{ce}(\sigma(XW_1)W_2,y)}{\partial W_1} &= \frac{\partial \ell_{ce}(\sigma(XW_1)W_2,y)}{\partial \sigma(XW_1)W_2}\cdot \frac{\partial \sigma(XW_1)W_2}{\partial \sigma(XW_1)}\cdot \frac{\partial \sigma(XW_1)}{\partial XW_1}\cdot \frac{\partial XW_1}{\partial W_1}\\ &= (S-I_y)\cdot W_2\cdot \sigma^\prime(XW_1)\cdot X \end{aligned}

这里 σ\sigma 是一个标量函数,所以 σ\sigma^\prime 就是这个标量函数的导数,比如ReLU的导数就是一个分段函数 σ(x)=0,x0;1,x>0\sigma'(x)=0,x\le 0;1,x\gt 0

对于上面这一串连乘,我们列出维度:(SIy)Rm×k,W2Rd×k,σ(XW1)Rm×d,XRm×n(S-I_y)\in\mathbb R^{m\times k}, W_2\in\mathbb R^{d\times k}, \sigma^\prime(XW_1)\in\mathbb R^{m\times d}, X\in\mathbb R^{m\times n},而 W1Rn×dW_1\in\mathbb R^{n\times d},因此

W1ce(σ(XW1)W2,y)=XT(σ(XW1)((SIy)W2T))\nabla_{W_1}\ell_{ce}(\sigma(XW_1)W_2,y) = X^T(\sigma^\prime(XW_1) \circ ((S-I_y)W_2^T))

这里 \circ 表示标量积,即逐元素相乘。

这里的启发就是不要手算了

Backpropagation “in general”#

假设我们有一个全连接网络 Zi+1=σi(ZiWi);i=1,,LZ_{i+1} = \sigma_i(Z_iW_i);i=1,\ldots,L,如果要求最后一层的偏导,那么就有

(ZL+1,y)Wi=ZL+1ZL+1ZLZi+2Zi+1Zi+1Wi\frac{\partial \ell(Z_{L+1},y)}{\partial W_i} = \frac{\partial \ell}{\partial Z_{L+1}}\cdot \frac{\partial Z_{L+1}}{\partial Z_L}\cdots \frac{\partial Z_{i+2}}{\partial Z_{i+1}}\cdot \frac{\partial Z_{i+1}}{\partial W_i}

注意到这里的链式求导很多都是重复的操作,设

Gi+1=ZL+1ZL+1ZLZi+2Zi+1G_{i+1} = \frac{\partial \ell}{\partial Z_{L+1}}\cdot \frac{\partial Z_{L+1}}{\partial Z_L}\cdots \frac{\partial Z_{i+2}}{\partial Z_{i+1}}

则有一个简单的递推关系

Gi=Gi+1Zi+1Zi=Gi+1σ(ZiWi)ZiWiZiWiZi=Gi+1σ(ZiWi)WiG_i = G_{i+1}\cdot \frac{\partial Z_{i+1}}{\partial Z_i} = G_{i+1}\cdot \frac{\partial \sigma(Z_iW_i)}{\partial Z_iW_i}\cdot\frac{\partial Z_iW_i}{\partial Z_i} = G_{i+1}\cdot \sigma^\prime(Z_iW_i)\cdot W_i

这里 Gi=Zi(ZL+1,y)Rm×ni,ZiRm×ni,WiRni×ni+1G_i = \nabla_{Z_i}\ell(Z_{L+1},y)\in\mathbb R^{m\times n_i},Z_i\in\mathbb R^{m\times n_i},W_i\in\mathbb R^{n_i\times n_{i+1}}

因此我们调整 Gi+1σ(ZiWi)WiG_{i+1}\cdot \sigma^\prime(Z_iW_i)\cdot W_i 的维度为

Gi=Zi(ZL+1,y)=(Gi+1σ(ZiWi))WiTG_i = \nabla_{Z_i}\ell(Z_{L+1},y) = (G_{i+1}\circ \sigma^\prime(Z_iW_i)) \cdot W_i^T

最后就可以求出我们实际上需要的梯度 Wi(ZL+1,y)\nabla_{W_i}\ell(Z_{L+1},y)

(ZL+1,y)Wi=Wi(ZL+1,y)=Gi+1Zi+1Wi=Gi+1σi(ZiWi)Wi=Gi+1σi(ZiWi)ZiWiZiWiWi=Gi+1σ(ZiWi)Zi\begin{aligned} \frac{\partial \ell(Z_{L+1},y)}{\partial W_i} &= \nabla_{W_i}\ell(Z_{L+1},y)\\ &= G_{i+1}\cdot \frac{\partial Z_{i+1}}{\partial W_i}\\ &= G_{i+1}\cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial W_i}\\ &= G_{i+1}\cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial Z_iW_i}\cdot \frac{\partial Z_iW_i}{\partial W_i}\\ &= G_{i+1}\cdot \sigma^\prime(Z_iW_i) \cdot Z_i \end{aligned}

最后再调整维度为

(ZL+1,y)Wi=Wi(ZL+1,y)=ZiT(Gi+1σ(ZiWi))\frac{\partial \ell(Z_{L+1},y)}{\partial W_i} = \nabla_{W_i}\ell(Z_{L+1},y) = Z_i^T \cdot (G_{i+1}\circ \sigma^\prime(Z_iW_i))

Backpropagation: Forward and backward passes#

  1. Forward pass
    • Initialize: Z1=XZ_1=X
    • Iterate: Zi+1=σi(ZiWi);i=1,,LZ_{i+1} = \sigma_i(Z_iW_i); i=1,\ldots,L
  2. Backward pass
    • Initialize: GL+1=ZL+1(ZL+1,y)=SIyG_{L+1} = \nabla_{Z_{L+1}}\ell(Z_{L+1},y) = S-I_y
    • Iterate: Gi=(Gi+1σ(ZiWi))WiT;i=L,,1G_i = (G_{i+1}\circ \sigma^\prime(Z_iW_i)) \cdot W_i^T; i=L,\ldots,1
    • Compute gradients: Wi(ZL+1,y)=ZiT(Gi+1σ(ZiWi))\nabla_{W_i}\ell(Z_{L+1},y) = Z_i^T \cdot (G_{i+1}\circ \sigma^\prime(Z_iW_i))
Backpropagation
https://fuwari.vercel.app/posts/backpropagation/
作者
st1vdy
发布于
2026-03-24
许可协议
CC BY-NC-SA 4.0