Understanding Gradient Calculation through Backpropagation

In the basics of neural networks, understanding gradient calculation through backpropagation can be challenging for beginners. This article will explain how to calculate gradients through backpropagation in a visually appealing and easy-to-understand manner.

Mathematical Method for Calculating Gradients

The reason we seek gradients is to update parameters and biases to the point of "minimizing the loss function value."

Let's start with a simple example. Suppose we have the following linear function:

F(x,y)=3x+4yF(x, y) = 3x + 4y

According to calculus, the formula for calculating the gradient is:

gradF(x,y)=(Fx,Fy)grad F(x, y) = (\frac{∂F}{∂x},\frac{∂F}{∂y})

It is easy to find that the gradient is (3, 4). To relate this to a more realistic scenario, let's look at a slightly more complex expression:

F_1(x)=w_1x+b_1F\_{1}(x) = w\_{1}x + b\_{1}

F_2(x)=w_2x+b_2F\_{2}(x) = w\_{2}x + b\_{2}

F(x)=F_2(F_1(x))F(x) = F\_{2}(F\_{1}(x))

We can understand this using knowledge from neural networks. This expression is equivalent to using two affine functions and a linear activation layer (which can be temporarily understood as ignoring the existence of gradients in that layer). Our goal is to find the gradient of w_1w\_{1}. According to the chain rule in calculus, we have:

yw_1=yF_1F_1w_1\frac{∂y}{w\_{1}} = \frac{y}{∂F\_{1}} \frac{∂F\_{1}}{w\_{1}}

It is easy to solve for the gradient as w_2xw\_{2}x.

The general method for calculating gradients is through differentiation, which is:

1def numerical_gradient(f, x):
2    h = 1e-4  # 0.0001
3    grad = np.zeros_like(x)
4
5    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
6    while not it.finished:
7        idx = it.multi_index
8        tmp_val = x[idx]
9        x[idx] = float(tmp_val) + h
10        fxh1 = f(x)  # f(x+h)
11
12        x[idx] = tmp_val - h
13        fxh2 = f(x)  # f(x-h)
14        grad[idx] = (fxh1 - fxh2) / (2*h)
15
16        x[idx] = tmp_val  # Restore value
17        it.iternext()

This method is also known as the forward propagation gradient calculation method, which requires the entire neural network function to be computed. Since this method consumes a lot of resources, we introduce the method of calculating gradients through backpropagation.

Backpropagation for Gradient Calculation

First, let's abstract the functions used above into code. In the backward method of the Affine layer, we update three variables, where

  • dx represents the signal for further backpropagation
  • dw is the local derivative of the weights
  • db is the local derivative of the bias
1class Affine:
2    def __init__(self, W, b) -> None:
3        self.W = W
4        self.b = b
5        self.x = None
6        self.dW = None
7        self.db = None
8    
9    def forward(self, x):
10        self.x = x
11        dot = np.dot(self.x, self.W)
12        out = dot + self.b  # Broadcasting...
13
14        return out
15
16    def backward(self, dout):
17        dx = np.dot(dout, self.W)  # A
18        self.dW = np.dot(self.x, dout)  # B
19        self.db = np.sum(dout, axis=0)
20
21        return dx

So how do we use backpropagation to calculate? It can be understood this way: we first calculate the gradient of EE with respect to x, then backpropagate to F_2F\_{2}, and subsequently calculate the gradient of the loss value with respect to w_2w\_{2}. In the method above, we typically compute from left to right, while backpropagation computes from right to left.

Let's deepen our understanding with a more practical example.

Let

E=(yt)22E = \frac{(y - t)^2}{2}

be the loss function, where tt represents the teacher label and y represents the output of the hidden layer. According to the chain rule, the gradient of w_1w\_{1} is:

Ew_1=Ey\*yF_1\*F_1w_1\frac{∂E}{∂w\_{1}} = \frac{∂E}{∂y} \* \frac{∂y}{∂F\_{1}}\*\frac{∂F\_{1}}{∂w\_{1}}

In backpropagation, we first calculate the derivative of the output layer (EE), then backpropagate to the previous layer (F_2F\_{2}). Thus, we can first obtain the gradient of w_2w\_{2} with respect to E as

Ew_2=Ey\*x_2\frac{∂E}{∂w\_{2}} = \frac{∂E}{∂y} \* x\_{2}

This expression corresponds to the comment B in the code above.

After that, we can use this gradient, combined with various gradient descent algorithms, to update the weights. Next, we also need to find the gradient of x2x_{2} with respect to E to continue propagating forward, which can be easily obtained as

Ex_2=Ey\*Ex_2\frac{∂E}{∂x\_{2}} = \frac{∂E}{∂y} \* \frac{∂E}{∂x\_{2}}

That is:

Ex_2=Ey\*w_2\frac{∂E}{∂x\_{2}} = \frac{∂E}{∂y} \* w\_{2}

This expression corresponds to the comment A in the code above.

In other words, in the backpropagation method at each layer, we do two things:

  1. Calculate the gradient of the current layer's weights with respect to the loss function (which is actually with respect to the previous layer's gradient, but ultimately still with respect to the loss function) to update the current layer's weights.
  2. Calculate the gradient of the current layer's x with respect to the loss function, where x represents the previous layer, applying the chain rule to propagate forward.

Both gradients depend on the gradient of x from the previous layer, as the x from the previous layer is the input to this layer.

It is worth mentioning that the order of the two dot products in the code is different because actual computations use matrix operations, which have specific order requirements.