# Backpropagation simply explained

During my studies, I attended a few classes in which the backpropagation algorithm was explained. Unfortunately it was not very clear, notations and vocabulary were messy and confusing. As a result, many students ended up saying it is a complicated algorithm. In fact, it is pretty simple and this is all the more surprising when you know that it now fuels so many real-world applications. This article simply aims to explain backpropagation as simply as it should be with the minimum one requires to understand it.

Let’s consider a multi-layer perceptron modelled by a function \(\nn_{\vec{\theta}}\), where \(\vec{\theta}\) denotes all the parameters of the network (weights and biases). Training this network consists in learning from a training dataset a set of parameters \(\vec{\theta}\) such that the resulting network has the desired behaviour. The training dataset is composed of pairs \(\left\lbrace (\vec{x}^{(i)}, \vec{y}^{(i)})~\vert 1 \le i \le n \right\rbrace\), where for each \(i\), \(\vec{y}^{(i)}\) is the known desired output of input \(\vec{x}^{(i)}\). The performance of the network is evaluated with an error/cost function defined as

\[E: \vec{\theta} \mapsto \frac{1}{n} \sum_{i=1}^n L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)}),\]where \(L\) is the loss function such that \(L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)})\) measures the discrepancy between the desired output $\vec{y}^{(i)}$ and the actual output \(\nn_{\vec{\theta}}(\vec{x}^{(i)})\) computed by the neural network \(\nn_{\vec{\theta}}\). Common loss functions are the square error (for regression) or the negative log-likelihood (for classification). The training is then cast into the minimisation of \(E\), which is usually carried out by a variant of the **gradient descent algorithm**. Let’s consider the three layers displayed in the following figure.

The output of neuron \(j\) of layer \(l\) is given by

\[h_j^l = \varphi(z_j^l) = \varphi \left( \sum_i w_{ij}^l h_i^{l-1} + b_j^l \right),\]where \(\varphi\) is the activation function of the neurons.

The vanilla gradient descent algorithm specifies that the update rule of the weight \(w_{ij}^{l}\) connecting the neuron \(i\) of layer \(l-1\) and the neuron \(j\) of layer \(l\) is given by

\[\Delta w_{ij}^{l} = -\alpha \pder{E}{w_{ij}^l},\]where \(\alpha\) is a scalar parameter called the learning rate. So far we have just spoken of gradient descent and it is only now that we introduce the term of **backpropagation** of the gradient, which is simply an algorithm to compute the gradient \(\pder{E}{w_{ij}^l}\). It is based on the chain rule, the univariate and the multivariate versions.

where \(\delta_{j}^l = \pder{E}{z_{j}^l}\) is known as the error of neuron $j$ of layer \(l\) and is computed by applying the multivariate chain rule:

\[\begin{align*} \delta_{j}^l = \pder{E}{z_{j}^l} &= \sum_{k} \pder{E}{z_{k}^{l+1}} \pder{z_{k}^{l+1}}{z_{j}^l} && \text{multivariate chain rule}\\ & = \sum_{k} \delta_k^{l+1} \pder{z_{k}^{l+1}}{h_{j}^{l}} \pder{h_{j}^{l}}{z_{j}^l} && \text{chain rule}\\ & = \sum_{k} \delta_k^{l+1} w_{jk}^{l+1} \varphi'(z_{j}^l) && \text{as }z_k^{l+1} = \sum_k w_{jk}^{l+1} h_j^{l} + b_k^{l+1} \text{ and } h_j^l = \varphi(z_j^l)\\ & = \varphi'(z_{j}^l) \sum_{k} \delta_k^{l+1} w_{jk}^{l+1} \end{align*}\]Therefore, by computing \(\vec{\delta}^{l+1}\) at layer \(l+1\), we can compute \(\vec{\delta}^{l}\) at the previous layer $l$. Starting from the output layer, the process is repeated until the input layer is reached. This iterative update to compute the gradient of $E$ with respect to all the weights is known as the backpropagation algorithm. Since \(E\) is non convex (except when there is no hidden layer), gradient descent will likely fall into a local minimum. The important question is to find a suitable local minimum, where the resulting network generalize well to new examples.

Note that the gradient descent algorithm refers to an optimisation algorithm to minimize a differentiable function while the backpropagation algorithm is the procedure to compute the gradient of the error with respect to any weights of a neural network. However, abusively, it is common to refer to the whole process (gradient descent + backpropagation) as backpropagation.

### Matrix form

When implementing backpropagation, it is important to write the computations in a matrix form so that efficient matrix multiplication algorithms can be used. \(\vec{h}^{l}\) and \(\vec{\delta}^l\) are row vectors, \(\vec{W}^{l}\) is a matrix. We then have:

\[\Delta \vec{W}^{l} = -\alpha \left(\vec{h}^{l-1}\right)^T \vec{\delta}^l,\]with

\[\vec{h}^l = \varphi \left( \vec{h}^{l-1} \vec{W}^l + \vec{b}^l \right),\]and

\[\vec{\delta}^l = \varphi'(\vec{z}^l) \odot \vec{\delta}^{l+1} \left(\vec{W}^{l+1}\right)^T.\]