Backpropagation simply explained
During my studies, I attended a few classes in which the backpropagation algorithm was explained. Unfortunately it was not very clear, notations and vocabulary were messy and confusing. As a result, many students ended up saying it is a complicated algorithm. In fact, it is pretty simple and this is all the more surprising when you know that it now fuels so many real-world applications. This article simply aims to explain backpropagation as simply as it should be with the minimum one requires to understand it.
Let’s consider a multi-layer perceptron modelled by a function \(\nn_{\vec{\theta}}\), where \(\vec{\theta}\) denotes all the parameters of the network (weights and biases). Training this network consists in learning from a training dataset a set of parameters \(\vec{\theta}\) such that the resulting network has the desired behaviour. The training dataset is composed of pairs \(\left\lbrace (\vec{x}^{(i)}, \vec{y}^{(i)})~\vert 1 \le i \le n \right\rbrace\), where for each \(i\), \(\vec{y}^{(i)}\) is the known desired output of input \(\vec{x}^{(i)}\). The performance of the network is evaluated with an error/cost function defined as
\[E: \vec{\theta} \mapsto \frac{1}{n} \sum_{i=1}^n L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)}),\]where \(L\) is the loss function such that \(L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)})\) measures the discrepancy between the desired output $\vec{y}^{(i)}$ and the actual output \(\nn_{\vec{\theta}}(\vec{x}^{(i)})\) computed by the neural network \(\nn_{\vec{\theta}}\). Common loss functions are the square error (for regression) or the negative log-likelihood (for classification). The training is then cast into the minimisation of \(E\), which is usually carried out by a variant of the gradient descent algorithm. Let’s consider the three layers displayed in the following figure.
The output of neuron \(j\) of layer \(l\) is given by
\[h_j^l = \varphi(z_j^l) = \varphi \left( \sum_i w_{ij}^l h_i^{l-1} + b_j^l \right),\]where \(\varphi\) is the activation function of the neurons.
The vanilla gradient descent algorithm specifies that the update rule of the weight \(w_{ij}^{l}\) connecting the neuron \(i\) of layer \(l-1\) and the neuron \(j\) of layer \(l\) is given by
\[\Delta w_{ij}^{l} = -\alpha \pder{E}{w_{ij}^l},\]where \(\alpha\) is a scalar parameter called the learning rate. So far we have just spoken of gradient descent and it is only now that we introduce the term of backpropagation of the gradient, which is simply an algorithm to compute the gradient \(\pder{E}{w_{ij}^l}\). It is based on the chain rule, the univariate and the multivariate versions.
\[\begin{align*} \Delta w_{ij}^l &= -\alpha \pder{E}{w_{ij}^l} \\ & = - \alpha \pder{E}{z_{j}^l} \pder{z_{j}^l}{w_{ij}^l} && \text{univariate chain rule}\\ & = - \alpha h^{l-1}_i \pder{E}{z_{j}^l} && \text{as } z_j^l = \sum_i w_{ij}^l h_i^{l-1} + b_j^l \\ & = - \alpha h^{l-1}_i \delta_{j}^l, \end{align*}\]where \(\delta_{j}^l = \pder{E}{z_{j}^l}\) is known as the error of neuron $j$ of layer \(l\) and is computed by applying the multivariate chain rule:
\[\begin{align*} \delta_{j}^l = \pder{E}{z_{j}^l} &= \sum_{k} \pder{E}{z_{k}^{l+1}} \pder{z_{k}^{l+1}}{z_{j}^l} && \text{multivariate chain rule}\\ & = \sum_{k} \delta_k^{l+1} \pder{z_{k}^{l+1}}{h_{j}^{l}} \pder{h_{j}^{l}}{z_{j}^l} && \text{chain rule}\\ & = \sum_{k} \delta_k^{l+1} w_{jk}^{l+1} \varphi'(z_{j}^l) && \text{as }z_k^{l+1} = \sum_k w_{jk}^{l+1} h_j^{l} + b_k^{l+1} \text{ and } h_j^l = \varphi(z_j^l)\\ & = \varphi'(z_{j}^l) \sum_{k} \delta_k^{l+1} w_{jk}^{l+1} \end{align*}\]Therefore, by computing \(\vec{\delta}^{l+1}\) at layer \(l+1\), we can compute \(\vec{\delta}^{l}\) at the previous layer $l$. Starting from the output layer, the process is repeated until the input layer is reached. This iterative update to compute the gradient of $E$ with respect to all the weights is known as the backpropagation algorithm. Since \(E\) is non convex (except when there is no hidden layer), gradient descent will likely fall into a local minimum. The important question is to find a suitable local minimum, where the resulting network generalize well to new examples.
Note that the gradient descent algorithm refers to an optimisation algorithm to minimize a differentiable function while the backpropagation algorithm is the procedure to compute the gradient of the error with respect to any weights of a neural network. However, abusively, it is common to refer to the whole process (gradient descent + backpropagation) as backpropagation.
Matrix form
When implementing backpropagation, it is important to write the computations in a matrix form so that efficient matrix multiplication algorithms can be used. \(\vec{h}^{l}\) and \(\vec{\delta}^l\) are row vectors, \(\vec{W}^{l}\) is a matrix. We then have:
\[\Delta \vec{W}^{l} = -\alpha \left(\vec{h}^{l-1}\right)^T \vec{\delta}^l,\]with
\[\vec{h}^l = \varphi \left( \vec{h}^{l-1} \vec{W}^l + \vec{b}^l \right),\]and
\[\vec{\delta}^l = \varphi'(\vec{z}^l) \odot \vec{\delta}^{l+1} \left(\vec{W}^{l+1}\right)^T.\]