A problem with gradient descent parametrization

Despite being widely used, gradient descent has a flaw: it is very sensitive to the parametrization of the model.

To see that, condider a parameter $w$ of a model, a neural network for example. Let’s say that a gradient descent iteration updates $w$ by $\Delta_\lambda w$, where $\lambda$ denotes the learning rate. Let’s now consider that we re-parametrize $w$ by introducing $w_1$ and $w_2$ such that $w = w_1 + w_2$. With this parametrisation, a gradient descent iteration with learning rate $\lambda$ will update $w$ by $2 \times \Delta_\lambda w$, i.e. two times more than before!

Similarly, we can also convince ourselves by considering the reparametrisation $w=Cw_1$. With this new parametrisation, a gradient descent iteration will update $w$ by $C^2 \times \Delta_\lambda w$. In this case, Newton’s method will discard the scaling factor $C^2$.

Written on October 20, 2015