Gradient identities

In machine learning, it is common to manipulate vectors instead of scalars. This post lists a few identities, which can be helpful to quickly compute gradients over computational graphs. If you have a doubt, you should not hesitate to derive the scalar identities first and then generalize them to vectors.

Let’s define the following functions:

With the following conventions for the gradients and jacobian matrices:

Addition:

Multiplication:

Division:

Composition:

proof:

Let’s mask by a new variable . Using the multivariate chain rule, we get:

where the last product is a typical matrix multiplication. Therefore, we have:

Dot product:

Note that , where the second product is the matrix multiplication.

proof:

Written on April 11, 2015