In machine learning, it is common to manipulate vectors instead of scalars. This post lists a few identities, which can be helpful to quickly compute gradients over computational graphs. If you have a doubt, you should not hesitate to derive the scalar identities first and then generalize them to vectors.
Let’s define the following functions:
With the following conventions for the gradients and jacobian matrices:
Let’s mask by a new variable . Using the multivariate chain rule, we get:
where the last product is a typical matrix multiplication. Therefore, we have:
Note that , where the second product is the matrix multiplication.