Gradient identities
In machine learning, it is common to manipulate vectors instead of scalars. This post lists a few identities, which can be helpful to quickly compute gradients over computational graphs. If you have a doubt, you should not hesitate to derive the scalar identities first and then generalize them to vectors.
Let’s define the following functions:
\[\begin{align*} h&\colon \mathbb{R} \rightarrow \mathbb{R} \\ f&\colon \mathbb{R}^n \rightarrow \mathbb{R} \\ g&\colon \mathbb{R}^n \rightarrow \mathbb{R} \\ \mathbf{F}&\colon \mathbb{R}^n \rightarrow \mathbb{R}^m \\ \mathbf{G}&\colon \mathbb{R}^n \rightarrow \mathbb{R}^m \end{align*}\]With the following conventions for the gradients and jacobian matrices:
\[f (\vec{x})=\left[\begin{array}{c}f_1(\vec{x})\\ \vdots\\ f_n(\vec{x})\end{array}\right]\] \[\nabla (\vec{x})=\left[\begin{array}{c}\pder{f}{x_1}\\ \vdots\\ \pder{f}{x_n}\end{array}\right]\] \[\mathbf{J}^\mathrm{T}_\mathbf{F}(\vec{x})=\left[\begin{array}{ccc} \pder{F_1}{x_1}(\vec{x}) & \dots & \pder{F_1}{x_n}(\vec{x})\\ \vdots & \ddots & \vdots\\ \pder{F_m}{x_1}(\vec{x}) & \dots & \pder{F_m}{x_n}(\vec{x})\\ \end{array}\right]\]Addition:
\[\nabla ( f + g ) = \nabla f + \nabla f\]Multiplication:
\[\nabla (f \, g) = g \,\nabla f + f \,\nabla g\]Division:
\[\nabla\left(\frac{f}{g}\right) = \frac{g\nabla f - f\nabla g}{g^2}\]Composition:
\[\nabla(h \circ f) = (h' \circ f) \nabla f\] \[\nabla(f \circ \mathbf{F}) = \mathbf{J}_\mathbf{F}^\mathrm{T} \, (\nabla f \circ \mathbf{F})\]proof:
Let’s mask \(\mathbf{F}(\vec{x})\) by a new variable \(\vec{y}\). Using the multivariate chain rule, we get:
\[\pder{f \circ \mathbf{F}}{x_k}(\vec{x}) = \sum_i \pder{f}{y_i}(\vec{y}) \pder{y_i}{x_k}(\vec{x}) = [\mathbf{J}_\mathbf{F}(\vec{x})_{:,k}]^\mathrm{T} \nabla f (\vec{y}) = [\mathbf{J}_\mathbf{F}(\vec{x})_{:,k}]^\mathrm{T} \nabla f (\mathbf{F}(\vec{x})),\]where the last product is a typical matrix multiplication. Therefore, we have:
\[\nabla(f \circ \mathbf{F}) = [\mathbf{J}_\mathbf{F}(\vec{x})]^\mathrm{T} \nabla f (\mathbf{F}(\vec{x})).\]Dot product:
Note that \(\mathbf{F} \cdot \mathbf{G} = \mathbf{F}^\mathrm{T} \mathbf{G}\), where the second product is the matrix multiplication.
\[\nabla(\mathbf{F} \cdot \mathbf{G}) = \mathbf{J}^\mathrm{T}_\mathbf{F} \, \mathbf{G} + \mathbf{J}^\mathrm{T}_\mathbf{F} \, \mathbf{G}\]proof:
\[\pder{\nabla(\mathbf{F} \cdot \mathbf{G})}{x_k} = \sum_i \left[ \pder{F_i}{x_k} G_i + F_i \pder{G_i}{x_k} \right]\]