KL divergence minimization and maximum likelihood
Let’s consider two distributions $p$ and $q$. The KL divergence of $q$ from $p$ is:
\[KL(p||q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}\]Let’s now consider that $q$ is a parametrized distribution $q_\theta$ and $p$ is an empirical distribution $p_D$ over a dataset $D = \left\lbrace x_1, … x_n \right\rbrace$, i.e. $\forall x \in D, p(x) = \frac{1}{n}$ and \(p(x) = 0\) otherwise, then the KL divergence can be re-written as:
\[KL(p_D||q_\theta) = \sum_{i=1}^n \frac{1}{n} \ln \frac{\frac{1}{n}}{q_\theta(x_i)}.\]Minimizing this KL divergence with respect to $\theta$ is then equivalent to minimize the following quantity:
\[- \sum_{i=1}^n \ln q_\theta(x_i),\]which is the log likelihood of the data $D$. In other words, we have:
\[argmin_\theta KL(p_D||q_\theta) = \theta_{MLE}.\]
Written on September 15, 2014