KL divergence minimization and maximum likelihood

Let’s consider two distributions $p$ and $q$. The KL divergence of $q$ from $p$ is:

\[KL(p||q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}\]

Let’s now consider that $q$ is a parametrized distribution $q_\theta$ and $p$ is an empirical distribution $p_D$ over a dataset $D = \left\lbrace x_1, … x_n \right\rbrace$, i.e. $\forall x \in D, p(x) = \frac{1}{n}$ and \(p(x) = 0\) otherwise, then the KL divergence can be re-written as:

\[KL(p_D||q_\theta) = \sum_{i=1}^n \frac{1}{n} \ln \frac{\frac{1}{n}}{q_\theta(x_i)}.\]

Minimizing this KL divergence with respect to $\theta$ is then equivalent to minimize the following quantity:

\[- \sum_{i=1}^n \ln q_\theta(x_i),\]

which is the log likelihood of the data $D$. In other words, we have:

\[argmin_\theta KL(p_D||q_\theta) = \theta_{MLE}.\]
Written on September 15, 2014