# KL divergence minimization and maximum likelihood

Let’s consider two distributions $p$ and $q$. The KL divergence of $q$ from $p$ is:

$KL(p||q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}$

Let’s now consider that $q$ is a parametrized distribution $q_\theta$ and $p$ is an empirical distribution $p_D$ over a dataset $D = \left\lbrace x_1, … x_n \right\rbrace$, i.e. $\forall x \in D, p(x) = \frac{1}{n}$ and $$p(x) = 0$$ otherwise, then the KL divergence can be re-written as:

$KL(p_D||q_\theta) = \sum_{i=1}^n \frac{1}{n} \ln \frac{\frac{1}{n}}{q_\theta(x_i)}.$

Minimizing this KL divergence with respect to $\theta$ is then equivalent to minimize the following quantity:

$- \sum_{i=1}^n \ln q_\theta(x_i),$

which is the log likelihood of the data $D$. In other words, we have:

$argmin_\theta KL(p_D||q_\theta) = \theta_{MLE}.$
Written on September 15, 2014