James A. Brofos quantitative strategies researcher

Expectation Maximization as Coordinate Ascent

I read a very useful description of the EM algorithm as coordinate ascent at this website and wanted to rederive some of the ideas.

The marginal log-likelihood is \begin{align} \log \pi(x; \theta) &= \log \int \pi(x, z; \theta) ~\mathrm{d}z \\
&= \log \int \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \tilde{\pi}(z)~\mathrm{d}z \\
&= \log \underset{z\sim \tilde{\pi}}{\mathbb{E}} \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \\
&\geq \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \\
&= \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x; \theta) \pi(z\vert x; \theta)}{\tilde{\pi}(z)} \\
&= \log \pi(x; \theta) + \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(z\vert x; \theta)}{\tilde{\pi}(z)} \\
&= \log \pi(x; \theta) - \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta)). \end{align} The blog post then proceeds to advocate for thinking of the EM algorithm as coordinate ascent on the ELBO \begin{align} \mathrm{ELBO}(\theta, \tilde{\pi}) = \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)}. \end{align} Clearly the ELBO depends on two arguments: the latent parameter $\theta$ and the density $\tilde{\pi}$. Suppose that we fix $\theta = \theta’$ and consider maximizing the ELBO with respect to the $\tilde{\pi}$. Then the representation \begin{align} \sup_{\tilde{\pi}} \mathrm{ELBO}(\theta’, \tilde{\pi}) &= \sup_{\tilde{\pi}} \pi(x; \theta’) - \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta’)) \\
&= \pi(x; \theta’) - \inf_{\tilde{\pi}} \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta’)) \end{align} shows that the optimal choice of density is $\tilde{\pi}(z) = \pi(z\vert x; \theta’)$.

Now let us regard $\tilde{\pi}(z) = \pi(z\vert x; \theta’)$ as fixed and consider maximizing the ELBO with reprect to the latent parameter $\theta$. Then we obtain, \begin{align} \sup_{\theta} \mathrm{ELBO}(\theta, \pi(z\vert x; \theta’)) &= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\pi(z\vert x; \theta’)} \\
&= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(x, z; \theta) - \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(z\vert x; \theta’) \\
&= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(x, z; \theta). \end{align} Therefore we can view EM as coordinate ascent on the ELBO, where in the expectation step we identify the optimal choice of $\tilde{\pi}(z)$ and in the maximization step we optimize the expected complete data log-likelihood.

Hence the EM sequence of estimators may be defined inductively according to \begin{align} \theta_{n+1} = \argmax{\theta} \underset{z\sim \pi(z\vert x; \theta_n)}{\mathbb{E}} \log \pi(x, z; \theta). \end{align} We can easily show that the log-likelihood is monotonically increasing for a sequence of estimators constructed in this manner. First observe that \begin{align} & \mathrm{ELBO}(\theta_{n+1}, \pi(z\vert x, \theta_n)) \geq \mathrm{ELBO}(\theta_n, \pi(z\vert x, \theta_n)) \\
\implies& \log \pi(x; \theta_{n+1}) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_{n+1})) \geq \log \pi(x; \theta_n) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_n)) \\
\implies& \log \pi(x; \theta_{n+1}) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_{n+1})) \geq \log \pi(x; \theta_n) \end{align} But the KL divergence is non-negative and therefore \begin{align} \log \pi(x; \theta_{n+1}) \geq \log \pi(x; \theta_n). \end{align}