Expectation Maximization as Coordinate Ascent

I read a very useful description of the EM algorithm as coordinate ascent at this website and wanted to rederive some of the ideas.

The marginal log-likelihood is \begin{align} \log \pi(x; \theta) &= \log \int \pi(x, z; \theta) ~\mathrm{d}z \\
&= \log \int \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \tilde{\pi}(z)~\mathrm{d}z \\
&= \log \underset{z\sim \tilde{\pi}}{\mathbb{E}} \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \\
&\geq \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)} \\
&= \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x; \theta) \pi(z\vert x; \theta)}{\tilde{\pi}(z)} \\
&= \log \pi(x; \theta) + \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(z\vert x; \theta)}{\tilde{\pi}(z)} \\
&= \log \pi(x; \theta) - \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta)). \end{align} The blog post then proceeds to advocate for thinking of the EM algorithm as coordinate ascent on the ELBO \begin{align} \mathrm{ELBO}(\theta, \tilde{\pi}) = \underset{z\sim \tilde{\pi}}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\tilde{\pi}(z)}. \end{align} Clearly the ELBO depends on two arguments: the latent parameter $\theta$ and the density $\tilde{\pi}$. Suppose that we fix $\theta = \theta’$ and consider maximizing the ELBO with respect to the $\tilde{\pi}$. Then the representation \begin{align} \sup_{\tilde{\pi}} \mathrm{ELBO}(\theta’, \tilde{\pi}) &= \sup_{\tilde{\pi}} \pi(x; \theta’) - \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta’)) \\
&= \pi(x; \theta’) - \inf_{\tilde{\pi}} \mathbb{KL}(\tilde{\pi}(z)\Vert \pi(z\vert x; \theta’)) \end{align} shows that the optimal choice of density is $\tilde{\pi}(z) = \pi(z\vert x; \theta’)$.

Now let us regard $\tilde{\pi}(z) = \pi(z\vert x; \theta’)$ as fixed and consider maximizing the ELBO with reprect to the latent parameter $\theta$. Then we obtain, \begin{align} \sup_{\theta} \mathrm{ELBO}(\theta, \pi(z\vert x; \theta’)) &= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \frac{\pi(x, z; \theta)}{\pi(z\vert x; \theta’)} \\
&= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(x, z; \theta) - \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(z\vert x; \theta’) \\
&= \sup_{\theta} \underset{z\sim \pi(z\vert x; \theta’)}{\mathbb{E}} \log \pi(x, z; \theta). \end{align} Therefore we can view EM as coordinate ascent on the ELBO, where in the expectation step we identify the optimal choice of $\tilde{\pi}(z)$ and in the maximization step we optimize the expected complete data log-likelihood.

Hence the EM sequence of estimators may be defined inductively according to \begin{align} \theta_{n+1} = \argmax{\theta} \underset{z\sim \pi(z\vert x; \theta_n)}{\mathbb{E}} \log \pi(x, z; \theta). \end{align} We can easily show that the log-likelihood is monotonically increasing for a sequence of estimators constructed in this manner. First observe that \begin{align} & \mathrm{ELBO}(\theta_{n+1}, \pi(z\vert x, \theta_n)) \geq \mathrm{ELBO}(\theta_n, \pi(z\vert x, \theta_n)) \\
\implies& \log \pi(x; \theta_{n+1}) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_{n+1})) \geq \log \pi(x; \theta_n) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_n)) \\
\implies& \log \pi(x; \theta_{n+1}) - \mathbb{KL}(\pi(z\vert x, \theta_n)\Vert \pi(z\vert x; \theta_{n+1})) \geq \log \pi(x; \theta_n) \end{align} But the KL divergence is non-negative and therefore \begin{align} \log \pi(x; \theta_{n+1}) \geq \log \pi(x; \theta_n). \end{align}

Expectation Maximization as Coordinate Ascent

You may also enjoy:

The Eigenfunctions of the Brownian Motion Covariance Kernel November 26, 2023

Expected Number of Iterations to Zero June 26, 2022

Bernoulli Expectation Maximization May 19, 2022

The Eigenfunctions of the Brownian Motion Covariance Kernel
November 26, 2023

Expected Number of Iterations to Zero
June 26, 2022

Bernoulli Expectation Maximization
May 19, 2022