# Expectation-maximization algorithm

In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation (E) step, which computes the expected value of the latent variables, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters given the data and setting the latent variables to their expectation.

 Contents

## Specification of the EM procedure

Let the observed variables be known as [itex]y[itex] and the latent variables as [itex]z[itex]. Together, [itex]y[itex] and [itex]z[itex] form the complete data. Assume that [itex]p[itex] is a joint model of the complete data with parameters [itex]\theta[itex]: [itex]p(y,z | \theta)[itex]. An EM algorithm will then iteratively improve an initial estimate [itex]\theta_0[itex] and construct new estimates [itex]\theta_1[itex] through [itex]\theta_N[itex]. An individual re-estimation step that derives [itex]\theta_{n+1}[itex] from [itex]\theta_n[itex] takes the following form (shown for the discrete case; the continuous case is similar):

[itex]

\theta_{n+1} = \arg\max_{\theta}

\sum_z
p \left(z \,|\, y, \theta_n \right)
\log p \left(y, z \,|\, \theta \right)


[itex]

In other words, [itex]\theta_{n+1}[itex] is the value that maximizes (M) the expectation (E) of the complete data log-likelihood with respect to the conditional distribution of the latent data under the previous parameter value. This expectation is usually denoted as [itex]Q(\theta)[itex]:

[itex]

Q(\theta) = \sum_z

p \left(z \,|\, y, \theta_n \right)
\log p \left(y, z \,|\, \theta \right)


[itex]

Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-dependent parameters of the function [itex]Q[itex]. Once the parameters of [itex]Q[itex] are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

It can be shown that an EM iteration does not decrease the observed data likelihood function, and that the only stationary points of the iteration are the stationary points of the observed data likelihood function. In practice, this means that an EM algorithm will converge to a local maximum of the observed data likelihood function.

EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.

"Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model.

An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.

There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

## Incremental versions

The classic EM procedure is to replace both Q and θ with their optimal possible (argmax) values at each iteration. However it can be shown (see Neal & Hinton, 1999) that simply finding Q and θ to give some improvement over their current value will also ensure successful convergence.

For example, to improve Q, we could restrict the space of possible functions to a computationally simple distribution such as a factorial distribution,

[itex]Q=\prod_i Q_i. \![itex]

Thus at each E step we compute the variational approximation of Q.

To improve θ, we could use any hill-climbing method, and not worry about finding the optimal θ, just some improvement. This method is also known as Generalized EM (GEM).

## Relation to variational Bayes methods

EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution over the latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over θ as well as the latent variables. In fact the Bayesian approach to inference is simply to treat θ as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes), we may iterate over each latent variable (now including θ) and optimize them one at a time. There are now k steps per iteration, where k is the number of latent variables. For graphical models this is easy to do as each variable's new Q depends only on its Markov blanket, so local message passing can be used for efficient inference.

## References

• Arthur Dempster, Nan Laird, and Donald Rubin. "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
• Radford Neal, Geoffrey Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants". In Michael I. Jordan (editor), Learning in Graphical Models pp 355-368. Cambridge, MA: MIT Press, 1999.

• Art and Cultures
• Countries of the World (http://www.academickids.com/encyclopedia/index.php/Countries)
• Space and Astronomy