The Expectation Maximization (EM) algorithm is an iterative optimization algorithm commonly used in machine learning and statistics to estimate the parameters of probabilistic models, where some of the variables in the model are hidden or unobserved.

## Expectation Maximization Algorithm Uses: Examples

The EM algorithm is used throughout statistics applications in fields such as neuroscience when inferring hierarchical parameters of a Bayesian model. It is also useful for statistical modeling in computer vision, genetics, finance and a number of other fields.

The EM algorithm consists of two main steps: the expectation (E) step and the maximization (M) step.

## What Is the EM Algorithm?

The EM algorithm is a widely used optimization algorithm in machine learning and statistics, particularly for estimating the parameters of probabilistic models where some variables are hidden or unobserved. The EM algorithm is frequently used in clustering, where the assignment of data points to clusters is uncertain or ambiguous. The Gaussian Mixture Model (GMM) is one of the more popular clustering algorithms that uses the EM algorithm to estimate cluster parameters and assignments.

The EM algorithm is also useful for missing data estimation. The observed data is used to estimate the probability distribution of the missing data, and the algorithm iteratively estimates the missing values. The EM algorithm is also useful for estimating latent variable models like hidden Markov models or hierarchical Bayesian models.

## EM Algorithm Steps

There are three main steps in the EM algorithm. We’ll go over the steps in the context of a Gaussian Mixture Model. Specifically, we assume that our data points belong to either of two Gaussian distributions:

First is the **initialization** step. We randomly initialize the parameters of the Gaussian distributions: µ_{1}, σ_{1}, µ_{2}σ_{2}.

In this illustration, we have nine “missing” data points (**z**) in one-dimensional space that we believe belong to either of two distributions (red or green). We can do this calculation easily using the probability density functions of the Gaussians.

Next is the **expectation **step. Here, we compute the probability of each possible value for the points. This is straightforward using the probability density function:

In our case, we plug in one of our points *z*_{1} for x, and we use either model’s parameters: µ_{1} or µ_{2} and σ_{1} or σ_{2} to get relative probabilities that *z*_{1} belongs to either of the two distributions. Finally, we compute the “posterior” using Bayes Rule for the class probability. For example, our updated probability that *z*_{1} in the first distribution is:

We do this for all of the points, and now have a probability of belonging to either distribution for each point. Here is what the probabilities for belonging to the first distribution might look like:

Next is the **maximization** step. Here we update the parameters of the model to their most likely values, given the assignments we just gave out unknown data points. For the new µ_{1}, we just take a weighted average of all the points, weighted by the probability that they belong to the first distribution. So our new µ_{1 }is:

where *r*_{1} is just the probability that *z*_{1} belongs to the first (red) distribution. Remember, this was computed in Step 2. The new σ_{1} is similar, we just take the weighted-average squared distance of the points from our new mean.* *

We do the same thing for µ_{2} and σ_{2}. Out updated distributions might now look something like this:

Now, all we do is repeat the **expectation **and **maximization **steps until convergence. And that’s the EM algorithm!

## Limitations of the EM Algorithm

Although the EM algorithm is a powerful statistical tool, it has some limitations. One of the main limitations is that it can get stuck in local optima. This means that the algorithm may not always converge to the global maximum likelihood estimates of the parameters.

Another limitation is that the algorithm assumes that the data is generated from a specific statistical model, which may not always be the case. For example, we may or may not actually have two Gaussian distributions generating our data but the EM algorithm of course relies on this assumption. Finally, the algorithm can be slow for larger datasets and more complicated models.