Bayesian Inference

Part II — Gaussian Mixtures & EM

Alexander Hoyle

March 2026

Part II — Gaussian Mixture Models and Expectation Maximization

Gaussian Distribution & Multivariate Data

The Gaussian Distribution

The most important distribution in statistics:

Gaussian PDF

\(p(x \mid \textcolor{#3A7CA5}{\mu}, \textcolor{#3A7CA5}{\sigma^2}) = \dfrac{1}{(2\pi\textcolor{#3A7CA5}{\sigma^2})^{1/2}} \exp\!\left(-\dfrac{1}{2\textcolor{#3A7CA5}{\sigma^2}}(x - \textcolor{#3A7CA5}{\mu})^2\right)\)

Focus on the inner part: \(\;-(x - \textcolor{#3A7CA5}{\mu})^2\)

As we move further from the mean, we have exponentially lower density. Many things follow this pattern: e.g., the volume of a noise source decreases exponentially the further you are from it.

The variance \(\textcolor{#3A7CA5}{\sigma^2}\) scales this: low variance → more confidence values are near the mean; we rapidly get lower density as we move away. Large variance → the "cost" of going away from the mean is smaller.

Gaussian Explorer

\(\mu\)0.0

\(\sigma^2\)1.0

Mean μ

0.000

Std dev σ

1.000

68 – 95 – 99.7

±1σ ±2σ ±3σ

Review: Maximum Likelihood

Say we observe data \(x_i \sim \mathcal{N}(\textcolor{#3A7CA5}{\mu}, \textcolor{#3A7CA5}{\sigma^2})\) for \(i = 1,\ldots,n\). Since each sampling event is independent:

\(\textcolor{#D4853A}{\mathcal{L}(\mu, \sigma^2, \mathbf{x})} = \displaystyle\prod_{i=1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\!\left(-\frac{1}{2\sigma^2}(x_i - \mu)^2\right)\)

Combine the product (all share the same \(\sigma^2\)):

\(= \dfrac{1}{(2\pi\sigma^2)^{n/2}} \exp\!\left(-\dfrac{1}{2\sigma^2}\displaystyle\sum_{i=1}^n (x_i - \mu)^2\right)\)

Take the log:

\(\textcolor{#D4853A}{\ell(\mu, \sigma^2, \mathbf{x})} = -\dfrac{n}{2}\log(2\pi) - \dfrac{n}{2}\log(\sigma^2) - \dfrac{1}{2\sigma^2}\displaystyle\sum_{i=1}^n (x_i - \mu)^2\)

Finding \(\textcolor{#3A7CA5}{\hat{\mu}}\)

Differentiate with respect to \(\textcolor{#3A7CA5}{\mu}\) and set to zero:

\(\dfrac{\partial\,\ell}{\partial\,\mu} = \dfrac{1}{\sigma^2}\displaystyle\sum_{i=1}^n (x_i - \mu)\)

\(= -\dfrac{n\mu}{\sigma^2} + \dfrac{1}{\sigma^2}\displaystyle\sum_{i=1}^n x_i\)

Setting to zero and solving for \(\mu\):

\(\dfrac{n\mu}{\sigma^2} = \dfrac{1}{\sigma^2}\displaystyle\sum_{i=1}^n x_i\)

\(\textcolor{#3A7CA5}{\hat{\mu}} = \dfrac{1}{n}\displaystyle\sum_{i=1}^n x_i\)

The MLE for the Gaussian mean is the sample mean — again!

The Multivariate Gaussian

So far we've dealt with scalar random variables. But in practice, our data points often live in \(d\)-dimensional space:

A point \(\;\mathbf{x} = [x_1, x_2, \ldots, x_d]^\top \in \mathbb{R}^d\)

height, weight
2 dimensions

ages of n deer
n dimensions

rainfall across a grid
d = grid cells

We want to model the joint distribution over all \(d\) dimensions — and we might expect systematic relationships between them.

The mean \(\textcolor{#3A7CA5}{\boldsymbol{\mu}} \in \mathbb{R}^d\) is now a vector, and the variance becomes a \(d \times d\) covariance matrix \(\textcolor{#3A7CA5}{\boldsymbol{\Sigma}}\) that captures both spread along each dimension and correlations between dimensions.

The Multivariate Gaussian

Multivariate Gaussian PDF

\(p(\mathbf{x} \mid \textcolor{#3A7CA5}{\boldsymbol{\mu}}, \textcolor{#3A7CA5}{\boldsymbol{\Sigma}}) = \dfrac{1}{(2\pi)^{d/2}|\textcolor{#3A7CA5}{\boldsymbol{\Sigma}}|^{1/2}} \exp\!\left(-\dfrac{1}{2}(\mathbf{x} - \textcolor{#3A7CA5}{\boldsymbol{\mu}})^\top \textcolor{#3A7CA5}{\boldsymbol{\Sigma}}^{-1} (\mathbf{x} - \textcolor{#3A7CA5}{\boldsymbol{\mu}})\right)\)

The quantity \((\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\) is the Mahalanobis distance — a generalization of the squared distance from the mean that accounts for the shape of the distribution.

In 2D, the covariance matrix \(\boldsymbol{\Sigma}\) is:

\(\textcolor{#3A7CA5}{\boldsymbol{\Sigma}} = \begin{bmatrix} \sigma_1^2 & \Sigma_{12} \\ \Sigma_{12} & \sigma_2^2 \end{bmatrix}\)

Diagonal entries \(\sigma_j^2\): variance (spread) along each axis

Off-diagonal \(\Sigma_{12} = \text{Cov}(x_1, x_2)\): how dimensions co-vary

We can parameterize the off-diagonal as \(\Sigma_{12} = \rho\,\sigma_1\sigma_2\), where \(\rho \in [-1, 1]\) is the correlation coefficient. This is convenient because it guarantees \(\boldsymbol{\Sigma}\) is positive semi-definite — a requirement for a valid covariance matrix.

The multivariate Gaussian PDF. The quantity (x-μ)^T Σ^{-1} (x-μ) is the Mahalanobis distance — a generalization of the squared distance from the mean that accounts for the shape of the distribution. Intuition for students: in 1D, "how far is x from μ" is just |x-μ|/σ — i.e., how many standard deviations away. In multiple dimensions, "one standard deviation" means different things in different directions — the distribution might be wide horizontally but narrow vertically. The Mahalanobis distance asks "how many standard deviations away is this point" while accounting for the fact that the distribution has a shape. Points on the same contour ellipse all have the same Mahalanobis distance from the mean. In 2D, the covariance matrix has variances on the diagonal and the covariance Cov(x₁,x₂) on the off-diagonal. Writing this as ρσ₁σ₂ is a parameterization choice, not a definition — the covariance is defined as E[(x₁-μ₁)(x₂-μ₂)]. The ρ parameterization is convenient because constraining |ρ| ≤ 1 guarantees Σ is positive semi-definite: det(Σ) = σ₁²σ₂²(1-ρ²) ≥ 0. You need PSD for the density to be well-defined (Σ⁻¹ must exist, the exponent must be non-negative). If a student asks: in higher dimensions you can't just set pairwise correlations freely and guarantee PSD — that's one reason people work with Cholesky factors instead.

Covariance shapes density contours

Diagonal covariance \(\textcolor{#3A7CA5}{\boldsymbol{\Sigma}} = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)\) — assumes dimensions are independent (\(\rho = 0\)):

\(p(\mathbf{x} \mid \textcolor{#3A7CA5}{\boldsymbol{\mu}}, \textcolor{#3A7CA5}{\boldsymbol{\sigma}}^2) = \dfrac{1}{(2\pi)^{d/2}\prod_{j=1}^d \sigma_j} \exp\!\left(-\dfrac{1}{2}\displaystyle\sum_{j=1}^d \dfrac{(x_j - \mu_j)^2}{\sigma_j^2}\right)\)

Spherical covariance \(\textcolor{#3A7CA5}{\boldsymbol{\Sigma}} = \sigma^2\mathbf{I}\) — same variance in every dimension. Contours are circles (2D) / spheres. This is the form we'll use for GMMs. All arguments generalize to full covariance.

2D Gaussian Explorer

\(\mu_1\)0.0

\(\mu_2\)0.0

\(\sigma_1\)1.0

\(\sigma_2\)1.0

\(\rho\)0.00

Covariance Matrix

1σ 2σ 3σ Mahalanobis

Section

Motivating the Gaussian Mixture Model

Distance in \(\mathbb{R}^d\)

Our data points now live in \(\mathbb{R}^d\): each \(\mathbf{x}_i\) is a vector with \(d\) components. The (squared) Euclidean distance between two points \(\mathbf{a}\) and \(\mathbf{b}\) in this space is

\(\|\mathbf{a} - \mathbf{b}\|^2 = \displaystyle\sum_{j=1}^d (a_j - b_j)^2\)

The Pythagorean theorem generalized to \(d\) dimensions.

We have a number of unlabeled points \(\mathbf{x}_i\) and a suspicion that they fall into \(k\) clusters \(C_j\). Our objective is to minimize the distance between each point and the mean of the cluster to which it belongs.

But we can't do this all at once: we need an iterative algorithm.

The \(k\)-means objective

We have \(k\) cluster centers \(\mathbf{u}_1, \ldots, \mathbf{u}_k \in \mathbb{R}^d\). Each point \(\mathbf{x}_i\) is assigned to exactly one cluster, forming \(k\) disjoint sets:

\(C_j = \{\mathbf{x}_i : \mathbf{x}_i \text{ is assigned to cluster } j\}\) — the set of points in cluster \(j\)

\(|C_j|\) is the number of points in cluster \(j\), and every point belongs to exactly one \(C_j\).

We want to find the assignments and centers that minimize the total within-cluster sum of squared distances:

\(J = \displaystyle\sum_{j=1}^k \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \mathbf{u}_j\|^2\)

Each dashed line contributes \(\|\mathbf{x}_i - \mathbf{u}_j\|^2\) to the total objective

The \(k\)-means algorithm

After randomly initializing \(k\) cluster centers \(\mathbf{u}_j\), repeat until convergence:

Step 1 — Assign

For each point \(\mathbf{x}_i\), calculate \(d_{ij} = \|\mathbf{x}_i - \mathbf{u}_j\|^2\) for all centers. Assign to the nearest one.

Step 2 — Update means

Recompute each center as the mean of its assigned points:

\(\mathbf{u}_j = \frac{1}{|C_j|}\displaystyle\sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i\)

\(k\)-means Explorer

Iteration0

Total J—

Per-cluster objective

—

◆ = cluster centers \(\mathbf{u}_j\)

Ready — press Assign

Gaussian Mixture Model

In a Gaussian Mixture Model, we have a similar setup: a set of points that we believe are generated by \(k\) component distributions. The difference is that we now have an explicit probabilistic model of our data, a "story" of how our observations came to be.

Parameters:

\(\textcolor{#C2566E}{\boldsymbol{\phi}} = [\textcolor{#C2566E}{\phi_1},\ldots,\textcolor{#C2566E}{\phi_k}]\), \(\|\boldsymbol{\phi}\| = 1\) — mixing weights (proportion from each component)

\(\textcolor{#3A7CA5}{\boldsymbol{\mu}} = [\textcolor{#3A7CA5}{\mu_1},\ldots,\textcolor{#3A7CA5}{\mu_k}]\) — component means

\(\textcolor{#D4853A}{\boldsymbol{\sigma}^2} = [\textcolor{#D4853A}{\sigma^2_1},\ldots,\textcolor{#D4853A}{\sigma^2_k}]\) — component variances

Generative story:

\(z_i \sim \text{Categorical}(z_i \mid \textcolor{#C2566E}{\boldsymbol{\phi}})\) — pick a component

\(\mathbf{x}_i \sim \mathcal{N}\!\left(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_{z_i}},\; \textcolor{#D4853A}{\sigma^2_{z_i}}\mathbf{I}\right)\) — sample from that Gaussian

\(\textcolor{#C2566E}{\phi_j}\) is the proportion of points generated by component \(j\) (so \(\sum_j^k \textcolor{#C2566E}{\phi_j} = 1\)). \(z_i\) is a positive integer between 1 and \(k\) that indexes which Gaussian \(\mathbf{x}_i\) is drawn from. However, \(z_i\) are unobserved: they're latent variables.

In a Gaussian Mixture Model, we have a similar setup, wherein we have a set of points that we believe roughly fall into k clusters (or more precisely, that are generated by k component distributions). I use the word "clusters" to relate it to k-means, but a cluster implies that data points are distinguishable, which may not be the case. The difference is that we now have an explicit probabilistic model of our data, a "story" of how our observations came to be. We have k multivariate Gaussian distributions each with (unknown) mean and covariance μ_j and σ²I. Some clusters are larger than others, and φ_j corresponds to the proportion of points generated by component j (so Σφ_j = 1). When generating our data, first we decide which component distribution each point x_i is going to belong to: this is z_i, a positive integer between 1 and k. z_i indexes (or "selects") the parameters of the particular Gaussian from which we will sample x_i. However, z_i are unobserved in reality; they're latent variables.

The generative story

Component parameters

\(\textcolor{#C2566E}{\phi_1}\)=0.33

\(\textcolor{#C2566E}{\phi_2}\)=0.33

\(\textcolor{#C2566E}{\phi_3}\)=0.34

\(\phi_1\)

\(\phi_1\) slider — remaining weight split evenly between 2 & 3

\(\sigma\)0.6

Points0

Dashed circles show 2σ contour of each component. "Generate 1" animates: first \(z_i\) is drawn, then \(\mathbf{x}_i\) is sampled.

GMM as a graphical model

Generative story:

\(\textcolor{#C2566E}{\phi_j}\) — mixing weights (fixed or learned)

\(\textcolor{#3A7CA5}{\mu_j}\), \(\textcolor{#D4853A}{\sigma^2_j}\) — component parameters

For \(i = 1, \ldots, n\):

\(z_i \sim \text{Categorical}(\textcolor{#C2566E}{\boldsymbol{\phi}})\)

\(\mathbf{x}_i \mid z_i \sim \mathcal{N}(\textcolor{#3A7CA5}{\mu_{z_i}}, \textcolor{#D4853A}{\sigma^2_{z_i}}\mathbf{I})\)

\(z_i\) is unshaded (latent — we don't observe it). \(\mathbf{x}_i\) is shaded (observed). The inner plate repeats over data points; the outer plate repeats over components.

Here's the GMM as a plate diagram. The component parameters μ_j and σ²_j sit inside a plate for j=1..k (one set of parameters per component). The mixing weights φ sit outside both plates — they're shared across all data points. For each data point i: z_i is drawn from Categorical(φ), selecting a component, then x_i is drawn from that component's Gaussian. Crucially, z_i is unshaded — it's latent, we don't observe it. x_i is shaded — it's what we actually see. The arrows show dependencies: x_i depends on z_i (which component), and on μ_{z_i} and σ²_{z_i} (that component's parameters). Compare with the Beta-Binomial plate diagram from earlier: θ there plays a similar role to the component parameters here, and the plate structure is analogous. The key new element is z_i — the latent variable that makes inference hard.

Hard vs. soft assignments

\(k\)-means

Each point belongs to exactly one cluster

\(\mathbf{x}_i \in C_j\) — deterministic

GMM

Each point has a probability of coming from each component

\(P(z_i = j \mid \mathbf{x}_i)\) — probabilistic

The difference from \(k\)-means is that we now can label each point with a probability that it came from a particular component. Cluster assignments are no longer "hard".

Section

EM for Gaussian Mixtures

An attempt at Maximum Likelihood

We have a probabilistic model with unknown parameters. Try MLE of the observed data \(\mathbf{x}\):

\(\textcolor{#D4853A}{\ell(\boldsymbol{\phi}, \boldsymbol{\mu}, \boldsymbol{\sigma}^2)} = \displaystyle\sum_{i=1}^n \log p(\mathbf{x}_i \mid \boldsymbol{\phi}, \boldsymbol{\mu}, \boldsymbol{\sigma}^2)\)

If we knew \(z_i\), we'd maximize \(p(\mathbf{x}_i, z_i)\). We don't — but since \(P(A) = \sum_B P(A, B)\), we can introduce and sum over \(z_i\):

\(= \displaystyle\sum_{i=1}^n \log \sum_{z_i \in \{1\ldots k\}} p(\mathbf{x}_i, z_i \mid \boldsymbol{\phi}, \boldsymbol{\mu}, \boldsymbol{\sigma}^2)\)

Since \(P(A, B) = P(A \mid B)\,P(B)\):

\(= \displaystyle\sum_{i=1}^n \log \sum_{j=1}^k \mathcal{N}(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_j}, \textcolor{#D4853A}{\sigma^2_j}\mathbf{I})\,\textcolor{#C2566E}{\phi_j}\)

Note the sum inside the log. This makes differentiation painful — but for discrete \(z\) with small \(k\), it's manageable.

Differentiating w.r.t. \(\textcolor{#3A7CA5}{\mu_j}\)

We need to differentiate \(\ell = \sum_i \log \sum_{j=1}^k \mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})\,\phi_j\). For each \(i\), the inner sum has the form:

\(\log\!\Big(\;\underbrace{\phi_j \cdot \mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})}_{a \cdot f(\mu_j)} \;+\; \underbrace{\textstyle\sum_{j' \neq j} \phi_{j'} \cdot \mathcal{N}(\mathbf{x}_i \mid \mu_{j'}, \sigma^2_{j'}\mathbf{I})}_{\text{does not depend on } \mu_j}\;\Big)\)

Only one term in the sum depends on \(\mu_j\). The rest are constant w.r.t. \(\mu_j\).

By the chain rule, \(\;\frac{d}{dx}\log(af(x) + c) = \frac{a\,f'(x)}{af(x) + c}\;\):

\(\dfrac{\partial\,\ell}{\partial\,\mu_j} = \displaystyle\sum_{i=1}^n \frac{\phi_j \;\cdot\; \frac{\partial}{\partial\,\mu_j}\,\mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})}{\sum_{j'=1}^k \mathcal{N}(\mathbf{x}_i \mid \mu_{j'}, \sigma^2_{j'}\mathbf{I})\,\phi_{j'}}\)

Now we just need \(\frac{\partial}{\partial\,\mu_j}\mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})\). The Gaussian is proportional to \(\exp\!\left(-\frac{1}{2\sigma^2_j}\|\mathbf{x}_i - \mu_j\|^2\right)\)…

The responsibility

The Gaussian is \(\mathcal{N} \propto \exp(-\frac{1}{2\sigma^2_j}\|\mathbf{x}_i - \mu_j\|^2)\). Since \(\frac{d}{dx}e^{f(x)} = e^{f(x)}f'(x)\), the Gaussian comes back out and we get \(\frac{1}{\sigma^2_j}(\mathbf{x}_i - \mu_j)\):

\(\dfrac{\partial\,\ell}{\partial\,\mu_j} = \displaystyle\sum_{i=1}^n \underbrace{\frac{\mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})\,\textcolor{#C2566E}{\phi_j}}{\sum_{j'=1}^k \mathcal{N}(\mathbf{x}_i \mid \mu_{j'}, \sigma^2_{j'}\mathbf{I})\,\textcolor{#C2566E}{\phi_{j'}}}}_{\Large \textcolor{#C2566E}{r_{ij}}} \;\cdot\; \frac{1}{\textcolor{#D4853A}{\sigma^2_j}}(\mathbf{x}_i - \textcolor{#3A7CA5}{\mu_j})\)

Responsibility

\(\textcolor{#C2566E}{r_{ij}} = P(z_i = j \mid \mathbf{x}_i, \mu_j, \sigma^2_j, \phi_j)\)

The posterior probability that point \(\mathbf{x}_i\) came from component \(j\). The numerator is \(p(\mathbf{x}_i \mid z_i\!=\!j)\,p(z_i\!=\!j)\), normalized by \(\sum_{j'} p(\mathbf{x}_i \mid z_i\!=\!j')\,p(z_i\!=\!j')\) — this is just Bayes' rule!

Since d/dx of ae^{f(x)} = ae^{f(x)} f'(x), the Gaussian comes back out in the numerator and we're left with the derivative of the exponent: (1/σ²)(x_i - μ_j). Take a moment to interpret the left-hand fraction here. The numerator is the probability of a specific point coming from a specific Gaussian, N(x_i | μ_j, σ_j), weighted by the proportion of points in that "cluster", φ_j, finally normalized by the probabilities of all the other components. In other words, it's the posterior probability that a point x_i came from component j — called the responsibility of component j for point x_i. Call this probability r_ij, equal to P(z_i = j | x_i, μ_j, σ_j, φ_j). And let's forget for a moment that it's a function of our unknown parameters. Note this is just Bayes' rule: posterior ∝ likelihood × prior.

Solving for \(\textcolor{#3A7CA5}{\hat{\mu}_j}\)

Let's forget for a moment that \(\textcolor{#C2566E}{r_{ij}}\) is a function of our unknown parameters. Setting the derivative to zero:

\(0 = \displaystyle\sum_{i=1}^n \textcolor{#C2566E}{r_{ij}} \frac{1}{\sigma^2_j}(\mathbf{x}_i - \mu_j)\)

\(0 = \displaystyle\sum_{i=1}^n \textcolor{#C2566E}{r_{ij}} \frac{1}{\sigma^2_j}\mathbf{x}_i \;-\; \sum_{i=1}^n \textcolor{#C2566E}{r_{ij}} \frac{1}{\sigma^2_j}\mu_j\)

\(\textcolor{#3A7CA5}{\hat{\mu}_j} = \dfrac{\sum_{i=1}^n \textcolor{#C2566E}{r_{ij}}\,\mathbf{x}_i}{\sum_{i=1}^n \textcolor{#C2566E}{r_{ij}}}\)

Note the correspondence to the MLE for a single Gaussian (\(\hat{\mu} = \frac{1}{n}\sum x_i\)) and to step 2 of \(k\)-means. To see this, imagine \(\textcolor{#C2566E}{r_{ij}} = 0\) or \(1\). The sum \(\sum_i \textcolor{#C2566E}{r_{ij}}\) is the "effective number of points assigned to component \(j\)". We have a "soft" version of \(k\)-means — weighted means — where each point is weighted by its probability of being in a given cluster.

All parameter estimates

The other parameters follow a similar approach. Stating them without proof:

\(\textcolor{#3A7CA5}{\hat{\mu}_j} = \dfrac{\sum_i \textcolor{#C2566E}{r_{ij}}\,\mathbf{x}_i}{\sum_i \textcolor{#C2566E}{r_{ij}}}\)

Weighted mean

\(\textcolor{#C2566E}{\hat{\phi}_j} = \dfrac{1}{n}\displaystyle\sum_{i=1}^n \textcolor{#C2566E}{r_{ij}}\)

Average responsibility — the fraction of data "softly assigned" to \(j\)

\(\textcolor{#D4853A}{\hat{\sigma}^2_j} = \dfrac{\sum_i \textcolor{#C2566E}{r_{ij}}\,\|\mathbf{x}_i - \mu_j\|^2}{\sum_i \textcolor{#C2566E}{r_{ij}}}\)

Weighted variance

The problem: all estimates are functions of \(\textcolor{#C2566E}{r_{ij}}\), which is itself a function of \(\textcolor{#3A7CA5}{\mu_j}\), \(\textcolor{#D4853A}{\sigma^2_j}\), and \(\textcolor{#C2566E}{\phi_j}\). There is no closed-form solution.

We need the responsibilities to estimate the parameters, but we need the parameters to compute the responsibilities…

\(k\)-means vs. EM estimates

Note the correspondence to \(k\)-means. Compare the estimates side by side:

\(k\)-means — hard assignment

\(\hat{\mu}_j = \dfrac{\sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i}{|C_j|}\)

\(r_{ij} \in \{0, 1\}\) — count points

Soft (\(r_{ij}\)) — probabilistic

\(\hat{\mu}_j = \dfrac{\sum_i \textcolor{#C2566E}{r_{ij}}\,\mathbf{x}_i}{\sum_i \textcolor{#C2566E}{r_{ij}}}\)

\(\textcolor{#C2566E}{r_{ij}} \in [0, 1]\) — weight by belief

To see this, imagine \(\textcolor{#C2566E}{r_{ij}} = 0\) or \(1\): the soft estimate reduces to the \(k\)-means estimate exactly. \(\sum_i \textcolor{#C2566E}{r_{ij}}\) is the "effective number of points assigned to component \(j\)". We have a "soft" version of \(k\)-means — weighted means — where each point is weighted by its probability of being in a given cluster.

Aside: a note on probability

It's helpful to reflect for a moment on what a probability distribution can be thought to represent. In some sense, they can encode our beliefs about the world — if we think of a Gaussian, a small variance corresponds to a stronger belief.

Think of your best guess for the height of the Burj Khalifa. Maybe around 1000 meters, give or take 500m? You could write down a distribution to convey that uncertainty:

\(\text{height}_{\text{guess}} \;\sim\; \mathcal{N}\!\left(\textcolor{#3A7CA5}{1000},\; \textcolor{#3A7CA5}{500^2}\right)\)

Then, if you measured the time for a coin dropped from the top to hit the ground, you could update your belief — perhaps depending on how much you trusted your measurement.

The benefit of probabilistic models is that, in a very rough sense, they allow us to manipulate beliefs about numbers almost as if they were numbers themselves. We can also update these beliefs based on observed data, thus reducing our uncertainty.

Although we aren't dealing with the fully Bayesian form of Gaussian Mixture Models here, it's still helpful to reflect for a moment on what a probability distribution can be thought to represent. In some sense, they can encode our beliefs about the world — if we think of a Gaussian, a small variance corresponds to a stronger belief. Think of your best guess for the height of the Burj Khalifa. Maybe it's around 1000 meters, give or take 500m? Well, you could write down a distribution to convey that uncertainty. Then, if you went to Dubai and, say, measured the time it took a coin dropped from the top to hit the ground, you could update your belief a bit (perhaps depending on how much you trusted your measurement). The benefit of probabilistic models is that, in a very rough sense, they allow us to manipulate beliefs about numbers (almost) as if they were numbers themselves. We also can update these beliefs based on the data we observe, thus reducing our uncertainty.

The EM algorithm

What would make our problem easier? If we knew the latent variables \(z_i\), we'd know which component each point came from, and \(\textcolor{#C2566E}{r_{ij}}\) would be 0 or 1 — basically a counting problem.

Since we don't know \(z_i\), we have an informed guess: the posterior \(\textcolor{#C2566E}{r_{ij}} = P(z_i = j \mid \mathbf{x}_i, \mu_j, \sigma^2_j, \phi_j)\). A posterior, roughly speaking, can be thought to represent the information we have about an unknown quantity given observed data.

But the posterior depends on unknown parameters, which would be easy to compute if we knew the posteriors…

What would make our problem easier? Well, if we knew the latent variables z_i, we'd be able to say whether point x_i was generated by Gaussian j, and r_ij = P(z_i = j | x_i) would either be 1 or 0. We could then easily estimate the parameters of the Gaussians and the proportion of points generated by each — basically, it would amount to a counting problem. Since we don't know the z_i values, we'll have to make an informed guess. As discussed, we have a belief about what these are, through the posterior r_ij. A posterior, roughly speaking, can be thought to represent the information we have about an unknown quantity given observed data. But now we run into the problem from before, namely, that the posterior is a function of unknown parameters, which of course would be easy to compute if we knew the posteriors...

The EM algorithm

EM presents an intuitive solution to this conundrum:

E-step

Proceed as if we knew the parameters (starting with some initial guess). Fix them in place and use them to calculate the posteriors \(\textcolor{#C2566E}{r_{ij}}\).

M-step

Use these posteriors to get better estimates of the parameters \(\textcolor{#3A7CA5}{\mu_j}\), \(\textcolor{#D4853A}{\sigma^2_j}\), \(\textcolor{#C2566E}{\phi_j}\).

These steps are analogous to the two steps in \(k\)-means. The only difference is that we'll be working with "beliefs" about the cluster assignments, represented by probability distributions, rather than fixed numeric values.

Why the complete log-likelihood?

Up until now, we've focused on \(\log p(\mathbf{x} \mid \theta)\). Recall that this involved marginalizing out \(\mathbf{z}\):

\(\ell(\theta) = \displaystyle\sum_{i=1}^n \log \underbrace{\sum_{j=1}^k \mathcal{N}(\mathbf{x}_i \mid \mu_j, \sigma^2_j\mathbf{I})\,\phi_j}_{\text{sum inside the log}}\)

The trouble is the sum inside the log. Even though we have beliefs about \(\mathbf{z}\) (via the posterior \(\textcolor{#C2566E}{P(z_i = j \mid \mathbf{x}_i, \theta)} = \textcolor{#C2566E}{r_{ij}}\)), there's no way to incorporate them — the marginalization "uses up" \(\mathbf{z}\) before we can leverage what we know about it.

What if instead we work with the complete log-likelihood \(\log p(\mathbf{x}, \mathbf{z} \mid \theta)\), which keeps \(\mathbf{z}\) visible? Since we don't actually know \(\mathbf{z}\), we'll use our best guess — the posterior — to take an expectation.

Taking expectations of functions

We want to take the expectation of the complete log-likelihood — a function of the random variable \(\mathbf{z}\) — under our best-guess posterior. Recall what this means:

\(\mathbb{E}_{p(z)}[g(z)] = \displaystyle\sum_z g(z)\;p(z)\) — evaluate \(g\) at all possible values of \(z\), weight by how probable each value is, sum.

Example: let \(g(z) = z^2\) and \(z \in \{1, 2, 3\}\) with \(p(z\!=\!1)=0.5\), \(p(z\!=\!2)=0.3\), \(p(z\!=\!3)=0.2\).

\(\mathbb{E}[z^2] = 1^2 \cdot 0.5 + 2^2 \cdot 0.3 + 3^2 \cdot 0.2 = 0.5 + 1.2 + 1.8 = 3.5\)

For us, \(g\) is the complete log-likelihood, and \(p(z)\) is the posterior \(p(\mathbf{z} \mid \mathbf{x}, \theta^l)\) — our current best guess about which component generated each point:

\(Q(\theta, \theta^l) = \displaystyle\sum_{\mathbf{z}} \left[\log p(\mathbf{x}, \mathbf{z} \mid \theta)\right]\; p(\mathbf{z} \mid \mathbf{x}, \theta^l)\)

\(\theta^l\) = current best guess; \(\theta\) = free variable to maximize

The \(Q\) function and EM

\(Q(\theta, \theta^l) = \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \theta^l)}\!\left[\log p(\mathbf{x}, \mathbf{z} \mid \theta)\right]\)

E-step

Use current \(\theta^l\) to evaluate the posterior \(p(\mathbf{z} \mid \mathbf{x}, \theta^l)\)

M-step

\(\theta^{l+1} = \displaystyle\argmax_\theta\; Q(\theta, \theta^l)\)

So we can think of \(q = p(\mathbf{z} \mid \mathbf{x}, \theta^l)\) as our best guess, at a particular iteration, for the posterior over latents that we can take expectations of in order to maximize \(\theta\).

EM applied to GMMs

Let's write out the joint probability. Using an indicator \(\mathbb{I}(z_i = j)\) which is 1 when \(z_i = j\), 0 otherwise:

\(p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\mu}, \boldsymbol{\sigma}^2, \boldsymbol{\phi}) = \displaystyle\prod_{i=1}^n \prod_{j=1}^k p(\mathbf{x}_i, z_i\!=\!j \mid \mu_j, \sigma^2_j, \phi_j)^{\,\mathbb{I}(z_i = j)}\)

Since \(p(\mathbf{x}_i, z_i\!=\!j) = p(\mathbf{x}_i \mid z_i\!=\!j)\,p(z_i\!=\!j)\):

\(= \displaystyle\prod_{i=1}^n \prod_{j=1}^k \left(\mathcal{N}(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_j}, \textcolor{#D4853A}{\sigma^2_j}\mathbf{I})\;\textcolor{#C2566E}{\phi_j}\right)^{\mathbb{I}(z_i = j)}\)

Why does this work? \(\mathbb{I}(z_i = j) = 1\) only for the component \(j\) that actually generated \(\mathbf{x}_i\). The exponent "selects" just that component's contribution. This is the complete likelihood \(\mathcal{L}_c\).

Take the log (products become sums, exponents come down):

\(\ell_c = \displaystyle\sum_{i=1}^n \sum_{j=1}^k \mathbb{I}(z_i = j)\left(\log\textcolor{#C2566E}{\phi_j} + \log\mathcal{N}(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_j}, \textcolor{#D4853A}{\sigma^2_j}\mathbf{I})\right)\)

Taking the expectation

Take the expectation of \(\ell_c\) under the posterior \(p(\mathbf{z} \mid \mathbf{x})\). By linearity, the expectation passes inside the sums:

\(\mathbb{E}_{p(\mathbf{z} \mid \mathbf{x})}[\ell_c] = \displaystyle\sum_{i=1}^n \sum_{j=1}^k \mathbb{E}\!\left[\mathbb{I}(z_i = j)\right] \left(\log\textcolor{#C2566E}{\phi_j} + \log\mathcal{N}(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_j}, \textcolor{#D4853A}{\sigma^2_j}\mathbf{I})\right)\)

The expectation of \(\mathbb{I}(z_i = j)\) under the posterior is just \(P(z_i = j \mid \mathbf{x}_i, \theta^l) = \textcolor{#C2566E}{r_{ij}}\):

\(Q(\theta, \theta^l) = \displaystyle\sum_{i=1}^n \sum_{j=1}^k \textcolor{#C2566E}{r_{ij}}\!\left(\log\textcolor{#C2566E}{\phi_j} + \log\mathcal{N}(\mathbf{x}_i \mid \textcolor{#3A7CA5}{\mu_j}, \textcolor{#D4853A}{\sigma^2_j}\mathbf{I})\right)\)

The log is now inside the sum — much easier to optimize than the observed likelihood

The E-step evaluates \(\textcolor{#C2566E}{r_{ij}}\) with fixed parameters; the M-step maximizes w.r.t. the parameters while keeping \(\textcolor{#C2566E}{r_{ij}}\) fixed. The resulting estimates are the same as those we derived earlier.

EM Explorer

Uses data from the generative model — generate points there first, then come here to recover the parameters

Iteration0

Log-lik—

	True	Estimated
	—	—
	—	—
	—	—

Colors = \(\textcolor{#C2566E}{r_{ij}}\). Dashed = 2σ. ✕ = true centers.

Ready — press E-step

Section

Toward Variational EM

A more general setup

We'll avoid getting too specific about a particular generative model. Say we've sampled IID random variables \(x_i \sim p(x_i \mid \theta)\) for \(i = 1,\ldots,n\). The log-likelihood is:

\(\ell(\theta) = \displaystyle\sum_{i=1}^N \log p(x_i \mid \theta)\) ← the marginal log-likelihood

Introducing latent variables \(z_i\) and using the law of total probability:

\(= \displaystyle\sum_{i=1}^N \log \int p(x_i, z_i \mid \theta)\, dz_i\)

By indexing by \(i\), we've assumed one latent variable per observation. In principle there could be more — it wouldn't change these derivations. Note the integral: \(z\) can now be continuous (or discrete, replacing \(\int\) with \(\sum\)).

Why do we need this?

For GMMs, the true posterior \(p(z_i \mid x_i, \theta)\) was tractable — we could compute \(\textcolor{#C2566E}{r_{ij}}\) exactly. But many models have posteriors that are intractable:

Topic models (LDA)

\(z_i\) = topic assignment for each word in a document

The posterior over topic assignments couples all words in a document — the number of possible configurations is combinatorially explosive

Variational Autoencoders

\(z_i\) = continuous latent representation of an image

The posterior \(p(z \mid x)\) depends on a neural network decoder — no closed-form solution. We approximate it with another neural network \(q(z \mid x)\)

In these cases we need to approximate the posterior with a simpler distribution \(q\). The variational framework gives us a principled way to do this.

Introducing \(q\)

Now, for a key trick. We introduce a new distribution over the latent variables, \(q(z_i \mid \phi)\). This is just multiplying by 1:

\(\ell(\theta) = \displaystyle\sum_{i=1}^N \log \int \frac{p(x_i, z_i \mid \theta)}{q(z_i \mid \phi)}\; q(z_i \mid \phi)\, dz_i\)

This distribution \(q\) can be more or less anything, but typically we select a parametric distribution that we believe is a good candidate for the posterior \(p(z \mid x)\).

The power of this formulation: in many cases, \(q\) is an approximation to the true posterior (and could be, for instance, a neural network or a fully factorized distribution). This is termed variational EM.

Jensen's inequality

We could try to maximize the expression directly, but the integral inside the log makes life difficult. However, it is in the form of an expectation over \(q(z_i \mid \phi)\).

By Jensen's inequality, for concave \(f\): \(\;\mathbb{E}[f(x)] \leq f(\mathbb{E}[x])\). Since \(\log\) is concave:

\(\ell(\theta) = \displaystyle\sum_{i=1}^N \log\,\mathbb{E}_{q(z_i \mid \phi)}\!\left[\frac{p(x_i, z_i \mid \theta)}{q(z_i \mid \phi)}\right] \;\geq\; \sum_{i=1}^N \mathbb{E}_{q(z_i \mid \phi)}\!\left[\log \frac{p(x_i, z_i \mid \theta)}{q(z_i \mid \phi)}\right]\)

ELBO

This is a lower bound on the marginal log-likelihood — the evidence lower-bound. We maximize this bound in two iterative steps.

E-step: decomposing the ELBO

Starting from the ELBO:

\(\ell(\theta) \;\geq\; \displaystyle\sum_{i=1}^N \int \log \frac{p(x_i, z_i \mid \theta)}{q(z_i \mid \phi)}\; q(z_i \mid \phi)\,dz_i\)

Rewrite the joint using \(p(x_i, z_i \mid \theta) = p(z_i \mid \theta, x_i)\,p(x_i \mid \theta)\):

\(= \displaystyle\sum_{i=1}^N \int \log \frac{p(z_i \mid \theta, x_i)\;p(x_i \mid \theta)}{q(z_i \mid \phi)}\; q(z_i \mid \phi)\,dz_i\)

Split the log into two integrals. Since \(p(x_i \mid \theta)\) does not depend on \(z_i\) and \(\int q\,dz = 1\):

\(= \displaystyle\sum_{i=1}^N \left[\int \log \frac{p(z_i \mid \theta, x_i)}{q(z_i \mid \phi)}\; q(z_i \mid \phi)\,dz_i \;+\; \log p(x_i \mid \theta)\right]\)

\(= \underbrace{-\text{KL}\!\left(q(\mathbf{z} \mid \phi)\;\|\;p(\mathbf{z} \mid \theta, \mathbf{x})\right)}_{\leq\; 0} + \;\ell(\theta)\)

E-step: minimizing KL divergence

\(\text{ELBO} = -\text{KL}\!\left(q(\mathbf{z} \mid \phi)\;\|\;p(\mathbf{z} \mid \theta, \mathbf{x})\right) + \ell(\theta)\)

The KL divergence roughly measures the difference between two distributions. It is always \(\geq 0\), and equals zero when \(q = p\). (NB: it is not a true distance — it is not symmetric and doesn't satisfy the triangle inequality.)

Since \(\ell(\theta)\) is fixed during the E-step, maximizing the ELBO w.r.t. \(\phi\) amounts to:

E-step:

\(\displaystyle\min_\phi\; \text{KL}\!\left(q(\mathbf{z} \mid \phi)\;\|\;p(\mathbf{z} \mid \theta, \mathbf{x})\right)\)

In words: make \(q\) as close as possible to the true posterior \(p(\mathbf{z} \mid \theta, \mathbf{x})\). In general, \(q\) can be an approximation that has a different functional form, but in the case of GMMs, it will be exactly equal to the true posterior.

M-step: maximizing the expected complete LL

In the E-step, the parameters \(\theta\) are fixed while we optimize \(q\). In the M-step, we fix \(q\) and maximize over \(\theta\). Rewriting the ELBO:

\(\text{ELBO} = \displaystyle\sum_{i=1}^N \mathbb{E}_{q(z_i \mid \phi)}\!\left[\log p(x_i, z_i \mid \theta)\right] - \mathbb{E}_{q(z_i \mid \phi)}\!\left[\log q(z_i \mid \phi)\right]\)

\(= \displaystyle\sum_{i=1}^N \mathbb{E}_{q(z_i \mid \phi)}\!\left[\log p(x_i, z_i \mid \theta)\right] + \text{H}(q)\)

\(\text{H}(q)\) is the entropy of \(q\) — it does not depend on \(\theta\) and can be ignored when maximizing over \(\theta\).

M-step:

\(\displaystyle\max_\theta\; \sum_{i=1}^N \mathbb{E}_{q(z_i \mid \phi)}\!\left[\log p(x_i, z_i \mid \theta)\right]\)

We can think of \(q\) as our best guess for the posterior that we take expectations of in order to maximize \(\theta\).

Summary

MLE

\(\max_\theta \log p(\mathbf{x} \mid \theta)\)

No latent variables, no prior

MAP

\(\max_\theta \log p(\mathbf{x} \mid \theta)p(\theta)\)

Add a prior on \(\theta\)

\(\max_\theta\;\mathbb{E}_{p(z|x,\theta^l)}[\log p(\mathbf{x},\mathbf{z} \mid \theta)]\)

Latent variables, exact posterior

Variational EM

\(\max_\theta\;\mathbb{E}_{q(z|\phi)}[\log p(\mathbf{x},\mathbf{z} \mid \theta)]\)

Latent variables, approximate \(q\)

The intuition from the GMM case still applies in general: we have posteriors of the latent variables that get updated in the E-step using our best guesses for the parameters \(\theta\), and use these posteriors to update estimates of the parameters in the M-step.

In exact EM (GMMs), \(q = p(\mathbf{z} \mid \mathbf{x}, \theta)\) and the KL is zero. In variational EM, \(q\) is an approximation — it could be a factorized distribution, a neural network, or anything tractable. The gap between the ELBO and the true log-likelihood is exactly the KL divergence between \(q\) and the true posterior.