Bayesian Inference

Alexander Hoyle

March 2026

Imagine you see this data

If I flip this coin again, what is \(P(X = \texttt{H})\)?

½ ?

Setting up the model

Define a probability mass function:

\(P(X = \texttt{H}) = \textcolor{#3A7CA5}{\theta}\) with \(\textcolor{#3A7CA5}{\theta} \in [0, 1]\)

meaning \(P(X = \texttt{T}) = 1 - \textcolor{#3A7CA5}{\theta}\)

A coin flip is a sample: \(X \sim P(X)\), where \(X \in \{\texttt{H}, \texttt{T}\}\)

Joint probability of the data

Probability of one head: \(\textcolor{#3A7CA5}{\theta}\)

Two heads: \(\textcolor{#3A7CA5}{\theta} \cdot \textcolor{#3A7CA5}{\theta}\)

Our exact sequence H H T T H H H H H T :

\(\theta\)·\(\theta\)·\((1\!-\!\theta)\)·\((1\!-\!\theta)\)·\(\theta\)·\(\theta\)·\(\theta\)·\(\theta\)·\(\theta\)·\((1\!-\!\theta)\)

\(= \textcolor{#E8A838}{\theta^7} \textcolor{#6BA3C7}{(1 - \theta)^3}\)

The Binomial Distribution

Usually we care about the number of heads, not the exact sequence.

How many ways to arrange 7 heads in 10 flips?

H H H H H H H T T T

H H H H H H T H T T

H H H H H T H H T T

H H T T H H H H H T

⋮

\(\binom{10}{7} = 120\) arrangements

Binomial PMF

\(P(X = k \mid \textcolor{#3A7CA5}{\theta}) = \binom{n}{k} \textcolor{#3A7CA5}{\theta}^k (1 - \textcolor{#3A7CA5}{\theta})^{n-k}\)

\(n\) independent Bernoulli trials, \(k\) successes

The Likelihood Function

What \(\textcolor{#3A7CA5}{\theta}\) makes our data most likely?

\(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})} = P(\mathcal{D} \mid \textcolor{#3A7CA5}{\theta})\)

Key notational point: \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})}\) and \(P(\mathcal{D} \mid \theta)\) are the same quantity — but \(\textcolor{#D4853A}{\mathcal{L}}\) highlights that \(\textcolor{#3A7CA5}{\theta}\) is the variable of interest, not the data \(\mathcal{D}\).

Maximum Likelihood Estimation

We want to maximize \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})} = \binom{n}{k}\textcolor{#3A7CA5}{\theta}^k(1-\textcolor{#3A7CA5}{\theta})^{n-k}\)

Taking derivatives of products of exponentials is not very fun. But since \(\log\) is monotonic, the max of \(f(x)\) is the same as the max of \(\log f(x)\):

Both peak at the same \(\theta\) — \(\log\) just reshapes the curve

The log-likelihood

Start from the likelihood:

\(\textcolor{#D4853A}{\ell(\theta \mid \mathcal{D})} = \log\!\left[\binom{n}{k}\textcolor{#3A7CA5}{\theta}^k(1-\textcolor{#3A7CA5}{\theta})^{n-k}\right]\)

The \(\log\) of a product is a sum of \(\log\)s:

\(= \log\binom{n}{k} + \log\!\left(\textcolor{#3A7CA5}{\theta}^k\right) + \log\!\left((1-\textcolor{#3A7CA5}{\theta})^{n-k}\right)\)

And by the power rule, \(\log(a^b) = b\log(a)\):

\(= \log\binom{n}{k} + k\log\textcolor{#3A7CA5}{\theta} + (n-k)\log(1-\textcolor{#3A7CA5}{\theta})\)

\(\textcolor{#D4853A}{\ell(\theta \mid \mathcal{D})} = \underbrace{\log\binom{n}{k}}_{\text{constant w.r.t. }\theta} + k\log\textcolor{#3A7CA5}{\theta} + (n-k)\log(1-\textcolor{#3A7CA5}{\theta})\)

Finding the MLE

\(\log\binom{n}{k}\) is constant w.r.t. \(\textcolor{#3A7CA5}{\theta}\) and disappears when differentiating. Set the derivative to zero:

\(\dfrac{\partial\,\ell}{\partial\,\theta} = \dfrac{k}{\theta} - \dfrac{n-k}{1-\theta} = 0\)

\(\dfrac{k}{\theta} = \dfrac{n-k}{1-\theta}\)

\(k(1-\theta) = (n-k)\theta\)

\(k - k\theta = n\theta - k\theta\)

\(k = n\theta\)

\(\textcolor{#3A7CA5}{\theta_{\text{MLE}}^*} = \dfrac{k}{n} = \dfrac{7}{10}\)

Quite some work for a very intuitive result! The MLE is just the empirical mean. If \(\theta\) were 0.9, we'd expect 9 heads in 10 flips, not 7.

Likelihood Explorer

n10

MLE

0.700

Heads

Tails

Does this feel right?

I flip a one-franc coin 5 times and get 4 heads:

\(\theta_{\text{MLE}}^* = \frac{4}{5} = 0.8\)

I flip it 10 times and get 9 heads:

\(\theta_{\text{MLE}}^* = \frac{9}{10} = 0.9\)

Is a real franc coin really 90% heads?

If 100 people all flipped 10 fair coins, roughly 1 would get 9 heads. Could we maybe encode this idea that, even if we observe 9 heads, maybe we're not entirely convinced the coin is biased?

Section

Priors

Encoding beliefs before seeing data

Two philosophies

Frequentist

There is a true fixed parameter \(\textcolor{#3A7CA5}{\theta}\). We observe data until our estimates converge to it. Estimates are noisy with less data, but only samples can inform those estimates.

Bayesian

We can incorporate prior knowledge or beliefs about \(\textcolor{#3A7CA5}{\theta}\) before we've seen data, and the data lets us update those beliefs.

The Prior Density

Prior

\(\textcolor{#5B8DB8}{p(}\textcolor{#3A7CA5}{\theta}\textcolor{#5B8DB8}{)}\)

A continuous distribution over possible parameter values

On notation: we use uppercase \(P(\cdot)\) for probability mass functions (discrete distributions, like the Binomial) and lowercase \(p(\cdot)\) for probability density functions (continuous distributions). You will see both going forward.

Beta Distribution Explorer

\(\alpha\)3.0

\(\beta\)3.0

Mean

0.500

Mode

0.500

Pseudo-obs

The Beta Distribution

Prior

\(\textcolor{#5B8DB8}{p(}\textcolor{#3A7CA5}{\theta}\textcolor{#5B8DB8}{)} = \dfrac{1}{\mathrm{B}(\alpha,\beta)}\textcolor{#3A7CA5}{\theta}^{\alpha-1}(1-\textcolor{#3A7CA5}{\theta})^{\beta-1}\)

Don't worry too much about \(\mathrm{B}(\cdot)\): it is a normalization constant to make sure the total probability is one.

\(\mathrm{B}(\alpha,\beta) = \dfrac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha+\beta)} = \displaystyle\int_0^1 \theta^{\alpha-1}(1-\theta)^{\beta-1}\,d\theta\)

Notice the shape: \(\textcolor{#3A7CA5}{\theta}^{\text{something}} \cdot (1 - \textcolor{#3A7CA5}{\theta})^{\text{something}}\) — remember this form!

Section

Posteriors

Updating beliefs with data

Updating our beliefs

I've observed my data — my ten coin flips. Maybe I want to change my belief about \(\textcolor{#3A7CA5}{\theta}\)?

Sure, nine out of ten could be good luck — but if I got 29 out of 30 heads, I would probably wonder about that coin. Our beliefs would shift.

Posterior

\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D}, \alpha, \beta)}\)

The distribution over \(\textcolor{#3A7CA5}{\theta}\) conditioned on observed data, plus our original beliefs

We will write this as \(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})}\) because we treat \(\alpha, \beta\) as fixed. This is the posterior distribution.

Bayes' Rule

\(P(A \mid B) = \dfrac{P(A,B)}{P(B)} = \dfrac{P(B \mid A)\,P(A)}{P(B)}\)

Applied to our setting (recall that \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})}\) is the same quantity as \(p(\mathcal{D}\mid\theta)\), just viewed as a function of \(\theta\)):

\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} = \dfrac{\textcolor{#D4853A}{p(\mathcal{D} \mid \theta)} \cdot \textcolor{#5B8DB8}{p(\theta)}}{\displaystyle\int \textcolor{#D4853A}{p(\mathcal{D}\mid\theta')} \, \textcolor{#5B8DB8}{p(\theta')} \,d\theta'}\)

The denominator — the marginal likelihood — requires integrating over all possible \(\theta\). This is basically intractable (and effectively impossible when \(\theta\) may be multiple parameters).

The proportionality trick

\(p(\mathcal{D})\) is fixed and does not depend on \(\textcolor{#3A7CA5}{\theta}\), so:

\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} = \frac{1}{Z}\,\textcolor{#D4853A}{p(\mathcal{D} \mid \theta)}\,\textcolor{#5B8DB8}{p(\theta)}\)

\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} \;\propto\; \textcolor{#D4853A}{p(\mathcal{D} \mid \theta)} \;\cdot\; \textcolor{#5B8DB8}{p(\theta)}\)

posterior \(\;\propto\;\) likelihood \(\;\times\;\) prior

This is one of the fundamental mechanisms in Bayesian inference.

Deriving the posterior

\(\textcolor{#C2566E}{p(\theta\mid\mathcal{D})} \;\propto\; \textcolor{#D4853A}{\underbrace{\binom{n}{k}\theta^k(1-\theta)^{n-k}}_{\text{Binomial likelihood}}} \;\cdot\; \textcolor{#5B8DB8}{\underbrace{\frac{1}{\mathrm{B}(\alpha,\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}}_{\text{Beta prior}}}\)

Drop constants w.r.t. \(\theta\):

\(\propto\; \theta^k(1-\theta)^{n-k}\;\cdot\;\theta^{\alpha-1}(1-\theta)^{\beta-1}\)

Combine exponents:

\(\propto\; \theta^{\,k+\alpha-1}(1-\theta)^{\,n-k+\beta-1}\)

This looks like a Beta distribution!

Working out the normalizing constant

We know the posterior must integrate to 1:

\(1 = \displaystyle\int_0^1 p(\theta\mid\mathcal{D})\,d\theta = \int_0^1 \frac{1}{Z}\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\,d\theta\)

So:

\(Z = \displaystyle\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\,d\theta = \mathrm{B}(k+\alpha,\; n-k+\beta)\)

Posterior

\(\textcolor{#C2566E}{p(\theta\mid\mathcal{D})} = \dfrac{1}{\mathrm{B}(k+\alpha,\; n-k+\beta)}\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\)

\(= \textcolor{#C2566E}{\text{Beta}(k + \alpha,\; n - k + \beta)}\)

Conjugacy & MAP Estimation

When the posterior has the same functional form as the prior, we call it a conjugate prior. We can also maximize this quantity. Note: we don't need the normalizing constant!

\(\dfrac{\partial}{\partial\theta}\left[\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\right] = 0\)

\((k+\alpha-1)\,\theta^{k+\alpha-2}(1-\theta)^{n-k+\beta-1} = (n-k+\beta-1)\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-2}\)

Maximum A Posteriori

\(\textcolor{#C2566E}{\theta_{\text{MAP}}^*} = \dfrac{k + \alpha - 1}{n + \alpha + \beta - 2}\)

The MAP is the mode (peak) of the posterior — not the mean. The posterior mean is \(\frac{k+\alpha}{n+\alpha+\beta}\). As \(n \to \infty\), both converge to the MLE \(\frac{k}{n}\) — the data eventually overwhelms the prior.

The pseudo-count interpretation

\(\textcolor{#C2566E}{\theta_{\text{MAP}}^*} = \dfrac{\textcolor{#D4853A}{\text{observed heads}} + \textcolor{#5B8DB8}{\text{prior heads}}}{\textcolor{#D4853A}{\text{observed flips}} + \textcolor{#5B8DB8}{\text{prior flips}}}\)

← prior

← data

The \(\alpha\) and \(\beta\) parameters have been updated with the number of successes \(k\) and failures \(n - k\). A stronger prior (more ghost coins) requires more real data to shift.

Bayesian Updating — Live

\(\alpha\)5.0

\(\beta\)5.0

Heads

Tails

MLE

—

MAP

—

Post. Mean

—

Putting it together

The generative story:

\(\textcolor{#3A7CA5}{\theta} \sim \textcolor{#5B8DB8}{\text{Beta}(\alpha, \beta)}\)

For \(i = 1, \ldots, n\):

\(x_i \mid \textcolor{#3A7CA5}{\theta} \sim \text{Bernoulli}(\textcolor{#3A7CA5}{\theta})\)

This is a plate diagram (or graphical model). Circles are random variables. Shaded = observed. The plate (rectangle) means "repeat for each \(i\)".

Going deeper: hyperpriors

We treated \(\alpha, \beta\) as fixed. But what if we're uncertain about them too?

\(\alpha \sim p(\alpha)\) — e.g., Gamma

\(\beta \sim p(\beta)\)

\(\textcolor{#3A7CA5}{\theta} \sim \textcolor{#5B8DB8}{\text{Beta}(\alpha, \beta)}\)

For \(i = 1, \ldots, n\):

\(x_i \mid \textcolor{#3A7CA5}{\theta} \sim \text{Bernoulli}(\textcolor{#3A7CA5}{\theta})\)

Now the data can inform \(\alpha\) and \(\beta\) too — they are random variables with their own priors. This is a hierarchical Bayesian model.

The deeper you go, the less sensitive the model is to your initial assumptions — the data "speaks for itself" at every level.

We treated α and β as fixed — squares in the plate diagram. But what if we're uncertain about our prior parameters too? We can put priors on them (hyperpriors), turning the squares into circles. Now the data can inform α and β as well — this is a hierarchical Bayesian model. The deeper you go, the less sensitive the model is to your initial assumptions. This is one of the great strengths of Bayesian modeling: you can express uncertainty at every level of the model, and the data informs all of it. Another powerful extension: if you had J different coins, each with their own θ_j but sharing α and β, the coins that have been flipped more would inform the shared hyperparameters, which in turn help estimate θ for coins with fewer flips — this is called "borrowing strength" or partial pooling.

Per-observation latent variables

Consider \(n\) students taking an exam. Each student \(j\) has a latent ability \(z_j\) that we don't observe:

\(\textcolor{#3A7CA5}{\theta}\) — exam difficulty (shared)

For \(j = 1, \ldots, n\):

\(\textcolor{#C2566E}{z_j} \sim p(z)\) — student ability

\(x_j \mid \textcolor{#C2566E}{z_j}, \textcolor{#3A7CA5}{\theta} \sim p(x \mid \textcolor{#C2566E}{z_j}, \textcolor{#3A7CA5}{\theta})\) — observed score

Now the latent variable \(\textcolor{#C2566E}{z_j}\) is inside the plate — one per student. We observe scores \(x_j\) but not abilities \(z_j\). This structure will reappear in GMMs, where each data point has its own latent variable indicating which component generated it.