Alexander Hoyle
March 2026
If I flip this coin again, what is \(P(X = \texttt{H})\)?
½ ?
Define a probability mass function:
\(P(X = \texttt{H}) = \textcolor{#3A7CA5}{\theta}\) with \(\textcolor{#3A7CA5}{\theta} \in [0, 1]\)
meaning \(P(X = \texttt{T}) = 1 - \textcolor{#3A7CA5}{\theta}\)
A coin flip is a sample: \(X \sim P(X)\), where \(X \in \{\texttt{H}, \texttt{T}\}\)
Probability of one head: \(\textcolor{#3A7CA5}{\theta}\)
Two heads: \(\textcolor{#3A7CA5}{\theta} \cdot \textcolor{#3A7CA5}{\theta}\)
Our exact sequence H H T T H H H H H T :
\(= \textcolor{#E8A838}{\theta^7} \textcolor{#6BA3C7}{(1 - \theta)^3}\)
Usually we care about the number of heads, not the exact sequence.
How many ways to arrange 7 heads in 10 flips?
\(\binom{10}{7} = 120\) arrangements
\(P(X = k \mid \textcolor{#3A7CA5}{\theta}) = \binom{n}{k} \textcolor{#3A7CA5}{\theta}^k (1 - \textcolor{#3A7CA5}{\theta})^{n-k}\)
\(n\) independent Bernoulli trials, \(k\) successes
What \(\textcolor{#3A7CA5}{\theta}\) makes our data most likely?
\(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})} = P(\mathcal{D} \mid \textcolor{#3A7CA5}{\theta})\)
Key notational point: \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})}\) and \(P(\mathcal{D} \mid \theta)\) are the same quantity — but \(\textcolor{#D4853A}{\mathcal{L}}\) highlights that \(\textcolor{#3A7CA5}{\theta}\) is the variable of interest, not the data \(\mathcal{D}\).
We want to maximize \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})} = \binom{n}{k}\textcolor{#3A7CA5}{\theta}^k(1-\textcolor{#3A7CA5}{\theta})^{n-k}\)
Taking derivatives of products of exponentials is not very fun. But since \(\log\) is monotonic, the max of \(f(x)\) is the same as the max of \(\log f(x)\):
Both peak at the same \(\theta\) — \(\log\) just reshapes the curve
Start from the likelihood:
\(\textcolor{#D4853A}{\ell(\theta \mid \mathcal{D})} = \log\!\left[\binom{n}{k}\textcolor{#3A7CA5}{\theta}^k(1-\textcolor{#3A7CA5}{\theta})^{n-k}\right]\)
The \(\log\) of a product is a sum of \(\log\)s:
\(= \log\binom{n}{k} + \log\!\left(\textcolor{#3A7CA5}{\theta}^k\right) + \log\!\left((1-\textcolor{#3A7CA5}{\theta})^{n-k}\right)\)
And by the power rule, \(\log(a^b) = b\log(a)\):
\(= \log\binom{n}{k} + k\log\textcolor{#3A7CA5}{\theta} + (n-k)\log(1-\textcolor{#3A7CA5}{\theta})\)
\(\textcolor{#D4853A}{\ell(\theta \mid \mathcal{D})} = \underbrace{\log\binom{n}{k}}_{\text{constant w.r.t. }\theta} + k\log\textcolor{#3A7CA5}{\theta} + (n-k)\log(1-\textcolor{#3A7CA5}{\theta})\)
\(\log\binom{n}{k}\) is constant w.r.t. \(\textcolor{#3A7CA5}{\theta}\) and disappears when differentiating. Set the derivative to zero:
\(\dfrac{\partial\,\ell}{\partial\,\theta} = \dfrac{k}{\theta} - \dfrac{n-k}{1-\theta} = 0\)
\(\dfrac{k}{\theta} = \dfrac{n-k}{1-\theta}\)
\(k(1-\theta) = (n-k)\theta\)
\(k - k\theta = n\theta - k\theta\)
\(k = n\theta\)
\(\textcolor{#3A7CA5}{\theta_{\text{MLE}}^*} = \dfrac{k}{n} = \dfrac{7}{10}\)
Quite some work for a very intuitive result! The MLE is just the empirical mean. If \(\theta\) were 0.9, we'd expect 9 heads in 10 flips, not 7.
I flip a one-franc coin 5 times and get 4 heads:
\(\theta_{\text{MLE}}^* = \frac{4}{5} = 0.8\)
I flip it 10 times and get 9 heads:
\(\theta_{\text{MLE}}^* = \frac{9}{10} = 0.9\)
Is a real franc coin really 90% heads?
If 100 people all flipped 10 fair coins, roughly 1 would get 9 heads. Could we maybe encode this idea that, even if we observe 9 heads, maybe we're not entirely convinced the coin is biased?
Encoding beliefs before seeing data
There is a true fixed parameter \(\textcolor{#3A7CA5}{\theta}\). We observe data until our estimates converge to it. Estimates are noisy with less data, but only samples can inform those estimates.
We can incorporate prior knowledge or beliefs about \(\textcolor{#3A7CA5}{\theta}\) before we've seen data, and the data lets us update those beliefs.
\(\textcolor{#5B8DB8}{p(}\textcolor{#3A7CA5}{\theta}\textcolor{#5B8DB8}{)}\)
A continuous distribution over possible parameter values
On notation: we use uppercase \(P(\cdot)\) for probability mass functions (discrete distributions, like the Binomial) and lowercase \(p(\cdot)\) for probability density functions (continuous distributions). You will see both going forward.
\(\textcolor{#5B8DB8}{p(}\textcolor{#3A7CA5}{\theta}\textcolor{#5B8DB8}{)} = \dfrac{1}{\mathrm{B}(\alpha,\beta)}\textcolor{#3A7CA5}{\theta}^{\alpha-1}(1-\textcolor{#3A7CA5}{\theta})^{\beta-1}\)
Don't worry too much about \(\mathrm{B}(\cdot)\): it is a normalization constant to make sure the total probability is one.
\(\mathrm{B}(\alpha,\beta) = \dfrac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha+\beta)} = \displaystyle\int_0^1 \theta^{\alpha-1}(1-\theta)^{\beta-1}\,d\theta\)
Notice the shape: \(\textcolor{#3A7CA5}{\theta}^{\text{something}} \cdot (1 - \textcolor{#3A7CA5}{\theta})^{\text{something}}\) — remember this form!
Updating beliefs with data
I've observed my data — my ten coin flips. Maybe I want to change my belief about \(\textcolor{#3A7CA5}{\theta}\)?
Sure, nine out of ten could be good luck — but if I got 29 out of 30 heads, I would probably wonder about that coin. Our beliefs would shift.
\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D}, \alpha, \beta)}\)
The distribution over \(\textcolor{#3A7CA5}{\theta}\) conditioned on observed data, plus our original beliefs
We will write this as \(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})}\) because we treat \(\alpha, \beta\) as fixed. This is the posterior distribution.
\(P(A \mid B) = \dfrac{P(A,B)}{P(B)} = \dfrac{P(B \mid A)\,P(A)}{P(B)}\)
Applied to our setting (recall that \(\textcolor{#D4853A}{\mathcal{L}(\theta \mid \mathcal{D})}\) is the same quantity as \(p(\mathcal{D}\mid\theta)\), just viewed as a function of \(\theta\)):
\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} = \dfrac{\textcolor{#D4853A}{p(\mathcal{D} \mid \theta)} \cdot \textcolor{#5B8DB8}{p(\theta)}}{\displaystyle\int \textcolor{#D4853A}{p(\mathcal{D}\mid\theta')} \, \textcolor{#5B8DB8}{p(\theta')} \,d\theta'}\)
The denominator — the marginal likelihood — requires integrating over all possible \(\theta\). This is basically intractable (and effectively impossible when \(\theta\) may be multiple parameters).
\(p(\mathcal{D})\) is fixed and does not depend on \(\textcolor{#3A7CA5}{\theta}\), so:
\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} = \frac{1}{Z}\,\textcolor{#D4853A}{p(\mathcal{D} \mid \theta)}\,\textcolor{#5B8DB8}{p(\theta)}\)
\(\textcolor{#C2566E}{p(\theta \mid \mathcal{D})} \;\propto\; \textcolor{#D4853A}{p(\mathcal{D} \mid \theta)} \;\cdot\; \textcolor{#5B8DB8}{p(\theta)}\)
posterior \(\;\propto\;\) likelihood \(\;\times\;\) prior
This is one of the fundamental mechanisms in Bayesian inference.
\(\textcolor{#C2566E}{p(\theta\mid\mathcal{D})} \;\propto\; \textcolor{#D4853A}{\underbrace{\binom{n}{k}\theta^k(1-\theta)^{n-k}}_{\text{Binomial likelihood}}} \;\cdot\; \textcolor{#5B8DB8}{\underbrace{\frac{1}{\mathrm{B}(\alpha,\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}}_{\text{Beta prior}}}\)
Drop constants w.r.t. \(\theta\):
\(\propto\; \theta^k(1-\theta)^{n-k}\;\cdot\;\theta^{\alpha-1}(1-\theta)^{\beta-1}\)
Combine exponents:
\(\propto\; \theta^{\,k+\alpha-1}(1-\theta)^{\,n-k+\beta-1}\)
This looks like a Beta distribution!
We know the posterior must integrate to 1:
\(1 = \displaystyle\int_0^1 p(\theta\mid\mathcal{D})\,d\theta = \int_0^1 \frac{1}{Z}\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\,d\theta\)
So:
\(Z = \displaystyle\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\,d\theta = \mathrm{B}(k+\alpha,\; n-k+\beta)\)
\(\textcolor{#C2566E}{p(\theta\mid\mathcal{D})} = \dfrac{1}{\mathrm{B}(k+\alpha,\; n-k+\beta)}\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\)
\(= \textcolor{#C2566E}{\text{Beta}(k + \alpha,\; n - k + \beta)}\)
When the posterior has the same functional form as the prior, we call it a conjugate prior. We can also maximize this quantity. Note: we don't need the normalizing constant!
\(\dfrac{\partial}{\partial\theta}\left[\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1}\right] = 0\)
\((k+\alpha-1)\,\theta^{k+\alpha-2}(1-\theta)^{n-k+\beta-1} = (n-k+\beta-1)\,\theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-2}\)
\(\textcolor{#C2566E}{\theta_{\text{MAP}}^*} = \dfrac{k + \alpha - 1}{n + \alpha + \beta - 2}\)
The MAP is the mode (peak) of the posterior — not the mean. The posterior mean is \(\frac{k+\alpha}{n+\alpha+\beta}\). As \(n \to \infty\), both converge to the MLE \(\frac{k}{n}\) — the data eventually overwhelms the prior.
\(\textcolor{#C2566E}{\theta_{\text{MAP}}^*} = \dfrac{\textcolor{#D4853A}{\text{observed heads}} + \textcolor{#5B8DB8}{\text{prior heads}}}{\textcolor{#D4853A}{\text{observed flips}} + \textcolor{#5B8DB8}{\text{prior flips}}}\)
The \(\alpha\) and \(\beta\) parameters have been updated with the number of successes \(k\) and failures \(n - k\). A stronger prior (more ghost coins) requires more real data to shift.
The generative story:
\(\textcolor{#3A7CA5}{\theta} \sim \textcolor{#5B8DB8}{\text{Beta}(\alpha, \beta)}\)
For \(i = 1, \ldots, n\):
\(x_i \mid \textcolor{#3A7CA5}{\theta} \sim \text{Bernoulli}(\textcolor{#3A7CA5}{\theta})\)
This is a plate diagram (or graphical model). Circles are random variables. Shaded = observed. The plate (rectangle) means "repeat for each \(i\)".
We treated \(\alpha, \beta\) as fixed. But what if we're uncertain about them too?
\(\alpha \sim p(\alpha)\) — e.g., Gamma
\(\beta \sim p(\beta)\)
\(\textcolor{#3A7CA5}{\theta} \sim \textcolor{#5B8DB8}{\text{Beta}(\alpha, \beta)}\)
For \(i = 1, \ldots, n\):
\(x_i \mid \textcolor{#3A7CA5}{\theta} \sim \text{Bernoulli}(\textcolor{#3A7CA5}{\theta})\)
Now the data can inform \(\alpha\) and \(\beta\) too — they are random variables with their own priors. This is a hierarchical Bayesian model.
The deeper you go, the less sensitive the model is to your initial assumptions — the data "speaks for itself" at every level.
Consider \(n\) students taking an exam. Each student \(j\) has a latent ability \(z_j\) that we don't observe:
\(\textcolor{#3A7CA5}{\theta}\) — exam difficulty (shared)
For \(j = 1, \ldots, n\):
\(\textcolor{#C2566E}{z_j} \sim p(z)\) — student ability
\(x_j \mid \textcolor{#C2566E}{z_j}, \textcolor{#3A7CA5}{\theta} \sim p(x \mid \textcolor{#C2566E}{z_j}, \textcolor{#3A7CA5}{\theta})\) — observed score
Now the latent variable \(\textcolor{#C2566E}{z_j}\) is inside the plate — one per student. We observe scores \(x_j\) but not abilities \(z_j\). This structure will reappear in GMMs, where each data point has its own latent variable indicating which component generated it.