4. Ingredients for Bayesian Inference#
Let’s start with Bayes’ theorem again:
This is the notation we’ll use when talking about probability distributions rather than events as we’ve done in Unit 3.
\(\pi(\theta \mid x)\): the posterior distribution#
This is the prior updated by the data and normalized.
\(f(x \mid \theta)\): the likelihood#
Contains all the information about our experiment. It’s an observed variable, so we describe it with \(f(\cdot)\) rather than \(\pi(\cdot)\) like the posterior and prior. The likelihood is:
Where \(n\) is the number of datapoints. As the sample size increases, the likelihood tends to dominate the prior, leading to a posterior more influenced by the data. Put another way, as we get more data, our prior beliefs become less important, so the data drives our conclusions.
\(\pi(\theta)\): the prior#
Usually we use this to describe the state of our knowledge or expert opinion prior to the experiment. Every unobserved variable in your model is a parameter. For each parameter, you will need to elicit a prior. Some students have criticized using \(\pi(\cdot)\) to describe both the posterior and the prior, but that is intentional. It’s accurately describing what we’re doing with Bayes’ theorem—updating our priors with new information to form our posterior.
\(m(x)\): the marginal distribution or normalizing constant#
Obtain it by integrating the joint distribution of \(x\) and \(\theta\) over all possible values of \(\theta\):
This integral is often intractable, which is why a large portion of this course is devoted to strategies to avoid calculating it. More on that later; but because of this situation, we often just write Bayes’ theorem in its proportional form:
The \(\propto\) means “proportional to.” Often this is all you need, whether you’re going for a ratio (where the marginal cancels out) or you can recognize the kernel of the posterior.