14. Prior Elicitation#
Priors are one of the strengths of Bayesian inference. They’re also a source of criticism. Critics say that prior choice is essentially subjective and can make the resulting models too easy to manipulate into supporting whatever the creator of the model wants to say. But those criticisms apply to classical statistics as well; arguably, the explicitly defined prior distributions of Bayesian models are more transparent than the hidden assumptions made in many frequentist models.
Priors ideally would be based on expert beliefs about whatever parameter you’re putting them on. In reality, we don’t always have a strong belief about something, so we might use weak or non–informative priors (more on those in the next lesson).
Choosing parameters#
This lecture mostly discusses what to do if you already know which distribution you want to use for your prior. The professor gives exponential and beta prior examples in the lecture.
Exponential#
We know our parameter \(\theta\) is exponentially-distributed with an expected value of 2. The expected value of an exponential distribution Exponential Distribution is \(1/\lambda\) where \(\lambda\) is the rate parameter, so we know that our prior should be \(Exp(1/2)\).
Beta#
The prior expected value of our beta-distributed parameter is \(1/2\) and the variance \(1/8\). By the known mean and variance Beta Distribution, we can figure out the shape parameters \(\alpha\) and \(\beta\) by solving the resulting system of equations:
The mean is:
and the variance is:
And after some algebra, we find our prior is \(Be(1/2, 1/2)\).
Principled prior choice#
But how do we choose the prior distribution? This might be students’ single most-asked question in the course. The professor doesn’t go in-depth on this topic in the lectures, so it feels like a blind spot for many. That’s normal, so don’t worry; this subject is big enough for a class on its own.
There are also practical considerations–we may need to select priors that are conjugate or otherwise play nicely with our model or sampling algorithm. For example, later in the course, the professor uses the Wishart prior on the covariance matrix of a multivariate normal distribution for multiple regression. While this works in OpenBUGS, other probabilistic programming libraries that use different sampling algorithms can’t easily use the Wishart distribution.
There are lots of methods and philosophies behind prior choice. The professor references Garthwaite and Dickey [1988] in the lecture; they present a detailed procedure for questioning experts and quantifying their opinions in the form of a conjugate prior. This is usually what people mean by prior elicitation; it’s specifically the process of turning expert knowledge into a prior distribution. Some of the people involved with PyMC have an new overview paper on prior elicitation Mikkola et al. [2023].
Michael Betancourt has a great, more general, overview of prior modeling here.
Jaynes [2003] (Chapters 11 & 12) describes the principle of maximum entropy for choosing priors. He describes these as “ignorance priors” because the idea is that by choosing the distribution of maximum entropy, we add the least information possible to our state of knowledge of the parameter in question.
… nothing in the mathematics requires that any random experiment be in fact performed or conceivable; and so we interpret the principle in the broadest sense which gives it the widest range of applicability, i.e. whether or not any random experiment is involved, the maximum entropy distribution still represents the most ‘honest’ description of our state of knowledge.
—Jaynes [2003] Chapter 12.2
If you’re sick of theory, the developers of Stan have more practical notes on prior choices here based on their experience.
These links are just scratching the surface of the subject. For now, just know that this stuff exists, because we’ll be talking more about prior choice throughout the course.