6. Metropolis

6. Metropolis#

Notes#

Some students have been confused by the professor’s notation for this lecture. Here he uses \(\pi\) to mean the target function, which in Bayesian inference, will be a posterior distribution. However, in the examples, he only uses the numerator of Bayes’ theorem in the calculation. In other words, he’s using something proportional to the target, but not the target itself. That’s because the normalizing constant cancels out in the acceptance ratio, so there’s no reason to add it.

Metropolis Algorithm#

Here’s the algorithm (Metropolis et al. [1953]):

Algorithm 1 (Metropolis)

Inputs Let \(\pi(x)\) be proportional to the target PDF. \(x_j\) is the current value and \(q(x|x_j)\) is a symmetric proposal distribution (\(q(x|x_j) = q(x_j|x)\)).

Output An array of samples representing the target PDF.

Choose an initial value \(x_0\) from your sample space.
Sample \(x_* ∼ q(x|x_j)\).
Calculate the acceptance probability: \(\rho(x_j, x_*) = min\left\{1, \frac{\pi(x_*)}{\pi(x_j)}\right\}\).
Update \(x_{j+1} = x_*\) with probability \(\rho(x_j, x_*)\), otherwise \(x_{j+1}\) remains equal to \(x_j\).

This is also known as random-walk Metropolis. The proposal distribution is a perturbation of the current location of the chain. It must be symmetrical in this version of the algorithm, which can be considered a specific case of the more general Metropolis-Hastings algorithm. You can control the “step size” of the algorithm by reducing the variance of your proposal.

The most important step is the acceptance probability. If the acceptance ratio comes out greater than 1, great! We definitely want that new proposed value because it is more likely than the previous value. That’s why we take the minimum of 1 and \(\frac{\pi(x_*)}{\pi(x_j)}\).

If the ratio is less than 1, then we might accept it. We want to accept it based on that probability, so if it’s .8 then we will take it 80% of the time, if it’s .1 then we’ll accept it 10% of the time. This allows the sampler to get unlikely values as well, but keeps it within a representative area of the target distribution.

Metropolis example#

Let our model be a Gamma-Gamma conjugate model, where:

\[\begin{split} \begin{align*} X_i | \beta &\sim \text{Ga}(v, \theta) \\ \theta &\sim \text{Ga}(\alpha, \beta) \end{align*} \end{split}\]

We’ll just have a single datapoint, \(x=1\), for simplicity. So if we let \(v=1\), \(\alpha=1\), \(\beta=1\), our true posterior (see Conjugate table) will be \(Ga(2, 2)\). We will use that to compare with our Metropolis results.

For our target, we have:

\[\pi(\theta|x) \propto \theta^{vn + \alpha - 1} e^{-\theta(\beta + \sum_{i=1}^{n}X_i)}\]

We’ll use a \(N(0, (.4)^2)\) proposal. Don’t worry about the code for now. We’ll go over that on the next page.

This shows the first 1,000 draws of the Metropolis algorithm. You can modify and try it out yourself using the files in this repository. If you generate the animation yourself you can step through it frame-by-frame. I’m working on a little web app to run it but it’s not ready yet.

The visualization was inspired by this excellent video by Kapil Sachdeva and this notebook by Colin Carroll.

Metropolis-Hastings#

For Metropolis-Hastings Hastings [1970], the only change is in the acceptance ratio formula.

Algorithm 2 (Metropolis-Hastings)

Inputs Let \(\pi(x)\) be proportional to the target PDF. \(x_j\) is the current value and \(q(x|x_j)\) is a proposal distribution.

Output An array of samples representing the target PDF.

Choose an initial value \(x_0\) from your sample space.
Sample \(x_* ∼ q(x|x_j)\).
Calculate the acceptance probability: \(\rho(x_j, x_*) = min\left\{1, \frac{\pi(x_*)}{\pi(x_j)}\frac{q(x_j|x_*)}{q(x_*|x_j)}\right\}\).
Update \(x_{j+1} = x_*\) with probability \(\rho(x_j, x_*)\), otherwise \(x_{j+1}\) remains equal to \(x_j\).

There’s another simplified variant that the professor calls “Independence Metropolis,” where we select a proposal distribution that doesn’t depend on the current state of x. In other words, step 2 becomes \(x_* ∼ q(x)\) and \(\rho(x_j, x_*) = min\left\{1, \frac{\pi(x_*)}{\pi(x_j)}\frac{q(x_j)}{q(x_*)}\right\}\). We’ll look at an example of this in the next lecture.

Other resources#

Markov Chain Monte Carlo Basics by Michael Betancourt.
Statistical Rethinking McElreath [2020] Chapter 9 and the corresponding video.
Markov Chain Monte Carlo by ritvikmath.
Why Metropolis-Hastings Works by Gregory Gundersen.