12. Bayesian Testing#

Lecture errata#

I’ve marked the lecture errors in red text in the rewritten slides below.

Rewriting the slides#

We’ve gotten a lot of questions about these slides for this lecture, so I’ve rewritten them here with some additional information.

First, I recommend checking out 3blue1brown’s video on Bayes’ factor to help with intuition, particularly the part about expressing Bayes’ rule in terms of the prior and posterior odds.

There is a similar example in Professor Vidakov’s statistics book: Vidakovic [2017] page 103. Unlike the classical approach, Bayesian testing doesn’t prioritize the null hypothesis. Instead, we calculate the posterior probabilities for both hypotheses, then choose the one with the larger posterior probability.


1#

Assume that \(\Theta_0\) and \(\Theta_1\) are two non-overlapping sets of parameter \(\theta\). We want to test
$\( H_0: \theta \in \Theta_0 \quad \text{v.s.} \quad H_1: \theta \in \Theta_1 \)$

The probabilities are:

\[ p_0 = \int_{\Theta_0} \pi(\theta | x) d\theta = \mathbb{P}^{\theta | X}(H_0) \]
\[ p_1 = \int_{\Theta_1} \pi(\theta | x) d\theta = \mathbb{P}^{\theta | X}(H_1) \]

See slides 5 and 6 for an example.


2#

Prior probabilities of hypotheses#

\[ \pi_0 = \int_{\Theta_0} \pi(\theta) d\theta, \quad \pi_1 = \int_{\Theta_1} \pi(\theta) d\theta \]

\( B_{01} \) – Bayes Factor in favor of \( H_0 \): $\( B_{01} = \frac{p_0 / p_1}{\pi_0 / \pi_1}, \quad \text{(posterior odds / prior odds)} \)$

\( B_{10} \) – Bayes Factor in favor of \( H_1 \) is the reciprocal: $\( B_{10} = \frac{1}{B_{01}} \)$

Precise null \( H_0: \theta = \theta_0 \) requires a prior with point mass at \( \theta_0 \).#

See the example on slide 3.


3#

Precise Null#

The professor notes that, if you want to test a precise number, your overall prior needs to contain the point mass representing your precise null hypothesis.

What he means here is that you can only test a hypothesis where your prior allows for a probability greater than 0. So if your prior is a continuous distribution, the probability at any given point is 0. At a high level, if your prior doesn’t contain your hypothesis, you are essentially ruling out the hypothesis before you even build the rest of your model. Without mixing in the point mass, you’ve already predetermined that your hypothesis is impossible.

That’s why he mixes that point mass \(\delta_{\theta_0}\) and the “spread” distribution \(\xi(\theta)\) in the prior \(\pi(\theta)\); you need a discrete distribution to test a specific point. $\( H_0: \theta = \theta_0 \quad \text{v.s.} \quad H_1: \theta \neq \theta_0 \)$

\[ \pi(\theta) = \pi_0 \cdot \delta_{\theta_0} + (1 - \pi_0) \cdot \xi(\theta) \quad \text{where } (1 - \pi_0) = \pi_1 \]
\[ m(x) = \pi_0 \cdot f(x | \theta_0) + \pi_1 \cdot m_1(x) \]
\[ m_1(x) = \int_{\{\theta \neq \theta_0\}} f(x | \theta) \xi(\theta) d\theta \]
\[\begin{split} \begin{align*} \textcolor{red}{\pi(\theta_0 | x)} &= \frac{f(x | \theta_0) \pi_0}{m(x)} \\ &= \frac{\pi_0 f(x|\theta_0)}{\pi_0 f(x|\theta_0) + \pi_1 m_1(x)} \\ &= \frac{\pi_0 f(x|\theta_0)}{\pi_0 f(x|\theta_0)\left(1+ \frac{\pi_1 m_1(x)}{\pi_0 f(x\mid\theta_0)}\right)} \\ &= \frac{1}{\left(1+ \frac{\pi_1 m_1(x)}{\pi_0 f(x\mid\theta_0)}\right)} \\ \textcolor{red}{\pi(\theta_0 | x)} & = \left( 1 + \frac{\pi_1}{\pi_0} \cdot \frac{m_1(x)}{f(x | \theta_0)} \right)^{-1} \\ \end{align*} \end{split}\]

Remembering that Bayes’ factor is what updates the prior odds to posterior odds,

\[\begin{split} \begin{align*} \text{Odds}(H_i \mid A) &= BF \times \text{Odds}(H_i)\\ \end{align*} \end{split}\]

we see that

\[\begin{split} \begin{align*} B_{01} &= \frac{f(x | \theta_0)}{m_1(x)} \\ &= \frac{f(x | \theta_0)}{\int_{\{\theta \neq \theta_0\}} f(x | \theta) \xi(\theta) d\theta} \end{align*} \end{split}\]

4#

Scales for the strength of evidence#

The professor uses Jeffrey’s scale (Jeffreys [2003] Appendix B) in the lecture.

Grade

K Value

Interpretation

0

\(K > 1\)

Null hypothesis supported.

1

\(1 > K > 10^{-1/2}\)

Evidence against q, but not worth more than a bare mention.

2

\(10^{-1/2} > K > 10^{-1}\)

Evidence against q substantial.

3

\(10^{-1} > K > 10^{-3/2}\)

Evidence against q strong.

4

\(10^{-3/2} > K > 10^{-2}\)

Evidence against q very strong.

5

\(10^{-2} > K\)

Evidence against q decisive.

This scale is just one of many, though. The Kass & Raftery scale (Kass and Raftery [1995]) and the Lee and Wagenmakers scale (Lee and Wagenmakers [2013]) are two alternatives.


5#

Example: Jeremy’s IQ#

In the context of Jeremy’s IQ example, test the hypotheses

\[ H_0: \theta \leq 100 \quad \text{v.s.} \quad H_1: \theta > 100 \]
\[ \theta | x \sim N(102.8, 48) \]
  • \( p_0 = \mathbb{P}^{\theta | X}(H_0) = \int_{-\infty}^{100} \frac{1}{\sqrt{2\pi \cdot 48}} \cdot e^{\frac{-(\theta - 102.8)^2}{2 \cdot 48}} d\theta \)

    \[ = \text{normcdf}(100, 102.8, \sqrt{48}) \]
    \[ = \boxed{0.3431} \]
  • \( p_1 = \mathbb{P}^{\theta | X}(H_1) = 1 - 0.3431 = \boxed{0.6569} \)


6#

  • \( \pi_0 = \mathbb{P}^{\theta}(H_0) = \int_{-\infty}^{100} \frac{1}{\sqrt{2\pi \cdot 120}} e^{\frac{-(\theta - 110)^2}{2 \cdot 120}} d\theta \)

    \[ = \text{normcdf}(100, 110, \sqrt{120}) \]
    \[ = \boxed{0.1807} \]
\[ \pi_1 = 1 - 0.1807 = \boxed{0.8193} \]
  • \( B_{10} = \frac{p_1 / p_0}{\pi_1 / \pi_0} = \frac{0.6569 / 0.3431}{0.8193 / 0.1807} = \frac{1.9146}{4.5340} = \boxed{0.4223} \)

\[ \frac{p_1}{p_0} = B_{10} \times \frac{\pi_1}{\pi_0} \]
  • \( \log_{10} B_{01} = -\log_{10} B_{10} = \boxed{0.3744} \quad \text{(poor evidence in favor of } H_0 \text{)} \)


7#

Example: 10 flips of a coin revised#

\[\begin{split} \begin{align*} X | p &\sim \text{Bin}(n, p)\\ p &\sim \text{Be}(500,500)\\ \end{align*} \end{split}\]

Posterior: \(p | X \sim \text{Be}(500,510)\)


  • We already found the posterior mean in a previous example:

    \[ \mathbb{E}(p | X) = \frac{500}{1010} = 0.4950495 \dots \]
  • The mode for \(\text{Be}(\alpha, \beta)\) is given by:

    \[ \frac{\alpha - 1}{\alpha + \beta - 2}; \quad \text{here the posterior mode is } \frac{499}{1008} = 0.4950397 \dots \]
  • The median (not explicit, uses special functions):

    \[ \text{betainv}(0.5, 500, 510) = 0.4950462 \dots \]

    Approximation:

    \[ \frac{\alpha - 1/3}{\alpha + \beta - 2/3} = \frac{499.6666}{1009.3333} = 0.4950462 \dots \]

8#

Test: \(H_0: p \leq 0.5\) vs. \(H_1: p > 0.5\)#

\[\begin{split} \begin{align*} p_0 &= \int_0^{0.5} \frac{1}{B(500,510)} p^{500-1} (1 - p)^{510-1} dp\\ &= \text{betacdf}(0.5,500,510) \\ &= \boxed{0.6235}\\ p_1 &= 1 - p_0 = \boxed{0.3765} \end{align*} \end{split}\]
\[\begin{split} \begin{align*} \pi_0 &= \int_0^{0.5} \frac{1}{B(500,500)} p^{500-1} (1 - p)^{500-1} dp\\ &= \text{betacdf}(0.5,500,500)\\ &= \boxed{0.5} \\ \pi_1 &= 1 - \pi_0 = \boxed{0.5}\\ \end{align*} \end{split}\]

\[ B_{01} = \frac{p_0 / p_1}{\pi_0 / \pi_1} = \frac{0.6235}{0.3765} = \boxed{1.656} \]
\[ \log_{10} B_{01} = \boxed{0.2191} \quad \text{(Poor evidence against } H_1 \text{)} \]

9#

Precise Null Test#

\( H_0: p = 0.5 \) vs. \( H_1: p \neq 0.5 \)

\[ \pi(p) = 0.8 \cdot \delta_{0.5} + 0.2 \cdot \text{Be}(500,500) \]
\[ \pi_0 = 0.8, \quad \pi_1 = 0.2 \]
\[\begin{split} \begin{align*} m_1(x) \Big|_{X=0} &= m_1(0)\\ &= \int_0^1 \binom{10}{0} p^0 (1-p)^{10} \frac{1}{B(500,500)} p^{500-1} (1-p)^{500-1} dp \\ &= \frac{B(500,510)}{B(500,500)} = \boxed{0.001021} \end{align*} \end{split}\]
\[\begin{split} \begin{align*} f(x|p) \Big|_{x=0, p=0.5} &= f(0|0.5)\\ &= \binom{10}{0} 0.5^0 \cdot 0.5^{10} &= \frac{1}{1024} = \boxed{0.0009765}\\ \end{align*} \end{split}\]
\[ \textcolor{red}{B_{01}} = \frac{f(0|0.5)}{m_1(0)} = \frac{0.0009765}{0.001021} = \boxed{0.9564} \]
\[ \textcolor{red}{\log_{10} B_{10} = -\log_{10} B_{01}} = -\log_{10} 0.9564 = \boxed{0.0194} \]

Very poor evidence against \(\textcolor{red}{H_0}\).