6. Analysis of Variance#

One-Way ANOVA#

In an ANOVA model, we are trying to discern whether the means of \(G\) different treatment groups are equivalent or different. The null hypothesis is that the mean of all treatment groups are equal:

\[\begin{split} \begin{align*} H_0 &: \mu_1 = \mu_2 = \cdots = \mu_G \\ H_1 &: \text{At least one group mean $\mu_{i}$ is different}\\ &\phantom{:}\text{from the others} \end{align*} \end{split}\]

The ANOVA model is defined as

\[\begin{split} \begin{align*} y_{ij} & \sim N(\mu_i, \sigma^2) \\ \mu_i & = \mu + a_i, \space i = 1,\cdots, G \space ; j = 1, \cdots ,n_i\\ &\phantom{=} \text{s.t. }\sum a_{i} = 0\phantom{\_}(see\phantom{\_}below) \end{align*} \end{split}\]

Here, \(i\) enumerates the treatment groups, \(j\) enumerates the sample points, \(\mu\) is the overall mean (grand mean), and \(a_{i}\) represents the \(i\)-th treatment’s deviation from the grand mean (treatment effect size). Notice the implication of the hierarchical structure introduced earlier in Unit 7.

Since under the null hypothesis, the group means are equal to the grand mean:

\[\begin{split} \begin{align*} \mu_i & = \mu + a_i\\ \mu_{i} - \mu &= a_{i}\\ a_{i} &= 0 \text{ ; }\forall i\\ \end{align*} \end{split}\]

Observe we could also write our null hypothesis as:

\[\begin{split} \begin{align*} H_0 & : a_1 = a_2 = \cdots =a_{G} = 0 \\ \end{align*} \end{split}\]

In classical statistics, an ANOVA table is built and an \(F\)-test is used to determine if \(H_0\) is rejected. Note that the rejection of \(H_{0}\) only implies that presence of a difference was detected. Additional methods are required to identify which group means are significantly different from each other.

The Bayesian approach to ANOVA is to set priors on \(\mu\), \(a_1,a_2,...,a_G\), and \(\sigma^2\). The Bayesian model is constructed as:

\[\begin{split} \begin{align*} y_i & \sim N(\mu_i,\sigma^2) && \text{likelihood}\\ \\ \mu & \sim N(0,\sigma_0^2) && \text{prior: grand mean}\\ \mu_i & = \mu + a_i && \text{deterministic relationship}\\ a_i & \sim N(0,\sigma_i^2) && \text{prior: } a_i\\ \tau &\sim Ga(0.001,0.001) && \text{prior: } \tau\\ \sigma^2 & = 1/\tau && \text{deterministic relationship}\\ \\ \text{Subject to: } & \sum a_i = 0 && \text{STZ constraint}\\ \end{align*} \end{split}\]

Here \(\sigma_0^2\) and \(\sigma_i^2\) are representing prior variances of the grand mean and treatment groups, respectively. Non-informative variances might be something like \(\sigma_0^2 = 1000\). To assess \(H_0\), we will look at the posterior distributions of the \(a_i\)’s to see if they are significantly different than zero. If we want to compare treatment groups, we can calculate their difference in the Bayesian model, and then look at the posterior distribution of the difference to see if it’s significantly different than zero.

This is another example of how Bayesian modeling allows for direct specification and interpretation of hypotheses in the “same world that the data live in”. Other methods for Bayesian model evaluation exist, and you will be exposed to a few throughout the remainder of the course.

Warning

Suppose instead that we specified only:

\[\begin{split} \begin{align*} y_{ij} & \sim N(\mu_i, \sigma^2) \\ \mu_i & = \mu + a_i, \space i = 1,...,a \space ; j = 1,...,n_i \end{align*} \end{split}\]

The joint log-likelihood would be:

\[ \log \mathcal{L}(\mathbf{y} \mid \mu, \mathbf{a}, \sigma^2) = -\frac{1}{2\sigma^2} \sum_{i=1}^{G} \sum_{j=1}^{n_i} (y_{ij} - \mu - a_i)^2 \]

Now consider a shifted parameterization:

\[ \mu' = \mu + c, \quad a_i' = a_i - c \]

The new joint likelihood is:

\[\begin{split} \begin{align*} \log \mathcal{L}(\mathbf{y} \mid \mu', \mathbf{a}', \sigma^2) &= -\frac{1}{2\sigma^2} \sum_{i=1}^{G} \sum_{j=1}^{n_i} (y_{ij} - (\mu + c) - (a_i - c))^2\\ &= -\frac{1}{2\sigma^2} \sum_{i=1}^{G} \sum_{j=1}^{n_i} (y_{ij} - \mu - a_i)^2\\ &= \log \mathcal{L}(\mathbf{y} \mid \mu, \mathbf{a}, \sigma^2) \end{align*} \end{split}\]

…the same as the first joint log likelihood. This is problematic. Without constraints on \(\mathbf{a}\), there exist infinitely many combinations of \((\mu' , \mathbf{a}')\) that yield the same likelihood value. When this happens, the model is said to be non-identifiable or unidentifiable. You can learn more about the concept of identifiability in statistics here.

In the frequentist paradigm this results in a non-unique maximum likelihood estimate (MLE). In the Bayesian paradigm, the posterior may, under circumstances, become non-integrable. In both cases, computational methods such as optimization or MCMC sampling may fail to converge, resulting in unreliable parameter estimates.

If, in your assignments or projects you define a Bayesian ANOVA but convergence fails, check to make sure you have specified your model in a way that is identifiable.

Recall that in any mean-centered sample, the signed deviations from the mean sum to zero. In a similar manner, imposing the sum to zero (STZ) constraint on \(\mathbf{a}\) ensures a unique solution and preserves the interpretability of each \(a_{i}\) as a centered deviation from the grand mean. In fact, subtracting the mean of \(\mathbf{a}\) at each sampling iteration ensures that \(\sum_i a_i = 0\), satisfying the STZ constraint. Some examples of how to do this in PyMC are provided in future lessons.

Note

Notice that in the frequentist formulation, there is no \(\sigma_i^2\) , \(\sigma_j^2\), or \(\sigma_{ij}^2\). This is deliberate: frequentist ANOVA assumes homoscedasticity, meaning constant variance across all groups. If this assumption is violated—especially when sample sizes differ—it can lead to increased risk of Type I or Type II errors.

In the Bayesian framework, we may place separate priors on each \(\sigma_i^2\), allowing us to relax the homoscedasticity assumption when appropriate. However, because the choice of priors on variance components can affect both model behavior and interpretation, homoscedasticity is often retained by choice, even though it is not strictly required.

Authors#

Jason Naramore, August 2024.

Rob Bennett, June 2025.