12. Multiple Linear Regression#

Contributed by Jason Naramore.

Multiple linear regression is really an extension of simple linear regression when there are multiple predictors \( x_1, x_2, \ldots, x_k \).

\[\begin{split} \begin{align*} y_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i, \quad i = 1 , \ldots , n \\ \epsilon_i & \overset{iid}{\sim} N (0,\sigma^2) \end{align*} \end{split}\]

The compact representation can be written as follows, using matrix form.

\[\begin{split} \mathbf{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}_{n \times 1} \quad \mathbf{X} = \begin{bmatrix} 1 & X_{11} & \dots & X_{1k} \\ 1 & X_{21} & \dots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \dots & X_{nk} \end{bmatrix}_{n \times p} \quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \vdots \\ \beta_k \end{bmatrix}_{p \times 1} \quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix}_{n \times 1} \end{split}\]
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]

There’s an elegant closed-form solution for finding the \( \hat{\boldsymbol{\beta}} \) estimates in classical statistics:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \]

The estimates \( \hat{\mathbf{y}} \) can then be found by \( \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}} \).

In the Bayesian form of multiple linear regression, we place priors on all \( \beta \)‘s and \( \sigma^2 \).

\[\begin{split} \begin{align*} y_{ij} & \sim N(\mu,\sigma^2) && \text{likelihood} \\ \mu & = \beta_0 + \beta_1 x && \text{deterministic relationship} \\ \beta_j & \sim N(0,\sigma_j^2) && \text{prior: } \beta_j, \space j = 0 \text{ to } k \\ \tau & \sim Ga(a,b) && \text{prior: } \tau \\ \sigma^2 & = 1/\tau && \text{deterministic relationship} \end{align*} \end{split}\]

We typically assume that the \( \beta \)‘s are independent. For non-informative priors, we might set \( \sigma_j^2 \) to something large, like \( 10^3 \) or \( 10^4 \), and the \( a \) and \( b \) parameters in \( \tau \)‘s Gamma distribution to something small, like \( 10^{-3} \).

An example using independent Normal priors on \( \beta \)‘s can be found here: Taste of Cheese.

Other methods for defining priors, such as Zellner’s prior, may help to account for covariance between predictors.

\[\begin{split} \begin{align*} \boldsymbol{\beta} & \sim MVN(\mu,g \cdot \sigma^2 \mathbf{V}) && \text{prior: } \boldsymbol{\beta} \\ \sigma^2 & \sim IG(a,b) && \text{prior: } \sigma^2 \end{align*} \end{split}\]
\[ \text{typical choices: } g = n, \quad g = p^2, \quad g = \max\{n,p^2\} \]

where \( \sigma^2 \mathbf{V} \) is the covariance matrix. An example using Zellner’s prior can be found here: Brozek Index Prediction.