12. Multiple Linear Regression
Contributed by Jason Naramore.
Multiple linear regression is really an extension of simple linear regression when there are multiple predictors \( x_1, x_2, \ldots, x_k \).
\[\begin{split}
\begin{align*}
y_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i, \quad i = 1 , \ldots , n \\
\epsilon_i & \overset{iid}{\sim} N (0,\sigma^2)
\end{align*}
\end{split}\]
The compact representation can be written as follows, using matrix form.
\[\begin{split}
\mathbf{y} =
\begin{bmatrix}
y_1 \\
\vdots \\
y_n
\end{bmatrix}_{n \times 1}
\quad
\mathbf{X} = \begin{bmatrix}
1 & X_{11} & \dots & X_{1k} \\
1 & X_{21} & \dots & X_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
1 & X_{n1} & \dots & X_{nk}
\end{bmatrix}_{n \times p}
\quad
\boldsymbol{\beta} =
\begin{bmatrix}
\beta_0 \\
\vdots \\
\beta_k
\end{bmatrix}_{p \times 1}
\quad
\boldsymbol{\epsilon} =
\begin{bmatrix}
\epsilon_1 \\
\vdots \\
\epsilon_n
\end{bmatrix}_{n \times 1}
\end{split}\]
\[
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
\]
There’s an elegant closed-form solution for finding the \( \hat{\boldsymbol{\beta}} \) estimates in classical statistics:
\[
\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
\]
The estimates \( \hat{\mathbf{y}} \) can then be found by \( \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}} \).
In the Bayesian form of multiple linear regression, we place priors on all \( \beta \)‘s and \( \sigma^2 \).
\[\begin{split}
\begin{align*}
y_{ij} & \sim N(\mu,\sigma^2) && \text{likelihood} \\
\mu & = \beta_0 + \beta_1 x && \text{deterministic relationship} \\
\beta_j & \sim N(0,\sigma_j^2) && \text{prior: } \beta_j, \space j = 0 \text{ to } k \\
\tau & \sim Ga(a,b) && \text{prior: } \tau \\
\sigma^2 & = 1/\tau && \text{deterministic relationship}
\end{align*}
\end{split}\]
We typically assume that the \( \beta \)‘s are independent. For non-informative priors, we might set \( \sigma_j^2 \) to something large, like \( 10^3 \) or \( 10^4 \), and the \( a \) and \( b \) parameters in \( \tau \)‘s Gamma distribution to something small, like \( 10^{-3} \).
An example using independent Normal priors on \( \beta \)‘s can be found here: Taste of Cheese.
Other methods for defining priors, such as Zellner’s prior, may help to account for covariance between predictors.
\[\begin{split}
\begin{align*}
\boldsymbol{\beta} & \sim MVN(\mu,g \cdot \sigma^2 \mathbf{V}) && \text{prior: } \boldsymbol{\beta} \\
\sigma^2 & \sim IG(a,b) && \text{prior: } \sigma^2
\end{align*}
\end{split}\]
\[
\text{typical choices: } g = n, \quad g = p^2, \quad g = \max\{n,p^2\}
\]
where \( \sigma^2 \mathbf{V} \) is the covariance matrix. An example using Zellner’s prior can be found here: Brozek Index Prediction.