12. Multiple Linear Regression
Multiple linear regression is really an extension of simple linear regression when there are multiple predictors \(x_1, x_2, \ldots, x_k\).
\[\begin{split}
\begin{align*}
y_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i, \quad i = 1 , \ldots , n \\
\epsilon_i & \overset{iid}{\sim} N (0,\sigma^2)
\end{align*}
\end{split}\]
The compact representation can be written as follows, using matrix form.
\[\begin{split}
\mathbf{y} =
\begin{bmatrix}
y_1 \\
\vdots \\
y_n
\end{bmatrix}_{n \times 1}
\quad
\mathbf{X} = \begin{bmatrix}
1 & X_{11} & \dots & X_{1k} \\
1 & X_{21} & \dots & X_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
1 & X_{n1} & \dots & X_{nk}
\end{bmatrix}_{n \times p}
\quad
\boldsymbol{\beta} =
\begin{bmatrix}
\beta_0 \\
\vdots \\
\beta_k
\end{bmatrix}_{p \times 1}
\quad
\boldsymbol{\epsilon} =
\begin{bmatrix}
\epsilon_1 \\
\vdots \\
\epsilon_n
\end{bmatrix}_{n \times 1}
\end{split}\]
\[
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
\]
There’s an elegant closed-form solution for finding the \(\hat{\boldsymbol{\beta}}\) estimates in classical statistics:
\[
\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
\]
The estimates \(\hat{\mathbf{y}}\) can then be found by \(\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}}\).
In the Bayesian form of multiple linear regression, we place priors on all \(\beta\)’s and \(\sigma^2\).
\[\begin{split}
\begin{align*}
y_{i} & \sim N(\mu,\sigma^2) && \text{likelihood} \\
\mu & = \mathbf{X}\boldsymbol{\beta} && \text{deterministic relationship} \\
\beta_j & \sim N(0,\sigma_j^2) && \text{prior: } \beta_j, \space j = 0 \text{ to } k \\
\tau & \sim Ga(a,b) && \text{prior: } \tau \\
\sigma^2 & = 1/\tau && \text{deterministic relationship}
\end{align*}
\end{split}\]
We typically assume that the \(\beta\)’s are independent. For non-informative priors, we might set \(\sigma_j^2\) to something large, like \(10^3\) or \(10^4\), and the \(a\) and \( b \) parameters in \(\tau\)’s Gamma distribution to something small, like \(10^{-3}\).
An example using independent Normal priors on \(\beta\)’s can be found here: Taste of Cheese.
Other methods for defining priors, such as Zellner’s prior, may help to account for covariance between predictors.
\[\begin{split}
\begin{align*}
\boldsymbol{\beta} & \sim MVN(\mu,g \cdot \sigma^2 \mathbf{V}) && \text{prior: } \boldsymbol{\beta} \\
\sigma^2 & \sim IG(a,b) && \text{prior: } \sigma^2
\end{align*}
\end{split}\]
\[
\text{typical choices: } g = n, \quad g = p^2, \quad g = \max\{n,p^2\}
\]
where \(\sigma^2 \mathbf{V}\) is the covariance matrix. An example using Zellner’s prior can be found here: Brozek Index Prediction.
Authors
Jason Naramore, August 2024.