import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
from pymc.math import dot
import pytensor.tensor as pt
11. Brozek index prediction*#
This example goes over linear regression and Bayesian \(R^2\).
Adapted from Unit 7: fat1.odc, fat2d.odc, and fatmulti.odc.
Data can be found here.
Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.
The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.
y = Brozek index
X1 = 1 (intercept)
X2 = age
X3 = weight
X4 = height
X5 = adipose
X6 = neck
X7 = chest
X8 = abdomen
X9 = hip
X10 = thigh
X11 = knee
X12 = ankle
X13 = bicep
X14 = forearm
X15 = wrist
These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.
Set a linear regression to predict the Brozek index from these body measurements.
Single predictor (X8)#
This is a recreation of fat1.odc.
data = pd.read_csv("../data/fat.tsv", sep="\t")
y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)
# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2
with pm.Model() as m:
tau = pm.Gamma("tau", 0.001, 0.001)
beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
variance = pm.Deterministic("variance", 1 / tau)
mu = beta0 + beta1 * X
likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)
# Bayesian R2 from fat1.odc
sse = (n - p) * variance
cy = y - y.mean()
sst = dot(cy, cy)
br2 = pm.Deterministic("br2", 1 - sse / sst)
trace = pm.sample(2000)
ppc = pm.sample_posterior_predictive(trace)
Show code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 3 seconds.
Sampling: [likelihood]
az.summary(trace, hdi_prob=0.95)
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta0_intercept | -35.043 | 2.436 | -39.797 | -30.335 | 0.046 | 0.032 | 2856.0 | 3356.0 | 1.0 |
beta1_abdomen | 0.583 | 0.026 | 0.533 | 0.635 | 0.000 | 0.000 | 2829.0 | 3291.0 | 1.0 |
tau | 0.049 | 0.004 | 0.042 | 0.058 | 0.000 | 0.000 | 3708.0 | 3918.0 | 1.0 |
variance | 20.519 | 1.814 | 17.128 | 24.075 | 0.030 | 0.021 | 3708.0 | 3918.0 | 1.0 |
br2 | 0.660 | 0.030 | 0.601 | 0.716 | 0.000 | 0.000 | 3708.0 | 3918.0 | 1.0 |
This matches the results from the U7 L11 video.
Another way to calculate the \(R^2\) using a posterior predictive check (keeping in mind that there is no standard “Bayesian \(R^2\)”). The results will be slightly different:
# updated based on the arviz docs
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)
r2 0.594825
r2_std 0.026037
dtype: float64
Multivariate regression with all predictors#
Based on fat2d.odc or fatmulti.odc (they appear to be identical). This is using Zellner’s G-prior Zellner [1986]. The PyMC devs now recommend using LKJ Cholesky Covariance priors for models like this. The Stan User Guide also recommends them. I haven’t been able to get one running smoothly for this particular model yet, but will update when I can.
data = pd.read_csv("../data/fat.tsv", sep="\t")
y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)
# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape
# Zellner's g
g = p**2
n, p, g
(252, 15, 225)
X_aug.shape
(252, 15)
mu_beta = np.zeros(p)
with pm.Model() as m2d:
tau = pm.Gamma("tau", 0.01, 0.01)
variance = pm.Deterministic("variance", 1 / tau)
tau_matrix = pt.fill(pt.zeros((15, 15)), tau)
tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)
mu = dot(X_aug, beta)
pm.Normal("likelihood", mu=mu, tau=tau, observed=y)
# Bayesian R2 from fat2d.odc
sse = (n - p) * variance
cy = y - y.mean()
sst = dot(cy, cy)
br2 = pm.Deterministic("br2", 1 - sse / sst)
br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)
trace = pm.sample(1000)
ppc = pm.sample_posterior_predictive(trace)
Show code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 95 seconds.
Sampling: [likelihood]
az.summary(trace, hdi_prob=0.95)
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta[0] | -14.601 | 16.995 | -46.502 | 20.296 | 0.369 | 0.266 | 2125.0 | 2260.0 | 1.0 |
beta[1] | 0.057 | 0.031 | -0.004 | 0.118 | 0.001 | 0.000 | 3102.0 | 2900.0 | 1.0 |
beta[2] | -0.080 | 0.052 | -0.185 | 0.018 | 0.001 | 0.001 | 2135.0 | 2461.0 | 1.0 |
beta[3] | -0.053 | 0.107 | -0.260 | 0.157 | 0.002 | 0.002 | 2917.0 | 2530.0 | 1.0 |
beta[4] | 0.070 | 0.295 | -0.521 | 0.652 | 0.006 | 0.005 | 2411.0 | 2431.0 | 1.0 |
beta[5] | -0.441 | 0.226 | -0.870 | 0.003 | 0.004 | 0.003 | 3272.0 | 2802.0 | 1.0 |
beta[6] | -0.035 | 0.104 | -0.233 | 0.173 | 0.002 | 0.002 | 3209.0 | 2649.0 | 1.0 |
beta[7] | 0.874 | 0.090 | 0.699 | 1.052 | 0.002 | 0.001 | 3220.0 | 2694.0 | 1.0 |
beta[8] | -0.202 | 0.147 | -0.493 | 0.080 | 0.003 | 0.002 | 2688.0 | 2513.0 | 1.0 |
beta[9] | 0.225 | 0.140 | -0.044 | 0.502 | 0.002 | 0.002 | 3408.0 | 3037.0 | 1.0 |
beta[10] | -0.007 | 0.235 | -0.454 | 0.444 | 0.004 | 0.003 | 3451.0 | 3186.0 | 1.0 |
beta[11] | 0.158 | 0.214 | -0.241 | 0.568 | 0.003 | 0.003 | 4151.0 | 2967.0 | 1.0 |
beta[12] | 0.146 | 0.165 | -0.168 | 0.469 | 0.003 | 0.002 | 3832.0 | 2879.0 | 1.0 |
beta[13] | 0.430 | 0.193 | 0.063 | 0.811 | 0.003 | 0.002 | 3781.0 | 3060.0 | 1.0 |
beta[14] | -1.486 | 0.509 | -2.500 | -0.526 | 0.008 | 0.006 | 3603.0 | 3169.0 | 1.0 |
tau | 0.059 | 0.005 | 0.049 | 0.070 | 0.000 | 0.000 | 3677.0 | 2993.0 | 1.0 |
variance | 16.947 | 1.531 | 13.942 | 19.802 | 0.025 | 0.018 | 3677.0 | 2993.0 | 1.0 |
br2 | 0.734 | 0.024 | 0.689 | 0.781 | 0.000 | 0.000 | 3677.0 | 2993.0 | 1.0 |
br2_adj | 0.718 | 0.025 | 0.670 | 0.768 | 0.000 | 0.000 | 3677.0 | 2993.0 | 1.0 |
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)
r2 0.654702
r2_std 0.024177
dtype: float64
%load_ext watermark
%watermark -n -u -v -iv
Last updated: Sun Mar 09 2025
Python implementation: CPython
Python version : 3.12.7
IPython version : 8.29.0
arviz : 0.20.0
pandas : 2.2.3
pymc : 5.19.1
numpy : 1.26.4
pytensor: 2.26.4