import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
from pymc.math import dot
import pytensor.tensor as pt

11. Brozek index prediction*#

This example goes over linear regression and Bayesian \(R^2\).

Adapted from Unit 7: fat1.odc, fat2d.odc, and fatmulti.odc.

Data can be found here.

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

  • y = Brozek index

  • X1 = 1 (intercept)

  • X2 = age

  • X3 = weight

  • X4 = height

  • X5 = adipose

  • X6 = neck

  • X7 = chest

  • X8 = abdomen

  • X9 = hip

  • X10 = thigh

  • X11 = knee

  • X12 = ankle

  • X13 = bicep

  • X14 = forearm

  • X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

Single predictor (X8)#

This is a recreation of fat1.odc.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2
with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 3 seconds.
Sampling: [likelihood]

az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta0_intercept -35.043 2.436 -39.797 -30.335 0.046 0.032 2856.0 3356.0 1.0
beta1_abdomen 0.583 0.026 0.533 0.635 0.000 0.000 2829.0 3291.0 1.0
tau 0.049 0.004 0.042 0.058 0.000 0.000 3708.0 3918.0 1.0
variance 20.519 1.814 17.128 24.075 0.030 0.021 3708.0 3918.0 1.0
br2 0.660 0.030 0.601 0.716 0.000 0.000 3708.0 3918.0 1.0

This matches the results from the U7 L11 video.

Another way to calculate the \(R^2\) using a posterior predictive check (keeping in mind that there is no standard “Bayesian \(R^2\)”). The results will be slightly different:

# updated based on the arviz docs
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)
r2        0.594825
r2_std    0.026037
dtype: float64

Multivariate regression with all predictors#

Based on fat2d.odc or fatmulti.odc (they appear to be identical). This is using Zellner’s G-prior Zellner [1986]. The PyMC devs now recommend using LKJ Cholesky Covariance priors for models like this. The Stan User Guide also recommends them. I haven’t been able to get one running smoothly for this particular model yet, but will update when I can.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g
(252, 15, 225)
X_aug.shape
(252, 15)
mu_beta = np.zeros(p)
with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.01, 0.01)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_matrix = pt.fill(pt.zeros((15, 15)), tau)
    tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
    beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)

    mu = dot(X_aug, beta)
    pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(1000)
    ppc = pm.sample_posterior_predictive(trace)
Hide code cell output
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 95 seconds.
Sampling: [likelihood]

az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[0] -14.601 16.995 -46.502 20.296 0.369 0.266 2125.0 2260.0 1.0
beta[1] 0.057 0.031 -0.004 0.118 0.001 0.000 3102.0 2900.0 1.0
beta[2] -0.080 0.052 -0.185 0.018 0.001 0.001 2135.0 2461.0 1.0
beta[3] -0.053 0.107 -0.260 0.157 0.002 0.002 2917.0 2530.0 1.0
beta[4] 0.070 0.295 -0.521 0.652 0.006 0.005 2411.0 2431.0 1.0
beta[5] -0.441 0.226 -0.870 0.003 0.004 0.003 3272.0 2802.0 1.0
beta[6] -0.035 0.104 -0.233 0.173 0.002 0.002 3209.0 2649.0 1.0
beta[7] 0.874 0.090 0.699 1.052 0.002 0.001 3220.0 2694.0 1.0
beta[8] -0.202 0.147 -0.493 0.080 0.003 0.002 2688.0 2513.0 1.0
beta[9] 0.225 0.140 -0.044 0.502 0.002 0.002 3408.0 3037.0 1.0
beta[10] -0.007 0.235 -0.454 0.444 0.004 0.003 3451.0 3186.0 1.0
beta[11] 0.158 0.214 -0.241 0.568 0.003 0.003 4151.0 2967.0 1.0
beta[12] 0.146 0.165 -0.168 0.469 0.003 0.002 3832.0 2879.0 1.0
beta[13] 0.430 0.193 0.063 0.811 0.003 0.002 3781.0 3060.0 1.0
beta[14] -1.486 0.509 -2.500 -0.526 0.008 0.006 3603.0 3169.0 1.0
tau 0.059 0.005 0.049 0.070 0.000 0.000 3677.0 2993.0 1.0
variance 16.947 1.531 13.942 19.802 0.025 0.018 3677.0 2993.0 1.0
br2 0.734 0.024 0.689 0.781 0.000 0.000 3677.0 2993.0 1.0
br2_adj 0.718 0.025 0.670 0.768 0.000 0.000 3677.0 2993.0 1.0
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)
r2        0.654702
r2_std    0.024177
dtype: float64
%load_ext watermark
%watermark -n -u -v -iv
Last updated: Sun Mar 09 2025

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.29.0

arviz   : 0.20.0
pandas  : 2.2.3
pymc    : 5.19.1
numpy   : 1.26.4
pytensor: 2.26.4