import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
from pymc.math import dot
import pytensor.tensor as pt

11. Brozek index prediction*#

This example goes over linear regression and Bayesian \(R^2\).

Adapted from Unit 7: fat1.odc, fat2d.odc, and fatmulti.odc.

Data can be found here.

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

  • y = Brozek index

  • X1 = 1 (intercept)

  • X2 = age

  • X3 = weight

  • X4 = height

  • X5 = adipose

  • X6 = neck

  • X7 = chest

  • X8 = abdomen

  • X9 = hip

  • X10 = thigh

  • X11 = knee

  • X12 = ankle

  • X13 = bicep

  • X14 = forearm

  • X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

Single predictor (X8)#

This is a recreation of fat1.odc.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2
with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
/Users/aaron/mambaforge/envs/pymc_env2/lib/python3.11/site-packages/multipledispatch/dispatcher.py:27: AmbiguityWarning: 
Ambiguities exist in dispatched function _unify

The following signatures may result in ambiguous behavior:
	[object, ConstrainedVar, Mapping], [ConstrainedVar, object, Mapping]
	[object, ConstrainedVar, Mapping], [ConstrainedVar, Var, Mapping]
	[object, ConstrainedVar, Mapping], [ConstrainedVar, Var, Mapping]
	[object, ConstrainedVar, Mapping], [ConstrainedVar, object, Mapping]


Consider making the following additions:

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
  warn(warning_text(dispatcher.name, ambiguities), AmbiguityWarning)
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]
100.00% [12000/12000 00:16<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 44 seconds.
Sampling: [likelihood]
100.00% [8000/8000 00:00<00:00]
az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta0_intercept -35.053 2.457 -39.958 -30.496 0.050 0.036 2382.0 2676.0 1.0
beta1_abdomen 0.583 0.026 0.534 0.636 0.001 0.000 2388.0 2770.0 1.0
tau 0.049 0.004 0.041 0.057 0.000 0.000 4226.0 4057.0 1.0
variance 20.516 1.808 17.086 24.142 0.028 0.020 4226.0 4057.0 1.0
br2 0.660 0.030 0.600 0.717 0.000 0.000 4226.0 4057.0 1.0

This matches the results from the U7 L11 video.

Another way to calculate the \(R^2\) using a posterior predictive check (keeping in mind that there is no standard “Bayesian \(R^2\)”). The results will be slightly different:

# updated based on the arviz docs
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))[
    "likelihood"
].values.T
az.r2_score(y, y_pred)
r2        0.595109
r2_std    0.026074
dtype: float64

Multivariate regression with all predictors#

Based on fat2d.odc or fatmulti.odc (they appear to be identical).

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g
(252, 15, 225)
X_aug.shape
(252, 15)
mu_beta = np.zeros(p)
with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.01, 0.01)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_matrix = pt.fill(pt.zeros((15, 15)), tau)
    tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
    beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)

    mu = dot(X_aug, beta)
    pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(1000)
    ppc = pm.sample_posterior_predictive(trace)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]
100.00% [8000/8000 06:57<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 418 seconds.
Sampling: [likelihood]
100.00% [4000/4000 00:00<00:00]
az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[0] -14.964 16.823 -47.976 17.982 0.379 0.283 1972.0 2123.0 1.0
beta[1] 0.057 0.032 -0.004 0.119 0.001 0.000 3449.0 3278.0 1.0
beta[2] -0.080 0.052 -0.184 0.021 0.001 0.001 2094.0 2456.0 1.0
beta[3] -0.054 0.107 -0.268 0.148 0.002 0.001 2998.0 2898.0 1.0
beta[4] 0.061 0.280 -0.438 0.653 0.005 0.004 3071.0 3086.0 1.0
beta[5] -0.442 0.223 -0.871 -0.007 0.004 0.003 3554.0 2945.0 1.0
beta[6] -0.030 0.098 -0.217 0.171 0.002 0.001 2988.0 3078.0 1.0
beta[7] 0.873 0.085 0.713 1.047 0.002 0.001 3131.0 3032.0 1.0
beta[8] -0.200 0.144 -0.465 0.101 0.003 0.002 2579.0 2522.0 1.0
beta[9] 0.226 0.134 -0.027 0.490 0.002 0.002 3452.0 3120.0 1.0
beta[10] -0.003 0.232 -0.462 0.433 0.004 0.004 3359.0 2859.0 1.0
beta[11] 0.156 0.219 -0.276 0.589 0.004 0.003 3686.0 2974.0 1.0
beta[12] 0.143 0.164 -0.176 0.472 0.003 0.002 3841.0 2981.0 1.0
beta[13] 0.433 0.197 0.030 0.793 0.003 0.002 3297.0 2921.0 1.0
beta[14] -1.492 0.502 -2.468 -0.502 0.008 0.006 3654.0 2731.0 1.0
tau 0.060 0.005 0.049 0.069 0.000 0.000 3291.0 2967.0 1.0
variance 16.931 1.481 14.208 19.997 0.026 0.019 3291.0 2967.0 1.0
br2 0.734 0.023 0.686 0.777 0.000 0.000 3291.0 2967.0 1.0
br2_adj 0.718 0.025 0.667 0.764 0.000 0.000 3291.0 2967.0 1.0
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))[
    "likelihood"
].values.T
az.r2_score(y, y_pred)
r2        0.654687
r2_std    0.023479
dtype: float64

Need to do some more reading on g-priors:

Original paper:

Zellner, A. (1986). “On Assessing Prior Distributions and Bayesian Regression Analysis with g Prior Distributions”. In Goel, P.; Zellner, A. (eds.). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Studies in Bayesian Econometrics and Statistics. Vol. 6. New York: Elsevier. pp. 233–243. ISBN 978-0-444-87712-3.

%load_ext watermark
%watermark -n -u -v -iv -p aesara,aeppl
Last updated: Fri Feb 03 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

aesara: 2.8.10
aeppl : 0.1.1

pandas  : 1.5.3
pytensor: 2.8.11
arviz   : 0.14.0
numpy   : 1.24.1
pymc    : 5.0.1