import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
from pymc.math import dot
import pytensor.tensor as pt
%load_ext lab_black
%load_ext watermark
11. Brozek index prediction*#
This example goes over linear regression and Bayesian \(R^2\).
Adapted from Unit 7: fat1.odc, fat2d.odc, and fatmulti.odc.
Data can be found here.
Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.
The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.
y = Brozek index
X1 = 1 (intercept)
X2 = age
X3 = weight
X4 = height
X5 = adipose
X6 = neck
X7 = chest
X8 = abdomen
X9 = hip
X10 = thigh
X11 = knee
X12 = ankle
X13 = bicep
X14 = forearm
X15 = wrist
These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.
Set a linear regression to predict the Brozek index from these body measurements.
Single predictor (X8)#
This is a recreation of fat1.odc.
data = pd.read_csv("../data/fat.tsv", sep="\t")
y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)
# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2
with pm.Model() as m:
tau = pm.Gamma("tau", 0.001, 0.001)
beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
variance = pm.Deterministic("variance", 1 / tau)
mu = beta0 + beta1 * X
likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)
# Bayesian R2 from fat1.odc
sse = (n - p) * variance
cy = y - y.mean()
sst = dot(cy, cy)
br2 = pm.Deterministic("br2", 1 - sse / sst)
trace = pm.sample(2000)
ppc = pm.sample_posterior_predictive(trace)
Show code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
/Users/aaron/mambaforge/envs/pymc_env2/lib/python3.11/site-packages/multipledispatch/dispatcher.py:27: AmbiguityWarning:
Ambiguities exist in dispatched function _unify
The following signatures may result in ambiguous behavior:
[object, ConstrainedVar, Mapping], [ConstrainedVar, object, Mapping]
[object, ConstrainedVar, Mapping], [ConstrainedVar, Var, Mapping]
[object, ConstrainedVar, Mapping], [ConstrainedVar, Var, Mapping]
[object, ConstrainedVar, Mapping], [ConstrainedVar, object, Mapping]
Consider making the following additions:
@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
warn(warning_text(dispatcher.name, ambiguities), AmbiguityWarning)
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 44 seconds.
Sampling: [likelihood]
az.summary(trace, hdi_prob=0.95)
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta0_intercept | -35.053 | 2.457 | -39.958 | -30.496 | 0.050 | 0.036 | 2382.0 | 2676.0 | 1.0 |
beta1_abdomen | 0.583 | 0.026 | 0.534 | 0.636 | 0.001 | 0.000 | 2388.0 | 2770.0 | 1.0 |
tau | 0.049 | 0.004 | 0.041 | 0.057 | 0.000 | 0.000 | 4226.0 | 4057.0 | 1.0 |
variance | 20.516 | 1.808 | 17.086 | 24.142 | 0.028 | 0.020 | 4226.0 | 4057.0 | 1.0 |
br2 | 0.660 | 0.030 | 0.600 | 0.717 | 0.000 | 0.000 | 4226.0 | 4057.0 | 1.0 |
This matches the results from the U7 L11 video.
Another way to calculate the \(R^2\) using a posterior predictive check (keeping in mind that there is no standard “Bayesian \(R^2\)”). The results will be slightly different:
# updated based on the arviz docs
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))[
"likelihood"
].values.T
az.r2_score(y, y_pred)
r2 0.595109
r2_std 0.026074
dtype: float64
Multivariate regression with all predictors#
Based on fat2d.odc or fatmulti.odc (they appear to be identical).
data = pd.read_csv("../data/fat.tsv", sep="\t")
y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)
# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape
# Zellner's g
g = p**2
n, p, g
(252, 15, 225)
X_aug.shape
(252, 15)
mu_beta = np.zeros(p)
with pm.Model() as m2d:
tau = pm.Gamma("tau", 0.01, 0.01)
variance = pm.Deterministic("variance", 1 / tau)
tau_matrix = pt.fill(pt.zeros((15, 15)), tau)
tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)
mu = dot(X_aug, beta)
pm.Normal("likelihood", mu=mu, tau=tau, observed=y)
# Bayesian R2 from fat2d.odc
sse = (n - p) * variance
cy = y - y.mean()
sst = dot(cy, cy)
br2 = pm.Deterministic("br2", 1 - sse / sst)
br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)
trace = pm.sample(1000)
ppc = pm.sample_posterior_predictive(trace)
Show code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 418 seconds.
Sampling: [likelihood]
az.summary(trace, hdi_prob=0.95)
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta[0] | -14.964 | 16.823 | -47.976 | 17.982 | 0.379 | 0.283 | 1972.0 | 2123.0 | 1.0 |
beta[1] | 0.057 | 0.032 | -0.004 | 0.119 | 0.001 | 0.000 | 3449.0 | 3278.0 | 1.0 |
beta[2] | -0.080 | 0.052 | -0.184 | 0.021 | 0.001 | 0.001 | 2094.0 | 2456.0 | 1.0 |
beta[3] | -0.054 | 0.107 | -0.268 | 0.148 | 0.002 | 0.001 | 2998.0 | 2898.0 | 1.0 |
beta[4] | 0.061 | 0.280 | -0.438 | 0.653 | 0.005 | 0.004 | 3071.0 | 3086.0 | 1.0 |
beta[5] | -0.442 | 0.223 | -0.871 | -0.007 | 0.004 | 0.003 | 3554.0 | 2945.0 | 1.0 |
beta[6] | -0.030 | 0.098 | -0.217 | 0.171 | 0.002 | 0.001 | 2988.0 | 3078.0 | 1.0 |
beta[7] | 0.873 | 0.085 | 0.713 | 1.047 | 0.002 | 0.001 | 3131.0 | 3032.0 | 1.0 |
beta[8] | -0.200 | 0.144 | -0.465 | 0.101 | 0.003 | 0.002 | 2579.0 | 2522.0 | 1.0 |
beta[9] | 0.226 | 0.134 | -0.027 | 0.490 | 0.002 | 0.002 | 3452.0 | 3120.0 | 1.0 |
beta[10] | -0.003 | 0.232 | -0.462 | 0.433 | 0.004 | 0.004 | 3359.0 | 2859.0 | 1.0 |
beta[11] | 0.156 | 0.219 | -0.276 | 0.589 | 0.004 | 0.003 | 3686.0 | 2974.0 | 1.0 |
beta[12] | 0.143 | 0.164 | -0.176 | 0.472 | 0.003 | 0.002 | 3841.0 | 2981.0 | 1.0 |
beta[13] | 0.433 | 0.197 | 0.030 | 0.793 | 0.003 | 0.002 | 3297.0 | 2921.0 | 1.0 |
beta[14] | -1.492 | 0.502 | -2.468 | -0.502 | 0.008 | 0.006 | 3654.0 | 2731.0 | 1.0 |
tau | 0.060 | 0.005 | 0.049 | 0.069 | 0.000 | 0.000 | 3291.0 | 2967.0 | 1.0 |
variance | 16.931 | 1.481 | 14.208 | 19.997 | 0.026 | 0.019 | 3291.0 | 2967.0 | 1.0 |
br2 | 0.734 | 0.023 | 0.686 | 0.777 | 0.000 | 0.000 | 3291.0 | 2967.0 | 1.0 |
br2_adj | 0.718 | 0.025 | 0.667 | 0.764 | 0.000 | 0.000 | 3291.0 | 2967.0 | 1.0 |
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))[
"likelihood"
].values.T
az.r2_score(y, y_pred)
r2 0.654687
r2_std 0.023479
dtype: float64
Need to do some more reading on g-priors:
Original paper:
Zellner, A. (1986). “On Assessing Prior Distributions and Bayesian Regression Analysis with g Prior Distributions”. In Goel, P.; Zellner, A. (eds.). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Studies in Bayesian Econometrics and Statistics. Vol. 6. New York: Elsevier. pp. 233–243. ISBN 978-0-444-87712-3.
%watermark -n -u -v -iv -p aesara,aeppl
Last updated: Fri Feb 03 2023
Python implementation: CPython
Python version : 3.11.0
IPython version : 8.9.0
aesara: 2.8.10
aeppl : 0.1.1
pandas : 1.5.3
pytensor: 2.8.11
arviz : 0.14.0
numpy : 1.24.1
pymc : 5.0.1