11. Brozek index prediction*

11. Brozek index prediction*#

This example goes over linear regression and Bayesian \(R^2\).

Adapted from Unit 7: fat1.odc, fat2d.odc, and fatmulti.odc.

Data can be found here.

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

y = Brozek index
X1 = 1 (intercept)
X2 = age
X3 = weight
X4 = height
X5 = adipose
X6 = neck
X7 = chest
X8 = abdomen
X9 = hip
X10 = thigh
X11 = knee
X12 = ankle
X13 = bicep
X14 = forearm
X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

Single predictor (X8)#

This is a recreation of fat1.odc.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2

with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)

az.summary(trace, hdi_prob=0.95)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta0_intercept	-35.043	2.436	-39.797	-30.335	0.046	0.032	2856.0	3356.0	1.0
beta1_abdomen	0.583	0.026	0.533	0.635	0.000	0.000	2829.0	3291.0	1.0
tau	0.049	0.004	0.042	0.058	0.000	0.000	3708.0	3918.0	1.0
variance	20.519	1.814	17.128	24.075	0.030	0.021	3708.0	3918.0	1.0
br2	0.660	0.030	0.601	0.716	0.000	0.000	3708.0	3918.0	1.0

This matches the results from the U7 L11 video.

Another way to calculate the \(R^2\) using a posterior predictive check (keeping in mind that there is no standard “Bayesian \(R^2\)”). The results will be slightly different:

# updated based on the arviz docs
y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)

r2        0.594825
r2_std    0.026037
dtype: float64

Multivariate regression with all predictors#

Based on fat2d.odc or fatmulti.odc (they appear to be identical). This is using Zellner’s G-prior Zellner [1986]. The PyMC devs now recommend using LKJ Cholesky Covariance priors for models like this. The Stan User Guide also recommends them. I haven’t been able to get one running smoothly for this particular model yet, but will update when I can.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g

(252, 15, 225)

X_aug.shape

(252, 15)

mu_beta = np.zeros(p)

with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.01, 0.01)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_matrix = pt.fill(pt.zeros((15, 15)), tau)
    tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
    beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)

    mu = dot(X_aug, beta)
    pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(1000)
    ppc = pm.sample_posterior_predictive(trace)

az.summary(trace, hdi_prob=0.95)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta[0]	-14.601	16.995	-46.502	20.296	0.369	0.266	2125.0	2260.0	1.0
beta[1]	0.057	0.031	-0.004	0.118	0.001	0.000	3102.0	2900.0	1.0
beta[2]	-0.080	0.052	-0.185	0.018	0.001	0.001	2135.0	2461.0	1.0
beta[3]	-0.053	0.107	-0.260	0.157	0.002	0.002	2917.0	2530.0	1.0
beta[4]	0.070	0.295	-0.521	0.652	0.006	0.005	2411.0	2431.0	1.0
beta[5]	-0.441	0.226	-0.870	0.003	0.004	0.003	3272.0	2802.0	1.0
beta[6]	-0.035	0.104	-0.233	0.173	0.002	0.002	3209.0	2649.0	1.0
beta[7]	0.874	0.090	0.699	1.052	0.002	0.001	3220.0	2694.0	1.0
beta[8]	-0.202	0.147	-0.493	0.080	0.003	0.002	2688.0	2513.0	1.0
beta[9]	0.225	0.140	-0.044	0.502	0.002	0.002	3408.0	3037.0	1.0
beta[10]	-0.007	0.235	-0.454	0.444	0.004	0.003	3451.0	3186.0	1.0
beta[11]	0.158	0.214	-0.241	0.568	0.003	0.003	4151.0	2967.0	1.0
beta[12]	0.146	0.165	-0.168	0.469	0.003	0.002	3832.0	2879.0	1.0
beta[13]	0.430	0.193	0.063	0.811	0.003	0.002	3781.0	3060.0	1.0
beta[14]	-1.486	0.509	-2.500	-0.526	0.008	0.006	3603.0	3169.0	1.0
tau	0.059	0.005	0.049	0.070	0.000	0.000	3677.0	2993.0	1.0
variance	16.947	1.531	13.942	19.802	0.025	0.018	3677.0	2993.0	1.0
br2	0.734	0.024	0.689	0.781	0.000	0.000	3677.0	2993.0	1.0
br2_adj	0.718	0.025	0.670	0.768	0.000	0.000	3677.0	2993.0	1.0

y_pred = ppc.posterior_predictive.stack(sample=("chain", "draw"))["likelihood"].values.T
az.r2_score(y, y_pred)

r2        0.654702
r2_std    0.024177
dtype: float64

%load_ext watermark
%watermark -n -u -v -iv

Last updated: Sun Mar 09 2025

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.29.0

arviz   : 0.20.0
pandas  : 2.2.3
pymc    : 5.19.1
numpy   : 1.26.4
pytensor: 2.26.4

11. Brozek index prediction*

Contents

11. Brozek index prediction*#

Single predictor (X8)#

Multivariate regression with all predictors#