import arviz as az
import pymc as pm
from pymc.math import switch, ge
import pandas as pd
import numpy as np

%load_ext lab_black

5. Loading Data, Step Function, and Deterministic Variables*#

This example introduces data containers, tracking of deterministic variables, and shows how to recreate the BUGS step function in PyMC.

Taste of Cheese#

Adapted from Unit 6: cheese.odc.

The link in the original .odc file is dead. I downloaded the data from here and have a copy here.

As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters.

Can the score be predicted well by the predictors: Acetic, H2S, and Lactic?

data = pd.read_csv("../data/cheese.csv", index_col=0)
X = data[["Acetic", "H2S", "Lactic"]].to_numpy()
# add intercept column to X
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y = data["taste"].to_numpy()
data
taste Acetic H2S Lactic
1 12.3 4.543 3.135 0.86
2 20.9 5.159 5.043 1.53
3 39.0 5.366 5.438 1.57
4 47.9 5.759 7.496 1.81
5 5.6 4.663 3.807 0.99
6 25.9 5.697 7.601 1.09
7 37.3 5.892 8.726 1.29
8 21.9 6.078 7.966 1.78
9 18.1 4.898 3.850 1.29
10 21.0 5.242 4.174 1.58
11 34.9 5.740 6.142 1.68
12 57.2 6.446 7.908 1.90
13 0.7 4.477 2.996 1.06
14 25.9 5.236 4.942 1.30
15 54.9 6.151 6.752 1.52
16 40.9 6.365 9.588 1.74
17 15.9 4.787 3.912 1.16
18 6.4 5.412 4.700 1.49
19 18.0 5.247 6.174 1.63
20 38.9 5.438 9.064 1.99
21 14.0 4.564 4.949 1.15
22 15.2 5.298 5.220 1.33
23 32.0 5.455 9.242 1.44
24 56.7 5.855 10.199 2.01
25 16.8 5.366 3.664 1.31
26 11.6 6.043 3.219 1.46
27 26.5 6.458 6.962 1.72
28 0.7 5.328 3.912 1.25
29 13.4 5.802 6.685 1.08
30 5.5 6.176 4.787 1.25
X_aug
array([[ 1.   ,  4.543,  3.135,  0.86 ],
       [ 1.   ,  5.159,  5.043,  1.53 ],
       [ 1.   ,  5.366,  5.438,  1.57 ],
       [ 1.   ,  5.759,  7.496,  1.81 ],
       [ 1.   ,  4.663,  3.807,  0.99 ],
       [ 1.   ,  5.697,  7.601,  1.09 ],
       [ 1.   ,  5.892,  8.726,  1.29 ],
       [ 1.   ,  6.078,  7.966,  1.78 ],
       [ 1.   ,  4.898,  3.85 ,  1.29 ],
       [ 1.   ,  5.242,  4.174,  1.58 ],
       [ 1.   ,  5.74 ,  6.142,  1.68 ],
       [ 1.   ,  6.446,  7.908,  1.9  ],
       [ 1.   ,  4.477,  2.996,  1.06 ],
       [ 1.   ,  5.236,  4.942,  1.3  ],
       [ 1.   ,  6.151,  6.752,  1.52 ],
       [ 1.   ,  6.365,  9.588,  1.74 ],
       [ 1.   ,  4.787,  3.912,  1.16 ],
       [ 1.   ,  5.412,  4.7  ,  1.49 ],
       [ 1.   ,  5.247,  6.174,  1.63 ],
       [ 1.   ,  5.438,  9.064,  1.99 ],
       [ 1.   ,  4.564,  4.949,  1.15 ],
       [ 1.   ,  5.298,  5.22 ,  1.33 ],
       [ 1.   ,  5.455,  9.242,  1.44 ],
       [ 1.   ,  5.855, 10.199,  2.01 ],
       [ 1.   ,  5.366,  3.664,  1.31 ],
       [ 1.   ,  6.043,  3.219,  1.46 ],
       [ 1.   ,  6.458,  6.962,  1.72 ],
       [ 1.   ,  5.328,  3.912,  1.25 ],
       [ 1.   ,  5.802,  6.685,  1.08 ],
       [ 1.   ,  6.176,  4.787,  1.25 ]])
with pm.Model() as m:
    # associate data with model (this makes prediction easier)
    X_data = pm.Data("X", X_aug, mutable=True)
    y_data = pm.Data("y", y, mutable=False)

    # priors
    beta = pm.Normal("beta", mu=0, sigma=1000, shape=X.shape[1] + 1)
    tau = pm.Gamma("tau", alpha=0.001, beta=0.001)
    sigma = pm.Deterministic("sigma", 1 / pm.math.sqrt(tau))

    mu = pm.math.dot(X_data, beta)

    # likelihood
    pm.Normal("taste_score", mu=mu, sigma=sigma, observed=y_data)

    # start sampling
    trace = pm.sample(5000, target_accept=0.95)
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, tau]
100.00% [24000/24000 00:19<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 20 seconds.
Sampling: [taste_score]
100.00% [20000/20000 00:00<00:00]
az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[0] -28.910 20.681 -71.358 10.156 0.244 0.177 7242.0 8929.0 1.0
beta[1] 0.347 4.660 -9.125 9.136 0.056 0.041 7031.0 8609.0 1.0
beta[2] 3.913 1.310 1.315 6.457 0.014 0.010 9333.0 10101.0 1.0
beta[3] 19.616 8.900 2.887 37.918 0.089 0.063 10111.0 10106.0 1.0
tau 0.010 0.003 0.005 0.015 0.000 0.000 10585.0 11160.0 1.0
sigma 10.445 1.513 7.775 13.520 0.015 0.011 10585.0 11160.0 1.0
y_pred = trace.posterior_predictive.stack(sample=("chain", "draw"))[
    "taste_score"
].values.T
az.r2_score(y, y_pred)
r2        0.576223
r2_std    0.075841
dtype: float64

Results are pretty close to OpenBUGS:

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

beta0

-29.75

20.24

0.7889

-70.06

-29.75

11.11

1000

100001

beta1

0.4576

4.6

0.189

-8.716

0.4388

9.786

1000

100001

beta2

3.906

1.291

0.02725

1.345

3.912

6.47

1000

100001

beta3

19.79

8.893

0.2379

2.053

19.88

37.2

1000

100001

tau

0.009777

0.002706

2.29E-05

0.00522

0.009528

0.01575

1000

100001

PyMC gives some warnings about the model unless we increase the target_accept parameter of pm.sample, while BUGS doesn’t. This is because PyMC uses more diagnostics to check if there are any problems with its exploration of the parameter space. Divergences indicate bias in the results. BUGS will happily run this model without reporting any problems, but it doesn’t mean that there aren’t any.

For further reading, check out Diagnosing Biased Inference with Divergences.

Stress, Diet, and Acids#

Adapted from Unit 6: stressacids.odc.

In the study Interrelationships Between Stress, Dietary Intake, and Plasma Ascorbic Acid During Pregnancy conducted at the Virginia Polytechnic Institute and State University, the plasma ascorbic acid levels of pregnant women were compared for smokers versus non-smokers. Thirty-two women in the last three months of pregnancy, free of major health disorders, and ranging in age from 15 to 32 years were selected for the study. Prior to the collection of 20 ml of blood, the participants were told to avoid breakfast, forego their vitamin supplements, and avoid foods high in ascorbic acid content. From the blood samples, the plasma ascorbic acid values of each subject were determined in milligrams per 100 milliliters.


The purpose of this example in lectures was mostly just to show different ways to load data in BUGS. I’m not going to go into that too much, since there are a million ways to prepare your data in Python. In the next cell, I start with the data pasted from stressacids.odc, then use list comprehensions to create one list for smokers and one for nonsmokers.

# fmt: off
plasma = [0.97, 0.72, 1.00, 0.81, 0.62, 1.32, 1.24, 0.99, 0.90, 0.74,
          0.88, 0.94, 1.06, 0.86, 0.85, 0.58, 0.57, 0.64, 0.98, 1.09,
          0.92, 0.78, 1.24, 1.18, 0.48, 0.71, 0.98, 0.68, 1.18, 1.36,
          0.78, 1.64]

smo = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
       1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]
# fmt: on

nonsmokers = [x for x, y in zip(plasma, smo) if y == 1]
smokers = [x for x, y in zip(plasma, smo) if y == 2]

BUGS step function#

I think this is the first time we’ve seen the BUGS step function.

BUGS defines the step function like this:

step(e) is 1 if e >= 0; 0 otherwise.

Keep in mind that in PyMC, step functions are how they refer to the algorithms used for sampling, as in NUTS or Metropolis. Just different terminology.

We can recreate the BUGS step function with pm.math.switch():

pm.math.switch(e >= 0, 1, 0)

We should also probably use pm.math.ge for greater than or equal, as well, so:

pm.math.switch(ge(e,0), 1, 0)

How do I track non-random variables in PyMC?#

One nice thing about BUGS is you can easily track both deterministic and non-deterministic variables while sampling. For PyMC, you can wrap these in pm.Deterministic(). Just make sure to use pm.math functions where possible.

with pm.Model() as m:
    # priors
    tau_nonsmokers = pm.Gamma("tau_nonsmokers", alpha=0.0001, beta=0.0001)
    sigma_nonsmokers = 1 / pm.math.sqrt(tau_nonsmokers)
    mu_nonsmokers = pm.Normal("mu_nonsmokers", mu=0, sigma=100)

    tau_smokers = pm.Gamma("tau_smokers", alpha=0.0001, beta=0.0001)
    sigma_smokers = 1 / pm.math.sqrt(tau_smokers)
    mu_smokers = pm.Normal("mu_smokers", mu=0, sigma=100)

    # likelihood
    plasma_aa_ns = pm.Normal(
        "nonsmokers_aa", mu=mu_nonsmokers, sigma=sigma_nonsmokers, observed=nonsmokers
    )
    plasma_aa_s = pm.Normal(
        "smokers_aa", mu=mu_smokers, sigma=sigma_smokers, observed=smokers
    )

    testmu = pm.Deterministic("test_mu", switch(ge(mu_smokers, mu_nonsmokers), 1, 0))
    r = pm.Deterministic("prec_ratio", tau_nonsmokers / tau_smokers)

    # start sampling
    trace = pm.sample(5000)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau_nonsmokers, mu_nonsmokers, tau_smokers, mu_smokers]
100.00% [24000/24000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 3 seconds.
az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu_nonsmokers 0.912 0.045 0.826 1.004 0.000 0.000 13842.0 12195.0 1.0
mu_smokers 0.976 0.165 0.646 1.299 0.002 0.001 13965.0 9774.0 1.0
tau_nonsmokers 22.605 6.758 9.906 35.476 0.054 0.038 15096.0 13040.0 1.0
tau_smokers 6.562 3.493 0.866 13.413 0.026 0.019 15445.0 10342.0 1.0
test_mu 0.664 0.472 0.000 1.000 0.004 0.003 16226.0 16226.0 1.0
prec_ratio 4.805 4.157 0.688 11.725 0.041 0.031 15058.0 11605.0 1.0
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Sat Aug 05 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.11.1

pymc  : 5.3.0
numpy : 1.24.2
arviz : 0.15.1
pandas: 1.5.3