import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt

%load_ext lab_black

8. Prediction*#

Taste of Cheese*#

Adapted from Unit 6: cheese.odc.

The link in the original .odc file is dead. I downloaded the data from here and have a copy here.

Problem statement#

As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters.

Can the score be predicted well by the predictors: Acetic, H2S, and Lactic?

data = pd.read_csv("../data/cheese.csv", index_col=0)
X = data[["Acetic", "H2S", "Lactic"]].to_numpy()
# add intercept column to X
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y = data["taste"].to_numpy()
data
taste Acetic H2S Lactic
1 12.3 4.543 3.135 0.86
2 20.9 5.159 5.043 1.53
3 39.0 5.366 5.438 1.57
4 47.9 5.759 7.496 1.81
5 5.6 4.663 3.807 0.99
6 25.9 5.697 7.601 1.09
7 37.3 5.892 8.726 1.29
8 21.9 6.078 7.966 1.78
9 18.1 4.898 3.850 1.29
10 21.0 5.242 4.174 1.58
11 34.9 5.740 6.142 1.68
12 57.2 6.446 7.908 1.90
13 0.7 4.477 2.996 1.06
14 25.9 5.236 4.942 1.30
15 54.9 6.151 6.752 1.52
16 40.9 6.365 9.588 1.74
17 15.9 4.787 3.912 1.16
18 6.4 5.412 4.700 1.49
19 18.0 5.247 6.174 1.63
20 38.9 5.438 9.064 1.99
21 14.0 4.564 4.949 1.15
22 15.2 5.298 5.220 1.33
23 32.0 5.455 9.242 1.44
24 56.7 5.855 10.199 2.01
25 16.8 5.366 3.664 1.31
26 11.6 6.043 3.219 1.46
27 26.5 6.458 6.962 1.72
28 0.7 5.328 3.912 1.25
29 13.4 5.802 6.685 1.08
30 5.5 6.176 4.787 1.25
X_aug
array([[ 1.   ,  4.543,  3.135,  0.86 ],
       [ 1.   ,  5.159,  5.043,  1.53 ],
       [ 1.   ,  5.366,  5.438,  1.57 ],
       [ 1.   ,  5.759,  7.496,  1.81 ],
       [ 1.   ,  4.663,  3.807,  0.99 ],
       [ 1.   ,  5.697,  7.601,  1.09 ],
       [ 1.   ,  5.892,  8.726,  1.29 ],
       [ 1.   ,  6.078,  7.966,  1.78 ],
       [ 1.   ,  4.898,  3.85 ,  1.29 ],
       [ 1.   ,  5.242,  4.174,  1.58 ],
       [ 1.   ,  5.74 ,  6.142,  1.68 ],
       [ 1.   ,  6.446,  7.908,  1.9  ],
       [ 1.   ,  4.477,  2.996,  1.06 ],
       [ 1.   ,  5.236,  4.942,  1.3  ],
       [ 1.   ,  6.151,  6.752,  1.52 ],
       [ 1.   ,  6.365,  9.588,  1.74 ],
       [ 1.   ,  4.787,  3.912,  1.16 ],
       [ 1.   ,  5.412,  4.7  ,  1.49 ],
       [ 1.   ,  5.247,  6.174,  1.63 ],
       [ 1.   ,  5.438,  9.064,  1.99 ],
       [ 1.   ,  4.564,  4.949,  1.15 ],
       [ 1.   ,  5.298,  5.22 ,  1.33 ],
       [ 1.   ,  5.455,  9.242,  1.44 ],
       [ 1.   ,  5.855, 10.199,  2.01 ],
       [ 1.   ,  5.366,  3.664,  1.31 ],
       [ 1.   ,  6.043,  3.219,  1.46 ],
       [ 1.   ,  6.458,  6.962,  1.72 ],
       [ 1.   ,  5.328,  3.912,  1.25 ],
       [ 1.   ,  5.802,  6.685,  1.08 ],
       [ 1.   ,  6.176,  4.787,  1.25 ]])
with pm.Model() as m:
    # associate data with model (this makes prediction easier)
    X_data = pm.Data("X", X_aug, mutable=True)
    y_data = pm.Data("y", y, mutable=False)

    # priors
    beta = pm.Normal("beta", mu=0, sigma=1000, shape=X.shape[1] + 1)
    tau = pm.Gamma("tau", alpha=0.001, beta=0.001)
    sigma = pm.Deterministic("sigma", 1 / pm.math.sqrt(tau))

    mu = pm.math.dot(X_data, beta)

    # likelihood
    pm.Normal("taste_score", mu=mu, sigma=sigma, observed=y_data)

    # start sampling
    trace = pm.sample(5000, target_accept=0.95)
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, tau]
100.00% [24000/24000 00:19<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 19 seconds.
Sampling: [taste_score]
100.00% [20000/20000 00:00<00:00]
az.summary(trace, hdi_prob=0.95)
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[0] -28.827 20.571 -68.654 12.278 0.243 0.172 7159.0 9390.0 1.0
beta[1] 0.314 4.650 -8.482 9.805 0.057 0.040 6760.0 9158.0 1.0
beta[2] 3.921 1.303 1.371 6.535 0.014 0.010 8847.0 10720.0 1.0
beta[3] 19.649 8.904 1.827 36.985 0.090 0.064 9816.0 10632.0 1.0
tau 0.010 0.003 0.005 0.015 0.000 0.000 10784.0 10427.0 1.0
sigma 10.435 1.492 7.768 13.419 0.015 0.010 10784.0 10427.0 1.0
y_pred = trace.posterior_predictive.stack(sample=("chain", "draw"))[
    "taste_score"
].values.T
az.r2_score(y, y_pred)
r2        0.576332
r2_std    0.075839
dtype: float64

Results are pretty close to OpenBUGS:

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

beta0

-29.75

20.24

0.7889

-70.06

-29.75

11.11

1000

100001

beta1

0.4576

4.6

0.189

-8.716

0.4388

9.786

1000

100001

beta2

3.906

1.291

0.02725

1.345

3.912

6.47

1000

100001

beta3

19.79

8.893

0.2379

2.053

19.88

37.2

1000

100001

tau

0.009777

0.002706

2.29E-05

0.00522

0.009528

0.01575

1000

100001

PyMC gives some warnings about the model unless we increase the target_accept parameter of pm.sample, while BUGS doesn’t. This is because PyMC uses more diagnostics to check if there are any problems with its exploration of the parameter space. Divergences indicate bias in the results. BUGS will happily run this model without reporting any problems, but it doesn’t mean that there aren’t any.

For further reading, check out Diagnosing Biased Inference with Divergences.

It looks like there are multiple ways to get predictions on out-of-sample data in PyMC. The easiest way is to set up a shared variable using pm.Data in the original model, then using pm.set_data to change to the new observations before calling pm.sample_posterior_predictive.

# single prediction on out-of-sample data
new_obs = np.array([[1.0, 5.0, 7.1, 1.5]])
pm.set_data({"X": new_obs}, model=m)
ppc = pm.sample_posterior_predictive(trace, model=m, predictions=True)
Hide code cell output
Sampling: [taste_score]
100.00% [20000/20000 00:00<00:00]

The default behavior now is to give one prediction per y-value. The professor often asks for a single prediction based on the new data; the equivalent here would be to take the mean of the predicted values.

az.summary(ppc.predictions, hdi_prob=0.95).mean()
mean            30.068067
sd              11.205900
hdi_2.5%         7.859233
hdi_97.5%       52.049367
mcse_mean        0.085567
mcse_sd          0.060533
ess_bulk     17181.033333
ess_tail     18564.700000
r_hat            1.000000
dtype: float64
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Sun Mar 26 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.10.1

pymc      : 5.1.2
pandas    : 1.5.3
arviz     : 0.15.1
matplotlib: 3.6.3
numpy     : 1.24.2