Prediction*

8. Prediction*#

We previously covered Bayesian prediction in Unit 4, Lesson 13. Recall that the core of Bayesian prediction is the posterior predictive distribution

\[ f(x_{n+1}\vert x_{1} , \cdots , x_{n}) = \int_{\Theta}f(x_{n+1} \vert \mathbf{\theta})\pi(\theta \vert x_{1} , \cdots , x_{n})d\theta \]

Think of this as a “weighted average” of the likelihood for new data, with “weights” prescribed by for \(\pi(\theta \vert x_{1} , \cdots , x_{n})\). This is a departure from the “plug-in predictions” you may have encountered in a frequentist course. The result is a distribution of predictions. It should be clear by now that analytical methods are often intractible, and that is no different for the posterior predictive distribution. Fortunately, pymc has some tools that make this easier.

Taste of Cheese*#

Adapted from Unit 6: cheese.odc.

The link in the original .odc file is dead. I downloaded the data from here and have a copy here.

Problem Statement#

As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the Latrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters.

Can the score be predicted well by the predictors: Acetic, H2S, and Lactic?

data = pd.read_csv("../data/cheese.csv", index_col=0)
X = data[["Acetic", "H2S", "Lactic"]].to_numpy()
# add intercept column to X
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y = data["taste"].to_numpy()

data.head(5)

	taste	Acetic	H2S	Lactic
1	12.3	4.543	3.135	0.86
2	20.9	5.159	5.043	1.53
3	39.0	5.366	5.438	1.57
4	47.9	5.759	7.496	1.81
5	5.6	4.663	3.807	0.99

with pm.Model() as m:
    # associate data with model (this makes prediction easier)
    X_data = pm.Data("X", X_aug)
    y_data = pm.Data("y", y)

    # priors
    beta = pm.Normal("beta", mu=0, sigma=1000, shape=X.shape[1] + 1)
    tau = pm.Gamma("tau", alpha=0.001, beta=0.001)
    sigma = pm.Deterministic("sigma", 1 / pm.math.sqrt(tau))

    mu = pm.math.dot(X_data, beta)

    # likelihood
    pm.Normal("taste_score", mu=mu, sigma=sigma, observed=y_data, shape=X_data.shape[0])

    # start sampling
    trace = pm.sample(5000, target_accept=0.95)
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)

az.summary(trace, hdi_prob=0.95)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta[0]	-28.594	20.567	-68.302	12.992	0.240	0.176	7366.0	9567.0	1.0
beta[1]	0.219	4.635	-8.825	9.345	0.056	0.041	6929.0	8886.0	1.0
beta[2]	3.909	1.283	1.322	6.350	0.014	0.010	8958.0	10953.0	1.0
beta[3]	19.914	8.862	2.569	37.440	0.089	0.069	9970.0	11208.0	1.0
tau	0.010	0.003	0.005	0.015	0.000	0.000	10522.0	10951.0	1.0
sigma	10.446	1.529	7.770	13.536	0.015	0.014	10522.0	10951.0	1.0

y_pred = trace.posterior_predictive.stack(sample=("chain", "draw"))[
    "taste_score"
].values.T
az.r2_score(y, y_pred)

r2        0.577111
r2_std    0.075772
dtype: float64

PyMC may give warnings about the model unless we increase the target_accept parameter of pm.sample. This is because PyMC uses more diagnostics to check if there are any problems with its exploration of the parameter space. Divergences indicate bias in the results.

For further reading, check out Diagnosing Biased Inference with Divergences.

It looks like there are multiple ways to get predictions on out-of-sample data in PyMC. The easiest way is to set up a shared variable using pm.Data in the original model, then using pm.set_data to change to the new observations before calling pm.sample_posterior_predictive.

# single prediction on out-of-sample data
new_obs = np.array([[1.0, 5.0, 7.1, 1.5]])
pm.set_data({"X": new_obs}, model=m)
ppc = pm.sample_posterior_predictive(trace, model=m, predictions=True)

The default behavior now is to give one prediction per \(y\)-value. The professor often asks for a single prediction based on the new data; the equivalent here would be to take the mean of the predicted values.

az.summary(ppc.predictions, hdi_prob=0.95)

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
taste_score[0]	30.055	11.197	8.671	52.698	0.085	0.063	17533.0	17753.0	1.0

Intuition: From posterior trace to posterior predictive summary:#

Let’s “manually” simulate the posterior predictive distribution from the posterior samples in trace: without the help of pm.sample_posterior_predictive.

Understanding the PyMC Posterior Trace Structure:#

When we fit a model in PyMC, the results are stored in an arviz.InferenceData object: what we saved as trace above. For each parameter, PyMC stores the samples as an xarray DataArray with two main sampling dimensions:

chain: The MCMC chain index (typically 4 chains by default)
draw: The sequential draw within each chain (e.g., 5000 in this case: as determined by our use of draws=5000)

Specifically:

Since beta is vector-valued in our implementation above: the shape of beta within trace is (num_chains,num_draws,num_coefficients), IE : (4,5000,4)
Since sigma is scalar valued its shape is (4,5000,1)

To work with these samples in a vectorized manner, we flatten the chain and draw dimensions into a single “samples” dimension to make them easier to work with. The resulting shapes are:

posterior_betas has shape (4,20000)
posterior_sigmas has shape (1,20000)

This is done using xarray.DataArray.stack. You can check out the documentation here if you need more guidance. adding on .values converts the resulting xarray to a NumPy array.

Pre-Selecting Random Joint Posterior Indices:#

To draw from the joint posterior, we first predefine a list of random indices—each corresponding to a complete joint sample of all model parameters. Each entry in indices tells us which sample (across all chains and draws) to use for a full set of parameters. This is important: If we sampled each parameter from different indices, we would effectively be sampling from the marginals, which is incorrect for generating predictive samples.

Generate the Posterior Predictive Samples from the Posterior Samples:#

For each posterior predictive sample:

Extract the corresponding set of beta coefficients and sigma value using the same index from indices
Compute the mean prediction: dot product of the new_obs vector and the sampled beta
Add random noise that is sampled from a Normal distribution with mean 0 and standard deviation sigma[i]

We will then summarize our results manually using a DataFrame.

num_ppd_samples = 1000

#Create a stacked ArviZ/xarray Dataset object containing your posterior samples for beta and sigma. 

posterior_betas = trace.posterior.beta.stack(samples=('chain','draw')).values
posterior_sigmas = trace.posterior.sigma.stack(samples=('chain','draw')).values

_,num_samples = posterior_betas.shape


#pre-defined list of indices
indices = np.random.choice(a=num_samples,size=num_ppd_samples)

#noise term:
noise = np.random.normal(0,posterior_sigmas[indices])

#posterior predictive samples:
ppd = ((new_obs @ posterior_betas[:,indices]) + noise).flatten()

#manually constructed summary:
pd.DataFrame(   [dict(zip(
                ['mean', 'sd', 'hdi_2.5%', 'hdi_97.5%'],
                np.append([ppd.mean(), ppd.std()], az.hdi(ppd, hdi_prob=0.95)).round(3)))],
                index=['taste_score[0]'])

	mean	sd	hdi_2.5%	hdi_97.5%
taste_score[0]	30.325	11.233	7.877	51.912

The results are fairly close to the results from az.summary(ppc.predictions, hdi_prob=0.95).

%load_ext watermark
%watermark -n -u -v -iv -p pytensor

Last updated: Sun Mar 09 2025

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.29.0

pytensor: 2.26.4

pymc      : 5.19.1
numpy     : 1.26.4
arviz     : 0.20.0
matplotlib: 3.9.2
pandas    : 2.2.3