8. Cumulative Example*

8. Cumulative Example*#

Notes#

From the OpenBUGS manual:

cumulative(s1, s2): tail area of distribution of s1 up to the value of s2, s1 must be stochastic, s1 and s2 can be the same

Fitting the model#

As we said on the previous page, we want to evaluate the Empirical CDF at the values of our response variable to check how well our model fits the data. First, we’ll fit our model and add posterior predictive samples to the trace.

y = np.array(
    [
        0.0,
        1.0,
        2.0,
        -1.0,
        0.4,
        -0.5,
        0.7,
        -1.2,
        0.1,
        -0.4,
        0.2,
        -0.5,
        -1.4,
        1.8,
        0.2,
        0.3,
        -0.6,
        1.1,
        5.1,
        -6.3,
    ]
)

with pm.Model() as model:
    # Priors
    theta = pm.Flat("theta")
    ls = pm.Flat("ls")
    s = pm.math.exp(ls)

    # Likelihood
    pm.Normal("y", mu=theta, sigma=s, observed=y)

    # Sampling
    trace = pm.sample(2000)
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)

az.summary(trace)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
theta	0.052	0.486	-0.855	0.975	0.006	0.006	6753.0	4931.0	1.0
ls	0.755	0.170	0.450	1.080	0.002	0.002	5754.0	5290.0	1.0

Plot posterior predictive distribution vs. observed values#

Now we have posterior samples, the equivalent of the stochastic part of the y[i] node in BUGS. There are different ways to compare these with our ECDF. The easiest is the arviz.plot_ppc function with the kind="cumulative" argument.

az.plot_ppc(trace, kind="cumulative", num_pp_samples=100)
plt.show()

../_images/1bbd865e4c609040dfc798a9bcb8d09615a10bf706426f6b88f94cae6590bb02.png

Doesn’t look like our posterior does a good job of fitting the observed data. The mean fits okay, but both tails have some issues because of some extreme values near the endpoints. This plot is just comparing individual ECDFs, though. It’s not evaluating the posterior ECDF at the observed \(y\) values.

Plots in general might not be the easiest thing to interpret if we have a lot of datapoints. I wrote the following function to recreate the lecture example.

prob = evaluate_ecdf(trace)

print(prob)
print(np.log(prob / (1 - prob)) ** 2)  # squared logit

[0.4905   0.668375 0.813875 0.304625 0.571    0.393625 0.610375 0.2775
508125 0.41075  0.528375 0.395    0.2455   0.79225  0.53125  0.540125
38525  0.681625 0.986875 0.005   ]
[ 0.00144435  0.49118313  2.17677064  0.68123523  0.08175727  0.18671035
20150114  0.91565089  0.00105644  0.13022411  0.01290998  0.18176809
26058627  1.79169352  0.01566581  0.0258714   0.21839098  0.57950114
66261234 28.01907597]

The last two values are the outliers. I’ve saved the old version of this code below, you should be able to click to expand it.

Another cool Arviz plot#

Arviz has the plot_lm function as well. It’s really nice for visualizing the observed data versus the posterior predictive samples. Let’s try it with the single-predictor model from 11. Brozek index prediction*.

data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

with pm.Model() as m:
    sigma = pm.Exponential("sigma", 0.05)
    beta0 = pm.Normal("intercept", 0, sigma=10)
    beta1 = pm.Normal("slope", 0, sigma=10)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y)

    trace = pm.sample(2000)
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)

Unfortunately, I haven’t been able to get this working for a year or two now.

x = xr.DataArray(np.linspace(0, 1, 252))
trace.posterior["y_model"] = (
    trace.posterior["intercept"] + trace.posterior["slope"] * x
)
az.plot_lm(idata=trace, y="y", x=x)

Show code cell output Hide code cell output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 3
      1 x = xr.DataArray(np.linspace(0, 1, 252))
      2 trace.posterior["y_model"] = trace.posterior["intercept"] + trace.posterior["slope"] * x
----> 3 az.plot_lm(idata=trace, y="y", x=x)

File ~/mambaforge/envs/pymc/lib/python3.11/site-packages/arviz/plots/lmplot.py:351, in plot_lm(y, idata, x, y_model, y_hat, num_samples, kind_pp, kind_model, xjitter, plot_dim, backend, y_kwargs, y_hat_plot_kwargs, y_hat_fill_kwargs, y_model_plot_kwargs, y_model_fill_kwargs, y_model_mean_kwargs, backend_kwargs, show, figsize, textsize, axes, legend, grid)
    348 backend = backend.lower()
    350 plot = get_plotting_function("plot_lm", "lmplot", backend)
--> 351 ax = plot(**lmplot_kwargs)
    352 return ax

File ~/mambaforge/envs/pymc/lib/python3.11/site-packages/arviz/plots/backends/matplotlib/lmplot.py:108, in plot_lm(x, y, y_model, y_hat, num_samples, kind_pp, kind_model, xjitter, length_plotters, rows, cols, y_kwargs, y_hat_plot_kwargs, y_hat_fill_kwargs, y_model_plot_kwargs, y_model_fill_kwargs, y_model_mean_kwargs, backend_kwargs, show, figsize, textsize, axes, legend, grid)
    106             ax_i.plot(x_plotters_jitter, y_hat_plotters[..., j], **y_hat_plot_kwargs)
    107         else:
--> 108             ax_i.plot(x_plotters, y_hat_plotters[..., j], **y_hat_plot_kwargs)
    109     ax_i.plot([], **y_hat_plot_kwargs, label="Posterior predictive samples")
    110 else:

File ~/mambaforge/envs/pymc/lib/python3.11/site-packages/matplotlib/axes/_axes.py:1688, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
   1445 """
   1446 Plot y versus x as lines and/or markers.
   1447 
   (...)
   1685 (``'green'``) or hex strings (``'#008000'``).
   1686 """
   1687 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1688 lines = [*self._get_lines(*args, data=data, **kwargs)]
   1689 for line in lines:
   1690     self.add_line(line)

File ~/mambaforge/envs/pymc/lib/python3.11/site-packages/matplotlib/axes/_base.py:311, in _process_plot_var_args.__call__(self, data, *args, **kwargs)
    309     this += args[0],
    310     args = args[1:]
--> 311 yield from self._plot_args(
    312     this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)

File ~/mambaforge/envs/pymc/lib/python3.11/site-packages/matplotlib/axes/_base.py:504, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
    501     self.axes.yaxis.update_units(y)
    503 if x.shape[0] != y.shape[0]:
--> 504     raise ValueError(f"x and y must have same first dimension, but "
    505                      f"have shapes {x.shape} and {y.shape}")
    506 if x.ndim > 2 or y.ndim > 2:
    507     raise ValueError(f"x and y can be no greater than 2D, but have "
    508                      f"shapes {x.shape} and {y.shape}")

ValueError: x and y must have same first dimension, but have shapes (252,) and (1,)

../_images/1b52d5f5a1167b90b37b030e5b54aa97d1a9547d3f3adaf06f3dda9d2beade31.png

The below example from the Arviz docs still works, but the idata here was generated using PyMC 3. Maybe there’s some subtle difference that I’m missing. Let me know if you figure it out!

idata = az.load_arviz_data("regression1d")
x = xr.DataArray(np.linspace(0, 1, 100))
idata.posterior["y_model"] = (
    idata.posterior["intercept"] + idata.posterior["slope"] * x
)
az.plot_lm(idata=idata, y="y", x=x)

array([[<Axes: ylabel='y'>]], dtype=object)

../_images/a2873e84b439d423b95aade4bd911493b4fda1e87dcfa8b3346aa053078ae422.png

%load_ext watermark
%watermark -n -u -v -iv -p pytensor

Last updated: Tue Nov 07 2023

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pytensor: 2.17.1

arviz     : 0.16.1
pymc      : 5.9.0
pandas    : 2.1.0
numpy     : 1.25.2
xarray    : 2023.8.0
matplotlib: 3.7.2