import arviz as az
import numpy as np
import pymc as pm
6. Missing Data*#
Dugongs#
This example is the first example of dealing with missing data.
It is adapted from Unit 6: dugongsmissing.odc.
Problem statement#
Carlin and Gelfand (1991) investigated the age (x) and length (y) of 27 captured dugongs (sea cows). Estimate parameters in a nonlinear growth model.
References#
Data provided by Ratkowsky (1983).
Carlin, B. and Gelfand, B. (1991). An Iterative Monte Carlo Method for Nonconjugate Bayesian Analysis, Statistics and Computing, 1, (2), 119†128.
Ratkowsky, D. (1983). Nonlinear regression modeling: A unified practical approach. M. Dekker, NY, viii, 276 p.
# fmt: off
X = [1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5,
9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, 17.0,
22.5, 29.0, 31.5]
y = [1.80, 1.85, 1.87, np.nan, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26,
2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47,
2.64, 2.56, 2.70, 2.72, np.nan]
# fmt: on
PyMC imputes missing data automatically, similar to BUGS.
It used to be that you needed to create a masked Numpy array, but this is no longer the case. It now detects NaN values passed into the observed argument and imputes them automatically. Note that this method only works for missing observed data. We’ll learn about other types of missing data in Unit 8.
with pm.Model() as m:
# priors
alpha = pm.Uniform("alpha", 0, 100)
beta = pm.Uniform("beta", 0, 100)
gamma = pm.Uniform("gamma", 0, 1)
sigma = pm.math.exp(pm.Uniform("sigma", -10, 10))
mu = alpha - beta * gamma**X
likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y)
# start sampling
trace = pm.sample(5000)
Show code cell output
/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pymc/model/core.py:1302: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
warnings.warn(impute_message, ImputationWarning)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, gamma, sigma, likelihood_unobserved]
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 8 seconds.
There were 30 divergences after tuning. Increase `target_accept` or reparameterize.
az.summary(trace, hdi_prob=0.95, var_names="~likelihood")
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
likelihood_unobserved[0] | 1.905 | 0.111 | 1.694 | 2.132 | 0.001 | 0.001 | 10455.0 | 10085.0 | 1.0 |
likelihood_unobserved[1] | 2.697 | 0.127 | 2.446 | 2.945 | 0.002 | 0.001 | 6013.0 | 4239.0 | 1.0 |
alpha | 2.743 | 0.146 | 2.519 | 3.017 | 0.003 | 0.002 | 4000.0 | 2198.0 | 1.0 |
beta | 0.994 | 0.121 | 0.791 | 1.218 | 0.002 | 0.002 | 4980.0 | 2841.0 | 1.0 |
gamma | 0.889 | 0.035 | 0.818 | 0.953 | 0.000 | 0.000 | 4174.0 | 2674.0 | 1.0 |
sigma | -2.344 | 0.156 | -2.648 | -2.042 | 0.002 | 0.001 | 6996.0 | 7641.0 | 1.0 |
This output is very close to the BUGS results if you use the inits labeled Alternative (full) inits (initializing the missing values at 1):
mean |
sd |
MC_error |
val2.5pc |
median |
val97.5pc |
start |
sample |
|
---|---|---|---|---|---|---|---|---|
alpha |
2.731 |
0.1206 |
0.003167 |
2.551 |
2.711 |
3.025 |
1001 |
100000 |
beta |
0.9846 |
0.1021 |
0.002008 |
0.8053 |
0.9773 |
1.212 |
1001 |
100000 |
gamma |
0.8874 |
0.03381 |
7.64E-04 |
0.8124 |
0.8908 |
0.9427 |
1001 |
100000 |
log.sigma |
-2.342 |
0.1543 |
7.83E-04 |
-2.622 |
-2.349 |
-2.018 |
1001 |
100000 |
sigma |
0.09733 |
0.0155 |
7.96E-05 |
0.07267 |
0.09547 |
0.1329 |
1001 |
100000 |
y[4] |
1.906 |
0.1098 |
6.23E-04 |
1.689 |
1.906 |
2.123 |
1001 |
100000 |
y[27] |
2.69 |
0.1255 |
0.001916 |
2.446 |
2.689 |
2.941 |
1001 |
100000 |
If you use the first set of inits, as Professor Vidakovic did in the video, BUGS gives the y[4] mean as 2.047 and y[27] as 3.04. All the other variables are very different here, too. This is quite a mystery - as far as I can tell, the only difference is that the missing values use BUGS-generated inits for the following results. If someone figures out what’s going on here, let me know!
mean |
sd |
MC_error |
val2.5pc |
median |
val97.5pc |
start |
sample |
|
---|---|---|---|---|---|---|---|---|
alpha |
32.54 |
3.698 |
0.2076 |
23.51 |
32.69 |
40.31 |
1001 |
100000 |
beta |
30.55 |
3.697 |
0.2076 |
21.52 |
30.69 |
38.31 |
1001 |
100000 |
gamma |
0.9989 |
2.08E-04 |
8.70E-06 |
0.9984 |
0.9989 |
0.9992 |
1001 |
100000 |
sigma |
0.1328 |
0.02071 |
8.17E-05 |
0.09964 |
0.1303 |
0.1805 |
1001 |
100000 |
y[4] |
2.047 |
0.1424 |
5.18E-04 |
1.766 |
2.046 |
2.329 |
1001 |
100000 |
y[27] |
3.04 |
0.1606 |
7.42E-04 |
2.722 |
3.04 |
3.358 |
1001 |
100000 |
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
The watermark extension is already loaded. To reload it, use:
%reload_ext watermark
Last updated: Sun Mar 09 2025
Python implementation: CPython
Python version : 3.12.7
IPython version : 8.29.0
pytensor: 2.26.4
numpy: 1.26.4
pymc : 5.19.1
arviz: 0.20.0