import arviz as az
import numpy as np
import pymc as pm
6. Missing Data*#
Dugongs#
This example is the first example of dealing with missing data.
It is adapted from Unit 6: dugongsmissing.odc.
Problem statement#
Carlin and Gelfand (1991) investigated the age (x) and length (y) of 27 captured dugongs (sea cows). Estimate parameters in a nonlinear growth model.
References#
Data provided by Ratkowsky (1983).
Carlin, B. and Gelfand, B. (1991). An Iterative Monte Carlo Method for Nonconjugate Bayesian Analysis, Statistics and Computing, 1, (2), 119†128.
Ratkowsky, D. (1983). Nonlinear regression modeling: A unified practical approach. M. Dekker, NY, viii, 276 p.
# fmt: off
X = [1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5,
9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, 17.0,
22.5, 29.0, 31.5]
y = [1.80, 1.85, 1.87, -1, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26,
2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47,
2.64, 2.56, 2.70, 2.72, -1]
# fmt: on
PyMC imputes missing data automatically, similar to BUGS, but it requires the missing data be input as a NumPy masked array. See the NumPy docs for np.ma.masked_values(). For y
, above, you can enter whatever number at the missing data positions, then plug it into the value
parameter below. Just make sure the number you’re using isn’t actually in the data. I chose -1 above.
Note that this method only works for missing observed data. We’ll learn about other types of missing data later.
y = np.ma.masked_values(y, value=-1)
y
masked_array(data=[1.8, 1.85, 1.87, --, 2.02, 2.27, 2.15, 2.26, 2.47,
2.19, 2.26, 2.4, 2.39, 2.41, 2.5, 2.32, 2.32, 2.43,
2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.7, 2.72, --],
mask=[False, False, False, True, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, True],
fill_value=-1.0)
Check it out! The array now has a mask attribute.
with pm.Model() as m:
# priors
alpha = pm.Uniform("alpha", 0, 100)
beta = pm.Uniform("beta", 0, 100)
gamma = pm.Uniform("gamma", 0, 1)
sigma = pm.math.exp(pm.Uniform("sigma", -10, 10))
mu = alpha - beta * gamma**X
likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y)
# start sampling
trace = pm.sample(5000)
Show code cell output
/Users/aaron/mambaforge/envs/pymc_env2/lib/python3.11/site-packages/pymc/model.py:1402: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
warnings.warn(impute_message, ImputationWarning)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, gamma, sigma, likelihood_missing]
Sampling 4 chains for 500 tune and 10_000 draw iterations (2_000 + 40_000 draws total) took 25 seconds.
There were 28 divergences after tuning. Increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(0)
Coordinates:
chain int64 0 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(1)
Coordinates:
chain int64 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(2)
Coordinates:
chain int64 2 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(3)
Coordinates:
chain int64 3 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
az.summary(trace, hdi_prob=0.95, var_names="~likelihood")
mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
likelihood_missing[0] | 1.907 | 0.110 | 1.691 | 2.125 | 0.001 | 0.001 | 22080.0 | 21790.0 | 1.0 |
likelihood_missing[1] | 2.694 | 0.129 | 2.439 | 2.947 | 0.001 | 0.001 | 11806.0 | 12392.0 | 1.0 |
alpha | 2.742 | 0.148 | 2.519 | 3.015 | 0.002 | 0.002 | 7872.0 | 5961.0 | 1.0 |
beta | 0.994 | 0.122 | 0.790 | 1.217 | 0.002 | 0.001 | 10184.0 | 6771.0 | 1.0 |
gamma | 0.889 | 0.035 | 0.816 | 0.953 | 0.000 | 0.000 | 8259.0 | 6797.0 | 1.0 |
sigma | -2.339 | 0.156 | -2.641 | -2.033 | 0.001 | 0.001 | 14593.0 | 17855.0 | 1.0 |
This output is very close to the BUGS results if you use the inits labeled Alternative (full) inits (initializing the missing values at 1):
mean |
sd |
MC_error |
val2.5pc |
median |
val97.5pc |
start |
sample |
|
---|---|---|---|---|---|---|---|---|
alpha |
2.731 |
0.1206 |
0.003167 |
2.551 |
2.711 |
3.025 |
1001 |
100000 |
beta |
0.9846 |
0.1021 |
0.002008 |
0.8053 |
0.9773 |
1.212 |
1001 |
100000 |
gamma |
0.8874 |
0.03381 |
7.64E-04 |
0.8124 |
0.8908 |
0.9427 |
1001 |
100000 |
log.sigma |
-2.342 |
0.1543 |
7.83E-04 |
-2.622 |
-2.349 |
-2.018 |
1001 |
100000 |
sigma |
0.09733 |
0.0155 |
7.96E-05 |
0.07267 |
0.09547 |
0.1329 |
1001 |
100000 |
y[4] |
1.906 |
0.1098 |
6.23E-04 |
1.689 |
1.906 |
2.123 |
1001 |
100000 |
y[27] |
2.69 |
0.1255 |
0.001916 |
2.446 |
2.689 |
2.941 |
1001 |
100000 |
If you use the first set of inits, as Professor Vidakovic did in the video, BUGS gives the y[4] mean as 2.047 and y[27] as 3.04. All the other variables are very different here, too. This is quite a mystery - as far as I can tell, the only difference is that the missing values use BUGS-generated inits for the following results. If someone figures out what’s going on here, let me know!
mean |
sd |
MC_error |
val2.5pc |
median |
val97.5pc |
start |
sample |
|
---|---|---|---|---|---|---|---|---|
alpha |
32.54 |
3.698 |
0.2076 |
23.51 |
32.69 |
40.31 |
1001 |
100000 |
beta |
30.55 |
3.697 |
0.2076 |
21.52 |
30.69 |
38.31 |
1001 |
100000 |
gamma |
0.9989 |
2.08E-04 |
8.70E-06 |
0.9984 |
0.9989 |
0.9992 |
1001 |
100000 |
sigma |
0.1328 |
0.02071 |
8.17E-05 |
0.09964 |
0.1303 |
0.1805 |
1001 |
100000 |
y[4] |
2.047 |
0.1424 |
5.18E-04 |
1.766 |
2.046 |
2.329 |
1001 |
100000 |
y[27] |
3.04 |
0.1606 |
7.42E-04 |
2.722 |
3.04 |
3.358 |
1001 |
100000 |
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Sat Mar 18 2023
Python implementation: CPython
Python version : 3.11.0
IPython version : 8.9.0
pytensor: 2.10.1
arviz : 0.14.0
numpy : 1.24.2
pymc : 5.1.1
matplotlib: 3.6.3