import arviz as az
import numpy as np
import pymc as pm

%load_ext lab_black

6. Missing Data*#

Dugongs#

This example is the first example of dealing with missing data.

It is adapted from Unit 6: dugongsmissing.odc.

Problem statement#

Carlin and Gelfand (1991) investigated the age (x) and length (y) of 27 captured dugongs (sea cows). Estimate parameters in a nonlinear growth model.

References#

Data provided by Ratkowsky (1983).

Carlin, B. and Gelfand, B. (1991). An Iterative Monte Carlo Method for Nonconjugate Bayesian Analysis, Statistics and Computing, 1, (2), 119†128.

Ratkowsky, D. (1983). Nonlinear regression modeling: A unified practical approach. M. Dekker, NY, viii, 276 p.

# fmt: off
X = [1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5, 
     9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, 17.0,
     22.5, 29.0, 31.5]
y = [1.80, 1.85, 1.87, -1, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26,
     2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47,
     2.64, 2.56, 2.70, 2.72, -1]
# fmt: on

PyMC imputes missing data automatically, similar to BUGS, but it requires the missing data be input as a NumPy masked array. See the NumPy docs for np.ma.masked_values(). For y, above, you can enter whatever number at the missing data positions, then plug it into the value parameter below. Just make sure the number you’re using isn’t actually in the data. I chose -1 above.

Note that this method only works for missing observed data. We’ll learn about other types of missing data later.

y = np.ma.masked_values(y, value=-1)
y
masked_array(data=[1.8, 1.85, 1.87, --, 2.02, 2.27, 2.15, 2.26, 2.47,
                   2.19, 2.26, 2.4, 2.39, 2.41, 2.5, 2.32, 2.32, 2.43,
                   2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.7, 2.72, --],
             mask=[False, False, False,  True, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=-1.0)

Check it out! The array now has a mask attribute.

with pm.Model() as m:
    # priors
    alpha = pm.Uniform("alpha", 0, 100)
    beta = pm.Uniform("beta", 0, 100)
    gamma = pm.Uniform("gamma", 0, 1)
    sigma = pm.math.exp(pm.Uniform("sigma", -10, 10))

    mu = alpha - beta * gamma**X

    likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y)

    # start sampling
    trace = pm.sample(5000)
Hide code cell output
/Users/aaron/mambaforge/envs/pymc_env2/lib/python3.11/site-packages/pymc/model.py:1402: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, gamma, sigma, likelihood_missing]
Sampling 4 chains for 500 tune and 10_000 draw iterations (2_000 + 40_000 draws total) took 25 seconds.
There were 28 divergences after tuning. Increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(0)
Coordinates:
    chain    int64 0 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(1)
Coordinates:
    chain    int64 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(2)
Coordinates:
    chain    int64 2 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain <xarray.DataArray 'chain' ()>
array(3)
Coordinates:
    chain    int64 3 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
az.summary(trace, hdi_prob=0.95, var_names="~likelihood")
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
likelihood_missing[0] 1.907 0.110 1.691 2.125 0.001 0.001 22080.0 21790.0 1.0
likelihood_missing[1] 2.694 0.129 2.439 2.947 0.001 0.001 11806.0 12392.0 1.0
alpha 2.742 0.148 2.519 3.015 0.002 0.002 7872.0 5961.0 1.0
beta 0.994 0.122 0.790 1.217 0.002 0.001 10184.0 6771.0 1.0
gamma 0.889 0.035 0.816 0.953 0.000 0.000 8259.0 6797.0 1.0
sigma -2.339 0.156 -2.641 -2.033 0.001 0.001 14593.0 17855.0 1.0

This output is very close to the BUGS results if you use the inits labeled Alternative (full) inits (initializing the missing values at 1):

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

alpha

2.731

0.1206

0.003167

2.551

2.711

3.025

1001

100000

beta

0.9846

0.1021

0.002008

0.8053

0.9773

1.212

1001

100000

gamma

0.8874

0.03381

7.64E-04

0.8124

0.8908

0.9427

1001

100000

log.sigma

-2.342

0.1543

7.83E-04

-2.622

-2.349

-2.018

1001

100000

sigma

0.09733

0.0155

7.96E-05

0.07267

0.09547

0.1329

1001

100000

y[4]

1.906

0.1098

6.23E-04

1.689

1.906

2.123

1001

100000

y[27]

2.69

0.1255

0.001916

2.446

2.689

2.941

1001

100000

If you use the first set of inits, as Professor Vidakovic did in the video, BUGS gives the y[4] mean as 2.047 and y[27] as 3.04. All the other variables are very different here, too. This is quite a mystery - as far as I can tell, the only difference is that the missing values use BUGS-generated inits for the following results. If someone figures out what’s going on here, let me know!

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

alpha

32.54

3.698

0.2076

23.51

32.69

40.31

1001

100000

beta

30.55

3.697

0.2076

21.52

30.69

38.31

1001

100000

gamma

0.9989

2.08E-04

8.70E-06

0.9984

0.9989

0.9992

1001

100000

sigma

0.1328

0.02071

8.17E-05

0.09964

0.1303

0.1805

1001

100000

y[4]

2.047

0.1424

5.18E-04

1.766

2.046

2.329

1001

100000

y[27]

3.04

0.1606

7.42E-04

2.722

3.04

3.358

1001

100000

%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Sat Mar 18 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.10.1

arviz     : 0.14.0
numpy     : 1.24.2
pymc      : 5.1.1
matplotlib: 3.6.3