import arviz as az
import numpy as np
import pymc as pm

6. Missing Data*#

Dugongs#

This example is the first example of dealing with missing data.

It is adapted from Unit 6: dugongsmissing.odc.

Problem statement#

Carlin and Gelfand (1991) investigated the age (x) and length (y) of 27 captured dugongs (sea cows). Estimate parameters in a nonlinear growth model.

References#

Data provided by Ratkowsky (1983).

Carlin, B. and Gelfand, B. (1991). An Iterative Monte Carlo Method for Nonconjugate Bayesian Analysis, Statistics and Computing, 1, (2), 119†128.

Ratkowsky, D. (1983). Nonlinear regression modeling: A unified practical approach. M. Dekker, NY, viii, 276 p.

# fmt: off
X = [1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5, 
     9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, 17.0,
     22.5, 29.0, 31.5]
y = [1.80, 1.85, 1.87, np.nan, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26,
     2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47,
     2.64, 2.56, 2.70, 2.72, np.nan]
# fmt: on

PyMC imputes missing data automatically, similar to BUGS.

It used to be that you needed to create a masked Numpy array, but this is no longer the case. It now detects NaN values passed into the observed argument and imputes them automatically. Note that this method only works for missing observed data. We’ll learn about other types of missing data in Unit 8.

with pm.Model() as m:
    # priors
    alpha = pm.Uniform("alpha", 0, 100)
    beta = pm.Uniform("beta", 0, 100)
    gamma = pm.Uniform("gamma", 0, 1)
    sigma = pm.math.exp(pm.Uniform("sigma", -10, 10))

    mu = alpha - beta * gamma**X

    likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y)

    # start sampling
    trace = pm.sample(5000)
Hide code cell output
/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pymc/model/core.py:1302: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, gamma, sigma, likelihood_unobserved]

Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 8 seconds.
There were 30 divergences after tuning. Increase `target_accept` or reparameterize.
az.summary(trace, hdi_prob=0.95, var_names="~likelihood")
mean sd hdi_2.5% hdi_97.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
likelihood_unobserved[0] 1.905 0.111 1.694 2.132 0.001 0.001 10455.0 10085.0 1.0
likelihood_unobserved[1] 2.697 0.127 2.446 2.945 0.002 0.001 6013.0 4239.0 1.0
alpha 2.743 0.146 2.519 3.017 0.003 0.002 4000.0 2198.0 1.0
beta 0.994 0.121 0.791 1.218 0.002 0.002 4980.0 2841.0 1.0
gamma 0.889 0.035 0.818 0.953 0.000 0.000 4174.0 2674.0 1.0
sigma -2.344 0.156 -2.648 -2.042 0.002 0.001 6996.0 7641.0 1.0

This output is very close to the BUGS results if you use the inits labeled Alternative (full) inits (initializing the missing values at 1):

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

alpha

2.731

0.1206

0.003167

2.551

2.711

3.025

1001

100000

beta

0.9846

0.1021

0.002008

0.8053

0.9773

1.212

1001

100000

gamma

0.8874

0.03381

7.64E-04

0.8124

0.8908

0.9427

1001

100000

log.sigma

-2.342

0.1543

7.83E-04

-2.622

-2.349

-2.018

1001

100000

sigma

0.09733

0.0155

7.96E-05

0.07267

0.09547

0.1329

1001

100000

y[4]

1.906

0.1098

6.23E-04

1.689

1.906

2.123

1001

100000

y[27]

2.69

0.1255

0.001916

2.446

2.689

2.941

1001

100000

If you use the first set of inits, as Professor Vidakovic did in the video, BUGS gives the y[4] mean as 2.047 and y[27] as 3.04. All the other variables are very different here, too. This is quite a mystery - as far as I can tell, the only difference is that the missing values use BUGS-generated inits for the following results. If someone figures out what’s going on here, let me know!

mean

sd

MC_error

val2.5pc

median

val97.5pc

start

sample

alpha

32.54

3.698

0.2076

23.51

32.69

40.31

1001

100000

beta

30.55

3.697

0.2076

21.52

30.69

38.31

1001

100000

gamma

0.9989

2.08E-04

8.70E-06

0.9984

0.9989

0.9992

1001

100000

sigma

0.1328

0.02071

8.17E-05

0.09964

0.1303

0.1805

1001

100000

y[4]

2.047

0.1424

5.18E-04

1.766

2.046

2.329

1001

100000

y[27]

3.04

0.1606

7.42E-04

2.722

3.04

3.358

1001

100000

%load_ext watermark
%watermark -n -u -v -iv -p pytensor
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: Sun Mar 09 2025

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.29.0

pytensor: 2.26.4

numpy: 1.26.4
pymc : 5.19.1
arviz: 0.20.0