Missing Data*

6. Missing Data*#

Dugongs#

This example is the first example of dealing with missing data.

It is adapted from Unit 6: dugongsmissing.odc.

Problem statement#

Carlin and Gelfand (1991) investigated the age (x) and length (y) of 27 captured dugongs (sea cows). Estimate parameters in a nonlinear growth model.

References#

Data provided by Ratkowsky (1983).

Carlin, B. and Gelfand, B. (1991). An Iterative Monte Carlo Method for Nonconjugate Bayesian Analysis, Statistics and Computing, 1, (2), 119†128.

Ratkowsky, D. (1983). Nonlinear regression modeling: A unified practical approach. M. Dekker, NY, viii, 276 p.

# fmt: off
X=   [    1.0  , 1.5  , 1.5  , 1.5  , 2.5  , 4.0  , 5.0  , 5.0  , 7.0 ,
          8.0  , 8.5  , 9.0  , 9.5  , 9.5  , 10.0 , 12.0 , 12.0 , 13.0,
          13.0 , 14.5 , 15.5 , 15.5 , 16.5 , 17.0 , 22.5 , 29.0 , 31.5]

y=   [    1.8    , 1.85   , 1.87   , np.nan , 2.02   , 2.27   , 2.15   , 2.26   , 2.47,
          2.19   , 2.26   , 2.4    , 2.39   , 2.41   , 2.5    , 2.32   , 2.32   , 2.43,  
          2.47   , 2.56   , 2.65   , 2.47   , 2.64   , 2.56   , 2.7    , 2.72   , np.nan]

The model is a nonlinear regression:

\[\begin{split} \begin{align*} \mu &\sim \mathcal{N}(\alpha - \beta \cdot \gamma^{\mathbf{x}} , \sigma^{2})\\ \alpha &\sim \mathcal{U}(0,100)\\ \beta &\sim \mathcal{U}(0,100)\\ \gamma &\sim \mathcal{U}(0,1)\\ \log\{\sigma\} &\sim \mathcal{U}(-10,10) \end{align*} \end{split}\]

Note: \(\log(\sigma)\) is modeled as a Uniform prior on [-10, 10]. We then exponentiate it to define sigma since PyMC requires standard deviation, not log-scale, for a Normal likelihood.

The primary objective here is to observe how PyMC automatically treats the instances of np.nan in the observed data as unknown parameters. These will be inferred automatically along with all other model parameters during sampling.

It used to be that you needed to create a masked Numpy array, but this is no longer the case. It now detects NaN values passed into the observed argument and imputes them automatically. Note that this method only works for missing observed data. We’ll learn about other types of missing data in Unit 8.

with pm.Model() as m:
    # priors
    alpha = pm.Uniform("alpha", 0, 100)
    beta = pm.Uniform("beta", 0, 100)
    gamma = pm.Uniform("gamma", 0, 1)
    sigma = pm.math.exp(pm.Uniform("sigma", -10, 10))

    mu = alpha - beta * gamma**X

    likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y)

    # start sampling
    trace = pm.sample(5000)

az.summary(trace, hdi_prob=0.95, var_names="~likelihood")

	mean	sd	hdi_2.5%	hdi_97.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
likelihood_unobserved[0]	1.905	0.111	1.694	2.132	0.001	0.001	10455.0	10085.0	1.0
likelihood_unobserved[1]	2.697	0.127	2.446	2.945	0.002	0.001	6013.0	4239.0	1.0
alpha	2.743	0.146	2.519	3.017	0.003	0.002	4000.0	2198.0	1.0
beta	0.994	0.121	0.791	1.218	0.002	0.002	4980.0	2841.0	1.0
gamma	0.889	0.035	0.818	0.953	0.000	0.000	4174.0	2674.0	1.0
sigma	-2.344	0.156	-2.648	-2.042	0.002	0.001	6996.0	7641.0	1.0

This output is very close to the BUGS results if you use the inits labeled Alternative (full) inits (initializing the missing values at 1):

	mean	sd	MC_error	val2.5pc	median	val97.5pc	start	sample
alpha	2.731	0.1206	0.003167	2.551	2.711	3.025	1001	100000
beta	0.9846	0.1021	0.002008	0.8053	0.9773	1.212	1001	100000
gamma	0.8874	0.03381	7.64E-04	0.8124	0.8908	0.9427	1001	100000
log.sigma	-2.342	0.1543	7.83E-04	-2.622	-2.349	-2.018	1001	100000
sigma	0.09733	0.0155	7.96E-05	0.07267	0.09547	0.1329	1001	100000
y[4]	1.906	0.1098	6.23E-04	1.689	1.906	2.123	1001	100000
y[27]	2.69	0.1255	0.001916	2.446	2.689	2.941	1001	100000

If you use the first set of inits, as Professor Vidakovic did in the video, BUGS gives the y[4] mean as 2.047 and y[27] as 3.04. All the other variables are very different here, too. This is quite a mystery - as far as I can tell, the only difference is that the missing values use BUGS-generated inits for the following results. If someone figures out what’s going on here, let me know!

	mean	sd	MC_error	val2.5pc	median	val97.5pc	start	sample
alpha	32.54	3.698	0.2076	23.51	32.69	40.31	1001	100000
beta	30.55	3.697	0.2076	21.52	30.69	38.31	1001	100000
gamma	0.9989	2.08E-04	8.70E-06	0.9984	0.9989	0.9992	1001	100000
sigma	0.1328	0.02071	8.17E-05	0.09964	0.1303	0.1805	1001	100000
y[4]	2.047	0.1424	5.18E-04	1.766	2.046	2.329	1001	100000
y[27]	3.04	0.1606	7.42E-04	2.722	3.04	3.358	1001	100000

%load_ext watermark
%watermark -n -u -v -iv -p pytensor

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: Sun Mar 09 2025

Python implementation: CPython
Python version       : 3.12.7
IPython version      : 8.29.0

pytensor: 2.26.4

numpy: 1.26.4
pymc : 5.19.1
arviz: 0.20.0

Missing Data*

Contents

6. Missing Data*#

Dugongs#

Problem statement#

References#