import arviz as az
import numpy as np
import pymc as pm
from pymc.math import dot, stack, concatenate, exp, invlogit

2. Rats Example with Missing Data*#

This example goes further into dealing with missing data in PyMC, including in the predictor variables.

Adapted from Unit 8: ratsignorable1.odc, ratsignorable2.odc, and ratsinformative.odc.

Data can be found here.

Problem statement#

We had a previous example about dugongs that dealt with missing data in the observed data (y values). This example shows how to deal with missing data in the input data (x). It’s still pretty easy. You could look at it like creating another likelihood in the model, a very simple one where the observed data is x, and you use a single distribution to fill in the missing values (see x_imputed in the model below).

Original paper here.

Gelfand et al 1990 consider the problem of missing data, and delete the last observation of cases 6-10, the last two from 11-20, the last 3 from 21-25 and the last 4 from 26-30. The appropriate data file is obtained by simply replacing data values by NA (see below). The model specification is unchanged, since the distinction between observed and unobserved quantities is made in the data file and not the model specification.

x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])
# import y data and create mask (missing data is represented as nan in the file)
y = np.loadtxt("../data/rats.txt")
y = np.nan_to_num(y, nan=-1)  # nan to -1
y = np.ma.masked_values(y, value=-1)  # create mask

Model 1#

This first model we only have missing data in our response variable (y). Notice that I made the shapes of alpha and beta (30, 1) instead of just 30. This is so that they broadcast correctly when combined (mu = alpha + beta * x). The NumPy docs have a helpful page about broadcasting.

prior_tau = 1e-4

with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=prior_tau)
    alpha_tau = pm.Gamma("alpha_tau", 0.001, 0.001)
    beta_c = pm.Normal("beta_c", 0, tau=prior_tau)
    beta_tau = pm.Gamma("beta_tau", 0.001, 0.001)

    alpha = pm.Normal(
        "alpha", alpha_c, tau=alpha_tau, shape=(30, 1)
    )  # (30, 1) for broadcasting
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.001, 0.001)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    mu = alpha + beta * x

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.95
    )
Hide code cell output
/Users/aaron/mambaforge/envs/pymc/lib/python3.11/site-packages/pymc/model/core.py:1365: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag_grad...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_c, alpha_tau, beta_c, beta_tau, alpha, beta, lik_tau, likelihood_unobserved]
100.00% [36000/36000 01:28<00:00 Sampling 4 chains, 5 divergences]
Sampling 4 chains for 4_000 tune and 5_000 draw iterations (16_000 + 20_000 draws total) took 89 seconds.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
az.summary(
    trace,
    hdi_prob=0.95,
    var_names=["alpha_c", "alpha_tau", "beta_c", "beta_tau", "sigma"],
    kind="stats",
)
mean sd hdi_2.5% hdi_97.5%
alpha_c 101.161 2.306 96.773 105.803
alpha_tau 0.027 0.098 0.004 0.049
beta_c 6.570 0.164 6.255 6.905
beta_tau 2.869 1.207 0.953 5.264
sigma 6.025 0.664 4.777 7.341

Model 2: Imputing missing predictor variable data#

This is the same model, except we now have missing x data.

x_miss = np.array([8.0, 15.0, 22.0, -1, 36.0])
x_miss = np.ma.masked_values(x_miss, value=-1)
x_miss
masked_array(data=[8.0, 15.0, 22.0, --, 36.0],
             mask=[False, False, False,  True, False],
       fill_value=-1.0)
prior_tau = 1e-4

with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=prior_tau)
    alpha_tau = pm.Gamma("alpha_tau", 0.001, 0.001)
    beta_c = pm.Normal("beta_c", 0, tau=prior_tau)
    beta_tau = pm.Gamma("beta_tau", 0.001, 0.001)

    alpha = pm.Normal("alpha", alpha_c, tau=alpha_tau, shape=(30, 1))
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.001, 0.001)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    x_imputed = pm.Normal("x_imputed", mu=20, sigma=5, observed=x_miss)

    mu = alpha + beta * x_imputed

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace_2 = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.87
    )
Hide code cell output
/Users/aaron/mambaforge/envs/pymc/lib/python3.11/site-packages/pymc/model/core.py:1365: ImputationWarning: Data in x_imputed contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
/Users/aaron/mambaforge/envs/pymc/lib/python3.11/site-packages/pymc/model/core.py:1365: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag_grad...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_c, alpha_tau, beta_c, beta_tau, alpha, beta, lik_tau, x_imputed_unobserved, likelihood_unobserved]
100.00% [36000/36000 01:14<00:00 Sampling 4 chains, 18 divergences]
Sampling 4 chains for 4_000 tune and 5_000 draw iterations (16_000 + 20_000 draws total) took 75 seconds.
There were 18 divergences after tuning. Increase `target_accept` or reparameterize.
az.summary(
    trace_2,
    hdi_prob=0.95,
    var_names=["alpha_c", "beta_c", "alpha_tau", "beta_tau", "lik_tau"],
    kind="stats",
)
mean sd hdi_2.5% hdi_97.5%
alpha_c 101.744 2.348 97.048 106.248
beta_c 6.517 0.168 6.189 6.853
alpha_tau 0.025 0.077 0.004 0.043
beta_tau 2.981 1.269 0.967 5.414
lik_tau 0.029 0.006 0.017 0.041

Model 3: Non-ignorable missingness#

Probability of missingness increases approx. at a rate of 1% with increasing the weight.

y = np.atleast_2d(
    np.array([177.0, 236.0, 285.0, 350.0, -1])
)  # original value was 320
y = np.ma.masked_values(y, value=-1)  # create masked array
# y.mask is equivalent to the "miss" array from the professor's example
miss = y.mask
x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])
t = 0.1
s = 1 / t  # convert BUGS dlogis tau to s for pymc
b = np.log(1.01)

with pm.Model() as m:
    a = pm.Logistic("a", mu=0, s=s)
    alpha = pm.Flat("alpha")
    beta = pm.Flat("beta")
    log_sigma = pm.Flat("log_sigma")
    sigma2 = pm.Deterministic("sigma2", exp(2 * log_sigma))
    tau = pm.Deterministic("tau", 1 / sigma2)

    mu = pm.Deterministic("mu", alpha + beta * x)
    y_imputed = pm.Normal("likelihood", mu, tau=tau, observed=y)

    p = pm.Deterministic("p", invlogit(a + b * y_imputed))
    pm.Bernoulli("missing", p=p, observed=miss)

    trace_3 = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.95
    )
Hide code cell output
/Users/aaron/mambaforge/envs/pymc/lib/python3.11/site-packages/pymc/model/core.py:1365: ImputationWarning: Data in likelihood contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag_grad...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, alpha, beta, log_sigma, likelihood_unobserved]
100.00% [36000/36000 00:22<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 4_000 tune and 5_000 draw iterations (16_000 + 20_000 draws total) took 23 seconds.
az.summary(trace_3, hdi_prob=0.95, kind="stats")
mean sd hdi_2.5% hdi_97.5%
a -4.816 1.366 -7.606 -2.334
alpha 110.976 18.973 82.593 135.184
beta 8.175 0.989 6.840 9.434
log_sigma 1.821 0.649 0.768 3.156
likelihood_unobserved[0] 406.247 25.075 374.857 437.231
sigma2 217.298 3347.053 2.286 448.064
tau 0.047 0.047 0.000 0.140
mu[0] 176.376 11.832 158.025 192.071
mu[1] 233.601 7.089 222.921 245.086
mu[2] 290.825 7.511 279.627 301.906
mu[3] 348.050 12.589 330.514 364.368
mu[4] 405.275 18.879 380.380 429.687
likelihood[0, 0] 177.000 0.000 177.000 177.000
likelihood[0, 1] 236.000 0.000 236.000 236.000
likelihood[0, 2] 285.000 0.000 285.000 285.000
likelihood[0, 3] 350.000 0.000 350.000 350.000
likelihood[0, 4] 406.247 25.075 374.857 437.231
p[0, 0] 0.079 0.085 0.000 0.253
p[0, 1] 0.127 0.121 0.000 0.378
p[0, 2] 0.180 0.155 0.000 0.498
p[0, 3] 0.272 0.198 0.000 0.654
p[0, 4] 0.367 0.229 0.000 0.769
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Tue Nov 07 2023

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pytensor: 2.17.1

arviz: 0.16.1
numpy: 1.25.2
pymc : 5.9.0