2. Rats Example with Missing Data*

2. Rats Example with Missing Data*#

This example goes further into dealing with missing data in PyMC, including in the predictor variables.

Adapted from Unit 8: ratsignorable1.odc, ratsignorable2.odc, and ratsinformative.odc.

Data can be found here.

Problem statement#

We had a previous example about dugongs that dealt with missing data in the observed data (y values). This example shows how to deal with missing data in the input data (x). It’s still pretty easy. You could look at it like creating another likelihood in the model, a very simple one where the observed data is x, and you use a single distribution to fill in the missing values (see x_imputed in the model below).

Original paper here.

Gelfand et al 1990 consider the problem of missing data, and delete the last observation of cases 6-10, the last two from 11-20, the last 3 from 21-25 and the last 4 from 26-30. The appropriate data file is obtained by simply replacing data values by NA (see below). The model specification is unchanged, since the distinction between observed and unobserved quantities is made in the data file and not the model specification.

x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])

# import y data and create mask (missing data is represented as nan in the file)
y = np.loadtxt("../data/rats.txt")
y = np.nan_to_num(y, nan=-1)  # nan to -1
y = np.ma.masked_values(y, value=-1)  # create mask

Model 1#

This first model we only have missing data in our response variable (y). Notice that I made the shapes of alpha and beta (30, 1) instead of just 30. This is so that they broadcast correctly when combined (mu = alpha + beta * x). The NumPy docs have a helpful page about broadcasting.

prior_tau = 1e-4

with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=prior_tau)
    alpha_tau = pm.Gamma("alpha_tau", 0.001, 0.001)
    beta_c = pm.Normal("beta_c", 0, tau=prior_tau)
    beta_tau = pm.Gamma("beta_tau", 0.001, 0.001)

    alpha = pm.Normal(
        "alpha", alpha_c, tau=alpha_tau, shape=(30, 1)
    )  # (30, 1) for broadcasting
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.001, 0.001)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    mu = alpha + beta * x

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.95
    )

az.summary(
    trace,
    hdi_prob=0.95,
    var_names=["alpha_c", "alpha_tau", "beta_c", "beta_tau", "sigma"],
    kind="stats",
)

	mean	sd	hdi_2.5%	hdi_97.5%
alpha_c	101.161	2.306	96.773	105.803
alpha_tau	0.027	0.098	0.004	0.049
beta_c	6.570	0.164	6.255	6.905
beta_tau	2.869	1.207	0.953	5.264
sigma	6.025	0.664	4.777	7.341

Model 2: Imputing missing predictor variable data#

This is the same model, except we now have missing x data.

x_miss = np.array([8.0, 15.0, 22.0, -1, 36.0])
x_miss = np.ma.masked_values(x_miss, value=-1)

x_miss

masked_array(data=[8.0, 15.0, 22.0, --, 36.0],
             mask=[False, False, False,  True, False],
       fill_value=-1.0)

prior_tau = 1e-4

with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=prior_tau)
    alpha_tau = pm.Gamma("alpha_tau", 0.001, 0.001)
    beta_c = pm.Normal("beta_c", 0, tau=prior_tau)
    beta_tau = pm.Gamma("beta_tau", 0.001, 0.001)

    alpha = pm.Normal("alpha", alpha_c, tau=alpha_tau, shape=(30, 1))
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.001, 0.001)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    x_imputed = pm.Normal("x_imputed", mu=20, sigma=5, observed=x_miss)

    mu = alpha + beta * x_imputed

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace_2 = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.87
    )

az.summary(
    trace_2,
    hdi_prob=0.95,
    var_names=["alpha_c", "beta_c", "alpha_tau", "beta_tau", "lik_tau"],
    kind="stats",
)

	mean	sd	hdi_2.5%	hdi_97.5%
alpha_c	101.744	2.348	97.048	106.248
beta_c	6.517	0.168	6.189	6.853
alpha_tau	0.025	0.077	0.004	0.043
beta_tau	2.981	1.269	0.967	5.414
lik_tau	0.029	0.006	0.017	0.041

Model 3: Non-ignorable missingness#

Probability of missingness increases approx. at a rate of 1% with increasing the weight.

y = np.atleast_2d(
    np.array([177.0, 236.0, 285.0, 350.0, -1])
)  # original value was 320
y = np.ma.masked_values(y, value=-1)  # create masked array
# y.mask is equivalent to the "miss" array from the professor's example
miss = y.mask
x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])

t = 0.1
s = 1 / t  # convert BUGS dlogis tau to s for pymc
b = np.log(1.01)

with pm.Model() as m:
    a = pm.Logistic("a", mu=0, s=s)
    alpha = pm.Flat("alpha")
    beta = pm.Flat("beta")
    log_sigma = pm.Flat("log_sigma")
    sigma2 = pm.Deterministic("sigma2", exp(2 * log_sigma))
    tau = pm.Deterministic("tau", 1 / sigma2)

    mu = pm.Deterministic("mu", alpha + beta * x)
    y_imputed = pm.Normal("likelihood", mu, tau=tau, observed=y)

    p = pm.Deterministic("p", invlogit(a + b * y_imputed))
    pm.Bernoulli("missing", p=p, observed=miss)

    trace_3 = pm.sample(
        5000, tune=4000, init="jitter+adapt_diag_grad", target_accept=0.95
    )

az.summary(trace_3, hdi_prob=0.95, kind="stats")

	mean	sd	hdi_2.5%	hdi_97.5%
a	-4.816	1.366	-7.606	-2.334
alpha	110.976	18.973	82.593	135.184
beta	8.175	0.989	6.840	9.434
log_sigma	1.821	0.649	0.768	3.156
likelihood_unobserved[0]	406.247	25.075	374.857	437.231
sigma2	217.298	3347.053	2.286	448.064
tau	0.047	0.047	0.000	0.140
mu[0]	176.376	11.832	158.025	192.071
mu[1]	233.601	7.089	222.921	245.086
mu[2]	290.825	7.511	279.627	301.906
mu[3]	348.050	12.589	330.514	364.368
mu[4]	405.275	18.879	380.380	429.687
likelihood[0, 0]	177.000	0.000	177.000	177.000
likelihood[0, 1]	236.000	0.000	236.000	236.000
likelihood[0, 2]	285.000	0.000	285.000	285.000
likelihood[0, 3]	350.000	0.000	350.000	350.000
likelihood[0, 4]	406.247	25.075	374.857	437.231
p[0, 0]	0.079	0.085	0.000	0.253
p[0, 1]	0.127	0.121	0.000	0.378
p[0, 2]	0.180	0.155	0.000	0.498
p[0, 3]	0.272	0.198	0.000	0.654
p[0, 4]	0.367	0.229	0.000	0.769

%load_ext watermark
%watermark -n -u -v -iv -p pytensor

Last updated: Tue Nov 07 2023

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

pytensor: 2.17.1

arviz: 0.16.1
numpy: 1.25.2
pymc : 5.9.0