import pymc as pm
import pandas as pd
import numpy as np
import arviz as az

%load_ext lab_black

19. Paraguay Vaccination Status*#

This example goes over a multilevel, or hierarchical, logistic regression model. It also shows how to use the PyMC coordinate system.

Adapted from Unit 7: paraguay.odc and paraguaynocluster.odc

Data can be found here.

This study considers factors influencing the vaccination status among 3424 children of 2552 mothers among 264 clusters in Paraguay. In this analysis, we’re specifically interested in mother-level factors related to child immunization. However, there is randomness associated with different clusters.

  • ID3: Cluster number

  • VACCODE: =1 if fully immunized, =0 otherwise

  • LB.TOT: No. of live births

  • MAGE2: mother age <20 =1, otherwise = 0

  • UN2: consensual union = 1, otherwise = 0

  • TOILET2: unsafe toilet in hh = 1, otherwise = 0

  • PR.SPOC1: spouse unskilled laborer = 1, otherwise = 0

  • SPANISH2: Spanish not hh language = 1, otherwise = 0

We need to add a random effect by cluster. This is a good use case for the PyMC coordinates system.

data = pd.read_csv("../data/paraguay.csv")
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ID3       3424 non-null   int64
 1   VACCODE   3424 non-null   int64
 2   LB.TOT    3424 non-null   int64
 3   MAGE2     3424 non-null   int64
 4   UN2       3424 non-null   int64
 5   TOILET2   3424 non-null   int64
 6   PR.SPOC1  3424 non-null   int64
 7   SPANISH2  3424 non-null   int64
dtypes: int64(8)
memory usage: 214.1 KB
y = data["VACCODE"].to_numpy()
# separate array for clusters
clusters = data["ID3"].to_numpy()
X = data.drop(["VACCODE", "ID3"], axis=1).to_numpy()
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y.shape, clusters.shape, X_aug.shape
((3424,), (3424,), (3424, 7))
cols = X_aug.shape[1]
# set up alternate coordinates, the ID3 or clusters column
cluster_idx, clusters = pd.factorize(data.ID3)
coords = {"cluster": clusters, "id": data.index.to_numpy()}
# note that the coords dict is passed to pm.Model call
with pm.Model(coords=coords) as m:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=False)
    clust_idx = pm.Data("cluster_idx", cluster_idx, dims="id", mutable=True)

    cluster_tau = pm.Gamma("cluster_tau", 0.01, 0.01)
    cluster_variance = pm.Deterministic("cluster_variance", 1 / cluster_tau)
    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    cluster_effect = pm.Normal("cluster_effect", 0, tau=cluster_tau, dims="cluster")
    p = pm.math.dot(X_data, beta) + cluster_effect[clust_idx]

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace = pm.sample(3000)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [cluster_tau, beta, cluster_effect]
100.00% [16000/16000 00:29<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 30 seconds.
az.summary(
    trace, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)
mean sd hdi_3% hdi_97%
beta[0] 1.444 0.131 1.205 1.694
beta[1] -0.070 0.015 -0.100 -0.043
beta[2] -0.565 0.205 -0.959 -0.191
beta[3] -0.196 0.096 -0.378 -0.015
beta[4] -0.692 0.133 -0.935 -0.435
beta[5] -0.283 0.110 -0.488 -0.075
beta[6] -0.621 0.099 -0.804 -0.435
cluster_variance 0.530 0.094 0.351 0.701
az.summary(trace, var_names=["cluster_effect"])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
cluster_effect[1] -0.215 0.666 -1.502 1.029 0.005 0.007 20599.0 8145.0 1.0
cluster_effect[2] -0.032 0.630 -1.188 1.193 0.005 0.006 18577.0 8663.0 1.0
cluster_effect[3] 0.205 0.666 -1.065 1.457 0.005 0.007 18365.0 8420.0 1.0
cluster_effect[4] -0.319 0.609 -1.510 0.787 0.004 0.005 20585.0 8696.0 1.0
cluster_effect[5] 0.307 0.656 -0.898 1.551 0.004 0.006 21794.0 8582.0 1.0
... ... ... ... ... ... ... ... ... ...
cluster_effect[260] 0.417 0.581 -0.662 1.527 0.004 0.004 19641.0 9036.0 1.0
cluster_effect[261] -0.258 0.539 -1.300 0.715 0.004 0.005 17799.0 8935.0 1.0
cluster_effect[262] -0.239 0.687 -1.539 1.026 0.005 0.007 17423.0 8331.0 1.0
cluster_effect[263] -0.084 0.522 -1.054 0.912 0.004 0.005 19809.0 8587.0 1.0
cluster_effect[264] 0.581 0.428 -0.242 1.363 0.003 0.003 18284.0 9180.0 1.0

264 rows × 9 columns

Based on this coordinates example:

https://oriolabrilpla.cat/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html

No clusters#

with pm.Model() as m_nc:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=True)

    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    p = pm.math.dot(X_data, beta)

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace_nc = pm.sample(3000)
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta]
100.00% [16000/16000 00:20<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 21 seconds.
az.summary(
    trace_nc, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)
mean sd hdi_3% hdi_97%
beta[0] 1.429 0.107 1.225 1.626
beta[1] -0.064 0.013 -0.090 -0.040
beta[2] -0.566 0.190 -0.927 -0.213
beta[3] -0.188 0.086 -0.347 -0.025
beta[4] -0.716 0.116 -0.943 -0.508
beta[5] -0.452 0.089 -0.618 -0.281
beta[6] -0.606 0.084 -0.769 -0.459
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Wed Mar 22 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.10.1

numpy : 1.24.2
pandas: 1.5.3
arviz : 0.14.0
pymc  : 5.1.2