19. Paraguay Vaccination Status*

19. Paraguay Vaccination Status*#

This example goes over a multilevel, or hierarchical, logistic regression model. It also shows how to use the PyMC coordinate system.

Adapted from Unit 7: paraguay.odc and paraguaynocluster.odc

Data can be found here.

This study considers factors influencing the vaccination status among 3424 children of 2552 mothers among 264 clusters in Paraguay. In this analysis, we’re specifically interested in mother-level factors related to child immunization. However, there is randomness associated with different clusters.

ID3: Cluster number
VACCODE: =1 if fully immunized, =0 otherwise
LB.TOT: No. of live births
MAGE2: mother age <20 =1, otherwise = 0
UN2: consensual union = 1, otherwise = 0
TOILET2: unsafe toilet in hh = 1, otherwise = 0
PR.SPOC1: spouse unskilled laborer = 1, otherwise = 0
SPANISH2: Spanish not hh language = 1, otherwise = 0

We need to add a random effect by cluster. This is a good use case for the PyMC coordinates system.

data = pd.read_csv("../data/paraguay.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ID3       3424 non-null   int64
 1   VACCODE   3424 non-null   int64
 2   LB.TOT    3424 non-null   int64
 3   MAGE2     3424 non-null   int64
 4   UN2       3424 non-null   int64
 5   TOILET2   3424 non-null   int64
 6   PR.SPOC1  3424 non-null   int64
 7   SPANISH2  3424 non-null   int64
dtypes: int64(8)
memory usage: 214.1 KB

y = data["VACCODE"].to_numpy()
# separate array for clusters
clusters = data["ID3"].to_numpy()
X = data.drop(["VACCODE", "ID3"], axis=1).to_numpy()
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y.shape, clusters.shape, X_aug.shape

((3424,), (3424,), (3424, 7))

cols = X_aug.shape[1]

# set up alternate coordinates, the ID3 or clusters column
cluster_idx, clusters = pd.factorize(data.ID3)
coords = {"cluster": clusters, "id": data.index.to_numpy()}

# note that the coords dict is passed to pm.Model call
with pm.Model(coords=coords) as m:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=False)
    clust_idx = pm.Data("cluster_idx", cluster_idx, dims="id", mutable=True)

    cluster_tau = pm.Gamma("cluster_tau", 0.01, 0.01)
    cluster_variance = pm.Deterministic("cluster_variance", 1 / cluster_tau)
    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    cluster_effect = pm.Normal("cluster_effect", 0, tau=cluster_tau, dims="cluster")
    p = pm.math.dot(X_data, beta) + cluster_effect[clust_idx]

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace = pm.sample(3000)

az.summary(
    trace, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)

	mean	sd	hdi_3%	hdi_97%
beta[0]	1.444	0.131	1.205	1.694
beta[1]	-0.070	0.015	-0.100	-0.043
beta[2]	-0.565	0.205	-0.959	-0.191
beta[3]	-0.196	0.096	-0.378	-0.015
beta[4]	-0.692	0.133	-0.935	-0.435
beta[5]	-0.283	0.110	-0.488	-0.075
beta[6]	-0.621	0.099	-0.804	-0.435
cluster_variance	0.530	0.094	0.351	0.701

az.summary(trace, var_names=["cluster_effect"])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
cluster_effect[1]	-0.215	0.666	-1.502	1.029	0.005	0.007	20599.0	8145.0	1.0
cluster_effect[2]	-0.032	0.630	-1.188	1.193	0.005	0.006	18577.0	8663.0	1.0
cluster_effect[3]	0.205	0.666	-1.065	1.457	0.005	0.007	18365.0	8420.0	1.0
cluster_effect[4]	-0.319	0.609	-1.510	0.787	0.004	0.005	20585.0	8696.0	1.0
cluster_effect[5]	0.307	0.656	-0.898	1.551	0.004	0.006	21794.0	8582.0	1.0
...	...	...	...	...	...	...	...	...	...
cluster_effect[260]	0.417	0.581	-0.662	1.527	0.004	0.004	19641.0	9036.0	1.0
cluster_effect[261]	-0.258	0.539	-1.300	0.715	0.004	0.005	17799.0	8935.0	1.0
cluster_effect[262]	-0.239	0.687	-1.539	1.026	0.005	0.007	17423.0	8331.0	1.0
cluster_effect[263]	-0.084	0.522	-1.054	0.912	0.004	0.005	19809.0	8587.0	1.0
cluster_effect[264]	0.581	0.428	-0.242	1.363	0.003	0.003	18284.0	9180.0	1.0

264 rows × 9 columns

Based on this coordinates example:

https://oriolabrilpla.cat/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html

No clusters#

with pm.Model() as m_nc:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=True)

    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    p = pm.math.dot(X_data, beta)

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace_nc = pm.sample(3000)

az.summary(
    trace_nc, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)

	mean	sd	hdi_3%	hdi_97%
beta[0]	1.429	0.107	1.225	1.626
beta[1]	-0.064	0.013	-0.090	-0.040
beta[2]	-0.566	0.190	-0.927	-0.213
beta[3]	-0.188	0.086	-0.347	-0.025
beta[4]	-0.716	0.116	-0.943	-0.508
beta[5]	-0.452	0.089	-0.618	-0.281
beta[6]	-0.606	0.084	-0.769	-0.459

%load_ext watermark
%watermark -n -u -v -iv -p pytensor

Last updated: Wed Mar 22 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.10.1

numpy : 1.24.2
pandas: 1.5.3
arviz : 0.14.0
pymc  : 5.1.2

19. Paraguay Vaccination Status*

Contents

19. Paraguay Vaccination Status*#

No clusters#