import arviz as az
import numpy as np
import pymc as pm
from pymc.math import exp

%load_ext lab_black

6. Time-to-event Models: Gastric Cancer*#

Adapted from code for Unit 8: gastric.odc.

Data can be found here.

Problem statement#

Stablein et al. (1981) provide data on 90 patients affected by locally advanced, nonresectable gastric carcinoma. The patients are randomized to two treatments: chemotherapy alone (coded as 0) and chemotherapy plus radiation (coded as 1). Survival time is reported in days. Recorded times are censored if the patient stopped participating in the study before it finished.

Stablein, D. M., Carter, W. H., Novak, J. W. (1981). Analysis of survival data with nonproportional hazard functions. Control. Clin. Trials, 2 , 2, 149–159.

Data#

Columns are, from left to right:

  • type: Treatment type, chemotherapy (0) or chemotherapy + radiation (1)

  • censored: If censored, meaning the patient survived the observation period, the time in days appears here rather than in the times column. 0 if not censored.

  • times: Recorded days without cancer recurrence. NaN if censored.

Model changes#

PyMC really did not like the noninformative exponential prior on v (α in this model). It’s a very broad prior. To avoid divide by zero errors, I just kept increasing lambda until the model ran all the way through. This is not ideal, but I haven’t had time to look into it further. The results actually came out fairly close to the BUGS results.

Method 1: pm.Censored#

The way PyMC censoring works is described in some detail in this notebook by Dr. Benjamin T. Vincent. This is accomplished in the source code here if you want to take a look. For right-censoring, try this: pm.Censored("name", dist, lower=None, upper=censored, observed=y). The censored values can be an array of the same shape as the y values.

If the y value equals the right-censored value, pm.Censored returns the complement to the CDF evaluated at the censored value. If the y value is greater than the censored value, it returns -np.inf. Otherwise, the distribution you passed to the dist parameter works as normal. What I’ve been doing is setting the values in the censored array to np.inf if the corresponding y value is not censored, and equal to the y value if it should be censored.

Note

I’ve noticed that this method is unstable with some distributions. Try using the imputed censoring model (below) if this one isn’t working.

data = np.loadtxt("../data/gastric.txt")
data.shape
(90, 3)
x = data[:, 0].copy()
censored = data[:, 1].copy()
y = data[:, 2].copy()
# for pymc, right-censored values must be greater than or equal to than the "upper" value
y[np.isnan(y)] = censored[np.isnan(y)]
censored[censored == 0] = np.inf
y
array([1.700e+01, 4.200e+01, 4.400e+01, 4.800e+01, 6.000e+01, 7.200e+01,
       7.400e+01, 9.500e+01, 1.030e+02, 1.080e+02, 1.220e+02, 1.440e+02,
       1.670e+02, 1.700e+02, 1.830e+02, 1.850e+02, 1.930e+02, 1.950e+02,
       1.970e+02, 2.080e+02, 2.340e+02, 2.350e+02, 2.540e+02, 3.070e+02,
       3.150e+02, 4.010e+02, 4.450e+02, 4.640e+02, 4.840e+02, 5.280e+02,
       5.420e+02, 5.670e+02, 5.770e+02, 5.800e+02, 7.950e+02, 8.550e+02,
       8.820e+02, 8.920e+02, 1.031e+03, 1.033e+03, 1.306e+03, 1.335e+03,
       1.366e+03, 1.452e+03, 1.472e+03, 1.000e+00, 6.300e+01, 1.050e+02,
       1.290e+02, 1.820e+02, 2.160e+02, 2.500e+02, 2.620e+02, 3.010e+02,
       3.010e+02, 3.420e+02, 3.540e+02, 3.560e+02, 3.580e+02, 3.800e+02,
       3.810e+02, 3.830e+02, 3.830e+02, 3.880e+02, 3.940e+02, 4.080e+02,
       4.600e+02, 4.890e+02, 4.990e+02, 5.240e+02, 5.290e+02, 5.350e+02,
       5.620e+02, 6.750e+02, 6.760e+02, 7.480e+02, 7.480e+02, 7.780e+02,
       7.860e+02, 7.970e+02, 9.450e+02, 9.550e+02, 9.680e+02, 1.180e+03,
       1.245e+03, 1.271e+03, 1.277e+03, 1.397e+03, 1.512e+03, 1.519e+03])
censored
array([  inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
        882.,  892., 1031., 1033., 1306., 1335.,   inf, 1452., 1472.,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,  381.,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,  529.,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,  945.,
         inf,   inf, 1180.,   inf,   inf, 1277., 1397., 1512., 1519.])

Warning

PyMC and BUGS do not specify the Weibull distribution in the same way!

α = v

β = λ ** (-1 / α)

log2 = np.log(2)

with pm.Model() as m:
    beta0 = pm.Normal("beta0", 0, tau=0.01)
    beta1 = pm.Normal("beta1", 0, tau=0.1)
    α = pm.Exponential("α", 4)

    λ = exp(beta0 + beta1 * x)
    β = λ ** (-1 / α)

    obs_latent = pm.Weibull.dist(alpha=α, beta=β)
    likelihood = pm.Censored(
        "likelihood",
        obs_latent,
        lower=None,
        upper=censored,
        observed=y,
    )

    median0 = pm.Deterministic("median0", (log2 * exp(-beta0)) ** (1 / α))
    median1 = pm.Deterministic(
        "median1", (log2 * exp(-beta0 - beta1)) ** (1 / α)
    )

    S = pm.Deterministic("S", exp(-λ * (likelihood**α)))
    f = pm.Deterministic("f", λ * α * (likelihood ** (α - 1)) * S)
    h = pm.Deterministic("h", f / S)

    trace = pm.sample(
        10000,
        tune=2000,
        init="jitter+adapt_diag_grad",
        target_accept=0.9,
    )
Hide code cell output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag_grad...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta0, beta1, α]
100.00% [48000/48000 00:13<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 14 seconds.
az.summary(trace, var_names=["~S", "~f", "~h"], kind="stats", hdi_prob=0.9)
mean sd hdi_5% hdi_95%
beta0 -6.539 0.651 -7.602 -5.475
beta1 0.258 0.235 -0.116 0.655
α 0.990 0.095 0.830 1.142
median0 516.909 91.089 370.988 660.714
median1 398.074 71.253 281.726 510.506

Method 2: Imputed censoring#

This method is from this notebook by Luis Mario Domenzain, George Ho, and Dr. Ben Vincent. This method imputes the values using what is almost another likelihood (not sure if it actually meets the definition of one, so I’m calling the variable impute_censored) in the model, with the right-censored values as lower bounds. Since the two likelihoods share the same priors, this ends up working nearly as well as the previous method.

Warning

pm.Bound is being deprecated, so this method will stop working eventually. The notebook linked above has a new method, but I haven’t investigated it yet.

data = np.loadtxt("../data/gastric.txt")
x = data[:, 0].copy()
censored_vals = data[:, 1].copy()
y = data[:, 2].copy()

# we need to separate the observed values and the censored values
observed_mask = censored_vals == 0

censored = censored_vals[~observed_mask]
y_uncensored = y[observed_mask]
x_censored = x[~observed_mask]
x_uncensored = x[observed_mask]
log2 = np.log(2)

with pm.Model() as m:
    beta0 = pm.Normal("beta0", 0, tau=0.0001)
    beta1 = pm.Normal("beta1", 0, tau=0.0001)
    α = pm.Exponential("α", 3)

    λ_censored = exp(beta0 + beta1 * x_censored)
    β_censored = λ_censored ** (-1 / α)

    λ_uncensored = exp(beta0 + beta1 * x_uncensored)
    β_uncensored = λ_uncensored ** (-1 / α)

    impute_censored = pm.Bound(
        "impute_censored",
        pm.Weibull.dist(alpha=α, beta=β_censored),
        lower=censored,
        shape=censored.shape[0],
    )

    likelihood = pm.Weibull(
        "likelihood",
        alpha=α,
        beta=β_uncensored,
        observed=y_uncensored,
        shape=y_uncensored.shape[0],
    )

    median0 = pm.Deterministic("median0", (log2 * exp(-beta0)) ** (1 / α))
    median1 = pm.Deterministic(
        "median1", (log2 * exp(-beta0 - beta1)) ** (1 / α)
    )

    trace = pm.sample(10000, tune=2000, cores=4, init="auto", target_accept=0.9)
Hide code cell output
/Users/aaron/mambaforge/envs/pymc_env2/lib/python3.11/site-packages/pymc/distributions/bound.py:186: FutureWarning: Bound has been deprecated in favor of Truncated, and will be removed in a future release. If Truncated is not an option, Bound can be implemented byadding an IntervalTransform between lower and upper to a continuous variable. A Potential that returns negative infinity for values outside of the bounds can be used for discrete variables.
  warnings.warn(
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta0, beta1, α, impute_censored]
100.00% [48000/48000 00:34<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 35 seconds.
az.summary(trace, hdi_prob=0.9, kind="stats")
mean sd hdi_5% hdi_95%
beta0 -6.619 0.654 -7.658 -5.505
beta1 0.261 0.236 -0.135 0.642
α 1.002 0.096 0.844 1.158
impute_censored[0] 1470.516 624.422 882.003 2238.512
impute_censored[1] 1485.333 638.066 892.025 2267.560
impute_censored[2] 1623.087 636.271 1031.002 2399.749
impute_censored[3] 1629.358 644.748 1033.058 2417.048
impute_censored[4] 1896.556 636.864 1306.001 2680.457
impute_censored[5] 1927.291 645.113 1335.035 2706.771
impute_censored[6] 2044.492 647.964 1452.027 2827.988
impute_censored[7] 2064.511 637.872 1472.006 2851.689
impute_censored[8] 1144.755 823.971 381.008 2153.162
impute_censored[9] 1295.370 823.647 529.021 2299.400
impute_censored[10] 1717.105 832.210 945.080 2742.788
impute_censored[11] 1948.776 828.073 1180.049 2974.388
impute_censored[12] 2045.309 839.943 1277.020 3052.989
impute_censored[13] 2169.509 847.547 1397.058 3197.841
impute_censored[14] 2286.423 854.267 1512.011 3298.995
impute_censored[15] 2287.177 829.874 1519.089 3316.162
median0 520.012 90.909 369.907 658.412
median1 400.322 70.953 290.464 517.338
%load_ext watermark
%watermark -n -u -v -iv -p pytensor
Last updated: Wed Mar 22 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

pytensor: 2.10.1

numpy: 1.24.2
arviz: 0.14.0
pymc : 5.1.2