Supplementary Exercises

Supplementary Exercises#

Warning

This page contains solutions! We recommend attempting each problem before peeking.

1. Beetles (Bliss Data)#

In his 1935 paper, Bliss provides a table showing a number of flour beetles killed after 5 hours of exposure to gaseous carbon disulfide at various concentrations. This data set has since been used extensively by statisticians to illustrate and compare models for binary and binomial data.

Use logistic regression on the following data:

x = np.array([1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839])
n = np.array([59, 60, 62, 56, 63, 59, 62, 60])
y = np.array([6, 13, 18, 28, 52, 53, 61, 60])

Use one of the logit, probit, cloglog, loglog, or cauchyit link functions.

Solution

The logistic regression model will look like this:

\[\begin{align*} y_i | \boldsymbol{\beta}, \mathbf{x}_i &\sim \text{Binomial}(n_i, p_i) \\ g(p_i) &= \beta_0 + \sum_{j=1}^p \beta_j x_{ij} \\ \beta_0 &\sim \mathcal{N}(0, \sigma_0^2) \\ \beta_j &\sim \mathcal{N}(0, \sigma^2), \quad j = 0, 1, \ldots, p \end{align*}\]

where \(g(\cdot)\) is the link function.

For logit:

\[ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) \]

and its inverse \(g^{-1}(\cdot)\) is the logistic function:

\[ \text{logistic}(x) = \frac{1}{1 + e^{-x}} \]

For probit:

\[ \text{probit}(p) = \Phi^{-1}(p) \]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution:

\[ \Phi(x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right] \]

\(\text{erf}\) is the error function:

\[ \text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} \, dt \]

For cloglog (complementary log-log):

\[ \text{cloglog}(p) = \log(-\log(1 - p)) \]

and its inverse \(g^{-1}(\cdot)\) is:

\[ g^{-1}(x) = 1 - e^{-e^x} \]

For loglog:

\[ \text{loglog}(p) = -\log(-\log(p)) \]

and its inverse \(g^{-1}(\cdot)\) is:

\[ g^{-1}(x) = e^{-e^{-x}} \]

For cauchyit:

\[ \text{cauchyit}(p) = \tan\left(\pi \left(p - \frac{1}{2}\right)\right) \]

and its inverse \(g^{-1}(\cdot)\) is:

\[ g^{-1}(x) = \frac{1}{2} + \frac{1}{\pi} \arctan(x) \]

Show code cell content Hide code cell content

import pymc as pm
import arviz as az
import pandas as pd
import numpy as np
from pymc.math import exp, invlogit, invprobit, arctan


# Complementary log-log transformation
def invcloglog(x):
    return 1 - exp(-exp(x))


# Log-log transformation
def invloglog(x):
    return exp(-exp(-x))


# Cauchit transformation
def invcauchit(x):
    return 0.5 + (1 / np.pi) * arctan(x)


x = np.array([1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839])
n = np.array([59, 60, 62, 56, 63, 59, 62, 60])
y = np.array([6, 13, 18, 28, 52, 53, 61, 60])

with pm.Model() as m:
    x_data = pm.MutableData("concentration", x)

    beta0 = pm.Normal("beta0", 0, 100)
    beta1 = pm.Normal("beta1", 0, 100)

    # Uncomment desired link function
    p = invlogit(beta0 + beta1 * x_data)  # logit
    # p = invprobit(beta0 + beta1 * x_data)  # probit
    # p = invcloglog(beta0 + beta1 * x_data)  # cloglog
    # p = invloglog(beta0 + beta1 * x_data)  # loglog
    # p = invcauchit(beta0 + beta1 * x_data)  # cauchyit

    pm.Binomial("likelihood", n=n, p=p, observed=y)

    trace = pm.sample(5000)

az.summary(trace, kind="stats")

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta0, beta1]

100.00% [24000/24000 00:10<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 11 seconds.

	mean	sd	hdi_3%	hdi_97%
beta0	-60.962	5.224	-70.699	-51.333
beta1	34.408	2.937	29.094	39.977

2. Vasoconstriction#

The data give the presence or absence (\(y_i = 1\) or \(0\)) of vasoconstriction in the skin of the fingers following inhalation of a certain volume of air (\(v_i\)) at a certain average rate (\(r_i\)). Total number of records is 39. The candidate models for analyzing the relationship are the usual logit, probit, cloglog, loglog, and cauchyit models.

Data is available for download in csv format here.

Transform covariates \(v\) and \(r\) as:

\[ x_1 = \log(10 \times v), \quad x_2 = \log(10 \times r) \]

Using a PPL, estimate posterior means for coefficients in the logit model. Use noninformative priors on all coefficients.
For a subject with \(v = r = 1.5\), find the probability of vasoconstriction.
Which of the five links: logit, probit, cloglog, loglog, and cauchyit, has the smallest deviance? An example for use of the five links is in Question 1, above. Uncomment and run one link at a time.

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

3. Caesarean Delivery: Categorical Response#

Table 1 contains grouped data on infection of mothers after a C-section, collected at the Clinical Center of the University of Munich. The response variable has three categories: Infection of type I, Infection of type II, and No infection. For each mother three covariates are collected:

\[\begin{align*} \text{NOPLAN} &= \begin{cases} 1 & \text{C-section was not planned} \\ 0 & \text{Planned} \end{cases} \\ \text{RISK} &= \begin{cases} 1 & \text{Risk factors present} \\ 0 & \text{No risk factors} \end{cases} \\ \text{ANTIB} &= \begin{cases} 1 & \text{Antibiotics given as prophylaxis} \\ 0 & \text{No antibiotics given} \end{cases} \end{align*}\]

Given the covariates, establish a multinomial model, where the outcome “No infection” serves as a baseline.

\[ \text{Table 1: Data on infections for 251 C-sections.} \]

\[\begin{split} \begin{array}{|l|lll|lll|} \hline & \rlap{\text{Planned}} & & & \rlap{\text{Unplanned}} & & \\ \text{Infection} & \text{I} & \text{II} & \text{No} & \text{I} & \text{II} & \text{No} \\ \hline \textbf{Antibiotics} & & & & & & \\ \hspace{1em}\text{Risk factor} & 0 & 1 & 17 & 4 & 7 & 87 \\ \hspace{1em}\text{No risk factor} & 0 & 0 & 2 & 0 & 0 & 0 \\ \hline \textbf{No Antibiotics} & & & & & & \\ \hspace{1em}\text{Risk factor} & 11 & 17 & 30 & 10 & 13 & 3 \\ \hspace{1em}\text{No risk factor} & 4 & 4 & 32 & 0 & 0 & 9 \\ \hline \end{array} \end{split}\]

A new C-section delivery for a mother with covariates \((\text{NOPLAN}, \text{RISK}, \text{ANTIBIO}) = (1, 0, 0)\) is to be evaluated for risks of infection. What are the estimated probabilities of no infection, and type I and II infections?

Hint: Consult 17. Multinomial regression*.

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

4. Magnesium Ammonium Phosphate and Chrysanthemums#

Walpole et al. (2007) provide data from a study on the effect of magnesium ammonium phosphate on the height of chrysanthemums, which was conducted at George Mason University in order to determine a possible optimum level of fertilization, based on the enhanced vertical growth response of the chrysanthemums. Forty chrysanthemum seedlings were assigned to 4 groups, each containing 10 plants. Each was planted in a similar pot containing a uniform growth medium. An increasing concentration of \(\text{MgNH}_4\text{PO}_4\), measured in grams per bushel, was added to each plant. The 4 groups of plants were grown under uniform conditions in a greenhouse for a period of 4 weeks. The treatments and the respective changes in heights, measured in centimeters, are given in the following table:

Data is available for download in csv format here.

Solve the problem as a Bayesian one-way ANOVA. Use STZ constraints on treatment effects.

Do different concentrations of \(\text{MgNH}_4\text{PO}_4\) affect the average attained height of chrysanthemums? Look at the 95% credible sets for the differences between treatment effects.
Find the 95% credible set for the contrast \(\mu_1 - \mu_2 - \mu_3 + \mu_4\).

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

5. Third-degree Burns#

The data for this exercise, discussed in Fan et al. (1995), refer to \(n = 435\) adults who were treated for third-degree burns by the University of Southern California General Hospital Burn Center. The patients were grouped according to the area of third-degree burns on the body. For each midpoint of the groupings “log(area +1),” the number of patients in the corresponding group who survived and the number who died from the burns was recorded:

\[\begin{split} \begin{array}{|c|c|c|} \hline \text{Log(area+1)} & \text{Survived} & \text{Died} \\ \hline 1.35 & 13 & 0 \\ 1.60 & 19 & 0 \\ 1.75 & 67 & 2 \\ 1.85 & 45 & 5 \\ 1.95 & 71 & 8 \\ 2.05 & 50 & 20 \\ 2.15 & 35 & 31 \\ 2.25 & 7 & 49 \\ 2.35 & 1 & 12 \\ \hline \end{array} \end{split}\]

Data is available for download in csv format here.

Fit the logistic regression on the probability of death due to third-degree burns with the covariate \(x = \log(\text{area} + 1)\). What is the deviance?
Using your model, estimate the posterior probability of survival for a person for whom \(\log(\text{area} + 1)\) equals 2.
Repeat (a) with probit and complementary log-log links. In terms of deviance, which model provides the best fit?

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

6. Shocks!#

An experiment was conducted to assess the effect of small electrical currents on farm animals, with the eventual goal of understanding the effects of high-voltage powerlines on livestock. The experiment was carried out with seven cows, and six shock intensities, 0, 1, 2, 3, 4, and 5 milliamps (shocks on the order of 15 milliamps are painful for many humans). Each cow was given 30 shocks, five at each intensity, in random order. The entire experiment was then repeated, so each cow received a total of 60 shocks. For each shock the response, mouth movement, was either present or absent. The data as quoted give the total number of responses, out of 70 trials, at each shock level. We ignore cow differences and differences between blocks (experiments).

\[\begin{split} \begin{array}{|c|c|c|c|} \hline \text{Current (ma)} & y & \text n & p \\ \hline 0 & 0 & 70 & 0.000 \\ 1 & 9 & 70 & 0.129 \\ 2 & 21 & 70 & 0.300 \\ 3 & 47 & 70 & 0.671 \\ 4 & 60 & 70 & 0.857 \\ 5 & 63 & 70 & 0.900 \\ \hline \end{array} \end{split}\]

Here, \(y\) is the number of responses, \(n\) is the number of trials, and \(p\) is the proportion of responses.

Data is available for download in csv format here.

As in the exercise on beetles (bliss data), model \(y\) as a function of \(x\) via binary regression with 5 different links and propose the link that minimizes the deviance.

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

7. Binary Regression and IOP#

Laser refractive surgery often decreases Intraocular Pressure (IOP) and may lead to hypotony (clinically significant low IOP that may lead to corneal decompensation, accelerated cataract formation, maculopathy, and discomfort). An investigator wished to determine whether the post-operative IOP in patients after laser refractive surgery was related to the residual thickness of the cornea.

In a sample of 140 patients who had undergone laser surgery, post-operative IOP and the thickness of the cornea were measured. The dataset is provided in the startup file iop2.odc which consists of two columns:

Indicator of low IOP (IOP < 10)
Central corneal thickness (in micrometers)

Fit the logistic regression with cornea thickness as the predictor of incidence of low IOP.
For a person who had a refractive surgery with residual thickness of cornea of 420 micrometers, what is the risk of a low IOP.
Compare deviances for two links: logit (as in (a)), and probit. Which link provides better fit?

Data is available for download in csv format here.

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

8. Negative Binomial as a Gamma Mixture of Poissons#

Implement this model, which recreates a negative binomial regression as a gamma mixture of Poissons.

\[\begin{align*} y_i | \lambda_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i | \mu_i, \alpha &\sim \text{Gamma}(\alpha, \frac{\alpha}{\mu_i}) \\ \mu_i &= \exp(\beta_0 + \beta_1 x_i) \\ \beta_0 &\sim \mathcal{N}(0, 10^4) \\ \beta_1 &\sim \mathcal{N}(0, 10^4) \\ \alpha &\sim \text{Uniform}(0.01, 100) \end{align*}\]

Here’s the data:

x = np.array([12.2, 14.2, 9.8, 16.3, 20.1, 18.4, 22.7, 20.2, 21.4, 33.0, 30.5])
y = np.array([5, 6, 6, 7, 6, 8, 11, 10, 18, 20, 22])

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

9. Sex of Diamond-backed Terrapins and Incubation Temperature#

Temperature-dependent sex determination, observed in some reptiles and fish, is a type of environmental sex determination in which the temperatures experienced during embryonic development determine the sex of the offspring. Data on the relationship between the ratio of male/female diamond-backed terrapins (Malaclemys terrapin) and incubation temperature are reported by Burke and Calichio [2014].

Develop a binary regression model for both logit and cloglog links.
Which link gives smaller deviance?
Predict the probability of a female terrapin for a temperature of 29°C, using both models.

Data is available for download in csv format here.

Solution

Solution to be added. Feel free to share yours on Ed Discussion!

Supplementary Exercises

Contents

Supplementary Exercises#

1. Beetles (Bliss Data)#

2. Vasoconstriction#

3. Caesarean Delivery: Categorical Response#

4. Magnesium Ammonium Phosphate and Chrysanthemums#

5. Third-degree Burns#

6. Shocks!#

7. Binary Regression and IOP#

8. Negative Binomial as a Gamma Mixture of Poissons#

9. Sex of Diamond-backed Terrapins and Incubation Temperature#