1. Model Fit, Selection, and Diagnostics#

Deviance#

For model with likelihood \(f(y|\theta)\), deviance is given by

\[ D(\theta) = -2\log(f(y|\theta)) \]

When comparing 2 models with deviance, the model with smaller deviance is better, in that it better fits the data. An alternate method for calculating deviance is

\[ D_s(\theta) = -2(\log(f(y|\theta)) - \log(f(y|\theta_s))) \]

where \(f(y|\theta_s)\) is the saturated model, that is a model that perfectly fits the data. A few examples of saturated deviance are

\[\begin{split}\begin{align*} y_i \sim Bin(n_i,\theta_i): & \space \space D_s(\underset{\sim}{\theta}) = 2 \sum_i \left ( y_i \log \frac{y_i/n_i}{\theta_i} + (n_i - y_i)\log \frac{1-y_i/n_i}{1-\theta_i} \right ) \\ y_i \sim Poi(\theta_i): & \space \space D_s(\underset{\sim}{\theta}) = 2 \sum_i \left ( y_i \log \frac{y_i}{\theta_i} - (y_i - \theta_i) \right ) \\ y_i \sim N(\theta_i,\delta_i^2): & \space \space D_s(\underset{\sim}{\theta}) = 2 \sum_i \left ( \frac{y_i - \theta_i}{\delta_i} \right ) ^2 \\ \end{align*}\end{split}\]

Deviance example#

Consider 2 models, Weibull and Exponential, for the data

\[ y = (1,1,2,2,3,4,4,5,5,8) \]

with non-informative priors. Which model has the smaller deviance?

import pymc as pm
import arviz as az
import numpy as np

y = np.array([1,1,2,2,3,4,4,5,5,8])
#Weibull Model
with pm.Model() as m1:
    #priors
    mu = pm.Normal("mu",0,tau=.001)
    gamma = pm.Gamma("gamma",0.001,0.001)

    #convert Weibull parameterization
    beta = gamma ** (-1 / mu)

    #likelihood
    like = pm.Weibull("like",mu,beta,observed=y)

    log_like = pm.math.log(mu*y**(mu-1)*pm.math.exp(-(y/beta)**mu)/(beta**mu))
    deviance = pm.Deterministic("deviance",-2*pm.math.sum(log_like))
    
    trace1 = pm.sample()

az.summary(trace1,var_names=["deviance"])
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, gamma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 0 seconds.
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
deviance 42.911 2.121 40.869 46.741 0.08 0.126 984.0 906.0 1.0
#Exponential Model
with pm.Model() as m2:
    #prior
    lambda_ = pm.Gamma("lambda_",0.001,0.001)

    #likelihood
    like = pm.Exponential("like",lambda_,observed=y)

    log_like = pm.math.log(lambda_*pm.math.exp(-lambda_ * y))
    deviance = pm.Deterministic("deviance",-2*pm.math.sum(log_like))

    trace2 = pm.sample()

az.summary(trace2,var_names="deviance")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [lambda_]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 0 seconds.
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
deviance 46.045 1.402 45.055 48.603 0.033 0.058 1864.0 2570.0 1.0

We see the Weibull model has a lower deviance, and therefore fits the data better than the exponential model.

Extra#

Papers to check out:

https://arxiv.org/abs/1307.5928

https://arxiv.org/abs/1507.04544?context=stat