7. Using the Empirical CDF and the Probability Integral Transform#
This lecture is about a goodness-of-fit test based on the Probability Integral Transform. If you have a random variable \(X\) and you apply its own cumulative distribution function (CDF) to it:
the resulting random variable \(Y\) will be \(U(0, 1)\)-distributed. We can use this idea to check our model’s fit by taking our sample’s response variable values and running them through the posterior CDF.
We only have samples from our posterior rather than an actual CDF function, so we’ll need to use the Empirical CDF (ECDF).
The Empirical CDF#
Remember the CDF is the function that maps a maps a number \(x\) to the probability that the random variable \(X\) takes on a value less than or equal to \(x\):
So the ECDF is just the probability that some number \(x\) is less than
where \(\mathbf{1}(X_i \le x)\) is an indicator function that evaluates to 1 if \((X_i \le x)\) is true, otherwise 0 (Vidakovic [2017]). In other words, we count the number of samples less than \(x\) and divide by \(n\) to get the probability.
The test(s)#
Once we’ve evaluated the ECDF at each original \(y\) value from our sample’s response variable, we need some way to compare the output of \(F_X(x)\) to a standard uniform distribution. Professor Vidakovic notes that values extremely close to 0 or 1 will indicate outliers. In the next lecture, he also applies a transformation to the values to make the outliers stand out more:
This is the square of the logit function. The purpose of the logit function is to map probabilities to real values. He then squares that to make extreme values stand out more.
The Arviz library (Kumar et al. [2019]) has a function called plot_ecdf
based on Säilynoja et al. [2022]. This paper explains the above process really well and outlines a graphical check based on these ideas.
On the next page, we’ll check out how to put these ideas into practice with PyMC.