Conformal inference is a hot research topic among statisticians but has not made its way into the world of econometrics yet. My goal in this article is to provide a gentle introduction to the main idea behind conformal inference. You will learn a new way of thinking about uncertainty in the context of machine learning (i.e., prediction) models.
Let’s imagine a size i.i.d. sample of an outcome variable and a covariate vector , . Conformal inference is concerned with building a confidence interval for a new outcome observation from a new feature realization .
Importantly, this interval should be valid:
- in finite samples (i.e., non-asymptotically),
- without assumptions on the data generating process, and
- for any estimator of the regression function, .
In mathematical notation, given a significance level , we want to construct a confidence interval satisfying the above properties and such that:
While the technical term for this is a prediction interval, I will loosely be calling it a confidence interval, a term with which most of you are familiar.
As a teaser, the basic idea behind the method rests on a simple result about sample quantiles.
Let me explain.
I will start with reviewing sample quantiles. Given an i.i.d. sample, , the th quantile is the value such that approximately of the data is smaller than it. For instance, the 95th quantile (sometimes also called percentile) is the value for which 95% of the observations are at least as small.
So, given a new observation , we know that:
The Naïve Approach
Let’s turn back to the regression example with and . We are given a new observation and our focus is on . Following the fact described above, a naïve way to construct a confidence interval for is as follows:
Here is an estimate of the regression function , is the empirical distribution function of the fitted residuals , and is the th quantile of that distribution.
Put simply, we can look at an interval around our best prediction for (i.e., ) defined by the residuals estimated on the original data.
It turns out this interval is too narrow. In a series of papers Vladimir Vovk and co-authors show that the empirical distribution function of the fitted residuals is often biased downward and hence this interval is invalid. This is where conformal inference comes in.
Consider the following strategy. For each we fit a regression on the sample . We calculate the residuals for and and count the proportion of ’s smaller than . Let’s call this number . That is,
where is the indicator function equal to one when the statement in the parenthesis is true and 0 if when it is not.
The test statistic is uniformly distributed over the set , implying we can use as a valid p-value for testing the null that . Then, using the sample quantiles logic outlined above we arrive at the following confidence interval for :
This is summarized in the following procedure:
Algorithm: For each value :
- fit the regression function on using your favorite estimator/learner.
- calculate the residuals.
- calculate the proportion .
Software Package: conformalInference
Two notes. First, conformal inference guarantees unconditional coverage. This is conceptually different and should not be confused with the conditional statement . The latter is stronger and more difficult to assert, requiring additional assumptions such as consistency of our estimator of .
Second, this procedure can be computationally expensive. For a given value we need to fit a regression model and compute residuals for every which we consider including in the confidence interval. This is where split conformal inference comes in.
Split Conformal Inference
Split conformal inference is a modification of the original algorithm that requires significantly less computation power. The idea is to split the fitting and ranking steps, so that the former is done only once. Here is the algorithm.
- Randomly split the data in two equal-sized bins.
- Get on the first bin.
- Calculate the residuals for each observation in the second bin.
- Let be the s-th smallest residual, where .
- Construct .
A downside of this splitting approach is the introduction of extra randomness. One way to mitigate this is to perform the split multiple times and construct a final confidence interval by taking the intersection of all intervals. The aggregation decreases the variability from a single data split and, as this paper shows, still remains valid. Similar random split aggregation has also been used in the context of statistical significance in high-dimensional models.
The outcome variable was
age, and the matrix included
pclass (ticket class),
sibsp (number of siblings aboard),
parch (number of parents aboard),
embarked (port of Embarkation), and
cabin. Some of these were categorical in which case I converted them to a bunch of binary variables. I used the first 888 observations to estimate the regression function using lasso and the 889th row to form the prediction (i.e., the test set).
age value in the test set was 32 while the conformal inference approach computed a confidence interval (21.25, 123.75). It did contain the true value but it was rather wide. The splitting algorithm gave similar results.
You can find the code in this GitHub repository.
- Conformal inference offers a novel approach for constructing valid finite-sample prediction intervals in machine learning models.
Where to Learn More
Conformal inference in statistics is an ongoing research topic and I do not know of any review papers or textbook treatments of the subject. If you are interested in learning more, check the paper referenced below.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094-1111.
Lei, J., Rinaldo, A., & Wasserman, L. (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74, 29-43.
Shafer, G., & Vovk, V. (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research, 9(3).
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world (Vol. 29). New York: Springer.