A Brief Introduction to Conformal Inference

Share this article

Background

Conformal inference is a hot research topic among statisticians but has not made its way into the world of econometrics yet. My goal in this article is to provide a gentle introduction to the main idea behind conformal inference. You will learn a new way of thinking about uncertainty in the context of machine learning (i.e., prediction) models.

Let’s imagine a size $n$ i.i.d. sample of an outcome variable $Y$ and a covariate vector $X$ , $(X_1, Y_1) \dots (X_n, Y_n)$ . Conformal inference is concerned with building a confidence interval for a new outcome observation $Y_{n+1}$ from a new feature realization $X_{n+1}$ .

Importantly, this interval should be valid:

in finite samples (i.e., non-asymptotically),
without assumptions on the data generating process, and
for any estimator of the regression function, $\mu(x)=E[Y|X=x]$ .

In mathematical notation, given a significance level $\alpha$ , we want to construct a confidence interval $CI(X_{n+1})$ satisfying the above properties and such that:

$\begin{equation*}P(Y_{n+1} \in CI(X_{n+1})) \geq 1-\alpha.\end{equation*}$

While the technical term for this is a prediction interval, I will loosely be calling it a confidence interval, a term with which most of you are familiar.

As a teaser, the basic idea behind the method rests on a simple result about sample quantiles.

Let me explain.

Diving Deeper

Sample Quantiles

I will start with reviewing sample quantiles. Given an i.i.d. sample, $U_1, \dots, U_n$ , the $(1-\alpha)$ th quantile is the value $\hat{q}_{1-\alpha}$ such that approximately $(1-\alpha)\times100\%$ of the data is smaller than it. For instance, the 95^th quantile (sometimes also called percentile) is the value for which 95% of the observations are at least as small.

So, given a new observation $U_{n+1}$ , we know that:

$\begin{equation*}P(U_{n+1}\leq \hat{q}_{1-\alpha})\geq 1-\alpha.\end{equation*}$

The Naïve Approach

Let’s turn back to the regression example with $Y$ and $X$ . We are given a new observation $X_{n+1}$ and our focus is on $Y_{n+1}$ . Following the fact described above, a naïve way to construct a confidence interval for $Y_{n+1}$ is as follows:

$\begin{equation*}CI^{\text{naïve}}(X_{n+1}) = [\hat{\mu}(X_{n+1})-\hat{F}_n^{-1}(1-\alpha), \hat{\mu}(X_{n+1})+\hat{F}_n^{-1}(1-\alpha)].\end{equation*}$

Here $\mu(\cdot)$ is an estimate of the regression function $E[Y|X]$ , $\hat{F}_n$ is the empirical distribution function of the fitted residuals $|Y-\hat{\mu}(X)|$ , and $\hat{F}_n^{-1}(1-\alpha)$ is the $(1-\alpha)$ th quantile of that distribution.

Put simply, we can look at an interval around our best prediction for $Y_{n+1}$ (i.e., $\hat{\mu}(X_{n+1})$ ) defined by the residuals estimated on the original data.

It turns out this interval is too narrow. In a series of papers Vladimir Vovk and co-authors show that the empirical distribution function of the fitted residuals is often biased downward and hence this interval is invalid. This is where conformal inference comes in.

Conformal Inference

Consider the following strategy. For each $y$ we fit a regression $\hat{\mu}_y$ on the sample $(Y_1, X_1),\dots (Y_n, X_n), (y, X_{n+1})$ . We calculate the residuals $R^y_i$ for $i=1,\dots,n$ and $R^y_{n+1}$ and count the proportion of $R^y_i$ ’s smaller than $R^y_{n+1}$ . Let’s call this number $\sigma(y)$ . That is,

$\begin{equation*}\sigma(y) = \frac{1}{n+1}\sum_{i=1}^{n+1} I (R^y_i \leq R^y_{n+1}),\end{equation*}$

where $I(\cdot)$ is the indicator function equal to one when the statement in the parenthesis is true and 0 if when it is not.

The test statistic $\sigma({Y_{n+1}})$ is uniformly distributed over the set $\{ \frac{1}{n+1}, \frac{2}{n+1},\dots, 1\}$ , implying we can use $1-\sigma({Y_{n+1}})$ as a valid p-value for testing the null that $Y_{n+1}=y$ . Then, using the sample quantiles logic outlined above we arrive at the following confidence interval for $Y_{n+1}$ :

$\begin{equation*} CI^{\text{conformal}}(X_{n+1}) \approx \{ y\in \mathbb{R} : \sigma(y)\leq 1-\alpha \}.\end{equation*}$

This is summarized in the following procedure:

Algorithm: For each value $y$ :

fit the regression function $\mu(\cdot)$ on $(X_1, Y_1), \dots, (X_n, Y_n), (X_{n+1}, y)$ using your favorite estimator/learner.
calculate the $n+1$ residuals.
calculate the proportion $\sigma(y)$ .
construct $CI = \{y: \sigma(y) \leq (1-\alpha)\}$

Software Package: conformalInference

Two notes. First, conformal inference guarantees unconditional coverage. This is conceptually different and should not be confused with the conditional statement $P(Y_{n+1}\in CI(x) | X_{n+1}=x)\geq 1-\alpha$ . The latter is stronger and more difficult to assert, requiring additional assumptions such as consistency of our estimator of $\mu(\cdot)$ .

Second, this procedure can be computationally expensive. For a given value $X_{n+1}$ we need to fit a regression model and compute residuals for every $y$ which we consider including in the confidence interval. This is where split conformal inference comes in.

Split Conformal Inference

Split conformal inference is a modification of the original algorithm that requires significantly less computation power. The idea is to split the fitting and ranking steps, so that the former is done only once. Here is the algorithm.

Algorithm:

Randomly split the data in two equal-sized bins.
Get $\hat{\mu}$ on the first bin.
Calculate the residuals for each observation in the second bin.
Let $d$ be the s-th smallest residual, where $s=(\frac{n}{2}+1)(1-\alpha)$ .
Construct $CI^{\text{split}}=[\hat{\mu}-d,\hat{\mu}+d]$ .

A downside of this splitting approach is the introduction of extra randomness. One way to mitigate this is to perform the split multiple times and construct a final confidence interval by taking the intersection of all intervals. The aggregation decreases the variability from a single data split and, as this paper shows, still remains valid. Similar random split aggregation has also been used in the context of statistical significance in high-dimensional models.

An Example

I used the popular Titanic dataset to try out the conformalInference R package. Like most of my data demos, this is meant to be a mere illustration and you should not take the results seriously.

The outcome variable was age, and the matrix $X$ included pclass (ticket class), age, sibsp (number of siblings aboard), parch (number of parents aboard), fare, embarked (port of Embarkation), and cabin. Some of these were categorical in which case I converted them to a bunch of binary variables. I used the first 888 observations to estimate the regression function $\mu(X)$ using lasso and the 889th row to form the prediction (i.e., the test set).

The actual age value in the test set was 32 while the conformal inference approach computed a confidence interval (21.25, 123.75). It did contain the true value but it was rather wide. The splitting algorithm gave similar results.

You can find the code in this GitHub repository.

Bottom Line

Conformal inference offers a novel approach for constructing valid finite-sample prediction intervals in machine learning models.

Where to Learn More

Conformal inference in statistics is an ongoing research topic and I do not know of any review papers or textbook treatments of the subject. If you are interested in learning more, check the paper referenced below.

References

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094-1111.

Lei, J., Rinaldo, A., & Wasserman, L. (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74, 29-43.

Shafer, G., & Vovk, V. (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research, 9(3).

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world (Vol. 29). New York: Springer.

One response

Using Conformal Inference for Variable Importance in Machine Learning – yasenov.com

February 2, 2024

[…] inference offers a novel way of measuring variable importance in ML. In an earlier article I introduced conformal inference as a tool for generating confidence intervals when making […]