{"title": "Conformalized Quantile Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3543, "page_last": 3553, "abstract": "Conformal prediction is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. Despite this appeal, existing conformal methods can be unnecessarily conservative because they form intervals of constant or weakly varying length across the input space. In this paper we propose a new method that is fully adaptive to heteroscedasticity. It combines conformal prediction with classical quantile regression, inheriting the advantages of both. We establish a theoretical guarantee of valid coverage, supplemented by extensive experiments on popular regression datasets. We compare the efficiency of conformalized quantile regression to other conformal methods, showing that our method tends to produce shorter intervals.", "full_text": "Conformalized Quantile Regression\n\nYaniv Romano\n\nDepartment of Statistics\n\nStanford University\n\nEvan Patterson\n\nDepartment of Statistics\n\nStanford University\n\nEmmanuel J. Cand\u00e8s\n\nDepartments of Mathematics and of Statistics\n\nStanford University\n\nAbstract\n\nConformal prediction is a technique for constructing prediction intervals that at-\ntain valid coverage in \ufb01nite samples, without making distributional assumptions.\nDespite this appeal, existing conformal methods can be unnecessarily conserva-\ntive because they form intervals of constant or weakly varying length across the\ninput space. In this paper we propose a new method that is fully adaptive to het-\neroscedasticity. It combines conformal prediction with classical quantile regression,\ninheriting the advantages of both. We establish a theoretical guarantee of valid\ncoverage, supplemented by extensive experiments on popular regression datasets.\nWe compare the ef\ufb01ciency of conformalized quantile regression to other conformal\nmethods, showing that our method tends to produce shorter intervals.\n\n1\n\nIntroduction\n\nIn many applications of regression modeling, it is important not only to predict accurately but also to\nquantify the accuracy of the predictions. This is especially true in situations involving high-stakes\ndecision making, such as estimating the ef\ufb01cacy of a drug or the risk of a credit default. The\nuncertainty in a prediction can be quanti\ufb01ed using a prediction interval, giving lower and upper\nbounds between which the response variable lies with high probability. An ideal procedure for\ngenerating prediction intervals should satisfy two properties. First, it should provide valid coverage\nin \ufb01nite samples, without making strong distributional assumptions, such as Gaussianity. Second, its\nintervals should be as short as possible at each point in the input space, so that the predictions will be\ninformative. When the data is heteroscedastic, getting valid but short prediction intervals requires\nadjusting the lengths of the intervals according to the local variability at each query point in predictor\nspace. This paper introduces a procedure that performs well on both criteria, being distribution-free\nand adaptive to heteroscedasticity.\nOur work is heavily inspired by conformal prediction, a general methodology for constructing\nprediction intervals [1\u20136]. Conformal prediction has the virtue of providing a nonasymptotic,\ndistribution-free coverage guarantee. The main idea is to \ufb01t a regression model on the training\nsamples, then use the residuals on a held-out validation set to quantify the uncertainty in future\npredictions. The effect of the underlying model on the length of the prediction intervals, and attempts\nto construct intervals with locally varying length, have been studied in numerous recent works [6\u201316].\nNevertheless, existing methods yield conformal intervals of either \ufb01xed length or length depending\nonly weakly on the predictors, as argued in [6, 15, 17].\nIn conformal prediction to date, there has been a mismatch between the primary inferential focus\u2014\nconditional mean estimation\u2014and the ultimate inferential goal\u2014prediction interval estimation.\nStatistical ef\ufb01ciency is lost by estimating a mean when an interval is needed. A more direct approach\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto interval estimation is offered by quantile regression [18]. Take any algorithm for quantile regression,\ni.e., for estimating conditional quantile functions from data. To obtain prediction intervals with, say,\nnominal 90% coverage, simply \ufb01t the conditional quantile function at the 5% and 95% levels and\nform the corresponding intervals. Even for highly heteroscedastic data, this methodology has been\nshown to be adaptive to local variability [19\u201325]. However, the validity of the estimated intervals is\nguaranteed only for speci\ufb01c models, under certain regularity and asymptotic conditions [22\u201324].\nIn this work, we combine conformal prediction with quantile regression. The resulting method, which\nwe call conformalized quantile regression (CQR), inherits both the \ufb01nite sample, distribution-free\nvalidity of conformal prediction and the statistical ef\ufb01ciency of quantile regression.1 On one hand,\nCQR is \ufb02exible in that it can wrap around any algorithm for quantile regression, including random\nforests and deep neural networks [26\u201329]. On the other hand, a key strength of CQR is its rigorous\ncontrol of the miscoverage rate, independent of the underlying regression algorithm.\n\nSummary and outline\nSuppose we are given n training samples {(Xi, Yi)}n\ni=1 and we must now predict the unknown\nvalue of Yn+1 at a test point Xn+1. We assume that all the samples {(Xi, Yi)}n+1\ni=1 are drawn\nexchangeably\u2014for instance, they may be drawn i.i.d.\u2014from an arbitrary joint distribution PXY\nover the feature vectors X \u2208 Rp and response variables Y \u2208 R. We aim to construct a marginal\ndistribution-free prediction interval C(Xn+1) \u2286 R that is likely to contain the unknown response\nYn+1. That is, given a desired miscoverage rate \u03b1, we ask that\n\nP{Yn+1 \u2208 C(Xn+1)} \u2265 1 \u2212 \u03b1\n\n(1)\nfor any joint distribution PXY and any sample size n. The probability in this statement is marginal,\nbeing taken over all the samples {(Xi, Yi)}n+1\ni=1 .\nTo accomplish this, we build on the method of conformal prediction [2, 3, 8]. We \ufb01rst split the\ntraining data into two disjoint subsets, a proper training set and a calibration set.2 We \ufb01t two quantile\nregressors on the proper training set to obtain initial estimates of the lower and upper bounds of the\nprediction interval, as explained in Section 2. Then, using the calibration set, we conformalize and, if\nnecessary, correct this prediction interval. Unlike the original interval, the conformalized prediction\ninterval is guaranteed to satisfy the coverage requirement (1) regardless of the choice or accuracy of\nthe quantile regression estimator. We prove this in Section 4.\nOur method differs from the standard method of conformal prediction [3, 15], recalled in Section 3,\nin that we calibrate the prediction interval using conditional quantile regression, while the standard\nmethod uses only classical, conditional mean regression. The result is that our intervals are adaptive\nto heteroscedasticity whereas the standard intervals are not. We evaluate the statistical ef\ufb01ciency of\nour framework by comparing its miscoverage rate and average interval length with those of other\nmethods. We review existing state-of-the-art schemes for conformal prediction in Section 5 and we\ncompare them with our method in Section 6. Based on extensive experiments across eleven datasets,\nwe conclude that conformal quantile regression yields shorter intervals than the competing methods.\n\n2 Quantile regression\n\nThe aim of conditional quantile regression [18] is to estimate a given quantile, such as the median, of\nY conditional on X. Recall that the conditional distribution function of Y given X = x is\n\nF (y | X = x) := P{Y \u2264 y | X = x},\n\nand that the \u03b1th conditional quantile function is\n\nq\u03b1(x) := inf{y \u2208 R : F (y | X = x) \u2265 \u03b1}.\n\nFix the lower and upper quantiles to be equal to \u03b1lo = \u03b1/2 and \u03b1hi = 1 \u2212 \u03b1/2, say. Given the\npair q\u03b1lo (x) and q\u03b1hi (x) of lower and upper conditional quantile functions, we obtain a conditional\nprediction interval for Y given X = x, with miscoverage rate \u03b1, as\n\nC(x) = [q\u03b1lo (x), q\u03b1hi (x)].\n\n(2)\n\n1An implementation of CQR is available online at https://github.com/yromano/cqr.\n2Like conformal regression, CQR has a variant that does not require data splitting.\n\n2\n\n\fBy construction, this interval satis\ufb01es\n\nP{Y \u2208 C(X)|X = x} \u2265 1 \u2212 \u03b1.\n\n(3)\nNotice that the length of the interval C(X) can vary greatly depending on the value of X. The\nuncertainty in the prediction of Y is naturally re\ufb02ected in the length of the interval. In practice we\ncannot know this ideal prediction interval, but we can try to estimate it from the data.\n\nEstimating quantiles from data\n\nClassical regression analysis estimates the conditional mean of the test response Yn+1 given the\nfeatures Xn+1=x by minimizing the sum of squared residuals on the n training points:\n\n\u02c6\u00b5(x) = \u00b5(x; \u02c6\u03b8),\n\n\u02c6\u03b8 = argmin\n\n\u03b8\n\n1\nn\n\n(Yi \u2212 \u00b5(Xi; \u03b8))2 + R(\u03b8).\n\nHere \u03b8 are the parameters of the regression model, \u00b5(x; \u03b8) is the regression function, and R is a\npotential regularizer.\nAnalogously, quantile regression estimates a conditional quantile function q\u03b1 of Yn+1 given Xn+1=x.\nThis can be cast as the optimization problem\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n\u02c6q\u03b1(x) = f (x; \u02c6\u03b8),\n\n\u02c6\u03b8 = argmin\n\n\u03b8\n\n1\nn\n\n\u03c1\u03b1(Yi, f (Xi; \u03b8)) + R(\u03b8),\n\nwhere f (x; \u03b8) is the quantile regression function and the loss function \u03c1\u03b1 is the \u201ccheck function\u201d or\n\u201cpinball loss\u201d [18, 24], de\ufb01ned by\n\n(cid:26)\u03b1(y \u2212 \u02c6y)\n\n\u03c1\u03b1(y, \u02c6y) :=\n\n(1 \u2212 \u03b1)(\u02c6y \u2212 y) otherwise.\n\nif y \u2212 \u02c6y > 0,\n\nThe simplicity and generality of this formulation makes quantile regression widely applicable. As in\nclassical regression, one can leverage the great variety of machine learning methods to design and\nlearn \u02c6q\u03b1 [19\u201321, 23, 30].\nAll this suggests an obvious strategy to construct a prediction band with nominal miscover-\nage rate \u03b1: estimate \u02c6q\u03b1lo (x) and \u02c6q\u03b1hi (x) using quantile regression, then output \u02c6C(Xn+1) =\n[\u02c6q\u03b1lo (Xn+1), \u02c6q\u03b1hi (Xn+1)] as an estimate of the ideal interval C(Xn+1) from equation (2). This\napproach is widely applicable and often works well in practice, yielding intervals that are adaptive to\nheteroscedasticity. However, it is not guaranteed to satisfy the coverage statement (3) when C(X)\nis replaced by the estimated interval \u02c6C(Xn+1). Indeed, the absence of any \ufb01nite sample guarantee\ncan sometimes be disastrous. This worry is corroborated by our experiments, which show that the\nintervals constructed by neural networks can substantially undercover.\nUnder regularity conditions and for speci\ufb01c models, estimates of conditional quantile functions\nvia the pinball loss or related methods are known to be asymptotically consistent [23, 24, 31, 32].\nCertain methods that do not minimize the pinball loss, such as quantile random forests [22], are also\nasymptotically consistent. But to get valid coverage in \ufb01nite samples, we must draw on a different set\nof ideas, from conformal prediction.\n\n3 Conformal Prediction\n\nWe now describe how conformal prediction [1, 3] constructs prediction intervals that satisfy the \ufb01nite-\nsample coverage guarantee (1). To be carried out exactly, the original, or full, conformal procedure\neffectively requires the regression algorithm to be invoked in\ufb01nitely many times. In contrast, the\nmethod of split, or inductive, conformal prediction [2, 8] avoids this problem, at the cost of splitting\nthe data. While our proposal is applicable to both versions of conformal prediction, in the interest of\nspace we will restrict our attention to split conformal prediction and refer the reader to [3, 15] for a\nmore detailed comparison between the two methods.\nUnder the assumptions of Section 1, the split conformal method begins by splitting the train-\ning data into two disjoint subsets: a proper training set {(Xi, Yi) : i \u2208 I1} and calibration set\n\n3\n\n\f{(Xi, Yi) : i \u2208 I2}. Then, given any regression algorithm A,3 a regression model is \ufb01t to the proper\ntraining set:\n\n\u02c6\u00b5(x) \u2190 A ({(Xi, Yi) : i \u2208 I1}) .\n\nNext, the absolute residuals are computed on the calibration set, as follows:\n\n(4)\nFor a given level \u03b1, we then compute a quantile of the empirical distribution4 of the absolute residuals,\n\nRi = |Yi \u2212 \u02c6\u00b5(Xi)|,\n\ni \u2208 I2.\n\nQ1\u2212\u03b1(R,I2) := (1 \u2212 \u03b1)(1 + 1/|I2|)-th empirical quantile of{Ri : i \u2208 I2} .\n\nFinally, the prediction interval at a new point Xn+1 is given by\n\nC(Xn+1) = [\u02c6\u00b5(Xn+1) \u2212 Q1\u2212\u03b1(R,I2), \u02c6\u00b5(Xn+1) + Q1\u2212\u03b1(R,I2)] .\n\n(5)\nThis interval is guaranteed to satisfy (1), as shown in [3]. For related theoretical studies, see [15, 33].\nA closer look at the prediction interval (5) reveals a major limitation of this procedure: the length\nof C(Xn+1) is \ufb01xed and equal to 2Q1\u2212\u03b1(R,I2), independent of Xn+1. Lei et al [15] observe that\nthe intervals produced by the full conformal method also vary only slightly with Xn+1, provided the\nregression algorithm is moderately stable. This brings us to our proposal, which offers a principled\napproach to constructing variable-width conformal prediction intervals.\n\n4 Conformalized quantile regression (CQR)\n\nIn this section we introduce our procedure, beginning with a small experiment on simulated data\nto show how it improves upon standard conformal prediction. Figure 1 compares the prediction\nintervals produced by (a) the split conformal method, (b) its locally adaptive variant (described later\nin Section 5), and (c) our method, conformalized quantile regression (CQR). The heteroskedasticity\nof the data is evident, as the dispersion of Y varies considerably with X. The data also contains\noutliers, shown in the supplementary material. For all three methods, we construct 90% prediction\nintervals on the test data. From Figures 1a and 1d, we see that the lengths of the split conformal\nintervals are \ufb01xed and equal to 2.91. The prediction intervals of the locally weighted variant, shown\nin Figure 1b, are partially adaptive, resulting in slightly shorter intervals, of average length 2.86. Our\nmethod, shown in Figure 1c, is also adaptive, but its prediction intervals are considerably shorter, of\naverage length 1.99, due to better estimation of the lower and upper quantiles. We refer the reader to\nthe supplementary material for further details about this experiment, as well as a second simulation\ndemonstrating the advantage of CQR on heavy-tailed data.\nWe now describe CQR itself. As in split conformal prediction, we begin by splitting the data into a\nproper training set, indexed by I1, and a calibration set, indexed by I2. Given any quantile regression\nalgorithm A, we then \ufb01t two conditional quantile functions \u02c6q\u03b1lo and \u02c6q\u03b1hi on the proper training set:\n\n{\u02c6q\u03b1lo, \u02c6q\u03b1hi} \u2190 A({(Xi, Yi) : i \u2208 I1}).\n\nIn the essential next step, we compute conformity scores that quantify the error made by the plug-in\nprediction interval \u02c6C(x) = [\u02c6q\u03b1lo (x), \u02c6q\u03b1hi(x)]. The scores are evaluated on the calibration set as\n\nEi := max{\u02c6q\u03b1lo(Xi) \u2212 Yi, Yi \u2212 \u02c6q\u03b1hi(Xi)}\n\n(6)\nfor each i \u2208 I2. The conformity score Ei has the following interpretation. If Yi is below the lower\nendpoint of the interval, Yi < \u02c6q\u03b1lo (Xi), then Ei = |Yi \u2212 \u02c6q\u03b1lo (Xi)| is the magnitude of the error\nincurred by this mistake. Similarly, if Yi is above the upper endpoint of the interval, Yi > \u02c6q\u03b1hi (Xi),\nthen Ei = |Yi \u2212 \u02c6q\u03b1hi(Xi)|. Finally, if Yi correctly belongs to the interval, \u02c6q\u03b1lo (Xi) \u2264 Yi \u2264 \u02c6q\u03b1hi(Xi),\nthen Ei is the larger of the two non-positive numbers \u02c6q\u03b1lo (Xi) \u2212 Yi and Yi \u2212 \u02c6q\u03b1hi(Xi) and so is itself\nnon-positive. The conformity score thus accounts for both undercoverage and overcoverage.\nFinally, given new input data Xn+1, we construct the prediction interval for Yn+1 as\nC(Xn+1) = [\u02c6q\u03b1lo (Xn+1) \u2212 Q1\u2212\u03b1(E,I2), \u02c6q\u03b1hi(Xn+1) + Q1\u2212\u03b1(E,I2)] ,\n\n(7)\n\n3In full conformal prediction, the regression algorithm must treat the data exchangeably, but no such\n\nrestrictions apply to split conformal prediction.\n\n4The explicit formula for empirical quantiles is recalled in the supplementary material.\n\n4\n\n\f(a) Split: Avg. coverage 91.4%; Avg. length 2.91.\n\n(b) Local: Avg. coverage 91.7%; Avg. length 2.86.\n\n(c) CQR: Avg. coverage 91.06%; Avg. length 1.99.\n\n(d) Length of prediction intervals.\n\nFigure 1: Prediction intervals on simulated heteroscedastic data with outliers (see the supplementary\nmaterial for a full range display): (a) the standard split conformal method, (b) its locally adaptive\nvariant, and (c) CQR (our method). The length of the interval as a function of X is shown in (d).\nThe target coverage rate is 90%. The broken black curve in (a) and (b) is the pointwise prediction\nfrom the random forest estimator. In (c), we show two curves, representing the lower and upper\nquantile regression estimates based on random forests [22]. Observe how in this example the quantile\nregression estimates closely match the adjusted estimates\u2014the boundary of the blue region\u2014obtained\nby conformalization.\n\nAlgorithm 1: Split Conformal Quantile Regression.\nInput:\n\nProcess:\n\nData (Xi, Yi), 1 \u2264 i \u2264 n; miscoverage level \u03b1 \u2208 (0, 1); quantile regression algorithm A.\nRandomly split {1, . . . , n} into two disjoint sets I1 and I2.\nFit two conditional quantile functions: {\u02c6q\u03b1lo, \u02c6q\u03b1hi} \u2190 A({(Xi, Yi) : i \u2208 I1}).\nCompute Ei for each i \u2208 I2, as in equation (6).\nCompute Q1\u2212\u03b1(E,I2), the (1 \u2212 \u03b1)(1 + 1/|I2|)-th empirical quantile of {Ei : i \u2208 I2}.\nPrediction interval C(x) = [\u02c6q\u03b1lo(x) \u2212 Q1\u2212\u03b1(E,I2), \u02c6q\u03b1hi (x) + Q1\u2212\u03b1(E,I2)] for Xn+1 = x.\n\nOutput:\n\nwhere\n\nQ1\u2212\u03b1(E,I2) := (1 \u2212 \u03b1)(1 + 1/|I2|)-th empirical quantile of{Ei : i \u2208 I2}\n\n(8)\n\nconformalizes the plug-in prediction interval.\nFor ease of reference, the CQR procedure is summarized in Algorithm 1. The following theorem,\nestablishing its validity, is proved in the supplementary material.\n\n5\n\n\fTheorem 1. If (Xi, Yi), i = 1, . . . , n + 1 are exchangeable, then the prediction interval C(Xn+1)\nconstructed by the split CQR algorithm satis\ufb01es\n\nP{Yn+1 \u2208 C(Xn+1)} \u2265 1 \u2212 \u03b1.\n\nMoreover, if the conformity scores Ei are almost surely distinct, then the prediction interval is nearly\nperfectly calibrated:\n\nP{Yn+1 \u2208 C(Xn+1)} \u2264 1 \u2212 \u03b1 + 1/(|I2| + 1).\n\nPractical considerations and extensions\n\nConformalized quantile regression can accommodate a wide range of quantile regression methods\n[18\u201323, 25, 30] to estimate the conditional quantile functions, q\u03b1lo and q\u03b1hi. The estimators can\nbe even be aggregates of different quantile regression algorithms. Recently, new deep learning\ntechniques have been proposed [26\u201329] for constructing prediction intervals. These methods could\nbe wrapped by our framework and would then immediately enjoy rigorous coverage guarantees. In\nour experiments, we focus on quantile neural networks [20] and quantile regression forests [22].\nBecause the underlying quantile regression algorithm may process the proper training set in arbitrary\nways, our framework affords broad \ufb02exibility in hyper-parameter tuning. Consider, for instance, the\ntuning of typical hyper-parameters of neural networks, such as the batch size, the learning rate, and\nthe number of epochs. The hyperparameters may be selected, as usual, by cross validation, where we\nminimize the average interval length over the folds.\nIn this vein, we record two speci\ufb01c implementation details that we have found to be useful.\n\n1. Quantile regression is sometimes too conservative, resulting in unnecessarily wide prediction\nintervals. In our experience, quantile regression forests [22] are often overly conservative\nand quantile neural networks [20] are occasionally so. We can mitigate this problem by\ntuning the nominal quantiles of the underlying method as additional hyper-parameters in\ncross validation. Notably, this tuning does not invalidate the coverage guarantee, but it may\nyield shorter intervals, as our experiments con\ufb01rm.\n\n2. To reduce the computational cost, instead of \ufb01tting two separate neural networks to estimate\nthe lower and upper quantile functions, we can replace the standard one-dimensional estimate\nof the unknown response by a two-dimensional estimate of the lower and upper quantiles.\nIn this way, most of the network parameters are shared between the two quantile estimators.\nWe adopt this approach in the experiments of Section 6.\n\nAnother avenue for extension is the conformalization step. The conformalization implemented by\nequations (7) and (8) allows coverage errors to be spread arbitrarily over the left and right tails. Using\na method reminiscent of [34], we can control the left and right tails independently, yielding a stronger\ncoverage guarantee. It is stated below and proved in the supplementary material. As we will see in\nSection 6, the price paid for the stronger coverage guarantee is slightly longer intervals.\nTheorem 2. De\ufb01ne the prediction interval\n\nC(Xn+1) := [\u02c6q\u03b1lo(Xn+1) \u2212 Q1\u2212\u03b1lo (Elo,I2), \u02c6q\u03b1hi (Xn+1) + Q1\u2212\u03b1hi(Ehi,I2)],\n\nwhere Q1\u2212\u03b1lo (Elo,I2) is the (1 \u2212 \u03b1lo)-th empirical quantile of {\u02c6q\u03b1lo (Xi) \u2212 Yi : i \u2208 I2} and\nQ1\u2212\u03b1hi (Ehi,I2) is the (1 \u2212 \u03b1hi)-th empirical quantile of {Yi \u2212 \u02c6q\u03b1hi (Xi) : i \u2208 I2}. If the samples\n(Xi, Yi), i = 1, . . . , n + 1 are exchangeable, then\n\nP{Yn+1 \u2265 \u02c6q\u03b1lo (Xn+1) \u2212 Q1\u2212\u03b1lo (Elo,I2)} \u2265 1 \u2212 \u03b1lo\n\nand\n\nP{Yn+1 \u2264 \u02c6q\u03b1hi (Xn+1) + Q1\u2212\u03b1hi (Ehi,I2)} \u2265 1 \u2212 \u03b1hi.\n\nConsequently, assuming \u03b1 = \u03b1lo + \u03b1hi, we also have P{Yn+1 \u2208 C(Xn+1)} \u2265 1 \u2212 \u03b1.\n\n5 Related work: locally adaptive conformal prediction\n\nLocally adaptive split conformal prediction, \ufb01rst proposed in [7, 9] and later studied in [15], is an\nearlier approach to making conformal prediction adaptive to heteroskedascity. Like our method, it\n\n6\n\n\f(cid:104)\n\n(cid:105)\n\n\u02c6\u00b5(Xn+1) \u2212 \u02c6\u03c3(Xn+1)Q1\u2212\u03b1( \u02dcR,I2), \u02c6\u00b5(Xn+1) + \u02c6\u03c3(Xn+1)Q1\u2212\u03b1( \u02dcR,I2)\n\nstarts from the observation that one can replace the absolute residuals in equation (4) by any other\nloss function that treats the data exchangeably. In this case, the absolute residuals Ri are replaced by\nthe scaled residuals \u02dcRi := |Yi \u2212 \u02c6\u00b5(Xi)|/\u02c6\u03c3(Xi) = Ri/\u02c6\u03c3(Xi), i \u2208 I2, where \u02c6\u03c3(Xi) is a measure of\nthe dispersion of the residuals at Xi. Usually \u02c6\u03c3(x) is an estimate of the conditional mean absolute\ndeviation (MAD) of |Y \u2212 \u02c6\u00b5(x)| given X = x. Finally, the prediction interval at Xn+1 is computed\nas C(Xn+1) =\n. Both\n\u02c6\u00b5 and \u02c6\u03c3 are \ufb01t only on the proper training set. Consequently, \u02c6\u00b5 and \u02c6\u03c3 satisfy the assumptions\nof conformal prediction and, hence, locally adaptive conformal prediction inherits the coverage\nguarantee of standard conformal prediction.\nIn practice, locally adaptive conformal prediction requires \ufb01tting two functions, in sequence, on the\nproper training set. (Thus it is more computationally expensive than standard conformal prediction.)\nFirst, one \ufb01ts the conditional mean function \u02c6\u00b5(x), as described in Section 3. Then one \ufb01ts \u02c6\u03c3(x) to\nthe pairs {(Xi, Ri) : i \u2208 I1}, using a regression model that predicts the residuals Ri given the inputs\nXi. As an example, the intervals in Figure 1b above are created by locally adaptive split conformal\nprediction, where both \u02c6\u00b5 and \u02c6\u03c3 are random forests.\nLocally adaptive conformal prediction is limited in several ways, some more important than others.\nA \ufb01rst limitation, already noted in [15], appears when the data is actually homoskedastic. In this\ncase, the locally adaptive method suffers from in\ufb02ated prediction intervals compared to the standard\nmethod. This is presumably due to the extra variability introduced by estimating \u02c6\u03c3 as well as \u02c6\u00b5.\nThe locally adaptive method faces a more fundamental statistical limitation. There is an essential\ndifference between the residuals on the proper training set and the residuals on the calibration set:\nthe former are biased by an optimization procedure designed to minimize them, while the latter are\nunbiased. Because it uses the proper training residuals (as it must to ensure valid coverage), the\nlocally adaptive method tends to systematically underestimate the prediction error. In general, this\nforces the correction constant Q1\u2212\u03b1( \u02dcR,I2) to be large and the intervals to be less adaptive.\nTo press this point further, suppose the conditional mean function \u02c6\u00b5 is a deep neural network. It is\nwell attested in the deep learning literature that, given enough training samples, the best prediction\nerror is attained by \u201cover-\ufb01tting\u201d to the training data, in the sense that the training error is nearly zero.\nThe training residuals are then very poor estimates of the true prediction error, resulting in severe loss\nof adaptivity. Our method, in contrast, does not suffer from this problem because the original training\nobjective is to estimate the lower and upper conditional quantiles, not the conditional mean.\n\n6 Experiments\n\nIn this section we systematically compare our method, conformalized quantile regression, to the\nstandard and locally adaptive versions of split conformal prediction. Among preexisting conformal\nprediction algorithms, we select leading variants that use random forests [10] and neural networks [35]\nfor conditional mean regression. Speci\ufb01cally, we evaluate the original version of split conformal\nprediction (Section 3) using three regression algorithms: Ridge, Random Forests and Neural Net.\nWe evaluate locally adaptive conformal prediction (Section 5) using the same three underlying\nregression algorithms: Ridge Local, Random Forests Local, and Neural Net Local. Likewise, we\ncon\ufb01gure our method (Algorithm 1) to use quantile random forests [22], CQR Random Forests, and\nquantile neural networks [20], CQR Neural Net. Finally, as a baseline, we also include the previous\ntwo quantile regression algorithms, but without any conformalization: Quantile Random Forests\nand Quantile Neural Net. The last two methods, in contrast to the others, do not have \ufb01nite-sample\ncoverage guarantees. All implementation details are available in the supplementary material.\nWe conduct the experiments on eleven benchmark datasets for regression, listed in the supplementary\nmaterial. In each case, we standardize the features to have zero mean and unit variance and we rescale\nthe response by dividing it by its mean absolute value.5 The performance metrics are averaged over\n20 different training-test splits; 80% of the examples are used for training and the remaining 20%\n\n5In the experiments, we compute the needed sample means and variances only on the proper training set.\nThis ensures that if the original data is exchangeable, then the rescaled data remains so. That being said, we\ncould also rescale using sample means and variances computed on the test data, because it would preserve\nexchangeability even while it destroys independence.\n\n7\n\n\fMethod\nRidge\nRidge Local\nRandom Forests\nRandom Forests Local\nNeural Net\nNeural Net Local\nCQR Random Forests\nCQR Neural Net\n*Quantile Random Forests\n*Quantile Neural Net\n\nAvg. Length Avg. Coverage\n\n3.07\n2.93\n2.24\n1.82\n2.20\n1.79\n1.40\n1.40\n*2.21\n*1.50\n\n90.08\n90.14\n90.00\n89.99\n89.95\n90.02\n90.34\n90.02\n*92.62\n*88.87\n\nTable 1: Length and coverage of prediction intervals (\u03b1 = 0.1), averaged across 11 datasets and 20\nrandom training-test splits. Our methods are shown in bold font. The methods marked by an asterisk\nare not supported by \ufb01nite-sample coverage guarantees.\n\nFigure 2: Average length (left) and coverage (right) of prediction intervals (\u03b1 = 0.1) on the bio\ndataset [36]. The numbers in the colored boxes are the average lengths, shown in red for split\nconformal, in gray for locally adapative split conformal, and in light blue for our method.\n\nfor testing. The proper training and calibration sets for split conformal prediction have equal size.\nThroughout the experiments the nominal miscoverage rate is \ufb01xed and set to \u03b1 = 0.1.\nTable 1 summarizes our 2,200 experiments, showing the average performance across all the datasets\nand training-test splits. On average, our method achieves shorter prediction intervals than both\nstandard and locally adaptive conformal prediction. It may seem surprising that our method also\noutperforms non-conformalized quantile regression, which is permitted more training data. There are\nseveral possible explanations for this. First, the non-conformalized methods sometimes overcover,\nbut that is mitigated by our signed conformity scores (6). In addition, by using CQR, we can tune\nthe quantiles of the underlying quantile regression algorithms using cross-validation (Section 4).\nInterestingly, CQR selects quantiles below the nominal level.\nTurning to the issue of valid coverage, all methods based on conformal prediction successfully\nconstruct prediction bands at the nominal coverage rate of 90%, as the theory suggests they should.\nOne of the non-conformalized methods, based on random forests, is slightly conservative, while\nthe other, based on neural networks, tends to undercover. In fact, other authors have shown that the\ncoverage of quantile neural networks depends greatly on the tuning of the hyper-parameters, with, for\ninstance, the actual coverage in [25, Figure 3] ranging from the nominal 95% to well below 50%.\nSuch volatility demonstrates the importance of the conformal prediction\u2019s \ufb01nite-sample guarantee.\nWhen estimating a lower and an upper quantile by two separate quantile regressions, there is no\nguarantee that the lower estimate will actually be smaller than the upper estimate. This is known as\n\n8\n\n1.381.131.711.571.661.572.091.84Avg. LengthAvg. Coverage0.91.21.51.82.180%85%90%95%100%CQR Neural NetCQR Random ForestsNeural Net LocalNeural NetRandom Forests LocalRandom ForestsRidge LocalRidgebio\fthe quantile crossing problem [37]. Quantile crossing can affect quantile neural networks, but not\nquantile regression forests. When the two quantiles are far apart, as in the 5% and 95% quantiles,\nwe should expect the estimates to cross very infrequently and that is indeed what we \ufb01nd in the\nexperiments. Nevertheless, we also evaluated a post-processing method to eliminate crossings [38]. It\nyields a slight improvement in performance: the average interval length of the CQR neural networks\ndrops from 1.40 to 1.35, while the coverage rate remains the same. The average interval length of the\nunconformalized quantile neural networks drops from 1.50 to 1.40, with a decrease in the average\ncoverage rate, from 88.87 to 87.99.\nAs expected, adopting the two-tailed, asymmetric conformalization proposed in Theorem 2 causes\nan increase in average interval length compared to the symmetric conformalization of Theorem 1.\nSpeci\ufb01cally, the average length for CQR neural networks increases from 1.40 to 1.58, while the\ncoverage rate stays about the same. The average length for the CQR random forests increases from\n1.40 to 1.57, accompanied by a slight increase in the average coverage rate, from 90.34 to 90.94.\nIn a series of \ufb01gures, provided in the supplementary material, we break down the performance of\nthe different methods on each of the benchmark datasets. The performance on individual datasets\ncon\ufb01rms the overall trend in Table 1. Locally adaptive conformal prediction generally outperforms\nstandard conformal prediction, and, on ten out of eleven datasets, conformalized quantile regression\noutperforms both. As a representative example, Figure 2 shows our results on a dataset (bio) about\nthe physicochemical properties of protein tertiary structure [36].\n\n7 Conclusion\n\nConformal quantile regression is a new way of constructing prediction intervals that combines the\nadvantages of conformal prediction and quantile regression. It provably controls the miscoverage rate\nin \ufb01nite samples, under the mild distributional assumption of exchangeability, while adapting the\ninterval lengths to heteroskedasticity in the data.\nWe expect the ideas behind conformal quantile regression to be applicable in the related setting\nof conformal predictive distributions [39]. In this extension of conformal prediction, the aim is to\nestimate a predictive probability distribution, not just an interval. We see intriguing connections\nbetween our work and a very recent, independently written paper on conformal distributions [17].\n\nAcknowledgements\n\nE. C. was partially supported by the Of\ufb01ce of Naval Research (ONR) under grant N00014-16- 1-2712,\nby the Army Research Of\ufb01ce (ARO) under grant W911NF-17-1-0304, by the Math + X award\nfrom the Simons Foundation and by a generous gift from TwoSigma. E. P. and Y. R. were partially\nsupported by the ARO grant. Y. R. was also supported by the same Math + X award. Y. R. thanks the\nZuckerman Institute, ISEF Foundation and the Viterbi Fellowship, Technion, for providing additional\nresearch support. We thank Chiara Sabatti for her insightful comments on a draft of this paper and\nRyan Tibshirani for his crucial remarks on our early experimental \ufb01ndings.\n\nReferences\n[1] Volodya Vovk, Alexander Gammerman, and Craig Saunders. Machine-learning applications of\nalgorithmic randomness. In International Conference on Machine Learning, pages 444\u2013453,\n1999.\n\n[2] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman.\n\nInductive\ncon\ufb01dence machines for regression. In European Conference on Machine Learning, pages\n345\u2013356. Springer, 2002.\n\n[3] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world.\n\nSpringer, 2005.\n\n[4] Vladimir Vovk, Ilia Nouretdinov, and Alex Gammerman. On-line predictive linear regression.\n\nThe Annals of Statistics, 37(3):1566\u20131590, 2009.\n\n[5] Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the\n\nAmerican Statistical Association, 108(501):278\u2013287, 2013.\n\n9\n\n\f[6] Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71\u201396, 2014.\n[7] Harris Papadopoulos, Alex Gammerman, and Volodya Vovk. Normalized nonconformity mea-\nsures for regression conformal prediction. In International Conference on Arti\ufb01cial Intelligence\nand Applications, pages 64\u201369, 2008.\n\n[8] Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural net-\n\nworks. In Tools in arti\ufb01cial intelligence. IntechOpen, 2008.\n\n[9] Harris Papadopoulos, Vladimir Vovk, and Alexander Gammerman. Regression conformal\nprediction with nearest neighbours. Journal of Arti\ufb01cial Intelligence Research, 40:815\u2013840,\n2011.\n\n[10] Ulf Johansson, Henrik Bostr\u00f6m, Tuve L\u00f6fstr\u00f6m, and Henrik Linusson. Regression conformal\n\nprediction with random forests. Machine Learning, 97(1-2):155\u2013176, 2014.\n\n[11] Ulf Johansson, Cecilia S\u00f6nstr\u00f6d, Henrik Linusson, and Henrik Bostr\u00f6m. Regression trees for\nstreaming data with local performance guarantees. In IEEE International Conference on Big\nData, pages 461\u2013470. IEEE, 2014.\n\n[12] Ulf Johansson, Cecilia S\u00f6nstr\u00f6d, and Henrik Linusson. Ef\ufb01cient conformal regressors using\nbagged neural nets. In IEEE International Joint Conference on Neural Networks, pages 1\u20138.\nIEEE, 2015.\n\n[13] Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Arti\ufb01cial Intelligence,\n\n74(1-2):9\u201328, 2015.\n\n[14] Henrik Bostr\u00f6m, Henrik Linusson, Tuve L\u00f6fstr\u00f6m, and Ulf Johansson. Accelerating dif\ufb01culty\nestimation for conformal regression forests. Annals of Mathematics and Arti\ufb01cial Intelligence,\n81(1-2):125\u2013144, 2017.\n\n[15] Jing Lei, Max G\u2019Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman.\nDistribution-free predictive inference for regression. Journal of the American Statistical Associ-\nation, 113(523):1094\u20131111, 2018.\n\n[16] Wenyu Chen, Kelli-Jean Chun, and Rina Foygel Barber. Discretized conformal prediction for\n\nef\ufb01cient distribution-free inference. Stat, 7(1):e173, 2018.\n\n[17] Vladimir Vovk, Ivan Petej, Paolo Toccaceli, and Alex Gammerman. Conformal calibrators.\n\narXiv preprint arXiv:1902.06579, 2019.\n\n[18] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: Journal of the\n\nEconometric Society, pages 33\u201350, 1978.\n\n[19] David R. Hunter and Kenneth Lange. Quantile regression via an MM algorithm. Journal of\n\nComputational and Graphical Statistics, 9(1):60\u201377, 2000.\n\n[20] James W. Taylor. A quantile regression neural network approach to estimating the conditional\n\ndensity of multiperiod returns. Journal of Forecasting, 19(4):299\u2013311, 2000.\n\n[21] Roger Koenker and Kevin F. Hallock. Quantile regression. Journal of Economic Perspectives,\n\n15(4):143\u2013156, 2001.\n\n[22] Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research,\n\n7:983\u2013999, 2006.\n\n[23] Ichiro Takeuchi, Quoc V. Le, Timothy D. Sears, and Alexander J. Smola. Nonparametric\n\nquantile estimation. Journal of Machine Learning Research, 7:1231\u20131264, 2006.\n\n[24] Ingo Steinwart and Andreas Christmann. Estimating conditional quantiles with the help of the\n\npinball loss. Bernoulli, 17(1):211\u2013225, 2011.\n\n[25] Natasa Tagasovska and David Lopez-Paz. Frequentist uncertainty estimates for deep learning.\n\narXiv preprint arXiv:1811.00908, 2018.\n\n[26] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing\nmodel uncertainty in deep learning. In International Conference on Machine Learning, pages\n1050\u20131059, 2016.\n\n[27] Cheng Lian, Zhigang Zeng, Wei Yao, Huiming Tang, and Chun Lung Philip Chen. Landslide\ndisplacement prediction with uncertainty based on neural networks with random hidden weights.\nIEEE Transactions on Neural Networks and Learning Systems, 27(12):2683\u20132695, 2016.\n\n10\n\n\f[28] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n[29] Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. High-quality prediction in-\ntervals for deep learning: A distribution-free, ensembled approach. In International Conference\non Machine Learning, pages 6473\u20136482, 2018.\n\n[30] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of\n\nStatistics, pages 1189\u20131232, 2001.\n\n[31] Kenneth Q Zhou and Stephen L Portnoy. Direct use of regression quantiles to construct\n\ncon\ufb01dence sets in linear models. The Annals of Statistics, 24(1):287\u2013306, 1996.\n\n[32] Kenneth Q Zhou and Stephen L Portnoy. Statistical inference on heteroscedastic models based\n\non regression quantiles. Journal of Nonparametric Statistics, 9(3):239\u2013260, 1998.\n\n[33] Wenyu Chen, Zhaokai Wang, Wooseok Ha, and Rina Foygel Barber. Trimmed conformal\n\nprediction for high-dimensional models. arXiv preprint arXiv:1611.09933, 2016.\n\n[34] Henrik Linusson, Ulf Johansson, and Tuve L\u00f6fstr\u00f6m. Signed-error conformal regression. In\nPaci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining, pages 224\u2013236. Springer,\n2014.\n\n[35] Harris Papadopoulos and Haris Haralambous. Reliable prediction intervals with regression\n\nneural networks. Neural Networks, 24(8):842\u2013851, 2011.\n\n[36] Physicochemical properties of protein tertiary structure data set.\n\nhttps://archive.\n\nics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+\nStructure. Accessed: January, 2019.\n\n[37] Gilbert Bassett Jr and Roger Koenker. An empirical quantile function for linear models with iid\n\nerrors. Journal of the American Statistical Association, 77(378):407\u2013415, 1982.\n\n[38] Victor Chernozhukov, Iv\u00e1n Fern\u00e1ndez-Val, and Alfred Galichon. Quantile and probability\n\ncurves without crossing. Econometrica, 78(3):1093\u20131125, 2010.\n\n[39] Vladimir Vovk, Jieli Shen, Valery Manokhin, and Min-ge Xie. Nonparametric predictive\n\ndistributions based on conformal prediction. Machine Learning, pages 1\u201330, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1926, "authors": [{"given_name": "Yaniv", "family_name": "Romano", "institution": "Stanford University"}, {"given_name": "Evan", "family_name": "Patterson", "institution": "Stanford University"}, {"given_name": "Emmanuel", "family_name": "Candes", "institution": "Stanford University"}]}