{"title": "Verified Uncertainty Calibration", "book": "Advances in Neural Information Processing Systems", "page_first": 3792, "page_last": 3803, "abstract": "Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient---it requires $O(B/\\epsilon^2)$ samples, compared to $O(1/\\epsilon^2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function that acts like a baseline for variance reduction and then bins the function values to actually ensure calibration. This requires only $O(1/\\epsilon^2 + B)$ samples. We then show that methods used to estimate calibration error are suboptimal---we prove that an alternative estimator introduced in the meteorological community requires fewer samples ($O(\\sqrt{B})$ instead of $O(B)$). We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35\\% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration.", "full_text": "Veri\ufb01ed Uncertainty Calibration\n\nAnanya Kumar, Percy Liang, Tengyu Ma\n\nDepartment of Computer Science\n\nStanford University\n\nAbstract\n\nApplications such as weather forecasting and personalized medicine demand mod-\nels that output calibrated probability estimates\u2014those representative of the true\nlikelihood of a prediction. Most models are not calibrated out of the box but are\nrecalibrated by post-processing model outputs. We \ufb01nd in this work that popular re-\ncalibration methods like Platt scaling and temperature scaling are (i) less calibrated\nthan reported, and (ii) current techniques cannot estimate how miscalibrated they\nare. An alternative method, histogram binning, has measurable calibration error\nbut is sample inef\ufb01cient\u2014it requires O(B/\u270f2) samples, compared to O(1/\u270f2) for\nscaling methods, where B is the number of distinct probabilities the model can\noutput. To get the best of both worlds, we introduce the scaling-binning calibrator,\nwhich \ufb01rst \ufb01ts a parametric function to reduce variance and then bins the function\nvalues to actually ensure calibration. This requires only O(1/\u270f2 + B) samples.\nNext, we show that we can estimate a model\u2019s calibration error more accurately\nusing an estimator from the meteorological community\u2014or equivalently measure\nits calibration error with fewer samples (O(pB) instead of O(B)). We validate\nour approach with multiclass calibration experiments on CIFAR-10 and ImageNet,\nwhere we obtain a 35% lower calibration error than histogram binning and, unlike\nscaling methods, guarantees on true calibration. We implement all these methods in\na Python library:\n\n1\n\nIntroduction\n\nThe probability that a system outputs for an event should re\ufb02ect the true frequency of that event: if\nan automated diagnosis system says 1,000 patients have cancer with probability 0.1, approximately\n100 of them should indeed have cancer. In this case, we say the model is uncertainty calibrated.\nThe importance of this notion of calibration has been emphasized in personalized medicine [1],\nmeteorological forecasting [2, 3, 4, 5, 6] and natural language processing applications [7, 8]. As most\nmodern machine learning models, such as neural networks, do not output calibrated probabilities out\nof the box [9, 10, 11], reseachers use recalibration methods that take the output of an uncalibrated\nmodel, and transform it into a calibrated probability. Scaling approaches for recalibration\u2014Platt\nscaling [12], isotonic regression [13], and temperature scaling [9]\u2014are widely used and require very\nfew samples, but do they actually produce calibrated probabilities?\nWe discover that these methods are less calibrated than reported. Past work approximates a model\u2019s\ncalibration error using a \ufb01nite set of bins. We show that by using more bins, we can uncover a higher\ncalibration error for models on CIFAR-10 and ImageNet. We show that a fundamental limitation\nwith approaches that output a continuous range of probabilities is that their true calibration error is\nunmeasurable with a \ufb01nite number of bins (Example 3.2).\nAn alternative approach, histogram binning [10], outputs probabilities from a \ufb01nite set. Histogram\nbinning can produce a model that is calibrated, and unlike scaling methods we can measure its\ncalibration error, but it is sample inef\ufb01cient. In particular, the number of samples required to calibrate\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Platt scaling\n\n(b) Histogram binning\n\n(c) Scaling-binning calibrator\n\nFigure 1: Visualization of the three recalibration approaches. The black crosses are the ground truth\nlabels, and the red lines are the output of the recalibration methods. Platt Scaling (Figure 1a) \ufb01ts\na function to the recalibration data, but its calibration error is not measurable. Histogram binning\n(Figure 1b) outputs the average label in each bin. The scaling-binning calibrator (Figure 1c) \ufb01ts a\nfunction g 2G to the recalibration data and then takes the average of the function values (the gray\ncircles) in each bin. The function values have lower variance than the labels, as visualized by the\nblue dotted lines, which is why our approach has lower variance.\n\nscales linearly with the number of distinct probabilities the model can output, B [14], which can\nbe large particularly in the multiclass setting where B typically scales with the number of classes.\nRecalibration sample ef\ufb01ciency is crucial\u2014we often want to recalibrate our models in the presence\nof domain shift [15] or recalibrate a model trained on simulated data, and may have access to only a\nsmall labeled dataset from the target domain.\nTo get the sample ef\ufb01ciency of Platt scaling and the veri\ufb01cation guarantees of histogram binning, we\npropose the scaling-binning calibrator (Figure 1c). Like scaling methods, we \ufb01t a simple function\ng 2G to the recalibration dataset. We then bin the input space so that an equal number of inputs\nland in each bin. In each bin, we output the average of the g values in that bin\u2014these are the gray\ncircles in Figure 1c. In contrast, histogram binning outputs the average of the label values in each\nbin (Figure 1b). The motivation behind our method is that the g values in each bin are in a narrower\nrange than the label values, so when we take the average we incur lower estimation error. If G is\nwell chosen, our method requires O( 1\n\u270f2 + B) samples to achieve calibration error \u270f instead of O( B\n\u270f2 )\nsamples for histogram binning, where B is the number of model outputs (Theorem 4.1). Note that in\nprior work, binning the outputs of a function was used for evaluation and without any guarantees,\nwhereas in our case it is used for the method itself, and we show improved sample complexity.\nWe run multiclass calibration experiments on CIFAR-10 [16] and ImageNet [17]. The scaling-binning\ncalibrator achieves a lower calibration error than histogram binning, while allowing us to measure the\ntrue calibration error unlike for scaling methods. We get a 35% lower calibration error on CIFAR-10\nand a 5x lower calibration error on ImageNet than histogram binning for B = 100.\nFinally, we show how to estimate the calibration error of models more accurately. Prior work in\nmachine learning [7, 9, 15, 18, 19] directly estimates each term in the calibration error from samples\n(De\ufb01nition 5.1). The sample complexity of this plugin estimator scales linearly with B. A debiased\nestimator introduced in the meteorological literature [20, 21] reduces the bias of the plugin estimator;\nwe prove that it achieves sample complexity that scales with pB by leveraging error cancellations\nacross bins. Experiments on CIFAR-10 and ImageNet con\ufb01rm that the debiased estimator measures\nthe calibration error more accurately.\n\n2 Setup and background\n\n2.1 Binary classi\ufb01cation\n\nLet X be the input space and Y be the label space where Y = {0, 1} for binary classi\ufb01cation. Let\nX 2X and Y 2Y be random variables denoting the input and label, given by an unknown joint\ndistribution P . As usual, expectations are taken over all random variables.\n\n2\n\n\fSuppose we have a model f : X! [0, 1] where the (possibly uncalibrated) output of the model\nrepresents the model\u2019s con\ufb01dence that the label is 1. The calibration error examines the difference\nbetween the model\u2019s probability and the true probability given the model\u2019s output:\nDe\ufb01nition 2.1 (Calibration error). The calibration error of f : X! [0, 1] is given by:\n\nCE(f ) =\u21e3 E\u21e5|f (X) E[Y | f (X)]|2\u21e4\u23181/2\n\n(1)\n\nIf CE(f ) = 0 then f is perfectly calibrated. This notion of calibration error is most commonly\nused [2, 3, 4, 7, 15, 18, 19, 20]. Replacing the 2s in the above de\ufb01nition by p 1 we get the `p\ncalibration error\u2014the `1 and `1 calibration errors are also used in the literature [9, 22, 23]. In\naddition to CE, we also deal with the `1 calibration error (known as ECE) in Sections 3 and 5.\nCalibration alone is not suf\ufb01cient: consider an image dataset containing 50% dogs and 50% cats. If\nf outputs 0.5 on all inputs, f is calibrated but not very useful. We often also wish to minimize the\nmean-squared error\u2014also known as the Brier score\u2014subject to a calibration budget [5, 24].\nDe\ufb01nition 2.2. The mean-squared error of f : X! [0, 1] is given by MSE(f ) = E[(f (X) Y )2].\nNote that MSE and CE are not orthogonal and MSE = 0 implies perfect calibration; in fact the MSE\nis the sum of the squared calibration error and a \u201csharpness\u201d term [2, 4, 18].\n\n2.2 Multiclass classi\ufb01cation\n\nWhile calibration in binary classi\ufb01cation is well-studied, it\u2019s less clear what to do for multiclass,\nwhere multiple de\ufb01nitions abound, differing in their strengths. In the multiclass setting, Y = [K] =\n{1, . . . , K} and f : X! [0, 1]K outputs a con\ufb01dence measure for each class in [K].\nDe\ufb01nition 2.3 (Top-label calibration error). The top-label calibration error examines the difference\nbetween the model\u2019s probability for its top prediction and the true probability of that prediction given\nthe model\u2019s output:\n\nTCE(f ) =\u21e3 Eh\u21e3PY = arg max\n\nj2[K]\n\nf (X)j | max\nj2[K]\n\nf (X)j max\n\nj2[K]\n\nf (X)j\u23182i\u23181/2\n\n(2)\n\nWe would often like the model to be calibrated on less likely predictions as well\u2014imagine that a\nmedical diagnosis system says there is a 50% chance a patient has a benign tumor, a 10% chance she\nhas an aggressive form of cancer, and a 40% chance she has one of a long list of other conditions. We\nwould like the model to be calibrated on all of these predictions so we de\ufb01ne the marginal calibration\nerror which examines, for each class, the difference between the model\u2019s probability and the true\nprobability of that class given the model\u2019s output.\nDe\ufb01nition 2.4 (Marginal calibration error). Let wk 2 [0, 1] denote how important calibrating class\nk is, where wk = 1/k if all classes are equally important. The marginal calibration error is:\n\nMCE(f ) =\u21e3 KXk=1\n\nwkE\u21e5(f (X)k P(Y = k | f (X)k))2\u21e4\u23181/2\n\n(3)\n\nPrior works [9, 15, 19] propose methods for multiclass calibration but only measure top-label\ncalibration\u2014[23] and concurrent work to ours [25] de\ufb01ne similar per-class calibration metrics where\ntemperature scaling [9] is worse than vector scaling despite having better top-label calibration.\nFor notational simplicity, our theory focuses on the binary classi\ufb01cation setting. We can transform\ntop-label calibration into a binary calibration problem\u2014the model outputs a probability corresponding\nto its top prediction, and the label represents whether the model gets it correct or not. Marginal\ncalibration can be transformed into K one-vs-all binary calibration problems where for each k 2 [K]\nthe model outputs the probability associated with the k-th class, and the label represents whether\nthe correct class is k [13]. We consider both top-label calibration and marginal calibration in our\nexperiments. Other notions of multiclass calibration include joint calibration (which requires the\nentire probability vector to be calibrated) [2, 6] and event-pooled calibration [18].\n\n3\n\n\f2.3 Recalibration\n\nPlatt\n\nscaling\n\n[12],\n\noutput\n\nfor\n\nexample\n\nSince most machine learning models do not output calibrated probabilities out of the box [9, 10]\nrecalibration methods take the output of an uncalibrated model, and transform it into a calibrated\nprobability. That is, given a trained model f : X! [0, 1], let Z = f (X). We are given recalibration\ndata T = {(zi, yi)}n\ni=1 independently sampled from P (Z, Y ), and we wish to learn a recalibrator\ng : [0, 1] ! [0, 1] such that g f is well-calibrated.\nScaling methods,\narg ming2GP(z,y)2T `(g(z), y), where G is a model family, g 2G\n\na\n=\nis differentiable, and `\nis a loss function, for example the log-loss or mean-squared error. The advantage of such methods is\nthat they converge very quickly since they only \ufb01t a small number of parameters.\nHistogram binning \ufb01rst constructs a set of bins (intervals) that partitions [0, 1], formalized below.\nDe\ufb01nition 2.5 (Binning schemes). A binning scheme B of size B is a set of B intervals I1, . . . , IB\nthat partitions [0, 1]. Given z 2 [0, 1], let (z) = j, where j is the interval that z lands in (z 2 Ij).\nThe bins are typically chosen such that either I1 = [0, 1\nB , 1] (equal\nwidth binning) [9] or so that each bin contains an equal number of zi values in the recalibration data\n(uniform mass binning) [10]. Histogram binning then outputs the average yi value in each bin.\n\nB ], . . . , IB = ( B1\n\nB ], I2 = ( 1\n\nB , 2\n\nfunction\n\ng\n\n3\n\nIs Platt scaling calibrated?\n\nIn this section, we show that methods like Platt scaling and temperature scaling are (i) less calibrated\nthan reported and (ii) it is dif\ufb01cult to tell how miscalibrated they are. That is we show, both\ntheoretically and with experiments on CIFAR-10 and ImageNet, why the calibration error of models\nthat output a continuous range of values is underestimated. We defer proofs to Appendix B.\nThe key to estimating the calibration error is estimating the conditional expectation E[Y | f (X)]. If\nf (X) is continuous, without smoothness assumptions on E[Y | f (X)] (that cannot be veri\ufb01ed in\npractice), this is impossible. This is analogous to the dif\ufb01culty of measuring the mutual information\nbetween two continuous signals [26].\nTo approximate the calibration error, prior work bins the output of f into B intervals. The calibration\nerror in each bin is estimated as the difference between the average value of f (X) and Y in that bin.\nNote that the binning here is for evaluation only, whereas in histogram binning, it is used for the\nrecalibration method itself. We formalize the notion of this binned calibration error below.\nDe\ufb01nition 3.1. The binned version of f outputs the average value of f in each bin Ij:\n\nfB(x) = E[f (X) | f (X) 2 Ij]\n\nwhere x 2 Ij\n\n(4)\n\nGiven B, the binned calibration error of f is simply the calibration error of fB. A simple exam-\nple shows that using binning to estimate the calibration error can severely underestimate the true\ncalibration error.\nExample 3.2. For any binning scheme B, and continuous bijective function f : [0, 1] ! [0, 1],\nthere exists a distribution P over X ,Y s.t. CE(fB) = 0 but CE(f ) 0.49. Note that for all f,\n0 \uf8ff CE(f ) \uf8ff 1.\nThe intuition of the construction is that in each interval Ij in B, the model could underestimate the\ntrue probability E[Y | f (X)] half the time, and overestimate the probability half the time. So if we\naverage over the entire bin the model appears to be calibrated, even though it is very uncalibrated.\nThe formal proof is in Appendix B, and holds for arbitrary `p calibration errors including the ECE.\nNext, we show that given a function f, its binned version always has lower calibration error. The\nproof, in Appendix B, is by Jensen\u2019s inequality. Intuitively, averaging a model\u2019s prediction within a\nbin allows errors at different parts of the bin to cancel out with each other. This result is similar to\nTheorem 2 in recent work [27], and holds for arbitrary `p calibration errors including the ECE.\nProposition 3.3 (Binning underestimates error). Given any binning scheme B and model f : X!\n[0, 1], we have:\n\nCE(fB) \uf8ff CE(f ).\n\n4\n\n\f(a) ImageNet\n\n(b) CIFAR-10\n\nFigure 2: Binned calibration errors of a recalibrated VGG-net model on CIFAR-10 and ImageNet\nwith 90% con\ufb01dence intervals. The binned calibration error increases as we increase the number of\nbins. This suggests that binning cannot be reliably used to measure the true calibration error.\n\n3.1 Experiments\n\nOur experiments on ImageNet and CIFAR-10 suggest that previous work reports numbers which\nare lower than the actual calibration error of their models. Recall that binning lower bounds the\ncalibration error. We cannot compute the actual calibration error but if we use a \u2018\ufb01ner\u2019 set of bins\nthen we get a tighter lower bound on the calibration error.\nAs in [9], our model\u2019s objective was to output the top predicted class and a con\ufb01dence score associated\nwith the prediction. For ImageNet, we started with a trained VGG16 model with an accuracy of\n64.3%. We split the validation set into 3 sets of sizes (20000, 5000, 25000). We used the \ufb01rst set\nof data to recalibrate the model using Platt scaling, the second to select the binning scheme B so\nthat each bin contains an equal number of points, and the third to measure the binned calibration\nerror . We calculated 90% con\ufb01dence intervals for the binned calibration error using 1,000 bootstrap\nresamples and performed the same experiment with varying numbers of bins.\nFigure 2a shows that as we increase the number of bins on ImageNet, the measured calibration error\nis higher and this is statistically signi\ufb01cant. For example, if we use 15 bins as in [9], we would\nthink the calibration error is around 0.02 when in reality the calibration error is at least twice as high.\nFigure 2b shows similar \ufb01ndings for CIFAR-10, and in Appendix C we show that our \ufb01ndings hold\neven if we use the `1 calibration error (ECE) and alternative binning strategies.\n\n4 The scaling-binning calibrator\n\nSection 3 shows that the problem with scaling methods is we cannot estimate their calibration error.\nThe upside of scaling methods is that if the function family has at least one function that can achieve\ncalibration error \u270f, they require O(1/\u270f2) samples to reach calibration error \u270f, while histogram binning\nrequires O(B/\u270f2) samples. Can we devise a method that is sample ef\ufb01cient to calibrate and one\nwhere it\u2019s possible to estimate the calibration error? To achieve this, we propose the scaling-binning\ncalibrator (Figure 1c) where we \ufb01rst \ufb01t a scaling function, and then bin the outputs of the scaling\nfunction.\n\n4.1 Algorithm\n\nWe split the recalibration data T of size n into 3 sets: T1, T2, T3. The scaling-binning calibrator,\nillustrated in Figure 1, outputs \u02c6gB such that \u02c6gB f has low calibration error:\nStep 1 (Function \ufb01tting): Select g = arg ming2GP(z,y)2T1\n(y g(z))2.\nStep 2 (Binning scheme construction): We choose the bins so that an equal number of g(zi) in T2\nland in each bin Ij for each j 2{ 1, . . . , B}\u2014this uniform-mass binning scheme [10] as opposed to\nequal-width binning [9] is essential for being able to estimate the calibration error in Section 5.\n\n5\n\n\fStep 3 (Discretization): Discretize g, by outputting the average g value in each bin\u2014these are\n|S|Ps2S s denote the mean of a set of values S. Let\nthe gray circles in Figure 1c. Let \u00b5(S) = 1\n\u02c6\u00b5[j] = \u00b5({g(zi) | g(zi) 2 Ij ^ (zi, yi) 2 T3}) be the mean of the g(zi) values that landed in the j-th\nbin. Recall that if z 2 Ij, (z) = j is the interval z lands in. Then we set \u02c6gB(z) = \u02c6\u00b5[(g(z))]\u2014that\nis we simply output the mean value in the bin that g(z) falls in.\n\n4.2 Analysis\n\nWe now show that the scaling-binning calibrator requires O(B + 1/\u270f2) samples to calibrate, and in\nSection 5 we show that we can ef\ufb01ciently measure its calibration error. For the main theorem, we\nmake some standard regularity assumptions on G which we formalize in Appendix D. Our result\nis a generalization result\u2014we show that if G contains some g\u21e4 with low calibration error, then our\nmethod is at least almost as well-calibrated as g\u21e4 given suf\ufb01ciently many samples.\nTheorem 4.1 (Calibration bound). Assume regularity conditions on G (\ufb01nite parameters, injectivity,\nLipschitz-continuity, consistency, twice differentiability). Given 2 (0, 1), there is a constant c such\nthat for all B, \u270f > 0, with n c\u21e3B log B + log B\n\u270f2 \u2318 samples, the scaling-binning calibrator \ufb01nds \u02c6gB\nwith (CE(\u02c6gB))2 \uf8ff 2 ming2G(CE(g))2 + \u270f2, with probability 1 .\nNote that our method can potentially be better calibrated than g\u21e4, because we bin the outputs of the\nscaling function, which reduces its calibration error (Proposition 3.3). While binning worsens the\nsharpness and can increase the mean-squared error of the model, in Proposition D.4 we show that if\nwe use many bins, binning the outputs cannot increase the mean-squared error by much.\nWe prove Theorem 4.1 in Appendix D but give a sketch here. Step 1 of our algorithm is Platt scaling,\nwhich simply \ufb01ts a function g to the data\u2014standard results in asymptotic statistics show that g\nconverges in O( 1\nStep 3, where we bin the outputs of g, is the main step of the algorithm. If we had in\ufb01nite data,\nProposition 3.3 showed that the binned version gB has lower calibration error than g, so we would be\ndone. However we do not have in\ufb01nite data\u2014the core of our proof is to show that the empirically\nbinned \u02c6gB is within \u270f of gB in O(B + 1\n\u270f2 ) samples required\nby histogram binning. The intuition is in Figure 1\u2014the g(zi) values in each bin (gray circles in\nFigure 1c) are in a narrower range than the yi values (black crosses in Figure 1b) and thus have lower\nvariance so when we take the average we incur less estimation error. The perhaps surprising part\n\n\u270f2 ) samples, instead of the O(B + B\n\n\u270f2 ) samples.\n\nis that we are estimating B numbers with eO(1/\u270f2) samples. In fact, there may be a small number\n\nof bins where the g(zi) values are not in a narrow range, but our proof still shows that the overall\nestimation error is small.\nOur uniform-mass binning scheme allows us to estimate the calibration error ef\ufb01ciently (see Section 5),\nunlike for scaling methods where we cannot estimate the calibration error (Section 3). Recall that we\nchose our bins so that each bin has an equal proportion of points in the recalibration set. Lemma 4.3\nshows that this property approximately holds in the population as well. This allows us to estimate the\ncalibration error ef\ufb01ciently (Theorem 5.4).\nDe\ufb01nition 4.2 (Well-balanced binning). Given a binning scheme B of size B, and \u21b5 1. We say B\nis \u21b5-well-balanced if for all j,\n\n1\n\u21b5B \uf8ff P(Z 2 Ij) \uf8ff\n\n\u21b5\nB\n\nLemma 4.3. For universal constant c, if n cB log B\nscheme B we chose is 2-well-balanced.\nWhile the way we choose bins is not novel [10], we believe the guarantees around it are\u2014not all\nbinning schemes in the literature allow us to ef\ufb01ciently estimate the calibration error; for example, the\nbinning scheme in [9] does not. Our proof of Lemma 4.3 is in Appendix D. We use a discretization\nargument to prove the result\u2014this gives a tighter bound than applying Chernoff bounds or a standard\nVC dimension argument which would tell us we need O(B2 log B\n\n , with probability at least 1 , the binning\n\n ) samples.\n\n6\n\n\f(a) Effect of number of bins on squared calibration error.\nFigure 3: (Left) Recalibrating using 1,000 data points on CIFAR-10, the scaling-binning calibrator\nachieves lower squared calibration error than histogram binning, especially when the number of bins\nB is large. (Right) For a \ufb01xed calibration error, the scaling-binning calibrator allows us to use more\nbins. This results in models with more predictive power which can be measured by the mean-squared\nerror. Note the vertical axis range is [0.04, 0.08] to zoom into the relevant region.\n\n(b) Tradeoff between calibration and MSE.\n\n4.3 Experiments\n\nOur experiments on CIFAR-10 and ImageNet show that in the low-data regime, for example when\nwe use \uf8ff 1000 data points to recalibrate, the scaling-binning calibrator produces models with much\nlower calibration error than histogram binning. The uncalibrated model outputs a con\ufb01dence score\nassociated with each class. We recalibrated each class separately as in [13], using B bins per class,\nand evaluated calibration using the marginal calibration error (De\ufb01nition 2.4).\nWe describe our experimental protocol for CIFAR-10. The CIFAR-10 validation set has 10,000\ndata points. We sampled, with replacement, a recalibration set of 1,000 points. We ran either the\nscaling-binning calibrator (we \ufb01t a sigmoid in the function \ufb01tting step) or histogram binning and\nmeasured the marginal calibration error on the entire set of 10K points. We repeated this entire\nprocedure 100 times and computed mean and 90% con\ufb01dence intervals, and we repeated this varying\nthe number of bins B. Figure 3a shows that the scaling-binning calibrator produces models with\nlower calibration error, for example 35% lower calibration error when we use 100 bins per class.\nUsing more bins allows a model to produce more \ufb01ne-grained predictions, e.g. [20] use B = 51\nbins, which improves the quality of predictions as measured by the mean-squared error\u2014Figure 3b\nshows that our method achieves better mean-squared errors for any given calibration constraint. More\nconcretely, the \ufb01gure shows a scatter plot of the mean-squared error and squared calibration error for\nhistogram binning and the scaling-binning calibrator when we vary the number of bins. For example,\nif we want our models to have a calibration error \uf8ff 0.02 = 2% we get a 9% lower mean-squared\nerror. In Appendix E we show that we get 5x lower top-label calibration error on ImageNet, and give\nfurther experiment details.\nValidating theoretical bounds: In Appendix E we run synthetic experiments to validate the bound\nin Theorem 4.1. In particular, we show that if we \ufb01x the number of samples n, and vary the number\nof bins B, the squared calibration error for the scaling-binning calibrator is nearly constant but\nfor histogram binning increases nearly linearly with B. For both methods, the squared calibration\nerror decreases approximately as 1/n\u2014that is when we double the number of samples the squared\ncalibration error halves.\n\n5 Verifying calibration\n\nBefore deploying our model we would like to check that it has calibration error below some desired\nthreshold E. In this section we show that we can accurately estimate the calibration error of binned\nmodels, if the binning scheme is 2-well-balanced. Recent work in machine learning uses a plugin\nestimate for each term in the calibration error [7, 15, 18, 19]. Older work in meteorology [20,\n21] notices that this is a biased estimate, and proposes a debiased estimator that subtracts off an\n\n7\n\n\fapproximate correction term to reduce the bias. Our contribution is to show that the debiased\nestimator is more accurate: while the plugin estimator requires samples proportional to B to estimate\nthe calibration error, the debiased estimator requires samples proportional to pB. Note that we\nshow an improved sample complexity\u2014prior work only showed that the naive estimator is biased. In\nAppendix G we also propose a way to debias the `1 calibration error (ECE), and show that we can\nestimate the ECE more accurately on CIFAR-10 and ImageNet.\nSuppose we wish to measure the squared calibration error E 2 of a binned model f : X! S where\nS \u2713 [0, 1] and |S| = B. Suppose we get an evaluation set Tn = {(x1, y1), . . . , (xn, yn)}. Past work\ntypically estimates the calibration error by directly estimating each term from samples:\nDe\ufb01nition 5.1 (Plugin estimator). Let Ls denote the yj values where the model outputs s: Ls =\n{yj | (xj, yj) 2 Tn ^ f (xj) = s}. Let \u02c6ps be the estimated probability of f outputting s: \u02c6ps = |Ls|n .\nLet \u02c6ys be the empirical average of Y when the model outputs s: \u02c6ys =Py2Ls\n\nThe plugin estimate for the squared calibration error is the weighted squared difference between \u02c6ys\nand s:\n\ny\n|Ls|\n\n.\n\npl =Xs2S\n\u02c6E 2\n\n\u02c6ps(s \u02c6ys)2\n\nAlternatively, [20, 21] propose to subtract an approximation of the bias from the estimate:\nDe\ufb01nition 5.2 (Debiased estimator). The debiased estimator for the squared calibration error is:\n\ndb =Xs2S\n\u02c6E 2\n\n\u02c6psh(s \u02c6ys)2 \n\n\u02c6ys(1 \u02c6ys)\n\n\u02c6psn 1 i\n\nWe are interested in analyzing the number of samples required to estimate the calibration error within\na constant multiplicative factor, that is to give an estimate \u02c6E 2 such that | \u02c6E 2 E 2|\uf8ff 1\n2 can\nbe replaced by any constant r with 0 < r < 1). Our main result is that the plugin estimator requires\npB\neO( B\nE 2 ) samples (Theorem 5.3) while the debiased estimator requires eO(\nE 2 ) samples (Theorem 5.4).\nTheorem 5.3 (Plugin estimator bound). Suppose we have a binned model with squared calibration\n2B . 1\nerror E 2, where the binning scheme is 2-well-balanced, that is for all s 2 S, P(f (X) = s) 1\n2E 2 \uf8ff \u02c6E 2\nIf n c B\npl \uf8ff\n2E 2 with probability at least 1 .\n3\nTheorem 5.4 (Debiased estimator bound). Suppose we have a binned model with squared calibration\nerror E 2 and for all s 2 S, P(f (X) = s) 1\n for some universal constant c\nthen for the debiased estimator, we have 1\n\n for some universal constant c, then for the plugin estimator, we have 1\n\n2E 2 (where 1\n\npB\nE 2 log B\n\nE 2 log B\n\n2E 2 with probability at least 1 .\n\n2B . If n c\ndb \uf8ff 3\n\n2E 2 \uf8ff \u02c6E 2\n\nThe proofs of both theorems is in Appendix F. The idea is that for the plugin estimator, each term\nin the sum has bias 1/n. These biases accumulate, giving total bias B/n. The debiased estimator\nhas much lower bias and the estimation variance cancels across bins\u2014this intuition is captured in\nLemma F.8 which requires careful conditioning to make the argument go through.\n\n5.1 Experiments\nWe run a multiclass marginal calibration experiment on CIFAR-10 which suggests that the debiased\nestimator produces better estimates of the calibration error than the plugin estimator. We split the\nvalidation set of size 10,000 into two sets SC and SE of sizes 3,000 and 7,000 respectively. We\nuse SC to re-calibrate and discretize a trained VGG-16 model. We calibrate each of the K = 10\nclasses seprately as described in Section 2 and used B = 100 or B = 10 bins per class. For varying\nvalues of n, we sample n points with replacement from SE, and estimate the calibration error using\nthe debiased estimator and the plugin estimator. We then compute the squared deviation of these\nestimates from the squared calibration error measured on the entire set SE. We repeat this resampling\n1,000 times to get the mean squared deviation of the estimates from the ground truth and con\ufb01dence\nintervals. Figure 4a shows that the debiased estimates are much closer to the ground truth than the\nplugin estimates\u2014the difference is especially signi\ufb01cant when the number of samples n is small or\n\n1We do not need the upper bound of the 2-well-balanced property.\n\n8\n\n\f(a) B = 10\n\n(b) B = 100\n\nFigure 4: Mean-squared errors of plugin and debiased estimators on a recalibrated VGG16 model on\nCIFAR-10 with 90% con\ufb01dence intervals (lower values better). The debiased estimator is closer to\nthe ground truth, which corresponds to 0 on the vertical axis, especially when B is large or n is small.\nNote that this is the MSE of the squared calibration error, not the MSE of the model in Figure 3.\n\nthe number of bins B is large. Note that having a perfect estimate corresponds to 0 on the vertical\naxis.\nIn Appendix G, we include histograms of the absolute difference between the estimates and ground\ntruth for the plugin and debiased estimator, over the 1,000 resamples.\n\n6 Related work\n\nCalibration, including the squared calibration error, has been studied in many \ufb01elds besides machine\nlearning including meteorology [2, 3, 4, 5, 6], fairness [28, 29], healthcare [1, 30, 31, 32], reinforce-\nment learning [33], natural language processing [7, 8], speech recognition [34], econometrics [24],\nand psychology [35]. Besides the calibration error, prior work also uses the Hosmer-Lemeshov\ntest [36] and reliability diagrams [4, 37] to evaluate calibration. Concurrent work to ours [38]\nalso notice that using the plugin calibration error estimator to test for calibration leads to rejecting\nwell-calibrated models too often. Besides calibration, other ways of producing and quantifying\nuncertainties include Bayesian methods [39] and conformal prediction [40, 41].\nAlgorithms and analysis in density estimation typically assume the true density is LLipschitz, while\nin calibration applications, the calibration error of the \ufb01nal model should be measurable from data,\nwithout making untestable assumptions on L.\nBias is a common issue with statistical estimators, for example, the seminal work by Stein [42] \ufb01xes\nthe bias of the mean-squared error. However, debiasing an estimator does not typically lead to an\nimproved sample complexity, as it does in our case. Recalibration is related to (conditional) density\nestimation [43, 44] as the goal is to estimate E[Y | f (X)].\n7 Conclusion\n\nThis paper makes three contributions: 1. We showed that the calibration error of continuous methods\nis underestimated; 2. We introduced the \ufb01rst method, to our knowledge, that has better sample\ncomplexity than histogram binning and has a measurable calibration error, giving us the best of\nscaling and binning methods; and 3. We showed that an alternative estimator for calibration error has\nbetter sample complexity than the plugin estimator. There are many exciting avenues for future work:\n\n1. Dataset shifts: Can we maintain calibration under dataset shifts (for example, train on\n\nMNIST, but evaluate on SVHN) without labeled examples from the target dataset?\n\n2. Measuring calibration: Can we come up with alternative metrics that still capture a notion\n\nof calibration, but are measurable for scaling methods?\n\n9\n\n\fReproducibility. Our Python calibration library is available at\n\n. All code, data, and experiments can be found on CodaLab at\n\ndated code can be found at\n\n. Up-\n\n.\n\nAcknowledgements. The authors would like to thank the Open Philantropy Project, Stanford\nGraduate Fellowship, and Toyota Research Institute for funding. Toyota Research Institute (\u201cTRI\")\nprovided funds to assist the authors with their research but this article solely re\ufb02ects the opinions and\nconclusions of its authors and not TRI or any other Toyota entity.\nWe are grateful to Pang Wei Koh, Chuan Guo, Anand Avati, Shengjia Zhao, Weihua Hu, Yu Bai,\nJohn Duchi, Dan Hendrycks, Jonathan Uesato, Michael Xie, Albert Gu, Aditi Raghunathan, Fereshte\nKhani, Stefano Ermon, Eric Nalisnick, and Pushmeet Kohli for insightful discussions. We thank\nthe anonymous reviewers for their thorough reviews and suggestions that have improved our paper.\nWe would also like to thank Pang Wei Koh, Yair Carmon, Albert Gu, Rachel Holladay, and Michael\nXie for their inputs on our draft, and Chuan Guo for providing code snippets from their temperature\nscaling paper.\n\nReferences\n[1] X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado. Calibrating predictive model estimates to\nsupport personalized medicine. Journal of the American Medical Informatics Association, 19\n(2):263\u2013274, 2012.\n\n[2] A. H. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology,\n\n12(4):595\u2013600, 1973.\n\n[3] A. H. Murphy and R. L. Winkler. Reliability of subjective probability forecasts of precipitation\nand temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26:\n41\u201347, 1977.\n\n[4] M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. Journal of\n\nthe Royal Statistical Society. Series D (The Statistician), 32:12\u201322, 1983.\n\n[5] T. Gneiting and A. E. Raftery. Weather forecasting with ensemble methods. Science, 310, 2005.\n[6] J. Brocker. Reliability, suf\ufb01ciency, and the decomposition of proper scores. Quarterly Journal\n\nof the Royal Meteorological Society, 135(643):1512\u20131519, 2009.\n\n[7] K. Nguyen and B. O\u2019Connor. Posterior calibration and exploratory analysis for natural language\nprocessing models. In Empirical Methods in Natural Language Processing (EMNLP), pages\n1587\u20131598, 2015.\n\n[8] D. Card and N. A. Smith. The importance of calibration for estimating proportions from\n\nannotations. In Association for Computational Linguistics (ACL), 2018.\n\n[9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In\n\nInternational Conference on Machine Learning (ICML), pages 1321\u20131330, 2017.\n\n[10] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and\nnaive bayesian classi\ufb01ers. In International Conference on Machine Learning (ICML), pages\n609\u2013616, 2001.\n\n[11] V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated\n\nregression. In International Conference on Machine Learning (ICML), 2018.\n\n[12] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized\n\nlikelihood methods. Advances in Large Margin Classi\ufb01ers, 10(3):61\u201374, 1999.\n\n[13] B. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability\nestimates. In International Conference on Knowledge Discovery and Data Mining (KDD),\npages 694\u2013699, 2002.\n\n[14] M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Binary classi\ufb01er calibration: Non-parametric\n\napproach. arXiv, 2014.\n\n[15] D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure.\n\nIn International Conference on Learning Representations (ICLR), 2019.\n\n10\n\n\f[16] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition (CVPR), pages 248\u2013255, 2009.\n[18] V. Kuleshov and P. Liang. Calibrated structured prediction. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), 2015.\n\n[19] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and\n\nuncertainty. In International Conference on Machine Learning (ICML), 2019.\n\n[20] J. Brocker. Estimating reliability and resolution of probability forecasts through decomposition\n\nof the empirical score. Climate Dynamics, 39:655\u2013667, 2012.\n\n[21] C. A. T. Ferro and T. E. Fricker. A bias-corrected decomposition of the brier score. Quarterly\n\nJournal of the Royal Meteorological Society, 138(668):1954\u20131960, 2012.\n\n[22] M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using\nbayesian binning. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2015.\n[23] J. V. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran. Measuring calibration in deep\n\nlearning. arXiv, 2019.\n\n[24] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243\u2013268,\n2007.\n\n[25] M. Kull, M. P. Nieto, M. K\u00e4ngsepp, T. S. Filho, H. Song, and P. Flach. Beyond temperature scal-\ning: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances\nin Neural Information Processing Systems (NeurIPS), 2019.\n\n[26] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191\u20131253,\n\n2003.\n\n[27] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Sch\u00f6n. Evaluating\n\nmodel calibration in classi\ufb01cation. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2019.\n\n[28] U. Hebert-Johnson, M. P. Kim, O. Reingold, and G. N. Rothblum. Multicalibration: Calibration\nfor the (computationally-identi\ufb01able) masses. In International Conference on Machine Learning\n(ICML), 2018.\n\n[29] L. T. Liu, M. Simchowitz, and M. Hardt. The implicit fairness criterion of unconstrained\n\nlearning. In International Conference on Machine Learning (ICML), 2019.\n\n[30] C. S. Crowson, E. J. Atkinson, and T. M. Therneau. Assessing calibration of prognostic risk\n\nscores. Statistical Methods in Medical Research, 25:1692\u20131706, 2017.\n\n[31] F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: issues in developing\nmodels, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in\nmedicine, 15(4):361\u2013387, 1996.\n\n[32] S. Yadlowsky, S. Basu, and L. Tian. A calibration metric for risk scores with survival data.\n\nMachine Learning for Healthcare, 2019.\n\n[33] A. Malik, V. Kuleshov, J. Song, D. Nemer, H. Seymour, and S. Ermon. Calibrated model-based\ndeep reinforcement learning. In International Conference on Machine Learning (ICML), 2019.\n[34] D. Yu, J. Li, and L. Deng. Calibration of con\ufb01dence measures in speech recognition. Trans.\n\nAudio, Speech and Lang. Proc., 19(8):2461\u20132473, 2011.\n\n[35] S. Lichtenstein, B. Fischhoff, and L. D. Phillips. Judgement under Uncertainty: Heuristics and\n\nBiases. Cambridge University Press, 1982.\n\n[36] D. W. Hosmer and S. Lemeshow. Goodness of \ufb01t tests for the multiple logistic regression model.\n\nCommunications in Statistics - Theory and Methods, 9:1043\u20131069, 1980.\n\n[37] J. Br\u00f6cker and L. A. Smith. Increasing the reliability of reliability diagrams. Weather and\n\nForecasting, 22(3):651\u2013661, 2007.\n\n[38] D. Widmann, F. Lindsten, and D. Zachariah. Calibration tests in multi-class classi\ufb01cation: A\nunifying framework. In Advances in Neural Information Processing Systems (NeurIPS), 2019.\n\n11\n\n\f[39] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman and\n\nHall/CRC Chapman and Hall/CRC, 1995 1995.\n\n[40] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning\n\nResearch (JMLR), 9:371\u2013421, 2008.\n\n[41] J. Lei, M. G\u2019Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive\ninference for regression. Journal of the American Statistical Association, 113:1094\u20131111, 2016.\n[42] C. M. Stein. Estimation of the mean of a multivariate normal distribution. Annals of Statistics,\n\n9(6):1135\u20131151, 1981.\n\n[43] Larry Wasserman. Density estimation.\n\n, 2019.\n\n[44] E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical\n\nStatistics, 33:1065\u20131076, 1962.\n\n[45] Fran\u00e7ois Chollet. keras.\n[46] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. URL\n\n, 2015.\n\n. Software available from tensor\ufb02ow.org.\n\n[47] Yonatan Geifman. cifar-vgg.\n[48] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.\n[49] J. H. Hubbard and B. B. Hubbard. Vector Calculus, Linear Algebra, And Differential Forms.\n\n, 2015.\n\nPrentice Hall, 1998.\n\n[50] M. Kull, T. M. S. Filho, and P. Flach. Beyond sigmoids: How to obtain well-calibrated\nprobabilities from binary classi\ufb01ers with beta calibration. Electronic Journal of Statistics, 11:\n5052\u20135080, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2060, "authors": [{"given_name": "Ananya", "family_name": "Kumar", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Stanford University"}]}