{"title": "Calibration tests in multi-class classification: A unifying framework", "book": "Advances in Neural Information Processing Systems", "page_first": 12257, "page_last": 12267, "abstract": "In safety-critical applications a probabilistic model is usually required to be calibrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classification, calibration of the most confident predictions only is often not sufficient. We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. We propose and evaluate empirically different consistent and unbiased estimators for a specific class of measures based on matrix-valued kernels. Importantly, these estimators can be interpreted as test statistics associated with well-defined bounds and approximations of the p-value under the null hypothesis that the model is calibrated, significantly improving the interpretability of calibration measures, which otherwise lack any meaningful unit or scale.", "full_text": "Calibration tests in multi-class classi\ufb01cation:\n\nA unifying framework\n\nDavid Widmann\n\nFredrik Lindsten\n\nDepartment of Information Technology\n\nDivision of Statistics and Machine Learning\n\nUppsala University, Sweden\ndavid.widmann@it.uu.se\n\nLink\u00f6ping University, Sweden\nfredrik.lindsten@liu.se\n\nDave Zachariah\n\nDepartment of Information Technology\n\nUppsala University, Sweden\ndave.zachariah@it.uu.se\n\nAbstract\n\nIn safety-critical applications a probabilistic model is usually required to be cali-\nbrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class\nclassi\ufb01cation, calibration of the most con\ufb01dent predictions only is often not suf\ufb01-\ncient. We propose and study calibration measures for multi-class classi\ufb01cation that\ngeneralize existing measures such as the expected calibration error, the maximum\ncalibration error, and the maximum mean calibration error. We propose and evalu-\nate empirically different consistent and unbiased estimators for a speci\ufb01c class of\nmeasures based on matrix-valued kernels. Importantly, these estimators can be in-\nterpreted as test statistics associated with well-de\ufb01ned bounds and approximations\nof the p-value under the null hypothesis that the model is calibrated, signi\ufb01cantly\nimproving the interpretability of calibration measures, which otherwise lack any\nmeaningful unit or scale.\n\n1\n\nIntroduction\n\nConsider the problem of analyzing microscopic images of tissue samples and reporting a tumour grade,\ni.e., a score that indicates whether cancer cells are well-differentiated or not, affecting both prognosis\nand treatment of patients. Since for some pathological images not even experienced pathologists\nmight all agree on one classi\ufb01cation, this task contains an inherent component of uncertainty. This\ntype of uncertainty that can not be removed by increasing the size of the training data set is typically\ncalled aleatoric uncertainty (Kiureghian and Ditlevsen, 2009). Unfortunately, even if the ideal model\nis among the class of models we consider, with a \ufb01nite training data set we will never obtain the\nideal model but we can only hope to learn a model that is, in some sense, close to it. Worse still, our\nmodel might not even be close to the ideal model if the model class is too restrictive or the number of\ntraining data is small\u2014which is not unlikely given the fact that annotating pathological images is\nexpensive. Thus ideally our model should be able to express not only aleatoric uncertainty but also\nthe uncertainty about the model itself. In contrast to aleatoric uncertainty this so-called epistemic\nuncertainty can be reduced by additional training data.\nDealing with these different types of uncertainty is one of the major problems in machine learning.\nThe application of our model in clinical practice demands \u201cmeaningful\u201d uncertainties to avoid doing\nharm to patients. Being too certain about high tumour grades might cause harm due to unneeded\naggressive therapies and overly pessimistic prognoses, whereas being too certain about low tumour\ngrades might result in insuf\ufb01cient therapies. \u201cProper\u201d uncertainty estimates are also crucial if the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmodel is supervised by a pathologist that takes over if the uncertainty reported by the model is too\nhigh. False but highly certain gradings might incorrectly keep the pathologist out of the loop, and on\nthe other hand too uncertain gradings might demand unneeded and costly human intervention.\nProbability theory provides a solid framework for dealing with uncertainties. Instead of assigning\nexactly one grade to each pathological image, so-called probabilistic models report subjective\nprobabilities, sometimes also called con\ufb01dence scores, of the tumour grades for each image. The\nmodel can be evaluated by comparing these subjective probabilities to the ground truth.\nOne desired property of such a probabilistic model is sharpness (or high accuracy), i.e., if possible,\nthe model should assign the highest probability to the true tumour grade (which maybe can not be\ninferred from the image at hand but only by other means such as an additional immunohistochemical\nstaining). However, to be able to trust the predictions the probabilities should be calibrated (or\nreliable) as well (DeGroot and Fienberg, 1983; Murphy and Winkler, 1977). This property requires\nthe subjective probabilities to match the relative empirical frequencies: intuitively, if we could\nobserve a long run of predictions (0.5, 0.1, 0.1, 0.3) for tumour grades 1, 2, 3, and 4, the empirical\nfrequencies of the true tumour grades should be (0.5, 0.1, 0.1, 0.3). Note that accuracy and calibration\nare two complementary properties: a model with over-con\ufb01dent predictions can be highly accurate\nbut miscalibrated, whereas a model that always reports the overall proportion of patients of each\ntumour grade in the considered population is calibrated but highly inaccurate.\nResearch of calibration in statistics and machine learning literature has been focused mainly on binary\nclassi\ufb01cation problems or the most con\ufb01dent predictions: common calibration measures such as the\nexpected calibration error (ECE) (Naeini et al., 2015), the maximum calibration error (MCE) (Naeini\net al., 2015), and the kernel-based maximum mean calibration error (MMCE) (Kumar et al., 2018),\nand reliability diagrams (Murphy and Winkler, 1977) have been developed for binary classi\ufb01cation.\nThis is insuf\ufb01cient since many recent applications of machine learning involve multiple classes.\nFurthermore, the crucial \ufb01nding of Guo et al. (2017) that many modern deep neural networks are\nmiscalibrated is also based only on the most con\ufb01dent prediction.\nRecently Vaicenavicius et al. (2019) suggested that this analysis might be too reduced for many\nrealistic scenarios. In our example, a prediction of (0.5, 0.3, 0.1, 0.1) is fundamentally different\nfrom a prediction of (0.5, 0.1, 0.1, 0.3), since according to the model in the \ufb01rst case it is only\nhalf as likely that a tumour is of grade 3 or 4, and hence the subjective probability of missing\nout on a more aggressive therapy is smaller. However, commonly in the study of calibration all\npredictions with a highest reported con\ufb01dence score of 0.5 are grouped together and a calibrated\nmodel has only to be correct about the most con\ufb01dent tumour grade in 50% of the cases, regardless\nof the other predictions. Although the ECE can be generalized to multi-class classi\ufb01cation, its\napplicability seems to be limited since its histogram-regression based estimator requires partitioning\nof the potentially high-dimensional probability simplex and is asymptotically inconsistent in many\ncases (Vaicenavicius et al., 2019). Sample complexity bounds for a bias-reduced estimator of the\nECE introduced in metereological literature (Br\u00f6cker, 2011; Ferro and Fricker, 2012) were derived\nin concurrent work (Kumar et al., 2019).\n\n2 Our contribution\n\nIn this work, we propose and study a general framework of calibration measures for multi-class\nclassi\ufb01cation. We show that this framework encompasses common calibration measures for binary\nclassi\ufb01cation such as the expected calibration error (ECE), the maximum calibration error (MCE),\nand the maximum mean calibration error (MMCE) by Kumar et al. (2018). In more detail we study\na class of measures based on vector-valued reproducing kernel Hilbert spaces, for which we derive\nconsistent and unbiased estimators. The statistical properties of the proposed estimators are not only\ntheoretically appealing, but also of high practical value, since they allow us to address two main\nproblems in calibration evaluation.\nAs discussed by Vaicenavicius et al. (2019), all calibration error estimates are inherently random,\nand comparing competing models based on these estimates without taking the randomness into\naccount can be very misleading, in particular when the estimators are biased (which, for instance,\nis the case for the commonly used histogram-regression based estimator of the ECE). Even more\nfundamentally, all commonly used calibration measures lack a meaningful unit or scale and are\ntherefore not interpretable as such (regardless of any \ufb01nite sample issues).\n\n2\n\n\fThe consistency and unbiasedness of the proposed estimators facilitate comparisons between compet-\ning models, and allow us to derive multiple statistical tests for calibration that exploit these properties.\nMoreover, by viewing the estimators as calibration test statistics, with well-de\ufb01ned bounds and\napproximations of the corresponding p-value, we give them an interpretable meaning.\nWe evaluate the proposed estimators and statistical tests empirically and compare them with ex-\nisting methods. To facilitate multi-class calibration evaluation we provide the Julia packages\nConsistencyResampling.jl (Widmann, 2019c), CalibrationErrors.jl (Widmann, 2019a),\nand CalibrationTests.jl (Widmann, 2019b) for consistency resampling, calibration error esti-\nmation, and calibration tests, respectively.\n\n3 Background\n\nWe start by shortly summarizing the most relevant de\ufb01nitions and concepts. Due to space constraints\nand to improve the readability of our paper, we do not provide any proofs in the main text but only\nrefer to the results in the supplementary material, which is intended as a reference for mathematically\nprecise statements and proofs.\n\n3.1 Probabilistic setting\n\nLet (X, Y ) be a pair of random variables with X and Y representing inputs (features) and outputs,\nrespectively. We focus on classi\ufb01cation problems and hence without loss of generality we may\nassume that the outputs consist of the m classes 1, . . . , m.\nLet \u2206m denote the (m \u2212 1)-dimensional probability simplex \u2206m := {z \u2208 Rm\n\u22650 : (cid:107)z(cid:107)1 = 1}. Then a\nprobabilistic model g is a function that for every input x outputs a prediction g(x) \u2208 \u2206m that models\nthe distribution\n\n(cid:0) P[Y = 1| X = x], . . . , P[Y = m| X = x](cid:1)\n\n\u2208 \u2206m\n\nof class Y given input X = x.\n\n3.2 Calibration\n\n3.2.1 Common notion\n\nThe common notion of calibration, as, e.g., used by Guo et al. (2017), considers only the most\ncon\ufb01dent predictions maxy gy(x) of a model g. According to this de\ufb01nition, a model is calibrated if\n(1)\n\ngy(X)] = max\n\nP[Y = arg max\n\ngy(X)\n\ny\n\ngy(X)| max\n\ny\n\ny\n\nholds almost always. Thus a model that is calibrated according to Eq. (1) ensures that we can partly\ntrust the uncertainties reported by its predictions. As an example, for a prediction of (0.4, 0.3, 0.3)\nthe model would only guarantee that in the long run inputs that yield a most con\ufb01dent prediction of\n40% are in the corresponding class 40% of the time.1\n\n3.2.2 Strong notion\n\nAccording to the more general calibration de\ufb01nition of Br\u00f6cker (2009); Vaicenavicius et al. (2019),\na probabilistic model g is calibrated if for almost all inputs x the prediction g(x) is equal to the\ndistribution of class Y given prediction g(X) = g(x). More formally, a calibrated model satis\ufb01es\n\nP[Y = y | g(X)] = gy(X)\n\n(2)\nalmost always for all classes y \u2208 {1, . . . , m}. As Vaicenavicius et al. (2019) showed, for multi-class\nclassi\ufb01cation this formulation is stronger than the de\ufb01nition of Zadrozny and Elkan (2002) that only\ndemands calibrated marginal probabilities. Thus we can fully trust the uncertainties reported by the\npredictions of a model that is calibrated according to Eq. (2). The prediction (0.4, 0.3, 0.3) would\nactually imply that the class distribution of the inputs that yield this prediction is (0.4, 0.3, 0.3).\nTo emphasize the difference to the de\ufb01nition in Eq. (1), we call calibration according to Eq. (2)\ncalibration in the strong sense or strong calibration.\n\n1This notion of calibration does not consider for which class the most con\ufb01dent prediction was obtained.\n\n3\n\n\fTo simplify our notation, we rewrite Eq. (2) in vectorized form. Equivalent to the de\ufb01nition above, a\nmodel g is calibrated in the strong sense if\n\nholds almost always, where\n\nr(g(X)) \u2212 g(X) = 0\n\nr(\u03be) :=(cid:0) P[Y = 1| g(X) = \u03be], . . . , P[Y = m| g(X) = \u03be](cid:1)\n\nis the distribution of class Y given prediction g(X) = \u03be.\nThe calibration of certain aspects of a model, such as the \ufb01ve largest predictions or groups of\nclasses, can be investigated by studying the strong calibration of models induced by so-called\ncalibration lenses. For more details about evaluation and visualization of strong calibration we refer\nto Vaicenavicius et al. (2019).\n\n(3)\n\n3.3 Matrix-valued kernels\nThe miscalibration measure that we propose in this work is based on matrix-valued kernels k : \u2206m \u00d7\n\u2206m \u2192 Rm\u00d7m. Matrix-valued kernels can be de\ufb01ned in a similar way as the more common real-\nvalued kernels, which can be characterized as symmetric positive de\ufb01nite functions (Berlinet and\nThomas-Agnan, 2004, Lemma 4).\nDe\ufb01nition 1 (Micchelli and Pontil (2005, De\ufb01nition 2); Caponnetto et al. (2008, De\ufb01nition 1)).\nWe call a function k : \u2206m \u00d7 \u2206m \u2192 Rm\u00d7m a matrix-valued kernel if for all s, t \u2208 \u2206m k(s, t) =\n(cid:124)\nand it is positive semi-de\ufb01nite, i.e., if for all n \u2208 N, t1, . . . , tn \u2208 \u2206m, and u1, . . . , un \u2208 Rm\nk(t, s)\n\nn(cid:88)\n\ni,j=1\n\n(cid:124)\ni k(ti, tj)uj \u2265 0.\nu\n\nThere exists a one-to-one mapping of such kernels and reproducing kernel Hilbert spaces (RKHSs)\nof vector-valued functions f : \u2206m \u2192 Rm. We provide a short summary of RKHSs of vector-valued\nfunctions on the probability simplex in Appendix D. A more detailed general introduction to RKHSs\nof vector-valued functions can be found in the publications by Caponnetto et al. (2008); Carmeli et al.\n(2010); Micchelli and Pontil (2005).\nSimilar to the scalar case, matrix-valued kernels can be constructed from other matrix-valued kernels\nand even from real-valued kernels. Very simple matrix-valued kernels are kernels of the form\nk(s, t) = \u02dck(s, t)Im, where \u02dck is a scalar-valued kernel, such as the Gaussian or Laplacian kernel,\nand Im is the identity matrix. As Example D.1 shows, this construction can be generalized by, e.g.,\nreplacing the identity matrix with an arbitrary positive semi-de\ufb01nite matrix.\nAn important class of kernels are so-called universal kernels. Loosely speaking, a kernel is called uni-\nversal if its RKHS is a dense subset of the space of continuous functions, i.e., if in the neighbourhood\nof every continuous function we can \ufb01nd a function in the RKHS. Prominent real-valued kernels on\nthe probability simplex such as the Gaussian and the Laplacian kernel are universal, and can be used\nto construct universal matrix-valued kernels of the form in Example D.1, as Lemma D.3 shows.\n\n4 Uni\ufb01cation of calibration measures\n\nIn this section we introduce a general measure of strong calibration and show its relation to other\nexisting measures.\n\n4.1 Calibration error\n\nIn the analysis of strong calibration the discrepancy in the left-hand side of Eq. (3) lends itself\nnaturally to the following calibration measure.\nDe\ufb01nition 2. Let F be a non-empty space of functions f : \u2206m \u2192 Rm. Then the calibration error\n(CE) of model g with respect to class F is\nCE[F, g] := sup\nf\u2208F\n\nE [(r(g(X)) \u2212 g(X))\n\nf (g(X))] .\n\n(cid:124)\n\n4\n\n\fA trivial consequence of the design of the CE is that the measure is zero for every model that is\ncalibrated in the strong sense, regardless of the choice of F. The converse statement is not true in\ngeneral. As we show in Theorem C.2, the class of continuous functions is a space for which the\nCE is zero if and only if model g is strongly calibrated, and hence allows to characterize calibrated\nmodels. However, since this space is extremely large, for every model the CE is either 0 or \u221e.2.\nThus a measure based on this space does not allow us to compare miscalibrated models and hence is\nrather impractical.\n\n4.2 Kernel calibration error\n\nDue to the correspondence between kernels and RKHSs we can de\ufb01ne the following kernel measure.\nDe\ufb01nition 3. Let k be a matrix-valued kernel as in De\ufb01nition 1. Then we de\ufb01ne the kernel calibration\nerror (KCE) with respect to kernel k as KCE[k, g] := CE[F, g], where F is the unit ball in the\nRKHS corresponding to kernel k.\nAs mentioned above, a RKHS with a universal kernel is a dense subset of the space of continuous\nfunctions. Hence these kernels yield a function space that is still large enough for identifying strongly\ncalibrated models.\nTheorem 1 (cf. Theorem C.1). Let k be a matrix-valued kernel as in De\ufb01nition 1, and assume that\nk is universal. Then KCE[k, g] = 0 if and only if model g is calibrated in the strong sense.\nFrom the supremum-based De\ufb01nition 2 it might not be obvious how the KCE can be evaluated.\nFortunately, there exists an equivalent kernel-based formulation.\nLemma 1 (cf. Lemma E.2). Let k be a matrix-valued kernel as in De\ufb01nition 1.\n(cid:16)\nE[(cid:107)k(g(X), g(X))(cid:107)] < \u221e, then\n\nk(g(X), g(X(cid:48)))(eY (cid:48) \u2212 g(X(cid:48)))(cid:3)(cid:17)1/2\n\nE(cid:2)(eY \u2212 g(X))\n\nKCE[k, g] =\n\n(4)\n\nIf\n\n(cid:124)\n\n,\n\nwhere (X(cid:48), Y (cid:48)) is an independent copy of (X, Y ) and ei denotes the ith unit vector.\n\n4.3 Expected calibration error\n\nThe most common measure of calibration error is the expected calibration error (ECE). Typically it is\nused for quantifying calibration in a binary classi\ufb01cation setting but it generalizes to strong calibration\nin a straightforward way. Let d : \u2206m \u00d7 \u2206m \u2192 R\u22650 be a distance measure on the probability simplex.\nThen the expected calibration error of a model g with respect to d is de\ufb01ned as\n\nECE[d, g] = E[d(r(g(X)), g(X))].\n\n(5)\nIf d(p, q) = 0 \u21d4 p = q, as it is the case for standard choices of d such as the total variation distance\nor the (squared) Euclidean distance, then ECE[d, g] is zero if and only if g is strongly calibrated as\nper Eq. (3).\nThe ECE with respect to the cityblock distance, the total variation distance, or the squared Euclidean\ndistance, are special cases of the calibration error CE, as we show in Lemma I.1.\n\n4.4 Maximum mean calibration error\n\nKumar et al. (2018) proposed a kernel-based calibration measure, the so-called maximum mean\ncalibration error (MMCE), for training (better) calibrated neural networks. In contrast to their work,\nin our publication we do not discuss how to obtain calibrated models but focus on the evaluation\nof calibration and on calibration tests. Moreover, the MMCE applies only to a binary classi\ufb01cation\nsetting whereas our measure quanti\ufb01es strong calibration and hence is more generally applicable. In\nfact, as we show in Example I.1, the MMCE is a special case of the KCE.\n\n2Assume CE[F, g] < \u221e and let f1, f2, . . . be a sequence of continuous functions with CE[F, g] =\nfn(g(X))]. From Remark C.2 we know that CE[F, g] \u2265 0. Moreover,\n\nlimn\u2192\u221e E [(r(g(X)) \u2212 g(X))\n\u02dcfn := 2fn are continuous functions with 2 CE[F, g] = limn\u2192\u221e E\nsupf\u2208F E [(r(g(X)) \u2212 g(X))\n\nf (g(X))] = CE[F, g]. Thus CE[F, g] = 0.\n\n(r(g(X)) \u2212 g(X))\n\n(cid:124) \u02dcfn(g(X))\n\n(cid:105) \u2264\n\n(cid:104)\n\n(cid:124)\n\n(cid:124)\n\n5\n\n\f5 Calibration error estimators\n\nConsider the task of estimating the calibration error of model g using a validation set D =\n{(Xi, Yi)}n\ni=1 of n i.i.d. random pairs of inputs and labels that are distributed according to (X, Y ).\nFrom the expression for the ECE in Eq. (5), the natural (and, indeed, standard) approach for estimating\nthe ECE is as the sample average of the distance d between the predictions g(X) and the calibration\nfunction r(g(X)). However, this is problematic since the calibration function is not readily available\nand needs to be estimated as well. Typically, this is addressed using histogram-regression, see, e.g.,\nGuo et al. (2017); Naeini et al. (2015); Vaicenavicius et al. (2019), which unfortunately leads to\ninconsistent and biased estimators in many cases (Vaicenavicius et al., 2019) and can scale poorly to\nlarge m. In contrast, for the KCE in Eq. (4) there is no explicit dependence on r, which enables us to\nderive multiple consistent and also unbiased estimators.\nLet k be a matrix-valued kernel as in De\ufb01nition 1 with E[(cid:107)k(g(X), g(X))(cid:107)] < \u221e, and de\ufb01ne for\n1 \u2264 i, j \u2264 n\nThen the estimators listed in Table 1 are consistent estimators of the squared kernel calibration error\nSKCE[k, g] := KCE2[k, g] (see Lemmas F.1 to F.3). The subscript letters \u201cq\u201d and \u201cl\u201d refer to the\nquadratic and linear computational complexity of the unbiased estimators, respectively.\n\nk(g(Xi), g(Xj))(eYj \u2212 g(Xj)).\n\nhi,j := (eYi \u2212 g(Xi))\n\n(cid:124)\n\nTable 1: Three consistent estimators of the SKCE.\n\nNotation De\ufb01nition\n(cid:92)SKCEb\n(cid:92)SKCEuq\n(cid:92)SKCEul\n\nn\u22122(cid:80)n\n(cid:0)n\n(cid:1)\u22121(cid:80)\n(cid:98)n/2(cid:99)\u22121(cid:80)(cid:98)n/2(cid:99)\n\ni,j=1 hi,j\n1\u2264i 0 and Bp;q := 21+1/p\u22121/qKp;q, then for the biased estimator we can bound\n\n3For a matrix A we denote by (cid:107)A(cid:107)p;q the induced matrix norm supx(cid:54)=0 (cid:107)Ax(cid:107)q/(cid:107)x(cid:107)p.\n\n6\n\n(cid:104)\n\nP\n\n(cid:92)SKCEb \u2265 t\n\n(cid:110)\n\n(cid:113)\n\nmax\n\n0,\n\nnt/Bp;q \u2212 1\n\n(cid:111)(cid:17)2(cid:19)\n\n,\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12) H0\n(cid:104)\n\n(cid:18)\n(cid:105)\n\n\u2264 exp\n\n(cid:12)(cid:12)(cid:12) H0\n\n(cid:16)\n\n1\n2\n\n\u2212\n\n(cid:18)\n\nP\n\nT \u2265 t\n\n\u2264 exp\n\n\u2212 (cid:98)n/2(cid:99)t2\n\n2B2\n\np;q\n\n(cid:19)\n\n.\n\nand for either of the unbiased estimators T \u2208 {(cid:92)SKCEuq, (cid:92)SKCEul}, we can bound\n\n\fAsymptotic bounds exploit the asymptotic distribution of the test statistics under the null hypothesis,\nas the number of validation data points goes to in\ufb01nity. The central limit theorem implies that the\nlinear estimator is asymptotically normally distributed.\nLemma 3 (Asymptotic distribution of (cid:92)SKCEul (see Corollary G.1)). Let k be a matrix-valued\nkernel as in De\ufb01nition 1, and assume that E[(cid:107)k(g(X), g(X))(cid:107)] < \u221e. If Var[hi,j] < \u221e, then\n\n(cid:104)(cid:112)\n(cid:98)n/2(cid:99) (cid:92)SKCEul \u2265 t\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12) H0\n\nP\n\n(cid:18) t\n\n(cid:19)\n\n\u2192 1 \u2212 \u03a6\n\n\u02c6\u03c3\n\nas n \u2192 \u221e,\n\nwhere \u02c6\u03c3 is the sample standard deviation of h2i\u22121,2i (i = 1, . . . ,(cid:98)n/2(cid:99)) and \u03a6 is the cumulative\ndistribution function of the standard normal distribution.\nIn Theorem G.2 we derive a theoretical expression of the asymptotic distribution of n (cid:92)SKCEuq,\nunder the assumption of strong calibration. This limit distribution can be approximated, e.g., by\nbootstrapping (Arcones and Gin\u00e9, 1992) or Pearson curve \ufb01tting, as discussed by Gretton et al. (2012).\n\n7 Experiments\n\nWe conduct experiments to con\ufb01rm the derived theoretical properties of the proposed calibration error\nestimators empirically and to compare them with a standard histogram-regression based estimator of\nthe ECE, denoted by (cid:91)ECE.4\ni=1 of 250 labeled predictions for m = 10 classes from\nWe construct synthetic data sets {(g(Xi), Yi)}250\nthree generative models. For each model we \ufb01rst sample predictions g(Xi) \u223c Dir(0.1, . . . , 0.1), and\nthen simulate corresponding labels Yi conditionally on g(Xi) from\n\nM1 : Cat(g(Xi)), M2 : 0.5 Cat(g(Xi)) + 0.5 Cat(1, 0, . . . , 0), M3 : Cat(0.1, . . . , 0.1),\n\nwhere M1 gives a calibrated model, and M2 and M3 are uncalibrated.\ninvestigate the theoretical properties of these models in more detail.\nFor simplicity, we use the matrix-valued kernel k(x, y) = exp (\u2212(cid:107)x \u2212 y(cid:107)/\u03bd)I10, where the kernel\nbandwidth \u03bd > 0 is chosen by the median heuristic. The total variation distance is a common distance\nmeasure of probability distributions and the standard distance measure for the ECE (Br\u00f6cker and\nSmith, 2007; Guo et al., 2017; Vaicenavicius et al., 2019), and hence it is chosen as the distance\nmeasure for all studied calibration errors.\n\nIn Appendix J.2 we\n\n7.1 Calibration error estimates\nIn Fig. 1 we show the distribution of (cid:91)ECE and of the three proposed estimators of the SKCE, evalu-\nated on 104 randomly sampled data sets from each of the three models. The true calibration error of\nthese models, indicated by a dashed line, is calculated theoretically for the ECE (see Appendix J.2.1)\nand empirically for the SKCE using the sample mean of all unbiased estimates of (cid:92)SKCEuq.\nWe see that the standard estimator of the ECE exhibits both negative and positive bias, whereas\n(cid:92)SKCEb is theoretically guaranteed to be biased upwards. The results also con\ufb01rm the unbiasedness\nof (cid:92)SKCEul.\n\n7.2 Calibration tests\n\nWe repeatedly compute the bounds and approximations of the p-value for the calibration error\nestimators that were derived in Section 6 on 104 randomly sampled data sets from each of the three\nmodels. More concretely, we evaluate the distribution-free bounds for (cid:92)SKCEb (Db), (cid:92)SKCEuq\n(Duq), and (cid:92)SKCEul (Dul) and the asymptotic approximations for (cid:92)SKCEuq (Auq) and (cid:92)SKCEul (Al),\nwhere the former is approximated by bootstrapping. We compare them with a previously proposed\nhypothesis test for the standard ECE estimator based on consistency resampling (C), in which\n\n4The implementation of the experiments is available online at https://github.com/devmotion/\n\nCalibrationPaper.\n\n7\n\n\fFigure 1: Distribution of calibration error estimates of 104 data sets that are randomly sampled from\nthe generative models M1, M2, and M3. The solid line indicates the mean of the calibration error\nestimates, and the dashed line displays the true calibration error.\n\ndata sets are resampled under the assumption that the model is calibrated by sampling labels from\nresampled predictions (Br\u00f6cker and Smith, 2007; Vaicenavicius et al., 2019).\nFor a chosen signi\ufb01cance level \u03b1 we compute from the p-value approximations p1, . . . , p104 the\nempirical test error\n\n104(cid:88)\n\ni=1\n\n1(\u03b1,1](pi)\n\n(for M2 and M3).\n\n104(cid:88)\n\ni=1\n\n1\n104\n\n1[0,\u03b1](pi)\n\n(for M1)\n\nand\n\n1\n104\n\nIn Fig. 2 we plot these empirical test errors versus the signi\ufb01cance level.\nAs expected, the distribution-free bounds seem to be very loose upper bounds of the p-value, resulting\nin statistical tests without much power. The asymptotic approximations, however, seem to estimate\nthe p-value quite well on average, as can be seen from the overlap with the diagonal in the results for\nthe calibrated model M1 (the empirical test error matches the chosen signi\ufb01cance level). Additionally,\ncalibration tests based on asymptotic distribution of these statistics, and in particular of (cid:92)SKCEuq, are\nquite powerful in our experiments, as the results for the uncalibrated models M2 and M3 show. For\nthe calibrated model, consistency resampling leads to an empirical test error that is not upper bounded\nby the signi\ufb01cance level, i.e., the null hypothesis of the model being calibrated is rejected too often.\nThis behaviour is caused by an underestimation of the p-value on average, which unfortunately makes\nthe calibration test based on consistency resampling for the standard ECE estimator unreliable.\n\n7.3 Additional experiments\n\nIn Appendix J.2.3 we provide additional results for varying number of classes and a non-standard\nECE estimator with data-dependent bins. We observe that the bias of (cid:91)ECE becomes more prominent\nwith increasing number of classes, showing high calibration error estimates even for calibrated models.\nThe estimators of the SKCE are not affected by the number of classes in the same way. In some\nexperiments with 100 and 1000 classes, however, the distribution of (cid:92)SKCEul shows multi-modality.\nThe considered calibration measures depend only on the predictions and the true labels, not on how\nthese predictions are computed. Hence directly specifying the predictions allows a clean numerical\n\n8\n\n00.10.20.301,0002,000[ECEM10.40.50.601,0002,0003,000M20.40.601,0002,0003,000M3024\u00b710\u2212302,0004,000\\SKCEb5\u00b710\u221220.10.1501,0002,0003,0001.522.53\u00b710\u2212201,0002,000\u22121012\u00b710\u2212301,0002,000\\SKCEuq5\u00b710\u221220.10.1501,0002,0003,00011.522.5\u00b710\u2212201,0002,000\u2212202\u00b710\u2212201,0002,0003,000\\SKCEul5\u00b710\u221220.10.1501,0002,000\u22125\u00b710\u2212205\u00b710\u221220.101,0002,000calibrationerrorestimate#runs\fFigure 2: Empirical test error versus signi\ufb01cance level for different bounds and approximations of\nthe p-value, evaluated on 104 data sets that are randomly sampled from the generative models M1,\nM2, and M3. The dashed line highlights the diagonal of the unit square.\n\nevaluation and enables comparisons of the estimates with the true calibration error. Nevertheless, we\nprovide a more practical evaluation of calibration for several modern neural networks in Appendix J.3.\n\n8 Conclusion\n\nWe have presented a uni\ufb01ed framework for quantifying the calibration error of probabilistic classi\ufb01ers.\nThe framework encompasses several existing error measures and enables the formulation of a new\nkernel-based measure. We have derived unbiased and consistent estimators of the kernel-based error\nmeasures, which are properties not readily enjoyed by the more common and less tractable ECE.\nThe impact of the kernel and its hyperparameters on the estimators is an important question for future\nresearch. We have refrained from investigating it in this paper, since it deserves a more exhaustive\nstudy than, what we felt, would have been possible in this work.\nThe calibration error estimators can be viewed as test statistics. This confers probabilistic interpretabil-\nity to the error measures. Speci\ufb01cally, we can compute well-founded bounds and approximations of\nthe p-value for the observed error estimates under the null hypothesis that the model is calibrated.\nWe have derived distribution-free bounds and asymptotic approximations for the estimators of the\nproposed kernel-based error measure, that allow reliable calibration tests in contrast to previously\nproposed tests based on consistency resampling with the standard estimator of the ECE.\n\nAcknowledgements\n\nWe thank the reviewers for all the constructive feedback on our paper. This research is \ufb01nancially\nsupported by the Swedish Research Council via the projects Learning of Large-Scale Probabilistic\nDynamical Models (contract number: 2016-04278) and Counterfactual Prediction Methods for\nHeterogeneous Populations (contract number: 2018-05040), by the Swedish Foundation for Strategic\nResearch via the project Probabilistic Modeling and Inference for Machine Learning (contract\nnumber: ICA16-0015), and by the Wallenberg AI, Autonomous Systems and Software Program\n(WASP) funded by the Knut and Alice Wallenberg Foundation.\n\n9\n\n00.51CM100.51M200.51M300.51Db00.5100.5100.51Duq00.5100.5100.51Dl00.5100.5100.51Auq00.5100.5100.20.40.60.8100.51Al00.20.40.60.8100.5100.20.40.60.8100.51signi\ufb01cancelevelempiricaltesterror\fReferences\nM. A. Arcones and E. Gin\u00e9. On the bootstrap of U and V statistics. The Annals of Statistics, 20(2):\n\n655\u2013674, 1992.\n\nN. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,\n\n68(3):337\u2013337, 3 1950.\n\nP. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\nA. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nSpringer US, 2004.\n\nJ. Br\u00f6cker. Reliability, suf\ufb01ciency, and the decomposition of proper scores. Quarterly Journal of the\n\nRoyal Meteorological Society, 135(643):1512\u20131519, 7 2009.\n\nJ. Br\u00f6cker. Estimating reliability and resolution of probability forecasts through decomposition of\n\nthe empirical score. Climate Dynamics, 39(3\u20134):655\u2013667, 2011.\n\nJ. Br\u00f6cker and L. A. Smith. Increasing the reliability of reliability diagrams. Weather and Forecasting,\n\n22(3):651\u2013661, 6 2007.\n\nA. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying. Universal multi-task kernels. Journal of\n\nMachine Learning Research, 9:1615\u20131646, 6 2008.\n\nC. Carmeli, E. De Vito, A. Toigo, and V. Umanit\u00e0. Vector valued reproducing kernel Hilbert spaces\n\nand universality. Analysis and Applications, 08(01):19\u201361, 01 2010.\n\nA. Christmann and I. Steinwart. Support Vector Machines. Information Science and Statistics.\n\nSpringer New York, 2008.\n\nM. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. The Statistician,\n\n32(1/2):12, 3 1983.\n\nC. A. T. Ferro and T. E. Fricker. A bias-corrected decomposition of the Brier score. Quarterly Journal\n\nof the Royal Meteorological Society, 138(668):1954\u20131960, 2012.\n\nA. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13:723\u2013773, 3 2012.\n\nC. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 1321\u20131330, 2017.\n\nW. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathemat-\n\nical Statistics, 19(3):293\u2013325, 9 1948.\n\nW. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, 1963.\n\nA. D. Kiureghian and O. Ditlevsen. Aleatory or epistemic? Does it matter? Structural Safety, 31(2):\n\n105\u2013112, 3 2009.\n\nA. Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\nA. Kumar, S. Sarawagi, and U. Jain. Trainable calibration measures for neural networks from kernel\nmean embeddings. In Proceedings of the 35th International Conference on Machine Learning,\nvolume 80 of Proceedings of Machine Learning Research, pages 2805\u20132814, 2018.\n\nA. Kumar, P. Liang, and T. Ma. Veri\ufb01ed uncertainty calibration. In Advances in Neural Information\n\nProcessing Systems 32. 2019.\n\nC. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation, 17(1):\n\n177\u2013204, 2005.\n\n10\n\n\fA. H. Murphy and R. L. Winkler. Reliability of subjective probability forecasts of precipitation and\n\ntemperature. Applied Statistics, 26(1):41, 1977.\n\nM. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using Bayesian\n\nbinning. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\nM. Pastell. Weave.jl: Scienti\ufb01c reports using Julia. The Journal of Open Source Software, 2(11):204,\n\nMar. 2017.\n\nH. Phan.\n\nPyTorch-CIFAR10, 2019.\n\nCIFAR10/.\n\nURL https://github.com/huyvnphan/PyTorch-\n\nW. Rudin. Real and complex analysis. McGraw-Hill, New York, 3 edition, 1986.\nR. J. Ser\ufb02ing, editor. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, Inc.,\n\n11 1980.\n\nJ. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Sch\u00f6n. Evaluating\nmodel calibration in classi\ufb01cation. In Proceedings of Machine Learning Research, volume 89 of\nProceedings of Machine Learning Research, pages 3459\u20133467, 2019.\n\nA. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\nA. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer New\n\nYork, 1996.\n\nD. Widmann. devmotion/CalibrationErrors.jl: v0.1.0, Sept. 2019a. URL https://doi.org/10.\n\n5281/zenodo.3457945.\n\nD. Widmann. devmotion/CalibrationTests.jl: v0.1.0, Oct. 2019b. URL https://doi.org/10.\n\n5281/zenodo.3514933.\n\nD. Widmann. devmotion/ConsistencyResampling.jl: v0.2.0, May 2019c. URL https://doi.org/\n\n10.5281/zenodo.3232854.\n\nB. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability\nestimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge\ndiscovery and data mining - KDD. ACM Press, 2002.\n\n11\n\n\f", "award": [], "sourceid": 6628, "authors": [{"given_name": "David", "family_name": "Widmann", "institution": "Uppsala University"}, {"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Link\u00f6ping University"}, {"given_name": "Dave", "family_name": "Zachariah", "institution": "Uppsala University"}]}