{"title": "Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration", "book": "Advances in Neural Information Processing Systems", "page_first": 12316, "page_last": 12326, "abstract": "Class probabilities predicted by most multiclass classifiers are uncalibrated, often tending towards over-confidence. With neural networks, calibration can be improved by temperature scaling, a method to learn a single corrective multiplicative factor for inputs to the last softmax layer. On non-neural models the existing methods apply binary calibration in a pairwise or one-vs-rest fashion. We propose a natively multiclass calibration method applicable to classifiers from any model class, derived from Dirichlet distributions and generalising the beta calibration method from binary classification. It is easily implemented with neural nets since it is equivalent to log-transforming the uncalibrated probabilities, followed by one linear layer and softmax. Experiments demonstrate improved probabilistic predictions according to multiple measures (confidence-ECE, classwise-ECE, log-loss, Brier score) across a wide range of datasets and classifiers. Parameters of the learned Dirichlet calibration map \nprovide insights to the biases in the uncalibrated model.", "full_text": "Beyond temperature scaling:\n\nObtaining well-calibrated multiclass probabilities\n\nwith Dirichlet calibration\n\nMeelis Kull\n\nDepartment of Computer Science\n\nUniversity of Tartu\n\nmeelis.kull@ut.ee\n\nMarkus K\u00e4ngsepp\n\nDepartment of Computer Science\n\nUniversity of Tartu\n\nmarkus.kangsepp@ut.ee\n\nHao Song\n\nDepartment of Computer Science\n\nUniversity of Bristol\n\nhao.song@bristol.ac.uk\n\nMiquel Perello-Nieto\n\nDepartment of Computer Science\n\nUniversity of Bristol\n\nmiquel.perellonieto@bris.ac.uk\n\nTelmo Silva Filho\n\nDepartment of Statistics\n\nUniversidade Federal da Para\u00edba\n\ntelmo@de.ufpb.br\n\nPeter Flach\n\nDepartment of Computer Science\n\nUniversity of Bristol and\nThe Alan Turing Institute\n\npeter.flach@bristol.ac.uk\n\nAbstract\n\nClass probabilities predicted by most multiclass classi\ufb01ers are uncalibrated, often\ntending towards over-con\ufb01dence. With neural networks, calibration can be im-\nproved by temperature scaling, a method to learn a single corrective multiplicative\nfactor for inputs to the last softmax layer. On non-neural models the existing\nmethods apply binary calibration in a pairwise or one-vs-rest fashion. We propose\na natively multiclass calibration method applicable to classi\ufb01ers from any model\nclass, derived from Dirichlet distributions and generalising the beta calibration\nmethod from binary classi\ufb01cation. It is easily implemented with neural nets since it\nis equivalent to log-transforming the uncalibrated probabilities, followed by one lin-\near layer and softmax. Experiments demonstrate improved probabilistic predictions\naccording to multiple measures (con\ufb01dence-ECE, classwise-ECE, log-loss, Brier\nscore) across a wide range of datasets and classi\ufb01ers. Parameters of the learned\nDirichlet calibration map provide insights to the biases in the uncalibrated model.\n\n1\n\nIntroduction\n\nA probabilistic classi\ufb01er is well-calibrated if among test instances receiving a predicted probability\nvector p, the class distribution is (approximately) distributed as p. This property is of fundamental\nimportance when using a classi\ufb01er for cost-sensitive classi\ufb01cation, for human decision making, or\nwithin an autonomous system. Due to over\ufb01tting, most machine learning algorithms produce over-\ncon\ufb01dent models, unless dedicated procedures are applied, such as Laplace smoothing in decision\ntrees [8]. The goal of (post-hoc) calibration methods is to use hold-out validation data to learn a\ncalibration map that transforms the model\u2019s predictions to be better calibrated. Meteorologists were\namong the \ufb01rst to think about calibration, with [3] introducing an evaluation measure for probabilistic\nforecasts, which we now call Brier score; [21] proposing reliability diagrams, which allow us\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto visualise calibration (reliability) errors; and [6] discussing proper scoring rules for forecaster\nevaluation and the decomposition of these loss measures into calibration and re\ufb01nement losses.\nCalibration methods for binary classi\ufb01ers have been well studied and include: logistic calibration,\nalso known as \u2018Platt scaling\u2019 [24]; binning calibration [26] with either equal-width or equal-frequency\nbins; isotonic calibration [27]; and beta calibration [15]. Extensions of the above approaches include:\n[22] which performs Bayesian averaging of multiple calibration maps obtained with equal-frequency\nbinning; [23] which uses near-isotonic regression to allow for some non-monotonic segments in the\ncalibration maps; and [1] which introduces a non-parametric Bayesian isotonic calibration method.\nCalibration in multiclass scenarios has been approached by decomposing the problem into k one-\nvs-rest binary calibration tasks [27], one for each class. The predictions of these k calibration\nmodels form unnormalised probability vectors, which, after normalisation, are not guaranteed to be\ncalibrated. Native multiclass calibration methods were introduced recently with a focus on neural\nnetworks, including: matrix scaling, vector scaling and temperature scaling [9], which can all be\nseen as multiclass extensions of Platt scaling and have been proposed as a calibration layer which\nshould be applied to the logits of a neural network, before the softmax layer. An alternative to\npost-hoc calibration is to modify the classi\ufb01er learning algorithm itself: MMCE [17] trains neural\nnetworks by optimising the combination of log-loss with a kernel-based measure of calibration\nloss; SWAG [19] models the posterior distribution over the weights of the neural network and then\nsamples from this distribution to perform Bayesian model averaging; [20] proposed a method to\ntransform the classi\ufb01cation task into regression and to learn a Gaussian Process model. Calibration\nmethods have been proposed for the regression task as well, including a method by [13] which\nadopts isotonic regression to calibrate the predicted quantiles. The theory of calibration functions and\nempirical calibration evaluation in classi\ufb01cation was studied by [25], also proposing a statistical test\nof calibration.\nWhile there are several calibration methods tailored for deep neural networks, we propose a general-\npurpose, natively multiclass calibration method called Dirichlet calibration, applicable for calibrating\nany probabilistic classi\ufb01er. We also demonstrate that the multiclass setting introduces numerous\nsubtleties that have not always been recognised or correctly dealt with by other authors. For example,\nsome authors use the weaker notion of con\ufb01dence calibration (our term), which requires only that the\nclassi\ufb01er\u2019s predicted probability for what it considers the most likely class is calibrated. There are\nalso variations in the evaluation metric used and in the way calibrated probabilities are visualised.\nConsequently, Section 2 is concerned with clarifying such fundamental issues. We then propose the\napproach of Dirichlet calibration in Section 3, present and discuss experimental results in Section 4,\nand conclude in Section 5.\n\n2 Evaluation of calibration and temperature scaling\nConsider a probabilistic classi\ufb01er \u02c6p : X \u2192 \u2206k that outputs class probabilities for k classes\n1, . . . ,k. For any given instance x in the feature space X it would output some probability vec-\ntor \u02c6p(x) = ( \u02c6p1(x), . . . , \u02c6pk(x)) belonging to \u2206k = {(q1, . . . ,qk) \u2208 [0,1]k | \u2211k\ni=1 qi = 1} which is the\n(k\u2212 1)-dimensional probability simplex over k classes.\nDe\ufb01nition 1. A probabilistic classi\ufb01er \u02c6p : X \u2192 \u2206k is multiclass-calibrated, or simply calibrated,\nif for any prediction vector q = (q1, . . . ,qk) \u2208 \u2206k, the proportions of classes among all possible\ninstances x getting the same prediction \u02c6p(x) = q are equal to the prediction vector q:\n\nP(Y = i | \u02c6p(X) = q) = qi\n\nfor i = 1, . . . ,k.\n\n(1)\n\nOne can de\ufb01ne several weaker notions of calibration [25] which provide necessary conditions for the\nmodel to be fully calibrated. One of these weaker notions was originally proposed by [27], requiring\nthat all one-vs-rest probability estimators obtained from the original multiclass model are calibrated.\nDe\ufb01nition 2. A probabilistic classi\ufb01er \u02c6p : X \u2192 \u2206k is classwise-calibrated, if for any class i and\nany predicted probability qi for this class:\n\nP(Y = i | \u02c6pi(X) = qi) = qi.\n\n(2)\n\nAnother weaker notion of calibration was used by [9], requiring that among all instances where the\nprobability of the most likely class is predicted to be c (the con\ufb01dence), the expected accuracy is c.\n\n2\n\n\fFigure 1: Reliability diagrams of c10_resnet_wide32 on CIFAR-10: (a) con\ufb01dence-reliability\nbefore calibration; (b) con\ufb01dence-reliability after temperature scaling; (c) classwise-reliability for\nclass 2 after temperature scaling; (d) classwise-reliability for class 2 after Dirichlet calibration.\n\nDe\ufb01nition 3. A probabilistic classi\ufb01er \u02c6p : X \u2192 \u2206k is con\ufb01dence-calibrated, if for any c \u2208 [0,1]:\n(3)\n\nY = argmax(cid:0)\u02c6p(X)(cid:1)(cid:12)(cid:12)(cid:12) max(cid:0)\u02c6p(X)(cid:1) = c\n(cid:17)\n(cid:16)\n\n= c.\n\nP\n\nFor practical evaluation purposes these idealistic de\ufb01nitions need to be relaxed. A common ap-\nproach for checking con\ufb01dence-calibration is to do equal-width binning of predictions according\nto con\ufb01dence level and check if Eq.(3) is approximately satis\ufb01ed within each bin. This can be\nvisualised using the reliability diagram (which we will call the con\ufb01dence-reliability diagram), see\nFig. 1a, where the wide blue bars show observed accuracy within each bin (empirical version of the\nconditional probability in Eq.(3)), and narrow red bars show the gap between the two sides of Eq.(3).\nWith accuracy below the average con\ufb01dence in most bins, this \ufb01gure about a wide ResNet trained on\nCIFAR-10 shows over-con\ufb01dence, typical for neural networks which predict probabilities through\nthe last softmax layer and are trained by minimising cross-entropy.\nThe calibration method called temperature scaling was proposed by [9] and it uses a hold-out\nvalidation set to learn a single temperature-parameter t > 0 which decreases con\ufb01dence (if t > 1) or\nincreases con\ufb01dence (if t < 1). This is achieved by rescaling the logit vector z (input to softmax \u03c3),\nso that instead of \u03c3 (z) the predicted class probabilities will be obtained by \u03c3 (z/t). The con\ufb01dence-\nreliability diagram in Fig. 1b shows that the same c10_resnet_wide32 model has come closer to\nbeing con\ufb01dence-calibrated after temperature scaling, having smaller gaps to the accuracy-equals-\ncon\ufb01dence diagonal. This is re\ufb02ected in a lower Expected Calibration Error (con\ufb01dence-ECE),\nde\ufb01ned as the average gap across bins, weighted by the number of instances in the bin. In fact,\ncon\ufb01dence-ECE is low enough that the statistical test proposed by [25] with signi\ufb01cance level\n\u03b1 = 0.01 does not reject the hypothesis that the model is con\ufb01dence-calibrated (p-value 0.017). The\nmain idea behind this test is that for a perfectly calibrated model, ECE against actual labels is in\nexpectation equal to the ECE against pseudo-labels which have been drawn from the categorical\ndistributions corresponding to the predicted class probability vectors. The above p-value was obtained\nby randomly drawing 10,000 sets of pseudo-labels and \ufb01nding 170 of these to have higher ECE than\nthe actual one.\nWhile the above temperature-scaled model is (nearly) con\ufb01dence-calibrated, it is far from being\nclasswise-calibrated. This becomes evident in Fig 1c, demonstrating that it systematically over-\nestimates the probability of instances to belong to class 2, with predicted probability (x-axis) smaller\nthan the observed frequency of class 2 (y-axis) in all the equal-width bins. In contrast, the model\nsystematically under-estimates class 4 probability (Supplementary Fig. 12a). Having only a single\ntuneable parameter, temperature scaling cannot learn to act differently on different classes. We\npropose plots such as Fig. 1c,d across all classes to be used for evaluating classwise-calibration, and\nwe will call these the classwise-reliability diagrams. We propose classwise-ECE as a measure of\nclasswise-calibration, de\ufb01ned as the average gap across all classwise-reliability diagrams, weighted\nby the number of instances in each bin:\n\nclasswise\u2212ECE =\n\n1\nk\n\nk\n\n\u2211\n\nj=1\n\n|Bi, j|\nn\n\n|y j(Bi, j)\u2212 \u02c6p j(Bi, j)|\n\n(4)\n\nm\n\n\u2211\n\ni=1\n\nwhere k,m,n are the numbers of classes, bins and instances, respectively, |Bi, j| denotes the size of\nthe bin, and \u02c6p j(Bi, j) and y j(Bi, j) denote the average prediction of class j probability and the actual\n\n3\n\n0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0Accuracy(a) Conf-reliability (Uncal.)Gap (conf-ECE=0.0451)Observed accuracy0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0Accuracy(b) Conf-reliability (Temp.Scal.)Gap (conf-ECE=0.0078)Observed accuracy0.00.20.40.60.81.0Predicted Probability0.00.20.40.60.81.0Frequency(c) Class 2 reliability (Temp.Scal.)Gap (class-2-ECE=0.0098)Observed frequency0.00.20.40.60.81.0Predicted Probability0.00.20.40.60.81.0Frequency(d) Class 2 reliability (Dirichlet Cal.)Gap (class-2-ECE=0.0033)Observed frequency\fproportion of class j in the bin Bi, j. The contribution of a single class j to the classwise-ECE will\nbe called class- j-ECE. As seen in Fig. 1(d), the same model gets closer to being class-2-calibrated\nafter applying our proposed Dirichlet calibration. By averaging class- j-ECE across all classes we\nget the overall classwise-ECE which for temperature scaling is cwECE = 0.1857 and for Dirichlet\ncalibration cwECE = 0.1795. This small difference in classwise-ECE appears more substantial when\nrunning the statistical test of [25], rejecting the null hypothesis that temperature scaling is classwise-\ncalibrated (p < 0.0001), while for Dirichlet calibration the decision depends on the signi\ufb01cance level\n(p = 0.016). A similar measure of classwise-calibration called L2 marginal calibration error was\nproposed in a concurrent work by [16].\nBefore explaining the Dirichlet calibration method, let us highlight the fundamental limitation of\nevaluation using any of the above reliability diagrams and ECE measures. Namely, it is easy to obtain\nalmost perfectly calibrated probabilities by predicting the overall class distribution, regardless of the\ngiven instance. Therefore, it is always important to consider other evaluation measures as well. In\naddition to the error rate, the obvious candidates are proper losses (such as Brier score or log-loss),\nas they evaluate probabilistic predictions and decompose into calibration loss and re\ufb01nement loss\n[14]. Proper losses are often used as objective functions in post-hoc calibration methods, which take\nan uncalibrated probabilistic classi\ufb01er \u02c6p and use a hold-out validation dataset to learn a calibration\nmap \u02c6\u00b5 : \u2206k \u2192 \u2206k that can be applied as \u02c6\u00b5(\u02c6p(x)) on top of the uncalibrated outputs of the classi\ufb01er to\nmake them better calibrated. Every proper loss is minimised by the same calibration map, known as\nthe canonical calibration function [25] of \u02c6p, de\ufb01ned as\n\n\u00b5(q) = (P(Y = 1 | \u02c6p(X) = q), . . . ,P(Y = k | \u02c6p(X) = q))\n\nThe goal of Dirichlet calibration, as of any other post-hoc calibration method, is to estimate this\ncanonical calibration map \u00b5 for a given probabilistic classi\ufb01er \u02c6p.\n\n3 Dirichlet calibration\n\nA key decision in designing a calibration method is the choice of parametric family. Our choice\nwas based on the following desiderata: (1) the family needs enough capacity to express biases\nof particular classes or pairs of classes; (2) the family must contain the identity map for the case\nwhere the model is already calibrated; (3) for every map in the family we must be able to provide a\nsemi-reasonable synthetic example where it is the canonical calibration function; (4) the parameters\nshould be interpretable to some extent at least.\n\nDirichlet calibration map family.\nInspired by beta calibration for binary classi\ufb01ers [15], we\nconsider the distribution of prediction vectors \u02c6p(x) separately on instances of each class, and assume\nthese k distributions are Dirichlet distributions with different parameters:\n\n(5)\nk )\u2208 (0,\u221e)k are the Dirichlet parameters for class j. Combining likelihoods\nwhere \u03b1 ( j) = (\u03b1 ( j)\nP(\u02c6p(X) | Y ) with priors P(Y ) expressing the overall class distribution \u03c0 \u2208 \u2206k, we can use Bayes\u2019 rule\nto express the canonical calibration function P(Y | \u02c6p(X)) as follows:\n\n1 , . . . ,\u03b1 ( j)\n\n\u02c6p(X) | Y = j \u223c Dir(\u03b1 ( j))\n\ngenerative parametrisation:\n\n\u02c6\u00b5DirGen(q;\u03b1,\u03c0) = (\u03c01 f1(q), . . . ,\u03c0k fk(q)) /z\n\n(6)\nwhere z = \u2211k\nj=1 \u03c0 j f j(q) is the normaliser, and f j is the probability density function of the Dirichlet\ndistribution with parameters \u03b1 ( j), gathered into a matrix \u03b1. It will also be convenient to have two\nalternative parametrisations of the same family: a linear parametrisation for \ufb01tting purposes and a\ncanonical parametrisation for interpretation purposes. These parametrisations are de\ufb01ned as follows:\n(7)\nwhere W \u2208 Rk\u00d7k is a k \u00d7 k parameter matrix, ln is a vector function that calculates the natural\nlogarithm component-wise and b \u2208 Rk is a parameter vector of length k;\n\n\u02c6\u00b5DirLin(q;W,b) = \u03c3 (Wlnq + b)\n\nlinear parametrisation:\n\ncanonical parametrisation:\n\n(8)\nwhere each column in the k-by-k matrix A \u2208 [0,\u221e)k\u00d7k with non-negative entries contains at least one\nvalue 0, division of q by 1/k is component-wise, and c \u2208 \u2206k is a probability vector of length k.\n\n\u02c6\u00b5Dir(q;A,c) = \u03c3 (Aln\n\nq\n1/k + lnc)\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Interpretation of Dirichlet calibration maps: (a) calibration map for MLP on the abalone\ndataset, 4 interpretation points shown by black dots, and canonical parametrisation as a matrix with\nA,c; (b) canonical parametrisation of a map on SVHN_convnet; (c) changes to the confusion matrix\nafter applying this calibration map.\n\nTheorem 1 (Equivalence of generative, linear and canonical parametrisations). The parametric\nfamilies \u02c6\u00b5DirGen(q;\u03b1,\u03c0), \u02c6\u00b5DirLin(q;W,b) and \u02c6\u00b5Dir(q;A,c) are equal, i.e. they contain exactly the\nsame calibration maps.\n\nProof. All proofs are given in the Supplemental Material.\n\nThe bene\ufb01t of the linear parametrisation is that it can be easily implemented as (additional) layers\nin a neural network: a logarithmic transformation followed by a fully connected layer with softmax\nactivation. Out of the three parametrisations only the canonical parametrisation is unique, in the sense\nthat any function in the Dirichlet calibration map family can be represented by a single pair of matrix\nA and vector c satisfying the requirements set by the canonical parametrisation \u02c6\u00b5Dir(q;A,c).\n\nInterpretability.\nIn addition to providing uniqueness, the canonical parametrisation is to some\nextent interpretable. As demonstrated in the proof of Thm. 1 provided in the Supplemental Material,\nthe linear parametrisation W,b obtained after \ufb01tting can be easily transformed into the canonical\nparametrisation by ai j = wi j \u2212 mini wi j and c = \u03c3 (Wlnu + b), where u = (1/k, . . . ,1/k). In the\ncanonical parametrisation, increasing the value of element ai j in matrix A increases the calibrated\nprobability of class i (and decreases the probabilities of all other classes), with effect size depending\non the uncalibrated probability of class j. E.g., element a3,9 = 0.63 of Fig.2b increases class 2\nprobability whenever class 8 has high predicted probability, modifying decision boundaries and\nresulting in 26 less confusions of class 2 for 8 as seen in Fig.2c. Looking at the matrix A and vector\nc, it is hard to know the effect of the calibration map without performing the computations. However,\nat k + 1 \u2018interpretation points\u2019 this is (approximately) possible. One of these is the centre of the\nprobability simplex, which maps to c. The other k points are vectors where one value is (almost) zero\nand the other values are equal, summing up to 1. Figure 2a shows the 3+1 interpretation points in an\nexample for k = 3, where each arrow visualises the result of calibration (end of arrow) at a particular\npoint (beginning of arrow). The result of calibration map at the interpretation points in the centres of\nsides (facets) is each determined by a single column of A only. The k columns of matrix A and the\nvector c determine, respectively, the behaviour of the calibration map near the k + 1 points\n\n(cid:18)\n\n1\u2212 \u03b5\nk\u2212 1\n\n, . . . ,\n\n1\u2212 \u03b5\nk\u2212 1\n\n\u03b5,\n\n, . . . ,\n\n1\u2212 \u03b5\nk\u2212 1\n\n,\u03b5\n\n, . . . ,\n\n, and\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)1\n\nk , . . . ,\n\n1\nk\n\n(cid:19)\n\n(cid:18)1\u2212 \u03b5\n\nk\u2212 1\n\nThe \ufb01rst k points are in\ufb01nitesimally close to the centres of facets of the probability simplex, and the\nlast point is the centre of the whole simplex. For 3 classes these 4 points have been visualised on the\nsimplex in Fig. 2a. The Dirichlet calibration map \u02c6\u00b5Dir(q;A,c) transforms these k + 1 points into:\n\n(\u03b5a11 , . . . ,\u03b5ak1) /z1 , . . . , (\u03b5a1k , . . . ,\u03b5akk ) /zk , and (c1, . . . ,ck)\n\n5\n\nC1C2C3C1C2C3123c123Class0.540.000.500.320.041.100.000.320.000.010.940.360.00.20.40.60.81.00123456789c0123456789Class2.160.000.210.250.130.370.000.230.000.160.050.152.130.390.010.120.180.170.000.040.210.020.120.091.820.070.000.000.090.060.630.090.100.170.140.242.120.080.130.220.010.210.080.020.000.090.100.001.350.220.120.150.230.270.230.100.100.120.220.051.940.140.170.030.150.140.270.080.030.230.090.262.020.050.050.130.050.120.110.080.140.170.270.151.890.230.000.080.070.190.000.100.090.260.050.101.880.330.130.160.080.060.080.050.290.240.090.051.640.170.00.20.40.60.81.001234567890123456789Class-18.00-1.002.00-8.004.000.0018.001.000.002.00-6.00-16.0011.001.001.000.003.003.001.002.00-12.00-12.0050.00-3.000.001.000.001.00-26.001.009.00-4.003.00-14.003.000.000.000.002.001.00-1.00-11.00-6.00-1.0016.007.000.00-4.00-4.004.00-3.00-5.00-3.00-7.001.0019.000.00-7.001.004.00-24.00-9.002.00-1.001.000.0028.001.002.000.004.00-10.000.00-6.003.004.000.00-3.006.002.001.00-34.0017.00-5.003.001.002.00-2.0019.00-2.00-12.00-10.002.00-2.001.001.001.00-7.006.0020.003020100102030\fwhere zi are normalising constants, and ai j,c j are elements of the matrix A and vector c, respectively.\nHowever, the effect of each parameter goes beyond the interpretation points and also changes classi-\n\ufb01cation decision boundaries. This can be seen for the calibration map for a model SVHN_convnet\nin Fig. 2b where larger off-diagonal coef\ufb01cients ai j often result in a bigger change in the confusion\nmatrix as seen in Fig. 2c (particularly in the 3rd row and 9th column).\n\nRelationship to other families. For 2 classes, the Dirichlet calibration map family coincides with\nthe beta calibration map family [15]. Although temperature scaling has been de\ufb01ned on logits z,\nit can be expressed in terms of the model outputs \u02c6p = \u03c3 (z) as well. It turns out that temperature\nscaling maps all belong to the Dirichlet family, with \u02c6\u00b5TempS(q;t) = \u02c6\u00b5DirLin(q; 1\nt I,0), where I is the\nidentity matrix and 0 is the zero vector (see Prop.1 in the Supplemental Material). The Dirichlet\ncalibration family is also related to the matrix scaling family \u02c6\u00b5MatS(z;W,b) = \u03c3 (Wz + b) proposed\nby [9] alongside with temperature scaling. Both families use a fully connected layer with softmax\nactivation, but the crucial difference is in the inputs to this layer. Matrix scaling uses logits z, while the\nlinear parametrisation of Dirichlet calibration uses log-transformed probabilities ln(\u02c6p) = ln(\u03c3 (z)).\nAs softmax followed by log-transform is losing information, matrix scaling has an informational\nadvantage over Dirichlet calibration on deep neural networks, which we will turn back to in the\nexperiments section.\n\nFitting and ODIR regularisation. The results of [9] showed poor performance for matrix scaling\n(with ECE, log-loss, error rate), leading the authors to the conclusion that \u201c[a]ny calibration model\nwith tens of thousands (or more) parameters will over\ufb01t to a small validation set, even when applying\nregularization\u201d. We agree that some over\ufb01tting happens, but in our experiments a simple L2 regulari-\nsation suf\ufb01ces on many non-neural models, whereas for other cases including deep neural nets we\npropose a novel ODIR (Off-Diagonal and Intercept Regularisation) scheme, which is ef\ufb01cient enough\nin \ufb01ghting over\ufb01tting to make both Dirichlet calibration and matrix scaling outperform temperature\nscaling on many occasions, including cases with 100 classes and hence 10100 parameters. Fitting of\nDirichlet calibration maps is performed by minimising log-loss, and by adding ODIR regularisation\nterms to the loss function as follows:\n\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\n(cid:16)\n\n(cid:32)\n\n(cid:17)\n\n+ \u03bb \u00b7\n\nL =\n\n1\nn\n\nn\n\n\u2211\n\ni=1\n\nlogloss\n\n\u02c6\u00b5DirLin(\u02c6p(xi);W,b),yi\n\n1\n\nk(k\u2212 1) \u2211\ni(cid:54)= j\n\nw2\ni j\n\n+ \u00b5 \u00b7\n\n1\nk \u2211\n\nj\n\nb2\nj\n\nwhere (xi,yi) are validation instances and wi j,b j are elements of W and b, respectively, and \u03bb , \u00b5 are\nhyper-parameters tunable with internal cross-validation on the validation data. The intuition is that\nthe diagonal is allowed to freely follow the biases of classes, whereas the intercept is regularised\nseparately from the off-diagonal elements due to having different scales (additive vs. multiplicative).\n\nImplementation details.\nImplementation of Dirichlet calibration is straightforward in standard\ndeep neural network frameworks (we used Keras [5] in the neural experiments). Alternatively, it is\nalso possible to use the Newton\u2013Raphson method on the L2 regularised objective function, which is\nconstructed by applying multinomial logistic regression with k features (log-transformed predicted\nclass probabilities). Both the gradient and Hessian matrix can be calculated either analytically or\nusing automatic differentiation libraries (e.g. JAX [2]). Such implementations normally yield faster\nconvergence given the convexity of the multinomial logistic loss, which is a better choice with a small\nnumber of target classes (tractable Hessian). One can also simply adopt existing implementations\nof logistic regression (e.g. scikit-learn) with the log transformed predicted probabilities. If the\nuncalibrated model outputs zero probability for some class, then this needs to be clipped to a small\npositive number (we used 2.2e\u2212308, the smallest positive usable number for the type float64 in\nPython).\n\n4 Experiments\n\nThe main goals of our experiments are to: (1) compare performance of Dirichlet calibration with other\ngeneral-purpose calibration methods on a wide range of datasets and classi\ufb01ers; (2) compare Dirichlet\ncalibration with temperature scaling on several deep neural networks and study the effectiveness\nof ODIR regularisation; and (3) study whether the neural-speci\ufb01c calibration methods outperform\ngeneral-purpose calibration methods due to the information loss going from logits to softmax outputs.\n\n6\n\n\fTable 1: Calibration methods ranked for p-cw-ECE\n(Friedman\u2019s test signi\ufb01cant with p-value 1.19e\u2212118).\n\nTable 2: Rankings for log-loss (Friedman\u2019s\ntest signi\ufb01cant with p-value 1.58e\u2212103).\n\nUncal DirL2 DirODIR Beta TempS VecS Isot FreqB WidthB\n\nUncal DirL2 DirODIR Beta TempS VecS Isot FreqB WidthB\n\nadas\nforest\nknn\nlda\nlogistic\nmlp\nnbayes\nqda\nsvc-linear\nsvc-rbf\ntree\n\n7.4\n6.1\n7.7\n7.3\n6.9\n5.0\n8.2\n7.5\n6.2\n5.5\n3.7\n\n3.1\n4.9\n2.0\n2.9\n3.0\n2.5\n1.6\n2.2\n3.0\n3.8\n4.6\n\n3.0\n4.7\n4.2\n4.1\n4.3\n5.2\n2.3\n3.6\n4.4\n4.1\n3.6\n\n5.1\n3.2\n5.0\n3.5\n3.6\n2.9\n4.9\n3.8\n4.0\n4.4\n5.0\n\n6.8\n4.4\n6.1\n6.0\n5.1\n4.3\n7.0\n6.0\n4.5\n5.0\n4.2\n\n3.7\n4.3\n4.6\n3.8\n4.3\n4.5\n3.2\n4.6\n4.0\n4.8\n4.4\n\n4.1\n3.6\n3.0\n3.7\n4.1\n5.0\n4.5\n3.7\n4.8\n3.4\n4.2\n\n6.0\n8.0\n7.0\n7.9\n8.0\n8.6\n6.8\n8.2\n8.5\n8.3\n8.0\n\n5.7\n5.8\n5.5\n5.8\n5.6\n6.9\n6.5\n5.5\n5.5\n5.8\n7.3\n\n8.4\n7.0\n8.2\n7.4\n7.6\n4.5\n8.6\n7.5\n6.7\n7.6\n6.2\n\n2.1\n5.7\n2.8\n1.9\n1.7\n2.7\n1.4\n2.2\n1.9\n4.0\n3.6\n\n2.0\n4.4\n4.9\n3.0\n2.5\n4.5\n2.5\n3.0\n2.3\n3.4\n5.7\n\n4.7\n3.1\n6.4\n3.5\n3.2\n3.1\n4.7\n3.8\n3.7\n3.7\n6.2\n\n7.8\n3.9\n7.2\n5.6\n5.3\n3.5\n6.2\n5.4\n4.4\n5.7\n7.4\n\n5.1\n3.4\n5.4\n4.0\n4.0\n4.0\n4.4\n4.5\n3.7\n2.9\n6.7\n\n4.9\n5.8\n2.7\n7.2\n8.1\n8.0\n5.8\n6.7\n8.1\n6.0\n2.0\n\n4.8\n6.8\n4.4\n7.1\n7.6\n8.6\n4.9\n7.3\n8.1\n5.4\n4.0\n\n5.1\n4.8\n3.0\n5.2\n5.0\n6.2\n6.5\n4.6\n6.0\n6.3\n3.2\n\navg rank\n\n6.50\n\n3.07\n\n3.97\n\n4.12\n\n5.40\n\n4.21 4.01\n\n7.75\n\n5.98\n\n7.24\n\n2.73\n\n3.48\n\n4.21\n\n5.66\n\n4.38 5.93\n\n6.28\n\n5.10\n\n4.1 Calibration of non-neural models\n\nExperimental setup. Calibration methods were compared on 21 UCI datasets (abalone, balance-\nscale, car, cleveland, dermatology, glass, iris, landsat-satellite, libras-movement, mfeat-karhunen,\nmfeat-morphological, mfeat-zernike, optdigits, page-blocks, pendigits, segment, shuttle, vehicle,\nvowel, waveform-5000, yeast) with 11 classi\ufb01ers: multiclass logistic regression (logistic), naive\nBayes (nbayes), random forest (forest), adaboost on trees (adas), linear discriminant analysis (lda),\nquadratic discriminant analysis (qda), decision tree (tree), K-nearest neighbours (knn), multilayer\nperceptron (mlp), support vector machine with linear (svc-linear) and RBF kernel (svc-rbf ).\nIn each of the 21\u00d7 11 = 231 settings1 we performed nested cross-validation to evaluate 8 calibration\nmethods: one-vs-rest isotonic calibration (Isot) which learns an isotonic calibration map on each\nclass vs rest separately and renormalises the individual calibration map outputs to add up to one\nat test time; one-vs-rest equal-width binning (WidthB) where one-vs-rest calibration maps predict\nthe empirical proportion of labels in each of the equal-width bins of the range [0,1]; one-vs-rest\nequal-frequency binning (FreqB) constructing bins with equal numbers of instances; one-vs-rest\nbeta calibration (Beta); temperature scaling (TempS); vector scaling (VectS), which restricts the\nmatrix scaling family, \ufb01xing off-diagonal elements to zero [9] (here applied on log-transformed class\nprobabilities instead of logits); and Dirichlet Calibration with L2 (DirL2) and with ODIR (DirODIR)\nregularisation. We used 3-fold internal cross-validation to train the calibration maps within the 5\ntimes 5-fold external cross-validation. Following [24], the 3 calibration maps learned in the internal\ncross-validation were all used as an ensemble by averaging their predictions. For calibration methods\nwith hyperparameters we used the training folds of the classi\ufb01er to choose the hyperparameter values\nwith the lowest log-loss.\nWe used 8 evaluation measures: accuracy, log-loss, Brier score, maximum calibration error (MCE),\ncon\ufb01dence-ECE (conf-ECE), classwise-ECE (cw-ECE), as well as signi\ufb01cance measures p-conf-ECE\nand p-cw-ECE evaluating how often the respective ECE measures are not signi\ufb01cantly higher than\nwhen assuming calibration. For p-conf-ECE and p-cw-ECE we used signi\ufb01cance level \u03b1 = 0.05 in\nthe test of [25] as explained in Section 2, and counted the proportion of signi\ufb01cance tests accepting the\nmodel being calibrated out of 5\u00d7 5 cases of external cross-validation. With each of the 8 evaluation\nmeasures we ranked the methods on each of the 21\u00d7 11 tasks and performed Friedman tests to \ufb01nd\nstatistical differences [7]. When the p-value of the Friedman test was under 0.005 we performed a\npost-hoc one-tailed Bonferroni-Dunn test to obtain Critical Differences (CDs) which indicated the\nminimum ranking difference to consider the methods signi\ufb01cantly different. Further details of the\nexperimental setup are provided in the Supplemental Material.\n\nResults. The results showed that Dirichlet with ODIR or L2 regularisation was the best calibration\nmethod based on log-loss, and p-cw-ECE, and in the group of best calibrators for the other measures\nexcept MCE (WidthB was the best for MCE, with all other calibration methods in the second-best\ngroup). After grouping the results by the classi\ufb01er learning algorithm, the average ranks with\nrespect to log-loss are shown in Table 2, and with respect to p-cw-ECE in Table 1. The critical\ndifference diagram for p-cw-ECE is presented in Fig. 3a. Fig. 3b shows the average p-cw-ECE for\neach calibration method across all datasets and shows how frequently the statistical test accepted\n\n1Naive Bayes and QDA ran too long on dataset shuttle, leaving us with a total of 229 sets of results.\n\n7\n\n\f(a) p-cw-ECE critical difference\n\n(b) p-cw-ECE for calibrators\n\n(c) p-cw-ECE for classi\ufb01ers\n\nFigure 3: Summarised results for p-cw-ECE: (a) CD diagram; (b) proportion of times each calibrator\nwas calibrated (\u03b1 = 0.05); (c) proportion of times each classi\ufb01er was already calibrated (\u03b1 = 0.05).\n\nthe null hypothesis of classi\ufb01er being calibrated (higher p-cw-ECE is better). The results show\nthat DirL2 was considered calibrated on more than 70% of the p-cw-ECE tests. An evaluation of\nclasswise-calibration without post-hoc calibration is given in Fig. 3c. Note that svc-linear and svc-rbf\nhave an unfair advantage because their sklearn implementation uses Platt scaling with 3-fold internal\ncross-validation to provide probabilities.\nSupplemental material contains the \ufb01nal ranking tables and CD diagrams for every metric, an analysis\nof the best calibrator hyperparameters, and a more detailed comparison of the classwise calibration\nfor the 11 classi\ufb01ers.\n\n4.2 Calibration of deep neural networks\n\nExperimental setup. We used 3 datasets (CIFAR-10, CIFAR-100 and SVHN), training 11 deep\nconvolutional neural nets with various architectures: ResNet 110 [10], ResNet 110 SD [12], ResNet\n152 SD [12], DenseNet 40 [11], WideNet 32 [28], LeNet 5 [18], and acquiring 3 pretrained models\nfrom [4]. For the latter we set aside 5,000 test instances for \ufb01tting the calibration map. On other models\nwe followed [9], setting aside 5,000 training instances (6,000 in SVHN) for calibration purposes and\ntraining the models as in the original papers. For calibration methods with hyperparameters we used\n5-fold cross-validation on the validation set to \ufb01nd optimal regularisation parameters. We used all 5\ncalibration models with the optimal hyperparameter values by averaging their predictions as in [24].\nAmong general-purpose calibration methods we compared 2 variants of Dirichlet calibration (with\nL2 regularisation and with ODIR) against temperature scaling (as discussed in Section 3, it can\nequivalently act on probabilities instead of logits and is therefore general-purpose). Other methods\nfrom our non-neural experiment were not included, as these were outperformed by temperature\nscaling in the experiments of [9]. Among methods that use logits (neural-speci\ufb01c calibration methods)\nwe included matrix scaling with ODIR regularisation, and vector scaling. As reported by [9], the\nnon-regularised matrix scaling performed very poorly and was not included in our comparisons. Full\ndetails and source code for training the models are in the Supplemental Material.\n\nResults. Tables 3 and 4 show that the best among three general-purpose calibration methods\ndepends heavily on the model and dataset. Both variants of Dirichlet calibration (with L2 and with\nODIR) outperformed temperature scaling in most cases on CIFAR-10. On CIFAR-100, Dir-L2 is poor,\nbut Dir-ODIR outperforms TempS in cw-ECE, showing the effectiveness of ODIR regularisation.\nHowever, this comes at the expense of minor increase in log-loss. According to the average rank\nacross all deep net experiments, Dir-ODIR is best, but without statistical signi\ufb01cance.\nThe full comparison including calibration methods that use logits con\ufb01rms that information loss going\nfrom logits to softmax outputs has an effect and MS-ODIR (matrix scaling with ODIR) outperforms\nDir-ODIR in 8 out of 14 cases on cw-ECE and 11 out of 14 on log-loss. However, the effect is\nnumerically usually very small, as average relative reduction of cw-ECE and log-loss is less than 1%\n(compared to the average relative reduction of over 30% from the uncalibrated model). According\nto the average rank on cw-ECE the best method is vector scaling, but this comes at the expense of\nincreased log-loss. According to the average rank on log-loss the best method is MS-ODIR, while its\ncw-ECE is on average bigger than for vector scaling by 2%.\n\n8\n\n123456789DirL2DirODIRIsotBetaVecSTempSWidthBUncalFreqBCD(p-value = 1.19e-118, #D = 229)\fTable 3: Scores and ranking of calibration\nmethods for cw-ECE.\n\nTable 4: Scores and ranking of calibration\nmethods for log-loss.\n\ngeneral-purpose calibrators\n\ncalibrators using logits\n\nMS-ODIR\n\n0.1046 0.0444 0.0432\nc10_convnet\n0.1146 0.0405 0.0341\nc10_densenet40\n0.1986 0.1715 0.0521\nc10_lenet5\n0.0986 0.0435 0.0321\nc10_resnet110\n0.0866 0.0314 0.0315\nc10_resnet110_SD\nc10_resnet_wide32\n0.0956 0.0485 0.0323\n0.4246 0.2271 0.4025\nc100_convnet\n0.4706 0.1872 0.3305\nc100_densenet40\n0.4736 0.3855 0.2194\nc100_lenet5\n0.4166 0.2013 0.3595\nc100_resnet110\nc100_resnet110_SD\n0.3756 0.2034 0.3735\n0.4206 0.1864 0.3335\nc100_resnet_wide32\nSVHN_convnet\n0.1596 0.0384 0.0435\nSVHN_resnet152_SD 0.0192 0.0181 0.0226\n3.79\nAverage rank\n\nUncal TempS Dir-L2 Dir-ODIR VecS\n0.0431\n0.0362\n0.0572\n0.0373\n0.0272\n0.0324\n0.2414\n0.1893\n0.2031\n0.1942\n0.1701\n0.1711\n0.0251\n0.0215\n2.29\n\n0.0455\n0.0374\n0.0594\n0.0394\n0.0293\n0.0292\n0.2403\n0.1861\n0.2132\n0.1861\n0.1893\n0.1802\n0.0262\n0.0203\n2.79\n\n5.71\n\n3.71\n\n0.0443\n0.0373\n0.0593\n0.0362\n0.0271\n0.0291\n0.2402\n0.1914\n0.2143\n0.2034\n0.1862\n0.1803\n0.0273\n0.0214\n2.71\n\ngeneral-purpose calibrators\n\ncalibrators using logits\n\nMS-ODIR\n\nUncal TempS Dir-L2 Dir-ODIR VecS\n0.3916 0.1951 0.1974\n0.4286 0.2255 0.2201\n0.8236 0.8005 0.7442\n0.3586 0.2095 0.2031\n0.3036 0.1785 0.1774\n0.3826 0.1915 0.1854\n1.6416 0.9421 1.1895\n2.0176 1.0572 1.2535\n2.7846 2.6505 2.5954\n1.6946 1.0923 1.2125\n1.3536 0.9423 1.1985\n1.8026 0.9453 1.0875\n0.2056 0.1515 0.1423\n0.0856 0.0791 0.0855\n3.79\n\n0.1952\n0.2244\n0.7443\n0.2053\n0.1763\n0.1822\n0.9612\n1.0594\n2.4902\n1.0964\n0.9454\n0.9534\n0.1382\n0.0802\n2.93\n\n0.1975\n0.2233\n0.7474\n0.2064\n0.1752\n0.1833\n0.9644\n1.0583\n2.5163\n1.0892\n0.9231\n0.9372\n0.1444\n0.0814\n3.14\n\n3.5\n\n6.0\n\n0.1963\n0.2222\n0.7431\n0.2042\n0.1751\n0.1821\n0.9613\n1.0511\n2.4871\n1.0741\n0.9272\n0.9331\n0.1381\n0.0813\n1.64\n\nAs the difference between MS-ODIR and vector scaling was on some models quite small, we further\ninvestigated the importance of off-diagonal coef\ufb01cients in MS-ODIR. For this we introduced a\nnew model MS-ODIR-zero, obtained from the respective MS-ODIR model by replacing the off-\ndiagonal entries with zeroes. In 6 out of 14 cases (c10_convnet, c10_densenet40, c10_resnet110_SD,\nc100_convnet, c100_resnet110_SD, SVHN_resnet152_SD) MS-ODIR-zero and MS-ODIR had\nalmost identical performance (difference in log-loss of less than 0.0001), indicating that ODIR\nregularisation had forced the off-diagonal entries to practically zero. However, MS-ODIR-zero\nwas signi\ufb01cantly worse in the remaining 8 out of 14 cases, indicating that the learned off-diagonal\ncoef\ufb01cients in MS-ODIR were meaningful. In all those cases MS-ODIR outperformed VecS in\nlog-loss. To eliminate the potential explanation that this could be due to random chance, we retrained\nthese networks on 2 more train-test splits (except for the pretrained SVHN_convnet). In all the reruns\nMS-ODIR remained better than VecS, con\ufb01rming that it is important to model the pairwise effects\nbetween classes in these cases. Detailed results have been presented in the Supplemental Material.\n\n5 Conclusion\n\nIn this paper we proposed a new parametric general-purpose multiclass calibration method called\nDirichlet calibration, which is a natural extension of the two-class beta calibration. Dirichlet cal-\nibration is easy to implement as a layer in a neural net, or as multinomial logistic regression on\nlog-transformed class probabilities. Its parameters provide insight into the biases of the model. While\nderived from Dirichlet-distributed likelihoods, it does not assume that the probability vectors are\nactually Dirichlet-distributed within each class, similarly as logistic calibration (Platt scaling) does not\nassume that the scores are Gaussian-distributed, while it can be derived from Gaussian likelihoods.\nComparisons with other general-purpose calibration methods across 21 datasets \u00d7 11 models showed\nbest or tied best performance for Dirichlet calibration on all 8 evaluation measures. Evaluation with\nour proposed classwise-ECE measure quanti\ufb01es how calibrated the predicted probabilities are on all\nclasses, not only on the most likely predicted class as with the commonly used (con\ufb01dence-)ECE. On\nneural networks we advance the state-of-the-art by introducing the ODIR regularisation scheme for\nmatrix scaling and Dirichlet calibration, leading these to outperform temperature scaling on many\ndeep neural networks.\nInterestingly, on many deep nets Dirichlet calibration learns a map which is very close to being in a\ntemperature scaling family. This raises a fundamental theoretical question of which neural architec-\ntures and training methods result in a classi\ufb01er with its canonical calibration function contained in\nthe temperature scaling family. But even in those cases Dirichlet calibration can become useful after\nany kind of dataset shift, learning an interpretable calibration map to reveal the shift and recalibrate\nthe predictions for the new context.\nDeriving calibration maps from Dirichlet distributions opens up the possibility of using other distribu-\ntions of the exponential family to obtain new calibration maps designed for various score types, as\nwell as investigating scores coming from mixtures of distributions inside each class.\n\n9\n\n\fAcknowledgements\n\nThe work of MKu and MK\u00e4 was supported by the Estonian Research Council under grant PUT1458.\nThe work of MPN and HS was supported by the SPHERE Next Steps Project funded by the UK\nEngineering and Physical Sciences Research Council (EPSRC), Grant EP/R005273/1. The work of\nPF and HS was supported by The Alan Turing Institute under EPSRC, Grant EP/N510129/1.\n\nReferences\n[1] M.-L. Allikivi and M. Kull. Non-parametric Bayesian isotonic calibration: Fighting over-\ncon\ufb01dence in binary classi\ufb01cation. In Machine Learning and Knowledge Discovery in Databases\n(ECML-PKDD\u201919), pages 68\u201385. Springer, 2019.\n\n[2] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. Wanderman-\n\nMilne. JAX: composable transformations of Python+NumPy programs, 2018.\n\n[3] G. W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly Weather Review,\n\n78(1):1\u20133, 1950.\n\n[4] A. Cheni. Base pretrained models and datasets in pytorch, 2017.\n\n[5] F. Chollet et al. Keras. https://keras.io, 2015.\n\n[6] M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. Journal of\n\nthe Royal Statistical Society. Series D (The Statistician), 32(1/2):12\u201322, 1983.\n\n[7] J. Dem\u0161ar. Statistical comparisons of classi\ufb01ers over multiple data sets. J. Machine Learning\n\nResearch, 7(Jan):1\u201330, 2006.\n\n[8] C. Ferri, P. A. Flach, and J. Hern\u00e1ndez-Orallo. Improving the AUC of probabilistic estimation\n\ntrees. In European Conference on Machine Learning, pages 121\u2013132. Springer, 2003.\n\n[9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks.\nIn Thirty-fourth International Conference on Machine Learning, Sydney, Australia, jun 2017.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR,\n\nabs/1512.03385, 2015.\n\n[11] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR,\n\nabs/1608.06993, 2016.\n\n[12] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic\n\ndepth. CoRR, abs/1603.09382, 2016.\n\n[13] V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated\n\nregression. arXiv preprint arXiv:1807.00263, 2018.\n\n[14] M. Kull and P. Flach. Novel decompositions of proper scoring rules for classi\ufb01cation: Score\nadjustment as precursor to calibration. In Machine Learning and Knowledge Discovery in\nDatabases (ECML-PKDD\u201915), pages 68\u201385. Springer, 2015.\n\n[15] M. Kull, T. M. Silva Filho, and P. Flach. Beyond sigmoids: How to obtain well-calibrated\nprobabilities from binary classi\ufb01ers with beta calibration. Electron. J. Statist., 11(2):5052\u20135080,\n2017.\n\n[16] A. Kumar, P. Liang, and T. Ma. Veri\ufb01ed uncertainty calibration.\n\nInformation Processing Systems (NeurIPS\u201919), 2019.\n\nIn Advances in Neural\n\n[17] A. Kumar, S. Sarawagi, and U. Jain. Trainable calibration measures for neural networks from\nkernel mean embeddings. In J. Dy and A. Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,\npages 2805\u20132814, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n10\n\n\f[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] W. Maddox, T. Garipov, P. Izmailov, D. P. Vetrov, and A. G. Wilson. A simple baseline for\n\nbayesian uncertainty in deep learning. CoRR, abs/1902.02476, 2019.\n\n[20] D. Milios, R. Camoriano, P. Michiardi, L. Rosasco, and M. Filippone. Dirichlet-based gaussian\nprocesses for large-scale calibrated classi\ufb01cation. In Advances in Neural Information Processing\nSystems, pages 6005\u20136015, 2018.\n\n[21] A. H. Murphy and R. L. Winkler. Reliability of subjective probability forecasts of precipitation\nand temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(1):41\u2013\n47, 1977.\n\n[22] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using\n\nbayesian binning. In AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[23] M. P. Naeini and G. F. Cooper. Binary classi\ufb01er calibration using an ensemble of near isotonic\nregression models. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages\n360\u2013369. IEEE, 2016.\n\n[24] J. Platt. Probabilities for SV machines. In A. Smola, P. Bartlett, B. Sch\u00f6lkopf, and D. Schuur-\n\nmans, editors, Advances in Large Margin Classi\ufb01ers, pages 61\u201374. MIT Press, 2000.\n\n[25] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. Sch\u00f6n. Evaluating\nmodel calibration in classi\ufb01cation. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of\nMachine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages\n3459\u20133467. PMLR, 16\u201318 Apr 2019.\n\n[26] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and\nnaive bayesian classi\ufb01ers. In Proc. 18th Int. Conf. on Machine Learning (ICML\u201901), pages\n609\u2013616, 2001.\n\n[27] B. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability\nestimates. In Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD\u201902), pages\n694\u2013699. ACM, 2002.\n\n[28] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6658, "authors": [{"given_name": "Meelis", "family_name": "Kull", "institution": "University of Tartu"}, {"given_name": "Miquel", "family_name": "Perello Nieto", "institution": "University of Bristol"}, {"given_name": "Markus", "family_name": "K\u00e4ngsepp", "institution": "University of Tartu"}, {"given_name": "Telmo", "family_name": "Silva Filho", "institution": "Universidade Federal da Para\u00edba"}, {"given_name": "Hao", "family_name": "Song", "institution": "University of Bristol"}, {"given_name": "Peter", "family_name": "Flach", "institution": "University of Bristol"}]}