{"title": "Calibrated Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3474, "page_last": 3482, "abstract": "In user-facing applications, displaying calibrated confidence measures---probabilities that correspond to true frequency---can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.", "full_text": "Calibrated Structured Prediction\n\nVolodymyr Kuleshov\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\nPercy Liang\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\nAbstract\n\nIn user-facing applications, displaying calibrated con\ufb01dence measures\u2014\nprobabilities that correspond to true frequency\u2014can be as important as obtaining\nhigh accuracy. We are interested in calibration for structured prediction problems\nsuch as speech recognition, optical character recognition, and medical diagnosis.\nStructured prediction presents new challenges for calibration: the output space is\nlarge, and users may issue many types of probability queries (e.g., marginals) on\nthe structured output. We extend the notion of calibration so as to handle various\nsubtleties pertaining to the structured setting, and then provide a simple recalibra-\ntion method that trains a binary classi\ufb01er to predict probabilities of interest. We\nexplore a range of features appropriate for structured recalibration, and demon-\nstrate their ef\ufb01cacy on three real-world datasets.\n\n1\n\nIntroduction\n\nApplications such as speech recognition [1], medical diagnosis [2], optical character recognition\n[3], machine translation [4], and scene labeling [5] have two properties: (i) they are instances of\nstructured prediction, where the predicted output is a complex structured object; and (ii) they are\nuser-facing applications for which it is important to provide accurate estimates of con\ufb01dence. This\npaper explores con\ufb01dence estimation for structured prediction.\nCentral to this paper is the idea of probability calibration [6, 7, 8, 9], which is prominent in the\nmeteorology [10] and econometrics [9] literature. Calibration requires that the probability that a\nsystem outputs for an event re\ufb02ects the true frequency of that event: of the times that a system says\nthat it will rain with probability 0.3, then 30% of the time, it should rain. In the context of structured\nprediction, we do not have a single event or a \ufb01xed set of events, but rather a multitude of events\nthat depend on the input, corresponding to different conditional and marginal probabilities that one\ncould ask of a structured prediction model. We must therefore extend the de\ufb01nition of calibration in\na way that deals with the complexities that arise in the structured setting.\nWe also consider the practical question of building a system that outputs calibrated probabilities.\nWe introduce a new framework for calibration in structured prediction, which involves de\ufb01ning\nprobabilities of interest, and then training binary classi\ufb01ers to predict these probabilities based on a\nset of features. Our framework generalizes current methods for binary and multiclass classi\ufb01cation\n[11, 12, 13], which predict class probabilities based on a single feature, the uncalibrated prediction\nscore. In structured prediction, the space of interesting probabilities and useful features is consid-\nerably richer. This motivates us to introduce a new concept of events as well as a range of new\nfeatures\u2014margin, pseudomargin\u2014which have varying computational demands. We perform a thor-\nough study of which features yield good calibration, and \ufb01nd that domain-general features are quite\ngood for calibrating MAP and marginal estimates over three tasks\u2014object recognition, optical char-\nacter recognition, and scene understanding. Interestingly, features based on MAP inference alone\ncan achieve good calibration on marginal probabilities (which can be more dif\ufb01cult to compute).\n\n1\n\n\fFigure 1:\nIn the context of an\nOCR system, our framework aug-\nments the structured predictor with\ncalibrated con\ufb01dence measures for\na set of events, e.g., whether the \ufb01rst\nletter is \u201cl\u201d.\n\n2 Background\n\n2.1 Structured Prediction\nIn structured prediction, we want to assign a structured label y = (y1, . . . , yL) \u2208 Y to an input\nx \u2208 X . For example, in optical character recognition (OCR), x is a sequence of images and y is the\nsequence of associated characters (see Figure 1(a)); note that the number of possible outputs y for a\ngiven x may be exponentially large.\nA common approach to structured prediction is conditional random \ufb01elds (CRFs), where we posit a\nprobabilistic model p\u03b8(y | x). We train p\u03b8 by optimizing a maximum-likelihood or a max-margin\nobjective over a training set {(x(i), y(i))}n\ni=1, assumed to be drawn i.i.d. from an unknown data-\ngenerating distribution P(x, y). The promise of a probabilistic model is that in addition to computing\nthe most likely output \u02c6y = arg maxy p\u03b8(y | x), we can also get its probability p\u03b8(y = \u02c6y | x) \u2208 [0, 1],\nor even marginal probabilities p\u03b8(y1 = \u02c6y1 | x) \u2208 [0, 1].\n\n2.2 Probabilistic Forecasting\n\nProbabilities from a CRF p\u03b8 are just numbers that sum to 1. In order for these probabilities to be\nuseful as con\ufb01dence measures, we would ideally like them to be calibrated. Calibration intuitively\nmeans that whenever a forecaster assigns 0.7 probability to an event, it should be the case that the\nevent actually holds about 70% of the time. In the case of binary classi\ufb01cation (Y = {0, 1}), we say\nthat a forecaster F : X \u2192 [0, 1] is perfectly calibrated if for all possible probabilities p \u2208 [0, 1]:\n\nP[y = 1 | F (x) = p] = p.\n\n(1)\n\nCalibration by itself does not guarantee a useful con\ufb01dence measure. A forecaster that always\noutputs the marginal class probability F (x) = P(y = 1) is calibrated but useless for accurate\nprediction. Good forecasts must also be sharp, i.e., their probabilities should be close to 0 or 1.\nCalibration and sharpness. Given a forecaster F : X \u2192 [0, 1], de\ufb01ne T (x) = E[y | F (x)] to be\nthe true probability of y = 1 given a that x received a forecast F (x). We can use T to decompose\nthe (cid:96)2 prediction loss as follows:\n\nE[(y \u2212 F (x))2] = E[(y \u2212 T (x))2] + E[(T (x) \u2212 F (x))2]\n\n= Var[y]\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nuncertainty\n\n\u2212 Var[T (x)]\n\n+ E[(T (x) \u2212 F (x))2]\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n.\n\n(2)\n(3)\n\n(cid:125)\n\nsharpness\n\ncalibration error\n\nThe \ufb01rst equality follows because y \u2212 T (x) has expectation 0 conditioned on F (x), and the second\nequality follows from the variance decomposition of y onto F (x).\nThe three terms in (3) formalize our intuitions about calibration and sharpness [7]. The calibration\nterm measures how close the predicted probability is to the true probability over that region and is\na natural generalization of perfect calibration (1) (which corresponds to zero calibration error). The\nsharpness term measures how much variation there is in the true probability across forecasts. It does\nnot depend on the numerical value of the forecaster F (x), but only the induced grouping of points;\nit is maximized by making F (x) closer to 0 and 1. Uncertainty does not depend on the forecaster\nand can be mostly ignored; note that it is always greater than sharpness and thus ensures that the\nloss stays positive.\n\n2\n\ny1y2y3y4x\u201cl\u201d\u201ca\u201d\u201cn\u201d\u201cd\u201dEventProbability[y=\u201cland\u201d]0.8[y1=\u201cl\u201d]0.8[y2=\u201ca\u201d]0.9[y3=\u201cn\u201d]0.9[y4=\u201cd\u201d]0.8(a)Structuredpredictionmodel(b)Forecasteroutput\fdifference\n\nerror\n\ninput x\ntrue P(y | x)\ncalibrated, unsharp p\u03b8(y | x)\nuncalibrated, sharp p\u03b8(y | x)\nbalanced p\u03b8(y | x)\n\nsharp.\nExamples. To illustrate\n0.167\nbetween\nthe\n0\ncalibration\n(lower\n0.167\nis better) and sharpness\n0.125\n(higher is better), consider\nthe following binary clas-\nsi\ufb01cation example: we have a uniform distribution (P(x) = 1/3) over inputs X = {0, 1, 2}. For\nx \u2208 {0, 1}, y = x with probability 1, and for x = 2, y is either 0 or 1, each with probability 1\n2.\nSetting p\u03b8(y | x) \u2261 0.5 would achieve perfect calibration (0) but not sharpness (0). We can get\nexcellent sharpness (0.167) but suffer in calibration (0.03) by predicting probabilities 0.2, 0.8, 0.4.\nWe can trade off some sharpness (0.125) for perfect calibration (0) by predicting 0 for x = 0 and\n0.75 for x \u2208 {1, 2}.\n\ncalib.\n0\n0\n0.03\n0\n\n1\n1\n0.5\n0.8\n0.75\n\n0\n0\n0.5\n0.2\n0\n\n2\n0.5\n0.5\n0.4\n0.75\n\nDiscretized probabilities. We have assumed so far that the forecaster might return arbitrary prob-\nabilities in [0, 1]. In this case, we might need an in\ufb01nite amount of data to estimate T (x) = E[y |\nF (x)] accurately for each value of F (x). In order to estimate calibration and sharpness from \ufb01nite\ndata, we use a discretized version of calibration and sharpness. Let B be a partitioning of the interval\n[0, 1]; for example B = {[0, 0.1), [0.1, 0.2), . . .}. Let B : [0, 1] \u2192 B map a probability p to the\ninterval B(p) containing p; e.g., B(0.15) = [0.1, 0.2). In this case, we simply rede\ufb01ne T (x) to be\nthe true probability of y = 1 given that F (x) lies in a bucket: T (x) = E[y | B(F (x))]. It is not hard\nto see that discretized calibration estimates form an upper bound on the calibration error (3) [14].\n\n3 Calibration in the Context of Structured Prediction\n\nWe have so far presented calibration in the context of binary classi\ufb01cation. In this section, we extend\nthese de\ufb01nitions to structured prediction. Our ultimate motivation is to construct forecasters that\naugment pre-trained structured models p\u03b8(y|x) with con\ufb01dence estimates. Unlike in the multiclass\nsetting [12], we cannot learn a forecaster Fy : X \u2192 [0, 1] that targets P(y | x) for each y \u2208 Y\nbecause the cardinality of Y is too large; in fact, the user will probably not be interested in every y.\n\nEvents of interest.\nInstead, we assume that for a given x and associated prediction y, the user is\ninterested in a set I(x) of events concerning x and y. An event E \u2208 I(x) is a subset E \u2286 Y; we\nwould like to determine the probability P(y \u2208 E | x) for each E \u2208 I(x). Here are two useful types\nof events that will serve as running examples:\n\n1. {MAP(x)}, which encodes whether MAP(x) = arg maxy p\u03b8(y | x) is correct.\n2. {y : yj = MAP(x)j}, which encodes whether the label at position j in MAP(x) is correct.\nIn the OCR example (Figure 1), suppose we predict MAP(x) = \u201cland\u201d. De\ufb01ne the events of\ninterest to be the MAP and the marginals: I(x) = {{MAP(x)},{y : y1 = MAP(x)1}, . . . ,{y :\nyL = MAP(x)L}}. Then we have I(x) = {{\u201cland\u201d},{y : y1 = \u201cl\u201d},{y : y2 = \u201ca\u201d},{y : y3 =\n\u201cn\u201d},{y : y4 = \u201cd\u201d}}. Note that the events of interest I(x) depend on x through MAP(x).\n\nEvent pooling. We now de\ufb01ne calibration in analogy with (1). We will construct a forecaster\nF (x, E) that tries to predict P(y \u2208 E | x). As we remarked earlier, we cannot make a statement\nthat holds uniformly for all events E; we can only make a guarantee in expectation. Thus, let E be\ndrawn uniformly from I(x), so that P is extended to be a joint distribution over (x, y, E). We say\nthat a forecaster F : X \u00d7 2Y (cid:55)\u2192 [0, 1] is perfectly calibrated if\n\nP (y \u2208 E | F (x, E) = p) = p.\n\n(4)\nIn other words, averaged over all x, y and events of interest E \u2208 I(x), whenever the forecaster\noutputs probability p, then the event E actually holds with probability p. Note that this de\ufb01ni-\ntion corresponds to perfect binary calibration (1) for the transformed pair of variables y(cid:48) = I[y \u2208\nE], x(cid:48) = (x, E). As an example, if I(x) = {{MAP(x)}}, then (4) says that of all the MAP predic-\ntions with con\ufb01dence p, a p fraction will be correct. If I(x) = {{y : yj = MAP(x)j}}L\nj=1, then\n(4) states that out of all the marginals (pooled together across all samples x and all positions j) with\ncon\ufb01dence p, a p fraction will be correct.\n\n3\n\n\fAlgorithm 1 Recalibration procedure for calibrated structured prediction.\nInput: Features \u03c6(x, E) from trained model p\u03b8, event set I(x), recalibration set S = {(xi, yi)}n\nOutput: Forecaster F (x, E).\nConstruct the events dataset: Sbinary = {(\u03c6(x, E), I[y \u2208 E]) : (x, y) \u2208 S, E \u2208 I(x)}\nTrain the forecaster F (e.g., k-NN or decision trees) on Sbinary.\n\ni=1.\n\nThe second example hints at an important subtlety inherent to having multiple events in structured\nprediction. The con\ufb01dence scores for marginals are only calibrated when averaged over all positions.\nIf a user only looked at the marginals for the \ufb01rst position, she might be sorely disappointed. As\n2 while y2 \u2261 1. Then\nan extreme example, suppose y = (y1, y2) and y1 is 0 or 1 with probability 1\na forecaster that outputs a con\ufb01dence of 0.75 for both events {y : y1 = 1} and {y : y2 = 1} will\nbe perfectly calibrated. However, neither event is calibrated in isolation (P(y1 = 1 | x) = 1\n2 and\nP(y2 = 1 | x) = 1). Finally, perfect calibration can be relaxed; following (3), we may de\ufb01ne the\ncalibration error to be E[(T (x, E) \u2212 F (x, E))2], where T (x, E) def= P(y \u2208 E | F (x, E)).\n\n4 Constructing Calibrated Forecasters\n\nHaving discussed the aspects of calibration speci\ufb01c to structured prediction, let us now turn to the\nproblem of constructing calibrated (and sharp) forecasters from \ufb01nite data.\n\n(x,y)\u2208S(cid:80)\n(cid:80)\n\nE\u2208I(x)(F (x, E) \u2212 I[y \u2208 E])2. Algorithm 1 outlines our procedure.\n\nRecalibration framework. We propose a framework that generalizes existing recalibration strate-\ngies to structured prediction models p\u03b8. First, the user speci\ufb01es a set of events of interest I(x) as\nwell as features \u03c6(x, E), which will in general depend on the trained model p\u03b8. We then train a\nforecaster F to predict whether the event E holds (i.e. I[y \u2208 E]) given features \u03c6(x, E). We train F\nby minimizing the empirical (cid:96)2 loss over a recalibration set S (disjoint from the training examples):\nminF\nAs an example, consider again the OCR setting in Figure 1. The margin feature \u03c6(x, E) =\nlog p\u03b8(MAP(1)(x)) \u2212 log p\u03b8(MAP(2)(x)) (where MAP(1)(x) and MAP(2)(x) are the \ufb01rst and sec-\nond highest scoring labels for x according to p\u03b8, respectively) will typically correlate with the event\nthat the MAP prediction is correct. We can perform isotonic regression using this feature on the\nrecalibration set S to produce well-calibrated probabilities.\nIn the limit of in\ufb01nite data, Algorithm 1 minimizes the expected loss E[(F (x, E) \u2212 I[y \u2208 E])2],\nwhere the expectation is over (x, y, E). By (3), the calibration error E[(T (x, E) \u2212 F (x, E))2] will\nalso be small. If there are not too many features \u03c6, we can drive the (cid:96)2 loss close to zero with\na nonparametric method such as k-NN. This is also why isotonic regression is sensible for binary\nrecalibration: we \ufb01rst project the data into a highly informative one-dimensional feature space; then\nwe predict labels from that space to obtain small (cid:96)2 loss.\nNote also that standard multiclass recalibration is a special case of this framework, where we use the\nraw uncalibrated score from p\u03b8 as a single feature. In the structured setting, one must invest careful\nthought in the choice of classi\ufb01er and features; we discuss these choices below.\nFeatures. Calibration is possible even with a single constant feature (e.g. \u03c6(x, E) \u2261 1), but\nsharpness depends strongly on the features\u2019 quality. If \u03c6 collapses points of opposite labels, no\nforecaster will be able to separate them and be sharp. While we want informative features, we can\nonly afford to have a few, since our recalibration set is typically small.\nCompared to calibration for binary classi\ufb01cation, our choice of features must also be informed\nby their computational requirements: the most informative features might require performing full\ninference in an intractable model. It is therefore useful to think of features as belonging to one\nof three types, depending on whether they are derived from unstructured classi\ufb01ers (e.g. an SVM\ntrained individually on each label), MAP inference, or marginal inference. In Section 5, we will\nshow that marginal inference produces the sharpest features, but clever MAP-based features can do\nalmost as well.\nIn Table 1, we propose several features that follow our guiding principles and that illustrate the\ncomputational tradeoffs inherent to structured prediction.\n\n4\n\n\fMAP recalibration on y\n\nType Name\nnone\nMAP\n\n\u03c6no\n1 : SVM margin\n\u03c6mp\n1 : Label length\n\u03c6mp\n2 : Admissibility\n\u03c6mp\n3 : Margin\n\nj\n\nDe\ufb01nition\n[sSVM\nminj mrgyj\n|yMAP|\nI[yMAP \u2208 G(x)]\nmrgy[p\u03b8(y | x)]\n\n(yj)]\n\nMarg.\n\n\u03c6mg\n1 : Margin\n\nminj mrgyj\n\n[p\u03b8(yj | x)]\n\nj\n\n[sSVM\n\nMarginal recalibration on yj\nDe\ufb01nition\nName\n\u03c6no\nmrgyj\n2 : SVM margin\n\u03c6mp\n% positions j\n4 : Label freq.\n\u03c6mp\n% neighbors j\n5 : Neighbors\n\u2208 L(x)]\nI[yMAP\n\u03c6mp\n6 : Label type\n[p\u03b8(yj | yMAP\u2212j , x)]\n\u03c6mp\n7 : Pseudomargin mrgyj\n[p\u03b8(yj | x)]\n\u03c6mg\nmrgyj\n2 : Margin\nI[yMG\n\u03c6mg\n3 : Concordance\n\nj = yMAP\n\n(yj)]\n(cid:48) labeled yMAP\n(cid:48) labeled yMAP\n\n]\n\nj\n\nj\n\nj\n\nj\n\nTable 1: Features for MAP recalibration (I(x) = {{yMAP(x)}}) and marginal recalibration (I(x) = {{y :\nyj = yMAP(x)j}}L\nj=1). We consider three types of features, requiring either unstructured, MAP, or marginal\ninference. For a generic function f, de\ufb01ne mrgaf (a) (cid:44) f (a(1)) \u2212 f (a(2)), where a(1) and a(2) are the top\ntwo inputs to f, ordered by f (a). Let yMG\n(yj) be the score of an SVM\nclassi\ufb01er predicting label yj. Features \u03c6mp\nrequire domain-speci\ufb01c knowledge: de\ufb01ning admissible\nsets G(x),L(x). In OCR, G are all English words and L(x) are similar-looking letters. Percentages in \u03c6mp\n4 and\n\u03c6mp\n5 are relative to all the labels in yMAP.\n\nj (cid:44) arg maxyj p\u03b8(yj | x); let sSVM\n2 and \u03c6mp\n\n6\n\nj\n\nthe form F (x, E) =(cid:80)\n\nRegion-based forecasters. Recall from (4) that calibration examines the true probability of an\nevent (y \u2208 E) conditioned on the forecaster\u2019s prediction F (x, E) = p. By limiting the number of\ndifferent probabilities p that F can output, we can more accurately estimate the true probability for\neach p To this end, let us partition the feature space (the range of \u03c6) into regions R, and output a\nprobability FR \u2208 [0, 1] for each region R \u2208 R. Formally, we consider region-based forecasters of\nR\u2208R FRI[\u03c6(x, E) \u2208 R], where FR is the fraction of points in region R (that\nis, (x, E) for which \u03c6(x, E) \u2208 R) for which the event holds (y \u2208 E). Note that the partitioning R\ncould itself depend on the recalibration set. Two examples of region-based forecasters are k-nearest\nneighbors (k-NN) and decision trees.\nLet us obtain additional insight into the performance of region-based forecasters as a function of\nrecalibration set size. Let S denote here a recalibration set of size n, which is used to derive a\npartitioning R and probability estimates FR for each region R \u2208 R. Let TR (cid:44) P(y \u2208 E | \u03c6(x, E) \u2208\nR) be the true event probability for region R, and wR (cid:44) P(\u03c6(x, E) \u2208 R) be the probability mass of\nregion R. We may rewrite the expected calibration error (3) of FR trained on a random S of size n\n(drawn i.i.d. from P) as\n\nCalibrationErrorn = ER\n\nwRES [(FR \u2212 TR)2 | R]\n\n.\n\n(5)\n\n(cid:34)(cid:88)\n\nR\u2208R\n\n(cid:35)\n\nWe see that there is a classic bias-variance tradeoff between having smaller regions (lower bias,\nincreased sharpness) and having more data points per region (lower variance, better calibration):\n\nE[(FR \u2212 TR)2 | R] = (E[FR | R] \u2212 TR)2\n\n+ E[(FR \u2212 E[FR | R])2 | R]\n\n.\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nbias\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nvariance\n\n(cid:125)\n\nIf R is a \ufb01xed partitioning independent of S, then the bias will be zero, and the variance is due to\nan empirical average, falling off as 1/n. However, both k-NN and decision trees produce biased\nestimates FR of TR because the regions are chosen adaptively, which is important for achieving\nsharpness. In this case, we can still ensure that the calibration error vanishes to zero if we let the\nregions grow uniformly larger: minR\u2208R |{(x, y) \u2208 S : \u03c6(x, E) \u2208 R, E \u2208 I(x)}| P\u2212\u2192 \u221e.\n\n5 Experiments\n\nWe test our proposed recalibrators and features on three real-world tasks.\n\nMulticlass image classi\ufb01cation. The task is to predict an image label given an image. This setting\nis a special case of structured prediction in which we show that our framework improves over exist-\ning multiclass recalibration strategies. We perform our experiments on the CIFAR-10 dataset [15],\n\n5\n\n\fFigure 2: MAP recalibration in the multiclass and chain CRF settings (left and middle) and marginal\nrecalibration of the graph CRF (right). The legend includes the (cid:96)2 loss before and after calibration.\nThe radius of the black balls re\ufb02ects the number of points having the given forecasted and true\nprobabilities.\n\nwhich consists of 60,000 32x32 color images of different types of animals and vehicles (ten classes\nin total). We train a linear SVM on features derived from k-means clustering and that produce high\naccuracies (79%) on this dataset [16]. We use 800 out of the 1600 features having the highest mu-\ntual information with the label (the drop in performance is negligible). 38,000 images were used for\ntraining, 2,000 for calibration, and 20,000 for testing.\n\nOptical character recognition. The task is to predict the word (sequence of characters) given a\nsequence of images (Figure 1). Calibrated OCR systems can be useful for automatic sorting of mail.\nThis setting demonstrates calibration on a tractable linear-chain CRF. We used a dataset consisting\nof \u223c 8-character-long words from 150 human subjects [3]. Each character is rasterized into a 16\u00d7 8\nbinary image. We chose 2000 words for training and another 2000 for testing. The remaining words\nare subsampled in various ways to produce recalibration sets.\n\nScene understanding. Given an image divided into a set of regions, the task is to label each region\nwith its type (e.g. person, tree, etc.). Calibrated scene understanding is important for building\nautonomous agents that try to take optimal actions in the environment, integrating over uncertainty.\nThis is a structured prediction setting in which inference is intractable. We conduct experiments on\na post-processed VOC Pascal dataset [5]. In brief, we train a graph CRF to predict the joint labeling\nyi of superpixels yij in an image (\u223c 100 superpixels per image; 21 possible labels). The input\nxi consists of 21 node features; CRF edges connect adjacent superpixels. We use 600 examples\nfor training, 500 for testing and subsample the remaining \u223c 800 examples to produce calibration\nsets. We perform MAP inference using AD3, a dual composition algorithm; we use a mean \ufb01eld\napproximation to compute marginals.\n\nExperimental setup. We perform both MAP and marginal calibration as described in Section 3.\nWe use decision trees and k-NN as our recalibration algorithms and examine the quality of our\nforecasts based on calibration and sharpness (Section 2). We further discretize probabilities into\nbuckets of size 0.1: B = {[ i\u22121\nWe report results using calibration curves: For each test point (xi, Ei, yi), let fi = F (xi, Ei) \u2208\n[0, 1] be the forecasted probability and ti = I[yi \u2208 Ei] \u2208 {0, 1} be the true outcome. For each\nbucket B \u2208 B, we compute averages fB = N\u22121\ni:fi\u2208B ti, where\nNB = |{fi \u2208 B}| is the number of points in bucket B. A calibration curve plots the tB as a\nfunction of fB. Perfect calibration corresponds to a straight line. See Figure 2 for an example.\n\n10 ) : i = 1, . . . , 10}.\n(cid:80)\ni:fi\u2208B fi and tB = N\u22121\n\n10 , i\n\nB\n\n(cid:80)\n\nB\n\n5.1\n\n\u201cOut-of-the-Box\u201d Recalibration\n\nWe would \ufb01rst like to demonstrate that our approach works well \u201cout of the box\u201d with very simple\nparameters: a single feature, k-NN with k = 100, and a reasonably-sized calibration set. We report\nresults in three settings: (i) multiclass and (ii) chain CRF MAP recalibration with the margin feature\n\u03c6mg\n1 (Figure 2, left, middle), as well as (iii) graph CRF marginal recalibration with the margin feature\n\u03c6mg\n(Figure 2, right). We use calibration sets of 2,000, 1,000, and 300 (respectively) and compare\nto the raw CRF probabilities p\u03b8(y \u2208 E | x).\n2\n\n6\n\n0.00.20.40.60.81.0Mean predicted value0.00.20.40.60.81.0Fraction of positivesImage classification (Multi-class MAP recal.);75% accuracy on raw uncalibrated SVMraw (23.0)cal (19.6)1-vs-a (20.1)0.00.20.40.60.81.0Mean predicted value0.00.20.40.60.81.0OCR (Chain CRF MAP recalibration);45% per-word accuracy using Viterbi decodingraw (29.5)cal (13.6)0.00.20.40.60.81.0Mean predicted value0.00.20.40.60.81.0Scene understanding (Graph CRF marginal recal.);78% accuracy using mean-field marg. decodingraw (65.9)cal (18.6)\fFigure 3: Feature analysis for MAP and marginal recalibration of the chain CRF (left and middle,\nresp.) and marginal recalibration of the graph CRF (right). Subplots show calibration curves for\nvarious groups of features from Table 1, as well as (cid:96)2 losses; dot sizes indicate relative bucket size.\n\nFigure 2 shows that our predictions (green line) are well-calibrated in every setting. In the multiclass\nsetting, we outperform an existing approach which individually recalibrates one-vs-all classi\ufb01ers and\nnormalizes their probability estimates [12]. This suggests that recalibrating for a speci\ufb01c event (e.g.\nthe highest scoring class) is better than \ufb01rst estimating all the multiclass probabilities.\n\n5.2 Feature Analysis\n\n4 , \u03c6mp\n\n(which\n2 , although they lack resolution.\n5 ) that capture whether a label is similar\n\nNext, we investigate the role of features. In Figure 3, we consider three structured settings, and\nin each setting evaluate performance using different sets of features from Table 1. From top to\nbottom, the subplots describe progressively more computationally demanding features. Our main\ntakeaways are that clever inexpensive features do as well as naive expensive ones, that features may\nbe complementary and help each other, and that recalibration allows us to add \u201cglobal\u201d features to a\nchain CRF. We also see that features affect only sharpness.\nIn the intractable graph CRF setting (Figure 3, right), we observe that pseudomarginals \u03c6mp\nrequire only MAP inference) fare almost as well as true marginals \u03c6mg\nAugmenting with additional MAP-based features (\u03c6mp\nto its neighbors and whether it occurs elsewhere in the image resolves this.\nThis synergistic interaction of features appears elsewhere. On marginal chain CRF recalibration\n(Figure 3, left), the margin \u03c6mg\n2 between the two best classes yields calibrated forecasts that slightly\nlack sharpness near zero (points with e.g. 50% and 10% con\ufb01dences will have similarly small\nmargins). Adding the MAP-marginal concordance feature \u03c6mg\nimproves calibration, since we can\nfurther differentiate between low and very low con\ufb01dence estimates. Similarly, individual SVM and\nMAP-based features \u03c6no\n6 are 26 binary indicators, one per character) are calibrated,\nbut not very sharp. They accurately identify 70%, 80% and 90% con\ufb01dence sets, which may be\nsuf\ufb01cient in practice, given that they take no additional time to compute. Adding features based on\nmarginals \u03c6mg\nOn MAP CRF recalibration (Figure 3, middle), we see that simple features (\u03c6mp\n2 ) can fare better\nis the length of a word; G in \u03c6mp\nthan more sophisticated ones like the margin \u03c6mp\n2\nencodes whether the word yMAP is in the dictionary). This demonstrates that recalibration lets us\nintroduce new global features beyond what\u2019s in the original CRF, which can dramatically improve\ncalibration at no additional inferential cost.\n\n3 improves sharpness.\n\n(the \u03c6mp\n\n2 , \u03c6mg\n\n(recall that \u03c6mp\n\n1\n\n3\n\n1 , \u03c6mp\n\n7\n\n2 , \u03c6mp\n\n6\n\n3\n\n7\n\n0.00.20.40.60.81.0Uncalibrated:30.20.00.20.40.60.81.0UnstructuredSVMscores\u03c6no2:15.80.00.20.40.60.81.026characterindicators\u03c6mp6:16.10.00.20.40.60.81.0Marginalprobabilities\u03c6mg2:12.00.00.20.40.60.81.0Marginalprobabilities+Marginal/MAPagreement\u03c6mg2,\u03c6mg3:10.90.00.10.20.30.40.50.60.70.80.91.0Per-letterOCR(ChainCRFmarginalrecalibration);84%per-letteraccuracyusingViterbidecoding0.00.20.40.60.81.0Allfeatures:10.80.00.20.40.60.81.0Uncalibrated:21.00.00.20.40.60.81.0UnstructuredSVMscores\u03c6no1:20.50.00.20.40.60.81.0Length+Presenceindict.\u03c6mp1,\u03c6mp2:4.20.00.20.40.60.81.0Marginbetween1stand2ndbest\u03c6mp3:13.10.00.20.40.60.81.0Lowestmarginalprobability\u03c6mg1:20.60.00.10.20.30.40.50.60.70.80.91.0Per-wordOCR(ChainCRFMAPrecalibration);45%per-wordaccuracyusingViterbidecoding0.00.20.40.60.81.0Allfeatures:4.00.00.20.40.60.81.0Uncalibrated:67.00.00.20.40.60.81.0UnstructuredSVMscores\u03c6no2:14.70.00.20.40.60.81.0Pseudomargins\u03c6mp7:17.00.00.20.40.60.81.0Pseudomargins,otherMAPfeatures\u03c6mp4,\u03c6mp5,\u03c6mp7:15.40.00.20.40.60.81.0Marginals,MAP/marg.concordance\u03c6mg2:15.90.00.10.20.30.40.50.60.70.80.91.0Sceneunderstanding(GraphCRFmarginalrecalibration);78%accuracyusingmean-\ufb01eldmarg.decoding0.00.20.40.60.81.0Allfeatures:14.0\fFigure 4: Calibration error\n(blue) and sharpness (green)\nof k-NN (left) and decision\ntrees (right) as a function\nof calibration set size (chain\nCRF; marginal recalibration).\n\n5.3 Effects of Recalibration Set Size and Recalibration Technique\n\nLastly, in Figure 4, we compare k-NN and decision trees on chain CRF marginal prediction using\n2 . We subsample calibration sets S of various sizes N. For each N and each algorithm\nfeature \u03c6mg\nwe choose a hyperparameter (minimum leaf size for decision trees, k in k-NN) by 10-fold cross-\nvalidation on S. We tried values between 5 and 500 in increments of 5.\nFigure 4 shows that for both methods, sharpness remains constant, while the calibration error de-\ncreases with N and quickly stabilizes below 10\u22123; this con\ufb01rms that we can always recalibrate with\nenough data. The decrease in calibration error also indicates that cross-validation successfully \ufb01nds\na good model for each N. Finally, we found that k-NN fared better when using continuous features\n(see also right columns of Figures 2 and 3); decision trees performed much better on categorical\nfeatures.\n\n6 Previous Work and Discussion\n\nCalibration and sharpness provide the conceptual basis for this work. These ideas and their con-\nnection to l2 losses have been explored extensively in the statistics literature [7, 9] in connection to\nforecast evaluation; there exist generalizations to other losses as well [17, 10]. Calibration in the\nonline setting is a \ufb01eld in itself; see [8] for a starting point. Finally, calibration has been explored\nextensively from a Bayesian viewpoint, starting with the seminal work of Dawid [18].\nRecalibration has been mostly studied in the binary classi\ufb01cation setting, with Platt scaling [11] and\nisotonic regression [13] being two popular and effective methods. Non-binary methods typically\ninvolve training one-vs-all predictors [12] and include extensions to ranking losses [19] and com-\nbinations of estimators [20]. Our generalization to structured prediction required us to develop the\nnotion of events of interest, which even in the multiclass setting works better than estimating every\nclass probability, and this might be useful beyond typical structured prediction problems.\nCon\ufb01dence estimation methods play a key role in speech recognition [21], but they require domain\nspeci\ufb01c acoustic features [1]. Our approach is more general, as it applies in any graphical model\n(including ones where inference is intractable), uses domain-independent features, and guarantees\ncalibrated probabilities, rather than simple scores that correlate with accuracy.\nThe issue of calibration arises any time one needs to assess the con\ufb01dence of a prediction. Its im-\nportance has been discussed and emphasized in medicine [22], natural language processing [23],\nspeech recognition [21], meteorology [10], econometrics [9], and psychology [24]. Unlike uncali-\nbrated con\ufb01dence measures, calibrated probabilities are formally tied to objective frequencies. They\nare easy to understand by users, e.g., patients undergoing diagnosis or researchers querying a prob-\nabilistic database. Moreover, modern AI systems typically consist of a pipeline of modules [23]. In\nthis setting, calibrated probabilities are important to express uncertainty meaningfully across differ-\nent (potentially third-party) modules. We hope our extension to the structured prediction setting can\nhelp make calibration more accessible and easier to apply to more complex and diverse settings.\n\nAcknowledgements. This research is supported by an NSERC Canada Graduate Scholarship to\nthe \ufb01rst author and a Sloan Research Fellowship to the second author.\n\nReproducibility. All code, data, and experiments for this paper are available on CodaLab at\nhttps://www.codalab.org/worksheets/0xecc9a01cfcbc4cd6b0444a92d259a87c/.\n\n8\n\n0500100015002000Recalibration set size10-410-310-210-1kNNCalSha0500100015002000Recalibration set size10-410-310-210-1Decision tree\fReferences\n[1] M. Seigel. Con\ufb01dence Estimation for Automatic Speech Recognition Hypotheses. PhD thesis, University\n\nof Cambridge, 2013.\n\n[2] D. E. Heckerman and B. N. Nathwani. Towards normative expert systems: Probability-based representa-\n\ntions for ef\ufb01cient knowledge acquisition and inference. Methods Archive, 31(2):106\u2013116, 1992.\n\n[3] R. H. Kassel. A comparison of approaches to on-line handwritten character recognition. PhD thesis,\n\nMassachusetts Institute of Technology, 1995.\n\n[4] P. Liang, A. Bouchard-C\u02c6ot\u00b4e, D. Klein, and B. Taskar. An end-to-end discriminative approach to machine\ntranslation. In International Conference on Computational Linguistics and Association for Computational\nLinguistics (COLING/ACL), 2006.\n\n[5] A. Mueller. Methods for Learning Structured Prediction in Semantic Segmentation of Natural Images.\n\nPhD thesis, University of Bonn, 2013.\n\n[6] G. W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly weather review, 78(1):1\u2013\n\n3, 1950.\n\n[7] A. H. Murphy. A new vector partition of the probability score.\n\n12(4):595\u2013600, 1973.\n\nJournal of Applied Meteorology,\n\n[8] D. P. Foster and R. V. Vohra. Asymptotic calibration, 1998.\n[9] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243\u2013268, 2007.\n\n[10] J. Brocker. Reliability, suf\ufb01ciency, and the decomposition of proper scores. Quarterly Journal of the\n\nRoyal Meteorological Society, 135(643):1512\u20131519, 2009.\n\n[11] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood\n\nmethods. Advances in Large Margin Classi\ufb01ers, 10(3):61\u201374, 1999.\n\n[12] B. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability estimates.\n\nIn International Conference on Knowledge Discovery and Data Mining (KDD), pages 694\u2013699, 2002.\n\n[13] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceed-\n\nings of the 22nd international conference on Machine learning, pages 625\u2013632, 2005.\n\n[14] D. B. Stephenson, C. A. S. Coelho, and I. T. Jolliffe. Two extra components in the brier score decompo-\n\nsition. Weather Forecasting, 23:752\u2013757, 2008.\n\n[15] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[16] A. Coates and A. Y. Ng. Learning feature representations with K-means. Neural Networks: Tricks of the\n\nTrade - Second Edition, 2(1):561\u2013580, 2012.\n\n[17] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classi\ufb01cation:\n\nStructure and applications, 2005.\n\n[18] D. A. Philip. The well-calibrated Bayesian. Journal of the American Statistical Association (JASA),\n\n77(379):605\u2013610, 1982.\n\n[19] A. K. Menon, X. Jiang, S. Vembu, C. Elkan, and L. Ohno-Machado. Predicting accurate probabilities\n\nwith a ranking loss. In International Conference on Machine Learning (ICML), 2012.\n\n[20] L. W. Zhong and J. Kwok. Accurate probability calibration for multiple classi\ufb01ers. In International Joint\n\nConference on Arti\ufb01cial Intelligence (IJCAI), pages 1939\u20131945, 2013.\n\n[21] D. Yu, J. Li, and L. Deng. Calibration of con\ufb01dence measures in speech recognition. Trans. Audio, Speech\n\nand Lang. Proc., 19(8):2461\u20132473, 2011.\n\n[22] X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado. Calibrating predictive model estimates to support\npersonalized medicine. Journal of the American Medical Informatics Association, 19(2):263\u2013274, 2012.\n[23] K. Nguyen and B. O\u2019Connor. Posterior calibration and exploratory analysis for natural language pro-\ncessing models. In Empirical Methods in Natural Language Processing (EMNLP), pages 1587\u20131598,\n2015.\n\n[24] S. Lichtenstein, B. Fischhoff, and L. D. Phillips. Judgement under Uncertainty: Heuristics and Biases.\n\nCambridge University Press, 1982.\n\n9\n\n\f", "award": [], "sourceid": 1932, "authors": [{"given_name": "Volodymyr", "family_name": "Kuleshov", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}