{"title": "On the Calibration of Multiclass Classification  with Rejection", "book": "Advances in Neural Information Processing Systems", "page_first": 2586, "page_last": 2596, "abstract": "We investigate the problem of multiclass classification with rejection, where a classifier can choose not to make a prediction to avoid critical misclassification. First, we consider an approach based on simultaneous training of a classifier and a rejector, which achieves the state-of-the-art performance in the binary case. We analyze this approach for the multiclass case and derive a general condition for calibration to the Bayes-optimal solution, which suggests that calibration is hard to achieve by general loss functions unlike the binary case. Next, we consider another traditional approach based on confidence scores, in which the existing work focuses on a specific class of losses. We propose rejection criteria for more general losses for this approach and guarantee calibration to the Bayes-optimal solution. Finally, we conduct experiments to validate the relevance of our theoretical findings.", "full_text": "On the Calibration of Multiclass Classi\ufb01cation\n\nwith Rejection\n\nChenri Ni1 Nontawat Charoenphakdee1,2\n1 The University of Tokyo, Japan\n\nJunya Honda1,2 Masashi Sugiyama2,1\n2 RIKEN Center for Advanced Intelligence Project, Japan\n\n{nichenri, nontawat}@ms.k.u-tokyo.ac.jp\n\n{jhonda, sugi}@k.u-tokyo.ac.jp\n\nAbstract\n\nWe investigate the problem of multiclass classi\ufb01cation with rejection, where a\nclassi\ufb01er can choose not to make a prediction to avoid critical misclassi\ufb01cation.\nFirst, we consider an approach based on simultaneous training of a classi\ufb01er and a\nrejector, which achieves the state-of-the-art performance in the binary case. We\nanalyze this approach for the multiclass case and derive a general condition for\ncalibration to the Bayes-optimal solution, which suggests that calibration is hard to\nachieve by general loss functions unlike the binary case. Next, we consider another\ntraditional approach based on con\ufb01dence scores, in which the existing work focuses\non a speci\ufb01c class of losses. We propose rejection criteria for more general losses\nfor this approach and guarantee calibration to the Bayes-optimal solution. Finally,\nwe conduct experiments to validate the relevance of our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nIn real-world classi\ufb01cation tasks, e.g., medical diagnosis, autonomous driving, and product inspection,\nmisclassi\ufb01cation can be costly and even life-threatening. Classi\ufb01cation with rejection is a framework\naiming to prevent critical misclassi\ufb01cation by providing an option not to make a prediction at the\nexpense of the pre-de\ufb01ned rejection cost [6, 7]. If the rejection cost is less than the misclassi\ufb01cation\ncost, there is an incentive to reject an instance. In practice, once the reject option is selected, one may\ngather more information about the instance or ask experts to give the correct label.\nMuch research on the theoretical perspective of classi\ufb01cation with rejection has been devoted to the\nbinary classi\ufb01cation scenario [16, 1, 14, 29, 8, 9]. However, rather less attention has been paid to\nthe multiclass scenario, which is undoubtedly important for real-world applications and is a more\ngeneral framework. To the best of our knowledge, although there exist many methods that rely on\nheuristics [11, 28, 23], only the work by Ramaswamy et al. [20] provides the theoretical guarantee\nfor their method. Nevertheless, the work by Ramaswamy et al. [20] only focuses on speci\ufb01c types of\nnon-differentiable losses and their method requires re-training of the classi\ufb01er when the rejection\ncost changes.\nThe key concept to validate the soundness of the method for classi\ufb01cation with rejection lies in the\nnotion of calibration, i.e., in\ufb01nite-sample consistency [29, 8]. Calibration suggests that the minimizer\nof a surrogate risk behaves identically to the Bayes-optimal solution almost surely. The existing\nmethods with calibration guarantees can be divided into two categories, which we detail in the\nfollowing.\nThe \ufb01rst category is called the con\ufb01dence-based approach. The main idea is to use the real-valued\noutput of the classi\ufb01er as a con\ufb01dence score [1, 14, 26]. Whether to reject the input is then determined\nfrom the classi\ufb01er\u2019s output and a threshold depending on the rejection cost and the choice of the\nsurrogate loss function.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe second category is what we call the classi\ufb01er-rejector approach. Unlike the con\ufb01dence-based\napproach, this approach separates the role of the classi\ufb01er and the rejector, and trains both functions\nsimultaneously [8, 9]. This problem formulation enables more \ufb02exible modeling for the rejector,\nwhich can be more robust to model-misspeci\ufb01cation. This is a state-of-the-art method in binary\nclassi\ufb01cation, and has been further discussed in online learning setting [10], structured output learning\nsetting [13], and also in some real-world applications such as liver disease diagnosis [15].\nThe goal of this paper is to provide a better understanding of multiclass classi\ufb01cation with rejection.\nWe \ufb01rst investigate the classi\ufb01er-rejector approach and derive a calibration condition of this approach\nin the multiclass case. Our condition recovers the known result by Cortes et al. [9] in the binary\ncase as a special case. However, when there are more than two classes, we argue that the condition\nis hard to be satis\ufb01ed. We next analyze the con\ufb01dence-based approach and prove the calibration\nresults for various classes of smooth losses, which guarantees the use of well-known losses such as\nthe logistic loss, the squared loss, the exponential loss and the cross-entropy loss. Our experiments\nsupport the above \ufb01ndings, that is, the failure of the classi\ufb01er-rejector approach and the success of\nthe con\ufb01dence-based approach with smooth loss functions, particularly the cross-entropy loss.\n\n2 Preliminaries\n\nIn this section, we formulate the problem of classi\ufb01cation with rejection and review related work.\n\n2.1 Problem setting\nLet X \u2286 Rd be a d-dimensional input space and Y = {1, . . . , K} be an output space representing\ni=1 drawn independently from an\nK classes. Suppose we are given n training samples {(xi, yi)}n\nunknown probability distribution over X \u00d7 Y with density p(x, y). In classi\ufb01cation with rejection,\nwe will learn a pair (r, f ) consisting of a rejector r and a classi\ufb01er f. The rejector r : X \u2192 R rejects\na point x \u2208 X if r(x) \u2264 0, and accepts it otherwise. The classi\ufb01er f : X \u2192 Y is assumed to take\nthe following form:\n\nf (x) = argmax\n\ny\u2208Y\n\ngy(x),\n\nwhere gy : X \u2192 R is a score function for multiclass classi\ufb01cation. By a slight abuse of notation, we\nidentify the classi\ufb01er f (x) with g(x), where g(x) = [g1(x), . . . , gK(x)](cid:62) and (cid:62) denotes the trans-\npose. Given a loss function L(r, f ; x, y), we de\ufb01ne its risk R by R(r, f ) = Ep(x,y)[L(r, f ; x, y)],\nwhere Ep(x,y)[\u00b7] denotes the expectation over the distribution p(x, y). We also de\ufb01ne the pointwise\nrisk W of the loss L at x by\n\nW(cid:0)r(x), f (x); \u03b7(x)(cid:1) =\n\n(cid:88)\n\ny\u2208Y\n\n\u03b7y(x)L\n\n(cid:0)r, f ; x, y(cid:1),\n\nminimizing W(cid:0)r(x), f (x); \u03b7(x)(cid:1) over(cid:0)r(x), f (x)(cid:1) for all x \u2208 X . Thus, it is suf\ufb01cient to only\nwhere \u03b7(x) = [\u03b71(x), . . . , \u03b7K(x)](cid:62) for \u03b7y(x) = p(y|x) denotes the class probability vector.\nNote that minimizing R(r, f ) with respect to (r, f ) over all measurable functions is equivalent to\nwrite, for example, W (r, f ; \u03b7) instead of W(cid:0)r(x), f (x); \u03b7(x)(cid:1) for the pointwise risk. We will also\n\nconsider the pointwise risk to minimize R(r, f ) [22, 25]. For brevity, we omit the notation of x and\n\ndrop the notation of r when classi\ufb01cation without rejection is considered and write, for example,\nL(f ; x, y), R(f ) and W (f ; \u03b7).\nIn multiclass classi\ufb01cation with rejection, our goal is to minimize the 0-1-c risk de\ufb01ned as\n\nR0-1-c(r, f ) = E\n\n[L0-1-c(r, f ; x, y)],\n\np(x,y)\n\nwhere the 0-1-c loss L0-1-c is given by\n\nL0-1-c(r, f ; x, y) = 1[f (x)(cid:54)=y]1[r(x)>0]\n\n+ c1[r(x)\u22640]\n\nmisclassi\ufb01cation loss\n\nrejection loss\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(1)\n\n(2)\n\n.\n\nHere, c \u2208 [0, 1] denotes the rejection cost, and 1[\u00b7] denotes the indicator function.\n\n2\n\n\fIt is well known that the Bayes-optimal classi\ufb01er and rejector [20], i.e., the classi\ufb01er and the rejector\nthat minimize (1), are given by\n\nr\u2217(x) = max\n\ny\u2208Y \u03b7y(x) \u2212 (1 \u2212 c).\n\nf\u2217(x) = argmax\n\n\u03b7y(x),\n\ny\u2208Y\n\nIn this paper, we assume c < 1/2 since data points with low con\ufb01dence are accepted otherwise.\n\n2.2 Calibration\n\nIn classi\ufb01cation without rejection, the classi\ufb01cation risk, i.e., the expected risk with respect to the\n0-1 loss L0-1(f ; x, y) = 1[f (x)(cid:54)=y], is the standard performance metric. It is known that minimizing\nthe 0-1 risk is computationally infeasible [4, 12]. Therefore, an important question is what kind of\nsurrogate loss can be used instead of the 0-1 loss [30, 24, 19]. Intuitively, a surrogate loss should be\noptimization-friendly and its minimization should lead to minimization of the 0-1 risk. The notion of\ncalibration is de\ufb01ned for loss functions as the minimum requirement to assure that the risk-minimizing\nclassi\ufb01er becomes the Bayes-optimal classi\ufb01er (see Zhang [30] for the formal de\ufb01nition).\nIn classi\ufb01cation with rejection, our goal is to minimize the 0-1-c risk. Similarly to the 0-1 risk, the\n0-1-c risk is also dif\ufb01cult to directly minimize [1, 20]. For the purpose of theoretical analysis, it is\nmore convenient to directly de\ufb01ne calibration for classi\ufb01ers and rejectors based on whether they are\nBayes-optimal. Thus, we propose to de\ufb01ne the notions of calibration as follows.\nDe\ufb01nition 1 (Calibration of a classi\ufb01er-rejector pair). We say that (r, f ) : X \u2192 R \u00d7 Y is calibrated\nif R0-1-c(r, f ) = R0-1-c(r\u2217, f\u2217).\nIn this paper, we also consider the notions of calibration separately for classi\ufb01ers and rejectors, which\nenables better understanding of where the dif\ufb01culty of classi\ufb01cation with rejection comes from.\nDe\ufb01nition 2 (Rejection calibration of a rejector). We say that r : X \u2192 R is rejection-calibrated if\nsign[r(x)] = sign[r\u2217(x)] for all x \u2208 X such that r\u2217(x) (cid:54)= 0.\nDe\ufb01nition 3 (Classi\ufb01cation calibration of a classi\ufb01er). We say that f : X \u2192 Y is classi\ufb01cation-\ncalibrated if f (x) = f\u2217(x) holds almost everywhere on X .\nAs we can see from these de\ufb01nitions and the form of loss function (2), if r is rejection-calibrated\nand f is classi\ufb01cation-calibrated, then (r, f ) is calibrated. Furthermore, rejection calibration of r is\nnecessary for calibration of (r, f ), while classi\ufb01cation calibration of f is not as exempli\ufb01ed in [20].\n\n2.3 Related work\n\nHere, we review some related work for both the con\ufb01dence-based and classi\ufb01er-rejector approaches.\nNote that we follow the conventional notation where the output domain is Y = {+1,\u22121} and the\nscore function f : X \u2192 R is regarded as a classi\ufb01er when discussing binary classi\ufb01cation.\n2.3.1 Con\ufb01dence-based approach\n\nIn the con\ufb01dence-based approach, we \ufb01rst train a classi\ufb01er based on some surrogate of the 0-1 loss,\nwhere we regard the real-valued output of the classi\ufb01er as some con\ufb01dence score. We then construct\na rejector based on the output and a pre-speci\ufb01ed threshold \u03b8, which takes the form\n\n(3)\nin the binary case. Bartlett and Wegkamp [1] proposed a loss called the modi\ufb01ed hinge loss and\ndesigned an SVM-like algorithm. Later, Yuan and Wegkamp [29] considered a smooth margin\n\nr(x) = |f (x)| \u2212 \u03b8\n\nloss \u03c6(cid:0)yf (x)(cid:1).\n\nHere the smoothness of the loss is quite important in the construction of rejectors, since the threshold \u03b8\nis sometimes not uniquely determined if a non-smooth loss is used. In Bartlett and Wegkamp [1], a\ncalibration guarantee for the non-smooth loss is shown for a range of \u03b8, but its empirical performance\nis heavily affected by the choice of the threshold. In addition, the loss function also contains a\nparameter that has to be determined by the rejection cost c, which means that we need to re-train the\nclassi\ufb01er once we change the value of c. On the other hand for smooth losses, the value of c does not\naffect the parameter of a smooth loss, but only the threshold \u03b8. This suggests that we do not need to\nre-train a classi\ufb01er when the rejection cost c changes [29].\n\n3\n\n\fRamaswamy et al. [20] extended the method of Bartlett and Wegkamp [1] to multiclass classi\ufb01cation,\nand designed non-smooth losses with excess risk bounds. However, their method has the drawbacks\nof non-unique \u03b8 and the dependence of the loss on c, which comes from the use of non-smooth losses.\n\n2.3.2 Classi\ufb01er-rejector approach\n\nCortes et al. [8, 9] pointed out that it is too restrictive to require the rejector r to be of form (3) when\nthe true classi\ufb01er is out of the considered hypothesis set. Based on this observation, they proposed to\nseparate the roles of the classi\ufb01er and the rejector, and directly minimize an upper bound of the 0-1-c\nrisk with respect to (r, f) in the training phase. Plus bound (PB) loss LPB was proposed as an upper\nbound of the 0-1-c loss in Cortes et al. [9]:\n(4)\nwhere \u03c6 and \u03c8 are convex upper bounds of 1[z\u22640]. Cortes et al. [9] derived the calibration result\nfor the exponential loss \u03c6(z) = \u03c8(z) = exp(\u2212z) with appropriately chosen parameters \u03b1, \u03b2 > 0.\nHowever, to the best of our knowledge, this approach is currently available only for the binary case,\nand an extension to the multiclass case is highly nontrivial as we will see later.\n\nLPB(r, f ; x, y) = \u03c6(cid:0)\u03b1[yf (x) \u2212 r(x)](cid:1) + c\u03c8(cid:0)\u03b2r(x)(cid:1),\n\n3 An analysis of the classi\ufb01er-rejector approach\n\nIn this section, we provide a general result on multiclass classi\ufb01cation with rejection using the\nclassi\ufb01er-rejector approach. In the following, we discuss the achievability of rejection calibration of\nr, which is a necessary condition for calibration of (r, f ).\nGiven a loss L(r, f ; x, y), we denote by (r\u2020\n\u03b7) = argminr\u2208R, g\u2208RK W (r, f ; \u03b7) the minimizer of\nthe corresponding pointwise risk W over the real space. First we derive the following theorem, which\nis the main result of this section.\nTheorem 4 (Necessary and suf\ufb01cient condition for rejection calibration). Assume that L is a convex\nfunction of class C 1 with respect to r, and also assume \u22022W (r,f\u2020\n> 0. Let (r\u2020, f\u2020) be the\nminimizer of the surrogate risk R over all measurable functions. Then, r\u2020 is rejection-calibrated if\nand only if\n\n(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\n\u03b7, f\u2020\n\n\u03b7;\u03b7)\n\n\u2202r2\n\nsup\n\n\u03b7: maxy \u03b7y\u22651\u2212c\n\n\u2202W (r, f\u2020\n\u2202r\n\n\u03b7; \u03b7)\n\n\u2264 0 \u2264\n\ninf\n\n\u03b7: maxy \u03b7y\u22641\u2212c\n\n\u2202W (r, f\u2020\n\u2202r\n\n\u03b7; \u03b7)\n\n.\n\n(5)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\nThe proof of this theorem is given in Appendix B.1. The following corollary is a weaker version of\nthis theorem but gives more insight into the strength of the requirement for rejection calibration.\nCorollary 5 (Necessary condition for rejection calibration). Under the same assumption as Theo-\nrem 4, r\u2020 is rejection-calibrated only if\n\nsup\n\n\u03b7: maxy \u03b7y=1\u2212c\n\n\u2202W (r, f\u2020\n\u2202r\n\n\u03b7; \u03b7)\n\n= 0,\n\ninf\n\n\u03b7: maxy \u03b7y=1\u2212c\n\n\u2202W (r, f\u2020\n\u2202r\n\n\u03b7; \u03b7)\n\n= 0.\n\n(6)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\n(cid:12)(cid:12)(cid:12)r=0\n\nThis corollary is straightforward from the relation\nh(\u03b7) \u2264\n\n\u03b7: maxy \u03b7y\u22641\u2212c\n\n\u03b7: maxy \u03b7y=1\u2212c\n\nh(\u03b7) \u2264\n\ninf\n\ninf\n\nsup\n\n\u03b7: maxy \u03b7y=1\u2212c\n\nh(\u03b7) \u2264\n\nsup\n\n\u03b7: maxy \u03b7y\u22651\u2212c\n\nh(\u03b7)\n\n\u2202r\n\n\u03b7;\u03b7)\n\nfor any function h(\u03b7). The conditions in (6) require that the supremum and the in\ufb01mum of the\nobjective function \u2202W (r,f\u2020\ncoincide under the same constraint. Therefore, the objective\nfunction is required to depend only on maxy \u03b7y, but not on the class probabilities of other classes.\nWhereas maxy \u03b7y uniquely determines the other probability as 1\u2212 maxy \u03b7y in the binary case, it still\nallows a degree of freedom in the multiclass case, which results in the situation where two conditions\nin (6) do not necessarily hold simultaneously.\nThe failure of the classi\ufb01er-rejector approach is intuitively explained as follows. The Bayes-optimal\nrejector r\u2217 must be determined only from maxy \u03b7y. Nevertheless, the classi\ufb01er-rejector approach\n\n4\n\n\f(cid:114)\n\n(cid:114) 1 \u2212 c\n\nc\n\nignores this requirement and tries to directly construct a rejector r, which does not satisfy this\nrequirement in general. This contrasts to the rejector in (10) obtained by the con\ufb01dence-based\napproach, where the requirement is encoded by the inverse link function and the max operator.\nRemark 1. An error of the rejector can be classi\ufb01ed into False Reject (FR) and False Accept (FA),\nwhich correspond to the outcomes when the rejector mistakenly rejects (resp. accepts) the data that\nshould be accepted (resp. rejected). We can see from close inspection of the proof of Theorem 4 that\nthe \ufb01rst inequality of (5) is the condition for the FR rate to be zero, while the second inequality is the\ncondition for the FA rate to be zero.\n\nTo understand the above difference between the binary and multiclass cases more precisely, let\nus consider the following example so that the conditions in (6) are explicitly written. De\ufb01ne two\nsurrogate losses given by\n\n(cid:88)\n(cid:88)\n\ny(cid:48)(cid:54)=y\n\ny(cid:48)(cid:54)=y\n\n(cid:16)\n(cid:16)\n\n\u03b1(cid:0)gy(x) \u2212 gy(cid:48)(x)(cid:1)(cid:17)\n\u03b1(cid:0)gy(x) \u2212 gy(cid:48)(x) \u2212 r(x)(cid:1)(cid:17)\n\n\u03c8(\u2212\u03b1r(x)) + c\u03c8(cid:0)\u03b2r(x)(cid:1),\n+ c\u03c8(cid:0)\u03b2r(x)(cid:1),\n\n\u03c6\n\n\u03c6\n\nLMPC(r, f ; x, y) =\n\nLAPC(r, f ; x, y) =\n\n(7)\n\n(8)\n\nwhich we call the multiplicative pairwise comparison (MPC) loss and the additive pairwise compari-\nson (APC) loss, respectively. Here, \u03c6 and \u03c8 are convex losses that bound 1[z\u22640] from above, and\n\u03b1 and \u03b2 are positive constants that control the performance of the rejector. Note that the pairwise\ncomparison loss is often used as a multiclass extension of a binary loss [27]. Also note that the\nAPC loss reduces to the PB loss [9] in (4) when K = 2. Here the MPC loss and the APC loss\nare natural ones at least for the purpose of classi\ufb01cation in the sense that the classi\ufb01ers induced by\nthem are classi\ufb01cation-calibrated (see Appendix B.3 for the proof). Nevertheless, when \u03c6 and \u03c8 are\nexponential losses, (6) gives the following conditions:\n\n\u03b2\n\u03b1\n\n= (K \u2212 2) + 2\n\n(K \u2212 1)\n\n1 \u2212 c\nc\n\n,\n\n\u03b2\n\u03b1\n\n= 2\n\n,\n\n(9)\n\nwhich recover the result proved by Cortes et al. [9] when K = 2 (see Appendix B.4 for details). Here\nthe RHSs of (9) for K > 2 are not identical and therefore we cannot \ufb01nd any \u03b1 and \u03b2 satisfying the\nabove equations, even though we get a classi\ufb01cation-calibrated classi\ufb01er. This implies the failure in\nrejection calibration. Not only when \u03c6 and \u03c8 are exponential losses, we can also prove the failure of\nthe classi\ufb01er-rejector approach when \u03c6 and \u03c8 are logistic losses using the same proof technique (see\nAppendix B.4).\nNote that, strictly speaking, it remains an open question whether it is possible to \ufb01nd a calibrated\nsurrogate loss in the classi\ufb01er-rejector approach. In this paper, our result emphasizes that calibration in\nthe multiclass scenario is signi\ufb01cantly more dif\ufb01cult. Intuitively, a necessary condition in Corollary 5\nis relatively easy to satisfy for K = 2 but it is not the case when K > 2, as illustrated in our examples.\n\n4 An analysis of the con\ufb01dence-based approach\n\nThis section focuses on the extension of the con\ufb01dence-based approach to the multiclass case using\nsmooth losses. When we need some con\ufb01dence score in the multiclass case, it is convenient to\nconsider a class of loss functions called strictly proper composite losses [22] de\ufb01ned as follows.\nDe\ufb01nition 6 (Strictly proper composite loss [22]). A loss L is strictly proper composite with link\nfunction \u03a8 : [0, 1]K \u2192 RK if the pointwise risk W of L satis\ufb01es argming W (g; \u03b7) = \u03a8(\u03b7) =\n[\u03a81(\u03b7), . . . , \u03a8K(\u03b7)](cid:62).\nWith this class of losses, the threshold \u03b8 used in the rejector derived in Yuan and Wegkamp [29] is\nexpressed as \u03a81\nlink function \u03a8 sometimes does not have a closed form whereas the inverse link function \u03a8\u22121 often\ndoes in multiclass classi\ufb01cation [22]. Thus, when we design a rejector in the multiclass case, it would\n\n(cid:0)(1 \u2212 c, c)(cid:1) in the binary case. However, unlike the binary case, it is known that the\nbe natural to use the inverse link function to map output(cid:98)g to the estimated class probability vector(cid:98)\u03b7\n\nrather than to use the link function itself as in the binary case. Based on this discussion, we consider\nthe following rejector based on the relationship between the inverse link \u03a8\u22121\nand the Bayes-optimal\n\ny\n\n5\n\n\fTable 1: A list of margin losses and the values of \u03b8, C and s that satisfy (14) and (15) in Theorem 7.\n\nLoss Name\n\nLogistic\n\nExponential\n\nSquared\n\nSquared Hinge\n\nlog(cid:0)1 + exp(\u2212z)(cid:1)\n\n\u03c6(z)\n\nexp(\u2212z)\n(1 \u2212 z)2\n(1 \u2212 z)2\n\n+\n\n\u03b8\n\nc\n\nc\n\n1\n\nlog 1\u2212c\n2 log 1\u2212c\n1 \u2212 2c\n1 \u2212 2c\n\nC\n1\n2\n1\u221a\n2\n1\n2\n1\n2\n\ns\n2\n2\n\n2\n2\n\nrejector r\u2217(x) = maxy\u2208Y \u03b7y(x) \u2212 (1 \u2212 c):\n\nr(x) = rf (x) = max\n\ny\u2208Y \u03a8\u22121\n\ny\n\n(cid:0)g(x)(cid:1)\n\n\u2212 (1 \u2212 c).\n\n(10)\n\nRecall that we identify the classi\ufb01er f with g, and we use the notation rf in the sense that r is\ndetermined by f. Below, we focus on two frequently used losses: one-versus-all (OVA) loss LOVA\nand cross-entropy (CE) loss LCE:\n\nLOVA(f ; x, y) = \u03c6(cid:0)gy(x)(cid:1) +\n\nLCE(f ; x, y) = \u2212gy(x) + log\n\ny(cid:48)(cid:54)=y\n\n(cid:88)\n\u03c6(cid:0)\n\u2212 gy(cid:48)(x)(cid:1),\n(cid:88)\nexp(cid:0)gy(cid:48)(x)(cid:1),\n(cid:80)\n\n\u03a8\u22121\ny, CE(g) =\n\ny(cid:48)\u2208Y\n\nexp(gy)\n\ny(cid:48)\u2208Y exp(gy(cid:48))\n\n(11)\n\n,\n\n(12)\n\nfor which the inverse link functions are given by\n\n\u03a8\u22121\ny, OVA(g) =\n\n\u03c6(cid:48)(\u2212gy)\n\n\u03c6(cid:48)(\u2212gy) + \u03c6(cid:48)(gy)\n\n,\n\nrespectively. Here, \u03c6 denotes a margin loss [3]. Note that unlike the losses proposed in Ramaswamy\net al. [20], the OVA loss and the CE loss do not contain c. Thus, training a classi\ufb01er once is suf\ufb01cient\nfor various choices of c.\nWe rely on the notion of excess risk bounds to prove the calibration result of the OVA loss and the\nCE loss. Excess risk bounds [30, 3, 19] are a tool to directly quantify the relationship between the\nsurrogate risk R and the risk we truly want to minimize. In our problem, the true risk is the 0-1-c risk\nin (1) and the excess risk bound to be derived is expressed as\n\n(13)\nwhere \u03be : R \u2192 R\u22650 is called a calibration function [19], which is increasing, continuous at 0 and\nsatis\ufb01es \u03be(0) = 0. Here, excess risks \u2206R0-1-c(rf , f ) and \u2206R(f ) are de\ufb01ned as follows:\n\n\u2264 \u2206R(f ),\n\n\u03be(cid:0)\u2206R0-1-c(rf , f )(cid:1)\n\n\u2206R0-1-c(rf , f ) = R0-1-c(rf , f ) \u2212 R0-1-c(r\u2217, f\u2217),\n\n\u2206R(f ) = R(f ) \u2212\n\ninf\n\nf(cid:48):measurable\n\nR(f(cid:48)).\n\nIneq. (13) ensures that the minimization of a surrogate risk leads to the minimization of the 0-1-c\nrisk. Therefore, the existence of an excess risk bound guarantees calibration.\nNow we give excess risk bounds for the OVA loss and the CE loss in the following theorems.\nTheorem 7 (Excess risk bound for OVA loss). Assume that \u03c6 is a convex function, and there exists\n\u03b8 > 0 such that \u03c6(cid:48)(\u03b8) and \u03c6(cid:48)(\u2212\u03b8) both exist, \u03c6(cid:48)(\u03b8) < 0 and\n= 1 \u2212 c.\n\n\u03c6(cid:48)(\u2212\u03b8)\n\n(14)\n\n\u03c6(cid:48)(\u2212\u03b8) + \u03c6(cid:48)(\u03b8)\n\nIn addition, suppose that there exist some constants C > 0 and s \u2265 1 such that\n\n(cid:27)\n\nfor all y \u2208 Y and probability vector \u03b7. Then, for all f and c \u2208\n\nWOVA(f ; \u03b7) \u2212 inf\ng(cid:48)\u2208RK\n\nWOVA(f(cid:48); \u03b7)\n\n\u2265 C\u2212s|\u03b7y \u2212 (1 \u2212 c)|s\n(cid:2)0, 1\n(2C)\u2212s\u2206R0-1-c(rf , f )s \u2264 \u2206ROVA(f ).\n\n(cid:1), we have\n\n2\n\n(cid:26)\n\ninf\n\ng: gy=\u03b8\n\n(15)\n\n(16)\n\n6\n\n\fFigure 1: Average 0-1-c risk on the test data as a function of the training data size on synthetic\ndatasets.\n\nTable 1 summarizes some margin losses with the values of \u03b8, C and s that satisfy the assumptions\n(14) and (15). Their derivations are given in Appendix A.2.\nTheorem 8 (Excess risk bound for CE loss). For all f and c \u2208 (0, 1/2), we have\n\n1\n2\n\n\u2206R0-1-c(rf , f )2 \u2264 \u2206RCE(f ).\n\nThe proofs of Theorems 7 and 8 can be found in Appendices A.1 and A.4, respectively. The derivation\nof the bound for the OVA loss is a natural extension of Yuan and Wegkamp [29] for the binary case.\nOn the other hand, although the CE loss can be regarded as a generalization of the logistic loss\nin binary classi\ufb01cation, the derivation of the excess risk bound for the logistic loss in Yuan and\nWegkamp [29] heavily relies on the binary setting and is not applicable to the multiclass case. In fact,\nthe CE loss is generally hard to bound even in the setting without rejection as discussed in Pires and\nSzepesv\u00e1ri [19]. In this paper, we reduce the analysis of the CE loss into that of the KL divergence\ninstead of trying to extend the argument of Yuan and Wegkamp [29] or Pires and Szepesv\u00e1ri [19].\nThis enabled us to derive the bound in a considerably simple way.\nThe excess risk bounds in Theorems 7 and 8 ensure that the minimization of the expected surrogate\nrisk leads to the minimization of the 0-1-c risk. On the other hand, we can also derive an estimation\nerror bound for the above losses, which shows that the minimization of the empirical surrogate\nrisk leads to the minimization of the expected surrogate risk for a \ufb01nite number of samples with a\nhypothesis class of our interest. Combining these results completes the scenario to minimize the 0-1-c\nrisk from \ufb01nite number of samples under the considered hypothesis class. Here the derivation of the\nestimation error bound using the notion of Rademacher complexity [2] is given in Appendix A.3.\n\n5 Experiments\n\nIn this section, we report the results of two experiments based on synthetic and benchmark datasets.\nThe purpose of the experiment on synthetic datasets is to verify the performance of calibration for the\nsetting where Bayes-optimal 0-1-c risk is available. On the other hand, we use benchmark datasets to\nevaluate the practical performance.\nCommon setup: For all methods, we used one-hidden-layer neural networks with the recti\ufb01ed\nlinear units (ReLU) as activation functions, where the number of hidden units is 3 for synthetic\ndatasets, and 50 for benchmark datasets. We added weight decay with candidates {10\u22127, 10\u22124, 10\u22121}.\nAMSGRAD [21] was used for optimization. More detailed setups can be found in Appendix C.\nSynthetic datasets: Here we report the performance of four methods analyzed in this paper. For the\nclassi\ufb01er-rejector approach, we used the MPC loss with the logistic loss in (7), where we used \u03b1 = 1\nas in Cortes et al. [9]. To see the performance of the rejector, we set two values for \u03b2 to satisfy either\nof (6) denoted by MPC+log+acc and MPC+log+rej, respectively. It is expected that MPC+log+acc\nwill over-accept the data, and MPC+log+rej will over-reject the data as discussed in Remark 1. For\nthe con\ufb01dence-based approach, we used the CE loss (CE) and OVA loss with the logistic loss in (11)\ndenoted by OVA+log. Synthetic data consist of eight classes. More detailed information on data\ngeneration process can be found in Appendix C.1.\nFigure 1 shows the average 0-1-c risk on the test data for various training data size, where the lower\n0-1-c risk is the better. CE shows the best performance in terms of convergence to the Bayes-optimal\n\n7\n\n103104105training size0.0200.0250.0300.0350.0400.0450.0500-1-c riskc=0.05MPC+log+accMPC+log+rejOVA+logCEBayes 0-1-c risk103104105training size0.060.080.100.120.140.160.180.200.220-1-c riskc=0.2MPC+log+accMPC+log+rejOVA+logCEBayes 0-1-c riskBayes 0-1 risk103104105training size0.100.150.200.250.300.350.400-1-c riskc=0.4MPC+log+accMPC+log+rejOVA+logCEBayes 0-1-c riskBayes 0-1 risk\fTable 2: Mean and standard deviation of the ratio (%) of the rejected data over all test data on\nsynthetic datasets when the training data size is 10,000 per class.\n\nc MPC+log+acc MPC+log+rej\n\n0.05\n0.2\n0.4\n\n25.4 (8.6)\n0.0 (0.0)\n0.0 (0.0)\n\n46.4 (7.6)\n23.2 (1.6)\n28.8 (9.9)\n\nOVA+log\n43.9 (1.3)\n31.5 (0.3)\n23.1 (0.7)\n\nCE\n\n33.9 (0.5)\n28.3 (1.5)\n17.3 (0.8)\n\nFigure 2: Average and standard error of 0-1-c risk on the test data as a function of rejection cost c on\nbenchmark datasets for 10 trials. The standard error is plotted in shaded regions.\n\n0-1-c risk for all values of c. In spite of the theoretical guarantees of the con\ufb01dence-based methods of\nOVA losses, they did not show better performance than the others. A possible reason is that the inverse\nlink function of the OVA loss is not normalized as can be seen from (12), which resulted in poor\nestimation of class probability \u03b7(x). It is observed that the classi\ufb01er-rejector methods (MPC+log+acc\nand MPC+log+rej) show unstable performance compared to the other methods. Table 2 shows the\nrejection ratio when the training data size is 10,000 per class. We can con\ufb01rm that MPC+log+acc\ntends to over-accept and MPC+log+rej tends to over-reject the data, which agrees with the discussion\nin Remark 1.\nBenchmark datasets: We compared the empirical performance using benchmark datasets with\nrejection cost ranged over c \u2208 {0.05, 0.1, 0.2, 0.3, 0.4}. In addition to APC+log, MPC+log, OVA+log\nand CE, we further implemented the existing method proposed in Ramaswamy et al. [20] (OVA+hin),\nwhich uses OVA loss with non-smooth hinge loss in (11). We show the results of vehicle, satimage,\nyeast, covtype and letter datasets from UCI Machine Learning Repository [17], which are the same\ndatasets as those used in Ramaswamy et al. [20]. Table 3 summarizes the speci\ufb01cation of the\nbenchmark datasets we used. For the classi\ufb01er-rejector methods (APC+log, MPC+log), we have extra\nparameters \u03b1 and \u03b2. We set \u03b1 = 1 as in Cortes et al. [9]. We chose \u03b2 by cross-validation, where the\nchoices of \u03b2 that satis\ufb01es either of (6) were also included. In the OVA+hin formulation, Ramaswamy\net al. [20] suggested that the threshold parameter \u03c4 \u2208 (\u22121, 1) in their methods is preferable at 0.\nNevertheless, we observed that the performance is considerably affected by its choice and thus we\ndecided to choose the best parameter from \ufb01ve candidates by cross-validation. See Appendix C.2 for\nthe detailed information on experimental setups. Note that APC+log, MPC+log and OVA+hin must\nbe re-trained for different rejection costs, while OVA+log and CE do not need re-training. The full\nexperimental results including the performance of other methods, the full report of the 0-1-c risk, the\naccuracy of the non-rejected data, and the rejection ratio can be found in Appendix C.\nFigure 2 illustrates the 0-1-c risk as functions of the rejection cost. It can be observed that CE is\neither competitive or preferable in all datasets. For OVA+log, despite its calibration guarantees, it is\noutperformed by CE for all datasets and it is even outperformed by MPC+log in letter dataset. The\nfailure of the OVA methods in letter might be due to their weakness for a large number of classes [5]\nand poor estimation of \u03b7(x). It is also worth noting that the standard deviations of MPC+log and\nOVA+hin are considerably large compared to those of OVA+log and CE, which might be caused by\nadditional hyper-parameters \u03b2 and \u03c4. Moreover, model \ufb01tting for a rejector and the non-convexity\nof the MPC loss function also make MPC+log unstable. Table 4 shows the mean and standard\ndeviation of the accuracy on non-rejected data. As we can see clearly in yeast datasets, unlike the\ncon\ufb01dence-based methods, the classi\ufb01er-rejector methods reject all the test data even when the value\n\n8\n\n0.10.20.30.4c0.0250.0500.0750.1000.1250.1500.1750.2000.2250-1-c riskvehicleMPC+logOVA+logCEOVA+hin0.10.20.30.4c0.040.060.080.100.120-1-c risksatimageMPC+logOVA+logCEOVA+hin0.10.20.30.4c0.0250.0500.0750.1000.1250.1500.1750.2000.2250-1-c riskletterMPC+logOVA+logCEOVA+hin\fTable 3: Speci\ufb01cation of benchmark datasets: the number of features, the number of classes, the\nnumber of training data, and the number of test data.\n\nName\nvehicle\nsatimage\n\nyeast\ncovtype\nletter\n\n18\n36\n8\n54\n16\n\n#features #classes #train #test\n146\n2000\n484\n\n4\n6\n10\n7\n26\n\n700\n4435\n1000\n15120\n15000\n\n565892\n5000\n\nTable 4: Mean and standard deviation of the accuracy (%) of the non-rejected data samples for 10\ntrials. Best and equivalent methods (with 5% t-test) with respect to the 0-1-c risk are shown in bold\nface. \u201c\u2013\u201d corresponds to the case where all the test data samples are rejected.\n\n\u2013 ( \u2013 )\n\nc APC+log MPC+log OVA+log\n100 (0.0)\n97.9 (0.7)\n90.2 (1.6)\n98.7 (0.1)\n96.2 (0.2)\n92.2 (0.3)\n\n96.6 (2.3)\n92.4 (3.0)\n85.3 (4.2)\n97.2 (1.4)\n92.6 (1.2)\n89.0 (1.1)\n\ndataset\n\nvehicle\n\nsatimage\n\nyeast\n\n0.05\n0.2 98.4 (1.9)\n0.4 89.1 (2.9)\n0.05 99.1 (0.2)\n0.2 95.0 (1.0)\n0.4 91.5 (0.7)\n0.05\n0.2\n0.4\n\n\u2013 ( \u2013 )\n\u2013 ( \u2013 )\n\u2013 ( \u2013 )\n\n\u2013 ( \u2013 )\n\u2013 ( \u2013 )\n\u2013 ( \u2013 )\n\n\u2013 ( \u2013 )\n\u2013 ( \u2013 )\n\n75.0 (3.9)\n\nCE\n\n100 (0.0)\n97.4 (0.1)\n91.7 (0.9)\n98.3 (0.1)\n95.7 (0.1)\n91.8 (0.2)\n\n\u2013 ( \u2013 )\n\n80.6 (6.2)\n76.6 (1.7)\n\ndataset\n\ncovtype\n\nletter\n\nc APC+log MPC+log\n0.05 79.5 (2.1)\n79.8 (1.7)\n73.8 (1.0)\n0.2 74.0 (1.8)\n0.4 69.8 (1.3)\n64.9 (3.4)\n0.05 99.8 (0.1)\n98.6 (0.2)\n96.9 (0.5)\n0.2 97.9 (0.3)\n0.4 95.2 (0.5)\n94.6 (3.8)\n\nOVA+log\n82.1 (2.7)\n74.9 (1.4)\n68.7 (1.1)\n99.6 ( 0.2 )\n98.3 (0.2)\n94.6 (0.2)\n\nCE\n\n82.0 (3.2)\n77.1 (0.3)\n69.4 (1.8)\n99 8 (0.0)\n98.4 (0.1)\n94.9 (0.3)\n\nof c is large. This implies that if the dataset is hard to learn, then classi\ufb01er-rejector methods may fail\nto learn the rejector.\n\n6 Conclusion\n\nWe presented a series of theoretical results on multiclass classi\ufb01cation with rejection. First, we\nprovided a necessary condition of rejection calibration for the classi\ufb01er-rejector approach that\nsuggested the dif\ufb01culty of calibration for this approach in the multiclass case. Second, we investigated\nthe con\ufb01dence-based approach and established the calibration results for the OVA loss and the CE loss\nby deriving excess risk bounds. Experimental results suggested that the CE loss is the most preferable\nand the classi\ufb01er-rejector approach can no longer outperform the con\ufb01dence-based methods unlike\nthe binary case.\n\nAcknowledgements\n\nWe thank Han Bao for fruitful discussions. We also thank anonymous reviewers for providing\ninsightful comments. NC was supported by MEXT scholarship and JST AIP challenge. JH was\nsupported by KAKENHI 18K17998. MS was supported by the International Research Center for\nNeurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.\n\nReferences\n[1] P. L. Bartlett and M. H. Wegkamp. Classi\ufb01cation with a reject option using a hinge loss. Journal\n\nof Machine Learning Research, 9:1823\u20131840, 2008.\n\n[2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of\n\nStatistics, 33:1487\u20131537, 2002.\n\n[3] P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[4] S. Ben-David, N. Eiron, and P. M. Long. On the dif\ufb01culty of approximately maximizing\n\nagreements. Journal of Computer and System Sciences, 66(3):496\u2013514, 2003.\n\n9\n\n\f[5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2nd edition, 2006.\n\n[6] C. K. Chow. An optimum character recognition system using decision functions. IRE Transac-\n\ntions on Electronic Computers, EC-6(4):247\u2013254, 1957.\n\n[7] C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Informa-\n\ntion Theory, 16(1):41\u201346, 1970.\n\n[8] C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In Proceedings of International\n\nConference on Algorithmic Learning Theory, pages 67\u201382, 2016.\n\n[9] C. Cortes, G. DeSalvo, and M. Mohri. Boosting with abstention. In Advances in Neural\n\nInformation Processing Systems 29, pages 1660\u20131668. 2016.\n\n[10] C. Cortes, G. DeSalvo, C. Gentile, M. Mohri, and S. Yang. Online learning with abstention.\nIn Proceedings of the 35th International Conference on Machine Learning, pages 1059\u20131067,\n2018.\n\n[11] B. Dubuisson and M. Masson. A statistical decision rule with incomplete knowledge about\n\nclasses. Pattern Recognition, 26:155\u2013165, 1993.\n\n[12] V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by\n\nhalfspaces is hard. SIAM Journal on Computing, 41(6):1558\u20131590, 2012.\n\n[13] A. Garcia, S. Essid, C. Clavel, and F. d\u2019Alch\u00e9 Buc. Structured output learning with abstention:\nApplication to accurate opinion prediction. In Proceedings of the 35th International Conference\non Machine Learning, pages 1695\u20131703, 2018.\n\n[14] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a\nreject option. In Advances in Neural Information Processing Systems 21, pages 537\u2013544. 2009.\n\n[15] K. Hamid, A. Asif, W. Abbasi, D. Sabih, and F. A. Minhas. Machine learning with abstention\nfor automated liver disease diagnosis. In Proceedings of International Conference on Frontiers\nof Information Technology, pages 356\u2013361, 2017.\n\n[16] R. Herbei and M. H. Wegkamp. Classi\ufb01cation with reject option. Canadian Journal of Statistics,\n\n34(4):709\u2013721, 2006.\n\n[17] M. Lichman et al. UCI machine learning repository, 2013. URL http://archive.ics.\n\nuci.edu/ml.\n\n[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT\n\nPress, 2012.\n\n[19] B. \u00c1. Pires and C. Szepesv\u00e1ri. Multiclass classi\ufb01cation calibration functions. arXiv preprint\n\narXiv:1609.06385, 2016.\n\n[20] H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classi\ufb01ca-\n\ntion with an abstain option. Electronic Journal of Statistics, 12:530\u2013554, 2018.\n\n[21] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In Proceedings\n\nof International Conference on Learning Representations, 2018.\n\n[22] M. D. Reid and R. C. Williamson. Composite binary losses. Journal of Machine Learning\n\nResearch, 11:2387\u20132422, 2010.\n\n[23] D. Tax and R. Duin. Growing a multi-class classi\ufb01er with a reject option. Pattern Recognition\n\nLetters, 29(10):1565\u20131570, 2008.\n\n[24] A. Tewari and P. L. Bartlett. On the consistency of multiclass classi\ufb01cation methods. Journal of\n\nMachine Learning Research, 8:1007\u20131025, 2007.\n\n[25] E. Vernet, M. D. Reid, and R. C. Williamson. Composite multiclass losses. In Advances in\n\nNeural Information Processing Systems 24, pages 1224\u20131232. 2011.\n\n10\n\n\f[26] M. Wegkamp and M. Yuan. Support vector machines with a reject option. Bernoulli, 17(4):\n\n1368\u20131385, 2011.\n\n[27] J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Royal\n\nHolloway, 1998.\n\n[28] Q. Wu, C. Jia, and W. Chen. A novel classi\ufb01cation-rejection sphere SVMs for multi-class classi-\n\ufb01cation problems. In Proceedings of the 3rd International Conference on Natural Computation,\nvolume 1, pages 34\u201338, 2007.\n\n[29] M. Yuan and M. H. Wegkamp. Classi\ufb01cation methods with reject option based on convex risk\n\nminimization. Journal of Machine Learning Research, 11:111\u2013130, 2010.\n\n[30] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[31] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk\n\nminimization. The Annals of Statistics, 32:56\u201385, 2004.\n\n11\n\n\f", "award": [], "sourceid": 1475, "authors": [{"given_name": "Chenri", "family_name": "Ni", "institution": "The University of Tokyo"}, {"given_name": "Nontawat", "family_name": "Charoenphakdee", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Junya", "family_name": "Honda", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}