{"title": "Agnostic Selective Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1665, "page_last": 1673, "abstract": "For a learning problem whose associated excess loss class is $(\\beta,B)$-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this rejected region of the domain is shown to be diminishing at rate $O(B\\theta (\\sqrt{1/m}))^\\beta)$, where $\\theta$ is Hanneke's disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary.", "full_text": "Agnostic Selective Classi\ufb01cation\n\nRan El-Yaniv and Yair Wiener\nComputer Science Department\n\nTechnion \u2013 Israel Institute of Technology\n\n{rani,wyair}@{cs,tx}.technion.ac.il\n\nAbstract\n\nFor a learning problem whose associated excess loss class is (\u03b2, B)-Bernstein, we\nshow that it is theoretically possible to track the same classi\ufb01cation performance\nof the best (unknown) hypothesis in our class, provided that we are free to abstain\nfrom prediction in some region of our choice. The (probabilistic) volume of this\n1/m)\u03b2),\nrejected region of the domain is shown to be diminishing at rate O(B\u03b8(\nwhere \u03b8 is Hanneke\u2019s disagreement coef\ufb01cient. The strategy achieving this perfor-\nmance has computational barriers because it requires empirical error minimization\nin an agnostic setting. Nevertheless, we heuristically approximate this strategy\nand develop a novel selective classi\ufb01cation algorithm using constrained SVMs.\nWe show empirically that the resulting algorithm consistently outperforms the tra-\nditional rejection mechanism based on distance from decision boundary.\n\n\u221a\n\n1 Introduction\n\nIs it possible to achieve the same test performance as the best classi\ufb01er in hindsight? The answer\nto this question is \u201cprobably not.\u201d However, when changing the rules of the standard game it is\npossible. Indeed, consider a game where our classi\ufb01er is allowed to abstain from prediction, without\npenalty, in some region of our choice. For this case, and assuming a noise free \u201crealizable\u201d setting,\nit was shown in [1] that there is a \u201cperfect classi\ufb01er.\u201d This means that after observing only a \ufb01nite\nlabeled training sample, the learning algorithm outputs a classi\ufb01er that, with certainty, will never\nerr on any test point. To achieve this, this classi\ufb01er must refuse to classify in some region of the\ndomain. Perhaps surprisingly it was shown that the volume of this rejection region is bounded, and\nin fact, this volume diminishes with increasing training set sizes (under certain conditions). An open\nquestion, posed in [1], is what would be an analogous notion of perfection in an agnostic, noisy\nsetting. Is it possible to achieve any kind of perfection in a real world scenario?\nThe setting under consideration, where classi\ufb01ers can abstain from prediction, is called classi\ufb01cation\nwith a reject option [2, 3], or selective classi\ufb01cation [1]. Focusing on this model, in this paper\nwe present a blend of theoretical and practical results. We \ufb01rst show that the concept of \u201cperfect\nclassi\ufb01cation\u201d that was introduced for the realizable case in [1], can be extended to the agnostic\nsetting. While pure perfection is impossible to accomplish in a noisy environment, a more realistic\nobjective is to perform as well as the best hypothesis in the class within a region of our choice.\nWe call this type of learning \u201cweakly optimal\u201d selective classi\ufb01cation and show that a novel strategy\naccomplishes this type of learning with diminishing rejection rate under certain Bernstein type con-\nditions (a stronger notion of optimality is mentioned later as well). This strategy relies on empirical\nrisk minimization, which is computationally dif\ufb01cult. In the practical part of the paper we present\na heuristic approximation algorithm, which relies on constrained SVMs, and mimics the optimal\nbehavior. We conclude with numerical examples that examine the empirical performance of the new\nalgorithm and compare its performance with that of the widely used selective classi\ufb01cation method\nfor rejection, based on distance from decision boundary.\n\n1\n\n\f2 Selective classi\ufb01cation and other de\ufb01nitions\nConsider a standard agnostic binary classi\ufb01cation setting where X is some feature space, and H\nis our hypothesis class of binary classi\ufb01ers, h : X \u2192 {\u00b11}. Given a \ufb01nite training sample of m\nlabeled examples, Sm = {(xi, yi)}m\ni=1, assumed to be sampled i.i.d. from some unknown underlying\ndistribution P (X, Y ) over X \u00d7 {\u00b11}, our goal is to select the best possible classi\ufb01er from H. For\nm\u2211\nany h \u2208 H, its true error, R(h), and its empirical error, \u02c6R(h), are,\n\nR(h) , Pr\n\n(X,Y )\u223cP\n\n{h(X) \u0338= Y } ,\n\n\u02c6R(h) , 1\nm\n\ni=1\n\nI (h(xi) \u0338= yi) .\n\nLet \u02c6h , arg inf h\u2208H \u02c6R(h) be the empirical risk minimizer (ERM), and h\ntrue risk minimizer.\nIn selective classi\ufb01cation [1], given Sm we need to select a binary selective classi\ufb01er de\ufb01ned to be a\npair (h, g), with h \u2208 H being a standard binary classi\ufb01er, and g : X \u2192 {0, 1} is a selection function\nde\ufb01ning the sub-region of activity of h in X . For any x \u2208 X ,\n\n\u2217 , arg inf h\u2208H R(h), the\n\n{\n\n(h, g)(x) ,\n\nreject,\nh(x),\n\nif g(x) = 0;\nif g(x) = 1.\n\n(1)\n\nSelective classi\ufb01cation performance is characterized in terms of two quantities: coverage and risk.\nThe coverage of (h, g) is\nFor a bounded loss function \u2113 : Y \u00d7 Y \u2192 [0, 1], the risk of (h, g) is de\ufb01ned as the average loss on\nthe accepted samples,\n\n\u03a6(h, g) , E [g(X)] .\n\nR(h, g) , E [\u2113(h(X), Y ) \u00b7 g(X)]\n\n.\n\n\u03a6(h, g)\n\nAs pointed out in [1], the trade-off between risk and coverage is the main characteristic of a selective\nclassi\ufb01er. This trade-off is termed there the \u201crisk-coverage curve\u201d (RC curve)1\nLet G \u2286 H. The disagreement set [4, 1] w.r.t. G is de\ufb01ned as\n\nDIS(G) , {x \u2208 X : \u2203h1, h2 \u2208 G s.t. h1(x) \u0338= h2(x)} .\n\n\u2032 \u2208 H : R(h\n\u2032\n\nFor any hypothesis class H, target hypothesis h \u2208 H, distribution P , sample Sm, and real r > 0,\nde\ufb01ne\nV(h, r) = {h\n(2)\nFinally, for any h \u2208 H we de\ufb01ne a ball in H of radius r around h [5]. Speci\ufb01cally, with respect to\nclass H, marginal distribution P over X , h \u2208 H, and real r > 0, de\ufb01ne\n(X) \u0338= h(X)} \u2264 r\n\n\u2032 \u2208 H : \u02c6R(h\n\u2032\nh\n}\n\n) \u2264 R(h) + r} and \u02c6V(h, r) =\n\n) \u2264 \u02c6R(h) + r\n\nB(h, r) ,\n\n{\n\n{\n\n}\n\n.\n\n.\n\n\u2032 \u2208 H : Pr\nh\nX\u223cP\n\n{h\n\u2032\n\n3 Perfect and weakly optimal selective classi\ufb01ers\n\nThe concept of perfect classi\ufb01cation was introduced in [1] within a realizable selective classi\ufb01cation\nsetting. Perfect classi\ufb01cation is an extreme case of selective classi\ufb01cation where a selective classi\ufb01er\n(h, g) achieves R(h, g) = 0 with certainty; that is, the classi\ufb01er never errs on its region of activity.\nObviously, the classi\ufb01er must compromise suf\ufb01ciently large part of the domain X in order to achieve\nthis outstanding performance. Surprisingly, it was shown in [1] that not-trivial perfect classi\ufb01cation\nexists in the sense that under certain conditions (e.g., \ufb01nite hypothesis class) the rejected region\ndiminishes at rate \u2126(1/m), where m is the size of the training set.\nIn agnostic environments, as we consider here, such perfect classi\ufb01cation appears to be out of reach.\nIn general, in the worst case no hypothesis can achieve zero error over any nonempty subset of the\n\n1Some authors refer to an equivalent variant of this curve as \u201cAccuracy-Rejection Curve\u201d or ARC.\n\n2\n\n\fdomain. We consider here the following weaker, but still extremely desirable behavior, which we\n\u2217 \u2208 H be the true risk minimizer of our problem.\ncall \u201cweakly optimal selective classi\ufb01cation.\u201d Let h\nLet (h, g) be a selective classi\ufb01er selected after observing the training set Sm. We say that (h, g) is\na weakly optimal selective classi\ufb01er if, for any 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4 over\nrandom choices of Sm, R(h, g) \u2264 R(h\n\u2217\n, g). That is, with high probability our classi\ufb01er is at least\nas good as the true risk minimizer over its region of activity. We call this classi\ufb01er \u2018weakly optimal\u2019\nbecause a stronger requirement would be that the classi\ufb01er should achieve the best possible error\namong all hypotheses in H restricted to the region of activity de\ufb01ned by g.\n\n4 A learning strategy\n\nWe now present a strategy that will be shown later to achieve non-trivial weakly optimal selective\nclassi\ufb01cation under certain conditions. We call it a \u201cstrategy\u201d rather than an \u201calgorithm\u201d because it\ndoes not include implementation details.\nLet\u2019s begin with some motivation. Using standard concentration inequalities one can show that\n\u2217, cannot be \u201ctoo far\u201d from the training error of the\nthe training error of the true risk minimizer, h\nempirical risk minimizer, \u02c6h. Therefore, we can guarantee, with high probability, that the class of\n\u2217. Selecting\nall hypothesis with \u201csuf\ufb01ciently low\u201d empirical error includes the true risk minimizer h\nonly subset of the domain, for which all hypothesis in that class agree, is then suf\ufb01cient to guarantee\nweak optimality. Strategy 1 formulates this idea. In the next section we analyze this strategy and\nshow that it achieves this optimality with non trivial (bounded) coverage.\n\n(\n\n\u221a\n\nStrategy 1 Learning strategy for weakly optimal selective classi\ufb01ers\nInput: Sm; m; (cid:14); d\n; g) w.p. 1 (cid:0) (cid:14)\nOutput: a selective classi\ufb01er (h; g) such that R(h; g) = R(h\n1: Set ^h = ERM (H; Sm), i.e., ^h is any empirical risk minimizer from H\n2: Set G = ^V\n3: Construct g such that g(x) = 1 () x 2 fX n DIS (G)g\n4: h = ^h\n\nd(ln 2me\nd )+ln 8\nm\n\n(see Eq. (2))\n\n^h; 4\n\n2\n\n(cid:3)\n\n)\n\n(cid:14)\n\n5 Analysis\n\nWe begin with a few de\ufb01nitions. Consider an instance of a binary learning problem with hy-\npothesis class H, an underlying distribution P over X \u00d7 Y, and a loss function \u2113(Y,Y). Let\n= arg inf h\u2208H {E\u2113(h(X), Y )} be the true risk minimizer. The associated excess loss class [6] is\n\u2217\nh\nde\ufb01ned as\nClass F is said to be a (\u03b2, B)-Bernstein class with respect to P (where 0 < \u03b2 \u2264 1 and B \u2265 1), if\nevery f \u2208 F satis\ufb01es\n\nF , {\u2113(h(x), y) \u2212 \u2113(h\n\u2217\n\n(x), y) : h \u2208 H} .\n\nEf 2 \u2264 B(Ef )\u03b2.\n\nBernstein classes arise in many natural situations; see discussions in [7, 8]. For example, if the prob-\nability P (X, Y ) satis\ufb01es Tsybakov\u2019s noise conditions then the excess loss function is a Bernstein\n[8, 9] class. In the following sequence of lemmas and theorems we assume a binary hypothesis class\nH with VC-dimension d, an underlying distribution P over X \u00d7{\u00b11}, and \u2113 is the 0/1 loss function.\nAlso, F denotes the associated excess loss class. Our results can be extended to losses other than\n0/1 by similar techniques to those used in [10].\nLemma 5.1. If F is a (\u03b2, B)-Bernstein class with respect to P , then for any r > 0\n\n, r) \u2286 B(\n\nV(h\n\u2217\n\n)\n\n\u2217\nh\n\n, Br\u03b2\n\n.\n\nProof. If h \u2208 V(h\n\u2217\n\n, r) then, by de\ufb01nition\n\nE{I(h(X) \u0338= Y )} \u2264 E{I(h\n\u2217\n\n(X) \u0338= Y )} + r.\n\n3\n\n\fUsing the linearity of expectation we have,\n\nSince F is a (\u03b2, B)-Bernstein class,\n\nE{I(h(X) \u0338= h\n\u2217\n\nBy (3), for any r > 0, E{I(h(X) \u0338= h\n\u2217\n\nThroughout this section we denote\n\nE{I(h(X) \u0338= Y ) \u2212 I(h\n\u2217\n\n(X) \u0338= Y )} \u2264 r.\n\n(3)\n\n}\n(X) \u0338= Y )|}\n(X), Y ))2\n\n= E\n= B (E{I(h(X) \u0338= Y ) \u2212 I(h\n\u2217\n\n{\n(X))} = E{|I(h(X) \u0338= Y ) \u2212 I(h\n\u2217\n(\u2113(h(X), Y ) \u2212 \u2113(h\n\u2217\n(X))} \u2264 Br\u03b2. Therefore, by de\ufb01nition, h \u2208 B(\n\u221a\n\n(X) \u0338= Y )})\u03b2 .\n\n= Ef 2 \u2264 B(Ef )\u03b2\n\n(\n\n)\n\n)\n\n.\n\n\u2217\nh\n\n, Br\u03b2\n\n\u03c3(m, \u03b4, d) , 2\n\nd\n\n2\n\nln 2me\nd\nm\n\n+ ln 2\n\u03b4\n\n.\n\nTheorem 5.2 ([11]). For any 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4 over the choice of Sm\nfrom P m, any hypothesis h \u2208 H satis\ufb01es\n\nR(h) \u2264 \u02c6R(h) + \u03c3(m, \u03b4, d).\nSimilarly \u02c6R(h) \u2264 R(h) + \u03c3(m, \u03b4, d) under the same conditions.\nLemma 5.3. For any r > 0, and 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4,\n\n\u02c6V(\u02c6h, r) \u2286 V (h\n\u2217\n\n, 2\u03c3(m, \u03b4/2, d) + r) .\n\nProof. If h \u2208 \u02c6V(\u02c6h, r), then, by de\ufb01nition, \u02c6R(h) \u2264 \u02c6R(\u02c6h) + r. Since \u02c6h minimizes the empirical error,\nwe have, \u02c6R(\u02c6h) \u2264 \u02c6R(h\n\u2217\n). Using Theorem 5.2 twice, and applying the union bound, we know that\nw.p. of at least 1 \u2212 \u03b4,\n\nR(h) \u2264 \u02c6R(h) + \u03c3(m, \u03b4/2, d) \u2227\n\n\u2217\n\u02c6R(h\n\n) \u2264 R(h\n\u2217\n\n) + 2\u03c3(m, \u03b4/2, d) + r, and h \u2208 V (h\n\u2217\n\nTherefore, R(h) \u2264 R(h\n\u2217\nFor any G \u2286 H, and distribution P we de\ufb01ne, \u2206G , Pr{DIS(G)}. Hanneke introduced a\ncomplexity measure for active learning problems termed the disagreement coef\ufb01cient [5]. The dis-\nagreement coef\ufb01cient of h with respect to H under distribution P is,\n\n) + \u03c3(m, \u03b4/2, d).\n, 2\u03c3(m, \u03b4/2, d) + r).\n\n(4)\nwhere \u03f5 = 0. The disagreement coef\ufb01cient of the hypothesis class H with respect to P is de\ufb01ned as\n\nr>\u03f5\n\nr\n\n,\n\n\u2206B(h, r)\n\n\u03b8h , sup\n\n\u03b8 , lim sup\n\nk\u2192\u221e \u03b8h(k),\n\nh(k)\n\nis any sequence of h(k) \u2208 H with R(h(k)) monotonically decreasing.\n\nwhere\nTheorem 5.4. Assume that H has disagreement coef\ufb01cient \u03b8 and that F is a (\u03b2, B)-Bernstein class\nw.r.t. P . Then, for any r > 0 and 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4,\n\n\u2206 \u02c6V(\u02c6h, r) \u2264 B\u03b8 (2\u03c3(m, \u03b4/2, d) + r)\u03b2 .\n\nProof. Applying Lemmas 5.3 and 5.1 we get that with probability of at least 1 \u2212 \u03b4,\n\n\u02c6V(\u02c6h, r) \u2286 B\n\n, B (2\u03c3(m, \u03b4/2, d) + r)\u03b2\n\nTherefore\n\n\u2206 \u02c6V(\u02c6h, r) \u2264 \u2206B\n\n\u2217\nh\n\nBy the de\ufb01nition of the disagreement coef\ufb01cient, for any r\n\n, B (2\u03c3(m, \u03b4/2, d) + r)\u03b2\n> 0, \u2206B(h\n\u2217\n\n\u2032\n\n.\n\n, r\n\n\u2032\n\n) \u2264 \u03b8r\n\n\u2032.\n\n)\n\n.\n\n)\n\n{\n\n}\n\n(\n\n\u2217\nh\n\n(\n\n4\n\n\fTheorem 5.5. Assume that H has disagreement coef\ufb01cient \u03b8 and that F is a (\u03b2, B)-Bernstein class\nw.r.t. P . Let (h, g) be the selective classi\ufb01er chosen by Algorithm 1. Then, with probability of at\nleast 1 \u2212 \u03b4,\n\n\u03a6(h, g) \u2265 1 \u2212 B\u03b8 (4\u03c3(m, \u03b4/4, d))\u03b2\n\n\u2217\nR(h, g) = R(h\nProof. Applying Theorem 5.2 we get that with probability of at least 1 \u2212 \u03b4/4,\n\n\u2227\n\n, g).\n\n\u2217\n\u02c6R(h\n\n) \u2264 R(h\n\u2217\n(\n\n) + \u03c3(m, \u03b4/4, d).\n\n\u2217 minimizes the true error, wet get that R(h\n\u2217\n\n) \u2264 R(\u02c6h). Applying again Theorem 5.2 we\nSince h\n)\nknow that with probability of at least 1 \u2212 \u03b4/4, R(\u02c6h) \u2264 \u02c6R(\u02c6h) + \u03c3(m, \u03b4/4, d). Applying the union\n) \u2264 \u02c6R(\u02c6h) + 2\u03c3(m, \u03b4/4, d). Hence,\nbound we have that with probability of at least 1 \u2212 \u03b4/2, \u02c6R(h\n\u2217\nwith probability of at least 1 \u2212 \u03b4/2, h\n\u02c6h, 2\u03c3(m, \u03b4/4, d)\n= G. We note that the selection\nfunction g(x) equals one only for x \u2208 X \\DIS (G) . Therefore, for any x \u2208 X , for which g(x) = 1,\nall the hypotheses in G agree, and in particular h\nE{I(\u02c6h(X) \u0338= Y ) \u00b7 g(X)}\n\n\u2217 and \u02c6h agree. Thus,\n\nE{I(h\n\u2217\n\n\u2217 \u2208 \u02c6V\n\n(X) \u0338= Y ) \u00b7 g(X)}\nE{g(X)}\n\n\u2217\n= R(h\n\n, g).\n\nApplying Theorem 5.4 and the union bound we therefore know that with probability of at least 1\u2212\u03b4,\n\nR(\u02c6h, g) =\n\nE{g(X)}\n\n=\n\n\u03a6(\u02c6h, g) = E{g(X)} = 1 \u2212 \u2206G \u2265 1 \u2212 B\u03b8 (4\u03c3(m, \u03b4/4, d))\u03b2 .\n\nHanneke introduced, in his original work [5], an alternative de\ufb01nition of the disagreement coef\ufb01cient\n\u03b8, for which the supermum in (4) is taken with respect to any \ufb01xed \u03f5 > 0. Using this alternative def-\ninition it is possible to show that fast coverage rates are achievable, not only for \ufb01nite disagreement\ncoef\ufb01cients (Theorem 5.5), but also if the disagreement coef\ufb01cient grows slowly with respect to 1/\u03f5\n(as shown by Wang [12], under suf\ufb01cient smoothness conditions). This extension will be discussed\nin the full version of this paper.\n\n6 A disbelief principle and the risk-coverage trade-off\n\nTheorem 5.5 tells us that the strategy presented in Section 4 not only outputs a weakly optimal\nselective classi\ufb01er, but this classi\ufb01er also has guaranteed coverage (under some conditions). As\nemphasized in [1], in practical applications it is desirable to allow for some control on the trade-off\nbetween risk and coverage; in other words, we would like to be able to develop the entire risk-\ncoverage curve for the classi\ufb01er at hand and select ourselves the cutoff point along this curve in\naccordance with other practical considerations we may have. How can this be achieved?\nThe following lemma facilitates a construction of a risk-coverage trade-off curve. The result is an\nalternative characterization of the selection function g, of the weakly optimal selective classi\ufb01er\nchosen by Strategy 1. This result allows for calculating the value of g(x), for any individual test\npoint x \u2208 X , without actually constructing g for the entire domain X .\n(\nLemma 6.1. Let (h, g) be a selective classi\ufb01er chosen by Strategy 1 after observing the training\nsample Sm. Let \u02c6h be the empirical risk minimizer over Sm. Let x be any point in X and\n\n{\n\n)}\nehx , argmin\ng(x) = 0 \u21d0\u21d2 \u02c6R(ehx) \u2212 \u02c6R(\u02c6h) \u2264 2\u03c3(m, \u03b4/4, d).\n\n| h(x) = \u2212sign\n\n\u02c6R(h)\n\n\u02c6h(x)\n\nh\u2208H\n\n,\n\nan empirical risk minimizer forced to label x the opposite from \u02c6h(x). Then\n\nProof. According to the de\ufb01nition of \u02c6V (see Eq. (2)),\n\n)\nThus, \u02c6h,ehx \u2208 \u02c6V. However, by construction, \u02c6h(x) = \u2212eh(x), so x \u2208 DIS( \u02c6V) and g(x) = 0.\n\n\u02c6R(ehx) \u2212 \u02c6R(\u02c6h) \u2264 2\u03c3(m, \u03b4/4, d) \u21d0\u21d2 eh \u2208 \u02c6V\n\n\u02c6h, 2\u03c3(m, \u03b4/4, d)\n\n(\n\n5\n\n\fempirical error \u02c6R(ehx) of a special empirical risk minimizer,ehx, which is constrained to label x the\n\nLemma 6.1 tells us that in order to decide if point x should be rejected we need to measure the\n\nopposite from \u02c6h(x). If this error is suf\ufb01ciently close to \u02c6R(\u02c6h) our classi\ufb01er cannot be too sure about\nthe label of x and we must reject it. This result strongly motivates the following de\ufb01nition of a\n\u201cdisbelief index\u201d for each individual point.\nDe\ufb01nition 6.2 (disbelief index). For any x \u2208 X , de\ufb01ne its disbelief index w.r.t. Sm and H,\n\nD(x) , D(x, Sm) , \u02c6R(ehx) \u2212 \u02c6R(\u02c6h).\n\nObserve that D(x) is large whenever our model is sensitive to label of x in the sense that when we\nare forced to bend our best model to \ufb01t the opposite label of x, our model substantially deteriorates,\ngiving rise to a large disbelief index. This large D(x) can be interpreted as our disbelief in the\npossibility that x can be labeled so differently. In this case we should de\ufb01nitely predict the label of\nx using our unforced model. Conversely, if D(x) is small, our model is indifferent to the label of x\nand in this sense, is not committed to its label. In this case we should abstain from prediction at x.\nThis \u201cdisbelief principle\u201d facilitates an exploration of the risk-coverage trade-off curve for our clas-\nsi\ufb01er. Given a pool of test points we can rank these test points according to their disbelief index, and\npoints with low index should be rejected \ufb01rst. Thus, this ranking provides the means for constructing\na risk-coverage trade-off curve.\nA similar technique of using an ERM oracle that can enforce an arbitrary number of example-based\nconstraints was used in [13, 14] in the context of active learning. As in our disbelief index, the\ndifference between the empirical risk (or importance weighted empirical risk [14]) of two ERM\noracles (with different constraints) is used to estimate prediction con\ufb01dence.\n\n7 Implementation\n\nm, S2\n\nAt this point in the paper we switch from theory to practice, aiming at implementing rejection\nmethods inspired by the disbelief principle and see how well they work on real world (well, ..., UCI)\nproblems. Attempting to implement a learning algorithm driven by the disbelief index we face a\nmajor bottleneck because the calculation of the index requires the identi\ufb01cation of ERM hypotheses.\nTo handle this computationally dif\ufb01cult problem, we \u201capproximate\u201d the ERM as follows. Focusing\non SVMs we use a high C value (105 in our experiments) to penalize more on training errors than\non small margin. In this way the solution to the optimization problem tend to get closer to the ERM.\nAnother problem we face is that the disbelief index is a noisy statistic that highly depends on the\nsample Sm. To overcome this noise we use robust statistics. First we generate 11 different samples\nm ) using bootstrap sampling. For each sample we calculate the disbelief index for\n(S1\nall test points and for each point take the median of these measurements as the \ufb01nal index.\nWe note that for any \ufb01nite training sample the disbelief index is a discrete variable. It is often the\ncase that several test points share the same disbelief index. In those cases we can use any con\ufb01dence\nmeasure as a tie breaker. In our experiments we use distance from decision boundary to break ties.\n\nIn order to estimate \u02c6R(ehx) we have to restrict the SVM optimizer to only consider hypotheses that\n\nclassify the point x in a speci\ufb01c way. To accomplish this we use a weighted SVM for unbalanced\ndata. We add the point x as another training point with weight 10 times larger than the weight of all\ntraining points combined. Thus, the penalty for misclassi\ufb01cation of x is very large and the optimizer\n\ufb01nds a solution that doesn\u2019t violate the constraint.\n\nm, . . . S11\n\n8 Empirical results\n\nFocusing on SVMs with a linear kernel we compared the RC (Risk-Coverage) curves achieved by\nthe proposed method with those achieved by SVM with rejection based on distance from decision\nboundary. This latter approach is very common in practical applications of selective classi\ufb01cation.\nFor implementation we used LIBSVM [15].\nBefore presenting these results we wish to emphasize that the proposed method leads to rejection\nregions fundamentally different than those obtained by the traditional distance-based technique. In\n\n6\n\n\fFigure 1 we depict those regions for a training sample of 150 points sampled from a mixture of\ntwo identical normal distributions (centered at different locations). The height map re\ufb02ects the\n\u201ccon\ufb01dence regions\u201d of each technique according to its own con\ufb01dence measure.\n\nFigure 1: con\ufb01dence height map using (a) disbelief index; (b) distance from decision boundary.\n\nWe tested our algorithm on standard medical diagnosis problems from the UCI repository, including\nall datasets used in [16]. We transformed nominal features to numerical ones in a standard way using\nbinary indicator attributes. We also normalized each attribute independently so that its dynamic\nrange is [0, 1]. No other preprocessing was employed.\nIn each iteration we choose uniformly at random non overlapping training set (100 samples) and test\nset (200 samples) for each dataset.SVM was trained on the entire training set and test samples were\nsorted according to con\ufb01dence (either using distance from decision boundary or disbelief index).\nFigure 2 depicts the RC curves of our technique (red solid line) and rejection based on distance from\ndecision boundary (green dashed line) for linear kernel on all 6 datasets. All results are averaged\nover 500 iterations (error bars show standard error).\n\nFigure 2: RC curves for SVM with linear kernel. Our method in solid red, and rejection based on\ndistance from decision boundary in dashed green. Horizntal axis (c) represents coverage.\n\nWith the exception of the Hepatitis dataset, in which both methods were statistically indistinguish-\nable, in all other datasets the proposed method exhibits signi\ufb01cant advantage over the traditional\napproach. We would like to highlight the performance of the proposed method on the Pima dataset.\nWhile the traditional approach cannot achieve error less than 8% for any rejection rate, in our ap-\nproach the test error decreases monotonically to zero with rejection rate. Furthermore, a clear ad-\nvantage for our method over a large range of rejection rates is evident in the Haberman dataset.2.\n\n2The Haberman dataset contains survival data of patients who had undergone surgery for breast cancer.\nWith estimated 207,090 new cases of breast cancer in the united states during 2010 [17] an improvement of 1%\naffects the lives of more than 2000 women.\n\n7\n\n(a)(b)00.20.40.60.8100.010.020.03cHypo00.20.40.60.8100.050.10.150.20.25ctest errorPima00.20.40.60.8100.050.10.15ctest errorHepatitis00.20.40.60.8100.10.20.3cHaberman00.20.40.60.8100.10.20.3ctest errorBUPA00.20.40.60.8100.010.020.03ctest errorBreast\fFor the sake of fairness, we note that the running time of our algorithm (as presented here) is substan-\ntially longer than the traditional technique. The performance of our algorithm can be substantially\nimproved when many unlabeled samples are available. Details will be provided in the full paper.\n\n9 Related work\n\nThe literature on theoretical studies of selective classi\ufb01cation is rather sparse. El-Yaniv and Wiener\n[1] studied the performance of a simple selective learning strategy for the realizable case. Given an\nhypothesis class H, and a sample Sm, their method abstain from prediction if all hypotheses in the\nversion space do not agree on the target sample. They were able to show that their selective classi\ufb01er\nachieves perfect classi\ufb01cation with meaningful coverage under some conditions. Our work can be\nviewed as an extension of the above algorithm to the agnostic case.\nFreund et al.\n[18] studied another simple ensemble method for binary classi\ufb01cation. Given an\nhypothesis class H, the method outputs a weighted average of all the hypotheses in H, where the\nweight of each hypothesis exponentially depends on its individual training error. Their algorithm\nabstains from prediction whenever the weighted average of all individual predictions is close to\n\u2217\nzero. They were able to bound the probability of misclassi\ufb01cation by 2R(h\n) + \u03f5(m) and, under\n) + \u03f5(F, m) on the rejection rate. Our algorithm\n\u2217\nsome conditions, they proved a bound of 5R(h\ncan be viewed as an extreme variation of the Freund et al. method. We include in our \u201censemble\u201d\nonly hypotheses with suf\ufb01ciently low empirical error and we abstain if the weighted average of all\npredictions is not de\ufb01nitive ( \u0338= \u00b11). Our risk and coverage bounds are asymptotically tighter.\nExcess risk bounds were developed by Herbei and Wegkamp [19] for a model where each rejection\nincurs a cost 0 \u2264 d \u2264 1/2. Their bound applies to any empirical risk minimizer over a hypothesis\nclass of ternary hypotheses (whose output is in {\u00b11, reject}). See also various extensions [20, 21].\nA rejection mechanism for SVMs based on distance from decision boundary is perhaps the most\nwidely known and used rejection technique. It is routinely used in medical applications [22, 23, 24].\nFew papers proposed alternative techniques for rejection in the case of SVMs. Those include taking\nthe reject area into account during optimization [25], training two SVM classi\ufb01ers with asymmetric\ncost [26], and using a hinge loss [20]. Grandvalet et al. [16] proposed an ef\ufb01cient implementation\nof SVM with a reject option using a double hinge loss. They empirically compared their results with\ntwo other selective classi\ufb01ers: the one proposed by Bartlett and Wegkamp [20] and the traditional\nrejection based on distance from decision boundary. In their experiments there was no statistically\nsigni\ufb01cant advantage to either method compared to the traditional approach for high rejection rates.\n\n10 Conclusion\n\nWe presented and analyzed a learning strategy for selective classi\ufb01cation that achieves weak op-\ntimality. We showed that the coverage rate directly depends on the disagreement coef\ufb01cient, thus\nlinking between active learning and selective classi\ufb01cation. Recently it has been shown that, for\nthe noise-free case, active learning can be reduced to selective classi\ufb01cation [27]. We conjecture\nthat such a reduction also holds in noisy settings. Exact implementation of our strategy, or exact\ncomputation of the disbelief index may be too dif\ufb01cult to achieve or even obtain with approximation\nguarantees. We presented one algorithm that heuristically approximate the required behavior and\nthere is certainly room for other, perhaps better methods and variants. Our empirical examination\nof the proposed algorithm indicate that it can provide signi\ufb01cant and consistent advantage over the\ntraditional rejection technique with SVMs. This advantage can be of great value especially in medi-\ncal diagnosis applications and other mission critical classi\ufb01cation tasks. The algorithm itself can be\nimplemented using off-the-shelf packages.\n\nAcknowledgments\n\nThis work was supported in part by the IST Programme of the European Community, under the\nPASCAL2 Network of Excellence, IST-2007-216886. This publication only re\ufb02ects the authors\u2019\nviews.\n\n8\n\n\fReferences\n[1] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. JMLR, 11:1605\u2013\n\n1641, 2010.\n\n[2] C.K. Chow. An optimum character recognition system using decision function. IEEE Trans. Computer,\n\n6(4):247\u2013254, 1957.\n\n[3] C.K. Chow. On optimum recognition error and reject trade-off.\n\n16:41\u201336, 1970.\n\nIEEE Trans. on Information Theory,\n\n[4] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353\u2013360, 2007.\n[5] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009.\n[6] P.L. Bartlett, S. Mendelson, and P. Philips. Local complexities for empirical risk minimization. In COLT:\nProceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 2004.\n[7] V. Koltchinskii. 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in\n\nrisk minimization. Annals of Statistics, 34:2593\u20132656, 2006.\n\n[8] P.L. Bartlett and S. Mendelson. Discussion of \u201d2004 IMS medallion lecture: Local rademacher complex-\nities and oracle inequalities in risk minimization\u201d by V. koltchinskii. Annals of Statistics, 34:2657\u20132663,\n2006.\n\n[9] A.B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Mathematical Statis-\n\ntics, 32:135\u2013166, 2004.\n\n[10] A. Beygelzimer, S. Dasgupta, and J. Langford.\n\nIn ICML \u201909:\nProceedings of the 26th Annual International Conference on Machine Learning, pages 49\u201356. ACM,\n2009.\n\nImportance weighted active learning.\n\n[11] O. Bousquet, S. Boucheron, and G. Lugosi.\n\nIn Advanced\nLectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 169\u2013207.\nSpringer, 2003.\n\nIntroduction to statistical learning theory.\n\n[12] L. Wang. Smoothness, disagreement coef\ufb01cient, and the label complexity of agnostic active learning.\n\nJMLR, pages 2269\u20132292, 2011.\n\n[13] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.\n[14] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. Ad-\n\nvances in Neural Information Processing Systems 23, 2010.\n\n[15] C.C. Chang and C.J. Lin.\n\nLIBSVM: A library for support vector machines.\n\nACM Trans-\nSoftware available at\n\nactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\u201dhttp://www.csie.ntu.edu.tw/ cjlin/libsvm\u201d.\n\n[16] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option.\n\nIn NIPS, pages 537\u2013544. MIT Press, 2008.\n\n[17] American Cancer Society. Cancer facts and \ufb01gures. 2010.\n[18] Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for averaged classi\ufb01ers. Annals of\n\nStatistics, 32(4):1698\u20131722, 2004.\n\n[19] R. Herbei and M.H. Wegkamp. Classi\ufb01cation with reject option. The Canadian Journal of Statistics,\n\n34(4):709\u2013721, 2006.\n\n[20] P.L. Bartlett and M.H. Wegkamp. Classi\ufb01cation with a reject option using a hinge loss. Journal of\n\nMachine Learning Research, 9:1823\u20131840, 2008.\n\n[21] M.H. Wegkap. Lasso type classi\ufb01ers with a reject option. Electronic Journal of Statistics, 1:155\u2013168,\n\n2007.\n\n[22] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector\nmachine classi\ufb01cation of microarray data. Technical report, AI Memo 1677, Massachusetts Institute of\nTechnology, 1998.\n\n[23] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi\ufb01cation using support\n\nvector machines. Machine Learning, pages 389\u2013422, 2002.\n\n[24] S. Mukherjee. Chapter 9. classifying microarray data using support vector machines. In of scientists from\nthe University of Pennsylvania School of Medicine and the School of Engineering and Applied Science.\nKluwer Academic Publishers, 2003.\n\n[25] G. Fumera and F. Roli. Support vector machines with embedded reject option. In Pattern Recognition\n\nwith Support Vector Machines: First International Workshop, pages 811\u2013919, 2002.\n\n[26] R. Sousa, B. Mora, and J.S. Cardoso. An ordinal data method for the classi\ufb01cation with reject option. In\n\nICMLA, pages 746\u2013750. IEEE Computer Society, 2009.\n\n[27] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classi\ufb01cation. Accepted to JMLR, 2011.\n\n9\n\n\f", "award": [], "sourceid": 945, "authors": [{"given_name": "Yair", "family_name": "Wiener", "institution": null}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}]}