{"title": "Support Vector Machines with a Reject Option", "book": "Advances in Neural Information Processing Systems", "page_first": 537, "page_last": 544, "abstract": "We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow's rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions.", "full_text": "Support Vector Machines with a Reject Option\n\nYves Grandvalet 1, 2, Alain Rakotomamonjy 3, Joseph Keshet 2 and St\u00b4ephane Canu 3\n\n1 Heudiasyc, UMR CNRS 6599\n\nUniversit\u00b4e de Technologie de Compi`egne\n\n2 Idiap Research Institute\n\nCentre du Parc\n\nBP 20529, 60205 Compi`egne Cedex, France CP 592, CH-1920 Martigny Switzerland\n\n3 LITIS, EA 4108\n\nUniversit\u00b4e de Rouen & INSA de Rouen\n76801 Saint Etienne du Rouvray, France\n\nAbstract\n\nWe consider the problem of binary classi\ufb01cation where the classi\ufb01er may abstain\ninstead of classifying each observation. The Bayes decision rule for this setup,\nknown as Chow\u2019s rule, is de\ufb01ned by two thresholds on posterior probabilities.\nFrom simple desiderata, namely the consistency and the sparsity of the classi\ufb01er,\nwe derive the double hinge loss function that focuses on estimating conditional\nprobabilities only in the vicinity of the threshold points of the optimal decision\nrule. We show that, for suitable kernel machines, our approach is universally\nconsistent. We cast the problem of minimizing the double hinge loss as a quadratic\nprogram akin to the standard SVM optimization problem and propose an active set\nmethod to solve it ef\ufb01ciently. We \ufb01nally provide preliminary experimental results\nillustrating the interest of our constructive approach to devising loss functions.\n\n1 Introduction\n\nIn decision problems where errors incur a severe loss, one may have to build classi\ufb01ers that abstain\nfrom classifying ambiguous examples. Rejecting these examples has been investigated since the\nearly days of pattern recognition. In particular, Chow (1970) analyses how the error rate may be\ndecreased thanks to the reject option.\nThere have been several attempts to integrate a reject option in Support Vector Machines (SVMs),\nusing strategies based on the thresholding of SVMs scores (Kwok, 1999) or on a new training cri-\nterion (Fumera & Roli, 2002). These approaches have however critical drawbacks: the former is\nnot consistent and the latter leads to considerable computational overheads to the original SVM\nalgorithm and lacks some of its most appealing features like convexity and sparsity.\nWe introduce a piecewise linear and convex training criterion dedicated to the problem of classi-\n\ufb01cation with the reject option. Our proposal, inspired by the probabilistic interpretation of SVM\n\ufb01tting (Grandvalet et al., 2006), is a double hinge loss, re\ufb02ecting the two thresholds in Chow\u2019s rule.\nHence, we generalize the loss suggested by Bartlett and Wegkamp (2008) to arbitrary asymmetric\nmisclassi\ufb01cation and rejection costs. For the symmetric case, our probabilistic viewpoint motivates\nanother decision rule. We then propose the \ufb01rst algorithm speci\ufb01cally dedicated to train SVMs with\na double hinge loss. Its implementation shows that our decision rule is at least at par with the one of\nBartlett and Wegkamp (2008).\nThe paper is organized as follows. Section 2 de\ufb01nes the problem and recalls Bayes rule for binary\nclassi\ufb01cation with a reject option. The proposed double hinge loss is derived in Section 3, together\nwith the decision rule associated with SVM scores. Section 4 addresses implementation issues: it\nformalizes the SVM training problem and details an active set algorithm speci\ufb01cally designed for\n\n1\n\n\ftraining with the double hinge loss. This implementation is tested empirically in Section 5. Finally,\nSection 6 concludes the paper.\n\n2 Problem Setting and the Bayes Classi\ufb01er\nClassi\ufb01cation aims at predicting a class label y \u2208 Y from an observed pattern x \u2208 X . For this\npurpose, we construct a decision rule d : X \u2192 A, where A is a set of actions that typically consists\nin assigning a label to x \u2208 X . In binary problems, where the class is tagged either as +1 or \u22121, the\ntwo types of errors are: (i) false positive, where an example labeled \u22121 is predicted as +1, incurring\na cost c\u2212; (ii) false negative, where an example labeled +1 is predicted as \u22121, incurring a cost c+.\nIn general, the goal of classi\ufb01cation is to predict the true label for an observed pattern. However,\npatterns close to the decision boundary are misclassi\ufb01ed with high probability. This problem be-\ncomes especially eminent in cases where the costs, c\u2212 or c+, are high, such as in medical decision\nmaking. In these processes, it might be better to alert the user and abstain from prediction. This\nmotivates the introduction of a reject option for classi\ufb01ers that cannot predict a pattern with enough\ncon\ufb01dence. This decision to abstain, which is denoted by 0, incurs a cost, r\u2212 and r+ for examples\nlabeled \u22121 and +1, respectively.\nThe costs pertaining to each possible decision are recapped on the right-\nhand-side. In what follows, we assume that all costs are strictly positive:\n(1)\nFurthermore, it should be possible to incur a lower expected loss by\nchoosing the reject option instead of any prediction, that is\n\nc\u2212 > 0 , c+ > 0 , r\u2212 > 0 , r+ > 0 .\n\n+1 \u22121\n0\nc\u2212\n\n)\nx\n(\nd\n\ny\n\nr+\n\nr\u2212\n\nc\u2212 r+ + c+ r\u2212 < c\u2212 c+ .\n\n(2)\n\nc+\n\n0\n\n+1\n\n0\n\u22121\n\nBayes\u2019 decision theory is the paramount framework in statistical decision theory, where decisions\nare taken to minimize expected losses. For classi\ufb01cation with a reject option, the overall risk is\n\nR(d) = c+ EXY [Y = 1, d(X) = \u22121] + c\u2212 EXY [Y = \u22121, d(X) = 1] +\n\nr+ EXY [Y = 1, d(X) = 0] + r\u2212 EXY [Y = \u22121, d(X) = 0]\n\n,\n\n(3)\n\nwhere X and Y denote the random variable describing patterns and labels.\nThe Bayes classi\ufb01er d\u2217 is de\ufb01ned as the minimizer of the risk R(d). Since the seminal paper of\nChow (1970), this rule is sometimes referred to as Chow\u2019s rule:\n\n( +1\n\n\u22121\n0\n\nd\u2217(x) =\n\nif P(Y = 1|X = x) > p+\nif P(Y = 1|X = x) < p\u2212\notherwise ,\n\n(4)\n\nwhere p+ = c\u2212 \u2212 r\u2212\nc\u2212 \u2212 r\u2212 + r+\n\nand p\u2212 =\n\nr\u2212\n\nc+ \u2212 r+ + r\u2212\n\n.\n\nNote that, assuming that (1) and (2) hold, we have 0 < p\u2212 < p+ < 1.\nOne of the major inductive principle is the empirical risk minimization, where one minimizes the\nempirical counterpart of the risk (3).\nIn classi\ufb01cation, this principle usually leads to a NP-hard\nproblem, which can be circumvented by using a smooth proxy of the misclassi\ufb01cation loss. For\nexample, Vapnik (1995) motivated the hinge loss as a \u201ccomputationally simple\u201d (i.e., convex) surro-\ngate of classi\ufb01cation error. The following section is dedicated to the construction of such a surrogate\nfor classi\ufb01cation with a reject option.\n\n3 Training Criterion\n\nOne method to get around the hardness of learning decision functions is to replace the conditional\n\nprobability P(Y = 1|X = x) with its estimationbP(Y = 1|X = x), and then plug this estimation\n\nback in (4) to build a classi\ufb01cation rule (Herbei & Wegkamp, 2006). One of the most widespread\n\n2\n\n\f)\n)\nx\n(\nf\n,\n1\n+\n\n(\n+\np\n,\n\n\u2212\np\n\u2018\n\n)\n)\nx\n(\nf\n,\n1\n\u2212\n\n(\n+\np\n,\n\n\u2212\np\n\u2018\n\nf(x)\n\nf(x)\n\nFigure 1: Double hinge loss function \u2018p\u2212,p+ for positive (left) and negative examples (right), with\np\u2212 = 0.4 and p+ = 0.8 (solid: double hinge, dashed: likelihood). Note that the decision thresholds\nf+ and f\u2212 are not symmetric around zero.\n\nrepresentative of this line of attack is the logistic regression model, which estimates the conditional\nprobability using the maximum (penalized) likelihood framework.\nAs a starting point, we consider the generalized logistic regression model for binary classi\ufb01cation,\nwhere\n\n(5)\nand the function f : X \u2192 R is estimated by the minimization of a regularized empirical risk on the\ntraining sample T = {(xi, yi)}n\n\n1 + exp(\u2212yf(x)) ,\n\ni=1\n\nbP(Y = y|X = x) =\n\n1\n\n\u2018(yi, f(xi)) + \u03bb\u2126(f) ,\n\n(6)\n\nnX\n\ni=1\n\nwhere \u2018 is a loss function and \u2126(\u00b7) is a regularization functional, such as the (squared) norm of f in\na suitable Hilbert space \u2126(f) = kfk2H, and \u03bb is a regularization parameter. In the standard logistic\nregression procedure, \u2018 is the negative log-likelihoood loss\n\n\u2018(y, f(x)) = log(1 + exp(\u2212yf(x))) .\n\nThis loss function is convex and decision-calibrated (Bartlett & Tewari, 2007), but it lacks an ap-\npealing feature of the hinge loss used in SVMs, that is, it does not lead to sparse solutions. This\ndrawback is the price to pay for the ability to estimate the posterior probability P(Y = 1|X = x)\non the whole range (0, 1) (Bartlett & Tewari, 2007).\nHowever, the de\ufb01nition of the Bayes\u2019 rule (4) clearly shows that the estimation of P(Y = 1|X = x)\ndoes not have to be accurate everywhere, but only in the vicinity of p+ and p\u2212. This motivates the\nconstruction of a training criterion that focuses on this goal, without estimating P(Y = 1|X = x)\non the whole range as an intermediate step. Our purpose is to derive such a loss function, without\nsacrifying sparsity to the consistency of the decision rule.\nThough not a proper negative log-likelihood, the hinge loss can be interpreted in a maximum a\nposteriori framework: The hinge loss can be derived as a relaxed minimization of negative log-\nlikelihood (Grandvalet et al., 2006). According to this viewpoint, minimizing the hinge loss aims\nat deriving a loose approximation to the the logistic regression model (5) that is accurate only at\nf(x) = 0, thus allowing to estimate whether P(Y = 1|X = x) > 1/2 or not. More generally,\none can show that, in order to have a precise estimate of P(Y = 1|X = x) = p, the surrogate loss\nshould be tangent to the neg-log-likelihood at f = log(p/(1 \u2212 p)).\nFollowing this simple constructive principle, we derive the double hinge loss, which aims at reliably\nestimating P(Y = 1|X = x) at the threshold points p+ and p\u2212. Furthermore, to encourage sparsity,\nwe set the loss to zero for all points classi\ufb01ed with high con\ufb01dence. This loss function is displayed in\nFigure 1. Formally, for the positive examples, the double hinge loss satisfying the above conditions\ncan be expressed as\n\n\u2018p\u2212,p+(+1, f(x)) = max(cid:8) \u2212 (1 \u2212 p\u2212)f(x) + H(p\u2212), \u2212(1 \u2212 p+)f(x) + H(p+), 0(cid:9) ,\n\n(7)\n\n3\n\n 02.55.10f+f\u2212 02.55.10f+f\u2212\fand for the negative examples it can be expressed as\n\n\u2018p\u2212,p+(\u22121, f(x)) = max(cid:8) p+f(x) + H(p+), p\u2212f(x) + H(p\u2212), 0(cid:9) ,\n\n(8)\nwhere H(p) = \u2212p log(p) \u2212 (1 \u2212 p) log(1 \u2212 p). Note that, unless p\u2212 = 1 \u2212 p+, there is no simple\nsymmetry with respect to the labels.\nAfter training, the decision rule is de\ufb01ned as the plug-in estimation of (4) using the logistic regres-\nsion probability estimation. Let f+ = log(p+/(1 \u2212 p+)) and f\u2212 = log(p\u2212/(1 \u2212 p\u2212)), the decision\nrule can be expressed in terms of the function f as follows\n\ndp\u2212,p+(x; f) =\n\nif f(x) > f+\nif f(x) < f\u2212\notherwise .\n\n(9)\n\n( +1\n\n\u22121\n0\n\nThe following result shows that the rule dp\u2212,p+(\u00b7; f) is universally consistent when f is learned by\nminimizing empirical risk based on \u2018p\u2212,p+. Hence, in the limit, learning with the double hinge loss\nis optimal in the sense that the risk for the learned decision rule converges to the Bayes\u2019 risk.\nTheorem 1. Let H be a functional space that is dense in the set of continuous functions. Suppose\nthat we have a positive sequence {\u03bbn} with \u03bbn \u2192 0 and n\u03bb2\n\nn/ log n \u2192 \u221e. We de\ufb01ne f\u2217\n\nn as\n\n\u2018p\u2212,p+(yi, f(xi)) + \u03bbnkfk2H .\narg min\nf\u2208H\nn)) = R(d\u2217(X)) holds almost surely,\nlimn\u2192\u221e R(dp\u2212,p+(X; f\u2217\n\ni=1\n\n1\nn\n\nn) is strongly universally consistent.\n\nThen,\ndp\u2212,p+(\u00b7; f\u2217\n\nthat\n\nis,\n\nthe classi\ufb01er\n\nnX\n\nProof. Our theorem follows directly from (Steinwart, 2005, Corollary 3.15), since \u2018p\u2212,p+ is regular\n(Steinwart, 2005, De\ufb01nition 3.9). Besides mild regularity conditions that hold for \u2018p\u2212,p+, a loss\nfunction is said regular if, for every \u03b1 \u2208 [0, 1], and every t\u03b1 such that\n\nt\u03b1 = arg min\n\nt\n\n\u03b1 \u2018p\u2212,p+(+1, t) + (1 \u2212 \u03b1) \u2018p\u2212,p+(\u22121, t) ,\n\nwe have that dp\u2212,p+(t\u03b1, x) agrees with d\u2217(x) almost everywhere.\nLet f1 = \u2212H(p\u2212)/p\u2212, f2 = \u2212(H(p+) \u2212 H(p\u2212))/(p+ \u2212 p\u2212) and f3 = H(p+)/(1 \u2212 p+) denote\nthe hinge locations in \u2018p\u2212,p+(\u00b11, f(x)). Note that we have f1 < f\u2212 < f2 < f+ < f3, and that\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nt\u03b1 \u2208\n\n[f1, f2] if \u03b1 = p\u2212\n\n(\u2212\u221e, f1] if 0 \u2264 \u03b1 < p\u2212\n{f2} if p\u2212 < \u03b1 < p+\n[f2, f3] if \u03b1 = p+\n[f3,\u221e) if p+ < \u03b1 \u2264 1\n\n\u21d2 dp\u2212,p+(t\u03b1, x) =\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u22121 if P(Y = 1|x) < p\u2212\n\u22121 or 0 if P(Y = 1|x) = p\u2212\n0 if p\u2212 < P(Y = 1|x) < p+\n0 or + 1 if P(Y = 1|x) = p+\n+1 if P(Y = 1|x) > p+\n\nwhich is the desired result.\n\nNote also that the analysis of Bartlett and Tewari (2007) can be used to show that minimizing \u2018p\u2212,p+\ncannot provide consistent estimates of P(Y = 1|X = x) = p for p /\u2208 {p\u2212, p+}. This property is\ndesirable regarding sparsity, since sparseness does not occur when the conditional probabilities can\nbe unambiguously estimated .\nNote on a Close Relative A double hinge loss function has been proposed recently with a dif-\nferent perspective by Bartlett and Wegkamp (2008). Their formulation is restricted to symmetric\nclassi\ufb01cation, where c+ = c\u2212 = 1 and r+ = r\u2212 = r.\nIn this situation, rejection may occur\nonly if 0 \u2264 r < 1/2, and the thresholds on the conditional probabilities in Bayes\u2019 rule (4) are\np\u2212 = 1 \u2212 p+ = r.\nFor symmetric classi\ufb01cation, the loss function of Bartlett and Wegkamp (2008) is a scaled version\nof our proposal that leads to equivalent solutions for f, but our decision rule differs. While our\nprobabilistic derivation of the double hinge loss motivates the decision function (9), the decision rule\nof Bartlett and Wegkamp (2008) has a free parameter (corresponding to the threshold f+ = \u2212f\u2212)\nwhose value is set by optimizing a generalization bound.\nOur decision rule rejects more examples when the loss incurred by rejection is small and fewer\nexamples otherwise. The two rules are identical for r \u2019 0.24. We will see in Section 5 that this\ndifference has noticeable outcomes.\n\n4\n\n\f4 SVMs with Double Hinge\n\nIn this section, we show how the standard SVM optimization problem is modi\ufb01ed when the hinge\nloss is replaced by the double hinge loss. The optimization problem is \ufb01rst written using a compact\nnotation, and the dual problem is then derived.\n\n4.1 Optimization Problem\n\nMinimizing the regularized empirical risk (6) with the double hinge loss (7\u20138) is an optimization\nproblem akin to the standard SVM problem. Let C be an arbitrary constant, we de\ufb01ne D = C(p+ \u2212\np\u2212), Ci = C(1 \u2212 p+) for positive examples, and Ci = Cp\u2212 for negative examples. With the\nintroduction of slack variables \u03be and \u03b7, the optimization problem can be stated as\n\nnX\n\ni=1\n\nnX\n\n\u03b7i\n\nmin\nf,b,\u03be,\u03b7\ns. t.\n\nkfk2H +\n\n1\nCi\u03bei + D\n2\nyi(f(xi) + b) \u2265 ti \u2212 \u03bei\nyi(f(xi) + b) \u2265 \u03c4i \u2212 \u03b7i\n\u03bei \u2265 0 ,\n\n\u03b7i \u2265 0\n\ni=1\ni = 1, . . . , n\ni = 1, . . . , n\ni = 1, . . . , n ,\n\nwhere, for positive examples, ti = H(p+)/(1 \u2212 p+), \u03c4i = \u2212(H(p\u2212) \u2212 H(p+))/(p\u2212 \u2212 p+), while,\nfor negative examples ti = H(p\u2212)/p\u2212, \u03c4i = (H(p\u2212) \u2212 H(p+))/(p\u2212 \u2212 p+).\nFor functions f belonging to a Hilbert space H endowed with a reproducing kernel k(\u00b7,\u00b7), ef\ufb01cient\noptimization algorithms can be drawn from the dual formulation:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n(10)\n\n(11)\n\nmin\n\u03b1,\u03b3\ns. t.\n\n1\n2 \u03b3T G\u03b3 \u2212 \u03c4 T \u03b3 \u2212 (t \u2212 \u03c4 )T \u03b1\nyT \u03b3 = 0\n0 \u2264 \u03b1i \u2264 Ci\ni = 1, . . . , n\n0 \u2264 \u03b3i \u2212 \u03b1i \u2264 D i = 1, . . . , n .\n\nwhere y = (y1, . . . , yn)T , t = (t1, . . . , tn)T and \u03c4 = (\u03c41, . . . , \u03c4n)T are vectors of Rn and G is the\nn \u00d7 n Gram matrix with general entry Gij = yiyjk(xi, xj). Note that (11) is a simple quadratic\nproblem under box constraints. Compared to the standard SVM dual problem, one has an additional\nvector to optimize, but, with the active set we developed, we only have to optimize a single vector of\nRn. The primal variables f and b are then derived from the Karush-Kuhn-Tucker (KKT) conditions.\ni=1 \u03b3iyik(\u00b7, xi), and b is obtained in the optimization process described\n\nFor f, we have: f(\u00b7) =Pn\n\nbelow.\n\n4.2 Solving the Problem\n\nTo solve (11), we use an active set algorithm, following a strategy that proved to be ef\ufb01cient in\nSimpleSVM (Vishwanathan et al., 2003). This algorithm solves the SVM training problem by a\ngreedy approach, in which one solves a series of small problems. First, the repartition of training\nexamples in support and non-support vectors is assumed to be known, and the training criterion is\noptimized considering that this partition \ufb01xed. Then, this optimization results in an updated partition\nof examples in support and non-support vectors. These two steps are iterated until some level of\naccuracy is reached.\nPartitioning the Training Set The training set is partitioned into \ufb01ve subsets de\ufb01ned by the ac-\ntivity of the box constraints of Problem (11). The training examples indexed by:\n\nI0 , de\ufb01ned by I0 = {i|\u03b3i = 0}, are such that yi(f(xi) + b) > ti;\nIt , de\ufb01ned by It = {i|0 < \u03b3i < Ci}, are such that yi(f(xi) + b) = ti;\nIC , de\ufb01ned by IC = {i|\u03b3i = Ci}, are such that \u03c4i < yi(f(xi) + b) \u2264 ti;\nI\u03c4 , de\ufb01ned by I\u03c4 = {i|Ci < \u03b3i < Ci + D}, are such that yi(f(xi) + b) = \u03c4i;\nID , de\ufb01ned by ID = {i|\u03b3i = Ci + D}, are such that yi(f(xi) + b) \u2264 \u03c4i.\n\nWhen example i belongs to one of the subsets described above, the KKT conditions yield that \u03b1i\nis either equal to \u03b3i or constant. Hence, provided that the repartition of examples in the subsets I0,\nIt, IC, I\u03c4 and ID is known, we only have to consider a problem in \u03b3. Furthermore, \u03b3i has to be\ncomputed only for i \u2208 It \u222a I\u03c4 .\n\n5\n\n\f\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\nwhere si = ti \u2212P\nP\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nj\u2208ID\n\nX\nX\n\nj\u2208IT\n\ni\u2208IT\n\nUpdating Dual Variables Assuming a correct partition, Problem (11) reduces to the considerably\nsmaller problem of computing \u03b3i for i \u2208 IT = It \u222a I\u03c4 :\n\ni\u2208IT ,j\u2208IT\n\n\u03b3i\u03b3jGij \u2212X\nX\nCiyi + X\nyi\u03b3i + X\n(Cj + D) Gji for i \u2208 It and si = \u03c4i \u2212P\n\n(Ci + D) yi = 0 ,\n\ni\u2208ID\n\ni\u2208IC\n\ni\u2208IT\n\n\u03b3isi\n\nmin\n\n{\u03b3i|i\u2208IT }\ns. t.\n\n1\n2\n\nX\nCjGji \u2212P\n\ni\u2208IT\n\nCjGji \u2212\n(Cj + D) Gji for i \u2208 I\u03c4 . Note that the box constraints of Problem (11) do not appear here,\n\nj\u2208ID\n\nj\u2208IC\n\nj\u2208IC\n\nbecause we assumed the partition to be correct.\nThe solution of Problem (12) is simply obtained by solving the following linear system resulting\nfrom the \ufb01rst-order optimality conditions:\n\n(12)\n\nGij\u03b3j + yi\u03bb = si\n\nyi\u03b3i = \u2212X\n\nCiyi \u2212 X\n\nfor i \u2208 IT\n\n(Ci + D) yi ,\n\n(13)\n\ni\u2208IC\n\ni\u2208ID\n\nwhere \u03bb, which is the (unknown) Lagrange parameter associated to the equality constraint in (12),\nis computed along with \u03b3. Note that the |IT| equations of the linear system given on the \ufb01rst line\nof (13) express that, for i \u2208 It, yi(f(xi) + \u03bb) = ti and for i \u2208 I\u03c4 , yi(f(xi) + \u03bb) = \u03c4i. Hence, the\nprimal variable b is equal to \u03bb.\nAlgorithm The algorithm, described in Algorithm 1, simply alternates updates of the partition of\nexamples in {I0, It, IC, I\u03c4 , ID}, and the ones of coef\ufb01cients \u03b3i for the current active set IT . As for\nstandard SVMs, the initialization step consists in either using the solution obtained for a different\nhyper-parameter, such as a higher value of C, or in picking one or several examples of each class to\narbitrarily initialize It to a non-empty set, and putting all the other ones in I0 = {1, . . . , n} \\ It.\n\ni = \u03b3old\n\ni + \u03c1(\u03b3i \u2212 \u03b3old\n\ni\n\n) obey box constraints\n\nAlgorithm 1 SVM Training with a Reject Option\ninput {xi, yi}1\u2264i\u2264n and hyper-parameters C, p+, p\u2212\ninitialize \u03b3old IT = {It, I\u03c4}, IT = {I0, IC, ID},\nrepeat\nsolve linear system (13) \u2192 (\u03b3i)i\u2208IT , b = \u03bb.\nif any (\u03b3i)i\u2208IT violates the box constraints (11) then\n\nCompute the largest \u03c1 s. t., for all i \u2208 IT \u03b3new\nLet j denote the index of (\u03b3new\n)i\u2208IT at bound,\nIT = IT \\ {j}, IT = IT \u222a {j}\nj = \u03b3new\n\u03b3old\nfor all i \u2208 IT do \u03b3new\ni = \u03b3i\nif any (yi(f(xi) + b))i\u2208IT\n\nj\n\ni\n\nelse\n\nselect i with violated constraint\nIT = IT \\ {i}, IT = IT \u222a {i}\n\nelse\n\nexact convergence\nend if\nfor all i \u2208 IT do \u03b3old\n\ni = \u03b3new\n\ni\n\nend if\n\nuntil convergence\n\noutput f, b.\n\nviolates primal constraints (10) then\n\nThe exact convergence is obtained when all constraints are ful\ufb01lled, that is, when all examples be-\nlong to the same subset at the begining and the end of the main loop. However, it is possible to relax\nthe convergence criteria while having a good control on the precision on the solution by monitor-\ning the duality gap, that is the difference between the primal and the dual objectives, respectively\nprovided in the de\ufb01nition of Problems (10) and (11).\n\n6\n\n\fTable 1: Performances in terms of average test loss, rejection rate and misclassi\ufb01cation rate (re-\njection is not an error) with r+ = r\u2212 = 0.45, for the three rejection methods over four different\ndatasets.\n\nAverage Test Loss Rejection rate (%) Error rate (%)\n\nWdbc\n\nNaive\nB&W\u2019s\nOur\u2019s\nNaive\nB&W\u2019s\nOur\u2019s\nThyroid Naive\n\nLiver\n\n.\n\nPima\n\nB&W\u2019s\nOur\u2019s\nNaive\nB&W\u2019s\nOur\u2019s\n\n2.9 \u00b1 1.6\n3.5 \u00b1 1.8\n2.9 \u00b1 1.7\n28.9 \u00b1 5.4\n30.9 \u00b1 4.0\n28.8 \u00b1 5.1\n4.1 \u00b1 2.9\n4.4 \u00b1 2.7\n3.7 \u00b1 2.7\n23.7 \u00b1 1.9\n24.7 \u00b1 2.1\n23.1 \u00b1 1.3\n\n0.7\n3.9\n1.2\n3.3\n34.5\n7.9\n0.9\n6.1\n2.1\n7.5\n24.3\n6.9\n\n2.6\n1.8\n2.4\n27.4\n15.4\n25.2\n3.7\n1.6\n2.8\n20.3\n13.8\n20.0\n\nTheorem 2. Algorithm 1 converges in a \ufb01nite number of steps to the exact solution of (11).\n\nProof. The proof follows the ones used to prove the convergence of active set methods in general,\nand SimpleSVM in particular, see Propositon 1 in (Vishwanathan et al., 2003)).\n\n5 Experiments\n\nWe compare the performances of three different rejection schemes based on SVMs. For this purpose,\nwe selected the datasets from the UCI repository related to medical problems, as medical decision\nmaking is an application domain for which rejection is of primary importance. Since these datasets\nare small, we repeated 10 trials for each problem. Each trial consists in splitting randomly the\nexamples into a training set with 80 % of examples and an independent test set. Note that the\ntraining examples were normalized to zero-mean and unit variance before cross-validation (test sets\nwere of course rescaled accordingly).\nIn a \ufb01rst series of experiments, to compare our decision rule with the one proposed by Bartlett and\nWegkamp (2008) (B&W\u2019s), we used symmetric costs: c+ = c\u2212 = 1 and r+ = r\u2212 = r. We\nalso chose r = 0.45, which corresponds to rather low rejection rates, in order to favour different\nbehaviors between these two decision rules (recall that they are identical for r \u2019 0.24). Besides\nthe double hinge loss, we also implemented a \u201cnaive\u201d method that consists in running the standard\nSVM algorithm (using the hinge loss) and selecting a symmetric rejection region around zero by\ncross-validation.\nFor all methods, we used Gaussian kernels. Model selection is performed by cross-validation. This\nincludes the selection of the kernel widths, the regularization parameter C for all methods and\nadditionally of the rejection thresholds for the naive method. Note that B&W\u2019s and our decision\nrules are based on learning with the double-hinge loss. Hence, the results displayed in Table 1 only\ndiffer due to the size of the rejection region, and to the disparities that arise from the choice of\nhyper-parameters that may arise in the cross-validation process (since the decision rules differ, the\ncross-validation scores differ also).\nTable 1 summarizes the averaged performances over the 10 trials. Overall, all methods lead to\nequivalent average test losses, with an unsigni\ufb01cant but consistent advantage for our decision rule.\nWe also see that the naive method tends to reject fewer test examples than the consistent methods.\nThis means that, for comparable average losses, the decision rules based on the scores learned by\nminimizing the double hinge loss tend to classify more accurately the examples that are not rejected,\nas seen on the last column of the table.\nFor noisy problems such as Liver and Pima, we observed that reducing rejection costs considerably\ndecrease the error rate on classi\ufb01ed examples (not shown on the table). The performances of the\ntwo learning methods based on the double-hinge get closer, and there is still no signi\ufb01cant gain\n\n7\n\n\fcompared to the naive approach. Note however that the symmetric setting is favourable to the naive\napproach, since we only have to estimate a single decision thershold. We are experimenting to see\nwhether the double-hinge loss shows more substantial improvements for asymmetric losses and for\nlarger training sets.\n\n6 Conclusion\n\nIn this paper we proposed a new solution to the general problem of classi\ufb01cation with a reject\noption. The double hinge loss was derived from the simple desiderata to obtain accurate estimates\nof posterior probabilities only in the vicinity of the decision boundaries. Our formulation handles\nasymmetric misclassi\ufb01cation and rejection costs and compares favorably to the one of Bartlett and\nWegkamp (2008).\nWe showed that for suitable kernels, including usual ones such as the Gaussian kernel, training a\nkernel machine with the double hinge loss provides a universally consistent classi\ufb01er with reject\noption. Furthermore, the loss provides sparse solutions, with a limited number of support vectors,\nsimilarly to the standard L1-SVM classi\ufb01er.\nWe presented what we believe to be the \ufb01rst principled and ef\ufb01cient implementation of SVMs for\nclassi\ufb01cation with a reject option. Our optimization scheme is based on an active set method, whose\ncomplexity compares to standard SVMs. The dimension of our quadratic program is bounded by\nthe number of examples, and is effectively limited to the number of support vectors. The only\ncomputational overhead is brought by monitoring \ufb01ve categories of examples, instead of the three\nones considered in standard SVMs (support vector, support at bound, inactive example).\nOur approach for deriving the double hinge loss can be used for other decision problems relying\non conditional probabilities at speci\ufb01c values or in a limited range or values. As a \ufb01rst example,\none may target the estimation of discretized con\ufb01dence ratings, such as the ones reported in weather\nforecasts. Multi-category classi\ufb01cation also belongs to this class of problems, since there, decisions\nrely on having precise conditional probabilities within a prede\ufb01ned interval.\n\nAcknowledgements\n\nThis work was supported in part by the French national research agency (ANR) through project\nGD2GS, and by the IST Programme of the European Community through project DIRAC.\n\nReferences\nBartlett, P. L., & Tewari, A. (2007). Sparseness vs estimating conditional probabilities: Some asymptotic\n\nresults. Journal of Machine Learning Research, 8, 775\u2013790.\n\nBartlett, P. L., & Wegkamp, M. H. (2008). Classi\ufb01cation with a reject option using a hinge loss. Journal of\n\nMachine Learning Research, 9, 1823\u20131840.\n\nChow, C. K. (1970). On optimum recognition error and reject tradeoff. IEEE Trans. on Info. Theory, 16, 41\u201346.\nFumera, G., & Roli, F. (2002). Support vector machines with embedded reject option. Pattern Recognition\n\nwith Support Vector Machines: First International Workshop (pp. 68\u201382). Springer.\n\nGrandvalet, Y., Mari\u00b4ethoz, J., & Bengio, S. (2006). A probabilistic interpretation of SVMs with an application\n\nto unbalanced classi\ufb01cation. NIPS 18 (pp. 467\u2013474). MIT Press.\n\nHerbei, R., & Wegkamp, M. H. (2006). Classi\ufb01cation with reject option. The Canadian Journal of Statistics,\n\n34, 709\u2013721.\n\nKwok, J. T. (1999). Moderating the outputs of support vector machine classi\ufb01ers.\n\nNetworks, 10, 1018\u20131031.\n\nIEEE Trans. on Neural\n\nSteinwart, I. (2005). Consistency of support vector machine and other regularized kernel classi\ufb01ers.\n\nTrans. on Info. Theory, 51, 128\u2013142.\n\nIEEE\n\nVapnik, V. N. (1995). The nature of statistical learning theory. Springer Series in Statistics. Springer.\nVishwanathan, S. V. N., Smola, A., & Murty, N. (2003). SimpleSVM. Proceedings of the Twentieth Interna-\n\ntional Conference on Machine Learning (pp. 68\u201382). AAAI.\n\n8\n\n\f", "award": [], "sourceid": 939, "authors": [{"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "Alain", "family_name": "Rakotomamonjy", "institution": null}, {"given_name": "Joseph", "family_name": "Keshet", "institution": null}, {"given_name": "St\u00e9phane", "family_name": "Canu", "institution": null}]}