{"title": "A Geometric Interpretation of v-SVM Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 250, "abstract": null, "full_text": "A Geometric Interpretation of v-SVM \n\nClassifiers \n\nDavid J. Crisp \n\nCentre for Sensor Signal and \n\nInformation Processing, \n\nDeptartment of Electrical Engineering, \nUniversity of Adelaide, South Australia \n\ndcrisp@eleceng.adelaide.edu.au \n\nChristopher J.C. Burges \n\nAdvanced Technologies, \n\nBell Laboratories, \nLucent Technologies \nHolmdel, New Jersey \nburges@lucent.com \n\nAbstract \n\nWe show that the recently proposed variant of the Support Vector \nmachine (SVM) algorithm, known as v-SVM, can be interpreted \nas a maximal separation between subsets of the convex hulls of the \ndata, which we call soft convex hulls. The soft convex hulls are \ncontrolled by choice of the parameter v. If the intersection of the \nconvex hulls is empty, the hyperplane is positioned halfway between \nthem such that the distance between convex hulls, measured along \nthe normal, is maximized; and if it is not, the hyperplane's normal \nis similarly determined by the soft convex hulls, but its position \n(perpendicular distance from the origin) is adjusted to minimize \nthe error sum. The proposed geometric interpretation of v-SVM \nalso leads to necessary and sufficient conditions for the existence of \na choice of v for which the v-SVM solution is nontrivial. \n\n1 \n\nIntroduction \n\nRecently, SchOlkopf et al. [I) introduced a new class of SVM algorithms, called \nv-SVM, for both regression estimation and pattern recognition. The basic idea is to \nremove the user-chosen error penalty factor C that appears in SVM algorithms by \nintroducing a new variable p which, in the pattern recognition case, adds another \ndegree of freedom to the margin. For a given normal to the separating hyperplane, \nthe size of the margin increases linearly with p. It turns out that by adding p to \nthe primal objective function with coefficient -v, v 2: 0, the variable C can be \nabsorbed, and the behaviour of the resulting SVM - the number of margin errors \nand number of support vectors - can to some extent be controlled by setting v. \nMoreover, the decision function produced by v-SVM can also be produced by the \noriginal SVM algorithm with a suitable choice of C. \nIn this paper we show that v-SVM, for the pattern recognition case, has a clear \ngeometric interpretation, which also leads to necessary and sufficient conditions for \nthe existence of a nontrivial solution to the v-SVM problem. All our considerations \napply to feature space, after the mapping of the data induced by some kernel. We \nadopt the usual notation: w is the normal to the separating hyperplane, the mapped \n\n\fA Geometric Interpretation ofv-SVM Classifiers \n\n245 \n\ndata is denoted by Xi E !RN , i = 1, ... ,1, with corresponding labels Yi E {\u00b11}, b, p \nare scalars, and ~i' i = 1\", ,,1 are positive scalar slack variables. \n\n2 v-SVM Classifiers \n\nThe v-SVM formulation, as given in [1], is as follows: minimize \n\n1 \n\npI = 211w/112 - Vp' + y l:~~ \n\n1 \n\nwith respect to w', b' , p', ~i, subject to: \n\ni \n\nYi(W' . Xi + b/) ~ p' - ~~, ~i ~ 0, p' ~ o. \n\n(1) \n\n(2) \n\nHere v is a user-chosen parameter between 0 and 1. The decision function (whose \nsign determines the label given to a test point x) is then: \n\n(3) \nThe Wolfe dual of this problem is: maximize Ph = -~ 2:ij OiOjYiYjXi . Xj subject \nto \n\nl' (x) = w' . x + b' . \n\n(4) \n\nwith w' given by w' = 2:i 0iYiXi . SchOlkopf et al. \n[1] show that v is an upper \nbound on the fraction of margin errors1 , a lower bound on the fraction of support \nvectors, and that both of these quantities approach v asymptotically. \nNote that the point w' = b' = p = ~i = 0 is feasible, and that at this point, pI = O. \nThus any solution of interest must have pI ::; O. Furthermore, if Vp' = 0, the \noptimal solution is at w' = b' = p = ~i = 02 \u2022 Thus we can assume that v p' > 0 (and \ntherefore v > 0) always. Given this, the constraint p' ~ 0 is in fact redundant: a \nnegative value of p' cannot appear in a solution (to the problem with this constraint \nremoved) since the above (feasible) solution (with p' = 0) gives a lower value for \nP'. Thus below we replace the constraints (2) by \n\n2.1 A Reparameterization of v-SVM \n\nWe reparameterize the primal problem by dividing the objective function pI by \nv 2/2, the constraints (5) by v, and by making the following substitutions: \n\n(5) \n\n~i \nI-' = -, w = -, b = -, p = -, ~i = -. \nv \n\n2 \nvl \n\nb' \nv \n\nw' \nv \n\np' \nv \n\n(6) \n\n1 A margin error Xi is defined to be any point for which \u20aci > 0 (see [1]). \n2In fact we can prove that, even if the optimal solution is not unique, the global \nsolutions still all have w = 0: see Burges and Crisp, \"Uniqueness of the SYM Solution\" in \nthis volume. \n\n\f246 \n\nD. J. Crisp and C. J. C. Burges \n\nThis gives the equivalent formulation: minimize \n\n(7) \n\nwith respect to w, b, p, ~i' subject to: \n\nIT we use as decision function f(x) = f'(x)/v, the formulation is exactly equivalent, \n\nalthough both primal and dual appear different. The dual problem is now: minimize \n\n(8) \n\nwith respect to the ai, subject to: \n\n(9) \n\n(10) \n\nwith w given by w = 1 2:i aiYiXi. In the following, we will refer to the reparam(cid:173)\neterized version of v-StrM given above as J.'-SVM, although we emphasize that it \ndescribes the same problem. \n\n3 A Geometric Interpretation of l/-SVM \n\nIn the separable case, it is clear that the optimal separating hyperplane is just that \nhyperplane which bisects the shortest vector joining the convex hulls of the positive \nand negative polarity points3 \u2022 We now show that this geometric interpretation can \nbe extended to the case of v-SVM for both separable and nonseparable cases. \n\n3.1 The Separable Case \n\nWe start by giving the analysis for the separable case. The convex hulls of the two \nclasses are \n\nand \n\n(11) \n\n(12) \n\nFinding the two closest points can be written as the following optimization problem: \n\nmin \n\nCIt \n\n(13) \n\n3See, for example, K. Bennett, 1997, in http://www.rpi.edu/bennek/svmtalk.ps (also, \n\nto appear). \n\n\fA Geometric Interpretation of v-SVM Classifiers \n\nsubject to: \n\nL ai = 1, \n\ni:y;=+l \n\nL ai = 1, \n\ni:y;=-l \n\na ' > 0 \n\nt _ \n\n247 \n\n(14) \n\nTaking the decision boundary j(x) = w\u00b7 x + b = 0 to be the perpendicular bisector \nof the line segment joining the two closest points means that at the solution, \n\nand b = -w\u00b7 p, where \n\n(15) \n\n(16) \n\nThus w lies along the line segment (and is half its size) and p is the midpoint of the \nline segment. By rescaling the objective function and using the class labels Yi = \u00b11 \nwe can rewrite this as4 : \n\nsubject to \n\nThe associated decision function is j( x) = w . x + b where w = ~ L:i aiYiXi, \np = ~ L:i aiXi and b = -w.p = -t L:ij aiYiajXi . Xj. \n\n3.2 The Connection with v-SVM \n\nConsider now the two sets of points defined by: \n\nH+ JJ = { '. ~ aiXil .. ~ ai = 1, 0 ~ ai ~ fL} \n\nI.y;-+l \n\nI.y.-+l \n\nand \n\n(17) \n\n(18) \n\n(19) \n\n(20) \n\nWe have the following simple proposition: \nProposition 1: H+ JJ C H+ and H-JJ C H_, and H+ JJ and H-JJ are both convex \nsets. Furthermore, the positions of the points H+ JJ and H-JJ with respect to the Xi \ndo not depend on the choice of origin. \nProof: Clearly, since the ai defined in H+ JJ is a subset of the ai defined in H+, \nH+ JJ C H+, similarly for H_. Now consider two points in H+JJ defined by aI, a2. \nThen all points on the line joining these two points can be written as L:i:y;=+l ((1-\nA)ali + Aa2i)Xi, 0 ~ A ~ 1. Since ali and a2i both satisfy 0 ~ ai ~ fL, so does \n(1- A)ali +Aa2i, and since also L:i:y;=+l (1- A)ali+Aa2i = 1, the set H+ JJ is convex. \n\n4That one can rescale the objective function without changing the constraints follows \nfrom uniqueness of the solution. See also Burges and Crisp, \"Uniqueness of the SVM \nSolution\" in this volume. \n\n\f248 \n\nD. J. Crisp and C. J. C. Burges \n\nThe argument for H_~ is similar. Finally, suppose that every Xi is translated by \n'Vi. Then since L:i:Yi=+l ai = 1, every point in H+~ is also \nXo, i.e. Xi -+ Xi + Xo \ntranslated by the same amount, similarly for H-w 0 \nThe problem of finding the optimal separating hyperplane between the convex sets \nH+~ and H_~ then becomes: \n\nsubject to \n\n(21) \n\n(22) \n\nSince Eqs. (21) and (22) are identical to (9) and (10), we see that the v-SVM \nalgorithm is in fact finding the optimal separating hyperplane between the convex \nsets H+~ and H-w We note that the convex sets H+~ and H_~ are not simply \nuniformly scaled versions of H + and H _. An example is shown in Figure 1. \n\nxl \n\nxl \n\nxl \n\n1'=113 \n\n1'=5112 \n\n1/3 \n\n...... '! \n\nxl \n\n113 \n\n5::: :\"::.~ \n\n. ' \n\nxl \n\nxl \n\n116 \n\n5112 \n\nx2 \n\n112 -lo:rrrTTT17TTT17~ \n\n--t----\"I---+-----. \n\nxl \n\nxl \n\n112 \n\nFigure 1: The soft convex hull for the vertices of a right isosceles triangle, for \nvarious 1'. Note how the shape changes as the set grows and is constrained by the \nboundaries of the encapsulating convex hull. For I' < ~, the set is empty. \n\nBelow, we will refer to the formulation given in this section as the soft convex hull \nformulation, and the sets of points defined in Eqs. (19) and (20) as soft convex \nhulls. \n\n3.3 Comparing the Offsets and Margin Widths \nThe natural value of the offset b in the soft convex hull approach, b = -w . p, arose \nby asking that the separating hyperplane lie halfway between the closest extremities \nof the two soft convex hulls. Different choices of b just amount to hyperplanes with \nthe same normal but at different perpendicular distances from the origin. This \nvalue of b will not in general be the same as that for which the cost term in Eq. (7) \nis minimized. We can compare the two values as follows. The KKT conditions for \nthe J.'-SVM formulation are \n\n(I' - ai)~i -\n\nai(Yi(w\u00b7Xi+b)-p+~i) \n\n0 \n0 \n\n(23) \n(24) \n\nMultiplying (24) by Yi, summing over i and using (23) gives \n\n\fA Geometric Interpretation ofv-SVM Classifiers \n\n249 \n\n(25) \n\nThus the separating hyperplane found in the J.'-SVM algorithm sits a perpendicular \ndistance 12ifiorr l:i Yi~i I away from that found in the soft convex hull formulation. \nFor the given w, this choice of b results in the lowest value of the cost, J.' l:i ~i. \nThe soft convex hull approach suggests taking p = w . w, since this is the value \nIii takes at the points l:Yi=+l (XiXi and l:Yi=-l (XiXi. Again, we can use the KKT \nconditions to compare this with p. Summing (24) over i and using (23) gives \n\np= p+ ~ L~i. \n\ni \n\n(26) \n\nSince p = W\u00b7 w, this again shows that if p = 0 then w = ~i = 0, and, by (25), b = O. \n\n3.4 The Primal for the Soft Convex Hull Formulation \n\nBy substituting (25) and (26) into the J.'-SVM primal formulation (7) and (8) we \nobtain the primal formulation for the soft convex hull problem: minimize \n\nwith respect to w, b, p, ~i, subject to: \n\n( \n\nYi W \u2022 Xi + b 2:: p - ~i + J.' ~ 2 \n\n-) \n\n_ \n\n\" \n\n1 + YiYj \n\n(27) \n\n(28) \n\n~j, \n\nj \n\nIt is straightforward to check that the dual is exactly (9) and (10). Moreover, by \nsumming the relevant KKT conditions, as above, we see that b = -w\u00b7p and p = w\u00b7w. \nNote that in this formulation the variables ~i retain their meaning according to (8). \n\n4 Choosing v \n\nIn this section we establish some results on the choices for v, using the J.'-SVM \nformulation. First, note that l:i (XiYi = 0 and l:i (Xi = 2 implies l:i:Yi=+l (Xi = \nl:i:Yi=-l (Xi = 1. Then (Xi 2:: 0 gives (Xi ~ 1, Vi. Thus choosing J.' > 1, which \ncorresponds to choosing v < 2/1, results in the same solution of the dual (and hence \nthe same normal w) as choosing J.' = 1. (Note that different values of J.' > 1 can \nstill result in different values of the other primal variables, e.g. b). \nThe equalities l:i:Yi=+l (Xi = l:i:y;=-l (Xi = 1 also show that if J.' < 2/1 then the \nfeasible region for the dual is empty and hence the problem is insoluble. This \ncorresponds to the requirement v < 1. However, we can improve upon this. Let 1+ \n(L) be the number of positive (negative) polarity points, so that 1+ + L = I. Let \nlmin == min{I+,L}. Then the minimal value of J.' which still results in a nonempty \nfeasible region is J.'min = 1/lmin. This gives the condition v ~ 2Imin /l. \nWe define a \"nontrivial\" solution of the problem to be any solution with w =I o. \nThe following proposition gives conditions for the existence of nontrivial solutions. \n\n\f250 \n\nD. J. Crisp and C. J. C. Burges \n\nProposition 2: A value of v exists which will result in a nontrivial solution to \nthe v-SVM classification problem if and only if {H+I-' : I-' = I-'min} n {H_I-' : I-' = \nI-'min} = 0. \nProof: Suppose that {H+I-' : I-' = I-'min} n {H_I-' : I-' = I-'min} =1= 0. Then for all \nallowable values of I-' (and hence v), the two convex hulls will intersect, since {H+I-' : \nI-' = I-'min} C {H+I-' : I-' ~ I-'min} and {H_I-' : I-' = I-'min} C {H_I-' : I-' ~ I-'min}. IT \nthe two convex hulls intersect, then the solution is trivial, since by definition there \nthen exist feasible points z such that z = Li:Yi=+lOiXi and z = Li:Yi=_lOiXi, \nand hence 2w = Li 0iYiXi = Li:Yi=+lOiXi - Li:Yi=-l 0iXi = 0 (cf. (21), (22). \nNow suppose that {H+I-' : I-' = I-'min} n {H_I-' : I-' = I-'min} = 0. Then clearly a \nnontrivial solution exists, since the shortest distance between the two convex sets \n{H +1-' : I-' = I-'min} and {H -I-' : I-' = I-'min} is not zero, hence the corresponding \nw =1= o. 0 \nNote that when 1+ = L, the condition amounts to the requirement that the centroid \nof the positive examples does not coincide with that of the negative examples. Note \nalso that this shows that, given a data set, one can find a lower bound on v, by \nfinding the largest I-' that satisfies H_I-' n H+I-' = 0. \n\n5 Discussion \n\nThe soft convex hull interpretation suggests that an appropriate way to penalize \npositive polarity errors differently from negative is to replace the sum I-' Li ~i in (7) \nwith 1-'+ Li:Yi=+l ~i + 1-'- Li:Yi=-l ~i\u00b7 In fact one can go further and introduce a I-' \nfor every train point. The I-'-SVM formulation makes this possibility explicit, which \nit is not in original v-SVM formulation. \nNote also that the fact that v-SVM leads to values of b which differ from that which \nwould place the optimal hyperplane halfway between the soft convex hulls suggests \nthat there may be principled methods for choosing the best b for a given problem, \nother than that dictated by minimizing the sum of the ~i 'so Indeed, originally, the \nsum of ~i 's term arose in an attempt to approximate the number of errors on the \ntrain set [21. The above reasoning in a sense separates the justification for w from \nthat for b. For example, given w, a simple line search could be used to find that \nvalue of b which actually does minimize the number of errors on the train set. Other \nmethods (for example, minimizing the estimated Bayes error [3]) may also prove \nuseful. \n\nAcknowledgments \n\nC. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Tech(cid:173)\nnologies for their support. \n\nReferences \n\n[1] B. Scholkopf and A. Smola and R. Williamson and P. Bartlett. New support vector \nalgorithms, neurocolt2 nc2-tr-1998-031. Technical report, GMD First and Australian \nNational University, 1998. \n\n[2] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, \n1995. \n\n[3] C. J. C. Burges and B. SchOlkopf. Improving the accuracy and speed of support vector \nlearning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural \nInformation Processing Systems 9, pages 375-381, Cambridge, MA, 1997. MIT Press. \n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "David", "family_name": "Crisp", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}]}