{"title": "Uniqueness of the SVM Solution", "book": "Advances in Neural Information Processing Systems", "page_first": 223, "page_last": 229, "abstract": null, "full_text": "Uniqueness of the SVM Solution \n\nChristopher J .C. Burges \n\nDavid J. Crisp \n\nAdvanced Technologies, \n\nBell Laboratories, \nLucent Technologies \nHolmdel, New Jersey \n\nburges@iucent.com \n\nCentre for Sensor Signal and \n\nInformation Processing, \n\nDeptartment of Electrical Engineering, \nUniversity of Adelaide, South Australia \n\ndcrisp@eleceng.adelaide.edu.au \n\nAbstract \n\nWe give necessary and sufficient conditions for uniqueness of the \nsupport vector solution for the problems of pattern recognition and \nregression estimation, for a general class of cost functions. We show \nthat if the solution is not unique, all support vectors are necessarily \nat bound, and we give some simple examples of non-unique solu(cid:173)\ntions. We note that uniqueness of the primal (dual) solution does \nnot necessarily imply uniqueness of the dual (primal) solution. We \nshow how to compute the threshold b when the solution is unique, \nbut when all support vectors are at bound, in which case the usual \nmethod for determining b does not work. \n\n1 \n\nIntroduction \n\nSupport vector machines (SVMs) have attracted wide interest as a means to imple(cid:173)\nment structural risk minimization for the problems of classification and regression \nestimation. The fact that training an SVM amounts to solving a convex quadratic \nprogramming problem means that the solution found is global, and that if it is not \nunique, then the set of global solutions is itself convex; furthermore, if the objec(cid:173)\ntive function is strictly convex, the solution is guaranteed to be unique [1]1. For \nquadratic programming problems, convexity of the objective function is equivalent \nto positive semi-definiteness of the Hessian, and strict convexity, to positive definite(cid:173)\nness [1]. For reference, we summarize the basic uniqueness result in the following \ntheorem, the proof of which can be found in [1]: \n\nTheorem 1: The solution to a convex programming problem, for which the ob(cid:173)\njective function is strictly convex, is unique. Positive definiteness of the Hessian \nimplies strict convexity of the objective function . \nNote that in general strict convexity of the objective function does not neccesarily \nimply positive definiteness of the Hessian. Furthermore, the solution can still be \nunique, even if the objective function is loosely convex (we will use the term \"loosely \nconvex\" to mean convex but not strictly convex). Thus the question of uniqueness \n\nIThis is in contrast with the case of neural nets, where local minima of the objective \n\nfunction can occur. \n\n\f224 \n\nC. J. C. Burges and D. J. Crisp \n\nfor a convex programming problem for which the objective function is loosely convex \nis one that must be examined on a case by case basis. In this paper we will give \nnecessary and sufficient conditions for the support vector solution to be unique, \neven when the objective function is loosely convex, for both the clasification and \nregression cases, and for a general class of cost function. \nOne of the central features of the support vector method is the implicit mapping \n~ of the data Z E Rn to some feature space F, which is accomplished by replacing \ndot products between data points Zi, Zj, wherever they occur in the train and test \nalgorithms, with a symmetric function K (Zi' Zj ), which is itself an inner product in \nF [2]: K(Zi' Zj) = (~(Zi)' ~(Zj\u00bb = (Xi, Xj), where we denote the mapped points in \nF by X = ~(z). In order for this to hold the kernel function K must satisfy Mercer's \npositivity condition [3]. The algorithms then amount to constructing an optimal \nseparating hyperplane in F, in the pattern recognition case, or fitting the data to a \nlinear regression tube (with a suitable choice of loss function [4]) in the regression \nestimation case. Below, without loss of generality, we will work in the space F, \nwhose dimension we denote by dF. The conditions we will find for non-uniqueness \nof the solution will not depend explicitly on F or ~. \nMost approaches to solving the support vector training problem employ the Wolfe \ndual, which we describe below. By uniqueness of the primal (dual) solution, we \nmean uniqueness of the set of primal (dual) variables at the solution. Notice that \nstrict convexity of the primal objective function does not imply strict convexity of \nthe dual objective function. For example, for the optimal hyperplane problem (the \nproblem of finding the maximal separating hyperplane in input space, for the case \nof separable data), the primal objective function is strictly convex, but the dual \nobjective function will be loosely convex whenever the number of training points \nexceeds the dimension of the data in input space. In that case, the dual Hessian \nH will necessarily be positive semidefinite, since H (or a submatrix of H, for the \ncases in which the cost function also contributes to the (block-diagonal) Hessian) is a \nGram matrix of the training data, and some rows of the matrix will then necessarily \nbe linearly dependent [5]2. In the cases of support vector pattern recognition and \nregression estimation studied below, one of four cases can occur: (1) both primal \nand dual solutions are unique; (2) the primal solution is unique while the dual \nsolution is not; (3) the dual is unique but the primal is not; (4) both solutions \nare not unique. Case (2) occurs when the unique primal solution has more than \none expansion in terms of the dual variables. We will give an example of case (3) \nbelow. It is easy to construct trivial examples where case (1) holds, and based on \nthe discussion below, it will be clear how to construct examples of (4). However, \nsince the geometrical motivation and interpretation of SVMs rests on the primal \nvariables, the theorems given below address uniqueness of the primal solution3 \u2022 \n\n2 The Case of Pattern Recognition \n\nWe consider a slightly generalized form of the problem given in [6], namely to \nminimize the objective function \n\nF = (1/2) IIwl12 + L Ci~f \n\n(1) \n\n2Recall that a Gram matrix is a matrix whose ij'th element has the form (Xi,Xj) for \nsome inner product (,), where Xi is an element of a vector space, and that the rank of a \nGram matrix is the maximum number of linearly independent vectors Xi that appear in it \n[6]. \n\n3Due to space constraints some proofs and other details will be omitted. Complete \n\ndetails will be given elsewhere. \n\n\fUniqueness of the SVM Solution \n\n225 \n\nwith constants p E [1,00), Gi > 0, subject to constraints: \n\nC; > 0 \n\n,:>. - ' \n\n\" \n\ni = 1 ... 1 \n\nYi(W . Xi + b) > 1 - ~i' i = 1,,,,,1 \n\n(2) \n(3) \nwhere W is the vector of weights, b a scalar threshold, ~i are positive slack variables \nwhich are introduced to handle the case of nonseparable data, the Yi are the polar(cid:173)\nities of the training samples (Yi E {\u00b1 I} ), Xi are the images of training samples in \nthe space F by the mapping ~, the Gi determine how much errors are penalized \n(here we have allowed each pattern to have its own penalty), and the index i labels \nthe 1 training patterns. The goal is then to find the values of the primal variables \n{w, b, ~i} that solve this problem. Most workers choose p = 1, since this results in \na particularly simple dual formulation, but the problem is convex for any p 2: 1. \nWe will not go into further details on support vector classification algorithms them(cid:173)\nselves here, but refer the interested reader to [3], [7] . Note that, at the solution, b \nis determined from w and ~i by the Karush Kuhn Tucker (KKT) conditions (see \nbelow), but we include it in the definition of a solution for convenience. \nNote that Theorem 1 gives an immediate proof that the solution to the optimal \nhyperplane problem is unique, since there the objective function is just (1/2)lIwI1 2 , \nwhich is strictly convex, and the constraints (Eq. (2) with the ~ variables removed) \nare linear inequality constraints which therefore define a convex set4. \n\nFor the discussion below we will need the dual formulation of this problem, for the \ncase p = 1. It takes the following form: minimize ~ L-ijG:iG:jYiYj(Xi,Xj) - L-iG:i \nsubject to constraints: \n\nTJi > 0, G:i 2: 0 \nGi \n\nG:i + TJi \n0 \n\nLG:iYi \n\n(4) \n(5) \n(6) \n\nand where the solution takes the form w = L-i G:iYiXi, and the KKT conditions, \nwhich are satisfied at the solution, are TJi~i = 0, G:i (Yi (w . Xi + b) - 1 + ~i) = 0, where \nTJi are Lagrange multipliers to enforce positivity of the ~i' and G:i are Lagrange \nmultipliers to enforce the constraint (2). The TJi can be implicitly encapsulated \nin the condition 0 ~ ai :::; Gi , but we retain them to emphasize that the above \nequations imply that whenever ~i =/; 0, we must have ai = Gi . Note that, for a \ngiven solution, a support vector is defined to be any point Xi for which G:i > O. Now \nsuppose we have some solution to the problem (1), (2), (3). Let Nl denote the set \n{i : Yi = 1, W \u00b7 Xi + b < I}, N2 the set {i : Yi = -1, W\u00b7 Xi + b > -I}, N3 the set \n{i : Yi = 1, W\u00b7 Xi + b = I}, N4 the set {i : Yi = -1, W\u00b7 Xi + b = -I}, Ns the set \n{i : Yi = 1, W\u00b7 Xi + b > I}, and N6 the set {i : Yi = -1, W\u00b7 Xi + b < -I}. Then we \nhave the following theorem: \n\nTheorem 2: The solution to the soft-margin problem, (1), (2) and (3), is unique \nfor p > 1. For p = 1, the solution is not unique if and only if at least one of the \nfollowing two conditions holds: \n\nFurthermore, whenever the solution is not unique, all solutions share the same w, \nand any support vector Xi has Lagrange multiplier satisfying ai = Gi , and when (7) \n\niENl UN3 \n\niEN2 \n\n4This is of course not a new result: see for example [3]. \n\n(7) \n\n(8) \n\n\f226 \n\nC. 1. C. Burges and D. 1. Crisp \n\nholds, then N3 contains no support vectors, and when (8) holds, then N4 contains \nno support vectors. \nProof: For the case p > 1, the objective function F is strictly convex, since a \nsum of strictly convex functions is a strictly convex function, and since the function \ng( v) = vP , v E lR+ is strictly convex for p > 1. FUrthermore the constraints define \na convex set, since any set of simultaneous linear inequality constraints defines a \nconvex set. Hence by Theorem 1 the solution is unique. \nFor the case p = 1, define Z to be that dF + i-component vector with Zi = Wi, \ni = \ni = dF + 1\", . ,dF + t. In terms of the variables z, the \n1, ... ,dF, and Zi = ~i' \nproblem is still a convex programming problem, and hence has the property that \nany solution is a global solution. Suppose that we have two solutions, Zl and \nt)ZI + tZ2, and \nZ2' Then we can form the family of solutions Zt, where Zt == (1 -\nsince the solutions are global, we have F(zd = F(Z2) = F(zt). By expanding \nF(zt) - F(zt} = 0 in terms of Zl and Z2 and differentiating twice with respect to t \nwe find that WI = W2. Now given wand b, the ~i are completely determined by the \nKKT conditions. Thus the solution is not unique if and only if b is not unique. \nDefine 0 == min {miniENl ~i' miniEN6 (-1 - W \u2022 Xi - b)}, and suppose that condition \n(7) holds. Then a different solution {w', b', e} is given by w' = w, b' = b + 0, \nand ~~ = ~i - 0, Vi E N 1 , ~~ = ~i + 0, Vi E N2 uN4 , all other ~i = 0, since by \nconstruction F then remains the same, and the constraints (2), (3) are satisfied \nby the primed variables. Similarly, suppose that condition (8) holds. Define 0 == \nmin{miniEN2~i,miniEN5(w\u00b7xi+b-l)}. Then a different solution {w',b',e} is \ngiven by w' = w, b' = b - 0, and ~~ = ~i - 0, Vi E N2 , ~: = ~i + 0, Vi E NI U N3 , \nall other ~i = 0, since again by construction F is unchanged and the constraints \nare still met. Thus the given conditions are sufficient for the solution to be non(cid:173)\nunique. To show necessity, assume that the solution is not unique: then by the \nabove argument, the solutions must differ by their values of b. Given a particular \nsolution b, suppose that b + 0, 0 > 0 is also a solution. Since the set of solutions is \nitself convex, then b + 0' will also correspond to a solution for all 0' : 0 ~ 0' ~ O. \nGiven some b' = b + 0', we can use the KKT conditions to compute all the ei, and \nwe can choose 0' sufficiently small so that no ~i' i E N6 that was previously zero \nbecomes nonzero. Then we find that in order that F remain the same, condition \n(7) must hold. If b - 0, 0 > 0 is a solution, similar reasoning shows that condition \n(8) must hold. To show the final statement of the theorem, we use the equality \nconstraint (6), together with the fact that, from the KKT conditions, all support \nvectors Xi with indices in NI uN2 satisfy (Xi = Ci \u2022 Substituting (6) in (7) then gives \nL:N3 (Xi + L:N4 (Ci -\n(Xi) = 0 which implies the result, since all (Xi are non-negative. \n(Xi) + L:.Af. (Xi = 0 which again \nSimilarly, substituting (6) in (8) gives L:,M (Ci -\n1 \n. \nImp les t e resu t. 0 \n\nl' \n\nh \n\n3 \n\n4 \n\nCorollary: For any solution which is not unique, letting S denote the set of indices \nof the corresponding set of support vectors, then we must have L:iES CiYi = O. \nFUrthermore, if the number of data points is finite, then for at least one of the \nfamily of solutions, all support vectors have corresponding ~i i= O. \nNote that it follows from the corollary that if the Ci are chosen such that there \nexists no subset r of the train data such that L:iET CiYi = 0, then the solution is \nguaranteed to be unique, even if p = 1. FUrthermore this can be done by choosing all \nthe Ci very close to some central value C, although the resulting solution can depend \nsensitively on the values chosen (see the example immediately below). Finally, note \nthat if all Ci are equal, the theorem shows that a necessary condition for the solution \nto be non-unique is that the negative and positive polarity support vectors be equal \nin number. \n\n\fUniqueness of the SVM Solution \n\n227 \n\nA simple example of a non-unique solution, for the case p = 1, is given by a train set \nin one dimension with just two examples, {Xl = 1, YI = I} and {xz = -1, Yz = -11' \nwith GI = Cz == C. It is straightforward to show analytically that for G 2: 2' \nthe solution is unique, with w = 1, 6 = 6 = b = 0, and marginS equal to 2, \nwhile for C < ! there is a family of solutions, with -1 + 2C ::; b ::; 1 - 2C and \n6 = 1- b - 2C, 6 = 1 + b - 2G, and margin l/C . The case G < ! corresponds to \nCase (3) in Section (1) (dual unique but primal not), since the dual variables are \nuniquely specified by a = C. Note also that this family of solutions also satisfies \nthe condition that any solution is smoothly deformable into another solution [7J. \nIf GI > Cz, the solution becomes unique, and is quite different from the unique \nsolution found when Gz > CI . When the G's are not equal, one can interpret \nwhat happens in terms of the mechanical analogy [8J, with the central separating \nhyperplane sliding away from the point that exerts the higher force, until that point \nlies on the edge of the margin region. \nNote that if the solution is not unique, the possible values of b fall on an interval \nof the real line: in this case a suitable choice would be one that minimizes an \nestimate of the Bayes error, where the SVM output densities are modeled using a \nvalidation set6 . Alternatively, requiring continuity with the cases p > 1, so that one \nwould choose that value of b that would result by considering the family of solutions \ngenerated by different choices of p, and taking the limit from above of p -t 1, would \nagain result in a unique solution. \n\n3 The Case of Regression Estimation 7 \n\nHere one has a set of l pairs {xI,Yd,{xz,yz},\u00b7\u00b7\u00b7,{XI,YI}, {Xi E :F,Yi E R}, and \nthe goal is to estimate the unknown functional dependence j of the Y on the X, \nwhere the function j is assumed to be related to the measurements {Xi,Yi} by \nYi = j(Xi) +ni, and where ni represents noise. For details we refer the reader to [3], \n[9]. Again we generalize the original formulation [10], as follows: for some choice of \npositive error penalties Gi, and for positive \u20aci, minimize \n\nF = ~ Ilwllz + 2)Gi~f + C;(~np) \n\nI \n\ni=l \n\n(9) \n\nwith constant p E [1 , 00), subject to constraints \n\nYi - w . Xi - b < \u20aci + ~i \n\u20aci + ~; \nW \u2022 Xi + b - Yi < \n\n(10) \n(11) \n(12) \nwhere we have adopted the notation ~;*) == {~i' ~;} [9J. This formulation results in \ninsensitive\" loss function, that is, there is no penalty (~}*) = 0) associated \nan \"\u20ac \nwith point Xi if IYi - w . Xi - bl ::; \u20aci. Now let {3, {3* be the Lagrange multipliers \nintroduced to enforce the constraints (10), (11). The dual then gives \n\n~;*) > 0 \n\n2: {3i = 2: {3;, 0::; {3i ::; Gi , 0::; {3; ::; G;, \n\n(13) \n\n5The margin is defined to be the distance between the two hyperplanes corresponding \nto equality in Eq. (2), namely 2/lIwll, and the margin region is defined to be the set of \npoints between the two hyperplanes. \n\n6This method was used to estimate b under similar circumstances in [8]. \n7The notation in this section only coincides with that used in section 2 where convenient. \n\n\f228 \n\nC. J. C. Burges and D. J. Crisp \n\nwhich we will need below. For this formulation, we have the following \nTheorem 3: For a given solution, define !(Xi, Yi) == Yi - W \u2022 Xi - b, and define Nl \nto be the set of indices {i : !(Xi, Yi) > fi}, N2 the set {i : !(Xi, Yi) = fd, N3 the set \n{i : !(Xi,Yi) = -fi}, and N4 the set {i : !(Xi,Yi) < -fi}. Then the solution to (9) \n- (12) is unique for p > 1, and for p = 1 it is not unique if and only if at least one \nof the following two conditions holds: \n\nCi \n\nL \n\niENIUN2 \n\nC'! , \n\nL \n\niEN3UN4 \n\nLC; \niEN4 \n\nLCi \niENl \n\n(14) \n\n(15) \n\nFurthermore, whenever the solution is not unique, all solutions share the same w, \nand all support vectors are at bound (that iss, either f3i = Ci or f3i = Cn, and \nwhen (14) holds, then N3 contains no support vectors, and when (15) holds, then \nN2 contains no support vectors. \nThe theorem shows that in the non-unique case one will only be able to move the \ntube (and get another solution) if one does not change its normal w. A trivial \nexample of a non-unique solution is when all the data fits inside the f-tube with \nroom to spare, in which case for all the solutions, the normal to the f-tubes always \nlies along the Y direction. Another example is when all Ci are equal, all data falls \noutside the tube, and there are the same number of points above the tube as below \nit. \n\n4 Computing b when all SV s are at Bound \n\nThe threshold b in Eqs. (2), (10) and (11) is usually determined from that sub(cid:173)\nset of the constraint equations which become equalities at the solution and for \nwhich the corresponding Lagrange multipliers are not at bound. However, it may \nbe that at the solution, this subset is empty. In this section we consider the sit(cid:173)\nuation where the solution is unique, where we have solved the optimization prob(cid:173)\nlem and therefore know the values of all Lagrange multipliers, and hence know \nalso w, and where we wish to find the unique value of b for this solution. Since \nthe ~~.) are known once b is fixed, we can find b by finding that value which \nboth minimizes the cost term in the primal Lagrangian, and which satisfies all \nthe constraint equations. Let us consider the pattern recognition case first. Let \nS+ (S_) denote the set of indices of positive (negative) polarity support vectors. \nAlso let V+ (V_) denote the set of indices of positive (negative) vectors which are \nnot support vectors. It is straightforward to show that if 2:iES_ Ci > 2:iES+ Ci, \nthen b = max {maxiES_ (-1 - W \u2022 Xi), maxiEV+ (1 - W \u2022 Xi)}, while if 2:iES_ Ci < \n2:iES+ Ci, then b = min {miniEs+ (1 - W \u2022 Xi), miniEv_ (-1 - W \u2022 Xi)}' Further(cid:173)\nmore, if 2:iES_ Ci = 2:iES+ Ci, and if the solution is unique, then these two values \ncoincide. \n\nlet us denote by S the set of indices of all sup(cid:173)\nIn the regression case, \nport vectors, S its complement, SI the set of indices for which f3i = Ci, \nand S2 the set of indices for which f3i = C;, so that S = SI U S2 (note \nSI n S2 = 0). Then if 2:iES2 C; > 2:iESl Ci, the desired value of b is \nb = max{m~Es(Yi - W\u00b7 Xi + fi), maxiES(Yi - W\u00b7 Xi - fi)} while if 2:iES2 C; < \nmin {miniEs(Yi - W \u2022 Xi - fi), miniES(Yi - W\u00b7 Xi + fi)}' \n2:iESl Ci, \n\nthen b \n\n-\n\n8Recall that if Ei > 0, then {3i{3; = O. \n\n\fUniqueness of the SVM Solution \n\n229 \n\nAgain, if the solution is unique, and if also l:iES 2 c; \ntwo values coincide. \n\n5 Discussion \n\nWe have shown that non-uniqueness of the SVM solution will be the exception rather \nthan the rule: it will occur only when one can rigidly parallel transport the margin \nregion without changing the total cost. If non-unique solutions are encountered, \nother techniques for finding the threshold, such as minimizing the Bayes error arising \nfrom a model of the SVM posteriors [8], will be needed. The method of proof in the \nabove theorems is straightforward, and should be extendable to similar algorithms, \nfor example Mangasarian's Generalized SVM [11]. In fact one can extend this result \nto any problem whose objective function consists of a sum of strictly convex and \nloosely convex functions: for example, it follows immediately that for the case of the \nlI-SVM pattern recognition and regression estimation algorithms [12], with arbitrary \nconvex costs, the value of the normal w will always be unique. \n\nAcknowledgments \n\nC. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Tech(cid:173)\nnologies for their support. \n\nReferences \n\n[1] R. Fletcher. Practical Methods of Optimization. \nedition, 1987. \n\nJohn Wiley and Sons, Inc., 2nd \n\n[2] B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin \nIn Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, \nclassifiers. \n1992. ACM. \n\n[3] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., New York, 1998. \n\n[4] A.J. Smola and B. Scholkopf. On a kernel-based method for pattern recognition, \nregression, approximation and operator inversion. Algorithmica, 22:211 - 231, 1998. \n[5] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, \n1985. \n\n[6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, \n1995. \n\n[7] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data \nMining and Knowledge Discovery, 2(2}:121-167, 1998. \n\n[8] C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector \nlearning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural \nInformation Processing Systems 9, pages 375-381, Cambridge, MA, 1997. MIT Press. \n\n[9] A. Smola and B. Scholkopf. A tutorial on support vector regression. Statistics and \nComputing, 1998. In press: also, COLT Technical Report TR-1998-030. \n\n[10] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approx(cid:173)\nimation, regression estimation, and signal processing. Advances in Neural Information \nProcessing Systems, 9:281-287, 1996. \n\n[11] O.L. Mangarasian. Generalized support vector machines, mathematical programming \ntechnical report 98-14. Technical report, University of Wisconsin, October 1998. \n\n[12] B. Scholkopf, A. Smola, R. Williamson and P. Bartlett, New Support Vector Algo(cid:173)\nrithms, NeuroCOLT2 NC2-TR-1998-031, 1998. \n\n\f", "award": [], "sourceid": 1735, "authors": [{"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "David", "family_name": "Crisp", "institution": null}]}