{"title": "Active Support Vector Machine Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 583, "abstract": null, "full_text": "Active Support Vector Machine \n\nClassification \n\no. L. Mangasarian \nComputer Sciences Dept. \nUniversity of Wisconsin \n1210 West Dayton Street \n\nMadison, WI 53706 \nolvi@cs.wisc.edu \n\nDavid R. Musicant \n\nDept. of Mathematics and Computer Science \n\nCarleton College \n\nOne North College Street \n\nNorthfield, MN 55057 \n\ndmusican@carleton.edu \n\nAbstract \n\nAn active set strategy is applied to the dual of a simple reformula(cid:173)\ntion of the standard quadratic program of a linear support vector \nmachine. This application generates a fast new dual algorithm \nthat consists of solving a finite number of linear equations, with a \ntypically large dimensionality equal to the number of points to be \nclassified. However, by making novel use of the Sherman-Morrison(cid:173)\nWoodbury formula, a much smaller matrix of the order of the orig(cid:173)\ninal input space is inverted at each step. Thus, a problem with a \n32-dimensional input space and 7 million points required inverting \npositive definite symmetric matrices of size 33 x 33 with a total run(cid:173)\nning time of 96 minutes on a 400 MHz Pentium II. The algorithm \nrequires no specialized quadratic or linear programming code, but \nmerely a linear equation solver which is publicly available. \n\n1 \n\nIntroduction \n\nSupport vector machines (SVMs) [23, 5, 14, 12] are powerful tools for data classifi(cid:173)\ncation. Classification is achieved by a linear or nonlinear separating surface in the \ninput space of the dataset. In this work we propose a very fast simple algorithm, \nbased on an active set strategy for solving quadratic programs with bounds [18]. \nThe algorithm is capable of accurately solving problems with millions of points and \nrequires nothing more complicated than a commonly available linear equation solver \n[17, 1, 6] for a typically small (100) dimensional input space of the problem. \n\nKey to our approach are the following two changes to the standard linear SVM: \n\n1. Maximize the margin (distance) between the parallel separating planes with \nrespect to both orientation (w) as well as location relative to the origin b). \nSee equation (7) below. Such an approach was also successfully utilized in \nthe successive overrelaxation (SOR) approach of [15] as well as the smooth \nsupport vector machine (SSVM) approach of [12]. \n\n2. The error in the soft margin (y) is minimized using the 2-norm squared \ninstead of the conventional 1-norm. See equation (7) . Such an approach \nhas also been used successfully in generating virtual support vectors [4]. \n\n\fThese simple, but fundamental changes, lead to a considerably simpler positive \ndefinite dual problem with nonnegativity constraints only. See equation (8). \n\nIn Section 2 of the paper we begin with the standard SVM formulation and its \ndual and then give our formulation and its simpler dual. We corroborate with solid \ncomputational evidence that our simpler formulation does not compromise on gen(cid:173)\neralization ability as evidenced by numerical tests in Section 4 on 6 public datasets. \nSee Table 1. Section 3 gives our active support vector machine (ASVM) Algorithm \n3.1 which consists of solving a system of linear equations in m dual variables with \na positive definite matrix. By invoking the Sherman-Morrison-Woodbury (SMW) \nformula (1) we need only invert an (n + 1) x (n + 1) matrix where n is the dimen(cid:173)\nsionality of the input space. This is a key feature of our approach that allows us to \nsolve problems with millions of points by merely inverting much smaller matrices of \nthe order of n. In concurrent work [8] Ferris and Munson also use the SMW formula \nbut in conjunction with an interior point approach to solve massive problems based \non our formulation (8) as well as the conventional formulation (6). Burges [3] has \nalso used an active set method, but applied to the standard SVM formulation (2) \ninstead of (7) as we do here. Both this work and Burges' appeal, in different ways, \nto the active set computational strategy of More and Toraldo [18]. We note that \nan active set computational strategy bears no relation to active learning. Section \n4 describes our numerical results which indicate that the ASVM formulation has a \ntenfold testing correctness that is as good as the ordinary SVM, and has the capa(cid:173)\nbility of accurately solving massive problems with millions of points that cannot be \nattacked by standard methods for ordinary SVMs. \n\nWe now describe our notation and give some background material. All vectors will \nbe column vectors unless transposed to a row vector by a prime I. For a vector \nx E Rn, x + denotes the vector in Rn with all of its negative components set to \nzero. The notation A E Rm xn will signify a real m x n matrix. For such a matrix \nA' will denote the transpose of A and Ai will denote the i-th row of A. A vector \nof ones or zeroes in a real space of arbitrary dimension will be denoted by e or \n0, respectively. The identity matrix of arbitrary dimension will be denoted by I. \nFor two vectors x and y in Rn, x ..1 y denotes orthogonality, that is x' y = O. For \nU E R m, Q E Rm xm and B C {I, 2, ... , m}, UB denotes UiEB, QB denotes QiEB \nand QBB denotes a principal submatrix of Q with rows i E B and columns j E B. \nThe notation argminxEs f(x) denotes the set of minimizers in the set S of the \nreal-valued function f defined on S. We use := to denote definition. The 2-norm \nof a matrix Q will be denoted by IIQI12. A separating plane, with respect to two \ngiven point sets A and B in R n , is a plane that attempts to separate R n into two \nhalfspaces such that each open halfspace contains points mostly of A or B. A special \ncase of the Sherman-Morrison-Woodbury (SMW) formula [9] will be utilized: \n\n(Ilv + HH') -l = v(I - H(Ilv + H'H)-l H'), \n\n(1) \nwhere v is a positive number and H is an arbitrary m x k matrix. This formula \nenables us to invert a large m x m matrix by merely inverting a smaller k x k matrix. \n\n2 The Linear Support Vector Machine \nWe consider the problem of classifying m points in the n-dimensional real space \nR n , represented by the m x n matrix A, according to membership of each point Ai \nin the class A+ or A- as specified by a given m x m diagonal matrix D with +l 's \nor -1 's along its diagonal. For this problem the standard SVM with a linear kernel \n[23, 5] is given by the following quadratic program with parameter v > 0: \n\n. \n\nmm \n\n(w,'Y,y)ERn +l+= \n\nve'y + -w'w s.t. D(Aw - e-y) + y 2:: e, y 2:: O. \n\n(2) \n\n1 \n2 \n\n\fx'w = 1 + 1 \n\nx \n\nx \n\nx \nx x A+ \nx \nx x \nx \n\n0 \n\nx \n\no 0 \n\n000 \n\no 0 0 Ox \n\n0 \n\nA-\n\n0000 \n\no \n( \nX'W = 1-1 \nM \u00b7 \n2 \nargln= IIwl12 \n\nx'w =1 \n\nFigure 1: The bounding planes (3) with a soft (i.e. with some errors) margin \n2/llwI12, and the plane (4) approximately separating A+ from A-. \nHere w is the normal to the bounding planes: \n\n(3) \nand'Y determines their location relative to the origin (Figure 1.) The plane x'w = \n'Y + 1 bounds the A+ points, possibly with error, and the plane x'w = 'Y -1 bounds \nthe A - points, also possibly with some error. The separating surface is the plane: \n\nx'w = 'Y \u00b1 1 \n\nx'w = 'Y, \n\n(4) \n\nmidway between the bounding planes (3). The quadratic term in (2), is twice the \nreciprocal of the square of the 2-norm distance 2/llw112 between the two bounding \nplanes of (3) (see Figure 1). This term maximizes this distance which is often called \nthe \"margin\". If the classes are linearly inseparable, as depicted in Figure 1, then \nthe two planes bound the two classes with a \"soft margin\". That is, they bound each \nset approximately with some error determined by the nonnegative error variable y: \n\n~ 'Y + 1, for Dii = 1, \n'Y - 1, for Dii = - 1. \n::; \n\n(5) \n\nTraditionally the I-norm of the error variable y is minimized parametrically with \nweight v in (2) resulting in an approximate separation as depicted in Figure 1. The \ndual to the standard quadratic linear SVM (2) [13, 22, 14, 7] is the following: \n\n. 1 \nmill - u'DAA'Du - e'u s.t. e'Du = 0, 0 < u < ve. \nuER=2 \n\n-\n\n-\n\n(6) \n\nThe variables (w, 'Y) of the primal problem which determine the separating surface \n(4) can be obtained from the solution of the dual problem above [15, Eqns. 5 and \n7]. We note immediately that the matrix DAA'D appearing in the dual objective \nfunction (6) is not positive definite in general because typically m > > n. Also, \nthere is an equality constraint present, in addition to bound constraints, which for \nlarge problems necessitates special computational procedures such as SMO [21]. \nFurthermore, a one-dimensional optimization problem [15] must be solved in order \nto determine the locator 'Y of the separating surface (4). In order to overcome all \nthese difficulties as well as that of dealing with the necessity of having to essentially \ninvert a very large matrix of the order of m x m , we propose the following simple \nbut critical modification of the standard SVM formulation (2). We change Il y lll to \nIlyll\u00a7 which makes the constraint y ~ 0 redundant. We also append the term 'Y2 to \nw'w. This in effect maximizes the margin between the parallel separating planes \n(3) with respect to both wand 'Y [15], that is with respect to both orientation and \n\n\flocation of the planes, rather that just with respect to w which merely determines \nthe orientation of the plane. This leads to the following reformulation of the SVM: \n\nmin v - + -(w'w + ,2) s.t. D(Aw - er) + y ~ e. \n\n(w ,'Y, y)ERn+l+\", \n\nthe dual of this problem is [13]: \n\ny'y \n2 \n\n1 \n2 \n\nI \nv \n\n(7) \n\n(8) \n\nmin -u'( - + D(AA' + ee')D)u - e'u. \n\n1 \nO~uER'\" 2 \n\nw=A'Du, y=u/v, ,=-e'Du. \n\nThe variables (w,,) of the primal problem which determine the separating surface \n(4) are recovered directly from the solution of the dual (8) above by the relations: \n(9) \nWe immediately note that the matrix appearing in the dual objective function is \npositive definite and that there is no equality constraint and no upper bound on the \ndual variable u. The only constraint present is a simple nonnegativity one. These \nfacts lead us to our simple finite active set algorithm which requires nothing more \nsophisticated than inverting an (n + 1) x (n + 1) matrix at each iteration in order \nto solve the dual problem (8). \n\n3 ASVM (Active Support Vector Machine) Algorithm \nThe algorithm consists of determining a partition of the dual variable u into nonbasic \nand basic variables. The nonbasic variables are those which are set to zero. The \nvalues of the basic variables are determined by finding the gradient of the objective \nfunction of (8) with respect to these variables, setting this gradient equal to zero, and \nsolving the resulting linear equations for the basic variables. If any basic variable \ntakes on a negative value after solving the linear equations, it is set to zero and \nbecomes nonbasic. This is the essence of the algorithm. In order to make the \nalgorithm converge and terminate, a few additional safeguards need to be put in \nplace in order to allow us to invoke the More-Toraldo finite termination result [18]. \nThe other key feature of the algorithm is a computational one and makes use of the \nSMW formula. This feature allows us to invert an (n + 1) x (n + 1) matrix at each \nstep instead of a much bigger matrix of order m x m. \n\nBefore stating our algorithm we define two matrices to simplifY notation as follows: \n(10) \n\n- e], Q = I /v + HH'. \n\nH = D[A \n\nWith these definitions the dual problem (8) becomes \n\n. \n\nmm \n\nO~uER'\" \n\n1 \n2 \n\nf(u):= -u'Qu - eu. \n\n(11) \n\nIt will be understood that within the ASVM Algorithm, Q - 1 will always be evalu(cid:173)\nated using the SMW formula and hence only an (n+l) x (n+l) matrix is inverted. \nWe state our algorithm now. Note that commented (%) parts of the algorithm are \nnot needed in general and were rarely used in our numerical results presented in \nSection 4. The essence of the algorithm is displayed in the two boxes below. \n\nAlgorithm 3.1 Active SVM (ASVM) Algorithm for (8). \n\n(0) Start with UO := (Q - 1e)+. For i = 1,2, .. . , having u i compute Ui+1 as \n\nIfollows. \n\n(1) Define Bi := {j I u; > a}, N i := {.i I u~ = a}. \n(2) Determine \n\nUi+l .- (Q-1 e\u00b7) \n\nBiBi B ' +, Ni \n\nBi\n\n' -\n\nu i +1.- a \n. \n\n.-\n\nStop if Ui+1 is the global solution, that is if a ~ Ui+1 -.l QUi+1 - e ~ a. \n\n\f(2a) % If f(u iH ) ~ f(u i ), then go to (4a). \n\n(2b) % If 0 :s; Ut.~l .1 QBi+1Bi+l nt.~ 1 -eBi+1 ~ 0, then UH1 is a global solution \non the face of active constraints: UNi = O. Set u i := uiH and go to (4b). \nI \n\n(3) ISet i := i + 1 and go to (1). \n\nt \u00b7 t \n\n(4a) % Move in the direction of the global minimum on the face of ac-\n.-\n\nt \u00b7 \nd H I \nBi Bi eBi an UBi \nwe cons razn s, UNi = \nargmino9 5. df(uki + ).(ut.1 - nki)) I nki + ).(ut.1 - Uki ) ~ O}. \nIf \nU~+1 = 0 for some j E B i , set i := i + 1 and go to (1). Otherwise UH1 is a \nglobal minimum on the face UNi = 0, and go to (4b). \n\n0 \n. \n\nS t -HI \ne UBi \n\n:= \n\nQ - l \n\n(4b) % Iterate a gradient projection step. Set k := 0 and uk := u i . \n\nIterate \nUk+l:= argminO