{"title": "Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 711, "abstract": null, "full_text": "Incremental Learning and Selective \n\nSampling via Parametric Optimization \n\nFramework for SVM \n\nIBM T. J. Watson Research Center \n\nIBM T. J. Watson Research Center \n\nKatya Scheinberg \n\nkatyas@us.ibm.com \n\nShai Fine \n\nfshai@us.ibm.com \n\nAbstract \n\nWe propose a framework based on a parametric quadratic program(cid:173)\nming (QP) technique to solve the support vector machine (SVM) \ntraining problem. This framework, can be specialized to obtain two \nSVM optimization methods. The first solves the fixed bias prob(cid:173)\nlem, while the second starts with an optimal solution for a fixed \nbias problem and adjusts the bias until the optimal value is found. \nThe later method can be applied in conjunction with any other ex(cid:173)\nisting technique which obtains a fixed bias solution. Moreover, the \nsecond method can also be used independently to solve the com(cid:173)\nplete SVM training problem. A combination of these two methods \nis more flexible than each individual method and, among other \nthings, produces an incremental algorithm which exactly solve the \n1-Norm Soft Margin SVM optimization problem. Applying Selec(cid:173)\ntive Sampling techniques may further boost convergence. \n\n1 \n\nIntroduction \n\nSVM training is a convex optimization problem which scales with the training set \nsize rather than the input dimension. While this is usually considered to be a desired \nquality, in large scale problems it may cause training to be impractical. \nThe \ncommon way to handle massive data applications is to turn to active set methods, \nwhich gradually build the set of active constraints by feeding a generic optimizer \nwith small scale sub-problems. Active set methods guarantee to converge to the \nglobal solut ion, however, convergence may be very slow, it may require too many \npasses over the data set, and at each iteration there's an implicit computational \noverhead of the actual active set selection. By using some heuristics and caching \nmechanisms, one can, in practice, reduce this load significantly. \n\nAnother common practice is to modify the SVM optimization problem such that \nit wont handle the bias term directly. Instead, the bias is either fixed in advance! \n(e.g. \n[4]). The \nadvantage is that the resulting dual optimization problem does not contain the \nlinear constraint, in which case one can suggest a procedure which updates only \n\n[6]) or added as another dimension to the feature space (e.g. \n\nIThroughout this sequel we will refer to such solution as the fixed bias solution. \n\n\fone Lagrange multiplier at a time. Thus, an incremental approach, which efficiently \nupdates an existing solution given a new training point, can be devised. Though \nwidely used, the solution resulting from this practice has inferior generalization \nperformances and the number of SY tends to be much higher [4]. \n\nTo the best of our knowledge, the only incremental algorithm suggested so far to \nexactly solve the 1-Norm Soft Margin2 optimization problem, have been described \nby Cauwenberghs and Poggio at [3]. This algorithm, handles Adiabatic increments \nby solving a system of linear equations resulted from a parametric transcription of \nthe KKT conditions. This approach is somewhat close to the one independently \ndeveloped here and we offer a more thorough comparison in the discussion section. \nIn this paper3 we introduce two new methods derived from parametric QP tech(cid:173)\nniques. The two methods are based on the same framework, which we call Para(cid:173)\nmetric Optimization for Kernel methods (POKER), and are essentially the same \nmethodology applied to somewhat different problems. The first method solves the \nfixed bias problem, while the second one starts with an optimal solution for a fixed \nbias problem and adjusts the bias until the optimal value is found. Each of these \nmethods can be used independently to solve the SYM training problem. \nThe \nmost interesting application, however, is alternating between the two methods to \nobtain a unique incremental algorithm. We will show how by using this approach \nwe can adjust the optimal solution as more data becomes available, and by applying \nSelective Sampling techniques we may further boost convergence rate. \nBoth our methods converge after a finite number of iterations. In principle, this \nnumber may be exponential in the training set size, n. However, since parametric \nQP methods are based on the well-known Simplex method for linear programming, \na similar behavior is expected: Though in theory the Simplex method is known \nto have exponential complexity, \nin practice it hardly ever displays exponential \nbehavior. The per-iteration complexity is expected to be O(nl), where l is the \nnumber of active points at that iteration, with the exception of some rare cases in \nwhich the complexity is expected to be bounded by O(nl2). \n\n2 Parametric QP for SVM \n\nAny optimal solution to the 1-Norm Soft Margin SYM optimization problem must \nsatisfy the Karush-Kuhn-Tucker (KKT) necessary and sufficient conditions: \n\n1 \n2 \n\n3 \n4 \n5 \n\nexiSi = 0, \n(c - exi)~i = 0, \n\ni = 1, ... ,n \n\ni = 1, . . . ,n \n\nT \n\nY ex = 0, \n-Qex + by + S - ~ = -e, \n\u00b0 ~ ex ~ c, S :::: 0, ~:::: 0. \n\n(1) \n\n2 A different incremental approach stems from a geometric interpretation of the primal \nproblem: Keerthi et al. \n[7] were the first to suggest a nearest point batch algorithm \nand Kowalczyk [8] provided the on-line version. They handled the inseparable with the \nwell-known transformation W ~ (W, .;c~) and b ~ b, which establish the equivalence \nbetween the Hard Margin and the 2-Norm Soft Margin optimization problems. Although \nthe i-Norm and the 2-Norm have been shown to yield equivalent generalization properties, \nit is often observed (cf. [7]) that the former method results in a smaller number of SV. It \nis obvious by the above transformation that the i-Norm Soft Margin is the most general \nSVM optimization problem. \n\n3The detailed statements of the algorithms and the supporting lemmas were omitted \n\ndue to space limitation, and can be found at [5]. \n\n\fwhere a E Rn is the vector of Lagrange multipliers, b is the bias (scalar) and sand \n~ are the n-dimensional vectors of slack and surplus variables, respectively. Y is a \nvector oflabels, \u00b11. Q is the label encoded kernel matrix, i.e. Qij = YiyjK(Xi,Xj), \ne is the vector of all 1 's of length n and c is the penalty associated with errors. \nIf we assume that the value of the bias is fixed to some predefined value b, then \ncondition 3 disappears from the system (1) and condition 4 becomes \n\nConsider the following modified parametric system of KKT conditions \n\n-Qa + S - ~ = -e - by \n\ni = 1, ... ,n \n\naiSi = 0, \n(c - ai)~i = 0, \n-Qa + S - ~ = p + u( -e - yb - p) , \no ::::: a ::::: c, S ~ 0, ~ ~ 0, \n\ni = 1, ... ,n \n\n(2) \n\n(3) \n\nfor some vector p. It is easy to find p, a S and ~ satisfying (3) for u = O. For \nexample, one may pick a = 0, S = e, ~ = 0 and p = -Qa + s. For u = 1 the \nsystems (3) reduces to the fixed bias system. Our fixed bias method starts at a \nsolution to (3) for u = 0 and by increasing u while updating a, s and ~ so that they \nsatisfy (3), obtains the optimal solution for u = 1. \nSimilarly we can obtain solution to (1) by starting at a fixed bias solution and \nupdate b, while maintaining a, s and ~ feasible for (2) , until the optimal value for \nb is reached. The optimal value of the bias is recognized when the corresponding \nsolution satisfy (1), namely aT y = O. \n\nBoth these methods are based on the same framework of adjusting a scalar param(cid:173)\neter in the right hand side of a KKT system. In the next section we will present \nthe method for adjusting the bias (adjusting u in (3) is very similar, save for a few \ntechnical differences). An advantage of this special case is that it solves the original \nproblem and can, in principal, be applied \"from scratch\" . \n\n3 Correcting a \"Fixed Bias\" Solution \n\nLet (a(b), s(b), ~(b)) be a fixed bias solution for a given b. The algorithm that we \npresent here is based on increasing (or decreasing) b monotonically, until the optimal \nb* is found, while updating and maintaining (a(b),s(b),~(b)). \n\nLet us introduce some notation. For a given b and and a fixed bias solution, \n(a(b), s(b), ~(b)), we partition the index set I = {I, ... , n} into three sets 10 (b), \nIe(b) and Is(b) in the following way: Vi E Io(b) si(b) > 0 and ai(b) = 0, Vi E Ie(b) \n~i(b) > 0 and ai(b) = c and Vi E Is(b) si(b) = ~i(b) = 0 and 0::::: ai(b) ::::: c. It is easy \nto see that Io(b)Ule(b)UIs(b) = I and Io(b)nle(b) = Ie(b)nIs(b) = Io(b)nIs(b) = 0. \nWe will call the partition (Io(b),Ie(b),Is(b)) - the optimal partition for a given b. \nWe will refer to Is as the active set. Based on partition (Io,Ie,Is) we define Qss \n(Qes Qse Qee, Qos, Qoo) as the submatrix of Q whose columns are the columns \nof Q indexed by the set Is (Ie, Is, Ie, 10 , 10 ) and whose rows are the rows of Q \nindexed by Is (Is, Ie, Ie, Is , 10). We also define Ys (Ye, Yo) and as (ae , ao) and \nthe subvectors of Y and a whose entries are indexed by Is (Ie, 10). Byes (ee) we \ndenote a vector of all ones of the appropriate size. \nAssume that we are given an initial guess4 bO < b*. To initiate the algorithm we \n4Whether bO < b* can be determined by evaluating -y T a(bO): if -y T a(bO) > 0 then \nbO < b*, otherwise bO > b*, in which case the algorithm is essentially the same, save for \nobvious changes. \n\n\fassume that we know the optimal partition (Ioo'!eo,Iso) = (Io(bO),!c(bO),!s(bO)) \nthat corresponds to aO = a(bO). We know that Vi E 10 ai = 0 and Vi E Ie ai = c. \nWe also know that -Qia + Yib = -1, Vi E Is (here Qi is the i-th row of Q). We \ncan write the set of active constraints as \n\n(4) \n\nIf Qss is nonsingular (the nondegenerate case), then as depends linearly on scalar \nb. Similarly, we can express So and ~e as linear functions of b. If Q ss is singular \n(the degenerate case), then, the set of all possible solutions as changes linearly with \nb as long as the partition remains optimal. In either case, if 0 < as < c, So > 0 \nand ~e > 0 then sufficiently small changes in b preserve these constraints. At each \niteration b can increase until one of the four types of inequality constraints becomes \nactive. Then, the optimal partition is updated, new linear expressions of the active \nvariables through b are computed, and the algorithm iterates. We terminate when \nY a < 0, that is b > b*. The final iteration gives us the correct optimal active set \nand optimal partition; from that we can easily compute b* and a*. \n\nT \n\nA geometric interpretation of the algorithmic steps suggest that we are trying to \nmove the separating hyperplane by increasing its bias and at the same time adjusting \nits orientation so it stays optimal for the current bias. At each iteration we move \nthe hyperplane until either a support vector is dropped from the support set, a \nsupport vector becomes violated, a violated point becomes a support vector or an \ninactive point joins the support vector set. \n\nThe algorithm is guaranteed to terminate after finitely many iterations. At each \niteration the algorithm covers an interval that corresponds to an optimal partition. \nThe same partition cannot correspond to two different intervals and the number of \npartitions is finite, hence so is the number of iterations (d. [1, 9]). Per-iteration \ncomplexity depends on whether an iteration is degenerate or not. A nondegenerate \niteration takes O(niIs I) + O(IIs 13 ) arithmetic operations, while a degenerate iter(cid:173)\nation should in theory take 0(n21Is 12) operations, but in practice it only takes5 \n0(nIIsI2). Note that the degeneracy occurs when the active support vectors are \nlinearly dependent. The larger is the rank of the kernel matrix the less likely is such \na situation. The storage requirement of the algorithm is O(n) + 0(IIsI2). \n\n4 \n\nIncremental Algorithm \n\nIncremental and on-line algorithms are aimed at training problems for which the \ndata becomes available in the course of training. Such an algorithm, when given \nan optimal solution for a training set of size n, and additional m training points, \nhas to efficiently find the optimal solution to the extended n + m training set. \nAssume we have an optimal solution (a, b, s,~) for a given data set X of size n. \nFor each new point that is added, we take the following actions: a new Lagrange \nmultiplier a n+l = 0 is added to the set of multipliers, then the distance to the \nmargin is evaluated for this point. If the point is not violated, that is if Sn+l = \nW T xn+l_yn+1b_1 > 0, then the new positive slack Sn+l is added to the set of slack \nvariables. If the point is violated then sn+1 = 1 is added to the set of slack variables. \n(Notice, that at this point the condition w T x n+1 + yn+1b + sn+1 = -1 is violated.) \nA surplus variable ~n+l = 0 is also added to the set of surplus variables. The \noptimal partition is adjusted accordingly. The process is repeated for all the points \nthat have to be added at the given step. If no violated points were encountered, \n\n5This assumes solving such a problem by an interior point method \n\n\fo \n1 \n\n2 \n\n3 \n4 \n\n5 \n\n1 \n\n-\n\n( n+i )T \n\ny , On+i = <,n+i = , Sn+i = -\nb \n\nt \n\ny + x \nb n+i \n\n0 \nn+i T \n) w + 1, Sn+i = 1 \n\nGiven dataset , asolution(oo , bo , so,~o) , and new points ~t~ \net p = - e -\nw, i = 1, ... , m \nS \nIf Sn+i ::::: 0, Set pn+i := -(x \nElse pn+i := -1 - byn+i \nX := XU {xn+l , ... , xn+m}, y := (yl , ... , y n, y n+l , ... , yn+m) \nIf p #- - e - by \nCall POKERfixedbias(X, y , 0 , b, s, ~ , p) \nCall POKERadjustbias (X , y , 0, b, s, ~) \nIf there are more data points go to O. \n\nFigure 1: Outline of the incremental algorithm (AltPOKER) \n\nthen no further action is necessary. The current solution is optimal and the bias \nis unchanged. If at least one point is violated, then the new set (Q, b, s,~) is not \nfeasible for the KKT system (1) with the extended data set. However, it is easy to \nfind p such that (Q, b, s, ~) is optimal for (3). Thus we can first apply the fixed bias \nalgorithm to find a new solution and then apply the adjustable bias algorithm to \nfind the optimal solution to the new extended problem (see Figure 1). \n\nIn theory adding even one point may force the algorithm to work as hard as if \nit were solving the problem \"from scratch\". But in practice it virtually never \nhappens. In our experiments, just a few iterations of the fixed bias and adjustable \nbias algorithms were sufficient to find the solution to the extended problem. Overall, \nthe computational complexity ofthe incremental algorithm is expected to be O(n2 ) . \n\n5 Experiments \n\nConvergence in Batch Mode: The most straight-forward way to activate \nPOKER in a batch mode is to construct the trivial partition6 and then apply the \nadjustable bias algorithm to get the optimal solution. We term this method Self(cid:173)\nInit POKER. Note that the initial value of the bias is most likely far away from the \nglobal solution, and as such, the results presented here should be regarded as a lower \nbound. We examined performances on a moderate size problem, the Abalone data \nset from the VCI Repository [2]. We fed the training algorithm with increasing sub(cid:173)\nsets up to the whole set (of size 4177). The gender encoding (male/female/infant) \nwas mapped into {(I,O,O),(O,I,O) ,(O,O,I)}. Then, the data was scaled to lie in the \n[-1 ,1] interval. We demonstrate convergence for polynomial kernel with increasing \ndegree, which in this setting corresponds to level of difficulty. However naive our \nimplementation is, one can observe (see Figure 2) a linear convergence rate in the \nbatch mode. \n\nConvergence in Incremental Mode: AltPOKER is the incremental algorithm \ndescribed in section 4. We examined the performance on the\" diabetes\" problem 7 \nthat have been used by Cauwenberghs and Poggio in [3] to test the performance of \ntheir algorithm. We demonstrate convergence for the RBF kernel with increasing \npenalty (\"C\" ). Figure 3 demonstrates the advantage of the more flexible approach \n\n6Fixing the bias term to be large enough (positive or negative) and the Lagrange \n\nmultipliers to 0 or C based on their class (negative/positive) membership. \n\n7 available at http://bach . ece.jhu. edu/pub/ gert/svm/increm ental \n\n\fSelflnil POKER: No. ollleralions VS. Problem Size \n\nAUPOKER: No. ol lleralions VS. Chun k Size \n\n16000 \n\n,near erne \n\n_ \n_ POJyKemel:(<>:.y>+1r \n\u2022 POJyKemel:(<>:.y>+1r \nPOJy Kemel:(<>:.y>+1t \nPOJyKemel:(<>:.y>+1r \n\n2000 \n\n-\n\n-\n\nC:O.l \nC\"l \nC,,10 \nC,,25 \nC,,50 \nC,,75 \nC\"l00 \n\n2500' \n\nProblernSize \n\nChunk Size \n\nFigure 2: SelfInit POKER - Convergence Figure 3: AltPOKER - Convergence in \nin Batch mode \n\nIncremental mode \n\nwhich allows various increment sizes: using increments of only one point resulted in \na performance of a similar scale as that of Cauwenberghs and Poggio, but with the \nincrease of the chunk sizes we observe rapid improvement in the convergence rate. \n\nSelective Sampling: We can use the incremental algorithm even in case when all \nthe data is available in advance to improve the overall efficiency. If one can select \na good representative small subset of the data set, then one can use it for training, \nhoping that the majority of the data points are classified correctly using the initial \nsampled data8 . We applied selective sampling as a preprocess in incremental mode: \nAt each meta-iteration, we ranked the points according to a predefined selection \ncriterion, and then picked just the top ones for the increment. \n\nThe following selection criteria have been used in our experiments: CIs2W picks the \nclosest point to the current hyperplane. This approach is inspired by active learning \nschemes which strive to halve the version space. However, the notion of a version \nspace is more complex when the problem is inseparable. Thus, it is reasonable to \nadapt a greedy approach which selects the point that will cause the larger change \nin the value of the objective function. \n\nWhile solving the optimization problem for all possible increments is impracticable, \nit may still worthwhile to approximate the potential change: MaxSlk picks the \nmost violating point. This corresponds to an upper bound estimate of the change \nin the objective, since the value of the slack (times c) is an upper bound to the \nfeasibility gap. dObj perform only few iterations of the adjustable bias algorithm \nand examine the change in the objective value. This is similar to Strong Branching \ntechnique which is used in branch and bound methods for integer programming. \nHere it provides a lower bound estimate to the change in the objective value. \n\nAlthough performing only few iterations is much cheaper than converging to the \noptimal solution, this technique is still more demanding then previous selection \nmethods. Hence we first ranked the points using CIs2W (MaxSlk) and then ap(cid:173)\nplied dObj only to the top few . Table 1 presents the application of the above \nmentioned criteria to three different problems. The results clearly shows that ad(cid:173)\nvantage of using the information obtained by dObj estimate. \n\n8This is different from a full-fledged Active Learning scheme in which the data is not \n\nlabeled, but rather queried at selected points. \n\n\fSelection \n\nCriteria \n\nNo Selection \nMaxSlk \nMaxSlk+dObj \nClsW \nClsW+dObj \n\na\u00b7 I Is I Ie I 10 \n400 I 4 I 11 I 9985 \n\na\u00b7 I Is I Ie I 10 \nII a\u00b7 I Is I Ie I 10 \n8 I 73 I 1 I 277 II 40 I 20 I 313 I 243 \n\n234 \n112 \n92 \n128 \n116 \n\n871 \n303 \n269 \n433 \n407 \n\n3078 \n3860 \n3184 \n2576 \n2218 \n\nTable 1: The impact of Selective Sampling on the No. of iterations of AltPOKER: \nSynthetic data (10Kx2), \"ionosophere\" [2] and \"diabetes\" (columns ordered resp.) \n\n6 Conclusions and Discussion \n\nWe propose a new finitely convergent method that can be applied in both batch \nand incremental modes to solve the 1-Norm Soft Margin SVM problem. Assuming \nthat the number of support vectors is small compared to the size of the data, the \nmethod is expected to perform O(n2 ) arithmetic operations, where n is the problem \nsize. Applying Selective Sampling techniques may further boost convergence and \nreduce computation load. \n\nOur method is independently developed, but somewhat similar to that in [3]. Our \nmethod, however, is more general - it can be applied to solve fixed bias problems \nas well as obtain optimal bias from a given fixed bias solution; It is not restricted \nto increments of size one, but rather can handle increments of arbitrary size; And, \nit can be used to get an estimate of the drop in the value of the objective function, \nwhich is a useful selective sampling criterion. \n\nFinally, it is possible to extend this method to produce a true on-line algorithm, by \nassuming certain properties of the data. This re-introduces some very important \napplications of the on-line technology, such as active learning, and various forms \nof adaptation. Pursuing this direction with a special emphasis on massive data \napplications (e.g. speech related applications), is left for further study. \n\nReferences \n\n[1] A. B. Berkelaar, B. Jansen, K. Roos, and T. Terlaky. Sensitivity analysis in (degener(cid:173)\n\nate) quadratic programming. Technical Report 96-26, Delft University, 1996. \n\n[2] C. L. Blake and C. J Merz. UCI repository of machine learning databases, 1998. \n[3] G. Cauwenberghs and T . Poggio. Incremental and decremental support vector machine \nlearning. In Adv. in N eural Information Processing Systems 13, pages 409- 415, 2001. \n[4] N. Cristianini and J. Shawe-Taylor. An Introductin to Support Vector Macines and \n\nOther Kernel-Based Learning Methods. Cambridge University Press, 2000. \n\n[5] S. Fine and K. Scheinberg. Poker: Parametric optimization framework for kernel \n\nmethods. Technical report , IBM T. J. Watson Research Center, 2001. Submitted. \n\n[6] T. T. Friess, N. Cristianini, and C. Campbell. The kernel-adaraton algorithm: A fast \n\nsimple learning procedure for SVM. In Pmc. of 15th ICML, pages 188- 196, 1998. \n\n[7] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. A fast iterative \nnearest point algorithm for SVM classifier design . IEEE Trnas . NN, 11:124- 36, 2000. \n[8] A. Kowalczyk. Maximal margin perceptron. In Advances in Large Margin Classifiers , \n\npages 75-113. MIT Press, 2000. \n\n[9] R. T. Rockafellar. Conjugate Duality and Optimization. SIAM, Philadelphia, 1974. \n\n\f", "award": [], "sourceid": 1978, "authors": [{"given_name": "Shai", "family_name": "Fine", "institution": null}, {"given_name": "Katya", "family_name": "Scheinberg", "institution": null}]}