{"title": "An Improved Decomposition Algorithm for Regression Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 484, "page_last": 490, "abstract": null, "full_text": "An Improved Decomposition Algorithm \nfor Regression Support Vector Machines \n\nDepartment of Computer and Information Sciences \n\nPavel Laskov \n\nUniversity of Delaware \n\nNewark, DE 19718 \nlaskov@asel. udel. edu \n\nAbstract \n\nA new decomposition algorithm for training regression Support \nVector Machines (SVM) is presented. The algorithm builds on \nthe basic principles of decomposition proposed by Osuna et. al., \nand addresses the issue of optimal working set selection. The new \ncriteria for testing optimality of a working set are derived. Based \non these criteria, the principle of \"maximal inconsistency\" is pro(cid:173)\nposed to form (approximately) optimal working sets. Experimental \nresults show superior performance of the new algorithm in compar(cid:173)\nison with traditional training of regression SVM without decompo(cid:173)\nsition. Similar results have been previously reported on decomposi(cid:173)\ntion algorithms for pattern recognition SVM. The new algorithm is \nalso applicable to advanced SVM formulations based on regression, \nsuch as density estimation and integral equation SVM. \n\n1 \n\nIntrod uction \n\nThe increasing interest in applications of Support Vector Machines (SVM) to large(cid:173)\nscale problems ushers in new requirements for computational complexity of their \ntraining algorithms. Requests have been recently made for algorithms capable of \nhandling problems containing 105 - 106 examples [1]. Training an SVM constitutes \na quadratic programming problem, and a typical SVM package uses an off-the-shelf \noptimization software to obtain a solution to it. The number of variables in the \noptimization problem is equal to the number of training data points (for the pattern \nrecognition SVM) or twice that number (for the regression SVM). The speed of \ngeneral-purpose optimization methods is insufficient for problems containing more \nthan a few thousand examples. This has motivated a quest for special-purpose \ntraining algorithms to take advantage of the particular structure of SVM training \nproblems. \n\nThe main avenue of research in SVM training algorithms is decomposition. The key \nidea of decomposition, due to Osuna et. al. [2], is to freeze all but a small number of \noptimization variables, and to solve a sequence of small fixed-size problems. The set \nof variables whose values are optimized at a current iteration is called the working \nset. Complexity of re-optimizing the working set is assumed to be constant-time. \n\n\fAn Improved Decomposition Algorithm for Regression Support Vector Machines \n\n485 \n\nIn order for a decomposition algorithm to be successful) the working set must be \nselected in a smart way. The fastest known decomposition algorithm is due to \nJoachims [3]. It is based on Zoutendijk)s method of feasible directions proposed in \nthe optimization community in the early 1960)s. However Joachims) algorithm is \nlimited to pattern recognition SVM because it makes use of labels being \u00b1l. The \ncurrent article presents a similar algorithm for the regression SVM. \nThe new algorithm utilizes a slightly different background from optimization the(cid:173)\nory. The Karush-Kuhn-Tucker Theorem is used to derive conditions for determining \nwhether or not a given working set is optimal. These conditions become the algo(cid:173)\nrithm)s termination criteria) as an alternative to Osuna)s criteria (also used by \nJoachims without modification) which used conditions for individual points. The \nadvantage of the new conditions is that knowledge of the hyperplane)s constant \nfactor b) which in some cases is difficult to compute) is not required. Further inves(cid:173)\ntigation of the new termination conditions allows to form the strategy for selecting \nan optimal working set. The new algorithm is applicable to the pattern recognition \nSVM) and is provably equivalent to Joachims) algorithm. One can also interpret \nthe new algorithm in the sense of the method of feasible directions. Experimental \nresults presented in the last section demonstrate superior performance of the new \nmethod in comparison with traditional training of regression SVM. \n\n2 General Principles of Regression SVM Decomposition \n\nThe original decomposition algorithm proposed for the pattern recognition SVM in \n[2] has been extended to the regression SVM in [4]. For the sake of completeness \nI will repeat the main steps of this extension with the aim of providing terse and \nstreamlined notation to lay the ground for working set selection. \n\nGiven the training data of size I) training of the regression SVM amounts to solving \nthe following quadratic programming problem in 21 variables: \n\nMaximize W(a) \n\nsubject to: eTa \n\na-Ct < \n> \na \n\n-T-\ny 0: -\n\n1- TD -\n0: \n-0: \n2 \n\n0 \n0 \n0 \n\nK \n-K \n\n(1) \n\nwhere \n\nThe basic idea of decomposition is to split the variable vector a into the working set \naB of fixed size q and the non-working set aN containing the rest ofthe variables. \nThe corresponding parts of vectors e and y will also bear subscripts Nand B . The \nmatrix D is partitioned into D BB ) DBN = D~B and D NN . A further requirement \nis that) for the i-th element of the training data) both 0i and 0; are either included \nin or omitted from the working set.l The values of the variables in the non-working \nset are frozen for the iteration) and optimization is only performed with respect to \nthe variables in the working set. \n\nOptimization of the working set is also a quadratic program. This can be seen \nby re-arranging the terms of the objective function and the equality constraint in \n\nIThis rule facilitates formulation of sub-problems to be solved at each iteration. \n\n\f486 \n\nP. Laskov \n\n(1) and dropping the terms independent of o.B from the objective. The resulting \nquadratic program (sub-problem) is formulated as follows: \n\nMaximize \n\nsubject to: \n\n-T \n(YB - QNDNB)QB -\n\n-T \n\n-\n\nT-\n\nT -\nCBQB +cNQN \no.B - Cl \n\na \n< 0 \n> 0 \n\n1 -T \n'iQBDBBQB \n\n-\n\n(2) \n\nThe basic decomposition algorithm chooses the first working set at random, and \nproceeds iteratively by selecting sub-optimal working sets and re-optimizing them, \nby solving quadratic program (2), until all subsets of size q are optimal. The precise \nformulation of termination conditions will be developed in the following section. \n\n3 Optimality of a Working Set \n\nIn order to maintain strict improvement of the objective function, the working \nset must be sub-optimal before re-optimization. The classical Karush-Kuhn-Tucker \n(KKT) conditions are necessary and sufficient for optimality of a quadratic program. \nI will use these conditions applied to the standard form of a quadratic program, as \ndescribed in [5], p. 36. \nThe standard form of a quadratic program requires that all constraints are of equal(cid:173)\nity type except for non-negativity constraints. To cast the refression SVM quadratic \nprogram (1) into the standard form, the slack variables s = (81 , '\" \nsponding to the box constraints, and the following matrices are introduced: \n\n,82l) corre(cid:173)\n\nI 0] \no I \n\n' E = [f], z = m ' \n\n(3) \n\nwhere 1 is a vector of length I, C is a vector of length 21. The zero element in vector \nz reflects the fact that a slack variable for the equality constraint must be zero. In \nthe matrix notation all constraints of problem (1) can be compactly expressed as: \n\nETz \n\nf \nz > 0 \n\n(4) \n\nIn this notation the Karush-Kuhn-Tucker Theorem can be stated as follows: \n\nTheorem 1 (Karush-Kuhn-Tucker Theorem) The primal vector z solves the \nquadratic problem (1) if and only if it satisfies (4) and there exists a dual vector \nu T = (ITT w T ) = (ITT (I' yT\u00bb such that: \n\nIT = DO. + Ew - Y > 0 \n\nY ~ 0 \nu T z = a \n\n(5) \n(6) \n(7) \n\nIt follows from the Karush-Kuhn-Tucker Theorem that if for all u satisfying con(cid:173)\nditions (6) - (7) the system of inequalities (5) is inconsistent then the solution of \nproblem (1) is not optimal. Since the objective function of sub-problem (2) was \nobtained by merely re-arranging terms in the objective function of the initial prob(cid:173)\nlem (1), the same conditions guarantee that the sub-problem (2) is not optimal. \nThus, the main strategy for identifying sub-optimal working sets will be to enforce \ninconsistency of the system (5) while satisfying conditions (6) - (7) . \n\n\fAn Improved Decomposition Algorithm for Regression Support Vector Machines \n\n487 \n\nLet us further analyze inequalities in (5). Each inequality has one of the following \nforms: \n\nwhere \n\n7T'i \n\n-rPi + E + Vi + J.L > 0 \nrPi + E - v; -\nJ.L > 0 \n\nI \n\nrPi = Yi - 2:)aj - a;)Kij \n\nj=l \n\n(8) \n(9) \n\nConsider the values ai can possible take: \n\n1. ai = O. In this case Si = C, and, by complementarity condition (7), Vi = O. \n\nThen inequality (8) becomes: \n\n7T'i = -rPi + E + J.L ~ 0 ~ J.L ~ rPi - E \n\n2. ai = C. By complementarity condition (7), 7T'i = O. Then inequality (8) \n\nbecomes: \n\n-rPi + E + J.L + Vi = 0 ~ J.L::; rPi - E \n\n3. 0 < ai < C. By complementarity condition (7), Vi = 0, 7T'i = O. Then \n\ninequality (8) becomes: \n\n- rPi + E + J.L = 0 ~ J.L = rPi - E \n\nSimilar reasoning for a; and inequality (9) yields the following results: \n\n1. a; = O. Then \n\n2. a; = C. Then \n\n3. 0 < a; < C . Then \n\nAs one can see, the only free variable in system (5) is J.L. Each inequality restricts \nJ.L to a certain interval on a real line. Such intervals will be denoted as J.L-sets in \nthe rest of the exposition. Any subset of inequalities in (5) is inconsistent if the \nintersection of the corresponding J.L-sets is empty. This provides a lucid rule for \ndetermining optimality of any working set: it is sub-optimal if the intersection of \nJ.L-sets of all its points is empty. A sub-optimal working set will also be denoted as \n\"inconsistent\". The following summarizes the rules for calculation of J.L-sets, taking \ninto account that for regression SVM aia; = 0: \n\n[rPi - E, rPi + E], \n[rPi - E, rPi - E], \n(-00, rPi - E], \n[rPi + E, rPi + E], \n[rPi + E, +(0), \n\nif ai = 0, a; = 0 \nif 0 < ai < C, a; = 0 \nif ai = C, a; = 0 \nif ai = 0, 0 < a; < C \n\n(10) \n\n\f488 \n\nP. Laskov \n\n4 Maximal Inconsistency Algorithm \n\nWhile inconsistency of the working set at each iteration guarantees convergence of \ndecomposition, the rate of convergence is quite slow if arbitrary inconsistent working \nsets are chosen. A natural heuristic is to select \"maximally inconsistent\" working \nsets, in a hope that such choice would provide the greatest improvement of the \nobjective function. The notion of \"maximal inconsistency\" is easy to define: let it \nbe the gap between the smallest right boundary and the largest left boundary of \np-sets of elements in the training set: \n\nG=L-R \n\nL = max pL R = min pr \n\nO* R) \n\n\u2022 compute Mi according to the rules (10) for all elements in S \n\u2022 select q/2 elements with the largest values of pi (\"left pass\") \n\u2022 select q /2 elements with the smallest values of p r (\"right pass\") \n\u2022 re-optimize the working set \n\nAlthough the motivation provided for the maximal inconsistency algorithm is purely \nheuristic, the algorithm can be rigorously derived, in a similar fashion as Joachims' \nalgorithm, from Zoutendijk's feasible direction problem. Details of such derivation \ncannot be presented here due to space constraints. Because of this relationship I \nwill further refer to both algorithms as \"feasible direction\" algorithms. \n\n5 Experimental Results \n\nExperimental evaluation of the new algorithm was performed on the mod(cid:173)\nified KDD Cup 1998 data set. \nThe original data set is available under \nhttp:j /www.ics.uci.edu/\"-'kdd/databases/kddcup98/kddcup98.html. The following \nmodifications were made to obtain a pure regression problem: \n\n\u2022 All 75 character fields were eliminated. \n\u2022 Numeric fields CONTROLN, ODATEDW, TCODE and DOB were elimi(cid:173)\n\ntated. \n\nThe remaining 400 features and the labels were scaled between 0 and 1. Initial \nsubsets of the training database of different sizes were selected for evaluation of the \nscaling properties of the new algorithm. The training times of the algorithms, with \nand without decomposition, the numbers of support vectors, including bounded \nsupport vectors, and the experimental scaling factors, are displayed in Table 1. \n\n\fAn Improved Decomposition Algorithm for Regression Support Vector Machines \n\n489 \n\nTable 1: Training time (sec) and number of SVs for the KDD Cup problem \n\nExamples no dcmp dcmp \n10 \n41 \n158 \n397 \n1252 \n2.08 \n2.24 \n\n500 \n1000 \n2000 \n3000 \n5000 \nscaling factor: \nSV-scaling factor: \n\n39 \n226 \n1490 \n5744 \n27052 \n2.84 \n3.06 \n\ntotal SV BSV \n0 \n3 \n5 \n7 \n15 \n\n274 \n518 \n970 \n1429 \n2349 \n\nTable 2: Training time (sec) and number of SVs for the KDD Cup problem, reduced \nfeature space. \n\nExamples no dcmp dcmp \n18 \n44 \n198 \n366 \n863 \n1.72 \n2.35 \n\n500 \n1000 \n2000 \n3000 \n5000 \nscaling factor: \nSV-scaling factor: \n\n56 \n346 \n1768 \n4789 \n22115 \n2.55 \n3.55 \n\ntotal SV BSV \n30 \n62 \n144 \n222 \n354 \n\n170 \n374 \n510 \n729 \n1139 \n\nThe experimental scaling factors are obtained by fitting lines to log-log plots of the \nrunning times against sample sizes, in the number of examples and the number \nof unbounded support vectors respectively. Experiments were run on SGI Octane \nwith 195MHz clock and 256M RAM. RBF kernel with, = 10, C = 1, termination \naccuracy 0.001, working set size of 20, and cache size of 5000 samples were used. \nA similar experiment was performed on a reduced feature set consisting of the first \n50 features selected from the full-size data set. This experiment illustrates the \nbehavior of the algorithms when the large number of support vectors are bounded. \nThe results are presented in Table 2. \n\n6 Discussion \n\nIt comes at no surprise that the decomposition algorithm outperforms the conven(cid:173)\ntional training algorithm by an order of magnitude. Similar results have been well \nestablished for pattern recognition SVM. Remarkable is the co-incidence of scaling \nfactors of the maximal inconsistency algorithm and Joachims' algorithm: his scaling \nfactors range from 1.7 to 2.1 [3]. I believe however, that a more important perfor(cid:173)\nmance measure is SV -scaling factor, and the results above suggest that this factor \nis consistent even for problems with significantly different compositions of support \nvectors. Further experiments should investigate properties of this measure. \n\nFinally, I would like to mention other methods proposed in order to speed-up train(cid:173)\ning of SVM, although no experimental results have been reported for these methods \nwith regard to training of the regression SVM. Chunking [6], p. 366, iterates through \n\n\f490 \n\nP. Laskov \n\nthe training data accumulating support vectors and adding a \"chunk\" of new data \nuntil no more changes to a solution occur. The main problem with this method is \nthat when the percentage of support vectors is high it essentially solves the problem \nof almost the same size more than once. Sequential Minimal Optimization (SMO), \nproposed by Platt [7] and easily extendable to the regression SVM [1], employs an \nidea similar to decomposition but always uses the working set of size 2. For such \na working set, a solution can be calculated \"by hand\" without numerical optimiza(cid:173)\ntion. A number of heuristics is applied in order to choose a good working set. It \nis difficult to draw a comparison between the working set selection mechanisms of \nSMO and the feasible direction algorithms but experimental results of Joachims [3] \nsuggest that SMO is slower. Another advantage of feasible direction algorithms is \nthat the size of the working set is not limited to 2, as in SMO. Practical experience \nshows that the optimal size of the working set is between 10 and 100. Lastly, tradi(cid:173)\ntional optimization methods, such as Newton's or conjugate gradient methods, can \nbe modified to yield the complexity of 0(s3), where s is the number of detected \nsupport vectors [8]. This can be a considerable improvement over the methods that \nhave complexity of 0(13), where 1 is the total number of training samples. \n\nThe real challenge lies in attaining sub-0(s3) complexity. While the experimen(cid:173)\ntal results suggest that feasible direction algorithms might attain such complexity, \ntheir complexity is not fully understood from the theoretical point of view. More \nspecifically, the convergence rate, and its dependence on the number of support \nvectors, needs to be analyzed. This will be the main direction of the future research \nin feasible direction SVM training algorithms. \n\nReferences \n\n[1] Smola, A., Sch61kopf, B. (1998) A Tutorial on Support Vector Regression. \n\nNeuroCOLT2 Technical Report NC2- TR-1998-030. \n\n[2] Osuna, E., Freund, R., Girosi, F. (1997) An Improved Training Algorithm for \nSupport Vector Machines. Proceedings of IEEE NNSP'97. Amelia Island FL. \n[3] Joachims, T. (1998) Making Large-Scale SVM Learning Practical. Advances in \nKernel Methods - Support Vector Learning. B. Sch61kopf, C. Burges, A. Smola, \n(eds.) MIT-Press. \n\n[4] Osuna, E. (1998) Support Vector Machines: Training and Applications. Ph. D. \n\nDissertation. Operations Research Center, MIT. \n\n[5] Boot, J. (1964) Quadratic Programming. Algorithms - Anomalies - Applica(cid:173)\n\ntions. North Holland Publishing Company, Amsterdam. \n\n[6] Vapnik, V. (1982) Estimation of Dependencies Based on Empirical Data. \n\nSpringer-Verlag. \n\n[7] Platt, J. (1998) Fast Training of Support Vector Machines Using Sequential \nMinimal Optimization. Advances in Kernel Methods - Support Vector Learning. \nB. Sch5lkopf, C. Burges, A. Smola, (eds.) MIT-Press. \n\n[8] Kaufman, L. (1998) Solving the Quadratic Programming Problem Arising in \nSupport Vector Classification. Advances in Kernel Methods - Support Vector \nLearning. B. Sch5lkopf, C. Burges, A. Smola, (eds.) MIT-Press. \n\n\f", "award": [], "sourceid": 1717, "authors": [{"given_name": "Pavel", "family_name": "Laskov", "institution": null}]}*