{"title": "Leveraged Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 610, "page_last": 616, "abstract": null, "full_text": "Leveraged Vector Machines \n\nYoram Singer \n\nHebrew University \n\nsinger@cs.huji.ac.il \n\nAbstract \n\nWe describe an iterative algorithm for building vector machines used in \nclassification tasks. The algorithm builds on ideas from support vector \nmachines, boosting, and generalized additive models. The algorithm can \nbe used with various continuously differential functions that bound the \ndiscrete (0-1) classification loss and is very simple to implement. We test \nthe proposed algorithm with two different loss functions on synthetic and \nnatural data. We also describe a norm-penalized version of the algorithm \nfor the exponential loss function used in AdaBoost. The performance of \nthe algorithm on natural data is comparable to support vector machines \nwhile typically its running time is shorter than of SVM. \n\n1 Introduction \nSupport vector machines (SVM) [1, 13] and boosting [10, 3, 4, 11] are highly popular and \neffective methods for constructing linear classifiers. The theoretical basis for SVMs stems \nfrom Vapnik's seminal on learning and generalization [12] and has proved to be of great \npractical usage. The first boosting algorithms [10, 3], on the other hand, were developed \nto answer certain fundamental questions about PAC-learnability [6]. While mathemati(cid:173)\ncally beautiful, these algorithms were rather impractical. Later, Freund and Schapire [4] \ndeveloped the AdaBoost algorithm, which proved to be a practically useful meta-learning \nalgorithm. AdaBoost works by making repeated calls to a weak learner. On each call the \nweak learner generates a single weak hypothesis, and these weak hypotheses are combined \ninto an ensemble called strong hypothesis. Recently, Schapire and Singer [11] studied a \nsimple generalization of AdaBoost in which a weak-hypothesis can assign a real-valued \nconfidence to each prediction. Even more recently, Friedman, Hastie, and Tibshirani [5] \npresented an alternative view of boosting from a statistical point of view and also described \na new family of algorithms for constructing generalized additive models of base learners \nin a similar fashion to AdaBoost. The work of Friedman, Hastie, and Tibshirani generated \nlots of attention and motivated research in classification algorithms that employ various \nloss functions [8, 7]. \nIn this work we combine ideas from the research mentioned above and devise an alternative \napproach to construct vector machines for classification. As in SVM, the base predictors \nthat we use are Mercer kernels. The value of a kernel evaluated at an input pattern, i.e., \nthe dot-product between two instances embedded in a high-dimensional space, is viewed \nas a real-valued prediction. We describe a simple extension to additive models in which the \nprediction of a base-learner is a linear transformation of a given kernel. We then describe \nan iterative algorithm that greedily adds kernels. We derive our algorithm using the expo(cid:173)\nnentialloss function used in AdaBoost and the loss function used by Friedman, Hastie, and \nTibshirani [5] in \"LogitBoost\". For brevity we call the resulting classifiers boosted vector \nmachines (BVM) and logistic vector machines (LVM). We would like to note in passing \n\n\fLeveraged Vector Machines \n\n611 \n\nthat the resulting algorithms are not boosting algorithms in the PAC sense. For instance, \nthe weak-Iearnability assumption that the weak-learner can always find a weak-hypothesis \nis violated. We therefore adopt the terminology used in [2] and call the resulting classifiers \nleveraged vector machines. \nThe leveraging procedure we give adopts the chunking technique from SVM. After present(cid:173)\ning the basic leveraging algorithms we compare their performance with SVM on synthetic \ndata. The experimental results show that the leveraged vector machines achieve similar \nperformance to SVM and often the resulting vector machines are smaller than the ones \nobtained by SVM. The experiments also demonstrate that BVM is especially sensitive to \n(malicious) label noise while LVM seems to be more insensitve. We also describe a simple \nnorm-penalized extension of BVM that provides a partial solution to overfitting in the p(cid:173)\nresence of noise. Finally, we give results of experiments performed with natural data from \nthe DCI repository and conclude. \n2 Preliminaries \nLet S = ((Xl, yd, ... ,(xm, Ym)) be a sequence of training examples where each instance \nXi belongs to a domain or instance space X, and each label Yi is in {-I, +1}. (The \nmethods described in this paper to build vector machines and SVMs can be extended to \nsolve multiclass problems using, for instance, error correcting output coding. Such methods \nare beyond the scope of this paper and will be discussed elsewhere). For convenience, we \nwill use iii to denote (Yi + 1) /2 E {O, I}. \nAs is boosting, we assume access to a weak or base learning algorithm which accepts as \ninput a weighted sequence of training examples S. Given such input, the weak learner \ncomputes a weak (or base) hypothesis h. In general, h has the form h : X -+ ~. We \ninterpret the sign of h(x) as the predicted label (-1 or + 1) to be assigned to instance X, \nand the magnitude Ih(x)1 as the \"confidence\" in this prediction. \nTo build vector machines we use the notion of confidence-rated predictions. take for base \nhypotheses sample-based Mercer kernels [13], and define the confidence (i.e., the magni(cid:173)\ntude of prediction) of a base learner to be the value of its dot-product with another instance. \nThe sign of the prediction is set to be the label of the corresponding instance. Formally, \nfor each base hypothesis h there exist (Xj,Yj) E S such that h(x) = YjK(Xj, x) and \nK(u, v) defines an inner product in a feature space: K(u, v) = 2:::~1 ak'lfJk (U)'Ij;k (v). \nWe denote the function induced by an instance label pair (Xj, Yj) with a kernel K by \nj (x) = yjK (Xj, x). Our goal is to find a classifier f(x), called a strong hypothesis in the \ncontext of boosting algorithms, ofthe form f(x) = 2::::=1 atht(x) + /3, such that the signs \nof the predictions of the classifier should agree, as much as possible, with the labels of the \ntraining instances. \nThe leverage algorithm we describe maintains a distribution Dover {I, ... , m}, i.e., over \nthe indices of S. This distribution is simply a vector of non-negative weights, one weight \nper example and is an exponential function of the classifier f which is built incrementally, \n\n1 \n\nD(i) = Z exp (-Yd(Xi)) where Z = L exp (-Yd(Xi)) . \n\nm \n\n(1) \n\ni=l \n\nFor a random function 9 of the input instances and the labels, we denote the sample ex(cid:173)\npectation of 9 according to D by ED(g) = 2::::1 D(i)g(Xi, Yi). We also use this notation \nto denote the expectation of matrices of random functions. We will convert a confidence(cid:173)\nrated classifier f into a randomized predictor by using the soft-max function and denote it \nby P(Xi) where \np \n\nexp (f(Xi)) \n\n1 \n\n(Xi) = exp (f(Xi)) + exp (- f(Xi)) \n\n1 + exp (-2f(Xi)) . \n\n(2) \n\n\f612 \n\nY. Singer \n\n3 The leveraging algorithm \nThe basic procedure to construct leveraged vector machines builds on ideas from [11, 5] by \nextending the prediction to be a linear function of the base classifiers. The algorithm works \nin rounds, constructing a new classifier It from the previous one It-I by adding a new \nbase hypothesis ht to the current classifier, It- Denoting by Dt and Pt+1 the distribution \nand probability given by Eqn. (1) and Eqn. (2) using It and It+l' the algorithm attempts to \nminimize either the exponential function that arise in AdaBoost: \n\nZ = 2: exp (-ydt(Xi)) = 2: exp (-Yi(ft-l (Xi) + atht(Xi) + f3t)) \n\nm \n\ni=1 \n\nm \n\ni=1 \nm \n\ni=1 \n\n'\" 2: Dt(i) exp (-Yi(atht(Xi) + f3t)) , \n\nor the logistic loss function: \n\nm \n\ni=1 \nm \n\ni=1 \n\nm - 2: (fh log(Pt+1 (Xi)) + (1 - ih) log(1 - Pt+1 (Xi))) \n\ni=1 \n\n(3) \n\n(4) \n\n(5) \n\nWe initialize lo(x) to be zero everywhere and run the procedure for a predefined num(cid:173)\n'\u00a3'['=1 (atht(x) + f3t) = \nber of rounds T. The final classifier is therefore IT(X) = \nf3 + '\u00a3'['=1 atht(x) where \nf3 = '\u00a3t f3t . We would like to note parenthetically that it \nis possible to use other loss functions that bound the 0-1 (classification) loss (see for in(cid:173)\nstance [8]). Here we focus on the above loss functions, Land Z. Fixing It-I and ht, these \nfunctions are convex in at and f3t which guarantees, under mild conditions (details omitted \ndue to lack of space), the uniqueness of at and f3t . \nOn each round we look for the current base hypothesis ht that will reduce the loss function \n(Z or L) the most. As discussed before, each input instance X j defines a function , Yj\u00bb \nthat attains the \nminimal loss, and set ht = . We then numerically search for the optimal value of a \nand f3 by iterating Eqn. (6) or Eqn. (7) and summing the values into at and f3t. We would \nlike to note that typically two or three iterations suffice and we can save time by using the \nvalue of a and f3 found using the quadratic approximation without a full numerical search \nfor the optimal value of a and f3. (See also Fig. 1.) We repeat this process for T rounds \nor until no instance can serve as a base hypothesis. We note that the same instance can be \nchosen more than once, although not in consecutive iterations, and typically only a small \nfraction of the instances is actually used in building f. Roughly speaking, these instances \nare the \"support patterns\" of the leveraged machines although they are not necessarily the \ngeometric support patterns. \nAs in SVMs, in order to make the search for a base hypothesis efficient we pre-compute and \nstore K(x, x') for all pairs x i- x' from 8. Storing these values require 181 2 space, which \nmight be prohibited in large problems. To save space, we employ the idea of chunking \nused in SVM. We partition 8 into r blocks 8 1,82 , ..\u2022 ,Sr of about the same size. We \ndivide the iterations into sub-groups such that all iterations belonging to the ith sub-group \nuse and evaluate kernels based on instances from the ith block only_ When switching to a \nnew block k we need to compute the values K(x, x') for x E 8 and x, E Sk. This division \ninto blocks might be more expensive since we typically use each block of instances more \nthan once. However, the storage of the kernel values can be done in place and we thus save \na factor of r in memory requirements. In practice we found that chunking does not hurt \nthe performance. In Fig. 1 we show the test error as a function of number of rounds when \nusing (a) full numerical search to determine a and f3 on each round, (b) using the quadratic \napproximation (\"one-step\") to find a and f3, and (c) using quadratic approximation with \nchunking. The number of instances in the experiment is 1000, each block for chunking is \nof size 100, and we switch to a different block every 100 iterations. (Further description of \nthe data is given in the next section.) In this example, after 10 iterations, there is virtually \nno difference in the performance of the different schemes. \n\n4 Experiments with synthetic data \nIn this section we describe experiments with synthetic data comparing different aspects \nof leveraged vector machines to SVMs. The original instance space is two dimensional \nwhere the positive class includes all points inside a circle of radius R, i.e., an instance \n(UI, U2) E 1R? is labeled +1 iff ui + u~ ~ R. The instances were picked at random \naccording to a zero mean unit variance normal distribution and R was set such exactly half \nof the instances belong to the positive class. In all the experiments described in this section \nwe generated 10 groups of training and test sets each of which includes 1000 train and test \nexamples. Overall, there are 10,000 training examples and 10,000 test examples. The \n\n\f614 \n\nY. Singer \n\n- I \n\nSVM \n\ni \n\n.\u2022 \n' J\u00bb - . ._\\ \n\u2022 O M \u2022 ! , .. - -\n\n.~ -\n\n-\n\nI \n\n-D .. \n\n-_.'--\n\n~ \nI \n\n! V \n. --\n\n\u2022\u2022 i\u00b7 \n\n~ \n\n.. \n\n5 \n\n... -\n\n, \n\n\u2022 \n\nFigure 2: Performance comparison of SVM and BVM as a function of the training data \nsize (left), the dimension of the kernels (middle), and the number of redundant features. \n\n~\u2022 \nIMI \n~ L'\" \n- . . \nsvu \n\n' 0 \n\n\u2022 \u2022 \u2022 \n\n/;I \n\n.. \n\n~ \n\nt ea \n\nt ot \n\n....... \n\n.. \n\n~II \n\n0 \"\" \n\n\u2022 I' \n\nFigure 3: Train and test er(cid:173)\nrors for SVM, LVM, and \nBVM as a function of the \nlabel noise. \n\naverage variance of the estimates of the empirical errors across experiments is about 0.2%. \nFor SVM we set the regularization parameter, C , to 100 and used 500 iterations to build \nleveraged machines. In all the experiments without noise the results for BVM and LVM \nwere practically the same. We therefore only compare BVM to SVM in Fig. 2. Unless said \notherwise we used polynomials of degree two as kernels: K(X,' x) = (x\u00b7 x' + 1)2 . Hence, \nthe data is separable in the absence of noise. \nIn the first experiment we tested the sensitivity to the number of training examples by omit(cid:173)\nting examples from the training data (without any modification to the test sets). On the left \npart of Fig. 2 we plot the test error as a function of the number of training examples. The \ntest error of BVM is almost indistinguishable from the error of SVM and performance of \nboth methods improves very fast as a function of training examples. Next, we compared \nthe performance as a function of the dimension of polynomial constituting the kernel. We \nran the algorithms with kernels of the form K(X,' x) = (x \u00b7 x' + l)d for d = 2, ... ,8. \nThe results are depicted in the middle plots of Fig. 2. Again, the performance of BVM and \nSVM is very close (note the small scale of the y axis for the test error in this experimen(cid:173)\nt). To conclude the experiments with clean, realizable, data we checked the sensitivity to \nirrelevant features of the input. Each input instance (Ul' U2) was augmented with random \n. ,Ul to form an input vector of dimension l. The right hand side graphs of \nelements U3,\" \nFig. 2 shows the test error as a function of 1 for 1 = 2, ... , 12. Once more we see that the \nperformance of both algorithms is very similar. \nWe next compared the performance of the algorithms in the presence of noise. We used ker(cid:173)\nnels of dimension two and instances without redundant features. The label of each instance \nwas flipped with probability E. We ran 15 sets of experiments, for \u20ac = 0.01, ... , 0.15. As \nbefore, each set included 10 runs each of which used 1000 training examples and 1000 test \nexamples. In Fig. 3 we show the average training error (left), and the average test error \n(right), for each of the algorithms. It is apparent from the graphs that BVMs built based \non the exponential loss are much more sensitive to noise than SVMs and LVMs, and their \ngeneralization error degrades significantly, even for low noise rates. The generalization er(cid:173)\nror ofLVMs is, on the other hand, only slightly worse than the that of SVMs, although the \n\n\fLeveraged Vector Machines \n\n615 \n\n.~ -(cid:173) LW \n\n..... \n\n~ -\n\nLW \n\n---\n----_. \n\n~ - ~ - .. - .. -\n\n---\n\nFigure 4: The training error, test error, and the cumulative Ll norm (L~'=l la~ I) as a \nfunction of the number of leveraging iterations for LVM,BVM, and PBVM. \n\nonly algorithmic difference in constructing BVMs and LVMs is in the loss function. The \nfact that LVMs exhibit performance similar to SVM can be partially attributed to the fact \nthat the asymptotic behavior of their loss functions is the same. \n5 A norm-penalized version \nOne of the problems with boosting and the corresponding leveraging algorithm with the \nexponential loss described here, is that it might increase the confidence on a few instances \nwhile misclassifying many other instances, albeit with a small confidence. This often hap(cid:173)\npens on late rounds, during which the distribution D t (i) is concentrated on a few examples, \nand the leveraging algorithm typically assigns a large weight to a weak hypothesis that does \nnot effect most of the instances. It is therefore desired to control the complexity of the lever(cid:173)\naged classifiers by limiting the magnitude of base hypotheses' weights. Several methods \nhave been proposed to limit the confidence of AdaBoost, using, for instance, regulariza(cid:173)\ntion (e.g., [9]) or \"smoothing\" the predictions [11]. Here we propose a norm-penalized \nmethod for BVM that is very simple to implement and maintains the convexity properties \nof the objective function. Following the idea Cortes and Vapnik's of SVMs in the non-\n\nseparable case [1] we add the following penalization term: ,0 exp (L;=1 latlP) . Simple \n\nalgebric manipulation implies that the objective function at the tth round for BVMs with \nthe penalization term above is, \n\nm \n\nZt = I: Dt(i) exp (-Yi(atht(xi) + f3t\u00bb +,t exp{latIP ) . \n\ni=l \n\n(8) \n\nIt is also easy to show that the penalty parameter should be updated after each round is: \n,t = ,t-l exp(lat-lIP)/Zt-l. Since Zt < 1, unless there is no kernel function better \n\nthan random, ,t typically increases as a function of t, forcing more and more the new \n\nweights to be small. Note that Eqn. (8) implies that the search for a base predictor ht \nand weights at, f3t on each round can still be done independently of previous rounds by \nmaintaining the distribution D t and a single regularization value 't. The penalty term for \np = 1 and p = 2 simply adds a diagonal term to the matrix of second order derivatives \n(Eqn. (6\u00bb and the algorithm follows the same line (details omitted). For brevity we call \nthe norm-penalized leveraging procedure PBVM. In Fig. 4 we plot the test error (right), \ntraining error (middle), and Lt latl as functions of number of rounds for LVM, BVM, \nand PBVM with p = 1 ,0 = 0.01. The training set in this example was made small on \n\npurpose (200 examples) and was contaminated with 5% label noise. In this very small \nexample both LVM and BVM overfit while PBVM stops increasing the weights and finds \na reasonably good classifier. The plots demonstrate that the norm-penalized version can \nsafeguard against overfitting by preventing the weights from growing arbitrarily large, and \nthat the effect of the penalized version is very similar to early stopping. We would like \n\n\f616 \n\nY. Singer \n\nDataSet \n(Source) \n\nlabor (UC!) \nechocard. (uci) \nbridges (uci) \nhepati tis (uci) \nhorse\u00b7colic (uci) \nliver (uci) \nionosphere (uci) \nvote (uci) \nticketl (att) \nticket2 (att) \nticket3 (att) \nbands (uci) \nbreast-wisc (uci) \npima (uci) \ngerman (uci) \nweather (uci) \nnetwork (att) \nsplice (uci) \nboa (att) \n\n#Example \n\n& \n\n#Feature \n57 : 16 \n74 : 12 \n102 : 7 \n155: 19 \n300: 23 \n345 : 6 \n351: 34 \n435 : 16 \n556 : 78 \n556: 53 \n556 : 61 \n690: 39 \n699: 9 \n768 : 8 \n1000: 10 \n1000: 35 \n2600: 35 \n3190: 60 \n5000: 68 \n\nSVM \n\nLVM \n\nBVM \n\nRBVM \n\nSVM \n\nLVM \n\nBVM \n\nPBVM \n\nSize \n\nSize \n\nSize \n\nSize \n\nError \n\nError \n\nError \n\nError \n\n12.5 \n7.8 \n27.2 \n41.2 \n122.0 \n228.6 \n63.4 \n37.0 \n48.1 \n52.6 \n46.1 \n265.5 \n49.3 \n360.7 \n485.2 \n562.0 \n1031.0 \n318.0 \n637.0 \n\n13.7 \n13.0 \n20.2 \n13.5 \n13.0 \n11.3 \n58.9 \n37.0 \n84.6 \n77.1 \n76.2 \n78.2 \n26.5 \n47.7 \n89.8 \n52.0 \n42.0 \n153.0 \n183.0 \n\n16.1 \n12.6 \n18.5 \n17.4 \n13.0 \n12.8 \n67.9 \n41.0 \n89.3 \n75.4 \n77.8 \n76.4 \n24.4 \n30.3 \n96.5 \n52.0 \n43.0 \n156.0 \n178.0 \n\n13.6 \n12.4 \n17.9 \n14.0 \n13.0 \n10.7 \n59.1 \n37.0 \n82.3 \n74.0 \n73.3 \n75.6 \n24.0 \n22.8 \n87.0 \n52.0 \n42.0 \n153.0 \n160.0 \n\n6.0 \n8.6 \n15.0 \n21.3 \n14.7 \n33.8 \n13.7 \n4.4 \n8.4 \n6.6 \n6.9 \n32.8 \n3.5 \n23.0 \n23.5 \n25.9 \n24.8 \n8.0 \n41.5 \n\n14.0 \n5.7 \n15.0 \n22.0 \n14.7 \n35.6 \n13.1 \n5.2 \n3.3 \n6.4 \n4.9 \n33.2 \n3.6 \n22.6 \n24.0 \n25.4 \n21.2 \n8.4 \n40.8 \n\n14.0 \n10.0 \n23.0 \n22.7 \n14.7 \n33.5 \n16.9 \n5.9 \n11.5 \n8.0 \n7.6 \n34.3 \n4.1 \n23.2 \n23.8 \n25.4 \n23.5 \n8.4 \n40.8 \n\n12.0 \n10.0 \n14.0 \n22.0 \n13.2 \n35.6 \n13.7 \n5.2 \n5.1 \n6.4 \n6.7 \n33.3 \n4.1 \n22.1 \n24.1 \n25.4 \n21.2 \n8.4 \n41.0 \n\nTable 1: Summary of results for a collection of binary classification problems. \n\nto note that we found experimentally that the norm-penalized version does compensate for \nincorrect estimates of a and fJ due to malicious label noise. The experimental results given \nin the next section show, however, that it does indeed help in preventing overfitting when \nthe training set is small. \n6 Experiments with natural data \nWe compared the practical performance of leveraged vector machines with SVMs on a \ncollection of nineteen dataset from the UCI machine learning repository and AT&T net(cid:173)\nworking and marketing data. For SVM we set C = 100. We built each of the leveraged \nvector machines using 500 rounds. For PBVM we used again p = 1 and 'Yo = 0.0l. We \nused chunking in building the leveraged vector machines, dividing each training set into 10 \nblocks. For all the datasets, with the exception of \"boa\", we used lO-fold cross validation \nto calculate the test error. (The dataset \"boa\" has 5000 training examples and 6000 test \nexamples.) The performance of SVM, LVM, and PBVM seem comparable. In fact, with \nthe exception of a very few datasets the differences in error rates are not statistically signif(cid:173)\nicant. Of the three methods (SVM, PBVM, and LVM), LVM is the simplest to implement \nthe time required to build an LVM is typically much shorter than that of an SVM. It is \nalso worth noting that the size of leveraged machines is often smaller than the size of the \ncorresponding SVM. Finally, it apparent that PBVMs frequently yield better results than \nBVMs, especially for small and medium size datasets. \nReferences \n[I] Corinna Cones and Vladimir Vapnik. Suppon-vector networks. Machine Learning, 20(3):273-297, September 1995. \n[2] N. Duffy and D. Helmbold. A geometric approach to leveraging weak learners. EuroCOLT '99. \n[3] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation , 121(2):256-285, 1995. \n[4] Yoav Freund and Roben E. Schapire. A decision\u00b7 theoretic generalization of on-line learning and an application to boosting. \n\nJournal of Computer and System Sciences, 55(1):119-139, August 1997. \n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Tech. Repon, 1998. \n[6] Michael Kearns and Leslie G. Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal \n\nof the Associationfor Computing Machiner)\" 41(1):67-95, January 1994. \n\n[7] John D. Laffeny. Additive models, boosting and inference for generalized divergences. In Proceedings of the Twelfth Annual \n\nConference on Computational Learning Theor)\" 1999. \n\n[8] L. Mason, J. Baxter. P. Banlett, and M. Frean. Doom II. Technical repon. Depa. of Sys. Eng. ANU 1999. \n[9] G. Rlitsch, T.Onoda. and K.-R. Miiller. Regularizing adaboost. In Advances in Neural Info. Processing Systems 12,1998. \n[10] Roben E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197-227,1990. \n[II] Roben E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. COLT'98. \n[12] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982. \n[13] Vladimir N. Vapnik. The Nature of Statistical Learning Theor),. Springer, 1995. \n\n\f", "award": [], "sourceid": 1771, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}]}