{"title": "Escaping the Convex Hull with Extrapolated Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 753, "page_last": 760, "abstract": null, "full_text": "Escaping the Convex Hull with \nExtrapolated Vector Machines. \n\nPatrick Haffner \n\nAT&T Labs-Research, 200 Laurel Ave, Middletown, NJ 07748 \n\nhaffner@research.att.com \n\nAbstract \n\nMaximum margin classifiers such as Support Vector Machines \n(SVMs) critically depends upon the convex hulls of the training \nsamples of each class, as they implicitly search for the minimum \ndistance between the convex hulls. We propose Extrapolated Vec(cid:173)\ntor Machines (XVMs) which rely on extrapolations outside these \nconvex hulls. XVMs improve SVM generalization very significantly \non the MNIST [7] OCR data. They share similarities with the \nFisher discriminant: maximize the inter-class margin while mini(cid:173)\nmizing the intra-class disparity. \n\n1 \n\nIntroduction \n\nBoth intuition and theory [9] seem to support that the best linear separation be(cid:173)\ntween two classes is the one that maximizes the margin. But is this always true? \nIn the example shown in Fig.(l), the maximum margin hyperplane is Wo; however, \nmost observers would say that the separating hyperplane WI has better chances to \ngeneralize, as it takes into account the expected location of additional training sam-\n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 f\\J:- \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022 . . , --.Q. \n~\"-\n.. \n................ ~x~... \n\n_ \n\n~, \n\n\u2022\u2022 \n. ' \n\nW\n\n1 \n\n--------------- ~---------------\n\n\u00b7 K~ \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n. .. . ....... (} ............ . \n\n'''-,,- /0- 0 00 00 \n\n00 0'\\ \no- o::-o \n\n\"OW-0_ o __ o-\n\nFigure 1: Example of separation where the large margin is undesirable. The convex \nhull and the separation that corresponds to the standard SVM use plain lines while \nthe extrapolated convex hulls and XVMs use dotted lines. \n\n\fpIes. Traditionally, to take this into account, one would estimate the distribution of \nthe data. In this paper, we just use a very elementary form of extrapolation (\"the \npoor man variance\") and show that it can be implemented into a new extension to \nSVMs that we call Extrapolated Vector Machines (XVMs). \n\n2 Adding Extrapolation to Maximum Margin Constraints \n\nThis section states extrapolation as a constrained optimization problem and com(cid:173)\nputes a simpler dual form. \nTake two classes C+ and C_ with Y+ = +1 and Y_ = -1 1 as respective targets. \nThe N training samples {(Xi, Yi); 1 ::::; i ::::; N} are separated with a margin p if there \nexists a set of weights W such that Ilwll = 1 and \n\nVk E {+, -}, Vi E Ck, Yk(w,xi+b) 2: p \n\n(1) \nSVMs offer techniques to find the weights W which maximize the margin p. Now, \ninstead of imposing the margin constraint on each training point, suppose that for \ntwo points in the same class Ck, we require any possible extrapolation within a \nrange factor 17k 2: 0 to be larger than the margin: \n\nVi,j E Ck, V)\" E [-17k, l+17k], Yk (W.()\"Xi + (l-)\")Xj) + b) 2: P \n\n(2) \nIt is sufficient to enforce the constraints at the end of the extrapolation segments, \nand \n\n(3) \nKeeping the constraint over each pair of points would result in N 2 Lagrange multi(cid:173)\npliers. But we can reduce it to a double constraint applied to each single point. If \nfollows from Eq.(3) that: \n\n(4) \n\n(5) \n\nWe consider J.Lk = max (Yk(W.Xj)) and Vk = min (Yk(W.Xj)) as optimization vari-\n\nabIes. By adding Eq.(4) and (5), the margin becomes \n\nlEC. \n\nlEC. \n\n2p = L ((17k+ 1)vk - 17kJ.Lk) = L (Vk -17dJ.Lk - Vk)) \n\n(6) \n\nOur problem is to maximize the margin under the double constraint: \n\nk \n\nk \n\nVi E Ck, Vk ::::; Yk(W.Xi) ::::; J.Lk \n\nIn other words, the extrapolated margin maximization is equivalent to squeezing \nthe points belonging to a given class between two hyperplanes. Eq.(6) shows that \np is maximized when Vk is maximized while J.Lk - Vk is minimized. \n\nMaximizing the margin over J.Lk , Vk and W with Lagrangian techniques gives us the \nfollowing dual problem: \n\n(7) \n\nlIn this paper, it is necessary to index the outputs y with the class k rather than \nthe more traditional sample index i, as extrapolation constraints require two examples to \nbelong to the same class. The resulting equations are more concise, but harder to read. \n\n\fCompared to the standard SVM formulation, we have two sets of support vectors. \nMoreover, the Lagrange multipliers that we chose are normalized differently from \nthe traditional SVM multipliers (note that this is one possible choice of notation, \nsee Section.6 for an alternative choice). They sum to 1 and allow and interesting \ngeometric interpretation developed in the next section. \n\n3 Geometric Interpretation and Iterative Algorithm \n\nFor each class k, we define the nearest point to the other class convex hull along \nthe direction of w: Nk = I:iECk f3iXi. Nk is a combination of the internal sup(cid:173)\nport vectors that belong to class k with f3i > O. At the minimum of (7), because \nthey correspond to non zero Lagrange multipliers, they fallon the internal margin \nYk(W,Xi) = Vk; therefore, we obtain Vk = Ykw.Nk\u00b7 \nSimilarly, we define the furthest point Fk = I:i ECk ~iXi' Fk is a combination of the \nexternal support vectors, and we have flk = Ykw.Fk. \n\nThe dual problem is equivalent to the distance minimization problem \n\nmin \n\nNk ,Fk EHk \n\nIILYk ((1Jk+I)Nk _1Jk F k)11\n\nk \n\n2 \n\nwhere 1{k is the convex hull containing the examples of class k. \n\nIt is possible to solve this optimization problem using an iterative Extrapolated \nConvex Hull Distance Minimization (XCHDM) algorithm. It is an extension of the \nNearest Point [5] or Maximal Margin Percept ron [6] algorithms. An interesting \ngeometric interpretation is also offered in [3]. All the aforementioned algorithms \nsearch for the points in the convex hulls of each class that are the nearest to each \nother (Nt and No on Fig.I) , the maximal margin weight vector w = Nt - No-' \nXCHDM look for nearest points in the extrapolated convex hulls (X+ I and X-I \non Fig.I). The extrapolated nearest points are X k = 1JkNk - 1JkFk' Note that \nthey can be outside the convex hull because we allow negative contribution from \nexternal support vectors. Here again, the weight vector can be expressed as a \ndifference between two points w = X+ - X - . When the data is non-separable, the \nsolution is trivial with w = O. With the double set of Lagrange multipliers, the \ndescription of the XCHDM algorithm is beyond the scope of this paper. XCHDM \nwith 1Jk = 0 are simple SVMs trained by the same algorithm as in [6]. \n\nAn interesting way to follow the convergence of the XCHDM algorithm is the fol(cid:173)\nlowing. Define the extrapolated primal margin \n\nand the dual margin \n\n1'; = 2p = L \n\nk \n\n((1Jk+ I )vk - 1Jkflk) \n\n1'; = IIX+ - X-II \n\nConvergence consists in reducing the duality gap 1'~ -1'; down to zero. In the rest \nof the paper, we will measure convergence with the duality ratio r = 1'~ . \n1'2 \n\nTo determine the threshold to compute the classifier output class sign(w.x+b) leaves \nus with two choices. We can require the separation to happen at the center of the \nprimal margin, with the primal threshold (subtract Eq.(5) from Eq.(4)) \n\n1 \n\nbl = -2\" LYk ((1Jk+ I )vk-1JkJ.lk) \n\nk \n\n\for at the center of the dual margin, with the dual threshold \n\nb2 = - ~w. 2:)(T}k+1)Nk - T}kFk) = - ~ (IIx+ 112 -lix-in \n\nk \n\nAgain, at the minimum, it is easy to verify that b1 = b2 . When we did not let \nthe XCHDM algorithm converge to the minimum, we found that b1 gave better \ngeneralization results. \n\nOur standard stopping heuristic is numerical: stop when the duality ratio gets over \na fixed value (typically between 0.5 and 0.9). \n\nThe only other stopping heuristic we have tried so far is based on the following idea. \nDefine the set of extrapolated pairs as {(T}k+1)Xi -T}kXj; 1 :S i,j :S N}. Convergence \nmeans that we find extrapolated support pairs that contain every extrapolated pair \non the correct side of the margin. We can relax this constraint and stop when the \nextrapolated support pairs contain every vector. This means that 12 must be lower \nthan the primal true margin along w (measured on the non-extrapolated data) \n11 = y+ + Y -. This causes the XCHDM algorithm to stop long before 12 reaches \nIi and is called the hybrid stopping heuristic. \n\n4 Beyond SVMs and discriminant approaches. \n\nKernel Machines consist of any classifier of the type f(x) = L:i Yi(XiK(x, Xi). SVMs \noffer one solution among many others, with the constraint (Xi > O. \nXVMs look for solutions that no longer bear this constraint. While the algorithm \ndescribed in Section 2 converges toward a solution where vectors act as support of \nmargins (internal and external), experiments show that the performance of XVMs \ncan be significantly improved if we stopped before full convergence. In this case, \nthe vectors with (Xi =/: 0 do not line up onto any type of margin, and should not be \ncalled support vectors. \n\nThe extrapolated margin contains terms which are caused by the extrapolation \nand are proportional to the width of each class along the direction of w. We \nwould observe the same phenomenon if we had trained the classifier using Maximum \nLikelihood Estimation (MLE) (replace class width with variance). In both MLE and \nXVMs, examples which are the furthest from the decision surface play an important \nrole. XVMs suggest an explanation why. \n\nNote also that like the Fisher discriminant, XVMs look for the projection that \nmaximizes the inter-class variance while minimizing the intra-class variances. \n\n5 Experiments on MNIST \n\nThe MNIST OCR database contains 60,000 handwritten digits for training and \n10,000 for testing (the testing data can be extended to 60,000 but we prefer to \nkeep unseen test data for final testing and comparisons). This database has been \nextensively studied on a large variety of learning approaches [7]. It lead to the \nfirst SVM \"success story\" [2], and results have been improved since then by using \nknowledge about the invariance of the data [4]. \n\nThe input vector is a list of 28x28 pixels ranging from 0 to 255. Before computing \nthe kernels, the input vectors are normalized to 1: x = II~II' \n\nGood polynomial kernels are easy to define as Kp(x, y) = (x.y)P. We found these \nnormalized kernels to outperform the unnormalized kernels Kp(x, y) = (a(x.y)+b)P \n\n\fthat have been traditionally used for the MNIST data significantly. For instance, \nthe baseline error rate with K4 is below 1.2%, whereas it hovers around 1.5% for \nK4 (after choosing optimal values for a and b)2. \n\nWe also define normalized Gaussian kernels: \n\nKp(x, y) = exp (-~ Ilx - y112) = [exp (x.y- 1)JP. \n\n(8) \n\nEq.(8) shows how they relate to normalized polynomial kernels: when x.y \u00ab 1, \nKp and Kp have the same asymptotic behavior. We observed that on MNIST, \nthe performance with Kp is very similar to what is obtained with unnormalized \nGaussian kernels Ku(x , y) = exp _(X~Y)2. However, they are easier to analyze and \ncompare to polynomial kernels. \n\nMNIST contains 1 class per digit, so the total number of classes is M=10. To com(cid:173)\nbine binary classifiers to perform multiclass classifications, the two most common \napproaches were considered . \n\n\u2022 In the one-vs-others case (lvsR) , we have one classifier per class c, with the \npositive examples taken from class c and negative examples form the other \nclasses. Class c is recognized when the corresponding classifier yields the \nlargest output . \n\n\u2022 In the one-vs-one case (lvs1), each classifier only discriminates one class \n\nfrom another: we need a total of (MU:;-l) = 45 classifiers. \n\nDespite the effort we spent on optimizing the recombination of the classifiers [8] 3, \n1 vsR SVMs (Table 1) perform significantly better than 1 vs1 SVMs (Table 2). 4 \n\nFor each trial, the number of errors over the 10,000 test samples (#err) and the \ntotal number of support vectors( #SV) are reported. As we only count SV s which \nare shared by different classes once, this predicts the test time. For instance, 12,000 \nsupport vectors mean that 20% of the 60,000 vectors are used as support. \n\nPreliminary experiments to choose the value of rJk with the hybrid criterion show \nthat the results for rJk = 1 are better than rJk = 1.5 in a statistically significant \nway, and slightly better than rJk = 0.5. We did not consider configurations where \nrJ+ f; rJ -; however, this would make sense for the assymetrical 1 vsR classifiers. \nXVM gain in performance over SVMs for a given configuration ranges from 15% \n(1 vsR in Table 3) to 25% (1 vs1 in Table 2). \n\n2This may partly explain a nagging mystery among researchers working on MNIST: \n\nhow did Cortes and Vapnik [2] obtain 1.1% error with a degree 4 polynomial ? \n\n3We compared the Max Wins voting algorithm with the DAGSVM decision tree algo(cid:173)\n\nrithm and found them to perform equally, and worse than 1 vsR SVMs. This is is surprising \nin the light of results published on other tasks [8] , and would require further investigations \nbeyond the scope of this paper. \n\n4Slightly better performance was obtained with a new algorithm that uses the incre(cid:173)\n\nmental properties of our training procedure (this is be the performance reported in the \ntables). In a transductive inference framework , treat the test example as a training exam-\nple: for each of the M possible labels, retrain the M among (M(\":-l) classifiers that use \nexamples with such label. The best label will be the one that causes the smallest increase \nin the multiclass margin p such that it combines the classifier margins pc in the following \nmanner \n\n~= ,,~ \n2 ~ 2 \nP \nc~M Pc \n\nThe fact that this margin predicts generalization is \"justified\" by Theorem 1 in [8]. \n\n\fKernel \n\nK3 \nK4 \nK5 \nKg \n[(2 \n[(4 \nK5 \n\nDuality Ratio stop \n\n0.40 \n\n0.75 \n\n0.99 \n\n#err \n136 \n127 \n125 \n136 \n147 \n125 \n125 \n\n#SV \n8367 \n8331 \n8834 \n13002 \n9014 \n8668 \n8944 \n\n#err \n136 \n117 \n119 \n137 \n128 \n119 \n125 \n\n#SV \n11132 \n11807 \n12786 \n18784 \n11663 \n12222 \n12852 \n\n# err \n132 \n119 \n119 \n141 \n131 \n117 \n125 \n\n#SV \n13762 \n15746 \n17868 \n25953 \n13918 \n16604 \n18085 \n\nTable 1: SVMs on MNIST with 10 1vsR classifiers \n\nKernel \n\nK3 \nK4 \nK5 \n\nSVM/ratio at 0.99 XVM/Hybrid \n# err \n#SV \n138 \n17020 \n16066 \n135 \n191 \n15775 \n\n#SV \n11952 \n13526 \n13526 \n\n# err \n117 \n110 \n114 \n\nTable 2: SVMjXVM on MNIST with 45 1 vs1 classifiers \n\nThe 103 errors obtained with K4 and r = 0.5 in Table 3 represent only about 1% \nerror: t his is the lowest error ever reported for any learning technique without a \npriori knowledge about the fact that t he input data corresponds to a pixel map (the \nlowest reproducible error previously reported was 1.2% with SVMs and polynomials \nof degree 9 [4], it could be reduced to 0.6% by using invariance properties of the \npixel map). The downside is that XVMs require 4 times as many support vectors \nas standards SVMs. \n\nTable 3 compares stopping according to the duality ratio and the hybrid criterion. \nWith the duality ratio, the best performance is most often reached with r = 0.50 (if \nt his happens to be consistent ly true, validation data to decide when to stop could \nbe spared). The hybrid criterion does not require validation data and yields errors \nthat, while higher than the best XVM, are lower than SVMs and only require a few \nmore support vectors. It takes fewer iterations to train than SVMs. One way to \ninterpret this hybrid stopping criterion is that we stop when interpolation in some \n(but not all) directions account for all non-interpolated vectors. This suggest that \ninterpolation is only desirable in a few directions. \n\nXVM gain is stronger in the 1 vs 1 case (Table 2). This suggests that extrapolating \non a convex hull that contains several different classes (in the 1 vsR case) may be \nundesirable. \n\nKernel \n\nK3 \nK4 \nK5 \nKg \nK2 \n[(4 \n\nDuality Ratio stop \n\n0.40 \n\n0. 50 \n\n0.75 \n\n# err \n118 \n112 \n109 \n128 \n114 \n108 \n\n#SV \n46662 \n40274 \n36912 \n35809 \n43909 \n36980 \n\n# err \n111 \n103 \n106 \n126 \n114 \n111 \n\n#SV \n43819 \n43132 \n44226 \n39462 \n46905 \n40329 \n\n# err \n116 \n110 \n110 \n131 \n114 \n114 \n\n#SV \n50216 \n52861 \n49383 \n50233 \n53676 \n51088 \n\nHybrid. \nStop Crit. \n\n# err \n125 \n107 \n107 \n125 \n119 \n108 \n\n#SV \n20604 \n18002 \n17322 \n19218 \n20152 \n16895 \n\nTable 3: XVMs on MNIST with 10 1 vsR classifiers \n\n\f6 The Soft Margin Case \n\nMNIST is characterized by the quasi-absence of outliers, so to assume that the \ndata is fully separable does not impair performance at all. To extend XVMs to \nnon-separable data, we first considered the traditional approaches of adding slack \nvariables to allow margin constraints to be violated. The most commonly used ap(cid:173)\nproach with SVMs adds linear slack variables to the unitary margin. Its application \nto the XVM requires to give up the weight normalization constraint, so that the \nusual unitary margin can be used in the constraints [9] . \n\nCompared to standard SVMs, a new issue to tackle is the fact that each constraint \ncorresponds to a pair of vectors: ideally, we should handle N 2 slack variables ~ij. \nTo have linear constraints that can be solved with KKT, we need to have the \ndecomposition ~ij = ('T}k+1)~i+'T}k~; (factors ('T}k+1) and 'T}k are added here to ease \nlater simplifications). \n\nSimilarly to Eq.(3), the constraint on the extrapolation from any pair of points is \n\nVi,j E Ck, Yk (w. (('T}k+1)xi - 'T}kXj) +b) 2: 1 - ('T}k+1)~i - 'T}k~; with ~i'~; 2: 0 \n\n(9) \nIntroducing J.tk = max (Yk(w,xj+b) - ~;) and Vk = min (Yk(W,Xi+b) + ~i)' we ob-\ntain the simpler double constraint \n\nJECk \n\n.ECk \n\nVi E Ck , Vk -~i ~ Yk(W,Xi+b) ~ J.tk+~; with ~i'~; 2: 0 \n\n(10) \n\nIt follows from Eq.(9) that J.tk and Vk are tied through (l+'T}k)vk = l+'T}kJ.tk \nIf we fix J.tk (and thus Vk) instead of treating it as an optimization variable, it would \namount to a standard SVM regression problem with {-I, + I} outputs, the width \nof the asymmetric f-insensitive tube being J.tk-Vk = (~~~;)' \nThis remark makes it possible for the reader to verify the results we reported on \nMNIST. Vsing the publicly available SVM software SVMtorch [1] with C = 10 and \nf = 0.1 as the width of the f-tube yields a 10-class error rate of 1.15% while the \nbest performance using SVMtorch in classification mode is 1.3% (in both cases, we \nuse Gaussian kernels with parameter (J = 1650). \n\nAn explicit minimization on J.tk requires to add to the standard SVM regression \nproblem the following constraint over the Lagrange multipliers (we use the same \nnotation as in [9]) : \n\nYi= l \n\nYi=- l \n\nYi= l \n\nYi=- l \n\nNote that we still have the standard regression constraint I: ai = I: ai \nThis has not been implemented yet , as we question the pertinence of the ~; slack \nvariables for XVMs. Experiments with SVMtorch on a variety of tasks where \nnon-zero slacks are required to achieve optimal performance (Reuters, VCI/Forest, \nVCI/Breast cancer) have not shown significant improvement using the regression \nmode while we vary the width of the f-tube. \n\nMany experiments on SVMs have reported that removing the outliers often gives \nefficient and sparse solutions. The early stopping heuristics that we have presented \nfor XVMs suggest strategies to avoid learning (or to unlearn) the outliers, and this \nis the approach we are currently exploring. \n\n\f7 Concluding Remarks \n\nThis paper shows that large margin classification on extrapolated data is equivalent \nto the addition of the minimization of a second external margin to the standard SVM \napproach. The associated optimization problem is solved efficiently with convex \nhull distance minimization algorithms. A 1 % error rate is obtained on the MNIST \ndataset: it is the lowest ever obtained without a-priori knowledge about the data. \n\nWe are currently trying to identify what other types of dataset show similar gains \nover SVMs, to determine how dependent XVM performance is on the facts that the \ndata is separable or has invariance properties. We have only explored a few among \nthe many variations the XVM models and algorithms allow, and a justification \nof why and when they generalize would help model selection. Geometry-based \nalgorithms that handle potential outliers are also under investigation. \n\nLearning Theory bounds that would be a function of both the margin and some \nform of variance of the data would be necessary to predict XVM generalization and \nallow us to also consider the extrapolation factor 'TJ as an optimization variable. \n\nReferences \n\n[1] R. Collobert and S. Bengio. Support vector machines for large-scale regression \n\nproblems. Technical Report IDIAP-RR-00-17, IDIAP, 2000. \n\n[2] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1- 25 , \n\n1995. \n\n[3] D. Crisp and C.J.C. Burges. A geometric interpretation of v-SVM classifiers. \nIn Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. \nLeen, K.-R. Mller, eds, Cambridge, MA, 2000. MIT Press. \n\n[4] D. DeCoste and B. Schoelkopf. Training invariant support vector machines. \nMachine Learning, special issue on Support Vector Machines and Methods, 200l. \n[5] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. A fast \niterative nearest point algorithm for support vector machine classifier design. \nIEEE transactions on neural networks, 11(1):124 - 136, jan 2000. \n\n[6] A. Kowalczyk. Maximal margin perceptron. In Advances in Large Margin Clas(cid:173)\nsifiers, Smola, Bartlett, Schlkopf, and Schuurmans, editors, Cambridge, MA, \n2000. MIT Press. \n\n[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning ap(cid:173)\n\nplied to document recognition. proceedings of the IEEE, 86(11), 1998. \n\n[8] J. Platt, N. Christianini, and J. Shawe-Taylor. Large margin dags for multiclass \nclassification. In Advances in Neural Information Processing Systems 12, S. A. \nSolla, T. K. Leen, K.-R. Mller, eds, Cambridge, MA, 2000. MIT Press. \n\n[9] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New-York, 1998. \n\n\f", "award": [], "sourceid": 2037, "authors": [{"given_name": "Patrick", "family_name": "Haffner", "institution": null}]}