{"title": "The Relaxed Online Maximum Margin Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 498, "page_last": 504, "abstract": null, "full_text": "The Relaxed Online \n\nMaximum Margin Algorithm \n\nYi Li and Philip M. Long \n\nDepartment of Computer Science \nNational University of Singapore \n\nSingapore 119260, Republic of Singapore \n\n{liyi,p/ong}@comp.nus.edu.sg \n\nAbstract \n\nWe describe a new incremental algorithm for training linear thresh(cid:173)\nold functions: \nthe Relaxed Online Maximum Margin Algorithm, or \nROMMA. ROMMA can be viewed as an approximation to the algorithm \nthat repeatedly chooses the hyperplane that classifies previously seen ex(cid:173)\namples correctly with the maximum margin. It is known that such a \nmaximum-margin hypothesis can be computed by minimizing the length \nof the weight vector subject to a number of linear constraints. ROMMA \nworks by maintaining a relatively simple relaxation of these constraints \nthat can be efficiently updated. We prove a mistake bound for ROMMA \nthat is the same as that proved for the perceptron algorithm. Our analysis \nimplies that the more computationally intensive maximum-margin algo(cid:173)\nrithm also satisfies this mistake bound; this is the first worst-case perfor(cid:173)\nmance guarantee for this algorithm. We describe some experiments us(cid:173)\ning ROMMA and a variant that updates its hypothesis more aggressively \nas batch algorithms to recognize handwritten digits. The computational \ncomplexity and simplicity of these algorithms is similar to that of per(cid:173)\nceptron algorithm , but their generalization is much better. We describe a \nsense in which the performance of ROMMA converges to that of SVM \nin the limit if bias isn't considered. \n\n1 \n\nIntroduction \n\nThe perceptron algorithm [10, 11] is well-known for its simplicity and effectiveness in the \ncase of linearly separable data. Vapnik's support vector machines (SVM) [13] use quadratic \nprogramming to find the weight vector that classifies all the training data correctly and \nmaximizes the margin, i.e. the minimal distance between the separating hyperplane and the \ninstances. This algorithm is slower than the perceptron algorithm, but generalizes better. \nOn the other hand, as an incremental algorithm, the perceptron algorithm is better suited \nfor online learning, where the algorithm repeatedly must classify patterns one at a time, \nthen finds out the correct classification, and then updates its hypothesis before making the \nnext prediction. \n\nIn this paper, we design and analyze a new simple online algorithm called ROMMA (the \nRelaxed Online Maximum Margin Algorithm) for classification using a linear threshold \n\n\fThe Relaxed Online Maximum Margin Algorithm \n\n499 \n\nfunction. ROMMA has similar time complexity to the perceptron algorithm, but its gener(cid:173)\nalization performance in our experiments is much better on average. Moreover, ROMMA \ncan be applied with kernel functions. \n\nWe conducted experiments similar to those performed by Cortes and Vapnik [2] and Freund \nand Schapire [3] on the problem of handwritten digit recognition. We tested the standard \nperceptron algorithm, the voted perceptron algorithm (for details, see [3]) and our new \nalgorithm, using the polynomial kernel function with d = 4 (the choice that was best \nin [3]). We found that our new algorithm performed better than the standard perceptron \nalgorithm, had slightly better performance than the voted perceptron. \n\nFor some other research with aims similar to ours, we refer the reader to [9,4,5,6]. \n\nThe paper is organized as foIlows. In Section 2, we describe ROMMA in enough detail \nto determine its predictions, and prove a mistake bound for it. In Section 3, we describe \nROMMA in more detail. In Section 4, we compare the experimental results of ROMMA \nand an aggressive variant of ROMMA with the perceptron and the voted perceptron algo(cid:173)\nrithms. \n\n2 A mistake-bound analysis \n\n2.1 The online algorithms \n\nFor concreteness, our analysis will concern the case in which instances (also called pat(cid:173)\nterns) and weight vectors are in R n . Fix n EN. In the standard online learning model [7], \nlearning proceeds in trials. In the tth trial, the algorithm is first presented with an instance \nit ERn . Next, the algorithm outputs a prediction Yt of the classification of it. Finally, \nthe algorithm finds out the correct classification Yt E {-1 , 1}. If Yt =I=- Yt, then we say that \nthe algorithm makes a mistake. It is worth emphasizing that in this model, when making \nits prediction for the tth trial, the algorithm only has access to instance-classification pairs \nfor previous trials. \nAll of the online algorithms that we will consider work by maintaining a weight vector WI \nwhich is updated between trials, and predicting Yt = sign( Wt . it), where sign( z) is 1 if z \nis positive, -1 if z is negative, and 0 otherwise.! \n\nThe perceptron algorithm. The perceptron algorithm, due to Rosenblatt [10, 11], starts \noff with Wi = O. When its prediction differs from the label Yt, it updates its weight vector \nby Wt+i = Wt + Ytit. If the prediction is correct then the weight vector is not changed. \nThe next three algorithms that we will consider assume that all of the data seen by the \nonline algorithm is collectively linearly separable, i.e. that there is a weight vector u such \nthat for all each trial t, Yt = sign( u . xd. When kernel functions are used, this is often the \ncase in practice. \nThe ideal online maximum margin algorithm. On each trial t, this algorithm chooses a \nweight vector Wt for which for all previous trials s ::; t, sign( Wt . is) = Ys, and which \nmaximizes the minimum distance of any is to the separating hyperplane. It is known [1, 14] \nthat this can be implemented by choosing Wt to minimize Ilwdl subject to the constraints \nthat Ys (Wt . xs) ;::: 1 for all s ::; t. These constraints define a convex polyhedron in weight \nspace which we will refer to as Pt. \n\nThe relaxed online maximum margin algorithm. This is our new algorithm. The first \ndifference is that trials in which mistakes are not made are ignored. The second difference \n\n'The prediction of 0, which ensures a mistake, is to make the proofs simpler. The usual mistake \n\nbound proof for the perceptron algorithm goes through with this change. \n\n\f500 \n\nY. Li and P. M Long \n\nis in how the algorithm responds to mistakes. The relaxed algorithm starts off like the ideal \nalgorithm. Before the second trial, it sets W2 to be the shortest weight vector such that \nYl (W2 . i l ) 2:: 1. If there is a mistake on the second trial, it chooses W3 as would the ideal \nalgorithm, to be the smallest element of \n\n(1) \n\nHowever, if the third trial is a mistake, then it behaves differently. Instead of choosing W4 \nto be the smallest element of \n\n{w: yI(w\u00b7 i l ) 2:: I} n {w: Y2(W. i 2 ) 2:: I} n {w: Y3(W\u00b7 i3) 2:: I} , \n\nit lets W4 be the smallest element of \n\n{w: W3 . W 2:: JJw3112} n {w: Y3(W. i3) 2:: I}. \n\nThis can be thought of as, before the third trial, replacing the polyhedron defined by (1) \nwith the halfspace {w : W3 \u00b7 W 2:: JJW3JJ2} (see Figure 1). \n\nNote that this halfspace contains the polyhedron \nof (1); in fact, it contains any convex set whose \nsmallest element is W3. Thus, it can be thought of \nas the least restrictive convex constraint for which \nthe smallest satisfying weight vector is W3. Let \nus call this halfspace H3 . The algorithm contin(cid:173)\nues in this manner. If the tth trial is a mistake, \nthen Wt+l is chosen to be the smallest element of \nHt n {w : Yt(w\u00b7 it) 2:: I}, and Ht+l is set to be \n{w : Wt+l . W 2:: IIwt+lJJ2}. If the tth trial is not a \nmistake, then Wt+l = Wt and Ht+l = Ht. We will \ncall Ht the old constraint, and {w : Yt (w . it) 2: I} \nthe new constraint. \n\nIn ROMMA, a convex \n\nFigure 1: \npolyhedron in weight space is re(cid:173)\nplaced with the halfspace with the \nsame smallest element. \nNote that after each mistake, this algorithm needs only to solve a quadratic programming \nproblem with two linear constraints. In fact, there is a simple closed-form expression for \nWt+l as a function of Wt, it and Yt that enables it to be computed incrementally using time \nsimilar to that of the perceptron algorithm. This is described in Section 3. \n\nThe relaxed online maximum margin algorithm with aggressive updating. The algo(cid:173)\nrithm is the same as the previous algorithm, except that an update is made after any trial in \nwhich yt{Wt . it} < 1, not just after mistakes. \n\n2.2 Upper bound on the number of mistakes made \n\nNow we prove a bound on the number of mistakes made by ROMMA. As in previous \nmistake bound proofs (e.g. [8]), we will show that mistakes result in an increase in a \n\"measure of progress\", and then appeal to a bound on the total possible progress. Our \nproof will use the squared length of Wt as its measure of progress. \n\nFirst we will need the following lemmas. \n\nLemma 1 On any run of ROMMA on linearly separable data, if trial t was a mistake, then \nthe new constraint is binding at the new weight vector; i.e. Yt (Wt+l . it) = 1. \n\nProof: For the purpose of contradiction suppose the new constraint is not binding at the \nnew weight vector Wt+l. Since Wt fails to satisfy this constraint, the line connecting Wt+l \nand Wt intersects with the border hyperplane of the new constraint, and we denote the \nintersecting point wq. Then Wq can be represented as Wq = aWt + (l-a)Wt+l, 0 < a < 1. \n\n\fThe Relaxed Online Maximum Margin Algorithm \n\n501 \n\nSince the square of Euclidean length II . 1\\2 is a convex function, the following holds: \n\nIIwql\\2 ~ allwtll 2 + (1 - a) IIwt+d2 \n\nSince Wt is the unique smallest member of Ht and Wt+1 i= Wt, we have IIwtl12 < IIwt+11l2, \nwhich implies \n\n(2) \n\nSince Wt and Wt+1 are both in Ht, Wq is too, and hence (2) contradicts the definition of \n0 \nWt+1\u00b7 \nLemma 2 On any run of ROMMA on linearly separable data, if trial t was a mistake, and \nnot the first one, then the old constraint is binding at the new weight vector, i.e. Wt + 1 . Wt = \nIIwtV \n\nProof: Let At be the plane of weight vectors that make the new constraint tight, i.e. At = \n{tV : Yt(w\u00b7 xd = I}. By Lemma 1, Wt+1 E At . Let at = Ytxtlllxtll2 be the element \nof At that is perpendicular to it. Then each wE At satisfies IIwII2 = lIatll2 + IIw - at 112, \nand therefore the length of a vector W in At is minimized when W = at and is monotone \nin the distance from W to at. Thus, if the old constraint is not binding, then Wt+1 = at. \nsince otherwise the solution could be improved by moving Wt+1 a little bit toward at. But \nthe old constraint requires that (Wt . Wt+d 2: IIwtll2, and if Wt+1 = at = Ytxtlllxtll2, this \nmeans that Wt . (YtxtlllxtIl2) 2: Ilwtll2. Rearranging, we get Yt(Wt . xd 2: IIxtll211wtlI2 > 0 \n(IIXtll > 0 follows from the fact that the data is linearly separable, and IIwt!\\ > 0 follows \nfrom the fact that there was at least one previous mistake). But since trial t was a mistake, \n0 \nYt (Wt . Xt) ~ 0, a contradiction. \n\nNow we're ready to prove the mistake bound. \n\nTheorem 3 Choose mEN, and a sequence (Xl, Yd,\u00b7\u00b7\u00b7, (xm , Ym) of pattern(cid:173)\nclassijicationpairsinRn x {-1,+1}. LetR = maxtl\\xtli. Ifthereisaweightvector \nii such that Yt (ii . Xt) 2: 1 for all 1 ~ t ~ m, then the number of mistakes made by \nROMMA on (Xl, yd, .. . , (xm, Ym) is at most R211ii1l 2. \n\nProof: First, we claim that for all t, ii E Ht. This is easily seen since ii satisfies all the \nconstraints that are ever imposed on a weight vector, and therefore all relaxations of such \nconstraints. Since Wt is the smallest element of Ht. we have IIwtll ~ lliill. \nWe have W2 = Ylxdllid 2, and therefore IIw211 = 1/lIx1\\\\ 2:: 1/ R which implies IIw2112 2: \n1/ R2. We claim that if any trial t > 1 is a mistake, then IIWt+1112 2: IIwtlI 2 + 1/ R2. This \nwill imply by induction that after M mistakes, the squared length of the algorithm's weight \nvector is at least M / R2, which, since all of the algorithm's weight vectors are no longer \nthan ii, will complete the proof. \n\nB \n\nFigure 2: At, Bt , and Pt \n\nChoose an index t > 1 of a trial in which a mistake \nis made. Let At = {tV : Yt (w . it) = I} and Bt = \n{w : (w . Wt) = IIwtIl2}. By Lemmas 1 and 2, \nWt+1 EAt n Bt . \nThe distance from Wt to At (call\u00b7 it pe) satisfies \n\nIYt(xt . we) -11 \n\nIIxtll \n\n1 \n\n1 \n2: lIitll 2: R ' \n\n(3) \n\nPt = \n\nsince the fact that there was a mistake in trial t im(cid:173)\nplies Yt(Xt . Wt) ~ O. Also, since Wt+1 E At. \n\n(4) \n\n\f502 \n\nY. Li and P. M. Long \n\nBecause Wt is the normal vector of Bt and Wt+1 E Bt, we have \nIIWt+1112 = IIWtll2 + II Wt+1 - Wt1l 2. \n\nThus, applying (3) and (4), we have IIWt+d2 -\nwhich, as discussed above, completes the proof. \n\nIIWt 112 = IIWt+! - welI2 2: p; 2: 1/ R2, \n0 \nUsing the fact, easily proved using induction, that for all t, Pt ~ H t , we can easily prove \nthe following, which complements analyses of the maximum margin algorithm using inde(cid:173)\npendence assumptions [1, 14, 12]. Details are omitted due to space constraints. \n\nTheorem 4 Choose mEN, and a sequence (x\\, yd,\"', (im , Ym) of pattern(cid:173)\nclassification pairs in R n x {-I , +1}. Let R = maXt lIitli. If there is a weight vector \nii such that Yt (ii . it) 2: 1 for all 1 ::; t ::; m, then the number of mistakes made by the \nideal online maximum margin algorithm on (Xl, yd, .. \" (xm, Ym) is at most R211ii1l 2. \n\nIn the proof of Theorem 3, if an update is made and Yt (Wt . id < 1 - 0 instead of Yt (Wt . \nit) ::; 0, then the progress made can be seen to be at least 02/ R2. This can be applied to \nprove the following. \n\nTheorem 5 Choose 0 > 0, mEN, and a sequence (Xl, Y1) , ... , (X m, Ym) of pattern(cid:173)\nclassification pairs in R n x {-I, + I}. Let R = maXt lIiell. If there is a weight vector ii \nsuch that Yt (ii . Xt) 2: 1 for aliI::; t ::; m, then if (i1' yI), ... , (im, Ym) are presented on \nline the number of trials in which aggressive ROMMA has Yt (Wt . it) < 1 - 0 is at most \nR2I1iiIl 2/0 2. \n\nTheorem 5 implies that, in a sense, repeatedly cycling through a dataset using aggressive \nROMMA will eventually converge to SVM; note however that bias is not considered. \n\n3 An efficient implementation \n\nWhen the prediction of ROMMA differs from the expected label, the algorithm chooses \nWt+! to minimize IIWt+!1I subject to AWt+! = b, where A = (~f) and b = \n( 11~:\"2 ) . Simple calculation shows that \n\nWt+! \n\nAT (AAT)-lb \n( IIxtII211Wtll2 - Yt(Wt . it)) ~ \n\nlIitll211Wtll2 - (Wt . ie)2 Wt + \n\n(\"wtIl2(Yt -\n\n(Wt \u00b7 it)) ) ~ \nIIxtll211Wtll2 _ (Wt . XtP Xt\u00b7 \n\n(5) \n\nIf on trials t in which a mistake is made, Ct = \n\nSince the computations required by ROMMA involve inner products together with a few \noperations on scalars, we can apply the kernel method to our algorithm, efficiently solving \nthe original problem in a very high dimensional space. Computationally, we only need to \nmodify the algorithm by replacing each inner product computation (ii . Xj) with a kernel \nfunction computation IC (ii, Xj). \n\nTo make a prediction for the tth trial, the algorithm must compute the inner product between \nXt and prediction vector Wt. In order to apply the kernel function, as in [1, 3], we store each \nprediction vector Wt in an implicit manner, as the weighted sum of examples on which \n\n\fThe Relaxed Online Maximum Margin Algorithm \n\n503 \n\nmistakes occur during the training. In particular. each Wt is represented as \n\nIT Cj WI + L n Cn djxj \n(t-l) \n\nt-l (t_ l ) \n\nWt = \n\nJ=1 \n\nJ=1 n=J+l \n\nThe above formula may seem daunting; however, making use of the recurrence (Wt+l \u00b7x) = \nCt (Wt . x) + dt (Xt . x). it is obvious that the complexity of our new algorithm is similar to \nthat of perceptron algorithm. This was born out by our experiments. \n\nThe implementation for aggressive ROMMA is similar to the above. \n\n4 Experiments \n\nWe did some experiments using the ROMMA and aggressive ROMMA as batch algorithms \non the MNIST OCR database. 2 We obtained a batch algorithm from our online algorithm \nin. the usual way, making a number of passes over the dataset and using the final weight \nvector to classify the test data. \n\nEvery example in this database has two parts, the first is a 28 x 28 matrix which rep(cid:173)\nresents the image of the corresponding digit. Each entry in the matrix takes value from \n{O, . . . , 255}. The second part is a label taking a value from {O,\u00b7 .. , g} . The dataset \nconsists of 60, 000 training examples and 10,000 test examples. We adopt the following \npolynomial kernel: K(Xi, Xj) = (1 + (Xi\u00b7 Xj))d. This corresponds to using an expanded \ncollection of features including all products of at most d components of the original fea(cid:173)\nture vector (see [14]). Let us refer to the mapping from the original feature vector to the \nexpanded feature vector as <1>. Note that one component of * (x) is always 1, and therefore \nthe component of the weight vector corresponding to that component can be viewed as a \nbias. In our experiments, we set WI = <1>(6') rather than (5 to speed up the learning of the \ncoefficient corresponding to the bias. We chose d = 4 since in experiments on the same \nproblem conducted in [3, 2], the best results occur with this value. \n\nTo cope with multiclass data, we trained ROMMA and aggressive ROMMA once for each \nof the 10 labels. Classification of an unknown pattern is done according to the maximum \noutput of these ten classifiers. \nAs every entry in the image matrix takes value from {O , \u00b7 .. , 255}, the order of magnitude \nof K(x , x) is at least 1026 , which might cause round-off error in the computation of Ci and \ndi . We scale the data by dividing each entry with 1100 when training with ROMMA. \n\nTable 1: Experimental results on MNIST data \n\npercep \nvoted-percep \nROMMA \nagg-ROMMA \nagg-ROMMACNC) \n\nT=l \n\nErr MisNo \n2.84 \n7970 \n7970 \n2.26 \n7963 \n2.48 \n6077 \n2.14 \n2.05 \n5909 \n\nT=2 \n\nErr MisNo \n2.27 \n10539 \n10539 \n1.88 \n1.96 \n9995 \n7391 \n1.82 \n1.76 \n6979 \n\nT=3 \n\nErr MisNo \n11945 \n1.99 \n11945 \n1.76 \n10971 \n1.79 \n1.71 \n7901 \n1.67 \n7339 \n\nT =4 \n\nErr MisNo \n12800 \n1.85 \n12800 \n1.69 \n11547 \n1.77 \n1.67 \n8139 \n7484 \n1.63 \n\nSince the performance of online learning is affected by the order of sample sequence, all \nthe results shown in Table 1 average over 10 random permutations. The columns marked \n\n2National \n\nInstitute \n\nfor Standards and Technology, \n\nspecial database 3. \n\nSee \n\nhttp://www.research.att.com/ ... yanniocr for information on obtaining this dataset. \n\n\f504 \n\nY Li and P M. Long \n\n\"MisNo\" in Table 1 show the total number of mistakes made during the training for the 10 \nlabels. Although online learning would involve only one epoch, we present results for a \nbatch setting until four epochs (T in Table 1 represents the number of epochs). \n\nTo deal with data which are linearly inseparable in the feature space, and also to improve \ngeneralization, Friess et al [4] suggested the use of quadratic penalty in the cost function, \nwhich can be implemented using a slightly different kernel function [4, 5]: iC(Xk ' Xj) = \nK(Xk, Xj) + c5kj ).., where c5kj is the Kronecker delta function. The last row in Table 1 is the \nresult of aggressive ROMMA using this method to control noise ().. = 30 for 10 classifiers). \n\nWe conducted three groups of experiments, one for the perceptron algorithm (denoted \"per(cid:173)\ncep\"), the second for the voted perceptron (denoted \"voted-percep\") whose description is \nin [3], the third for ROMMA, aggressive ROMMA (denoted \"agg-ROMMA\"), and aggres(cid:173)\nsive ROMMA with noise control (denoted \"agg-ROMMA(NC)\"). Data in the third group \nare scaled. All three groups set 'lih = <1>(0). \nThe results in Table 1 demonstrate that ROMMA has better performance than the standard \nperceptron, aggressive ROMMA has slightly better performance than the voted perceptron. \nAggressive ROMMA with noise control should not be compared with perceptrons without \nnoise control. Its presentation is used to show what performance our new online algorithm \ncould achieve (of course it's not the best, since all 10 classifiers use the same )\"). A remark(cid:173)\nable phenomenon is that our new algorithm behaves very well at the first two epochs. \n\nReferences \n\n[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. \nProceedings of the 1992 Workshop on Computational Learning Theory, pages 144-152, 1992. \n[2] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273-297,1995. \n[3] y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. \n\nProceedings of the 1998 Conference on Computational Learning Theory, 1998. \n\n[4] T. T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple \nlearning procedure for support vector machines. In Proc. 15th Int. Con! on Machine Learning. \nMorgan Kaufman Publishers, 1998. \n\n[5] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. A fast iterative nearest \npoint algorithm for support vector machine c1assiifer design. Technical report, Indian Institute \nof Science, 99. TR-ISL-99-03 . \n\n[6] Adam Kowalczyk. Maximal margin perceptron. In Smola, Bartlett, Scholkopf, and Schuur(cid:173)\n\nmans, editors, Advances in Large Margin Classifiers, 1999. MIT-Press. \n\n[7] N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear-threshold \n\nalgorithm. Machine Learning, 2:285-318, 1988. \n\n[8] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD \n\nthesis, UC Santa Cruz, 1989. \n\n[9] John C. Platt. Fast training of support vector machines using sequential minimal optimization. \nIn B. Scholkopf, C. Burges, A. Smola, editors, Advances in Kernel Methods: Support Vector \nMachines, 1998. MIT Press. \n\n[10] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization \n\nin the brain. Psychological Review, 65:386-407, 1958. \n\n[11] F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms . \n\nSpartan Books, Washington, D. C., 1962. \n\n[12] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. A framework for structural risk \n\nminimization. In Proc. of the 1996 Conference on Computational Learning Theory, 1996. \n\n[13] V. N. Vapnik. Estimation of Dependencies based on Empirical Data. Springer Verlag, 1982. \n[14] V. N. Vapnik. The Nature of Statistical Learning Theory . Springer, 1995. \n\n\f", "award": [], "sourceid": 1727, "authors": [{"given_name": "Yi", "family_name": "Li", "institution": null}, {"given_name": "Philip", "family_name": "Long", "institution": null}]}*