{"title": "DUOL: A Double Updating Approach for Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2259, "page_last": 2267, "abstract": "In most online learning algorithms, the weights assigned to the misclassified examples (or support vectors) remain unchanged during the entire learning process. This is clearly insufficient since when a new misclassified example is added to the pool of support vectors, we generally expect it to affect the weights for the existing support vectors. In this paper, we propose a new online learning method, termed Double Updating Online Learning\", or \"DUOL\" for short. Instead of only assigning a fixed weight to the misclassified example received in current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be significantly improved by the proposed online learning method. Encouraging experimental results show that the proposed technique is in general considerably more effective than the state-of-the-art online learning algorithms.\"", "full_text": "DUOL: A Double Updating Approach for\n\nOnline Learning\n\nPeilin Zhao\n\nSchool of Comp. Eng.\n\nNanyang Tech. University\n\nSingapore 639798\n\nSteven C.H. Hoi\nSchool of Comp. Eng.\n\nNanyang Tech. University\n\nSingapore 639798\n\nRong Jin\n\nDept. of Comp. Sci. & Eng.\nMichigan State University\nEast Lansing, MI, 48824\n\nzhao0106@ntu.edu.sg\n\nchhoi@ntu.edu.sg\n\nrongjin@cse.msu.edu\n\nAbstract\n\nIn most online learning algorithms, the weights assigned to the misclassi\ufb01ed ex-\namples (or support vectors) remain unchanged during the entire learning process.\nThis is clearly insuf\ufb01cient since when a new misclassi\ufb01ed example is added to\nthe pool of support vectors, we generally expect it to affect the weights for the\nexisting support vectors. In this paper, we propose a new online learning method,\ntermed Double Updating Online Learning, or DUOL for short. Instead of only\nassigning a \ufb01xed weight to the misclassi\ufb01ed example received in current trial, the\nproposed online learning algorithm also tries to update the weight for one of the\nexisting support vectors. We show that the mistake bound can be signi\ufb01cantly im-\nproved by the proposed online learning method. Encouraging experimental results\nshow that the proposed technique is in general considerably more effective than\nthe state-of-the-art online learning algorithms.\n\n1 Introduction\n\nOnline learning has been extensively studied in the machine learning community (Rosenblatt, 1958;\nFreund & Schapire, 1999; Kivinen et al., 2001a; Crammer et al., 2006). Most online learning\nalgorithms work by assigning a \ufb01xed weight to a new example when it is misclassi\ufb01ed. As a result,\nthe weights assigned to the misclassi\ufb01ed examples, or support vectors, remain unchanged during the\nentire process of learning. This is clearly insuf\ufb01cient because when a new example is added to the\npool of support vectors, we expect it to affect the weights assigned to the existing support vectors\nreceived in previous trials.\nAlthough several online algorithms are capable of updating the example weights as the learning\nprocess goes, most of them are designed for the purposes other than improving the classi\ufb01cation\naccuracy and reducing the mistake bound. For instance, in (Orabona et al., 2008; Crammer et al.,\n2003; Dekel et al., 2005), online learning algorithms are proposed to adjust the example weights\nin order to \ufb01t in the constraint of \ufb01xed number of support vectors; in (Cesa-Bianchi & Gentile,\n2006), example weights are adjusted to track the drifting concepts. In this paper, we propose a new\nformulation for online learning that aims to dynamically update the example weights in order to\nimprove the classi\ufb01cation accuracy as well as the mistake bound. Instead of only assigning a weight\nto the misclassi\ufb01ed example that is received in current trial, the proposed online learning algorithm\nalso updates the weight for one of the existing support vectors. As a result, the example weights\nare dynamically updated as learning goes. We refer to the proposed approach as Double Updating\nOnline Learning, or DUOL for short.\nThe key question in the proposed online learning approach is which one of the existing support vec-\ntors should be selected for weight updating. To this end, we employ an analysis for double updating\nonline learning that is based on the recent work of online convex programming by incremental dual\nascent (Shalev-Shwartz & Singer, 2006). Our analysis shows that under certain conditions, the pro-\nposed online learning algorithm can signi\ufb01cantly reduce the mistake bound of the existing online\nalgorithms. This result is further veri\ufb01ed empirically by extensive experiments and comparison to\nthe state-of-the-art algorithms for online learning.\n\n1\n\n\fThe rest of this paper is organized as follows. Section 2 reviews the related work for online learning.\nSection 3 presents the proposed \u201cdouble updating\u201d approach to online learning. Section 4 gives our\nexperimental results. Section 5 sets out the conclusion and addresses some future work.\n2 Related Work\nOnline learning has been extensively studied in machine learning (Rosenblatt, 1958; Crammer &\nSinger, 2003; Cesa-Bianchi et al., 2004; Crammer et al., 2006; Fink et al., 2006; Yang et al., 2009).\nOne of the most well-known online approaches is the Perceptron algorithm (Rosenblatt, 1958; Fre-\nund & Schapire, 1999), which updates the learning function by adding a new example with a constant\nweight into the current set of support vectors when it is misclassi\ufb01ed. Recently a number of online\nlearning algorithms have been developed based on the criterion of maximum margin (Crammer &\nSinger, 2003; Gentile, 2001; Kivinen et al., 2001b; Crammer et al., 2006; Li & Long, 1999). One\nexample is the Relaxed Online Maximum Margin algorithm (ROMMA) (Li & Long, 1999), which\nrepeatedly chooses the hyper-planes that correctly classify the existing training examples with the\nmaximum margin. Another representative example is the Passive-Aggressive (PA) method (Cram-\nmer et al., 2006). It updates the classi\ufb01cation function when a new example is misclassi\ufb01ed or its\nclassi\ufb01cation score does not exceed some prede\ufb01ned margin. Empirical studies showed that the\nmaximum margin based online learning algorithms are generally more effective than the Perceptron\nalgorithm. However, despite the difference, most online learning algorithms only update the weight\nof the newly added support vector, and keep the weights of the existing support vectors unchanged.\nThis constraint could signi\ufb01cantly limit the effect of online learning.\nBesides the studies for regular online learning, several algorithms are proposed for online learning\nwith \ufb01xed budget. In these studies, the total number of support vectors is required to be bounded\neither by a theoretical bound or by a manually \ufb01xed budget. Example algorithms for \ufb01xed budget\nonline learning include (Weston & Bordes, 2005; Crammer et al., 2003; Cavallanti et al., 2007;\nDekel et al., 2008). The key idea of these algorithms is to dynamically update the weights of the\nexisting support vectors as a new support vector is added, and the support vector with the least weight\nwill be discarded when the number of support vectors exceeds the budget. The idea of discarding\nsupport vectors is also used in studies (Kivinen et al., 2001b) and (Cheng et al., 2006). In a very\nrecently proposed method (Orabona et al., 2008), a new \u201cprojection\u201d approach is proposed for online\nlearning that ensures the number of support vectors is bounded. Besides, in (Cesa-Bianchi & Gentile,\n2006), an online learning algorithm is proposed to handle the drifting concept, in which the weights\nof the existing support vectors are reduced whenever a new support vector is added. Although these\nonline learning algorithms are capable of dynamically adjusting the weights of support vectors, they\nare designed to either \ufb01t in the budget of the number of support vectors or to handle drifting concepts,\nnot to improve the classi\ufb01cation accuracy and the mistake bound.\nThe proposed online learning algorithm is closely related to the recent work of online convex pro-\ngramming by incremental dual ascent (Shalev-Shwartz & Singer, 2006). Although the idea of si-\nmultaneously updating the weights of multiple support vectors was mentioned in (Shalev-Shwartz\n& Singer, 2006), no ef\ufb01cient updating algorithm was explicitly proposed. As will be shown later, the\nonline algorithm proposed in this work shares the same computational cost as that of conventional\nonline learning algorithms, despite the need of updating weights of two support vectors.\n3 Double Updating to Online Learning\n3.1 Motivation\nWe consider an online learning trial t with an incoming example that is misclassi\ufb01ed. Let \u03ba(\u00b7,\u00b7) :\nRd \u00d7 Rd \u2192 R be the kernel function used in our classi\ufb01er. Let D = {(xi, yi), i = 1, . . . , n}\nbe the collection of n misclassi\ufb01ed examples received before the trial t, where xi \u2208 Rd and yi \u2208\n{\u22121, +1}. We also refer to these misclassi\ufb01ed training examples as \u201csupport vectors\u201d. We denote\nby \u03b1 = (\u03b11, . . . , \u03b1n) \u2208 [0, C]n the weights assigned to the support vectors in D, where C is a\nprede\ufb01ned constant. The resulting classi\ufb01er, denoted by f(x), is expressed as\n\nn(cid:88)\n\nf(x) =\n\n\u03b1iyi\u03ba(x, xi)\n\n(1)\n\nLet (xa, ya) be the misclassi\ufb01ed example received in the trial t, i.e., yaf(xa) \u2264 0. In the conven-\ntional approach for online learning, we simply assign a constant weight, denoted by \u03b2, to (xa, ya),\n\ni=1\n\n2\n\n\fand the resulting classi\ufb01er becomes\n\nf(cid:48)(x) = \u03b2ya\u03ba(x, xa) +\n\nn(cid:88)\n\ni=1\n\n\u03b1iyi\u03ba(x, xi) = \u03b2ya\u03ba(x, xa) + f(x)\n\n(2)\n\n(cid:80)n\n\nThe shortcoming with the conventional online learning approach is that the introduction of the new\nsupport vector (xa, ya) may harm the classi\ufb01cation of existing support vectors in D, which is re-\nvealed by the following proposition.\nProposition 1. Let (xa, ya) be an example misclassi\ufb01ed by the current classi\ufb01er f(x) =\ni=1 \u03b1iyi\u03ba(x, xi), i.e., yaf(xa) < 0. Let f(cid:48)(x) = \u03b2ya\u03ba(x, xa) + f(x) be the updated classi-\n\ufb01er with \u03b2 > 0. There exists at least one support vector xi \u2208 D such that yif(xi) > yif(cid:48)(xi).\nProof. It follows from the fact that: \u2203xi \u2208 D, yiya\u03ba(xi, xa) < 0 when yaf(xa) < 0.\n\nAs indicated by the above proposition, when a new misclassi\ufb01ed example is added to the classi-\n\ufb01er, the classi\ufb01cation con\ufb01dence of at least one support vector will be reduced. In the case when\nyaf(xa) \u2264 \u2212\u03b3, it is easy to verify that there exists some support vector (xb, yb) who satis\ufb01es\n\u03b2yaybk(xa, xb) \u2264 \u2212\u03b3/n; at the meantime, it can be shown that when the classi\ufb01cation con\ufb01dence\nof (xb, yb) is less than \u03b3/n, i.e., ybf(xb) \u2264 \u03b3/n, such support vector will be misclassi\ufb01ed after\nthe classi\ufb01er is updated with the example (xa, ya). In order to alleviate this problem, we propose\nto update the weight for the existing support vector whose classi\ufb01cation con\ufb01dence is signi\ufb01cantly\naffected by the new misclassi\ufb01ed example. In particular, we consider a support vector (xb, yb) \u2208 D\nfor weight updating if it satis\ufb01es the following two conditions\n\n\u2022 ybf(xb) \u2264 0, i.e., support vector (xb, yb) is misclassi\ufb01ed by the current classi\ufb01er f(x)\n\u2022 k(xb, xa)yayb \u2264 \u2212\u03c1 where \u03c1 \u2265 0 is a prede\ufb01ned threshold, i.e., support vector (xb, yb)\n\n\u201ccon\ufb02icts\u201d with the new misclassi\ufb01ed example (xa, ya).\n\nT(cid:88)\n\nM \u2264 1\n\u2206\n\n(cid:107)f(cid:107)2H\u03ba + C\n\n1\n2\n\nWe refer to the support vector satisfying the above conditions as auxiliary example. It is clear that\nby adding the misclassi\ufb01ed example (xa, ya) to classi\ufb01er f(x) with weight \u03b2, the classi\ufb01cation score\nof (xb, yb) will be reduced by at least \u03b2\u03c1, which could lead to the misclassi\ufb01cation of the auxiliary\nexample (xb, yb). To avoid such a mistake, we propose to update the weights for both (xa, ya) and\n(xb, yb) simultaneously. In the next section, we show the details of the double updating algorithm\nfor online learning, and the analysis for mistake bound.\nOur analysis follows closely the previous work on the relationship between online learning and\nthe dual formulation of SVM (Shalev-Shwartz & Singer, 2006), in which the online learning is\ninterpreted as an ef\ufb01cient updating rule for maximizing the objective function in the dual form of\nSVM. We denote by \u2206t the improvement of the objective function in dual SVM when adding a new\nmisclassi\ufb01ed example to the classi\ufb01cation function in the t-th trial. If an online learning algorithm A\nis designed to ensure that all \u2206t is bounded from the below by a positive constant \u2206, then the number\nof mistakes made by A when trained over a sequence of trials (x1, y1), . . . , (xT , yT ), denoted by\nM, is upper bounded by:\n\n(cid:33)\n\n(cid:195)\n\ni=1\n\nmin\nf\u2208H\u03ba\n\n(cid:96)(yif(xi))\n\n(3)\nwhere (cid:96)(yif(xi)) = max(0, 1 \u2212 yif(xi)) is the hinge loss function. In our analysis, we will show\nthat \u2206, which is referred to as the bounding constant for the improvement in the objective function,\ncould be signi\ufb01cantly improved when updating the weight for both the newly misclassi\ufb01ed example\nand the auxiliary example.\nFor the remaining part of this paper, we denote by (xb, yb) an auxiliary example that satis\ufb01es the two\nconditions speci\ufb01ed before. We slightly abuse the notation by using \u03b1 = (\u03b11, . . . , \u03b1n\u22121)) \u2208 Rn\u22121\nto denote the weights assigned to all the support vectors in D except (xb, yb). Similarly, we denote\nby y = (y1, . . . , yn\u22121) \u2208 [\u22121, 1]n\u22121 the class labels assigned to all the examples in D except for\n(xb, yb). We de\ufb01ne\n\nsa = \u03ba(xa, xa), sb = \u03ba(xb, xb), sab = \u03ba(xa, xb), wab = yaybsab.\n\nnote by(cid:98)\u03b3b the weight for the auxiliary example (xb, yb) that is used in the current classi\ufb01er f(x), and\n\n(4)\nAccording to the assumption of auxiliary example, we have wab = sabyayb \u2264 \u2212\u03c1. Finally, we de-\nby \u03b3a and \u03b3b the updated weights for (xa, ya) and (xb, yb), respectively. Throughout the analysis,\nwe assume \u03ba(x, x) \u2264 1 for any example x.\n\n3\n\n\f3.2 Double Updating Online Learning\nRecall an auxiliary example (xb, yb) should satisfy two conditions (I) ybf(xb) \u2264 0, and (II) wab \u2264\n\u2212\u03c1. In addition, the new example (xa, ya) received in the current iteration t is misclassi\ufb01ed, i.e.,\nyaf(xa) \u2264 0. Following the framework of dual formulation for online learning, the following\nlemma shows how to compute \u2206t, i.e., the improvement in the objective function of dual SVM by\nadjusting weights for (xa, ya) and (xb, yb).\nLemma 1. The maximal improvement in the objective function of dual SVM by adjusting weights\nfor (xa, ya) and (xb, yb), denoted by \u2206t, is computed by solving the following optimization problem:\n(5)\n\n{h(\u03b3a, \u2206\u03b3b) : 0 \u2264 \u03b3a \u2264 C, 0 \u2264 \u2206\u03b3b \u2264 C \u2212(cid:98)\u03b3b}\n\n\u2206t = max\n\u03b3a,\u2206\u03b3b\n\nwhere\n\nh(\u03b3a, \u2206\u03b3b) = \u03b3a(1 \u2212 yaf(xa)) + \u2206\u03b3b(1 \u2212 ybf(xb)) \u2212 sa\n\na \u2212 sb\n2\n\n2 \u03b32\n\n\u2206\u03b32\n\nb \u2212 wab\u03b3a\u2206\u03b3b\n\n(6)\n\n(cid:80)t\n\nProof. It is straightforward to verify that the dual function of min\nft\u2208H\u03ba\ndenoted by Dt(\u03b31, . . . , \u03b3t), is computed as follows,\n\nwhere 0 \u2264 \u03b3i \u2264 C, i = 1, . . . , t and ft(\u00b7) =\n\nh(\u03b3a, \u2206\u03b3b) = Dt(\u03b31, . . . ,(cid:98)\u03b3b + \u2206\u03b3b, . . . , \u03b3t\u22121, \u03b3a) \u2212 Dt\u22121(\u03b31, . . . ,(cid:98)\u03b3b, . . . , \u03b3t\u22121)\n\ni=1\n\ni=1\n\n1\n\n2(cid:107)ft(cid:107)2H\u03ba\n\n+C\n\ni=1 (cid:96)(yift(xi)),\n\nt(cid:88)\n\n\u03b3iyift(xi) +\n\n\u03b3i \u2212 t(cid:88)\n(cid:80)t\ni=1 \u03b3iyi\u03ba(\u00b7, xi) is the current classi\ufb01er. Thus,\n(cid:33)\n(cid:195)\nt\u22121(cid:88)\n\n(cid:107)ft(cid:107)2\n\n1\n2\n\nH\u03ba\n\n\u03b3iyift(xi) + \u2206\u03b3bybft(xb) + \u03b3ayaft(xa)\n\n+\n\n(7)\n\n(cid:107)ft(cid:107)2\n\nH\u03ba\n\n1\n2\n\n(cid:33)\n\ni=1\n\n\u03b3iyift\u22121(xi) +\n\n(cid:107)ft\u22121(cid:107)2\n\nH\u03ba\n\n1\n2\n\nDt(\u03b31, . . . , \u03b3t) =\n\n=\n\n\u03b3i + \u2206\u03b3b + \u03b3a \u2212\n\nt\u22121(cid:88)\n(cid:195)\nt\u22121(cid:88)\n\ni=1\n\n\u2212\n\n\u03b3i \u2212 t\u22121(cid:88)\n\ni=1\n\ni=1\n\nUsing the relation ft(x) = ft\u22121(x) + \u2206\u03b3byb\u03ba(x, xb) + \u03b3aya\u03ba(x, xa), we have\nb \u2212 wab\u03b3a\u2206\u03b3b\na \u2212 sb\nh(\u03b3a, \u2206\u03b3b) = \u03b3a(1 \u2212 yaft\u22121(xa)) + \u2206\u03b3b(1 \u2212 ybft\u22121(xb)) \u2212 sa\nconstraint that the weight for example (xb, yb) is in the range [0, C], i.e.,(cid:98)\u03b3b + \u2206\u03b3b \u2208 [0, C]. To this\n2\nFinally, we need to show \u2206\u03b3b \u2265 0. Note that this constraint does not come directly from the box\nend, we consider the part of h(\u03b3a, \u2206\u03b3b) that is related to \u2206\u03b3b, i.e.,\n\n2 \u03b32\n\n\u2206\u03b32\n\ng(\u2206\u03b3b) = \u2206\u03b3b(1 \u2212 ybft\u22121(xb) \u2212 wab\u03b3a) \u2212 sb\n2\n\n\u2206\u03b32\nb\n\nSince wab \u2264 \u2212\u03c1 and ybft\u22121(xb) \u2264 0, it is clear that \u2206\u03b3b \u2265 0 when maximizing g(\u2206\u03b3b), which\nresults in the constraint \u2206\u03b3b \u2265 0.\n\nThe following theorem shows the bound for \u2206 when C is suf\ufb01ciently large.\n\nTheorem 1. Assume C >(cid:98)\u03b3b + 1/(1 \u2212 \u03c1) for the selected auxiliary example (xb, yb). We have the\n\nfollowing bound for \u2206\n\n\u2206 \u2265 1\n1 \u2212 \u03c1\n\n(8)\n\nProof. Using the fact sa, sb \u2264 1, \u03b3a, \u2206\u03b3b \u2265 0, yaf(xa) \u2264 0, ybf(xb) \u2264 0, and wa,b \u2264 \u2212\u03c1, we\nhave\n\nh(\u03b3a, \u2206\u03b3b) \u2265 \u03b3a + \u2206\u03b3b \u2212 1\n\na \u2212 1\n2\n\n\u2206\u03b32\n\nb + \u03c1\u03b3a\u2206\u03b3b\n\nThus, \u2206 is bounded as\n\n\u2206 \u2265\n\n\u03b3b\u2208[0,C],\u2206\u03b3b\u2208[0,C\u2212(cid:98)\u03b3]\n\nmax\n\n2 \u03b32\n\u03b3a + \u2206\u03b3b \u2212 1\n2\n\n(\u03b32\n\na + \u2206\u03b32\n\nb ) + \u03c1\u03b3a\u2206\u03b3b\n\nUnder the condition that C > \u02c6\u03b3b + 1/(1 \u2212 \u03c1), it is easy to verify that the optimal solution for the\nabove problem is \u03b3a = \u2206\u03b3b = 1/(1 \u2212 \u03c1), which leads to the result in the theorem.\n\n4\n\n\fWe now consider the general case, where we only assume C \u2265 1. The following theorem shows the\nbound for \u2206 in the general case.\nTheorem 2. Assume C \u2265 1. We have the following bound for \u2206, when updating the weights for the\nnew example (xa, ya) and the auxiliary example (xb, yb)\n\n(cid:161)\n\n(1 + \u03c1)2, (C \u2212(cid:98)\u03b3)2(cid:162)\n\n\u2206 \u2265 1\n2\n\n+\n\n1\n2\n\nmin\n\nProof. By setting \u03b3a = 1, we have h(\u03b3a, \u2206\u03b3b) computed as\n\nHence, \u2206 is lower bounded by\n\n\u2206 \u2265 1\n2\n\n+\n\n\u2206\u03b3b\u2208[0,C\u2212(cid:98)\u03b3]\n\nmax\n\nh(\u03b3a = 1, \u2206\u03b3b) \u2265 1\n(cid:181)\n2\n(1 + \u03c1)\u2206\u03b3b \u2212 1\n2\n\n(cid:182)\n\n\u2206\u03b32\nb\n\n\u2265 1\n2\n\n+\n\n1\n2\n\nmin\n\n+ (1 + \u03c1)\u2206\u03b3b \u2212 1\n2\n\n\u2206\u03b32\nb\n\n(cid:161)\n\n(1 + \u03c1)2, (C \u2212(cid:98)\u03b3)2(cid:162)\n\nSince we only have \u2206 \u2265 1/2 if we only update the weight for the new misclassi\ufb01ed example\n(xa, ya), the result in theorem 2 indicates an increase in \u2206 when updating the weight for both\n(xa, ya) and the auxiliary example (xb, yb). Furthermore, when C is suf\ufb01ciently large, as indicated\nby Theorem 1, the improvement in \u2206 can be very signi\ufb01cant.\nThe \ufb01nal remaining question is how to identify the auxiliary example (xb, yb) ef\ufb01ciently, which\nrequires ef\ufb01ciently updating the classi\ufb01cation score yif(xi) for all the support vectors. To this\nend, we introduce a variable for each support vector, denoted by f i\nt , to keep track the classi\ufb01-\ncation score. When a new support vector (xa, ya) with weight \u03b3a is added to the classi\ufb01er, we\nt\u22121 + yi\u03b3aya\u03ba(xi, xa), and when the weight of\nupdate the classi\ufb01cation score f i\nan auxiliary example (xb, yb) is updated from \u02c6\u03b3b to \u03b3b, we update the classi\ufb01cation score f i\nt\u22121 by\nt\u22121 + yi(\u03b3b \u2212 \u02c6\u03b3b)yb\u03ba(xi, xb).This updating procedure ensures that the computational cost of\nt \u2190 f i\nf i\ndouble updating online learning is O(n), where n is the number of support vectors, similar to that\nof the kernel online learning algorithm. Figure 1 shows the details of the DUOL algorithm.\nFinally, we show a bound on the number of mistakes by assuming C is suf\ufb01ciently large.\nTheorem 3. Let (x1, y1), . . . , (xT , yT ) be a sequence of examples, where xt \u2208 Rn, yt \u2208 {\u22121, +1}\nand \u03ba(xt, xt) \u2264 1 for all t. And assume C is suf\ufb01ciently large. Then for any function f in H\u03ba, the\nnumber of prediction mistakes M made by DUOL on this sequence of examples is bounded by:\n\nt \u2190 f i\n\nt\u22121 by f i\n\n(cid:33)\n\nM \u2264 2\n\nmin\nf\u2208H\u03ba\n\n(cid:107)f(cid:107)2H\u03ba + C\n\n1\n2\n\n(cid:96)(yif(xi))\n\n\u2212 1 + \u03c1\n1 \u2212 \u03c1\n\nMd(\u03c1)\n\n(9)\n\nwhere Md(\u03c1) is the number of mistakes when there is an auxiliary example, which depends on the\nthreshold \u03c1 and the dataset (Md(\u03c1) is actually a decreasing function with \u03c1).\n\nProof. We denote by Ms the number of mistakes when we made a single update without \ufb01nding\nappropriate auxiliary example. Using Theorem 1, we have the following inequality,\n\n1\n2 Ms +\n\n1\n\n1 \u2212 \u03c1\n\nMd(\u03c1) \u2264\n\nmin\nf\u2208H\u03ba\n\n1\n2\n\nPlugging M = Ms + Md into the equation above, we can get\n\n(cid:195)\n\n(cid:33)\n\n(cid:96)(yif(xi))\n\nT(cid:88)\n(cid:107)f(cid:107)2H\u03ba + C\n(cid:33)\n\ni=1\n\n(cid:96)(yif(xi))\n\nM \u2264 2\n\nmin\nf\u2208H\u03ba\n\n(cid:107)f(cid:107)2H\u03ba + C\n\n1\n2\n\n\u2212 1 + \u03c1\n1 \u2212 \u03c1\n\nMd(\u03c1)\n\n(cid:195)\n\n(cid:195)\n\nT(cid:88)\n\ni=1\n\nT(cid:88)\n\ni=1\n\n(10)\n\n(11)\n\nIt is worthwhile pointing out that although according to Theorem 3, it seems that the larger the value\nof \u03c1 the smaller the mistake bound will be. This however is not true since Md(\u03c1) is in general a\nmonotonically decreasing function in \u03c1. As a result, it is unclear if Md(\u03c1) \u00d7 (1 + \u03c1)/(1 \u2212 \u03c1) will\nincrease when \u03c1 is increased.\n\n5\n\n\fInitialize S0 = \u2205, f0 = 0;\nfor t=1,2,. . . ,T do\n\nReceive new instance xt\nPredict \u02c6yt = sign(ft\u22121(xt));\nReceive label yt;\nlt = max{0, 1 \u2212 ytft\u22121(xt)}\nif lt > 0 then\n\nAlgorithm 1 The DUOL Algorithm (DUOL)\nPROCEDURE\n1:\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n\nend for\nf t\nt\u22121 = ytft\u22121(xt);\nSt = St\u22121 \u222a {t};\nif (wmin \u2264 \u2212\u03c1) then\n\nwmin = 0\nfor \u2200i \u2208 St\u22121 do\n\nend if\n\nend if\n\n20:\n21:\n22:\n23:\n\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34:\n35:\n36:\n37:\n38:\n39: end for\n\nif (f i\n\nt\u22121 \u2264 0) then\nif (yiytk(xi, xt) < wmin) then\n\nwmin = yiytk(xi, xt);\n(xb, yb) = (xi, yi);/*auxiliary example*/\n\nend if\n\nelse\n\n1\n1\u2212\u03c1 );\n\n\u03b3t = min(C,\n\u03b3b = min(C, \u02c6\u03b3b + 1\nfor \u2200i \u2208 St do\n\n1\u2212\u03c1 );\n\nt \u2190 f i\nf i\n\nt\u22121 + yi\u03b3tytk(xi, xt)\n+ yi(\u03b3b \u2212 \u02c6\u03b3b)ybk(xi, xb);\n\nelse /* no auxiliary example found */\n\nend for\nft = ft\u22121 + \u03b3tytk(xt, \u00b7) + (\u03b3b \u2212 \u02c6\u03b3b)ybk(xb, \u00b7);\n\u03b3t = min(C, 1);\nfor \u2200i \u2208 St do\nend for\nft = ft\u22121 + \u03b3tytk(xt, \u00b7);\n\nt\u22121 + yi\u03b3tytk(xi, xt);\n\nt \u2190 f i\nf i\n\nft = ft\u22121; St = St\u22121;\nfor \u2200i \u2208 St do\nt \u2190 f i\nf i\nt\u22121;\nend for\n\nend if\n\nFigure 1: The Algorithm of Double Updating Online Learning (DUOL).\n\n4 Experimental Results\n4.1 Experimental Testbed and Setup\nWe now evaluate the empirical performance of the proposed double updating online learning\n(DUOL) algorithm. We compare DUOL with a number of state-of-the-art techniques, including\nPerceptron (Rosenblatt, 1958; Freund & Schapire, 1999), the \u201cROMMA\u201d algorithm and its aggres-\nsive version \u201cagg-ROMMA\u201d (Li & Long, 1999), the ALMAp(\u03b1) algorithm (Gentile, 2001), and the\nPassive-Aggressive algorithms (\u201cPA\u201d) (Crammer et al., 2006). The original Perceptron algorithm\nwas proposed for learning linear models. In our experiments, we follow (Kivinen et al., 2001b) by\nadapting it to the kernel case. Two versions of PA algorithms (PA-I and PA-II) were implemented as\ndescribed in (Crammer et al., 2006). Finally, as an ideal yardstick, we also implement a full online\nSVM algorithm (\u201cOnline-SVM\u201d) (Shalev-Shwartz & Singer, 2006), which updates all the support\nvectors in each trial, and is thus computationally extremely intensive as will be revealed in our study.\nTo extensively examine the performance, we test all the algorithms on a number of benchmark\ndatasets from web machine learning repositories. All of the datasets can be downloaded from LIB-\nSVM website 1, UCI machine learning repository 2 and MIT CBCL face datasets 3 . Due to space\nlimitation, we randomly choose six of them in our discussions, including \u201cgerman\u201d, \u201csplice\u201d, \u201cspam-\nbase\u201d, \u201cMITFace\u201d, \u201ca7a\u201d, and \u201cw7a\u201d.\nTo make a fair comparison, all algorithms adopt the same experimental setup. In particular, for all\nthe compared algorithms, we set the penalty parameter C = 5, and employ the same Gaussian kernel\nwith \u03c3 = 8. For the ALMAp(\u03b1) algorithm, parameter p and \u03b1 are set to be 2 and 0.9, respectively,\nbased on our experience. For the proposed DUOL algorithm, we \ufb01x \u03c1 to be 0.2 for all cases.\nAll the experiments were conducted over 20 random permutations for each dataset. All the results\nwere reported by averaging over these 20 runs. We evaluate the online learning performance by mea-\nsuring mistake rate, i.e., the ratio of the number of mistakes made by the online learning algorithm\nover the total number of examples received for predictions. In addition, to examine the sparsity of\nthe resulting classi\ufb01ers, we also evaluate the number of support vectors produced by each online\nlearning algorithm. Finally, we also evaluate computational ef\ufb01ciency of all the algorithms by their\nrunning time (in seconds). All experiments were run in Matlab over a machine of 2.3GHz CPU.\n4.2 Performance Evaluation\nTable 1 to 6 summarize the performance of all the compared algorithms over the six datasets4,\nrespectively. Figure 2 to 6 show the mistake rates of all online learning algorithms in comparison\nover trials. We observe that Online-SVM yields considerably better performance than the other\nonline learning algorithms for dataset \u201cgerman\u201d, \u201csplice\u201d, \u201cspambase\u201d, and \u201cMITFace\u201d, however,\nat the price of extremely high computational cost. For most cases, the running time of Online-SVM\nis two order, sometimes three order, higher than the other online learning algorithms, making it\n\n1http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n2http://www.ics.uci.edu/\u02dcmlearn/MLRepository.html\n3http://cbcl.mit.edu/software-datasets\n4Due to huge computational cost, we are unable to obtain the results of Online-SVM on two large datasets.\n\n6\n\n\funsuitable for online learning. For the remaining part of this section, we restrict our discussion to\nthe other six baseline online learning algorithms.\nFirst, among the six baseline algorithms in comparison, we observe that the agg-ROMMA and two\nPA algorithms (PA-I and PA-II) perform considerably better than the other three algorithms (i.e.,\nPerceptron, ROMMA, and ALMA) in most cases. We also notice that the agg-ROMMA and the\ntwo PA algorithms consume considerably larger numbers of support vectors than the other three\nalgorithms. We believe this is because the agg-ROMMA and the two PA algorithms adopt more\naggressive strategies than the other three algorithms, resulting more updates and better classi\ufb01cation\nperformance. For the convenience of discussion, we refer to agg-ROMMA and two PA algorithms\nas aggressive algorithms, and the three algorithms as non-aggressive ones.\nSecond, comparing with all six competing algorithms, we observe that DUOL achieves signi\ufb01cantly\nsmaller mistake rates than the other single-updating algorithms in all cases. This shows that the\nproposed double updating approach is effective in improving the online prediction performance.\nBy examining the sparsity of resulting classi\ufb01ers, we observed that DUOL results in sparser clas-\nsi\ufb01ers than the three aggressive online learning algorithms, and denser classi\ufb01ers than the three\nnon-aggressive algorithms.\nThird, according to the results of running time, we observe that DUOL is overall ef\ufb01cient compared\nto the state-of-the-art online learning algorithms. Among all the compared algorithms, Percep-\ntron, for its simplicity, is clearly the most ef\ufb01cient algorithm, and the agg-ROMMA algorithm is\nsigni\ufb01cantly slower than the others (except for \u201cOnline-SVM\u201d). Although DUOL requires double\nupdating, its ef\ufb01ciency is comparable to the PA and ROMMA algorithms.\n\nMistake (%)\n35.305 \u00b1 1.510\n35.105 \u00b1 1.189\n33.350 \u00b1 1.287\n34.025 \u00b1 0.910\n33.670 \u00b1 1.278\n33.175 \u00b1 1.229\n28.860 \u00b1 0.651\n29.990 \u00b1 1.033\n\nSupport Vectors (#)\n353.05 \u00b1 15.10\n351.05 \u00b1 11.89\n643.25 \u00b1 12.31\n402.00 \u00b1 7.33\n732.60 \u00b1 9.74\n757.00 \u00b1 10.02\n646.10 \u00b1 5.00\n682.50 \u00b1 12.87\n\nTable 1: Evaluation on german (n=1000, d=24).\nTime (s)\nAlgorithm\n0.018\nPerceptron\n0.154\nROMMA\n1.068\nagg-ROMMA\n0.225\nALMA2(0.9)\nPA-I\n0.029\n0.030\nPA-II\n16.097\nOnline-SVM\nDUOL\n0.089\nTable 3: Evaluation on spambase (n=4601, d=57).\nTime (s)\nAlgorithm\n0.204\nPerceptron\n10.128\nROMMA\n95.028\nagg-ROMMA\n25.294\nALMA2(0.9)\n0.490\nPA-I\nPA-II\n0.505\nOnline-SVM\nDUOL\n0.985\nTable 5: Evaluation on a7a (n=16100, d=123).\nTime (s)\n2.043\n306.793\n661.632\n338.609\n4.296\n4.536\n10.122\n\nSupport Vectors (#)\n1149.65 \u00b1 24.17\n1102.10 \u00b1 23.44\n2550.60 \u00b1 27.32\n1550.15 \u00b1 15.65\n2861.50 \u00b1 24.36\n3029.10 \u00b1 24.69\n2396.95 \u00b1 10.57\n2528.55 \u00b1 20.57\n\nSupport Vectors (#)\n3545.50 \u00b1 32.49\n3428.85 \u00b1 43.77\n4541.30 \u00b1 109.39\n3571.05 \u00b1 40.38\n6760.70 \u00b1 47.89\n7068.40 \u00b1 51.32\n7089.85 \u00b1 38.93\n\nMistake (%)\n24.987 \u00b1 0.525\n23.953 \u00b1 0.510\n21.242 \u00b1 0.384\n23.579 \u00b1 0.411\n22.112 \u00b1 0.374\n21.907 \u00b1 0.340\n17.138 \u00b1 0.321\n19.438 \u00b1 0.432\n\nMistake (%)\n22.022 \u00b1 0.202\n21.297 \u00b1 0.272\n20.832 \u00b1 0.234\n20.096 \u00b1 0.214\n21.826 \u00b1 0.239\n21.478 \u00b1 0.237\n19.389 \u00b1 0.227\n\nAlgorithm\nPerceptron\nROMMA\nagg-ROMMA\nALMA2(0.9)\nPA-I\nPA-II\nDUOL\n\n2521.665\n\nTable 2: Evaluation on splice (n=1000, d=6).\n\nSupport Vectors (#)\n\nTime (s)\n0.016\n0.055\n0.803\n0.075\n0.028\n0.028\n12.243\n0.076\n\n271.20 \u00b1 9.75\n255.60 \u00b1 8.14\n602.95 \u00b1 7.43\n314.95 \u00b1 9.41\n665.60 \u00b1 5.60\n689.00 \u00b1 7.85\n614.90 \u00b1 2.92\n577.85 \u00b1 8.93\n\nMistakes (%)\n27.120 \u00b1 0.975\n25.560 \u00b1 0.814\n22.980 \u00b1 0.780\n26.040 \u00b1 0.965\n23.815 \u00b1 1.042\n23.515 \u00b1 1.005\n17.455 \u00b1 0.518\n20.560 \u00b1 0.566\n\nAlgorithm\nPerceptron\nROMMA\nagg-ROMMA\nALMA2(0.9)\nPA-I\nPA-II\nOnline-SVM\nDUOL\nTable 4: Evaluation on MITFace (n=6977, d=361).\nTime (s)\nAlgorithm\n0.164\nPerceptron\n0.362\nROMMA\n11.074\nagg-ROMMA\n0.675\nALMA2(0.9)\n0.356\nPA-I\nPA-II\n0.370\nOnline-SVM\nDUOL\n\nSupport Vectors (#)\n325.50 \u00b1 13.37\n287.05 \u00b1 10.84\n1121.15 \u00b1 24.18\n400.10 \u00b1 10.53\n1155.45 \u00b1 14.53\n1222.05 \u00b1 13.73\n520.05 \u00b1 4.55\n768.65 \u00b1 16.18\n\nMistake (%)\n4.665 \u00b1 0.192\n4.114 \u00b1 0.155\n3.137 \u00b1 0.093\n4.467 \u00b1 0.169\n3.190 \u00b1 0.128\n3.108 \u00b1 0.112\n1.142 \u00b1 0.073\n2.409 \u00b1 0.161\n\n0.384\nTable 6: Results on w7a (n=24292, d=300).\n\n7238.105\n\nTime (s)\n1.233\n13.860\n137.975\n13.245\n3.732\n4.719\n2.677\n\nAlgorithm\nPerceptron\nROMMA\nagg-ROMMA\nALMA2(0.9)\nPA-I\nPA-II\nDUOL\n\nMistake (%)\n4.027 \u00b1 0.095\n4.158 \u00b1 0.087\n3.500 \u00b1 0.061\n3.518 \u00b1 0.071\n3.701 \u00b1 0.057\n3.571 \u00b1 0.053\n2.771 \u00b1 0.041\n\nSupport Vectors (#)\n994.40 \u00b1 23.57\n1026.75 \u00b1 21.51\n2317.70 \u00b1 58.92\n1031.05 \u00b1 15.33\n2839.60 \u00b1 41.57\n3391.50 \u00b1 51.94\n1699.80 \u00b1 22.78\n\n5 Conclusions\nThis paper presented a novel \u201cdouble updating\u201d approach to online learning named as \u201cDUOL\u201d,\nwhich not only updates the weight of the newly added support vector, but also adjusts the weight\nof one existing support vector that seriously con\ufb02icts with the new support vector. We show that\nthe mistake bound for an online classi\ufb01cation task can be signi\ufb01cantly reduced by the proposed\nDUOL algorithms. We have conducted an extensive set of experiments by comparing with a number\nof competing algorithms. Promising empirical results validate the effectiveness of our technique.\nFuture work will address issues of multi-class double updating online learning.\nAcknowledgements\nThis work was supported in part by MOE tier-1 Grant (RG67/07), NRF IDM Grant (NRF2008IDM-IDM-004-\n018), National Science Foundation (IIS-0643494), and US Navy Research Of\ufb01ce (N00014-09-1-0663).\n\n7\n\n\f(a) average rate of mistakes\n\n(b) average number of support vectors\n\n(c) average time cost (log10 t)\n\nFigure 2: Evaluation on the german dataset. The data size is 1000 and the dimensionality is 24.\n\n(a) average rate of mistakes\n\n(b) average number of support vectors\n\n(c) average time cost (log10 t)\n\nFigure 3: Evaluation on the splice dataset. The data size is 1000 and the dimensionality is 60.\n\n(a) average rate of mistakes\n\n(b) average number of support vectors\n\n(c) average time cost (log10 t)\n\nFigure 4: Evaluation on the spambase dataset. The data size is 4601 and the dimensionality is 57.\n\n(a) average rate of mistakes\n\n(b) average number of support vectors\n\n(c) average time cost (log10 t)\n\nFigure 5: Evaluation on the a7a dataset. The data size is 16100 and the dimensionality is 123.\n\n(a) average rate of mistakes\n\n(b) average number of support vectors\n\n(c) average time cost (log10 t)\n\nFigure 6: Evaluation on the w7a dataset. The data size is 24292 and the dimensionality is 300.\n\n8\n\n020040060080010000.250.30.350.40.450.5Number of samplesOnline average rate of mistakes PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL020040060080010000100200300400500600700800Number of samplesOnline average number of support vectors PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL02004006008001000\u22123\u22122\u221210123Number of samplesaverage time cost (log10 t) PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL020040060080010000.20.250.30.350.40.450.5Number of samplesOnline average rate of mistakes PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL020040060080010000100200300400500600700Number of samplesOnline average number of support vectors PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL02004006008001000\u22123\u22122\u221210123Number of samplesaverage time cost (log10 t) PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL0100020003000400050000.160.180.20.220.240.260.280.30.320.340.36Number of samplesOnline average rate of mistakes PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL0100020003000400050000500100015002000250030003500Number of samplesOnline average number of support vectors PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL010002000300040005000\u22123\u22122\u22121012345678Number of samplesaverage time cost (log10 t) PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIOnline\u2212SVMDUOL02000400060008000100001200014000160000.190.20.210.220.230.240.250.260.270.28Number of samplesOnline average rate of mistakes PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL0200040006000800010000120001400016000010002000300040005000600070008000Number of samplesOnline average number of support vectors PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL0200040006000800010000120001400016000\u22122\u2212101234Number of samplesaverage time cost (log10 t) PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL00.511.522.5x 1040.0250.030.0350.040.0450.050.0550.06Number of samplesOnline average rate of mistakes PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL00.511.522.5x 1040500100015002000250030003500Number of samplesOnline average number of support vectors PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL00.511.522.5x 104\u22122\u22121.5\u22121\u22120.500.511.522.533.5Number of samplesaverage time cost (log10 t) PerceptronROMMAagg\u2212ROMMAALMA2(0.9)PA\u2212IPA\u2212IIDUOL\fReferences\nCavallanti, G., Cesa-Bianchi, N., & Gentile, C. (2007). Tracking the best hyperplane with a simple\n\nbudget perceptron. Machine Learning, 69, 143\u2013167.\n\nCesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learn-\n\ning algorithms. IEEE Trans. on Inf. Theory, 50, 2050\u20132057.\n\nCesa-Bianchi, N., & Gentile, C. (2006). Tracking the best hyperplane with a simple budget percep-\n\ntron. COLT (pp. 483\u2013498).\n\nCheng, L., Vishwanathan, S. V. N., Schuurmans, D., Wang, S., & Caelli, T. (2006). Implicit online\n\nlearning with kernels. NIPS (pp. 249\u2013256).\n\nCrammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-\n\naggressive algorithms. JMLR, 7, 551\u2013585.\n\nCrammer, K., Kandola, J. S., & Singer, Y. (2003). Online classi\ufb01cation on a budget. NIPS.\nCrammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems.\n\nJMLR, 3, 951\u2013991.\n\nDekel, O., Shalev-Shwartz, S., & Singer, Y. (2005). The forgetron: A kernel-based perceptron on a\n\n\ufb01xed budget. NIPS.\n\nDekel, O., Shalev-Shwartz, S., & Singer, Y. (2008). The forgetron: A kernel-based perceptron on a\n\nbudget. SIAM J. Comput., 37, 1342\u20131372.\n\nFink, M., Shalev-Shwartz, S., Singer, Y., & Ullman, S. (2006). Online multiclass learning by inter-\n\nclass hypothesis sharing. ICML (pp. 313\u2013320).\n\nFreund, Y., & Schapire, R. E. (1999). Large margin classi\ufb01cation using the perceptron algorithm.\n\nMach. Learn., 37, 277\u2013296.\n\nGentile, C. (2001). A new approximate maximal margin classi\ufb01cation algorithm. JMLR, 2, 213\u2013242.\nKivinen, J., Smola, A. J., & Williamson, R. C. (2001a). Online learning with kernels. NIPS (pp.\n\n785\u2013792).\n\nKivinen, J., Smola, A. J., & Williamson, R. C. (2001b). Online learning with kernels. NIPS (pp.\n\n785\u2013792).\n\nLi, Y., & Long, P. M. (1999). The relaxed online maximum margin algorithm. NIPS (pp. 498\u2013504).\nOrabona, F., Keshet, J., & Caputo, B. (2008). The projectron: a bounded kernel-based perceptron.\n\nICML (pp. 720\u2013727).\n\nRosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organiza-\n\ntion in the brain. Psychological Review, 65, 386\u2013407.\n\nShalev-Shwartz, S., & Singer, Y. (2006). Online learning meets optimization in the dual. COLT (pp.\n\n423\u2013437).\n\nWeston, J., & Bordes, A. (2005). Online (and of\ufb02ine) on an even tighter budget. AISTATS (pp.\n\n413\u2013420).\n\nYang, L., Jin, R., & Ye, J. (2009). Online learning by ellipsoid method. ICML (p. 145).\n\n9\n\n\f", "award": [], "sourceid": 138, "authors": [{"given_name": "Peilin", "family_name": "Zhao", "institution": null}, {"given_name": "Steven", "family_name": "Hoi", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}]}