{"title": "Online Passive-Aggressive Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1229, "page_last": 1236, "abstract": "", "full_text": "Online Passive-Aggressive Algorithms\n\nKoby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer\n\nSchool of Computer Science & Engineering\n\nThe Hebrew University, Jerusalem 91904, Israel\n\nfkobics,oferd,shais,singerg@cs.huji.ac.il\n\nAbstract\n\nWe present a uni\ufb01ed view for online classi\ufb01cation, regression, and uni-\nclass problems. This view leads to a single algorithmic framework for the\nthree problems. We prove worst case loss bounds for various algorithms\nfor both the realizable case and the non-realizable case. A conversion\nof our main online algorithm to the setting of batch learning is also dis-\ncussed. The end result is new algorithms and accompanying loss bounds\nfor the hinge-loss.\n\n1\n\nIntroduction\n\nIn this paper we describe and analyze several learning tasks through the same algorithmic\nprism. Speci\ufb01cally, we discuss online classi\ufb01cation, online regression, and online uniclass\nprediction. In all three settings we receive instances in a sequential manner. For concrete-\nness we assume that these instances are vectors in Rn and denote the instance received on\nround t by xt. In the classi\ufb01cation problem our goal is to \ufb01nd a mapping from the instance\nspace into the set of labels, f(cid:0)1; +1g. In the regression problem the mapping is into R.\nOur goal in the uniclass problem is to \ufb01nd a center-point in Rn with a small Euclidean\ndistance to all of the instances.\n\nWe \ufb01rst describe the classi\ufb01cation and regression problems. For classi\ufb01cation and regres-\nsion we restrict ourselves to mappings based on a weight vector w 2 Rn, namely the\nmapping f : Rn ! R takes the form f (x) = w (cid:1) x. After receiving xt we extend a\nprediction ^yt using f. For regression the prediction is simply ^yt = f (xt) while for classi-\n\ufb01cation ^yt = sign(f (xt)). After extending the prediction ^yt, we receive the true outcome\nyt. We then suffer an instantaneous loss based on the discrepancy between yt and f (xt).\nThe goal of the online learning algorithm is to minimize the cumulative loss. The losses\nwe discuss in this paper depend on a pre-de\ufb01ned insensitivity parameter (cid:15) and are denoted\n\u2018(cid:15)(w; (x; y)). For regression the (cid:15)-insensitive loss is,\n\n\u2018(cid:15)(w; (x; y)) = (cid:26) 0\n\njy (cid:0) w (cid:1) xj (cid:0) (cid:15) otherwise\nwhile for classi\ufb01cation the (cid:15)-insensitive loss is de\ufb01ned to be,\n\njy (cid:0) w (cid:1) xj (cid:20) (cid:15)\n\n;\n\n:\n\n(1)\n\n(2)\n\n\u2018(cid:15)(w; (x; y)) = (cid:26) 0\n\ny(w (cid:1) x) (cid:21) (cid:15)\n\n(cid:15) (cid:0) y(w (cid:1) x) otherwise\n\nAs in other online algorithms the weight vector w is updated after receiving the feedback\nyt. Therefore, we denote by wt the vector used for prediction on round t. We leave the\ndetails on the form this update takes to later sections.\n\n\fProblem\nClassi\ufb01cation\n\nRegression\n\nUniclass\n\nExample (zt)\n\n(xt; yt) 2 Rn(cid:2) f-1,+1g\n(xt; yt) 2 Rn (cid:2) R\n(xt; yt) 2 Rn (cid:2) f1g\n\nDiscrepancy ((cid:14))\n(cid:0)yt(wt (cid:1) xt)\njyt (cid:0) wt (cid:1) xtj\nkxt (cid:0) wtk\n\nUpdate Direction (vt)\n\nytxt\n\nsign(yt (cid:0) wt (cid:1) xt) xt\n\nxt(cid:0)wt\n\nkxt(cid:0)wtk\n\nTable 1: Summary of the settings and parameters employed by the additive PA algorithm\nfor classi\ufb01cation, regression, and uniclass.\n\nThe setting for uniclass is slightly different as we only observe a sequence of instances.\nThe goal of the uniclass algorithm is to \ufb01nd a center-point w such that all instances xt fall\nwithin a radius of (cid:15) from w. Since we employ the framework of online learning the vector\nw is constructed incrementally. The vector wt therefore plays the role of the instantaneous\ncenter and is adapted after observing each instance xt. If an example xt falls within a\nEuclidean distance (cid:15) from wt then we suffer no loss. Otherwise, the loss is the distance\nbetween xt and a ball of radius (cid:15) centered at wt. Formally, the uniclass loss is,\n\n\u2018(cid:15)(wt; xt) = (cid:26) 0\n\nkxt (cid:0) wtk (cid:0) (cid:15)\n\nkxt (cid:0) wtk (cid:20) (cid:15)\notherwise\n\n:\n\n(3)\n\nIn the next sections we give additive and multiplicative online algorithms for the above\nlearning problems and prove respective online loss bounds. A common thread of our ap-\nproach is a uni\ufb01ed view of all three tasks which leads to a single algorithmic framework\nwith a common analysis.\n\nRelated work: Our work builds on numerous techniques from online learning. The up-\ndates we derive are based on an optimization problem directly related to the one employed\nby Support Vector Machines [15]. Li and Long [14] were among the \ufb01rst to suggest the idea\nof converting a batch optimization problem into an online task. Our work borrows ideas\nfrom the work of Warmuth and colleagues [11]. In particular, Gentile and Warmuth [6]\ngeneralized and adapted techniques from [11] to the hinge loss which is closely related to\nthe losses de\ufb01ned in Eqs. (1)-(3). Kivinen et al. [10] discussed a general framework for\ngradient-based online learning where some of their bounds bare similarities to the bounds\npresented in this paper. Our work also generalizes and greatly improves online loss bounds\nfor classi\ufb01cation given in [3]. Herbster [8] suggested an algorithm for classi\ufb01cation and\nregression that is equivalent to one of the algorithms given in this paper, however, the loss-\nbound derived by Herbster is somewhat weaker. Finally, we would like to note that similar\nalgorithms have been devised in the convex optimization community (cf. [1, 2]). The main\ndifference between these algorithms and the online algorithms presented in this paper lies\nin the analysis: while we derive worst case, \ufb01nite horizon loss bounds, the optimization\ncommunity is mostly concerned with asymptotic convergence properties.\n\n2 A Uni\ufb01ed Loss\n\nThe three problems described in the previous section share common algebraic properties\nwhich we explore in this section. The end result is a common algorithmic framework that is\napplicable to all three problems and an accompanying analysis (Sec. 3). Let zt = (xt; yt)\ndenote the instance-target pair received on round t where in the case of uniclass we set\nyt = 1 as a placeholder. For a given example zt, let (cid:14)(w; zt) denote the discrepancy of\nw on zt: for classi\ufb01cation we set the discrepancy to be (cid:0)yt(wt (cid:1) xt) (the negative of the\nmargin), for regression it is jyt (cid:0) wt (cid:1) xtj, and for uniclass kxt (cid:0) wtk. Fixing zt, we also\n\n\fview (cid:14)(w; zt) as a convex function of w. Let [a]+ be the function that equals a whenever\na > 0 and otherwise equals zero. Using the discrepancies de\ufb01ned above, the three different\nlosses given in Eqs. (1)-(3) can all be written as \u2018(cid:15)(w; z) = [(cid:14)(w; z) (cid:0) (cid:15)]+, where for\nclassi\ufb01cation we set (cid:15)   (cid:0)(cid:15) since the discrepancy is de\ufb01ned as the negative of the margin.\nWhile this construction might seem a bit odd for classi\ufb01cation, it is very useful in unifying\nthe three problems. To conclude, the loss in all three problems can be derived by applying\nthe same hinge loss to different (problem dependent) discrepancies.\n\n3 An Additive Algorithm for the Realizable Case\n\nEquipped with the simple uni\ufb01ed notion of loss we describe in this section a single online\nalgorithm that is applicable to all three problems. The algorithm and the analysis we present\nin this section assume that there exist a weight vector w\n? and an insensitivity parameter (cid:15)?\nfor which the data is perfectly realizable. Namely, we assume that \u2018(cid:15)? (w?; zt) = 0 for all\nt which implies that,\n\n? (cid:1) xt) (cid:21) j(cid:15)?j (Class.)\n\nyt(w\nA modi\ufb01cation of the algorithm for the unrealizable case is given in Sec. 5.\n\n? (cid:1) xtj (cid:20) (cid:15)? (Reg.)\n\nkxt (cid:0) w\n\njyt (cid:0) w\n\n?k (cid:20) (cid:15)? (Unic.) : (4)\n\nThe general method we use for deriving our on-line update rule is to de\ufb01ne the new weight\nvector wt+1 as the solution to the following projection problem\n\nwt+1 = argmin\n\nw\n\n1\n2kw (cid:0) wtk2\n\ns.t.\n\n\u2018(cid:15)(w; zt) = 0 ;\n\n(5)\n\nnamely, wt+1 is set to be the projection of wt onto the set of all weight vectors that attain\na loss of zero. We denote this set by C. For the case of classi\ufb01cation, C is a half space,\nC = fw : (cid:0)ytw (cid:1) xt (cid:20) (cid:15)g. For regression C is an (cid:15)-hyper-slab, C = fw : jw (cid:1) xt (cid:0)\nytj (cid:20) (cid:15)g and for uniclass it is a ball of radius (cid:15) centered at xt, C = fw : kw (cid:0) xtk (cid:20)\n(cid:15)g. In Fig. 2 we illustrate the projection for the three cases. This optimization problem\nattempts to keep wt+1 as close to wt as possible, while forcing wt+1 to achieve a zero\nloss on the most recent example. The resulting algorithm is passive whenever the loss is\nzero, that is, wt+1 = wt whenever \u2018(cid:15)(wt; zt) = 0.\nIn contrast, on rounds for which\n\u2018(cid:15)(wt; zt) > 0 we aggressively force wt+1 to satisfy the constraint \u2018(cid:15)(wt+1; zt) = 0.\nTherefore we name the algorithm\npassive-aggressive or PA for short. In\nthe following we show that for the\nthree problems described above the\nsolution to the optimization problem\nin Eq. (5) yields the following update\nrule,\n\nParameter: Insensitivity: (cid:15)\nInitialize: Set w1 = 0 (R&C) ; w1 = x0 (U)\nFor t = 1; 2; : : :\n\n(cid:15) Get a new instance: zt 2 Rn\n(cid:15) Suffer loss: \u2018(cid:15)(wt; zt)\n(cid:15) If \u2018(cid:15)(wt; zt) > 0 :\n\nof\n\nis minus\nthe\n\nwt+1 = wt + (cid:28)tvt ;\nthe\ndiscrepancy\n\n(6)\ngradi-\nwhere vt\nand\nent\n(Note\n(cid:28)t = \u2018(cid:15)(wt; zt)=kvtk2.\nthat although the discrepancy might\nnot be differentiable everywhere, its\ngradient exists whenever the loss is\ngreater than zero). To see that the\nupdate from Eq. (6) is the solution to the problem de\ufb01ned by Eq. (5), \ufb01rst note that the\nequality constraint \u2018(cid:15)(w; zt) = 0 is equivalent to the inequality constraint (cid:14)(w; zt) (cid:20) (cid:15).\nThe Lagrangian of the optimization problem is\n\n1. Set vt (see Table 1)\n2. Set (cid:28)t = \u2018(cid:15)(wt;zt)\n3. Update: wt+1 = wt + (cid:28)tvt\n\nFigure 1: The additive PA algorithm.\n\nkvtk2\n\nL(w; (cid:28) ) =\n\n1\n2kw (cid:0) wtk2 + (cid:28) ((cid:14)(w; zt) (cid:0) (cid:15)) ;\n\n(7)\n\n\fwt+1\n(cid:0)(cid:18)\n\n(cid:0)\n\nq\n\nwt\n\nwt+1\n(cid:0)(cid:18)\n\n(cid:0)\n\nq\n\nwt\n\nwt+1\n(cid:0)(cid:18)\n\n(cid:0)\n\nq\n\nwt\n\nFigure 2: An illustration of the update: wt+1 is found by projecting the current vector\nwt onto the set of vectors attaining a zero loss on zt. This set is a stripe in the case of\nregression, a half-space for classi\ufb01cation, and a ball for uniclass.\n\nwhere (cid:28) (cid:21) 0 is a Lagrange multiplier. To \ufb01nd a saddle point of L we \ufb01rst differentiate L\nwith respect to w and use the fact that vt is minus the gradient of the discrepancy to get,\n\nrw(L) = w (cid:0) wt + (cid:28)rw(cid:14) = 0 ) w = wt + (cid:28) vt :\n\nTo \ufb01nd the value of (cid:28) we use the KKT conditions. Hence, whenever (cid:28) is positive (as in\nthe case of non-zero loss), the inequality constraint, (cid:14)(w; zt) (cid:20) (cid:15), becomes an equality.\nSimple algebraic manipulations yield that the value (cid:28) for which (cid:14)(w; zt) = (cid:15) for all three\nproblems is equal to, (cid:28)t = \u2018(cid:15)(w; zt)=kvtk2. A summary of the discrepancy functions and\ntheir respective updates is given in Table 1. The pseudo-code of the additive algorithm for\nall three settings is given in Fig. 1.\nWe now discuss the initialization of w1. For classi\ufb01cation and regression a reasonable\nchoice for w1 is the zero vector. However, in the case of uniclass initializing w1 to be\nthe zero vector might incur large losses if, for instance, all the instances are located far\naway from the origin. A more sensible choice for uniclass is to initialize w1 to be one of\nthe examples. For simplicity of the description we assume that we are provided with an\nexample x0 prior to the run of the algorithm and initialize w1 = x0.\nTo conclude this section we note that for all three cases the weight vector wt is a linear\ncombination of the instances. This representation enables us to employ kernels [15].\n\n4 Analysis\n\nThe following theorem provides a uni\ufb01ed loss bound for all three settings. After proving\nthe theorem we discuss a few of its implications.\n\nTheorem 1 Let z1; z2; : : : ; zt; : : : be a sequence of examples for one of the problems de-\nscribed in Table 1. Assume that there exist w? and (cid:15)? such that \u2018(cid:15)? (w?; zt) = 0 for all\nt. Then if the additive PA algorithm is run with (cid:15) (cid:21) (cid:15)?, the following bound holds for any\nT (cid:21) 1\n\nT\n\nXt=1\n\n(\u2018(cid:15)(wt; zt))2 + 2((cid:15) (cid:0) (cid:15)?)\n\nT\n\nXt=1\n\n\u2018(cid:15)(wt; zt) (cid:20) B kw\n\n? (cid:0) w1k2 ;\n\n(8)\n\nwhere for classi\ufb01cation and regression B is a bound on the squared norm of the instances\n(8t : B (cid:21) kxtk2\n\n2) and B = 1 for uniclass.\n\n\fProof: De\ufb01ne (cid:1)t = kwt (cid:0) w\n?k2 (cid:0) kwt+1 (cid:0) w\nt=1 (cid:1)t from above and below. First note thatPT\nPT\n?k2 (cid:0) kwT +1 (cid:0) w\n\n(cid:1)t = kw1 (cid:0) w\n\nT\n\nXt=1\n\n?k2. We prove the theorem by bounding\nt=1 (cid:1)t is a telescopic sum and therefore\n\n?k2 (cid:20) kw1 (cid:0) w\n\n?k2 :\n\nThis provides an upper bound on Pt (cid:1)t. In the following we prove the lower bound\n\n\u2018(cid:15)(wt; zt)\n\n(cid:1)t (cid:21)\n\nB\n\n(\u2018(cid:15)(wt; zt) + 2((cid:15) (cid:0) (cid:15)?)) :\n\nFirst note that we do not modify wt if \u2018(cid:15)(wt; zt) = 0. Therefore, this inequality trivially\nholds when \u2018(cid:15)(wt; zt) = 0 and thus we can restrict ourselves to rounds on which the\ndiscrepancy is larger than (cid:15), which implies that \u2018(cid:15)(wt; zt) = (cid:14)(wt; zt) (cid:0) (cid:15). Let t be such a\nround then by rewriting wt+1 as wt + (cid:28)tvt we get,\n\n(cid:1)t = kwt (cid:0) w\n= kwt (cid:0) w\n= (cid:0)(cid:28) 2\n\n?k2 (cid:0) kwt+1 (cid:0) w\n?k2 (cid:0)(cid:0)(cid:28) 2\n\n?k2 = kwt (cid:0) w\nt kvtk2 + 2(cid:28)t(vt (cid:1) (wt (cid:0) w\n\n?k2 (cid:0) kwt + (cid:28)tvt (cid:0) w\n?)) + kwt (cid:0) w\n\n?k2(cid:1)\n\n?k2\n\nt kvtk2 + 2(cid:28)tvt (cid:1) (w\n\n? (cid:0) wt) :\n\nUsing the fact that (cid:0)vt is the gradient of the convex function (cid:14)(w; zt) at wt we have,\n\nAdding and subtracting (cid:15) from the left-hand side of Eq. (12) and rearranging we get,\n\n(cid:14)(w\n\n?; zt) (cid:0) (cid:14)(wt; zt) (cid:21) ((cid:0)vt) (cid:1) (w\n\n? (cid:0) wt) :\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nvt (cid:1) (w\n\n? (cid:0) wt) (cid:21) (cid:14)(wt; zt) (cid:0) (cid:15) + (cid:15) (cid:0) (cid:14)(w\n\n?; zt) :\n\nRecall that (cid:14)(wt; zt) (cid:0) (cid:15) = \u2018(cid:15)(wt; zt) and that (cid:15)? (cid:21) (cid:14)(w\n\n((cid:14)(wt; zt) (cid:0) (cid:15)) + ((cid:15) (cid:0) (cid:14)(w\n\nCombining Eq. (11) with Eqs. (13-14) we get\n\n?; zt). Therefore,\n?; zt)) (cid:21) \u2018(cid:15)(wt; zt) + ((cid:15) (cid:0) (cid:15)?) :\n\n(cid:1)t (cid:21) (cid:0)(cid:28) 2\n\nt kvtk2 + 2(cid:28)t (\u2018(cid:15)(wt; zt) + ((cid:15) (cid:0) (cid:15)?))\n\n= (cid:28)t(cid:0)(cid:0)(cid:28)tkvtk2 + 2\u2018(cid:15)(wt; zt) + 2((cid:15) (cid:0) (cid:15)?)(cid:1) :\n\nPlugging (cid:28)t = \u2018(cid:15)(wt; zt)=kvtk2 into Eq. (15) we get\n\n(cid:1)t (cid:21)\n\n\u2018(cid:15)(wt; zt)\nkvtk2\n\n(\u2018(cid:15)(wt; zt) + 2((cid:15) (cid:0) (cid:15)?)) :\n\nFor uniclass kvtk2 is always equal to 1 by construction and for classi\ufb01cation and regression\nwe have kvtk2 = kxtk2 (cid:20) B which gives,\n\n(cid:1)t (cid:21)\n\n\u2018(cid:15)(wt; zt)\n\nB\n\n(\u2018(cid:15)(wt; zt) + 2((cid:15) (cid:0) (cid:15)?)) :\n\nComparing the above lower bound with the upper bound in Eq. (9) we get\n\nT\n\nXt=1\n\n(\u2018(cid:15)(wt; zt))2 +\n\nT\n\nXt=1\n\n2((cid:15) (cid:0) (cid:15)?)\u2018(cid:15)(wt; zt) (cid:20) Bkw\n\n? (cid:0) w1k2 :\n\nThis concludes the proof.\n\nLet us now discuss the implications of Thm. 1. We \ufb01rst focus on the classi\ufb01cation case. Due\nto the realizability assumption, there exist w\n?; zt) = 0\nwhich implies that yt(w? (cid:1) xt) (cid:21) (cid:0)(cid:15)?. Dividing w? by its norm we can rewrite the latter as\n?k. The parameter ^(cid:15)? is often\nyt( ^w\n\n? and (cid:15)? such that for all t, \u2018(cid:15)? (w\n\n? (cid:1) xt) (cid:21) ^(cid:15)? where ^w\n\n?k and ^(cid:15)? = j(cid:15)?j=kw\n\n?=kw\n\n? = w\n\n\freferred to as the margin of a unit-norm separating hyperplane. Now, setting (cid:15) = (cid:0)1 we\nget that \u2018(cid:15)(w; z) = [1 (cid:0) y(w (cid:1) x)]+ \u2013 the hinge loss for classi\ufb01cation. We now use Thm. 1\nto obtain two loss bounds for the hinge loss in a classi\ufb01cation setting. First, note that by\nalso setting w\n?=^(cid:15)? and thus (cid:15)? = (cid:0)1 we get that the second term on the left hand\nside of Eq. (8) vanishes as (cid:15)? = (cid:15) = (cid:0)1 and thus,\n\n? = ^w\n\nT\n\nXt=1\n\n([1 (cid:0) yt(wt (cid:1) xt)]+)2 (cid:20) B kw\n\n?k2 =\n\nB\n(^(cid:15)?)2 :\n\n(17)\n\nWe thus have obtained a bound on the squared hinge loss. The same bound was also\nderived by Herbster [8]. We can immediately use this bound to derive a mistake bound for\nthe PA algorithm. Note that the algorithm makes a prediction mistake iff yt(wt (cid:1) xt) (cid:20) 0.\nIn this case, [1 (cid:0) yt(wt (cid:1) xt)]+ (cid:21) 1 and therefore the number of prediction mistakes is\nbounded by B=(^(cid:15)?)2. This bound is common to online algorithms for classi\ufb01cation such\nas ROMMA [14].\n\nWe can also manipulate the result of Thm. 1 to obtain a direct bound on the hinge loss.\nUsing again (cid:15) = (cid:0)1 and omitting the \ufb01rst term in the left hand side of Eq. (8) we get,\n\nT\n\nXt=1\n\n[1 (cid:0) yt(wt (cid:1) xt)]+ (cid:20) Bkw\n\n2((cid:0)1 (cid:0) (cid:15)?)\n?=^(cid:15)?, which implies that (cid:15)? = (cid:0)2, we can further simplify the above\n\n?k2 :\n\nBy setting w\nto get a bound on the cumulative hinge loss,\n\n? = 2 ^w\n\nT\n\nXt=1\n\n[1 (cid:0) yt(wt (cid:1) xt)]+ (cid:20) 2\n\nB\n(^(cid:15)?)2 :\n\nTo conclude this section, we would like to point out that the PA online algorithm can also\nbe used as a building block for a batch algorithm. Concretely, let S = fz1; : : : ; zmg be a\n\ufb01xed training set and let (cid:12) 2 R be a small positive number. We start with an initial weight\nvector w1 and then invoke the PA algorithm as follows. We choose an example z 2 S such\nthat \u2018(cid:15)(w1; z)2 > (cid:12) and present z to the PA algorithm. We repeat this process and obtain\nw2; w3; : : : until the T \u2019th iteration on which for all z 2 S, \u2018(cid:15)(wT ; z)2 (cid:20) (cid:12). The output of\nthe batch algorithm is wT . Due to the bound of Thm. 1, T is at most dBkw?(cid:0)w1k2=(cid:12)e and\nby construction the loss of wT on any z 2 S is at most p(cid:12). Moreover, in the following\nlemma we show that the norm of wT cannot be too large. Since wT achieves a small\nempirical loss and its norm is small, it can be shown using classical techniques (cf. [15])\nthat the loss of wT on unseen data is small as well.\nLemma 2 Under the same conditions of Thm. 1, the following bound holds for any T (cid:21) 1\n\nkwT (cid:0) w1k (cid:20) 2kw\n\n? (cid:0) w1k :\n\nProof: First note that the inequality trivially holds for T = 1 and thus we focus on the\ncase T > 1. We use the de\ufb01nition of (cid:1)t from the proof of Thm. 1. Eq. (10) implies that\n(cid:1)t is non-negative for all t. Therefore, we get from Eq. (9) that\n\n0 (cid:20)\n\nT (cid:0)1\n\nXt=1\n\n(cid:1)t = kw1 (cid:0) w\n\n?k2 (cid:0) kwT (cid:0) w\n?k (cid:20) kw\n\n(18)\n\n?k2 :\n? (cid:0) w1k. Finally, we use\n\nRearranging the terms in Eq. (18) we get that kwT (cid:0) w\nthe triangle inequality to get the bound,\n\nkwT (cid:0) w1k = k(wT (cid:0) w\n(cid:20) kwT (cid:0) w\n\n?) + (w\n?k + kw\n\n? (cid:0) w1)k\n? (cid:0) w1k (cid:20) 2kw\n\nThis concludes the proof.\n\n? (cid:0) w1k :\n\n\f5 A Modi\ufb01cation for the Unrealizable Case\n\nWe now brie\ufb02y describe an algorithm for the unrealizable case. This algorithm applies only\nto regression and classi\ufb01cation problems. The case of uniclass is more involved and will\nbe discussed in detail elsewhere. The algorithm employs two parameters. The \ufb01rst is the\ninsensitivity parameter (cid:15) which de\ufb01nes the loss function as in the realizable case. However,\n? that achieves zero loss over the sequence.\nin this case we do not assume that there exists w\nWe instead measure the loss of the online algorithm relative to the loss of any vector w\n?.\nThe second parameter, (cid:13) > 0, is a relaxation parameter. Before describing the effect of this\nparameter we de\ufb01ne the update step for the unrealizable case. As in the realizable case, the\nalgorithm is conservative. That is, if the loss on example zt is zero then wt+1 = wt. In\ncase the loss is positive the update rule is wt+1 = wt + (cid:28)tvt where vt is the same as in the\nrealizable case. However, the scaling factor (cid:28)t is modi\ufb01ed and is set to,\n\nThe following theorem provides a loss bound for the online algorithm relative to the loss\nof any \ufb01xed weight vector w?.\n\n(cid:28)t =\n\n\u2018(cid:15)(wt; zt)\nkvtk2 + (cid:13)\n\n:\n\nTheorem 3 Let z1 = (x1; y1); z2 = (x2; y2); : : : ; zt = (xt; yt); : : : be a sequence of\n? be any vector in Rn. Then if the PA algorithm\nclassi\ufb01cation or regression examples. Let w\nfor the unrealizable case is run with (cid:15), and with (cid:13) > 0, the following bound holds for any\nT (cid:21) 1 and a constant B satisfying B (cid:21) kxtk2,\nXt=1\n\n(\u2018(cid:15)(wt; zt))2 (cid:20) ((cid:13) + B)kw\n\n? (cid:0) w1k2 + (cid:18)1 +\n\n(cid:13) (cid:19) T\nXt=1\n\n?; zt))2 : (19)\n\n(\u2018(cid:15)(w\n\nB\n\nT\n\nThe proof of the theorem is based on a reduction to the realizable case (cf. [4, 13, 14]) and\nis omitted due to the lack of space.\n\n6 Extensions\n\nThere are numerous potential extensions to our approach. For instance, if all the compo-\nnents of the instances are non-negative we can derive a multiplicative version of the PA\nalgorithm. The multiplicative PA algorithm maintains a weight vector wt 2 Pn where\nPn = fx : x 2 Rn\n\nj=1 xj = 1g. The multiplicative update of wt is,\n\n+; Pn\n\nwt+1;j = (1=Zt) wt;je(cid:28)tvt;j\n\n;\n\n1\n\nfor regression and classi\ufb01cation and \u2018(cid:15)(wt; zt)=(8kvtk2\n\nwhere vt is the same as the one used in the additive algorithm (Table 1), (cid:28)t now becomes\n1 ) for uniclass\n4\u2018(cid:15)(wt; zt)=kvtk2\nand Zt = Pn\nj=1 wt;je(cid:28)tvt;j is a normalization factor. For the multiplicative PA we can\nprove the following loss bound.\nTheorem 4 Let z1; z2; : : : ; zt = (xt; yt); : : : be a sequence of examples such that xt;j (cid:21) 0\nfor all t. Let DRE (wkw0) = Pj wj log(wj=w0\nj) denote the relative entropy between w and\nw0. Assume that there exist w\n?; zt) = 0 for all t. Then when the\nmultiplicative version of the PA algorithm is run with (cid:15) > (cid:15)?, the following bound holds for\nany T (cid:21) 1,\nXt=1\n\n(\u2018(cid:15)(wt; zt))2 + 2((cid:15) (cid:0) (cid:15)?)\n\n? and (cid:15)? such that \u2018(cid:15)? (w\n\n\u2018(cid:15)(wt; zt) (cid:20)\n\n?kw1) ;\n\nB DRE (w\n\nXt=1\n\n1\n2\n\nT\n\nT\n\nwhere for classi\ufb01cation and regression B is a bound on the square of the in\ufb01nity norm of\nthe instances (8t : B (cid:21) kxtk2\n\n1 ) and B = 16 for uniclass.\n\n\fThe proof of the theorem is rather technical and uses the proof technique of Thm. 1 in\nconjunction with inequalities on the logarithm of Zt (see for instance [7, 11, 9]).\nAn interesting question is whether the uni\ufb01ed view of classi\ufb01cation, regression, and\nuniclass can be exported and used with other algorithms for classi\ufb01cation such as\nROMMA [14] and ALMA [5]. Another, rather general direction for possible extension\nsurfaces when replacing the Euclidean distance between wt+1 and wt with other distances\nand divergences such as the Bregman divergence. The resulting optimization problem may\nbe solved via Bregman projections. In this case it might be possible to derive general loss\nbounds, see for example [12]. We are currently exploring generalizations of our framework\nto other decision tasks such as distance-learning [16] and online convex programming [17].\n\nReferences\n\n[1] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex\n\nfeasibility problems. SIAM Review, 1996.\n\n[2] Y. Censor and S. A. Zenios. Parallel Optimization.. Oxford University Press, 1997.\n[3] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass prob-\n\nlems. Jornal of Machine Learning Research, 3:951\u2013991, 2003.\n\n[4] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algo-\n\nrithm. Machine Learning, 37(3):277\u2013296, 1999.\n\n[5] C. Gentile. A new approximate maximal margin classi\ufb01cation algorithm. Journal of\n\nMachine Learning Research, 2:213\u2013242, 2001.\n\n[6] C. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS\u201998.\n[7] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. A comparison of new\n\nand old algorithms for a mixture estimation problem. In COLT\u201995.\n\n[8] M. Herbster. Learning additive models online with fast evaluating kernels.\n\nCOLT\u201901.\n\nIn\n\n[9] J. Kivinen, D. P. Helmbold, and M. Warmuth. Relative loss bounds for single neurons.\n\nIEEE Transactions on Neural Networks, 10(6):1291\u20131304, 1999.\n\n[10] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels.\n\nNIPS\u201902.\n\nIn\n\n[11] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for\n\nlinear predictors. Information and Computation, 132(1):1\u201364, January 1997.\n\n[12] J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression\n\nproblems. Journal of Machine Learning, 45(3):301\u2013329, July 2001.\n\n[13] N. Klasner and H. U. Simon. From noise-free to noise-tolerant and from on-line to\n\nbatch learning. In COLT\u201995.\n\n[14] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine\n\nLearning, 46(1\u20133):361\u2013387, 2002.\n\n[15] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[16] E. Xing, A. Y. Ng, M. Jordan, and S. Russel. Distance metric learning, with applica-\n\ntion to clustering with side-information. In NIPS\u201903.\n\n[17] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient\n\nascent. In ICML\u201903.\n\n\f", "award": [], "sourceid": 2360, "authors": [{"given_name": "Shai", "family_name": "Shalev-shwartz", "institution": null}, {"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Ofer", "family_name": "Dekel", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}