{"title": "Online Classification for Complex Problems Using Simultaneous Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": null, "full_text": "Online Classi\ufb01cation for Complex Problems Using\n\nSimultaneous Projections\n\nYonatan Amit1\n\nShai Shalev-Shwartz1 Yoram Singer1,2\n\n1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel\n\n2 Google Inc. 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA\n\n{mitmit,shais,singer}@cs.huji.ac.il\n\nAbstract\n\nWe describe and analyze an algorithmic framework for online classi\ufb01cation where\neach online trial consists of multiple prediction tasks that are tied together. We\ntackle the problem of updating the online hypothesis by de\ufb01ning a projection\nproblem in which each prediction task corresponds to a single linear constraint.\nThese constraints are tied together through a single slack parameter. We then in-\ntroduce a general method for approximately solving the problem by projecting\nsimultaneously and independently on each constraint which corresponds to a pre-\ndiction sub-problem, and then averaging the individual solutions. We show that\nthis approach constitutes a feasible, albeit not necessarily optimal, solution for the\noriginal projection problem. We derive concrete simultaneous projection schemes\nand analyze them in the mistake bound model. We demonstrate the power of\nthe proposed algorithm in experiments with online multiclass text categorization.\nOur experiments indicate that a combination of class-dependent features with the\nsimultaneous projection method outperforms previously studied algorithms.\n\n1 Introduction\n\nIn this paper we discuss and analyze a framework for devising ef\ufb01cient online learning algorithms\nfor complex prediction problems such as multiclass categorization. In the settings we cover, a com-\nplex prediction problem is cast as the task of simultaneously coping with multiple simpli\ufb01ed sub-\nproblems which are nonetheless tied together. For example, in multiclass categorization, the task is\nto predict a single label out of k possible outcomes. Our simultaneous projection approach is based\non the fact that we can retrospectively (after making a prediction) cast the problem as the task of\nmaking k \u2212 1 binary decisions each of which involves the correct label and one of the competing\nlabels. The performance of the k \u2212 1 predictions is measured through a single loss. Our approach\nstands in contrast to previously studied methods which can be roughly be partitioned into three\nparadigms. The \ufb01rst and probably the simplest previously studied approach is to break the problem\ninto multiple decoupled problems that are solved independently. Such an approach was used for\ninstance by Weston and Watkins [1] for batch learning of multiclass support vector machines. The\nsimplicity of this approach also underscores its de\ufb01ciency as it is detached from the original loss of\nthe complex decision problem. The second approach maintains the original structure of the problem\nbut focuses on a single, worst performing, derived sub-problem (see for instance [2]). While this\napproach adheres with the original structure of the problem, the resulting update mechanism is by\nconstruction sub-optimal as it oversees almost all of the constraints imposed by the complex pre-\ndiction problem. (See also [6] for analysis and explanation of the sub-optimality of this approach.)\nThe third approach for dealing with complex problems is to tailor a speci\ufb01c ef\ufb01cient solution for\nthe problem on hand. While this approach yielded ef\ufb01cient learning algorithms for multiclass cate-\ngorization problems [2] and aesthetic solutions for structured output problems [3, 4], devising these\nalgorithms required dedicated efforts. Moreover, tailored solutions typically impose rather restric-\ntive assumptions on the representation of the data in order to yield ef\ufb01cient algorithmic solutions.\n\n\fIn contrast to previously studied approaches, we propose a simple, general, and ef\ufb01cient framework\nfor online learning of a wide variety of complex problems. We do so by casting the online update\ntask as an optimization problem in which the newly devised hypothesis is required to be similar to\nthe current hypothesis while attaining a small loss on multiple binary prediction problems. Casting\nthe online learning task as a sequence of instantaneous optimization problems was \ufb01rst suggested\nand analyzed by Kivinen and Warmuth [12] for binary classi\ufb01cation and regression problems. In\nour optimization-based approach, the complex decision problem is cast as an optimization problem\nthat consists of multiple linear constraints each of which represents a simpli\ufb01ed sub-problem. These\nconstraints are tied through a single slack variable whose role is to asses the overall prediction\nquality for the complex problem. We describe and analyze a family of two-phase algorithms. In the\n\ufb01rst phase, the algorithms solve simultaneously multiple sub-problems. Each sub-problem distills\nto an optimization problem with a single linear constraint from the original multiple-constraints\nproblem. The simple structure of each single-constraint problem results in an analytical solution\nwhich is ef\ufb01ciently computable. In the second phase, the algorithms take a convex combination of\nthe independent solutions to obtain a solution for the multiple-constraints problem. The end result is\nan approach whose time complexity and mistake bounds are equivalent to approaches which solely\ndeal with the worst-violating constraint [9]. In practice, though, the performance of the simultaneous\nprojection framework is much better than single-constraint update schemes.\n\n2 Problem Setting\n\nIn this section we introduce the notation used throughout the paper and formally describe our prob-\nlem setting. We denote vectors by lower case bold face letters (e.g. x and \u03c9) where the j\u2019th element\nof x is denoted by xj. We denote matrices by upper case bold face letters (e.g. X), where the j\u2019th\nrow of X is denoted by xj. The set of integers {1, . . . , k} is denoted by [k]. Finally, we use the\nhinge function [a]+ = max{0, a}.\nOnline learning is performed in a sequence of trials. At trial t the algorithm receives a matrix Xt\nof size kt \u00d7 n, where each row of Xt is an instance, and is required to make a prediction on the\nlabel associated with each instance. We denote the vector of predicted labels by \u02c6yt. We allow \u02c6yt\nj\nj| is the con\ufb01dence\nto take any value in R, where the actual label being predicted is sign(\u02c6yt\nin the prediction. After making a prediction \u02c6yt the algorithm receives the correct labels yt where\nj \u2208 {\u22121, 1} for all j \u2208 [kt]. In this paper we assume that the predictions in each trial are formed\nyt\nby calculating the inner product between a weight vector \u03c9t \u2208 Rn with each instance in Xt, thus\n\u02c6yt = Xt \u03c9t. Our goal is to perfectly predict the entire vector yt. We thus say that the vector \u02c6yt\nj). That is, we suffer a\nwas imperfectly predicted if there exists an outcome j such that yt\nunit loss on trial t if there exists j, such that sign(\u02c6yt\nj. Directly minimizing this combinatorial\nerror is a computationally dif\ufb01cult task. Therefore, we use an adaptation of the hinge-loss, de\ufb01ned\n\u2018 (\u02c6yt, yt) = maxj\u2208[kt]\nj is\noften referred to as the (signed) margin of the prediction and ties the correctness and the con\ufb01dence\nin the prediction. We use \u2018 (\u03c9; (Xt, yt)) to denote \u2018 (\u02c6yt, yt) where \u02c6yt = Xt \u03c9. We also denote the\nj}, and similarly\nset of instances whose labels were predicted incorrectly by Mt = {j | sign(\u02c6yt\nj) 6= yt\nj]+ > 0}.\nthe set of instances whose hinge-losses are greater than zero by \u0393t = {j | [1 \u2212 yt\nj \u02c6yt\n\n, as a proxy for the combinatorial error. The quantity yt\n\n(cid:2)1 \u2212 yt\n\nj 6= sign(\u02c6yt\n\nj) and |\u02c6yt\n\nj \u02c6yt\n\n(cid:3)\n\nj \u02c6yt\n\nj\n\n+\n\nj) 6= yt\n\n3 Derived Problems\n\nIn this section we further explore the motivation for our problem setting by describing two different\ncomplex decision tasks and showing how they can be cast as special cases of our setting. We also\nwould like to note that our approach can be employed in other prediction problems (see Sec. 7).\nMultilabel Categorization In the multilabel categorization task each instance is associated with\na set of relevant labels from the set [k]. The multilabel categorization task can be cast as a\nspecial case of a ranking task in which the goal is to rank the relevant labels above the irrel-\nevant ones. Many learning algorithms for this task employ class-dependant features (for ex-\nample, see [7]). For simplicity, assume that each class is associated with n features and de-\nnote by \u03c6(x, r) the feature vector for class r. We would like to note that features obtained\nfor different classes typically relay different information and are often substantially different.\n\n\f\u03c9t\n\n\u03c9t+1\n\n3) \u2265 1\n\n1) \u2265 1\n\n2) \u2265 1\n\n3 (\u03c9 \u00b7 xt\nyt\n\n2 (\u03c9 \u00b7 xt\nyt\n\n1 (\u03c9 \u00b7 xt\nyt\n\nA categorizer, or label ranker, is based on a weight vector\n\u03c9. A vector \u03c9 induces a score for each class \u03c9 \u00b7 \u03c6(x, r)\nwhich, in turn, de\ufb01nes an ordering of the classes. A learner is\nrequired to build a vector \u03c9 that successfully ranks the labels\naccording to their relevance, namely for each pair of classes\n(r, s) such that r is relevant while s is not, the class r should\nbe ranked higher than the class s. Thus we require that \u03c9 \u00b7\n\u03c6(x, r) > \u03c9 \u00b7 \u03c6(x, s) for every such pair (r, s). We say that a\nlabel ranking is imperfect if there exists any pair (r, s) which\nviolates this requirement. The loss associated with each such\nviolation is [1 \u2212 (\u03c9 \u00b7 \u03c6(x, r) \u2212 \u03c9 \u00b7 \u03c6(x, s))]+ and the loss\nof the categorizer is de\ufb01ned as the maximum over the losses\ninduced by the violated pairs. In order to map the problem to\nour setting, we de\ufb01ne a virtual instance for every pair (r, s)\nsuch that r is relevant and s is not. The new instance is the\nn dimensional vector de\ufb01ned by \u03c6(x, r)\u2212 \u03c6(x, s). The label\nassociated with all of the instances is set to 1. It is clear that\nan imperfect categorizer makes a prediction mistake on at\nleast one of the instances, and that the losses de\ufb01ned by both\nproblems are the same.\nOrdinal Regression In the problem of ordinal regression an instance x is a vector of n features\nthat is associated with a target rank y \u2208 [k]. A learning algorithm is required to \ufb01nd a vector \u03c9\nand k thresholds b1 \u2264 \u00b7\u00b7\u00b7 \u2264 bk\u22121 \u2264 bk = \u221e. The value of \u03c9 \u00b7 x provides a score from which the\nprediction value can be de\ufb01ned as the smallest index i for which \u03c9\u00b7x < bi, \u02c6y = min{i|\u03c9 \u00b7 x < bi}.\nIn order to obtain a correct prediction, an ordinal regressor is required to ensure that \u03c9\u00b7x \u2265 bi for all\ni < y and that \u03c9 \u00b7 x < bi for i \u2265 y. It is considered a prediction mistake if any of these constraints\nis violated. In order to map the ordinal regression task to our setting, we introduce k \u2212 1 instances.\nEach instance is a vector in Rn+k\u22121. The \ufb01rst n entries of the vector are set to be the elements of\nx, the remaining k \u2212 1 entries are set to \u2212\u03b4i,j. That is, the i\u2019th entry in the j\u2019th vector is set to \u22121\nif i = j and to 0 otherwise. The label of the \ufb01rst y \u2212 1 instances is 1, while the remaining k \u2212 y\ninstances are labeled as \u22121. Once we learned an expanded vector in Rn+k\u22121, the regressor \u03c9 is\nobtained by taking the \ufb01rst n components of the expanded vector and the thresholds b1, . . . , bk\u22121\nare set to be the last k \u2212 1 elements. A prediction mistake of any of the instances corresponds to an\nincorrect rank in the original problem.\n\nFigure 1: Illustration of the simultane-\nous projections algorithm: each instance\ncasts a constraint on \u03c9 and each such\nconstraint de\ufb01nes a halfspace of feasi-\nble solutions. We project on each half-\nspace in parallel and the new vector is a\nweighted average of these projections\n\n4 Simultaneous Projection Algorithms\n\n(cid:0)\u03c9t \u00b7 xt\n\n(cid:1) \u2265 1. If all the constraints are satis\ufb01ed\n\nRecall that on trial t the algorithm receives a matrix, Xt, of kt instances, and predicts \u02c6yt = Xt \u03c9t.\nAfter performing its prediction, the algorithm receives the corresponding labels yt. Each such\ninstance-label pair casts a constraint on \u03c9t, yt\nj\nby \u03c9t then \u03c9t+1 is set to be \u03c9t and the algorithm proceeds to the next trial. Otherwise, we would\nlike to set \u03c9t+1 as close as possible to \u03c9t while satisfying all constraints.\nSuch an aggressive approach may be sensitive to outliers and over-\ufb01tting. Thus, we allow some\nof the constraints to remain violated by introducing a tradeoff between the change to \u03c9t and the\nloss attained on (Xt, yt). Formally, we would like to set \u03c9t+1 to be the solution of the following\n2 k\u03c9 \u2212 \u03c9tk2 + C \u2018(\u03c9; (Xt, yt)), where C is a tradeoff parameter.\noptimization problem, min\u03c9\u2208Rn\nAs we discuss below, this formalism effectively translates to a cap on the maximal change to \u03c9t.\nWe rewrite the above optimization by introducing a single slack variable as follows:\n\n1\n\nj\n\nmin\n\n\u03c9\u2208Rn,\u03be\u22650\n\n(1)\nWe denote the objective function of Eq. (1) by P t and refer to it as the instantaneous primal problem\nto be solved on trial t. The dual optimization problem of P t is the maximization problem\n\nj\n\n(cid:0)\u03c9 \u00b7 xt\n\n(cid:1) \u2265 1 \u2212 \u03be , \u03be \u2265 0 .\n\n1\n2\n\n(cid:13)(cid:13)\u03c9 \u2212 \u03c9t(cid:13)(cid:13)2 + C\u03be s.t. \u2200j \u2208 [kt] : yt\nktX\nktX\n\nktX\n\nj\n\n\u03b1t\nj yt\n\nj xt\nj\n\ns.t.\n\n(cid:13)(cid:13)(cid:13)\u03c9t +\n\n(cid:13)(cid:13)(cid:13)2\n\nj=1\n\nj=1\n\nj=1\n\nmax\n1,..,\u03b1t\nkt\n\n\u03b1t\n\nj \u2212 1\n\u03b1t\n2\n\nj \u2264 C , \u2200j : \u03b1t\n\u03b1t\n\nj \u2265 0 .\n\n(2)\n\n\fprimal problem is calculated from the optimal dual solution as follows, \u03c9t+1 = \u03c9t+Pkt\n\nEach dual variable corresponds to a single constraint of the primal problem. The minimizer of the\nj xt\nj.\nUnfortunately, in the common case, where each xt\nj is in an arbitrary orientation, there does not exist\nan analytic solution for the dual problem (Eq. (2)). We tackle the problem by breaking it down\ninto kt reduced problems, each of which focuses on a single dual variable. Formally, for the j\u2019th\nj0 = 0 for all j0 6= j. Each reduced\nvariable, the j\u2019th reduced problem solves Eq. (2) while \ufb01xing \u03b1t\noptimization problem amounts to the following problem\n\nj=1 \u03b1t\n\nj yt\n\n(cid:13)(cid:13)\u03c9t + \u03b1t\n\n(cid:13)(cid:13)2\n\nj \u2212 1\n\u03b1t\n2\n\nmax\n\u03b1t\nj\n\nj yt\n\nj xt\nj\n\ns.t. \u03b1t\n\nj \u2208 [0, C] .\n\n(3)\n\nP\nWe next obtain an exact or approximate solution for each reduced problem as if it were inde-\npendent of the rest. We then choose a distribution \u00b5t \u2208 \u2206kt, where \u2206kt = {\u00b5 \u2208 Rkt\n:\nj \u00b5j = 1, \u00b5j \u2265 0} is the probability simplex, and multiply each \u03b1t\nj \u2264 C implies thatPkt\nj by the corresponding\nj. Since \u00b5t \u2208 \u2206kt, this yields a feasible solution to the dual problem de\ufb01ned in Eq.\n(2) for\n\u00b5t\nFinally, the algorithm uses the combined solution and sets \u03c9t+1 = \u03c9t + Pkt\nj \u2264 C.\nthe following reason. Each \u00b5t\nj\u03b1t\nj=1 \u00b5t\nj xt\nj.\nj yt\nj \u03b1t\nj=1 \u00b5t\n\nj \u2265 0 and the fact that \u03b1t\n\nj\u03b1t\n\nInput:\n\nInitialize:\n\nj to be\n\nAggressiveness parameter C > 0\n\n\u03c91 = (0, . . . , 0)\nFor t = 1, 2, . . . , T :\nReceive instance matrix X t \u2208 Rkt\u00d7n\nPredict \u02c6yt = Xt \u03c9t\nReceive correct labels yt\nSuffer loss \u2018 (\u03c9t; (Xt, yt))\nIf \u2018 > 0:\nChoose importance weights \u00b5t \u2208 \u2206kt\nChoose individual dual solutions \u03b1t\nj\nj \u03b1t\nj yt\n\nWe next present three schemes to obtain a solu-\ntion for the reduced problem (Eq. (3)) and then\ncombine the solution into a single update.\nSimultaneous Perceptron: The simplest of the\nupdate forms generalizes the famous Perceptron\nalgorithm from [8] by setting \u03b1t\nj to C if the j\u2019th\ninstance is incorrectly labeled, and to 0 otherwise.\n1|Mt| for\nWe similarly set the weight \u00b5t\nj \u2208 Mt and to 0 otherwise. We abbreviate this\nscheme as the SimPerc algorithm.\nSoft Simultaneous Projections: The soft simul-\ntaneous projections scheme uses the fact that each\nreduced problem has an analytic solution, yield-\ning \u03b1t\nindependently assign each \u03b1t\ntion. We next set \u00b5t\nsolution may update \u03b1t\nattain is not suf\ufb01ciently large. We abbreviate this scheme as the SimProj algorithm.\nConservative Simultaneous Projections: Combining ideas from both methods, the conservative\nsimultaneous projections scheme optimally sets \u03b1t\nj according to the analytic solution. The difference\nwith the SimProj algorithm lies in the selection of \u00b5t. In the conservative scheme only the instances\nwhich were incorrectly predicted (j \u2208 Mt) are assigned a positive weight. Put differently, \u00b5t\nj is set\n1|Mt| for j \u2208 Mt and to 0 otherwise. We abbreviate this scheme as the ConProj algorithm.\nto\n\nFigure 2: Simultaneous projections algorithm.\nj to be 1|\u0393t| for j \u2208 \u0393t and to 0 otherwise. We would like to comment that this\nj also for instances which were correctly classi\ufb01ed as long as the margin they\n\nUpdate \u03c9t+1 = \u03c9t +Pkt\n\nj = min(cid:8)C, \u2018(cid:0)\u03c9t; (xt\n\nj, yt\nj this optimal solu-\n\n(cid:13)(cid:13)2(cid:9). We\n\nj)(cid:1) /(cid:13)(cid:13)xt\n\nj\n\nj=1 \u00b5t\n\nj xt\nj\n\nTo recap, on each trial t we obtain a feasible solution for the instantaneous dual given in Eq. (2).\nj, according to a weight vector \u00b5t \u2208 \u2206kt. While\nThis solution combines independently calculated \u03b1t\nthis solution may not be optimal, it does constitutes an infrastructure for obtaining a mistake bound\nand, as we demonstrate in Sec. 6, performs well in practice.\n\n5 Analysis\n\nThe algorithms described in the previous section perform updates in order to increase the instanta-\nneous dual problem de\ufb01ned in Eq. (2). We now use the mistake bound model to derive an upper\nbound on the number of trials on which the predictions of SimPerc and ConProj algorithms are\nimperfect. Following [6], the \ufb01rst step in the analysis is to tie the instantaneous dual problems to\n\n\f2 k\u03c9k2 + CPT\n\nTX\n\n(cid:0)\u03c9 \u00b7 xt\n\nj\n\n(cid:1) \u2265 1 \u2212 \u03bet \u2200t : \u03bet \u2265 0. (4)\n\na global loss function. To do so, we introduce a primal optimization problem de\ufb01ned over the en-\ntire sequence of examples as follows, min\u03c9\u2208Rn\nt=1 \u2018 (\u03c9; (X t, Y t)) . We rewrite the\noptimization problem as the following equivalent constrained optimization problem,\n\n1\n\n1\n2\n\nmin\n\nk\u03c9k2 + C\n\ns.t. \u2200t \u2208 [T ],\u2200j \u2208 [kt] : yt\n\nj\n\n\u03bet\n\n\u03bb\n\nt=0\n\nt=1\n\nt=1\n\nj=1\n\nj xt\nj\n\nmax\n\n\u03bbt,j yt\n\n(cid:13)(cid:13)(cid:13)2\n\nktX\n\nTX\n\nktX\n\ns.t. \u2200t :\n\n\u03c9\u2208Rn,\u03be\u2208RT\nWe denote the value of the objective function at (\u03c9, \u03be) for this optimization problem by P(\u03c9, \u03be).\nA competitor who may see the entire sequence of examples in advance may in particular set (\u03c9, \u03be)\nto be the minimizer of the problem which we denote by (\u03c9?, \u03be?). Standard usage of Lagrange\nmultipliers yields that the dual of Eq. (4) is,\n\n(cid:13)(cid:13)(cid:13) TX\nktX\nde\ufb01nes a feasible solution \u03c9 =PT\n\n\u03bbt,j \u2212 1\n2\n\n(5)\nWe denote the value of the objective function of Eq. (5) by D(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbT ), where each \u03bbt is a\nvector in Rkt. Through our derivation we use the fact that any set of dual variables \u03bb1,\u00b7\u00b7\u00b7 , \u03bbT\nj with a corresponding assignment of the slack\n\n\u03bbt,j \u2264 C \u2200t, j : \u03bbt,j \u2265 0 .\n\nvariables.\nClearly, the optimization problem given by Eq. (5) depends on all the examples from the \ufb01rst trial\nthrough time step T and thus can only be solved in hindsight. We note however, that if we ensure\nthat \u03bbs,j = 0 for all s > t then the dual function no longer depends on instances occurring on rounds\nproceeding round t. As we show next, we use this primal-dual view to derive the skeleton algorithm\nfrom Fig. 2 by \ufb01nding a new feasible solution for the dual problem on every trial. Formally, the\ninstantaneous dual problem, given by Eq. (2), is equivalent (after omitting an additive constant) to\nthe following constrained optimization problem,\n\nPkt\n\nj=1 \u03bbt,jyt\n\njxt\n\nj=1\n\nj=1\n\nt=1\n\nktX\n\n\u03bb\n\nmax\n\nD(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbt\u22121, \u03bb, 0,\u00b7\u00b7\u00b7 , 0) s.t. \u03bb \u2265 0 ,\n\n(6)\nThat is, the instantaneous dual problem is obtained from D(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbT ) by \ufb01xing \u03bb1, . . . , \u03bbt\u22121\nto the values set in previous rounds, forcing \u03bbt+1 through \u03bbT to the zero vectors, and choosing a\nfeasible vector for \u03bbt. Given the set of dual variables \u03bb1, . . . , \u03bbt\u22121 it is straightforward to show that\nj. Equipped with these relations and\n\n\u03bbj \u2264 C .\n\nj \u03bbs,jys\n\nj xs\n\ns=1\n\nj=1\n\nthe prediction vector used on trial t is \u03c9t =Pt\u22121\nP\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c9t +\n\n\u03bbj \u2212 1\n2\n\nktX\n\nktX\n\n\u03bb1,...,\u03bbkt\n\n\u03bbjyt\n\nmax\n\njxt\nj\n\nj=1\n\nj=1\n\nomitting constants which do not depend on \u03bbt Eq. (6) can be rewritten as,\n\n\u03bbj \u2264 C .\n\n(7)\n\ns.t. \u2200j : \u03bbj \u2265 0,\n\nktX\n\nj=1\n\n1, . . . , \u00b5t\nkt\n\n(7) and Eq.\n\nThe problems de\ufb01ned by Eq.\n\u03b1t\n1, . . . , \u03b1t\nkt\n\u03bbt,j = \u00b5t\nj \u03b1t\n\ufb01rst bound is given for the SimPerc algorithm.\n\n(2) are equivalent. Thus, weighing the variables\nby \u00b5t\nalso yields a feasible solution for the problem de\ufb01ned in Eq. (6), namely\nj. We now tie all of these observations together by using the weak-duality theorem. Our\n\nTheorem 1. Let(cid:0)X1, y1(cid:1) , . . . ,(cid:0)XT , yT(cid:1) be a sequence of examples where Xt is a matrix of kt\n\nexamples and yt are the associated labels. Assume that for all t and j the norm of an instance xt\nis at most R. Then, for any \u03c9? \u2208 Rn the number of trials on which the prediction of SimPerc is\nj\nimperfect is at most,\n\n2k\u03c9?k2 + CPT\n\n1\n\nt=1 \u2018 (\u03c9?; (Xt, yt))\n2 C 2R2\n\nC \u2212 1\n\n.\n\nProof. To prove the theorem we make use of the weak-duality theorem. Recall that any dual feasible\nsolution induces a value for the dual\u2019s objective function which is upper bounded by the optimum\nvalue of the primal problem, P (\u03c9?, \u03be?). In particular, the solution obtained at the end of trial T\nis dual feasible, and thus D(\u03bb1, . . . , \u03bbT ) \u2264 P(\u03c9?, \u03be?) . We now rewrite the left hand-side of the\nabove equation as the following sum,\n\n(cid:2)D(\u03bb1, . . . , \u03bbt, 0, . . . , 0) \u2212 D(\u03bb1, . . . , \u03bbt\u22121, 0, . . . , 0)(cid:3) .\n\nD(0, . . . , 0) +\n\n(8)\n\nTX\n\nt=1\n\n\fs