{"title": "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", "book": "Advances in Neural Information Processing Systems", "page_first": 1984, "page_last": 1992, "abstract": "We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fat-shattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting.", "full_text": "Online Learning: Random Averages, Combinatorial\n\nParameters, and Learnability\n\nAlexander Rakhlin\nDepartment of Statistics\nUniversity of Pennsylvania\n\nKarthik Sridharan\n\nToyota Technological Institute\n\nat Chicago\n\nAmbuj Tewari\n\nComputer Science Department\nUniversity of Texas at Austin\n\nAbstract\n\nWe develop a theory of online learning by de\ufb01ning several complexity measures.\nAmong them are analogues of Rademacher complexity, covering numbers and fat-\nshattering dimension from statistical learning theory. Relationship among these\ncomplexity measures, their connection to online learning, and tools for bounding\nthem are provided. We apply these results to various learning problems. We\nprovide a complete characterization of online learnability in the supervised setting.\n\n1\n\nIntroduction\n\nIn the online learning framework, the learner is faced with a sequence of data appearing at discrete\ntime intervals.\nIn contrast to the classical \u201cbatch\u201d learning scenario where the learner is being\nevaluated after the sequence is completely revealed, in the online framework the learner is evaluated\nat every round. Furthermore, in the batch scenario the data source is typically assumed to be i.i.d.\nwith an unknown distribution, while in the online framework we relax or eliminate any stochastic\nassumptions on the data source. As such, the online learning problem can be phrased as a repeated\ntwo-player game between the learner (player) and the adversary (Nature).\nLet F be a class of functions and X some set. The Online Learning Model is de\ufb01ned as\nthe following T -round interaction between the learner and the adversary: On round t = 1, . . . , T ,\nthe Learner chooses ft \u2208 F, the Adversary picks xt \u2208 X , and the Learner suffers loss ft(xt). At\nthe end of T rounds we de\ufb01ne regret as the difference between the cumulative loss of the player\nas compared to the cumulative loss of the best \ufb01xed comparator. For the given pair (F,X ), the\nproblem is said to be online learnable if there exists an algorithm for the learner such that regret\ngrows sublinearly. Learnability is closely related to Hannan consistency [13, 9].\nThere has been a lot of interest in a particular setting of the online learning model, called online\nconvex optimization. In this setting, we write xt(ft) as the loss incurred by the learner, and the\nassumption is made that the function xt is convex in its argument. The particular convexity structure\nenables the development of optimization-based algorithms for learner\u2019s choices. Learnability and\nprecise rates of growth of regret have been shown in a number of recent papers (e.g. [33, 25, 1]). The\nonline learning model also subsumes the prediction setting. In the latter, the learner\u2019s choice of a Y-\nvalued function gt leads to the loss of (cid:96)(gt(zt), yt) according to a \ufb01xed loss function (cid:96) : Y \u00d7Y (cid:55)\u2192 R.\nThe choice of the learner is equivalently written as ft(x) = (cid:96)(gt(z), y), and xt = (zt, yt) is the\nchoice of the adversary. In Section 6 we discuss the prediction setting in more detail.\nIn the \u201cbatch\u201d learning scenario, data {(xi, yi)}T\ni=1 is presented as an i.i.d. draw from a \ufb01xed\ndistribution over some product X \u00d7 Y. Learnability results have been extensively studied in the\nPAC framework [29] and its agnostic extensions [14, 17].\nIt is well-known that learnability in\nthe binary case (that is, Y = {\u22121, +1}) is completely characterized by \ufb01niteness of the Vapnik-\nChervonenkis combinatorial dimension of the function class [32, 31]. In the real-valued case, a\nnumber of combinatorial quantities have been proposed: P -dimension [23], V -dimension, as well\nas the scale-sensitive versions P\u03b3-dimension [17, 5] and V\u03b3-dimension [3]. The last two dimensions\n\n1\n\n\fwere shown to be characterizing learnability [3] and uniform convergence of means to expectations\nfor function classes.\nIn contrast to the classical learning setting, there has been surprisingly little work on characterizing\nlearnability for the online learning framework. Littlestone [19] has shown that, in the setting of\nprediction of binary outcomes, a certain combinatorial property of the binary-valued function class\ncharacterizes learnability in the realizable case. The result has been extended to the non-realizable\ncase by Shai Ben-David, D\u00b4avid P\u00b4al and Shai Shalev-Shwartz [7] who named this combinatorial\nquantity the Littlestone\u2019s dimension. In parallel to [7], minimax analysis of online convex optimiza-\ntion yielded new insights into the value of the game, its minimax dual representation, as well as\nalgorithm-independent upper and lower bounds [1, 27]. In this paper, we build upon these results\nand the \ufb01ndings of [7] to develop a theory of online learning.\nWe show that in the online learning model, a notion which we call Sequential Rademacher complex-\nity allows us to easily prove learnability for a vast array of problems. The role of this complexity\nis similar to the role of the Rademacher complexity in statistical learning theory. Next, we ex-\ntend Littlestone\u2019s dimension to the real-valued case. We show that \ufb01niteness of this scale-sensitive\nversion, which we call the fat-shattering dimension, is necessary and suf\ufb01cient for learnability in\nthe prediction setting. Extending the binary-valued result of [7], we introduce a generic algorithm\nwhich plays the role similar to that of empirical risk minimization for i.i.d. data: if the problem\nis learnable in the supervised setting, then it is learnable by this algorithm. Along the way we de-\nvelop analogues of Massart\u2019s \ufb01nite class lemma, the Dudley integral upper bound on the Sequential\nRademacher complexity, appropriately de\ufb01ned packing and covering numbers, and even an analogue\nof the Sauer-Shelah combinatorial lemma. In the full version of this paper, we introduce a general-\nization of the uniform law of large numbers for non-i.i.d. distributions and show that \ufb01niteness of\nthe fat-shattering dimension implies this convergence.\nMany of the results come with more work than their counterparts in statistical learning theory. In\nparticular, instead of training sets we have to work with trees, making the results somewhat involved.\nFor this reason, we state our results without proofs, deferring the details to the full version of this\npaper. While the spirit of the online theory is that it provides a \u201ctemporal\u201d generalization of the\n\u201cbatch\u201d learning problem, not all the results from statistical learning theory transfer to our setting.\nFor instance, two distinct notions of a packing set exist for trees, and these notions can be seen\nto coincide in \u201cbatch\u201d learning. The fact that many notions of statistical learning theory can be\nextended to the online learning model is indeed remarkable.\n\n2 Preliminaries\n\nBy phrasing the online learning model as a repeated game and considering its minimax value, we\nnaturally arrive at an important object in combinatorial game theory: trees. Unless speci\ufb01ed, all\ntrees considered in this paper are rooted binary trees with equal-depth paths from the root to the\nleaves. While it is useful to have the tree picture in mind when reading the paper, it is also necessary\nto precisely de\ufb01ne trees as mathematical objects. We opt for the following de\ufb01nition. Given some\nset Z, a Z-valued tree of depth T is a sequence (z1, . . . , zT ) of T mappings zi : {\u00b11}i\u22121 (cid:55)\u2192 Z.\nThe root of the tree z is the constant function z1 \u2208 Z. Armed with this de\ufb01nition, we can talk about\nvarious operations on trees. For a function f : Z (cid:55)\u2192 U, f(x) denotes the U-valued tree de\ufb01ned by\nthe mappings (f\u25e6x1, . . . , f\u25e6xT ). A path of length T is a sequence \u0001 = (\u00011, . . . , \u0001T\u22121) \u2208 {\u00b11}T\u22121.\nWe shall abuse notation by referring to xi(\u00011, . . . , \u0001i\u22121) by xi(\u0001). Clearly xi only depends on the\n\ufb01rst i \u2212 1 elements of \u0001.\nWe denote (ya, . . . , yb) by ya:b. The set of all functions from X to Y is denoted by YX , and the\nt-fold product X \u00d7 . . . \u00d7 X is denoted by X t. For any T \u2208 N, [T ] denotes the set {1, . . . , T}.\nWhenever the variable in sup (inf) is not quanti\ufb01ed, it ranges over the set of all possible values.\n\n3 Value of the Game\nFix the sets F and X and consider the online learning model stated in the introduction. We assume\nthat F is a separable metric space. Let Q be the set of Borel probability measures on F. Assume that\nQ is weakly compact. We consider randomized learners who predict a distribution qt \u2208 Q on every\n\n2\n\n\f(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\nround. Formally, de\ufb01ne a learner\u2019s strategy \u03c0 as a sequence of mappings \u03c0t : X t\u22121 \u00d7 F t\u22121 (cid:55)\u2192 Q\nfor each t \u2208 [T ]. We de\ufb01ne the value of the game as\n\n(cid:34) T(cid:88)\n\nt=1\n\nVT (F,X ) = inf\n\nq1\u2208Q sup\nx1\u2208X\n\nEf1\u223cq1 \u00b7\u00b7\u00b7\n\nqT \u2208Q sup\ninf\nxT \u2208X\n\nEfT \u223cqT\n\nft(xt) \u2212 inf\nf\u2208F\n\nf(xt)\n\n(1)\n\nwhere ft has distribution qt. We consider here the adaptive adversary who gets to choose each xt\nbased on the history of moves f1:t\u22121 and x1:t\u22121.\nNote that our assumption that F is a separable metric space implies that Q is tight [28] and\nProkhorov\u2019s theorem states that compactness of Q under weak topology is equivalent to tightness\n[28]. Hence we have that Q is compact under weak topology and this is essentially what we need to\napply a modi\ufb01cation of Theorem 1 of [1]. Speci\ufb01cally we show the following:\nTheorem 1. Let F and X be the sets of moves for the two players, satisfying the necessary con-\nditions for the minimax theorem to hold. Denote by Q and P the sets of probability distributions\n(mixed strategies) on F and X , respectively. Then\n\nVT (F,X ) = sup\n\np1\n\nEx1\u223cp1 . . . sup\n\npT\n\nExT \u223cpT\n\ninf\nft\u2208F\n\nExt\u223cpt [ft(xt)] \u2212 inf\nf\u2208F\n\nf(xt)\n\n.\n\n(2)\n\nThe question of learnability in the online learning model is now reduced to the study of VT (F,X ),\ntaking Eq. (2) as the starting point. In particular, under our de\ufb01nition, showing that the value grows\nsublinearly with T is equivalent to showing learnability.\nDe\ufb01nition 1. A class F is said to be online learnable with respect to the given X if\n\nVT (F,X )\n\n= 0 .\n\nlim sup\nT\u2192\u221e\n\nT\n\nThe rest of the paper is aimed at understanding the value of the game VT (F,X ) for various function\nclasses F. Since complexity of F is the focus of the paper, we shall often write VT (F), and the\ndependence on X will be implicit. One of the key notions introduced in this paper is the complexity\nwhich we term Sequential Rademacher complexity. A natural generalization of Rademacher com-\nplexity [18, 6, 21], the sequential analogue possesses many of the nice properties of its classical\ncousin. The properties are proved in Section 7 and then used to show learnability for many of the\nexamples in Section 8. The \ufb01rst step, however, is to show that Sequential Rademacher complexity\nupper bounds the value of the game. This is the subject of the next section.\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\n4 Random Averages\nDe\ufb01nition 2. The Sequential Rademacher Complexity of a function class F \u2286 RX is de\ufb01ned as\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\nRT (F) = sup\n\nx\n\nE\u0001\n\nsup\nf\u2208F\n\n\u0001tf(xt(\u0001))\n\nwhere the outer supremum is taken over all X -valued trees of depth T and \u0001 = (\u00011, . . . , \u0001T ) is a\nsequence of i.i.d. Rademacher random variables.\nTheorem 2. The minimax value of a randomized game is bounded as VT (F) \u2264 2RT (F) .\nTheorem 2 relies on a technical lemma, whose proof requires considerably more work than the\nclassical symmetrization proof [11, 21] due to the non-i.i.d. nature of the sequences. We mention\nthat under strong assumptions on the space of functions, the Sequential Rademacher and the classical\nRademacher complexities coincide (see [1]). In general, however, the two complexities are very\ndifferent. For example, the discrepancy is exhibited by a class of linear threshold functions.\n\n5 Covering Numbers and Combinatorial Parameters\n\nIn online learning, the notion characterizing learnability for binary prediction in the realizable case\nhas been introduced by Littlestone [19] and extended to the non-realizable case of binary predic-\ntion by Shai Ben-David, D\u00b4avid P\u00b4al and Shai Shalev-Shwartz [7]. Next, we de\ufb01ne the Littlestone\u2019s\n\n3\n\n\fdimension [19, 7] and propose its scale-sensitive versions for real-valued function classes. In the\nsequel, these combinatorial parameters are shown to control the growth of covering numbers on\ntrees. In the setting of prediction, the combinatorial parameters are shown to exactly characterize\nlearnability (see Section 6).\nDe\ufb01nition 3 ([19, 7]). An X -valued tree x of depth d is shattered by a function class F \u2286 {\u00b11}X\nif for all \u0001 \u2208 {\u00b11}d, there exists f \u2208 F such that f(xt(\u0001)) = \u0001t for all t \u2208 [d]. The Littlestone\ndimension Ldim(F,X ) is the largest d such that F shatters an X -valued tree of depth d.\nDe\ufb01nition 4. An X -valued tree x of depth d is \u03b1-shattered by a function class F \u2286 RX , if there\nexists an R-valued tree s of depth d such that\n\n\u2200\u0001 \u2208 {\u00b11}d, \u2203f \u2208 F s.t. \u2200t \u2208 [d], \u0001t(f(xt(\u0001)) \u2212 st(\u0001)) \u2265 \u03b1/2\n\nThe tree s is called the witness to shattering. The fat-shattering dimension fat\u03b1(F,X ) at scale \u03b1 is\nthe largest d such that F \u03b1-shatters an X -valued tree of depth d.\nWith these de\ufb01nitions it is easy to see that fat\u03b1(F,X ) = Ldim(F,X ) for a binary-valued function\nclass F \u2286 {0, 1}X for any 0 < \u03b1 \u2264 1. When X and/or F is understood from the context, we will\nsimply write fat\u03b1 or fat\u03b1(F) instead of fat\u03b1(F,X ).\nLet us mention that if trees x are de\ufb01ned by constant mappings xt(\u0001) = xt, the combinatorial pa-\nrameters coincide with the Vapnik-Chervonenkis dimension and with the scale-sensitive dimension\nP\u03b3. Therefore, the notions we are studying are a strict \u201ctemporal\u201d generalizations of the VC theory.\nAs in statistical learning theory, the combinatorial parameters are only useful if they can be shown to\ncapture that aspect of F which is important for learnability. In particular, a \u201csize\u201d of a function class\nis known to be related to complexity of learning from i.i.d. data., and the classical way to measure\n\u201csize\u201d is through a cover or a packing set. We propose the following de\ufb01nitions for online learning.\nDe\ufb01nition 5. A set V of R-valued trees of depth T is an \u03b1-cover (with respect to (cid:96)p-norm) of\nF \u2286 RX on a tree x of depth T if\n\n\u2200f \u2208 F, \u2200\u0001 \u2208 {\u00b11}T \u2203v \u2208 V s.t.\n\n1\nT\n\n|vt(\u0001) \u2212 f(xt(\u0001))|p \u2264 \u03b1p\n\nThe covering number Np(\u03b1,F, x) of a function class F on a given tree x is the size of the smallest\ncover. Further de\ufb01ne Np(\u03b1,F, T ) = supx Np(\u03b1,F, x), the maximal (cid:96)p covering number of F over\ndepth T trees.\nIn particular, a set V of R-valued trees of depth T is a 0-cover of F \u2286 RX on a tree x of depth T if\nfor all f \u2208 F and \u0001 \u2208 {\u00b11}T , there exists v \u2208 V s.t. vt(\u0001) = f(xt(\u0001)). We denote by N (0,F, x)\nthe size of a smallest 0-cover on x and N (0,F, T ) = supx N (0,F, x). The 0-cover should not be\nmistaken for the size |{f(x) : f \u2208 F}| of the projection of F onto the tree x, and the same care\nshould be taken when dealing with \u03b1-covers.\nWe would like to comment that while in the i.i.d. setting there is a notion of packing number that\nupper and lower bounds covering number, in the sequential counterpart such an analog fails.\n\n5.1 A Combinatorial Upper Bound\n\nWe now relate the combinatorial parameters introduced in the previous section to the size of a cover.\nIn the binary case (k = 1 below), a reader might notice a similarity of Theorem 3 to the classical\nresults due to Sauer [24], Shelah [26] (also, Perles and Shelah), and Vapnik and Chervonenkis [32].\nThere are several approaches to proving what is often called the Sauer-Shelah lemma. We opt for\nthe inductive-style proof (e.g. Alon and Spencer [4]). Dealing with trees, however, requires more\nwork than in the VC case.\nTheorem 3. Let F \u2286 {0, . . . , k}X be a class of functions with fat1(F) = d1, fat2(F) = d2. Then\n\nT(cid:88)\n\nt=1\n\nN\u221e(1/2,F, T ) \u2264 d2(cid:88)\n\ni=0\n\n(cid:18)T\n\n(cid:19)\n\ni\n\nN (0,F, T ) \u2264 d1(cid:88)\n\ni=0\n\n(cid:18)T\n\n(cid:19)\n\ni\n\nki \u2264 (ekT )d1 .\n\nki \u2264 (ekT )d2 ,\n\n4\n\n\fOf particular interest is the case k = 1, when fat1(F) = Ldim(F) . Armed with Theorem 3, we\ncan reduce the problem of bounding the size of a cover at an \u03b1 scale by a discretization trick. For\nthe classical case of a cover based on a set points, the discretization idea appears in [3, 22]. We now\nshow that the covering numbers are bounded in terms of the fat-shattering dimension.\nCorollary 4. Suppose F is a class of [\u22121, 1]-valued functions on X . Then for any \u03b1 > 0, any\nT > 0, and any X -valued tree x of depth T ,\n\nN1(\u03b1,F, x) \u2264 N2(\u03b1,F, x) \u2264 N\u221e(\u03b1,F, x) \u2264 (2eT /\u03b1)fat\u03b1(F )\n\nWhen bounding deviations of means from expectations uniformly over the function class, the usual\napproach proceeds by a symmetrization argument [12] followed by passing to a cover of the function\nclass and a union bound (e.g. [21]). Alternatively, a more re\ufb01ned chaining analysis integrates over\ncovering at different scales (e.g. [30]). By following the same path, we are able to prove a number\nof similar results for our setting. Next, we present a bound similar to Massart\u2019s \ufb01nite class lemma\n[20, Lemma 5.2]. This result will be used when integrating over different scales for the cover.\n\n5.2 Finite Class Lemma and the Chaining Method\nLemma 5. For any \ufb01nite set V of R-valued trees of depth T we have that\n\n\"\n\nTX\n\nt=1\n\n#\n\nvuut2 log(|V |) max\n\nE\u0001\n\nmax\nv\u2208V\n\n\u0001tvt(\u0001)\n\n\u2264\n\nv\u2208V\n\nmax\n\n\u0001\u2208{\u00b11}T\n\nvt(\u0001)2\n\nTX\n\nt=1\n\nA simple consequence of the above lemma is that if F \u2286 [0, 1]X is a \ufb01nite class, then for any given\n\ntree x we obtain a(cid:112)2T log(|F|) upper bound. If f \u2208 F is associated with an \u201cexpert\u201d (see [9]),\n\nthis result combined with Theorem 2 yields a bound given by the expert\u2019s algorithm. In Section 8\nwe discuss this case in more detail. However, as we show next, Lemma 5 goes well beyond just\n\ufb01nite classes and can be used to get an analog of Dudley entropy bound [10] for the online setting\nthrough a chaining argument.\nDe\ufb01nition 6. The Integrated complexity of a function class F \u2286 [\u22121, 1]X is de\ufb01ned as\n\n\uf6be\n\npT log N2(\u03b4,F, T ) d\u03b4\n\n\ufb00\n\n.\n\nZ 1\n\n\u03b1\n\nDT (F) = inf\n\n\u03b1\n\n4T \u03b1 + 12\n\nThe basic idea in the proof of the following theorem is the same as in statistical learning: RT (F) is\nbounded by controlling the complexity along the chain of coverings. The argument for trees, though,\nis more involved than the classical case.\nTheorem 6. For any function class F \u2286 [\u22121, 1]X , RT (F) \u2264 DT (F)\n\n6 Supervised Learning\nIn this section we study the supervised learning problem where player picks a function ft \u2208 RX\nat any time t and the adversary provides input target pair (xt, yt) and the player suffers loss\n|ft(xt) \u2212 yt|. Note that if F \u2286 {\u00b11}X and each yt \u2208 {\u00b11} then the problem boils down to binary\nclassi\ufb01cation problem. As we are interested in prediction, we allow ft to be outside of F. Though\nwe use the absolute loss in this section, it is easy to see that all the results hold (with modi\ufb01ed rates)\nfor any loss (cid:96)(f(x), y) which is such that for all f, x and y, \u03c6((cid:96)(\u02c6y, y)) \u2264 |\u02c6y\u2212 y| \u2264 \u03a6((cid:96)(\u02c6y, y)) where\n\u03a6 and \u03c6 are monotonically increasing functions. For instance the squared loss is a classic example.\nTo formally de\ufb01ne the value of the online supervised learning game, \ufb01x a set of labels Y \u2286 [\u22121, 1].\nGiven F, de\ufb01ne the associated loss class, FS = {(x, y) (cid:55)\u2192 |f(x)\u2212y| : f \u2208 F} . Now, the supervised\ngame is obtained using the pair (FS,X \u00d7Y) and we accordingly de\ufb01ne VS\nT (F) = VT (FS,X \u00d7Y) .\nBinary classi\ufb01cation is, of course, a special case when Y = {\u00b11} and F \u2286 {\u00b11}X . In that case,\nwe simply use VBinary\n\nfor VS\nT .\n\nT\n\n5\n\n\fProposition 7. For the supervised learning game played with a function class F \u2286 [\u22121, 1]X , for\nany T \u2265 1\n\n\u221a\n1\n4\n\n2\n\nsup\n\n\u03b1\n\nn\n\n\u03b1pT min{fat\u03b1, T}o \u2264 1\n(\n\n2\n\n\u2264 RT (F) \u2264 DT (F) \u2264 inf\n\nV S\nT (F)\n\ns\n\nZ 1\n\n\u221a\nT\n\n\u201e 2eT\n\n\u00ab\n\n)\n\n4T \u03b1 + 12\n\n(3)\nTheorem 8. For any function class F \u2286 [\u22121, 1]X , F is online learnable in the supervised setting\nif and only if fat\u03b1(F) is \ufb01nite for any \u03b1 > 0. Moreover, if the function class is online learnable,\nthen the value of the supervised game VS\nT (F), the Sequential Rademacher complexity R(F), and\nthe Integrated complexity D(F) are within a multiplicative factor of O(log3/2 T ) of each other.\nCorollary 9. For the binary classi\ufb01cation game played with function class F we have that\n\nfat\u03b2 log\n\nd\u03b2\n\n\u03b2\n\n\u03b1\n\n\u03b1\n\n(cid:112)T min{Ldim(F), T} \u2264 VBinary\n\n(cid:112)T Ldim(F) log T\n\n(F) \u2264 K2\n\nK1\n\nfor some universal constants K1, K2. This recovers the result of [7].\nWe wish to point out that lower bound of Proposition 7 also holds for \u201cimproper\u201d supervised learning\nalgorithms, i.e. those simply output a prediction \u02c6yt \u2208 Y rather than a function ft \u2208 F. Since a\nproper learning strategy can always be used as an improper learning strategy, we trivially have that\nif class is online learnable in the supervised setting then it is improperly online learnable. Because\nthe above mentioned property of lower bound of Proposition 7, we also have the non-trivial reverse\nimplication: if a class is improperly online learnable in the supervised setting, it is online learnable.\n\nT\n\n6.1 Generic Algorithm\n\nWe shall now present a generic improper learning algorithm for the supervised setting that achieves\na low regret bound whenever the function class is online learnable. For any \u03b1 > 0 de\ufb01ne an \u03b1-\ndiscretization of the [\u22121, 1] interval as B\u03b1 = {\u22121 + \u03b1/2,\u22121 + 3\u03b1/2, . . . ,\u22121 + (2k + 1)\u03b1/2, . . .}\nfor 0 \u2264 k and (2k + 1)\u03b1 \u2264 4. Also for any a \u2208 [\u22121, 1] de\ufb01ne (cid:98)a(cid:99)\u03b1 = argmin\n|r \u2212 a|. For a set of\nr\u2208B\u03b1\nfunctions V \u2286 F, any r \u2208 B\u03b1 and x \u2208 X de\ufb01ne V (r, x) = {f \u2208 V | f(x) \u2208 (r \u2212 \u03b1/2, r + \u03b1/2]}.\nThe algorithm proceeds by generating\u201cexperts\u201d in a way similar to [7]. Using these experts along\nwith exponentially weighted experts algorithm we shall provide the generic algorithm for online\nsupervised learning.\nAlgorithm 1 Expert (F, \u03b1, 1 \u2264 i1 < . . . < iL \u2264 T, Y1, . . . , YL)\n\nV1 \u2190 F\nfor t = 1 to T do\n\nP\n\nRt(x) = {r \u2208 B\u03b1 : fat\u03b1(Vt(r, x)) = maxr(cid:48)\u2208B\u03b1 fat\u03b1(Vt(r(cid:48), x))}\nFor each x \u2208 X , let f(cid:48)\nif t \u2208 {i1, . . . , iL} then\n\nt(x) = 1\n\nr\u2208Rt(x) r\n\n|Rt(x)|\n\n\u2200x \u2208 X , ft(x) = Yj where j is s.t. t = ij.\nPlay ft, receive xt, and update Vt+1 = Vt(ft(xt), xt)\nPlay ft = f(cid:48)\n\nt, receive xt, and set Vt+1 = Vt\n\nelse\n\nend if\nend for\n\net : X t\u22121 (cid:55)\u2192 F. The number of unique experts is |ET| = (cid:80)fat\u03b1\n\nFor each L \u2264 fat\u03b1(F) and every possible choice of 1 \u2264 i1 < . . . < iL \u2264 T and Y1, . . . , YL \u2208 B\u03b1\nwe generate an expert. Denote this set of experts as ET . Each expert outputs a function ft \u2208 F\nat every round T . Hence each expert e \u2208 ET can be seen as a sequence (e1, . . . , eT ) of mappings\nUsing an argument similar to [7], for any f \u2208 F there exists e \u2208 ET such that for any t \u2208 [T ],\n|f(xt) \u2212 e(x1:t\u22121)(xt)| \u2264 \u03b1 .\nTheorem 10. For any \u03b1 > 0 if we run the exponentially weighted experts algorithm with the set ET\nof experts then the expected regret of the algorithm is bounded as\n\n(cid:1) (|B\u03b1| \u2212 1)L \u2264 (cid:0) 2T\n\n(cid:1)fat\u03b1\n\n(cid:0)T\n\nL=0\n\nL\n\n\u03b1\n\nft(xt) \u2212 inf\nf\u2208F\n\n\u2264 \u03b1T +\n\nT fat\u03b1 log\n\ns\n\n\u201e 2T\n\n\u00ab\n\n\u03b1\n\n\" TX\n\nt=1\n\nE\n\nTX\n\nt=1\n\n#\n\nf (xt)\n\n6\n\n\fFurther if F be bounded by 1 then by running an additional experts algorithm over the experts for\ndiscretizations over \u03b1, we can provide regret guarantee of\n\n\" TX\n\nt=1\n\nE\n\nTX\n\nt=1\n\nft(xt) \u2212 inf\nf\u2208F\n\n(\n\ns\n\n#\n\n\u2264 inf\n\n\u03b1\n\n\u201e 2T\n\n\u00ab\n\n\u03b1\n\n\u221a\n\n+\n\n\u201e\n\n\u00ab\u00ab)\n\n\u201e 1\n\n\u03b1\n\nf (xt)\n\n\u03b1T +\n\nT fat\u03b1 log\n\nT\n\n3 + 2 log log\n\n7 Structural Results\n\nBeing able to bound complexity of a function class by a complexity of a simpler class is of great\nutility for proving bounds. In statistical learning theory, such structural results are obtained through\nproperties of Rademacher averages [21, 6]. In particular, the contraction inequality due to Ledoux\nand Talagrand, allows one to pass from a composition of a Lipschitz function with a class to the\nfunction class itself. This wonderful property permits easy convergence proofs for a vast array of\nproblems. We show that the notion of Sequential Rademacher complexity also enjoys many of the\nsame properties. In Section 8, the effectiveness of the results is illustrated on a number of examples.\nFirst, we prove the contraction inequality.\nLemma 11. Fix a class F \u2286 RZ and a function \u03c6 : R \u00d7 Z (cid:55)\u2192 R. Assume, for all z \u2208 Z, \u03c6(\u00b7, z) is\na L-Lipschitz function. Then R(\u03c6(F)) \u2264 L \u00b7 R(F) where \u03c6(F) = {z (cid:55)\u2192 \u03c6(f(z), z) : f \u2208 F}.\n\nThe next lemma bounds the Sequential Rademacher complexity for the product of classes.\nLemma 12. Let F = F1 \u00d7 . . . \u00d7 Fk where each Fj \u2282 RX . Also let \u03c6 : Rk (cid:55)\u2192 R be L-Lipschitz\n\nw.r.t. (cid:107) \u00b7 (cid:107)\u221e norm. Then we have that R(\u03c6 \u25e6 F) \u2264 LO(cid:16)\nvalued functions, R(g(F1, . . . ,Fk)) \u2264 O(cid:16)\n(cid:17)(cid:80)k\n\nlog3/2(T )\n\nCorollary 13. For a \ufb01xed binary function b : {\u00b11}k (cid:55)\u2192 {\u00b11} and classes F1, . . . ,Fk of {\u00b11}-\n\nj=1 R(Fj) .\n\n(cid:17)(cid:80)k\n\nlog3/2(T )\n\nj=1 R(Fj) .\n\nIn the next proposition, we summarize some basic properties of Sequential Rademacher complexity\n(see [21, 6] for the results in the i.i.d. setting):\nProposition 14. Sequential Rademacher complexity satis\ufb01es the following properties: (i) if F \u2282 G,\nthen R(F) \u2264 R(G); (ii) R(F) = R(conv(F)); (iii) R(cF) = |c|R(F) for all c \u2208 R; (iv) If\n\u03c6 : R (cid:55)\u2192 R is L-Lipschitz, then R(\u03c6(F)) \u2264 LR(F); (v) For any h, R(F + h) = R(F) where\nF + h = {f + h : f \u2208 F}.\n\n8 Examples and Applications\nExample: Linear Function Classes Suppose FW is a class consisting of linear functions x (cid:55)\u2192\n(cid:104)w, x(cid:105) where the weight vector w comes from some set W, FW = {x (cid:55)\u2192 (cid:104)w, x(cid:105) : w \u2208 W}. Often,\nit is possible to \ufb01nd a strongly convex function \u03a8(w) \u2265 0 such that \u03a8(w) \u2264 \u03a8max < \u221e for all\nw \u2208 W (for example the function (cid:107)w(cid:107)2\nTheorem 15. Let W be a class of weight vectors such that 0 \u2264 \u03a8(w) \u2264 \u03a8max for all w \u2208 W.\nSuppose that \u03a8 is \u03c3-strongly convex w.r.t. a given norm (cid:107) \u00b7 (cid:107). Then, we have, RT (FW ) \u2264\n(cid:107)X(cid:107)(cid:63)\ninput space.\n\np2 \u03a8max T /\u03c3 , where (cid:107)X(cid:107)(cid:63) = supx\u2208X (cid:107)x(cid:107)(cid:63), the maximum dual norm of any vector in the\n\n2 on any bounded subset of Rd).\n\nT ) regret bounds of online mirror descent\nThe above result actually allows us to recover the O(\n(including Zinkevich\u2019s online gradient descent) obtained in the online convex optimization literature.\nThere, the set X is set of convex Lipschitz functions on a convex set F. We interpret f(x) as x(f).\nIt is easy to bound the value of the convex game by that of the linear game [2], i.e. one in which X\nis the set of linear functions. Then we directly appeal to the above theorem to bound the value of\n\u221a\nthe linear game. The online convex optimization setting includes supervised learning using convex\nT ) regret algorithms\nlosses and linear predictors and so our theorem also proves existence of O(\nin that setting.\n\n\u221a\n\nExample: Margin Based Regret We prove a general margin based mistake bound for binary\nclassi\ufb01cation. This shows the generality of our framework since we do not require assumptions\n\n7\n\n\flike convexity to bound the minimax regret. The proof of the following result uses a non-convex\nLipschitz \u201cramp\u201d function along with Lemma 11. As far as we know, this is the \ufb01rst general margin\nbased mistake bound in the online setting for a general function class.\nTheorem 16. For any function class F \u2282 RX bounded by B, there exists a randomized player\nstrategy \u03c0 such that for any sequence (x1, y1), . . . , (xT , yT ) \u2208 (X \u00d7 {\u00b11})T ,\nTX\n\n\u201e B\n\nTX\n\n(\n\n\u221a\n\n\u00ab)\n\n1{f (xt)yt < \u03b3} +\n\nRT (F) +\n\nT log log\n\nt=1\n\nEft\u223c\u03c0t(x1:t\u22121) [1{ft(xt)yt < 0}] \u2264 inf\n)\n(\n\n\u03b3>0\n\ninf\nf\u2208F\n\nt=1\n\nw1\n\nj xj\n\nx (cid:55)\u2192X\n\n\u02db\u02db (cid:107)w(cid:107)1 \u2264 B1\n\nExample : Neural Networks and Decision Trees We now consider a k-layer 1-norm neural\n(\nnetwork. To this end let function class F1 be given by\nx (cid:55)\u2192X\nF1 =\nfor 2 \u2264 i \u2264 k. The theory we have developed provides us with enough tools to control the sequential\nRademacher complexity of classes like the above that are built using simpler components. The\nfollowing result shows that neural networks can be learned online. A similar result, but for statistical\nlearning, appeared in [6]. Let X \u2282 Rd, and X\u221e be such that \u2200x \u2208 X , (cid:107)x(cid:107)\u221e \u2264 X\u221e.\nTheorem 17. Let \u03c3 : R (cid:55)\u2192 [\u22121, 1] be L-Lipschitz. Then\n\nj\u03c3 (fj(x)) \u02db\u02db \u2200j fj \u2208 Fi\u22121,(cid:107)wi(cid:107)1 \u2264 Bi\n\n, and Fi =\n\nwi\n\nj\n\nj\n\n)\n\n4\n\u03b3\n\n\u03b3\n\n kY\n\ni=1\n\n!\n\nLk\u22121X\u221ep2T log d.\n\nRT (Fk) \u2264\n\nBi\n\nFirst, consider transductive learning, where the set X = {z1}n\n\nWe can also prove online learnability of decision trees under appropriate restrictions on their depth\nand number of leaves. We skip the formal statement in the interest of space but the proof proceeds\nin a fashion similar to the decision tree result in [6]. The structural results enjoyed by the sequential\nRademacher complexity (esp. Corollary 13) are key to making the proof work.\nExample: Transductive Learning and Prediction of Individual Sequences Let F \u2282 RX and\n\ndepend on n. Assuming that F \u2282 [0, 1]X , the value of the game is upper bounded by 2DT (F) \u2264\n\u221a\n4\nIn particular, for binary prediction, using the Sauer-Shelah lemma ensures that the\n\nlet (cid:98)N\u221e(\u03b1,F) be the classical pointwise (over X ) covering number at scale \u03b1. It is easy to verify that\nN\u221e(\u03b1,F, T ) \u2264 (cid:98)N\u221e(\u03b1,F) for all T . This simple observation can be applied in several situations.\nlearnability, it is suf\ufb01cient to consider an assumption on the dependence of (cid:98)N\u221e(\u03b1,F) on \u03b1. An\nobvious example of such a class is a VC-type class with (cid:98)N\u221e(\u03b1,F) \u2264 (c/\u03b1)d for some c which can\nvalue of the game is at most 4(cid:112)dT log(eT ), matching the result of [15] up to a constant 2.\nputs us in the setting considered earlier with n = T . We immediately obtain 4(cid:112)dT log(eT ), match-\ning the results on [8, p. 1873]. For the case of a \ufb01nite number of experts, clearly (cid:98)N\u221e \u2264 N which\n\nIn the context of prediction of individual sequences, Cesa-Bianchi and Lugosi [8] proved upper\nbounds in terms of the (classical) Rademacher complexity and the (classical) Dudley integral. The\nparticular assumption made in [8] is that experts are static. Formally, we de\ufb01ne static experts as map-\npings f : {1, . . . , T} (cid:55)\u2192 [0, 1], and let F denote a class of such experts. De\ufb01ning X = {1, . . . , T}\n\ni=1 is a \ufb01nite set. To ensure online\n\ndT log c.\n\n\u221a\n\ngives the classical O(\n\nT log N) bound [9].\n\nExample: Isotron Recently, Kalai and Sastry [16] introduced a method called Isotron for learn-\ning Single Index Models (SIM), which generalize linear and logistic regression, generalized linear\nmodels, and classi\ufb01cation by linear threshold functions. A natural open question posed by the au-\nthors is whether there is an online variant of Isotron. Before even attempting a quest for such an\nalgorithm, we can ask a more basic question: is the (Idealized) SIM problem even learnable in the\nonline framework? We answer the question in positive with the tools we have developed by proving\nthat the following class (with X a Euclidean ball in Rd and Y = [\u22121, 1]) is learnable:\nH = {f (x, y) = (y \u2212 u((cid:104)w, x(cid:105)))2 | u : [\u22121, 1] (cid:55)\u2192 [\u22121, 1] is non-decreasing 1-Lipschitz , (cid:107)w(cid:107)2 \u2264 1} (4)\nwhere u and w range over the possibilities. Using the machinery we developed, it is not hard to\nshow that the class H is online learnable in the supervised setting. Moreover, VT (H,X \u00d7 Y) =\n\u221a\nO(\n\nT log3/2 T ).\n\n8\n\n\fReferences\n[1] J. Abernethy, A. Agarwal, P. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through mini-\n\nmax duality. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.\n\n[2] J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for\n\nonline convex games. In COLT, pages 414\u2013424, 2008.\n\n[3] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform conver-\n\ngence, and learnability. Journal of the ACM, 44:615\u2013631, 1997.\n\n[4] N. Alon and J. Spencer. The Probabilistic Method. John Wiley & Sons, 2nd edition, 2000.\n[5] P. L. Bartlett, P. M. Long, and R. C. Williamson. Fat-shattering and the learnability of real-valued func-\n\ntions. Journal of Computer and System Sciences, 52(3):434\u2013452, 1996. (special issue on COLT\u201894).\n\n[6] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: risk bounds and structural\n\nresults. J. Mach. Learn. Res., 3:463\u2013482, 2003.\n\n[7] S. Ben-David, D. Pal, and S. Shalev-Shwartz. Agnostic online learning. In COLT, 2009.\n[8] N. Cesa-Bianchi and G. Lugosi. On prediction of individual sequences. A. of S., pages 1865\u20131895, 1999.\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n[10] R. M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes.\n\nJournal of Functional Analysis, 1(3):290\u2013330, 1967.\n\n[11] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.\n[12] E. Gin\u00b4e and J. Zinn. Some limit theorems for empirical processes. Ann. of Prob., 12(4):929\u2013989, 1984.\n[13] J. Hannan. Approximation to Bayes risk in repeated play. Contr. to Theo. of Games, 3:97\u2013139, 1957.\n[14] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning appli-\n\ncations. Information and Computation, 100(1):78\u2013150, 1992.\n\n[15] S. M. Kakade and A. Kalai. From batch to transductive online learning. In NIPS, 2005.\n[16] A. Tauman Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In Pro-\n\nceedings of the 22th Annual Conference on Learning Theory, 2009.\n\n[17] M. J. Kearns and R. E. Schapire. Ef\ufb01cient distribution-free learning of probabilistic concepts. Journal of\n\nComputer and System Sciences, 48(3):464\u2013497, 1994.\n\n[18] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning.\n\nHigh Dimensional Probability II, 47:443\u2013459, 2000.\n\n[19] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.\n\nMachine Learning, 2(4):285\u2013318, 04 1988.\n\n[20] P. Massart. Some applications of concentration inequalities to statistics. Annales de la Facult\u00b4e des Sci-\n\nences de Toulouse, IX(2):245\u2013303, 2000.\n\n[21] S. Mendelson. A few notes on statistical learning theory. In MLSS 2002, pages 1\u201340. 2003.\n[22] S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Inventiones mathematicae,\n\n152:37\u201355, 2003.\n\n[23] D. Pollard. Empirical Processes: Theory and Applications, volume 2. Hayward, CA, 1990.\n[24] N. Sauer. On the density of families of sets. J. Combinatorial Theory, 13:145\u2013147, 1972.\n[25] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In NIPS, pages 1265\u20131272.\n\nMIT Press, Cambridge, MA, 2007.\n\n[26] S. Shelah. A combinatorial problem: Stability and order for models and theories in in\ufb01nitary languages.\n\nPac. J. Math, 4:247\u2013261, 1972.\n\n[27] K. Sridharan and A. Tewari. Convex games in banach spaces. In COLT, 2010.\n[28] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes : With Applications\n\nto Statistics. Springer Series, March 1996.\n\n[29] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n[30] S.A. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.\n[31] V. N. Vapnik. Estimation of Dependences Based on Empirical Data (Springer Series in Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 1982.\n\n[32] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to\n\ntheir probabilities. Theory of Probability and its Applications, 16(2):264\u2013280, 1971.\n\n[33] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, pages\n\n928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "Alexander", "family_name": "Rakhlin", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}