{"title": "Online Learning: Stochastic, Constrained, and Smoothed Adversaries", "book": "Advances in Neural Information Processing Systems", "page_first": 1764, "page_last": 1772, "abstract": "Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario whereby at every time step the worst instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable.", "full_text": "Online Learning: Stochastic, Constrained, and\n\nSmoothed Adversaries\n\nAlexander Rakhlin\n\nDepartment of Statistics\n\nUniversity of Pennsylvania\n\nKarthik Sridharan\n\nToyota Technological Institute at Chicago\n\nkarthik@ttic.edu\n\nrakhlin@wharton.upenn.edu\n\nAmbuj Tewari\n\nComputer Science Department\nUniversity of Texas at Austin\nambuj@cs.utexas.edu\n\nAbstract\n\nLearning theory has largely focused on two main learning scenarios: the classical\nstatistical setting where instances are drawn i.i.d. from a \ufb01xed distribution, and\nthe adversarial scenario wherein, at every time step, an adversarially chosen in-\nstance is revealed to the player. It can be argued that in the real world neither of\nthese assumptions is reasonable. We de\ufb01ne the minimax value of a game where\nthe adversary is restricted in his moves, capturing stochastic and non-stochastic\nassumptions on data. Building on the sequential symmetrization approach, we de-\n\ufb01ne a notion of distribution-dependent Rademacher complexity for the spectrum\nof problems ranging from i.i.d.\nto worst-case. The bounds let us immediately\ndeduce variation-type bounds. We study a smoothed online learning scenario and\nshow that exponentially small amount of noise can make function classes with\nin\ufb01nite Littlestone dimension learnable.\n\n1\n\nIntroduction\n\nIn the papers [1, 10, 11], an array of tools has been developed to study the minimax value of diverse\nsequential problems under the worst-case assumption on Nature. In [10], many analogues of the\nclassical notions from statistical learning theory have been developed, and these have been extended\nin [11] for performance measures well beyond the additive regret. The process of sequential sym-\nmetrization emerged as a key technique for dealing with complicated nested minimax expressions.\nIn the worst-case model, the developed tools give a uni\ufb01ed treatment to such sequential problems as\nregret minimization, calibration of forecasters, Blackwell\u2019s approachability, \u03a6-regret, and more.\n\nLearning theory has been so far focused predominantly on the i.i.d. and the worst-case learning\nscenarios. Much less is known about learnability in-between these two extremes. In the present\npaper, we make progress towards \ufb01lling this gap by proposing a framework in which it is possible\nto variously restrict the behavior of Nature. By restricting Nature to play i.i.d. sequences, the results\nboil down to the classical notions of statistical learning in the supervised learning scenario. By not\nplacing any restrictions on Nature, we recover the worst-case results of [10]. Between these two\nendpoints of the spectrum, particular assumptions on the adversary yield interesting bounds on the\nminimax value of the associated problem. Once again, the sequential symmetrization technique\narises as the main tool for dealing with the minimax value, but the proofs require more care than in\nthe i.i.d. or completely adversarial settings.\n\n1\n\n\fAdapting the game-theoretic language, we will think of the learner and the adversary as the two\nplayers of a zero-sum repeated game. Adversary\u2019s moves will be associated with \u201cdata\u201d, while the\nmoves of the learner \u2013 with a function or a parameter. This point of view is not new: game-theoretic\nminimax analysis has been at the heart of statistical decision theory for more than half a century\n(see [3]). In fact, there is a well-developed theory of minimax estimation when restrictions are put\non either the choice of the adversary or the allowed estimators by the player. We are not aware of a\nsimilar theory for sequential problems with non-i.i.d. data.\n\nThe main contribution of this paper is the development of tools for the analysis of online scenarios\nwhere the adversary\u2019s moves are restricted in various ways.\nIn additional to general theory, we\nconsider several interesting scenarios which can be captured by our framework. All proofs are\ndeferred to the appendix.\n\n2 Value of the Game\n\nthe learner chooses ft \u2208 F ,\n\nLet F be a closed subset of a complete separable metric space, denoting the set of moves of\nthe learner. Suppose the adversary chooses from the set X . Consider the Online Learning\nModel, de\ufb01ned as a T -round interaction between the learner and the adversary: On round\nthe adversary simultaneously picks xt \u2208 X ,\nt = 1, . . . , T ,\nand the learner suffers loss ft(xt). The goal of the learner is to minimize regret, de\ufb01ned as\nPT\nt=1 ft(xt) \u2212 inf f \u2208FPT\nt=1 f (xt). It is a standard fact that simultaneity of the choices can be\nformalized by the \ufb01rst player choosing a mixed strategy; the second player then picks an action\nbased on this mixed strategy, but not on its realization. We therefore consider randomized learners\nwho predict a distribution qt \u2208 Q on every round, where Q is the set of probability distributions on\nF , assumed to be weakly compact. The set of probability distributions on X (mixed strategies of\nthe adversary) is denoted by P.\nWe would like to capture the fact that sequences (x1, . . . , xT ) cannot be arbitrary. This is achieved\nby de\ufb01ning restrictions on the adversary, that is, subsets of \u201callowed\u201d distributions for each round.\nThese restrictions limit the scope of available mixed strategies for the adversary.\nDe\ufb01nition 1. A restriction P1:T on the adversary is a sequence P1, . . . ,PT of mappings Pt\nX t\u22121 7\u2192 2P such that Pt(x1:t\u22121) is a convex subset of P for any x1:t\u22121 \u2208 X t\u22121.\n\n:\n\nNote that the restrictions depend on the past moves of the adversary, but not on those of the player.\nWe will write Pt instead of Pt(x1:t\u22121) when x1:t\u22121 is clearly de\ufb01ned. Using the notion of restric-\ntions, we can give names to several types of adversaries that we will study in this paper.\n\nstrategy is available to the adversary, including any deterministic point distribution.\n\n(1) A worst-case adversary is de\ufb01ned by vacuous restrictions Pt(x1:t\u22121) = P. That is, any mixed\n(2) A constrained adversary is de\ufb01ned by Pt(x1:xt\u22121) being the set of all distributions supported\non the set {x \u2208 X : Ct(x1, . . . , xt\u22121, x) = 1} for some deterministic binary-valued constraint\nCt. The deterministic constraint can, for instance, ensure that the length of the path determined\nby the moves x1, . . . , xt stays below the allowed budget.\n\n(3) A smoothed adversary picks the worst-case sequence which gets corrupted by i.i.d. noise.\nEquivalently, we can view this as restrictions on the adversary who chooses the \u201ccenter\u201d (or a\nparameter) of the noise distribution.\n\nUsing techniques developed in this paper, we can also study the following adversaries (omitted due\n\nto lack of space):\n\n(4) A hybrid adversary in the supervised learning game picks the worst-case label yt, but is forced\n\nto draw the xt-variable from a \ufb01xed distribution [6].\n\n(5) An i.i.d. adversary is de\ufb01ned by a time-invariant restriction Pt(x1:t\u22121) = {p} for every t and\n\nsome p \u2208 P.\n\nFor the given restrictions P1:T , we de\ufb01ne the value of the game as\nVT (P1:T )\n\n\u00b7 \u00b7 \u00b7\n\nE\n\nE\n\ninf\nq2\u2208Q\n\nsup\np2\u2208P2\n\nf2,x2\n\ninf\nqT \u2208Q\n\nsup\npT \u2208PT\n\n\u25b3\n\n= inf\nq1\u2208Q\n\nsup\np1\u2208P1\n\nf1,x1\n\n(1)\nwhere ft has distribution qt and xt has distribution pt. As in [10], the adversary is adaptive, that is,\nchooses pt based on the history of moves f1:t\u22121 and x1:t\u22121. At this point, the only difference from\n\nE\n\nfT ,xT \" T\nXt=1\n\nft(xt) \u2212 inf\nf \u2208F\n\nf (xt)#\n\nT\n\nXt=1\n\n2\n\n\fthe setup of [10] is in the restrictions Pt on the adversary. Because these restrictions might not allow\npoint distributions, suprema over pt\u2019s in (1) cannot be equivalently written as the suprema over xt\u2019s.\nA word about the notation. In [10], the value of the game is written as VT (F), signifying that the\nmain object of study is F . In [11], it is written as VT (\u2113, \u03a6T ) since the focus is on the complexity\nof the set of transformations \u03a6T and the payoff mapping \u2113. In the present paper, the main focus is\nindeed on the restrictions on the adversary, justifying our choice VT (P1:T ) for the notation.\nThe \ufb01rst step is to apply the minimax theorem. To this end, we verify the necessary conditions. Our\nassumption that F is a closed subset of a complete separable metric space implies that Q is tight and\nProkhorov\u2019s theorem states that compactness of Q under weak topology is equivalent to tightness\n[14]. Compactness under weak topology allows us to proceed as in [10]. Additionally, we require\nthat the restriction sets are compact and convex.\nTheorem 1. Let F and X be the sets of moves for the two players, satisfying the necessary condi-\ntions for the minimax theorem to hold. Let P1:T be the restrictions, and assume that for any x1:t\u22121,\nPt(x1:t\u22121) satis\ufb01es the necessary conditions for the minimax theorem to hold. Then\nXt=1\n\nExT \u223cpT \" T\nXt=1\n\nExt\u223cpt [ft(xt)] \u2212 inf\n\nVT (P1:T ) = sup\n\nEx1\u223cp1 . . . sup\npT \u2208PT\n\nf (xt)# .\n\ninf\nft\u2208F\n\np1\u2208P1\n\n(2)\n\nT\n\nf \u2208F\n\nThe nested sequence of suprema and expected values in Theorem 1 can be re-written succinctly as\n\nEx2\u223cp2(\u00b7|x1) . . . ExT \u223cpT (\u00b7|x1:T \u22121)\" T\nXt=1\nf (xt)#\n\nExt\u223cpt [ft(xt)] \u2212 inf\nf \u2208F\n\ninf\nft\u2208F\n\nT\n\nXt=1\n\ninf\nft\u2208F\n\nExt\u223cpt [ft(xt)] \u2212 inf\nf \u2208F\n\nf (xt)#\n\nT\n\nXt=1\n\n(3)\n\nVT (P1:T ) = sup\np\u2208P\n\nEx1\u223cp1\n\n= sup\np\u2208P\n\nE\" T\nXt=1\n\nwhere the supremum is over all joint distributions p over sequences, such that p satis\ufb01es the re-\n\nstrictions as described below. Given a joint distribution p on sequences (x1, . . . , xT ) \u2208 X T , we\ndenote the associated conditional distributions by pt(\u00b7|x1:t\u22121). We can think of the choice p as a\nsequence of oblivious strategies {pt : X t\u22121 7\u2192 P}T\nt=1, mapping the pre\ufb01x x1:t\u22121 to a conditional\ndistribution pt(\u00b7|x1:t\u22121) \u2208 Pt(x1:t\u22121). We will indeed call p a \u201cjoint distribution\u201d or an \u201cobliv-\nious strategy\u201d interchangeably. We say that a joint distribution p satis\ufb01es restrictions if for any t\nand any x1:t\u22121 \u2208 X t\u22121, pt(\u00b7|x1:t\u22121) \u2208 Pt(x1:t\u22121). The set of all joint distributions satisfying the\n\nrestrictions is denoted by P. We note that Theorem 1 cannot be deduced immediately from the anal-\nogous result in [10], as it is not clear how the restrictions on the adversary per each round come into\nplay after applying the minimax theorem. Nevertheless, it is comforting that the restrictions directly\ntranslate into the set P of oblivious strategies satisfying the restrictions.\n\nBefore continuing with our goal of upper-bounding the value of the game, we state the following\ninteresting facts.\n\nProposition 2. There is an oblivious minimax optimal strategy for the adversary, and there is a\ncorresponding minimax optimal strategy for the player that does not depend on its own moves.\n\nThe latter statement of the proposition is folklore for worst-case learning, yet we have not seen a\nproof of it in the literature. The proposition holds for all online learning settings with legal restric-\ntions P1:T , encompassing also the no-restrictions setting of worst-case online learning [10]. The\nresult crucially relies on the fact that the objective is external regret.\n\n3 Symmetrization and Random Averages\n\nTheorem 1 is a useful representation of the value of the game. As the next step, we upper bound\nit with an expression which is easier to study. Such an expression is obtained by introducing\nRademacher random variables. This process can be termed sequential symmetrization and has been\nexploited in [1, 10, 11]. The restrictions Pt, however, make sequential symmetrization considerably\nmore involved than in the papers cited above. The main dif\ufb01culty arises from the fact that the set\nPt(x1:t\u22121) depends on the sequence x1:t\u22121, and symmetrization (that is, replacement of xs with x\u2032\ns)\nhas to be done with care as it affects this dependence. Roughly speaking, in the process of sym-\nmetrization, a tangent sequence x\u2032\nt are independent and\n\n2, . . . is introduced such that xt and x\u2032\n\n1, x\u2032\n\n3\n\n\fidentically distributed given \u201cthe past\u201d. However, \u201cthe past\u201d is itself an interleaving choice of the\noriginal sequence and the tangent sequence.\n\n1), . . . , (xT \u22121, x\u2032\n\nt, \u01eb). In other words, \u03c7t selects between xt and x\u2032\n\nDe\ufb01ne the \u201cselector function\u201d \u03c7 : X \u00d7X \u00d7{\u00b11} 7\u2192 X by \u03c7(x, x\u2032, \u01eb) = x\u2032 if \u01eb = 1 and \u03c7(x, x\u2032, \u01eb) =\nx if \u01eb = \u22121. When xt and x\u2032\nt are understood from the context, we will use the shorthand \u03c7t(\u01eb) :=\n\u03c7(xt, x\u2032\nt depending on the sign of \u01eb. Throughout\nthe paper, we deal with binary trees, which arise from symmetrization [10]. Given some set Z, an\nZ-valued tree of depth T is a sequence z = (z1, . . . , zT ) of T mappings zi : {\u00b11}i\u22121 7\u2192 Z. The\nT -tuple \u01eb = (\u01eb1, . . . , \u01ebT ) \u2208 {\u00b11}T de\ufb01nes a path. For brevity, we write zt(\u01eb) instead of zt(\u01eb1:t\u22121).\nGiven a joint distribution p, consider the \u201c(X \u00d7 X )T \u22121 7\u2192 P(X \u00d7 X )\u201d- valued probability tree\n\u03c1 = (\u03c11, . . . , \u03c1T ) de\ufb01ned by\n\u03c1t(\u01eb1:t\u22121)$(x1, x\u2032\nT \u22121)(cid:1) = (pt(\u00b7|\u03c71(\u01eb1), . . . , \u03c7t\u22121(\u01ebt\u22121)), pt(\u00b7|\u03c71(\u01eb1), . . . , \u03c7t\u22121(\u01ebt\u22121))).\n\nIn other words, the values of the mappings \u03c1t(\u01eb) are products of conditional distributions, where\nconditioning is done with respect to a sequence made from xs and x\u2032\ns depending on the sign of \u01ebs.\nWe note that the dif\ufb01culty in intermixing the x and x\u2032 sequences does not arise in i.i.d. or worst-\ncase symmetrization. However, in-between these extremes the notational complexity seems to be\nunavoidable if we are to employ symmetrization and obtain a version of Rademacher complexity.\nAs an example, consider the \u201cleft-most\u201d path \u01eb = \u22121 in a binary tree of depth T , where 1 =\n(1, . . . , 1) is a T -dimensional vector of ones. Then all the selectors \u03c7(xt, x\u2032\nt, \u01ebt) choose the sequence\nx1, . . . , xT . The probability tree \u03c1 on the \u201cleft-most\u201d path is, therefore, de\ufb01ned by the conditional\ndistributions pt(\u00b7|x1:t\u22121); on the path \u01eb = 1, the conditional distributions are pt(\u00b7|x\u2032\nSlightly abusing the notation, we will write \u03c1t(\u01eb)$(x1, x\u2032\nt\u22121)(cid:1) for the probability\ntree since \u03c1t clearly depends only on the pre\ufb01x up to time t \u2212 1. Throughout the paper, it will\nbe understood that the tree \u03c1 is obtained from p as described above. Since all the conditional\ndistributions of p satisfy the restrictions, so do the corresponding distributions of the probability\ntree \u03c1. By saying that \u03c1 satis\ufb01es restrictions we then mean that p \u2208 P.\nSampling of a pair of X -valued trees from \u03c1, written as (x, x\u2032) \u223c \u03c1, is de\ufb01ned as the following\nrecursive process: for any \u01eb \u2208 {\u00b11}T , (x1(\u01eb), x\u2032\n\n1(\u01eb)) \u223c \u03c11(\u01eb) and\n\n1), . . . , (xt\u22121, x\u2032\n\n1:t\u22121).\n\n(xt(\u01eb), x\u2032\n\n1(\u01eb)), . . . , (xt\u22121(\u01eb), x\u2032\n\nt(\u01eb)) \u223c \u03c1t(\u01eb)((x1(\u01eb), x\u2032\nTo gain a better understanding of the sampling process, consider the \ufb01rst few levels of the tree.\n1 of the trees x, x\u2032 are sampled from p1, the conditional distribution for t = 1\nThe roots x1, x\u2032\ngiven by p. Next, say, \u01eb1 = +1. Then the \u201cright\u201d children of x1 and x\u2032\n1 are sampled via\n2(+1) \u223c p2(\u00b7|x\u2032\nx2(+1), x\u2032\n1. On the other hand, the \u201cleft\u201d children\nx2(\u22121), x\u2032\n2(\u22121) are both distributed according to p2(\u00b7|x1). Now, suppose \u01eb1 = +1 and \u01eb2 = \u22121.\nThen, x3(+1,\u22121), x\u2032\n\n3(+1,\u22121) are both sampled from p3(\u00b7|x\u2032\n\n1) since \u03c71(+1) selects x\u2032\n\nfor 2 \u2264 t \u2264 T\n\n1, x2(+1)).\n\nt\u22121(\u01eb)))\n\n(4)\n\nThe proof of Theorem 3 reveals why such intricate conditional structure arises, and Proposition 5\nbelow shows that this structure greatly simpli\ufb01es for i.i.d. and worst-case situations. Nevertheless,\nthe process described above allows us to de\ufb01ne a uni\ufb01ed notion of Rademacher complexity for the\nspectrum of assumptions between the two extremes.\n\nDe\ufb01nition 2. The distribution-dependent sequential Rademacher complexity of a function class\n\nF \u2286 RX is de\ufb01ned as\n\nRT (F, p)\n\n\u25b3\n\n= E(x,x\n\n\u2032)\u223c\u03c1\n\nE\u01eb\"sup\n\nf \u2208F\n\n\u01ebtf (xt(\u01eb))#\n\nT\n\nXt=1\n\nwhere \u01eb = (\u01eb1, . . . , \u01ebT ) is a sequence of i.i.d. Rademacher random variables and \u03c1 is the probability\ntree associated with p.\n\nWe now prove an upper bound on the value VT (P1:T ) of the game in terms of this distribution-\ndependent sequential Rademacher complexity. The result cannot be deduced directly from [10], and\nit greatly increases the scope of problems whose learnability can now be studied in a uni\ufb01ed manner.\n\nTheorem 3. The minimax value is bounded as\n\nVT (P1:T ) \u2264 2 sup\n\np\u2208P\n\nRT (F, p).\n\n(5)\n\n4\n\n\fMore generally, for any measurable function Mt such that Mt(p, f, x, x\u2032, \u01eb) = Mt(p, f, x\u2032, x,\u2212\u01eb),\n\nVT (P1:T ) \u2264 2 sup\np\u2208P\n\nE(x,x\n\n\u2032)\u223c\u03c1\n\n\u01ebt(f (xt(\u01eb)) \u2212 Mt(p, f, x, x\n\n\u2032, \u01eb))#\n\nE\u01eb\"sup\n\nf \u2208F\n\nT\n\nXt=1\n\nThe following corollary provides a natural \u201ccentered\u201d version of the distribution-dependent\nRademacher complexity. That is, the complexity can be measured by relative shifts in the adver-\nsarial moves.\nCorollary 4. For the game with restrictions P1:T ,\nE\u01eb\"sup\n\n\u01ebt(cid:16)f (xt(\u01eb)) \u2212 Et\u22121f (xt(\u01eb))(cid:17)#\n\nVT (P1:T ) \u2264 2 sup\n\nE(x,x\n\n\u2032)\u223c\u03c1\n\np\u2208P\n\nf \u2208F\n\nT\n\nXt=1\n\nwhere Et\u22121 denotes the conditional expectation of xt(\u01eb).\nExample 1. Suppose F is a unit ball in a Banach space and f (x) = hf, xi. Then\n\nVT (P1:T ) \u2264 2 sup\n\np\u2208P\n\nE(x,x\n\n\u2032)\u223c\u03c1\n\nT\n\nE\u01eb(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nXt=1\n\n\u01ebt(cid:16)xt(\u01eb) \u2212 Et\u22121xt(\u01eb)(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nt=1 Yt(cid:13)(cid:13)(cid:13)\n\nSuppose the adversary plays a simple random walk (e.g., pt(x|x1, . . . , xt\u22121) = pt(x|xt\u22121) is uni-\nform on a unit sphere). For simplicity, suppose this is the only strategy allowed by the set P. Then\nxt(\u01eb) \u2212 Et\u22121xt(\u01eb) are independent increments when conditioned on the history. Further, the in-\nwhere {Yt} is the corresponding\n\ncrements do not depend on \u01ebt. Thus, VT (P1:T ) \u2264 2E(cid:13)(cid:13)(cid:13)PT\n\nrandom walk.\n\nWe now show that the distribution-dependent sequential Rademacher complexity for i.i.d. data is\nprecisely the classical Rademacher complexity, and further show that the distribution-dependent se-\nquential Rademacher complexity is always upper bounded by the worst-case sequential Rademacher\ncomplexity de\ufb01ned in [10].\nProposition 5. First, consider the i.i.d. restrictions Pt = {p} for all t, where p is some \ufb01xed\ndistribution on X , and let \u03c1 be the process associated with the joint distribution p = pT . Then\n\u01ebtf (xt)#\n\n= Ex1,...,xT \u223cpE\u01eb\"sup\n\nRT (F, p) = RT (F, p),\n\nwhere RT (F, p)\n\n(6)\n\nf \u2208F\n\nT\n\n\u25b3\n\nXt=1\n\nis the classical Rademacher complexity. Second, for any joint distribution p,\n\nRT (F, p) \u2264 RT (F),\n\nwhere RT (F)\n\n\u25b3\n\n= sup\n\nx\n\nis the sequential Rademacher complexity de\ufb01ned in [10].\n\nE\u01eb\"sup\n\nf \u2208F\n\nT\n\nXt=1\n\n\u01ebtf (xt(\u01eb))#\n\n(7)\n\nIn the case of hybrid learning, adversary chooses a sequence of pairs (xt, yt) where the instance xt\u2019s\nare i.i.d. but the labels yi\u2019s are fully adversarial. The distribution-dependent Rademacher complexity\nin such a hybrid case can be upper bounded by a vary natural quantity: a random average where\nexpectation is taken over xt\u2019s and a supremum over Y-valued trees. So, the distribution dependent\nRademacher complexity itself becomes a hybrid between the classical Rademacher complexity and\nthe worst case sequential Rademacher complexity. For more details, see Lemma 17 in the Appendix\nas another example of an analysis of the distribution-dependent sequential Rademacher complexity.\n\nDistribution-dependent sequential Rademacher complexity enjoys many of the nice properties satis-\n\ufb01ed by both classical and worst-case Rademacher complexities. As shown in [10], these properties\nare handy tools for proving upper bounds on the value in various examples. We have: (a) If F \u2282 G,\nthen R(F, p) \u2264 R(G, p); (b) R(F, p) = R(conv(F), p); (c) R(cF, p) = |c|R(F, p) for all\nc \u2208 R; (d) For any h, R(F + h, p) = R(F, p) where F + h = {f + h : f \u2208 F}.\nIn addition to the above properties, upper bounds on R(F, p) can be derived via sequential covering\nnumbers de\ufb01ned in [10]. This notion of a cover captures the sequential complexity of a function\nclass on a given X -valued tree x. One can then show an analogue of the Dudley integral bound,\nwhere the complexity is averaged with respect to the underlying process (x, x\u2032) \u223c \u03c1.\n\n5\n\n\f4 Application: Constrained Adversaries\n\nIn this section, we consider adversaries who are deterministically constrained in the sequences of\nactions they can play. It is often useful to consider scenarios where the adversary is worst case,\nyet has some budget or constraint to satisfy while picking the actions. Examples of such scenarios\ninclude, for instance, games where the adversary is constrained to make moves that are close in some\nfashion to the previous move, linear games with bounded variance, and so on. Below we formulate\nsuch games quite generally through arbitrary constraints that the adversary has to satisfy on each\nround. We easily derive several results to illustrate the versatility of the developed framework.\n\n\u25b3\n\nFor a T round game consider an adversary who is only allowed to play sequences x1, . . . , xT such\nthat at round t the constraint Ct(x1, . . . , xt) = 1 is satis\ufb01ed, where Ct : X t 7\u2192 {0, 1} represents the\nconstraint on the sequence played so far. The constrained adversary can be viewed as a stochastic\nadversary with restrictions on the conditional distribution at time t given by the set of all Borel\ndistributions on the set Xt(x1:t\u22121)\n= {x \u2208 X : Ct(x1, . . . , xt\u22121, x) = 1}. Since this set includes\nall point distributions on each x \u2208 Xt, the sequential complexity simpli\ufb01es in a way similar to\nworst-case adversaries. We write VT (C1:T ) for the value of the game with the given constraints.\nNow, assume that for any x1:t\u22121, the set of all distributions on Xt(x1:t\u22121) is weakly compact in\na way similar to compactness of P. That is, Pt(x1:t\u22121) satisfy the necessary conditions for the\nminimax theorem to hold. We have the following corollaries of Theorems 1 and 3.\nCorollary 6. Let F and X be the sets of moves for the two players, satisfying the necessary condi-\ntions for the minimax theorem to hold. Let {Ct : X t\u22121 7\u2192 {0, 1}}T\n\nt=1 be the constraints. Then\n\nVT (C1:T ) = sup\n\np\u2208P\n\nE\" T\nXt=1\n\ninf\nft\u2208F\n\nExt\u223cpt [ft(xt)] \u2212 inf\n\nf \u2208F\n\nf (xt)#\n\nT\n\nXt=1\n\n(8)\n\nwhere p ranges over all distributions over sequences (x1, . . . , xT ) such that \u2200t, Ct(x1:t\u22121) = 1.\nCorollary 7. Let\nerty that\nC(\u03c71(\u01eb1), . . . , \u03c7t\u22121(\u01ebt\u22121), x\u2032\n\nthe set T be a set of pairs (x, x\u2032) of X -valued trees with the prop-\nfor any \u01eb \u2208 {\u00b11}T and any t \u2208 [T ], C(\u03c71(\u01eb1), . . . , \u03c7t\u22121(\u01ebt\u22121), xt(\u01eb)) =\n\nt(\u01eb)) = 1 . The minimax value is bounded as\n\nVT (C1:T ) \u2264 2\n\nsup\n\n(x,x\n\n\u2032)\u2208T\n\nRT (F, p).\n\nMore generally, for any measurable function Mt such that Mt(f, x, x\u2032, \u01eb) = Mt(f, x\u2032, x,\u2212\u01eb),\n\nVT (C1:T ) \u2264 2\n\nsup\n\n(x,x\n\n\u2032)\u2208T\n\nE\u01eb\"sup\n\nf \u2208F\n\nT\n\nXt=1\n\n\u01ebt(f (xt(\u01eb)) \u2212 Mt(f, x, x\u2032, \u01eb))# .\n\nArmed with these results, we can recover and extend some known results on online learning against\nbudgeted adversaries. The \ufb01rst result says that if the adversary is not allowed to move by more than\n\u03c3t away from its previous average of decisions, the player has a strategy to exploit this fact and\nobtain lower regret. For the \u21132-norm, such \u201ctotal variation\u201d bounds have been achieved in [4] up to\na log T factor. Our analysis seamlessly incorporates variance measured in arbitrary norms, not just\n\u21132. We emphasize that such certi\ufb01cates of learnability are not possible with the analysis of [10].\nProposition 8 (Variance Bound). Consider the online linear optimization setting with F = {f :\n\u03a8(f ) \u2264 R2} for a \u03bb-strongly function \u03a8 : F 7\u2192 R+ on F , and X = {x : kxk\u2217 \u2264 1}. Let\nf (x) = hf, xi for any f \u2208 F and x \u2208 X . Consider the sequence of constraints {Ct}T\nt=1 given by\nCt(x1, . . . , xt\u22121, x) = 1 if kx \u2212 1\n\n\u03c4 =1 x\u03c4k\u2217 \u2264 \u03c3t and 0 otherwise. Then\n\nt\u22121Pt\u22121\nVT (C1:T ) \u2264 2\u221a2Rq\u03bb\u22121PT\ngame where the move xt played by adversary at time t satis\ufb01es (cid:13)(cid:13)(cid:13)\nthis case we can conclude that VT (C1:T ) \u2264 2\u221a2qPT\n\nt=1 \u03c32\n\nt=1 \u03c32\n\nt\n\n6\n\nIn particular, we obtain the following \u21132 variance bound. Consider the case when \u03a8 : F 7\u2192 R+ is\ngiven by \u03a8(f ) = 1\n2kfk2, F = {f : kfk2 \u2264 1} and X = {x : kxk2 \u2264 1}. Consider the constrained\n\nxt \u2212 1\n\nt\u22121Pt\u22121\n\n\u03c4 =1 x\u03c4(cid:13)(cid:13)(cid:13)2 \u2264 \u03c3t . In\n\nt . We can also derive a variance bound\n\n\fover the simplex. Let \u03a8(f ) = Pd\ni=1 fi log(dfi) is de\ufb01ned over the d-simplex F , and X = {x :\nkxk\u221e \u2264 1}. Consider the constrained game where the move xt played by adversary at time t\nsatis\ufb01es maxj\u2208[d](cid:12)(cid:12)(cid:12)\nt\u22121Pt\u22121\nconclude that VT (C1:T ) \u2264 2\u221a2qlog(d)PT\n\n\u03c4 =1 x\u03c4 [j](cid:12)(cid:12)(cid:12) \u2264 \u03c3t . For any f \u2208 F , \u03a8(f ) \u2264 log(d) and so we\n\nThe next Proposition gives a bound whenever the adversary is constrained to choose his decision\nfrom a small ball around the previous decision.\n\nxt[j] \u2212 1\n\nt=1 \u03c32\nt .\n\nProposition 9 (Slowly-Changing Decisions). Consider the online linear optimization setting where\nadversary\u2019s move at any time is close to the move during the previous time step. Let F = {f :\n\u03a8(f ) \u2264 R2} where \u03a8 : F 7\u2192 R+ is a \u03bb-strongly function on F and X = {x : kxk\u2217 \u2264 B}. Let\nf (x) = hf, xi for any f \u2208 F and x \u2208 X . Consider the sequence of constraints {Ct}T\nt=1 given by\nCt(x1, . . . , xt\u22121, x) = 1 if kx \u2212 xt\u22121k\u2217 \u2264 \u03b4 and 0 otherwise. Then,\n\nVT (C1:T ) \u2264 2R\u03b4p2T /\u03bb .\n\nIn particular, consider the case of a Euclidean-norm restriction on the moves. Let \u03a8 : F 7\u2192 R+ is\ngiven by \u03a8(f ) = 1\n2kfk2, F = {f : kfk2 \u2264 1} and X = {x : kxk2 \u2264 1}. Consider the constrained\ngame where the move xt played by adversary at time t satis\ufb01es kxt \u2212 xt\u22121k2 \u2264 \u03b4 . In this case\nwe can conclude that VT (C1:T ) \u2264 2\u03b4\u221a2T . For the case of decision-making on the simplex, we\nobtain the following result. Let \u03a8(f ) = Pd\ni=1 fi log(dfi) is de\ufb01ned over the d-simplex F , and\nX = {x : kxk\u221e \u2264 1}. Consider the constrained game where the move xt played by adversary at\ntime t satis\ufb01es kxt \u2212 xt\u22121k\u221e \u2264 \u03b4. In this case note that for any f \u2208 F , \u03a8(f ) \u2264 log(d) and so we\ncan conclude that VT (C1:T ) \u2264 2\u03b4p2T log(d) .\n\n5 Application: Smoothed Adversaries\n\nThe development of smoothed analysis over the past decade is arguably one of the landmarks in\nthe study of complexity of algorithms. In contrast to the overly optimistic average complexity and\nthe overly pessimistic worst-case complexity, smoothed complexity can be seen as a more realistic\nmeasure of algorithm\u2019s performance. In their groundbreaking work, Spielman and Teng [13] showed\nthat the smoothed running time complexity of the simplex method is polynomial. This result explains\ngood performance of the method in practice despite its exponential-time worst-case complexity. In\nthis section, we consider the effect of smoothing on learnability.\n\nIt is well-known that there is a gap between the i.i.d. and the worst-case scenarios. In fact, we do not\nneed to go far for an example: a simple class of threshold functions on a unit interval is learnable in\nthe i.i.d. supervised learning scenario, yet dif\ufb01cult in the online worst-case model [8, 2, 9]. This fact\nis re\ufb02ected in the corresponding combinatorial dimensions: the Vapnik-Chervonenkis dimension is\none, whereas the Littlestone dimension is in\ufb01nite. The proof of the latter fact, however, reveals\nthat the in\ufb01nite number of mistakes on the part of the player is due to the in\ufb01nite resolution of the\ncarefully chosen adversarial sequence. We can argue that this in\ufb01nite precision is an unreasonable\nassumption on the power of a real-world opponent. The idea of limiting the power of the malicious\nadversary through perturbing the sequence can be traced back to Posner and Kulkarni [9]. The\nauthors considered on-line learning of functions of bounded variation, but in the so-called realizable\nsetting (that is, when labels are given by some function in the given class).\n\nWe de\ufb01ne the smoothed online learning model as the following T -round interaction between the\nlearner and the adversary. On round t, the learner chooses ft \u2208 F ; the adversary simultaneously\nchooses xt \u2208 X , which is then perturbed by some noise st \u223c \u03c3, yielding a value \u02dcxt = \u03c9(xt, st);\nand the player suffers ft(\u02dcxt). Regret is de\ufb01ned with respect to the perturbed sequence. Here \u03c9 :\nX \u00d7 S 7\u2192 X is some measurable mapping; for instance, additive disturbances can be written as\n\u02dcx = \u03c9(x, s) = x + s. If \u03c9 keeps xt unchanged, that is \u03c9(xt, st) = xt, the setting is precisely\nthe standard online learning model. In the full information version, we assume that the choice \u02dcxt\nis revealed to the player at the end of round t. We now recognize that the setting is nothing but a\nparticular way to restrict the adversary. That is, the choice xt \u2208 X de\ufb01nes a parameter of a mixed\nstrategy from which a actual move \u03c9(xt, st) is drawn; for instance, for additive zero-mean Gaussian\nnoise, xt de\ufb01nes the center of the distribution from which xt + st is drawn. In other words, noise\ndoes not allow the adversary to play any desired mixed strategy.\n\n7\n\n\fThe value of the smoothed online learning game (as de\ufb01ned in (1)) can be equivalently written as\n\nq1\n\nsup\nx1\n\nE\n\nf1\u223cq1\ns1\u223c\u03c3\n\ninf\nq2\n\nsup\nx2\n\nVT = inf\n\nf (\u03c9(xt, st))#\nwhere the in\ufb01ma are over qt \u2208 Q and the suprema are over xt \u2208 X . Using sequential symmetriza-\ntion, we deduce the following upper bound on the value of the smoothed online learning game.\nTheorem 10. The value of the smoothed online learning game is bounded above as\n\nsT \u223c\u03c3 \" T\nXt=1\n\nft(\u03c9(xt, st)) \u2212 inf\n\nf \u2208F\n\nXt=1\n\n\u00b7\u00b7\u00b7 inf\n\nqT\n\nsup\nxT\n\nf2\u223cq2\ns2\u223c\u03c3\n\nE\n\nfT \u223cqT\n\nE\n\nT\n\nVT \u2264 2 sup\n\nx1\u2208X\n\nE\n\ns1\u223c\u03c3\n\nE\u01eb1 . . . sup\nxT \u2208X\n\nE\n\nsT \u223c\u03c3\n\nE\u01ebT \"sup\n\nf \u2208F\n\nT\n\nXt=1\n\n\u01ebtf (\u03c9(xt, st))#\n\nWe now demonstrate how Theorem 10 can be used to show learnability for smoothed learning of\nthreshold functions. First, consider the supervised game with threshold functions on a unit interval\n(that is, non-homogenous hyperplanes). The moves of the adversary are pairs x = (z, y) with\nz \u2208 [0, 1] and y \u2208 {0, 1}, and the binary-valued function class F is de\ufb01ned by\n\nF = {f\u03b8(z, y) = |y \u2212 1{z < \u03b8}| : \u03b8 \u2208 [0, 1]} ,\n\n(9)\nthat is, every function is associated with a threshold \u03b8 \u2208 [0, 1]. The class F has in\ufb01nite Littlestone\u2019s\ndimension and is not learnable in the worst-case online framework. Consider a smoothed scenario,\nwith the z-variable of the adversarial move (z, y) perturbed by an additive uniform noise \u03c3 =\nUnif[\u2212\u03b3/2, \u03b3/2] for some \u03b3 \u2265 0. That is, the actual move revealed to the player at time t is\n(zt + st, yt), with st \u223c \u03c3. Any non-trivial upper bound on regret has to depend on particular noise\nassumptions, as \u03b3 = 0 corresponds to the case with in\ufb01nite Littlestone dimension. For the uniform\ndisturbance, the intuition tells us that noise implies a margin, and we should expect a 1/\u03b3 complexity\nparameter appearing in the bounds. The next lemma quanti\ufb01es the intuition that additive noise limits\nprecision of the adversary.\nLemma 11. Let \u03b81, . . . , \u03b8N be obtained by discretizing the interval [0, 1] into N = T a bins\n[\u03b8i, \u03b8i+1) of length T \u2212a, for some a \u2265 3. Then, for any sequence z1, . . . , zT \u2208 [0, 1], with proba-\nbility at least 1 \u2212 1\n\u03b3T a\u22122 , no two elements of the sequence z1 + s1, . . . , zT + sT belong to the same\ninterval [\u03b8i, \u03b8i+1), where s1, . . . , sT are i.i.d. Unif[\u2212\u03b3/2, \u03b3/2].\nWe now observe that, conditioned on the event in Lemma 11, the upper bound on the value in\nTheorem 10 is a supremum of N martingale difference sequences! We then arrive at:\nProposition 12. For the problem of smoothed online learning of thresholds in 1-D, the value is\n\nVT \u2264 2 +p2T (4 log T + log(1/\u03b3))\n\nWhat we found is somewhat surprising: for a problem which is not learnable in the online worst-\ncase scenario, an exponentially small noise added to the moves of the adversary yields a learnable\nproblem. This shows, at least in the given example, that the worst-case analysis and Littlestone\u2019s\ndimension are brittle notions which might be too restrictive in the real world, where some noise is\nunavoidable. It is comforting that small additive noise makes the problem learnable!\n\nThe proof for smoothed learning of half-spaces in higher dimension follows the same route as the\none-dimensional exposition. For simplicity, assume the hyperplanes are homogenous and Z =\nSd\u22121 \u2282 Rd, Y = {\u22121, 1}, X = Z\u00d7Y. De\ufb01ne F = {f\u03b8(z, y) = 1{y hz, \u03b8i > 0} : \u03b8 \u2208 Sd\u22121}, and\nassume that the noise is distributed uniformly on a square patch with side-length \u03b3 on the surface of\nthe sphere Sd\u22121. We can also consider other distributions, possibly with support on a d-dimensional\nball instead.\nProposition 13. For the problem of smoothed online learning of half-spaces,\n\nVT = O sdT (cid:18)log(cid:18) 1\n\n\u03b3(cid:19) +\n\n3\n\nd \u2212 1\n\n\u03b3(cid:19)\nlog T(cid:19) + vd\u22122 \u00b7(cid:18) 1\n\n3\n\nd\u22121!\n\nwhere vd\u22122 is constant depending only on the dimension d.\n\nWe conclude that half spaces are online learnable in the smoothed model, since the upper bound of\nProposition 13 guarantees existence of an algorithm which achieves this regret. In fact, for the two\nexamples considered in this section, the Exponential Weights Algorithm on the discretization given\nby Lemma 11 is a (computationally infeasible) algorithm achieving the bound.\n\n8\n\n\fReferences\n\n[1] J. Abernethy, A. Agarwal, P. Bartlett, and A. Rakhlin. A stochastic view of optimal regret\n\nthrough minimax duality. In COLT, 2009.\n\n[2] S. Ben-David, D. Pal, and S. Shalev-Shwartz. Agnostic online learning. In Proceedings of the\n\n22th Annual Conference on Learning Theory, 2009.\n\n[3] J.O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985.\n\n[4] E. Hazan and S. Kale. Better algorithms for benign bandits. In SODA, 2009.\n\n[5] S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\n\nbounds, margin bounds, and regularization. NIPS, 22, 2008.\n\n[6] A. Lazaric and R. Munos. Hybrid Stochastic-Adversarial On-line Learning. In COLT, 2009.\n\n[7] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-Verlag, New York,\n\n1991.\n\n[8] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 2(4):285\u2013318, 04 1988.\n\n[9] S. Posner and S. Kulkarni. On-line learning of functions of bounded variation under various\nsampling schemes. In Proceedings of the sixth annual conference on Computational learning\ntheory, pages 439\u2013445. ACM, 1993.\n\n[10] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial\n\nparameters, and learnability. In NIPS, 2010. Full version available at arXiv:1006.1138.\n\n[11] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In COLT, 2011.\n\nFull version available at arXiv:1011.3168.\n\n[12] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform\n\nconvergence. JMLR, 11:2635\u20132670, Oct 2010.\n\n[13] D. A. Spielman and S. H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm\n\nusually takes polynomial time. Journal of the ACM, 51(3):385\u2013463, 2004.\n\n[14] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes : With\n\nApplications to Statistics. Springer Series, March 1996.\n\n9\n\n\f", "award": [], "sourceid": 991, "authors": [{"given_name": "Alexander", "family_name": "Rakhlin", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}