{"title": "Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2508, "page_last": 2516, "abstract": "We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves $O(\\sqrt{T\\log|\\Pi|}+\\log|\\Pi|)$ regret with respect to a comparison set of policies $\\Pi$.  The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set $\\Pi$ has polynomial size, this algorithm is efficient.  We also consider the episodic adversarial online shortest path problem.  Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses.  When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs.  Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.", "full_text": "Online Learning in Markov Decision Processes with\n\nAdversarially Chosen Transition Probability\n\nDistributions\n\nYasin Abbasi-Yadkori\n\nQueensland University of Technology\n\nyasin.abbasiyadkori@qut.edu.au\n\nPeter L. Bartlett\n\nUC Berkeley and QUT\n\nbartlett@eecs.berkeley.edu\n\nVarun Kanade\nUC Berkeley\n\nvkanade@eecs.berkeley.edu\n\nYevgeny Seldin\n\nQueensland University of Technology\nyevgeny.seldin@gmail.com\n\nCsaba Szepesv\u00b4ari\nUniversity of Alberta\n\nszepesva@cs.ualberta.ca\n\nAbstract\n\nWe study the problem of online learning Markov Decision Processes (MDPs)\nwhen both the transition distributions and loss functions are chosen by an ad-\nversary. We present an algorithm that, under a mixing assumption, achieves\n\nO(pT log |\u21e7| + log |\u21e7|) regret with respect to a comparison set of policies \u21e7.\n\nThe regret is independent of the size of the state and action spaces. When expec-\ntations over sample paths can be computed ef\ufb01ciently and the comparison set \u21e7\nhas polynomial size, this algorithm is ef\ufb01cient.\nWe also consider the episodic adversarial online shortest path problem. Here, in\neach episode an adversary may choose a weighted directed acyclic graph with an\nidenti\ufb01ed start and \ufb01nish node. The goal of the learning algorithm is to choose a\npath that minimizes the loss while traversing from the start to \ufb01nish node. At the\nend of each episode the loss function (given by weights on the edges) is revealed\nto the learning algorithm. The goal is to minimize regret with respect to a \ufb01xed\npolicy for selecting paths. This problem is a special case of the online MDP\nproblem. It was shown that for randomly chosen graphs and adversarial losses,\nthe problem can be ef\ufb01ciently solved. We show that it also can be ef\ufb01ciently\nsolved for adversarial graphs and randomly chosen losses. When both graphs and\nlosses are adversarially chosen, we show that designing ef\ufb01cient algorithms for\nthe adversarial online shortest path problem (and hence for the adversarial MDP\nproblem) is as hard as learning parity with noise, a notoriously dif\ufb01cult problem\nthat has been used to design ef\ufb01cient cryptographic schemes. Finally, we present\nan ef\ufb01cient algorithm whose regret scales linearly with the number of distinct\ngraphs.\n\n1\n\nIntroduction\n\nIn many sequential decision problems, the transition dynamics can change with time. For example,\nin steering a vehicle, the state of the vehicle is determined by the actions taken by the driver, but\nalso by external factors, such as terrain and weather conditions. As another example, the state of a\n\n1\n\n\frobot that moves in a room is determined both by its actions and by how people in the room interact\nwith it. The robot might not have in\ufb02uence over these external factors, or it might be very dif\ufb01cult\nto model them. Other examples occur in portfolio optimization, clinical trials, and two player games\nsuch as poker.\nWe consider the problem of online learning Markov Decision Processes (MDPs) when the transition\nprobability distributions and loss functions are chosen adversarially and are allowed to change with\ntime. We study the following game between a learner and an adversary:\n\n1. The (oblivious) adversary chooses a sequence of transition kernels mt and loss functions `t.\n2. At time t:\nspace A.\n\n(a) The learner observes the state xt in state space X and chooses an action at in the action\n(b) The new state xt+1 2X is drawn at random according to distribution mt(\u00b7|xt, at).\n(c) The learner observes the transition kernel mt and the loss function `t, and suffers the loss\n\n`t(xt, at).\n\nin this game. Then the loss of the policy \u21e1 isPT\na policy \u21e1 2 \u21e7 is de\ufb01ned as the random variable RT (A, \u21e1) =PT\n\nTo handle the case when the representation of mt or `t is very large, we assume that the learner has\na black-box access to mt and `t. The above game is played for a total of T rounds and the total\nloss suffered by the learner isPT\nt=1 `t(xt, at). In the absence of state variables, the MDP problem\nreduces to a full information online learning problem (Cesa-Bianchi and Lugosi [1]). The dif\ufb01culty\nwith MDP problems is that, unlike full information online learning problems, the choice of a policy\nat each round changes the future states and losses.\nA policy is a mapping \u21e1 : X! A, where A denotes the set of distributions over A. To evaluate\nthe learner\u2019s performance, we imagine a hypothetical game where at each round the action played is\nchosen according to a \ufb01xed policy \u21e1, and the transition kernels mt and loss functions `t are the same\nas those chosen by the oblivious adversary. Let (x\u21e1\nt ) denote a sequence of state and action pairs\nt ). De\ufb01ne a set \u21e7 of policies that will be\nt , a\u21e1\nused as a benchmark to evaluate the learner\u2019s performance. The regret of a learner A with respect to\nt ).\nThe goal in adversarial online learning is to design learning algorithms for which the regret with\nrespect to any policy grows sublinearly with T , the total number of rounds played. Algorithms with\nsuch a guarantee, somewhat unfortunately, are typically termed no-regret algorithms.\nWe also study a special case of this problem: the episodic online adversarial shortest path problem.\nHere, in each episode the adversary chooses a layered directed acyclic graph with a unique start and\n\ufb01nish node. The adversary also chooses a loss function, i.e., a weight for every edge in the graph.\nThe goal of the learning algorithm is to choose a path from start to \ufb01nish that minimizes the total\nloss. The loss along any path is simply the sum of the weights on the edges. At the end of the round\nthe graph and the loss function are revealed to the learner. The goal, as in the case of the online\nMDP problem, is to minimize regret with respect to a class of policies for choosing the path. Note\nthat the online shortest path problem is a special case of the online MDP problem; the states are the\nnodes in the graph and the transition dynamics is speci\ufb01ed by the edges.\n\nt=1 `t(xt, at) PT\n\nt , a\u21e1\nt=1 `t(x\u21e1\n\nt=1 `t(x\u21e1\n\nt , a\u21e1\n\n1.1 Related Work\n\nBurnetas and Katehakis [2], Jaksch et al. [3], and Bartlett and Tewari [4] propose ef\ufb01cient algorithms\nfor \ufb01nite MDP problems with stochastic transitions and loss functions. These results are extended\nto MDPs with large state and action spaces in [5, 6, 7]. Abbasi-Yadkori and Szepesv\u00b4ari [5] and\nAbbasi-Yadkori [6] derive algorithms with O(pT ) regret for linearly parameterized MDP problems,\nwhile Ortner and Ryabko [7] derive O(T (2d+1)/(2d+2)) regret bounds under a Lipschitz assumption,\nwhere d is the dimensionality of the state space. We note that these algorithms are computationally\nexpensive.\nEven-Dar et al. [8] consider the problem of online learning MDPs with \ufb01xed and known dynamics,\nbut adversarially changing loss functions. They show that when the transition kernel satis\ufb01es a\nmixing condition (see Section 3), there is an algorithm with regret bound O(pT ). Yu and Mannor [9,\n10] study a harder setting, where the transition dynamics may also change adversarially over time.\n\n2\n\n\fHowever, their regret bound scales with the amount of variation in the transition kernels and in the\nworst case may grow linearly with time.\nRecently, Neu et al. [11] give a no-regret algorithm for the episodic shortest path problem with\nadversarial losses but stochastic transition dynamics.\n\n1.2 Our Contributions\n\nFirst, we study a general MDP problem with large (possibly continuous) state and action spaces\nand adversarially changing dynamics and loss functions. We present an algorithm that guarantees\nO(pT ) regret with respect to a suitably small (totally bounded) class of policies \u21e7 for this online\nMDP problem. The regret grows with the metric entropy of \u21e7, so that if the comparison class is\nthe set of all policies (that is, the algorithm must compete with the optimal \ufb01xed policy), it scales\npolynomially with the size of the state and action spaces. The above algorithm is ef\ufb01cient as long\nas the comparison class has polynomial size and we can compute expectations over sample paths\nfor each policy. This result has several advantages over the results of [5, 6, 7]. First, the transition\ndistributions and loss functions are chosen adversarially. Second, by designing an appropriate small\nclass of comparison policies, the algorithm is ef\ufb01cient, even in the face of very large state and action\nspaces.\nNext, we present ef\ufb01cient no-regret algorithms for the episodic online shortest path problem for two\ncases: when the graphs and loss functions (edge weights) are chosen adversarially and the set of\ngraphs is small; and when the graphs are chosen adversarially, but the loss is stochastic.\nFinally, we show that for the general adversarial online shortest path problem, designing an ef\ufb01cient\nno-regret algorithm is at least as hard as learning parity with noise. Since the online shortest path\nproblem is a special case of online MDP problem, the hardness result is also applicable there.1 The\nnoisy parity problem is widely believed to be computationally intractable, and has been used to\ndesign cryptographic schemes.\nOrganization: In Section 3 we introduce an algorithm for MDP problems with adversarially chosen\ntransition kernels and loss functions. Section 4 discusses how this algorithm can also be applied to\nthe online episodic shortest path problem with adversarially varying graphs and loss functions and\nalso considers the case of stochastic loss functions. Finally, in Section 4.2, we show the reduction\nfrom the adversarial online epsiodic shortest path problem to learning parity with noise.\n\n2 Notations\nLet X\u21e2 Rn be a state space and A\u21e2 Rd be an action space. Let S be the space of probability\ndistributions over a set S. De\ufb01ne a policy \u21e1 as a mapping \u21e1 : X! A. We use \u21e1(a|x) to\ndenote the probability of choosing an action a in state x under policy \u21e1. A random action under\npolicy \u21e1 is denoted by \u21e1(x). A transition probability kernel (or transition kernel) m is a mapping\nm : X\u21e5A! X . For \ufb01nite X , let P (\u21e1, m) be the transition probability matrix of policy \u21e1 under\ntransition kernel m. A loss function is a bounded real-valued function over state and action spaces,\n` : X\u21e5A! R. For a vector v, de\ufb01ne kvk1 =Pi |vi|. For a real-valued function f de\ufb01ned over\nX\u21e5A , de\ufb01ne kfk1,1 = maxx2XPa2A |f (x, a)|. The inner product between two vectors v and\nw is denoted by hv, wi.\n3 Online MDP Problems\n\nIn this section, we study a general MDP problem with large state and action spaces. The adversary\ncan change the dynamics and the loss functions, but is restricted to choose dynamics that satisfy a\nmixing condition.\nAssumption A1 Uniform Mixing There exists a constant \u2327> 0 such that for all distributions\nd and d0 over the state space, any deterministic policy \u21e1, and any transition kernel m 2 M,\nkdP (\u21e1, m)  d0P (\u21e1, m)k1 \uf8ff e1/\u2327 kd  d0k1.\nthis claim has since been retracted [12, 13].\n\n1There was an error in the proof of a claimed hardness result for the online adversarial MDP problem [8];\n\n3\n\n\fChoose \u21e11 uniformly at random.\nfor t := 1, 2, . . . , T do\n\nFor all policies \u21e1 2 \u21e7, w\u21e1,0 = 1. \u2318 = min{plog |\u21e7| /T , 1/2}.\nLearner takes the action at \u21e0 \u21e1t(.|xt) and adversary chooses mt and `t.\nLearner suffers loss `t(xt, at) and observes mt and `t. Update state: xt+1 \u21e0 mt(.|xt, at).\nFor all policies \u21e1, w\u21e1,t = w\u21e1,t1(1  \u2318)E[`t(x\u21e1\nWt =P\u21e12\u21e7 w\u21e1,t. For any \u21e1, p\u21e1,t+1 = w\u21e1,t/Wt.\nWith probability t = w\u21e1t,t/w\u21e1t,t1 choose the previous policy, \u21e1t+1 = \u21e1t, while with\nprobability 1  t, choose \u21e1t+1 based on the distribution p\u21e1,t+1.\n\nt ,\u21e1)].\n\nend for\n\nFigure 1: OMDP: The Online Algorithm for Markov Decision Processes\n\nThis assumption excludes deterministic MDPs that can be more dif\ufb01cult to deal with. As discussed\nby Neu et al. [14], if Assumption A1 holds for deterministic policies, then it holds for all policies.\nWe propose an exponentially-weighted average algorithm for this problem. The algorithm, called\nOMDP and shown in Figure 1, maintains a distribution over the policy class, but changes its policy\nwith a small probability. The main results of this section are the following regret bounds for the\nOMDP algorithm. The proofs can be found in Appendix A.\nTheorem 1. Let the loss functions selected by the adversary be bounded in [0, 1], and the transition\nkernels selected by the adversary satisfy Assumption A1. Then, for any policy \u21e1 2 \u21e7,\n\nE [RT (OMDP,\u21e1 )] \uf8ff (4 + 2\u2327 2)pT log |\u21e7| + log |\u21e7| .\n\nCorollary 2. Let \u21e7 be an arbitrary policy space, N (\u270f) be the \u270f-covering number of the metric space\n(\u21e7,k.k1,1), and C(\u270f) be an \u270f-cover. Assume that we run the OMDP algorithm on C(\u270f). Then, under\nthe same assumptions as in Theorem 1, for any policy \u21e1 2 \u21e7,\n\nE [RT (OMDP,\u21e1 )] \uf8ff (4 + 2\u2327 2)pT log N (\u270f) + log N (\u270f) + \u2327T\u270f .\n\nRemark 3. If we choose \u21e7 to be the space of deterministic policies and X and A are \ufb01nite spaces,\nfrom Theorem 1 we obtain that E [RT (OMDP,\u21e1 )] \uf8ff (4 + 2\u2327 2)pT|X| log |A| + |X| log |A|. This\nresult, however, is not suf\ufb01cient to show that the average regret with respect to the optimal stationary\npolicy converges to zero. This is because, unlike in the standard MDP framework, the optimal\nstationary policy is not necessarily deterministic. Corollary 2 extends the result of Theorem 1 to\ncontinuous policy spaces.\nIn particular, if X and A are \ufb01nite spaces and \u21e7 is the space of all policies, N (\u270f) \uf8ff (|A|/\u270f)|A||X|,\nso the expected regret satis\ufb01es E [RT (OMDP,\u21e1 )] \uf8ff (4+2\u2327 2)qT|A||X| log |A|\u270f +|A||X| log |A|\u270f +\n\nT , we get that E [RT (OMDP,\u21e1 )] = O(\u2327 2pT |A||X| log(|A|T )).\n\n\u2327T\u270f . By the choice of \u270f = 1\n\n3.1 Proof Sketch\n\nThe main idea behind the design and the analysis of our algorithm is the following regret decompo-\nsition:\n\nRT (A, \u21e1) =\n\nTXt=1\n\nTXt=1\n\n`t(xA\n\nt , at) \n\n`t(x\u21e1t\n\nt ,\u21e1 t) +\n\n`t(x\u21e1t\n\nt ,\u21e1 t) \n\n`t(x\u21e1\n\nt ,\u21e1 ) ,\n\n(1)\n\nTXt=1\n\nTXt=1\n\nwhere A is an online learning algorithm that generates a policy \u21e1t at round t, xA\nt if we have followed the policies generated by algorithm A, and `(x, \u21e1) = `(x, \u21e1(x)). Let\n\nt is the state at round\n\nBT (A) =\n\nTXt=1\n\n`t(xA\n\nt , at) \n\nTXt=1\n\n`t(x\u21e1t\n\nt ,\u21e1 t) , CT (A, \u21e1) =\n\nTXt=1\n\n`t(x\u21e1t\n\nt ,\u21e1 t) \n\nTXt=1\n\n`t(x\u21e1\n\nt ,\u21e1 ) .\n\n4\n\n\fNote that the choice of policies has no in\ufb02uence over future losses in CT (A, \u21e1). Thus, CT (A, \u21e1)\ncan be bounded by a reduction to full information online learning algorithms. Also, notice that the\ncompetitor policy \u21e1 does not appear in BT (A). In fact, BT (A) depends only on the algorithm A.\nWe will show that if the class of transition kernels satis\ufb01es Assumption A1 and algorithm A rarely\nchanges its policies, then BT (A) can be bounded by a sublinear term. To be more precise, let \u21b5t\nbe the probability that algorithm A changes its policy at round t. We will require that there exists\na constant D such that for any 1 \uf8ff t \uf8ff T , any sequence of models m1, . . . , mt and loss functions\n`1, . . . ,` t, \u21b5t \uf8ff D/pt.\nWe would like to have a full information online learning algorithm that rarely changes its policy.\nThe \ufb01rst candidate that we consider is the well-known Exponentially Weighted Average (EWA)\nalgorithm [15, 16]. In our MDP problem, the EWA algorithm chooses a policy \u21e1 2 \u21e7 accord-\ns ,\u21e1 )]\u2318 for some > 0. The policies that this\nEWA algorithm generates most likely are different in consecutive rounds and thus, the EWA algo-\nrithm might change its policy frequently. However, a variant of EWA, called Shrinking Dartboard\n(SD) (Guelen et al. [17]), rarely changes its policy (see Lemma 8 in Appendix A). The OMDP al-\ngorithm is based on the SD algorithm. Note that the algorithm needs to know the number of rounds,\nT , in advance. Also note that we could use any rarely switching algorithm such as Follow the Lazy\nLeader of Kalai and Vempala [18] as the subroutine.\n\ning to distribution qt(\u21e1) / exp\u21e3Pt1\n\ns=1 E [`s(x\u21e1\n\n4 Adversarial Online Shortest Path Problem\n\nWe consider the following adversarial online shortest path problem with changing graphs. The\nproblem is a repeated game played between a decision-maker and an (oblivious) adversary over\nT rounds. At each round t the adversary presents a directed acyclic graph gt on n nodes to the\ndecision maker, with L layers indexed by {1, . . . , L} and a special start and \ufb01nish node. Each layer\ncontains a \ufb01xed set of nodes and has connections only with the next layer. 2 The decision-maker\nmust choose a path pt from the start to the \ufb01nish node. Then, the adversary reveals weights across\nall the edges of the graph. The loss `t(gt, pt) of the decision-maker is the weight along the path that\nthe decision-maker took on that round.\nDenote by [k] the set {1, 2, . . . , k}. A policy is a mapping \u21e1 : [n] ! [n]. Each policy may be\ninterpreted as giving a start to \ufb01nish path. Suppose that the start node is s 2 [n], then \u21e1(i) gives the\nsubsequent node. The path is interpreted as follows : if at a node v, the edge (v, \u21e1(v)) exists then the\nnext node is \u21e1(v). Otherwise, the next node is an arbitrary (pre-determined) choice that is adjacent\nto v. We compete against the class of such policies for choosing the shortest path. Denote the class\nof such policies by \u21e7. The regret of a decision-maker A with respect to a policy \u21e1 2 \u21e7 is de\ufb01ned as:\nRT (A, \u21e1) =PT\nt=1 `t(gt,\u21e1 (gt)), where \u21e1(gt) is the path obtained by following\nthe policy \u21e1 starting at the source node. Note that it is possible that there exists no policy that would\nresult in an actual path that leads to the sink for some graph. In this case we say that the loss of the\npolicy is in\ufb01nite. Thus, there may be adversarially chosen sequences of graphs for which the regret\nof a decision-maker is not well-de\ufb01ned. This can be easily corrected by the adversary ensuring that\nthe graph always has some \ufb01xed set of edges which result in a (possibly high loss) s ! f path.\nIn fact, we show that the adversary can choose a sequence of graphs and loss functions that make\nthis problem at least as hard as learning noisy parities. Learning noisy parities is a notoriously hard\nproblem in computational learning theory. The best known algorithm runs in time 2O(n/ log(n)) [20]\nand the presumed hardness of this and related problems has been used for designing cryptographic\nprotocols [21].\nInterestingly, for the hardness result to hold, it is essential that the adversary have the ability to\ncontrol both the sequence of graphs and losses. The problem is well-understood when the graphs\nare generated randomly and the losses are adversarial. Jaksch et al. [3] and Bartlett and Tewari [4]\npropose ef\ufb01cient algorithms for problems with stochastic losses.3 Neu et al. [22] extend these results\nto problems with adversarial loss functions.\n\nt=1 `t(gt, pt) PT\n\n2As noted by Neu et al. [19], any directed acyclic graph can be transformed into a graph that satis\ufb01es our\n\nassumptions.\n\nproblems with small modi\ufb01cations.\n\n3These algorithms are originally proposed for continuing problems, but we can use them in shortest path\n\n5\n\n\fOne can also ask what happens in the case when the graphs are chosen by the adversary, but the\nweight of each edge is drawn at random according to a \ufb01xed stationary distribution. In this setting,\nwe show a reduction to bandit linear optimization. Thus, in fact, that algorithm does not need to see\nthe weights of all edges at the end of the round, but only needs to know the loss it suffered.\nFinally, we consider the case when both graphs and losses are chosen adversarially. Although the\ngeneral problem is at least as hard as learning noisy parities, we give an ef\ufb01cient algorithm whose\nregret scales linearly with the number of different graphs. Thus, if the adversary is forced to choose\ngraphs from some small set G, then we have an ef\ufb01cient algorithm for solving the problem. We note\nthat in fact, our algorithm does not need to see the graph gt at the beginning of the round, in which\ncase an algorithm achieving O(|G|pT ) may be trivially obtained.\n4.1 Stochastic Loss Functions and Adversarial Graphs\n\nConsider the case when the weight of each edge is chosen from a \ufb01xed distribution. Then it is easy\nto see that the expected loss of any path is a \ufb01xed linear function of the expected weights vector. The\nset of available paths depends on the graph and it may change from time to time. This is an instance\nof stochastic linear bandit problem, for which ef\ufb01cient algorithms exist [23, 24, 25].\nTheorem 4. Let us represent each path by a binary vector of length n(n  1)/2, such that the ith\nelement is 1 only if the corresponding edge is present in the path. Assume that the learner suffers\nthe loss of c(p) for choosing path p, where E [c(p)] = h`, pi and the loss vector ` 2 Rn(n1)/2 is\n\ufb01xed. Let Pt be the set of paths in a graph gt. Consider the CONFIDENCEBALL1 algorithm of Dani\net al. [24] applied to the shortest path problem with a changing action set Pt and the loss function\n`. Then the regret with respect to the best path in each round is Cn3pT for a problem-independent\nconstant C.\n\nLetb`t be the least squares estimate of ` at round t, Vt =Pt1\nnorm-1 ball con\ufb01dence set, Ct =n` : V 1/2\n\nt\n\ns=1 psp>s be the covariance matrix, and\nPt be the decision set at round t. The CONFIDENCEBALL1 algorithm constructs a high probability\n\n(` b`t)1 \uf8ff to for an appropriate t, and chooses\n\nan action pt according to pt = argmin`2Ct,p2Pth`, pi. Dani et al. [24] prove that the regret of the\nCONFIDENCEBALL1 algorithm is bounded by O(m3/2pT ), where m is the dimensionality of the\naction set (in our case m = n(n  1)/2). The above optimization can be solved ef\ufb01ciently, because\nonly 2n corners of Ct need to be evaluated.\nNote that the regret in Theorem 4 is with respect to the best path in each round, which is a stronger\nresult than competing with a \ufb01xed policy.\n\n4.2 Hardness Result\n\nIn this section, we show that the setting when both the graphs and the losses are chosen by an\nadversary, the problem is at least as hard as the noisy parity problem. We consider the online agnostic\nparity learning problem. Recall that the class of parity function over {0, 1}n is the following: For\nS \u2713 [n], PARS(x) = i2Sxi, where  denotes modulo 2 addition. The class is PARITIES =\n{PARS | S \u2713 [n]}. In the online setting, the learning algorithm is given xt 2{ 0, 1}n, the learning\nalgorithm then picks \u02c6yt 2{ 0, 1}, and then the true label yt is revealed. The learning algorithm\nsuffers loss I(\u02c6yt 6= yt). The regret of the learning algorithm with respect to PARITIES is de\ufb01ned\nas: Regret = PT\nt=1 I(PARS(xt) 6= yt). The goal is to\ndesign a learning algorithm that runs in time polynomial in n, T and suffers regret O(poly(n)T 1)\nfor some constant > 0. It follows from prior work that online agnostic learning of parities is at\nleast as hard as the of\ufb02ine version (see Littlestone [26], Kanade and Steinke [27]). As mentioned\npreviously, the agnostic parity learning problem is notoriously dif\ufb01cult. Thus, it seems unlikely that\na computationally ef\ufb01cient no-regret algorithm for this problem exists.\nTheorem 5. Suppose there is a no-regret algorithm for the online adversarial shortest path prob-\nlem that runs in time poly(n, T ) and achieves regret O(poly(n)T 1) for any constant > 0.\nThen there is a polynomial-time algorithm for online agnostic parity learning that achieves regret\nO(poly(n)T 1). By the online to batch reduction, this would imply a polynomial time algorithm\nfor agnostically learning parities.\n\nt=1 I(\u02c6yt 6= yt)  minPARS2PARITIESPT\n\n6\n\n\f1a\n\n2a\n\n3a\n\n4a\n\n5a\n\n6a\n\n2b\n\n3b\n\n4b\n\n5b\n\n6b\n\n1  y\n\ny\n\n?\n(a)\n\nfor t := 1, 2, . . . do\n\nAdversary chooses a graph gt 2G\nfor l = 1, . . . , L do\n\nInitialize an EWA expert algorithm, E\nfor s = 1, . . . , t  1 do\nif gs 2 C(xt,l) then\n\nFeed expert E with the value function Qs =\nQ\u21e1s,gs,cs\n\nend if\nend for\nLet \u21e1t(.|xt,l) be the distribution over actions of the expert\nE\nTake the action at,l \u21e0 \u21e1t(.|xt,l), suffer the loss\nct(nt,l, at,l), and move to the node nt,l+1 = gt(nt,l, at,l)\n\nend for\nLearner observes the graph gt and the loss function ct\nCompute the value function Qt = Q\u21e1t,gt,ct for all nodes\nn0 2 [n]\n\nend for\n\n(b)\n\nFigure 2: (a) Encoding the example (1, 0, 1, 0, 1) 2{ 0, 1}5 as a graph. (b) Improved Algorithm for\nthe Online Shortest Path Problem.\n\nProof. We \ufb01rst show how to map a point (x, y) to a graph and a loss function. Let (x, y) 2{ 0, 1}n\u21e5\n{0, 1}. We de\ufb01ne a graph, g(x) and a loss function `x,y associated with (x, y). De\ufb01ne a graph on\n2n + 2 nodes \u2013 named 1a, 2a, 2b, 3a, 3b, . . . , na, nb, (n + 1)a, (n + 1)b,? in that order. Let E(x)\ndenote the set of edges of g(x). The set E(x) contains the following edges:\n(i) If x1 = 1, both (1a, 2a) and (1a, 2b) are in E(x), else if x1 = 0, only (1a, 2a) is present.\n(ii) For 1 < i \uf8ff n, if xi = 1, the edges (ia, (i + 1)a), (ia, (i + 1)b), (ib, (i + 1)a), (ib, (i + 1)b)\nare all present; if xi = 0 only the two edges (ia, (i + 1)a) and (ib, (i + 1)b) are present.\n(iii) The two edges ((n + 1)a,?) and ((n + 1)b,?) are always present.\nFor the loss function, de\ufb01ne the weights as follows. The weight of the edge ((n + 1)a,?) is y;\nthe weight of the edge ((n + 1)b,?) is 1  y. The weights of all the remaining edges are set to 0.\nFigure 2(a) shows the encoding of the example (1, 0, 1, 0, 1) 2{ 0, 1}5.\nSuppose an algorithm with the stated regret bound for the online shortest path problem exists, call\nit U. We will use this algorithm to solve the online parity learning problem. Let xt be an example\nreceived; then pass the graph g(xt) to the algorithm U. The start vertex is 1a and the \ufb01nish vertex\nis ?. Suppose the path pt chosen by U reaches ? using the edge ((n + 1)a,?) then set \u02c6yt to be 0.\nOtherwise, choose \u02c6yt = 1.\nThus, in effect we are using algorithm U as a meta-algorithm for the online agnostic parity learning\nproblem. First, it is easy to check that the loss suffered by the meta-algorithm on the parity problem\nis exactly the same as the loss of U on the online shortest path problem. This follows directly from\nthe de\ufb01nition of the losses on the edges.\nNext, we claim that for any S \u2713 [n], there is a policy \u21e1S that achieves the same loss (on the online\nshortest path problem) as the parity PARS does (on the parity learning problem). The policy is as\nfollows:\n(i) From node ia, if i 2 S and (ia, (i + 1)b) 2 E(gt), go to (i + 1)b, otherwise go to (i + 1)a.\n(ii) From node ib, if i 2 S and (ib, (i + 1)a) 2 E(gt), go to (i + 1)a, otherwise go to (i + 1)b.\n(iii) Finally, from either (n + 1)a or (n + 1)b, just move to ?.\nWe can think of the path pt as being in type a nodes or type b nodes. For each i 2 S, such that\ni = 1, the path pt switches types. Thus, if PARS(xt) = 1, pt reaches ? via the edge ((n + 1)b,?)\nxt\nand if PARS(xt) = 0, pt reaches ? via the edge ((n + 1)a,?). Recall that the loss function is\n\n7\n\n\fde\ufb01ned as follows: weight of the edge ((n + 1)a,?) is yt, weight of the edge ((n + 1)b,?) is\n1  yt; other edges have loss 0. Thus, the loss suffered by the policy \u21e1S is 1 if PARS(xt) 6= yt\nand 0 otherwise. This is exactly the loss of the parity function PARS on the agnostic parity learning\nproblem. Thus, if the algorithm U has regret O(poly(n), T 1), then the meta-algorithm for the\nonline agnostic parity learning problem also has regret O(poly(n), T 1).\nRemark 6. We observe that the online shortest path problem is a special case of online MDP\nlearning. Thus, the above reduction also shows that, short of a major breakthrough, it is unlikely\nthat there exists a computationally ef\ufb01cient algorithm for the fully adversarial online MDP problem.\n\n4.3 Small Number of Graphs\nIn this section, we design an ef\ufb01cient algorithm and prove a O(|G|pT ) regret bound, where G is the\nset of graphs played by the adversary up to round T . The computational complexity of the algorithm\nis O(L2t) at round t. The algorithm does not need to know the set G or |G|. This regret bound holds\neven if the graphs are revealed at the end of the rounds. Notice that if the graphs are shown at the\nbeginning of the rounds, obtaining regret bounds that scale like O(|G|pT ) is trivial; the learner only\nneeds to run |G| copies of the MDP-E algorithm of Even-Dar et al. [12], one for each graph.\nLet n\u21e1\nt,l denote the node at layer l of round t if we run policy \u21e1. Let ct(n0, a) be the loss incurred\nfor taking action a in node n0 at round t.4 We construct a new graph, called G, as follows: graph G\nalso has a layered structure with the same number of layers, L. At each layer, we have a number of\nstates that represent all possible observations that we might have upon arriving at that layer. Thus,\na state at layer l has the form of x = (s, a0, n1, a1, . . . , nl1, al1, nl), where ni belongs to layer i\nand ai 2A .\nLet X be the set of states in G and Xl be the set of states in layer l of G. For (x, a) 2X\u21e5A ,\nlet c(x, a) = c(n(x), a), where n(x) is the last node observed in state x. Let g(n0, a) be the next\nnode under graph g if we take action a in node n0. Let g(x, a) = g(n(x), a). Let c(x, \u21e1) =\nPa \u21e1(a|x)c(x, a). For a graph g and a loss function `, de\ufb01ne the value functions by\n\n8n0 2 [n], Q\u21e1,g,c(n0,\u21e1 0) = Ea\u21e0\u21e10(n0) [c(n0, a) + Q\u21e1,g,c(g(n0, a),\u21e1 )] ,\n\n8x, s.t. g 2 C(x), Q\u21e1,g,c(x, \u21e10) = Q\u21e1,g,c(n(x),\u21e1 0) ,\n\nwith Q\u21e1,g,c(f, a) = 0 for any \u21e1, g, c, a where f is the \ufb01nish node. Let Qt = Q\u21e1t,gt,ct denote the\nvalue function associated with policy \u21e1t at time t. For x = (s, a0, n1, a1, . . . , nl1, al1, nl), de\ufb01ne\nC(x) = {g 2G : n1 = g(s, a0), . . . , nl = g(nl1, al1)}, the set of graphs that are consistent\nwith the state x.\nWe can use the MDP-E algorithm to generate policies. The algorithm, however, is computationally\nexpensive as it updates a large set of experts at each round. Notice that the number of states at stage l,\n|Xl|, can be exponential in the number of graphs. We show a modi\ufb01cation of the MDP-E algorithm\nthat would generate the same sequence of policies, with the advantage that the new algorithm is\ncomputationally ef\ufb01cient. The algorithm is shown in Figure 2(b). As the generated policies are\nalways the same, the regret bound in the next theorem, that is proven for the MDP-E algorithm, also\napplies to the new algorithm. The proof can be found in Appendix B.\nTheorem 7. For\n\n2Lp8T log(2T ) +\nL min{|G| , maxl |Xl|}qT log |A|2 + 2L.\nThe theorem gives a sublinear regret as long as |G| = o(pT ). On the other hand, the hardness\nresult in Theorem 5 applies when |G| =\u21e5( T ). Characterizing regret vs. computational complexity\ntradeoffs when |G| is in between remains for future work.\nReferences\n[1] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nE [RT (MDP-E,\u21e1 )]\n\npolicy\n\n\uf8ff\n\nany\n\n\u21e1,\n\nNew York, NY, USA, 2006.\n\n4Thus, `t(Gt,\u21e1 (Gt)) =PL\n\nl=1 ct(n\u21e1\n\nt,l,\u21e1 ).\n\n8\n\n\f[2] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision pro-\n\ncesses. Mathematics of Operations Research, 22(1):222\u2013255, 1997.\n\n[3] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of\n\nMachine Learning Research, 11:1563\u20141600, 2010.\n\n[4] P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement learning in\n\nweakly communicating MDPs. In UAI, 2009.\n\n[5] Yasin Abbasi-Yadkori and Csaba Szepesv\u00b4ari. Regret bounds for the adaptive control of linear quadratic\n\nsystems. In COLT, 2011.\n\n[6] Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. PhD thesis, Uni-\n\nversity of Alberta, 2012.\n\n[7] Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforcement\n\nlearning. In NIPS, 2012.\n\n[8] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In NIPS,\n\n2004.\n\n[9] Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In IEEE Conference on\n\nDecision and Control, 2009.\n\n[10] Jia Yuan Yu and Shie Mannor. Online learning in Markov decision processes with arbitrarily changing\n\nrewards and transitions. In GameNets, 2009.\n\n[11] Gergely Neu, Andr\u00b4as Gy\u00a8orgy, and Csaba Szepesv\u00b4ari. The adversarial stochastic shortest path problem\n\nwith unknown transition probabilities. In AISTATS, 2012.\n\n[12] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics\n\nof Operations Research, 34(3):726\u2013736, 2009.\n\n[13] Eyal Even-Dar. Personal communication., 2013.\n[14] Gergely Neu, Andr\u00b4as Gy\u00a8orgy, Csaba Szepesv\u00b4ari, and Andr\u00b4as Antos. Online Markov decision processes\n\nunder bandit feedback. In NIPS, 2010.\n\n[15] Vladimir Vovk. Aggregating strategies. In COLT, pages 372\u2013383, 1990.\n[16] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Compu-\n\ntation, 108(2):212\u2013261, 1994.\n\n[17] Sascha Geulen, Berthold V\u00a8ocking, and Melanie Winkler. Regret minimization for online buffering prob-\n\nlems using the weighted majority algorithm. In COLT, 2010.\n\n[18] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\nputer and System Sciences, 71(3):291\u2013307, 2005.\n\n[19] Gergely Neu, Andr\u00b4as Gy\u00a8orgy, and Csaba Szepesv\u00b4ari. The online loop-free stochastic shortest path prob-\n\nlem. In COLT, 2010.\n\n[20] Adam Tauman Kalai, Yishay Mansour, and Elad Verbin. On agnostic boosting and parity learning. In\n\nSTOC, pages 629\u2013638, 2008.\n\n[21] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. In STOC, pages\n\n84\u201393, 2005.\n\n[22] Gergely Neu, Andr\u00b4as Gy\u00a8orgy, and Csaba Szepesv\u00b4ari. The adversarial stochastic shortest path problem\n\nwith unknown transition probabilities. In AISTATS, 2012.\n\n[23] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learning\n\nResearch, 2002.\n\n[24] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Rocco\n\nServedio and Tong Zhang, editors, COLT, pages 355\u2013366, 2008.\n\n[25] Yasin Abbasi-Yadkori, D\u00b4avid P\u00b4al, and Csaba Szepesv\u00b4ari. Improved algorithms for linear stochastic ban-\n\ndits. In NIPS, 2011.\n\n[26] Nick Littlestone. From on-line to batch learning. In COLT, pages 269\u2013284, 1989.\n[27] Varun Kanade and Thomas Steinke. Learning hurdles for sleeping experts. In Proceedings of the 3rd\n\nInnovations in Theoretical Computer Science Conference, ITCS \u201912, pages 11\u201318, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1187, "authors": [{"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "Queensland University of Technology"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "UC Berkeley"}, {"given_name": "Yevgeny", "family_name": "Seldin", "institution": "Queensland University of Technology & UC Berkeley"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}