{"title": "Online learning in episodic Markovian decision processes by relative entropy policy search", "book": "Advances in Neural Information Processing Systems", "page_first": 1583, "page_last": 1591, "abstract": "We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\\A$ and the state space $\\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\\sqrt{L\\nX\\nA T\\log(\\nX\\nA/L)}$ in the bandit setting and $2L\\sqrt{T\\log(\\nX\\nA/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.", "full_text": "Online Learning in Episodic Markovian Decision\n\nProcesses by Relative Entropy Policy Search\n\nAlexander Zimin\n\nInstitute of Science and Technology Austria\n\nalexander.zimin@ist.ac.at\n\nGergely Neu\n\nINRIA Lille \u2013 Nord Europe\n\ngergely.neu@gmail.com\n\nAbstract\n\nWe study the problem of online learning in \ufb01nite episodic Markov decision pro-\ncesses (MDPs) where the loss function is allowed to change between episodes.\nThe natural performance measure in this learning problem is the regret de\ufb01ned as\nthe difference between the total loss of the best stationary policy and the total loss\nsuffered by the learner. We assume that the learner is given access to a \ufb01nite action\nspace A and the state space X has a layered structure with L layers, so that state\ntransitions are only possible between consecutive layers. We describe a variant of\nthe recently proposed Relative Entropy Policy Search algorithm and show that its\n\nregret after T episodes is 2(cid:112)L|X||A|T log(|X||A|/L) in the bandit setting and\n2L(cid:112)T log(|X||A|/L) in the full information setting, given that the learner has\n\nperfect knowledge of the transition probabilities of the underlying MDP. These\nguarantees largely improve previously known results under much milder assump-\ntions and cannot be signi\ufb01cantly improved under general assumptions.\n\n1\n\nIntroduction\n\nIn this paper, we study the problem of online learning in a class of \ufb01nite non-stationary episodic\nMarkov decision processes. The learning problem that we consider can be formalized as a sequential\ninteraction between a learner (often called agent) and an environment, where the interaction between\nthe two entities proceeds in episodes. Every episode consists of multiple time steps: In every time\nstep of an episode, a learner has to choose one of its available actions after observing some part\nof the current state of the environment. The chosen action in\ufb02uences the observable state of the\nenvironment in a stochastic fashion and imposes some loss on the learner. However, the entire state\n(be it observed or not) also in\ufb02uences the loss. The goal of the learner is to minimize its total\n(non-discounted) loss that it suffers. In this work, we assume that the unobserved part of the state\nevolves autonomously from the observed part of the state or the actions chosen by the learner, thus\ncorresponding to a state sequence generated by an oblivious adversary such as nature. Otherwise,\nabsolutely no statistical assumption is made about the mechanism generating the unobserved state\nvariables. As usual for such learning problems, we set our goal as minimizing the regret de\ufb01ned as\nthe difference between the total loss suffered by the learner and the total loss of the best stationary\nstate-feedback policy. This setting fuses two important paradigms of learning theory: online learning\n[5] and reinforcement learning [21, 22].\nThe learning problem outlined above can be formalized as an online learning problem where the\nactions of the learner correspond to choosing policies in a known Markovian decision process where\nthe loss function changes arbitrarily between episodes. This setting is a simpli\ufb01ed version of the\n\nParts of this work were done while Alexander Zimin was enrolled in the MSc. programme of the Central\nEuropean University, Budapest, and Gergely Neu was working on his PhD. thesis at the Budapest University of\nTechnology and Economics and the MTA SZTAKI Institute for Computer Science and Control, Hungary. Both\nauthors would like to express their gratitude to L\u00b4aszl\u00b4o Gy\u00a8or\ufb01 for making this collaboration possible.\n\n1\n\n\flearning problem \ufb01rst addressed by Even-Dar et al. [8, 9], who consider online learning unichain\nMDPs. In their variant of the problem, the learner faces a continuing MDP task where all policies\nare assumed to generate a unique stationary distribution over the state space and losses can change\narbitrarily between consecutive time steps. Assuming that the learner observes the complete loss\nfunction after each time step (that is, assuming full information feedback), they propose an algorithm\n\ncalled MDP-E and show that its regret is O(\u03c4 2(cid:112)T log |A|), where \u03c4 > 0 is an upper bound on the\n\nmixing time of any policy. The core idea of MDP-E is the observation that the regret of the global\ndecision problem can be decomposed into regrets of simpler decision problems de\ufb01ned in each state.\nYu et al. [23] consider the same setting and propose an algorithm that guarantees o(T ) regret under\nbandit feedback where the learner only observes the losses that it actually suffers, but not the whole\nloss function. Based on the results of Even-Dar et al. [9], Neu et al. [16] propose an algorithm\nthat is shown to enjoy an O(T 2/3) bound on the regret in the bandit setting, given some further\nassumptions concerning the transition structure of the underlying MDP. For the case of continuing\ndeterministic MDP tasks, Dekel and Hazan [7] describe an algorithm guaranteeing O(T 2/3) regret.\nThe immediate precursor of the current paper is the work of Neu et al. [14], who consider online\nlearning in episodic MDPs where the state space has a layered (or loop-free) structure and every\npolicy visits every state with a positive probability of at least \u03b1 > 0. Their analysis is based on a\ndecomposition similar to the one proposed by Even-Dar et al. [9], and is suf\ufb01cient to prove a regret\n\nbound of O(L2(cid:112)T|A| log |A|/\u03b1) in the bandit case and O(L2(cid:112)T log |A|) in the full information\n\ncase.\nIn this paper, we present a learning algorithm that directly aims to minimize the global regret of the\nalgorithm instead of trying to minimize the local regrets in a decomposed problem. Our approach\nis motivated by the insightful paper of Peters et al. [17], who propose an algorithm called Relative\nEntropy Policy Search (REPS) for reinforcement learning problems. As Peters et al. [17] and Kakade\n[11] point out, good performance of policy search algorithms requires that the information loss\nbetween the consecutive policies selected by the algorithm is bounded, so that policies are only\nmodi\ufb01ed in small steps. Accordingly, REPS aims to select policies that minimize the expected loss\nwhile guaranteeing that the state-action distributions generated by the policies stay close in terms\nof Kullback\u2013Leibler divergence. Further, Daniel et al. [6] point out that REPS is closely related\nto a number of previously known probabilistic policy search methods. Our paper is based on the\nobservation that REPS is closely related to the Proximal Point Algorithm (PPA) \ufb01rst proposed by\nMartinet [13] (see also [20]).\nWe propose a variant of REPS called online REPS or O-REPS and analyze it using fundamen-\ntal results concerning the PPA family. Our analysis improves all previous results concerning on-\nline learning in episodic MDPs: we show that the expected regret of O-REPS is bounded by\n\n2(cid:112)L|X||A|T log(|X||A|/L) in the bandit setting and 2L(cid:112)T log(|X||A|/L) in the full informa-\n\ntion setting. Unlike previous works in the literature, we do not have to make any assumptions about\nthe transition dynamics apart from the loop-free assumption. The full discussion of our results is\ndeferred to Section 5.\nBefore we move to the technical content of the paper, we \ufb01rst \ufb01x some conventions. Random\nvariables will be typeset in boldface (e.g., x, a) and inde\ufb01nite sums over states and actions are to be\nunderstood as sums over the entire state and action spaces. For clarity, we assume that all actions\nare available in all states, however, this assumption is not essential. The indicator of any event A\nwill be denoted by I{A}.\n\n2 Problem de\ufb01nition\nAn episodic loop-free Markov decision process is formally de\ufb01ned by the tuple M = {X ,A, P},\nwhere X is the \ufb01nite state space, A is the \ufb01nite action space, and P : X \u00d7 X \u00d7 A is the transition\nfunction, where P (x(cid:48)|x, a) is the probability that the next state of the Markovian environment will\nbe x(cid:48), given that action a is selected in state x. We will assume that M satis\ufb01es the following\nassumptions:\n\n\u2022 The state space X can be decomposed into non-intersecting layers, i.e. X = (cid:83)L\n\nk=0 Xk\n\nwhere Xl \u2229 Xk = \u2205 for l (cid:54)= k.\n\n\u2022 X0 and XL are singletons, i.e. X0 = {x0} and XL = {xL}.\n\n2\n\n\f\u2022 Transitions are possible only between consecutive layers. Formally, if P (x(cid:48)|x, a) > 0, then\n\nx(cid:48) \u2208 Xk+1 and x \u2208 Xk for some 0 \u2264 k \u2264 L \u2212 1.\n\nThe interaction between the learner and the environment is described on Figure 1. The interaction\nof an agent and the Markovian environment proceeds in episodes, where in each episode the agent\nstarts in state x0 and moves forward across the consecutive layers until it reaches state xL.1 We\nassume that the environment selects a sequence of loss functions {(cid:96)t}T\nt=1 and the losses only change\nbetween episodes. Furthermore, we assume that the learner only observes the losses that it suffers\nin each individual state-action pair that it visits, in other words, we consider bandit feedback.2\n\nParameters: Markovian environment M = {X ,A, P};\nFor all episodes t = 1, 2, . . . , T , repeat\n\n1. The environment chooses the loss function (cid:96)t : X \u00d7 A \u2192 [0, 1].\n2. The learner starts in state x0(t) = x0.\n3. For all time steps l = 0, 1, 2, . . . , L \u2212 1, repeat\n\n(a) The learner observes xl(t) \u2208 Xl.\n(b) Based on its previous observations (and randomness), the learner selects al(t).\n(c) The learner suffers and observes loss (cid:96)t(xl(t), al(t)).\n(d) The environment draws the new state xl+1(t) \u223c P (\u00b7|xl(t), al(t)).\n\nFigure 1: The protocol of online learning in episodic MDPs.\n\nFor de\ufb01ning our performance measure, we need to specify a set of reference controllers that is made\navailable to the learner. To this end, we de\ufb01ne the concept of (stochastic stationary) policies: A\npolicy is de\ufb01ned as a mapping \u03c0 : A \u00d7 X \u2192 [0, 1], where \u03c0(a|x) gives the probability of selecting\naction a in state x. The expected total loss of a policy \u03c0 is de\ufb01ned as\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P, \u03c0\n\n,\n\nLT (\u03c0) = E\n\n(cid:96)t(x(cid:48)\n\nk, a(cid:48)\nk)\n\nt=1\n\nk=0\n\n(cid:34) T(cid:88)\nL\u22121(cid:88)\n(cid:80)L\u22121\n(cid:98)RT =(cid:98)LT \u2212 min\n\nk=0\n\n\u03c0\n\nLT (\u03c0),\n\nloss suffered by the learner as(cid:98)LT =(cid:80)T\n\nwhere the notation E [\u00b7| P, \u03c0] is used to emphasize that the random variables x(cid:48)\nk are gener-\nated by executing \u03c0 in the MDP speci\ufb01ed by the transition function P . Denote the total expected\nE [ (cid:96)t(xk(t), ak(t))| P ], where the expectation is\ntaken over the internal randomization of the learner and the random transitions of the Markovian\nenvironment. Using these notations, we de\ufb01ne the learner\u2019s goal as minimizing the (total expected)\nregret de\ufb01ned as\n\nk and a(cid:48)\n\nt=1\n\nwhere the minimum is taken over the complete set of stochastic stationary policies.3\nIt is bene\ufb01cial to introduce the concept of occupancy measures on the state-action space X \u00d7A: the\noccupancy measure q\u03c0 of policy \u03c0 is de\ufb01ned as the collection of distributions generated by executing\npolicy \u03c0 on the episodic MDP described by P :\nk(x) = x, a(cid:48)\nx(cid:48)\n(cid:88)\n(cid:88)\n\nmeasure of any policy \u03c0 satis\ufb01es(cid:88)\n\nwhere k(x) denotes the index of the layer that x belongs to. It is easy to see that the occupancy\n\nq\u03c0(x, a) = P(cid:104)\n\n(cid:12)(cid:12)(cid:12) P, \u03c0\n\nk(x) = a\n\n(cid:105)\n\n,\n\nP (x|x(cid:48), a(cid:48))q\u03c0(x(cid:48), a(cid:48)),\n\n(1)\n\nq\u03c0(x, a) =\n\na\n\nx(cid:48)\u2208Xk(x)\u22121\n\na(cid:48)\n\n1Such MDPs naturally arise in episodic decision tasks where some notion of time is present in the state\n\ndescription.\n\nfeedback, see Audibert et al. [2].\n\n2In the literature of online combinatorial optimization, this feedback scheme is often called semi-bandit\n\n3The existence of this minimum is a standard result of MDP theory, see Puterman [18].\n\n3\n\n\ffor all x \u2208 X \\{x0, xl}, with q\u03c0(x0, a) = \u03c0(a|x0) for all a \u2208 A. The set of all occupancy measures\nsatisfying the above equality in the MDP M will be denoted as \u2206(M ). The policy \u03c0 is said to\ngenerate the occupancy measure q \u2208 \u2206(M ) if\n\n\u03c0(a|x) =\n\n(cid:80)\n\nq(x, a)\nb q(x, b)\n\nholds for all (x, a) \u2208 X \u00d7 A. It is clear that there exists a unique generating policy for all measures\nin \u2206(M ) and vice versa. The policy generating q will be denoted as \u03c0q. In what follows, we will\nrede\ufb01ne the task of the learner from having to select individual actions ak(t) to having to select\noccupancy measures qt \u2208 \u2206(M ) in each episode t. To see why this notion simpli\ufb01es the treatment\nof the problem, observe that\n\n(cid:34) L\u22121(cid:88)\n\nk=0\n\nE\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P, \u03c0q\n\n(cid:96)t(x(cid:48)\n\nk, a(cid:48)\nk)\n\n(cid:88)\n\n(cid:88)\n\nq(x, a)(cid:96)t(x, a)\n\na\n\nx\u2208Xk\nq(x, a)(cid:96)t(x, a) def= (cid:104)q, (cid:96)t(cid:105) ,\n\n(2)\n\nL\u22121(cid:88)\n(cid:88)\n\nk=0\n\nx,a\n\n=\n\n=\n\n(cid:34) T(cid:88)\n\nt=1\n\nwhere we de\ufb01ned the inner product (cid:104)\u00b7,\u00b7(cid:105) on X \u00d7 A in the last line. Using this notation, we can\nreformulate our original problem as an instance of online linear optimization with decision space\n\u2206(M ). Assuming that the learner selects occupancy measure qt in episode t, the regret can be\nrewritten as\n\n(cid:98)RT = max\n\nq\u2208\u2206(M )\n\nE\n\n(cid:35)\n\n(cid:104)qt \u2212 q, (cid:96)t(cid:105)\n\n.\n\n3 The algorithm: O-REPS\n\nUsing the formalism introduced in the previous section, we now describe our algorithm called Online\nRelative Entropy Policy Search (O-REPS). O-REPS is an instance of online linear optimization\nmethods usually referred to as Follow-the-Regularized-Leader (FTRL), Online Stochastic Mirror\nDescent (OSMD) or the Proximal Point Algorithm (PPA)\u2014see, e.g., [1], [19], [3] and [2] for a\ndiscussion of these methods and their relations. To allow comparisons with the original derivation of\nREPS by Peters et al. [17], we formalize our algorithm as an instance of PPA. Before describing the\nalgorithm, some more de\ufb01nitions are in order. First, de\ufb01ne D (q(cid:107)q(cid:48)) as the unnormalized Kullback\u2013\nLeibler divergence between two occupancy measures q and q(cid:48):\n\nD (q(cid:107)q(cid:48)) =\n\nq(x, a) log\n\n(q(x, a) \u2212 q(cid:48)(x, a)) .\n\n(cid:88)\n\nx,a\n\nq(x, a)\nq(cid:48)(x, a)\n\n\u2212(cid:88)\nq(x, a) log q(x, a) \u2212(cid:88)\n\nx,a\n\n(cid:88)\n\nR(q) =\n\nq(x, a).\n\nFurthermore, let R(q) de\ufb01ne the unnormalized negative entropy of the occupancy measure q:\n\nx,a\n\nx,a\n\nWe are now ready to de\ufb01ne O-REPS formally. In the \ufb01rst episode, O-REPS chooses the uniform\npolicy with \u03c01(a|x) = 1/|A| for all x and a, and we let q1 = q\u03c01.4 Then, the algorithm proceeds\nrecursively: After observing\n\nut = (x0(t), a0(t), (cid:96)t(x0(t), a0(t)), . . . , xL\u22121(t), aL\u22121(t), (cid:96)t(xL\u22121(t), aL\u22121(t)), xL(t))\n\nin episode t, we de\ufb01ne the loss estimates \u02c6(cid:96)t as\n\n\u02c6(cid:96)t =\n\n(cid:96)t(x, a)\nqt(x, a)\n\nI{(x, a) \u2208 ut} ,\n\nwhere we used the notation (x, a) \u2208 ut to indicate that the state-action pair (x, a) was observed dur-\ning episode t. After episode t, O-REPS selects the occupancy measure that solves the optimization\nproblem\n\n(cid:110)\n\n(cid:68)\n\n(cid:69)\n\n\u03b7\n\nq, \u02c6(cid:96)t\n\n+ D(q||qt)\n\n.\n\n(3)\n\n(cid:111)\n\nqt+1 = arg min\nq\u2208\u2206(M )\n\n4Note that q\u03c0 can be simply computed by using (1) recursively.\n\n4\n\n\fIn episode t, our algorithm follows the policy \u03c0t = \u03c0qt. De\ufb01ning Ut = (u1, u2, . . . , ut), we\nclearly have that qt(x, a) = P [ (x, a) \u2208 ut| Ut\u22121], so \u02c6(cid:96)t(x, a) is an unbiased estimate of (cid:96)t(x, a)\nfor all (x, a) such that qt(x, a) > 0:\n\n=\n\n(cid:96)t(x, a)\nqt(x, a)\n\nP [ (x, a) \u2208 ut| Ut\u22121] = (cid:96)t(x, a).\n\n(4)\n\nE(cid:104) \u02c6(cid:96)t(x, a)\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12) Ut\u22121\n\nWe now proceed to explain how the policy update step (3) can be implemented ef\ufb01ciently. It is\nknown (see, e.g., Bart\u00b4ok et al. [3, Lemma 8.6]) that performing this optimization can be reformulated\nas \ufb01rst solving the unconstrained optimization problem\n\n(cid:110)\n\n(cid:68)\n\n(cid:69)\n\n\u03b7\n\nq, \u02c6(cid:96)t\n\n+ D(q||qt)\n\n(cid:111)\n\nq\nand then projecting the result to \u2206(M ) as\n\n\u02dcqt+1 = arg min\n\nqt+1 = arg min\nq\u2208\u2206(M )\n\nD (q(cid:107)\u02dcqt+1) .\n\nThe \ufb01rst step can be simply carried out by setting \u02dcqt+1(x, a) = qt(x, a)e\u2212\u03b7 \u02c6(cid:96)t(x,a). The projection\nstep, however, requires more care. To describe the projection procedure, we need to introduce some\nmore notation. For any function v : X \u2192 R and loss function (cid:96) : X \u00d7 A \u2192 [0, 1] we de\ufb01ne a\nfunction\n\n\u03b4(x, a|v, (cid:96)) = \u2212\u03b7(cid:96)(x, a) \u2212 (cid:88)\n\nv(x(cid:48))P (x(cid:48)|x, a) + v(x).\n\n(5)\n\nx(cid:48)\u2208X\n\nAs noted by Peters et al. [17], the above function can be regarded as the Bellman error corresponding\nto the value function v. The next proposition provides a succinct formalization of the optimization\nproblem (3).\nProposition 1. Let t > 1 and de\ufb01ne the function\n\n(cid:88)\n\nqt(x, a)e\u03b4(x,a|v, \u02c6(cid:96)t).\n\nZt(v, k) =\n\nThe update step (3) can be performed as\n\nx\u2208Xk,a\u2208A\n\nqt+1(x, a) =\n\nwhere\n\nqt(x, a)e\u03b4(x,a|\u02c6vt, \u02c6(cid:96)t)\n\nZt(\u02c6vt, k(x))\n\n,\n\nL(cid:88)\n\nk=0\n\n\u02c6vt = arg min\n\nv\n\nln Zt(v, k).\n\n(6)\n\nMinimizing the expression on the right-hand side of Equation (6) is an unconstrained convex op-\ntimization problem (see Boyd and Vandenberghe [4] and the comments of Peters et al. [17]) and\ncan be solved ef\ufb01ciently. It is important to note that since q1(x, a) > 0 holds for all (x, a) pairs,\nqt(x, a) is also positive for all t > 0 by the multiplicative update rule, so Equation 4 holds for all\nstate-action pairs (x, a) in all time steps.\nThe proof follows the steps of Peters et al. [17], however, their original formalization of REPS is\nslightly different, which results in small changes in the analysis as well. For further comments\nregarding the differences between O-REPS and REPS, see Section 5.\n\nProof of Proposition 1. We start with formulating the projection step as a constrained optimization\nproblem:\n\nsubject to (cid:88)\n(cid:88)\n(cid:88)\n\na\n\nx\u2208Xk\n\na\n\nD (q(cid:107)\u02dcqt+1)\n\nmin\n\nq\n\n(cid:88)\n\nq(x, a) =\n\nx(cid:48),a(cid:48)\n\nq(x, a) = 1\n\nP (x|x(cid:48), a(cid:48))q(x(cid:48), a(cid:48))\n\nfor all x \u2208 X \\ {x0, xl},\n\nfor all k = 0, 1, . . . , L \u2212 1.\n\n5\n\n\fTo solve the problem, consider the Lagrangian:\n\nLt(q) =D (q(cid:107)\u02dcqt+1) +\n\nx\u2208Xk,a\u2208A\n\nk=0\n\n\u03bbk\n\n\uf8eb\uf8ed (cid:88)\nL\u22121(cid:88)\n\uf8eb\uf8ed (cid:88)\n(cid:88)\n(cid:32)\n(cid:88)\n(cid:32)\n\nx(cid:48)\u2208Xk\u22121\n\nq(x0, a)\n\na(cid:48)\n\na\n\n\u03bb0 +\n\n(cid:88)\n\nx(cid:48)\n\n+\n\nv(x)\n\nk=1\n\nx\u2208Xk\n=D (q(cid:107)\u02dcqt+1) +\n\nL\u22121(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nx(cid:54)=x0\n\na\n\n+\n\nq(x, a)\n\n\u03bbk(x) +\n\n\uf8f6\uf8f8\n\nq(x, a) \u2212 1\n\nq(x, a)\n\nq(x(cid:48), a(cid:48))P (x|x(cid:48), a(cid:48)) \u2212(cid:88)\n(cid:33)\n(cid:88)\n\u2212 L\u22121(cid:88)\n(cid:33)\n\nx(cid:48)\nv(x(cid:48))P (x(cid:48)|x, a) \u2212 v(x)\n\nv(x(cid:48))P (x(cid:48)|x0, a)\n\nk=0\n\na\n\n,\n\n\uf8f6\uf8f8\n\n\u03bbk\n\nwhere {\u03bbk}L\u22121\nv(xL) = 0 for convenience. Differentiating the Lagrangian with respect to any q(x, a), we get\n\nk=0 and {v(x)}x\u2208X\\{x0,xl} are Lagrange multipliers. In what follows, we set v(x0) =\n\u2202Lt(q)\n\u2202q(x, a)\n\nv(x(cid:48))P (x(cid:48)|x, a) \u2212 v(x).\n\n(cid:88)\n\n= ln q(x, a) \u2212 ln \u02dcqt+1(x, a) + \u03bbk(x) +\nqt+1(x, a) = \u02dcqt+1(x, a)e\u2212\u03bbk(x)\u2212(cid:80)\n\nx(cid:48)\n\nx(cid:48) v(x(cid:48))P (x(cid:48)|x,a)+v(x).\n\nHence, setting the gradient to zero, we obtain the formula for qt+1(x, a):\n\nSubstituting the formula for \u02dcqt+1(x, a), we get\n\nqt+1(x, a) = qt(x, a)e\u2212\u03bbk(x)+\u03b4(x,a|v, \u02c6(cid:96)t).\nUsing the second constraint, we have for every k = 0, 1, . . . , L \u2212 1 that\n\nqt(x, a)e\u2212\u03bbk+\u03b4(x,a|v, \u02c6(cid:96)t) = 1,\n\n(cid:88)\n\n(cid:88)\n\nx\u2208Xk\n\na\n\nyielding e\u2212\u03bbk = 1/Zt(v, k), which leaves us with computing the value of v at the optimum. This\ncan be done by solving the dual problem of maximizing\n\n(cid:88)\nis equivalent to maximizing \u2212(cid:80)L\u22121\n\nover {\u03bbk}L\u22121\n\nx,a\n\n\u02dcqt+1(x, a) \u2212 L \u2212 L\u22121(cid:88)\n\n\u03bbk\n\nk=0\n\nk=0 . If we drop the constants and express each \u03bbk in terms of Zt(v, k), then the problem\n\nk=0 ln Zt(v, k), that is, solving the optimization problem (6).\n\n4 Analysis\n\nThe next theorem states our main result concerning the regret of O-REPS under bandit feedback. The\nproof of the theorem is based on rather common ideas used in the analysis of FTRL/OSMD/PPA-\nstyle algorithms (see, e.g., [24], Chapter 11 of [5], [1], [12], [2]). After proving the theorem, we also\npresent the regret bound for O-REPS when used in a full information setting where the learner gets\nto observe (cid:96)t after each episode t.\nTheorem 1. Assuming bandit feedback, the total expected regret of O-REPS satis\ufb01es\n\n(cid:114)\n\nIn particular, setting \u03b7 =\n\n(cid:98)RT \u2264 \u03b7|X||A|T +\n\nL log\n\n|X||A|\n\nL\n\n\u03b7\n\n|X||A|\n\nL log\n\nT|X||A| yields\n\nL\n\n(cid:114)\n\n(cid:98)RT \u2264 2\n\nL|X||A|T log\n\n|X||A|\n\nL\n\n6\n\n.\n\n.\n\n\fqt(x, a) \u02c6(cid:96)2\n\nt (x, a)\n\nqt(x, a)\n\n(cid:96)t(x, a)\nqt(x, a)\n\n\u02c6(cid:96)t(x, a) \u2264 \u03b7\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\nx,a\n\n\u02c6(cid:96)t(x, a).\n\nUsing the exact form of \u02dcqt+1 and the fact that ex \u2265 1 + x, we get that\n\u02dcqt+1(x, a) \u2265 qt(x, a) \u2212 \u03b7qt(x, a) \u02c6(cid:96)t(x, a)\n\nt=1\n\nx,a\n\nt=1\n\n(cid:68)\n\n\u2264 \u03b7\n\nand thus\n\nT(cid:88)\n\n(cid:69) \u2264 \u03b7\n\nqt \u2212 \u02dcqt+1, \u02c6(cid:96)t\n\nT(cid:88)\nT(cid:88)\n\nCombining this with (7), we get\n\n(cid:88)\n(cid:88)\n(cid:69) \u2264 \u03b7\n(cid:34) T(cid:88)\n(cid:88)\nIt also follows from Equation (4) that E(cid:104)(cid:68)\n\nqt \u2212 q, \u02c6(cid:96)t\n\nT(cid:88)\n\n(cid:68)\n\nt=1\n\nx,a\n\nt=1\n\nnotice that\n\nE\n\n\u02c6(cid:96)t(x, a)\n\nt=1\n\n(cid:69)(cid:105)\nD (q(cid:107)q1) \u2264R(q) \u2212 R(q1) \u2264 L\u22121(cid:88)\n\nx,a\nq, \u02c6(cid:96)t\n\nT(cid:88)\n\nt=1\n\nx,a\n\n\u02c6(cid:96)t(x, a) +\n\n(cid:88)\n(cid:35)\n= (cid:104)q, (cid:96)t(cid:105) and E(cid:104)(cid:68)\n(cid:88)\n\n\u2264 |X||A|T.\n\n(cid:88)\n\nNext, we take an expectation on both sides. By Equation (4), we have\n\nD (q(cid:107)q1)\n\n\u03b7\n\n.\n\n(8)\n\n(cid:69)(cid:105)\n\nqt, \u02c6(cid:96)t\n\n= E [(cid:104)qt, (cid:96)t(cid:105)]. Finally,\n\nProof. By standard arguments (see, e.g., [19, Lemma 12], [3, Lemma 9.2] or [5, Theorem 11.1]),\nwe have\n\nT(cid:88)\n\n(cid:68)\n\nt=1\n\n(cid:69) \u2264 T(cid:88)\n\n(cid:68)\n\nt=1\n\n(cid:69)\n\nD (q(cid:107)q1)\n\n\u03b7\n\nqt \u2212 q, \u02c6(cid:96)t\n\nqt \u2212 \u02dcqt+1, \u02c6(cid:96)t\n\n+\n\n.\n\n(7)\n\nq1(x, a) log\n\n1\n\nq1(x, a)\n\nk=0\n\nx\u2208Xk\n\na\n\n(since R(q) \u2264 0)\n\n\u2264 L\u22121(cid:88)\n\nk=0\n\nlog |Xk||A| \u2264 L log\n\n|X||A|\n\nL\n\n,\n\nwhere we used the trivial upper bound on the entropy of distributions and Jensen\u2019s inequality in\nthe last step. Plugging the above upper bound into Equation (8), we obtain the statement of the\ntheorem.\nTheorem 2. Assuming full feedback, the total expected regret of O-REPS satis\ufb01es\n\n(cid:113)\n\nlog\n\nIn particular, setting \u03b7 =\n\n(cid:98)RT \u2264 \u03b7LT +\n(cid:114)\n(cid:98)RT \u2264 2L\n\nyields\n\n|X||A|\n\nL\n\nT\n\nL log\n\n|X||A|\n\nL\n\n.\n\n\u03b7\n\nT log\n\n|X||A|\n\nL\n\n.\n\nThe proof of the statement follows directly from the proof of Theorem 1, with the only difference\nthat we set \u02c6(cid:96)t = (cid:96)t and we can use the tighter upper bound\n\nT(cid:88)\n\nt=1\n\n(cid:104)qt \u2212 \u02dcqt+1, (cid:96)t(cid:105) \u2264 \u03b7\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\n(cid:88)\n(cid:88)\n\nx,a\n\nt=1\n\nx,a\n\n\u2264 \u03b7\n\n7\n\nwhere we used that(cid:80)\n\nx\u2208Xk\n\n(cid:80)\n\na qt(x, a) = 1 for all layers k.\n\nqt(x, a)(cid:96)2\n\nt (x, a)\n\nqt(x, a) = \u03b7LT,\n\n\f5 Conclusions and future work\n\nComparison with previous results We \ufb01rst compare our regret bounds with previous results from\nthe literature. First, our guarantees for the full information case trade off a factor of L present in\n\nthe bounds of Neu et al. [14] to a (usually much smaller) factor of(cid:112)log |X|. More importantly,\nour bounds trade off a factor of L3/2/\u03b1 in the bandit case to a factor of(cid:112)|X|. This improvement\n\nis particularly remarkable considering that we do not need to assume that \u03b1 > 0, that is, we drop\nthe rather unnatural assumption that every stationary policy has to visit every state with positive\nprobability. In particular, dropping this assumption enables our algorithm to work in deterministic\nloop-free MDPs, that is, to solve the online shortest path problem (see, e.g., [10]). In the shortest\npath setting, O-REPS provides an alternative implementation to the Component Hedge algorithm\nanalyzed by Koolen et al. [12], who prove identical bounds in the full information case. As shown\nby Audibert et al. [2], Component Hedge achieves the analog of our bounds in the bandit case as\nwell.\nO-REPS also bears close resemblance to the algorithms of Even-Dar et al. [9] and Neu et al. [16] who\nx(cid:48) P (x(cid:48)|x, a)vt(x(cid:48))).\nThe most important difference between their algorithm and O-REPS is that their value functions vt\nare computed as the solution of the Bellman-equations instead of the solution of the optimization\nproblem (6). By a simple combination of our analysis and that of Even-Dar et al. [9], it is possible to\n\u03c4 T ) in the unichain setting with full information feedback,\nimproving their bound by a factor of \u03c4 3/2 under the same assumptions. It is an interesting open\nproblem to \ufb01nd out if using the O-REPS value functions is a strictly better idea than solving the\nBellman equations in general. Another important direction of future work is to extend our results to\nthe case of unichain MDPs with bandit feedback and the setting where the transition probabilities of\nthe underlying MDP is unknown (see Neu et al. [15]).\n\nalso use policy updates of the form \u03c0t+1(a|x) \u221d \u03c0t(a|x) exp(\u2212\u03b7(cid:96)t(x, a)\u2212(cid:80)\nshow that O-REPS attains a regret of (cid:101)O(\n\n\u221a\n\nLower bounds Following the proof of Theorem 10 in Audibert et al. [2], it is straightforward\nto construct an MDP consisting of |X|/L chains of L consecutive bandit problems each with |A|\n\nactions such that no algorithm can achieve smaller regret than 0.03L(cid:112)T log(|X||A|) in the full in-\nformation case and 0.04(cid:112)L|X||A|T in the bandit case. These results suggest that our bounds can-\n\nnot be signi\ufb01cantly improved in general, however, \ufb01nding an appropriate problem-dependent lower\nbound remains an interesting open problem in the much broader \ufb01eld of online linear optimization.\n\nREPS vs. O-REPS As noted several times above, our algorithm is directly inspired by the work\nof Peters et al. [17]. However, there is a slight difference between the original version of REPS and\nO-REPS, namely, Peters et al. aim to solve the optimization problem qt+1 = arg minq\u2208\u2206(M )(cid:104)q, \u02c6(cid:96)t(cid:105)\nsubject to the constraint D (q(cid:107)qt) \u2264 \u03b5 for some \u03b5 > 0. This is to be contrasted with the following\nproperty of the occupancy measures generated by O-REPS (proved in the supplementary material):\nLemma 1. For any t > 0, D (qt(cid:107)qt+1) \u2264 \u03b72\n\n2 (cid:104)qt, \u02c6(cid:96)2\nt(cid:105).\n\nIn particular, if the losses are estimated by bounded sample averages as done by Peters et al. [17],\nthis gives D (qt(cid:107)qt+1) \u2264 \u03b72/2. While this is not the exact same property as desired by REPS,\nboth inequalities imply that the occupancy measures stay close to each other in the 1-norm sense by\nPinsker\u2019s inequality. Thus we conjecture that our formulation of O-REPS has similar properties to\nthe one studied by Peters et al. [17], while it might be somewhat simpler to implement.\n\nAcknowledgments\n\nAlexander Zimin is an OMV scholar. Gergely Neu\u2019s work was carried out during the tenure of\nan ERCIM \u201dAlain Bensoussan\u201d Fellowship Programme. The research leading to these results has\nreceived funding from INRIA, the European Union Seventh Framework Programme (FP7/2007-\n2013) under grant agreements 246016 and 231495 (project CompLACS), the Ministry of Higher\nEducation and Research, Nord-Pas-de-Calais Regional Council and FEDER through the \u201cContrat\nde Projets Etat Region (CPER) 2007-2013\u201d.\n\n8\n\n\fReferences\n[1] Abernethy, J., Hazan, E., and Rakhlin, A. (2008). Competing in the dark: An ef\ufb01cient algorithm\nfor bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory\n(COLT), pages 263\u2013274.\n\n[2] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2013). Regret in online combinatorial optimization.\n\nMathematics of Operations Research. to appear.\n\n[3] Bart\u00b4ok, G., P\u00b4al, D., Szepesv\u00b4ari, C., and Szita, I. (2011). Online learning. Lecture notes, Univer-\n\nsity of Alberta. https://moodle.cs.ualberta.ca/\ufb01le.php/354/notes.pdf.\n\n[4] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.\n[5] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge Univer-\n\nsity Press, New York, NY, USA.\n\n[6] Daniel, C., Neumann, G., and Peters, J. (2012). Hierarchical relative entropy policy search.\nIn Proceedings of the Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 22 of JMLR Workshop and Conference Proceedings, pages 273\u2013281.\n\n[7] Dekel, O. and Hazan, E. (2013). Better rates for any adversarial deterministic mdp. In Dasgupta,\nS. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), volume 28, pages 675\u2013683. JMLR Workshop and Conference Proceedings.\n[8] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process.\n\nIn NIPS-17, pages 401\u2013408.\n\n[9] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes.\n\nMathematics of Operations Research, 34(3):726\u2013736.\n\n[10] Gy\u00a8orgy, A., Linder, T., Lugosi, G., and Ottucs\u00b4ak, Gy.. (2007). The on-line shortest path prob-\n\nlem under partial monitoring. Journal of Machine Learning Research, 8:2369\u20132403.\n\n[11] Kakade, S. (2001). A natural policy gradient. In Advances in Neural Information Processing\n\nSystems 14 (NIPS), pages 1531\u20131538.\n\n[12] Koolen, W. M., Warmuth, M. K., and Kivinen, J. (2010). Hedging structured concepts. In\n\nProceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 93\u2013105.\n\n[13] Martinet, B. (1970). R\u00b4egularisation d\u2019in\u00b4equations variationnelles par approximations succes-\nsives. ESAIM: Mathematical Modelling and Numerical Analysis - Mod\u00b4elisation Math\u00b4ematique\net Analyse Num\u00b4erique, 4(R3):154\u2013158.\n\n[14] Neu, G., Gy\u00a8orgy, A., and Szepesv\u00b4ari, Cs. (2010a). The online loop-free stochastic shortest-\npath problem. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages\n231\u2013243.\n\n[15] Neu, G., Gy\u00a8orgy, A., and Szepesv\u00b4ari, Cs. (2012). The adversarial stochastic shortest path\n\nproblem with unknown transition probabilities. In AISTATS 2012, pages 805\u2013813.\n\n[16] Neu, G., Gy\u00a8orgy, A., Szepesv\u00b4ari, Cs., and Antos, A. (2010b). Online Markov decision pro-\n\ncesses under bandit feedback. In NIPS-23, pages 1804\u20131812. CURRAN.\n\n[17] Peters, J., M\u00a8ulling, K., and Altun, Y. (2010). Relative entropy policy search. In AAAI 2010,\n\npages 1607\u20131612.\n\n[18] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Program-\n\nming. Wiley-Interscience.\n\n[19] Rakhlin, A. (2009). Lecture notes on online learning.\n[20] Rockafellar, R. T. (1976). Monotone Operators and the Proximal Point Algorithm. SIAM\n\nJournal on Control and Optimization, 14(5):877\u2013898.\n\n[21] Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press.\n[22] Szepesv\u00b4ari, Cs. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01-\n\ncial Intelligence and Machine Learning. Morgan & Claypool Publishers.\n\n[23] Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary\n\nreward processes. Mathematics of Operations Research, 34(3):737\u2013757.\n\n[24] Zinkevich, M. (2003). Online convex programming and generalized in\ufb01nitesimal gradient\nascent. In Proceedings of the Twentieth International Conference on Machine Learning, pages\n928\u2013936.\n\n9\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Alexander", "family_name": "Zimin", "institution": "Institute of Science and Technology Austria"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "INRIA"}]}