{"title": "Near-optimal Regret Bounds for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": "For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.", "full_text": "Near-optimal Regret Bounds for\n\nReinforcement Learning\n\nPeter Auer\nRonald Ortner\nUniversity of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria\n\nThomas Jaksch\n\n{auer,tjaksch,rortner}@unileoben.ac.at\n\nAbstract\n\nFor undiscounted reinforcement learning in Markov decision processes (MDPs)\nwe consider the total regret of a learning algorithm with respect to an optimal\npolicy. In order to describe the transition structure of an MDP we propose a new\nparameter: An MDP has diameter D if for any pair of states s, s0 there is a policy\nwhich moves from s to s0 in at most D steps (on average). We present a rein-\nforcement learning algorithm with total regret \u02dcO(DS\nAT ) after T steps for any\nunknown MDP with S states, A actions per state, and diameter D. This bound\nholds with high probability. We also present a corresponding lower bound of\n\u2126(\n\nDSAT ) on the total regret of any learning algorithm.\n\n\u221a\n\n\u221a\n\n1 Introduction\nIn a Markov decision process (MDP) M with \ufb01nite state space S and \ufb01nite action space A, a learner\nin state s \u2208 S needs to choose an action a \u2208 A. When executing action a in state s, the learner\nreceives a random reward r with mean \u00afr(s, a) according to some distribution on [0, 1]. Further,\naccording to the transition probabilities p (s0|s, a), a random transition to a state s0 \u2208 S occurs.\nReinforcement learning of MDPs is a standard model for learning with delayed feedback. In contrast\nto important other work on reinforcement learning \u2014 where the performance of the learned policy is\nconsidered (see e.g. [1, 2] and also the discussion and references given in the introduction of [3]) \u2014\nwe are interested in the performance of the learning algorithm during learning. For that, we compare\nthe rewards collected by the algorithm during learning with the rewards of an optimal policy.\nIn this paper we will consider undiscounted rewards. The accumulated reward of an algorithm A\nafter T steps in an MDP M is de\ufb01ned as\n\nR(M, A, s, T ) :=PT\n\nt=1 rt,\n\nwhere s is the initial state and rt are the rewards received during the execution of algorithm A. The\naverage reward\n\nD(M) := max\n\ns,s0\u2208S min\n\u03c0:S\u2192A\n\nE [T (s0|M, \u03c0, s)] .\n\n\u03c1(M, A, s) := lim\nT\u2192\u221e\n\n1\nT\n\nE [R(M, A, s, T )]\n\ncan be maximized by an appropriate stationary policy \u03c0 : S \u2192 A which de\ufb01nes an optimal action\nfor each state [4].\nThe dif\ufb01culty of learning an MDP does not only depend on its size (given by the number of states\nand actions), but also on its transition structure. In order to measure this transition structure we\npropose a new parameter, the diameter D of an MDP. The diameter D is the time it takes to move\nfrom any state s to any other state s0, using an appropriate policy for this pair of states s and s0:\nDe\ufb01nition 1. Let T (s0|M, \u03c0, s) be the \ufb01rst (random) time step in which state s0 is reached when\npolicy \u03c0 is executed on MDP M with initial state s. Then the diameter of M is given by\n\n\fA \ufb01nite diameter seems necessary for interesting bounds on the regret of any algorithm with respect\nto an optimal policy. When a learner explores suboptimal actions, this may take him into a \u201cbad\npart\u201d of the MDP from which it may take about D steps to reach again a \u201cgood part\u201d of the MDP.\nHence, the learner may suffer regret D for such exploration, and it is very plausible that the diameter\nappears in the regret bound.\nFor MDPs with \ufb01nite diameter (which usually are called communicating, see e.g. [4]) the optimal\naverage reward \u03c1\u2217 does not depend on the initial state (cf. [4], Section 8.3.3), and we set\n\n\u03c1\u2217(M) := \u03c1\u2217(M, s) := max\n\n\u03c0\n\n\u03c1(M, \u03c0, s).\n\nThe optimal average reward is the natural benchmark for a learning algorithm A, and we de\ufb01ne the\ntotal regret of A after T steps as1\n\n\u2206(M, A, s, T ) := T \u03c1\u2217(M) \u2212 R(M, A, s, T ).\n\nIn the following, we present our reinforcement learning algorithm UCRL2 (a variant of the UCRL\nalgorithm of [5]) which uses upper con\ufb01dence bounds to choose an optimistic policy. We show\n\nthat the total regret of UCRL2 after T steps is \u02dcO(D|S|p|A|T ). A corresponding lower bound of\n\u2126(pD|S||A|T ) on the total regret of any learning algorithm is given as well. These results establish\n\nthe diameter as an important parameter of an MDP. Further, the diameter seems to be more natural\nthan other parameters that have been proposed for various PAC and regret bounds, such as the mixing\ntime [3, 6] or the hitting time of an optimal policy [7] (cf. the discussion below).\n\n1.1 Relation to previous Work\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b4 , 1\n\nof an optimal policy \u03c0\u2217 as input parameter. This parameter T mix\n\nWe \ufb01rst compare our results to the PAC bounds for the well-known algorithms E3 of Kearns,\nSingh [3], and R-Max of Brafman, Tennenholtz [6] (see also Kakade [8]). These algorithms achieve\n\u03b5-optimal average reward with probability 1\u2212 \u03b4 after time polynomial in 1\n\u03b5 , |S|, |A|, and the mix-\n(see below). As the polynomial dependence on \u03b5 is of order 1/\u03b53, the PAC bounds\ning time T mix\ntranslate into T 2/3 regret bounds at the best. Moreover, both algorithms need the \u03b5-return mixing\ntime T mix\nis the number of steps\nsteps is \u03b5-close to the optimal average reward \u03c1\u2217.\nuntil the average reward of \u03c0\u2217 over these T mix\n\u03b5 \u2248 D/\u03b5. This additional dependency on \u03b5\nIt is easy to construct MDPs of diameter D with T mix\nfurther increases the exponent in the above mentioned regret bounds for E3 and R-max. Also, the\nexponents of the parameters |S| and |A| in the PAC bounds of [3] and [6] are substantially larger\nthan in our bound.\nThe MBIE algorithm of Strehl and Littman [9, 10] \u2014 similarly to our approach \u2014 applies con\ufb01dence\nbounds to compute an optimistic policy. However, Strehl and Littman consider only a discounted\nreward setting, which seems to be less natural when dealing with regret. Their de\ufb01nition of regret\nmeasures the difference between the rewards2 of an optimal policy and the rewards of the learning\nalgorithm along the trajectory taken by the learning algorithm. In contrast, we are interested in the\nregret of the learning algorithm in respect to the rewards of the optimal policy along the trajectory\nof the optimal policy.\nTewari and Bartlett [7] propose a generalization of the index policies of Burnetas and Katehakis [11].\nThese index policies choose actions optimistically by using con\ufb01dence bounds only for the estimates\nin the current state. The regret bounds for the index policies of [11] and the OLP algorithm of [7]\nare asymptotically logarithmic in T . However, unlike our bounds, these bounds depend on the gap\nbetween the \u201cquality\u201d of the best and the second best action, and these asymptotic bounds also hide\nan additive term which is exponential in the number of states. Actually, it is possible to prove a\ncorresponding gap-dependent logarithmic bound for our UCRL2 algorithm as well (cf. Remark 4\nbelow). This bound holds uniformly over time and under weaker assumptions: While [7] and [11]\nconsider only ergodic MDPs in which any policy will reach every state after a suf\ufb01cient number of\nsteps, we make only the more natural assumption of a \ufb01nite diameter.\n\n1It can be shown that maxA E [R(M, A, s, T )] = T \u03c1\u2217(M ) + O(D(M )) and maxA R(M, A, s, T ) =\n\nT \u03c1\u2217(M ) + \u02dcO(cid:0)\u221a\n\nT(cid:1) with high probability.\n\n2Actually, the state values.\n\n\f2 Results\n\nWe summarize the results achieved for our algorithm UCRL2 which is described in the next section,\nand also state a corresponding lower bound. We assume an unknown MDP M to be learned, with\nS := |S| states, A := |A| actions, and \ufb01nite diameter D := D(M). Only S and A are known to the\nlearner, and UCRL2 is run with parameter \u03b4.\nTheorem 2. With probability 1\u2212 \u03b4 it holds that for any initial state s \u2208 S and any T > 1, the regret\nof UCRL2 is bounded by\n\n\u2206(M, UCRL2, s, T ) \u2264 c1 \u00b7 DS\n\nfor a constant c1 which is independent of M, T , and \u03b4.\n\nT A log T\n\u03b4 ,\n\nIt is straightforward to obtain from Theorem 2 the following sample complexity bound.\nCorollary 3. With probability 1 \u2212 \u03b4 the average per-step regret is at most \u03b5 for any\n\nq\n\n(cid:18) DSA\n\n(cid:19)\n\nT \u2265 c2\n\nD2S2A\n\nlog\n\n\u03b52\nsteps, where c2 is a constant independent of M.\nRemark 4. The proof method of Theorem 2 can be modi\ufb01ed to give for each initial state s and T > 1\nan alternative upper bound on the expected regret,\n\n\u03b4\u03b5\n\nE [\u2206(M, UCRL2, s, T )] \u2264 c3\n\nD2S2A log T\n\ng\n\n,\n\nwhere g := \u03c1\u2217(M) \u2212 max\u03c0,s{\u03c1(M, \u03c0, s) : \u03c1(M, \u03c0, s) < \u03c1\u2217(M)} is the gap between the optimal\naverage reward and the second best average reward achievable in M.\n\nThese new bounds are improvements over the bounds that have been achieved in [5] for the original\nUCRL algorithm in various respects: the exponents of the relevant parameters have been decreased\nconsiderably, the parameter D we use here is substantially smaller than the corresponding mixing\ntime in [5], and \ufb01nally, the ergodicity assumption is replaced by the much weaker and more natural\nassumption that the MDP has \ufb01nite diameter.\nThe following is an accompanying lower bound on the expected regret.\nTheorem 5. For some c4 > 0, any algorithm A, and any natural numbers S, A \u2265 10, D \u2265\n20 logA S, and T \u2265 DSA, there is an MDP 3 M with S states, A actions, and diameter D, such\nthat for any initial state s \u2208 S the expected regret of A after T steps is\n\nE [\u2206(M, A, s, T )] \u2265 c4 \u00b7\n\nDSAT .\n\n\u221a\n\nIn a different setting, a modi\ufb01cation of UCRL2 can also deal with changing MDPs.\nRemark 6. Assume that the MDP (i.e. its transition probabilities and reward distributions) is al-\nlowed to change \u2018 times up to step T , such that the diameter is always at most D (we assume an\ninitial change at time t = 1). In this model we measure regret as the sum of missed rewards com-\npared to the \u2018 policies which are optimal after the changes of the MDP. Restarting UCRL2 with\nparameter \u03b4/\u20182 at steps di3/\u20182e for i = 1, 2, 3 . . ., this regret is upper bounded by\n\nwith probability 1 \u2212 2\u03b4.\n\nc5 \u00b7 \u2018\n\n1\n3 T\n\n2\n3 DS\n\nq\n\nA log T\n\u03b4\n\nMDPs with a different model of changing rewards have already been considered in [12]. There, the\n\u221a\ntransition probabilities are assumed to be \ufb01xed and known to the learner, but the rewards are allowed\nT ) on the regret against an optimal\nto change in every step. A best possible upper bound of O(\nstationary policy, given all the reward changes in advance, is derived.\n\n3The diameter of any MDP with S states and A actions is at least logA S.\n\n\fInput: A con\ufb01dence parameter \u03b4 \u2208 (0, 1).\nInitialization: Set t := 1, and observe the initial state s1.\nFor episodes k = 1, 2, . . . do\n\nInitialize episode k:\n\n1. Set the start time of episode k, tk := t.\n2. For all (s, a) in S \u00d7 A initialize the state-action counts for episode k, vk(s, a) := 0.\n\nFurther, set the state-action counts prior to episode k,\n\nNk (s, a) := #{\u03c4 < tk : s\u03c4 = s, a\u03c4 = a} .\n\n3. For s, s0 \u2208 S and a \u2208 A set the observed accumulated rewards and the transition\n\ncounts prior to episode k,\n\ntk\u22121X\n\n\u03c4 =1\n\nRk (s, a) :=\n\nr\u03c4 1s\u03c4 =s,a\u03c4 =a,\n\nPk (s, a, s0) := #{\u03c4 < tk : s\u03c4 = s, a\u03c4 = a, s\u03c4 +1 = s0} ,\n\nand compute estimates \u02c6rk (s, a) :=\n\nmax{1,Nk(s,a)} , \u02c6pk (s0|s, a) := Pk(s,a,s0)\n\nmax{1,Nk(s,a)} .\n\nRk(s,a)\n\nCompute policy \u02dc\u03c0k:\n4. Let Mk be the set of all MDPs with states and actions as in M, and with tran-\nsition probabilities \u02dcp (\u00b7|s, a) close to \u02c6pk (\u00b7|s, a), and rewards \u02dcr(s, a) \u2208 [0, 1] close\n\nto \u02c6rk (s, a), that is,(cid:12)(cid:12)\u02dcr(s, a) \u2212 \u02c6rk\n(cid:13)(cid:13)(cid:13)\u02dcp(cid:0)\u00b7|s, a(cid:1) \u2212 \u02c6pk\n\n(cid:0)s, a(cid:1)(cid:12)(cid:12) \u2264 q 7 log(2SAtk/\u03b4)\n\u2264 q 14S log(2Atk/\u03b4)\n(cid:0)\u00b7|s, a(cid:1)(cid:13)(cid:13)(cid:13)1\n\n2 max{1,Nk(s,a)}\n\nmax{1,Nk(s,a)} .\n\nand\n\n(1)\n\n(2)\n\n5. Use extended value iteration (Section 3.1) to \ufb01nd a policy \u02dc\u03c0k and an optimistic\n\nMDP \u02dcMk \u2208 Mk such that\n\n\u02dc\u03c1k := min\ns\n\n\u03c1( \u02dcMk, \u02dc\u03c0k, s) \u2265\n\nM0\u2208Mk,\u03c0,s0 \u03c1(M0, \u03c0, s0) \u2212 1\u221a\n\nmax\n\ntk\n\n.\n\n(3)\n\nExecute policy \u02dc\u03c0k:\n6. While vk(st, \u02dc\u03c0k(st)) < max{1, Nk(st, \u02dc\u03c0k(st))} do\n\n(a) Choose action at := \u02dc\u03c0k(st), obtain reward rt, and observe next state st+1.\n(b) Update vk(st, at) := vk(st, at) + 1.\n(c) Set t := t + 1.\n\nFigure 1: The UCRL2 algorithm.\n\n3 The UCRL2 Algorithm\n\nOur algorithm is a variant of the UCRL algorithm in [5]. As its predecessor, UCRL2 implements\nthe paradigm of \u201coptimism in the face of uncertainty\u201d. As such, it de\ufb01nes a set M of statistically\nplausible MDPs given the observations so far, and chooses an optimistic MDP \u02dcM (with respect to\nthe achievable average reward) among these plausible MDPs. Then it executes a policy \u02dc\u03c0 which is\n(nearly) optimal for the optimistic MDP \u02dcM.\nMore precisely, UCRL2 (Figure 1) proceeds in episodes and computes a new policy \u02dc\u03c0k only at the\nbeginning of each episode k. The lengths of the episodes are not \ufb01xed a priori, but depend on\nthe observations made. In Steps 2\u20133, UCRL2 computes estimates \u02c6pk (s0|s, a) and \u02c6rk (s, a) for the\ntransition probabilities and mean rewards from the observations made before episode k. In Step 4,\na set Mk of plausible MDPs is de\ufb01ned in terms of con\ufb01dence regions around the estimated mean\nrewards \u02c6rk(s, a) and transition probabilities \u02c6pk (s0|s, a). This guarantees that with high probability\n\n\fthe true MDP M is in Mk. In Step 5, extended value iteration (see below) is used to choose a near-\noptimal policy \u02dc\u03c0k on an optimistic MDP \u02dcMk \u2208 Mk. This policy \u02dc\u03c0k is executed throughout episode\nk (Step 6). Episode k ends when a state s is visited in which the action a = \u02dc\u03c0k(s) induced by the\ncurrent policy has been chosen in episode k equally often as before episode k. Thus, the total number\nof occurrences of any state-action pair is at most doubled during an episode. The counts vk(s, a)\nkeep track of these occurrences in episode k.4\n\n3.1 Extended Value Iteration\n\nIn Step 5 of the UCRL2 algorithm we need to \ufb01nd a near-optimal policy \u02dc\u03c0k for an optimistic MDP.\nWhile value iteration typically calculates a policy for a \ufb01xed MDP, we also need to select an op-\ntimistic MDP \u02dcMk which gives almost maximal reward among all plausible MDPs. This can be\nachieved by extending value iteration to search also among the plausible MDPs. Formally, this can\nbe seen as undiscounted value iteration [4] on an MDP with extended action set. We denote the state\ni(s) and get for all s \u2208 S:\nvalues of the i-th iteration by ui(s) and the normalized state values by u0\n\n(\n\nu0(s) = 0,\n\n(cid:26)X\nthe set of transition probabilities \u02dcp(cid:0)\u00b7|s, a(cid:1) satisfying condition (2).\n\nui+1(s) = max\na\u2208A\n\n\u02dcrk (s, a) + max\n\np(\u00b7)\u2208P(s,a)\n\ns0\u2208S\n\nHere \u02dcrk (s, a) are the maximal rewards satisfying condition (1) in algorithm UCRL2, and P(s, a) is\n\n(cid:27))\n\np(s0) \u00b7 ui(s0)\n\n,\n\n(4)\n\nWhile (4) may look like a step of value iteration with an in\ufb01nite action space, maxp p\u00b7 ui is actually\na linear optimization problem over the convex polytope P(s, a). This implies that only the \ufb01nite\nnumber of vertices of the polytope need to be considered as extended actions, which guarantees\nconvergence of the value iteration.5\nThe value iteration is stopped when\n\n(cid:8)ui+1(s) \u2212 ui(s)(cid:9) \u2212 min\n\n(cid:8)ui+1(s) \u2212 ui(s)(cid:9) <\n\ns\u2208S\n\nmax\ns\u2208S\n\n1\u221a\ntk\n\n,\n\n(5)\n\nwhich means that the change of the state values is almost uniform and actually close to the average\nreward of the optimal policy. It can be shown that the actions, rewards, and transition probabilities\nchosen in (4) for this i-th iteration de\ufb01ne an optimistic MDP \u02dcMk and a policy \u02dc\u03c0k which satisfy\ncondition (3) of algorithm UCRL2.\n\n4 Analysis of UCRL2 and Proof Sketch of Theorem 2\n\nIn the following we present an outline of the main steps of the proof of Theorem 2. Details and the\ncomplete proofs can be found in the full version of the paper [13]. We also make the assumption\nthat the rewards r(s, a) are deterministic and known to the learner.6 This simpli\ufb01es the exposition.\nConsidering unknown stochastic rewards adds little to the proof and only lower order terms to the\nregret bounds. We also assume that the true MDP M satis\ufb01es the con\ufb01dence bounds in Step 4 of\nalgorithm UCRL2 such that M \u2208 Mk. This can be shown to hold with suf\ufb01ciently high probability\n(using a union bound over all T ).\nWe start by considering the regret in a single episode k. Since the optimistic average reward \u02dc\u03c1k\nof the optimistically chosen policy \u02dc\u03c0k is essentially larger than the true optimal average reward \u03c1\u2217,\nit is suf\ufb01cient to calculate by how much the optimistic average reward \u02dc\u03c1k overestimates the actual\nrewards of policy \u02dc\u03c0k. By the choice of \u02dc\u03c0k and \u02dcMk in Step 5 of UCRL2, \u02dc\u03c1k \u2265 \u03c1\u2217 \u2212 1/\ntk. Thus the\n4Since the policy \u02dc\u03c0k is \ufb01xed for episode k, vk(s, a) 6= 0 only for a = \u02dc\u03c0k(s). Nevertheless, we \ufb01nd it\nconvenient to use a notation which explicitly includes the action a in vk(s, a).\n5Because of the special structure of the polytope P(s, a), the linear program in (4) can be solved very ef\ufb01-\nciently in O(S) steps after sorting the state values ui(s0). For the formal convergence proof also the periodicity\nof optimal policies in the extended MDP needs to be considered.\n\n6In this case all plausible MDPs considered in Steps 4 and 5 of algorithm UCRL2 would give these rewards.\n\n\u221a\n\n\fregret \u2206k during episode k is bounded as\n\ntk+1\u22121X\n\nt=tk\n\n\u2206k :=\n\n(\u03c1\u2217 \u2212 rt) \u2264\n\ntk+1\u22121X\n\nt=tk\n\n(\u02dc\u03c1k \u2212 rt) + tk+1 \u2212 tk\u221a\n\n.\n\ntk\n\n\u221a\nThe sum over k of the second term on the right hand side is O(\nT ) and will not be considered\nfurther in this proof sketch. The \ufb01rst term on the right hand side can be rewritten using the known\ndeterministic rewards r(s, a) and the occurrences of state action pairs (s, a) in episode k,\n\ntk+1\u22121X\n\n(\u02dc\u03c1k \u2212 rt) = X\n\nvk(s, a)(cid:0)\u02dc\u03c1k \u2212 r(s, a)(cid:1).\n\n\u2206k .\n\nt=tk\n\n(s,a)\n\n4.1 Extended Value Iteration revisited\n\n(6)\n\n(7)\n\nTo proceed, we reconsider the extended value iteration in Section 3.1. As an important observation\nfor our analysis, we \ufb01nd that for any iteration i the range of the state values is bounded by the\ndiameter of the MDP M,\n\nmax\n\ns\n\nui(s) \u2212 min\n\nui(s) \u2264 D.\n\ns\n\nTo see this, observe that ui(s) is the total expected reward after i steps of an optimal non-stationary\ni-step policy starting in state s, on the MDP with extended action set as considered for the extended\nvalue iteration. The diameter of this extended MDP is at most D as it contains the actions of the true\nMDP M. If there were states with ui(s1) \u2212 ui(s0) > D, then an improved value for ui(s0) could\nbe achieved by the following policy: First follow a policy which moves from s0 to s1 most quickly,\nwhich takes at most D steps on average. Then follow the optimal i-step policy for s1. Since only D\nof the i rewards of the policy for s1 are missed, this policy gives ui(s0) \u2265 ui(s1) \u2212 D, proving (7).\nFor the convergence criterion (5) it can be shown that at the corresponding iteration\n\n|ui+1(s) \u2212 ui(s) \u2212 \u02dc\u03c1k| \u2264 1\u221a\ntk\n\nfor all s \u2208 S, where \u02dc\u03c1k is the average reward of the policy \u02dc\u03c0k chosen in this iteration on the\noptimistic MDP \u02dcMk.7 Expanding ui+1(s) according to (4), we get\n\nui+1(s) = r(s, \u02dc\u03c0k(s)) +X\n\n\u02dcpk (s0|s, \u02dc\u03c0k(s)) \u00b7 ui(s0)\n\ns0\n\n\u2212\n\nand hence\n\n X\n\n\u02dc\u03c1k \u2212 r(s, \u02dc\u03c0k(s))\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n \n!\nDe\ufb01ning rk := (cid:0)rk\n(cid:0)s, \u02dc\u03c0k(s)(cid:1)(cid:1)\n(cid:0)\u02dcpk (s0|s, \u02dc\u03c0k(s))(cid:1)\ns,s0 as the transition matrix of \u02dc\u03c0k on \u02dcMk, and vk :=(cid:0)vk\n(cid:0) \u02dcP k \u2212 I(cid:1)ui +X\n\u2206k . X\nvk(s, a)(cid:0)\u02dc\u03c1k \u2212 r(s, a)(cid:1) \u2264 vk\n\ns as the (column) vector of rewards for policy \u02dc\u03c0k, \u02dcP k :=\ns as the (row)\nvector of visit counts for each state and the corresponding action chosen by \u02dc\u03c0k, we can rewrite (6)\nas\n\n\u02dcpk (s0|s, \u02dc\u03c0k(s)) \u00b7 ui(s0) \u2212 ui(s)\n\n!(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 1\u221a\n(cid:0)s, \u02dc\u03c0k(s)(cid:1)(cid:1)\n\ntk\n\n,\n\n(8)\n\ns0\n\n.\n\nvk(s, a)\u221a\ntk\n\n(s,a)\n\n(s,a)\n\nrecalling that vk(s, a) = 0 for a 6= \u02dc\u03c0k(s). Since the rows of \u02dcP k sum to 1, we can replace ui by wk\nwith wk(s) = ui(s) \u2212 mins ui(s) (we again use the subscript k to reference the episode). The last\nterm on the right hand side of (8) is of lower order, and by (7) we have\n\n(cid:0) \u02dcP k \u2212 I(cid:1)wk,\n\n\u2206k . vk\nkwkk\u221e \u2264 D.\n\n(9)\n(10)\n\n7This is quite intuitive. We expect to receive average reward \u02dc\u03c1k per step, such that the difference of the state\n\nvalues after i + 1 and i steps should be about \u02dc\u03c1k.\n\n\f4.2 Completing the Proof\nReplacing the transition matrix \u02dcP k of the policy \u02dc\u03c0k in the optimistic MDP \u02dcMk by the transition\nmatrix P k of \u02dc\u03c0k in the true MDP M, we get\n\n(cid:0) \u02dcP k \u2212 I(cid:1)wk = vk\n(cid:0) \u02dcP k \u2212 P k\n(cid:1)wk + vk\n\n(cid:0) \u02dcP k \u2212 P k + P k \u2212 I(cid:1)wk\n(cid:0)P k \u2212 I(cid:1)wk.\n\n\u2206k . vk\n= vk\n\n(11)\nThe intuition about the second term in (11) is that the counts of the state visits vk are relatively close\nto the stationary distribution of the transition matrix P k, such that vk\nformal proof requires the de\ufb01nition of a suitable martingale and the use of concentration inequalities\n\n(cid:0)P k\u2212I(cid:1) should be small. The\n!\n\nfor this martingale. This yieldsX\n\n \n\nr\n\n(cid:0)P k \u2212 I(cid:1)wk = O\n\nD\n\nT log T\n\u03b4\n\nvk\n\nk\n\nwith high probability, which gives a lower order term in our regret bound. Thus, our regret bound is\nmainly determined by the \ufb01rst term in (11). Since \u02dcMk and M are in the set of plausible MDPs Mk,\nthis term can be bounded using condition (2) in algorithm UCRL2:\n\n(cid:1)wk = X\n\u2264 X\n\u2264 X\n\ns\n\ns\n\ns0\n\nvk\n\nX\n(cid:0)s, \u02dc\u03c0k(s)(cid:1) \u00b7(cid:0) \u02dcP k(s, s0) \u2212 P k(s, s0)(cid:1) \u00b7 wk(s0)\n(cid:0)s, \u02dc\u03c0k(s)(cid:1) \u00b7(cid:13)(cid:13)(cid:13) \u02dcP k(s,\u00b7) \u2212 P k(s,\u00b7)\n(cid:13)(cid:13)(cid:13)1\nq 14S log(2AT /\u03b4)\n(cid:0)s, \u02dc\u03c0k(s)(cid:1) \u00b7 2\nmax{1,Nk(s,\u02dc\u03c0k(s))} \u00b7 D .\n\n\u00b7 kwkk\u221e\n\nvk\n\n(cid:0) \u02dcP k \u2212 P k\n\n\u2206k . vk\n\nthat Nk(s, a) =\ni 1\n2. Thus the regret suffered by a suboptimal action\nwhile the average reward of any other policy is 1\nin state s0 is \u2126(\u03b5/\u03b4). The main observation for the proof of the lower bound is that any algorithm\n\n\fneeds to probe \u2126(A0) actions in state s0 for \u2126(cid:0)\u03b4/\u03b52(cid:1) times on average, to detect the \u201cgood\u201d action a\u2217\naction a\u2217, we \ufb01nd that \u2126(kA0) actions in the s0-states of the copies need to be probed for \u2126(cid:0)\u03b4/\u03b52(cid:1)\ntimes to detect the \u201cgood\u201d action. Setting \u03b5 = p\u03b4kA0/T , suboptimal actions need to be taken\n\u2126(cid:0)kA0\u03b4/\u03b52(cid:1) = \u2126(T ) times which gives \u2126(T \u03b5/\u03b4) = \u2126(\n\nreliably.\nConsidering k := bS/2c copies of this MDP where only one of the copies has such a \u201cgood\u201d\n\nT DSA) regret.\n\n\u221a\n\nFinally, we need to connect the k copies into a single MDP. This can be done by introducing A0 + 1\nadditional deterministic actions per state, which do not leave the s1-states but connect the s0-states\nof the k copies by inducing an A0-ary tree structure on the s0-states (1 action for going toward the\nroot, A0 actions to go toward the leaves). The diameter of the resulting MDP is at most 2(D/10 +\ndlogA0 ke) which is twice the time to travel to or from the root for any state in the MDP. Thus we\nhave constructed an MDP with \u2264 S states, \u2264 A actions, and diameter \u2264 D which forces regret\n\u2126(\n\nDSAT ) on any algorithm. This proves the theorem.\n\n\u221a\n\nAcknowledgments\n\nThis work was supported in part by the Austrian Science Fund FWF (S9104-N13 SP4). The research\nleading to these results has received funding from the European Community\u2019s Seventh Framework\nProgramme (FP7/2007-2013) under grant agreements n\u25e6 216886 (PASCAL2 Network of Excel-\nlence), and n\u25e6 216529 (Personal Information Navigator Adapting Through Viewing, PinView). This\npublication only re\ufb02ects the authors\u2019 views.\n\nReferences\n\n[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[2] Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect\n\nalgorithms. In Advances in Neural Information Processing Systems 11. MIT Press, 1999.\n\n[3] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach.\n\nLearn., 49:209\u2013232, 2002.\n\n[4] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, Inc., New York, NY, USA, 1994.\n\n[5] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for reinforcement learning. In Advances\n\nin Neural Information Processing Systems 19, pages 49\u201356. MIT Press, 2007.\n\n[6] Ronen I. Brafman and Moshe Tennenholtz. R-max \u2013 a general polynomial time algorithm for near-optimal\n\nreinforcement learning. J. Mach. Learn. Res., 3:213\u2013231, 2002.\n\n[7] Ambuj Tewari and Peter Bartlett. Optimistic linear programming gives logarithmic regret for irreducible\n\nmdps. In Advances in Neural Information Processing Systems 20, pages 1505\u20131512. MIT Press, 2008.\n\n[8] Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College\n\nLondon, 2003.\n\n[9] Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation.\n\nIn Proc. 22nd ICML 2005, pages 857\u2013864, 2005.\n\n[10] Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov\n\ndecision processes. J. Comput. System Sci., 74(8):1309\u20131331, 2008.\n\n[11] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision pro-\n\ncesses. Math. Oper. Res., 22(1):222\u2013255, 1997.\n\n[12] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process.\n\nAdvances in Neural Information Processing Systems 17, pages 401\u2013408. MIT Press, 2005.\n\nIn\n\n[13] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learn-\ning. Technical Report CIT-2009-01, University of Leoben, Chair for Information Technology, 2009.\nhttp://institute.unileoben.ac.at/infotech/publications/TR/CIT-2009-01.pdf.\n\n\f", "award": [], "sourceid": 242, "authors": [{"given_name": "Peter", "family_name": "Auer", "institution": null}, {"given_name": "Thomas", "family_name": "Jaksch", "institution": null}, {"given_name": "Ronald", "family_name": "Ortner", "institution": null}]}