{"title": "Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 49, "page_last": 56, "abstract": null, "full_text": "Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning\n\nPeter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria {auer,rortner}@unileoben.ac.at\n\nAbstract\nWe present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.\n\n1 Introduction\n1.1 Preliminaries Definition 1. A Markov decision process (MDP) M on a finite set of states S with a finite set of actions A available in each state S consists of (i) an initial distribution 0 over S, (ii) the transition probabilities p(s, a, s ) that specify the probability of reaching state s when choosing action a in state s, and (iii) the payoff distributions with mean r(s, a) and support in [0, 1] that specify the random reward for choosing action a in state s. A policy on an MDP M is a mapping : S A. We will mainly consider unichain MDPs, in which under any policy any state can be reached (after a finite number of transitions) from any state. For a policy let be the stationary distribution induced by on M .1 The average reward of then is defined as s (M , ) := (s)r(s, (s)). (1)\nS\n\nA policy is called optimal on M , if (M , ) (M , ) =: (M ) =: for all policies . Our measure for the quality of a learning algorithm is the total regret after some finite number of steps. When a learning algorithm A executes action at in state st at step t obtaining reward rt , then T -1 RT := t=0 rt - T denotes the total regret of A after T steps. The total regret RT with respect to an -optimal policy (i.e. a policy whose return differs from by at most ) is defined accordingly. 1.2 Discussion\n\nWe would like to compare this approach with the various PAC-like bounds in the literature as given for the E3 -algorithm of Kearns, Singh [1] and the R-Max algorithm of Brafman, Tennenholtz [2] (cf. also [3]). Both take as inputs (among others) a confidence parameter and an accuracy parameter\n1 Every policy induces a Markov chain C on M . If C is ergodic with transition matrix P , then there exists a unique invariant and stnictly positive distribution , such that independent of 0 one has n = r 1 0 Pn , where Pn = n j =1 P j . If C is not ergodic, will depend on 0 .\n\n\f\n. The algorithms then are shown to yield -optimal return after time polynomial in 1 , 1 (among others) with probability 1 - . In contrast, our algorithm has no such input parameters and converges to an optimal policy with expected logarithmic online regret in the number of steps taken. Obviously, by using a decreasing sequence t , online regret bounds for E3 and R-Max can be achieved. However, it is not clear whether such a procedure can give logarithmic online regret bounds. We rather conjecture that these bounds either will be not logarithmic in the total number of steps (if t decreases quickly) or that the dependency on the parameters of the MDP in particular on the distance between the reward of the best and a second best policy won't be polynomial (if t decreases slowly). Moreover, although our UCRL algorithm shares the \"optimism under uncertainty\" maxim with RMax, our mechanism for the exploitation-exploration tradeoff is implicit, while E3 and R-Max have to distinguish between \"known\" and \"unknown\" states explicitly. Finally, in their original form both E3 and R-Max need a policy's -return mixing time T as input parameter. The knowledge of this parameter then is eliminated by calculating the -optimal policy for T = 1, 2, . . ., so that sooner or later the correct -return mixing time is reached. This is sufficient to obtain polynomial PACbounds, but seems to be intricate for practical purposes. Moreover, as noted in [2], at some time step the assumed T may be exponential in the true T , which makes policy computation exponential in T . Unlike that, we need our mixing time parameter only in the analysis. This makes our algorithm rather simple and intuitive. Recently, more refined performance measures such as the sample complexity of exploration [3] were introduced. Strehl and Littman [4] showed that in the discounted setting, efficiency in the sample complexity implies efficiency in the average loss. However, average loss is defined in respect to the actually visited states, so that small average loss does not guarantee small total regret, which is defined in respect to the states visited by an optimal policy. For this average loss polylogarithmic online bounds were shown for for the MBIE algorithm [4], while more recently logarithmic bounds for delayed Q-learning were given in [5]. However, discounted reinforcement learning is a bit simpler than undiscounted reinforcement learning, as depending on the discount factor only a finite number of steps is relevant. This makes discounted reinforcement learning similar to the setting with trials of constant length from a fixed initial state [6]. For this case logarithmic online regret bounds in the number of trials have already been given in [7]. Since we measure performance during exploration, the exploration vs. exploitation dilemma becomes an important issue. In the multi-armed bandit problem, similar exploration-exploitation tradeoffs were handled with upper confidence bounds for the expected immediate returns [8, 9]. This approach has been shown to allow good performance during the learning phase, while still converging fast to a nearly optimal policy. Our UCRL algorithm takes into account the state structure of the MDP, but is still based on upper confidence bounds for the expected return of a policy. Upper confidence bounds have been applied to reinforcement learning in various places and different contexts, e.g. interval estimation [10, 11], action elimination [12], or PAC-learning [6]. Our UCRL algorithm is similar to Strehl, Littman's MBIE algorithm [10, 4], but our confidence bounds are different, and we are interested in the undiscounted case. Another paper with a similar approach is Burnetas, Katehakis [13]. The basic idea of their rather complex index policies is to choose the action with maximal return in some specified confidence region of the MDP's probability distributions. The online-regret of their algorithm is asymptotically logarithmic in the number of steps, which is best possible. Our UCRL algorithm is simpler and achieves logarithmic regret not only asymptotically but uniformly over time. Moreover, unlike in the approach of [13], knowledge about the MDP's underlying state structure is not needed. More recently, online reinforcement learning with changing rewards chosen by an adversary was considered under the presumption that the learner has full kn ledge of the transition probabilities ow [14]. The given algorithm achieves best possible regret of O( T ) after T steps. In the subsequent Sections 2 and 3 we introduce our UCRL algorithm and show that its expected online regret in unichain MDPs is O(log T ) after T steps. In Section 4 we consider problems that arise when the underlying MDP is not unichain.\n\n\f\n2 The UCRL Algorithm\nTo select good policies, we keep track of estimates for the average rewards and the transition probabilities. For each step t let Nt (s, a) = |{0 < t : s = s, a = a}|, 0 Rt (s, a) = r ,\n 0. In general, these estimates will deviate from the respective true values. However, together with appropriate confidence intervals they may be used to define a set Mt of plausible MDPs. Our algorithm then chooses an optimal policy t for ~ ~ t with maximal average reward := (Mt ) among the MDPs in Mt . That is, ~ an MDP M ~t t ~ ~ Mt := arg max{(M , ) : M Mt },\n\n\nand\n\n:= arg max {(M , t )}. ~\nM Mt\n\nMore precisely, we want Mt to be a set of plausible MDPs in the sense that P { > } < t- ~t (2) for some > 2. Essentially, condition (2) means that it is unlikely that the true MDP M is not in Mt . Actually, Mt is defined to contain exactly those unichain MDPs M whose transition probabilities p (, , ) and rewards r (, ) satisfy for all states s, s and actions a l og(2t |S ||A|) ^ , and (3) r (s, a) rt (s, a) + 2Nt (s,a) l og(4t |S |2 |A|) . (4) |p (s, a, s ) - pt (s, a, s )| ^ 2Nt (s,a) Conditions (3) and (4) describe confidence bounds on the rewards and transition probabilities of the true MDP M such that (2) is implied (cf. Section 3.1 below). The intuition behind the algorithm is that if a non-optimal policy is followed, then this is eventually observed and something about the MDP is learned. In the proofs we show that this learning happens sufficiently fast to approach an optimal policy with only logarithmic regret. As switching policies too often may be harmful, and estimates don't change very much after few steps, our algorithm discards the policy t only if there was considerable progress concerning the es~ timates p(s, t (s), s ) or r(s, t (s)). That is, UCRL sticks to a policy until the length of some of the ^~ ^~ confidence intervals given by conditions (3) and (4) is halved. Only then a new policy is calculated. We will see below (cf. Section 3.3) that this condition limits the number of policy changes without paying too much for not changing to an optimal policy earlier. Summing up, Figure 1 displays our algorithm. Remark 1. The optimal policy in the algorithm can be efficiently calculated by a modified version ~ of value iteration (cf. [15]).\n\n3 Analysis for Unichain MDPs\n3.1 An Upper Bound on the Optimal Reward\n\nWe show that with high probability the true MDP M is contained in the set Mt of plausible MDPs.\n\n\f\nNotation: 1 Set confp (t, s, a) := min ,\n\nl\n\nog(4t |S |2 |A|) 2Nt (s,a)\n\na\n\nnd confr (t, s, a) := min\n\n1 ,\n\nl\n\nog(2t |S ||A|) 2Nt (s,a)\n\n.\n\nInitialization: Set t = 0. Set N0 (s, a) := R0 (s, a) := P0 (s, a, s ) = 0 for all s, a, s . Observe first state s0 . For rounds k = 1, 2, . . . do Initialize round k : 1. Set tk := t. 2. Recalculate estimates rt (s, a) and pt (s, a, s ) according to ^ ^ rt (s, a) := ^\nRt (s,a) Nt (s,a) ,\n\nand\n\npt (s, a, s ) := ^\n\nPt (s,a,s ) Nt (s,a) , 1 |S | .\n\nprovided that Nt (s, a) > 0. Otherwise set rt (s, a) := 1 and pt (s, a, s ) := ^ ^ 3. Calculate new policy tk := arg max{(M , ) : M Mt }, ~\n\n\nwhere Mt consists of plausible unichain MDPs M\n(\n\nw\n\nith rewards\n\nr s, a) - rt (s, a) confr (t, s, a) ^ and transition probabilities |p (s, a, s ) - pt (s, a, s )| confp (t, s, a). ^ Execute chosen policy tk : ~ 4. While confr (t, S, A) > confr (tk , S, A)/2 and confp (t, S, A) > confp (tk , S, A)/2\n\ndo (a) Choose action at := tk (st ). ~ (b) Observe obtained reward rt and next state st+1 . (c) Update: Set Nt+1 (st , at ) := Nt (st , at ) + 1. Set Rt+1 (st , at ) := Rt (st , at ) + rt . Set Pt+1 (st , at , st+1 ) := Pt (st , at , st+1 ) + 1. All other values Nt+1 (s, a), Rt+1 (s, a), and Pt+1 (s, a, s ) are set to Nt (s, a), Rt (s, a), and Pt (s, a, s ), respectively. (d) Set t := t + 1. Figure 1: The UCRL algorithm.\n\nLemma 1. For any t, any reward r(s, a) and any transition probability p(s, a, s ) of the true MDP M we have ^ P rt (s, a) < r(s, a) - P | pt (s, a, s ) - p(s, a, s )| > ^ l l\nog(2t |S ||A|) 2Nt (s,a)\n\n< <\n\nog(4t |S |2 |A|) 2Nt (s,a)\n\nt- , 2|S ||A| t- . 2|S |2 |A|\n\n(5) (6)\n\nProof. By Chernoff-Hoeffding's inequality.\n\n\f\nUsing the definition of Mt as given by (3) and (4) and summing over all s, a, and s , Lemma 1 shows that M Mt with high probability. This implies that the maximal average reward ~t assumed by our algorithm when calculating a new policy at step t is an upper bound on (M ) with high probability. Corollary 1. For any t: 3.2 P { > } < t- . ~t\n\nSufficient Precision and Mixing Times\n\nIn order to upper bound the loss, we consider the precision needed to guarantee that the policy calculated by UCRL is (-)optimal. This sufficient precision will of course depend on or in case one wants to compete with an optimal policy the minimal difference between and the average reward of some suboptimal policy, :=\n :(M , )<\n\nmin\n\n - (M , ).\n\n~~ It is sufficient that the difference between (Mt , t ) and (M , t ) is small in order to guarantee ~ ~ t , t ) - (M , t )| < , then by Corollary 1 with high that t is an (-)optimal policy. For if |(M ~ ~ ~ probability ~~ > |(Mt , t ) - (M , t )| | (M ) - (M , t )|, ~ ~ (7) so that t is already an -optimal policy on M . For = , (7) implies the optimality of t . ~ ~ Thus, we consider bounds on the deviation of the transition probabilities and rewards for the assumed ~ MDP Mt from the true values, such that (7) is implied. This is handled in the subsequent proposition, where we use the notion of the MDP's mixing time, which will play an essential role throughout the analysis. Definition 2. Given an ergodic Markov chain C , let Ts,s be the first passage time for two states s, s , that is, the time needed to reach s when starting in s. Furthermore let Ts,s the return time to\ns s ss . Then the mixing time state s. Let TC := maxs,s S E(Ts,s ), and C := maxsS 2E(Ts,s ) of a unichain MDP M is TM := max TC , where C is the Markov chain induced by on M . Furthermore, we set M := max C .\n\nmax\n\n=\n\nE(T\n\n,\n\n)\n\nOur notion of mixing time is different from the notion of -return mixing time given in [1, 2], which depends on an additional parameter . However, it serves a similar purpose. Proposition 1. Let p(, ), p(, ) and r(), r() be the transition probabilities and rewards of the ~ ~ MDPs M an| M under the policy , respectively. If for all states s, s d~ ~ r(s) - r(s)| < r := ~ ~~ then |(M , ) - (M , )| < . ~ The proposition is an easy consequence of the following result about the difference in the stationary distributions of ergodic Markov chains. ~ Theorem 1 (Cho, Meyer[16]). Let C , C be two ergodic Markov chains on the same state space S with transition probabilities p(, ), p(, ) and stationary distributions , . Then the difference ~ ~ in the distributions , can be upper bounded by the difference in the transition probabilities as ~ follows: s max |(s) - (s)| C max ~ |p(s, s ) - p(s, s )|, ~ (8)\nsS sS\nS\n\n 2\n\nand\n\n|p(s, s ) - p(s, s )| < p := ~\n\n , 2M |S |2\n\nwhere C is as given in Definition 2. Proof of Proposition 1. By (8), s s |(s) - (s)| |S |M max ~\nS sS\nS\n\n|p(s, s ) - p(s, s )| M |S |2 p . ~\n\n\f\nAs the rewards are [0, 1] and\n\n(s) = 1, we have by (1) s s ~~ |(M , ) - (M , )| ~ |(s) - (s)|r(s) + ~ ~\nS S\n\ns\n\n|r(s) - r(s)|(s) ~\n\n<\n\nM |S |2 p + r = .\n\nSince r > p and the confidence intervals for rewards are smaller than for transition probabilities (cf. Lemma 1), in the following we only consider the precision needed for transition probabilities. 3.3 Bounding the Regret As can be seen from the description of the algorithm, we split the sequence of steps into rounds, where a new round starts whenever the algorithm recalculates its policy. The following facts follow immediately from the form of our confidence intervals and Lemma 1, respectively. Proposition 2. For halving a confidence interval of a reward or transition probability for some (s, a) S A, the number Nt (s, a) of visits in (s, a) has to be at least doubled. Corollary 2. The number of rounds after T steps cannot exceed |S ||A| log2 Proposition 3. If Nt (s, a) .\nlog(4t |S |2 |A|) , 2 2 T |S ||A| .\n\nthen the confidence intervals for (s, a) are smaller than\n\nWe need to consider three sources of regret: first, by executing a suboptimal policy in a round of length , we may lose reward up to within this round; second, there may be some loss when changing policies; third, we have to consider the error probabilities with which some of our confidence intervals fail. 3.3.1 Regret due to Suboptimal Rounds Proposition 3 provides an upper bound on the number of visits needed in each (s, a) in order to guarantee that a newly calculated policy is optimal. This can be used to upper bound the total number of steps in suboptimal rounds. Consider all suboptimal rounds with |ptk (s, a, s ) - p(s, a, s )| p for some s , where a policy ^ tk with tk (s) = a is played. Let m(s, a) be the number of these rounds and i (s, a) (i = ~ ~ 1, . . . , m(s, a)) their respective lengths. The mean passage time between any state s and s is upper bounded by TM . Then by Markov's inequality, the probability that it takes more than 2TM steps ,a i to reach s from s is smaller than 1 . Thus we may separate each round i into i (sM ) ntervals 2 2T of length 2TM , in each of which the probability of visiting state s is at least 1 . Thus we may 2 lower bound the number of visits Ns,a (n) in (s, a) within n such intervals by an application of Chernoff-Hoeffding's inequality: N n n 1 P s,a (n) log T 1- . (9) - T 2\nS Since by Proposition 3, Nt (s, a) < 2 log(4Tp|2 | m(s,a)\n 2\n\n| A| )\n\n, we get log(4T |S |2 |A|) p 2\n\ni\n=1\n\n\n\ni (s, a)\n\n< c\n\n2TM\n\n1 with probability 1 - T for a suitable constant c < 11. This gives for the expected regret in these rounds mi(s,a) < log(4T |S |2 |A|) 1 E i (s, a) 2 c TM + 2 m(s, a) TM + T . p 2 T =1\n\nApplying Corollary 2 and summing up over all (s, a), one sees that the expected regret due to suboptimal rounds cannot exceed 2 c |S ||A|TM log(4T |S |2 |A|) T + 2TM |S |2 |A|2 log2 + |S ||A|. p 2 |S ||A|\n\n\f\n3.3.2 Loss by Policy Changes For any policy t there may be some states from which the ex~ pected average reward for the next steps is larger than when starting in some other state. This does not play a role if . However, as we are playing our policies only for a finite number of steps before considering a change, we have to take into account that every time we switch policies, we may need a start-up phase to get into such a favorable state. In average, this cannot take more than TM steps, as this time is sufficient to reach any \"good\" state from some \"bad\" state. This is made more precise in the following lemma. We omit a detailed proof. Lemma 2. For all policies , all starting states s0 and all T 0 E T -1 t\n=0\n\nr(st , (st ))\n\n\n\nT ( , M ) - TM .\nT |S ||A| .\n\nBy Corollary 2, the corresponding expected regret after T steps is |S ||A|TM log2\n\n3.3.3 Regret if Confidence Intervals Fail Finally, we have to take into account the error probabilities, with which in each round a transition probability or a reward, respectively, is not contained in its confidence interval. According to Lemma 1, the probability that this happens at some step t - t- t- for a given state-action pair is < 2|t ||A| + |S | 2|S |2 |A| = |S ||A| . Now let t1 = 1, t2 , . . . , tN T S be the steps in which a new round starts. As the regret in each round can be upper bounded by its length, one obtains for the regret caused by failure of confidence intervals\nN -1 i =1 N -1 i t t1- t- t- i i (ti+1 - ti ) cti < c < c, |S ||A| |S ||A| |S ||A| =1 =1\n\nusing that ti+1 - ti < cti for a suitable constant c = c(|S |, |A|, TM ) and provided that > 2. 3.3.4 Putting Everything Together Summing up over all the sources of regret and replacing for p yields the following theorem, which is a generalization of similar results that were achieved for the multi-armed bandit problem in [8]. Theorem 2. On unichain MDPs, the expected total regret of the UCRL algorithm with respect to an (-)optimal policy after T > 1 steps can be upper bounded by T |A|TM 2 |S |5 M log T + 3TM |S |2 |A|2 log2 , and 2 |S ||A| |A|TM 2 |S |5 T M E(RT ) < const log T + 3TM |S |2 |A|2 log2 . 2 |S ||A|\n E(RT ) < const \n\n4 Remarks and Open Questions on Multichain MDPs\n In a multichain MDP a policy may split up the MDP into ergodic subchains Si . Thus it may happen during the learning phase that one goes wrong and ends up in a part of the MDP that gives suboptimal return but cannot be left under no policy whatsoever. As already observed by Kearns, Singh [1], in this case it seems fair to compete with (M ) := max minSi (Si , ).\n\nUnfortunately, the original UCRL algorithm may not work very well in this setting, as it is impossible for the algorithm to distinguish between a very low probability for a transition and its impossibility. Here the \"optimism in the face of uncertainty\" idea fails, as there is no way to falsify the wrong belief in a possible transition. Obviously, if we knew for each policy which subchains it induces on M (the MDP's ergodic struc~ ture), UCRL could choose an MDP Mt and a policy t that maximizes the reward among all plau~ sible MDPs with the given ergodic structure. However, only the empiric ergodic structure (based on the observations so far) is known. As the empiric ergodic structure may not be reliable, one may additionally explore the ergodic structures of all policies. Alas, the number of additional exploration steps will depend on the smallest positive transition probability. If the latter is not known, it seems that logarithmic online regret bounds can be no longer guaranteed.\n\n\f\nHowever, we conjecture that for a slightly modified algorithm the logarithmic online regret bounds still hold for communicating MDPs, in which for any two states s, s there is a suitable policy such that s is reachable from s under (i.e., s, s are contained in the same subchain Si ). As Theorem 1 does not hold for communicating MDPs in general, a proof would need a different analysis.\n\n5 Conclusion and Outlook\nBeside the open problems on multichain MDPs, it is an interesting question whether our results also hold when assuming for the mixing time not the slowest policy for reaching any state but the fastest. Another research direction is to consider value function approximation and continuous reinforcement learning problems. For practical purposes, using the variance of the estimates will reduce the width of the upper confidence bounds and will make the exploration even more focused, improving learning speed and regret bounds. In this setting, we have experimental results comparable to those of the MBIE algorithm [10], which clearly outperforms other learning algorithms like R-Max or -greedy. Acknowledgements. This work was supported in part by the the Austrian Science Fund FWF (S9104-N04 SP4) and the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002506778. This publication only reflects the authors' views.\n\nReferences\n[1] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49:209232, 2002. [2] Ronen I. Brafman and Moshe Tennenholtz. R-max a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213231, 2002. [3] Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003. [4] Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation. In Proc. 22nd ICML 2005, pages 857864, 2005. [5] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Proc. 23nd ICML 2006, pages 881888, 2006. [6] Claude-Nicolas Fiechter. Efficient reinforcement learning. In Proc. 7th COLT, pages 8897. ACM, 1994. [7] Peter Auer and Ronald Ortner. Online regret bounds for a new reinforcement learning algorithm. In Proc. 1st ACVW, pages 3542. OCG, 2005. [8] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397 422, 2002. ` [9] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed bandit problem. Mach. Learn., 47:235256, 2002. [10] Alexander L. Strehl and Michael L. Littman. An empirical evaluation of interval estimation for Markov decision processes. In Proc. 16th ICTAI, pages 128135. IEEE Computer Society, 2004. [11] Leslie P. Kaelbling. Learning in Embedded Systems. MIT Press, 1993. [12] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for reinforcement learning. In Proc. 20th ICML, pages 162169. AAAI Press, 2003. [13] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision processes. Math. Oper. Res., 22(1):222255, 1997. [14] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In Proc. 17th NIPS, pages 401408. MIT Press, 2004. [15] Martin L. Puterman. Markov Decision Processes. Discrete Stochastic Programming. Wiley, 1994. [16] Grace E. Cho and Carl D. Meyer. Markov chain sensitivity measured by mean first passage times. Linear Algebra Appl., 316:2128, 2000.\n\n\f\n", "award": [], "sourceid": 3052, "authors": [{"given_name": "Peter", "family_name": "Auer", "institution": null}, {"given_name": "Ronald", "family_name": "Ortner", "institution": null}]}