{"title": "Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5713, "page_last": 5723, "abstract": "Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.", "full_text": "Unifying PAC and Regret: Uniform PAC Bounds for\n\nEpisodic Reinforcement Learning\n\nChristoph Dann\n\nMachine Learning Department\nCarnegie-Mellon University\n\ncdann@cdann.net\n\nTor Lattimore\u2217\n\ntor.lattimore@gmail.com\n\nEmma Brunskill\n\nComputer Science Department\n\nStanford University\n\nebrun@cs.stanford.edu\n\nAbstract\n\nStatistical performance bounds for reinforcement learning (RL) algorithms can be\ncritical for high-stakes applications like healthcare. This paper introduces a new\nframework for theoretically measuring the performance of such algorithms called\nUniform-PAC, which is a strengthening of the classical Probably Approximately\nCorrect (PAC) framework. In contrast to the PAC framework, the uniform version\nmay be used to derive high probability regret guarantees and so forms a bridge\nbetween the two setups that has been missing in the literature. We demonstrate\nthe bene\ufb01ts of the new framework for \ufb01nite-state episodic MDPs with a new\nalgorithm that is Uniform-PAC and simultaneously achieves optimal regret and\nPAC guarantees except for a factor of the horizon.\n\n1\n\nIntroduction\n\nThe recent empirical successes of deep reinforcement learning (RL) are tremendously exciting, but the\nperformance of these approaches still varies signi\ufb01cantly across domains, each of which requires the\nuser to solve a new tuning problem [1]. Ultimately we would like reinforcement learning algorithms\nthat simultaneously perform well empirically and have strong theoretical guarantees. Such algorithms\nare especially important for high stakes domains like health care, education and customer service,\nwhere non-expert users demand excellent outcomes.\nWe propose a new framework for measuring the performance of reinforcement learning algorithms\ncalled Uniform-PAC. Brie\ufb02y, an algorithm is Uniform-PAC if with high probability it simultaneously\nfor all \u03b5 > 0 selects an \u03b5-optimal policy on all episodes except for a number that scales polynomially\nwith 1/\u03b5. Algorithms that are Uniform-PAC converge to an optimal policy with high probability\nand immediately yield both PAC and high probability regret bounds, which makes them superior to\nalgorithms that come with only PAC or regret guarantees. Indeed,\n\n(a) Neither PAC nor regret guarantees imply convergence to optimal policies with high probability;\n(b) (\u03b5, \u03b4)-PAC algorithms may be \u03b5/2-suboptimal in every episode;\n(c) Algorithms with small regret may be maximally suboptimal in\ufb01nitely often.\n\u2217Tor Lattimore is now at DeepMind, London\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fUniform-PAC algorithms suffer none of these drawbacks. One could hope that existing algorithms\nwith PAC or regret guarantees might be Uniform-PAC already, with only the analysis missing.\nUnfortunately this is not the case and modi\ufb01cation is required to adapt these approaches to satisfy\nthe new performance metric. The key insight for obtaining Uniform-PAC guarantees is to leverage\ntime-uniform concentration bounds such as the \ufb01nite-time versions of the law of iterated logarithm,\nwhich obviates the need for horizon-dependent con\ufb01dence levels.\nWe provide a new optimistic algorithm for episodic RL called UBEV that is Uniform PAC. Unlike its\npredecessors, UBEV uses con\ufb01dence intervals based on the law of iterated logarithm (LIL) which\nhold uniformly over time. They allow us to more tightly control the probability of failure events\nin which the algorithm behaves poorly. Our analysis is nearly optimal according to the traditional\nmetrics, with a linear dependence on the state space for the PAC setting and square root dependence\nfor the regret. Therefore UBEV is a Uniform PAC algorithm with PAC bounds and high probability\nregret bounds that are near optimal in the dependence on the length of the episodes (horizon) and\noptimal in the state and action spaces cardinality as well as the number of episodes. To our knowledge\nUBEV is the \ufb01rst algorithm with both near-optimal PAC and regret guarantees.\n\nNotation and setup. We consider episodic \ufb01xed-horizon MDPs with time-dependent dynamics,\nwhich can be formalized as a tuple M = (S,A, pR, P, p0, H). The statespace S and the actionspace\nA are \ufb01nite sets with cardinality S and A. The agent interacts with the MDP in episodes of H time\nsteps each. At the beginning of each time-step t \u2208 [H] the agent observes a state st and chooses an\naction at based on a policy \u03c0 that may depend on the within-episode time step (at = \u03c0(st, t)). The\nnext state is sampled from the tth transition kernel st+1 \u223c P (\u00b7|st, at, t) and the initial state from\ns1 \u223c p0. The agent then receives a reward drawn from a distribution pR(st, at, t) which can depend\non st, at and t with mean r(st, at, t) determined by the reward function. The reward distribution pR\nis supported on [0, 1].2 The value function from time step t for policy \u03c0 is de\ufb01ned as\n\n(cid:34) H(cid:88)\n\nt (s) := E\nV \u03c0\n\nr(si, ai, i)\n\ns(cid:48)\u2208S\nand the optimal value function is denoted by V (cid:63)\nevaluated by the total expected reward or return\n\ni=t\n\n(cid:88)\n\n=\n\n(cid:35)\n(cid:12)(cid:12)(cid:12)(cid:12)st = s\n(cid:34) H(cid:88)\n\n\u03c1\u03c0 := E\n\nP (s(cid:48)|s, \u03c0(s, t), t)V \u03c0\n\nt+1(s(cid:48)) + r(s, \u03c0(s, t), t) .\n\nt . In any \ufb01xed episode, the quality of a policy \u03c0 is\n\n(cid:35)\n\nr(si, ai, i)(cid:12)(cid:12)\u03c0\n\n= p(cid:62)\n\n0 V \u03c0\n1 ,\n\ni=t\n\n0 V (cid:63)\n\nWe let N\u03b5 :=(cid:80)\u221e\nR(T ) :=(cid:80)T\n\nwhich is compared to the optimal return \u03c1(cid:63) = p(cid:62)\n1 . For this notation p0 and the value functions V (cid:63)\nt ,\n1 are interpreted as vectors of length S. If an algorithm follows policy \u03c0k in episode k, then the\nV \u03c0\noptimality gap in episode k is \u2206k := \u03c1(cid:63) \u2212 \u03c1\u03c0k which is bounded by \u2206max = max\u03c0 \u03c1(cid:63) \u2212 \u03c1\u03c0 \u2264 H.\nI{\u2206k > \u03b5} be the number of \u03b5-errors and R(T ) be the regret after T episodes:\nk=1 \u2206k. Note that T is the number of episodes and not total time steps (which is HT\nafter T episodes) and k is an episode index while t usually denotes time indices within an episode.\nThe \u02dcO notation is similar to the usual O-notation but suppresses additional polylog-factors, that is\ng(x) = \u02dcO(f (x)) iff there is a polynomial p such that g(x) = O(f (x)p(log(x))).\n\nk=1\n\n2 Uniform PAC and Existing Learning Frameworks\n\nWe brie\ufb02y summarize the most common performance measures used in the literature.\n\n\u2022 (\u03b5, \u03b4)-PAC: There exists a polynomial function FPAC(S, A, H, 1/\u03b5, log(1/\u03b4)) such that\n\nP (N\u03b5 > FPAC(S, A, H, 1/\u03b5, log(1/\u03b4))) \u2264 \u03b4 .\n\n\u2022 Expected Regret: There exists a function FER(S, A, H, T ) such that E[R(T )] \u2264\n\u2022 High Probability Regret: There exists a function FHPR(S, A, H, T, log(1/\u03b4)) such that\n\nFER(S, A, H, T ).\n\nP (R(T ) > FHPR(S, A, H, T, log(1/\u03b4))) \u2264 \u03b4 .\n\n2The reward may be allowed to depend on the next-state with no further effort in the proofs. The boundedness\n\nassumption could be replaced by the assumption of subgaussian noise with known subgaussian parameter.\n\n2\n\n\f\u2022 Uniform High Probability Regret: There exists a function FUHPR(S, A, H, T, log(1/\u03b4)) such\n\nthat\n\nP (exists T : R(T ) > FUHPR(S, A, H, T, log(1/\u03b4))) \u2264 \u03b4 .\n\nIn all de\ufb01nitions the function F should be polynomial in all arguments. For notational conciseness\nwe often omit some of the parameters of F where the context is clear. The different performance\nguarantees are widely used (e.g. PAC: [2, 3, 4, 5], (uniform) high-probability regret: [6, 7, 8];\nexpected regret: [9, 10, 11, 12]). Due to space constraints, we will not discuss Bayesian-style\nperformance guarantees that only hold in expectation with respect to a distribution over problem\ninstances. We will shortly discuss the limitations of the frameworks listed above, but \ufb01rst formally\nde\ufb01ne the Uniform-PAC criteria\nDe\ufb01nition 1 (Uniform-PAC). An algorithm is Uniform-PAC for \u03b4 > 0 if\n\nP (exists \u03b5 > 0 : N\u03b5 > FUPAC (S, A, H, 1/\u03b5, log(1/\u03b4))) \u2264 \u03b4 ,\n\nwhere FUPAC is polynomial in all arguments.\n\nAll the performance metrics are functions of the distribution of the sequence of errors over the\nepisodes (\u2206k)k\u2208N. Regret bounds are the integral of this sequence up to time T , which is a random\nvariable. The expected regret is just the expectation of the integral, while the high-probability\nregret is a quantile. PAC bounds are the quantile of the size of the superlevel set for a \ufb01xed level \u03b5.\nUniform-PAC bounds are like PAC bounds, but hold for all \u03b5 simultaneously.\n\nLimitations of regret. Since regret guarantees only bound the integral of \u2206k over k, it does not\ndistinguish between making a few severe mistakes and many small mistakes. In fact, since regret\nbounds provably grow with the number of episodes T , an algorithm that achieves optimal regret may\nstill make in\ufb01nitely many mistakes (of arbitrary quality, see proof of Theorem 2 below). This is\nhighly undesirable in high-stakes scenarios. For example in drug treatment optimization in healthcare,\nwe would like to distinguish between infrequent severe complications (few large \u2206k) and frequent\nminor side effects (many small \u2206k). In fact, even with an optimal regret bound, we could still serve\nin\ufb01nitely patients with the worst possible treatment.\n\nLimitations of PAC. PAC bounds limit the number of mistakes for a given accuracy level \u03b5, but\nis otherwise non-restrictive. That means an algorithm with \u2206k > \u03b5/2 for all k almost surely might\nstill be (\u03b5, \u03b4)-PAC. Worse, many algorithms designed to be (\u03b5, \u03b4)-PAC actually exhibit this behavior\nbecause they explicitly halt learning once an \u03b5-optimal policy has been found. The less widely used\nTCE (total cost of exploration) bounds [13] and KWIK guarantees [14] suffer from the same issueand\nfor conciseness are not discussed in detail.\n\nAdvantages of Uniform-PAC. The new criterion overcomes the limitations of PAC and regret\nguarantees by measuring the number of \u03b5-errors at every level simultaneously. By de\ufb01nition, algo-\nrithms that are Uniform-PAC for a \u03b4 are (\u03b5, \u03b4)-PAC for all \u03b5 > 0. We will soon see that an algorithm\nwith a non-trivial Uniform-PAC guarantee also has small regret with high probability. Furthermore,\nthere is no loss in the reduction so that an algorithm with optimal Uniform-PAC guarantees also\nhas optimal regret, at least in the episodic RL setting. In this sense Uniform-PAC is the missing\nbridge between regret and PAC. Finally, for algorithms based on con\ufb01dence bounds, Uniform-PAC\nguarantees are usually obtained without much additional work by replacing standard concentration\nbounds with versions that hold uniformly over episodes (e.g. using the law of the iterated logarithms).\nIn this sense we think Uniform-PAC is the new \u2018gold-standard\u2019 of theoretical guarantees for RL\nalgorithms.\n\n2.1 Relationships between Performance Guarantees\n\nExisting theoretical analyses usually focus exclusively on either the regret or PAC framework. Besides\noccasional heuristic translations, Proposition 4 in [15] and Corollary 3 in [6] are the only results\nrelating a notion of PAC and regret, we are aware of. Yet the guarantees there are not widely used3\n\n3The average per-step regret in [6] is super\ufb01cially a PAC bound, but does not hold over in\ufb01nitely many\ntime-steps and exhibits the limitations of a conventional regret bound. The translation to average loss in [15]\ncomes at additional costs due to the discounted in\ufb01nite horizon setting.\n\n3\n\n\fFigure 1: Visual summary of relationship among the different learning frameworks: Expected regret\n(ER) and PAC preclude each other while the other crossed arrows represent only a does-not-implies\nrelationship. Blue arrows represent imply relationships. For details see the theorem statements.\n\nunlike the de\ufb01nitions given above which we now formally relate to each other. A simpli\ufb01ed overview\nof the relations discussed below is shown in Figure 1.\nTheorem 1. No algorithm can achieve\n\n\u2022 a sub-linear expected regret bound for all T and\n\u2022 a \ufb01nite (\u03b5, \u03b4)-PAC bound for a small enough \u03b5\n\nsimultaneously for all two-armed multi-armed bandits with Bernoulli reward distributions. This\nimplies that such guarantees also cannot be satis\ufb01ed simultaneously for all episodic MDPs.\n\nA full proof is in Appendix A.1, but the intuition is simple. Suppose a two-armed Bernoulli bandit has\nmean rewards 1/2 + \u03b5 and 1/2 respectively and the second arm is chosen at most F < \u221e times with\nprobability at least 1 \u2212 \u03b4, then one can easily show that in an alternative bandit with mean rewards\n1/2 + \u03b5 and 1/2 + 2\u03b5 there is a non-zero probability that the second arm is played \ufb01nitely often and in\nthis bandit the expected regret will be linear. Therefore, sub-linear expected regret is only possible if\neach arm is pulled in\ufb01nitely often almost surely.\nTheorem 2. The following statements hold for performance guarantees in episodic MDPs:\n\n(a) If an algorithm satis\ufb01es a (\u03b5, \u03b4)-PAC bound with FPAC = \u0398(1/\u03b52) then it satis\ufb01es for a\nspeci\ufb01c T = \u0398(\u03b5\u22123) a FHPR = \u0398(T 2/3) bound. Further, there is an MDP and algorithm that\nsatis\ufb01es the (\u03b5, \u03b4)-PAC bound FPAC = \u0398(1/\u03b52) on that MDP and has regret R(T ) = \u2126(T 2/3)\non that MDP for any T . That means a (\u03b5, \u03b4)-PAC bound with FPAC = \u0398(1/\u03b52) can only be\nconverted to a high-probability regret bound with FHPR = \u2126(T 2/3).\n\n(b) For any chosen \u03b5, \u03b4 > 0 and FPAC, there is an MDP and algorithm that satis\ufb01es the (\u03b5, \u03b4)-PAC\nbound FPAC on that MDP and has regret R(T ) = \u2126(T ) on that MDP. That means a (\u03b5, \u03b4)-PAC\nbound cannot be converted to a sub-linear uniform high-probability regret bound.\n\n(c) For any FUHPR(T, \u03b4) with FUHPR(T, \u03b4) \u2192 \u221e as T \u2192 \u221e, there is an algorithm that satis\ufb01es\nthat uniform high-probability regret bound on some MDP but makes in\ufb01nitely many mistakes\nfor any suf\ufb01ciently small accuracy level \u03b5 > 0 for that MDP. Therefore, a high-probability\nregret bound (uniform or not) cannot be converted to a \ufb01nite (\u03b5, \u03b4)-PAC bound.\n\n(d) For any FUHPR(T, \u03b4) there is an algorithm that satis\ufb01es that uniform high-probability regret\n\nbound on some MDP but suffers expected regret ER(T ) = \u2126(T ) on that MDP.\n\n\u221a\nFor most interesting RL problems including episodic MDPs the worst-case expected regret grows\nwith O(\nT ). The theorem shows that establishing an optimal high probability regret bound does not\nimply any \ufb01nite PAC bound. While PAC bounds may be converted to regret bounds, the resulting\nbounds are necessarily severely suboptimal with a rate of T 2/3. The next theorem formalises the\nclaim that Uniform-PAC is stronger than both the PAC and high-probability regret criteria.\n\n4\n\nUniform PACExpected RegretHigh-Prob.RegretUniform High- Prob. RegretPACimpliesimpliesimpliesprecludecannot implyimplies subopt. for single T\fTheorem 3. Suppose an algorithm is Uniform-PAC for some \u03b4 with FUPAC = \u02dcO(C1/\u03b5 + C2/\u03b52)\nwhere C1, C2 > 0 are constant in \u03b5, but may depend on other quantities such as S, A, H, log(1/\u03b4),\nthen the algorithm\n\n(a) converges to optimal policies with high probability: P(limk\u2192\u221e \u2206k = 0) \u2265 1 \u2212 \u03b4.\n(b) is (\u03b5, \u03b4)-PAC with bound FPAC = FUPAC for all \u03b5.\n\u221a\n(c) enjoys a high-probability regret at level \u03b4 with FUHPR = \u02dcO(\n\nC2T + max{C1, C2}).\n\nObserve that stronger uniform PAC bounds lead to stronger regret bounds and for RL in episodic\nMDPs, an optimal uniform-PAC bound implies a uniform regret bound. To our knowledge, there\nare no existing approaches with PAC or regret guarantees that are Uniform-PAC. PAC methods such\nas MBIE, MoRMax, UCRL-\u03b3, UCFH, Delayed Q-Learning or Median-PAC all depend on advance\nknowledge of \u03b5 and eventually stop improving their policies. Even when disabling the stopping\ncondition, these methods are not uniform-PAC as their con\ufb01dence bounds only hold for \ufb01nitely many\nepisodes and are eventually violated according to the law of iterated logarithms. Existing algorithms\nwith uniform high-probability regret bounds such as UCRL2 or UCBVI [16] also do not satisfy\n\nuniform-PAC bounds since they use upper con\ufb01dence bounds with width(cid:112)log(T )/n where T is the\n\nnumber of observed episodes and n is the number of observations for a speci\ufb01c state and action. The\npresence of log(T ) causes the algorithm to try each action in each state in\ufb01nitely often. One might\nbegin to wonder if uniform-PAC is too good to be true. Can any algorithm meet the requirements? We\ndemonstrate in Section 4 that the answer is yes by showing that UBEV has meaningful Uniform-PAC\nbounds. A key technique that allows us to prove these bounds is the use of \ufb01nite-time law of iterated\n\nlogarithm con\ufb01dence bounds which decrease at rate(cid:112)(log log n)/n.\n\n3 The UBEV Algorithm\n\nThe pseudo-code for the proposed UBEV algorithm is given in Algorithm 1. In each episode it\nfollows an optimistic policy \u03c0k that is computed by backwards induction using a carefully chosen\ncon\ufb01dence interval on the transition probabilities in each state. In line 8 an optimistic estimate of the\nQ-function for the current state-action-time triple is computed using the empirical estimates of the\nexpected next state value \u02c6Vnext \u2208 R (given that the values at the next time are \u02dcVt+1) and expected\nimmediate reward \u02c6r plus con\ufb01dence bounds (H \u2212 t)\u03c6 and \u03c6. We show in Lemma D.1 in the appendix\nthat the policy update in Lines 3\u20139 \ufb01nds an optimal solution to maxP (cid:48),r(cid:48),V (cid:48),\u03c0(cid:48) Es\u223cp0 [V (cid:48)\n1 (s)] subject\nto the constraints that for all s \u2208 S, a \u2208 A, t \u2208 [H],\n\nV (cid:48)\nt (s) = r(s, \u03c0(cid:48)(s, t), t) + P (cid:48)(s, \u03c0(cid:48)(s, t), t)(cid:62)V (cid:48)\nH+1 = 0, P (cid:48)(s, a, t) \u2208 \u2206S,\nV (cid:48)\n|[(P (cid:48) \u2212 \u02c6Pk)(s, a, t)](cid:62)V (cid:48)\n|r(cid:48)(s, a, t) \u2212 \u02c6rk(s, a, t)| \u2264 \u03c6(s, a, t)\n\nt+1| \u2264 \u03c6(s, a, t)(H \u2212 t)\n\nt+1\n\nr(cid:48)(s, a, t) \u2208 [0, 1]\n\n(Bellman Equation)\n\n(1)\n\n(cid:32)(cid:115)\n\n(2)\n\n(cid:33)\n\n(cid:115)\n\nwhere (P (cid:48) \u2212 \u02c6Pk)(s, a, t) is short for P (cid:48)(s, a, t) \u2212 \u02c6Pk(s, a, t) = P (cid:48)(\u00b7|s, a, t) \u2212 \u02c6Pk(\u00b7|s, a, t) and\n\n\u03c6(s, a, t) =\n\n2 ln ln max{e, n(s, a, t)} + ln(18SAH/\u03b4)\n\nn(s, a, t)\n\n= O\n\nln(SAH ln(n(s, a, t))/\u03b4)\n\nn(s, a, t)\n\nis the width of a con\ufb01dence bound with e = exp(1) and \u02c6Pk(s(cid:48)|s, a, t) = m(s(cid:48),s,a,t)\nare the empirical\ntransition probabilities and \u02c6rk(s, a, t) = l(s, a, t)/n(s, a, t) the empirical immediate rewards (both\nat the beginning of the kth episode). Our algorithm is conceptually similar to other algorithms based\non the optimism principle such as MBIE [5], UCFH [3], UCRL2 [6] or UCRL-\u03b3 [2] but there are\nseveral key differences:\n\nn(s,a,t)\n\n\u2022 Instead of using con\ufb01dence intervals over the transition kernel by itself, we incorporate the\nvalue function directly into the concentration analysis. Ultimately this saves a factor of S in\nthe sample complexity, but the price is a more dif\ufb01cult analysis. Previously MoRMax [17]\nalso used the idea of directly bounding the transition and value function, but in a very different\nalgorithm that required discarding data and had a less tight bound. A similar technique has\nbeen used by Azar et al. [16].\n\n5\n\n\fAlgorithm 1: UBEV (Upper Bounding the Expected Next State Value) Algorithm\n\nInput :failure tolerance \u03b4 \u2208 (0, 1]\n1 n(s, a, t) = l(s, a, t) = m(s(cid:48), s, a, t) = 0;\n2 for k = 1, 2, 3, . . . do\n\n\u02dcVH+1(s(cid:48)) := 0 \u2200s, s(cid:48) \u2208 S, a \u2208 A, t \u2208 [H]\n\n*/\n\n*/\n\n/* Optimistic planning\nfor t = H to 1 do\nfor s \u2208 S do\n\nfor a \u2208 A do\n\n(cid:113) 2 ln ln(max{e,n(s,a,t)})+ln(18SAH/\u03b4)\n\n\u03c6 :=\n\u02c6r := l(s,a,t)\nn(s,a,t) ;\nQ(a) := min{1, \u02c6r + \u03c6} + min\n\n\u02c6Vnext := m(\u00b7,s,a,t)(cid:62) \u02dcVt+1\n\n(cid:110)\n\nn(s,a,t)\n\nn(s,a,t)\n\n// confidence bound\n\n// empirical estimates\n\nmax \u02dcVt+1, \u02c6Vnext + (H \u2212 t)\u03c6\n\n(cid:111)\n\n\u03c0k(s, t) := arg maxa Q(a),\n\n\u02dcVt(s) := Q(\u03c0k(s, t))\n\n/* Execute policy for one episode\ns1 \u223c p0;\nfor t = 1 to H do\n\nat := \u03c0k(st, t), rt \u223c pR(st, at, t) and st+1 \u223c P (st, at, t)\nn(st, at, t)++; m(st+1, st, at, t)++;\n\nl(st, at, t)+= rt // update statistics\n\n3\n4\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n11\n12\n13\n\nFigure 2: Empirical comparison of optimism-based algorithms with frequentist regret or PAC bounds\non a randomly generated MDP with 3 actions, time horizon 10 and S = 5, 50, 200 states. All\nalgorithms are run with parameters that satisfy their bound requirements. A detailed description of\nthe experimental setup including a link to the source code can be found in Appendix B.\n\n\u2022 Many algorithms update their policy less and less frequently (usually when the number of\nsamples doubles), and only \ufb01nitely often in total. Instead, we update the policy after every\nepisode, which means that UBEV immediately leverages new observations.\n\n\u2022 Con\ufb01dence bounds in existing algorithms that keep improving the policy (e.g. Jaksch et al.\n\n[6], Azar et al. [16]) scale at a rate(cid:112)log(k)/n where k is the number of episodes played so far\nInstead the width of UBEV\u2019s con\ufb01dence bounds \u03c6 scales at rate(cid:112)ln ln(max{e, n})/n \u2248\n(cid:112)(log log n)/n which is the best achievable rate and results in signi\ufb01cantly faster learning.\n\nand n is the number of times the speci\ufb01c (s, a, t) has been observed. As the results of a brief\nempirical comparison in Figure 2 indicate, this leads to slow learning (compare UCBVI_1\nand UBEV\u2019s performance which differ essentially only by their use of different rate bounds).\n\n4 Uniform PAC Analysis\n\nWe now discuss the Uniform-PAC analysis of UBEV which results in the following Uniform-PAC\nand regret guarantee.\n\n6\n\n103104105106107Number of Episodes0.00.51.01.52.02.53.03.5Expected Return S=5103104105106107Number of Episodes0.51.01.52.02.53.0Expected Return S=50103104105106107Number of Episodes0.51.01.52.02.5Expected Return S=200MoRMaxUBEVUCRL2MBIEMedianPACDelayedQLOIMUCFHUCBVI_1UCBVI_2optimal\fTheorem 4. Let \u03c0k be the policy of UBEV in the kth episode. Then with probability at least 1 \u2212 \u03b4\nfor all \u03b5 > 0 jointly the number of episodes k where the expected return from the start state is not\n\u03b5-optimal (that is \u2206k > \u03b5) is at most\n\n(cid:18)\n\nA, S, H,\n\n.\n\n1\n\u03b5\n\n,\n\n1\n\u03b4\n\n(cid:19)(cid:19)\n\n(cid:17)\n\nO\n\n(cid:18) SAH 4\n\u03b52 min(cid:8)1+\u03b5S2A, S(cid:9) polylog\n(cid:16)\n\n\u221a\nH 2(\n\nR(T ) = O\n\nSAT + S3A2) polylog(S, A, H, T )\n\n.\n\nTherefore, with probability at least 1 \u2212 \u03b4 UBEV converges to optimal policies and for all episodes T\nhas regret\n\nHere polylog(x . . . ) is a function that can be bounded by a polynomial of logarithm, that is, \u2203k, C :\npolylog(x . . . ) \u2264 ln(x . . . )k+C. In Appendix C we provide a lower bound on the sample complexity\nthat shows that if \u03b5 < 1/(S2A), the Uniform-PAC bound is tight up to log-factors and a factor of H.\nTo our knowledge, UBEV is the \ufb01rst algorithm with both near-tight (up to H factors) high probability\nregret and (\u03b5, \u03b4) PAC bounds as well as the \ufb01rst algorithm with any nontrivial uniform-PAC bound.\nUsing Theorem 3 the convergence and regret bound follows immediately from the uniform PAC\nbound. After a discussion of the different con\ufb01dence bounds allowing us to prove uniform-PAC\nbounds, we will provide a short proof sketch of the uniform PAC bound.\n\n4.1 Enabling Uniform PAC With Law-of-Iterated-Logarithm Con\ufb01dence Bounds\n\nTo have a PAC bound for all \u03b5 jointly, it is critical that UBEV continually make use of new experience.\nIf UBEV stopped leveraging new observations after some \ufb01xed number, it would not be able to\ndistinguish with high probability among which of the remaining possible MDPs do or do not have\noptimal policies that are suf\ufb01ciently optimal in the other MDPs. The algorithm therefore could\npotentially follow a policy that is not at least \u03b5-optimal for in\ufb01nitely many episodes for a suf\ufb01ciently\nsmall \u03b5. To enable UBEV to incorporate all new observations, the con\ufb01dence bounds in UBEV must\nhold for an in\ufb01nite number of updates. We therefore require a proof that the total probability of all\npossible failure events (of the high con\ufb01dence bounds not holding) is bounded by \u03b4, in order to obtain\nhigh probability guarantees. In contrast to prior (\u03b5, \u03b4)-PAC proofs that only consider a \ufb01nite number\nof failure events (which is enabled by requiring an RL algorithm to stop using additional data), we\nmust bound the probability of an in\ufb01nite set of possible failure events.\nSome choices of con\ufb01dence bounds will hold uniformly across all sample sizes but are not suf\ufb01ciently\ntight for uniform PAC results. For example, the recent work by Azar et al. [16] uses con\ufb01dence\nintervals that shrink at a rate of\nn , where T is the number of episodes, and n is the number of\nsamples of a (s, a) pair at a particular time step. This con\ufb01dence interval will hold for all episodes,\nbut these intervals do not shrink suf\ufb01ciently quickly and can even increase. One simple approach for\nconstructing con\ufb01dence intervals that is suf\ufb01cient for uniform PAC guarantees is to combine bounds\nfor \ufb01xed number of samples with a union bound allocating failure probability \u03b4/n2 to the failure case\n\n(cid:113) ln T\n\nknow of no algorithms that do such in our setting.\nWe follow a similarly simple but much stronger approach of using law-of-iterated logarithm (LIL)\n\nwith n samples. This results in con\ufb01dence intervals that shrink at rate(cid:112)1/n ln n. Interestingly we\nbounds that shrink at the better rate of(cid:112)1/n ln ln n. Such bounds have sparked recent interest in\nPAC bounds, and much tighter (and therefore will lead to much better performance) than(cid:112)1/n ln T\n\nsequential decision making [18, 19, 20, 21, 22] but to the best of our knowledge we are the \ufb01rst to\nleverage them for RL. We prove several general LIL bounds in Appendix F and explain how we use\nthese results in our analysis in Appendix E.2. These LIL bounds are both suf\ufb01cient to ensure uniform\n\nbounds. Indeed, LIL have the tightest possible rate dependence on the number of samples n for a\nbound that holds for all timesteps (though they are not tight with respect to constants).\n\n4.2 Proof Sketch\n\nWe now provide a short overview of our uniform PAC bound in Theorem 4. It follows the typical\nscheme for optimism based algorithms: we show that in each episode UBEV follows a policy that is\n\n7\n\n\foptimal with respect to the MDP \u02dcMk that yields highest return in a set of MDPs Mk given by the\nconstraints in Eqs. (1)\u2013(2) (Lemma D.1 in the appendix). We then de\ufb01ne a failure event F (more\ndetails see below) such that on the complement F C, the true MDP is in Mk for all k.\nUnder the event that the true MDP is in the desired set, the V \u03c0\nof \u03c0k\nin MDP \u02dcMk is higher than the optimal value function of the true MDP M (Lemma E.16). Therefore,\n1 \u2212 V \u03c0k\nthe optimality gap is bounded by \u2206k \u2264 p(cid:62)\n1 ). The right hand side this expression is then\nH(cid:88)\ndecomposed via a standard identity (Lemma E.15) as\n\n1 , i.e., the value \u02dcV \u03c0k\n\n1 \u2264 \u02dcV \u03c0k\n\n1 \u2264 V (cid:63)\n\n(cid:88)\n\n(cid:88)\n\nH(cid:88)\n\n0 ( \u02dcV \u03c0k\n\n1\n\nwtk(s, a)(( \u02dcPk \u2212 P )(s, a, t))(cid:62) \u02dcV \u03c0k\n\nwtk(s, a)(\u02dcrk(s, a, t) \u2212 r(s, a, t)),\n\nt+1 +\n\nt=1\n\n(s,a)\u2208S\u00d7A\n\nt=1\n\n(s,a)\u2208S\u00d7A\n\nwhere wtk(s, a) is the probability that when following policy \u03c0k in the true MDP we encounter\nst = s and at = a. The quantities \u02dcPk, \u02dcrk are the model parameters of the optimistic MDP \u02dcMk For\nthe sake of conciseness, we ignore the second term above in the following which can be bounded by\n\u03b5/3 in the same way as the \ufb01rst. We further decompose the \ufb01rst term as\n\n(cid:88)\n(cid:88)\n\nwtk(s, a)(( \u02dcPk \u2212 P )(s, a, t))(cid:62) \u02dcV \u03c0k\n\ntk\n\nt+1\n\nt\u2208[H]\n(s,a)\u2208Lc\n\n(cid:88)\nwhere Ltk = (cid:8)(s, a) \u2208 S \u00d7 A : wtk(s, a) \u2265 wmin = \u03b5\n\nwtk(s, a)(( \u02dcPk \u2212 \u02c6Pk)(s, a, t))(cid:62) \u02dcV \u03c0k\n\n(s,a)\u2208Ltk\n\n(s,a)\u2208Ltk\n\nt+1 +\n\nt\u2208[H]\n\nt\u2208[H]\n\n+\n\n3HS2\n\n(3)\n\n(4)\n\nwtk(s, a)(( \u02c6Pk \u2212 P )(s, a, t))(cid:62) \u02dcV \u03c0k\n\nt+1\n\n(cid:9) is the set of state-action pairs with\n(cid:32)(cid:115)\n\n(cid:33)\n\nH 2 ln (ln(ntk(s, a))/\u03b4)\n\nntk(s, a)\n\n,\n\n(5)\n\n\uf8f6\uf8f8 ,(6)\n\n\uf8eb\uf8ed(cid:115)\n\nSH 2 ln ln ntk(s,a)\n\n\u03b4\nntk(s, a)\n\nnon-negligible visitation probability. The value of wmin is chosen so that (3) is bounded by \u03b5/3.\nSince \u02dcV \u03c0k is the optimal solution of the optimization problem in Eq. (1), we can bound\n\n|(( \u02dcPk\u2212 \u02c6Pk)(s, a, t))(cid:62) \u02dcV \u03c0k\n\nt+1| \u2264 \u03c6k(s, a, t)(H \u2212 t) = O\n\nwhere \u03c6k(s, a, t) is the value of \u03c6(s, a, t) and ntk(s, a) the value of n(s, a, t) right before episode k.\nFurther we decompose\n\n|(( \u02c6Pk \u2212 P )(s, a, t))(cid:62) \u02dcV \u03c0k\n\nt+1| \u2264 (cid:107)( \u02c6Pk \u2212 P )(s, a, t)(cid:107)1(cid:107) \u02dcV \u03c0k\n\nt+1(cid:107)\u221e \u2264 O\n\n\uf8eb\uf8ed H(cid:88)\n\n(cid:88)\n\n(cid:115)\n\n\uf8f6\uf8f8 .\n\nwhere the second inequality follows from a standard concentration bound used in the de\ufb01nition of the\nfailure event F (see below). Substituting this and (5) into (4) leads to\n\n(4) \u2264 O\n\nSH 2 ln(ln(ntk(s, a))/\u03b4)\n\n(7)\n\nwtk(s, a)\n\n\u03b4\n\n2\n\nt=1\n\nntk(s, a)\n\ni.e.,(cid:80)\n\ni