{"title": "Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model", "book": "Advances in Neural Information Processing Systems", "page_first": 5625, "page_last": 5634, "abstract": "This paper focuses on the problem of computing an $\\epsilon$-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model. We propose an algorithm that is initially agnostic to the MDP but that can leverage the specific MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to find a good policy, precisely highlighting the contribution of each state-action pair to the final sample complexity. A key feature of our analysis is that it removes all horizon dependencies in the sample complexity of suboptimal actions except for the intrinsic scaling of the value function and a constant additive term.", "full_text": "Almost Horizon-Free Structure-Aware\n\nBest Policy Identi\ufb01cation with a Generative Model\n\nInstitute for Computational and Mathematical Engineering,\n\nAndrea Zanette\n\nStanford University, CA\nzanette@stanford.edu\n\nMykel J. Kochenderfer\n\nEmma Brunskill\n\nDepartment of Aeronautics and Astronautics,\n\nDepartment of Computer Science,\n\nStanford University, CA\nmykel@stanford.edu\n\nStanford University, CA\n\nebrun@cs.stanford.edu\n\nAbstract\n\nThis paper focuses on the problem of computing an \u01eb-optimal policy in a discounted\nMarkov Decision Process (MDP) provided that we can access the reward and\ntransition function through a generative model. We propose an algorithm that is\ninitially agnostic to the MDP but that can leverage the speci\ufb01c MDP structure,\nexpressed in terms of variances of the rewards and next-state value function, and\ngaps in the optimal action-value function to reduce the sample complexity needed\nto \ufb01nd a good policy, precisely highlighting the contribution of each state-action\npair to the \ufb01nal sample complexity. A key feature of our analysis is that it removes\nall horizon dependencies in the sample complexity of suboptimal actions except\nfor the intrinsic scaling of the value function and a constant additive term.\n\n1\n\nIntroduction\n\nA key goal is to design reinforcement learning (RL) agents that can leverage problem structure to\nef\ufb01ciently learn a good policy, especially in problems with very long time horizons. Ideally the RL\nalgorithm should be able to adjust without apriori information about the problem structure. Formal\nanalyses that characterize the performance of such algorithms yielding instance-dependent bound\nhelp to advance our core understanding of the characteristics that govern the hardness of learning to\nmake good decisions under uncertainty.\n\nThough there is relatively limited work in reinforcement learning, strong problem-dependent guar-\nantees are available for multi-armed bandits. In particular, well known bounds for online learning\nscale as a function of the gap between the expected reward of a particular action and the optimal\naction [ABF02] and also on the variance of the rewards [AMS09]. In the pure exploration setting\nin bandits, which is related to the setting we consider in this paper, there exist multiple algorithms\nwith problem-dependent bounds [EMM06; MM94; MSA08; Jam+14; BMS09; ABM10; GGL12;\nKKS13] of this form. Ideally the complexity of learning to make good decisions in reinforcement\nlearning tasks would scale with previously identi\ufb01ed quantities of gap and variance over the next\nvalue function. As a step towards this, in this paper we introduce an algorithm for an RL agent\noperating in a discrete state and action space that has access to a generative model and can leverage\nproblem-dependent structure to have strong instance-dependent PAC sample complexity bounds\nthat are a function of the variance of the rewards and next state value functions, as well as the gaps\nbetween the optimal and suboptimal state-action values. While the sequential setting brings additional\ndif\ufb01culties due to possibly long horizon, our bounds also show that in the dominant terms, our\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fapproach avoids suffering any horizon dependence for suboptimal actions beyond the scaling of the\nvalue function. This signi\ufb01cantly improves in statistical ef\ufb01ciency over prior worst-case bounds for\nthe generative model case [GAMK13; Sid+18] and matches existing worst-case bounds in worst-case\nsettings.\n\nTo do so we introduce a novel algorithm structure that acquires samples of state-action pairs in\niterative rounds. A slight variant of the well known simulation lemma (see e.g. [KMN02]) suggests\nthat in order to improve our estimate of the optimal value function and policy, it is suf\ufb01cient to ensure\nthat after each round of sampling, the con\ufb01dence intervals shrink over the MDP parameter estimates\nof both the state\u2013action pairs visited by the optimal policy and the state\u2013action pairs visited by the\nempirically-optimal policy. While of course both are unknown, we show that we can implicitly\nmaintain a set of candidate policies that are \u01eb-accurate, and by ensuring that we shrink the con\ufb01dence\nsets of all state\u2013action pairs likely to be visited by any such policy, we are also guaranteed (with\nhigh probability) to shrink the con\ufb01dence intervals of the optimal policy. Interestingly we can show\nthat by focusing on such state\u2013action pairs, we can avoid the horizon dependence on suboptimal\nactions. The key idea is to take into account the MDP learned dynamics to enforce a constraint on\nthe suboptimality of the candidate policies. The sampling strategy is derived by solving a minimax\nproblem that minimizes the number of samples to guarantee that every policy in the set of candidate\npolicies is accurately estimated. Importantly, this minimax problem can be reformulated as a convex\nminimization problem that can be solved with any standard solver for convex optimization.\n\nOur algorithmic approach is quite different from many prior approaches, both in the generative\nmodel setting and the online setting. When a generative model is available, the available worst-case\noptimal algorithms [AMK12; Sid+18] allocate samples uniformly to all state and action pairs. We\nshow our approach can be substantialy more effective for general case of MDPs with heterogeneous\nstructure, and even for the pathologically hard instances because of the reduced horizon dependence\non suboptimal actions. Note too that our approach is quite different from online RL algorithms that\noften (implicitly) allocate exploration budget to state-action pairs encountered by the policy with\nthe most optimistic upper bound [JOA10; AOM17; OVR13; DB15; DLB17; SLL09; LH14], since\nhere we explicitly reason about the reduction in the con\ufb01dence intervals across a large set of policies\nwhose value is near the empirical optimal value at this round.\n\n2 Notation and Preliminaries\n\nWe consider discounted in\ufb01nite horizon MDPs [SB18], which are de\ufb01ned by a tuple M =\nhS, A, p, r, \u03b3i, where S and A are the state and action spaces with cardinality S and A, respec-\ntively. We denote by p(s\u2032 | s, a) the probability of transitioning to state s\u2032 after taking action a in\nstate s while r(s, a) \u2208 [0, 1] is the average instantaneous reward collected and R(s, a) \u2208 [0, 1] the\ncorresponding random variable. The vector value function of policy \u03c0 is denoted with V \u03c0. If \u03c1 is the\n\ninitial starting distribution then V (\u03c1) =Ps \u03c1sV (s). The value function of the optimal policy \u03c0\u22c6 is\n\ndenoted with V \u22c6 = V \u03c0\u22c6\n. We call Var R(s, a) and Varp(s,a) V \u22c6 the variance of R(s, a) and of V \u22c6(s\u2032)\nwhere s\u2032 \u223c p(s, a). The agent interacts with the MDP via a generative model that takes as input a\n(s, a) pair and returns a random sample of the reward R(s, a) and a random next state s+ according\nto the transition model s+ \u223c p(s, a). The reinforcement learning agent maintains an empirical MDP\n\nsubscript indicates what iteration/episode they refer to. We denote with w\u03c0,\u03c1\nt=0 \u03b3t Pr(s, a, t, \u03c1)\nthe discounted sum of visit probabilities Pr(s, a, t, \u03c1) to the (s, a) pair in timestep t if the starting\n\ncMk = hS, A,bpk,brk, \u03b3i for every iteration/episode k, and the maximum likelihood transitionbpk(s, a)\nk is the empirical estimate using MDP cMk\nsa samples. The bV \u22c6\nand rewardsbrk(s, a) have received nk\nk. Variables with a hat refer to the empirical MDP cMk and the\nof the empirical optimal policyb\u03c0\u22c6\nis its analogous on cMk. We use the \u02dcO(\u00b7) notation to\nstate is drawn uniformly from \u03c1 and bw\n\nindicate a quantity that depends on (\u00b7) up to a polylog expression of a quantity at most polynomial\n1\nin S, A,\n\u03b4 , where \u03b4 is the \u201cfailure probability\u201d. Before proceeding, we \ufb01rst recall the following\nlemma from [GAMK13]:\n\nsa =P\u221e\n\n1\n\n1\u2212\u03b3\n\n\u03c0,k,\u03c1\nsa\n\n2\n\n\fLemma 2 (Simulation Lemma for Optimal Value Function Estimate [GAMK13]). With probability\nat least 1 \u2212 \u03b4, outside the failure event for any starting distribution \u03c1 it holds that:\n\n(V \u22c6 \u2212bV \u22c6\n(V \u22c6 \u2212bV \u22c6\n\nk )(\u03c1) \u2264 X(s,a)bw\nk )(\u03c1) \u2265 X(s,a)bw\n\n\u03c0\u22c6,k,\u03c1\nsa\n\nk,k,\u03c1\n\nb\u03c0\u22c6\nsa\n\n(cid:0)(r \u2212brk)(s, a) + \u03b3(p \u2212bpk)(s, a)\u22a4V \u22c6(cid:1) \u2264 X(s,a)bw\n(cid:0)(r \u2212brk)(s, a) + \u03b3(p \u2212bpk)(s, a)\u22a4V \u22c6(cid:1) \u2265 \u2212X(s,a)bw\n\n\u03c0\u22c6,k,\u03c1\nsa\n\nCIsa(nk\n\nsa)\n\nk,k,\u03c1\n\nb\u03c0\u22c6\nsa\n\nCIsa(nk\n\nsa)\n\nThe CIsa(nk\nsa) are Bernstein\u2019s con\ufb01dence intervals (de\ufb01ned in more details in appendix A) after nk\nsa\nsamples over the rewards and transitions and are function of the unknown rewards and transition\nvariances. The proof (see appendix) is a slight variation of lemma 3 in [GAMK13].\n\n3 Sampling Strategy Given an Empirical MDP\n\nWe \ufb01rst describe how our approach will allocate samples to state\u2013action pairs given a current empirical\nMDP, before presenting in the next section our full algorithm.\n\nk (optimal on cMk). Since \u03c0\u22c6 andb\u03c0\u22c6\n\nand the empirical optimal policyb\u03c0\u22c6\nb\u03c0\u22c6\nsmall for all state\u2013action pairs leading to a small |(V \u22c6 \u2212bV \u22c6\nthat the empirical optimal policyb\u03c0\u22c6\nnear-optimal). Therefore, in the main text we mostly focus on showing that |(V \u22c6 \u2212bV \u22c6\n\nLemma 1 suggests that to estimate the optimal value function it suf\ufb01ces to accurately estimate the\n(s, a) pairs in the trajectories identi\ufb01ed by two policies, namely the optimal policy \u03c0\u22c6 (optimal on M)\nk are unknown (in particular,\nk is a random variable prior to sampling), a common strategy is to allocate an identical number of\nsamples uniformly [GAMK13; Sid+18] so that the con\ufb01dence intervals CIsa(nk\nsa) are suf\ufb01ciently\nk )(\u03c1)|; from here it is possible to show\nk is\nk )(\u03c1)| is small,\nand defer additional details to the appendix. The proposed approach is to proceed in iterations or\nepisodes. In each episode our algorithm implicitly maintains a set of candidate policies, which are\nnear-optimal, and allocates more samples to the (s, a) pairs visited by these policies to re\ufb01ne their\nestimated value. On the next episode those policies that are too suboptimal relative to their estimation\naccuracy are implicitly discarded. In particular, the samples are placed in a way that is related to the\nvisit probabilities of the near-optimal empirical policies in addition to the variances of the reward and\ntransitions of state\u2013action pairs encountered in potentially good policies.\n\nk )(\u03c1)| is also small (sob\u03c0\u22c6\n\nk can be extracted and that |(V \u22c6 \u2212 V b\u03c0\u22c6\n\n3.1 Oracle Minimax Program\n\nSuppose we have already allocated some samples and have computed the maximum likelihood\nk and know that the optimal value function estimate is at\nk k\u221e \u2264 \u01ebk. How should we allocate further sampling resources to\nimprove the accuracy in the optimal value function estimate? The idea is given by the simulation\nlemma (lemma 2): in order to see an improvement after sampling (i.e., in the next episode k + 1)\nsa ) in the\n\nMDP cMk with empirical optimal policyb\u03c0\u22c6\nleast \u01ebk-accurate, i.e., kV \u22c6 \u2212bV \u22c6\nthe maximum likelihood MDP cMk+1 must have smaller con\ufb01dence intervals CIsa(nk+1\nk+1 on cMk+1. Both are of course\n(s, a) pairs visited by \u03c0\u22c6 and the empirical optimal policy b\u03c0\u22c6\nunknown. However, we introduce the constraint (bV \u22c6\nk \u2212 bV \u03c0\nC\u01ebk-optimal policies (and starting distributions) on cMk. Here, C is a numerical constant that will\nensure that \u03c0\u22c6 andb\u03c0\u22c6\nensureP(s,a)bw\nLemma 2 ensures |(V \u22c6 \u2212bV \u22c6\n\nk+1 satisfy this condition and are therefore allocated enough samples. Given C\nand \u01ebk, the idea is that we should choose a sampling strategy {nsa}sa with high enough samples to\nk )(\u03c1) \u2264 C\u01ebk so that\n\nsa ) \u2264 \u01ebk+1 for all policies that satisfy (bV \u22c6\nk \u2212bV \u03c0\nk )(\u03c1) \u2264 C\u01ebk,X(s,a)\n\nk+1)(\u03c1)| \u2264 \u01ebk+1 = \u01ebk/2. This is equivalent to solving the following1:\n\nk )(\u03c1) \u2264 C\u01ebk that restricts sampling to\n\nDe\ufb01nition 1 (Oracle Minimax Problem).\n\n\u03c0,k+1,\u03c1\nsa\n\nCIsa(nk+1\n\nsa ),\n\nnsa \u2264 nmax.\n\n(1)\n\nmin\n\nn\n\n\u03c0,k+1,\u03c1\n\nCIsa(nk+1\n\nmax\n\n\u03c0,\u03c1 X(s,a)bw\n\ns.t.\n\n(bV \u22c6\nk \u2212bV \u03c0\n\n1For space, we omit the constraint \u03c1s \u2265 0 and k\u03c1k1 = 1 on the starting distribution.\n\n3\n\n\f\u03c0,k+1,\u03c1\n\nHere the vector of the discounted sum of visit probabilitiesbw\nsystem (I \u2212\u03b3(bP \u03c0\nuses the next-episode empirical visit probabilitiesbw\n\nis computable from the linear\n= \u03c1 and nmax is a guess on the number of samples needed to ensure\nthat the objective function is \u2264 \u01ebk/2. We call this problem the oracle minimax problem because it\nwhich are not known. In addition, it uses\nthe true variance of the next state value function (embedded in the de\ufb01nition of con\ufb01dence intervals\nCIsa(nk\n\nsa)). As these quantities are unknown in episode k, the program cannot be solved.\n\nk+1)\u22a4)bw\n\n\u03c0,k+1,\u03c1\n\n\u03c0,k+1,\u03c1\n\n3.2 Algorithm Minimax Program\n\nThis section shows how to construct a minimax program that is \u2018close\u2019 enough to the Oracle minimax\n\n\u03c0,k,\u03c1\n\n\u03c0,k,\u03c1\n\n\u03c0,k+1,\u03c1\n\n\u03c0,k+1,\u03c1\n\n\u03c0,k,\u03c1\n\nk\n\u03c0,k+1,\u03c1\n\n\u03c0,k+1,\u03c1\nsa\n\nCIsa(nk+1\n\n\u2212 bw\n\nand instead use the currently-\n\naccurately leads to a high sample complexity; fortunately we are able to claim that the product\nand the con\ufb01dence interval vector CI k+1\non the rewards and transitions is already small if policy \u03c0 has received enough samples along its\ntrajectories before the current episode. Let us rewrite the objective function of equation 1 as a function\n\nproblem (Equation 1) but is function of only empirical quantities computable from cMk. The idea\nis 1) to avoid using the next-episode empirical distributionbw\nand 2) use the empirical variance of the next state value function Varbpk(s,a)bV \u22c6\ncomputablebw\ninstead of the real, unknown variance Varp(s,a) V \u22c6. Estimating the visit distribution bw\nbetween the visit distribution shift bw\nof the visit distribution on cMk plus a term that takes into account the shift in the distribution from\ncMk to cMk+1:\nX(s,a)bw\nfor both2 \u03c0 = \u03c0\u22c6 andb\u03c0\u22c6\n(s, a) pair3, independent of the desired accuracy \u01ebk+1. This way we can ensure that we can usebw\ninstead ofbw\n\nLemma 9 in appendix allows us to claim that the rightmost summation above is less than 2Cp(nmin)\u01ebk\nk+1. Here Cp(nmin) is de\ufb01ned in appendix A and can be made (see lemma\n16) for example < 1/100 by allocating a small constant number of samples \u02dcO(S/(1 \u2212 \u03b3)2) to each\n\u03c0,k,\u03c1\n\nsa ) = X(s,a) bw\nsa| {z }\n\nNow the only quantities that are not known by the algorithm are the variance of the transitions and\nrewards that appear in the con\ufb01dence intervals CIsa(nk+1\nsa ). Precisely, to estimate the variance of the\ntransitions Varp(s,a) V \u22c6 in the (s, a) pair, we need to known both the transition probability p(s, a)\nand the true value function V \u22c6, both of which are unknown. Fortunately it is possible to use the\nk plus a fast-shrinking (as a function\nof the number of samples) correction term. Since this analysis was similarly performed in prior\npapers for this setting [GAMK13; Sid+18], we defer its discussion to Lemma 11 in the appendix.\nWith these corrections (Bksa, de\ufb01ned in appendix A, is the variance correction and 2\u01ebk/625 accounts\nfor the distribution shift) we can write the following minimax problem:\n\nempirical transitionsbpk(s, a) and the empirical value functionbV \u22c6\n\nsa ) +X(s,a)\n\nplus a small correction term \u226a \u01ebk.\n\n\u2212bw\n{z\n\nCIsa(nk+1\nsa )\n\nShift of Empirical Distributions\n\nCIsa(nk+1\n\n(bw\n|\n\n\u03c0,k+1,\u03c1\nsa\n\nComputable\n\n\u03c0,k,\u03c1\nsa\n\n)\n\n}\n\n\u03c0,k+1,\u03c1\n\nDe\ufb01nition 2 (Algorithm Minimax Problem).\n\nsa ) + Bksa) + 2\u01ebk/625\n\nmin\n\nn\n\nmax\n\n\u03c0,k,\u03c1\nsa\n\n\u03c0,\u03c1 X(s,a)bw\n(bV \u22c6\nk \u2212bV \u03c0\n\ns.t.:\n\n(cCI sa(nk+1\nk )(\u03c1) \u2264 C\u01ebk; X(s,a)\n\nnsa \u2264 nmax;\n\n\u03c0,k,\u03c1\n\n= \u03c1.\n\n(2)\n\n(I \u2212 \u03b3(bP \u03c0\n\nk )\u22a4)bw\n\nsa ) are the con\ufb01dence intervals evaluated with the empirical variances de\ufb01ned in\n\nHere cCI sa(nk+1\nAppendix A. This program is fully expressed in terms of empirical quantities that depends on cMk.\n\nAs long as a solution to the above minimax program is \u2264 \u01ebk/2 the oracle objective function will also\n\n2Lemma 9 bounds this term as 2Cp(nmin)\u01eb\u03c0\n\nk for \u03c0 = \u03c0\u22c6, \u03c0 = b\u03c0\u22c6\n\nk+1, respectively; \u01eb\u03c0\n\nk is de\ufb01ned in appendix\nk \u2264 \u01ebk we need an inductive argument\n\nA and represents the \u201caccuracy\u201d of policy \u03c0 in episode k. To ensure \u01eb\u03c0\nwhich is sketched out in the main theorem (Theorem 1).\n\n3As we will shortly see, this will contribute only a constant term to the \ufb01nal sample complexity.\n\n4\n\n\fbe \u2264 \u01ebk/2 at the solution of the program (for more details see Lemma 6 in the Appendix). In other\nwords, by solving the minimax program (def 2) we put enough samples to satisfy the oracle program\n1, which ensures accuracy in the value function estimate through Lemma 2.\n\n4 Algorithm\n\nWe now take the sampling approach\ndescribed in the previous section and\nuse it to construct an iterative al-\ngorithm for quickly learning a near-\noptimal or optimal policy given access\nto a generative model. Speci\ufb01cally\nwe present BESt POlicy identi\ufb01cation\nwith no Knowledge of the Environ-\nment (BESPOKE) in Algorithm 1. The\nalgorithm proceeds in episodes. Each\nepisode starts with an empirical MDP\n\nAlgorithm 1 BESPOKE\n\nInput: Failure probability \u03b4 > 0, accuracy \u01ebInput > 0\nSet \u01eb1 = 1\nfor k = 1, 2, . . .\n\n1\u2212\u03b3 and allocate nmin samples to each (s, a) pair\n\nfor nmax = 20, 21, 22, . . .\n\nSolve the optimization program of de\ufb01nition 7 (appendix)\nif the optimal value of the program of de\ufb01nition 7 is \u2264 \u01ebk\n2\n\nBreak and return sampling strategy {nk+1\nsa , \u2200(s, a)\n\nQuery the generative model up to nk+1\nk+1 and bV \u22c6\nCompute, b\u03c0\u22c6\nSet \u01ebk+1 = \u01ebk\n2\nif \u01ebk+1 \u2264 \u01ebInput\n\nk+1\n\nsa }sa\n\nBreak and return the policy b\u03c0\u22c6\n\ncMk whose optimal value functionbV \u22c6\nis \u01ebk accurate kV \u22c6 \u2212bV \u22c6\n2 to halve the accuracy in the value function estimate, i.e., kV \u22c6 \u2212bV \u22c6\n\nk\nk k\u221e \u2264 \u01ebk un-\nder an inductive assumption. The sam-\nples are allocated at each episode k by solving an optimization program equivalent to that in de\ufb01nition\nk+1k\u221e \u2264 \u01ebk+1 = \u01ebk/2. In the\ninnermost loop of the algorithm the required number of samples for the next episode is guessed\nnmax = 1, 2, 4, 8, . . . , until nmax is suf\ufb01cient to ensure that the objective function of the minimax\nproblem of de\ufb01nition 2 will be \u2264 \u01ebk/2; the purpose of the inner loop is to avoid putting more samples\nthan needed and allows us to obtain the sample complexity result of Theorem 2. In Appendix G we\nreformulate the optimization program 2 (described more precisely in De\ufb01nition 5 in the appendix)\nobtaining a convex minimization program that avoids optimizing over the policy and instead works\n\nk+1\n\n\u03c0,k,\u03c1\n\nconvex optimization [BV04].\n\n; this can be ef\ufb01ciently solved with standard techniques from\n\nTheorem 1 (BESPOKE Works as Intended). With probability at least 1 \u2212 \u03b4, in every episode k\nk and its optimal\n\ndirectly with the distribution bw\nBESPOKE maintains an empirical MDP cMk such that its optimal value function bV \u22c6\npolicyb\u03c0\u22c6\n\nkV \u22c6 \u2212 V b\u03c0\u22c6\n\nk k\u221e \u2264 \u01ebk,\n\nk k\u221e \u2264 2\u01ebk\n\nk satisfy:\n\n2 , \u2200k. In particular, when BESPOKE terminates in episode kF inal it holds that\n\nkV \u22c6 \u2212bV \u22c6\n\nwhere \u01ebk+1\n\u01ebInput\n\ndef\n= \u01ebk\n\n2 \u2264 \u01ebkF inal \u2264 \u01ebInput.\n\nThe proof is reported in the appendix, and shows by induction that for every episode k, \u03c0\u22c6 and\nk \u2212\n)(\u03c1) \u2264 C\u01ebk for all \u03c1 and are therefore allocated enough samples;\n\nk+1 are in the set of \u2018candidate\u2019 policies because they are near-optimal on cMk, satisfying (bV \u22c6\nb\u03c0\u22c6\nk \u2212bV\nk )(\u03c1) \u2264 C\u01ebk and (bV \u22c6\nbV \u03c0\u22c6\nthis guarantees accuracy in bV \u22c6\n\nb\u03c0\u22c6\nk\nk+1 by Lemma 2.\n\nk+1\n\n5 Sample Complexity Analysis\n\nTo analyze the sample complexity of BESPOKE we derive another optimization program function\n\nof only problem dependent quantities. We 1) shift from the empirical visit distribution bw\ncMk to the \u201creal\u201d visit distribution w\u03c0,\u03c1 on M; 2) move from empirical con\ufb01dence intervals to those\n\nevaluated with the true variances; and \ufb01nally 3) relax the near-optimality constraint on the policies by\nusing Lemma 7 in the appendix (for an appropriate numerical constant C \u22c6 > C) in order to be able\nto use the value functions on M:\n\non\n\n\u03c0,k,\u03c1\n\nk )(\u03c1) \u2264 C\u01ebk \u2192 (V \u22c6 \u2212 V \u03c0)(\u03c1) \u2264 C \u22c6\u01ebk,\n\n\u2200\u03c1.\n\n(3)\n\n(bV \u22c6\nk \u2212bV \u03c0\n\n5\n\n\fIn the end, we have enlarged the feasible set of the algorithm minimax problem while upper bounding\nits objective function obtaining:4\nDe\ufb01nition 3 (\u22c6-Minimax Program).\n\nmin\n\nn\n\nsa (CIsa(nk+1\n\nsa ) + 2Bksa) + \u01ebk/25\n\n(4)\n\n(5)\n\nsubject to the constraints (r \u2208 RSA is the reward vector):\n\nnsa \u2264 nmax;\n\n(I \u2212 \u03b3(P \u03c0)\u22a4)w\u03c0,\u03c1 = \u03c1.\n\nmax\n\nw\u03c0,\u03c1\n\nw\u03c0,\u03c1 X(s,a)\n\u2264 C \u22c6\u01ebk; X(s,a)\n\n(V \u22c6 \u2212 V \u03c0)(\u03c1)\n\nV \u22c6(\u03c1)\u2212(w\u03c0,\u03c1)\u22a4r\n\n|\n\n{z\n\n}\n\nThis is made rigorous in Lemma 6, but essentially we have obtained a minimax program whose\nsolution can be studied in terms of problem dependent quantities; in particular, its solution in terms\nof number of samples nsa upper bounds the sample complexity of the algorithm in every episode.\n\nProblem Dependent Analysis Due to space constraints, here we sketch the sample complexity\ndef\n= V \u22c6(s) \u2212 Q\u22c6(s, a) appear while simultane-\nanalysis of suboptimal actions to make the gaps \u2206sa\nously eliminating the horizon dependence. We recall the following (e.g., Lemma 5.2.1 in [Kak+03];\nsee also our appendix):\n\nLemma 1 (Sum of Losses). It holds that:\n\n(V \u22c6 \u2212 V \u03c0)(\u03c1) = X(s,a)\n\n|\n\ndef\n= \u2206sa\n\n{z\n\n) = X(s,a)\n}\n\nw\u03c0,\u03c1\n\nsa (Q\u22c6(s, \u03c0\u22c6(s)) \u2212 Q\u22c6(s, a)\n\nw\u03c0,\u03c1\n\nsa \u2206sa\n\n(20)\n\nLemma 1 expresses the value of a suboptimal policy as a sum of per-step losses \u2206sa weighted by the\ndiscounted sum of probabilities of being in that (s, a) pair. The key step that enables us to obtain\nstrong problem dependent bounds and to remove the horizon dependence for suboptimal actions is\nsa Bksa + 3\u01ebk/625).\nLemma 1 (Gap-Con\ufb01dence Interval Lemma). If (\u03c0, \u03c1) satis\ufb01es (V \u22c6 \u2212V \u03c0)(\u03c1) \u2264 C \u22c6\u01ebk then a sample\ncomplexity:\n\nsynthesized in the following short lemma, where we ignore the term (P(s,a) 2w\u03c0,\u03c1\n\uf8f6\uf8f7\uf8f7\uf8f8 ,\n\n\u03b32 Varp(s,a) V \u22c6\n\n(1 \u2212 \u03b3)\u2206sa\n\nVar R(s, a)\n\nTransition Estimation\n\nReward Estimation\n\n\u22062\nsa\n\n\u22062\nsa\n\n\u2206sa\n\n+\n\nnsa = \u02dcO\uf8eb\uf8ec\uf8ec\uf8ed\n|\n\n1\n\n+\n\n+\n\nsuf\ufb01ces to ensure\n\n{z\n\n|\n\n}\n\n\u03b3\n\n\u2200(s, a)\n\n(6)\n\n(7)\n\nw\u03c0,\u03c1\n\nsa CIsa(nk+1\n\nsa ) \u2264\n\n\u01ebk\n2\n\n.\n\n{z\n\n}\nw\u03c0,\u03c1 X(s,a)\n\nmax\n\nProof. A direct computation shows that if nk+1\n\nsa\n\nsatis\ufb01es equation 6 with appropriate constants5 then:\n\nThis justi\ufb01es the \ufb01rst inequality below:\n\nCIsa(nk+1\n\nsa ) \u2264\n\n\u2206sa\n2C \u22c6 .\n\nw\u03c0,\u03c1\n\nsa CIsa(nk+1\n\nsa ) \u2264\n\nX(s,a)\n\n1\n\n2C \u22c6 X(s,a)\n\nw\u03c0,\u03c1\n\nsa \u2206sa =\n\n1\n2C \u22c6 (V \u22c6 \u2212 V \u03c0)(\u03c1) \u2264\n\n1\n2\n\n\u01ebk.\n\n(8)\n\n(9)\n\nThe equality follows from lemma 1 and the last inequality from the constraint on the optimality of\n\u03c0.\n\n4The relaxed optimization program is over the distribution induced by the policy. Here, P \u03c0 is the transition\n\nmatrix identi\ufb01ed by the policy \u03c0 on M.\n\n5Note that, in particular, C \u22c6 is a constant.\n\n6\n\n\fThey key idea is that by having con\ufb01dence intervals of the same size as the gaps is suf\ufb01cient to\nestimate the policy as accurately as its suboptimality gap (V \u22c6 \u2212V \u03c0)(\u03c1), regardless of the horizon. By\naugmenting this argument with the law of total variance [GAMK13], splitting into further subcases,\nand by taking into account the correction terms we obtain:\n\nTheorem 2 (Sample Complexity of the Algorithm BESPOKE). With probability at least 1 \u2212 \u03b4, the\n\ntotal sample complexity of BESPOKE up to episode k is upper bounded byP(s,a) nsa where nsa is\n\nthe total number of samples allocated to the (s, a) pair:\n\nnsa = \u02dcO(cid:16) minn\n\n1\n\nVar R(s, a) + \u03b32 Varp(s,a) V \u22c6\n\n(1 \u2212 \u03b3)3(\u01ebk)2 ,\nVar R(s, a) + \u03b32 Varp(s,a) V \u22c6\n\n(1 \u2212 \u03b3)2(\u01ebk)2\n\n+\n\n1\n\n(1 \u2212 \u03b3)2(\u01ebk)\n\n,\n\n(166)\n\n+\n\n1\n\n(1 \u2212 \u03b3)\u2206s,a o +\n\n\u03b3S\n\n(1 \u2212 \u03b3)2 (cid:17) .\n\n(167)\n\n\u22062\n\ns,a\n\nNotice that the BESPOKE would suffer a worst-case sample complexity similar to [GAMK13; Sid+18]\nonly in the initial phases of learning, i.e., whenever \u01ebk is much larger than the gaps.\n\n6 Signi\ufb01cance of the Bound\n\nWe motivate the importance of theorem 2 by specializing the result in two noteworthy cases.\n\nSample Complexity to Identify the Best Policy and the Worst-Case Lower Bound If the opti-\nmal policy is unique, de\ufb01ne the minimum gap \u2206min = mins,a,a6=\u03c0\u22c6(a) \u2206sa. To identify the optimal\npolicy we must set \u01ebInput \u2264 \u2206min and the sample complexity of BESPOKE at termination becomes:\n\nVar R(s, \u03c0\u22c6(s)) + \u03b32 Varp(s,\u03c0\u22c6(s)) V \u22c6\n\n(1 \u2212 \u03b3)2\u22062\n\nmin\n\n,\n\n1\n\nmin\n\n(1 \u2212 \u03b3)3\u22062\n\n\u02dcO Xs\nmin(\n{z\n|\n+ X(s,a)|a6=\u03c0\u22c6(s)(cid:18) Var R(s, a) + \u03b32 Varp(s,a) V \u22c6\n|\n\nRULING-OUT SUBOPTIMAL ACTIONS\n\n{z\n\n\u22062\nsa\n\nESTIMATING \u03c0\u22c6\n\n+\n\n1\n\n(1 \u2212 \u03b3)\u2206sa(cid:19)\n}\n\n1\n\n(1 \u2212 \u03b3)2\u2206min)\n}\n\n\u03b3S2A\n(1 \u2212 \u03b3)2\n\n!\n\n+\n\n+\n\n(10)\n\n|\n\nCONSTANT\n\n{z\n\n}\n\nN\n\nS\n\n(1\u2212\u03b3)3\u22062\n\nmin\n\nin \u02dcO(cid:16)\n\u02dcO(cid:16)\n\nOne of our core contributions is that we suffer a dependence on the horizon 1/(1 \u2212 \u03b3) only in\nestimating the optimal (s, a) pairs (\ufb01rst summation over the state space). The summation over\nsuboptimal (s, a) is independent of the horizon, although of the horizon implicitly affects the scaling\nof the variance Varp(s,a) V \u22c6 and explicitly the maximum value function range (term 1/(1 \u2212 \u03b3)\u2206sa).\nIt is important to compare the above result with the established lower bound [GAMK13]\nwhich is \u2126(\n(1\u2212\u03b3)3\u01eb2 ) to obtain an \u01eb-accurate policy, where N is the number of state-action\npairs. Since \u2206sa = \u2206min, \u2200(s, a), a 6= \u03c0\u22c6(s) in the lower bound construction and the\nvariance is maximum Varp(s,a) V \u22c6 \u2264 1/(1 \u2212 \u03b3)2, we are able to identify the optimal policy\n\n+ S2A\n\n+ S(A\u22121)\n(1\u2212\u03b3)2\u22062\n\n(1\u2212\u03b3)2(cid:17) samples which improves6 on the worst case bound\n(1\u2212\u03b3)2(cid:17) of [GAMK13; Sid+18] by a full horizon factor for suboptimal ac-\n\ntions. While our result can be surprising at \ufb01rst, it does not contradict the lower bound:\nthe\nlower bound makes no attempt to distinguish between optimal and suboptimal actions as it is only\nexpressed in terms of total (s, a) pairs N , and the construction uses a number of (s, a) pairs that\n)\nis a constant multiple of the state space cardinality, i.e., one could only deduce \u2126(\nmin\nas a lower bound. Our result, therefore, does not violate the lower bound, but rather it shows\nthat while we must suffer an unavoidable worst-case 1/(1 \u2212 \u03b3)3 factor on the state space corre-\nsponding to the optimal (s, a) pairs, the dependence on the planning horizon is absent for sub-\noptimal (s, a) except for the scaling of the value function implicit in the variance. Surprisingly,\nexcluding the constant term S2A\n(1\u2212\u03b3)2 , suboptimal (s, a) pairs get a combined number of samples\n\n(1\u2212\u03b3)3\u22062\n\n+ S2A\n\n(1\u2212\u03b3)3\u22062\n\nSA\n\nmin\n\nmin\n\nS\n\n6The paper [Sid+18] has the same bound as [GAMK13] but avoids the constant term S2A\n\n(1\u2212\u03b3)2 .\n\n7\n\n\f\u02dcO\uf8eb\uf8edX(s,a)(cid:18) Var R(s, a)\n\nmax{\u01eb2\n\nk, \u22062\n\nsa}\n\n+\n\n1\n\nmax{\u01ebk, \u2206sa}(cid:19)\uf8f6\uf8f8 \u2264 \u02dcO\uf8eb\uf8edX(s,a)\n\n1\nmax{\u01eb2\nk, \u22062\n\nsa}\uf8f6\uf8f8\n\n(11)\n\n\u02dcO(cid:16)P(s,a)|a6=\u03c0\u22c6(s)(cid:16) Var R(s,a)+\u03b32 Varp(s,a) V \u22c6\n\n\u22062\n\nsa\n\n+\n\n1\n\n(1\u2212\u03b3)\u2206sa(cid:17)(cid:17) which is the sample complexity (ignor-\n\ning log and constant factors) that a variance-aware bandit algorithm for best arm identi\ufb01cation would\nneed (see e.g., [GGL12], appendix B) to \u2018reject\u2019 these suboptimal arms provided that it can obtain\nsamples7 of the random variable R(s, a) + \u03b3V \u22c6(s\u2032), s\u2032 \u223c p(s, a). In this case, however, the V \u22c6\nvector would need to be known to the bandit algorithm. In other words, the sample complexity of\nBESPOKE at termination consists of two main terms: a leading order term with a dependence on the\nstate space with an unavoidable (due to the lower bound) dependence on the horizon 1\n1\u2212\u03b3 , and an\nhorizon-free bandit-like sample complexity to rule out suboptimal actions as if the optimal value\nfunction V \u22c6 was known.\n\nBESPOKE applied to Bandits Finally, if \u03b3 = 0 we are in the bandit setting, and the sample\ncomplexity of BESPOKE at step k becomes exactly (since Var R(s, a) \u2264 1):\n\nThis matches the best-known sample complexity bound for best arm identi\ufb01cation for tabular bandit\nwith gaps and variances [ABM10; GGL12] except for constants and log terms. This is encouraging\nas it suggests it may be possible to have algorithms with a smooth transition in sample complexity as\na function of the discount factor when moving from a bandit to an RL setting.\n\n7 Related Literature and Conclusion\n\nRelated Literature\nIn the more challenging setting of online exploration (i.e., without a generative\nmodel) the PAC literature [DB15; DLB17; LH14; SLL09] directly provides algorithms to identify an\n\u01eb-optimal policy with high probability in the worst-case. Gap-aware analyses exists, see for example\n[BK97; TB08; OPT18] for asymptotic regret bounds on ergodic MDPs with matching upper and\nlower bounds and with an emphasis on the minimum gap; since these analyses look at the asymptotic\nregret they are not comparable to the proposal here. Very recently [SJ19] presents a gap-based\nnon-asymptotic regret bound for episodic MDPs but not yet free of the horizon and dependencies on\n\u2206min. Gaps in MDPS have also been used to justify the observed relation between the value function\naccuracy and the resulting policy performance [FSM10]. In addition, [EMM06; Bru10] also propose\nan algorithm and PAC bounds that depend the minimum gap, but the results do not leverage recent\nadvances in tighter sample complexity analysis. [JOA10] presents a regret bound based on the same\nquantity. The maximum variance of the next-state optimal value function is discussed in [MMM14;\nZB19].\n\nThe closest related work in the PAC setting similarly assumes access to a generative model, and\nprovides near-matching worst-case sample complexity upper and lower bounds [AMK12] for tabular\nMDPs even in terms of computational complexity [Sid+18]. However, this work focuses on near-\noptimal worst-case performance: as these algorithms allocate samples uniformly they do not adapt to\nthe problem structure. Finally, [Aga+19] show how to improve on the constant sample complexity\nterm for model based approaches like the one we use here; it is possible that their techniques can be\napplied to our setting.\n\nConclusion This work leverages domain structure, notably the action-value function gaps, to\neliminate the impact of the horizon when ruling out suboptimal actions to identify a near-optimal\npolicy for discounted-reward Markov decision processes using a generative model, except for a\nconstant term and the inherent value function scaling. This is achieved through a tractable algorithm.\nIn doing so, our \ufb01nite time sample complexity analysis quanti\ufb01es the sample complexity contribution\nof each state-action pair as a function of the action-value function gaps and variances of the rewards\nand next-state value function, and recovers the best-known bounds (excepts for logs and constants)\nwhen deployed to bandit instances using these quantities.\n\nOur work provides at least two important analytical tools: 1) the way we relate the suboptimality\nof the policies with the gaps to reduce the dependence on the horizon is new, and could be used in\n\n7Here, Var R(s, a) + \u03b32 Varp(s,a) V \u22c6 is the variance of the random variable R(s, a) + \u03b3V \u22c6(s\u2032) with\n\ns\u2032 \u223c p(s, a). Note the scaling of this random variable, which has range\n\n1\n1\u2212\u03b3 .\n\n8\n\n\fother settings to make the gap appear while simultaneously reducing the horizon dependence 2) the\nway we analyze the visit distribution shift induced by the policies, weighted by the local reward and\ntransition con\ufb01dence intervals, and show it is small, is another analytical contribution of our work\nwhich can be extended to the settings where one is interested in obtaining a good policy from a given\nstarting distribution \u03c1 as opposed to all starting states.\n\nAcknowledgment\n\nThis work is partially supported by a Total Innovation Fellowship program, an NSF CAREER award\nand an Of\ufb01ce of Naval Research Young Investigator Award. The authors are grateful to the reviewers\nfor the high-quality reviews and suggestions.\n\nReferences\n\n[ABF02]\n\n[ABM10]\n\n[Aga+19]\n\n[AMK12]\n\n[AMS09]\n\n[AOM17]\n\n[BK97]\n\n[BMS09]\n\n[Bru10]\n\n[BV04]\n\n[DB15]\n\n[DLB17]\n\n[EMM06]\n\n[FSM10]\n\nPeter Auer, Nicolo Cesa Bianchi, and Paul Fischer. \u201cFinite-time Analysis of the Multiarmed\nBandit Problem\u201d. In: Machine Learning (2002).\nJean-Yves Audibert, Sebastien Bubeck, and Remi Munos. \u201cBest Arm Identi\ufb01cation in Multi-\nArmed Bandits\u201d. In: Conference on Learning Theory (COLT). 2010.\nAlekh Agarwal et al. \u201cOptimality and approximation with policy gradient methods in markov\ndecision processes\u201d. In: arXiv preprint arXiv:1908.00261 (2019).\nMohammad Gheshlaghi Azar, Remi Munos, and Hilbert J. Kappen. \u201cOn the Sample Complexity\nof Reinforcement Learning with a Generative Model\u201d. In: International Conference on Machine\nLearning (ICML). 2012.\nJean Yves Audibert, Remi Munos, and Csaba Szepesvari. \u201cExploration-exploitation trade-off\nusing variance estimates in multi-armed bandits\u201d. In: Theoretical Computer Science (2009).\nMohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. \u201cMinimax Regret Bounds for\nReinforcement Learning\u201d. In: International Conference on Machine Learning (ICML). 2017.\nApostolos N Burnetas and Michael N Katehakis. \u201cOptimal adaptive policies for Markov decision\nprocesses\u201d. In: Mathematics of Operations Research 22.1 (1997), pp. 222\u2013255.\nS\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. \u201cPure exploration in multi-armed bandits\nproblems\u201d. In: International Conference on Algorithmic Learning Theory. 2009.\nEmma Brunskill. \u201cWhen Policies Can Be Trusted: Analyzing a Criteria to Identify Optimal\nPolicies in MDPs with Unknown Model Parameters.\u201d In: International Conference on Automated\nPlanning and Scheduling (ICAPS). 2010, pp. 218\u2013221.\nStephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n2004.\nChristoph Dann and Emma Brunskill. \u201cSample Complexity of Episodic Fixed-Horizon Rein-\nforcement Learning\u201d. In: Advances in Neural Information Processing Systems (NIPS). 2015.\nChristoph Dann, Tor Lattimore, and Emma Brunskill. \u201cUnifying PAC and Regret: Uniform PAC\nBounds for Episodic Reinforcement Learning\u201d. In: Advances in Neural Information Processing\nSystems (NIPS). 2017.\nEyal Even-Dar, Shie Mannor, and Yishay Mansour. \u201cAction Elimination and Stopping Con-\nditions for the Multi-Armed Bandit and Reinforcement Learning Problems\u201d. In: Journal of\nMachine Learning Research (2006).\nAmir-massoud Farahmand, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. \u201cError propagation for approxi-\nmate policy and value iteration\u201d. In: Advances in Neural Information Processing Systems (NIPS).\n2010.\n\n[GAMK13] Mohammad Gheshlaghi Azar, R\u00e9mi Munos, and Hilbert J. Kappen. \u201cMinimax PAC bounds\non the sample complexity of reinforcement learning with a generative model\u201d. In: Machine\nLearning 91.3 (June 2013), pp. 325\u2013349. DOI: 10.1007/s10994-013-5368-1. URL: https:\n//doi.org/10.1007/s10994-013-5368-1.\nVictor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. \u201cBest arm identi\ufb01cation:\nA uni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence\u201d. In: Advances in Neural Information\nProcessing Systems (NIPS). 2012, pp. 3212\u20133220.\nKevin Jamieson et al. \u201clil\u2019ucb: An optimal exploration algorithm for multi-armed bandits\u201d. In:\nConference on Learning Theory (COLT). 2014, pp. 423\u2013439.\nThomas Jaksch, Ronald Ortner, and Peter Auer. \u201cNear-optimal Regret Bounds for Reinforcement\nLearning\u201d. In: Journal of Machine Learning Research (2010).\n\n[Jam+14]\n\n[GGL12]\n\n[JOA10]\n\n9\n\n\f[Kak+03]\n\n[KKS13]\n\n[KMN02]\n\n[LH14]\n\n[MM94]\n\n[MMM14]\n\n[MP09]\n\n[MSA08]\n\n[OPT18]\n\n[OVR13]\n\n[SB18]\n\n[Sid+18]\n\n[SJ19]\n\n[SLL09]\n\n[TB08]\n\n[WBS07]\n\n[Wei+03]\n\n[ZB19]\n\nSham Machandranath Kakade et al. \u201cOn the sample complexity of reinforcement learning\u201d.\nPhD thesis. University of London London, England, 2003.\nZohar Karnin, Tomer Koren, and Oren Somekh. \u201cAlmost optimal exploration in multi-armed\nbandits\u201d. In: International Conference on Machine Learning (ICML). 2013.\nMichael Kearns, Yishay Mansour, and Andrew Y Ng. \u201cA sparse sampling algorithm for near-\noptimal planning in large Markov decision processes\u201d. In: Machine Learning 49.2-3 (2002),\npp. 193\u2013208.\nTor Lattimore and Marcus Hutter. \u201cNear-optimal PAC bounds for discounted MDPs\u201d. In: Theo-\nretical Computer Science 558 (2014), pp. 125\u2013143.\nOded Maron and Andrew W Moore. \u201cHoeffding races: Accelerating model selection search\nfor classi\ufb01cation and function approximation\u201d. In: Advances in Neural Information Processing\nSystems (NIPS). 1994, pp. 59\u201366.\nOdalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. \u201c\u201cHow hard is my MDP?\u201d The\ndistribution-norm to the rescue\u201d. In: Advances in Neural Information Processing Systems (NIPS).\n2014.\nAndreas Maurer and Massimiliano Pontil. \u201cEmpirical Bernstein Bounds and Sample Variance\nPenalization\u201d. In: Conference on Learning Theory (COLT). 2009.\nVolodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. \u201cEmpirical bernstein stopping\u201d.\nIn: International Conference on Machine Learning (ICML). ACM. 2008, pp. 672\u2013679.\nJungseul Ok, Alexandre Proutiere, and Damianos Tranos. \u201cExploration in Structured Rein-\nforcement Learning\u201d. In: Advances in Neural Information Processing Systems. 2018, pp. 8874\u2013\n8882.\nIan Osband, Benjamin Van Roy, and Daniel Russo. \u201c(More) Ef\ufb01cient Reinforcement Learning\nvia Posterior Sampling\u201d. In: Advances in Neural Information Processing Systems (NIPS). 2013.\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press,\n2018.\nAaron Sidford et al. \u201cNear-Optimal Time and Sample Complexities for for Solving Discounted\nMarkov Decision Process with a Generative Model\u201d. In: Advances in Neural Information\nProcessing Systems (NIPS). 2018.\nMax Simchowitz and Kevin Jamieson. \u201cNon-Asymptotic Gap-Dependent Regret Bounds for\nTabular MDPs\u201d. In: arXiv preprint arXiv:1905.03814 (2019).\nAlexander L Strehl, Lihong Li, and Michael L Littman. \u201cReinforcement learning in \ufb01nite MDPs:\nPAC analysis\u201d. In: Journal of Machine Learning Research 10.Nov (2009), pp. 2413\u20132444.\nAmbuj Tewari and Peter L Bartlett. \u201cOptimistic linear programming gives logarithmic regret for\nirreducible MDPs\u201d. In: Advances in Neural Information Processing Systems. 2008, pp. 1505\u2013\n1512.\nTao Wang, Michael Bowling, and Dale Schuurmans. \u201cDual representations for dynamic program-\nming and reinforcement learning\u201d. In: 2007 IEEE International Symposium on Approximate\nDynamic Programming and Reinforcement Learning. IEEE. 2007, pp. 44\u201351.\nTsachy Weissman et al. Inequalities for the l1 deviation of the empirical distribution. Tech. rep.\nHewlett-Packard Labs, 2003.\nAndrea Zanette and Emma Brunskill. \u201cTighter Problem-Dependent Regret Bounds in Reinforce-\nment Learning without Domain Knowledge using Value Function Bounds\u201d. In: International\nConference on Machine Learning (ICML). 2019. URL: http://proceedings.mlr.press/\nv97/zanette19a.html.\n\n10\n\n\f", "award": [], "sourceid": 3007, "authors": [{"given_name": "Andrea", "family_name": "Zanette", "institution": "Stanford University"}, {"given_name": "Mykel", "family_name": "Kochenderfer", "institution": "Stanford University"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "Stanford University"}]}