{"title": "Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 12224, "page_last": 12234, "abstract": "State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.", "full_text": "Tight Regret Bounds for Model-Based Reinforcement\n\nLearning with Greedy Policies\n\nYonathan Efroni\u2217\nTechnion, Israel\n\nNadav Merlis \u2217\nTechnion, Israel\n\nMohammad Ghavamzadeh\n\nFacebook AI Research\n\nShie Mannor\nTechnion, Israel\n\nAbstract\n\nState-of-the-art ef\ufb01cient model-based Reinforcement Learning (RL) algorithms typ-\nically act by iteratively solving empirical models, i.e., by performing full-planning\non Markov Decision Processes (MDPs) built by the gathered experience. In this\npaper, we focus on model-based RL in the \ufb01nite-state \ufb01nite-horizon undiscounted\n\u221a\nMDP setting and establish that exploring with greedy policies \u2013 act by 1-step plan-\nning \u2013 can achieve tight minimax performance in terms of regret, \u02dcO(\nHSAT ).\nThus, full-planning in model-based RL can be avoided altogether without any per-\nformance degradation, and, by doing so, the computational complexity decreases\nby a factor of S. The results are based on a novel analysis of real-time dynamic\nprogramming, then extended to model-based RL. Speci\ufb01cally, we generalize ex-\nisting algorithms that perform full-planning to act by 1-step planning. For these\ngeneralizations, we prove regret bounds with the same rate as their full-planning\ncounterparts.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) [Sutton and Barto, 2018] is a \ufb01eld of machine learning that tackles the\nproblem of learning how to act in an unknown dynamic environment. An agent interacts with the\nenvironment, and receives feedback on its actions in the form of a state-dependent reward signal.\nUsing this experience, the agent\u2019s goal is then to \ufb01nd a policy that maximizes the long-term reward.\nThere are two main approaches for learning such a policy: model-based and model-free. The model-\nbased approach estimates the system\u2019s model and uses it to assess the long-term effects of actions via\nfull-planning (e.g., Jaksch et al. 2010). Model-based RL algorithms usually enjoy good performance\nguarantees in terms of the regret \u2013 the difference between the sum of rewards gained by playing\nan optimal policy and the sum of rewards that the agent accumulates [Jaksch et al., 2010, Bartlett\nand Tewari, 2009]. Nevertheless, model-based algorithms suffer from high space and computation\ncomplexity. The former is caused by the need for storing a model. The latter is due to the frequent\nfull-planning, which requires a full solution of the estimated model. Alternatively, model-free RL\nalgorithms directly estimate quantities that take into account the long-term effect of an action, thus,\navoiding model estimation and planning operations altogether [Jin et al., 2018]. These algorithms\nusually enjoy better computational and space complexity, but seem to have worse performance\nguarantees.\nIn many applications, the high computational complexity of model-based RL makes them infeasible.\nThus, practical model-based approaches alleviate this computational burden by using short-term\nplanning e.g., Dyna [Sutton, 1991], instead of full-planning. To the best of our knowledge, there are\nno regret guarantees for such algorithms, even in the tabular setting. This raises the following question:\nCan a model-based approach coupled with short-term planning enjoy the favorable performance of\nmodel-based RL?\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAlgorithm\n\nUCRL22[Jaksch et al., 2010]\nUCBVI [Azar et al., 2017]\n\nEULER [Zanette and Brunskill, 2019]\n\nUCRL2-GP\nEULER-GP\n\nQ-v2 [Jin et al., 2018]\nLower bounds\n\nRegret\n\u221a\n\u221a\n\u221a\n\u02dcO(\nH 2S2AT )\n\u221a\n\u02dcO(\nHSAT +\n\u02dcO(\n\u221a\nHSAT )\n\u221a\n\u02dcO(\nH 2S2AT )\n\u221a\n\u02dcO(\nHSAT )\n\u02dcO(\nH 3SAT\n\n(cid:16)\u221a\n\n(cid:17)\n\n\u2126\n\nSAHT\n\nH 2T )\n\nTime Complexity\n\n\u02dcO(N SAH)\n\u02dcO(N SAH)\n\u02dcO(N SAH)\n\u02dcO(N AH)\n\u02dcO(N AH)\n\u02dcO(AH)\n\nSpace Complexity\n\u02dcO(HS + N SA)\n\u02dcO(HS + N SA)\n\u02dcO(HS + N SA)\n\u02dcO(HS + N SA)\n\u02dcO(HS + N SA)\n\n\u02dcO(HSA)\n\n\u2013\n\n\u2013\n\nTable 1: Comparison of our bounds with several state-of-the-art bounds for RL in tabular \ufb01nite-\nhorizon MDPs. The time complexity of the algorithms is per episode; S and A are the sizes of the\nstate and action sets, respectively; H is the horizon of the MDP; T is the total number of samples that\nthe algorithm gathers; N \u2264 S is the maximum number of non-zero transition probabilities across the\nentire state-action pairs. The algorithms proposed in this paper are highlighted in gray.\n\nIn this work, we show that model-based algorithms that use 1-step planning can achieve the same\nperformance as algorithms that perform full-planning, thus, answering af\ufb01rmatively to the above\nquestion. To this end, we study Real-Time Dynamic-Programming (RTDP) [Barto et al., 1995]\nthat \ufb01nds the optimal policy of a known model by acting greedily based on 1-step planning, and\nestablish new and sharper \ufb01nite sample guarantees. We demonstrate how the new analysis of RTDP\ncan be incorporated into two model-based RL algorithms, and prove that the regret of the resulting\nalgorithms remains unchanged, while their computational complexity drastically decreases. As Table\n1 shows, this reduces the computational complexity of model-based RL methods by a factor of S.\nThe contributions of our paper are as follows: we \ufb01rst prove regret bounds for RTDP when the\nmodel is known. To do so, we establish concentration results on Decreasing Bounded Processes,\nwhich are of independent interest. We then show that the regret bound translates into a Uniform\nProbably Approximately Correct (PAC) [Dann et al., 2017] bound for RTDP that greatly improves\nexisting PAC results [Strehl et al., 2006]. Next, we move to the learning problem, where the model\nis unknown. Based on the analysis developed for RTDP we adapt UCRL2 [Jaksch et al., 2010] and\nEULER [Zanette and Brunskill, 2019], both act by full-planning, to UCRL2 with Greedy Policies\n(UCRL2-GP) and EULER with Greedy Policies (EULER-GP); model-based algorithms that act by\n1-step planning. The adapted versions are shown to preserve the performance guarantees, while\nimprove in terms of computational complexity.\n\n2 Notations and De\ufb01nitions\n\nWe consider \ufb01nite-horizon MDPs with time-independent dynamics [Bertsekas and Tsitsiklis, 1996].\nA \ufb01nite-horizon MDP is de\ufb01ned by the tuple M = (S,A, R, p, H), where S and A are the state and\naction spaces with cardinalities S and A, respectively. The immediate reward for taking an action a\nat state s is a random variable R(s, a) \u2208 [0, 1] with expectation ER(s, a) = r(s, a). The transition\nprobability is p(s(cid:48) | s, a), the probability of transitioning to state s(cid:48) upon taking action a at state s.\nFurthermore, N := maxs,a |{s(cid:48) : p(s(cid:48) | s, a) > 0}| is the maximum number of non-zero transition\nprobabilities across the entire state-action pairs. If this number is unknown to the designer of the\nalgorithm in advanced, then we set N = S. The initial state in each episode is arbitrarily chosen and\nH \u2208 N is the horizon, i.e., the number of time-steps in each episode. We de\ufb01ne [N ] := {1, . . . , N},\nfor all N \u2208 N, and throughout the paper use t \u2208 [H] and k \u2208 [K] to denote time-step inside an\nepisode and the index of an episode, respectively.\nA deterministic policy \u03c0 : S \u00d7 [H] \u2192 A is a mapping from states and time-step indices to actions.\nWe denote by at := \u03c0(st, t), the action taken at time t at state st according to a policy \u03c0. The quality\n\n2Similarly to previous work in the \ufb01nite horizon setting, we state the regret in terms of the horizon H. The\n\n\u221a\nAT , where D is the diameter of the MDP.\n\nregret in the in\ufb01nite horizon setting is DS\n\n2\n\n\fof a policy \u03c0 from state s at time t is measured by its value function, which is de\ufb01ned as\n\nt (s) := E\nV \u03c0\n\nr(st(cid:48), \u03c0(st(cid:48), t(cid:48))) | st = s\n\n,\n\n(cid:35)\n\n(cid:34) H(cid:88)\n\nt(cid:48)=t\n\na\n\nt+1\n\nk=1\n\n1 (sk\n\n1 (sk\n\n1)(cid:1).\n\n1) \u2212 V \u03c0k\n\nwhere the expectation is over the environment\u2019s randomness. An optimal policy maximizes this\nvalue for all states s and time-steps t, and the corresponding optimal value is denoted by V \u2217\nt (s) :=\nt (s), for all t \u2208 [H]. The optimal value satis\ufb01es the optimal Bellman equation, i.e.,\nmax\u03c0 V \u03c0\n\n(cid:8)r(s, a) + p(\u00b7 | s, a)T V \u2217\nmance of the agent is measured by its regret, de\ufb01ned as Regret(K) :=(cid:80)K\n\nWe consider an agent that repeatedly interacts with an MDP in a sequence of episodes [K]. The perfor-\n\n(cid:9).\n(cid:0)V \u2217\n\nV \u2217\nt (s) = T \u2217V \u2217\n\nt+1(s) := max\n\n(1)\n\nt and ak\n\nThroughout this work, the policy \u03c0k is computed by a 1-step planning operation with respect to the\nvalue function estimated by the algorithm at the end of episode k \u2212 1, denoted by \u00afV k\u22121. We also\ncall such policy a greedy policy. Moreover, sk\nt stand, respectively, for the state and the action\ntaken at the tth time-step of the kth episode.\nNext, we de\ufb01ne the \ufb01ltration Fk that includes all events (states, actions, and rewards) until the end\nof the kth episode, as well as the initial state of the episode k + 1. We denote by T = KH, the\ntotal number of time-steps (samples). Moreover, we denote by nk(s, a), the number of times that\nthe agent has visited state-action pair (s, a), and by \u02c6Xk, the empirical average of a random variable\nX. Both quantities are based on experience gathered until the end of the kth episode and are Fk\nmeasurable. We also de\ufb01ne the probability to visit the state-action pair (s, a) at the kth episode at\n\n(cid:1). We note that \u03c0k is Fk\u22121 measurable, and\ntime-step t by wtk(s, a) = Pr(cid:0)sk\nthus, wtk(s, a) = Pr(cid:0)sk\nfactors. We de\ufb01ne (cid:107)X(cid:107)2,p :=(cid:112)EpX 2, where p is a probability distribution over the domain of X,\n\nWe use \u02dcO(X) to refer to a quantity that depends on X up to poly-log expression of a quantity at most\n\u03b4 . Similarly, (cid:46) represents \u2264 up to numerical constants or poly-log\npolynomial in S, A, T , K, H, and 1\nand use X \u2228 Y := max{X, Y }. Lastly, P(S) is the set of probability distributions over the state\nspace S.\n\n(cid:1). Also denote wk(s, a) =(cid:80)H\n\nt = s, ak\nt = a | Fk\u22121\n\nt = a | sk\n\nt=1 wtk(s, a).\n\nt = s, ak\n\n0, \u03c0k\n\n3 Real-Time Dynamic Programming\n\nAlgorithm 1 Real-Time Dynamic Programming\n\nInitialize: \u2200s \u2208 S, \u2200t \u2208 [H], \u00afV 0\nfor k = 1, 2, . . . do\n\nt (s) = H \u2212 (t \u2212 1).\n\nInitialize sk\n1\nfor t = 1, . . . , H do\nt \u2208 arg maxa r(sk\nak\n\u00afV k\nt (sk\nt , ak\nAct with ak\n\nt ) = r(sk\n\nt , a) + p(\u00b7 | sk\nt ) + p(\u00b7 | sk\nt , ak\nt+1.\n\nt and observe sk\n\nt , a)T \u00afV k\u22121\nt )T \u00afV k\u22121\n\nt+1\n\nt+1\n\nend for\n\nend for\n\nRTDP [Barto et al., 1995] is a well-known algorithm that solves an MDP when a model of the\nenvironment is given. Unlike, e.g., Value Iteration (VI) [Bertsekas and Tsitsiklis, 1996] that solves an\nMDP by of\ufb02ine calculations, RTDP solves an MDP in a real-time manner. As mentioned in Barto\net al. [1995], RTDP can be interpreted as an asynchronous VI adjusted to a real-time algorithm.\nAlgorithm 1 contains the pseudocode of RTDP for \ufb01nite-horizon MDPs. The value function is\ninitialized with an optimistic value, i.e., an upper bound of the optimal value. At each time-step t and\nepisode k, the agent acts from the current state sk\nt greedily with respect to the current value at the\nnext time step, \u00afV k\u22121\nt according to the optimal Bellman operator. We\nt+1 . It then updates the value of sk\ndenote by \u00afV , the value function, and as we show in the following, it always upper bounds V \u2217. Note\nthat since the action at a \ufb01xed state is chosen according to \u00afV k\u22121, then \u03c0k is Fk\u22121 measurable.\n\n3\n\n\fH(cid:88)\n\nSince RTDP is an online algorithm, i.e., it updates its value estimates through interactions with the\nenvironment, it is natural to measure its performance in terms of the regret. The rest of this section is\ndevoted to supplying expected and high-probability bounds on the regret of RTDP, which will also\nlead to PAC bounds for this algorithm. In Section 4, based on the observations from this section, we\nwill establish minimax regret bounds for 1-step greedy model-based RL.\nWe start by stating two basic properties of RTDP in the following lemma: the value is always\noptimistic and decreases in k (see proof in Appendix B). Although the \ufb01rst property is known [Barto\net al., 1995], to the best of our knowledge, the second one has not been proven in previous work.\nLemma 1. For all s, t, and k, it holds that (i) V \u2217\n\nt (s) \u2264 \u00afV k\u22121\n\nt (s) and (ii) \u00afV k\n\nt (s) \u2264 \u00afV k\n\n(s).\n\nt\n\n(sk\n\n1) and the real value V \u03c0k\n\nThe following lemma, that we believe is new, relates the difference between the optimistic value\n\u00afV k\u22121\n1) to the expected cumulative update of the value function at the\n1\nend of the kth episode (see proof in Appendix B).\nLemma 2 (Value Update for Exact Model). The expected cumulative value update of RTDP at the\nkth episode satis\ufb01es\n\n1 (sk\n\n\u00afV k\u22121\n\n1\n\n1) \u2212 V \u03c0k\n(sk\n\n1 (sk\n\n1) =\n\nE[ \u00afV k\u22121\n\nt\n\nt ) \u2212 \u00afV k\n(sk\n\nt (sk\n\nt ) | Fk\u22121].\n\nt=1\n\nThe result relates the difference of the optimistic value \u00afV k\u22121 and the value of the greedy policy\nV \u03c0k to the expected update along the trajectory, created by following \u03c0k. Thus, for example, if the\noptimistic value is overestimated, then the value update throughout this episode is expected to be\nlarge.\n\n3.1 Regret and PAC Analysis\n\n1) \u2265 V \u03c0k\n\n1) \u2265 \u00afV \u2217(sk\n\nUsing Lemma 1, we observe that the sequence of values is decreasing and bounded from below. Thus,\nintuitively, the decrements of the values cannot be inde\ufb01nitely large. Importantly, Lemma 2 states\nthat when the expected decrements of the values are small, then V \u03c0k\n1), and\nthus, to V \u2217, since \u00afV k\u22121(sk\nBuilding on this reasoning, we are led to establish a general result on a decreasing process. This\nresult will allow us to formally justify the aforementioned reasoning and derive regret bounds for\nRTDP. The proof utilizes self-normalized concentration bounds [de la Pe\u00f1a et al., 2007], applied on\nmartingales, and can be found in Appendix A.\nDe\ufb01nition 1 (Decreasing Bounded Process). We call a random process {Xk,Fk}k\u22650, where\n{Fk}k\u22650 is a \ufb01ltration and {Xk}k\u22650 is adapted to this \ufb01ltration, a Decreasing Bounded Process, if\nit satis\ufb01es the following properties:\n\n1) is close to \u00afV k\u22121(sk\n\n1 (sk\n\n1 (sk\n\n1).\n\n1. {Xk}k\u22650 decreases, i.e., Xk+1 \u2264 Xk a.s. .\n2. X0 = C \u2265 0, and for all k, Xk \u2265 0 a.s. .\n\nTheorem 3 (Regret Bound of a Decreasing Bounded Process). Let {Xk,Fk}k\u22650 be a Decreasing\n\nk=1 Xk\u22121 \u2212 E[Xk | Fk\u22121] be its K-round regret. Then,\n\nBounded Process and RK =(cid:80)K\n(cid:26)\n\n\u2203K > 0 : RK \u2265 C\n\nPr\n\n(cid:16)\n\n(cid:17)2(cid:27)\n1 + 2(cid:112)ln(2/\u03b4)\n\n\u2264 \u03b4.\n\nSpeci\ufb01cally, it holds that Pr{\u2203K > 0 : RK \u2265 9C ln(3/\u03b4)} \u2264 \u03b4.\nWe are now ready to prove the central result of this section, the expected and high-probability regret\nbounds on RTDP (see full proof in Appendix B).\nTheorem 4 (Regret Bounds for RTDP). The following regret bounds hold for RTDP:\n\n1. E[Regret(K)] \u2264 SH 2.\n2. For any \u03b4 > 0, with probability 1 \u2212 \u03b4, for all K > 0, Regret(K) \u2264 9SH 2 ln(3SH/\u03b4).\n\n4\n\n\fProof Sketch. We give a sketch of the proof of the second claim. Applying Lemmas 1 and then 2,\n\nRegret(K) :=\n\nV \u2217\n1 (sk\n\n1) \u2212 V \u03c0k\n\n1 (sk\n\n\u00afV k\u22121\n\n1\n\n1) \u2212 V \u03c0k (sk\n(sk\n1)\n\nk=1\n\nE[ \u00afV k\u22121\n\nt\n\nt ) \u2212 \u00afV k\n(sk\n\nt (sk\n\nt ) | Fk\u22121].\n\n(2)\n\n1) \u2264 K(cid:88)\n\nK(cid:88)\n\u2264 K(cid:88)\nH(cid:88)\n\nk=1\n\nk=1\n\nt=1\n\nWe then establish (see Lemma 34) that RHS of (2) is, in fact, a sum of SH Decreasing Bounded\nProcesses, i.e.,\n\n\u00afV k\u22121\n\nt\n\n(s) \u2212 E[ \u00afV k\n\nt (s) | Fk\u22121].\n\n(3)\n\nH(cid:88)\nSince for any \ufb01xed s, t,(cid:8) \u00afV k\nt (s)(cid:9)\n\n(2) =\n\nt=1\n\n(cid:88)\n\nK(cid:88)\n\ns\u2208S\n\nk=1\n\nfor a \ufb01xed s, t, and conclude the proof by applying the union bound on all SH terms in (3).\n\nk\u22650 is a decreasing process by Lemma 1, we can use Theorem 3,\n\nTheorem 4 exhibits a regret bound that does not depend on T = KH. While it is expected that RTDP,\nthat has access to the exact model, would achieve better performance than an RL algorithm with no\nsuch access, a regret bound independent of T is a noteworthy result. Indeed, it leads to the following\nUniform PAC (see Dann et al. 2017 for the de\ufb01nition) and (0, \u03b4) PAC guarantees for RTDP (see\nproofs in Appendix B). To the best of our knowledge, both are the \ufb01rst PAC guarantees for RTDP.3\nCorollary 5 (RTDP is Uniform PAC). Let \u03b4 > 0 and N\u0001 be the number of episodes in which RTDP\noutputs a policy with V \u2217\n\n1 (sk\n\n(cid:26)\n1) \u2212 V \u03c0k\n\nPr\n\n1) > \u0001. Then,\n\n1 (sk\n\u2203\u0001 > 0 : N\u0001 \u2265 9SH 2 ln(3SH/\u03b4)\n\n(cid:27)\n\n\u2264 \u03b4.\n\n\u0001\n\nCorollary 6 (RTDP is (0, \u03b4) PAC). Let \u03b4 > 0 and N be the number of episodes in which\nRTDP outputs a non optimal policy. De\ufb01ne the (unknown) gap of the MDP, \u2206(M) =\nmins min\u03c0:V \u03c0\n\n1 (s)(cid:54)=V \u2217\n\n1 (s) V \u2217\n\n(cid:26)\n1 (s) \u2212 V \u03c0\n\nPr\n\n1 (s) > 0. Then,\nN \u2265 9SH 2 ln(3SH/\u03b4)\n\n\u2206(M)\n\n(cid:27)\n\n\u2264 \u03b4.\n\n4 Exploration in Model-based RL: Greedy Policy Achieves Minimax Regret\n\nWe start this section by formulating a general optimistic RL scheme that acts by 1-step planning (see\nAlgorithm 2). Then, we establish Lemma 7, which generalizes Lemma 2 to the case where a non-exact\nmodel is used for the value updates. Using this lemma, we offer a novel regret decomposition for\nalgorithms which follow Algorithm 2. Based on the decomposition, we analyze generalizations of\nUCRL2 [Jaksch et al., 2010] (for \ufb01nite horizon MDPs) and EULER [Zanette and Brunskill, 2019],\nthat use greedy policies instead of solving an MDP (full planning) at the beginning of each episode.\nSurprisingly, we \ufb01nd that both generalized algorithms do not suffer from performance degradation, up\nto numerical constants and logarithmic factors. Thus, we conclude that there exists an RL algorithm\nthat achieves the minimax regret bound, while acting according to greedy policies.\nConsider the general RL scheme that explores by greedy policies as depicted in Algorithm 2. The\nvalue \u00afV is initialized optimistically and the algorithm interacts with the unknown environment in\nan episodic manner. At each time-step t, a greedy policy from the current state, sk\nt , is calculated\noptimistically based on the empirical model (\u02c6rk\u22121, \u02c6pk\u22121, nk\u22121) and the current value at the next\ntime-step \u00afV k\u22121\nt+1 . This is done in a subroutine called \u2018ModelBaseOptimisticQ\u2019.4 We further assume\nthe optimistic Q-function has the form \u00afQ(sk\nt+1 and refer to\n3Existing PAC results on RTDP analyze variations of RTDP in which \u0001 is an input parameter of the algorithm.\n4We also allow the subroutine to use O(S) internal memory for auxiliary calculations, which does not change\n\nt , a) + \u02dcpk\u22121(\u00b7 | sk\n\nt , a)T \u00afV k\u22121\n\nt , a) = \u02dcrk\u22121(sk\n\nthe overall space complexity.\n\n5\n\n\ft (s) = H \u2212 (t \u2212 1).\n\nAlgorithm 2 Model-based RL with Greedy Policies\n1: Initialize: \u2200s \u2208 S, \u2200t \u2208 [H], \u00afV 0\n2: for k = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nInitialize sk\n1\nfor t = 1, . . . , H do\n\u2200a, \u00afQ(sk\nt \u2208 arg maxa \u00afQ(sk\nak\n\u00afV k\nt (sk\nAct with ak\n\nt , a) = ModelBaseOptimisticQ(cid:0)\u02c6rk\u22121, \u02c6pk\u22121, nk\u22121, \u00afV k\u22121\nt ) = min(cid:8) \u00afV k\u22121\n\nend for\nUpdate \u02c6rk, \u02c6pk, nk with all experience gathered in episode.\n\nt ), \u00afQ(sk\nt+1.\n\nt and observe sk\n\nt )(cid:9)\n\nt , a)\n(sk\n\nt , ak\n\nt+1\n\nt\n\n(cid:1)\n\n(\u02dcrk\u22121, \u02dcpk\u22121) as the optimistic model. The agent interacts with the environment based on the greedy\npolicy with respect to \u00afQ and uses the gathered experience to update the empirical model at the end of\nthe episode.\nBy construction of the update rule (see Line 7), the value is a decreasing function of k, for all\n(s, t) \u2208 S \u00d7 [H]. Thus, property (ii) in Lemma 1 holds for Algorithm 2. Furthermore, the algorithms\nanalyzed in this section will also be optimistic with high probability, i.e., property (i) in Lemma 1\nalso holds. Finally, since the value update uses the empirical quantities \u02c6rk\u22121, \u02c6pk\u22121, nk\u22121 and \u00afV k\u22121\nfrom the previous episode, policy \u03c0k is still Fk\u22121 measurable.\nThe following lemma generalizes Lemma 2 to the case where, unlike in RTDP, the update rule does\nnot use the exact model (see proof in Appendix C).\nLemma 7 (Value Update for Optimistic Model). The expected cumulative value update of Algorithm\n2 in the kth episode is bounded by\n\nt+1\n\n\u00afV k\u22121\n\n1\n\n1) \u2264 H(cid:88)\nE(cid:2) \u00afV k\u22121\nE(cid:2)(\u02dcrk\u22121 \u2212 r)(sk\n\nt , ak\n\nt=1\n\nt\n\n1 (sk\n\n1) \u2212 V \u03c0k\n(sk\nH(cid:88)\n\n+\n\nt ) \u2212 \u00afV k\n(sk\n\nt (sk\n\nt ) | Fk\u22121\n\n(cid:3)\n\nt ) + (\u02dcpk\u22121 \u2212 p)(\u00b7 | sk\n\nt , ak\n\nt )T \u00afV k\u22121\n\nt+1 | Fk\u22121\n\n(cid:3) .\n\nt=1\n\nIn the rest of the section, we consider two instantiations of the subroutine \u2018ModelBaseOptimisticQ\u2019\nin Algorithm 2. We use the bonus terms of UCRL2 and of EULER to acquire an optimistic Q-\nfunction, \u00afQ. These two options then lead to UCRL2 with Greedy Policies (UCRL2-GP) and EULER\nwith Greedy Policies (EULER-GP) algorithms.\n\n4.1 UCRL2 with Greedy Policies for Finite-Horizon MDPs\n\n(cid:114)\n\nAlgorithm 3 UCRL2 with Greedy Policies (UCRL2-GP)\n\n(cid:26)\n\n1: \u02dcrk\u22121(sk\n\nt , a) = \u02c6rk\u22121(sk\n\nt , a) +\n\n2 ln 8SAT\n\nnk\u22121(sk\n\n\u03b4\n\nt ,a)\u22281\n\nP (cid:48) \u2208 P(S) : (cid:107)P (cid:48)(\u00b7) \u2212 \u02c6pk\u22121(\u00b7 | sk\nt ,a) P (cid:48)(\u00b7 | sk\nt , a)T \u00afV k\u22121\n\nt , a) + \u02dcpk\u22121(\u00b7 | sk\n\nt , a) = arg maxP (cid:48)\u2208CI(sk\n\nt+1\n\nt , a)(cid:107)1 \u2264\nt , a)T \u00afV k\u22121\n\nt+1\n\nt , a) = \u02dcrk\u22121(sk\n\n2: CI(sk\nt , a) =\n3: \u02dcpk\u22121(\u00b7 | sk\n4: \u00afQ(sk\n5: Return \u00afQ(sk\n\nt , a)\n\n(cid:114)\n\n(cid:27)\n\n4S ln 12SAT\nt ,a)\u22281\nnk\u22121(sk\n\n\u03b4\n\nWe form the optimistic local model based on the con\ufb01dence set of UCRL2 [Jaksch et al., 2010].\nThis amounts to use Algorithm 3 as the subroutine \u2018ModelBaseOptimisticQ\u2019 in Algorithm 2. The\nmaximization problem on Line 3 of Algorithm 3 is common, when using bonus based on an optimistic\nmodel [Jaksch et al., 2010], and it can be solved ef\ufb01ciently in \u02dcO(N ) operations (e.g., Strehl and\nLittman 2008, Section 3.1.5). A full version of the algorithm can be found in Appendix D.\n\n6\n\n\fThus, Algorithm 3 performs N AH operations per episode. This saves the need to perform Extended\nValue Iteration [Jaksch et al., 2010], that costs N SAH operations per episode (an extra factor of S).\nDespite the signi\ufb01cant improvement in terms of computational complexity, the regret of UCRL2-GP\nis similar to the one of UCRL2 [Jaksch et al., 2010] as the following theorem formalizes (see proof in\nAppendix D).\nTheorem 8 (Regret Bound of UCRL2-GP). For any time T \u2264 KH, with probability at least 1 \u2212 \u03b4,\n\nthe regret of UCRL2-GP is bounded by \u02dcO(cid:16)\n\nAT + H 2\n\n(cid:17)\n\nSSA\n\nHS\n\n\u221a\n\n\u221a\n\n.\n\nProof Sketch. Using the optimism of the value function (see Section D.2) and by applying Lemma 7,\nwe bound the regret as follows:\n\nRegret(K) =\n\n1) \u2212 V \u03c0k\n\n1 (sk\n\n\u00afV k\u22121\n\n1\n\n1) \u2212 V \u03c0k\n(sk\n\n1 (sk\n1)\n\nk=1\n\nV \u2217\n1 (sk\n\nK(cid:88)\nH(cid:88)\n\u2264 K(cid:88)\nK(cid:88)\nH(cid:88)\n\nk=1\n\nt=1\n\n+\n\nk=1\n\nt=1\n\n1) \u2264 K(cid:88)\n\nk=1\n\nE[ \u00afV k\u22121\n\nt ) \u2212 \u00afV k\n(sk\n\nt (sk\n\nt ) | Fk\u22121]\n\nt\n\nE(cid:2)(\u02dcrk\u22121 \u2212 r)(sk\n\nt ) + (\u02dcpk\u22121 \u2212 p)(\u00b7 | sk\n\nt , ak\n\nt )T \u00afV k\u22121\n\nt+1 | Fk\u22121\n\nt , ak\n\n(cid:3) .\n\n(4)\n\nThus, the regret is upper bounded by two terms. As in Theorem 4, by applying Lemma 11 (Ap-\npendix A), the \ufb01rst term in (4) is a sum of SH Decreasing Bounded Processes, and can thus be\n\nbounded by \u02dcO(cid:0)SH 2(cid:1). The presence of the second term in (4) is common in recent regret analyses\n\n(e.g., Dann et al. 2017). Using standard techniques [Jaksch et al., 2010, Dann et al., 2017, Zanette\nand Brunskill, 2019], this term can be bounded (up to additive constant factors) with high probability\nby (cid:46) H\n\n(cid:105) \u2264 \u02dcO(HS\n\nS(cid:80)K\n\nE(cid:104)(cid:113)\n\n(cid:80)H\n\nAT ).\n\n\u221a\n\n\u221a\n\n1\n\nk=1\n\nt=1\n\nnk\u22121(sk\n\nt ,ak\n\nt ) | Fk\u22121\n\n4.2 EULER with Greedy Policies\n\nIn this section, we use bonus terms as in EULER [Zanette and Brunskill, 2019]. Similar to the\nprevious section, this amounts to replacing the subroutine \u2018ModelBaseOptimisticQ\u2019 in Algorithm 2\nwith a subroutine based on the bonus terms from [Zanette and Brunskill, 2019]. Algorithm 5 in\nAppendix E contains the pseudocode of the algorithm. The bonus terms in EULER are based on the\nempirical Bernstein inequality and tracking both an upper bound \u00afVt and a lower-bound V t on V \u2217\nt .\nUsing these, EULER achieves both minimax optimal and problem dependent regret bounds.\nEULER [Zanette and Brunskill, 2019] performs O(N SAH) computations per episode (same as the\nVI algorithm), while EULER-GP requires only O(N AH). Despite this advantage in computational\ncomplexity, EULER-GP exhibits similar minimax regret bounds to EULER (see proof in Appendix E),\nmuch like the equivalent performance of UCRL2 and UCRL2-GP proved in Section 4.1.\nTheorem 9 (Regret Bound of EULER-GP). Let G be an upper bound on the total reward\ncollected within an episode. De\ufb01ne Q\u2217 := maxs,a,t\n\nt+1(s)(cid:1) and\nHe\ufb00 := min(cid:8)Q\u2217,G2/H(cid:9). With probability 1 \u2212 \u03b4, for any time T \u2264 KH jointly on all episodes\nk \u2208 [K], the regret of EULER-GP is bounded by \u02dcO(cid:16)\u221a\n(cid:17)\nit is also bounded by \u02dcO(cid:16)\u221a\n\n(cid:0)VarR(s, a) + Vars(cid:48)\u223cp(\u00b7|s,a)V \u2217\n\n\u221a\nSSAH 2(\n\n\u221a\nSSAH 2(\n\nHe\ufb00 SAT +\n\u221a\n\nHSAT +\n\n. Thus,\n\n(cid:17)\n\nS +\n\nS +\n\n\u221a\n\n\u221a\n\n\u221a\n\nH)\n\nH)\n\n.\n\nNote that Theorem 9 exhibits similar problem-dependent regret-bounds as in Theorem 1 of [Zanette\nand Brunskill, 2019]. Thus, the same corollaries derived in [Zanette and Brunskill, 2019] for EULER\ncan also be applied to EULER-GP.\n\n5 Experiments\n\nIn this section, we present an empirical evaluation of both UCRL2 and EULER, and compare their\nperformance to the proposed variants, which use greedy policy updates, UCRL2-GP and EULER-GP,\n\n7\n\n\f(a) Chain environment with N = 25 states\n\n(b) 2D chain environment with 5 \u00d7 5 grid\n\nFigure 1: A comparison UCRL2 and EULER with their greedy counterpart. Results are averaged\nover 5 random seeds and are shown alongside error bars (\u00b13std).\n\nrespectively. We evaluated the algorithms on two environments. (i) Chain environment [Osband\nand Van Roy, 2017]: In this MDP, there are N states, which are connected in a chain. The agent\nstarts at the left side of the chain and can move either to the left or try moving to the right, which\nsucceeds w.p. 1 \u2212 1/N, and results with movement to the left otherwise. The agent goal is to reach\nthe right side of the chain and try moving to the right, which results with a reward r \u223c N (1, 1).\nMoving backwards from the initials state also results with r \u223c N (0, 1), and otherwise, the reward is\nr = 0. Furthermore, the horizon is set to H = N, so that the agent must always move to the right\nto have a chance to receive a reward. (ii) 2D chain: A generalization of the chain environment, in\nwhich the agent starts at the upper-left corner of a N \u00d7 N grid and aims to reach the lower-right\ncorner and move towards this corner, in H = 2N \u2212 1 steps. Similarly to the chain environment, there\nis a probability 1/H to move backwards (up or left), and the agent must always move toward the\ncorner to observe a reward r \u223c N (1, 1). Moving into the starting corner results with r \u223c N (0, 1),\nand otherwise r = 0. This environment is more challenging for greedy updates, since there are many\npossible trajectories that lead to reward.\nThe simulation results can be found in Figure 1, and clearly indicate that using greedy planning\nleads to negligible degradation in the performance. Thus, the simulations verify our claim that\ngreedy policy updates greatly improve the ef\ufb01ciency of the algorithm while maintaining the same\nperformance.\n\n6 Related Work\n\nSA/\u00012(1 \u2212 \u03b3)4(cid:17)\n(cid:16)\n\nReal-Time Dynamic Programming: RTDP [Barto et al., 1995] has been extensively used and has\nmany variants that exhibit superior empirical performance (e.g., [Bonet and Geffner, 2003, McMahan\net al., 2005, Smith and Simmons, 2006]). For discounted MDPs, Strehl et al. [2006] proved (\u0001, \u03b4)-PAC\nbounds of \u02dcO\n, for a modi\ufb01ed version of RTDP in which the value updates occur\nonly if the decrease in value is larger than \u0001(1 \u2212 \u03b3). I.e., their algorithm explicitly use \u0001 to mark\n\nstates with accurate value estimate. We prove that RTDP converges in a rate of \u02dcO(cid:0)SH 2/\u0001(cid:1) without\n\nknowing \u0001. Indeed, Strehl et al. [2006] posed whether the original RTDP is PAC as an open problem.\nFurthermore, no regret bound for RTDP has been reported in the literature.\nRegret bounds for RL: The most renowned algorithms with regret guarantees for undiscounted\nin\ufb01nite-horizon MDPs are UCRL2 [Jaksch et al., 2010] and REGAL [Bartlett and Tewari, 2009],\nwhich have been extended throughout the years (e.g., by Fruit et al. 2018, Talebi and Maillard\n2018). Recently, there is an increasing interest in regret bounds for MDPs with \ufb01nite horizon H and\nstationary dynamics. In this scenario, UCRL2 enjoys a regret bound of order HS\nAT . Azar et al.\n[2017] proposed UCBVI, with improved regret bound of order\nHSAT , which is also asymptotically\ntight [Osband and Van Roy, 2016]. Dann et al. [2018] presented ORLC that achieves tight regret\nbounds and (nearly) tight PAC guarantees for non-stationary MDPs. Finally, Zanette and Brunskill\n[2019] proposed EULER, an algorithm that enjoys tight minimax regret bounds and has additional\n\n\u221a\n\n\u221a\n\n8\n\n\f(cid:16)\n\nproblem-dependent bounds that encapsulate the MDP\u2019s complexity. All of these algorithms are\nmodel-based and require full-planning. Model-free RL was analyzed by [Jin et al., 2018]. There,\nthe authors exhibit regret bounds that are worse by a factor of H relatively to the lower-bound. To\nthe best of our knowledge, there are no model-based algorithms with regret guarantees that avoid\nfull-planning. It is worth noting that while all the above algorithms, and the ones in this work, rely on\nthe Optimism in the Face of Uncertainty principle [Lai and Robbins, 1985], Thompson Sampling\nmodel-based RL algorithms exist [Osband et al., 2013, Gopalan and Mannor, 2015, Agrawal and Jia,\n2017, Osband and Van Roy, 2017]. There, a model is sampled from a distribution over models, on\nwhich full-planning takes place.\nGreedy policies in model-based RL: By adjusting RTDP to the case where the model is unknown,\nStrehl et al. [2012] formulated model-based RL algorithms that act using a greedy policy. They\nproved a \u02dcO\nsample complexity bound for discounted MDPs. To the best of our\nknowledge, there are no regret bounds for model-based RL algorithms that act by greedy policies.\nPractical model-based RL: Due to the high computational complexity of planning in model-based\nRL, most of the practical algorithms are model-free (e.g., Mnih et al. 2015). Algorithms that do use a\nmodel usually only take advantage of local information. For example, Dyna [Sutton, 1991, Peng et al.,\n2018] selects state-action pairs, either randomly or via prioritized sweeping [Moore and Atkeson,\n1993, Van Seijen and Sutton, 2013], and updates them according to a local model. Other papers use\nthe local model to plan for a short horizon from the current state [Tamar et al., 2016, Hafner et al.,\n2018]. The performance of such algorithms depends heavily on the planning horizon, that in turn\ndramatically increases the computational complexity.\n\nS2A/\u00013(1 \u2212 \u03b3)6(cid:17)\n\n7 Conclusions and Future Work\n\nIn this work, we established that tabular model-based RL algorithms can explore by 1-step planning\ninstead of full-planning, without suffering from performance degradation. Speci\ufb01cally, exploring\nwith model-based greedy policies can be minimax optimal in terms of regret. Differently put, the\nvariance caused by exploring with greedy policies is smaller than the variance caused by learning\na suf\ufb01ciently good model. Indeed, the extra term which appears due to the greedy exploration is\n\u02dcO(SH 2) (e.g., the \ufb01rst term in (4)); a constant term, smaller than the existing constant terms of\nUCRL2 and EULER.\nThis work raises and highlights some interesting research questions. The obvious ones are extensions\nto average and discounted MDPs, as well as to Thompson sampling based RL algorithms. Although\nthese scenarios are harder or different in terms of analysis, we believe this work introduces the\nrelevant approach to tackle this question. Another interesting question is the applicability of the\nresults in large-scale problems, when tabular representation is infeasible and approximation must be\nused. There, algorithms that act using lookahead policies, instead of 1-step planning, are expected\nto yield better performance, as they are less sensitive to value approximation errors (e.g., Bertsekas\nand Tsitsiklis 1996, Jiang et al. 2018, Efroni et al. 2018b,a). Even then, full-planning, as opposed to\nusing a short-horizon planning, might be unnecessary. Lastly, establishing whether the model-based\napproach is or is not provably better than the model-free approach, as the current state of the literature\nsuggests, is yet an important and unsolved open problem.\n\nAcknowledgments\n\nWe thank Oren Louidor for illuminating discussions relating the Decreasing Bounded Process, and\nEsther Derman for the very helpful comments. This work was partially funded by the Israel Science\nFoundation under ISF grant number 1380/16.\n\nReferences\n\nShipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case\nregret bounds. In Advances in Neural Information Processing Systems, pages 1184\u20131194, 2017.\n\n9\n\n\fMohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for reinforce-\nment learning. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 263\u2013272. JMLR. org, 2017.\n\nPeter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\nAndrew G Barto, Steven J Bradtke, and Satinder P Singh. Learning to act using real-time dynamic\n\nprogramming. Arti\ufb01cial intelligence, 72(1-2):81\u2013138, 1995.\n\nDimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scienti\ufb01c\n\nBelmont, MA, 1996.\n\nBlai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic\n\nprogramming. In ICAPS, volume 3, pages 12\u201321, 2003.\n\nChristoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds\nfor episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages\n5713\u20135723, 2017.\n\nChristoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certi\ufb01cates: Towards accountable\n\nreinforcement learning. arXiv preprint arXiv:1811.03056, 2018.\n\nVictor H de la Pe\u00f1a, Michael J Klass, Tze Leung Lai, et al. Pseudo-maximization and self-normalized\n\nprocesses. Probability Surveys, 4:172\u2013192, 2007.\n\nVictor H de la Pe\u00f1a, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and\n\nStatistical Applications. Springer Science & Business Media, 2008.\n\nYonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine tree-search methods\n\nin reinforcement learning. arXiv preprint arXiv:1809.01843, 2018a.\n\nYonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in\napproximate and online reinforcement learning. In Advances in Neural Information Processing\nSystems, pages 5238\u20135247, 2018b.\n\nRonan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-constrained\n\nexploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018.\n\nAditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision\n\nprocesses. In Conference on Learning Theory, pages 861\u2013898, 2015.\n\nDanijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James\nDavidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551,\n2018.\n\nThomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\nDaniel R Jiang, Emmanuel Ekwedike, and Han Liu. Feedback-based tree search for reinforcement\n\nlearning. arXiv preprint arXiv:1805.05935, 2018.\n\nChi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably ef\ufb01cient?\n\nIn Advances in Neural Information Processing Systems, pages 4863\u20134873, 2018.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\nAndreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penaliza-\n\ntion. arXiv preprint arXiv:0907.3740, 2009.\n\nH Brendan McMahan, Maxim Likhachev, and Geoffrey J Gordon. Bounded real-time dynamic\nprogramming: Rtdp with monotone upper bounds and performance guarantees. In Proceedings of\nthe 22nd international conference on Machine learning, pages 569\u2013576. ACM, 2005.\n\n10\n\n\fVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nAndrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with\n\nless data and less time. Machine learning, 13(1):103\u2013130, 1993.\n\nIan Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv\n\npreprint arXiv:1608.02732, 2016.\n\nIan Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement\nlearning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 2701\u20132710. JMLR. org, 2017.\n\nIan Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011,\n2013.\n\nBaolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. Deep\ndyna-q: Integrating planning for task-completion dialogue policy learning. arXiv preprint\narXiv:1801.06176, 2018.\n\nTrey Smith and Reid Simmons. Focused real-time dynamic programming for mdps: Squeezing more\n\nout of a heuristic. In AAAI, pages 1227\u20131232, 2006.\n\nAlexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for\nmarkov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\nAlexander L Strehl, Lihong Li, and Michael L Littman. Pac reinforcement learning bounds for rtdp\n\nand rand-rtdp. In Proceedings of AAAI workshop on learning for search, 2006.\n\nAlexander L Strehl, Lihong Li, and Michael L Littman. Incremental model-based learners with\n\nformal learning-time guarantees. arXiv preprint arXiv:1206.6870, 2012.\n\nRichard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART\n\nBulletin, 2(4):160\u2013163, 1991.\n\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.\n\nMohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undis-\n\ncounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.\n\nAviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In\n\nAdvances in Neural Information Processing Systems, pages 2154\u20132162, 2016.\n\nHarm Van Seijen and Richard S Sutton. Planning by prioritized sweeping with small backups. arXiv\n\npreprint arXiv:1301.2343, 2013.\n\nTsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger.\nInequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep,\n2003.\n\nAndrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement\nlearning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210,\n2019.\n\n11\n\n\f", "award": [], "sourceid": 6614, "authors": [{"given_name": "Yonathan", "family_name": "Efroni", "institution": "Technion"}, {"given_name": "Nadav", "family_name": "Merlis", "institution": "Technion"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "Facebook AI Research"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}