{"title": "Exploration in Structured Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8874, "page_last": 8882, "abstract": "We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes S and A of the state and action spaces, i.e., they are smaller than c log T where T is the time horizon and the constant c only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as SA log T. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.", "full_text": "Exploration in Structured Reinforcement Learning\n\nJungseul Ok\nKTH, EECS\n\nStockholm, Sweden\n\nockjs@illinois.edu\n\nAlexandre Proutiere\n\nKTH, EECS\n\nStockholm, Sweden\nalepro@kth.se\n\nDamianos Tranos\n\nKTH, EECS\n\nStockholm, Sweden\ntranos@kth.se\n\nAbstract\n\nWe address reinforcement learning problems with \ufb01nite state and action spaces\nwhere the underlying MDP has some known structure that could be potentially\nexploited to minimize the exploration rates of suboptimal (state, action) pairs. For\nany arbitrary structure, we derive problem-speci\ufb01c regret lower bounds satis\ufb01ed\nby any learning algorithm. These lower bounds are made explicit for unstructured\nMDPs and for those whose transition probabilities and average reward functions are\nLipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds\nare shown not to scale with the sizes S and A of the state and action spaces, i.e.,\nthey are smaller than c log T where T is the time horizon and the constant c only\ndepends on the Lipschitz structure, the span of the bias function, and the minimal\naction sub-optimality gap. This contrasts with unstructured MDPs where the regret\nlower bound typically scales as SA log T . We devise DEL (Directed Exploration\nLearning), an algorithm that matches our regret lower bounds. We further simplify\nthe algorithm for Lipschitz MDPs, and show that the simpli\ufb01ed version is still able\nto ef\ufb01ciently exploit the structure.\n\n1\n\nIntroduction\n\n\u221a\n\nReal-world Reinforcement Learning (RL) problems often concern dynamical systems with large state\nand action spaces, which make the design of ef\ufb01cient algorithms extremely challenging. This dif\ufb01culty\nis well illustrated by the known regret fundamental limits. The regret compares the accumulated\nreward of an optimal policy (aware of the system dynamics and reward function) to that of the\nalgorithm considered, and it quanti\ufb01es the loss incurred by the need of exploring sub-optimal (state,\naction) pairs to learn the system dynamics and rewards. In online RL problems with undiscounted\nSAT 1, where S, A, and T denote the\nreward, regret lower bounds typically scale as SA log T or\nsizes of the state and action spaces and the time horizon, respectively. Hence, with large state and\naction spaces, it is essential to identify and exploit any possible structure existing in the system\ndynamics and reward function so as to minimize exploration phases and in turn reduce regret to\nreasonable values. Modern RL algorithms actually implicitly impose some structural properties\neither in the model parameters (transition probabilities and reward function, see e.g. [Ortner and\nRyabko, 2012]) or directly in the Q-function (for discounted RL problems, see e.g. [Mnih et al.,\n2015]. Despite the successes of these recent algorithms, our understanding of structured RL problems\nremains limited.\nIn this paper, we explore structured RL problems with \ufb01nite state and action spaces. We \ufb01rst\nderive problem-speci\ufb01c regret lower bounds satis\ufb01ed by any algorithm for RL problems with any\narbitrary structure. These lower bounds are instrumental to devise algorithms optimally balancing\nexploration and exploitation, i.e., achieving the regret fundamental limits. A similar approach has\n\n1The \ufb01rst lower bound is asymptotic in T and problem-speci\ufb01c, the second is minimax. We ignore here for\n\nsimplicity the dependence of these bounds in the diameter, bias span, and action sub-optimality gap.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u03b4min\n\n\u03b42\nmin\n\nbeen recently applied with success to stochastic bandit problems, where the average reward of arms\nexhibits structural properties, e.g. unimodality [Combes and Proutiere, 2014], Lipschitz continuity\n[Magureanu et al., 2014], or more general properties [Combes et al., 2017]. Extending these results\nto RL problems is highly non trivial, and to our knowledge, this paper is the \ufb01rst to provide problem-\nspeci\ufb01c regret lower bounds for structured RL problems. Although the results presented here concern\nergodic RL problems with undiscounted reward, they could be easily generalized to discounted\nproblems (under an appropriate de\ufb01nition of regret).\nOur contributions are as follows:\n1. For ergodic structured RL problems, we derive problem-speci\ufb01c regret lower bounds. The latter\nare valid for any structure (but are structure-speci\ufb01c), and for unknown system dynamics and reward\nfunction.\n2. We analyze the lower bounds for unstructured MDPs, and show that they scale at most as\n(H+1)2\nSA log T , where H and \u03b4min represent the span of the bias function and the minimal state-\naction sub-optimality gap, respectively. These results extend previously known regret lower bounds\nderived in the seminal paper [Burnetas and Katehakis, 1997] to the case where the reward function is\nunknown.\n3. We further study the regret lower bounds in the case of Lipschitz MDPs. Interestingly, these\nbounds are shown to scale at most as (H+1)3\nSlipAlip log T where Slip and Alip only depend on the\nLipschitz properties of the transition probabilities and reward function. This indicates that when H\nand \u03b4min do not scale with the sizes of the state and action spaces, we can hope for a regret growing\nlogarithmically with the time horizon, and independent of S and A.\n4. We propose DEL, an algorithm that achieves our regret fundamental limits for any structured\nMDP. DEL is rather complex to implement since it requires in each round to solve an optimization\nproblem similar to that providing the regret lower bounds. Fortunately, we were able to devise\nsimpli\ufb01ed versions of DEL, with regret scaling at most as (H+1)2\nSlipAlip log T\n\u03b4min\nfor unstructured and Lipschitz MDPs, respectively. In absence of structure, DEL, in its simpli\ufb01ed\nversion, does not require to compute action indexes as done in OLP [Tewari and Bartlett, 2008], and\nyet achieves similar regret guarantees without the knowledge of the reward function. DEL, simpli\ufb01ed\nfor Lipschitz MDPs, only needs, in each step, to compute the optimal policy of the estimated MDP,\nas well as to solve a simple linear program.\n5. Preliminary numerical experiments (presented in the supplementary material) illustrate our\ntheoretical \ufb01ndings. In particular, we provide examples of Lipschitz MDPs, for which the regret\nunder DEL does not seem to scale with S and A, and signi\ufb01cantly outperforms algorithms that do\nnot exploit the structure.\n\nSA log T and (H+1)3\n\n\u03b42\nmin\n\n2 Related Work\n\nRegret lower bounds have been extensively investigated for unstructured ergodic RL problems.\n[Burnetas and Katehakis, 1997] provided a problem-speci\ufb01c lower bound similar to ours, but only\nvalid when the reward function is known. Minimax regret lower bounds have been studied e.g. in\n\u221a\n[Auer et al., 2009] and [Bartlett and Tewari, 2009]: in the worst case, the regret has to scale as\nDSAT where D is the diameter of the MDP. In spite of these results, regret lower bounds for\nunstructured RL problems are still attracting some attention, see e.g. [Osband and Van Roy, 2016]\nfor insightful discussions. To our knowledge, this paper constitutes the \ufb01rst attempt to derive regret\nlower bounds in the case of structured RL problems. Our bounds are asymptotic in the time horizon\nT , but we hope to extend them to \ufb01nite time horizons using similar techniques as those recently\nused to provide such bounds for bandit problems [Garivier et al., Jun. 2018]. These techniques\naddress problem-speci\ufb01c and minimax lower bounds in a uni\ufb01ed manner, and can be leveraged to\nderive minimax lower bounds for structured RL problems. However we do not expect minimax lower\nbounds to be very informative about the regret gains that one may achieve by exploiting a structure\n(indeed, the MDPs leading to worst-case regret in unstructured RL comply to many structures).\n\u221a\nThere have been a plethora of algorithms developed for ergodic unstructured RL problems. We\nmay classify these algorithms depending on their regret guarantees, either scaling as log T or\nT .\nIn absence of structure, [Burnetas and Katehakis, 1997] developed an asymptotically optimal, but\ninvolved, algorithm. This algorithm has been simpli\ufb01ed in [Tewari and Bartlett, 2008], but remains\n\n2\n\n\f\u221a\n\n\u221a\n\nT include UCRL2 [Auer et al., 2009], KL-UCRL with regret guarantees \u02dcO(DS\n\nmore complex than our proposed algorithm. Some algorithms have \ufb01nite-time regret guarantees\nscaling as log T [Auer and Ortner, 2007], [Auer et al., 2009], [Filippi et al., 2010]. For example, the\nauthors of [Filippi et al., 2010] propose KL-UCRL an extension of UCRL [Auer and Ortner, 2007]\nwith regret bounded by D2S2A\nlog T . Having \ufb01nite-time regret guarantees is arguably desirable, but\n\u03b4min\nso far this comes at the expense of a much larger constant in front of log T . Algorithms with regret\nscaling as\nAT ),\nREGAL.C [Bartlett and Tewari, 2009] with guarantees \u02dcO(HS\nAT ). Recently, the authors of\n[Agrawal and Jia, 2017] managed to achieve a regret guarantee of \u02dcO(D\nSAT ), but only valid when\nT \u2265 S5A.\nAlgorithms devised to exploit some known structure are most often applicable to RL problems with\ncontinuous state or action spaces. Typically, the transition probabilities and reward function are\nassumed to be smooth in the state and action, typically Lipschitz continuous [Ortner and Ryabko,\n2012], [Lakshmanan et al., 2015]. The regret then needs to scale as a power of T , e.g. T 2/3 in\n[Lakshmanan et al., 2015] for 1-dimensional state spaces. An original approach to RL problems for\nwhich the transition probabilities belong to some known class of functions was proposed in [Osband\nand Van Roy, 2014]. The regret upper bounds derived there depend on the so-called Kolmogorov and\neluder dimensions, which in turn depend on the chosen class of functions. Our approach to design\nlearning algorithms exploiting the structure is different from all aforementioned methods, as we aim\nat matching the problem-speci\ufb01c minimal exploration rates of sub-optimal (state, action) pairs.\n\n\u221a\n\n\u221a\n\nt -measurable. We denote by \u03a0 the set of all such policies.\n\n3 Models and Objectives\nWe consider an MDP \u03c6 = (p\u03c6, q\u03c6) with \ufb01nite state and action spaces S and A of respective\ncardinalities S and A. p\u03c6 and q\u03c6 are the transition and reward kernels of \u03c6. Speci\ufb01cally, when in state\nx, taking action a, the system moves to state y with probability p\u03c6(y|x, a), and a reward drawn from\ndistribution q\u03c6(\u00b7|x, a) of average r\u03c6(x, a) is collected. The rewards are bounded, w.l.o.g., in [0, 1].\nWe assume that for any (x, a), q\u03c6(\u00b7|x, a) is absolutely continuous w.r.t. some measure \u03bb on [0, 1]2.\nThe random vector Zt := (Xt, At, Rt) represents the state, the action, and the collected reward at\nstep t. A policy \u03c0 selects an action, denoted by \u03c0t(x), in step t when the system is in state x based\non the history captured through H\u03c0\nt , the \u03c3-algebra generated by (Z1, . . . , Zt\u22121, Xt) observed under\n\u03c0: \u03c0t(x) is H\u03c0\nStructured MDPs. The MDP \u03c6 is initially unknown. However we assume that \u03c6 belongs to some\nwell speci\ufb01ed set \u03a6 which may encode a known structure of the MDP. The knowledge of \u03a6 can be\nexploited to devise (more) ef\ufb01cient policies. The results derived in this paper are valid under any\nstructure, but we give a particular attention to the cases of\n(i) Unstructured MDPs: \u03c6 \u2208 \u03a6 if for all (x, a), p\u03c6(\u00b7 | x, a) \u2208 P(S) and q\u03c6(\u00b7 | x, a) \u2208 P([0, 1])3;\n(ii) Lipschitz MDPs: \u03c6 \u2208 \u03a6 if p\u03c6(\u00b7|x, a) and r\u03c6(x, a) are Lipschitz-continuous w.r.t. x and a in some\nmetric space (we provide a precise de\ufb01nition in the next section).\nThe learning problem. The expected cumulative reward up to step T of a policy \u03c0 \u2208 \u03a0 when\nx[\u00b7] denotes the expectation under\nthe system starts in state x is V \u03c0\npolicy \u03c0 given that X1 = x. Now assume that the system starts in state x and evolves according\nto the initially unknown MDP \u03c6 \u2208 \u03a6 for given structure \u03a6, the objective is to devise a policy\n\u03c0 \u2208 \u03a0 maximizing V \u03c0\nT (x) up to step T de\ufb01ned as the\ndifference between the cumulative reward of an optimal policy and that obtained under \u03c0:\n\nT (x) or equivalently, minimizing the regret R\u03c0\n\nt=1 Rt], where E\u03c0\n\nx[(cid:80)T\n\nT (x) := E\u03c0\n\nT (x) := V \u2217\nR\u03c0\n\nT (x) \u2212 V \u03c0\n\nT (x)\n\nT (x).\n\nT (x) := sup\u03c0\u2208\u03a0 V \u03c0\n\nwhere V \u2217\nPreliminaries and notations. Let \u03a0D be the set of stationary (deterministic) policies, i.e. when\nin state Xt = x, f \u2208 \u03a0D selects an action f (x) independent of t. \u03c6 is communicating if each\npair of states are connected by some policy. Further, \u03c6 is ergodic if under any stationary policy,\n2\u03bb can be the Lebesgue measure; alternatively, if rewards take values in {0, 1}, \u03bb can be the sum of Dirac\n3P(S) is the set of distributions on S and P([0, 1]) is the set of distributions on [0, 1], absolutely continuous\n\nmeasures at 0 and 1.\n\nw.r.t. \u03bb.\n\n3\n\n\fT V \u03c0\n\n\u03c6(x) := limT\u2192\u221e 1\n\nthe resulting Markov chain (Xt)t\u22651 is irreducible. For any communicating \u03c6 and any policy \u03c0 \u2208\n\u03c6(x) the gain of \u03c0 (or long-term average reward) started from initial state\n\u03a0D, we denote by g\u03c0\nT (x). We denote by \u03a0\u2217(\u03c6) the set of stationary policies with maximal\nx: g\u03c0\ngain: \u03a0\u2217(\u03c6) := {f \u2208 \u03a0D : gf\n\u03c6(x). If \u03c6 is\n\u03c6 of f \u2208 \u03a0D is\ncommunicating, the maximal gain is constant and denoted by g\u2217\nde\ufb01ned by hf\n\u03c6(Xt))], and quanti\ufb01es the advantage of starting\n\u03c6, respectively, the Bellman operator under action a and the\nin state x. We denote by Ba\noptimal Bellman operator under \u03c6. They are de\ufb01ned by: for any h : S (cid:55)\u2192 R and x \u2208 S,\n\n\u03c6(x) := C- limT\u2192\u221e Ef\n\u03c6 and B\u2217\n\n\u03c6(x) \u2200x \u2208 S}, where g\u2217\n\n\u03c6. The bias function hf\n\n\u03c6(x) := max\u03c0\u2208\u03a0 g\u03c0\n\nt=1(Rt \u2212 gf\n\n\u03c6(x) = g\u2217\n\nx[(cid:80)\u221e\n(cid:88)\n\n(Ba\n\n\u03c6h)(x) := r\u03c6(x, a) +\n\np\u03c6(y|x, a)h(y)\n\nand\n\ny\u2208S\n\na\u2208A (Ba\n\u03c6 satisfy the evaluation equation: for all state x \u2208 S, gf\n\n\u03c6h)(x) := max\n\n(B\u2217\n\n\u03c6h)(x) .\n\n\u03c6(x)+hf\n\n\u03c6(x) =\n\n\u03c6 and hf\n\n\u03c6 verify the optimality equation:\n\nThen for any f \u2208 \u03a0D, gf\n(Bf (x)\n\n\u03c6 hf\n\n\u03c6 and hf\n\n\u03c6)(x). Furthermore, f \u2208 \u03a0\u2217(\u03c6) if and only if gf\n\u03c6(x) = (B\u2217\n\ngf\n\u03c6(x) + hf\n\n\u03c6hf\n\n\u03c6)(x) .\n\n\u03c6h)(x)}. For ergodic \u03c6, h\u2217\n\n\u03c6 the bias function of an optimal stationary policy4, and by H its span H :=\n\u03c6(y). For x \u2208 S, h : S (cid:55)\u2192 R, and \u03c6 \u2208 \u03a6, let O(x; h, \u03c6) = {a \u2208 A : (B\u2217\n\u03c6h)(x) =\n\u03c6 is unique up to an additive constant. Hence, for ergodic \u03c6, the set\n\u03c6, \u03c6), and \u03a0\u2217(\u03c6) = {f \u2208 \u03a0D : f (x) \u2208\n\nWe denote by h\u2217\n\u03c6(x) \u2212 h\u2217\nmaxx,y h\u2217\n(Ba\nof optimal actions in state x under \u03c6 is O(x; \u03c6) := O(x; h\u2217\nO(x; \u03c6) \u2200x \u2208 S}. Finally, we de\ufb01ne for any state x and action a,\n\u03c6h\u2217\n\n\u03b4\u2217(x, a; \u03c6) := (B\u2217\n\n\u03c6)(x) \u2212 (Ba\n\n\u03c6)(x) .\n\n\u03c6h\u2217\n\nThis can be interpreted as the long-term regret obtained by initially selecting action a in state x (and\nthen applying an optimal stationary policy) rather than following an optimal policy. The minimum\ngap is de\ufb01ned as \u03b4min := min(x,a):\u03b4\u2217(x,a;\u03c6)>0 \u03b4\u2217(x, a; \u03c6).\nWe denote by \u00afR+ = R+ \u222a {\u221e}. The set of MDPs is equipped with the following (cid:96)\u221e-norm:\n(cid:107)\u03c6\u2212\u03c8(cid:107) := max(x,a)\u2208S\u00d7A (cid:107)\u03c6(x, a)\u2212\u03c8(x, a)(cid:107) where (cid:107)\u03c6(x, a)\u2212\u03c8(x, a)(cid:107) := |r\u03c6(x, a)\u2212r\u03c8(x, a)|+\nmaxy\u2208S |p\u03c6(y | x, a) \u2212 p\u03c8(y | x, a)|.\nThe proofs of all results are presented in the supplementary material.\n\n4 Regret Lower Bounds\n\nT (x) = o(T \u03b1) .\n\nIn this section, we present an (asymptotic) regret lower bound satis\ufb01ed by any uniformly good learning\nalgorithm. An algorithm \u03c0 \u2208 \u03a0 is uniformly good if for all ergodic \u03c6 \u2208 \u03a6, any initial state x and any\nconstant \u03b1 > 0, the regret of \u03c0 satis\ufb01es R\u03c0\nTo state our lower bound, we introduce the following notations. For \u03c6 and \u03c8, we denote \u03c6 (cid:28) \u03c8 if the\nkernel of \u03c6 is absolutely continuous w.r.t. that of \u03c8, i.e., \u2200E, P\u03c6[E] = 0 if P\u03c8[E] = 0. For \u03c6 and \u03c8\nsuch that \u03c6 (cid:28) \u03c8 and (x, a), we de\ufb01ne the KL-divergence between \u03c6 and \u03c8 in state-action pair (x, a)\nKL\u03c6|\u03c8(x, a) as the KL-divergence between the distributions of the next state and collected reward if\nthe state is x and a is selected under these two MDPs:\np\u03c6(y|x, a)\np\u03c8(y|x, a)\n\nq\u03c6(r|x, a)\nq\u03c8(r|x, a)\n\np\u03c6(y|x, a) log\n\nq\u03c6(r|x, a) log\n\nKL\u03c6|\u03c8(x, a) =\n\n(cid:88)\n\n(cid:90) 1\n\n\u03bb(dr) .\n\n+\n\n0\n\ny\u2208S\n\nWe further de\ufb01ne the set of confusing MDPs as:\n\u2206\u03a6(\u03c6) = {\u03c8 \u2208 \u03a6 : \u03c6 (cid:28) \u03c8, (i) KL\u03c6|\u03c8(x, a) = 0 \u2200x,\u2200a \u2208 O(x; \u03c6); (ii) \u03a0\u2217(\u03c6) \u2229 \u03a0\u2217(\u03c8) = \u2205} .\nThis set consists of MDP \u03c8\u2019s that (i) coincide with \u03c6 for state-action pairs where the actions are\noptimal (the kernels of \u03c6 and \u03c8 cannot be statistically distinguished under an optimal policy); and\nsuch that (ii) the optimal policies under \u03c8 are not optimal under \u03c6.\n\n4In case of h\u2217\n\n\u03c6 is not unique, we arbitrarily select an optimal stationary policy and de\ufb01ne h\u2217\n\u03c6.\n\n4\n\n\fTheorem 1. Let \u03c6 \u2208 \u03a6 be ergodic. For any uniformly good algorithm \u03c0 \u2208 \u03a0 and for any x \u2208 S,\n\nlim inf\nT\u2192\u221e\n\nR\u03c0\nT (x)\nlog T\n\n\u2265 K\u03a6(\u03c6),\n\n\u03b7(x, a)\u03b4\u2217(x, a; \u03c6),\n\n(1)\n\n(2)\n\nwhere K\u03a6(\u03c6) is the value of the following optimization problem:\n\n(cid:88)\n\n(x,a)\u2208S\u00d7A\n\ninf\n\n\u03b7\u2208F\u03a6(\u03c6)\n\n:(cid:80)\n\nwhere F\u03a6(\u03c6) := {\u03b7 \u2208 \u00afRS\u00d7A\n\n+\n\n(x,a)\u2208S\u00d7A \u03b7(x, a)KL\u03c6|\u03c8(x, a) \u2265 1,\u2200\u03c8 \u2208 \u2206\u03a6(\u03c6)}.\n\nThe above theorem can be interpreted as follows. When selecting a sub-optimal action a in state x,\none has to pay a regret of \u03b4\u2217(x, a; \u03c6). Then the minimal number of times any sub-optimal action a in\nstate x has to be explored scales as \u03b7\u2217(x, a) log T where \u03b7\u2217(x, a) solves the optimization problem\n(2). It is worth mentioning that our lower bound is tight, as we present in Section 5 an algorithm\nachieving this fundamental limit of regret.\nThe regret lower bound stated in Theorem 1 extends the problem-speci\ufb01c regret lower bound derived\nin [Burnetas and Katehakis, 1997] for unstructured ergodic MDPs with known reward function. Our\nlower bound is valid for unknown reward function, but also applies to any structure \u03a6. Note however\nthat at this point, it is only implicitly de\ufb01ned through the solution of (2), which seems dif\ufb01cult to solve.\nThe optimization problem can actually be simpli\ufb01ed, as shown later in this section, by providing\nuseful structural properties of the feasibility set F\u03a6(\u03c6) depending on the structure considered. The\nsimpli\ufb01cation will be instrumental to quantify the gain that can be achieved when optimally exploiting\nthe structure, as well as to design ef\ufb01cient algorithms.\n\nIn the following, the optimization problem: inf \u03b7\u2208F(cid:80)\n\n(x,a)\u2208S\u00d7A \u03b7(x, a)\u03b4\u2217(x, a; \u03c6) is referred to as\n\nP (\u03c6,F); so that P (\u03c6,F\u03a6(\u03c6)) corresponds to (2).\nThe proof of Theorem 1 combines a characterization of the regret as a function of the number of times\nNT (x, a) up to step T (state, action) pair (x, a) is visited, and of the \u03b4\u2217(x, a; \u03c6)\u2019s, and change-of-\nmeasure arguments as those recently used to prove in a very direct manner regret lower bounds in ban-\ndit optimization problems [Kaufmann et al., 2016]. More precisely, for any uniformly good algorithm\n\u03c0, and for any confusing MDP \u03c8 \u2208 \u2206\u03a6(\u03c6), we show that the exploration rates required to statistically\n[NT (x, a)]KL\u03c6|\u03c8(x, a) \u2265 1 where\ndistinguish \u03c8 from \u03c6 satisfy lim inf T\u2192\u221e 1\nlog T\nthe expectation is taken w.r.t. \u03c6 given any initial state x1. The theorem is then obtained by considering\n(hence optimizing the lower bound) all possible confusing MDPs.\n\n(x,a)\u2208S\u00d7A E\u03c0\n\n(cid:80)\n\nx1\n\n4.1 Decoupled exploration in unstructured MDPs\nIn the absence of structure, \u03a6 = {\u03c8 : p\u03c8(\u00b7|x, a) \u2208 P(S), q\u03c8(\u00b7|x, a) \u2208 P([0, 1]),\u2200(x, a)}, and we\nhave:\nTheorem 2. Consider the unstructured model \u03a6, and let \u03c6 \u2208 \u03a6 be ergodic. We have:\n\n: \u2200(x, a) s.t. a /\u2208 O(x; \u03c6), \u03b7(x, a)KL\u03c6|\u03c8(x, a) \u2265 1, \u2200\u03c8 \u2208 \u2206\u03a6(x, a; \u03c6)(cid:9)\n\nF\u03a6(\u03c6) =(cid:8)\u03b7 \u2208 \u00afRS\u00d7A\n\n+\n\n\u03c8h\u2217\n\n\u03c6 + h\u2217\n\n\u03c6)(x) > g\u2217\n\n\u03c6(x)}.\nwhere \u2206\u03a6(x, a; \u03c6) := {\u03c8 \u2208 \u03a6 : KL\u03c6|\u03c8(y, b) = 0 \u2200(y, b) (cid:54)= (x, a) and (Ba\nThe theorem states that in the constraints of the optimization problem (2), we can restrict our attention\nto confusing MDPs \u03c8 that are different than the original MDP \u03c6 only for a single state-action\npair (x, a). Further note that the condition (Ba\n\u03c6(x) is equivalent to saying that\naction a becomes optimal in state x under \u03c8 (see Lemma 1(i) in [Burnetas and Katehakis, 1997]).\nHence to obtain the lower bound in unstructured MDPs, we may just consider confusing MDPs \u03c8\nwhich make an initially sub-optimal action a in state x optimal by locally changing the kernels and\nrewards of \u03c6 at (x, a) only. Importantly, this observation implies that an optimal algorithm \u03c0 must\n[NT (x, a)] \u223c log T / inf \u03c8\u2208\u2206\u03a6(x,a;\u03c6) KL\u03c6|\u03c8(x, a). In other words, the required level of\nsatisfy E\u03c0\nexploration of the various sub-optimal state-action pairs are decoupled, which signi\ufb01cantly simpli\ufb01es\nthe design of optimal algorithms.\nTo get an idea on how the regret lower bound scales as the sizes of both state and action spaces,\nwe can further provide an upper bound of the regret lower bound. One may easily observe that\n\n\u03c6)(x) > g\u2217\n\n\u03c6 + h\u2217\n\n\u03c8h\u2217\n\nx1\n\n5\n\n\fFun(\u03c6) \u2282 F\u03a6(\u03c6) where\n\n(cid:40)\n\nFun(\u03c6) =\n\n\u03b7 \u2208 \u00afRS\u00d7A\n\n+\n\n(cid:18) \u03b4\u2217(x, a; \u03c6)\n\nH + 1\n\n(cid:41)\n(cid:19)2 \u2265 2, \u2200(x, a) s.t. a /\u2208 O(x; \u03c6)\n\n.\n\n: \u03b7(x, a)\n\nFrom this result, an upper bound of the regret lower bound is Kun(\u03c6) := 2 (H+1)2\n\u03b4min\ncan devise algorithms achieving this regret scaling (see Section 5).\nTheorem 2 relies on the following decoupling lemma, actually valid under any structure \u03a6.\nLemma 1. Let U1,U2 be two non-overlapping subsets of the (state, action) pairs such that for all\n(x, a) \u2208 U0 := U1 \u222aU2, a /\u2208 O(x; \u03c6). De\ufb01ne the following three MDPs in \u03a6 obtained starting from \u03c6\nand changing the kernels for (state, action) pairs in U1 \u222aU2. Speci\ufb01cally, let (p, q) be some transition\nand reward kernels. For all (x, a), de\ufb01ne \u03c8j, j \u2208 {0, 1, 2} as\n\nSA log T , and we\n\n(cid:26)(p(\u00b7|x, a), q(\u00b7|x, a))\n\n(p\u03c6(\u00b7|x, a), q\u03c6(\u00b7|x, a))\n\nif (x, a) \u2208 Uj,\notherwise.\n\n(p\u03c8j (\u00b7|x, a), q\u03c8j (\u00b7|x, a)) =\n\nThen, if \u03a0\u2217(\u03c6) \u2229 \u03a0\u2217(\u03c80) = \u2205, then \u03a0\u2217(\u03c6) \u2229 \u03a0\u2217(\u03c81) = \u2205 or \u03a0\u2217(\u03c6) \u2229 \u03a0\u2217(\u03c82) = \u2205.\n\n4.2 Lipschitz structure\n\nLipschitz structures have been widely studied in the bandit and reinforcement learning literature.\nWe \ufb01nd it convenient to use the following structure, although one could imagine other variants in\nmore general metric spaces. We assume that the state (resp. action) space can be embedded in the d\n(resp. d(cid:48)) dimensional Euclidian space: S \u2282 [0, D]d and A \u2282 [0, D(cid:48)]d(cid:48)\n. We consider MDPs whose\ntransition kernels and average rewards are Lipschitz w.r.t. the states and actions. Speci\ufb01cally, let\nL, L(cid:48) > 0, \u03b1, \u03b1(cid:48) > 0, and\n\n\u03a6 = {\u03c8 : p\u03c8(\u00b7|x, a) \u2208 P(S), q\u03c8(\u00b7|x, a) \u2208 P([0, 1]) : (L1)-(L2) hold,\u2200(x, a)},\n\nwhere\n\n(L1)\n\n+\n\n(L2)\n\n(cid:88)\n\n(cid:107)p1 \u2212 p2(cid:107)1 =(cid:80)\ny\u2208S |p1(y) \u2212 p2(y)|.\n(cid:88)\n\uf8eb\uf8ec\uf8ed\n\n(cid:16) \u03b4min\n\n(cid:17)1/\u03b1\n\nSlip := min\n\na /\u2208O(x,\u03c6)\n\n\u03b7(x, a)\n\nx\u2208S\n\nD\n\nd\n\n(cid:32)(cid:20) \u03b4\u2217(x(cid:48), a(cid:48); \u03c6)\n\nH + 1\n\n(cid:107)p\u03c8(\u00b7|x, a) \u2212 p\u03c8(\u00b7|x(cid:48), a(cid:48))(cid:107)1 \u2264 Ld(x, x(cid:48))\u03b1 + L(cid:48)d(a, a(cid:48))\u03b1(cid:48)\n|r\u03c8(x, a) \u2212 r\u03c8(x(cid:48), a(cid:48))| \u2264 Ld(x, x(cid:48))\u03b1 + L(cid:48)d(a, a(cid:48))\u03b1(cid:48)\n\n.\n\n,\n\nHere d(\u00b7,\u00b7) is the Euclidean distance, and for two distributions p1 and p2 on S we denote by\nTheorem 3. For the model \u03a6 with Lipschitz structure (L1)-(L2), we have Flip(\u03c6) \u2282 F\u03a6(\u03c6) where\nFlip(\u03c6) is the set of \u03b7 \u2208 \u00afRS\u00d7A\n\nsatisfying for all (x(cid:48), a(cid:48)) such that a(cid:48) /\u2208 O(x(cid:48), \u03c6),\n\n(cid:16)\n\nLd(x, x(cid:48))\u03b1 + L(cid:48)d(a, a(cid:48))\u03b1(cid:48)(cid:17)(cid:21)\n\n\u2212 2\n\n(cid:33)2\n\n+\n\n\u2265 2\n\n(3)\n\nwhere we use the notation [u]+ := max{0, u} for u \u2208 R. Furthermore, the optimal values K\u03a6(\u03c6)\nand Klip(\u03c6) of P (\u03c6,F\u03a6(\u03c6)) and P (\u03c6,Flip(\u03c6)) are upper bounded by 8 (H+1)3\nSlipAlip where\nD(cid:48)\u221a\n(cid:16) \u03b4min\n\n\uf8f6\uf8f7\uf8f8d\uf8fc\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8fe , and Alip := min\n\n\uf8f6\uf8f7\uf8f8d(cid:48)\uf8fc\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8fe .\n\n(cid:17)1/\u03b1(cid:48) + 1\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3A,\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3S,\n\n\uf8eb\uf8ec\uf8ed\n\n8L(cid:48)(H+1)\n\n8L(H+1)\n\n+ 1\n\n\u03b42\nmin\n\n\u221a\n\nd(cid:48)\n\nThe above theorem has important consequences. First, it states that exploiting the Lipschitz structure\noptimally, one may achieve a regret at most scaling as (H+1)3\nSlipAlip log T . This scaling is inde-\npendent of the sizes of the state and action spaces provided that the minimal gap \u03b4min is \ufb01xed, and\nprovided that the span H does not scale with S. The latter condition typically holds for fast mixing\nmodels or for MDPs with diameter not scaling with S (refer to [Bartlett and Tewari, 2009] for a\nprecise connection between H and the diameter). Hence, exploiting the structure can really yield\nsigni\ufb01cant regret improvements. As shown in the next section, leveraging the simpli\ufb01ed structure\nin Flip(\u03c6), we may devise a simple algorithm achieving these improvements, i.e., having a regret\nscaling at most as Klip(\u03c6) log T .\n\n\u03b42\nmin\n\n6\n\n\fAlgorithm 1 DEL(\u03b3)\ninput Model structure \u03a6\n\nInitialize N1(x) \u2190 1[x = X1], N1(x, a) \u2190 0, s1(x) \u2190 0, p1(y | x, a) \u2190 1/|S|, r1(x, a) \u2190 0\nfor each x, y \u2208 S, a \u2208 A, and \u03c61 accordingly.\nfor t = 1, ..., T do\n\nt := \u03c6t(Ct), h(cid:48)\n\nt(x) :=\n\nt\n\n1\n\nFor each x \u2208 S, let Ct(x) := {a \u2208 A : Nt(x, a) \u2265 log2 Nt(x)}, \u03c6(cid:48)\nh\u2217\n(x), \u03b6t :=\n\u03c6(cid:48)\nif \u2200a \u2208 O(x; \u03c6(cid:48)\nMonotonize: At \u2190 Amnt\nelse if \u2203a \u2208 A s.t. Nt(Xt, a) < log Nt(Xt)\n\n1+log log t and \u03b3t := (1 + \u03b3)(1 + log t)\nt), Nt(Xt, a) < log2 Nt(Xt) + 1 then\n\n:= arg mina\u2208O(x;\u03c6(cid:48)\n\nt) Nt(Xt, a).\n\nt\n\n(cid:16) Nt(x,a)\n\n\u03b3t\n\n1+log log Nt(Xt) then\n:= arg mina\u2208A Nt(Xt, a).\n\n: (x, a) \u2208 S \u00d7 A(cid:17) \u2208 F\u03a6(\u03c6t;Ct, \u03b6t). then\n\nt\n\nEstimate: At \u2190 Aest\n\nt\n\nt) Nt(Xt, a).\n\n:= arg mina\u2208O(x;\u03c6(cid:48)\n\nExploit: At \u2190 Axpt\nFor each (x, a) \u2208 S \u00d7 A, let \u03b4t(x, a) := \u03b4\u2217(x, a; \u03c6t,Ct, \u03b6t).\nif Ft := F\u03a6(\u03c6t;Ct, \u03b6t) \u2229 {\u03b7 : \u03b7(x, a) = \u221e if \u03b4t(x, a) = 0} = \u2205 then\nelse\n\nLet \u03b7t(x, a) = \u221e if \u03b4t(x, a) = 0 and \u03b7t(x, a) = 0 otherwise.\nObtain a solution \u03b7t of P(\u03b4t,Ft): inf \u03b7\u2208Ft\n\n(cid:80)\n\n(x,a)\u2208S\u00d7A \u03b7(x, a)\u03b4t(x, a)\n\nelse if\n\nelse\n\nend if\nExplore: At \u2190 Axpr\n\nt\n\n:= arg mina\u2208A:Nt(Xt,a)\u2264\u03b7t(Xt,a)\u03b3t Nt(Xt, a).\n\nend if\nSelect action At, and observe the next state Xt+1 and the instantaneous reward Rt.\nUpdate \u03c6t+1, Nt+1(x) and Nt+1(x, a) for each (x, a) \u2208 S \u00d7 A.\n\nend for\n\n5 Algorithms\n\nIn this section, we present DEL (Directed Exploration Learning), an algorithm that achieves the regret\nlimits identi\ufb01ed in the previous section. Asymptotically optimal algorithms for generic controlled\nMarkov chains have already been proposed in [Graves and Lai, 1997], and could be adapted to our\nsetting. By presenting DEL, we aim at providing simpli\ufb01ed, yet optimal algorithms. Moreover, DEL\ncan be adapted so that the exploration rates of sub-optimal actions are directed towards the solution of\nan optimization problem P (\u03c6,F(\u03c6)) provided that F(\u03c6) \u2282 F\u03a6(\u03c6) (it suf\ufb01ces to use F(\u03c6t) instead\nof F\u03a6(\u03c6t) in DEL). For example, in the case of Lipschitz structure \u03a6, running DEL on Flip(\u03c6) yields\na regret scaling at most as (H+1)3\n\nSlipAlip log T .\n\n\u03b42\nmin\n\nThe pseudo-code of DEL with input parameter \u03b3 > 0 is given in Algorithm 1. There, for notational\nconvenience, we abuse the notations and rede\ufb01ne log t as 1[t \u2265 1] log t, and let \u221e \u00b7 0 = 0. \u03c6t refers\nto the estimated MDP at time t (using empirical transition rates and rewards). For any non-empty\ncorrespondence C : S (cid:16) A (i.e., for any x, C(x) is a non-empty subset of A), let \u03c6(C) denote the\nrestricted MDP where the set of actions available at state x is C(x). Then, g\u2217\n\u03c6(C) are the\n(optimal) gain and bias functions corresponding to the restricted MDP \u03c6(C). Given a restriction\nde\ufb01ned by C, for each (x, a) \u2208 S \u00d7 A, let \u03b4\u2217(x, a; \u03c6,C) := (B\u2217\n\u03c6(C))(x) and\n\u03c6(C)(y). For \u03b6 \u2265 0, let \u03b4\u2217(x, a; \u03c6,C, \u03b6) := 0 if \u03b4\u2217(x, a; \u03c6,C) \u2264 \u03b6,\nH\u03c6(C) := maxx,y\u2208S h\u2217\nand let \u03b4\u2217(x, a; \u03c6,C, \u03b6) := \u03b4\u2217(x, a; \u03c6,C) otherwise. For \u03b6 \u2265 0, we further de\ufb01ne the set of confusing\nMDPs \u2206\u03a6(\u03c6;C, \u03b6), and the set of feasible solutions F\u03a6(\u03c6;C, \u03b6) as:\n\n\u03c6(C) and h\u2217\n\u03c6h\u2217\n\n\u03c6(C))(x) \u2212 (Ba\n\n\u03c6(C)h\u2217\n\n\u2206\u03a6(\u03c6;C, \u03b6) :=\n\n\u03c8 \u2208 \u03a6 \u222a {\u03c6} : \u03c6 (cid:28) \u03c8;\n\n(i) KL\u03c6|\u03c8(x, a) = 0 \u2200x,\u2200a \u2208 O(x; \u03c6(C));\n(ii) \u2203(x, a) \u2208 S \u00d7 A s.t.\n\na /\u2208 O(x; \u03c6(C)) and \u03b4\u2217(x, a; \u03c8,C, \u03b6) = 0\n\n(cid:41)\n\n(cid:41)\n\nF\u03a6(\u03c6;C, \u03b6) :=\n\n\u03b7(x, a)KL\u03c6|\u03c8(x, a) \u2265 1, \u2200\u03c8 \u2208 \u2206\u03a6(\u03c6;C, \u03b6)\n\n.\n\n\u03c6(C)(x) \u2212 h\u2217\n(cid:40)\n(cid:40)\n\n\u03b7 \u2208 \u00afRS\u00d7A\n\n+\n\n(cid:88)\n\n(cid:88)\n\nx\u2208S\n\na\u2208A\n\n:\n\n7\n\n\fSimilar sets Fun(\u03c6;C, \u03b6) and Flip(\u03c6;C, \u03b6) can be de\ufb01ned for the cases of unstructured and Lipschitz\nMDPs (refer to the supplementary material), and DEL can be simpli\ufb01ed in these cases by replac-\ning F\u03a6(\u03c6;C, \u03b6) by Fun(\u03c6;C, \u03b6) or Flip(\u03c6;C, \u03b6) in the pseudo-code. Finally, P(\u03b4,F) refers to the\n\noptimization problem inf \u03b7\u2208F(cid:80)\n\n(x,a)\u2208S\u00d7A \u03b7(x, a)\u03b4(x, a).\n\nDEL combines the ideas behind OSSB [Combes et al., 2017], an asymptotically optimal algorithm for\nstructured bandits, and the asymptotically optimal algorithm presented in [Burnetas and Katehakis,\n1997] for RL problems without structure. DEL design aims at exploring sub-optimal actions no more\nthan what the regret lower bound prescribes. To this aim, it essentially solves in each iteration t an\noptimization problem close to P (\u03c6t,F\u03a6(\u03c6t)) where \u03c6t is an estimate of the true MDP \u03c6. Depending\non the solution and the number of times apparently sub-optimal actions have been played, DEL\ndecides to explore or exploit. The estimation phase ensures that certainty equivalence holds. The\n\"monotonization\" phase together with the restriction to relatively well selected actions were already\nproposed in [Burnetas and Katehakis, 1997] to make sure that accurately estimated actions only are\nselected in the exploitation phase. The various details and complications introduced in DEL ensure\nthat its regret analysis can be conducted. In practice (see the supplementary material), our initial\nexperiments suggest that many details can be removed without large regret penalties.\nTheorem 4. For a structure \u03a6 with Bernoulli rewards and for any ergodic MDP \u03c6 \u2208 \u03a6, assume\nthat: (i) \u03c6 is in the interior of \u03a6 (i.e., there exists a constant \u03b60 > 0 such that for any \u03b6 \u2208 (0, \u03b60),\n\u03c8 \u2208 \u03a6 if (cid:107)\u03c6 \u2212 \u03c8(cid:107) \u2264 \u03b6 and \u03c8 (cid:28) \u03c6), (ii) the solution \u03b7\u2217(\u03c6) is uniquely de\ufb01ned for each (x, a) such\nthat a /\u2208 O(x; \u03c6), (iii) continuous at \u03c6 (i.e., for any given \u03b5 > 0, there exists \u03b6(\u03b5) > 0 such that\nfor all \u03b6 \u2208 (0, \u03b6(\u03b5)), if (cid:107)\u03c8 \u2212 \u03c6(cid:107) \u2264 \u03b6, max(x,a):a /\u2208O(x;\u03c6) |\u03b7\u2217(x, a; \u03c8, \u03b6) \u2212 \u03b7\u2217(x, a; \u03c6)| \u2264 \u03b5 where\n\u03b7\u2217(\u03c8, \u03b6) is solution of P(\u03b4\u2217(\u03c8,A, \u03b6),F\u03a6(\u03c8;A, \u03b6)), and \u03b7\u2217(x, a; \u03c6) that of P (\u03c6,F\u03a6(\u03c6))). Then, for\n\u03c0 = DEL(\u03b3) with any \u03b3 > 0, we have:\n\nlim sup\nT\u2192\u221e\n\nR\u03c0\nT (\u03c6)\nlog T\n\n\u2264 (1 + \u03b3)K\u03a6(\u03c6) .\n\n(4)\n\nFor Lipschitz \u03a6 with (L1)-(L2) (resp.\nunstructured \u03a6), if \u03c0 = DEL uses in each step t,\nFlip(\u03c6t;Ct, \u03b6t) (resp. Fun(\u03c6t;Ct, \u03b6t)) instead of F\u03a6(\u03c6t;Ct, \u03b6t), its regret is asymptotically smaller\nthan (1 + \u03b3)Klip(\u03c6) log T (resp. (1 + \u03b3)Kun(\u03c6) log T ).\nIn the above theorem, the assumptions about the uniqueness and continuity of the solution \u03b7\u2217(\u03c6)\ncould be veri\ufb01ed for particular structures. In particular, we believe that they generally hold in the\ncase of unstructured and Lipschitz MDPs. Also note that similar assumptions have been made in\n[Graves and Lai, 1997].\n\n6 Extensions and Future Work\n\nIt is worth extending the approach developed in this paper to the case of structured discounted RL\nproblems (although for such problems, there is no ideal way of de\ufb01ning the regret of an algorithm).\nThere are other extensions worth investigating. For example, since our framework allows any kind of\nstructure, we may specify our regret lower bounds for structures stronger than that corresponding\nto Lipschitz continuity, e.g., the reward may exhibit some kind of unimodality or convexity. Under\nsuch structures, the regret improvements might become even more signi\ufb01cant. Another interesting\ndirection consists in generalizing the results to the case of communicating MDPs. This would allow\nus for example to consider deterministic system dynamics and unknown probabilistic rewards.\n\n7 Acknowledgements\n\nThis work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program\n(WASP) funded by the Knut and Alice Wallenberg Foundation. Jungseul Ok is now with UIUC in\nProf. Sewoong Oh\u2019s group. He would like to thank UIUC for \ufb01nancially supporting his participation\nto NIPS 2018 conference.\n\n8\n\n\fReferences\nShipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In\n\nAdvances in Neural Information Processing Systems 31, 2017.\n\nPeter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In\n\nAdvances in Neural Information Processing Systems 19, 2007.\n\nPeter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In\n\nAdvances in Neural Information Processing Systems 22, 2009.\n\nPeter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in\nweakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2009.\n\nApostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision processes.\n\nMathematics of Operations Research, 22(1):222\u2013255, 1997.\n\nRichard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In\n\nProceedings of the 31st International Conference on Machine Learning, 2014.\n\nRichard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic\n\nbandits. In Advances in Neural Information Processing Systems 30, 2017.\n\nSarah Filippi, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. Optimism in reinforcement learning and Kullback-Leibler\n\ndivergence. In 48th Annual Allerton Conference on Communication, Control, and Computing, 2010.\n\nAur\u00e9lien Garivier, Pierre M\u00e9nard, and Gilles Stoltz. Explore \ufb01rst, exploit next: The true shape of regret in bandit\n\nproblems. Mathematics of Operations Research, Jun. 2018.\n\nTodd L. Graves and Tze Leung Lai. Asymptotically ef\ufb01cient adaptive choice of control laws in controlled\n\nMarkov chains. SIAM J. Control and Optimization, 35(3):715\u2013743, 1997.\n\nEmilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of best-arm identi\ufb01cation in\n\nmulti-armed bandit models. The Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\nKailasam Lakshmanan, Ronald Ortner, and Daniil Ryabko. Improved regret bounds for undiscounted continuous\n\nreinforcement learning. In 32nd International Conference on Machine Learning, 2015.\n\nStefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bounds and\n\noptimal algorithms. In Conference on Learning Theory, 2014.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex\nGraves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,\nIoannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.\nHuman-level control through deep reinforcement learning. Nature, 518:529\u2013533, 2015.\n\nRonald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In\n\nAdvances in Neural Information Processing Systems 25, 2012.\n\nIan Osband and Benjamin Van Roy. Model-based reinforcement learning and the Eluder dimension. In Advances\n\nin Neural Information Processing Systems 27, 2014.\n\nIan Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint\n\narXiv:1608.02732, 2016.\n\nAmbuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible\n\nMDPs. In Advances in Neural Information Processing Systems 20, 2008.\n\n9\n\n\f", "award": [], "sourceid": 5327, "authors": [{"given_name": "Jungseul", "family_name": "Ok", "institution": "UIUC"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}, {"given_name": "Damianos", "family_name": "Tranos", "institution": "KTH Royal Institute of Stockholm"}]}