{"title": "Online Regret Bounds for Undiscounted Continuous Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1763, "page_last": 1771, "abstract": "We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Hoelder continuity of rewards and transition probabilities.", "full_text": "Online Regret Bounds for Undiscounted Continuous\n\nReinforcement Learning\n\nRonald Ortner\u2217\u2020\n\n\u2217Montanuniversitaet Leoben\n\n8700 Leoben, Austria\n\nrortner@unileoben.ac.at\n\nDaniil Ryabko\u2020\n\n\u2020INRIA Lille-Nord Europe, \u00b4equipe SequeL\n\n59650 Villeneuve d\u2019Ascq, France\n\ndaniil@ryabko.net\n\nAbstract\n\nWe derive sublinear regret bounds for undiscounted reinforcement learning in con-\ntinuous state space. The proposed algorithm combines state aggregation with the\nuse of upper con\ufb01dence bounds for implementing optimism in the face of uncer-\ntainty. Beside the existence of an optimal policy which satis\ufb01es the Poisson equa-\ntion, the only assumptions made are H\u00a8older continuity of rewards and transition\nprobabilities.\n\n1\n\nIntroduction\n\nReal world problems usually demand continuous state or action spaces, and one of the challenges for\nreinforcement learning is to deal with such continuous domains. In many problems there is a natural\nmetric on the state space such that close states exhibit similar behavior. Often such similarities can\nbe formalized as Lipschitz or more generally H\u00a8older continuity of reward and transition functions.\nThe simplest continuous reinforcement learning problem is the 1-dimensional continuum-armed\nbandit, where the learner has to choose arms from a bounded interval. Bounds on the regret with\nrespect to an optimal policy under the assumption that the reward function is H\u00a8older continuous have\nbeen given in [15, 4]. The proposed algorithms apply the UCB algorithm [2] to a discretization of the\nproblem. That way, the regret suffered by the algorithm consists of the loss by aggregation (which\ncan be bounded using H\u00a8older continuity) plus the regret the algorithm incurs in the discretized set-\nting. More recently, algorithms that adapt the used discretization (making it \ufb01ner in more promising\nregions) have been proposed and analyzed [16, 8].\nWhile the continuous bandit case has been investigated in detail, in the general case of continuous\nstate Markov decision processes (MDPs) a lot of work is con\ufb01ned to rather particular settings, pri-\nmarily with respect to the considered transition model. In the simplest case, the transition function\nis considered to be deterministic as in [19], and mistake bounds for the respective discounted set-\nting have been derived in [6]. Another common assumption is that transition functions are linear\n\u221a\nfunctions of state and action plus some noise. For such settings sample complexity bounds have\nbeen given in [23, 7], while \u02dcO(\nT ) bounds for the regret after T steps are shown in [1]. However,\nthere is also some research considering more general transition dynamics under the assumption that\nclose states behave similarly, as will be considered here. While most of this work is purely experi-\nmental [12, 24], there are also some contributions with theoretical guarantees. Thus, [13] considers\nPAC-learning for continuous reinforcement learning in metric state spaces, when generative sam-\npling is possible. The proposed algorithm is a generalization of the E3 algorithm [14] to continuous\ndomains. A respective adaptive discretization approach is suggested in [20]. The PAC-like bounds\nderived there however depend on the (random) behavior of the proposed algorithm.\n\nHere we suggest a learning algorithm for undiscounted reinforcement learning in continuous state\nspace. The proposed algorithm is in the tradition of algorithms like UCRL2 [11] in that it implements\n\n1\n\n\fthe \u201coptimism in the face of uncertainty\u201d maxim, here combined with state aggregation. Thus, the\nalgorithm does not need a generative model or access to \u201cresets:\u201d learning is done online, that is, in\na single continual session of interactions between the environment and the learning policy.\nFor our algorithm we derive regret bounds of \u02dcO(T (2+\u03b1)/(2+2\u03b1)) for MDPs with 1-dimensional\nstate space and H\u00a8older-continuous rewards and transition probabilities with parameter \u03b1. These\nbounds also straightforwardly generalize to dimension d where the regret\nis bounded by\n\u02dcO(T (2d+\u03b1)/(2d+2\u03b1)). Thus, in particular, if rewards and transition probabilities are Lipschitz, the\nregret is bounded by \u02dcO(T (2d+1)/(2d+2))) in dimension d and \u02dcO(T 3/4) in dimension 1. We also\npresent an accompanying lower bound of \u2126(\nT ). As far as we know, these are the \ufb01rst regret\nbounds for a general undiscounted continuous reinforcement learning setting.\n\n\u221a\n\n2 Preliminaries\n\nWe consider the following setting. Given is a Markov decision process (MDP) M with state space\nS = [0, 1]d and \ufb01nite action space A. For the sake of simplicity, in the following we assume d = 1.\nHowever, proofs and results generalize straightforwardly to arbitrary dimension, cf. Remark 5 below.\nThe random rewards in state s under action a are assumed to be bounded in [0, 1] with mean r(s, a).\nThe transition probability distribution in state s under action a is denoted by p(\u00b7|s, a).\nWe will make the natural assumption that rewards and transition probabilities are similar in close\nstates. More precisely, we assume that rewards and transition probabilities are H\u00a8older continuous.\nAssumption 1. There are L, \u03b1 > 0 such that for any two states s, s(cid:48) and all actions a,\n\nAssumption 2. There are L, \u03b1 > 0 such that for any two states s, s(cid:48) and all actions a,\n\n|r(s, a) \u2212 r(s(cid:48), a)| \u2264 L|s \u2212 s(cid:48)|\u03b1.\n\n(cid:13)(cid:13)p(\u00b7|s, a) \u2212 p(\u00b7|s(cid:48), a)(cid:13)(cid:13)1 \u2264 L|s \u2212 s(cid:48)|\u03b1.\n\nFor the sake of simplicity we will assume that \u03b1 and L in Assumptions 1 and 2 are the same.\nWe also assume existence of an optimal policy \u03c0\u2217 : S \u2192 A which gives optimal average reward\n\u03c1\u2217 = \u03c1\u2217(M) on M independent of the initial state. A suf\ufb01cient condition for state-independent\noptimal reward is geometric convergence of \u03c0\u2217 to an invariant probability measure. This is a natural\ncondition which e.g. holds for any communicating \ufb01nite state MDP. It also ensures (cf. Chapter 10\nof [10]) that the Poisson equation holds for the optimal policy. In general, under suitable technical\nconditions (like geometric convergence to an invariant probability measure \u00b5\u03c0) the Poisson equation\n\n\u03c1\u03c0 + \u03bb\u03c0(s) = r(s, \u03c0(s)) +\n\np(ds(cid:48)|s, \u03c0(s)) \u00b7 \u03bb\u03c0(s(cid:48))\n\n(1)\n\n(cid:90)\n\nS\n\nrelates the rewards and transition probabilities under any measurable policy \u03c0 to its average re-\nward \u03c1\u03c0 and the bias function \u03bb\u03c0 : S \u2192 R of \u03c0. Intuitively, the bias is the difference in accumulated\nrewards when starting in a different state. Formally, the bias is de\ufb01ned by the Poisson equation (1)\nS \u03bb\u03c0 d\u00b5\u03c0 = 0 (cf. e.g. [9]). The following result follows from the\n\nand the normalizing equation(cid:82)\n\nbias de\ufb01nition and Assumptions 1 and 2 (together with results from Chapter 10 of [10]).\nProposition 3. Under Assumptions 1 and 2, the bias of the optimal policy is bounded.\n\nConsequently, it makes sense to de\ufb01ne the bias span H(M) of a continuous state MDP M satisfying\nAssumptions 1 and 2 to be H(M) := sups \u03bb\u03c0\u2217(s) \u2212 inf s \u03bb\u03c0\u2217(s). Note that since inf s \u03bb\u03c0\u2217(s) \u2264 0\nby de\ufb01nition of the bias, the bias function \u03bb\u03c0\u2217 is upper bounded by H(M).\nWe are interested in algorithms which can compete with the optimal policy \u03c0\u2217 and measure their\nt=1 rt, where rt is the random\nreward obtained by the algorithm at step t. Indeed, within T steps no canonical or even bias optimal\noptimal policy (cf. Chapter 10 of [10]) can obtain higher accumulated reward than T \u03c1\u2217 + H(M).\n\nperformance by the regret (after T steps) de\ufb01ned as T \u03c1\u2217(M) \u2212(cid:80)T\n\n3 Algorithm\n\nOur algorithm UCCRL, shown in detail in Figure 1, implements the \u201coptimism in the face of uncer-\ntainty maxim\u201d just like UCRL2 [11] or REGAL [5]. It maintains a set of plausible MDPs M and\n\n2\n\n\fAlgorithm 1 The UCCRL algorithm\n\nInput: State space S = [0, 1], action space A, con\ufb01dence parameter \u03b4 > 0, aggregation parame-\nter n \u2208 N, upper bound H on the bias span, Lipschitz parameters L, \u03b1.\nInitialization:\n\n(cid:66) Let I1 :=(cid:2)0, 1\n\n(cid:3), Ij :=(cid:0) j\u22121\n\n(cid:3) for j = 2, 3, . . . , n.\n\n(cid:66) Set t := 1, and observe the initial state s1 and interval I(s1).\n\nn\n\nn\n\nn , j\n\nfor episodes k = 1, 2, . . . do\n\n(cid:66) Let Nk (Ij, a) be the number of times action a has been chosen in a state \u2208 Ij prior to\nepisode k, and vk(Ij, a) the respective counts in episode k.\nInitialize episode k:\n(cid:66) Set the start time of episode k, tk := t.\n(cid:66) Compute estimates \u02c6rk(s, a) and \u02c6pagg\nall samples from states in the same interval I(s), respectively.\nCompute policy \u02dc\u03c0k:\n(cid:66) Let Mk be the set of plausible MDPs \u02dcM with H( \u02dcM) \u2264 H and rewards \u02dcr(s, a) and\ntransition probabilities \u02dcp(\u00b7|s, a) satisfying\n\nk (Ii|s, a) for rewards and transition probabilities, using\n\n(cid:12)(cid:12)\u02dcr(s, a) \u2212 \u02c6rk(s, a)(cid:12)(cid:12) \u2264 Ln\u2212\u03b1 +\n\nk (\u00b7|s, a)\n\n\u2264 Ln\u2212\u03b1 +\n\n(cid:113) 7 log(2nAtk/\u03b4)\n(cid:113) 56n log(2Atk/\u03b4)\n\n2 max{1,Nk(I(s),a)} ,\n\nmax{1,Nk(I(s),a)} .\n\n(cid:13)(cid:13)(cid:13)\u02dcpagg(\u00b7|s, a) \u2212 \u02c6pagg\n\n(cid:13)(cid:13)(cid:13)1\n\n(cid:66) Choose policy \u02dc\u03c0k and \u02dcMk \u2208 Mk such that\n\n\u03c1\u02dc\u03c0k( \u02dcMk) = arg max{\u03c1\u2217(M)| M \u2208 Mk}.\n\nExecute policy \u02dc\u03c0k:\nwhile vk(I(st), \u02dc\u03c0k(st)) < max{1, Nk(I(st), \u02dc\u03c0k(st))} do\n\n(cid:66) Choose action at = \u02dc\u03c0k(st), obtain reward rt, and observe next state st+1.\n(cid:66) Set t := t + 1.\n\nend while\n\nend for\n\n(2)\n\n(3)\n\n(4)\n\nchooses optimistically an MDP \u02dcM \u2208 M and a policy \u02dc\u03c0 such that the average reward \u03c1\u02dc\u03c0( \u02dcM) is max-\nimized, cf. (4). Whereas for UCRL2 and REGAL the set of plausible MDPs is de\ufb01ned by con\ufb01dence\nintervals for rewards and transition probabilities for each individual state-action pair, for UCCRL\nwe assume an MDP to be plausible if its aggregated rewards and transition probabilities are within\na certain range. This range is de\ufb01ned by the aggregation error (determined by the assumed H\u00a8older\ncontinuity) and respective con\ufb01dence intervals, cf. (2), (3). Correspondingly, the estimates for re-\nwards and transition probabilities for some state action-pair (s, a) are calculated from all sampled\nvalues of action a in states close to s.\n\n(cid:3),\nMore precisely, for the aggregation UCCRL partitions the state space into intervals I1 := (cid:2)0, 1\n(cid:3) for k = 2, 3, . . . , n. The corresponding aggregated transition probabilities are\nIk := (cid:0) k\u22121\n\nn\n\nn , k\n\nn\n\nde\ufb01ned by\n\npagg(Ij|s, a) :=\n\np(ds(cid:48)|s, a).\n\nIj\n\n(5)\nGenerally, for a (transition) probability distribution p(\u00b7) over S we write pagg(\u00b7) for the aggre-\ngated probability distribution with respect to {I1, I2 . . . , In}. Now, given the aggregated state space\n{I1, I2 . . . , In}, estimates \u02c6r(s, a) and \u02c6pagg(\u00b7|s, a) are calculated from all samples of action a in\nstates in I(s), the interval Ij containing s. (Consequently, the estimates are the same for states in\nthe same interval.)\nAs UCRL2 and REGAL, UCCRL proceeds in episodes in which the chosen policy remains \ufb01xed.\nEpisodes are terminated when the number of times an action has been sampled from some interval Ij\nhas been doubled. Only then estimates are updated and a new policy is calculated.\n\n(cid:90)\n\n3\n\n\fk\n\nk\n\non \u02dcM agg\n\n. Then \u02dc\u03c0k can be set to be the extension of \u02dc\u03c0agg\n\nSince all states in the same interval Ij have the same con\ufb01dence intervals, \ufb01nding the optimal\npair \u02dcMk, \u02dc\u03c0k in (4) is equivalent to \ufb01nding the respective optimistic discretized MDP \u02dcM agg\nand\nto S, that is,\nan optimal policy \u02dc\u03c0agg\n\u02dc\u03c0k(s) := \u02dc\u03c0agg\n(I(s)) for all s. However, due to the constraint on the bias even in this \ufb01nite case\nef\ufb01cient computation of \u02dcM agg\nis still an open problem. We note that the REGAL.C algo-\nrithm [5] selects optimistic MDP and optimal policy in the same way as UCCRL.\nWhile the algorithm presented here is the \ufb01rst modi\ufb01cation of UCRL2 to continuous reinforcement\nlearning problems, there are similar adaptations to online aggregation [21] and learning in \ufb01nite state\nMDPs with some additional similarity structure known to the learner [22].\n\nand \u02dc\u03c0agg\n\nk\n\nk\n\nk\n\nk\n\nk\n\n4 Regret Bounds\n\nFor UCCRL we can derive the following bounds on the regret.\nTheorem 4. Let M be an MDP with continuous state space [0, 1], A actions, rewards and transi-\ntion probabilities satisfying Assumptions 1 and 2, and bias span upper bounded by H. Then with\nprobability 1 \u2212 \u03b4, the regret of UCCRL (run with input parameters n and H) after T steps is upper\nbounded by\n\n(6)\n\nTherefore, setting n = T 1/(2+2\u03b1) gives regret upper bounded by\n\n(cid:113)\nAT log(cid:0) T\n(cid:113)\nA log(cid:0) T\n\n\u03b4\n\n(cid:1) + const(cid:48) \u00b7 HLn\u2212\u03b1T.\n(cid:1) \u00b7 T (2+\u03b1)/(2+2\u03b1).\n\n\u03b4\n\nconst \u00b7 nH\n\nconst \u00b7 HL\n\nWith no known upper bound on the bias span, guessing H by log T one still obtains an upper bound\non the regret of \u02dcO(T (2+\u03b1)/(2+2\u03b1)).\n\nIntuitively, the second term in the regret bound of (6) is the discretization error, while the \ufb01rst term\ncorresponds to the regret on the discretized MDP. A detailed proof of Theorem 4 can be found in\nSection 5 below.\nRemark 5 (d-dimensional case). The general d-dimensional case can be handled as described for\ndimension 1, with the only difference being that the discretization now has nd states, so that one\nhas nd instead of n in the \ufb01rst term of (6). Then choosing n = T 1/(2d+2\u03b1) bounds the regret by\n\u02dcO(T (2d+\u03b1)/(2d+2\u03b1)).\nRemark 6 (unknown horizon). If the horizon T is unknown then the doubling trick (executing the\nalgorithm in rounds i = 1, 2, . . . guessing T = 2i and setting the con\ufb01dence parameter to \u03b4/2i)\ngives the same bounds.\nRemark 7 (unknown H\u00a8older parameters). The UCCRL algorithm receives (bounds on) the\nH\u00a8older parameters L as \u03b1 as inputs. If these parameters are not known, then one can still obtain\nsublinear regret bounds albeit with worse dependence on T . Speci\ufb01cally, we can use the model-\nselection technique introduced in [17]. To do this, \ufb01x a certain number J of values for the constants\nL and \u03b1; each of these values will be considered as a model. The model selection consists in running\nUCCRL with each of these parameter values for a certain period of \u03c40 time steps (exploration). Then\none selects the model with the highest reward and uses it for a period of \u03c4(cid:48)\n0 time steps (exploitation),\nwhile checking that its average reward stays within (6) of what was obtained in the exploitation\nphase. If the average reward does not pass this test, then the model with the second-best average\nreward is selected, and so on. Then one switches to exploration with longer periods \u03c41, etc. Since\nthere are no guarantees on the behavior of UCCRL when the H\u00a8older parameters are wrong, none\nof the models can be discarded at any stage. Optimizing over the parameters \u03c4i and \u03c4(cid:48)\ni as done\nin [17], and increasing the number J of considered parameter values, one can obtain regret bounds\nof \u02dcO(T (2+2\u03b1)/(2+3\u03b1)), or \u02dcO(T 4/5) in the Lipschitz case. For details see [17]. Since in this model-\nselection process UCCRL is used in a \u201cblack-box\u201d fashion, the exploration is rather wasteful, and\nthus we think that this bound is suboptimal. Recently, the results of [17] have been improved [18],\nand it seems that similar analysis gives improved regret bounds for the case of unknown H\u00a8older\nparameters as well.\n\nThe following is a complementing lower bound on the regret for continuous state reinforcement\nlearning.\n\n4\n\n\fTheorem 8. For any A, H > 1 and any reinforcement learning algorithm there is a continuous\n\u221a\nstate reinforcement learning problem with A actions and bias span H satisfying Assumption 1 such\nthat the algorithm suffers regret of \u2126(\n\nHAT ).\n\nProof. Consider the following reinforcement learning problem with state space [0, 1]. The state\nspace is partitioned into n intervals Ij of equal size. The transition probabilities for each action a\nare on each of the intervals Ij concentrated and equally distributed on the same interval Ij. The\nrewards on each interval Ij are also constant for each a and are chosen as in the lower bounds for a\nmulti-armed bandit problem [3] with nA arms. That is, giving only one arm slightly higher reward,\nit is known [3] that regret of \u2126(\nnAT ) can be forced upon any algorithm on the respective bandit\n\u221a\nproblem. Adding another action giving no reward and equally distributing over the whole state\nspace, the bias span of the problem is n and the regret \u2126(\n\nHAT ).\n\n\u221a\n\nRemark 9. Note that Assumption 2 does not hold in the example used in the proof of Theorem 8.\nHowever, the transition probabilities are piecewise constant (and hence Lipschitz) and known to\nthe learner. Actually, it is straightforward to deal with piecewise H\u00a8older continuous rewards and\ntransition probabilities where the \ufb01nitely many points of discontinuity are known to the learner. If\none makes sure that the intervals of the discretized state space do not contain any discontinuities, it\nis easy to adapt UCCRL and Theorem 4 accordingly.\nRemark 10 (comparison to bandits). The bounds of Theorems 4 and 8 cannot be directly com-\npared to bounds for the continuous-armed bandit problem [15, 4, 16, 8], because the latter is no\nspecial case of learning MDPs with continuous state space (and rather corresponds to a continuous\naction space). Thus, in particular one cannot freely sample an arbitrary state of the state space as\nassumed in continuous-armed bandits.\n\n5 Proof of Theorem 4\n\nFor the proof of the main theorem we adapt the proof of the regret bounds for \ufb01nite MDPs in [11]\nand [5]. Although the state space is now continuous, due to the \ufb01nite horizon T , we can reuse\nsome arguments, so that we keep the structure of the original proof of Theorem 2 in [11]. Some of\nthe necessary adaptations made are similar to techniques used for showing regret bounds for other\nmodi\ufb01cations of the original UCRL2 algorithm [21, 22], which however only considered \ufb01nite-state\nMDPs.\n\n5.1 Splitting into Episodes\n\ndenote the total number of episodes by m. Then setting \u2206k :=(cid:80)\n\nLet vk(s, a) be the number of times action a has been chosen in episode k when being in state s, and\ns,a vk(s, a)(\u03c1\u2217 \u2212 r(s, a)), with\nprobability at least 1 \u2212 \u03b4\n12T 5/4 the regret of UCCRL after T steps is upper bounded by (cf. Section\n4.1 of [11]),\n\n(cid:113) 5\n8 T log(cid:0) 8T\n\n\u03b4\n\n(cid:1) +(cid:80)m\n\nk=1 \u2206k .\n\n(7)\n\n5.2 Failing Con\ufb01dence Intervals\n\nNext, we consider the regret incurred when the true MDP M is not contained in the set of plausi-\nble MDPs Mk. Thus, \ufb01x a state-action pair (s, a), and recall that \u02c6r(s, a) and \u02c6pagg(\u00b7|s, a) are the\nestimates for rewards and transition probabilities calculated from all samples of state-action pairs\ncontained in the same interval I(s). Now assume that at step t there have been N > 0 samples of\naction a in states in I(s) and that in the i-th sample a transition from state si \u2208 I(s) to state s(cid:48)\ni has\nbeen observed (i = 1, . . . , N).\nFirst, concerning the rewards one obtains as in the proof of Lemma 17 in Appendix C.1 of [11] \u2014 but\nnow using Hoeffding for independent and not necessarily identically distributed random variables\n\u2014 that\n\n(cid:110)(cid:12)(cid:12)\u02c6r(s, a) \u2212 E[\u02c6r(s, a)](cid:12)(cid:12) \u2265(cid:113) 7\n\n2N log(cid:0) 2nAt\n\n\u03b4\n\n(cid:1)(cid:111) \u2264\n\nPr\n\n\u03b4\n\n60nAt7 .\n\n(8)\n\n5\n\n\f=\n\nj=1\n\n(cid:16)\nn(cid:88)\nN(cid:88)\n\n(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)\u02c6pagg(\u00b7|s, a) \u2212 E[\u02c6pagg(\u00b7|s, a)]\n(cid:90)\n\n(cid:16)\ni))\u2212(cid:82)\ni=1 Xi \u2265(cid:113)\n(cid:110)(cid:80)N\n56nN log(cid:0) 2At\n(cid:110)(cid:13)(cid:13)(cid:13)\u02c6pagg(\u00b7|s, a) \u2212 E[\u02c6pagg(\u00b7|s, a)]\n(cid:13)(cid:13)(cid:13)1\n\nx(I(s(cid:48)\n\nPr\n\nS\n\n\u03b4\n\nPr\n\n\u02c6pagg(Ij|s, a) \u2212 E[\u02c6pagg(Ij|s, a)]\n\nx(Ij)\n\n(cid:17)\n\n= 1\nN\n\nwith |Xi| \u2264 2, so that by Azuma-Hoeffding inequality (e.g., Lemma 10 in [11]), Pr{(cid:80)N\n\ni)) \u2212\n(9)\nS p(ds(cid:48)|si, a)\u00b7x(I(s(cid:48))) is a martingale difference sequence\ni=1 Xi \u2265\n\nFor any x \u2208 {\u22121, 1}n, Xi := x(I(s(cid:48)\n\u03b8} \u2264 exp(\u2212\u03b82/8N) and in particular\n\np(ds(cid:48)|si, a) \u00b7 x(I(s(cid:48)))\n\ni=1\n\n.\n\n(cid:17)7n \u2264\n(cid:1)(cid:111) \u2264(cid:16) \u03b4\n\u2265(cid:113) 56n\n(cid:1)(cid:111) \u2264\nN log(cid:0) 2At\n\n2At\n\n\u03b4\n\n\u03b4\n\n2n20nAt7 .\n\n\u03b4\n\n20nAt7 .\n\n(10)\n\nConcerning the transition probabilities, we have for a suitable x \u2208 {\u22121, 1}n\n\nn(cid:88)\n\nj=1\n\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\u02c6pagg(Ij|s, a) \u2212 E[\u02c6pagg(Ij|s, a)]\n(cid:17)\n\n=\n\nA union bound over all sequences x \u2208 {\u22121, 1}n then yields from (9) that\n\nAnother union bound over all t possible values for N, all n intervals and all actions shows that the\ncon\ufb01dence intervals in (8) and (10) hold at time t with probability at least 1 \u2212 \u03b4\n15t6 for the actual\ncounts N(I(s), a) and all state-action pairs (s, a). (Note that the equations (8) and (10) are the same\nfor state-action pairs with states in the same interval.)\nNow, by linearity of expectation E[\u02c6r(s, a)] can be written as 1\ni=1 r(si, a). Since the si are as-\nsumed to be in the same interval I(s), it follows that |E[\u02c6r(s, a)] \u2212 r(s, a)| < Ln\u2212\u03b1. Similarly,\nbility at least 1 \u2212 \u03b4\n\n(cid:13)(cid:13)E[\u02c6pagg(\u00b7|s, a)] \u2212 pagg(\u00b7|s, a)(cid:13)(cid:13)1 < Ln\u2212\u03b1. Together with (8) and (10) this shows that with proba-\n\n15t6 for all state-action pairs (s, a)\n\n(cid:12)(cid:12)\u02c6r(s, a) \u2212 r(s, a)(cid:12)(cid:12) < Ln\u2212\u03b1 +\n\nN\n\n(cid:80)N\n(cid:113) 7 log(2nAt/\u03b4)\n(cid:113) 56n log(2At/\u03b4)\n\n(12)\nThis shows that the true MDP is contained in the set of plausible MDPs M(t) at step t with proba-\nbility at least 1 \u2212 \u03b4\n\n15t6 , just as in Lemma 17 of [11]. The argument that\n\nmax{1,N (I(s),a)} .\n\n< Ln\u2212\u03b1 +\n\n(cid:13)(cid:13)(cid:13)1\n\n(cid:13)(cid:13)(cid:13)\u02c6pagg(\u00b7|s, a) \u2212 pagg(\u00b7|s, a)\nm(cid:88)\n\n2 max{1,N (I(s),a)} ,\n\n(11)\n\nT\n\n(13)\n\n\u2206k1M(cid:54)\u2208Mk \u2264\n\n\u221a\n\nk=1\n\n12T 5/4 then can be taken without any changes from Section 4.2 of [11].\n\nwith probability at least 1 \u2212 \u03b4\n5.3 Regret in Episodes with M \u2208 Mk\nNow for episodes with M \u2208 Mk, by the optimistic choice of \u02dcMk and \u02dc\u03c0k in (4) we can bound\n\n\u2206k = (cid:88)\n\u2264 (cid:88)\n= (cid:88)\n\ns\n\ns\n\nvk(s, \u02dc\u03c0k(s))(cid:0)\u03c1\u2217 \u2212 r(s, \u02dc\u03c0k(s))(cid:1)\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dc\u03c1\u2217\nk \u2212 r(s, \u02dc\u03c0k(s))(cid:1)\nk \u2212 \u02dcrk(s, \u02dc\u03c0k(s))(cid:1) +(cid:88)\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dc\u03c1\u2217\n\ns\n\ns\n\nAny term \u02dcrk(s, a) \u2212 r(s, a) \u2264 |\u02dcrk(s, a) \u2212 \u02c6rk(s, a)| + |\u02c6rk(s, a) \u2212 r(s, a)| is bounded according to\n(2) and (11), as we assume that \u02dcMk, M \u2208 Mk, so that summarizing states in the same interval Ij\n\n\u2206k \u2264(cid:88)\n\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dc\u03c1\u2217\n\nk \u2212 \u02dcrk(s, \u02dc\u03c0k(s))(cid:1) + 2\n\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dcrk(s, \u02dc\u03c0k(s)) \u2212 r(s, \u02dc\u03c0k(s))(cid:1).\n(cid:19)\n(cid:113) 7 log(2nAtk/\u03b4)\n\n(cid:18)\n\nLn\u2212\u03b1 +\n\nvk(Ij, a)\n\n2 max{1,Nk(Ij ,a)}\n\n.\n\ns\n\nn(cid:88)\n\n(cid:88)\n\nj=1\n\na\u2208A\n\n6\n\n\fWe continue analyzing the \ufb01rst term on the right hand side of (14). By the Poisson equation (1) for\n\u02dc\u03c0k on \u02dcMk, denoting the respective bias by \u02dc\u03bbk := \u02dc\u03bb\u02dc\u03c0k we can write\n\n(cid:88)\n\na\u2208A\n\n(cid:112)max{1, Nk(Ij, a)} .\n\nvk(Ij, a)\n\n(14)\n\n\u02dcpk(ds(cid:48)|s, \u02dc\u03c0k(s)) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(s)\n\np(ds(cid:48)|s, \u02dc\u03c0k(s)) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(s)\n\n(cid:17)\n(cid:17)\n\n\u02dcpk(ds(cid:48)|s, \u02dc\u03c0k(s)) \u2212 p(ds(cid:48)|s, \u02dc\u03c0k(s))\n\n(cid:17) \u00b7 \u02dc\u03bbk(s(cid:48)).\n\n(15)\n\n(16)\n\n\u03b4\n\ns\n\ns\n\nS\n\nS\n\ns\n\ns\n\nj=1\n\nvk(s, \u02dc\u03c0k(s))\n\nvk(s, \u02dc\u03c0k(s))\n\n+ 2Ln\u2212\u03b1\u03c4k +\n\n\u2206k \u2264 (cid:88)\n\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dc\u03c1\u2217\n(cid:113)\n\n(cid:88)\nvk(s, \u02dc\u03c0k(s))(cid:0)\u02dc\u03c1\u2217\n= (cid:88)\n= (cid:88)\n+(cid:88)\n\nk \u2212 \u02dcrk(s, \u02dc\u03c0k(s))(cid:1)\n(cid:1) n(cid:88)\n14 log(cid:0) 2nAT\nk \u2212 \u02dcrk(s, \u02dc\u03c0k(s))(cid:1)\n(cid:16)(cid:90)\n(cid:16)(cid:90)\n(cid:16)\nn(cid:88)\nk (\u00b7|s, a)\u2212 pagg(\u00b7|s, a)(cid:13)(cid:13)1 \u2264(cid:13)(cid:13)\u02dcpagg\nNow(cid:13)(cid:13)\u02dcpagg\nk (\u00b7|s, a)\u2212 \u02c6pagg\n(cid:16)\n(cid:88)\nn(cid:88)\nvk(s, \u02dc\u03c0k(s)) \u00b7 H \u00b7 n(cid:88)\n(cid:16)\n(cid:113)\n14n log(cid:0) 2AT\n\n\u2264 (cid:88)\n\u2264 (cid:88)\n\n5.4 The True Transition Functions\n\nvk(s, \u02dc\u03c0k(s)) \u00b7 H \u00b7 2\n\n= 2HLn\u2212\u03b1\u03c4k + 4H\n\nvk(s, \u02dc\u03c0k(s))\n\nLn\u2212\u03b1 +\n\n(cid:16)\n\n(cid:90)\n\nj=1\n\ns\n\ns\n\nj=1\n\nj=1\n\nIj\n\nIj\n\ns\n\ns\n\ncan be bounded by (3) and (12), because we assume \u02dcMk, M \u2208 Mk. Hence, since by de\ufb01nition of\nthe algorithm H bounds the bias function \u02dc\u03bbk, the term in (16) is bounded by\n\u02dcpk(ds(cid:48)|s, \u02dc\u03c0k(s)) \u2212 p(ds(cid:48)|s, \u02dc\u03c0k(s))\n\nvk(s, \u02dc\u03c0k(s))\n\n\u02dc\u03bbk(s(cid:48))\n\n(cid:90)\n\nk (\u00b7|s, a)(cid:13)(cid:13)1 +(cid:13)(cid:13)\u02c6pagg\n\nk (\u00b7|s, a)\u2212 pagg(\u00b7|s, a)(cid:13)(cid:13)1\n(cid:17)\n\nSince max{1, Nk(Ij, a)} \u2264 tk \u2264 T , setting \u03c4k := tk+1 \u2212 tk to be the length of episode k we have\n\n(cid:17)\n\nk (Ij|s, \u02dc\u03c0k(s)) \u2212 pagg(Ij|s, \u02dc\u03c0k(s))\n\u02dcpagg\n\n(cid:17)\n\n(cid:113) 56n log(2AT /\u03b4)\n(cid:1) n(cid:88)\n\n(cid:88)\n\nmax{1,Nk(I(s),at)}\n\n\u03b4\n\nj=1\n\na\u2208A\n\nvk(Ij, a)\n\n(cid:112)max{1, Nk(Ij, a)} ,\n(cid:17)\n\n(cid:88)\n\nwhile for the term in (15)\nvk(s, \u02dc\u03c0k(s))\n\nS\n\n(cid:16)(cid:90)\n(cid:16)(cid:90)\n(cid:16)(cid:90)\n\nS\n\nS\n\ntk+1\u22121(cid:88)\ntk+1\u22121(cid:88)\n\nt=tk\n\ns\n\n=\n\n=\n\np(ds(cid:48)|s, \u02dc\u03c0k(s)) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(s)\n\n(cid:17)\n\np(ds(cid:48)|st, at) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(st)\n\np(ds(cid:48)|st, at) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(st+1)\n\n(cid:17)\n\n+ \u02dc\u03bbk(stk+1) \u2212 \u02dc\u03bbk(stk).\n\nt=tk\n\n(cid:82)\nLet k(t) be the index of the episode time step t belongs to. Then the sequence Xt\n:=\nS p(ds(cid:48)|st, at) \u00b7 \u02dc\u03bbk(t)(s(cid:48)) \u2212 \u02dc\u03bbk(t)(st+1) is a sequence of martingale differences so that Azuma-\nHoeffding inequality shows (cf. Section 4.3.2 and in particular eq. (18) in [11]) that after summing\nover all episodes we have\n\nm(cid:88)\n\n(cid:18) tk+1\u22121(cid:88)\n\n(cid:16)(cid:90)\n\nk=1\n\nt=tk\n\nS\n\np(ds(cid:48)|st, at) \u00b7 \u02dc\u03bbk(s(cid:48)) \u2212 \u02dc\u03bbk(st+1)\n\n(cid:17)\n2 T log(cid:0) 8T\n\n(cid:19)\n+ \u02dc\u03bbk(stk+1) \u2212 \u02dc\u03bbk(stk)\n(cid:1),\n(cid:0) 8T\n\n(cid:1) + HnA log2\n\nnA\n\n5\n\n\u03b4\n\n(cid:113)\n\n\u2264 H\n\n7\n\n(17)\n\n(18)\n\n\f12T 5/4\n\n+ H\n\n5\n\n(19)\n\n\u03b4\n\nvk(Ij, a)\n\n\u03b4\n\nnA\n\nk=1\n\nj=1\n\na\u2208A\n\nk=1\n\nj=1\n\na\u2208A\n\nm(cid:88)\n\nk=1\n\n+ 2Ln\u2212\u03b1T +\n\n(cid:1) \u00b7 m(cid:88)\n\nn(cid:88)\n\n(cid:88)\n\nAnalogously to Section 4.3.3 and Appendix C.3 of [11], one can show that\n\n(cid:112)max{1, Nk(Ij, a)}\n\nvk(Ij, a)\n\nwhere the second term comes from an upper bound on the number of episodes, which can be derived\nanalogously to Appendix C.2 of [11].\n5.5 Summing over Episodes with M \u2208 Mk\nTo conclude, we sum (14) over all the episodes with M \u2208 Mk, using (15), (17), and (18). This\n(cid:113)\nyields that with probability at least 1 \u2212 \u03b4\n14n log(cid:0) 2AT\n\u2206k1M\u2208Mk \u2264 2HLn\u2212\u03b1T + 4H\n(cid:113)\n(cid:0) 8T\n(cid:1) + HnA log2\n2 T log(cid:0) 8T\n(cid:1)\n(cid:113)\n(cid:88)\nn(cid:88)\n(cid:1) m(cid:88)\n14 log(cid:0) 2nAT\n(cid:88)\n(cid:88)\nn(cid:88)\n(cid:112)max{1, Nk(Ij, a)} \u2264 (cid:0)\u221a\n(cid:113)\nm(cid:88)\n2 T log(cid:0) 8T\n(cid:1) + HnA log2\n\u2206k1M\u2208Mk \u2264 H\n(cid:113)\n(cid:1)(cid:17)(cid:0)\u221a\n(cid:16)\n2 + 1(cid:1)\u221a\n14n log(cid:0) 2AT\n(cid:113)\nm(cid:88)\nm(cid:88)\n(cid:1) +\n8 T log(cid:0) 8T\n(cid:113)\n\u2264(cid:113)\n2 T log(cid:0) 8T\n(cid:1) +\n8 T log(cid:0) 8T\n(cid:113)\n(cid:16)\n(cid:1)(cid:17)(cid:0)\u221a\n14n log(cid:0) 2AT\n2 + 1(cid:1)\u221a\n\n(20)\nFinally, evaluating (7) by summing \u2206k over all episodes, by (13) and (20) we have with probability\n\u2265 1 \u2212 \u03b4\n\n(cid:112)max{1, Nk(Ij, a)} .\n2 + 1(cid:1)\u221a\n(cid:1)\n(cid:0) 8T\n\nand we get from (19) after some simpli\ufb01cations that with probability \u2265 1 \u2212 \u03b4\n\n4T 5/4 an upper bound on the regret of\n\n\u2206k1M\u2208Mk\n\n(cid:1) + HnA log2\n\n\u03b4\n\n\u2206k1M /\u2208Mk +\n\u221a\n\nnAT + 2(H + 1)Ln\u2212\u03b1T .\n\nnAT + 2(H + 1)Ln\u2212\u03b1T.\n\n(cid:1)\n\n(cid:0) 8T\n\nnA\n\n+\n\n(4H + 1)\n\n+\n\n(4H + 1)\n\nk=1\n\nk=1\n\nT + H\n\n5\n\n5\n\n5\n\n\u03b4\n\nnA\n\n\u03b4\n\n\u03b4\n\nnAT ,\n\n12T 5/4\n\n5\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\nvk(Ij, a)\n\nj=1\n\na\u2208A\n\nk\n\nk=1\n\nA union bound over all possible values of T and further simpli\ufb01cations as in Appendix C.4 of [11]\n\ufb01nish the proof.\n\n6 Outlook\n\nWe think that a generalization of our results to continuous action space should not pose any major\nproblems.\nIn order to improve over the given bounds, it may be promising to investigate more\nsophisticated discretization patterns.\nThe assumption of H\u00a8older continuity is an obvious, yet not the only possible assumption one can\nmake about the transition probabilities and reward functions. A more general problem is to assume\na set F of functions, \ufb01nd a way to measure the \u201csize\u201d of F, and derive regret bounds depending on\nthis size of F.\n\nAcknowledgments\n\nThe authors would like to thank the three anonymous reviewers for their helpful suggestions and\nR\u00b4emi Munos for useful discussion which helped to improve the bounds. This research was funded by\nthe Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and FEDER\n(Contrat de Projets Etat Region CPER 2007-2013), ANR projects EXPLO-RA (ANR-08-COSI-\n004), Lampada (ANR-09-EMER-007) and CoAdapt, and by the European Community\u2019s FP7 Pro-\ngram under grant agreements n\u25e6 216886 (PASCAL2) and n\u25e6 270327 (CompLACS). The \ufb01rst author\nis currently funded by the Austrian Science Fund (FWF): J 3259-N13.\n\n8\n\n\fReferences\n[1] Yasin Abbasi-Yadkori and Csaba Szepesv\u00b4ari. Regret bounds for the adaptive control of linear quadratic\n\nsystems. COLT 2011, JMLR Proceedings Track, 19:1\u201326, 2011.\n\n[2] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed bandit prob-\n\nlem. Mach. Learn., 47:235\u2013256, 2002.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32:48\u201377, 2002.\n\n[4] Peter Auer, Ronald Ortner, and Csaba Szepesv\u00b4ari. Improved rates for the stochastic continuum-armed\nbandit problem. In Learning Theory, 20th Annual Conference on Learning Theory, COLT 2007, pages\n454\u2013468, 2007.\n\n[5] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning\nin weakly communicating MDPs. In Proc. 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\n2009, pages 25\u201342, 2009.\n\n[6] Andrey Bernstein and Nahum Shimkin. Adaptive-resolution reinforcement learning with polynomial\n\nexploration in deterministic domains. Mach. Learn., 81(3):359\u2013397, 2010.\n\n[7] Emma Brunskill, Bethany R. Lef\ufb02er, Lihong Li, Michael L. Littman, and Nicholas Roy. Provably ef\ufb01cient\n\nlearning with typed parametric models. J. Mach. Learn. Res., 10:1955\u20131988, 2009.\n\n[8] S\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesv\u00b4ari. Online optimization of \u03c7-armed\n\nbandits. In Advances in Neural Information Processing Systems 22, NIPS 2009, pages 201\u2013208, 2010.\n\n[9] On\u00b4esimo Hern\u00b4andez-Lerma and Jean Bernard Lasserre. Discrete-time Markov control processes, vol-\n\nume 30 of Applications of mathematics. Springer, 1996.\n\n[10] On\u00b4esimo Hern\u00b4andez-Lerma and Jean Bernard Lasserre. Further topics on discrete-time Markov control\n\nprocesses, volume 42 of Applications of mathematics. Springer, 1999.\n\n[11] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.\n\nJ. Mach. Learn. Res., 11:1563\u20131600, 2010.\n\n[12] Nicholas K. Jong and Peter Stone. Model-based exploration in continuous state spaces. In Abstraction,\nReformulation, and Approximation, 7th International Symposium, SARA 2007, pages 258\u2013272. Springer,\n2007.\n\n[13] Sham Kakade, Michael J. Kearns, and John Langford. Exploration in metric state spaces. In Machine\n\nLearning, Proc. 20th International Conference, ICML 2003, pages 306\u2013312, 2003.\n\n[14] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach.\n\nLearn., 49:209\u2013232, 2002.\n\n[15] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances Neural\n\nInformation Processing Systems 17, NIPS 2004, pages 697\u2013704, 2005.\n\n[16] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proc.\n\n40th Annual ACM Symposium on Theory of Computing, STOC 2008, pages 681\u2013690, 2008.\n\n[17] Odalric-Ambrym Maillard, R\u00b4emi Munos, and Daniil Ryabko. Selecting the state-representation in rein-\n\nforcement learning. In Advances Neural Processing Systems 24, NIPS 2011, pages 2627\u20132635, 2012.\n\n[18] Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, and Daniil Ryabko. Optimal regret bounds\n\nfor selecting the state representation in reinforcement learning. accepted for ICML 2013.\n\n[19] Gerhard Neumann, Michael Pfeiffer, and Wolfgang Maass. Ef\ufb01cient continuous-time reinforcement learn-\ning with adaptive state graphs. In Machine Learning: ECML 2007, 18th European Conference on Machine\nLearning, pages 250\u2013261, 2007.\n\n[20] Ali Nouri and Michael L. Littman. Multi-resolution exploration in continuous spaces. In Advances in\n\nNeural Information Processing Systems 21, NIPS 2008, pages 1209\u20131216, 2009.\n\n[21] Ronald Ortner. Adaptive aggregation for reinforcement learning in average reward Markov decision\n\nprocesses. Ann. Oper. Res., 2012. doi:10.1007/s10479-12-1064-y, to appear.\n\n[22] Ronald Ortner, Daniil Ryabko, Peter Auer, and R\u00b4emi Munos. Regret bounds for restless Markov bandits.\n\nIn Proc. 23rd Conference on Algorithmic Learning Theory, ALT 2012, pages 214\u2013228, 2012.\n\n[23] Alexander L. Strehl and Michael L. Littman. Online linear regression and its application to model-based\nreinforcement learning. In Advances Neural Information Processing Systems 20, NIPS 2007, pages 1417\u2013\n1424, 2008.\n\n[24] William T. B. Uther and Manuela M. Veloso. Tree based discretization for continuous state space re-\ninforcement learning. In Proc. 15th National Conference on Arti\ufb01cial Intelligence and 10th Innovative\nApplications of Arti\ufb01cial Intelligence Conference, AAAI 98, IAAI 98, pages 769\u2013774, 1998.\n\n9\n\n\f", "award": [], "sourceid": 861, "authors": [{"given_name": "Ronald", "family_name": "Ortner", "institution": null}, {"given_name": "Daniil", "family_name": "Ryabko", "institution": null}]}