{"title": "Learning Unknown Markov Decision Processes: A Thompson Sampling Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1333, "page_last": 1342, "abstract": "We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish $\\tilde O(HS\\sqrt{AT})$ bounds on expected regret under a Bayesian setting, where $S$ and $A$ are the sizes of the state and action spaces, $T$ is time, and $H$ is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.", "full_text": "Learning Unknown Markov Decision Processes:\n\nA Thompson Sampling Approach\n\nYi Ouyang\n\nUniversity of California, Berkeley\n\nouyangyi@berkeley.edu\n\nMukul Gagrani\n\nUniversity of Southern California\n\nmgagrani@usc.edu\n\nAshutosh Nayyar\n\nUniversity of Southern California\n\nashutosn@usc.edu\n\nRahul Jain\n\nUniversity of Southern California\n\nrahul.jain@usc.edu\n\nAbstract\n\nWe consider the problem of learning an unknown Markov Decision Process (MDP)\nthat is weakly communicating in the in\ufb01nite horizon setting. We propose a Thomp-\nson Sampling-based reinforcement learning algorithm with dynamic episodes\n(TSDE). At the beginning of each episode, the algorithm generates a sample from\nthe posterior distribution over the unknown model parameters. It then follows the\noptimal stationary policy for the sampled model for the rest of the episode. The\nduration of each episode is dynamically determined by two stopping criteria. The\n\ufb01rst stopping criterion controls the growth rate of episode length. The second\nstopping criterion happens when the number of visits to any state-action pair is\ndoubled. We establish \u02dcO(HS\nAT ) bounds on expected regret under a Bayesian\nsetting, where S and A are the sizes of the state and action spaces, T is time, and\nH is the bound of the span. This regret bound matches the best available bound for\nweakly communicating MDPs. Numerical results show it to perform better than\nexisting algorithms for in\ufb01nite horizon MDPs.\n\n\u221a\n\n1\n\nIntroduction\n\nWe consider the problem of reinforcement learning by an agent interacting with an environment while\ntrying to minimize the total cost accumulated over time. The environment is modeled by an in\ufb01nite\nhorizon Markov Decision Process (MDP) with \ufb01nite state and action spaces. When the environment\nis perfectly known, the agent can determine optimal actions by solving a dynamic program for the\nMDP [1]. In reinforcement learning, however, the agent is uncertain about the true dynamics of the\nMDP. A naive approach to an unknown model is the certainty equivalence principle. The idea is to\nestimate the unknown MDP parameters from available information and then choose actions as if the\nestimates are the true parameters. But it is well-known in adaptive control theory that the certainty\nequivalence principle may lead to suboptimal performance due to the lack of exploration [2]. This\nissue actually comes from the fundamental exploitation-exploration trade-off: the agent wants to\nexploit available information to minimize cost, but it also needs to explore the environment to learn\nsystem dynamics.\nOne common way to handle the exploitation-exploration trade-off is to use the optimism in the\nface of uncertainty (OFU) principle [3]. Under this principle, the agent constructs con\ufb01dence sets\nfor the system parameters at each time, \ufb01nd the optimistic parameters that are associated with the\nminimum cost, and then selects an action based on the optimistic parameters. The optimism procedure\nencourages exploration for rarely visited states and actions. Several optimistic algorithms are proved\nto possess strong theoretical performance guarantees [4\u201310].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAn alternative way to incentivize exploration is the Thompson Sampling (TS) or Posterior Sampling\nmethod. The idea of TS was \ufb01rst proposed by Thompson in [11] for stochastic bandit problems. It has\nbeen applied to MDP environments [12\u201317] where the agent computes the posterior distribution of\nunknown parameters using observed information and a prior distribution. A TS algorithm generally\nproceeds in episodes: at the beginning of each episode a set of MDP parameters is randomly\nsampled from the posterior distribution, then actions are selected based on the sampled model during\nthe episode. TS algorithms have the following advantages over optimistic algorithms. First, TS\nalgorithms can easily incorporate problem structures through the prior distribution. Second, they are\nmore computationally ef\ufb01cient since a TS algorithm only needs to solve the sampled MDP, while an\noptimistic algorithm requires solving all MDPs that lie within the con\ufb01dent sets. Third, empirical\nstudies suggest that TS algorithms outperform optimistic algorithms in bandit problems [18, 19] as\nwell as in MDP environments [13, 16, 17].\nDue to the above advantages, we focus on TS algorithms for the MDP learning problem. The main\nchallenge in the design of a TS algorithm is the lengths of the episodes. For \ufb01nite horizon MDPs\nunder the episodic setting, the length of each episode can be set as the time horizon [13]. When there\nexists a recurrent state under any stationary policy, the TS algorithm of [15] starts a new episode\nwhenever the system enters the recurrent state. However, the above methods to end an episode can\nnot be applied to MDPs without the special features. The work of [16] proposed a dynamic episode\nschedule based on the doubling trick used in [7], but a mistake in their proof of regret bound was\npointed out by [20]. In view of the mistake in [16], there is no TS algorithm with strong performance\nguarantees for general MDPs to the best of our knowledge.\nWe consider the most general subclass of weakly communicating MDPs in which meaningful \ufb01nite\ntime regret guarantees can be analyzed. We propose the Thompson Sampling with Dynamic Episodes\n(TSDE) learning algorithm. In TSDE, there are two stopping criteria for an episode to end. The\n\ufb01rst stopping criterion controls the growth rate of episode length. The second stopping criterion\nis the doubling trick similar to the one in [7\u201310, 16] that stops when the number of visits to any\nstate-action pair is doubled. Under a Bayesian framework, we show that the expected regret of TSDE\naccumulated up to time T is bounded by \u02dcO(HS\nAT ) where \u02dcO hides logarithmic factors. Here S\nand A are the sizes of the state and action spaces, T is time, and H is the bound of the span. This\nregret bound matches the best available bound for weakly communicating MDPs [7], and it matches\nthe theoretical lower bound in order of T except for logarithmic factors. We present numerical results\nthat show that TSDE actually outperforms current algorithms with known regret bounds that have the\nsame order in T for a benchmark MDP problem as well as randomly generated MDPs.\n\n\u221a\n\n2 Problem Formulation\n\n2.1 Preliminaries\nAn in\ufb01nite horizon Markov Decision Process (MDP) is described by (S,A, c, \u03b8). Here S is the state\nspace, A is the action space, c : S \u00d7 A \u2192 [0, 1]1 is the cost function, and \u03b8 : S 2 \u00d7 A \u2192 [0, 1]\nrepresents the transition probabilities such that \u03b8(s(cid:48)|s, a) = P(st+1 = s(cid:48)|st = s, at = a) where\nst \u2208 S and at \u2208 A are the state and the action at t = 1, 2, 3 . . . . We assume that S and A are \ufb01nite\nspaces with sizes S \u2265 2 and A \u2265 2, and the initial state s1 is a known and \ufb01xed state. A stationary\npolicy is a deterministic map \u03c0 : S \u2192 A that maps a state to an action. The average cost per stage of\na stationary policy is de\ufb01ned as\n\nJ\u03c0(\u03b8) = lim sup\nT\u2192\u221e\n\n1\nT\n\nE(cid:104) T(cid:88)\n\nt=1\n\n(cid:105)\n\nc(st, at)\n\n.\n\nHere we use J\u03c0(\u03b8) to explicitly show the dependency of the average cost on \u03b8.\nTo have meaningful \ufb01nite time regret bounds, we consider the subclass of weakly communicating\nMDPs de\ufb01ned as follows.\nDe\ufb01nition 1. An MDP is weakly communicating (or weak accessible) if its states can be partitioned\ninto two subsets: in the \ufb01rst subset all states are transient under every stationary policy, and every\ntwo states in the second subset can be reached from each other under some stationary policy.\n\n1Since S and A are \ufb01nite, we can normalize the cost function to [0, 1] without loss of generality.\n\n2\n\n\fFrom MDP theory [1], we know that if the MDP is weakly communicating, the optimal average cost\nper stage J(\u03b8) = min\u03c0 J\u03c0(\u03b8) satis\ufb01es the Bellman equation\n\nJ(\u03b8) + v(s, \u03b8) = min\na\u2208A\n\nc(s, a) +\n\n(cid:48)|s, a)v(s\n(cid:48)\n\n\u03b8(s\n\n, \u03b8)\n\n(1)\n\n(cid:110)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:111)\n\nfor all s \u2208 S. The corresponding optimal stationary policy \u03c0\u2217 is the minimizer of the above\noptimization given by\n\na = \u03c0\n\n(s, \u03b8).\n\n(2)\nSince the cost function c(s, a) \u2208 [0, 1], J(\u03b8) \u2208 [0, 1] for all \u03b8. If v satis\ufb01es the Bellman equation, v\nplus any constant also satis\ufb01es the Bellman equation. Without loss of generality, let mins\u2208S v(s, \u03b8) =\n0 and de\ufb01ne the span of the MDP as sp(\u03b8) = maxs\u2208S v(s, \u03b8). 2\nWe de\ufb01ne \u2126\u2217 to be the set of all \u03b8 such that the MDP with transition probabilities \u03b8 is weakly\ncommunicating, and there exists a number H such that sp(\u03b8) \u2264 H. We will focus on MDPs with\ntransition probabilities in the set \u2126\u2217.\n\n\u2217\n\n2.2 Reinforcement Learning for Weakly Communicating MDPs\n\nWe consider the reinforcement learning problem of an agent interacting with a random weakly\ncommunicating MDP (S,A, c, \u03b8\u2217). We assume that S, A and the cost function c are completely\nknown to the agent. The actual transition probabilities \u03b8\u2217 is randomly generated at the beginning\nbefore the MDP interacts with the agent. The value of \u03b8\u2217 is then \ufb01xed but unknown to the agent. The\ncomplete knowledge of the cost is typical as in [7, 15]. Algorithms can generally be extended to the\nunknown costs/rewards case at the expense of some constant factor for the regret bound.\nAt each time t,\nthe agent selects an action according to at = \u03c6t(ht) where ht =\n(s1, s2, . . . , st, a1, a2, . . . , at\u22121) is the history of states and actions. The collection \u03c6 = (\u03c61, \u03c62 . . . )\nis called a learning algorithm. The functions \u03c6t allow for the possibility of randomization over actions\nat each time.\nWe focus on a Bayesian framework for the unknown parameter \u03b8\u2217. Let \u00b51 be the prior distribution\nfor \u03b8\u2217, i.e., for any set \u0398, P(\u03b8\u2217 \u2208 \u0398) = \u00b51(\u0398). We make the following assumptions on \u00b51.\nAssumption 1. The support of the prior distribution \u00b51 is a subset of \u2126\u2217. That is, the MDP is weakly\ncommunicating and sp(\u03b8\u2217) \u2264 H.\nIn this Bayesian framework, we de\ufb01ne the expected regret (also called Bayesian regret or Bayes risk)\nof a learning algorithm \u03c6 up to time T as\n\nR(T, \u03c6) = E(cid:104) T(cid:88)\n\n(cid:104)\n\nc(st, at) \u2212 J(\u03b8\u2217)\n\n(cid:105)(cid:105)\n\n(3)\n\nwhere st, at, t = 1, . . . , T are generated by \u03c6 and J(\u03b8\u2217) is the optimal per stage cost of the MDP.\nThe above expectation is with respect to the prior distribution \u00b51 for \u03b8\u2217, the randomness in state\ntransitions, and the randomized algorithm. The expected regret is an important metric to quantify the\nperformance of a learning algorithm.\n\nt=1\n\n3 Thompson Sampling with Dynamic Episodes\n\nIn this section, we propose the Thompson Sampling with Dynamic Episodes (TSDE) learning\nalgorithm. The input of TSDE is the prior distribution \u00b51. At each time t, given the history ht, the\nagent can compute the posterior distribution \u00b5t given by \u00b5t(\u0398) = P(\u03b8\u2217 \u2208 \u0398|ht) for any set \u0398. Upon\napplying the action at and observing the new state st+1, the posterior distribution at t + 1 can be\nupdated according to Bayes\u2019 rule as\n\n\u00b5t+1(d\u03b8) =\n\n.\n\n(4)\n\n\u03b8(st+1|st, at)\u00b5t(d\u03b8)\n\n(cid:82) \u03b8(cid:48)(st+1|st, at)\u00b5t(d\u03b8(cid:48))\n\n2See [7]for a discussion on the connection of the span with other parameters such as the diameter appearing\n\nin the lower bound on regret.\n\n3\n\n\fLet Nt(s, a) be the number of visits to any state-action pair (s, a) before time t. That is,\n\nNt(s, a) = |{\u03c4 < t : (s\u03c4 , a\u03c4 ) = (s, a)}|.\n\n(5)\n\nWith these notations, TSDE is described as follows.\n\nAlgorithm 1 Thompson Sampling with Dynamic Episodes (TSDE)\n\nInput: \u00b51\nInitialization: t \u2190 1, tk \u2190 0\nfor episodes k = 1, 2, ... do\n\nTk\u22121 \u2190 t \u2212 tk\ntk \u2190 t\nGenerate \u03b8k \u223c \u00b5tk and compute \u03c0k(\u00b7) = \u03c0\u2217(\u00b7, \u03b8k) from (1)-(2)\nwhile t \u2264 tk + Tk\u22121 and Nt(s, a) \u2264 2Ntk (s, a) for all (s, a) \u2208 S \u00d7 A do\n\nApply action at = \u03c0k(st)\nObserve new state st+1\nUpdate \u00b5t+1 according to (4)\nt \u2190 t + 1\n\nend while\n\nend for\n\nThe TSDE algorithm operates in episodes. Let tk be start time of the kth episode and Tk = tk+1 \u2212 tk\nbe the length of the episode with the convention T0 = 1. From the description of the algorithm,\nt1 = 1 and tk+1, k \u2265 1, is given by\n\ntk+1 = min{t > tk :\n\nt > tk + Tk\u22121 or Nt(s, a) > 2Ntk (s, a) for some (s, a)}.\n\n(6)\n\nAt the beginning of episode k, a parameter \u03b8k is sampled from the posterior distribution \u00b5tk. During\neach episode k, actions are generated from the optimal stationary policy \u03c0k for the sampled parameter\n\u03b8k. One important feature of TSDE is that its episode lengths are not \ufb01xed. The length Tk of each\nepisode is dynamically determined according to two stopping criteria: (i) t > tk + Tk\u22121, and (ii)\nNt(s, a) > 2Ntk (s, a) for some state-action pair (s, a). The \ufb01rst stopping criterion provides that the\nepisode length grows at a linear rate without triggering the second criterion. The second stopping\ncriterion ensures that the number of visits to any state-action pair (s, a) during an episode should not\nbe more than the number visits to the pair before this episode.\nRemark 1. Note that TSDE only requires the knowledge of S, A, c, and the prior distribution \u00b51.\nTSDE can operate without the knowledge of time horizon T , the bound H on span used in [7], and\nany knowledge about the actual \u03b8\u2217 such as the recurrent state needed in [15].\n\n3.1 Main Result\n\nTheorem 1. Under Assumption 1,\n\nR(T, TSDE) \u2264 (H + 1)\n\n(cid:112)\n\n2SAT log(T ) + 49HS\n\n(cid:112)\n\nAT log(AT ).\n\nThe proof of Theorem 1 appears in Section 4.\nRemark 2. Note that our regret bound has the same order in H, S, A and T as the optimistic\nalgorithm in [7] which is the best available bound for weakly communicating MDPs. Moreover, the\nbound does not depend on the prior distribution or other problem-dependent parameters such as the\nrecurrent time of the optimal policy used in the regret bound of [15].\n\n3.2 Approximation Error\n\nAt the beginning of each episode, TSDE computes the optimal stationary policy \u03c0k for the parameter\n\u03b8k. This step requires the solution to a \ufb01xed \ufb01nite MDP. Policy iteration or value iteration can be used\nto solve the sampled MDP, but the resulting stationary policy may be only approximately optimal in\npractice. We call \u03c0 an \u0001\u2212approximate policy if\n\n(cid:110)\n\nc(s, a) +\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:111)\n\n(cid:48)|s, a)v(s\n(cid:48)\n\n\u03b8(s\n\n, \u03b8)\n\n+ \u0001.\n\n(cid:88)\n\ns(cid:48)\u2208S\n\nc(s, \u03c0(s)) +\n\n(cid:48)|s, \u03c0(s))v(s\n(cid:48)\n\n\u03b8(s\n\n, \u03b8) \u2264 min\na\u2208A\n\n4\n\n\fWhen the algorithm returns an \u0001k\u2212approximate policy \u02dc\u03c0k instead of the optimal stationary policy \u03c0k\nat episode k, we have the following regret bound in the presence of such approximation error.\nTheorem 2. If TSDE computes an \u0001k\u2212approximate policy \u02dc\u03c0k instead of the optimal stationary policy\n\u03c0k at each episode k, the expected regret of TSDE satis\ufb01es\n\nR(T, TSDE) \u2264 \u02dcO(HS\n\nk+1 , E(cid:104)(cid:80)\n\nFurthermore, if \u0001k \u2264 1\n\nk:tk\u2264T Tk\u0001k\n\n(cid:105)\n\nTk\u0001k\n\n.\n\n\u221a\n\nAT ) + E(cid:104) (cid:88)\n(cid:105) \u2264(cid:112)\n\n2SAT log(T ).\n\nk:tk\u2264T\n\nTheorem 2 shows that the approximation error in the computation of optimal stationary policy is only\nadditive to the regret under TSDE. The regret bound would remain \u02dcO(HS\nAT ) if the approximation\nerror is such that \u0001k \u2264 1\n\nk+1. The proof of Theorem 2 is in the appendix due to the lack of space.\n\n\u221a\n\n4 Analysis\n\n4.1 Number of Episodes\nTo analyze the performance of TSDE over T time steps, de\ufb01ne KT = arg max{k : tk \u2264 T} be the\nnumber of episodes of TSDE until time T . Note that KT is a random variable because the number\nof visits Nt(x, u) depends on the dynamical state trajectory. In the analysis for time T we use the\nconvention that t(KT +1) = T + 1. We provide an upper bound on KT as follows.\nLemma 1.\n\nKT \u2264(cid:112)\n\n2SAT log(T ).\n\nProof. De\ufb01ne macro episodes with start times tni , i = 1, 2, . . . where tn1 = t1 and\ntni+1 = min{tk > tni : Ntk (s, a) > 2Ntk\u22121(s, a) for some (s, a)}.\n\nThe idea is that each macro episode starts when the second stopping criterion happens. Let M be the\nnumber of macro episodes until time T and de\ufb01ne n(M +1) = KT + 1.\n\nk=ni Tk be the length of the ith macro episode. By the de\ufb01nition of macro episodes,\nany episode except the last one in a macro episode must be triggered by the \ufb01rst stopping criterion.\nTherefore, within the ith macro episode, Tk = Tk\u22121 + 1 for all k = ni, ni + 1, . . . , ni+1 \u2212 2. Hence,\n\n(Tni\u22121 + j) + Tni+1\u22121\n\n(j + 1) + 1 = 0.5(ni+1 \u2212 ni)(ni+1 \u2212 ni + 1).\n\nLet \u02dcTi =(cid:80)ni+1\u22121\nni+1\u22121(cid:88)\n\n\u02dcTi =\n\nTk =\n\nk=ni\n\n\u2265\n\nConsequently, ni+1 \u2212 ni \u2264(cid:112)\nUsing (7) and the fact that(cid:80)M\n\nni+1\u2212ni\u22121(cid:88)\nni+1\u2212ni\u22121(cid:88)\n\nj=1\n\nj=1\n\n2 \u02dcTi for all i = 1, . . . , M. From this property we obtain\n\n(cid:113)\n\n2 \u02dcTi.\n\nKT =nM +1 \u2212 1 =\n(cid:113)\n\nKT \u2264 M(cid:88)\n\ni=1\n\u02dcTi = T we get\n\n2 \u02dcTi \u2264\n\ni=1\n\nM(cid:88)\n(ni+1 \u2212 ni) \u2264 M(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)M\nM(cid:88)\n\n2 \u02dcTi =\n\n\u221a\n\ni=1\n\n2M T\n\n(7)\n\n(8)\n\ni=1\n\ni=1\n\nwhere the second inequality is Cauchy-Schwarz.\nFrom Lemma 6 in the appendix, the number of macro episodes M \u2264 SA log(T ). Substituting this\nbound into (8) we obtain the result of this lemma.\nRemark 3. TSDE computes the optimal stationary policy of a \ufb01nite MDP at each episode. Lemma 1\n\nensures that such computation only needs to be done at a sublinear rate of(cid:112)\n\n2SAT log(T ).\n\n5\n\n\f4.2 Regret Bound\n\nAs discussed in [13, 20, 21], one key property of Thompson/Posterior Sampling algorithms is that for\nany function f, E[f (\u03b8t)] = E[f (\u03b8\u2217)] if \u03b8t is sampled from the posterior distribution at time t. This\nproperty leads to regret bounds for algorithms with \ufb01xed sampling episodes since the start time tk of\neach episode is deterministic. However, our TSDE algorithm has dynamic episodes that requires us\nto have the stopping-time version of the above property.\nLemma 2. Under TSDE, tk is a stopping time for any episode k. Then for any measurable function\nf and any \u03c3(htk )\u2212measurable random variable X, we have\n\nE(cid:104)\n\n(cid:105)\n\n= E(cid:104)\n\n(cid:105)\n\nf (\u03b8k, X)\n\nf (\u03b8\u2217, X)\n\nProof. From the de\ufb01nition (6), the start time tk is a stopping-time, i.e. tk is \u03c3(htk )\u2212measurable.\nNote that \u03b8k is randomly sampled from the posterior distribution \u00b5tk. Since tk is a stopping time,\ntk and \u00b5tk are both measurable with respect to \u03c3(htk ). From the assumption, X is also measurable\nwith respect to \u03c3(htk ). Then conditioned on htk, the only randomness in f (\u03b8k, X) is the random\nsampling in the algorithm. This gives the following equation:\n\nf (\u03b8k, X)|htk\n\nf (\u03b8k, X)|htk , tk, \u00b5tk\n\n=\n\nf (\u03b8\u2217, X)|htk\n\n(9)\n\n(cid:105)\n\n(cid:90)\n\nf (\u03b8, X)\u00b5tk (d\u03b8) = E(cid:104)\n\n(cid:105)\n\nE(cid:104)\n\n(cid:105)\n\n= E(cid:104)\n\nsince \u00b5tk is the posterior distribution of \u03b8\u2217 given htk. Now the result follows by taking the expectation\nof both sides.\n\nFor tk \u2264 t < tk+1 in episode k, the Bellman equation (1) holds by Assumption 1 for s = st, \u03b8 = \u03b8k\nand action at = \u03c0k(st). Then we obtain\n\nc(st, at) = J(\u03b8k) + v(st, \u03b8k) \u2212(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:48)|st, at)v(s\n(cid:48)\n\n\u03b8k(s\n\n, \u03b8k).\n\n(10)\n\nUsing (10), the expected regret of TSDE is equal to\n\ntk+1\u22121(cid:88)\n\nE(cid:104) KT(cid:88)\n= E(cid:104) KT(cid:88)\n\nk=1\n\nc(st, at)\n\n(cid:105) \u2212 T E(cid:104)\n\nt=tk\n\nTkJ(\u03b8k)\n\n(cid:105) \u2212 T E(cid:104)\n(cid:105)\n\nJ(\u03b8\u2217)\n\nJ(\u03b8\u2217)\n\n(cid:105)\n+ E(cid:104) KT(cid:88)\n\nk=1\n\n=R0 + R1 + R2,\n\nk=1\n\nt=tk\n\ntk+1\u22121(cid:88)\n\n(cid:104)\n\nv(st, \u03b8k) \u2212(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:48)|st, at)v(s\n(cid:48)\n\n\u03b8k(s\n\n(cid:105)(cid:105)\n\n, \u03b8k)\n\n(11)\n\nwhere R0, R1 and R2 are given by\n\nR0 = E(cid:104) KT(cid:88)\nR1 = E(cid:104) KT(cid:88)\nR2 = E(cid:104) KT(cid:88)\n\nk=1\n\nk=1\n\nTkJ(\u03b8k)\n\ntk+1\u22121(cid:88)\ntk+1\u22121(cid:88)\n\nt=tk\n\n(cid:104)\n(cid:104)\n\nk=1\n\nt=tk\n\n(cid:105) \u2212 T E(cid:104)\n\n(cid:105)\n\nJ(\u03b8\u2217)\n\n,\n\n(cid:105)(cid:105)\n\n,\n\nv(st, \u03b8k) \u2212 v(st+1, \u03b8k)\n\nv(st+1, \u03b8k) \u2212(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:48)|st, at)v(s\n(cid:48)\n\n\u03b8k(s\n\n, \u03b8k)\n\n(cid:105)(cid:105)\n\n.\n\nWe proceed to derive bounds on R0, R1 and R2.\nBased on the key property of Lemma 2, we derive an upper bound on R0.\nLemma 3. The \ufb01rst term R0 is bounded as\n\nR0 \u2264 E[KT ].\n\n6\n\n\fProof. From monotone convergence theorem we have\n\n(cid:105) \u2212 T E(cid:104)\n\n(cid:105)\n\n\u221e(cid:88)\n\nE(cid:104)\n\n1{tk\u2264T}TkJ(\u03b8k)\n\n(cid:105) \u2212 T E(cid:104)\n\n(cid:105)\n\nJ(\u03b8\u2217)\n\n.\n\n1{tk\u2264T}TkJ(\u03b8k)\n\nJ(\u03b8\u2217)\n\n=\n\nR0 = E(cid:104) \u221e(cid:88)\n\nk=1\n\nk=1\n\nNote that the \ufb01rst stopping criterion of TSDE ensures that Tk \u2264 Tk\u22121 + 1 for all k. Because\nJ(\u03b8k) \u2265 0, each term in the \ufb01rst summation satis\ufb01es\n\n1{tk\u2264T}TkJ(\u03b8k)\n\n1{tk\u2264T}(Tk\u22121 + 1)J(\u03b8k)\n\n.\n\nNote that 1{tk\u2264T}(Tk\u22121 + 1) is measurable with respect to \u03c3(htk ). Then, Lemma 2 gives\n\n1{tk\u2264T}(Tk\u22121 + 1)J(\u03b8k)\n\n1{tk\u2264T}(Tk\u22121 + 1)J(\u03b8\u2217)\n\n.\n\n(cid:105) \u2264 E(cid:104)\n= E(cid:104)\n(cid:105)\n\n(cid:105)\n\n(cid:105)\n\nE(cid:104)\n\nE(cid:104)\n\nk=1\n\nE(cid:104)\n\u221e(cid:88)\n= E(cid:104) KT(cid:88)\n= E(cid:104)\n\nk=1\n\nCombining the above equations we get\n\nR0 \u2264\n\n1{tk\u2264T}(Tk\u22121 + 1)J(\u03b8\u2217)\n\n(Tk\u22121 + 1)J(\u03b8\u2217)\n\n(cid:105)\n\n+ E(cid:104)(cid:16) KT(cid:88)\n\nwhere the last equality holds because J(\u03b8\u2217) \u2264 1 and(cid:80)KT\n\nKT J(\u03b8\u2217)\n\nk=1\n\n(cid:105)\n\nJ(\u03b8\u2217)\n\n(cid:105) \u2212 T E(cid:104)\n(cid:105)\n\n(cid:105) \u2212 T E(cid:104)\n\nTk\u22121 \u2212 T\n\nJ(\u03b8\u2217)\n\n(cid:105) \u2264 E(cid:104)\n\n(cid:17)\n(cid:105)\nk=1 Tk\u22121 = T0 +(cid:80)KT \u22121\n\nJ(\u03b8\u2217)\n\nKT\n\nk=1 Tk \u2264 T .\n\nNote that the \ufb01rst stopping criterion of TSDE plays a crucial role in the proof of Lemma 3. It allows\nus to bound the length of an episode using the length of the previous episode which is measurable\nwith respect to the information at the beginning of the episode.\nThe other two terms R1 and R2 of the regret are bounded in the following lemmas. Their proofs\nfollow similar steps to those in [13, 16]. The proofs are in the appendix due to the lack of space.\nLemma 4. The second term R1 is bounded as\n\nLemma 5. The third term R2 is bounded as\n\nR2 \u2264 49HS\n\nAT log(AT ).\n\nR1 \u2264 E[HKT ].\n\n(cid:112)\n\nWe are now ready to prove Theorem 1.\nProof of Theorem 1. From (11), R(T, TSDE) = R0 + R1 + R2 \u2264 E[KT ] + E[HKT ] + R2 where\nthe inequality comes from Lemma 3, Lemma 4. Then the claim of the theorem directly follows from\nLemma 1 and Lemma 5.\n\n5 Simulations\n\nIn this section, we compare through simulations the performance of TSDE with three learning\nalgorithms with the same regret order: UCRL2 [8], TSMDP [15], and Lazy PSRL [16]. UCRL2 is\nan optimistic algorithm with similar regret bounds. TSMDP and Lazy PSRL are TS algorithms for\nin\ufb01nite horizon MDPs. TSMDP has the same regret order in T given a recurrent state for resampling.\nThe original regret analysis for Lazy PSRL is incorrect, but the regret bounds are conjectured to\nbe correct [20]. We chose \u03b4 = 0.05 for the implementation of UCRL2 and assume an independent\nDirichlet prior with parameters [0.1, . . . , 0.1] over the transition probabilities for all TS algorithms.\nWe consider two environments: randomly generated MDPs and the RiverSwim example [22]. For\nrandomly generated MDPs, we use the independent Dirichlet prior over 6 states and 2 actions but\n\n7\n\n\fwith a \ufb01xed cost. We select the resampling state s0 = 1 for TSMDP here since all states are recurrent\nunder the Dirichlet prior. The RiverSwim example models an agent swimming in a river who can\nchoose to swim either left or right. The MDP consists of six states arranged in a chain with the agent\nstarting in the leftmost state (s = 1). If the agent decides to move left i.e with the river current then\nhe is always successful but if he decides to move right he might fail with some probability. The cost\nfunction is given by: c(s, a) = 0.8 if s = 1, a = left; c(s, a) = 0 if s = 6, a = right; and c(s, a) = 1\notherwise. The optimal policy is to swim right to reach the rightmost state which minimizes the cost.\nFor TSMDP in RiverSwim, we consider two versions with s0 = 1 and with s0 = 3 for the resampling\nstate. We simulate 500 Monte Carlo runs for both the examples and run for T = 105.\n\n(a) Expected Regret vs Time for random MDPs\n\n(b) Expected Regret vs Time for RiverSwim\n\nFigure 1: Simulation Results\n\nFrom Figure 1(a) we can see that TSDE outperforms all the three algorithms in randomly generated\nMDPs. In particular, there is a signi\ufb01cant gap between the regret of TSDE and that of UCRL2\nand TSMDP. The poor performance of UCRL2 assures the motivation to consider TS algorithms.\nFrom the speci\ufb01cation of TSMDP, its performance heavily hinges on the choice of an appropriate\nresampling state which is not possible for a general unknown MDP. This is re\ufb02ected in the randomly\ngenerated MDPs experiment.\nIn the RiverSwim example, Figure 1(b) shows that TSDE signi\ufb01cantly outperforms UCRL2, Lazy\nPSRL, and TSMDP with s0 = 3. Although TSMDP with s0 = 1 performs slightly better than TSDE,\nthere is no way to pick this speci\ufb01c s0 if the MDP is unknown in practice. Since Lazy PSRL is also\nequipped with the doubling trick criterion, the performance gap between TSDE and Lazy PSRL\nhighlights the importance of the \ufb01rst stopping criterion on the growth rate of episode length. We also\nlike to point out that in this example, the MDP is \ufb01xed and is not generated from the Dirichlet prior.\nTherefore, we conjecture that TSDE also has the same regret bounds under a non-Bayesian setting.\n\n6 Conclusion\n\n\u221a\n\nWe propose the Thompson Sampling with Dynamic Episodes (TSDE) learning algorithm and establish\n\u02dcO(HS\nAT ) bounds on expected regret for the general subclass of weakly communicating MDPs.\nOur result \ufb01lls a gap in the theoretical analysis of Thompson Sampling for MDPs. Numerical results\nvalidate that the TSDE algorithm outperforms other learning algorithms for in\ufb01nite horizon MDPs.\nThe TSDE algorithm determines the end of an episode by two stopping criteria. The second criterion\ncomes from the doubling trick used in many reinforcement learning algorithms. But the \ufb01rst criterion\non the linear growth rate of episode length seems to be a new idea for episodic learning algorithms.\nThe stopping criterion is crucial in the proof of regret bound (Lemma 3). The simulation results of\nTSDE versus Lazy PSRL further shows that this criterion is not only a technical constraint for proofs,\nit indeed helps balance exploitation and exploration.\n\n8\n\n0246810x 1040100200300400500600TRegret UCRL2TSMDPLazyPSRLTSDE0246810x 1040100020003000400050006000TRegret UCRL2TSMDPwiths0=3TSMDPwiths0=1LazyPSRLTSDE\fAcknowledgments\n\nYi Ouyang would like to thank Yang Liu from Harvard University for helpful discussions. Rahul Jain\nand Ashutosh Nayyar were supported by NSF Grants 1611574 and 1446901.\n\nReferences\n[1] D. P. Bertsekas, Dynamic programming and optimal control, vol. 2. Athena Scienti\ufb01c, Belmont,\n\nMA, 2012.\n\n[2] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identi\ufb01cation, and adaptive control.\n\nSIAM, 2015.\n\n[3] T. L. Lai and H. Robbins, \u201cAsymptotically ef\ufb01cient adaptive allocation rules,\u201d Advances in\n\napplied mathematics, vol. 6, no. 1, pp. 4\u201322, 1985.\n\n[4] A. N. Burnetas and M. N. Katehakis, \u201cOptimal adaptive policies for markov decision processes,\u201d\n\nMathematics of Operations Research, vol. 22, no. 1, pp. 222\u2013255, 1997.\n\n[5] M. Kearns and S. Singh, \u201cNear-optimal reinforcement learning in polynomial time,\u201d Machine\n\nLearning, vol. 49, no. 2-3, pp. 209\u2013232, 2002.\n\n[6] R. I. Brafman and M. Tennenholtz, \u201cR-max-a general polynomial time algorithm for near-\noptimal reinforcement learning,\u201d Journal of Machine Learning Research, vol. 3, no. Oct,\npp. 213\u2013231, 2002.\n\n[7] P. L. Bartlett and A. Tewari, \u201cRegal: A regularization based algorithm for reinforcement learning\n\nin weakly communicating mdps,\u201d in UAI, 2009.\n\n[8] T. Jaksch, R. Ortner, and P. Auer, \u201cNear-optimal regret bounds for reinforcement learning,\u201d\n\nJournal of Machine Learning Research, vol. 11, no. Apr, pp. 1563\u20131600, 2010.\n\n[9] S. Filippi, O. Capp\u00b4e, and A. Garivier, \u201cOptimism in reinforcement learning and kullback-leibler\n\ndivergence,\u201d in Allerton, pp. 115\u2013122, 2010.\n\n[10] C. Dann and E. Brunskill, \u201cSample complexity of episodic \ufb01xed-horizon reinforcement learning,\u201d\n\nin NIPS, 2015.\n\n[11] W. R. Thompson, \u201cOn the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples,\u201d Biometrika, vol. 25, no. 3/4, pp. 285\u2013294, 1933.\n\n[12] M. Strens, \u201cA bayesian framework for reinforcement learning,\u201d in ICML, 2000.\n\n[13] I. Osband, D. Russo, and B. Van Roy, \u201c(More) ef\ufb01cient reinforcement learning via posterior\n\nsampling,\u201d in NIPS, 2013.\n\n[14] R. Fonteneau, N. Korda, and R. Munos, \u201cAn optimistic posterior sampling strategy for bayesian\n\nreinforcement learning,\u201d in BayesOpt2013, 2013.\n\n[15] A. Gopalan and S. Mannor, \u201cThompson sampling for learning parameterized markov decision\n\nprocesses,\u201d in COLT, 2015.\n\n[16] Y. Abbasi-Yadkori and C. Szepesv\u00b4ari, \u201cBayesian optimal control of smoothly parameterized\n\nsystems.,\u201d in UAI, 2015.\n\n[17] I. Osband and B. Van Roy, \u201cWhy is posterior sampling better than optimism for reinforcement\n\nlearning,\u201d EWRL, 2016.\n\n[18] S. L. Scott, \u201cA modern bayesian look at the multi-armed bandit,\u201d Applied Stochastic Models in\n\nBusiness and Industry, vol. 26, no. 6, pp. 639\u2013658, 2010.\n\n[19] O. Chapelle and L. Li, \u201cAn empirical evaluation of thompson sampling,\u201d in NIPS, 2011.\n\n[20] I. Osband and B. Van Roy, \u201cPosterior sampling for reinforcement learning without episodes,\u201d\n\narXiv preprint arXiv:1608.02731, 2016.\n\n9\n\n\f[21] D. Russo and B. Van Roy, \u201cLearning to optimize via posterior sampling,\u201d Mathematics of\n\nOperations Research, vol. 39, no. 4, pp. 1221\u20131243, 2014.\n\n[22] A. L. Strehl and M. L. Littman, \u201cAn analysis of model-based interval estimation for markov\ndecision processes,\u201d Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309\u20131331,\n2008.\n\n10\n\n\f", "award": [], "sourceid": 874, "authors": [{"given_name": "Yi", "family_name": "Ouyang", "institution": "University of California, Berkeley"}, {"given_name": "Mukul", "family_name": "Gagrani", "institution": "University of Southern California"}, {"given_name": "Ashutosh", "family_name": "Nayyar", "institution": "University of Southern California"}, {"given_name": "Rahul", "family_name": "Jain", "institution": "University of Southern California"}]}