{"title": "Near-optimal Reinforcement Learning in Factored MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 604, "page_last": 612, "abstract": "Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\\Omega(\\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = \\Omega(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \\emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \\emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).", "full_text": "Near-optimal Reinforcement Learning\n\nin Factored MDPs\n\nIan Osband\n\nStanford University\n\niosband@stanford.edu\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nAny reinforcement learning algorithm that applies to all Markov decision\nprocesses (MDPs) will suer (\u00d4SAT) regret on some MDP, where T is\nthe elapsed time and S and A are the cardinalities of the state and action\nspaces. This implies T =( SA) time to guarantee a near-optimal policy.\nIn many settings of practical interest, due to the curse of dimensionality,\nS and A can be so enormous that this learning time is unacceptable. We\nestablish that, if the system is known to be a factored MDP, it is possible\nto achieve regret that scales polynomially in the number of parameters\nencoding the factored MDP, which may be exponentially smaller than S\nor A. We provide two algorithms that satisfy near-optimal regret bounds\nin this context: posterior sampling reinforcement learning (PSRL) and an\nupper con\ufb01dence bound algorithm (UCRL-Factored).\n\n1 Introduction\nWe consider a reinforcement learning agent that takes sequential actions within an uncertain\nenvironment with an aim to maximize cumulative reward [1]. We model the environment\nas a Markov decision process (MDP) whose dynamics are not fully known to the agent.\nThe agent can learn to improve future performance by exploring poorly-understood states\nand actions, but might improve its short-term rewards through a policy which exploits its\nexisting knowledge. Ecient reinforcement learning balances exploration with exploitation\nto earn high cumulative reward.\nThe vast majority of ecient reinforcement learning has focused upon the tabula rasa setting,\nwhere little prior knowledge is available about the environment beyond its state and action\nspaces. In this setting several algorithms have been designed to attain sample complexity\npolynomial in the number of states S and actions A [2, 3]. Stronger bounds on regret,\nthe dierence between an agent\u2019s cumulative reward and that of the optimal controller,\nhave also been developed. The strongest results of this kind establish \u02dcO(S\u00d4AT) regret for\nparticular algorithms [4, 5, 6] which is close to the lower bound (\u00d4SAT) [4]. However, in\nmany setting of interest, due to the curse of dimensionality, S and A can be so enormous\nthat even this level of regret is unacceptable.\nIn many practical problems the agent will have some prior understanding of the environment\nbeyond tabula rasa. For example, in a large production line with m machines in sequence\neach with K possible states, we may know that over a single time-step each machine can\nonly be in\ufb02uenced by its direct neighbors. Such simple observations can reduce the dimen-\nsionality of the learning problem exponentially, but cannot easily be exploited by a tabula\nrasa algorithm. Factored MDPs (FMDPs) [7], whose transitions can be represented by a\ndynamic Bayesian network (DBN) [8], are one eective way to represent these structured\nMDPs compactly.\n\n1\n\n\fSeveral algorithms have been developed that exploit the known DBN structure to achieve\nsample complexity polynomial in the parameters of the FMDP, which may be exponentially\nsmaller than S or A [9, 10, 11]. However, these polynomial bounds include several high order\nterms. We present two algorithms, UCRL-Factored and PSRL, with the \ufb01rst near-optimal\nregret bounds for factored MDPs. UCRL-Factored is an optimistic algorithm that modi\ufb01es\nthe con\ufb01dence sets of UCRL2 [4] to take advantage of the network structure. PSRL is\nmotivated by the old heuristic of Thompson sampling [12] and has been previously shown\nto be ecient in non-factored MDPs [13, 6]. These algorithms are descibed fully in Section\n6.\nBoth algorithms make use of approximate FMDP planner in internal steps. However, even\nwhere an FMDP can be represented concisely, solving for the optimal policy may take\nexponentially long in the most general case [14]. Our focus in this paper is upon the\nstatistical aspect of the learning problem and like earlier discussions we do not specify which\ncomputational methods are used [10]. Our results serve as a reduction of the reinforcement\nlearning problem to \ufb01nding an approximate solution for a given FMDP. In many cases of\ninterest, eective approximate planning methods for FMDPs do exist. Investigating and\nextending these methods are an ongoing subject of research [15, 16, 17, 18].\n\n2 Problem formulation\nWe consider the problem of learning to optimize a random \ufb01nite horizon MDP M =\n(S,A, RM , P M ,\u00b7,\ufb02 ) in repeated \ufb01nite episodes of interaction. S is the state space, A is the\naction space, RM(s, a) is the reward distibution over R in state s with action a, P M(\u00b7|s, a)\nis the transition probability over S from state s with action a, \u00b7 is the time horizon, and\n\ufb02 the initial state distribution. We de\ufb01ne the MDP and all other random variables we will\nconsider with respect to a probability space (,F, P).\nA deterministic policy \u00b5 is a function mapping each state s \u0153S and i = 1, . . . ,\u00b7 to an action\na \u0153A . For each MDP M = (S,A, RM , P M ,\u00b7,\ufb02 ) and policy \u00b5, we de\ufb01ne a value function\n\nV M\n\n\u00b5,i(s) := EM,\u00b5SU\n\u00b7\u00ffj=i\n\nR\n\nM(sj, aj)---si = sTV ,\n\nM(s, a) denotes the expected reward realized when action a is selected while in\nwhere R\nstate s, and the subscripts of the expectation operator indicate that aj = \u00b5(sj, j), and\nsj+1 \u2265 P M(\u00b7|sj, aj) for j = i, . . . , \u00b7. A policy \u00b5 is optimal for the MDP M if V M\n\u00b5,i(s) =\n\u00b5\u00d5,i(s) for all s \u0153S and i = 1, . . . ,\u00b7 . We will associate with each MDP M a policy\nmax\u00b5\u00d5 V M\n\u00b5M that is optimal for M.\nThe reinforcement learning agent interacts with the MDP over episodes that begin at times\ntk = (k \u2260 1)\u00b7 + 1, k = 1, 2, . . .. At each time t, the agent selects an action at, observes\na scalar reward rt, and then transitions to st+1. Let Ht = (s1, a1, r1, . . . , st\u22601, at\u22601, rt\u22601)\ndenote the history of observations made prior to time t. A reinforcement learning algorithm\nis a deterministic sequence {\ufb01k|k = 1, 2, . . .} of functions, each mapping Htk to a probability\ndistribution \ufb01k(Htk) over policies which the agent will employ during the kth episode. We\nde\ufb01ne the regret incurred by a reinforcement learning algorithm \ufb01 up to time T to be:\n\nwhere k denotes regret over the kth episode, de\ufb01ned with respect to the MDP M\u00fa by\n\nwith \u00b5\u00fa = \u00b5M\u00fa and \u00b5k \u2265 \ufb01k(Htk). Note that regret is not deterministic since it can\ndepend on the random MDP M\u00fa, the algorithm\u2019s internal random sampling and, through\nthe history Htk, on previous random transitions and random rewards. We will assess and\ncompare algorithm performance in terms of regret and its expectation.\n\nRegret(T,\ufb01, M \u00fa) :=\n\nk,\n\n\u00c1T /\u00b7 \u00cb\u00ffk=1\n\u00b5\u00fa,1(s) \u2260 V M\u00fa\n\n\u00b5k,1(s))\n\nk :=\u00ffS\n\n\ufb02(s)(V M\u00fa\n\n2\n\n\f3 Factored MDPs\nIntuitively a factored MDP is an MDP whose rewards and transitions exhibit some condi-\ntional independence structure. To formalize this de\ufb01nition we must introduce some more\nnotation common to the literature [11].\nDe\ufb01nition 1 (Scope operation for factored sets X = X1 \u25ca .. \u25caX n).\nFor any subset of indices Z \u2122{ 1, 2, .., n} let us de\ufb01ne the scope set X[Z] := oi\u0153Z Xi. Further,\nfor any x \u0153X de\ufb01ne the scope variable x[Z] \u0153X [Z] to be the value of the variables xi \u0153X i\nwith indices i \u0153 Z. For singleton sets Z we will write x[i] for x[{i}] in the natural way.\nLet PX ,Y be the set of functions mapping elements of a \ufb01nite set X to probability mass\nfunctions over a \ufb01nite set Y. P C,\u2021\nX ,R will denote the set of functions mapping elements of a\n\ufb01nite set X to \u2021-sub-Gaussian probability measures over (R,B(R)) with mean bounded in\n[0, C]. For reinforcement learning we will write X for S\u25caA and consider factored reward\nand factored transition functions which are drawn from within these families.\nDe\ufb01nition 2 ( Factored reward functions R \u0153R\u2122P C,\u2021\nX ,R).\nThe reward function class R is factored over S\u25caA = X = X1 \u25ca .. \u25caX n with scopes Z1, ..Zl\nif and only if, for all R \u0153R , x \u0153X there exist functions {Ri \u0153P C,\u2021\n\ni=1 such that,\n\nX[Zi],R}l\n\nE[r] =\n\nl\u00ffi=1\n\nE#ri$\n\ni=1 ri with each ri \u2265 Ri(x[Zi]) and individually observed.\n\nfor r \u2265 R(x) is equal toql\nDe\ufb01nition 3 ( Factored transition functions P \u0153P\u2122P X ,S ).\nThe transition function class P is factored over S\u25caA = X = X1 \u25ca .. \u25caX n and S =\nS1 \u25ca ..\u25caS m with scopes Z1, ..Zm if and only if, for all P \u0153P , x \u0153X , s \u0153S there exist some\n{Pi \u0153P X[Zi],Si}m\n\ni=1 such that,\n\nP(s|x) =\n\nm\u0178i=1\n\nPi3s[i]---- x[Zi]4\n\nA factored MDP (FMDP) is then de\ufb01ned to be an MDP with both factored rewards and\nfactored transitions. Writing X = S\u25caA a FMDP is fully characterized by the tuple\ni=1; \u00b7; \ufb02\",\n\nwhere ZR\ni are the scopes for the reward and transition functions respectively in\n{1, .., n} for Xi. We assume that the size of all scopes |Zi|\u00c6 \u2019 \u03c0 n and factors |Xi|\u00c6 K so\nthat the domains of Ri and Pi are of size at most K\u2019.\n\nM =!{Si}m\n\ni=1; {Pi}m\n\ni=1; {ZR\ni }l\n\ni=1; {Xi}n\n\ni=1; {Ri}l\n\ni=1; {ZP\n\ni and ZP\n\ni }m\n\nIf \u201e is the distribution of M\u00fa and  is the span of the optimal value function then we can\nbound the regret of PSRL:\n\n4 Results\nOur \ufb01rst result shows that we can bound the expected regret of PSRL.\nTheorem 1 (Expected regret for PSRL in factored MDPs).\ni=1; {Xi}n\n\nLet M\u00fa be factored with graph structure G = !{Si}m\nE#Regret(T,\ufb01 PS\n\u00b7 , M\u00fa)$\u00c6\n+4 + E[]31 + 4\n\nl\u00ffi=1;5\u00b7C |X[ZR\nT \u2260 44 m\u00ffj=1;5\u00b7|X[ZP\n\ni ]| + 12\u2021\u00d2|X[ZR\nj ]| + 12\u00d2|X[ZP\n\ni ]|T log!4l|X[ZR\nj ]||Sj|T log!4m|X[ZP\n\ni }m\n\ni=1; \u00b7\".\ni ]|kT\"< + 2\u00d4T\nj ]|kT\"< (1)\n\ni=1; {ZR\ni }l\n\ni=1; {ZP\n\nWe have a similar result for UCRL-Factored that holds with high probability.\n\n3\n\n\f, M\u00fa)\u00c6\n\n\u00b7\n\nRegret(T,\ufb01 UC\n\n\u00b7\n\nTheorem 2 (High probability regret for UCRL-Factored in factored MDPs).\ni=1; {ZP\n\ni }m\nD is the diameter of M\u00fa, then for any M\u00fa can bound the regret of UCRL-Factored:\n\ni=1; {ZR\ni }l\n\ni=1; {Xi}n\n\ni ]|T log!12l|X[ZR\nj ]||Sj|T log!12m|X[ZP\n\nLet M\u00fa be factored with graph structure G =!{Si}m\ni=1; \u00b7\". If\nl\u00ffi=1;5\u00b7C |X[ZR\ni ]|kT /\u201d\"< + 2\u00d4T\ni ]| + 12\u2021\u00d2|X[ZR\nm\u00ffj=1;5\u00b7|X[ZP\nj ]|kT /\u201d\"<(2)\nj ]| + 12\u00d2|X[ZP\n+CD\uf8ff2T log(6/\u201d) + CD\nwith probability at least 1 \u2260 \u201d\nBoth algorithms give bounds \u02dcO1qm\nj ]||Sj|T2 where  is a measure of MDP\nj=1\u00d2|X[ZP\nconnectedness: expected span E[] for PSRL and scaled diameter CD for UCRL-Factored.\nThe span of an MDP is the maximum dierence in value of any two states under the optimal\n\u00b5\u00fa,1(s)\u2260V M\u00fa\npolicy (M\u00fa) := maxs,s\u00d5\u0153S{V M\u00fa\n\u00b5\u00fa,1(s\u00d5)}. The diameter of an MDP is the maximum\nnumber of expected timesteps to get between any two states D(M\u00fa) = maxs\u201d=s\u00d5 min\u00b5 T \u00b5\ns\u00e6s\u00d5.\nPSRL\u2019s bounds are tighter since (M) \u00c6 CD(M) and may be exponentially smaller.\nHowever, UCRL-Factored has stronger probabilistic guarantees than PSRL since its bounds\nhold with high probability for any MDP M\u00fa not just in expectation. There is an optimistic\nalgorithm REGAL [5] which formally replaces the UCRL2 D with  and retains the high\nprobability guarantees. An analogous extension to REGAL-Factored is possible, however,\nno practical implementation of that algorithm exists even with an FMDP planner.\nThe algebra in Theorems 1 and 2 can be overwhelming. For clarity, we present a symmetric\nproblem instance for which we can produce a cleaner single-term upper bound. Let Q be\nshorthand for the simple graph structure with l + 1 = m, C = \u2021 = 1, |Si| = |Xi| = K and\ni | = |ZP\n|ZR\nCorollary 1 (Clean bounds for PSRL in a symmetric problem).\nIf \u201e is the distribution of M\u00fa with structure Q then we can bound the regret of PSRL:\nCorollary 2 (Clean bounds for UCRL-Factored in a symmetric problem).\nFor any MDP M\u00fa with structure Q we can bound the regret of UCRL-Factored:\nwith probability at least 1 \u2260 \u201d.\nBoth algorithms satisfy bounds of \u02dcO(\u00b7m\u00d4JKT) which is exponentially tighter than can be\nobtained by any Q-naive algorithm. For a factored MDP with m independent components\nwith S states and A actions the bound \u02dcO(mS\u00d4AT) is close to the lower bound (m\u00d4SAT)\nand so the bound is near optimal. The corollaries follow directly from Theorems 1 and 2 as\nshown in Appendix B.\n\nj | = \u2019 for i = 1, .., l and j = 1, .., m, we will write J = K\u2019.\n\u00b7 , M\u00fa)$ \u00c6 15m\u00b7\uf8ffJKT log(2mJT)\n, M\u00fa) \u00c6 15m\u00b7\uf8ffJKT log(12mJT /\u201d)\n\nE#Regret(T,\ufb01 PS\n\nRegret(T,\ufb01 UC\n\n(3)\n\n(4)\n\n5 Con\ufb01dence sets\nOur analysis will rely upon the construction of con\ufb01dence sets based around the empirical\nestimates for the underlying reward and transition functions. The con\ufb01dence sets are con-\nstructed to contain the true MDP with high probability. This technique is common to the\nliterature, but we will exploit the additional graph structure G to sharpen the bounds.\nConsider a family of functions F\u2122M X ,(Y,Y) which takes x \u0153X to a probability distribu-\ntion over (Y, Y). We will write MX ,Y unless we wish to stress a particular \u2021-algebra.\nDe\ufb01nition 4 (Set widths).\nLet X be a \ufb01nite set, and let (Y, Y) be a measurable space. The width of a set F\u0153M X ,Y\nat x \u0153X with respect to a norm \u00ce\u00b7\u00ce is\n\nwF(x) := sup\nf ,f\u0153F\n\n\u00ce(f \u2260 f)(x)\u00ce\n\n4\n\n\fOur con\ufb01dence set sequence {Ft \u2122F : t \u0153 N} is initialized with a set F. We adapt our\ncon\ufb01dence set to the observations yt \u0153Y which are drawn from the true function f\u00fa \u0153F\nat measurement points xt \u0153X so that yt \u2265 f\u00fa(xt). Each con\ufb01dence set is then centered\naround an empirical estimate \u02c6ft \u0153M X ,Y at time t, de\ufb01ned by\n\n\u02c6ft(x) = 1\n\nnt(x) \u00ff\u00b7<t:x\u00b7 =x\n\n\u201dy\u00b7 ,\n\nFt = Ft(\u00ce\u00b7\u00ce , xt\u22601\n\n1\n\nwhere nt(x) is the number of time x appears in (x1, .., xt\u22601) and \u201dyt is the probability mass\nfunction over Y that assigns all probability to the outcome yt.\nOur sequence of con\ufb01dence sets depends on our choice of norm \u00ce\u00b7\u00ce and a non-decreasing\nsequence {dt : t \u0153 N}. For each t, the con\ufb01dence set is de\ufb01ned by:\n\n, dt) :=If \u0153F ---- \u00ce(f \u2260 \u02c6ft)(xi)\u00ce \u00c6\u00db dt\n\nnt(xi) \u2019i = 1, .., t \u2260 1J .\n\n1\n\nWhere xt\u22601\nis shorthand for (x1, .., xt\u22601) and we interpret nt(xi) = 0 as a null constraint.\nThe following result shows that we can bound the sum of con\ufb01dence widths through time.\nTheorem 3 (Bounding the sum of widths).\nFor all \ufb01nite sets X, measurable spaces (Y, Y), function classes F\u2122M X ,Y with uniformly\nbounded widths wF(x) \u00c6 CF \u2019x \u0153X and non-decreasing sequences {dt : t \u0153 N}:\n\nL\u00ffk=1\n\n\u00b7\u00ffi=1\n\nwFk(xtk+i) \u00c6 4!\u00b7CF|X| + 1\" + 4\uf8ff2dT|X|T\n\n(5)\n\nProof. The proof follows from elementary counting arguments on nt(x) and the pigeonhole\nprinciple. A full derivation is given in Appendix A.\n\n6 Algorithms\nWith our notation established, we are now able to introduce our algorithms for ecient\nlearning in Factored MDPs. PSRL and UCRL-Factored proceed in episodes of \ufb01xed policies.\nAt the start of the kth episode they produce a candidate MDP Mk and then proceed with the\npolicy which is optimal for Mk. In PSRL, Mk is generated by a sample from the posterior\nfor M\u00fa, whereas UCRL-Factored chooses Mk optimistically from the con\ufb01dence set Mk.\nBoth algorithms require prior knowledge of the graphical structure G and an approximate\nplanner for FMDPs. We will write (M, \u2018) for a planner which returns \u2018-optimal policy\nfor M. We will write \u02dc(M,\u2018 ) for a planner which returns an \u2018-optimal policy for most\noptimistic realization from a family of MDPs M. Given  it is possible to obtain \u02dc through\nextended value iteration, although this might become computationally intractable [4].\nPSRL remains identical to earlier treatment [13, 6] provided G is encoded in the prior\n\u201e. UCRL-Factored is a modi\ufb01cation to UCRL2 that can exploit the graph and episodic\nt ) as shorthand for these con\ufb01dence sets\nt (dPj\nstructure of . We write Ri\nt(|E[\u00b7]|, xt\u22601\n[ZP\nj ], dPj\nt ) generated from initial sets Ri1 =\nRi\nj ],Sj.\nP C,\u2021\nX[ZR\nWe should note that UCRL2 was designed to obtain regret bounds even in MDPs without\nepisodic reset. This is accomplished by imposing arti\ufb01cial episodes which end whenever\nthe number of visits to a state-action pair is doubled [4].\nIt is simple to extend UCRL-\nFactored\u2019s guarantees to this setting using this same strategy. This will not work for PSRL\nsince our current analysis requires that the episode length is independent of the sampled\nMDP. Nevertheless, there has been good empirical performance using this method for MDPs\nwithout episodic reset in simulation [6].\n\nt(dRi\nt ) and P j\nt(\u00ce\u00b7\u00ce 1, xt\u22601\nt ) and P i\n\n1\n\n1\n\ni ], dRi\n[ZR\n1 = PX[ZP\ni ],R and P j\n\n5\n\n\fAlgorithm 1\nPSRL (Posterior Sampling)\n1: Input: Prior \u201e encoding G, t = 1\n2: for episodes k = 1, 2, .. do\nsample Mk \u2265 \u201e(\u00b7|Ht)\n3:\ncompute \u00b5k =( Mk,\uf8ff\u00b7/k)\n4:\nfor timesteps j = 1, ..,\u00b7 do\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nsample and apply at = \u00b5k(st, j)\nobserve rt and sm\nt+1\nt = t + 1\n\nend for\n\ndRi\ndPj\n\nAlgorithm 2\nUCRL-Factored (Optimism)\n1: Input: Graph structure G, con\ufb01dence \u201d, t = 1\n2: for episodes k = 1, 2, .. do\ni ]|k/\u201d\" for i = 1, .., l\nt = 4\u20212 log!4l|X[Z R\n3:\nj ]|k/\u201d\" for j = 1, .., m\nt = 4|Sj| log!4m|X[Z P\n4:\nt ) \u2019i, j}\nt ), Pj \u0153P j\n5: Mk = {M |G, Ri \u0153R i\ncompute \u00b5k = \u02dc(Mk,\uf8ff\u00b7/k)\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n\nsample and apply at = \u00b5k(st, u)\nobserve r1\nt+1, .., sm\nt+1\nt = t + 1\n\nfor timesteps u = 1, ..,\u00b7 do\n\nt and s1\n\nend for\n\nt (dPj\n\nt , .., rl\n\nt(dRi\n\n7 Analysis\nFor our common analysis of PSRL and UCRL-Factored we will let \u02dcMk refer generally to\neither the sampled MDP used in PSRL or the optimistic MDP chosen from Mk with\nassociated policy \u02dc\u00b5k). We introduce the Bellman operator T M\n\u00b5 , which for any MDP\nM = (S,A, RM , P M ,\u00b7,\ufb02 ), stationary policy \u00b5 : S\u00e6A and value function V : S\u00e6 R,\nis de\ufb01ned by\n\n\u00b5 V (s) := R\nT M\n\nM(s, \u00b5(s)) +\u00ffs\u00d5\u0153S\n\nP M(s\u00d5|s, \u00b5(s))V (s\u00d5).\n\n\u00b5,i and T M\n\n\u02dc\u00b5k,i. We will also write xk,i := (stk+i, \u00b5k(stk+i)).\n\nThis returns the expected value of state s where we follow the policy \u00b5 under the laws of\nM, for one time step. We will streamline our discussion of P M , RM , V M\n\u00b5 by simply\nwriting \u00fa in place of M\u00fa or \u00b5\u00fa and k in place of \u02dcMk or \u02dc\u00b5k where appropriate; for example\nV \u00fak,i := V M\u00fa\nWe now break down the regret by adding and subtracting the imagined near optimal reward\nof policy \u02dc\u00b5K, which is known to the agent. For clarity of analysis we consider only the case\nof \ufb02(s\u00d5) = 1{s\u00d5 = s} but this changes nothing for our consideration of \ufb01nite S.\nk,1(s)4\n\n(6)\nk,1 relates the optimal rewards of the MDP M\u00fa to those near optimal for \u02dcMk. We\nV \u00fa\u00fa,1 \u2260 V k\ncan bound this dierence by the planning accuracy \uf8ff1/k for PSRL in expectation, since\nM\u00fa and Mk are equal in law, and for UCRL-Factored in high probability by optimism.\nWe decompose the \ufb01rst term through repeated application of dynamic programming:\n\nk,1(s) \u2260 V \u00fak,1(s)4 +3V \u00fa\n\n\u00fa,1(s) \u2260 V \u00fak,1(s) =3V k\n\n\u00fa,1(s) \u2260 V k\n\nk = V \u00fa\n\nk,i+1(stk+i) +\n\n!V k\nk,1 \u2260 V \u00fak,1\" (stk+1) =\n\n\u00b7\u00ffi=1!T k\nWhere dtk+i :=qs\u0153S\u00d3P \u00fa(s|xk,i)(V \u00fak,i+1 \u2260 V k\ntingale dierence bounded by k, the span of V k\nto say that k \u00c6 CD [4] and apply the Azuma-Hoeding inequality to say that:\n\n\u00b7\u00ffi=1\nk,i \u2260T \u00fak,i\" V k\nk,i+1)(s)\u00d4 \u2260 (V \u00fak,i+1 \u2260 V k\ndtk+i > CD\uf8ff2T log(2/\u201d)B \u00c6 \u201d\n\nPA m\u00ffk=1\n\n\u00b7\u00ffi=1\n\nk,i+1)(stk+i) is a mar-\nk,i. For UCRL-Factored we can use optimism\n\ndtk+1.\n\n(8)\n\n(7)\n\nThe remaining term is the one step Bellman error of the imagined MDP \u02dcMk. Crucially this\nterm only depends on states and actions xk,i which are actually observed. We can now use\n\n6\n\n\fthe H\u00a8older inequality to bound\n\n\u00b7\u00ffi=1!T k\n\nk,i \u2260T \u00fak,i\" V k\n\nk,i+1(stk+i) \u00c6\n\nk(xk,i)\u2260R\u00fa(xk,i)|+1\n\n2k\u00ceP k(\u00b7|xk,i)\u2260P \u00fa(\u00b7|xk,i)\u00ce1 (9)\n\n\u00b7\u00ffi=1 |R\n\n7.1 Factorization decomposition\nWe aim to exploit the graphical structure G to create more ecient con\ufb01dence sets Mk. It is\nclear from (9) that we may upper bound the deviations of R\u00fa, R\nk factor-by-factor using the\ntriangle inequality. Our next result, Lemma 1, shows we can also do this for the transition\nfunctions P \u00fa and P k. This is the key result that allows us to build con\ufb01dence sets around\neach factor P \u00faj rather than P \u00fa as a whole.\nLemma 1 (Bounding factored deviations).\nLet the transition function class P\u2122P X ,S be factored over X = X1 \u25ca .. \u25caX n and S =\nS1 \u25ca .. \u25caS m with scopes Z1, ..Zm. Then, for any P, \u02dcP \u0153P we may bound their L1 distance\nby the sum of the dierences of their factorizations:\n\n\u00ceP(x) \u2260 \u02dcP(x)\u00ce1 \u00c6\n\nm\u00ffi=1 \u00cePi(x[Zi]) \u2260 \u02dcPi(x[Zi])\u00ce1\n\nProof. We begin with the simple claim that for any \u20131,\u2013 2,\u2014 1,\u2014 2 \u0153 (0, 1]:\n\n|\u20131\u20132 \u2260 \u20141\u20142| = \u20132----\u20131 \u2260\n\n\u20141\u20142\n\n\u20132 ----\n\n\u00c6 \u201323|\u20131 \u2260 \u20141| +----\u20141 \u2260\n\n\u00c6 \u20132 |\u20131 \u2260 \u20141| + \u20141 |\u20132 \u2260 \u20142|\n\n\u20132 ----4\n\n\u20141\u20142\n\nThis result also holds for any \u20131,\u2013 2,\u2014 1,\u2014 2 \u0153 [0, 1], where 0 can be veri\ufb01ed case by case.\nWe now consider the probability distributions p, \u02dcp over {1, .., d1} and q, \u02dcq over {1, .., d2}. We\nlet Q = pqT , \u02dcQ = \u02dcp\u02dcqT be the joint probability distribution over {1, .., d1}\u25ca{ 1, .., d2}. Using\nthe claim above we bound the L1 deviation \u00ceQ \u2260 \u02dcQ\u00ce1 by the deviations of their factors:\n\n\u00ceQ \u2260 \u02dcQ\u00ce1 =\n\nd2\u00ffj=1 |piqj \u2260 \u02dcpi\u02dcqj|\nd2\u00ffj=1\nqj|pi \u2260 \u02dcpi| + \u02dcpi|qj \u2260 \u02dcqj|\n\nd1\u00ffi=1\nd1\u00ffi=1\n\u00c6\n= \u00cep \u2260 \u02dcp\u00ce1 + \u00ceq \u2260 \u02dcq\u00ce1\n\nWe conclude the proof by applying this m times to the factored transitions P and \u02dcP.\n7.2 Concentration guarantees for Mk\nWe now want to show that the true MDP lies within Mk with high probability. Note that\nposterior sampling will also allow us to then say that the sampled Mk is within Mk with\nhigh probability too. In order to show this, we \ufb01rst present a concentration result for the\nL1 deviation of empirical probabilities.\nLemma 2 (L1 bounds for the empirical transition function).\nFor all \ufb01nite sets X, \ufb01nite sets Y, function classes P\u2122P X ,Y then for any x \u0153X , \u2018> 0 the\ndeviation the true distribution P \u00fa to the empirical estimate after t samples \u02c6Pt is bounded:\n\nP1\u00ceP \u00fa(x) \u2260 \u02c6Pt(x)\u00ce1 \u00d8 \u20182 \u00c6 exp3|Y| log(2) \u2260\n\nnt(x)\u20182\n\n2 4\n\n7\n\n\f\u201d\u00d5\" ) \u00c6 \u201d\u00d5. We\nj ]|k2). Now using a union bound\n\nnt(x) log! 2\n\nProof. This is a relaxation of the result proved by Weissman [19].\n\nLemma 2 ensures that for any x \u0153X P(\u00ceP \u00faj (x) \u2260 \u02c6Pj t(x)\u00ce1 \u00d8\u00d2 2|Sj|\ntk = 2|Si| log(2/\u201d\u00d5k,j) with \u201d\u00d5k,j = \u201d/(2m|X[ZP\nthen de\ufb01ne dPj\ntk ) \u2019k \u0153 N, j = 1, .., m) \u00d8 1 \u2260 \u201d.\nwe conclude P(P \u00faj \u0153P j\nLemma 3 (Tail bounds for sub \u2021-gaussian random variables).\nIf {\u2018i} are all independent and sub \u2021-gaussian then \u2019\u2014 \u00d8 0:\n\u2018i| >\u2014B \u00c6 exp3log(2) \u2260\nPA 1\n\n2\u202124\n\nt (dPj\n\nn\u20142\n\nn|\n\nn\u00ffi=1\nP3M\u00fa \u0153M k \u2019k \u0153 N4 \u00d8 1 \u2260 2\u201d\n\ni \u0153R i\n\nt(dRi\n\nA similar argument now ensures that P1R\u00fa\n\ntk ) \u2019k \u0153 N, i = 1, .., l2 \u00d8 1 \u2260 \u201d, and so\n\n(10)\n\n7.3 Regret bounds\nWe now have all the necessary intermediate results to complete our proof. We begin with\nthe analysis of PSRL. Using equation (10) and the fact that M\u00fa, Mk are equal in law by\nposterior sampling, we can say that P(M\u00fa, Mk \u0153M k\u2019k \u0153 N) \u00d8 1 \u2260 4\u201d. The contributions\nk=1\uf8ff\u00b7 /k \u00c6 2\u00d4T. From here we take\nfrom regret in planning function  are bounded byqm\nequation (9), Lemma 1 and Theorem 3 to say that for any \u201d> 0:\ni ]|T<\nE#Regret(T,\ufb01 PS\nj ]|T<\nT |X[ZP\nLet A = {M\u00fa, Mk \u0153M k}, since k \u00d8 0 and by posterior sampling E[k] = E[] for all k:\n1 \u2260 4\u201d4 E[].\nE[k|A] \u00c6 P(A)\u22601E[] \u00c631 \u2260\nPlugging in dRi\nT and setting \u201d = 1/T completes the proof of Theorem 1. The analysis\nof UCRL-Factored and Theorem 2 follows similarly from (8) and (10). Corollaries 1 and 2\nfollow from substituting the structure Q and upper bounding the constant and logarithmic\nterms. This is presented in detail in Appendix B.\n\ni ]| + 1) + 4\u00d22dRi\nT |X[ZR\nj ]| + 1) + 4\u00d22dPj\nk2 \u2260 4\u201d4 E[] \u00c631 + 4\u201d\n\n\u00b7 , M\u00fa)$ \u00c6 4\u201dT + 2\u00d4T +\nk=1,..,L!E[k|Mk, M\u00fa \u0153M k]\" \u25ca\n\nl\u00ffi=1;4(\u00b7C |X[ZR\nm\u00ffj=1;4(\u00b7|X[ZP\n\nE[] =31 + 4\u201d\n\nk24\u22601\n\nT and dPj\n\n+ sup\n\n4\u201d\n\n8 Conclusion\nWe present the \ufb01rst algorithms with near-optimal regret bounds in factored MDPs. Many\npractical problems for reinforcement learning will have extremely large state and action\nspaces, this allows us to obtain meaningful performance guarantees even in previously in-\ntractably large systems. However, our analysis leaves several important questions unad-\ndressed. First, we assume access to an approximate FMDP planner that may be compu-\ntationally prohibitive in practice. Second, we assume that the graph structure is known a\npriori but there are other algorithms that seek to learn this from experience [20, 21]. Finally,\nwe might consider dimensionality reduction in large MDPs more generally, where either the\nrewards, transitions or optimal value function are known to belong in some function class\nF to obtain bounds that depend on the dimensionality of F.\nAcknowledgments\nOsband is supported by Stanford Graduate Fellowships courtesy of PACCAR inc. This work\nwas supported in part by Award CMMI-0968707 from the National Science Foundation.\n\n8\n\n\fReferences\n[1] Apostolos Burnetas and Michael Katehakis. Optimal adaptive policies for Markov decision\n\nprocesses. Mathematics of Operations Research, 22(1):222\u2013255, 1997.\n\n[2] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49(2-3):209\u2013232, 2002.\n\n[3] Ronen Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for\nnear-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213\u2013231,\n2003.\n\n[4] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. The Journal of Machine Learning Research, 99:1563\u20131600, 2010.\n\n[5] Peter Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\n[6] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) Ecient Reinforcement Learning\n\nvia Posterior Sampling. Advances in Neural Information Processing Systems, 2013.\n\n[7] Craig Boutilier, Richard Dearden, and Mois\u00b4es Goldszmidt. Stochastic dynamic programming\n\nwith factored representations. Arti\ufb01cial Intelligence, 121(1):49\u2013107, 2000.\n\n[8] Zoubin Ghahramani. Learning dynamic bayesian networks. In Adaptive processing of sequences\n\nand data structures, pages 168\u2013197. Springer, 1998.\n\n[9] Alexander Strehl. Model-based reinforcement learning in factored-state MDPs. In Approximate\nDynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International\nSymposium on, pages 103\u2013110. IEEE, 2007.\n\n[10] Michael Kearns and Daphne Koller. Ecient reinforcement learning in factored MDPs. In\n\nIJCAI, volume 16, pages 740\u2013747, 1999.\n\n[11] Istv\u00b4an Szita and Andr\u00b4as L\u02ddorincz. Optimistic initialization and greediness lead to polynomial\ntime learning in factored MDPs. In Proceedings of the 26th Annual International Conference\non Machine Learning, pages 1001\u20131008. ACM, 2009.\n\n[12] William Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[13] Malcom Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th\n\nInternational Conference on Machine Learning, pages 943\u2013950, 2000.\n\n[14] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Ecient solution\n\nalgorithms for factored MDPs. J. Artif. Intell. Res.(JAIR), 19:399\u2013468, 2003.\n\n[15] Daphne Koller and Ronald Parr. Policy iteration for factored MDPs. In Proceedings of the Six-\nteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 326\u2013334. Morgan Kaufmann\nPublishers Inc., 2000.\n\n[16] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored MDPs.\n\nIn IJCAI, volume 1, pages 673\u2013682, 2001.\n\n[17] Karina Valdivia Delgado, Scott Sanner, and Leliane Nunes De Barros. Ecient solutions to\nfactored MDPs with imprecise transition probabilities. Arti\ufb01cial Intelligence, 175(9):1498\u2013\n1527, 2011.\n\n[18] Scott Sanner and Craig Boutilier. Approximate linear programming for \ufb01rst-order MDPs.\n\narXiv preprint arXiv:1207.1415, 2012.\n\n[19] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger.\nInequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech.\nRep, 2003.\n\n[20] Alexander Strehl, Carlos Diuk, and Michael Littman. Ecient structure learning in factored-\n\nstate MDPs. In AAAI, volume 7, pages 645\u2013650, 2007.\n\n[21] Carlos Diuk, Lihong Li, and Bethany R Leer. The adaptive k-meteorologists problem and its\napplication to structure learning and feature selection in reinforcement learning. In Proceedings\nof the 26th Annual International Conference on Machine Learning, pages 249\u2013256. ACM, 2009.\n\n9\n\n\f", "award": [], "sourceid": 413, "authors": [{"given_name": "Ian", "family_name": "Osband", "institution": "Stanford"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}