{"title": "Optimistic Planning in Markov Decision Processes Using a Generative Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1035, "page_last": 1043, "abstract": "We consider the problem of online planning in a Markov decision process with discounted rewards for any given initial state. We consider the PAC sample complexity problem of computing, with probability $1-\\delta$, an $\\epsilon$-optimal action using the smallest possible number of calls to the generative model (which provides reward and next-state samples). We design an algorithm, called StOP (for Stochastic-Optimistic Planning), based on the optimism in the face of uncertainty\" principle. StOP can be used in the general setting, requires only a generative model, and enjoys a complexity bound that only depends on the local structure of the MDP.\"", "full_text": "Optimistic planning in Markov decision processes\n\nusing a generative model\n\nBal\u00b4azs Sz\u00a8or\u00b4enyi\n\nINRIA Lille - Nord Europe,\n\nSequeL project, France /\n\nMTA-SZTE Research Group on\nArti\ufb01cial Intelligence, Hungary\n\nGunnar Kedenburg\n\nINRIA Lille - Nord Europe,\n\nSequeL project, France\n\ngunnar.kedenburg@inria.fr\n\nbalazs.szorenyi@inria.fr\n\nRemi Munos\u2217\n\nINRIA Lille - Nord Europe,\n\nSequeL project, France\n\nremi.munos@inria.fr\n\nAbstract\n\nWe consider the problem of online planning in a Markov decision process with\ndiscounted rewards for any given initial state. We consider the PAC sample com-\nplexity problem of computing, with probability 1\u2212\u03b4, an \ufffd-optimal action using the\nsmallest possible number of calls to the generative model (which provides reward\nand next-state samples). We design an algorithm, called StOP (for Stochastic-\nOptimistic Planning), based on the \u201coptimism in the face of uncertainty\u201d princi-\nple. StOP can be used in the general setting, requires only a generative model, and\nenjoys a complexity bound that only depends on the local structure of the MDP.\n\n1\n\nIntroduction\n\n1.1 Problem formulation\n\nIn a Markov decision process (MDP), an agent navigates in a state space X by making decisions\nfrom some action set U. The dynamics of the system are determined by transition probabilities\nP : X \u00d7 U \u00d7 X \u2192 [0, 1] and reward probabilities R : X \u00d7 U \u00d7 [0, 1] \u2192 [0, 1], as follows: when\nthe agent chooses action u in state x, then, with probability R(x, u, r), it receives reward r, and with\nprobability P (x, u, x\ufffd) it makes a transition to a next state x\ufffd. This happens independently of all\nprevious actions, states and rewards\u2014that is, the system possesses the Markov property. See [20, 2]\nfor a general introduction to MDPs. We do not assume that the transition or reward probabilities\nare fully known. Instead, we assume access to the MDP via a generative model (e.g. simulation\nsoftware), which, for a state-action (x, u), returns a reward sample r \u223c R(x, u,\u00b7) and a next-state\nsample x\ufffd \u223c P (x, u,\u00b7). We also assume the number of possible next-states to be bounded by N \u2208 N.\nWe would like to \ufb01nd an agent that implements a policy which maximizes the expected cumulative\ndiscounted reward E[\ufffd\u221et=0 \u03b3trt], which we will also refer to as the return. Here, rt is the reward\nreceived at time t and \u03b3 \u2208 (0, 1) is the discount factor. Further, we take an online planning approach,\nwhere at each time step, the agent uses the generative model to perform a simulated search (planning)\nin the set of policies, starting from the current state. As a result of this search, the agent takes a single\naction. An expensive global search for the optimal policy in the whole MDP is avoided.\n\n\u2217Current af\ufb01liation: Google DeepMind\n\n1\n\n\fTo quantify the performance of our algorithm, we consider a PAC (Probably Approximately Correct)\nsetting, where, given \ufffd > 0 and \u03b4 \u2208 (0, 1), our algorithm returns, with probability 1\u2212\u03b4, an \ufffd-optimal\naction (i.e. such that the loss of performing this action and then following an optimal policy instead\nof following an optimal policy from the beginning is at most \ufffd). The number of calls to the generative\nmodel required by the planning algorithm is referred to as its sample complexity. The sample and\ncomputational complexities of the planning algorithm introduced here depend on local properties\nof the MDP, such as the quantity of near-optimal policies starting from the initial state, rather than\nglobal features like the MDP\u2019s size.\n\n1.2 Related work\n\nThe online planning approach and, in particular, its ability to get rid of the dependency on the global\nfeatures of the MDP in the complexity bounds (mentioned above, and detailed further below) is\nthe driving force behind the Monte Carlo Tree Search algorithms [16, 8, 11, 18]. 1 The theoreti-\ncal analysis of this approach is still far from complete. Some of the earlier algorithms use strong\nassumptions, others are applicable only in restricted cases, or don\u2019t adapt to the complexity of the\nproblem. In this paper we build on ideas used in previous works, and aim at \ufb01xing these issues.\nA \ufb01rst related work is the sparse sampling algorithm of [14]. It builds a uniform look-ahead tree of a\ngiven depth (which depends on the precision \ufffd), using for each transition a \ufb01nite number of samples\nobtained from a generative model. An estimate of the value function is then built using empirical\naveraging instead of expectations in the dynamic programming back-up scheme. This results in an\n\n1\n\n(1\u2212\u03b3)3\ufffd\ufffd log K+log[1/(\ufffd(1\u2212\u03b3)2 )])\n\nlog(1/\u03b3)\n\n\ufffd2(1\u2212\u03b3)4 .\n\nalgorithm with (problem-independent) sample complexity of order\ufffd\n\n(neglecting some poly-logarithmic dependence), where K is the number of actions. In terms of \ufffd,\nthis bound scales as exp(O([log(1/\ufffd)]2)), which is non-polynomial in 1/\ufffd. 2 Another disadvantage\nof the algorithm is that the expansion of the look-ahead tree is uniform; it does not adapt to the MDP.\nAn algorithm which addresses this appears in [21]. It avoids evaluating some unnecessary branches\nof the look-ahead tree of the sparse sampling algorithm. However, the provided sample bound does\nnot improve on the one in [14], and it is possible to show that the bound is tight (for both algorithms).\nIn fact, the sample complexity turns out to be super-polynomial even in the pure Monte Carlo setting\n(i.e., when K = 1): 1/\ufffd2+(log C)/ log(1/\u03b3), with C \u2265\nClose to our contribution are the planning algorithms [13, 3, 5, 15] (see also the survey [18]) that\nfollow the so-called \u201coptimism in the face of uncertainty\u201d principle for online planning. This prin-\nciple has been extensively investigated in the multi-armed bandit literature (see e.g. [17, 1, 4]). In\nthe planning problem, this approach translates to prioritizing the most promising part of the policy\nspace during exploration. In [13, 3, 5], the sample complexity depends on a measure of the quantity\nof near-optimal policies, which gives a better understanding of the real hardness of the problem than\nthe uniform bound in [14].\nThe case of deterministic dynamics and rewards is considered in [13]. The proposed algorithm has\nlog(1/\u03b3) , where \u03ba \u2208 [1, K] measures (as a branching factor) the\nsample complexity of order (1/\ufffd)\nquantity of nodes of the planning tree that belong to near-optimal policies. If all policies are very\ngood, many nodes need to be explored in order to distinguish the optimal policies from the rest, and\nlog(1/\u03b3) .\ntherefore, \u03ba is close to the number of actions K, resulting in the minimax bound of (1/\ufffd)\nNow if there is structure in the rewards (e.g. when sub-optimal policies can be eliminated by ob-\nserving the \ufb01rst rewards along the sequence), then the proportion of near-optimal policies is low,\nso \u03ba can be small and the bound is much better. In [3], the case of stochastic rewards have been\nconsidered. However, in that work the performance is not compared to the optimal (closed-loop)\npolicy, but to the best open-loop policy (i.e. which does not depends on the state but only on the\nsequence of actions). In that situation, the sample complexity is of order (1/\ufffd)max(2, log(\u03ba)\nThe deterministic and open-loop settings are relatively simple, since any policy can be identi\ufb01ed with\na sequence of actions. In the general MDP case however, a policy corresponds to an exponentially\n\nlog(1/\u03b3) ).\n\nlog K\n\n1\n\nlog \u03ba\n\n1A similar planning approach has been considered in the control literature, such as the model-predictive\n\ncontrol [6] or in the AI community, such as the A\u2217 heuristic search [19] and the AO\u2217 variant [12].\n\n2A problem-independent lower bound for the sample complexity, of order (1/\ufffd)1/ log(1/\u03b3), is provided too.\n\n2\n\n\fwide tree, where several branches need to be explored. The closest work to ours in this respect is\n[5]. However, it makes the (strong) assumption that a full model of the rewards and transitions is\nlog(1/\u03b3) , but where \u03ba \u2208 (1, N K] is de\ufb01ned\nas the branching factor of the set of nodes that simultaneously (1) belong to near-optimal policies,\nand (2) whose \u201ccontribution\u201d to the value function at the initial state is non-negligible.\n\navailable. The sample complexity achieved is again\ufffd1/\ufffd\ufffd log(\u03ba)\n\n1.3 The main results of the paper\n\nOur main contribution is a planning algorithm, called StOP (for Stochastic Optimistic Planning)\nthat achieves a polynomial sample complexity in terms of \ufffd (which can be regarded as the leading\nparameter in this problem), and which is, in terms of this complexity, competitive to other algorithms\nthat can exploit more speci\ufb01cs of their respective domains.\nIt bene\ufb01ts from possible reward or\ntransition probability structures, and does not require any special restriction or knowledge about the\nMDP besides having access to a generative model. The sample complexity bound is more involved\nthan in previous works, but can be upper-bounded by:\n\n(1/\ufffd)2+ log \u03ba\n\nlog(1/\u03b3) +o(1)\n\n(1)\nThe important quantity \u03ba \u2208 [1, KN ] plays the role of a branching factor of the set of important\nstates S \ufffd,\u2217 (de\ufb01ned precisely later) that \u201ccontribute\u201d in a signi\ufb01cant way to near-optimal policies.\nThese states have a non-negligible probability to be reached when following some near-optimal\npolicy. This measure is similar (but with some differences illustrated below) to the \u03ba introduced in\nthe analysis of OP-MDP in [5]. Comparing the two, (1) contains an additional constant of 2 in the\nexponent. This is a consequence of the fact that the rewards are random and that we do not have\naccess to the true probabilities, only to a generative model generating transition and reward samples.\nIn order to provide intuition about the bound, let us consider several speci\ufb01c cases (the derivation of\nthese bounds can be found in Section E):\n\n\u2022 Worst-case. When there is no structure at all, then S \ufffd,\u2217 may potentially be the set of\nall possible reachable nodes (up to some depth which depends on \ufffd), and its branching\nfactor is \u03ba = KN. The sample complexity is thus of order (neglecting logarithmic fac-\ntors) (1/\ufffd)2+ log(KN )\nlog(1/\u03b3) . This is the same complexity that uniform planning algorithm would\nachieve. Indeed, uniform planning would build a tree of depth h with branching factor KN\nwhere from each state-action one would generate m rewards and next-state samples. Then,\ndynamic programming would be used with the empirical Bellman operator built from the\nsamples. Using Chernoff-Hoeffding bound, the estimation error is of the order (neglecting\n\nlogarithms and (1\u2212 \u03b3) dependence) of 1/\u221am. So for a desired error \ufffd we need to choose h\nof order log(1/\ufffd)/ log(1/\u03b3), and m of order 1/\ufffd2 leading to a sample complexity of order\nm(KN )h = (1/\ufffd)2+ log(KN )\nlog(1/\u03b3) . (See also [15]) Note that in the worst-case sense there is no\nuniformly better strategy than a uniform planning, which is achieved by StOP. However,\nStOP can also do much better in speci\ufb01c settings, as illustrated next.\n\n\u2022 Case with K0 > 1 actions at the initial state, K1 = 1 actions for all other states, and\narbitrary transition probabilities. Now each branch corresponds to a single policy. In\nthat case one has \u03ba = 1 (even though N > 1) and the sample complexity of StOP is of\norder \u02dcO(log(1/\u03b4)/\ufffd2) with high probability3. This is the same rate as a Monte-Carlo eval-\nuation strategy would achieve, by sampling O(log(1/\u03b4)/\ufffd2) random trajectories of length\nlog(1/\ufffd)/ log(1/\u03b3). Notice that this result is surprisingly different from OP-MDP which\nhas a complexity of order (1/\ufffd)\nlog(1/\u03b3) (in the case when \u03ba = N, i.e., when all transitions\nare uniform). Indeed, in the case of uniform transition probabilities, OP-MDP would sam-\nple the nodes in breadth-\ufb01rst search way, thus achieving this minimax-optimal complexity.\nThis does not contradict the \u02dcO(log(1/\u03b4)/\ufffd2) bound for StOP (and Monte-Carlo) since this\nbound applies to an individual problem and holds in high probability, whereas the bound\nfor OP-MDP is deterministic and holds uniformly over all problems of this type.\n\nlog N\n\n3We emphasize the dependence on \u03b4 here since we want to compare this high-probability bound to the\n\ndeterministic bound of OP-MDP.\n\n3\n\n\fHere we see the potential bene\ufb01t of using StOP instead of OP-MDP, even though StOP\nonly uses a generative model of the MDP whereas OP-MDP requires a full model.\n\n\u2022 Highly structured policies. This situation holds when there is a substantial gap between\nnear optimal policies and other sub-optimal policies. For example if along an optimal\npolicy, all immediate rewards are 1, whereas as soon as one deviates from it, all rewards\nare < 1. Then only a small proportion of the nodes (the ones that contribute to near-optimal\npolicies) will be expanded by the algorithm. In such cases, \u03ba is very close to 1 and in the\nlimit, we recover the previous case when K = 1 and the sample complexity is O(1/\ufffd)2.\n\n\u2022 Deterministic MDPs. Here N = 1 and we have that \u03ba \u2208 [1, K]. When there is structure in\nthe rewards (like in the previous case), then \u03ba = 1 and we obtain a rate \u02dcO(1/\ufffd2). Now when\nthe MDP is almost deterministic, in the sense that N > 1 but from any state-action, there\nis one next-state probability which is close to 1, then we have almost the same complexity\nas in the deterministic case (since the nodes that have a small probability to be reached will\nnot contribute to the set of important nodes S \ufffd,\u2217, which characterizes \u03ba).\n[9] for the PAC setting.\n\n\u2022 Multi-armed bandit we essentially recover the result of the Action Elimination algorithm\n\nThus we see that in the worst case StOP is minimax-optimal, and in addition, StOP is able to bene\ufb01t\nfrom situations when there is some structure either in the rewards or in the transition probabilities.\nWe stress that StOP achieves the above mentioned results having no knowledge about \u03ba.\n\n1.4 The structure of the paper\n\nSection 2 describes the algorithm, and introduces all the necessary notions. Section 3 presents the\nconsistency and sample complexity results. Section 4 discusses run time ef\ufb01ciency, and in Section 5\nwe make some concluding remarks. Finally, the supplementary material provides the missing proofs,\nthe analysis of the special cases, and the necessary \ufb01xes for the issues with the run-time complexity.\n\n2 StOP: Stochastic Optimistic Planning\n\nRecall that N \u2208 N denotes the number of possible next states. That is, for each state x \u2208 X and each\naction u available at x, it holds that P (x, u, x\ufffd) = 0 for all but at most N states x\ufffd \u2208 X. Throughout\nthis section, the state of interest is denoted by x0, the requested accuracy by \ufffd, and the con\ufb01dence\nparameter by \u03b40. That is, the problem to be solved is to output an action u which is, with probability\nat least (1 \u2212 \u03b40), at least \ufffd-optimal in x0.\nThe algorithm and the analysis make use of the notion of an (in\ufb01nite) planning tree, policies and\ntrajectories. These notions are introduced in the next subsection.\n\n2.1 Planning trees and trajectories\n\nThe in\ufb01nite planning tree \u03a0\u221e for a given MDP is a rooted and labeled in\ufb01nite tree. Its root is\ndenoted s0 and is labeled by the state of interest, x0 \u2208 X. Nodes on even levels are called action\nnodes (the root is an action node), and have Kd children each on the d-th level of action nodes: each\naction u is represented by exactly one child, labeled u. Nodes on odd levels are called transition\nnodes and have N children each: if the label of the parent (action) node is x, and the label of the\ntransition node itself is u, then for each x\ufffd \u2208 X with P (x, u, x\ufffd) > 0 there is a corresponding child,\nlabeled x\ufffd. There may be children with probability zero, but no duplicates.\nAn in\ufb01nite policy is a subtree of \u03a0\u221e with the same root, where each action node has exactly one\nchild and each transition node has N children. It corresponds to an agent having \ufb01xed all its possible\nfuture actions. A (partial) policy \u03a0 is a \ufb01nite subtree of \u03a0\u221e, again with the same root, but where\nthe action nodes have at most one child, each transition node has N children, and all leaves 4 are\non the same level. The number of transition nodes on any path from the root to a leaf is denoted\nd(\u03a0) and is called the depth of \u03a0. A partial policy corresponds to the agent having its possible\nfuture actions planned for d(\u03a0) steps. There is a natural partial order over these policies: a policy\n\n4Note that leaves are, by de\ufb01nition, always action nodes.\n\n4\n\n\f\u03a0\ufffd is called descendant policy of a policy \u03a0 if \u03a0 is a subtree of \u03a0\ufffd. If, additionally, it holds that\nd(\u03a0\ufffd) = d(\u03a0) + 1, then \u03a0 is called the parent policy of \u03a0\ufffd, and \u03a0\ufffd the child policy of \u03a0.\nA (random) trajectory, or rollout, for some policy \u03a0 is a realization \u03c4 := (xt, ut, rt)T\nt=0 of the\nstochastic process that belongs to the policy. A random path is generated from the root by always\nfollowing, from a non-leaf action node with label xt, its unique child in \u03a0, then setting ut to the\nlabel of this node, from where, drawing \ufb01rst a label xt+1 from P (xt, ut,\u00b7), one follows the child\nwith label xt+1. The reward rt is drawn from the distribution determined by R(xt, ut,\u00b7). The value\nof the rollout \u03c4 (also called return or payoff in the literature) is v(\u03c4 ) :=\ufffdT\nt=0 rt\u03b3t, and the value of\nthe policy \u03a0 is v(\u03a0) := E[v(\u03c4 )] = E[\ufffdT\nt=0 rt\u03b3t]. For an action u available at x0, denote by v(u)\nthe maximum of the values of the policies having u as the label of the child of root s0. Denote by v\u2217\nthe maximum of these v(u) values. Using this notation, the task of the algorithm is to return, with\nhigh probability, an action u with v(u) \u2265 v\u2217 \u2212 \ufffd.\n2.2 The algorithm\n\nStOP (Algorithm 1, see Figure 1 in the supplementary material for an illustration) maintains for each\naction u available at x0 a set of active policies Active(u). Initially, it holds that Active(u) = {\u03a0u},\nwhere \u03a0u is the shallowest partial policy with the child of the root being labeled u. Also, for each\npolicy \u03a0 that becomes a member of an active set, the algorithm maintains high con\ufb01dence lower and\nupper bounds for the value v(\u03a0) of the policy, denoted \u03bd(\u03a0) and b(\u03a0), respectively.\nIn each round t, an optimistic policy \u03a0\u2020t,u := argmax\u03a0\u2208Active(u) b(\u03a0) is determined for each ac-\ntion u. Based on this, the current optimistic action u\u2020t := argmaxu b(\u03a0\u2020t,u) and secondary action\nu\u2020\u2020t\nb(\u03a0\u2020t,u) are computed. A policy \u03a0t to explore is then chosen: if the policy\nthat belongs to the secondary action is at least as deeply developed as the policy that belongs to\nthe optimistic action, the optimistic one is chosen for exploration, otherwise the secondary one.\nNote that a smaller depth is equivalent to a larger gap between lower and upper bound, and vice\nversa5. The set Active(ut) is then updated by replacing the policy \u03a0t by its child policies. Accord-\ningly, the upper and lower bounds for these policies are computed. The algorithm terminates when\nb(\u03a0\u2020t,u)\u2013that is, when, with high con\ufb01dence, no policies starting with an\n\u03bd(\u03a0\u2020t ) + \ufffd \u2265 maxu\ufffd=u\u2020t\naction different from u\u2020t have the potential to have signi\ufb01cantly higher value.\n\n:= argmaxu\ufffd=u\u2020t\n\n2.2.1 Number and length of trajectories needed for one partial policy\n\nFix some integer d > 0 and let \u03a0 be a partial policy of depth d. Let, furthermore, \u03a0\ufffd be an in\ufb01nite\npolicy that is a descendant of \u03a0. Note that\n\n0 \u2264 v(\u03a0\ufffd) \u2212 v(\u03a0) \u2264 \u03b3d\n1\u2212\u03b3 .\n\n(2)\n\nThe value of \u03a0 is a \u03b3d\n1\u2212\u03b3 -accurate approximation of the value of \u03a0\ufffd. On the other hand, having m\ntrajectories for \u03a0, their average reward \u02c6v(\u03a0) can be used as an estimate of the value v(\u03a0) of \u03a0. From\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\nthe Hoeffding bound, this estimate has, with probability at least (1 \u2212 \u03b4), accuracy 1\u2212\u03b3d\n2m .\nWith m := m(d, \u03b4) := \ufffd ln(1/\u03b4)\n2m holds, so with prob-\nability at least (1 \u2212 \u03b4), b(\u03a0) := \u02c6v(\u03a0) + \u03b3d\n1\u2212\u03b3 and \u03bd(\u03a0) :=\n\u02c6v(\u03a0)\u2212 1\u2212\u03b3d\n1\u2212\u03b3 bound v(\u03a0\ufffd) from above and below, respectively. This choice\nbalances the inaccuracy of estimating v(\u03a0\ufffd) based on v(\u03a0) and the inaccuracy of estimating v(\u03a0).\n1\u2212\u03b3 \u2264 \ufffd/2. Note\nLet d\u2217 := d\u2217(\ufffd, \u03b3) := \ufffd(ln\nthat if d(\u03a0) = d\u2217 for any given policy \u03a0, then b(\u03a0) \u2212 \u03bd(\u03a0) \u2264 \ufffd/2. Because of this, it follows\n(see Lemma 3 in the supplementary material) that d\u2217 is the maximal length the algorithm ever has\nto develop a policy.\n\n(1\u2212\u03b3)\ufffd )/ ln(1/\u03b3)\ufffd, the smallest integer satisfying 3 \u03b3d\u2217\n\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\n2m \u2264 \u02c6v(\u03a0) + 2 \u03b3d\n\n\u03b3d )2\ufffd trajectories, \u03b3d\n( 1\u2212\u03b3d\n1\u2212\u03b3 + 1\u2212\u03b3d\n\n1\u2212\u03b3 \u2265 1\u2212\u03b3d\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\n\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\n\n2m \u2265 \u02c6v(\u03a0)\u2212 \u03b3d\n\n2\n\n6\n\n5This approach of using secondary actions is based on the UGapE algorithm [10].\n\n5\n\n\fAlgorithm 1 StOP(s0, \u03b40, \ufffd, \u03b3)\n1: for all u available from x0 do\n\u03a0u := smallest policy with the child of s0 labeled u\n2:\n\u03b41 := (\u03b40/d\u2217) \u00b7 (K0)\u22121\n3:\n4:\n(\u03bd(\u03a0u), b(\u03a0u)) := BoundValue(\u03a0u, \u03b41)\n5:\nActive(u) := {\u03a0u}\n6: for round t=1, 2, . . . do\nfor all u available at x0 do\n7:\n8:\n9:\n\n\u03a0\u2020t,u := argmax\u03a0\u2208Active(u) b(\u03a0)\n\nb(\u03a0\u2020t,u),\n\nt,u\u2020\u2020t\n\n\u03a0\u2020t := \u03a0\u2020\n, where u\u2020t := argmaxu b(\u03a0\u2020t,u),\nt,u\u2020t\n\u03a0\u2020\u2020t\n:= \u03a0\u2020\n, where u\u2020\u2020t\nif \u03bd(\u03a0\u2020t ) + \ufffd \u2265 maxu\ufffd=u\u2020t\nif d(\u03a0\u2020\u2020t ) \u2265 d(\u03a0\u2020t ) then\nut := u\u2020t and \u03a0t := \u03a0\u2020t\nelse\n\n:= argmaxu\ufffd=u\u2020t\nb(\u03a0\u2020t,u) then\n\nreturn u\u2020t\n\nut := u\u2020\u2020t and \u03a0t := \u03a0\u2020\u2020t\n\nActive(ut) := Active(ut) \\ {\u03a0t}\n(K\ufffd)\u2212N \ufffd\nfor all child policy \u03a0\ufffd of \u03a0t do\n\n\u03b4 := (\u03b40/d\u2217) \u00b7\ufffdd(\u03a0t)\u22121\n(\u03bd(\u03a0), b(\u03a0)) := BoundValue(\u03a0\ufffd, \u03b4)\nActive(ut) := Active(ut) \u222a {\u03a0\ufffd}\n\n\ufffd=0\n\n\ufffd \ufffdd\u22121\n\n\ufffd=0 (K\ufffd)N \ufffd\n\n10:\n\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n\n\ufffd initialize\n\n\ufffd d(\u03a0u) = 1\n\n\ufffd the set of active policies that follow u in s0\n\n\ufffd optimistic action and policy\n\ufffd secondary action and policy\n\ufffd termination criterion\n\n\ufffd select the policy to evaluate\n\n\ufffd action and policy to explore\n\n= # of policies of depth at most d\n\n2.2.2 Samples and sample trees\n\nAlgorithm StOP aims to aggressively reuse every sample for each transition node and every sample\nfor each state-action pair, in order to keep the sample complexity as low as possible. Each time the\nvalue of a partial policy is evaluated, all samples that are available for any part of it from previous\nrounds are reused. That is, if m trajectories are necessary for assessing the value of some policy\n\u03a0, and there are m\ufffd complete trajectories available and m\ufffd\ufffd that end at some inner node of \u03a0, then\nStOP (more precisely, another algorithm, Sample, called from StOP) samples rewards (using\nSampleReward) and transitions (SampleTransition) to generate continuations for the m\ufffd\ufffd\nincomplete trajectories and to generate (m\u2212 m\ufffd \u2212 m\ufffd\ufffd) new trajectories, as described in Section 2.1,\nwhere\n\n\u2022 SampleReward(s) for some action node s samples a reward from the distribution\nR(x, u,\u00b7), where u is the label of the parent of s and x is the label of the grandparent\nof s, and\n\u2022 SampleTransition(s) for some transition node s samples a next state from the distri-\n\nbution P (x, u,\u00b7), where u is the label of s and x is the label of the parent of s.\n\nTo compensate for the sharing of the samples, the con\ufb01dences of the estimates are increased, so that\nwith probability at least (1\u2212 \u03b40), all of them are valid6. The samples are organized as a collection of\nsample trees, where a sample tree T is a (\ufb01nite) subtree of \u03a0\u221e with the property that each transition\nnode has exactly one child, and that each action node s is associated with some reward rT (s). Note\nthat the intersection of a policy \u03a0 and a sample tree T is always a path. Denote this path by \u03c4 (T , \u03a0)\nand note that it necessarily starts from the root and ends either in a leaf or in an internal node of \u03a0. In\nthe former case, this path can be interpreted as a complete trajectory for \u03a0, and in the latter case, as\nan initial segment. Accordingly, when the value of a new policy \u03a0 needs to be estimated/bounded, it\nis computed as \u02c6v(\u03a0) := 1\ni=1 v(\u03c4 (Ti, \u03a0)) (see Algorithm 2: BoundValue), where T1, . . . ,Tm\nare sample trees constructed by the algorithm. For terseness, these are considered to be global\nvariables, and are constructed and maintained using algorithm Sample (Algorithm 3).\n\nm\ufffdm\n\n\ufffd=0 K\u2212N \ufffd\ndivided by the number of policies of depth at most d, and by the largest possible depth\u2014see section 2.2.1.\n\n6In particular, the con\ufb01dence is set to 1 \u2212 \u03b4d(\u03a0) for policy \u03a0, where \u03b4d = (\u03b40/d\u2217)\ufffdd\u22121\n\n\ufffd\n\nis \u03b40\n\n6\n\n\f2\n\nAlgorithm 2 BoundValue(\u03a0, \u03b4)\nEnsure: with probability at least (1 \u2212 \u03b4), interval [\u03bd(\u03a0), b(\u03a0)] contains v(\u03a0)\n1: m :=\ufffd ln(1/\u03b4)\nm\ufffdm\n\n\u03b3d(\u03a0) \ufffd2\ufffd\n\ufffd 1\u2212\u03b3d(\u03a0)\n2: Sample(\u03a0, s0, m)\n3: \u02c6v(\u03a0) := 1\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\n4: \u03bd(\u03a0) := \u02c6v(\u03a0) \u2212 1\u2212\u03b3d(\u03a0)\n5: b(\u03a0) := \u02c6v(\u03a0) + \u03b3d(\u03a0)\n1\u2212\u03b3 + 1\u2212\u03b3d(\u03a0)\n6: return (\u03bd(\u03a0), b(\u03a0))\n\n1\u2212\u03b3 \ufffd ln(1/\u03b4)\n\ni=1 v(\u03c4 (Ti, \u03a0))\n\n2m\n\n2m\n\n\ufffd Ensure that at least m trajectories exist for \u03a0\n\ufffd empirical estimate of v(\u03a0)\n\ufffd Hoeffding bound\n\n\ufffd . . . and (2)\n\nends in a leaf of \u03a0 for i = 1, . . . , m)\n\nAlgorithm 3 Sample(\u03a0, s, m)\nEnsure: there are m sample trees T1, . . . ,Tm that contain a complete trajectory for \u03a0 (i.e. \u03c4 (Ti, \u03a0)\n1: for i := 1, . . . , m do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nlet s\ufffd be the child of s in \u03a0 and add it to T as a new child of s\ns\ufffd\ufffd := SampleTransition(s\ufffd),\nadd s\ufffd\ufffd to T as a new child of s\ufffd\ns := s\ufffd\ufffd\nrT (s\ufffd\ufffd) := SampleReward(s\ufffd\ufffd)\n\nif sample tree Ti does not yet exist then\nlet Ti be a new sample tree of depth 0\nlet s be the last node of \u03c4 (Ti, \u03a0)\nwhile s is not a leaf of \u03a0 do\n\n\ufffd s\ufffd is a transition node\n\n\ufffd s is an action node\n\n3 Analysis\n\nRecall that v\u2217 denotes the maximal value of any (possibly in\ufb01nite) policy tree. The following theo-\nrem formalizes the consistency result for StOP (see the proof in Section C).\nTheorem 1. With probability at least (1 \u2212 \u03b40), StOP returns an action with value at least v\u2217 \u2212 \ufffd.\nBefore stating the sample complexity result, some further notation needs to be introduced.\nLet u\u2217 denote an optimal action available at state x0. That is, v(u\u2217) = v\u2217. De\ufb01ne for u \ufffd= u\u2217\n\nu :=\ufffd\u03a0 : \u03a0 follows u from s0 and v(\u03a0) + 3 \u03b3d(\u03a0)\nP \ufffd\nand also de\ufb01ne\nu\u2217 :=\ufffd\u03a0 : \u03a0 follows u\u2217 from s0, v(\u03a0) + 3 \u03b3d(\u03a0)\nP \ufffd\nThen P \ufffd := P \ufffd\nin order to determine an \ufffd-optimal action. (See also Lemma 8 in the supplementary material.)\nLet now p(s) denote the product of the probabilities of the transitions on the path from s0 to s. That\nis, for any policy tree \u03a0 containing s, a trajectory for \u03a0 goes through s with probability p(s). When\nestimating the value of some policy \u03a0 of depth d, the expected number of trajectories going through\nsome nodes s of it is p(s)m(d, \u03b4d). The sample complexity therefore has to take into consideration\nfor each node s (at least for the ones with \u201chigh\u201d p(s) value) the maximum \ufffd(s) = max{d(\u03a0) : \u03a0 \u2208\nP \ufffd contains s} of the depth of the relevant policies it is included in. Therefore, the expected number\nof trajectories going through s in a given run of StOP is\n\nu is the set of \u201cimportant\u201d policies that potentially need to be evaluated\n\n1\u2212\u03b3 \u2265 v\u2217 and v(\u03a0) \u2212 6 \u03b3d(\u03a0)\n\nu\u2217 \u222a\ufffdu\ufffd=u\u2217 P \ufffd\n\n1\u2212\u03b3 + \ufffd\ufffd ,\n\n1\u2212\u03b3 \u2265 v\u2217 \u2212 3 \u03b3d(\u03a0)\n\n1\u2212\u03b3 + \ufffd \u2264 max\nu\ufffd=u\u2217\n\nv(u)\ufffd .\n\np(s) \u00b7 m(\ufffd(s), \u03b4\ufffd(s)) = p(s)\ufffd ln(1/\u03b4\ufffd(s))\n\n2\n\n\u03b3\ufffd(s) \ufffd2\ufffd\n\ufffd 1\u2212\u03b3\ufffd(s)\n\n(3)\n\nIf (3) is \u201clarge\u201d for some s, it can be used to deduce a high con\ufb01dence upper bound on the number of\ntimes s gets sampled. To this end, let S \ufffd denote the set of nodes of the trees in P \ufffd, let N \ufffd denote the\n\n7\n\n\fsmallest positive integer N satisfying N \u2265\ufffd\ufffd\ufffds \u2208 S \ufffd : p(s) \u00b7 m(\ufffd(s), \u03b4\ufffd(s)) \u2265 (8/3) ln(2N /\u03b40)\ufffd\ufffd\ufffd\n\n(obviously N \ufffd \u2264 |S \ufffd|), and de\ufb01ne\n\nS \ufffd,\u2217 :=\ufffds \u2208 S \ufffd : p(s) \u00b7 m(\ufffd(s), \u03b4\ufffd(s)) \u2265 (8/3) ln(2N \ufffd/\u03b40)\ufffd .\n\n2 ), as N \ufffd = |S \ufffd,\u2217|. See also Appendix D.)\n\nS \ufffd is the set of \u201cimportant\u201d nodes (P \ufffd is the set of \u201cimportant\u201d policies), and S \ufffd,\u2217 consists of the\nimportant nodes that, with high probability, are not sampled more than twice as often as expected.\n(This high probability is 1 \u2212 \u03b40\n2N \ufffd according to the Bernstein bound, so these upper bounds hold\njointly with probability at least (1 \u2212 \u03b40\nFor s\ufffd \u2208 S \ufffd \\ S \ufffd,\u2217, the number of times s\ufffd gets sampled has a variance that is too high compared\nto its expected value (3), so in this case, a different approach is needed in order to derive high\ncon\ufb01dence upper bounds. To this end, for a transition node s, let p\u25e6(s) := p\u25e6(s, \ufffd) := \ufffd{p(s\ufffd) :\ns\ufffd is a child of s with p(s\ufffd) \u00b7 m(\ufffd(s\ufffd), \u03b4\ufffd(s\ufffd)) < (8/3) ln(2N \ufffd/\u03b40)}, and de\ufb01ne\nB(s) := B(s, \ufffd) :=\ufffd0,\nif p\u25e6(s) \u2264\notherwise.\nAs it will be shown in the proof of Theorem 2 (in Section D), this is a high con\ufb01dence upper bound\non the number of trajectories that go through some child s\ufffd \u2208 S \ufffd \\ S \ufffd,\u2217 of some s\ufffd \u2208 S \ufffd,\u2217.\nTheorem 2. With probability at least (1 \u2212 2\u03b4), StOP outputs a policy of value at least (v\u2217 \u2212 \ufffd), af-\n\ufffd=d(s)+1 K\ufffd\ufffd samples,\n\nter generating at most\ufffds\u2208S \ufffd,\u2217\ufffd2p(s)m(\ufffd(s), \u03b4\ufffd(s)) + B(s)\ufffd\ufffd(s)\nd=d(s)+1\ufffdd\nwhere d(s) = min{d(\u03a0) : s appears in policy \u03a0} is the depth of node s.\nFinally, the bound discussed in Section 1 is obtained by setting \u03ba := lim sup\ufffd\u21920 max(\u03ba1, \u03ba2),\nln(1/\u03b40) 2p(s)m(\ufffd(s), \u03b4\ufffd(s))\ufffd1/d\u2217\nwhere \u03ba1 := \u03ba1(\ufffd, \u03b40, \u03b3) :=\ufffd\ufffds\u2208S \ufffd,\u2217\nand \u03ba2 := \u03ba2(\ufffd, \u03b40, \u03b3) :=\n\ufffd=d(s) K\ufffd\ufffd1/d\u2217\n\ufffd \ufffd2(1\u2212\u03b3)2\nln(1/\u03b40) \ufffds\u2208S \ufffd,\u2217 B(s)\ufffd\ufffd(s)\nd=d(s)\ufffdd\n\n2N \ufffdm(\ufffd(s),\u03b4\ufffd(s))\n\nmax(6 ln( 2N \ufffd\n\u03b40\n\n), 2p\u25e6(s)m(\ufffd(s), \u03b4\ufffd(s)))\n\n\ufffd2(1\u2212\u03b3)2\n\n.\n\n\u03b4\n\n4 Ef\ufb01ciency\n\nStOP, as presented in Algorithm 1, is not ef\ufb01ciently executable. First of all, whenever it evaluates\nan optimistic policy, it enumerates all its child policies, which typically has exponential time com-\nplexity. Besides that, the sample trees are also treated in an inef\ufb01cient way. An ef\ufb01cient version of\nStOP with all these issues \ufb01xed is presented in Appendix F of the supplementary material.\n\n5 Concluding remarks\n\nIn this work, we have presented and analyzed our algorithm, StOP. To the best of our knowledge,\nStOP is currently the only algorithm for optimal (i.e. closed loop) online planning with a generative\nmodel that provably bene\ufb01ts from local structure both in reward as well as in transition probabilities.\nIt assumes no knowledge about this structure other than access to the generative model, and does\nnot impose any restrictions on the system dynamics.\nOne should note though that the current version of StOP does not support domains with in\ufb01nite\nN. The sparse sampling algorithm in [14] can easily handle such problems (at the cost of a non-\npolynomial (in 1/\ufffd) sample complexity), however, StOP has much better sample complexity in case\nof \ufb01nite N. An interesting problem for future research is to design adaptive planning algorithms\nwith sample complexity independent of N ([21] presents such an algorithm, but the complexity\nbound provided there is the same as the one in [14]).\n\nAcknowledgments\n\nThis work was supported by the French Ministry of Higher Education and Research, and by the\nEuropean Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement\nno 270327 (project CompLACS). Author two would like to acknowledge the support of the BMBF\nproject ALICE (01IB10003B).\n\n8\n\n\fReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nLearning Journal, 47(2-3):235\u2013256, 2002.\n\n[2] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 2001.\n[3] S. Bubeck and R. Munos. Open loop optimistic planning. In Conference on Learning Theory, 2010.\n[4] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] Lucian Bus\u00b8oniu and R\u00b4emi Munos. Optimistic planning for markov decision processes. In Proceedings\n15th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS-12), pages 182\u2013189,\n2012.\n\n[6] E. F. Camacho and C. Bordons. Model Predictive Control. Springer-Verlag, 2004.\n[7] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\n[8] R\u00b4emi Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. In Proceedings\n\nComputers and Games 2006. Springer-Verlag, 2006.\n\n[9] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for reinforcement\nlearning. In T. Fawcett and N. Mishra, editors, Proceedings of the Twentieth International Conference on\nMachine Learning (ICML-2003), pages 162\u2013169, 2003.\n\n[10] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation: A uni\ufb01ed\napproach to \ufb01xed budget and \ufb01xed con\ufb01dence. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher\nJ. C. Burges, Lon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 3221\u20133229, 2012.\n\n[11] Sylvain Gelly, Yizao Wang, R\u00b4emi Munos, and Olivier Teytaud. Modi\ufb01cation of UCT with Patterns in\n\nMonte-Carlo Go. Rapport de recherche RR-6062, INRIA, 2006.\n\n[12] Eric A. Hansen and Shlomo Zilberstein. A heuristic search algorithm for Markov decision problems. In\nProceedings Bar-Ilan Symposium on the Foundation of Arti\ufb01cial Intelligence, Ramat Gan, Israel, 23\u201325\nJune 1999.\n\n[13] J-F. Hren and R. Munos. Optimistic planning of deterministic systems. In Recent Advances in Reinforce-\nment Learning, pages 151\u2013164. Springer LNAI 5323, European Workshop on Reinforcement Learning,\n2008.\n\n[14] M. Kearns, Y. Mansour, and A.Y. Ng. A sparse sampling algorithm for near-optimal planning in large\n\nMarkovian decision processes. In Machine Learning, volume 49, pages 193\u2013208, 2002.\n\n[15] Gunnar Kedenburg, Raphael Fonteneau, and Remi Munos. Aggregating optimistic planning trees for\nsolving markov decision processes. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 26, pages 2382\u20132390. Curran\nAssociates, Inc., 2013.\n\n[16] Levente Kocsis and Csaba Szepesv\u00b4ari. Bandit based monte-carlo planning. In In: ECML-06. Number\n\n4212 in LNCS, pages 282\u2013293. Springer, 2006.\n\n[17] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Math-\n\nematics, 6:4\u201322, 1985.\n\n[18] R\u00b4emi Munos. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization\n\nand planning. Foundation and Trends in Machine Learning, 7(1):1\u2013129, 2014.\n\n[19] N.J. Nilsson. Principles of Arti\ufb01cial Intelligence. Tioga Publishing, 1980.\n[20] M.L. Puterman. Markov Decision Processes \u2014 Discrete Stochastic Dynamic Programming. John Wiley\n\n& Sons, Inc., New York, NY, 1994.\n\n[21] Thomas J. Walsh, Sergiu Goschin, and Michael L. Littman.\n\nIntegrating sample-based planning and\nmodel-based reinforcement learning. In Proceedings of the Twenty-Fourth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 612\u2013617. AAAI Press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 620, "authors": [{"given_name": "Bal\u00e1zs", "family_name": "Sz\u00f6r\u00e9nyi", "institution": "INRIA Lille"}, {"given_name": "Gunnar", "family_name": "Kedenburg", "institution": null}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA / MSR"}]}