{"title": "Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning", "book": "Advances in Neural Information Processing Systems", "page_first": 4680, "page_last": 4688, "abstract": "We study the sampling-based planning problem in Markov decision processes (MDPs) that we can access only through a generative model, usually referred to as Monte-Carlo planning. Our objective is to return a good estimate of the optimal value function at any state while minimizing the number of calls to the generative model, i.e. the sample complexity. We propose a new algorithm, TrailBlazer, able to handle MDPs with a finite or an infinite number of transitions from state-action to next states. TrailBlazer is an adaptive algorithm that exploits possible structures of the MDP by exploring only a subset of states reachable by following near-optimal policies. We provide bounds on its sample complexity that depend on a measure of the quantity of near-optimal states. The algorithm behavior can be considered as an extension of Monte-Carlo sampling (for estimating an expectation) to problems that alternate maximization (over actions) and expectation (over next states). Finally, another appealing feature of TrailBlazer is that it is simple to implement and computationally efficient.", "full_text": "Blazing the trails before beating the path:\nSample-ef\ufb01cient Monte-Carlo planning\n\nJean-Bastien Grill\n\nMichal Valko\n\nSequeL team, INRIA Lille - Nord Europe, France\n\njean-bastien.grill@inria.fr\n\nmichal.valko@inria.fr\n\nR\u00e9mi Munos\n\nGoogle DeepMind, UK\u2217\nmunos@google.com\n\nAbstract\n\nYou are a robot and you live in a Markov decision process (MDP) with a \ufb01nite or an\nin\ufb01nite number of transitions from state-action to next states. You got brains and so\nyou plan before you act. Luckily, your roboparents equipped you with a generative\nmodel to do some Monte-Carlo planning. The world is waiting for you and you\nhave no time to waste. You want your planning to be ef\ufb01cient. Sample-ef\ufb01cient.\nIndeed, you want to exploit the possible structure of the MDP by exploring only a\nsubset of states reachable by following near-optimal policies. You want guarantees\non sample complexity that depend on a measure of the quantity of near-optimal\nstates. You want something, that is an extension of Monte-Carlo sampling (for\nestimating an expectation) to problems that alternate maximization (over actions)\nand expectation (over next states). But you do not want to StOP with exponential\nrunning time, you want something simple to implement and computationally\nef\ufb01cient. You want it all and you want it now. You want TrailBlazer.\n\n1\n\nIntroduction\n\nvalue function, de\ufb01ned as the maximum of the expected sum of discounted rewards: E(cid:104)(cid:80)\n\nWe consider the problem of sampling-based planning in a Markov decision process (MDP) when a\ngenerative model (oracle) is available. This approach, also called Monte-Carlo planning or Monte-\nCarlo tree search (see e.g., [12]), has been popularized in the game of computer Go [7, 8, 15] and\nshown impressive performance in many other high dimensional control and game problems [4]. In\nthe present paper, we provide a sample complexity analysis of a new algorithm called TrailBlazer.\nOur assumption about the MDP is that we possess a generative model which can be called from any\nstate-action pair to generate rewards and transition samples. Since making a call to this generative\nmodel has a cost, be it a numerical cost expressed in CPU time (in simulated environments) or a\n\ufb01nancial cost (in real domains), our goal is to use this model as parsimoniously as possible.\nFollowing dynamic programming [2], planning can be reduced to an approximation of the (optimal)\nt\u22650 \u03b3trt\n,\nwhere \u03b3 \u2208 [0, 1) is a known discount factor. Indeed, if an \u03b5-optimal approximation of the value\nfunction at any state-action pair is available, then the policy corresponding to selecting in each state\nthe action with the highest approximated value will be O (\u03b5/ (1 \u2212 \u03b3))-optimal [3].\nConsequently, in this paper, we focus on a near-optimal approximation of the value function for\na single given state (or state-action pair). In order to assess the performance of our algorithm we\nmeasure its sample complexity de\ufb01ned as the number of oracle calls, given that we guarantee its\nconsistency, i.e., that with probability at least 1 \u2212 \u03b4, TrailBlazer returns an \u03b5-approximation of the\nvalue function as required by the probably approximately correct (PAC) framework.\n\n(cid:105)\n\n\u2217on the leave from SequeL team, INRIA Lille - Nord Europe, France\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe use a tree representation to represent the set of states that are reachable from any initial state.\nThis tree alternates maximum (MAX) nodes (corresponding to actions) and average (AVG) nodes\n(corresponding to the random transition to next states). We assume the number K of actions is \ufb01nite.\nHowever, the number N of possible next states is either \ufb01nite or in\ufb01nite (which may be the case\nwhen the state space is in\ufb01nite), and we will report results in both the \ufb01nite N and the in\ufb01nite case.\nThe root node of this planning tree represents the current state (or a state-action) of the MDP and its\nvalue is the maximum (over all policies de\ufb01ned at MAX nodes) of the corresponding expected sum of\ndiscounted rewards. Notice that by using a tree representation, we do not use the property that some\nstate of the MDP can be reached by different paths (sequences of states-actions). Therefore, this state\nwill be represented by different nodes in the tree. We could potentially merge such duplicates to form\na graph instead. However, for simplicity, we choose not to merge these duplicates and keep a tree,\nwhich could make the planning problem harder. To sum up, our goal is to return, with probability\n1 \u2212 \u03b4, an \u03b5-accurate value of the root node of this planning tree while using as low number of calls\nto the oracle as possible. Our contribution is an algorithm called TrailBlazer whose sampling\nstrategy depends on the speci\ufb01c structure of the MDP and for which we provide sample complexity\nbounds in terms of a new problem-dependent measure of the quantity of near-optimal nodes. Before\ndescribing our contribution in more detail we \ufb01rst relate our setting to what has been around.\n\n1.1 Related work\n\nIn this section we focus on the dependency between \u03b5 and the sample complexity and all bound of\nthe style 1/\u03b5c are up to a poly-logarithmic multiplicative factor not indicated for clarity. Kocsis and\nSzepesv\u00e1ri [12] introduced the UCT algorithm (upper-con\ufb01dence bounds for trees). UCT is ef\ufb01cient\nin computer Go [7, 8, 15] and a number of other control and game problems [4]. UCT is based on\ngenerating trajectories by selecting in each MAX node the action that has the highest upper-con\ufb01dence\nbound (computed according to the UCB algorithm of Auer et al. [1]). UCT converges asymptotically to\nthe optimal solution, but its sample complexity can be worst than doubly-exponential in (1/\u03b5) for\nsome MDPs [13]. One reason for this is that the algorithm can expand very deeply the apparently\nbest branches but may lack suf\ufb01cient exploration, especially when a narrow optimal path is hidden in\na suboptimal branch. As a result, this approach works well in some problems with a speci\ufb01c structure\nbut may be much worse than a uniform sampling in other problems.\nOn the other hand, a uniform planning approach is safe for all problems. Kearns et al. [11] generate a\nsparse look-ahead tree based on expanding all MAX nodes and sampling a \ufb01nite number of children\nfrom AVG nodes up to a \ufb01xed depth that depends on the desired accuracy \u03b5. Their sample complexity\nis2 of the order of (1/\u03b5)log(1/\u03b5), which is non-polynomial in 1/\u03b5. This bound is better than that for\nUCT in a worst-case sense. However, as their look-ahead tree is built in a uniform and non-adaptive\nway, this algorithm fails to bene\ufb01t from a potentially favorable structure of the MDP.\nAn improved version of this sparse-sampling algorithm by Walsh et al. [17] cuts suboptimal branches\nin an adaptive way but unfortunately does not come with an improved bound and stays non-polynomial\neven in the simple Monte Carlo setting for which K = 1.\nAlthough the sample complexity is certainly non-polynomial in the worst case, it can be polyno-\nmial in some speci\ufb01c problems. First, for the case of \ufb01nite N, the sample complexity is poly-\nnomial and Sz\u00f6r\u00e9nyi et al. [16] show that a uniform sampling algorithm has complexity at most\n(1/\u03b5)2+log(KN )/(log(1/\u03b3)). Notice that the product KN represents the branching factor of the look-\nahead planning tree. This bound could be improved for problems with speci\ufb01c reward structure or\ntransition smoothness. In order to do this, we need to design non-uniform, adaptive algorithm that\ncaptures the possible structure of the MDP when available, while making sure that in the worst case,\nwe do not perform worse than a uniform sampling algorithm.\nThe case of deterministic dynamics (N = 1) and rewards considered by Hren and Munos [10] has a\ncomplexity of order (1/\u03b5)(log \u03ba)/(log(1/\u03b3)), where \u03ba \u2208 [1, K] is the branching factor of the subset of\nnear-optimal nodes.3 The case of stochastic rewards has been considered by Bubeck and Munos [5]\nbut with the difference that the goal was not to approximate the optimal value function but the value\nof the best open-loop policy which consists in a sequence of actions independent of states. Their\nsample complexity is (1/\u03b5)max(2,(log \u03ba)/(log 1/\u03b3)).\n\n2neglecting exponential dependence in \u03b3\n3nodes that need to be considered in order to return a near-optimal approximation of the value at the root\n\n2\n\n\fIn the case of general MDPs, Bu\u00b8soniu and Munos [6] consider the case of a fully known model of\nthe MDP. For any state-action, the model returns the expected reward and the set of all next states\n(assuming N is \ufb01nite) with their corresponding transition probabilities. In that case, the complexity is\n(1/\u03b5)log \u03ba/(log(1/\u03b3)), where \u03ba \u2208 [0, KN ] can again be interpreted as a branching factor of the subset\nof near-optimal nodes. These approaches use the optimism in the face of uncertainty principle whose\napplications to planning have been have been studied by Munos [13]. TrailBlazer is different. It\nis not optimistic by design: To avoid voracious demand for samples it does not balance the upper-\ncon\ufb01dence bounds of all possible actions. This is crucial for polynomial sample complexity in the\nin\ufb01nite case. The whole Section 3 shines many rays of intuitive light on this single and powerful idea.\nThe work that is most related to ours is StOP by Sz\u00f6r\u00e9nyi et al. [16] which considers the plan-\nning problem in MDPs with a generative model. Their complexity bound is of the order of\n(1/\u03b5)2+log \u03ba/(log(1/\u03b3))+o(1), where \u03ba \u2208 [0, KN ] is a problem-dependent quantity. However, their \u03ba\nde\ufb01ned as lim\u03b5\u21920 max(\u03ba1, \u03ba2) (in their Theorem 2) is somehow dif\ufb01cult to interpret as a measure of\nthe quantity of near-optimal nodes. Moreover, StOP is not computationally ef\ufb01cient as it requires to\nidentify the optimistic policy which requires computing an upper bound on the value of any possible\npolicy, whose number is exponential in the number of MAX nodes, which itself is exponential in the\nplanning horizon. Although they suggest (in their Appendix F) a computational improvement, this\nversion is not analyzed. Finally, unlike in the present paper, StOP does not consider the case N = \u221e\nof an unbounded number of states.\n\n1.2 Our contributions\n\nOur main result is TrailBlazer, an algorithm with a bound on the number of samples required to\nreturn a high-probability \u03b5-approximation of the root node whether the number of next states N is\n\ufb01nite or in\ufb01nite. The bounds use a problem-dependent quantity (\u03ba or d) that measures the quantity of\nnear-optimal nodes. We now summarize the results.\nFinite number of next states (N < \u221e): The sample complexity of TrailBlazer is of the order\nof4 (1/\u03b5)max(2,log(N \u03ba)/ log(1/\u03b3)+o(1)), where \u03ba \u2208 [1, K] is related to the branching factor of the set\nof near-optimal nodes (precisely de\ufb01ned later).\nIn\ufb01nite number of next states (N = \u221e): The complexity of TrailBlazer is (1/\u03b5)2+d, where d\nis a measure of the dif\ufb01culty to identify the near-optimal nodes. Notice that d can be \ufb01nite even if\nthe planning problem is very challenging.5 We also state our contributions in speci\ufb01c settings in\ncomparison to previous work.\n\n\u2022 For the case N < \u221e, we improve over the best-known previous worst-case bound with\nan exponent (to 1/\u03b5) of max(2, log(N K)/ log(1/\u03b3)) instead of 2 + log(N K)/ log(1/\u03b3)\nreported by Sz\u00f6r\u00e9nyi et al. [16].\n\u2022 For the case N = \u221e, we identify properties of the MDP (when d = 0) under which the\nsample complexity is of order (in 1/\u03b52). This is the case when there are non-vanishing action-\ngaps6 from any state along near-optimal policies or when the probability of transitionning to\nnodes with gap \u2206 is upper bounded by \u22062. This complexity bound is as good as Monte-\nCarlo sampling and for this reason TrailBlazer is a natural extension of Monte-Carlo\nsampling (where all nodes are AVG) to stochastic control problems (where MAX and AVG\nnodes alternate). Also, no previous algorithm reported a polynomial bound when N = \u221e.\n\u2022 In MDPs with deterministic transitions (N = 1) but stochastic rewards our bound is\n(1/\u03b5)max(2,log \u03ba/(log 1/\u03b3)) which is similar to the bound achieved by Bubeck and Munos [5]\nin a similar setting (open-loop policies).\n\u2022 In the evaluation case without control (K = 1) TrailBlazer behaves exactly as Monte-\nCarlo sampling (thus achieves a complexity of 1/\u03b52), even in the case N = \u221e.\n\u2022 Finally TrailBlazer is easy to implement and is numerically ef\ufb01cient.\n\n4neglecting logarithmic terms in \u03b5 and \u03b4\n5since when N = \u221e the actual branching factor of the set of reachable nodes is in\ufb01nite\n6de\ufb01ned as the difference in values of best and second-best actions\n\n3\n\n\f2 Monte-Carlo planning with a generative model\nSetup We operate on a planning tree T . Each node\nof T from the root down is alternatively either an\naverage (AVG) or a maximum (MAX) node. For any\nnode s, C [s] is the set of its children. We consider\ntrees T for which the cardinality of C [s] for any MAX\nnode s is bounded by K. The cardinality N of C [s]\nfor any AVG node s can be either \ufb01nite, N < \u221e,\nor in\ufb01nite. We consider both cases. TrailBlazer\napplies to both situations. We provide performance\nguarantees for a general case and possibly tighter,\nN-dependent guarantees in the case of N < \u221e. We assume that we have a generative model of the\ntransitions and rewards: Each AVG node s is associated with a transition, a random variable \u03c4s \u2208 C [s]\nand a reward, a random variable rs \u2208 [0, 1].\n\n1: Input: \u03b4, \u03b5\n(cid:17)\n2: Set: \u03b7 \u2190 \u03b31/ max(2,log(1/\u03b5))\n3: Set: \u03bb \u2190 2 log(\u03b5(1 \u2212 \u03b3))2 log\n4: Set: m \u2190 (log(1/\u03b4) + \u03bb)/((1 \u2212 \u03b3)2\u03b52)\n5: Use: \u03b4 and \u03b7 as global parameters\n6: Output:\n\n\u00b5 \u2190 call the root with parameters (m, \u03b5/2)\n\nFigure 1: TrailBlazer\n\n(cid:16) log(K)\n\n(1\u2212\u03b7)\nlog(\u03b3/\u03b7)\n\nObjective For any node s, we de\ufb01ne the value func-\ntion V [s] as the optimum over policies \u03c0 (giving a\nsuccessor to all MAX nodes) of the sum of discounted\nexpected rewards playing policy \u03c0,\n\n(cid:34)(cid:88)\n\nt\u22650\n\n(cid:12)(cid:12)(cid:12)s0 = s, \u03c0\n\n(cid:35)\n\n,\n\nV [s] = sup\n\n\u03c0\n\nE\n\nwhere \u03b3 \u2208 (0, 1) is the discount factor. If s is an AVG\nnode, V satis\ufb01es the following Bellman equation,\n\nV [s] = E [rs] + \u03b3\n\np(s(cid:48)|s)V [s(cid:48)] .\n\n\u03b3trst\n\n(cid:88)\n\ns(cid:48)\u2208C[s]\n\nActiveNodes \u2190 SampledNodes(1 : m)\nwhile |SampledNodes| < m do\n\n1: Input: m, \u03b5\n2: Initialization: {Only executed on \ufb01rst call}\n3: SampledNodes \u2190 \u2205,\n4: r \u2190 0\n5: Run:\n6: if \u03b5 \u2265 1/(1 \u2212 \u03b3) then\n7: Output: 0\n8: end if\n9: if |SampledNodes| > m then\n10:\n11: else\n12:\n13:\n14:\n15:\n16:\n17:\n18: end if {At this point, |ActiveNodes| = m}\n19: for all unique nodes s \u2208 ActiveNodes do\nk \u2190 #occurrences of s in ActiveNodes\n20:\n\u03bd \u2190 call s with parameters (k, \u03b5/\u03b3)\n21:\n\u00b5 \u2190 \u00b5 + \u03bdk/m\n22:\n23: end for\n24: Output: \u03b3\u00b5 + r/|SampledNodes|\n\n\u03c4 \u2190 {new sample of next state}\nSampledNodes.append(\u03c4)\nr \u2190 r+[new sample of reward]\n\nend while\nActiveNodes \u2190 SampledNodes\n\nIf s is a MAX node, then V [s] = maxs(cid:48)\u2208C[s] V [s(cid:48)] .\nThe planner has access to the oracle which can be\ncalled for any AVG node s to either get a reward r or a\ntransition \u03c4 which are two independent random vari-\nables identically distributed as rs and \u03c4s respectively.\nWith the notation above, our goal is to estimate the\nvalue V [s0] of the root node s0 using the smallest\npossible number of oracle calls. More precisely,\ngiven any \u03b4 and \u03b5, we want to output a value \u00b5\u03b5,\u03b4 such\nthat P [|\u00b5\u03b5,\u03b4 \u2212 V [s0]| > \u03b5] \u2264 \u03b4 using the smallest\npossible number of oracle calls n\u03b5,\u03b4. The number of calls is the sample complexity of the algorithm.\n\nFigure 2: AVG node\n\n2.1 Blazing the trails with TrailBlazer\nTo ful\ufb01ll the above objective, our TrailBlazer constructs a planning tree T which is, at any\ntime, a \ufb01nite subset of the potentially in\ufb01nite tree. Only the already visited nodes are in T and\nexplicitly represented in memory. Taking the object-oriented paradigm, each node of T is a persistent\nobject with its own memory which can receive and perform calls respectively from and to other\nnodes. A node can potentially be called several times (with different parameters) during the run of\nTrailBlazer and may reuse (some of) its stored (transition and reward) samples. In particular, after\nnode s receives a call from its parent, node s may perform internal computation by calling its own\nchildren in order to return a real value to its parent.\nPseudocode of TrailBlazer is in Figure 1 along with the subroutines for MAX nodes in Figure 3 and\nAVG nodes in Figure 2. A node (MAX or AVG) is called with two parameters m and \u03b5, which represent\nsome requested properties of the returned value: m controls the desired variance and \u03b5 the desired\nmaximum bias. We now describe the MAX and AVG node subroutines.\n\n4\n\n\fMAX nodes A MAX node s keeps a lower and an\nupper bound of its children values which with high\nprobability simultaneously hold at all times. It se-\nquentially calls its children with different parame-\nters in order to get more and more precise estimates\nof their values. Whenever the upper bound of one\nchild becomes lower than the maximum lower bound,\nthis child is discarded. This process can stop in two\nways: 1) The set L of the remaining children shrunk\nenough such that there is a single child b(cid:63) left. In\nthis case, s calls b(cid:63) with the same parameters that s\nreceived and uses the output of b(cid:63) as its own output.\n2) The precision we have on the value of the remain-\ning children is high enough. In this case, s returns\nthe highest estimate of the children in L. Note that\nthe MAX node is eliminating actions to identify the\nbest. Any other best-arm identi\ufb01cation algorithm for\nbandits can be adapted instead.\n\n(cid:96)\n\nL \u2190(cid:110)\n\n\u00b5b \u2190 call b with ((cid:96), U\u03b7/(1 \u2212 \u03b7))\n\u00b5j\u2212 2U\n1\u2212\u03b7\n\n1\u2212\u03b7 \u2265 supj\n\nb : \u00b5b + 2U\n\nend for\n\nU \u2190 2\n1\u2212\u03b3\nfor b \u2208 L do\n\n(cid:113) log(K(cid:96)/(\u03b4\u03b5))+\u03b3/(\u03b7\u2212\u03b3)+\u03bb+1\n(cid:105)(cid:111)\n\n1: Input: m, \u03b5\n2: L \u2190 all children of the node\n3: (cid:96) \u2190 1\n4: while |L| > 1 and U \u2265 (1 \u2212 \u03b7)\u03b5 do\n5:\n6:\n7:\n8:\n9:\n10:\n11: end while\n12: if |L| > 1 then\n13: Output: \u00b5 \u2190 maxb\u2208L \u00b5b\n14: else { L = {b(cid:63)} }\n15:\n16:\n17: Output: \u00b5\n18: end if\n\nb(cid:63) \u2190 arg maxb\u2208L \u00b5b\n\u00b5 \u2190 call b(cid:63) with (m, \u03b7\u03b5)\n\n(cid:96) \u2190 (cid:96) + 1\n\n(cid:104)\n\nFigure 3: MAX node\n\nAVG nodes Every AVG node s keeps a list of all the\nchildren that it already sampled and a reward estimate r \u2208 R. Note that the list may contain the same\nchild multiple times (this is particularly true for N < \u221e). After receiving a call with parameters\n(m, \u03b5), s checks if \u03b5 \u2265 1/(1 \u2212 \u03b3). If this condition is veri\ufb01ed, then it returns zero. If not, s considers\nthe \ufb01rst m sampled children and potentially samples more children from the generative model if\nneeded. For every child s(cid:48) in this list, s calls it with parameters (k, \u03b5/\u03b3), where k is the number of\ntimes a transition toward this child was sampled. It returns r + \u03b3\u00b5, where \u00b5 is the average of all the\nchildren estimates.\n\nAnytime algorithm TrailBlazer is naturally anytime. It can be called with slowly decreasing \u03b5,\nsuch that m is always increased only by 1, without having to throw away any previously collected\nsamples. Executing TrailBlazer with \u03b5(cid:48) and then with \u03b5 < \u03b5(cid:48) leads to the same amount of\ncomputation as immediately running TrailBlazer with \u03b5.\n\nPractical considerations The parameter \u03bb exists so the behavior depends only on the randomness\nof oracle calls and the parameters (m, \u03b5) that the node has been called with. This is a desirable\nproperty because it opens the possibility to extend the algorithm to more general settings, for instance\nif we have also MIN nodes. However, for practical purposes, we may set \u03bb = 0 and modify the\nde\ufb01nition of U in Figure 3 by replacing K with the number of oracle calls made so far globally.\n\n3 Cogs whirring behind\n\nBefore diving into the analysis we explain the ideas behind TrailBlazer and the choices made.\n\nTree-based algorithm The number of policies the planner can consider is exponential in the\nnumber of states. This leads to two major challenges. First, reducing the problem to multi-arm\nbandits on the set of the policies would hurt. When a reward is collected from a state, all the policies\nwhich could reach that state are affected. Therefore, it is useful to share the information between the\npolicies. The second challenge is computational as it is infeasible to keep all policies in memory.\nThese two problems immediately vanish with just how TrailBlazer is formulated. Contrary to\nSz\u00f6r\u00e9nyi et al. [16], we do not represent the policies explicitly or update them simultaneously to\nshare the information, but we store all the information directly in the planning tree we construct.\nIndeed, by having all the nodes being separate entities that store their own information, we can share\ninformation between policies without explicitly having to enforce it.\nWe steel ourselves for the detailed understanding with the following two arguments. They shed light\nfrom two different angles on the very same key point: Do not re\ufb01ne more paths than you need to!\n\n5\n\n\fDelicate treatment of uncertainty First, we give intuition about the two parameters which mea-\nsure the requested precision of a call. The output estimate \u00b5 of any call with parameters (m, \u03b5)\nveri\ufb01es the following property (conditioned on a high-probability event),\n\nE(cid:104)\n\ne\u03bb(\u00b5\u2212V[s])(cid:105) \u2264 exp\n\n(cid:18)\n\n\u2200\u03bb\n\n(cid:19)\n\n\u03b1 + \u03b5|\u03bb| +\n\n\u03c32\u03bb2\n\n2\n\n, with \u03c32 = O (1/m) and constant \u03b1.\n\n(1)\n\nThis awfully looks like the de\ufb01nition of \u00b5 being uncentered sub-Gaussian, except that instead of \u03bb in\nthe exponential function, there is |\u03bb| and there is a \u03bb-independent constant \u03b1. Inequality 1 implies\nthat the absolute value of the bias of the output estimate \u00b5 is bounded by \u03b5,\n\n(cid:12)(cid:12)E [\u00b5] \u2212 V [s](cid:12)(cid:12) \u2264 \u03b5.\n\nAs in the sub-Gaussian case, the second term 1\n2 \u03c32\u03bb2 is a variance term. Therefore, \u03b5 controls the\nmaximum bias of \u00b5 and 1/m control its sub-variance. In some cases, getting high-variance or\nlow-variance estimate matters less as it is going to be averaged later with other independent estimates\nby an ancestor AVG node. In this case we prefer to query for high variance rather than a low one, in\norder to decrease sample complexity.\nFrom \u03c3 and \u03b5 it is possible to deduce a con\ufb01dence bounds on |\u00b5 \u2212 V [s]| by typically summing\n\u221a\nthe bias \u03b5 and a term proportional to the standard deviation \u03c3 = O (1/\nm). Previous approaches\n[16, 5] consider a single parameter, representing the width of this high-probability con\ufb01dence interval.\nTrailBlazer is different. In TrailBlazer, the nodes can perform high-variance and low-bias\nqueries but can also query for both low-variance and low-bias. TrailBlazer treats these two types\nof queries differently. This is the whetstone of TrailBlazer and the reason why it is not optimistic.\nIn this part we explain the condition |SampledNodes| > m in Figure 2, which\nRe\ufb01ning few paths\nis crucial for our approach and results. First notice, that as long as TrailBlazer encounters only AVG\nnodes, it behaves just like Monte-Carlo sampling \u2014 without the MAX nodes we would be just doing\na simple averaging of trajectories. However, when TrailBlazer encounters a MAX node it locally\nuses more samples around this MAX node, temporally moving away from a Monte-Carlo behavior.\nThis enables TrailBlazer to compute the best action at this MAX node. Nevertheless, once this\nbest action is identi\ufb01ed with high probability, the algorithm should behave again like Monte-Carlo\nsampling. Therefore, TrailBlazer forgets the additional nodes, sampled just because of the MAX\nnode, and only keeps in memory the \ufb01rst m ones. This is done with the following line in Figure 2,\n\nActiveNodes \u2190 SampledNodes(1 : m).\n\nAgain, while additional transitions were useful for some MAX node parents to decide which action\nto pick, they are discarded once this choice is made. Note that they can become useful again if an\nancestor becomes unsure about which action to pick and needs more precision to make a choice. This\nis an important difference between TrailBlazer and some previous approaches like UCT where all\nthe already sampled transitions are equally re\ufb01ned. This treatment enables us to provide polynomial\nbounds on the sample complexity for some special cases even in the in\ufb01nite case (N = \u221e).\n\n4 TrailBlazer is good and cheap \u2014 consistency and sample complexity\n\nIn this section, we start by our consistency result, stating that TrailBlazer outputs a correct value\nin a PAC (probably approximately correct) sense. Later, we de\ufb01ne a measure of the problem dif\ufb01culty\nwhich we use to state our sample-complexity results. We remark that the following consistency result\nholds whether the state space is \ufb01nite or in\ufb01nite.\nTheorem 1. For all \u03b5 and \u03b4, the output \u00b5\u03b5,\u03b4 of TrailBlazer called on the root s0 with (\u03b5, \u03b4) veri\ufb01es\n\nP [|\u00b5\u03b5,\u03b4 \u2212 V [s0]| > \u03b5] < \u03b4.\n\n4.1 De\ufb01nition of the problem dif\ufb01culty\n\nWe now de\ufb01ne a measure of problem dif\ufb01culty that we use to provide our sample complexity\nguarantees. We de\ufb01ne a set of near-optimal nodes such that exploring only this set is enough to\ncompute an optimal policy. Let s(cid:48) be a MAX node of tree T . For any of its descendants s, let\nc\u2192s(s(cid:48)) \u2208 C [s(cid:48)] be the child of s(cid:48) in the path between s(cid:48) and s. For any MAX node s, we de\ufb01ne\n\n\u2206\u2192s(s(cid:48)) = max\nx\u2208C[s(cid:48)]\n\nV [x] \u2212 V [c\u2192s(s(cid:48))] .\n\n6\n\n\f\u2206\u2192s(s(cid:48)) is the difference of the sum of discounted rewards stating from s(cid:48) between an agent playing\noptimally and one playing \ufb01rst the action toward s and then optimally.\nDe\ufb01nition 1 (near-optimality). We say that a node s of depth h is near-optimal, if for any even\ndepth h(cid:48),\n\n\u2206\u2192s(sh(cid:48)) \u2264 16\n\n\u03b3(h\u2212h(cid:48))/2\n\u03b3(1 \u2212 \u03b3)\n\nwith sh(cid:48) the ancestor of s of even depth h(cid:48). Let Nh be the set of all near-optimal nodes of depth h.\nRemark 1. Notice that the subset of near-optimal nodes contains all required information to get\nthe value of the root. In the case N = \u221e, when p(s|s(cid:48)) = 0 for all s and s(cid:48), then our de\ufb01nition of\nnear-optimality nodes leads to the smallest subset in a sense we precise in Appendix C. We prove that\nwith probability 1 \u2212 \u03b4, TrailBlazer only explores near-optimal nodes. Therefore, the size of the\nsubset of near-optimal nodes directly re\ufb02ects the sample complexity of TrailBlazer.\n\nIn Appendix C, we discuss the negatives of other potential de\ufb01nitions of near-optimality.\n\n4.2 Sample complexity in the \ufb01nite case\n\nWe \ufb01rst state our result where the set of the AVG children nodes is \ufb01nite and bounded by N.\nDe\ufb01nition 2. We de\ufb01ne \u03ba \u2208 [1, K] as the smallest number such that\n\n\u2203C \u2200h,\n\n|N2h| \u2264 CN h\u03bah.\n\nNotice that since the total number of nodes of depth 2h is bounded by (KN )h, \u03ba is upper-bounded\nby K, the maximum number of MAX\u2019s children. However \u03ba can be as low as 1 in cases when the set\nof near-optimal nodes is small.\nTheorem 2. There exists C > 0 and K such that for all \u03b5 > 0 and \u03b4 > 0, with probability 1 \u2212 \u03b4,\nthe sample-complexity of TrailBlazer (the number of calls to the generative model before the\nalgorithm terminates) is\n\nn(\u03b5, \u03b4) \u2264 C(1/\u03b5)max(2, log(N \u03ba)\n\nlog(1/\u03b3) +o(1)) (log(1/\u03b4) + log(1/\u03b5))\u03b1 ,\n\nwhere \u03b1 = 5 when log(N \u03ba)/ log(1/\u03b3) \u2265 2 and \u03b1 = 3 otherwise.\n\n(\u03ba = K) improves over the best-known worst-case bound (cid:101)O(cid:0)(1/\u03b5)2+log(KN )/ log(1/\u03b3)(cid:1) [16]. This\n\nThis provides a problem-dependent sample-complexity bound, which already in the worst case\n\nbound gets better as \u03ba gets smaller and is minimal when \u03ba = 1. This is, for example, the case when\nthe gap (see de\ufb01nition given in Equation 2) at MAX nodes is uniformly lower-bounded by some \u2206 > 0.\nIn this case, this theorem provides a bound of order (1/\u03b5)max(2,log(N )/ log(1/\u03b3)). However, we will\nshow in Remark 2 that we can further improve this bound to (1/\u03b5)2.\n\n4.3 Sample complexity in the in\ufb01nite case\nSince the previous bound depends on N, it does not apply to the in\ufb01nite case with N = \u221e. We now\nprovide a sample complexity result in the case N = \u221e. However, notice that when N is bounded,\nthen both results apply.\nWe \ufb01rst de\ufb01ne gap \u2206(s) for any MAX node s as the difference between the best and second best arm,\n\n\u2206(s) = V [i(cid:63)] \u2212 max\n\ni\u2208C[s],i(cid:54)=i(cid:63)\n\nV [i] with i(cid:63) = arg max\ni\u2208C[s]\n\nV [i] .\n\n(2)\n\nFor any even integer h, we de\ufb01ne a random variable Sh taking values among MAX nodes of depth h,\nin the following way. First, from every AVG nodes from the root to nodes of depth h, we draw a single\ntransition to one of its children according to the corresponding transition probabilities. This de\ufb01nes\na subtree with K h/2 nodes of depth h and we choose Sh to be one of them uniformly at random.\nFurthermore, for any even integer h(cid:48) < h we note Sh\n\nh(cid:48) the MAX node ancestor of Sh of depth h(cid:48).\n\n7\n\n\fDe\ufb01nition 3. We de\ufb01ne d \u2265 0 as the smallest d such that for all \u03be there exists a > 0 for which for\nall even h > 0,\n\n\uf8ee\uf8ef\uf8ef\uf8f0K h/21(cid:0)Sh \u2208 Nh\n\nE\n\n(cid:1) (cid:89)\n\n0\u2264h(cid:48)<h\n\nh(cid:48)\u22610(mod 2)\n\n(cid:18)\n\n(cid:19)1\n\n(cid:18) \u03be\n\n\u03b3h\u2212h(cid:48)\n\nh(cid:48) )<16 \u03b3(h\u2212h(cid:48))/2\n\n\u03b3(1\u2212\u03b3)\n\n\u2206(Sh\n\n(cid:19)\uf8f9\uf8fa\uf8fa\uf8fb \u2264 a\u03b3\u2212dh\n\nIf no such d exists, we set d = \u221e.\nThis de\ufb01nition of d takes into account the size of the near-optimality set (just like \u03ba) but unlike \u03ba it\nalso takes into account the dif\ufb01culty to identify the near-optimal paths.\nIntuitively, the expected number of oracle calls performed by a given AVG node s is proportional to:\n(1/\u03b52) \u00d7 (the product of the inverted squared gaps of the set of MAX nodes in the path from the root to\ns) \u00d7 (the probability of reaching s by following a policy which always tries to reach s).\nTherefore, a near-optimal path with a larger number of small MAX node gaps can be considered\ndif\ufb01cult. By assigning a larger weight to dif\ufb01cult nodes, we are able to give a better characterization\nof the actual complexity of the problem and provide polynomial guarantees on the sample complexity\nfor N = \u221e when d is \ufb01nite.\nTheorem 3. If d is \ufb01nite then there exists C > 0 such that for all \u03b5 > 0 and \u03b4 > 0, the expected\nsample complexity of TrailBlazer satis\ufb01es\n\nE [n(\u03b5, \u03b4)] \u2264 C\n\n(log(1/\u03b4) + log(1/\u03b5))3\n\n\u03b52+d\n\n\u00b7\n\nNote that this result holds in expectation only, contrary to Theorem 2 which holds in high probability.\nWe now give an example for which d = 0, followed by a special case of it.\nLemma 1. If there exists c > 0 and b > 2 such that for any near-optimal AVG node s,\n\nP [\u2206 (\u03c4s) \u2264 x] \u2264 cxb,\n\nwhere the random variable \u03c4s is a successor state from s drawn from the MDP\u2019s transition probabili-\nties, then d = 0 and consequently the sample complexity is of order 1/\u03b52.\nRemark 2. If there exists \u2206min such that for any near-optimal MAX node s, \u2206(s) \u2265 \u2206min then\nd = 0 and the sample complexity is of order 1/\u03b52. Indeed, in this case as P [\u2206s \u2264 x] \u2264 (x/\u2206min)b\nfor any b > 2 for which d = 0 by Lemma 1.\n\n5 Conclusion\n\nWe provide a new Monte-Carlo planning algorithm TrailBlazer that works for MDPs where the\nnumber of next states N can be either \ufb01nite or in\ufb01nite. TrailBlazer is easy to implement and\nis numerically ef\ufb01cient. It comes packaged with a PAC consistency and two problem-dependent\nsample-complexity guarantees expressed in terms of a measure (de\ufb01ned by \u03ba) of the quantity of\nnear-optimal nodes or a measure (de\ufb01ned by d) of the dif\ufb01culty to identify the near-optimal paths.\nThe sample complexity of TrailBlazer improves over previous worst-case guarantees. What\u2019s\nmore, TrailBlazer exploits MDPs with speci\ufb01c structure by exploring only a fraction of the whole\nsearch space when either \u03ba or d is small. In particular, we showed that if the set of near-optimal nodes\n\nhave non-vanishing action-gaps, then the sample complexity is (cid:101)O(1/\u03b52), which is the same rate as\n\nMonte-Carlo sampling. This is a pretty decent evidence that TrailBlazer is a natural extension of\nMonte-Carlo sampling to stochastic control problems.\n\nAcknowledgements The research presented in this paper was supported by French Ministry of Higher Educa-\ntion and Research, Nord-Pas-de-Calais Regional Council, a doctoral grant of \u00c9cole Normale Sup\u00e9rieure in Paris,\nInria and Carnegie Mellon University associated-team project EduBand, and French National Research Agency\nprojects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003)\n\n8\n\n\fReferences\n[1] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[2] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.\n\n[3] Dimitri Bertsekas and John Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\nBelmont, MA, 1996.\n\n[4] Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling,\nPhilipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.\nA survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence\nand AI in Games, 4(1):1\u201343, 2012.\n\n[5] S\u00e9bastien Bubeck and R\u00e9mi Munos. Open-loop optimistic planning. In Conference on Learning\n\nTheory, 2010.\n\n[6] Lucian Bu\u00b8soniu and R\u00e9mi Munos. Optimistic planning for Markov decision processes. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[7] R\u00e9mi Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. Computers\n\nand games, 4630:72\u201383, 2007.\n\n[8] Sylvain Gelly, Wang Yizao, R\u00e9mi Munos, and Olivier Teytaud. Modi\ufb01cation of UCT with\npatterns in Monte-Carlo Go. Technical report, Inria, 2006. URL https://hal.inria.fr/\ninria-00117266.\n\n[9] Arthur Guez, David Silver, and Peter Dayan. Ef\ufb01cient Bayes-adaptive reinforcement learning\n\nusing sample-based search. Neural Information Processing Systems, 2012.\n\n[10] Jean-Francois Hren and R\u00e9mi Munos. Optimistic Planning of Deterministic Systems.\n\nEuropean Workshop on Reinforcement Learning, 2008.\n\nIn\n\n[11] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-\noptimal planning in large Markov decision processes. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, 1999.\n\n[12] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit-based Monte-Carlo planning. In European\n\nConference on Machine Learning, 2006.\n\n[13] R\u00e9mi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to\noptimization and planning. Foundations and Trends in Machine Learning, 7(1):1\u2013130, 2014.\n\n[14] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In Neural Information\n\nProcessing Systems, 2010.\n\n[15] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander\nDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap,\nMadeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the\ngame of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[16] Bal\u00e1zs Sz\u00f6r\u00e9nyi, Gunnar Kedenburg, and R\u00e9mi Munos. Optimistic planning in Markov decision\n\nprocesses using a generative model. In Neural Information Processing Systems, 2014.\n\n[17] Thomas J Walsh, Sergiu Goschin, and Michael L Littman. Integrating sample-based planning\n\nand model-based reinforcement learning. AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n9\n\n\f", "award": [], "sourceid": 2335, "authors": [{"given_name": "Jean-Bastien", "family_name": "Grill", "institution": "Inria Lille - Nord Europe"}, {"given_name": "Michal", "family_name": "Valko", "institution": "Inria Lille - Nord Europe"}, {"given_name": "Remi", "family_name": "Munos", "institution": "Google DeepMind"}]}