{"title": "Single-Agent Policy Tree Search With Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 3201, "page_last": 3211, "abstract": "We introduce two novel tree search algorithms that use a policy to guide\nsearch. The first algorithm is a best-first enumeration that uses a cost\nfunction that allows us to provide an upper bound on the number of nodes\nto be expanded before reaching a goal state. We show that this best-first\nalgorithm is particularly well suited for ``needle-in-a-haystack'' problems.\nThe second algorithm, which is based on sampling, provides an\nupper bound on the expected number of nodes to be expanded before\nreaching a set of goal states. We show that this algorithm is better\nsuited for problems where many paths lead to a goal. We validate these tree\nsearch algorithms on 1,000 computer-generated levels of Sokoban, where the\npolicy used to guide search comes from a neural network trained using A3C. Our\nresults show that the policy tree search algorithms we introduce are\ncompetitive with a state-of-the-art domain-independent planner that uses\nheuristic search.", "full_text": "Single-Agent Policy Tree Search With Guarantees\n\nLaurent Orseau\n\nDeepMind, London, UK\nlorseau@google.com\n\nLevi H. S. Lelis\u2217\n\nUniversidade Federal de Vi\u00e7osa, Brazil\n\nlevi.lelis@ufv.br\n\nTor Lattimore\n\nDeepMind, London, UK\nlattimore@google.com\n\nTh\u00e9ophane Weber\n\nDeepMind, London, UK\ntheophane@google.com\n\nAbstract\n\nWe introduce two novel tree search algorithms that use a policy to guide search.\nThe \ufb01rst algorithm is a best-\ufb01rst enumeration that uses a cost function that allows\nus to prove an upper bound on the number of nodes to be expanded before reaching\na goal state. We show that this best-\ufb01rst algorithm is particularly well suited for\n\u201cneedle-in-a-haystack\u201d problems. The second algorithm is based on sampling and\nwe prove an upper bound on the expected number of nodes it expands before\nreaching a set of goal states. We show that this algorithm is better suited for\nproblems where many paths lead to a goal. We validate these tree search algorithms\non 1,000 computer-generated levels of Sokoban, where the policy used to guide the\nsearch comes from a neural network trained using A3C. Our results show that the\npolicy tree search algorithms we introduce are competitive with a state-of-the-art\ndomain-independent planner that uses heuristic search.\n\n1\n\nIntroduction\n\nMonte-Carlo tree search (MCTS) algorithms [Coulom, 2007, Browne et al., 2012] have been recently\napplied with great success to several problems such as Go, Chess, and Shogi [Silver et al., 2016, 2017].\nSuch algorithms are well adapted to stochastic and adversarial domains, due to their sampling nature\nand the convergence guarantee to min-max values. However, the sampling procedure used in MCTS\nalgorithms is not well-suited for other kinds of problems [Nakhost, 2013], such as deterministic\nsingle-agent problems where the objective is to \ufb01nd any solution at all. In particular, if the reward is\nvery sparse\u2014for example the agent is rewarded only at the end of the task\u2014MCTS algorithms revert\nto uniform search. In practice such algorithms can be guided by a heuristic but, to the best of our\nknowledge, no bound is known that depends on the quality of the heuristic. For such cases one may\nuse instead other traditional search approaches such as A* [Hart et al., 1968] and Greedy Best-First\nSearch (GBFS) [Doran and Michie, 1966], which are guided by a heuristic cost function.\nIn this paper we tackle single-agent problems from the perspective of policy-guided search. One\nmay view policy-guided search as a special kind of heuristic search in which a policy, instead of a\nheuristic function, is provided as input to the search algorithm. As a policy is a probability distribution\nover sequences of actions, this allows us to provide theoretical guarantees that cannot be offered\nby value (e.g., reward-based) functions: we can bound the number of node expansions\u2014roughly\nspeaking, the search time\u2014depending on the probability of the sequences of actions that reach\nthe goal. We propose two different algorithms with different strengths and weaknesses. The \ufb01rst\nalgorithm, called LevinTS, is based on Levin search [Levin, 1973] and we derive a strict upper\nbound on the number of nodes to search before \ufb01nding the least-cost solution. The second algorithm,\n\n\u2217This work was carried out while L. H. S. Lelis was at the University of Alberta, Canada.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcalled LubyTS, is based on the scheduling of Luby et al. [1993] for randomized algorithms and\nwe prove an upper bound on the expected number of nodes to search before reaching any solution\nwhile taking advantage of the potential multiplicity of the solutions. LevinTS and LubyTS are the\n\ufb01rst policy tree search algorithms with such guarantees. Empirical results on the PSPACE-hard\ndomain of Sokoban [Culberson, 1999] show that LubyTS and in particular LevinTS guided by a\npolicy learned with A3C [Mnih et al., 2016] are competitive with a state-of-the-art planner that uses\nGBFS [Hoffmann and Nebel, 2001]. Although we focus on deterministic environments, LevinTS\nand LubyTS can be extended to stochastic environments with a known model.\nLevinTS and LubyTS bring important research areas closer together. Namely, areas that traditionally\nrely on heuristic-guided tree search with guarantees such as classical planning and areas devoted\nto learn control policies such as reinforcement learning. We expect future works to explore closer\nrelations of these areas, such as the use of LevinTS and LubyTS as part of classical planning systems.\n\n2 Notation and background\nWe write N1 = {1, 2, . . .}. Let S be a (possibly uncountable) set of states, and let A be a \ufb01nite set\nof actions. The environment starts in an initial state s0 \u2208 S. During an interaction step (or just\nstep) the environment in state s \u2208 S receives an action a \u2208 A from the searcher and transitions\ndeterministically according to a transition function T : S \u00d7 A \u2192 S to the state s(cid:48) = T (s, a). The\nstate of the environment after a sequence of actions a1:t is written T (a1:t) which is a shorthand\nfor the recursive application of the transition function T from the initial state s0 to each action\nof a1:t, where a1:t is the sequence of actions a1, a2, . . . at. Let S g \u2286 S be a set of goal states.\nWhen the environment transitions to one of the goal states, the problem is solved and the interaction\nstops. We consider tree search algorithms and de\ufb01ne the set of nodes in the tree as the set of\nsequences of actions N := A\u2217 \u222a A\u221e. The root node n0 is the empty sequence of actions. Hence\na sequence of actions a1:t of length t is uniquely identi\ufb01ed by a node n \u2208 N and we de\ufb01ne\nd0(n) = d0(a1:t) := t (the usual depth d(n) of the node is recovered with d(n) = d0(n) \u2212 1).\nSeveral sequences of actions (hence several nodes) can lead to the same state of the environment,\nand we write N (s) := {n \u2208 N : T (n) = s} for the set of nodes with the same state. We de\ufb01ne the\nset of children C(n) of a node n \u2208 N as C(n) := {na|a \u2208 A}, where na denotes the sequence of\nactions n followed by the action a. We de\ufb01ne the target set N g \u2286 N as the set of nodes such that\nthe corresponding states are goal states: N g := {n : T (n) \u2208 S g}. The searcher does not know the\ntarget set in advance and only recognizes a goal state when the environment transitions to one. If\nn1 = a1:t and n2 = a1:tat+1:k with k > t then we say that a1:t is a pre\ufb01x of a1:tat+1:k and that n1\nis an ancestor of n2 (and n2 is a descendant of n1).\nA search tree T \u2208 N \u2217 is a set of sequences of actions (nodes) such that (i) for all nodes n \u2208 T , T\nalso contains all the ancestors of n and (ii) if n \u2208 T \u2229 N g, then the tree contains no descendant of\nn. The leaves L(T ) of the tree T are the set of nodes n \u2208 T such that T contains no descendant of\nn. A policy assigns probabilities to sequences of actions under the constraint that \u03c0(n0) = 1 and\nn(cid:48)\u2208C(n) \u03c0(n(cid:48)). If n(cid:48) is a descendant of n, we de\ufb01ne the conditional probability\n\u03c0(n(cid:48)|n) := \u03c0(n(cid:48))/\u03c0(n). The policy is assumed to be provided as input to the search algorithm.\nLet TS be a generic tree search algorithm de\ufb01ned as follows. At any expansion step k \u2265 1, let Vk be\nthe set of nodes that have been expanded (visited) before (excluding) step k, and let the fringe set\nC(n) \\ Vk be the set of not-yet-expanded children of expanded nodes, with V1 := \u2205\nand F1 := {n0}. At iteration k, the search algorithm TS chooses a node nk \u2208 Fk for expansion: if\nnk \u2208 N g, then the algorithm terminates with success. Otherwise, Vk+1 := Vk\u222a{nk} and the iteration\nk + 1 starts. At any expansion step, the set of expanded nodes is a search tree. Let nk be the node\nexpanded by TS at step k. Then we de\ufb01ne the search time N (TS,N g) := mink>0{k : nk \u2208 N g} as\nthe number of node expansions before reaching any node of the target set N g.\nA policy is Markovian if the probability of an action depends only on the current state of the\nenvironment, that is, for all n1 and n2 with T (n1) = T (n2),\u2200a \u2208 A : \u03c0(a|n1) = \u03c0(a|n2). In this\npaper we consider both Markovian and non-Markovian policies. For some function cost : N \u2192 R\nover nodes, we de\ufb01ne the cost of a state s as cost(s) := minn\u2208N (s) cost(n). Then we say that a tree\nsearch algorithm with a cost function cost(n) expands states in best-\ufb01rst order if for all states s1 and\ns2, if cost(s1) < cost(s2), then s1 is visited before s2. We say that a state is expanded at its lowest\ncost if for all states s, the \ufb01rst node n \u2208 N (s) to be expanded has cost cost(n) = cost(s).\n\n\u2200n \u2208 N , \u03c0(n) =(cid:80)\nFk :=(cid:83)\n\nn\u2208Vk\n\n2\n\n\fAlgorithm 1: Levin tree search.\n\n1 def LevinTS()\n2\n\nV := \u2205\nF := {n0}\nwhile F (cid:54)= \u2205\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\n17\n\nn := argminn\u2208F d0(n)\nF := F \\ {n}\ns := T (n)\nif s \u2208 S g\n\n\u03c0(n)\n\nif is_Markov(\u03c0)\n\nreturn true\nif \u2203n(cid:48) \u2208 V : (T (n(cid:48)) = s) \u2227 (\u03c0(n(cid:48)) \u2265 \u03c0(n))\n# s has already been visited with\n# a higher probability: State cut\ncontinue\nV := V \u222a {n(cid:48)}\nF := F \u222a C(n)\n\nreturn false\n\nAlgorithm 2: Sampling and execution of\na single trajectory.\ndef sample_traj(depth)\n\nn := n0\nfor d := 0 to depth\n\nif T (n) \u2208 S g\nreturn true\na \u223c \u03c0(.|n)\nn := na\n\nreturn false\n\nroot\n\n1\n2\n\n1\n\n1\n. . .\n\n1\n2\n. . .\n\n1\n2\n\n1\n2\n\n1\n2\n. . .\n\n1\n2\n\n1\n2\n. . .\n\n1\n2\n. . .\n\nFigure 1: A \u2018chain-and-bin\u2019 tree.\n\n3 Levin tree search: policy-guided enumeration\n\nFirst, we show that merely expanding nodes by decreasing order of their probabilities can fail to\nreach a goal state of non-zero probability.\nTheorem 1. The version of TS that chooses at iteration k the node nk := argmaxn\u2208Fk\nnever expand any node of the target set N g, even if \u2200n \u2208 N g, \u03c0(n) > 0.\n\n\u03c0(n) may\n\nProof. Consider the tree in Fig. 1. Under the left child of the root is an in\ufb01nite \u2018chain\u2019 in which each\nnode has probability 1/2. Under the right child of the root is an in\ufb01nite binary tree in which each\nnode has two children, each of conditional probability 1/2, and thus each node has probability 2\u2212d.\nBefore testing a node of depth at least 2 in the right-hand-side binary tree (with probability at most\n1/4), the search expands in\ufb01nitely many nodes of probability 1/2. De\ufb01ning the target set as any set\nof nodes with individual probability at most 1/4 proves the claim.\n\nTo solve this problem, we draw inspiration from Levin search [Levin, 1973, Trakhtenbrot, 1984],\nwhich (in a different domain) penalizes the probability with computation time. Here, we take\ncomputation time to mean the depth of a node. The new Levin tree search (LevinTS) algorithm is a\nversion of TS in which nodes are expanded in order of increasing costs d0(n)/\u03c0(n) (see Algorithm 1).\nLevinTS also performs state cuts (see Lines 10\u201315 of Algorithm 1). That is, LevinTS does not\nexpand node n representing state s if (i) the policy \u03c0 is Markovian, (ii) it has already expanded\nanother node n(cid:48) that also represents s, and (iii) \u03c0(n(cid:48)) \u2265 \u03c0(n). By performing state cuts only if these\nthree conditions are met, we can show that LevinTS expands states in best-\ufb01rst order.\nTheorem 2. LevinTS expands states in best-\ufb01rst order and at their lowest cost \ufb01rst.\n\nProof. Let us \ufb01rst consider the case where the policy is non-Markovian. Then, LevinTS does not\nperform state cuts (see Line 10 of Algorithm 1). Let n1 and n2 be two arbitrary different nodes\n(sequences of actions), with cost(n1) < cost(n2). Let n12 be the closest common ancestor of n1 and\nn2; it must exist since at least the root is one of their common ancestors. Then all nodes on the path\nfrom n12 to n1 have cost less than cost(n1) and thus than cost(n2), due to the monotonicity of d0\nand \u03c0 and thus of cost, which implies by recursion from n12 that all these nodes and thus also n1\nare expanded before n2. Hence, if T (n1) = T (n2), this proves that all states are visited \ufb01rst at their\nlowest cost. Furthermore, if T (n1) (cid:54)= T (n2), this proves that states of lower cost are visited \ufb01rst.\nNow, if the policy is Markovian, then we need to show that state cuts do not prevent best-\ufb01rst order\nand lowest cost. Let n1 and n2 be two nodes representing the same state s, where n1 is expanded\n\n3\n\n\fbefore n2. Assume that no cut has been performed before n2 is expanded. First, since no cuts\nwere performed, we have from the non-Markovian case that d0(n1)\n\u03c0(n2) . Secondly, consider a\nsequence of actions a1:k taken after state s, and let n1k = n1a1:k be the node reached after taking\na1:k starting from n1 and similarly for n2k. Since the environment is deterministic, this sequence\nleads to the same state sk, whether starting from n1 or from n2. Since the policy is Markovian,\n\u03c0(n1k|n1) = \u03c0(n2k|n2). Then from the condition (iii) of state cuts,\n\n\u03c0(n1) \u2264 d0(n2)\n\nif \u03c0(n1) \u2265 \u03c0(n2),\n\nd0(n1k)\n\u03c0(n1k)\n\n=\n\nd0(n1)\n\u03c0(n1)\n\u2264 d0(n2)\n\u03c0(n2)\n\n1\n\n\u03c0(n1k|n1)\n\n1\n\n\u03c0(n1k|n1)\n\n+\n\n+\n\nk\n\n\u03c0(n1)\u03c0(n1k|n1)\n\nk\n\n\u03c0(n2)\u03c0(n1k|n1)\n\n=\n\nd0(n2k)\n\u03c0(n2k)\n\n,\n\nso the state sk has a lower or equal cost below n1 than below n2. Since this holds for any such a1:k,\nn2 can be safely cut, and by recurrence all cuts preserve the best-\ufb01rst ordering and lowest costs of\nstates. The rest of the proof is as in the non-Markovian case.\n\nLevinTS\u2019s cost function allows us to provide the following guarantee, which is an adaptation of Levin\nsearch\u2019s theorem [Solomonoff, 1984] to tree search problems.\nTheorem 3. Let N g be a set of target nodes, then LevinTS with a policy \u03c0 ensures that the number\nof node expansions N (LevinTS,N g) before reaching any of the target nodes is bounded by\n\nN (LevinTS,N g) \u2264 min\nn\u2208N g\n\nd0(n)\n\u03c0(n)\n\n.\n\nProof. From Theorem 2, the \ufb01rst state of S g to be expanded is the one of lowest cost, and with one of\nthe nodes of lowest cost, that is, with cost c := minn\u2208N g d0(n)/\u03c0(n). Let Tc be the current search\ntree when ng is being expanded. Then all nodes in Tc that have been expanded up to now have at\nmost cost c. Therefore at all leaves n \u2208 L(Tc) of the current search tree, d0(n)/\u03c0(n) \u2264 c. Since\neach node is expanded at most once (each sequence of actions is tried at most once) the number of\nnodes expanded by LevinTS until node ng is at most\n\nN (LevinTS,N g) = |N (Tc)| \u2264 (cid:88)\n\nd0(n) \u2264 (cid:88)\n\ninequality follows from d0(n)/\u03c0(n) \u2264 c, and the last inequality is because(cid:80)\nwhich follows from(cid:80)\n\nwhere the \ufb01rst inequality is because each leaf of depth d0 has at most d0 ancestors, the second\nn\u2208L(Tc) \u03c0(n) \u2264 1,\nn(cid:48)\u2208C(n) \u03c0(n(cid:48)) = \u03c0(n), that is, each parent node splits its probability among its\n\nn\u2208L(Tc)\n\nn\u2208L(Tc)\n\n\u03c0(n)c \u2264 c = min\nn\u2208N g\n\nd0(n)\n\u03c0(n)\n\nchildren, and the root has probability 1.\n\nThe upper bound of Theorem 3 is tight within a small factor for a tree like in Fig. 1, and is almost an\nequality when the tree splits at the root into multiple chains.\n\n4 Luby tree search: policy-guided unbounded sampling\n\nMulti-sampling When a good upper bound dmax is known on the depth of a subset of the target\nnodes with large cumulative probability, a simple idea is to sample trajectories according to \u03c0 (see\nAlgorithm 2) of that maximum depth dmax until a solution is found, if one exists. Call this strategy\nmultiTS (see Algorithm 3). We can then provide the following straightforward guarantee.\nTheorem 4. The expected number of node expansions before reaching a node in N g is bounded by\n\nE[N (multiTS(\u221e, dmax),N g)] \u2264 dmax\n\u03c0+\ndmax\n\n,\n\n\u03c0+\ndmax\n\n:=\n\n(cid:88)\n\n\u03c0(n) .\n\nn\u2208N g\n\nd0(n)\u2264dmax\n\nProof. Remembering that a tree search algorithm does not expand children of target nodes, the result\nfollows from observing that E[N (multiTS,N g)] is the expectation of a geometric distribution with\nsuccess probability \u03c0+\ndmax where each failed trial takes exactly dmax node expansions and the success\ntrial takes at most dmax node expansions.\n\n4\n\n\fAlgorithm (3) Sampling of nsims trajecto-\nries of \ufb01xed depths dmax \u2208 N1.\ndef multiTS(nsims, dmax)\n\nfor k := 1 to nsims\n\nif sample_traj(dmax)\n\nreturn true\n\nreturn false\n\nAlgorithm (4) Sampling of nsims trajectories of depths\nthat follow A6519, with optional coef\ufb01cient dmin \u2208 N1.\ndef LubyTS(nsims, dmin=1)\n\nfor k := 1 to nsims\n\nif sample_traj(dmin \u2217 A6519(k))\n\nreturn true\n\nreturn false\n\nThis strategy can have an important advantage over LevinTS if there are many target nodes within\ndepth bounded by dmax with small individual probability but large cumulative probability.\nThe drawback is that if no target node has a depth shorter than the bound dmax, this strategy will never\n\ufb01nd a solution (the expectation is in\ufb01nite), even if the target nodes have high probability according to\nthe policy \u03c0. Ensuring such target nodes can be always found leads to the LubyTS algorithm.\n\nLubyTS Suppose we are given a randomized program \u03c1, that has an unknown distribution p over\nthe halting times (where halting means solving an underlying problem). We want to de\ufb01ne a strategy\nthat can restart the program multiple times and run it each time with a different allowed running time\nso that it halts in as little cumulative time as possible in expectation. Luby et al. [1993] prove that\nthe optimal strategy is to run \u03c1 for running times of \ufb01xed lengths tp optimized for p; then either the\nprogram halts within tp steps, or it is forced to stop and is restarted for another tp steps and so on.\nThis strategy has an expected running time of (cid:96)p, with Lp\nq(t) where q is the\ncumulative distribution function of p. Luby et al. [1993] also devise a universal restarting strategy\nbased on a special sequence2 of running times:\n\n4 \u2264 (cid:96)p \u2264 Lp = mint\u2208N1\n\nt\n\n1 1 2 1 1 2 4 1 1 2 1 1 2 4 8 1 1 2 1 1 2 4 1 1 2 1 1 2 4 8 16 1 1 2. . .\n\nThey prove that the expected running time of this strategy is bounded by 192(cid:96)p(log2 (cid:96)p + 5) and also\nprove a lower bound of 1\n8 (cid:96)p log2 (cid:96)p for any universal restarting strategy. We propose to use instead\nthe sequence3 A6519:\n\n1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 16 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 32 1 2. . .\n\nwhich is simpler to compute and for which we can prove the following tighter upper bound.\nTheorem 5. For all distributions p over halting times, the expected running time of the restarting\nstrategy based on A6519 is bounded by mint t + t\n, where q is the cumulative\nq(t)\ndistribution of p.\n\nt\nq(t) + 6.1\n\nlog2\n\n(cid:16)\n\n(cid:17)\n\nThe proof is provided in Appendix B. We can easily import the strategy described above into the tree\nsearch setting (see Algorithm 4), and provide the following result.\nTheorem 6. Let N g be the set of target nodes, then LubyTS(\u221e, 1) with a policy \u03c0 ensures that the\nexpected number of node expansions before reaching a target node is bounded by\n\nE[N (LubyTS(\u221e, 1),N g)] \u2264 min\nd\u2208N1\n\nd +\n\nd\n\u03c0+\nd\n\nd\n\u03c0+\nd\n\nlog2\n\n+ 6.1\n\n,\n\n\u03c0+\nd :=\n\n\u03c0(n) ,\n\n(cid:18)\n\n(cid:19)\n\n(cid:88)\n\nn\u2208N g\nd0(n)\u2264d\n\nwhere \u03c0+\n\nd is the cumulative probability of the target nodes with depth at most d.\n\nProof. This is a straightforward application of Theorem 5: The randomized program samples a\nsequence of actions from the policy \u03c0, the running time t becomes the depth d0(n) of a node n, the\nprobability distribution p over halting times becomes the probability of reaching a target node of\n\ndepth t, p(t) =(cid:80){n\u2208N g,d0(n)=t} \u03c0(n), and the cumulative distribution function q becomes \u03c0+\n\nd .\n\n2https://oeis.org/A182105.\n3https://oeis.org/A006519. Gary Detlefs (ibid) notes that it can be computed with A6519(n) :=\n\n((n XOR n \u2212 1) + 1)/2 or with A6519(n) := (n AND \u2212 n) where \u2212n is n\u2019s complement to 2.\n\n5\n\n\fCompared to Theorem 4, the cost of adapting to an unknown depth is an additional factor log(d/\u03c0+\nd ).\nThe proof of Theorem 5 suggests that the term log d is due to not knowing the lower bound on d,\nand the term \u2212 log \u03c0+\nd is due to not knowing the upper bound. If a good lower bound dmin on the\naverage solution length is known, one can also multiply A6519(n) by dmin to avoid sampling too\nshort trajectories as in Algorithm 4; this may lessen the factor log d while still guaranteeing that a\nsolution can be found if one of positive probability exists. In particular, in the tree search domain,\nthe sequence A6519 samples trajectories of depth 1 half of the time, which is wasteful. Conversely,\nin general it is not possible to cap d at some upper bound, as this may prevent \ufb01nding a solution\nas for multiTS. Hence the factor \u2212 log \u03c0+\nd can easily be\nexponentially small with d.\n\nd remains, which is unfortunate since \u03c0+\n\n5 Strengths and weaknesses of LevinTS and LubyTS\n\nConsider a \u201cneedle-in-the-haystack problem\u201d represented by a perfect full and in\ufb01nite binary search\ntree where all nodes n have probability \u03c0(n) = 2\u2212d(n). Suppose that the set N g of target nodes\ncontains a single node ng at some depth d. According to Theorem 3, LevinTS needs to expand no\nmore than d0(ng)2d(ng) nodes before expanding ng. For this particular tree, the number of expansions\nis closer to 2d(ng)+1 since there are only at most 2d(ng)\u22121 nodes with cost lower or equal to cost(ng).\nTheorem 6 and the matching-order lower bound of [Luby et al., 1993] suggest LubyTS may expand\nin expectation O(d(ng)22d(ng)) nodes to reach ng. This additional factor of d(n)2 compared to\nLevinTS is a non-negligible price for needle-in-a-haystack searches. For multiTS, if the depth bound\ndmax is larger than d0(ng), then the expected search time is at most and close to dmax2d(ng), which is\na factor d(n) faster than LubyTS, unless dmax (cid:29) d(ng).\nNow suppose that the set of target nodes is composed of 2d\u22121 nodes, all at depth d. Since all nodes\nat a given depth have the same probability, LevinTS will expand at least 2d and at most 2d+1 nodes\nbefore expanding any of the target nodes. By contrast, because the cumulative probability of the\ntarget nodes at depth d is 1/2, LubyTS \ufb01nds a solution in O(d log d) node expansions, which is an\nexponential gain over LevinTS. For multiTS it would be dmax, which can be worse than d log d due\nto the need for a large enough dmax.\nLevinTS can perform state cuts if the policy is Markovian, which can substantially reduce the\nalgorithm\u2019s search effort. For example, suppose that in the binary tree above every left child\nrepresents the same state as the root and thus is cut off from the search tree, leaving in effect only 2d\nnodes for any depth d. If the target set contains only one node at some depth d, even when following\na uniform policy, LevinTS expands only those 2d nodes. By contrast, LubyTS expands in expectation\nmore than O(2d) nodes. LevinTS has a memory requirement that grows linearly with the number of\nnodes expanded, as well as a log factor in the computation time due to the need to maintain a priority\nqueue to sort the nodes by cost. By contrast, LubyTS and multiTS have a memory requirement that\ngrows linearly with the solution depth, as they only need to store in memory the trajectory sampled.\nLevinTS\u2019s memory cost could be alleviated with an iterative deepening [Korf, 1985] variant with\ntransposition table [Reinefeld and Marsland, 1994].\n\n6 Mixing policies and avoiding zero probabilities\n\nFor both LevinTS and LubyTS, if the provided policy \u03c0 incorrectly assigns a probability too close\nto 0 to some sequences of actions, then the algorithm may never \ufb01nd the solution. To mitigate such\noutcomes, it is possible to \u2018mix\u2019 the policy with the uniform policy so that the former behaves slightly\nmore like the latter. There are several ways to achieve this, each with their own pros and cons.\n\nBayes mixing of policies\nIf \u03c01 and \u03c02 are two policies, we can build their Bayes average \u03c012 with\nprior \u03b1 \u2208 [0, 1] and 1 \u2212 \u03b1 such that for all sequence of actions a1:t, \u03c012(a1:t) = \u03b1\u03c01(a1:t) + (1 \u2212\n\u03b1)\u03c02(a1:t). The conditional probability of the next action is given by\n\u03c012(at|a \u03c02(at|a 1/\u03b5.\nIf no good bound on d is known, one can use a more adaptive\nk=1(t/(t + 1))\u03b3 = 1/(t + 1)\u03b3, which means\nthat the maximum price to pay to use only the policy \u03c02 for all the t steps is at most (t + 1)\u03b3, and the\nprice to pay each step the policy \u03c01 is used is approximately (t + 1)/\u03b3. The optimal value of \u03b5 can\nalso be learned automatically using an algorithm such as Soft-Bayes [Orseau et al., 2017] where the\n\u2018experts\u2019 are the provided policies, but this may have a large probability overhead for this setup.\n\n7 Experiments: computer-generated Sokoban\n\nWe test our algorithms on 1,000 computer-generated levels of Sokoban [Racani\u00e8re et al., 2017] of\n10x10 grid cells and 4 boxes.4 For the policy, we use a neural network pre-trained with A3C (details\non the architecture and the learning procedure are in Appendix A). We picked the best performing\nnetwork out of 4 runs with different learning rates. Once the network is trained, we compare the\ndifferent algorithms using the same network\u2019s \ufb01xed Markovian policy. Note that for each new level,\nthe goal states (and thus target set) are different, whereas the policy does not change (but still depends\non the state). We test the following algorithms and parameters: LubyTS(256,1), LubyTS(256,32),\nLubyTS(512, 32), multiTS(1, 200), multiTS(100, 200), multiTS(200, 200), LevinTS. Excluding\nthe small values (i.e., nsims = 1 and dmin = 1), the parameters were chosen to obtain a total\nnumber of expansions within the same order of magnitude. In addition to the policy trained with\nA3C, we tested LevinTS, LubyTS, and multiTS with a variant of the policy in which we add 1%\nof noise to the probabilities output of the neural network. That is, these variants use the policy\n\u02dc\u03c0(a|n) = (1 \u2212 \u03b5)\u03c0(a|n) + \u03b5 1\n4 where \u03c0 is the network\u2019s policy and \u03b5 = 0.01, to guide their search.\nThese variants are marked with the symbol (*) in the table of results. We compare our policy tree\nsearch methods with a version of the LAMA planner [Richter and Westphal, 2010] that uses the lazy\nversion of GBFS with preferred operators and queue alternation with the FF heuristic. This version of\n\n4The levels are available at https://github.com/deepmind/boxoban-levels/unfiltered/test.\n\n7\n\n\fSolved\n\nAvg. length\n\nMax. length\n\nTotal expansions\n\n94,423,278\n\nTable 1: Comparison of different solvers on the 1000 computer-generated levels of Sokoban. For\nrandomized solvers (shown at the top part of the table), the results are aggregated over 5 random\nseeds (\u00b1 indicates standard deviation). (*) Uses \u02dc\u03c0 with \u03b5 = 0.01.\nAlgorithm\nUniform\nLubyTS(256, 1)\nLubyTS(256, 32)\nLubyTS(512, 32)\nLubyTS(512, 32) (*)\nMultiTS(1, 200)\nMultiTS(100, 200)\nMultiTS(200, 200)\nMultiTS(200, 200) (*)\nLevinTS\nLevinTS (*)\nLAMA\n\n63,8481\u00b1 2,434\n6,246,293\u00b1 73,382\n11,515,937\u00b1 211,524\n10,730,753\u00b1 164,410\n93,768\u00b1 925\n3260536\u00b1 57185\n5,768,680\u00b1 116,152\n5,389,534\u00b1 45,085\n6,602,666\n5,026,200\n3,151,325\n\n59\n228\u00b1 18.6\n1,638.4\u00b1 540.7\n3,266.6\u00b1 1,287.8\n1,975.6\u00b1 904.5\n196.4\u00b1 2.2\n199.4\u00b1 0.5\n196.4\u00b1 2.3\n198.8\u00b1 1\n106\n106\n185\n\n19\n41.0\u00b1 0.6\n48.4\u00b1 0.9\n54.8\u00b1 4.2\n50.7\u00b1 2.5\n41.3\u00b1 0.6\n47.8\u00b1 0.5\n47.9\u00b1 0.7\n48.8\u00b1 0.4\n39.8\n39.5\n51.6\n\n88\n753\u00b1 5\n870\u00b1 2\n884\u00b1 4\n896\u00b1 2\n669\u00b1 5\n866\u00b1 4\n881\u00b1 1\n895\u00b1 3\n1,000\n1,000\n1,000\n\nFigure 3: Node expansions for Sokoban on log-scale. The levels indices (x-axis) are sorted indepen-\ndently for each solver from the easiest to the hardest level. For clarity a typical run has been chosen\nfor randomized solvers; see Table 1 for standard deviations.\n\nLAMA is implemented in Fast Downward [Helmert, 2006], a domain-independent solver. We used\nthis version of LAMA because it was shown to perform better than other state-of-the-art planners\non Sokoban problems [Xie et al., 2012]. Moreover, similarly to our methods, LAMA searches for a\nsolution of small depth rather than a solution of minimal depth.\nTable 1 presents the number of levels solved (\u201cSolved\u201d), average solution length (\u201cAvg. length\u201d),\nlongest solution length (\u201cMax. length\u201d), and total number of nodes expanded (\u201cTotal expansions\u201d).\nThe top part of the table shows the sampling-based randomized algorithms. In addition to the average\nvalues, we present the standard deviation of \ufb01ve independent runs of these algorithms. Since LevinTS\nand LAMA are deterministic, we present a single run of these approaches. Fig. 3 shows the number of\nnodes expanded per level by each method when the levels are independently sorted for each approach\nfrom the easiest to the hardest Sokoban level in terms of node expansions. The Uniform searcher\n(LevinTS with a uniform policy) with maximum 100,000 node expansions per level\u2014and still with\nstate cuts\u2014can solve no more than 9% of the levels, which shows that the problem is not trivial.\nFor most of the levels, LevinTS (with the A3C policy) expands many fewer nodes than LAMA, but\nhas to expand many more nodes on the last few levels. On 998 instances, the cumulative number of\nexpansions taken by LevinTS is ~2.7e6 nodes while LAMA expands ~3.1e6 nodes. These numbers\ncontrast with the number of expansions required by LevinTS (6.6e6) and LAMA (3.15e6) to solve all\n\n8\n\n\f1,000 levels. The addition of noise to the policy reduces the number of nodes expanded by LevinTS\nwhile solving harder instances at the cost of increasing the number of nodes expanded for easier\nproblems (see the lines of the two versions of LevinTS crossing at the right-hand side of Fig. 3).\nOverall, noise reduces from 6.6e6 to 5e6 the total number of nodes LevinTS expands (see Table 1).\nLevinTS has to expand a large number of nodes for a small number of levels likely due to the training\nprocedure used to derive the policy. That is, the policy is learned only from the 65% easiest levels\nsolved after sampling single trajectories\u2014harder levels are never solved during training. Nevertheless,\nLevinTS can still solve harder instances by compensating the lack of policy guidance with search.\nThe sampling-based methods have a hard time reaching 90% success, but still improves by more than\n20% over sampling a single trajectory. LubyTS(256, 32) improves substantially over LubyTS(256, 1)\nsince many solutions have length around 30 steps. LubyTS(256, 32) is as good as multiTS(200, 100)\nthat uses a hand-tuned upper bound on the length of the solutions.\nThe solutions found by LevinTS are noticeably shorter (in terms of number of moves) than those\nfound by LAMA. It is remarkable that LevinTS can \ufb01nd shorter solutions and expand fewer nodes\nthan LAMA for most of the levels. This is likely due to the combination of good search guidance\nthrough the policy for most of the problems and LevinTS\u2019s systematic search procedure. By contrast,\ndue to its sampling-based approach, LubyTS tends to \ufb01nd very long solutions.\nRacani\u00e8re et al. [2017] report different neural-network based solvers applied to a long sequence of\nSokoban levels generated by the same system used in our experiments (although we use a different\nrandom seed to generate the levels, we believe they are of the same complexity). Racani\u00e8re et al.\u2019s\nprimary goal was not to produce an ef\ufb01cient solver per se, but to demonstrate how an integrated\nneural-based learning and planning system can be robust to model errors and more ef\ufb01cient than an\nMCTS baseline. Their MCTS approach solves 87% of the levels within approximately 30e6 node\nexpansions (25,000 per level for 870 levels, and 500 simulations of 120 steps for the remaining 130\nlevels). Although LevinTS had much stronger results in our experiments, we note that Racani\u00e8re\net al.\u2019s implementation of MCTS commits to an action every 500 node expansions. By contrast, in\nour experimental setup, we assume that LevinTS solves the problem before committing to an action.\nThis difference makes the results not directly comparable. Racani\u00e8re et al.\u2019s second solver (I2A) is a\nhybrid model-free and model-based planning using a LSTM-based recurrent neural network with\nmore learning components than our approaches. I2A reaches 95% success within an estimated total\nof 5.3e6 node expansions (4,000 on average over 950 levels, and 30,000 steps for the remaining 50\nunsolved levels; this counts the internal planning steps). For comparison, LevinTS with 1% noise\nsolves all the levels within the same total time (999 for LevinTS without noise). Moreover, LevinTS\nsolves 95% of the levels within a total of less than 160,000 steps, which is approximately 168 node\nexpansions on average for solved levels, compared to the reported 4,000 for I2A. Moreover, it is also\nnot clear how long it would take I2A to solve the remaining 5%.\n\n8 Conclusions and future works\n\nWe introduced two novel tree search algorithms for single-agent problems that are guided by a policy:\nLevinTS and LubyTS. Both algorithms have guarantees on the number of nodes that they expand\nbefore reaching a solution (strictly for LevinTS, in expectation for LubyTS). LevinTS and LubyTS\ndepart from the traditional heuristic approach to tree search by employing a policy instead of a\nheuristic function to guide search while still offering important guarantees.\nThe results on the computer-generated Sokoban problems using a pre-trained neural network show\nthat these algorithms can largely improve through tree search upon the score of the network during\ntraining. Our results also showed that LevinTS is able to solve most of the levels used in our\nexperiment while expanding many fewer nodes than a state-of-the-art heuristic search planner. In\naddition, LevinTS was able to \ufb01nd considerably shorter solutions than the planner.\nThe policy can be learned by various means or it can even be handcrafted. In this paper we used\nreinforcement learning to learn the policy. However, the bounds offered by the algorithms could also\nserve directly as metrics to be optimized while learning a policy; this is a research direction we are\ninterested in investigating in future works.\n\n9\n\n\fAcknowledgements The authors wish to thank Peter Sunehag, Andras Gyorgy, R\u00e9mi Munos,\nJoel Veness, Arthur Guez, Marc Lanctot, Andr\u00e9 Grahl Pereira, and Michael Bowling for helpful\ndiscussions pertaining this research. Financial support for this research was in part provided by the\nNatural Sciences and Engineering Research Council of Canada (NSERC).\n\nReferences\nC. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,\nD. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE\nTransactions on Computational Intelligence and AI in games, 4(1):1\u201343, 2012.\n\nR. Coulom. Ef\ufb01cient selectivity and backup operators in monte-carlo tree search. In Computers and\n\nGames, pages 72\u201383. Springer Berlin Heidelberg, 2007.\n\nJ. C. Culberson. Sokoban is PSPACE-Complete. In Fun With Algorithms, pages 65\u201376, 1999.\n\nJ. E. Doran and D. Michie. Experiments with the graph traverser program. In Royal Society of\n\nLondon A, volume 294, pages 235\u2013259. The Royal Society, 1966.\n\nP. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum\n\ncost paths. IEEE Transactions on Systems Science and Cybernetics, SSC-4(2):100\u2013107, 1968.\n\nM. Helmert. The fast downward planning system. Journal of Arti\ufb01cial Intelligence Research, 26:\n\n191\u2013246, 2006.\n\nJ. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic search.\n\nJournal of Arti\ufb01cial Intelligence Research, 14:253\u2013302, 2001.\n\nR. E. Korf. Depth-\ufb01rst iterative-deepening. Arti\ufb01cial Intelligence, 27(1):97 \u2013 109, 1985.\n\nL. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):\n\n265\u2013266, 1973.\n\nM. Luby, A. Sinclair, and D. Zuckerman. Optimal speedup of Las Vegas algorithms. Inf. Process.\n\nLett., 47(4):173\u2013180, Sept. 1993.\n\nV. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nAsynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International\nConference on Machine Learning, volume 48, pages 1928\u20131937. PMLR, 2016.\n\nH. Nakhost. Random Walk Planning: Theory, Practice, and Application. PhD thesis, University of\n\nAlberta, 2013.\n\nL. Orseau, T. Lattimore, and S. Legg. Soft-bayes: Prod for mixtures of experts with log-loss. In\nProceedings of the 28th International Conference on Algorithmic Learning Theory, volume 76\nof Proceedings of Machine Learning Research, pages 372\u2013399, Kyoto University, Kyoto, Japan,\n15\u201317 Oct 2017. PMLR.\n\nS. Racani\u00e8re, T. Weber, D. Reichert, L. Buesing, A. Guez, D. Jimenez Rezende, A. Puigdom\u00e8nech Ba-\ndia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Hassabis, D. Silver, and D. Wierstra.\nImagination-augmented agents for deep reinforcement learning. In Advances in Neural Information\nProcessing Systems 30, pages 5690\u20135701. 2017.\n\nA. Reinefeld and T. A. Marsland. Enhanced iterative-deepening search. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 16(7):701\u2013710, 1994.\n\nS. Richter and M. Westphal. The lama planner: Guiding cost-based anytime planning with landmarks.\n\nJournal of Arti\ufb01cial Intelligence Research, 39(1):127\u2013177, 2010.\n\nD. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n10\n\n\fD. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-\nmaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by\nself-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL\nhttp://arxiv.org/abs/1712.01815.\n\nR. J. Solomonoff. Optimum sequential search. Oxbridge Research, 1984.\n\nT. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average of its\n\nrecent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\nB. A. Trakhtenbrot. A survey of Russian approaches to Perebor (brute-force searches) algorithms.\n\nAnnals of the History of Computing, 6(4):384\u2013400, 1984.\n\nF. Xie, H. Nakhost, and M. M\u00fcller. Planning via random walk-driven local search. In Proceedings\nof the Twenty-Second International Conference on Automated Planning and Scheduling, pages\n315\u2013322, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1637, "authors": [{"given_name": "Laurent", "family_name": "Orseau", "institution": "DeepMind"}, {"given_name": "Levi", "family_name": "Lelis", "institution": "Universidade Federal de Vi\u00e7osa"}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "DeepMind"}, {"given_name": "Theophane", "family_name": "Weber", "institution": "DeepMind"}]}