{"title": "Monte-Carlo Planning in Large POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2164, "page_last": 2172, "abstract": "This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, Monte-Carlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, Rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10x10 Battleship and Partially Observable PacMan, with approximately 10^18 and 10^56 states respectively. Our Monte-Carlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs.", "full_text": "Monte-Carlo Planning in Large POMDPs\n\nDavid Silver\n\nMIT, Cambridge, MA 02139\ndavidstarsilver@gmail.com\n\nJoel Veness\n\nUNSW, Sydney, Australia\n\njveness@gmail.com\n\nAbstract\n\nThis paper introduces a Monte-Carlo algorithm for online planning in large\nPOMDPs. The algorithm combines a Monte-Carlo update of the agent\u2019s\nbelief state with a Monte-Carlo tree search from the current belief state.\nThe new algorithm, POMCP, has two important properties. First, Monte-\nCarlo sampling is used to break the curse of dimensionality both during\nbelief state updates and during planning. Second, only a black box simu-\nlator of the POMDP is required, rather than explicit probability distribu-\ntions. These properties enable POMCP to plan e\ufb00ectively in signi\ufb01cantly\nlarger POMDPs than has previously been possible. We demonstrate its ef-\nfectiveness in three large POMDPs. We scale up a well-known benchmark\nproblem, rocksample, by several orders of magnitude. We also introduce\ntwo challenging new POMDPs: 10 \u00d7 10 battleship and partially observable\nPacMan, with approximately 1018 and 1056 states respectively. Our Monte-\nCarlo planning algorithm achieved a high level of performance with no prior\nknowledge, and was also able to exploit simple domain knowledge to achieve\nbetter results with less search. POMCP is the \ufb01rst general purpose planner\nto achieve high performance in such large and unfactored POMDPs.\n\n1 Introduction\n\nMonte-Carlo tree search (MCTS) is a new approach to online planning that has provided\nexceptional performance in large, fully observable domains. It has outperformed previous\nplanning approaches in challenging games such as Go [5], Amazons [10] and General Game\nPlaying [4]. The key idea is to evaluate each state in a search tree by the average outcome\nof simulations from that state. MCTS provides several major advantages over traditional\nsearch methods. It is a highly selective, best-\ufb01rst search that quickly focuses on the most\npromising regions of the search space. It breaks the curse of dimensionality by sampling\nstate transitions instead of considering all possible state transitions. It only requires a black\nbox simulator, and can be applied in problems that are too large or too complex to represent\nwith explicit probability distributions. It uses random simulations to estimate the potential\nfor long-term reward, so that it plans over large horizons, and is often e\ufb00ective without any\nsearch heuristics or prior domain knowledge [8]. If exploration is controlled appropriately\nthen MCTS converges to the optimal policy. In addition, it is anytime, computationally\ne\ufb03cient, and highly parallelisable.\nIn this paper we extend MCTS to partially observable environments (POMDPs). Full-width\nplanning algorithms, such as value iteration [6], scale poorly for two reasons, sometimes\nreferred to as the curse of dimensionality and the curse of history [12]. In a problem with\nn states, value iteration reasons about an n-dimensional belief state. Furthermore, the\nnumber of histories that it must evaluate is exponential in the horizon. The basic idea of\nour approach is to use Monte-Carlo sampling to break both curses, by sampling start states\nfrom the belief state, and by sampling histories using a black box simulator.\nOur search algorithm constructs, online, a search tree of histories. Each node of the search\ntree estimates the value of a history by Monte-Carlo simulation. For each simulation, the\n\n1\n\n\fstart state is sampled from the current belief state, and state transitions and observations\nare sampled from a black box simulator. We show that if the belief state is correct, then\nthis simple procedure converges to the optimal policy for any \ufb01nite horizon POMDP. In\npractice we can execute hundreds of thousands of simulations per second, which allows us\nto construct extensive search trees that cover many possible contingencies.\nIn addition,\nMonte-Carlo simulation can be used to update the agent\u2019s belief state. As the search tree\nis constructed, we store the set of sample states encountered by the black box simulator\nin each node of the search tree. We approximate the belief state by the set of sample\nstates corresponding to the actual history. Our algorithm, Partially Observable Monte-\nCarlo Planning (POMCP), e\ufb03ciently uses the same set of Monte-Carlo simulations for both\ntree search and belief state updates.\n\n2 Background\n\n2.1 POMDPs\n\nIn a Markov decision process (MDP) the environment\u2019s dynamics are fully determined by\nits current state st. For any state s \u2208 S and for any action a \u2208 A, the transition probabilities\nP a\nss(cid:48) = P r(st+1 = s(cid:48)|st = s, at = a) determine the next state distribution s(cid:48), and the reward\nfunction Ra\ns = E[rt+1|st = s, at = a] determines the expected reward. In a partially observ-\nable Markov decision process (POMDP), the state cannot be directly observed by the agent.\nInstead, the agent receives an observation o \u2208 O, determined by observation probabilities\nZ a\ns(cid:48)o = P r(ot+1 = o|st+1 = s(cid:48), at = a). The initial state s0 \u2208 S is determined by a prob-\nability distribution Is = P r(s0 = s). A history is a sequence of actions and observations,\nht = {a1, o1, ..., at, ot} or htat+1 = {a1, o1, ..., at, ot, at+1}. The agent\u2019s action-selection\nbehaviour can be described by a policy, \u03c0(h, a), that maps a history h to a probability\nk=t \u03b3k\u2212trk is\nthe total discounted reward accumulated from time t onwards, where \u03b3 is a discount factor\nspeci\ufb01ed by the environment. The value function V \u03c0(h) is the expected return from state\ns when following policy \u03c0, V \u03c0(h) = E\u03c0[Rt|ht = h]. The optimal value function is the maxi-\nmum value function achievable by any policy, V \u2217(h) = max\nV \u03c0(h). In any POMDP there\nis at least one optimal policy \u03c0\u2217(h, a) that achieves the optimal value function. The belief\nstate is the probability distribution over states given history h, B(s, h) = P r(st = s|ht = h).\n\ndistribution over actions, \u03c0(h, a) = P r(at+1 = a|ht = h). The return Rt =(cid:80)\u221e\n\n\u03c0\n\n2.2 Online Planning in POMDPs\n\nOnline POMDP planners use forward search, from the current history or belief state, to\nform a local approximation to the optimal value function. The majority of online planners\nare based on point-based value iteration [12, 13]. These algorithms use an explicit model\nof the POMDP probability distributions, M = (cid:104)P,R,Z,I(cid:105). They construct a search tree\nof belief states, using a heuristic best-\ufb01rst expansion procedure. Each value in the search\ntree is updated by a full-width computation that takes account of all possible actions,\nobservations and next states. This approach can be combined with an o\ufb04ine planning\nmethod to produce a branch-and-bound procedure [13]. Upper or lower bounds on the\nvalue function are computed o\ufb04ine, and are propagated up the tree during search. If the\nPOMDP is small, or can be factored into a compact representation, then full-width planning\nwith explicit models can be very e\ufb00ective.\nMonte-Carlo planning is a very di\ufb00erent paradigm for online planning in POMDPs [2, 7].\nThe agent uses a simulator G as a generative model of the POMDP. The simulator pro-\nvides a sample of a successor state, observation and reward, given a state and action,\n(st+1, ot+1, rt+1) \u223c G(st, at), and can also be reset to a start state s. The simulator is used\nto generate sequences of states, observations and rewards. These simulations are used to\nupdate the value function, without ever looking inside the black box describing the model\u2019s\ndynamics. In addition, Monte-Carlo methods have a sample complexity that is determined\nonly by the underlying di\ufb03culty of the POMDP, rather than the size of the state space or\nobservation space [7]. In principle, this makes them an appealing choice for large POMDPs.\nHowever, prior Monte-Carlo planners have been limited to \ufb01xed horizon, depth-\ufb01rst search\n[7] (also known as sparse sampling), or to simple rollout methods with no search tree [2],\nand have not so far proven to be competitive with best-\ufb01rst, full-width planning methods.\n\n2\n\n\f2.3 Rollouts\n\nIn fully observable MDPs, Monte-Carlo simulation provides a simple method for evaluating\na state s. Sequences of states are generated by an MDP simulator, starting from s and using\na random rollout policy, until a terminal state or discount horizon is reached. The value of\nstate s is estimated by the mean return of N simulations from s, V (s) = 1\ni=1 Ri, where\nN\nRi is the return from the beginning of the ith simulation. Monte-Carlo simulation can be\nturned into a simple control algorithm by evaluating all legal actions and selecting the action\nwith highest evaluation [15]. Monte-Carlo simulation can be extended to partially observable\nMDPs [2] by using a history based rollout policy \u03c0rollout(h, a). To evaluate candidate action\na in history h, simulations are generated from ha using a POMDP simulator and the rollout\npolicy. The value of ha is estimated by the mean return of N simulations from ha.\n\n(cid:80)N\n\n2.4 Monte-Carlo Tree Search\n\ncount N (s) =(cid:80)\n\nMonte-Carlo tree search [3] uses Monte-Carlo simulation to evaluate the nodes of a search\ntree in a sequentially best-\ufb01rst order. There is one node in the tree for each state s, con-\ntaining a value Q(s, a) and a visitation count N (s, a) for each action a, and an overall\na N (s, a). Each node is initialised to Q(s, a) = 0, N (s, a) = 0. The value\nis estimated by the mean return from s of all simulations in which action a was selected\nfrom state s. Each simulation starts from the current state st, and is divided into two\nstages: a tree policy that is used while within the search tree; and a rollout policy that is\nused once simulations leave the scope of the search tree. The simplest version of MCTS\nuses a greedy tree policy during the \ufb01rst stage, which selects the action with the highest\nvalue; and a uniform random rollout policy during the second stage. After each simula-\ntion, one new node is added to the search tree, containing the \ufb01rst state visited in the\nsecond stage. The UCT algorithm [8] improves the greedy action selection in MCTS. Each\nstate of the search tree is viewed as a multi-armed bandit, and actions are chosen by using\nthe UCB1 algorithm [1]. The value of an action is augmented by an exploration bonus\nthat is highest for rarely tried actions, Q\u2295(s, a) = Q(s, a) + c\nN (s,a) . The scalar con-\nstant c determines the relative ratio of exploration to exploitation; when c = 0 the UCT\nalgorithm acts greedily within the tree. Once all actions from state s are represented in\nthe search tree, the tree policy selects the action maximising the augmented action-value,\nargmaxaQ\u2295(s, a). Otherwise, the rollout policy is used to select actions. For suitable choice\nof c, the value function constructed by UCT converges in probability to the optimal value\np\u2192 Q\u2217(s, a),\u2200s \u2208 S, a \u2208 A [8]. UCT can be extended to use domain knowl-\nfunction, Q(s, a)\nedge, for example heuristic knowledge or a value function computed o\ufb04ine [5]. New nodes\nare initialised using this knowledge, Q(s, a) = Qinit(s, a), N (s, a) = Ninit, where Qinit(s, a)\nis an action value function and Ninit indicates its quality. Domain knowledge narrowly\nfocuses the search on promising states without altering asymptotic convergence.\n\n(cid:113) log N (s)\n\n3 Monte-Carlo Planning in POMDPs\n\nPartially Observable Monte-Carlo Planning (POMCP) consists of a UCT search that selects\nactions at each time-step; and a particle \ufb01lter that updates the agent\u2019s belief state.\n\n3.1 Partially Observable UCT (PO\u2013UCT)\n\nWe extend the UCT algorithm to partially observable environments by using a search tree\nof histories instead of states. The tree contains a node T (h) = (cid:104)N (h), V (h)(cid:105) for each\nrepresented history h. N (h) counts the number of times that history h has been visited.\nV (h) is the value of history h, estimated by the mean return of all simulations starting with h.\nNew nodes are initialised to (cid:104)Vinit(h), Ninit(h)(cid:105) if domain knowledge is available, and to (cid:104)0, 0(cid:105)\notherwise. We assume for now that the belief state B(s, h) is known exactly. Each simulation\nstarts in an initial state that is sampled from B(\u00b7, ht). As in the fully observable algorithm,\nIn the \ufb01rst stage of simulation, when child\nthe simulations are divided into two stages.\nnodes exist for all children, actions are selected by UCB1, V \u2295(ha) = V (ha) + c\nN (ha) .\nActions are then selected to maximise this augmented value, argmaxaV \u2295(ha). In the second\n\n(cid:113) log N (h)\n\n3\n\n\fFigure 1: An illustration of POMCP in an environment with 2 actions, 2 observations, 50 states,\nand no intermediate rewards. The agent constructs a search tree from multiple simulations, and\nevaluates each history by its mean return (left). The agent uses the search tree to select a real\naction a, and observes a real observation o (middle). The agent then prunes the tree and begins a\nnew search from the updated history hao (right).\n\nstage of simulation, actions are selected by a history based rollout policy \u03c0rollout(h, a) (e.g.\nuniform random action selection). After each simulation, precisely one new node is added\nto the tree, corresponding to the \ufb01rst new history encountered during that simulation.\n\n3.2 Monte-Carlo Belief State Updates\n(cid:80)\nIn small state spaces, the belief state can be updated exactly by Bayes\u2019 theorem, B(s(cid:48), hao) =\ns\u2208S(cid:80)\n(cid:80)\ns\u2208S Z a\ns(cid:48) oP a\nss(cid:48)B(s,h)\nss(cid:48)(cid:48)B(s,h) . The majority of POMDP planning methods operate in this man-\nP a\ns(cid:48)(cid:48)\u2208S Z a\ns(cid:48)(cid:48)o\nner [13]. However, in large state spaces even a single Bayes update may be computationally\ninfeasible. Furthermore, a compact represention of the transition or observation proba-\nbilities may not be available. To plan e\ufb03ciently in large POMDPs, we approximate the\nbelief state using an unweighted particle \ufb01lter, and use a Monte-Carlo procedure to update\nparticles based on sample observations, rewards, and state transitions. Although weighted\nparticle \ufb01lters are used widely to represent belief states, an unweighted particle \ufb01lter can\nbe implemented particularly e\ufb03ciently with a black box simulator, without requiring an\nexplicit model of the POMDP, and providing excellent scalability to larger problems.\nt \u2208 S, 1 \u2264 i \u2264 K. Each\nWe approximate the belief state for history ht by K particles, Bi\nparticle corresponds to a sample state, and the belief state is the sum of all particles,\n\u02c6B(s, ht) = 1\n, where \u03b4ss(cid:48) is the kronecker delta function. At the start of the\n0 \u223c I, 1 \u2264 i \u2264 K.\nalgorithm, K particles are sampled from the initial state distribution, Bi\nAfter a real action at is executed, and a real observation ot is observed, the particles are\nupdated by Monte-Carlo simulation. A state s is sampled from the current belief state\n\u02c6B(s, ht), by selecting a particle at random from Bt. This particle is passed into the black\nbox simulator, to give a successor state s(cid:48) and observation o(cid:48), (s(cid:48), o(cid:48), r) \u223c G(s, at). If the\nsample observation matches the real observation, o = ot, then a new particle s(cid:48) is added\nto Bt+1. This process repeats until K particles have been added. This approximation to\nthe belief state approaches the true belief state with su\ufb03cient particles, limK\u2192\u221e \u02c6B(s, ht) =\nB(s, ht),\u2200s \u2208 S. As with many particle \ufb01lter approaches, particle deprivation is possible\nfor large t. In practice we combine the belief state update with particle reinvigoration. For\nexample, new particles can be introduced by adding arti\ufb01cial noise to existing particles.\n\n(cid:80)K\n\nK\n\ni=1 \u03b4sBi\n\nt\n\n3.3 Partially Observable Monte-Carlo\n\nPOMCP combines Monte-Carlo belief state updates with PO\u2013UCT, and shares the same\nsimulations for both Monte-Carlo procedures. Each node in the search tree, T (h) =\n(cid:104)N (h), V (h), B(h)(cid:105), contains a set of particles B(h) in addition to its count N (h) and value\nV (h). The search procedure is called from the current history ht. Each simulation begins\nfrom a start state that is sampled from the belief state B(ht). Simulations are performed\n\n4\n\nN=1V=2a1a2o1o2o1o2o1o2a1a2a1a2N=1V=-1N=1V=-3N=2V=-2N=3V=-1N=1V=4N=1V=6N=2V=5N=1V=-1N=3V=3N=3V=1N=6V=2N=9V=1.5a=a2o=o2o1o2a1a2N=1V=4N=1V=6N=2V=5N=1V=-1N=3V=3S={17,34,26,31} S={27,36,44} S={42} o1o2a1a2N=1V=4N=1V=6N=2V=5N=1V=-1N=3V=3S={27,36,44} S={7} S={7} S={38} S={38} S={27,36,44} hhhaoS={7} S={38} r=-1r=+2r=+3r=+4r=+6r=-1\fAlgorithm 1 Partially Observable Monte-Carlo Planning\n\nprocedure Search(h)\n\nrepeat\n\nif h = empty then\n\ns \u223c I\ns \u223c B(h)\n\nelse\n\nend if\nSimulate(s, h, 0)\n\nuntil Timeout()\nreturn argmax\n\nV (hb)\n\nb\nend procedure\n\nprocedure Rollout(s, h, depth)\n\nif \u03b3depth < \u0001 then\n\nreturn 0\n\nend if\na \u223c \u03c0rollout(h,\u00b7)\n(s(cid:48), o, r) \u223c G(s, a)\nreturn r + \u03b3.Rollout(s(cid:48), hao, depth+1)\n\nprocedure Simulate(s, h, depth)\n\nif \u03b3depth < \u0001 then\n\nreturn 0\nend if\nif h /\u2208 T then\n\nfor all a \u2208 A do\n\nT (ha) \u2190 (Ninit(ha), Vinit(ha),\u2205)\n\nend for\nreturn Rollout(s, h, depth)\n\n(cid:113) log N (h)\n\nb\n\nN (hb)\n\nV (hb) + c\n\nend if\na \u2190 argmax\n(s(cid:48), o, r) \u223c G(s, a)\nR \u2190 r + \u03b3.Simulate(s(cid:48), hao, depth + 1)\nB(h) \u2190 B(h) \u222a {s}\nN (h) \u2190 N (h) + 1\nN (ha) \u2190 N (ha) + 1\nV (ha) \u2190 V (ha) + R\u2212V (ha)\nreturn R\n\nN (ha)\n\nend procedure\n\nend procedure\n\nusing the partially observable UCT algorithm, as described above. For every history h\nencountered during simulation, the belief state B(h) is updated to include the simulation\nstate. When search is complete, the agent selects the action at with greatest value, and\nreceives a real observation ot from the world. At this point, the node T (htatot) becomes the\nroot of the new search tree, and the belief state B(htao) determines the agent\u2019s new belief\nstate. The remainder of the tree is pruned, as all other histories are now impossible. The\ncomplete POMCP algorithm is described in Algorithm 1 and Figure 1.\n\n4 Convergence\n\nThe UCT algorithm converges to the optimal value function in fully observable MDPs [8].\nThis suggests two simple ways to apply UCT to POMDPs: either by converting every belief\nstate into an MDP state, or by converting every history into an MDP state, and then\napplying UCT directly to the derived MDP. However, the \ufb01rst approach is computationally\nexpensive in large POMDPs, where even a single belief state update can be prohibitively\ncostly. The second approach requires a history-based simulator that can sample the next\nhistory given the current history, which is usually more costly and hard to encode than a\nstate-based simulator. The key innovation of the PO\u2013UCT algorithm is to apply a UCT\nsearch to a history-based MDP, but using a state-based simulator to e\ufb03ciently sample states\nfrom the current beliefs. In this section we prove that given the true belief state B(s, h),\nPO\u2013UCT also converges to the optimal value function. We prove convergence for POMDPs\nwith \ufb01nite horizon T ; this can be extended to the in\ufb01nite horizon case as suggested in [8].\nLemma 1. Given a POMDP M = (S,A,P,R,Z), consider the derived MDP with his-\nh =\ntories as states,\ns . Then the value function \u02dcV \u03c0(h) of the derived MDP is equal to the value\n\n\u02dcM = (H,A, \u02dcP, \u02dcR), where \u02dcP a\n\nh,hao = (cid:80)\n\ns(cid:48)o and \u02dcRa\n\nB(s, h)Ra\n\nB(s, h)P a\n\nss(cid:48)Z a\n\n(cid:80)\n\n(cid:80)\n\nbackward\nV \u03c0(h)\n\ns\u2208S\nfunction V \u03c0(h) of the POMDP, \u2200\u03c0 \u02dcV \u03c0(h) = V \u03c0(h).\nProof. By\n(cid:80)\nthe\na\u2208A\nLet D\u03c0(hT ) be the POMDP rollout distribution. This is the distribution of histories gener-\nated by sampling an initial state st \u223c B(s, ht), and then repeatedly sampling actions from\npolicy \u03c0(h, a) and sampling states, observations and rewards from M, until terminating at\n\nB(s, h)\u03c0(h, a) (Ra\n\nstarting\ns(cid:48)oV \u03c0(hao))\n\na\u2208A\n= \u02dcV \u03c0(h).\n\ns\u2208S\n\u02dcV \u03c0(hao)\n\nequation,\ns + \u03b3P a\n\n(cid:16) \u02dcRa\n\n(cid:80)\n(cid:17)\n\nfrom\n=\n\ninduction\n\nh + \u03b3 \u02dcP a\n\nss(cid:48)Z a\n\n(cid:80)\n\nthe\n\ns\u2208S\n\ns(cid:48)\u2208S\n\nBellman\n\nhorizon,\n\n(cid:80)\n\non\n\n(cid:80)\n\n(cid:80)\n\no\u2208O\n\ns(cid:48)\u2208S\n\no\u2208O\n\n\u03c0(h, a)\n\n=\n\nh,hao\n\n5\n\n\ftime T . Let \u02dcD\u03c0(hT ) be the derived MDP rollout distribution. This is the distribution of\nhistories generated by starting at ht and then repeatedly sampling actions from policy \u03c0\nand sampling state transitions and rewards from \u02dcM, until terminating at time T .\nLemma 2. For any rollout policy \u03c0, the POMDP rollout distribution is equal to the derived\nMDP rollout distribution, \u2200\u03c0 D\u03c0(hT ) = \u02dcD\u03c0(hT ).\n\nProof. By forward induction from ht, D\u03c0(hao) = D\u03c0(h)\u03c0(h, a)(cid:80)\n\ns(cid:48)\u2208S B(s, h)P a\n\nh,hao = \u02dcD\u03c0(hao).\n\n\u02dcD\u03c0(h)\u03c0(h, a) \u02dcP a\nTheorem 1. For suitable choice of c, the value function constructed by PO\u2013UCT converges\np\u2192 V \u2217(h), for all histories h that are\nin probability to the optimal value function, V (h)\npre\ufb01xed by ht. As the number of visits N (h) approaches in\ufb01nity, the bias of the value\nfunction, E [V (h) \u2212 V \u2217(h)] is O(log N (h)/N (h)).\nProof. By Lemma 2 the PO\u2013UCT simulations can be mapped into UCT simulations in the\nderived MDP. By Lemma 1, the analysis of UCT in [8] can then be applied to PO\u2013UCT.\n\ns\u2208S(cid:80)\n\nss(cid:48)Z a\n\ns(cid:48)o =\n\n5 Experiments\n\nWe applied POMCP to the benchmark rocksample problem, and to two new problems:\nbattleship and pocman. For each problem we ran POMCP 1000 times, or for up to 12 hours\nof total computation time. We evaluated the performance of POMCP by the average total\ndiscounted reward. In the smaller rocksample problems, we compared POMCP to the best\nfull-width online planning algorithms. However, the other problems were too large to run\nthese algorithms. To provide a performance benchmark in these cases, we evaluated the\nperformance of simple Monte-Carlo simulation without any tree. The PO-rollout algorithm\nused Monte-Carlo belief state updates, as described in section 3.2. It then simulated n/|A|\nrollouts for each legal action, and selected the action with highest average return.\nThe exploration constant for POMCP was set to c = Rhi \u2212 Rlo, where Rhi was the highest\nreturn achieved during sample runs of POMCP with c = 0, and Rlo was the lowest return\nachieved during sample rollouts. The discount horizon was set to 0.01 (about 90 steps\nwhen \u03b3 = 0.95). On the larger battleship and pocman problems, we combined POMCP with\nparticle reinvigoration. After each real action and observation, additional particles were\nadded to the belief state, by applying a domain speci\ufb01c local transformation to existing\nparticles. When n simulations were used, n/16 new particles were added to the belief set.\nWe also introduced domain knowledge into the search algorithm, by de\ufb01ning a set of preferred\nactions Ap. In each problem, we applied POMCP both with and without preferred actions.\nWhen preferred actions were used, the rollout policy selected actions uniformly from Ap,\nand each new node T (ha) in the tree was initialised to Vinit(ha) = Rhi, Ninit(ha) = 10\nfor preferred actions a \u2208 Ap, and to Vinit(ha) = Rlo, Ninit(ha) = 0 for all other actions.\nOtherwise, the rollouts policy selected actions uniformly among all legal actions, and each\nnew node T (ha) was initialised to Vinit(ha) = 0, Ninit(ha) = 0 for all a \u2208 A.\nThe rocksample (n, k) problem [14] simulates a Mars explorer robot in an n \u00d7 n grid con-\ntaining k rocks. The task is to determine which rocks are valuable, take samples of valuable\nrocks, and leave the map to the east when sampling is complete. When provided with an ex-\nactly factored representation, online full-width planners have been successful in rocksample\n(7, 8) [13], and an o\ufb04ine full-width planner has been successful in the much larger rock-\nsample (11, 11) problem [11]. We applied POMCP to three variants of rocksample: (7, 8),\n(11, 11), and (15, 15), without factoring the problem. When using preferred actions, the\nnumber of valuable and unvaluable observations was counted for each rock. Actions that\nsampled rocks with more valuable observations were preferred. If all remaining rocks had a\ngreater number of unvaluable observations, then the east action was preferred. The results\nof applying POMCP to rocksample, with various levels of prior knowledge, is shown in Fig-\nure 2. These results are compared with prior work in Table 1. On rocksample (7, 8), the\nperformance of POMCP with preferred actions was close to the best prior online planning\nmethods combined with o\ufb04ine solvers. On rocksample (11, 11), POMCP achieved the same\nperformance with 4 seconds of online computation to the state-of-the-art solver SARSOP\nwith 1000 seconds of o\ufb04ine computation [11]. Unlike prior methods, POMCP also provided\nscalable performance on rocksample (15, 15), a problem with over 7 million states.\n\n6\n\n\fRocksample\nStates |S|\nAEMS2\nHSVI-BFS\nSARSOP\nRollout\nPOMCP\n\n(11, 11)\n247,808\n\n(7, 8)\n12,544\n21.37 \u00b10.22 N/A\n21.46 \u00b10.22 N/A\n21.39 \u00b10.01\n9.46 \u00b10.27\n20.71 \u00b10.21\n\n(15, 15)\n7,372,800\nN/A\nN/A\n21.56 \u00b10.11 N/A\n8.70 \u00b10.29\n7.56 \u00b10.25\n15.32 \u00b10.28\n20.01 \u00b10.23\n\nTable 1: Comparison of Monte-Carlo planning with full-width planning on rocksample. POMCP and\nthe rollout algorithm used prior knowledge in their rollouts. The online planners used knowledge\ncomputed o\ufb04ine by PBVI; results are from [13]. Each online algorithm was given 1 second per\naction. The full-width, o\ufb04ine planner SARSOP was given approximately 1000 seconds of o\ufb04ine\ncomputation; results are from [9]. All full-width planners were provided with an exactly factored\nrepresentation of the POMDP; the Monte-Carlo planners do not factor the representation. The\nfull-width planners could not be run on the larger problems.\nIn the battleship POMDP, 5 ships are placed at random into a 10 \u00d7 10 grid, subject to the\nconstraint that no ship may be placed adjacent or diagonally adjacent to another ship. Each\nship has a di\ufb00erent size of 5 \u00d7 1, 4 \u00d7 1, 3 \u00d7 1 and 2 \u00d7 1 respectively. The goal is to \ufb01nd\nand sink all ships. However, the agent cannot observe the location of the ships. Each step,\nthe agent can \ufb01re upon one cell of the grid, and receives an observation of 1 if a ship was\nhit, and 0 otherwise. There is a -1 reward per time-step, a terminal reward of +100 for\nhitting every cell of every ship, and there is no discounting (\u03b3 = 1). It is illegal to \ufb01re twice\non the same cell. If it was necessary to \ufb01re on all cells of the grid, the total reward is 0;\notherwise the total reward indicates the number of steps better than the worst case. There\nare 100 actions, 2 observations, and approximately 1018 states in this challenging POMDP.\nParticle invigoration is particularly important in this problem. Each local transformation\napplied one of three transformations: 2 ships of di\ufb00erent sizes swapped location; 2 smaller\nships were swapped into the location of 1 larger ship; or 1 to 4 ships were moved to a new\nlocation, selected uniformly at random, and accepted if the new con\ufb01guration was legal.\nWithout preferred actions, all legal actions were considered. When preferred actions were\nused, impossible cells for ships were deduced automatically, by marking o\ufb00 the diagonally\nadjacent cells to each hit. These cells were never selected in the tree or during rollouts. The\nperformance of POMCP, with and without preferred actions, is shown in Figure 2. POMCP\nwas able to sink all ships more than 50 moves faster, on average, than random play, and more\nthan 25 moves faster than randomly selecting amongst preferred actions (which corresponds\nto the simple strategy used by many humans when playing the Battleship game). Using\npreferred actions, POMCP achieved better results with less search; however, even without\npreferred actions, POMCP was able to deduce the diagonal constraints from its rollouts,\nand performed almost as well given more simulations per move. Interestingly, the search\ntree only provided a small bene\ufb01t over the PO-rollout algorithm, due to small di\ufb00erences\nbetween the value of actions, but high variance in the returns.\nIn our \ufb01nal experiment we introduce a partially observable version of the video game Pac-\nMan. In this task, pocman, the agent must navigate a 17\u00d7 19 maze and eat the food pellets\nthat are randomly distributed across the maze. Four ghosts roam the maze, initially ac-\ncording to a randomised strategy: at each junction point they select a direction, without\ndoubling back, with probability proportional to the number of food pellets in line of sight in\nthat direction. Normally, if PocMan touches a ghost then he dies and the episode terminates.\nHowever, four power pills are available, which last for 15 steps and enable PocMan to eat\nany ghosts he touches. If a ghost is within Manhattan distance of 5 of PocMan, it chases him\naggressively, or runs away if he is under the e\ufb00ect of a power pill. The PocMan agent receives\na reward of \u22121 at each step, +10 for each food pellet, +25 for eating a ghost and \u2212100 for\ndying. The discount factor is \u03b3 = 0.95. The PocMan agent receives ten observation bits at\nevery time step, corresponding to his senses of sight, hearing, touch and smell. He receives\nfour observation bits indicating whether he can see a ghost in each cardinal direction, set\nto 1 if there is a ghost in his direct line of sight. He receives one observation bit indicating\nwhether he can hear a ghost, which is set to 1 if he is within Manhattan distance 2 of a ghost.\nHe receives four observation bits indicating whether he can feel a wall in each of the cardinal\ndirections, which is set to 1 if he is adjacent to a wall. Finally, he receives one observation\nbit indicating whether he can smell food, which is set to 1 if he is adjacent or diagonally ad-\n\n7\n\n\fFigure 2: Performance of POMCP in rocksample (11,11) and (15,15), battleship and pocman. Each\npoint shows the mean discounted return from 1000 runs or 12 hours of total computation. The\nsearch time for POMCP with preferred actions is shown on the top axis.\n\njacent to any food. The pocman problem has approximately 1056 states, 4 actions, and 1024\nobservations. For particle invigoration, 1 or 2 ghosts were teleported to a randomly selected\nnew location. The new particle was accepted if consistent with the last observation. When\nusing preferred actions, if PocMan was under the e\ufb00ect of a power pill, then he preferred to\nmove in directions where he saw ghosts. Otherwise, PocMan preferred to move in directions\nwhere he didn\u2019t see ghosts, excluding the direction he just came from. The performance\nof POMCP in pocman, with and without preferred actions, is shown in Figure 2. Using\npreferred actions, POMCP achieved an average undiscounted return of over 300, compared\nto 230 for the PO-rollout algorithm. Without domain knowledge, POMCP still achieved\nan average undiscounted return of 260, compared to 130 for simple rollouts. A real-time\ndemonstration of POMCP using preferred actions, is available online, along with source code\nfor POMCP (http://www.cs.ucl.ac.uk/staff/D.Silver/web/Applications.html).\n\n6 Discussion\n\nTraditionally, POMDP planning has focused on small problems that have few states or can\nbe neatly factorised into a compact representation. However, real-world problems are often\nlarge and messy, with enormous state spaces and probability distributions that cannot be\nconveniently factorised. In these challenging POMDPs, Monte-Carlo simulation provides\nan e\ufb00ective mechanism both for tree search and for belief state updates, breaking the curse\nof dimensionality and allowing much greater scalability than has previously been possible.\nUnlike previous approaches to Monte-Carlo planning in POMDPs, the PO\u2013UCT algorithm\nprovides a computationally e\ufb03cient best-\ufb01rst search that focuses its samples in the most\npromising regions of the search space. The POMCP algorithm uses these same samples to\nprovide a rich and e\ufb00ective belief state update. The battleship and pocman problems provide\ntwo examples of large POMDPs which cannot easily be factored and are intractable to prior\nalgorithms for POMDP planning. POMCP was able to achieve high performance in these\nchallenging problems with just a few seconds of online computation.\n\n8\n\n 0 5 10 15 20 25 10 100 1000 10000 100000 1e+06 0.01 0.1 1 10Average Discounted ReturnSimulationsSearch Time (seconds)Rocksample (11, 11)POMC: basicPOMC: preferredPO-rollout: basicPO-rollout: preferredSARSOP 0 5 10 15 20 25 10 100 1000 10000 100000 1e+06 0.01 0.1 1 10 100Average Discounted ReturnSimulationsSearch Time (seconds)Rocksample (15, 15)POMC: basicPOMC: preferredPO-rollout: basicPO-rollout: preferred 0 10 20 30 40 50 60 70 10 100 1000 10000 100000 0.001 0.01 0.1 1Average ReturnSimulationsSearch Time (seconds)BattleshipPOMCP: basicPOMCP: preferredPO-rollouts: basicPO-rollouts: preferredPO-rollouts: preferred 0 50 100 150 200 250 300 350 10 100 1000 10000 100000 0.001 0.01 0.1 1Average Undiscounted ReturnSimulationsSearch Time (seconds)PocManPOMCP: basicPOMCP: preferredPO-rollout: basicPO-rollout: preferred\fReferences\n\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed\n\nbandit problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[2] D. Bertsekas and D. Casta\u02dcnon. Rollout algorithms for stochastic scheduling problems.\n\nJournal of Heuristics, 5(1):89\u2013108, 1999.\n\n[3] R. Coulom. E\ufb03cient selectivity and backup operators in Monte-Carlo tree search. In\n5th International Conference on Computer and Games, 2006-05-29, pages 72\u201383, 2006.\n\n[4] H. Finnsson and Y. Bj\u00a8ornsson. Simulation-based approach to general game playing. In\n\n23rd Conference on Arti\ufb01cial Intelligence, pages 259\u2013264, 2008.\n\n[5] S. Gelly and D. Silver. Combining online and o\ufb04ine learning in UCT. In 17th Inter-\n\nnational Conference on Machine Learning, pages 273\u2013280, 2007.\n\n[6] L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observ-\n\nable stochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1995.\n\n[7] M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via\nreusable trajectories. In Advances in Neural Information Processing Systems 12. MIT\nPress, 2000.\n\n[8] L. Kocsis and C. Szepesvari. Bandit based Monte-Carlo planning. In 15th European\n\nConference on Machine Learning, pages 282\u2013293, 2006.\n\n[9] H. Kurniawati, D. Hsu, and W. Lee. SARSOP: E\ufb03cient point-based POMDP planning\nby approximating optimally reachable belief spaces. In Robotics: Science and Systems,\n2008.\n\n[10] R. Lorentz. Amazons discover Monte-Carlo. In Computers and Games, pages 13\u201324,\n\n2008.\n\n[11] S. Ong, S. Png, D. Hsu, and W. Lee. POMDPs for robotic tasks with mixed observ-\n\nability. In Robotics: Science and Systems, 2009.\n\n[12] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large\n\nPOMDPs. Journal of Arti\ufb01cial Intelligence Research, 27:335\u2013380, 2006.\n\n[13] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithms for\n\npomdps. Journal of Arti\ufb01cial Intelligence Research, 32:663\u2013704, 2008.\n\n[14] T. Smith and R. Simmons. Heuristic search value iteration for pomdps. In 20th con-\n\nference on Uncertainty in Arti\ufb01cial Intelligence, 2004.\n\n[15] G. Tesauro and G. Galperin. Online policy improvement using Monte-Carlo search. In\n\nAdvances in Neural Information Processing 9, pages 1068\u20131074, 1996.\n\n9\n\n\f", "award": [], "sourceid": 740, "authors": [{"given_name": "David", "family_name": "Silver", "institution": null}, {"given_name": "Joel", "family_name": "Veness", "institution": null}]}