{"title": "Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1646, "page_last": 1654, "abstract": "Monte-Carlo tree search is drawing great interest in the domain of planning under uncertainty, particularly when little or no domain knowledge is available. One of the central problems is the trade-off between exploration and exploitation. In this paper we present a novel Bayesian mixture modelling and inference based Thompson sampling approach to addressing this dilemma. The proposed Dirichlet-NormalGamma MCTS (DNG-MCTS) algorithm represents the uncertainty of the accumulated reward for actions in the MCTS search tree as a mixture of Normal distributions and inferences on it in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions. Thompson sampling is used to select the best action at each decision node. Experimental results show that our proposed algorithm has achieved the state-of-the-art comparing with popular UCT algorithm in the context of online planning for general Markov decision processes.", "full_text": "Bayesian Mixture Modeling and Inference based\nThompson Sampling in Monte-Carlo Tree Search\n\nAijun Bai\n\nUniv. of Sci. & Tech. of China\n\nbaj@mail.ustc.edu.cn\n\nFeng Wu\n\nUniversity of Southampton\nfw6e11@ecs.soton.ac.uk\n\nAbstract\n\nXiaoping Chen\n\nUniv. of Sci. & Tech. of China\n\nxpchen@ustc.edu.cn\n\nMonte-Carlo tree search (MCTS) has been drawing great interest in recent years\nfor planning and learning under uncertainty. One of the key challenges is the\ntrade-off between exploration and exploitation. To address this, we present a\nnovel approach for MCTS using Bayesian mixture modeling and inference based\nThompson sampling and apply it to the problem of online planning in MDPs.\nOur algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models\nthe uncertainty of the accumulated reward for actions in the search tree as a mix-\nture of Normal distributions. We perform inferences on the mixture in Bayesian\nsettings by choosing conjugate priors in the form of combinations of Dirichlet\nand NormalGamma distributions and select the best action at each decision node\nusing Thompson sampling. Experimental results con\ufb01rm that our algorithm ad-\nvances the state-of-the-art UCT approach with better values on several benchmark\nproblems.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) provide a general framework for planning and learning under\nuncertainty. We consider the problem of online planning in MDPs without prior knowledge on the\nunderlying transition probabilities. Monte-Carlo tree search (MCTS) can \ufb01nd near-optimal policies\nin our domains by combining tree search methods with sampling techniques. The key idea is to iter-\natively evaluate each state in a best-\ufb01rst search tree by the mean outcome of simulation samples. It is\nmodel-free and requires only a black-box simulator (generative model) of the underlying problems.\nTo date, great success has been achieved by MCTS in variety of domains, such as game play [1, 2],\nplanning under uncertainty [3, 4, 5], and Bayesian reinforcement learning [6, 7].\nWhen applying MCTS, one of the fundamental challenges is the so-called exploration versus ex-\nploitation dilemma: an agent must not only exploit by selecting the best action based on the current\ninformation, but should also keep exploring other actions for possible higher future payoffs. Thomp-\nson sampling is one of the earliest heuristics to address this dilemma in multi-armed bandit problems\n(MABs) according to the principle of randomized probability matching [8]. The basic idea is to s-\nelect actions stochastically, based on the probabilities of being optimal. It has recently been shown\nto perform very well in MABs both empirically [9] and theoretically [10]. It has been proved that\nThompson sampling algorithm achieves logarithmic expected regret which is asymptotically opti-\nmal for MABs. Comparing to the UCB1 heuristic [3], the main advantage of Thompson sampling\nis that it allows more robust convergence under a wide range of problem settings.\nIn this paper, we borrow the idea of Thompson sampling and propose the Dirichlet-NormalGamma\nMCTS (DNG-MCTS) algorithm \u2014 a novel Bayesian mixture modeling and inference based Thomp-\nson sampling approach for online planning in MDPs. In this algorithm, we use a mixture of Normal\ndistributions to model the unknown distribution of the accumulated reward of performing a partic-\nular action in the MCTS search tree. In the present of online planning for MDPs, a conjugate prior\n\n1\n\n\fexists in the form of a combination of Dirichlet and NormalGamma distributions. By choosing the\nconjugate prior, it is then relatively simple to compute the posterior distribution after each accumu-\nlated reward is observed by simulation in the search tree. Thompson sampling is then used to select\nthe action to be performed by simulation at each decision node. We have tested our DNG-MCTS\nalgorithm and compared it with the popular UCT algorithm in several benchmark problems. Ex-\nperimental results show that our proposed algorithm has outperformed the state-of-the-art for online\nplanning in general MDPs. Furthermore, we show the convergence of our algorithm, con\ufb01rming its\ntechnical soundness.\nThe reminder of this paper is organized as follows. In Section 2, we brie\ufb02y introduce the neces-\nsary background. Section 3 presents our main results \u2014 the DNG-MCTS algorithm. We show\nexperimental results on several benchmark problems in Section 4. Finally in Section 5 the paper is\nconcluded with a summary of our contributions and future work.\n\n2 Background\n\nIn this section, we brie\ufb02y review the MDP model, the MAB problem, the MCTS framework, and\nthe UCT algorithm as the basis of our algorithm. Some related work is also presented.\n\noptimal policy \u03c0 that maximizes the expected reward de\ufb01ned as V\u03c0(s) = E[(cid:80)H\n\n2.1 MDPs and MABs\nFormally, an MDP is de\ufb01ned as a tuple (cid:104)S, A, T, R(cid:105), where S is the state space, A is the action\nspace, T (s(cid:48)|s, a) is the probability of reaching state s(cid:48) if action a is applied in state s, and R(s, a)\nis the reward received by the agent. A policy is a decision rule mapping from states to actions and\nspecifying which action should be taken in each state. The aim of solving an MDP is to \ufb01nd the\nt=0 \u03b3tR(st, \u03c0(st))],\nwhere H is the planing horizon, \u03b3 \u2208 (0, 1] is the discount factor, st is the state in time step t and\n\u03c0(st) is the action selected by policy \u03c0 in state st.\nIntuitively, an MAB can be seen as an MDP with only one state s and a stochastic reward function\nR(s, a) := Xa, where Xa is a random variable following an unknown distribution fXa (x). At each\ntime step t, one action at must be chosen and executed. A stochastic reward Xat is then received\naccordingly. The goal is to \ufb01nd a sequence of actions that minimizes the cumulative regret de\ufb01ned\n\nt=1(Xa\u2217 \u2212 Xat)], where a\u2217 is the true best action.\n\nas RT = E[(cid:80)T\n\n2.2 MCTS and UCT\n\nTo solve MDPs, MCTS iteratively evaluates a state by: (1) selecting an action based on a given action\nselection strategy; (2) performing the selected action by Monte-Carlo simulation; (3) recursively\nevaluating the resulted state if it is already in the search tree, or inserting it into the search tree and\nrunning a rollout policy by simulations. This process is applied to descend through the search tree\nuntil some terminate conditions are reached. The simulation result is then back-propagated through\nthe selected nodes to update their statistics.\nThe UCT algorithm is a popular approach based on MCTS for planning under uncertainty [3]. It\ntreats each state of the search tree as an MAB, and selects the action that maximizes the UCB1\n\nheuristic \u00afQ(s, a) + c(cid:112)log N (s)/N (s, a), where \u00afQ(s, a) is the mean return of action a in state s\noverall count N (s) =(cid:80)\n\nfrom all previous simulations, N (s, a) is the visitation count of action a in state s, N (s) is the\na\u2208A N (s, a), and c is the exploration constant that determines the relative\nratio of exploration to exploitation. It is proved that with an appropriate choice of c the probability\nof selecting the optimal action converges to 1 as the number of samples grows to in\ufb01nity.\n\n2.3 Related Work\n\nThe fundamental assumption of our algorithm is modeling unknown distribution of the accumulated\nreward for each state-action pair in the search tree as a mixture of Normal distributions. A similar\nassumption has been made in [11], where they assumed a Normal distribution over the rewards.\nComparing to their approach, as we will show in Section 3, our assumption on Normal mixture is\nmore realistic for our problems. Tesauro et al.[12] developed a Bayesian UCT approach to MCTS\n\n2\n\n\fusing Gaussian approximation. Speci\ufb01cally, their method propagates probability distributions of\nrewards from leaf nodes up to the root node by applying MAX (or MIN) extremum distribution\noperator for the interior nodes. Then, it uses modi\ufb01ed UCB1 heuristics to select actions on the basis\nof the interior distributions. However, extremum distribution operation on decision nodes is very\ntime-consuming because it must consider over all the child nodes. In contrast, we treat each decision\nnode in the search tree as an MAB, maintain a posterior distribution over the accumulated reward\nfor each applicable actions separately, and then select the best action using Thompson sampling.\n\n3 The DNG-MCTS Algorithm\n\nThis section presents our main results \u2014 a Bayesian mixture modeling and inference based Thomp-\nson sampling approach for MCTS (DNG-MCTS).\n\n3.1 The Assumptions\n\nFor a given MDP policy \u03c0, let Xs,\u03c0 be a random variable that denotes the accumulated reward\nof following policy \u03c0 starting from state s, and Xs,a,\u03c0 denotes the accumulated reward of \ufb01rst\nperforming action a in state s and then following policy \u03c0 thereafter. Our assumptions are: (1)\nXs,\u03c0 is sampled from a Normal distribution, and (2) Xs,a,\u03c0 can be modeled as a mixture of Normal\ndistributions. These are realistic approximations for our problems with the following reasons.\nGiven policy \u03c0, an MDP reduces to a Markov chain {st} with \ufb01nite state space S and the transition\nfunction T (s(cid:48)|s, \u03c0(s)). Suppose that the resulting chain {st} is ergodic. That is, it is possible to\ngo from every state to every other state (not necessarily in one move). Let w denote the stationary\ndistribution of {st}. According to the central limit theorem on Markov chains [13, 14], for any\nbounded function f on the state space S, we have:\n\nf (st) \u2212 n\u00b5) \u2192 N (0, \u03c32) as n \u2192 \u221e,\n\n(1)\n\nt=0 f (st) as a Normal distribution if n is suf\ufb01ciently large.\n\nwhere \u00b5 = Ew[f ] and \u03c3 is a constant depending only on f and w. This indicates that the sum of\nf (st) follows N (n\u00b5, n\u03c32) as n grows to in\ufb01nity. It is then natural to approximate the distribution\n\nof(cid:80)n\nConsidering \ufb01nite-horizon MDPs with horizon H, if \u03b3 = 1, Xs0,\u03c0 =(cid:80)H\nsuf\ufb01ciently large. On the other hand, if \u03b3 (cid:54)= 1, Xs0,\u03c0 =(cid:80)H\nlinear combination of(cid:80)n\n\nt=0 R(st, \u03c0(st)) is a sum\nof f (st) = R(st, \u03c0(st)). Thus, Xs0,\u03c0 is approximately normally distributed for each s0 \u2208 S if H is\nt=0 \u03b3tR(st, \u03c0(st)) can be rewritten as a\n\nn(cid:88)\n\nt=0\n\n1\u221a\nn\n\n(\n\nt=0 f (st) for n = 0 to H as follow:\nXs0,\u03c0 = (1 \u2212 \u03b3)\n\n\u03b3n\n\nf (st) + \u03b3H\n\nH\u22121(cid:88)\n\nn(cid:88)\n\nH(cid:88)\n\nf (st)\n\n(2)\n\nn=0\n\nt=0\n\nt=0\n\nNotice that a linear combination of independent or correlated normally distributed random variables\nis still normally distributed. If H is suf\ufb01ciently large and \u03b3 is close to 1, it is reasonable to approxi-\nmate Xs0,\u03c0 as a Normal distribution. Therefore, we assume that Xs,\u03c0 is normally distributed in both\ncases.\nIf the policy \u03c0 is not \ufb01xed and may change over time (e.g., the derived policy of an online algorithm\nbefore it converges), the real distribution of Xs,\u03c0 is actually unknown and could be very complex.\nHowever, if the algorithm is guaranteed to converge in the limit (as explained in Section 3.5, this\nholds for our DNG-MCTS algorithm), it is convenient and reasonable to approximate Xs,\u03c0 as a\nNormal distribution.\nNow consider the accumulated reward of \ufb01rst performing action a in s and following policy \u03c0\nthereafter. By de\ufb01nition, Xs,a,\u03c0 = R(s, a) + \u03b3Xs(cid:48),\u03c0, where s(cid:48) is the next state distributed according\nto T (s(cid:48)|s, a). Let Ys,a,\u03c0 be a random variable de\ufb01ned as Ys,a,\u03c0 = (Xs,a,\u03c0 \u2212 R(s, a))/\u03b3. We can see\nthat the pdf of Ys,a,\u03c0 is a convex combination of the pdfs of Xs(cid:48),\u03c0 for each s(cid:48) \u2208 S. Speci\ufb01cally, we\ns(cid:48)\u2208S T (s(cid:48)|s, a)fXs(cid:48) ,\u03c0 (y). Hence it is straightforward to model the distribution\nof Ys,a,\u03c0 as a mixture of Normal distributions if Xs(cid:48),\u03c0 is assumed to be normally distributed for each\ns(cid:48) \u2208 S. Since Xs,a,\u03c0 is a linear function of Ys,a,\u03c0, Xs,a,\u03c0 is also a mixture of Normal distributions\nunder our assumptions.\n\nhave fYs,a,\u03c0 (y) =(cid:80)\n\n3\n\n\f3.2 The Modeling and Inference Methods\n\nIn Bayesian settings, the unknown distribution of a random variable X can be modeled as a para-\nmetric likelihood function L(x|\u03b8) depending on the parameters \u03b8. Given a prior distribution P (\u03b8),\nand a set of past observations Z = {x1, x2, . . .}, the posterior distribution of \u03b8 can then be obtained\n\nusing Bayes\u2019 rules: P (\u03b8|Z) \u221d(cid:81)\n\ni L(xi|\u03b8)P (\u03b8).\n\nAssumption (1) implies that it suf\ufb01ces to model the distribution of Xs,\u03c0 as a Normal likelihood\nN (\u00b5s, 1/\u03c4s) with unknown mean \u00b5s and precision \u03c4s. The precision is de\ufb01ned as the recipro-\ncal of the variance, \u03c4 = 1/\u03c32. This is chosen for mathematical convenience of introducing the\nNomralGamma distribution as a conjugate prior. A NormalGamma distribution is de\ufb01ned by the\nhyper-parameters (cid:104)\u00b50, \u03bb, \u03b1, \u03b2(cid:105) with \u03bb > 0, \u03b1 \u2265 1 and \u03b2 \u2265 0.\nIt is said that (\u00b5, \u03c4 ) follows a\nNormalGamma distribution N ormalGamma(\u00b50, \u03bb, \u03b1, \u03b2) if the pdf of (\u00b5, \u03c4 ) has the form\n\n\u221a\n\u221a\n\u03bb\n2\u03c0\n\n\u03b2\u03b1\n\u0393(\u03b1)\n\nf (\u00b5, \u03c4|\u00b50, \u03bb, \u03b1, \u03b2) =\n\n\u03c4 \u03b1\u2212 1\n\n2 e\u2212\u03b2\u03c4 e\u2212 \u03bb\u03c4 (\u00b5\u2212\u00b50)2\n\n2\n\n.\n\n(3)\n\ni=1(xi \u2212 \u00afx)2/n is the sample variance.\n\nBy de\ufb01nition, the marginal distribution over \u03c4 is a Gamma distribution, \u03c4 \u223c Gamma(\u03b1, \u03b2), and\nthe conditional distribution over \u00b5 given \u03c4 is a Normal distribution, \u00b5 \u223c N (\u00b50, 1/(\u03bb\u03c4 )).\nLet us brie\ufb02y recall the posterior of (\u00b5, \u03c4 ). Suppose X is normally distributed with unknown mean\n\u00b5 and precision \u03c4, x \u223c N (\u00b5, 1/\u03c4 ), and that the prior distribution of (\u00b5, \u03c4 ) has a NormalGamma\ndistribution, (\u00b5, \u03c4 ) \u223c N ormalGamma(\u00b50, \u03bb0, \u03b10, \u03b20). After observing n independent samples of\nX, denoted {x1, x2, . . . , xn}, according to the Bayes\u2019 theorem, the posterior distribution of (\u00b5, \u03c4 )\nis also a NormalGamma distribution, (\u00b5, \u03c4 ) \u223c N ormalGamma(\u00b5n, \u03bbn, \u03b1n, \u03b2n), where \u00b5n =\n(\u03bb0\u00b50+n\u00afx)/(\u03bb0+n), \u03bbn = \u03bb0+n, \u03b1n = \u03b10+n/2 and \u03b2n = \u03b20+(ns+\u03bb0n(\u00afx\u2212\u00b50)2/(\u03bb0+n))/2,\n\nwhere \u00afx =(cid:80)n\ni=1 xi/n is the sample mean and s =(cid:80)n\ntions Ys,a,\u03c0 = (Xs,a,\u03c0 \u2212 R(s, a))/\u03b3 \u223c (cid:80)\nare the mixture weights such that ws,a,s(cid:48) \u2265 0 and (cid:80)\n\nBased on Assumption 2, the distribution of Ys,a,\u03c0 can be modeled as a mixture of Normal distribu-\ns(cid:48)\u2208S ws,a,s(cid:48)N (\u00b5s(cid:48), 1/\u03c4s(cid:48)), where ws,a,s(cid:48) = T (s(cid:48)|s, a)\ns(cid:48)\u2208S ws,a,s(cid:48) = 1, which are previous-\nly unknown in Monte-Carlo settings. A natural representation on these unknown weights is via\nDirichlet distributions, since Dirichlet distribution is the conjugate prior of a general discrete prob-\nability distribution. For state s and action a, a Dirichlet distribution, denoted Dir(\u03c1s,a) where\n\u03c1s,a = (\u03c1s,a,s1, \u03c1s,a,s2,\u00b7\u00b7\u00b7 ), gives the posterior distribution of T (s(cid:48)|s, a) for each s(cid:48) \u2208 S if the\ntransition to s(cid:48) has been observed \u03c1s,a,s(cid:48) \u2212 1 times. After observing a transition (s, a) \u2192 s(cid:48), the\nposterior distribution is also Dirichlet and can simply be updated as \u03c1s,a,s(cid:48) \u2190 \u03c1s,a,s(cid:48) + 1.\nTherefore, to model the distribution of Xs,\u03c0 and Xs,a,\u03c0 we only need to maintain a set of hyper-\nparameters (cid:104)\u00b5s,0, \u03bbs, \u03b1s, \u03b2s(cid:105) and \u03c1s,a for each state s and action a encountered in the MCTS search\ntree and update them by using Bayes\u2019 rules.\nNow we turn to the question of how to choose the priors by initializing hyper-parameters. While the\nimpact of the prior tends to be negligible in the limit, its choice is important especially when only\na small amount of data has been observed. In general, priors should re\ufb02ect available knowledge of\nthe hidden model.\nIn the absence of any knowledge, uninformative priors may be preferred. According to the principle\nof indifference, uninformative priors assign equal probabilities to all possibilities. For NormalGam-\nma priors, we hope that the sampled distribution of \u00b5 given \u03c4, i.e., N (\u00b50, 1/(\u03bb\u03c4 )), is as \ufb02at as\npossible. This implies an in\ufb01nite variance 1/(\u03bb\u03c4 ) \u2192 \u221e, so that \u03bb\u03c4 \u2192 0. Recall that \u03c4 follows\na Gamma distribution Gamma(\u03b1, \u03b2) with expectation E[\u03c4 ] = \u03b1/\u03b2, so we have in expectation\n\u03bb\u03b1/\u03b2 \u2192 0. Considering the parameter space (\u03bb > 0, \u03b1 \u2265 1, \u03b2 \u2265 0), we can choose \u03bb small\nenough, \u03b1 = 1 and \u03b2 suf\ufb01ciently large to approximate this condition. Second, we hope the sampled\ndistribution is in the middle of axis, so \u00b50 = 0 seems to be a good selection. It is worth noting that\nintuitively \u03b2 should not be set too large, or the convergence process may be very slow. For Dirichlet\npriors, it is common to set \u03c1s,a,s(cid:48) = \u03b4 where \u03b4 is a small enough positive for each s \u2208 S, a \u2208 A and\ns(cid:48) \u2208 S encountered in the search tree to have uninformative priors.\nOn the other hand, if some prior knowledge is available, informative priors may be preferred. By\nexploiting domain knowledge, a state node can be initialized with informative priors indicating it-\ns priority over other states.\nIn DNG-MCTS, this is done by setting the hyper-parameters based\n\n4\n\n\fon subjective estimation for states. According to the interpretation of hyper-parameters of Nor-\nmalGamma distribution in terms of pseudo-observations, if one has a prior mean of \u00b50 from \u03bb\nsamples and a prior precision of \u03b1/\u03b2 from 2\u03b1 samples, the prior distribution over \u00b5 and \u03c4 is\nN ormalGamma(\u00b50, \u03bb, \u03b1, \u03b2), providing a straightforward way to initialize the hyper-parameters\nif some prior knowledge (such as historical data of past observations) is available. Specifying de-\ntailed priors based on prior knowledge for particular domains is beyond the scope of this paper. The\nability to include prior information provides important \ufb02exibility and can be considered an advan-\ntage of the approach.\n\n3.3 The Action Selection Strategy\n\nIn DNG-MCTS, action selection strategy is derived using Thompson sampling. Speci\ufb01cally, in\ngeneral Bayesian settings, action a is chosen with probability:\n\n(cid:90)\n\n(cid:20)\n\n(cid:21)(cid:89)\n\nE [Xa(cid:48)|\u03b8a(cid:48)]\n\nPa(cid:48)(\u03b8a(cid:48)|Z) d\u03b8\n\n(4)\n\nP (a) =\n\n1\n\na = argmax\n\na(cid:48)\n\ntion of reward by applying a, E[Xa|\u03b8a] = (cid:82) xLa(x|\u03b8a) dx is the expectation of Xa given \u03b8a, and\n\nwhere 1 is the indicator function, \u03b8a is the hidden parameter prescribing the underlying distribu-\n\na(cid:48)\n\n\u03b8 = (\u03b8a1, \u03b8a2 , . . . ) is the vector of parameters for all actions. Fortunately, this can ef\ufb01ciently be\napproached by sampling method. To this end, a set of parameters \u03b8a is sampled according to the pos-\nterior distributions Pa(\u03b8a|Z) for each a \u2208 A, and the action a\u2217 = argmaxa\nE[Xa|\u03b8a] with highest\nexpectation is selected.\nIn our implementation, at each decision node s of the search tree, we sample the mean \u00b5s(cid:48) and\nmixture weights ws,a,s(cid:48) according to N ormalGamma(\u00b5s(cid:48),0, \u03bbs(cid:48), \u03b1s(cid:48), \u03b2s(cid:48)) and Dir(\u03c1s,a) respec-\ntively for each possible next state s(cid:48) \u2208 S. The expectation of Xs,a,\u03c0 is then computed as\ns(cid:48)\u2208S ws,a,s(cid:48)\u00b5s(cid:48). The action with highest expectation is then selected to be performed\n\nR(s, a) + \u03b3(cid:80)\n\nin simulation.\n\n3.4 The Main Algorithm\n\nThe main process of DNG-MCTS is outlined in Figure 1.\nIt is worth noting that the function\nThompsonSampling has a boolean parameter sampling. If sampling is true, Thompson sam-\npling method is used to select the best action as explained in Section 3.3, otherwise a greedy action\nis returned with respect to the current expected transition probabilities and accumulated rewards of\n\nnext states, which are E[ws,a,s(cid:48)] = \u03c1s,a,s(cid:48)/(cid:80)\n\nx\u2208S \u03c1s,a,x and E[Xs,\u03c0] = \u00b5s,0 respectively.\n\nAt each iteration, the function DNG-MCTS uses Thompson sampling to recursively select actions\nto be executed by simulation from the root node to leaf nodes through the existing search tree T .\nIt inserts each newly visited node into the tree, plays a default rollout policy from the new node,\nand propagates the simulated outcome to update the hyper-parameters for visited states and actions.\nNoting that the rollout policy is only played once for each new node at each iteration, the set of past\nobservations Z in the algorithm has size n = 1.\nThe function OnlinePlanning is the overall procedure interacting with the real environment. It is\ncalled with current state s, search tree T initially empty and the maximal horizon H. It repeatedly\ncalls the function DNG-MCTS until some resource budgets are reached (e.g., the computation is\ntimeout or the maximal number of iterations is reached), by when a greedy action to be performed\nin the environment is returned to the agent.\n\n3.5 The Convergency Property\n\nFor Thompson sampling in stationary MABs (i.e., the underlying reward function will not change), it\nis proved that: (1) the probability of selecting any suboptimal action a at the current step is bounded\nby a linear function of the probability of selecting the optimal action; (2) the coef\ufb01cient in this linear\nfunction decreases exponentially fast with the increase in the number of selection of optimal action\n[15]. Thus, the probability of selecting the optimal action in an MAB is guaranteed to converge to 1\nin the limit using Thompson sampling.\n\n5\n\n\fOnlinePlanning(s : state, T : tree,\nH : max horizon)\nInitialize (\u00b5s,0, \u03bbs, \u03b1s, \u03b2s) for each s \u2208 S\nInitialize \u03c1s,a for each s \u2208 S and a \u2208 A\nrepeat\n\nDNG-MCTS(s, T, H)\n\nuntil resource budgets reached\nreturn ThompsonSampling(s, H, F alse)\n\nelse\n\nDNG-MCTS(s : state, T : tree, h : horizon)\nif h = 0 or s is terminal then\nelse if node (cid:104)s, h(cid:105) is not in tree T then\n\nreturn 0\nAdd node (cid:104)s, h(cid:105) to T\nPlay rollout policy by simulation for h steps\nObserve the outcome r\nreturn r\na \u2190 ThompsonSampling(s, h, T rue)\nExecute a by simulation\nObserve next state s(cid:48) and reward R(s, a)\nr \u2190 R(s, a) + \u03b3 DNG-MCTS(s(cid:48), T, h \u2212 1)\n\u03b1s \u2190 \u03b1s + 0.5\n\u03b2s \u2190 \u03b2s + (\u03bbs(r \u2212 \u00b5s,0)2/(\u03bbs + 1))/2\n\u00b5s,0 \u2190 (\u03bbs\u00b5s,0 + r)/(\u03bbs + 1)\n\u03bbs \u2190 \u03bbs + 1\n\u03c1s,a,s(cid:48) \u2190 \u03c1s,a,s(cid:48) + 1\nreturn r\n\nThompsonSampling(s : state, h : horizon,\nsampling : boolean)\nforeach a \u2208 A do\n\nqa \u2190 QValue(s, a, h, sampling)\n\nreturn argmaxa qa\n\nQValue(s : state, a : action, h : horizon,\nsampling : boolean)\nr \u2190 0\nforeach s(cid:48) \u2208 S do\n\nSample ws(cid:48) according to Dir(\u03c1s,a)\n\nif sampling = T rue then\n\nelse\n\nws(cid:48) \u2190 \u03c1s,a,s(cid:48) /(cid:80)\n\nr \u2190 r + ws(cid:48)Value(s(cid:48), h \u2212 1, sampling)\n\nn\u2208S \u03c1s,a,n\n\nreturn R(s, a) + \u03b3r\n\nValue(s : state, h : horizon,\nsampling : boolean)\nif h = 0 or s is terminal then\n\nreturn 0\n\nelse\n\nif sampling = T rue then\n\nSample (\u00b5, \u03c4 ) according to\nN ormalGamma(\u00b5s,0, \u03bbs, \u03b1s, \u03b2s)\nreturn \u00b5\n\nelse\n\nreturn \u00b5s,0\n\nFigure 1: Dirichlet-NormalGamma based Monte-Carlo Tree Search\n\nThe distribution of Xs,\u03c0 is determined by the transition function and the Q values given the policy \u03c0.\nWhen the Q values converge, the distribution of Xs,\u03c0 becomes stationary with the optimal policy.\nFor the leaf nodes (level H) of the search tree, Thompson sampling will converge to the optimal\nactions with probability 1 in the limit since the MABs are stationary. When all the leaf nodes\nconverge, the distributions of return values from them will not change. So the MABs of the nodes in\nlevel H \u2212 1 become stationary as well. Thus, Thompson sampling will also converge to the optimal\nactions for nodes in level H \u2212 1. Recursively, this holds for all the upper-level nodes. Therefore, we\nconclude that DNG-MCTS can \ufb01nd the optimal policy for the root node if unbounded computational\nresources are given.\n\n4 Experiments\n\nWe have tested our DNG-MCTS algorithm and compared the results with UCT in three common\nMDP benchmark domains, namely Canadian traveler problem, racetrack and sailing. These prob-\nlems are modeled as cost-based MDPs. That is, a cost function c(s, a) is used instead of the reward\nfunction R(s, a), and the min operator is used in the Bellman equation instead of the max operator.\nSimilarly, the objective of solving a cost-based MDPs is to \ufb01nd an optimal policy that minimizes\nthe expected accumulated cost for each state. Notice that algorithms developed for reward-based\nMDPs can be straightforwardly transformed and applied to cost-based MDPs by simply using the\nmin operator instead of max in the Bellman update routines. Accordingly, the min operator is used\nin the function ThompsonSampling of our transformed DNG-MCTS algorithm. We implemented\nour codes and conducted the experiments on the basis of MDP-engine, which is an open source\nsoftware package with a collection of problem instances and base algorithms for MDPs.1\n\n1MDP-engine can be publicly accessed via https://code.google.com/p/mdp-engine/\n\n6\n\n\fTable 1: CTP problems with 20 nodes. The second column indicates the belief size of the trans-\nformed MDP for each problem instance. UCTB and UCTO are the two domain-speci\ufb01c UCT im-\nplementations [18]. DNG-MCTS and UCT run for 10,000 iterations. Boldface fonts are best in\nwhole table; gray cells show best among domain-independent implementations for each group. The\ndata of UCTB, UCTO and UCT are taken form [16].\n\nprob.\n20-1\n20-2\n20-3\n20-4\n20-5\n20-6\n20-7\n20-8\n20-9\n20-10\ntotal\n\nbelief\n20 \u00d7 349\n20 \u00d7 349\n20 \u00d7 351\n20 \u00d7 349\n20 \u00d7 352\n20 \u00d7 349\n20 \u00d7 350\n20 \u00d7 351\n20 \u00d7 350\n20 \u00d7 349\n\ndomain-speci\ufb01c UCT\nUCTB\n210.7\u00b17\n176.4\u00b14\n150.7\u00b17\n264.8\u00b19\n123.2\u00b17\n165.4\u00b16\n191.6\u00b16\n160.1\u00b17\n235.2\u00b16\n180.8\u00b17\n1858.9\n\nUCTO\n169.0\u00b16\n148.9\u00b13\n132.5\u00b16\n235.2\u00b17\n111.3\u00b15\n133.1\u00b13\n148.2\u00b14\n134.5\u00b15\n173.9\u00b14\n167.0\u00b15\n1553.6\n\nrandom rollout policy\n\noptimistic rollout policy\n\nUCT\n216.4\u00b13\n178.5\u00b12\n169.7\u00b14\n264.1\u00b14\n139.8\u00b14\n178.0\u00b13\n211.8\u00b13\n218.5\u00b14\n251.9\u00b13\n185.7\u00b13\n2014.4\n\nDNG\n223.9\u00b14\n178.1\u00b12\n159.5\u00b14\n266.8\u00b14\n133.4\u00b14\n169.8\u00b13\n214.9\u00b14\n202.3\u00b14\n246.0\u00b13\n188.9\u00b14\n1983.68\n\nUCT\n180.7\u00b13\n160.8\u00b12\n144.3\u00b13\n238.3\u00b13\n123.9\u00b13\n167.8\u00b12\n174.1\u00b12\n152.3\u00b13\n185.2\u00b12\n178.5\u00b13\n1705.9\n\nDNG\n177.1\u00b13\n155.2\u00b12\n140.1\u00b13\n242.7\u00b14\n122.1\u00b13\n141.9\u00b12\n166.1\u00b13\n151.4\u00b13\n180.4\u00b12\n170.5\u00b13\n1647.4\n\nIn each benchmark problem, we (1) ran the transformed algorithms for a number of iterations from\nthe current state, (2) applied the best action based on the resulted action-values, (3) repeated the loop\nuntil terminating conditions (e.g., a goal state is satis\ufb01ed or the maximal number of running steps\nis reached), and (4) reported the total discounted cost. The performance of algorithms is evaluated\nby the average value of total discounted costs over 1,000 independent runs.\nIn all experiments,\n(\u00b5s,0, \u03bbs, \u03b1s, \u03b2s) is initialized to (0, 0.01, 1, 100), and \u03c1s,a,s(cid:48) is initialized to 0.01 for all s \u2208 S,\na \u2208 A and s(cid:48) \u2208 S. For fair comparison, we also use the same settings as in [16]: for each decision\nnode, (1) only applicable actions are selected, (2) applicable actions are forced to be selected once\nbefore any of them are selected twice or more, and 3) the exploration constant for the UCT algorithm\nis set to be the current mean action-values Q(s, a, d).\nThe Canadian traveler problem (CTP) is a path \ufb01nding problem with imperfect information over a\ngraph whose edges may be blocked with given prior probabilities [17]. A CTP can be modeled as\na deterministic POMDP, i.e., the only source of uncertainty is the initial belief. When transformed\nto an MDP, the size of the belief space is n \u00d7 3m, where n is the number of nodes and m is the\nnumber of edges. This problem has a discount factor \u03b3 = 1. The aim is to navigate to the goal state\nas quickly as possible. It has recently been addressed by an anytime variation of AO*, named AOT\n[16], and two domain-speci\ufb01c implementations of UCT which take advantage of the speci\ufb01c MDP\nstructure of the CTP and use a more informed base policy, named UCTB and UCTO [18]. In this\nexperiment, we used the same 10 problem instances with 20 nodes as done in their papers.\nWhen running DNG-MCTS and UCT in those CTP instances, the number of iterations for each\ndecision-making was set to be 10,000, which is identical to [16]. Two types of default rollout policy\nwere tested: the random policy that selects actions with equal probabilities and the optimistic policy\nthat assumes traversability for unknown edges and selects actions according to estimated cost. The\nresults are shown in Table 1. Similar to [16], we included the results of UCTB and UCTO as\na reference. From the table, we can see that DNG-MCTS outperformed the domain-independent\nversion of UCT with random rollout policy in several instances, and particularly performed much\nbetter than UCT with optimistic rollout policy. Although DNG-MCTS is not as good as domain-\nspeci\ufb01c UCTO, it is competitive comparing to the general UCT algorithm in this domain.\nThe racetrack problem simulates a car race [19], where a car starts in a set of initial states and moves\ntowards the goal. At each time step, the car can choose to accelerate to one of the eight directions.\nWhen moving, the car has a possibility of 0.9 to succeed and 0.1 to fail on its acceleration. We\ntested DNG-MCTS and UCT with random rollout policy and horizon H = 100 in the instance of\nbarto-big, which has a state space with size |S| = 22534. The discount factor is \u03b3 = 0.95 and the\noptimal cost produced is known to be 21.38. We reported the curve of the average cost as a function\nof the number of iterations in Figure 2a. Each data point in the \ufb01gure was averaged over 1,000\n\n7\n\n\f(a) Racetrack-barto-big with random policy\n\n(b) Sailing-100 \u00d7 100 with random policy\n\nFigure 2: Performance curves for Racetrack and Sailing\n\nruns, each of which was allowed for running at most 100 steps. It can be seen from the \ufb01gure that\nDNG-MCTS converged faster than UCT in terms of sample complexity in this domain.\nThe sailing domain is adopted from [3]. In this domain, a sailboat navigates to a destination on\nan 8-connected grid. The direction of the wind changes over time according to prior transition\nprobabilities. The goal is to reach the destination as quickly as possible, by choosing at each grid\nlocation a neighbour location to move to. The discount factor in this domain is \u03b3 = 0.95 and the\nmaximum horizon is set to be H = 100. We ran DNG-MCTS and UCT with random rollout policy\nin a 100\u00d7 100 instance of this domain. This instance has 80000 states and the optimal cost is 26.08.\nThe performance curve is shown in Figure 2b. A trend similar to the racetrack problem can be\nobserved in the graph: DNG-MCTS converged faster than UCT in terms of sample complexity.\nRegarding computational complexity, although the total computation time of our algorithm is linear\nwith the total sample size, which is at most width \u00d7 depth (width is the number of iterations\nand depth is the maximal horizon), our approach does require more computation than simple UCT\nmethods. Speci\ufb01cally, we observed that most of the computation time of DNG-MCTS is due to the\nsampling from distributions in Thompson sampling. Thus, DNG-MCTS usually consumes more\ntime than UCT in a single iteration. Based on our experimental results on the benchmark problems,\nDNG-MCTS typically needs about 2 to 4 times (depending on problems and the iterating stage of\nthe algorithms) of computational time more than UCT algorithm for a single iteration. However,\nif the simulations are expensive (e.g., computational physics in 3D environment where the cost of\nexecuting the simulation steps greatly exceeds the time needed by action-selection steps in MCTS),\nDNG-MCTS can obtain much better performance than UCT in terms of computational complexity\nbecause DNG-MCTS is expected to have lower sample complexity.\n\n5 Conclusion\n\nIn this paper, we proposed our DNG-MCTS algorithm \u2014 a novel Bayesian modeling and inference\nbased Thompson sampling approach using MCTS for MDP online planning. The basic assumption\nof DNG-MCTS is modeling the uncertainty of the accumulated reward for each state-action pair as\na mixture of Normal distributions. We presented the overall Bayesian framework for representing,\nupdating, decision-making and propagating of probability distributions over rewards in the MCTS\nsearch tree. Our experimental results con\ufb01rmed that, comparing to the general UCT algorithm,\nDNG-MCTS produced competitive results in the CTP domain, and converged faster in the domains\nof racetrack and sailing with respect to sample complexity. In the future, we plan to extend our basic\nassumption to using more complex distributions and test our algorithm on real-world applications.\n\n8\n\n2030405060708090100110100100010000100000avg.accumulatedcostnumberofiterationsUCTDNG-MCTS2530354045505560110100100010000100000avg.accumulatedcostnumberofiterationsUCTDNG-MCTS\fAcknowledgements\n\nThis work is supported in part by the National Hi-Tech Project of China under grant 2008AA01Z150\nand the Natural Science Foundation of China under grant 60745002 and 61175057. Feng Wu is\nsupported in part by the ORCHID project (http://www.orchid.ac.uk). We are grateful to\nthe anonymous reviewers for their constructive comments and suggestions.\n\nReferences\n[1] S. Gelly and D. Silver. Monte-carlo tree search and rapid action value estimation in computer\n\ngo. Arti\ufb01cial Intelligence, 175(11):1856\u20131875, 2011.\n\n[2] Mark HM Winands, Yngvi Bjornsson, and J Saito. Monte carlo tree search in lines of action.\n\nIEEE Transactions on Computational Intelligence and AI in Games, 2(4):239\u2013250, 2010.\n\n[3] L. Kocsis and C. Szepesv\u00b4ari. Bandit based monte-carlo planning. In European Conference on\n\nMachine Learning, pages 282\u2013293, 2006.\n\n[4] D. Silver and J. Veness. Monte-carlo planning in large pomdps. In Advances in Neural Infor-\n\nmation Processing Systems, pages 2164\u20132172, 2010.\n\n[5] Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. Online planning for ad hoc autonomous\nagent teams. In International Joint Conference on Arti\ufb01cial Intelligence, pages 439\u2013445, 2011.\n[6] Arthur Guez, David Silver, and Peter Dayan. Ef\ufb01cient bayes-adaptive reinforcement learning\nIn Advances in Neural Information Processing Systems, pages\n\nusing sample-based search.\n1034\u20131042, 2012.\n\n[7] John Asmuth and Michael L. Littman. Learning is planning: near bayes-optimal reinforcement\nlearning via monte-carlo tree search. In Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326,\n2011.\n\n[8] William R. Thompson. On the likelihood that one unknown probability exceeds another in\n\nview of the evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\n[9] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances\n\nNeural Information Processing Systems, pages 2249\u20132257, 2011.\n\n[10] Emilie Kaufmann, Nathaniel Korda, and R\u00b4emi Munos. Thompson sampling: An optimal \ufb01nite\n\ntime analysis. In Algorithmic Learning Theory, pages 199\u2013213, 2012.\n\n[11] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian q-learning. In AAAI Conference\n\non Arti\ufb01cial Intelligence, pages 761\u2013768, 1998.\n\n[12] Gerald Tesauro, V. T. Rajan, and Richard Segal. Bayesian inference in monte-carlo tree search.\n\nIn Uncertainty in Arti\ufb01cial Intelligence, pages 580\u2013588, 2010.\n\n[13] Galin L Jones. On the markov chain central limit theorem. Probability surveys, 1:299\u2013320,\n\n2004.\n\n[14] Anirban DasGupta. Asymptotic theory of statistics and probability. Springer, 2008.\n[15] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In\n\nArti\ufb01cial Intelligence and Statistics, pages 99\u2013107, 2013.\n\n[16] Blai Bonet and Hector Geffner. Action selection for mdps: Anytime ao* vs. uct.\n\nConference on Arti\ufb01cial Intelligence, pages 1749\u20131755, 2012.\n\nIn AAAI\n\n[17] Christos H Papadimitriou and Mihalis Yannakakis. Shortest paths without a map. Theoretical\n\nComputer Science, 84(1):127\u2013150, 1991.\n\n[18] Patrick Eyerich, Thomas Keller, and Malte Helmert. High-quality policies for the canadian\n\ntraveler\u2019s problem. In AAAI Conference on Arti\ufb01cial Intelligence, pages 51\u201358, 2010.\n\n[19] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time dynamic program-\n\nming. Arti\ufb01cial Intelligence, 72(1-2):81\u2013138, 1995.\n\n9\n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Aijun", "family_name": "Bai", "institution": "University of Science and Technology of China (USTC)"}, {"given_name": "Feng", "family_name": "Wu", "institution": "University of Southampton"}, {"given_name": "Xiaoping", "family_name": "Chen", "institution": "University of Science and Technology of China (USTC)"}]}