{"title": "Variance Reduction in Monte-Carlo Tree Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1836, "page_last": 1844, "abstract": "Monte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning technique for decision-making in single-agent and adversarial environments. The stochastic nature of the Monte-Carlo simulations introduces errors in the value estimates, both in terms of bias and variance. Whilst reducing bias (typically through the addition of domain knowledge) has been studied in the MCTS literature, comparatively little effort has focused on reducing variance. This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates. We demonstrate how these techniques can be applied to MCTS and explore their efficacy on three different stochastic, single-agent settings: Pig, Can't Stop and Dominion.", "full_text": "Variance Reduction in Monte-Carlo Tree Search\n\nJoel Veness\n\nUniversity of Alberta\n\nMarc Lanctot\n\nUniversity of Alberta\n\nMichael Bowling\n\nUniversity of Alberta\n\nveness@cs.ualberta.ca\n\nlanctot@cs.ualberta.ca\n\nbowling@cs.ualberta.ca\n\nAbstract\n\nMonte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning\ntechnique for decision-making in single-agent and adversarial environments. The\nstochastic nature of the Monte-Carlo simulations introduces errors in the value es-\ntimates, both in terms of bias and variance. Whilst reducing bias (typically through\nthe addition of domain knowledge) has been studied in the MCTS literature, com-\nparatively little effort has focused on reducing variance. This is somewhat sur-\nprising, since variance reduction techniques are a well-studied area in classical\nstatistics. In this paper, we examine the application of some standard techniques\nfor variance reduction in MCTS, including common random numbers, antithetic\nvariates and control variates. We demonstrate how these techniques can be applied\nto MCTS and explore their ef\ufb01cacy on three different stochastic, single-agent set-\ntings: Pig, Can\u2019t Stop and Dominion.\n\n1\n\nIntroduction\n\nMonte-Carlo Tree Search (MCTS) has become a popular approach for decision making in large\ndomains. The fundamental idea is to iteratively construct a search tree, whose internal nodes contain\nvalue estimates, by using Monte-Carlo simulations. These value estimates are used to direct the\ngrowth of the search tree and to estimate the value under the optimal policy from each internal node.\nThis general approach [6] has been successfully adapted to a variety of challenging problem settings,\nincluding Markov Decision Processes, Partially Observable Markov Decision Processes, Real-Time\nStrategy games, Computer Go and General Game Playing [15, 22, 2, 9, 12, 10].\nDue to its popularity, considerable effort has been made to improve the ef\ufb01ciency of Monte-Carlo\nTree Search. Noteworthy enhancements include the addition of domain knowledge [12, 13], paral-\nlelization [7], Rapid Action Value Estimation (RAVE) [11], automated parameter tuning [8] and roll-\nout policy optimization [21]. Somewhat surprisingly however, the application of classical variance\nreduction techniques to MCTS has remained unexplored. In this paper we survey some common\nvariance reduction ideas and show how they can be used to improve the ef\ufb01ciency of MCTS.\nFor our investigation, we studied three stochastic games: Pig [16], Can\u2019t Stop [19] and Dominion\n[24]. We found that substantial increases in performance can be obtained by using the appropriate\ncombination of variance reduction techniques. To the best of our knowledge, our work constitutes\nthe \ufb01rst investigation of classical variance reduction techniques in the context of MCTS. By showing\nsome examples of these techniques working in practice, as well as discussing the issues involved in\ntheir application, this paper aims to bring this useful set of techniques to the attention of the wider\nMCTS community.\n\n2 Background\n\nWe begin with a short overview of Markov Decision Processes and online planning using Monte-\nCarlo Tree Search.\n\n1\n\n\f2.1 Markov Decision Processes\n\nA Markov Decision Process (MDP) is a popular formalism [4, 23] for modeling sequential decision\nmaking problems. Although more general setups exist, it will be suf\ufb01cient to limit our attention to\nthe case of \ufb01nite MDPs. Formally, a \ufb01nite MDP is a triplet (S,A,P0), where S is a \ufb01nite, non-\nempty set of states, A is a \ufb01nite, non-empty set of actions and P0 is the transition probability kernel\nthat assigns to each state-action pair (s, a) \u2208 S \u00d7A a probability measure over S \u00d7R that we denote\nby P0(\u00b7| s, a). S and A are known as the state space and action space respectively. Without loss\nof generality, we assume that the state always contains the current time index t \u2208 N. The transition\nprobability kernel gives rise to the state transition kernel P(s, a, s(cid:48)) := P0({s(cid:48)} \u00d7 R| s, a), which\ngives the probability of transitioning from state s to state s(cid:48) if action a is taken in s. An agent\u2019s\nbehavior can be described by a policy that de\ufb01nes, for each state s \u2208 S, a probability measure over\nA denoted by \u03c0(\u00b7| s). At each time t, the agent communicates an action At \u223c \u03c0(\u00b7| St) to the system\nin state St \u2208 S. The system then responds with a state-reward pair (St+1, Rt+1) \u223c P0(\u00b7| St, At),\nwhere St+1 \u2208 S and Rt+1 \u2208 R. We will assume that each reward lies within [rmin, rmax] \u2282 R and\nthat the system executes for only a \ufb01nite number of steps n \u2208 N so that t \u2264 n. Given a sequence\nof random variables At, St+1, Rt+1, . . . , An\u22121, Sn, Rn describing the execution of the system up to\ni=t+1 Ri. The return Xst,at with\nrespect to a state-action pair (st, at) \u2208 S \u00d7 A is de\ufb01ned similarly, with the added constraint that\nAt = at. An optimal policy, denoted by \u03c0\u2217, is a policy that maximizes the expected return E [Xst ]\nfor all states st \u2208 S. A deterministic optimal policy always exists for this class of MDPs.\n\ntime n from a state st, the return from st is de\ufb01ned as Xst :=(cid:80)n\n\n2.2 Online Monte-Carlo Planning in MDPs\n\nIf the state space is small, an optimal action can be computed of\ufb02ine for each state using techniques\nsuch as exhaustive Expectimax Search [18] or Q-Learning [23]. Unfortunately, state spaces too large\nfor these approaches are regularly encountered in practice. One way to deal with this is to use online\nplanning. This involves repeatedly using search to compute an approximation to the optimal action\nfrom the current state. This effectively amortizes the planning effort across multiple time steps, and\nimplicitly focuses the approximation effort on the relevant parts of the state space.\nA popular way to construct an online planning algorithm is to use a depth-limited version of an ex-\nhaustive search technique (such as Expectimax Search) in conjunction with iterative deepening [18].\nAlthough this approach works well in domains with limited stochasticity, it scales poorly in highly\nstochastic MDPs. This is because of the exhaustive enumeration of all possible successor states at\nchance nodes. This enumeration severely limits the maximum search depth that can be obtained\ngiven reasonable time constraints. Depth-limited exhaustive search is generally outperformed by\nMonte-Carlo planning techniques in these situations.\nA canonical example of online Monte-Carlo planning is 1-ply rollout-based planning [3]. It com-\nbines a default policy \u03c0 with a one-ply lookahead search. At each time t < n, given a starting\nstate st, for each at \u2208 A and with t < i < n, E [Xst,at | Ai \u223c \u03c0(\u00b7| Si)] is estimated by gen-\nerating trajectories St+1, Rt+1, . . . , An\u22121, Sn, Rn of agent-system interaction. From these tra-\njectories, sample means \u00afXst,at are computed for all at \u2208 A. The agent then selects the action\nAt := argmaxat\u2208A \u00afXst,a, and observes the system response (St+1, Rt+1). This process is then\nrepeated until time n. Under some mild assumptions, this technique is provably superior to exe-\ncuting the default policy [3]. One of the main advantages of rollout based planning compared with\nexhaustive depth-limited search is that a much larger search horizon can be used. The disadvan-\ntage however is that if \u03c0 is suboptimal, then E [Xst,a | Ai \u223c \u03c0(\u00b7| Si)] < E [Xst,a | Ai \u223c \u03c0\u2217(\u00b7| Si)]\nfor at least one state-action pair (st, a) \u2208 S \u00d7 A, which implies that at least some value estimates\nconstructed by 1-ply rollout-based planning are biased. This can lead to mistakes which cannot be\ncorrected through additional sampling. The bias can be reduced by incorporating more knowledge\ninto the default policy, however this can be both dif\ufb01cult and time consuming.\nMonte-Carlo Tree Search algorithms improve on this procedure, by providing a means to construct\nasymptotically consistent estimates of the return under the optimal policy from simulation trajecto-\nries. The UCT algorithm [15] in particular has been shown to work well in practice. Like rollout-\nbased planning, it uses a default policy to generate trajectories of agent-system interaction. However\nnow the construction of a search tree is also interleaved within this process, with nodes correspond-\ning to states and edge corresponding to a state-action pairs. Initially, the search tree consists of a\n\n2\n\n\fsingle node, which represents the current state st at time t. One or more simulations are then per-\nformed. We will use Tm \u2282 S to denote the set of states contained within the search tree after m \u2208 N\nsimulations. Associated with each state-action pair (s, a) \u2208 S \u00d7 A is an estimate \u00afX m\ns,a of the return\ns,a \u2208 N representing the number of times this state-action\nunder the optimal policy and a count T m\npair has been visited after m simulations, with T 0\nEach simulation can be broken down into four phases, selection, expansion, rollout and backup.\nSelection involves traversing a path from the root node to a leaf node in the following manner: for\neach non-leaf, internal node representing some state s on this path, the UCB [1] criterion is applied\nto select an action until a leaf node corresponding to state sl is reached. If U(Bs) denotes the uniform\ndistribution over the set of unexplored actions Bm\ns,a,\na\u2208A T m\nUCB at state s selects\n\ns := {a \u2208 A : T m\n\ns,a = 0}, and T m\n\n:=(cid:80)\n\ns,a := 0 and \u00afX 0\n\ns,a := 0.\n\ns\n\n(cid:113)\n\n\u00afX m\n\ns,a + c\n\nlog(T m\n\ns )/T m\ns,a,\n\n(1)\n\ns\n\nAm+1\n\u223c U(Bm\n\n:= argmax\n\na\u2208A\n\ns\n\ns | = \u2205, or Am+1\n\nif |Bm\ns ) otherwise. The ratio of exploration to exploitation is controlled\nby the positive constant c \u2208 R. In the case of more than one maximizing action, ties are broken\nuniformly at random. Provided sl is non-terminal, the expansion phase is then executed, by selecting\nan action Al \u223c \u03c0(\u00b7| sl), observing a successor state Sl+1 = sl+1, and then adding a node to the\nsearch tree so that Tm+1 = Tm \u222a {sl+1}. Higher values of c increase the level of exploration,\nwhich in turn leads to more shallow and symmetric tree growth. The rollout phase is then invoked,\nwhich for l < i < n, executes actions Ai \u223c \u03c0(\u00b7| Si). At this point, a complete agent-system\nexecution trajectory (at, st+1, rt+1, . . . , an\u22121, sn, rn) from st has been realized. The backup phase\nthen assigns, for t \u2264 k < n,\n\n\u00afX m+1\nsk,ak\n\n\u2190 \u00afX m\n\nsk,ak\n\n+\n\n1\n\nT m\n\nsk ,ak\n\n+1\n\nri \u2212 \u00afX m\n\nsk,ak\n\n,\n\nT m+1\nsk,ak\n\n\u2190 T m\n\nsk,ak\n\n+ 1,\n\nto each (sk, ak) \u2208 Tm+1 occurring on the realized trajectory. Notice that for all (s, a) \u2208 S \u00d7 A, the\nvalue estimate \u00afX m\ns,a corresponds to the average return of the realized simulation trajectories passing\nthrough state-action pair (s, a). After the desired number of simulations k has been performed in\nstate st, the action with the highest expected return at := argmaxa\u2208A \u00afX k\nst,a is selected. With an\nappropriate [15] value of c, as m \u2192 \u221e, the value estimates converge to the expected return under\nthe optimal policy. However, due to the stochastic nature of the UCT algorithm, each value estimate\n\u00afX m\ns,a is subject to error, in terms of both bias and variance, for \ufb01nite m. While previous work (see\nSection 1) has focused on improving these estimates by reducing bias, little attention has been given\nto improvements via variance reduction. The next section describes how the accuracy of UCT\u2019s\nvalue estimates can be improved by adapting classical variance reduction techniques to MCTS.\n\n(cid:32) n(cid:88)\n\ni=t+1\n\n(cid:33)\n\n3 Variance Reduction in MCTS\n\nThis section describes how three variance reduction techniques \u2014 control variates, common random\nnumbers and antithetic variates \u2014 can be applied to the UCT algorithm. Each subsection begins with\na short overview of each variance reduction technique, followed by a description of how UCT can be\nmodi\ufb01ed to ef\ufb01ciently incorporate it. Whilst we restrict our attention to planning in MDPs using the\nUCT algorithm, the ideas and techniques we present are quite general. For example, similar modi\ufb01-\ncations could be made to the Sparse Sampling [14] or AMS [5] algorithms for planning in MDPs, or\nto the POMCP algorithm [22] for planning in POMDPs. In what follows, given an independent and\nidentically distributed sample (X1, X2, . . . Xn), the sample mean is denoted by \u00afX := 1\ni=1 Xi.\nn\n\nProvided E [X] exists, \u00afX is an unbiased estimator of E [X] with variance(cid:112)n\u22121Var[X].\n\n(cid:80)n\n\n3.1 Control Variates\nAn improved estimate of E[X] can be constructed if we have access to an additional statistic Y\nthat is correlated with X, provided that \u00b5Y := E[Y ] exists and is known. To see this, note that if\nZ := X + c(Y \u2212 E[Y ]), then \u00afZ is an unbiased estimator of E[X], for any c \u2208 R. Y is called the\ncontrol variate. One can show that Var[Z] is minimised for c\u2217 := \u2212Cov[X, Y ]/Var[Y ]. Given a\nsample (X1, Y1), (X2, Y2), . . . , (Xn, Yn) and setting c = c\u2217, the control variate enhanced estimator\n\n\u00afXcv :=\n\n1\nn\n\n[Xi + c\u2217(Yi \u2212 \u00b5Y )]\n\n(2)\n\nn(cid:88)\n\ni=1\n\n3\n\n\fis obtained, with variance\n\n(cid:18)\n\nVar[ \u00afXcv] =\n\n1\nn\n\nVar[X] \u2212 Cov[X, Y ]2\nVar[Y ]\n\n(cid:19)\n\n.\n\nThus the total variance reduction is dependent on the strength of correlation between X and Y . For\nthe optimal value of c, the variance reduction obtained by using Z in place of X is 100\u00d7Corr[X, Y ]2\npercent. In practice, both Var[Y ] and Cov[X, Y ] are unknown and need to be estimated from data.\n\nOne solution is to use the plug-in estimator Cn := \u2212(cid:100)Cov[X, Y ]/(cid:100)Var(Y ), where(cid:100)Cov[\u00b7,\u00b7] and(cid:100)Var(\u00b7)\n\ndenote the sample covariance and sample variance respectively. This estimate can be constructed\nof\ufb02ine using an independent sample or be estimated online. Although replacing c\u2217 with an online\nestimate of Cn in Equation 2 introduces bias, this modi\ufb01ed estimator is still consistent [17]. Thus\nonline estimation is a reasonable choice for large n; we revisit the issue of small n later. Note\nthat \u00afXcv can be ef\ufb01ciently computed with respect to Cn by maintaining \u00afX and \u00afY online, since\n\u00afXcv = \u00afX + Cn( \u00afY \u2212 \u00b5Y ).\nApplication to UCT. Control variates can be applied recursively, by rede\ufb01ning the return Xs,a for\nevery state-action pair (s, a) \u2208 S \u00d7 A to\n\nZs,a := Xs,a + cs,a (Ys,a \u2212 E[Ys,a]) ,\n\n(3)\nprovided E [Ys,a] exists and is known for all (s, a) \u2208 S \u00d7 A, and Ys,a is a function of the random\nvariables At, St+1, Rt+1, . . . , An\u22121, Sn, Rn that describe the complete execution of the system after\naction a is performed in state s. Notice that a separate control variate will be introduced for each\nstate-action pair. Furthermore, as E [Zst,at | Ai \u223c \u03c0(\u00b7| Si)] = E [Xst,at | Ai \u223c \u03c0(\u00b7| Si)], for all\npolicies \u03c0, for all (st, at) \u2208 S \u00d7 A and for all t < i < n, the inductive argument [15] used to\nestablish the asymptotic consistency of UCT still applies when control variates are introduced in\nthis fashion.\nFinding appropriate control variates whose expectations are known in advance can prove dif\ufb01cult.\nThis situation is further complicated in UCT where we seek a set of control variates {Ys,a} for all\n(s, a) \u2208 S \u00d7A. Drawing inspiration from advantage sum estimators [25], we now provide a general\nclass of control variates designed for application in UCT. Given a realization of a random simulation\ntrajectory St = st, At = at, St+1 = st+1, At+1 = at+1, . . . , Sn = sn, consider control\nvariates of the form\n\n(4)\nwhere b : S \u2192 {true, false} denotes a boolean function of state and I denotes the binary indicator\nfunction. In this case, the expectation\n\nI[b(Si+1)] \u2212 P[b(Si+1)| Si=si, Ai=ai],\n\nYst,at :=(cid:80)n\u22121\n(cid:18)\n\ni=t\n\n(cid:19)\n\nE[Yst,at] =(cid:80)n\u22121\n\ni=t\n\nE [I [b(Si+1)] | Si=si, Ai=ai] \u2212 P [b(Si+1)| Si=si, Ai=ai]\n\n= 0,\n\nfor all (st, at) \u2208 S \u00d7 A. Thus, using control variates of this form simpli\ufb01es the task to specifying\na state property that is strongly correlated with the return, such that P[b(Si+1)| Si = si, Ai = ai] is\nknown for all (si, ai) \u2208 (S,A), for all t \u2264 i < n. This considerably reduces the effort required to\n\ufb01nd an appropriate set of control variates for UCT.\n\n3.2 Common Random Numbers\nConsider comparing the expectation of E[Y ] to E[Z], where both Y := g(X) and Z := h(X) are\nfunctions of a common random variable X. This can be framed as estimating the value of \u03b4Y,Z,\nwhere \u03b4Y,Z := E[g(X)] \u2212 E[h(X)]. If the expectations E[g(X)] and E[h(X)] were estimated from\ntwo independent samples X1 and X2, the estimator \u02c6g(X1)\u2212\u02c6h(X2) would be obtained, with variance\nVar[\u02c6g(X1) \u2212 \u02c6h(X2)] = Var[\u02c6g(X1)] + Var[\u02c6h(X2)]. Note that no covariance term appears since\nX1 and X2 are independent samples. The technique of common random numbers suggests setting\nX1 = X2 if Cov[\u02c6g(X1), \u02c6h(X2)] is positive. This gives the estimator \u02c6\u03b4Y,Z(X1) := \u02c6g(X1)\u2212 \u02c6h(X1),\nwith variance Var[\u02c6g(X1)]+Var[\u02c6h(X1)]\u22122Cov[\u02c6g(X1), \u02c6h(X1)], which is an improvement whenever\nCov[\u02c6g(X1), \u02c6h(X1)] is positive. This technique cannot be applied indiscriminately however, since a\nvariance increase will result if the estimates are negatively correlated.\n\nApplication to UCT. Rather than directly reducing the variance of the individual return estimates,\ncommon random numbers can instead be applied to reduce the variance of the estimated differences\n\n4\n\n\fs,a + c\n\nlog(T m\n\ns )/T m\n\n(cid:113)\n\ns,a \u2212 \u00afX m\n\ns,a(cid:48), for each pair of distinct actions a, a(cid:48) \u2208 A in a state s. This has the bene\ufb01t\ns,a selected\ns,a selected by UCB as the\n\nin return \u00afX m\nof reducing the effect of variance in both determining the action at := argmaxa\u2208A \u00afX m\nby UCT in state st and the actions argmaxa\u2208A \u00afX m\nsearch tree is constructed.\nAs each estimate \u00afX m\ns,a is a function of realized simulation trajectories originating from state-action\npair (s, a), a carefully chosen subset of the stochastic events determining the realized state transitions\nnow needs to be shared across future trajectories originating from s so that Cov[ \u00afX m\ns,a(cid:48)] is\npositive for all m \u2208 N and for all distinct pairs of actions a, a(cid:48) \u2208 A. Our approach is to use the same\nchance outcomes to determine the trajectories originating from state-action pairs (s, a) and (s, a(cid:48)) if\ns,a = T j\ns,a to index into\nT i\na list of stored stochastic outcomes Es de\ufb01ned for each state s. By only adding a new outcome to\nEs when Ts,a exceeds the number of elements in Es, the list of common chance outcomes can be\nef\ufb01ciently generated online. This idea can be applied recursively, provided that the shared chance\nevents from the current state do not con\ufb02ict with those de\ufb01ned at any possible ancestor state.\n\ns,a(cid:48), for any a, a(cid:48) \u2208 A and i, j \u2208 N. This can be implemented by using T m\n\ns,a, \u00afX m\n\n3.3 Antithetic Variates\n\n2\n\nConsider estimating E[X] with \u02c6h(X, Y) := 1\n\u02c6h1(X)+\u02c6h2(Y), the average of two unbiased estimates\n\u02c6h1(X) and \u02c6h2(Y), computed from two identically distributed samples X = (X1, X2, . . . , Xn) and\nY = (Y1, Y2, . . . , Yn). The variance of \u02c6h(X, Y) is\n\n4 (Var[\u02c6h1(X)] + Var[\u02c6h2(Y)]) + 1\n\n(5)\nThe method of antithetic variates exploits this identity, by deliberately introducing a negative cor-\nrelation between \u02c6h1(X) and \u02c6h2(Y). The usual way to do this is to construct X and Y from pairs of\nsample points (Xi, Yi) such that Cov[h1(Xi), h2(Yi)] < 0 for all i \u2264 n. So that \u02c6h2(Y) remains an\nunbiased estimate of E[X], care needs to be taken when making Y depend on X.\n\nCov[\u02c6h1(X), \u02c6h2(Y)].\n\n1\n\n2\n\nApplication to UCT. Like the technique of common random numbers, antithetic variates can\nbe applied to UCT by modifying the way simulation trajectories are sampled. Whenever a node\nrepresenting (si, ai) \u2208 S \u00d7 A is visited during the backup phase of UCT, the realized trajectory\nmod 2 \u2261 0. The next\nsi+1, ri+1, ai+1, . . . , sn, rn from (si, ai) is now stored in memory if T m\ntime this node is visited during the selection phase, the previous trajectory is used to predetermine\none or more antithetic events that will (partially) drive subsequent state transitions for the current\nsimulation trajectory. After this, the memory used to store the previous simulation trajectory is\nreclaimed. This technique can be applied to all state-action pairs inside the tree, provided that the\nantithetic events determined by any state-action pair do not overlap with the antithetic events de\ufb01ned\nby any possible ancestor.\n\nsi,ai\n\n4 Empirical Results\n\nThis section begins with a description of our test domains, and how our various variance reduction\nideas can be applied to them. We then investigate the performance of UCT when enhanced with\nvarious combinations of these techniques.\n\n4.1 Test Domains\n\nPig is a turn-based jeopardy dice game that can be played with one or more players [20]. Players\nroll two dice each turn and keep a turn total. At each decision point, they have two actions, roll and\nstop. If they decide to stop, they add their turn total to their total score. Normally, dice rolls add to\nis rolled the turn total will be reset\nthe players turn total, with the following exceptions: if a single\nis rolled then the players turn will end along with their total score being\nand the turn ended; if a\nreset to 0. These possibilities make the game highly stochastic.\nCan\u2019t Stop is a dice game where the goal is to obtain three complete columns by reaching the\nhighest level in each of the 2-12 columns [19]. This done by repeatedly rolling 4 dice and playing\nzero or more pairing combinations. Once a pairing combination is played, a marker is placed on\nthe associated column and moved upwards. Only three distinct columns can be used during any\n\n5\n\n\fgiven turn. If the dice are rolled and no legal pairing combination can be made, the player loses\nall of the progress made towards completing columns on this turn. After rolling and making a legal\npairing, a player can chose to lock in their progress by ending their turn. A key component of the\ngame involves correctly assessing the risk associated with not being able to make a legal dice pairing\ngiven the current board con\ufb01guration.\nDominion is a popular turn-based, deck-building card game [24]. It involves acquiring cards by\nspending the money cards in your current deck. Bought cards have certain effects that allow you to\nbuy more cards, get more money, draw more cards, and earn victory points. The goal is to get as\nmany victory points as possible.\nIn all cases, we used solitaire variants of the games where the aim is to maximize the number of\npoints given a \ufb01xed number of turns. All of our domains can be represented as \ufb01nite MDPs. The\ngame of Pig contains approximately 2.4 \u00d7 106 states. Can\u2019t Stop and Dominion are signi\ufb01cantly\nmore challenging, containing in excess of 1024 and 1030 states respectively.\n\n4.2 Application of Variance Reduction Techniques\n\nWe now describe the application of each technique to the games of Pig, Can\u2019t Stop and Dominion.\n\nthan expected, and an overestimate if it contained less rolls with a\n\nControl Variates. The control variates used for all domains were of the form speci\ufb01ed by Equation\n4 in Section 3.1. In Pig, we used a boolean function that returned true if we had just performed the\nroll action and obtained at least one\n. This control variate has an intuitive interpretation, since we\nwould expect the return from a single trajectory to be an underestimate if it contained more rolls\nwith a\nthan expected. In\nCan\u2019t Stop, we used similarly inspired boolean function that returned true if we could not make\na legal pairing from our most recent roll of the 4 dice. In Dominion, we used a boolean function\nthat returned whether we had just played an action that let us randomly draw a hand with 8 or more\nmoney to spend. This is a signi\ufb01cant occurrence, as 8 money is needed to buy a Province, the highest\nscoring card in the game. Strong play invariably requires purchasing as many Provinces as possible.\nWe used a mixture of online and of\ufb02ine estimation to determine the values of cs,a to use in Equation\n3. When T m\ns,a < 50,\nthe constants 6.0, 6.0 and \u22120.7 were used for Pig, Can\u2019t Stop and Dominion respectively. These\n\ns,a \u2265 50, the online estimate \u2212(cid:100)Cov[Xs,a, Ys,a]/(cid:100)Var[Ys,a] was used.\n\nconstants were obtained by computing of\ufb02ine estimates of \u2212(cid:100)Cov[Xs,a, Ys,a]/(cid:100)Var[Ys,a] across a\n\nrepresentative sample of game situations. This combination gave better performance than either\nscheme in isolation.\n\nIf T m\n\nCommon Random Numbers. To apply the ideas in Section 3.2, we need to specify the future\nchance events to be shared across all of the trajectories originating from each state. Since a player\u2019s\n\ufb01nal score in Pig is strongly dependent on their dice rolls, it is natural to consider sharing one or\nmore future dice roll outcomes. By exploiting the property in Pig that each roll event is independent\nof the current state, our implementation shares a batch of roll outcomes large enough to drive a\ncomplete simulation trajectory. So that these chance events don\u2019t con\ufb02ict, we limited the sharing of\nroll events to just the root node. A similar technique was used in Can\u2019t Stop. We found this scheme\nto be superior to sharing a smaller number of future roll outcomes and applying the ideas in Section\n3.2 recursively. In Dominion, stochasticity is caused by drawing cards from the top of a deck that is\nperiodically shuf\ufb02ed. Here we implemented common random numbers by recursively sharing pre-\nshuf\ufb02ed deck con\ufb01gurations across the actions at each state. The motivation for this kind of sharing\nis that it should reduce the chance of one action appearing better than another simply because of\n\u201cluckier\u201d shuf\ufb02es.\n\nAntithetic Variates. To apply the ideas in Section 3.3, we need to describe how the antithetic\nevents are constructed from previous simulation trajectories. In Pig, a negative correlation between\nthe returns of pairs of simulation trajectories can be induced by forcing the roll outcomes in the\nsecond trajectory to oppose those occurring in the \ufb01rst trajectory. Exploiting the property that the\nrelative worth of each pair of dice outcomes is independent of state, a list of antithetic roll outcomes\ncan be constructed by mapping each individual roll outcome in the \ufb01rst trajectory to its antithetic\n. A similar idea\npartner. For example, a lucky roll of\nis used in Can\u2019t Stop, however the situation is more complicated, since the relative worth of each\n\nwas paired with the unlucky roll of\n\n6\n\n\fFigure 1: The estimated variance of the value estimates for the Roll action and estimated differences between\n\nactions on turn 1 in Pig.\n\nchance event varies from state to state. Our solution was to develop a state-dependent heuristic\nranking function, which would assign an index between 0 and 1295 to the 64 distinct chance events\nfor a given state. Chance events that are favorable in the current state are assigned low indexes, while\nunfavorable events are assigned high index values. When simulating a non-antithetic trajectory, the\nranking for each chance event is recorded. Later when the antithetic trajectory needs to be simulated,\nthe previously recorded rank indexes are used to compute the relevant antithetic event for the current\nstate. This approach can be applied in a wide variety of domains where the stochastic outcomes can\nbe ordered by how \u201clucky\u201d they are e.g., suppliers\u2019 price \ufb02uctuations, rare catastrophic events, or\nhigher than average click-through-rates. For Dominion, a number of antithetic mappings were tried,\nbut none provided any substantial reduction in variance. The complexity of how cards can be played\nto draw more cards from one\u2019s deck makes a good or bad shuf\ufb02e intricately dependent on the exact\ncomposition of cards in one\u2019s deck, of which there are intractably many possibilities with no obvious\nsymmetries.\n\n4.3 Experimental Setup\n\nEach variance reduction technique is evaluated in combination with the UCT algorithm, with vary-\ning levels of search effort. In Pig, the default (rollout) policy plays the roll and stop actions with\nprobability 0.8 and 0.2 respectively. In Can\u2019t Stop, the default policy will end the turn if a column\nhas just been \ufb01nished, otherwise it will choose to re-roll with probability 0.85. In Dominion, the\ndefault policy incorporates some simple domain knowledge that favors obtaining higher cost cards\nand avoiding redundant actions. The UCB constant c in Equation 1 was set to 100.0 for both Pig\nand Dominion and 5500.0 for Can\u2019t Stop.\n\n4.4 Evaluation\n\nWe performed two sets of experiments. The \ufb01rst is used to gain a deeper understanding of the role\nof bias and variance in UCT. The next set of results is used to assess the overall performance of UCT\nwhen augmented with our variance reduction techniques.\n\nBias versus Variance. When assessing the quality of an estimator using mean squared error\n(MSE), it is well known that the estimation error can be decomposed into two terms, bias and vari-\nance. Therefore, when assessing the potential impact of variance reduction, it is important to know\njust how much of the estimation error is caused by variance as opposed to bias. Since the game\nof Pig has \u2248 2.4 \u00d7 106 states, we can solve it of\ufb02ine using Expectimax Search. This allows us to\ncompute the expected return E[Xs1 | \u03c0\u2217] of the optimal action (roll) at the starting state s1. We use\nthis value to compute both the bias-squared and variance component of the MSE for the estimated\nreturn of the roll action at s1 when using UCT without variance reduction. This is shown in the\nleftmost graph of Figure 1. It seems that the dominating term in the MSE is the bias-squared. This\nis misleading however, since the absolute error is not the only factor in determining which action\nis selected by UCT. More important instead is the difference between the estimated returns for each\naction, since UCT ultimately ends up choosing the action with the largest estimated return. As Pig\nhas just two actions, we can also compute the MSE of the estimated difference in return between\nrolling and stopping using UCT without variance reduction. This is shown by the rightmost graph\n\n7\n\n 0 100 200 300 400 500 600 700 800 900 1000 4 5 6 7 8 9 10 11 12 13 14 15MSE and Bias2log2(Simulations)MSE and Bias2 of Roll Value Estimator vs. Simulations in UCTMSEBias2 0 50 100 150 200 250 300 4 5 6 7 8 9 10 11 12 13 14 15MSE and Bias2log2(Simulations)MSE and Bias2 in Value Difference Estimator vs. Simulations in UCTMSEBias2\fFigure 2: Performance Results for Pig, Can\u2019t Stop, and Dominion with 95% con\ufb01dence intervals shown.\n\nValues on the vertical axis of each graph represent the average score.\n\nin Figure 1. Here we see that variance is the dominating component (the bias is within \u00b12) when\nthe number of simulations is less than 1024. The role of bias and variance will of course vary from\ndomain to domain, but this result suggests that variance reduction may play an important role when\ntrying to determine the best action.\n\nSearch Performance. Figure 2 shows the results of our variance reduction methods on Pig, Can\u2019t\nStop and Dominion. Each data point for Pig, Can\u2019t Stop and Dominion is obtained by averaging\nthe scores obtained across 50, 000, 10, 000 and 10, 000 games respectively. Such a large number\nof games is needed to obtain statistically signi\ufb01cant results due to the highly stochastic nature of\neach domain. 95% con\ufb01dence intervals are shown for each data point. In Pig, the best approach\nconsistently outperforms the base version of UCT, even when given twice the number of simulations.\nIn Can\u2019t Stop, the best approach gave a performance increase roughly equivalent to using base UCT\nwith 50-60% more simulations. The results also show a clear bene\ufb01t to using variance reduction\ntechniques in the challenging game of Dominion. Here the best combination of variance reduction\ntechniques leads to an improvement roughly equivalent to using 25-40% more simulations. The use\nof antithetic variates in both Pig and Can\u2019t Stop gave a measurable increase in performance, however\nthe technique was less effective than either control variates or common random numbers. Control\nvariates was particularly helpful across all domains, and even more effective when combined with\ncommon random numbers.\n\n5 Discussion\n\nAlthough our UCT modi\ufb01cations are designed to be lightweight, some additional overhead is un-\navoidable. Common random numbers and antithetic variates increase the space complexity of UCT\nby a multiplicative constant. Control variates typically increase the time complexity of each value\nbackup by a constant. These factors need to be taken into consideration when evaluating the ben-\ne\ufb01ts of variance reduction for a particular domain. Note that surprising results are possible; for\nexample, if generating the underlying chance events is expensive, using common random numbers\nor antithetic variates can even reduce the computational cost of each simulation. Ultimately, the\neffectiveness of variance reduction in MCTS is both domain and implementation speci\ufb01c. That said,\nwe would expect our techniques to be useful in many situations, especially in noisy domains or if\neach simulation is computationally expensive. In our experiments, the overhead of every technique\nwas dominated by the cost of simulating to the end of the game.\n\n6 Conclusion\n\nThis paper describes how control variates, common random numbers and antithetic variates can\nbe used to improve the performance of Monte-Carlo Tree Search by reducing variance. Our main\ncontribution is to describe how the UCT algorithm can be modi\ufb01ed to ef\ufb01ciently incorporate these\ntechniques in practice. In particular, we provide a general approach that signi\ufb01cantly reduces the\neffort needed recursively apply control variates. Using these methods, we demonstrated substantial\nperformance improvements on the highly stochastic games of Pig, Can\u2019t Stop and Dominion. Our\nwork should be of particular interest to those using Monte-Carlo planning in highly stochastic or\nresource limited settings.\n\n8\n\n 40 45 50 55 601632641282565121,024SimulationsPig MCTS Performance ResultsBaseAVCRNCVCVCRN 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,2001632641282565121,024SimulationsCant Stop MCTS Performance ResultsBaseAVCRNCVCVCRN 10 20 30 40 501282565121,0242,048SimulationsDominion MCTS Performance ResultsBaseCRNCVCVCRN\fReferences\n\n[1] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. JMLR, 3:397\u2013\n\n422, 2002.\n\n[2] Radha-Krishna Balla and Alan Fern. UCT for Tactical Assault Planning in Real-Time Strategy\n\nGames. In IJCAI, pages 40\u201345, 2009.\n\n[3] Dimitri P. Bertsekas and David A. Castanon. Rollout algorithms for stochastic scheduling\n\nproblems. Journal of Heuristics, 5(1):89\u2013108, 1999.\n\n[4] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\n1st edition, 1996.\n\n[5] Hyeong S. Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. An Adaptive Sampling\nAlgorithm for Solving Markov Decision Processes. Operations Research, 53(1):126\u2013139, Jan-\nuary 2005.\n\n[6] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-Carlo Tree\nIn Fourth Arti\ufb01cial Intelligence and Interactive\n\nSearch: A New Framework for Game AI.\nDigital Entertainment Conference (AIIDE 2008), 2008.\n\n[7] Guillaume M. Chaslot, Mark H. Winands, and H. Jaap Herik. Parallel Monte-Carlo Tree\nSearch. In Proceedings of the 6th International Conference on Computers and Games, pages\n60\u201371, Berlin, Heidelberg, 2008. Springer-Verlag.\n\n[8] Guillaume M.J-B. Chaslot, Mark H.M. Winands, Istvan Szita, and H. Jaap van den Herik.\n\nCross-entropy for Monte-Carlo Tree Search. ICGA, 31(3):145\u2013156, 2008.\n\n[9] R\u00b4emi Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. In Pro-\n\nceedings Computers and Games 2006. Springer-Verlag, 2006.\n\n[10] Hilmar Finnsson and Yngvi Bjornsson. Simulation-based Approach to General Game Play-\ning. In Twenty-Third AAAI Conference on Arti\ufb01cial Intelligence (AAAI 2008), pages 259\u2013264,\n2008.\n\n[11] S. Gelly and D. Silver. Combining online and of\ufb02ine learning in UCT. In Proceedings of the\n\n17th International Conference on Machine Learning, pages 273\u2013280, 2007.\n\n[12] Sylvain Gelly and Yizao Wang. Exploration exploitation in Go: UCT for Monte-Carlo Go. In\n\nNIPS Workshop on On-line trading of Exploration and Exploitation, 2006.\n\n[13] Sylvain Gelly, Yizao Wang, R\u00b4emi Munos, and Olivier Teytaud. Modi\ufb01cation of UCT with\n\npatterns in Monte-Carlo Go. Technical Report 6062, INRIA, France, November 2006.\n\n[14] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for\nnear-optimal planning in large Markov Decision Processes. In IJCAI, pages 1324\u20131331, 1999.\n[15] Levente Kocsis and Csaba Szepesv\u00b4ari. Bandit based Monte-Carlo planning. In ECML, pages\n\n282\u2013293, 2006.\n\n[16] Todd W. Neller and Clifton G.M. Presser. Practical play of the dice game pig. Undergraduate\n\nMathematics and Its Applications, 26(4):443\u2013458, 2010.\n\n[17] Barry L. Nelson. Control variate remedies. Operations Research, 38(6):pp. 974\u2013992, 1990.\n[18] Stuart Russell and Peter Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Prentice-Hall,\n\nEnglewood Cliffs, NJ, 2nd edition edition, 2003.\n\n[19] Sid Sackson. Can\u2019t Stop. Ravensburger, 1980.\n[20] John Scarne. Scarne on dice. Harrisburg, PA: Military Service Publishing Co, 1945.\n[21] David Silver and Gerald Tesauro. Monte-Carlo simulation balancing.\n\nIn ICML, page 119,\n\n2009.\n\n[22] David Silver and Joel Veness. Monte-Carlo Planning in Large POMDPs.\n\nNeural Information Processing Systems 23, pages 2164\u20132172, 2010.\n\nIn Advances in\n\n[23] Csaba Szepesv\u00b4ari. Reinforcement learning algorithms for MDPs, 2009.\n[24] Donald X. Vaccarino. Dominion. Rio Grande Games, 2008.\n[25] Martha White and Michael Bowling. Learning a value analysis tool for agent evaluation.\nIn Proceedings of the Twenty-First International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), pages 1976\u20131981, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1041, "authors": [{"given_name": "Joel", "family_name": "Veness", "institution": null}, {"given_name": "Marc", "family_name": "Lanctot", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}]}