{"title": "Monte-Carlo Tree Search by Best Arm Identification", "book": "Advances in Neural Information Processing Systems", "page_first": 4897, "page_last": 4906, "abstract": "Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game tree search problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a best arm identification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.", "full_text": "Monte-Carlo Tree Search by Best Arm Identi\ufb01cation\n\nCNRS & Univ. Lille, UMR 9189 (CRIStAL), Inria SequeL\n\nEmilie Kaufmann\n\nLille, France\n\nemilie.kaufmann@univ-lille1.fr\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica,\n\nScience Park 123, 1098 XG Amsterdam, The Netherlands\n\nwmkoolen@cwi.nl\n\nAbstract\n\nRecent advances in bandit tools and techniques for sequential learning are steadily\nenabling new applications and are promising the resolution of a range of challeng-\ning related problems. We study the game tree search problem, where the goal is to\nquickly identify the optimal move in a given game tree by sequentially sampling its\nstochastic payoffs. We develop new algorithms for trees of arbitrary depth, that op-\nerate by summarizing all deeper levels of the tree into con\ufb01dence intervals at depth\none, and applying a best arm identi\ufb01cation procedure at the root. We prove new\nsample complexity guarantees with a re\ufb01ned dependence on the problem instance.\nWe show experimentally that our algorithms outperform existing elimination-based\nalgorithms and match previous special-purpose methods for depth-two trees.\n\n1\n\nIntroduction\n\nsive moves is represented by a maximin game treeT . This tree models the possible actions sequences\n\nWe consider two-player zero-sum turn-based interactions, in which the sequence of possible succes-\n\nby a collection of MAX nodes, that correspond to states in the game in which player A should take\naction, MIN nodes, for states in the game in which player B should take action, and leaves which\nspecify the payoff for player A. The goal is to determine the best action at the root for player A. For\ndeterministic payoffs this search problem is primarily algorithmic, with several powerful pruning\nstrategies available [20]. We look at problems with stochastic payoffs, which in addition present a\nmajor statistical challenge.\nSequential identi\ufb01cation questions in game trees with stochastic payoffs arise naturally as robust\nversions of bandit problems. They are also a core component of Monte Carlo tree search (MCTS)\napproaches for solving intractably large deterministic tree search problems, where an entire sub-tree\nis represented by a stochastic leaf in which randomized play-out and/or evaluations are performed [4].\nA play-out consists in \ufb01nishing the game with some simple, typically random, policy and observing\nthe outcome for player A.\nFor example, MCTS is used within the AlphaGo system [21], and the evaluation of a leaf position\ncombines supervised learning and (smart) play-outs. While MCTS algorithms for Go have now\nreached expert human level, such algorithms remain very costly, in that many (expensive) leaf\nevaluations or play-outs are necessary to output the next action to be taken by the player. In this\npaper, we focus on the sample complexity of Monte-Carlo Tree Search methods, about which very\nlittle is known. For this purpose, we work under a simpli\ufb01ed model for MCTS already studied by\n[22], and that generalizes the depth-two framework of [10].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 A simple model for Monte-Carlo Tree Search\n\nplay-out performed when this leaf is reached by an MCTS algorithm. In this model, we do not try\n\nWe start by \ufb01xing a game treeT , in which the root is a MAX node. LettingL be the set of leaves\nof this tree, for each (cid:96)\u2208L we introduce a stochastic oracleO(cid:96) that represents the leaf evaluation or\nto optimize the evaluation or play-out strategy, but we rather assume that the oracleO(cid:96) produces\nO(cid:96) is a Bernoulli distribution with unknown mean \u00b5(cid:96) (the probability of player A winning the game\ndistribution bounded in[0, 1].\nFor each node s in the tree, we denote byC(s) the set of its children and byP(s) its parent. The root\nis denoted by s0. The value (for player A) of any node s is recursively de\ufb01ned by V(cid:96)= \u00b5(cid:96) if (cid:96)\u2208L and\n\ni.i.d. samples from an unknown distribution whose mean \u00b5(cid:96) is the value of the position (cid:96). To ease the\npresentation, we focus on binary oracles (indicating the win or loss of a play-out), in which the oracle\n\nin the corresponding state). Our algorithms can be used without modi\ufb01cation in case the oracle is a\n\nVs=\u0003 maxc\u2208C(s) Vc\nminc\u2208C(s) Vc\n\nif s is a MAX node,\nif s is a MIN node.\n\n\u2217= argmax\ns\u2208C(s0) Vs.\n\ns\n\nThe best move is the action at the root with highest value,\n\nas [22, 10], in which the strategy also requires a stopping rule, after which leaves are no longer\n\nTo identify s\u2217 (or an \u0001-close move), an MCTS algorithm sequentially selects paths in the game tree\nand calls the corresponding leaf oracle. At round t, a leaf Lt\u2208L is chosen by this adaptive sampling\nrule, after which a sample Xt\u223cOLt is collected. We consider here the same PAC learning framework\nevaluated, and a recommendation rule that outputs upon stopping a guess \u02c6s\u03c4 \u2208C(s0) for the best\nGiven a risk level \u03b4 and some accuracy parameter \u0001\u2265 0 our goal is have a recommendation \u02c6s\u03c4\u2208C(s0)\nwhose value is within \u0001 of the value of the best move, with probability larger than 1\u2212 \u03b4, that is\nAn algorithm satisfying this property is called(\u0001, \u03b4)-correct. The main challenge is to design\n(\u0001, \u03b4)-correct algorithms that use as few leaf evaluations \u03c4 as possible.\n\nP(V(s0)\u2212 V(\u02c6s\u03c4)\u2264 \u0001)\u2265 1\u2212 \u03b4.\n\nmove of player A.\n\nRelated work The model we introduce for Monte-Carlo Tree Search is very reminiscent of a\nstochastic bandit model. In those, an agent repeatedly selects one out of several probability distri-\nbutions, called arms, and draws a sample from the chosen distribution. Bandits models have been\nstudied since the 1930s [23], mostly with a focus on regret minimization, where the agent aims to\nmaximize the sum of the samples collected, which are viewed as rewards [18]. In the context of\nMCTS, a sample corresponds to a win or a loss in one play-out, and maximizing the number of\nsuccessful play-outs (that correspond to simulated games) may be at odds with identifying quickly\nthe next best action to take at the root. In that, our best action identi\ufb01cation problem is closer to a\nso-called Best Arm Identi\ufb01cation (BAI) problem.\nThe goal in the standard BAI problem is to \ufb01nd quickly and accurately the arm with highest mean.\nThe BAI problem in the \ufb01xed-con\ufb01dence setting [7] is the special case of our simple model for a tree\nof depth one. For deeper trees, rather than \ufb01nding the best arm (i.e. leaf), we are interested in \ufb01nding\nthe best action at the root. As the best root action is a function of the means of all leaves, this is a\nmore structured problem.\nBandit algorithms, and more recently BAI algorithms have been successfully adapted to tree search.\nBuilding on the UCB algorithm [2], a regret minimizing algorithm, variants of the UCT algorithm\n[17] have been used for MCTS in growing trees, leading to successful AIs for games. However, there\nare only very weak theoretical guarantees for UCT. Moreover, observing that maximizing the number\nof successful play-outs is not the target, recent work rather tried to leverage tools from the BAI\nliterature. In [19, 6] Sequential Halving [14] is used for exploring game trees. The latter algorithm is\na state-of-the-art algorithm for the \ufb01xed-budget BAI problem [1], in which the goal is to identify the\nbest arm with the smallest probability of error based on a given budget of draws. The proposed SHOT\n(Sequential Halving applied tO Trees) algorithm [6] is compared empirically to the UCT approach\nof [17], showing improvements in some cases. A hybrid approach mixing SHOT and UCT is also\nstudied [19], still without sample complexity guarantees.\n\n2\n\n\fIn the \ufb01xed-con\ufb01dence setting, [22] develop the \ufb01rst sample complexity guarantees in the model we\nconsider. The proposed algorithm, FindTopWinner is based on uniform sampling and eliminations,\nan approach that may be related to the Successive Eliminations algorithm [7] for \ufb01xed-con\ufb01dence\nBAI in bandit models. FindTopWinner proceeds in rounds, in which the leaves that have not been\neliminated are sampled repeatedly until the precision of their estimates doubled. Then the tree is\npruned of every node whose estimated value differs signi\ufb01cantly from the estimated value of its\nparent, which leads to the possible elimination of several leaves. For depth-two trees, [10] propose\nan elimination procedure that is not round-based. In this simpler setting, an algorithm that exploits\ncon\ufb01dence intervals is also developed, inspired by the LUCB algorithm for \ufb01xed-con\ufb01dence BAI\n[13]. Some variants of the proposed M-LUCB algorithm appear to perform better in simulations than\nelimination based algorithms. We now investigate this trend further in deeper trees, both in theory\nand in practice.\n\nOur Contribution.\nIn this paper, we propose a generic architecture, called BAI-MCTS, that builds\non a Best Arm Identi\ufb01cation (BAI) algorithm and on con\ufb01dence intervals on the node values in order\nto solve the best action identi\ufb01cation problem in a tree of arbitrary depth. In particular, we study two\nspeci\ufb01c instances, UGapE-MCTS and LUCB-MCTS, that rely on con\ufb01dence-based BAI algorithms\n[8, 13]. We prove that these are (\u0001, \u03b4)-correct and give a high-probability upper bound on their\nsample complexity. Both our theoretical and empirical results improve over the elimination-based\nstate-of-the-art algorithm, FindTopWinner [22].\n\n2 BAI-MCTS algorithms\n\nWe present a generic class of algorithms, called BAI-MCTS, that combines a BAI algorithm with\nan exploration of the tree based on con\ufb01dence intervals on the node values. Before introducing the\nalgorithm and two particular instances, we \ufb01rst explain how to build such con\ufb01dence intervals, and\nalso introduce the central notion of representative child and representative leaf.\n\n2.1 Con\ufb01dence intervals and representative nodes\n\nI(cid:96)(t)=[L(cid:96)(t), U(cid:96)(t)],\n\nFor each leaf (cid:96)\u2208L, using the past observations from this leaf we may build a con\ufb01dence interval\nwhere U(cid:96)(t) (resp. L(cid:96)(t)) is an Upper Con\ufb01dence Bound (resp. a Lower Con\ufb01dence Bound) on the\nvalue V((cid:96))= \u00b5(cid:96). The speci\ufb01c con\ufb01dence interval we shall use will be discussed later.\nFor each internal node s, we recursively de\ufb01ne Is(t) = [Ls(t), Us(t)] with\nLs(t)=\u0003 maxc\u2208C(s) Lc(t)\nminc\u2208C(s) Lc(t)\n\nThese con\ufb01dence intervals are then propagated upwards in the tree using the following con-\nstruction.\n\nUs(t)=\u0003 maxc\u2208C(s) Uc(t)\nminc\u2208C(s) Uc(t)\n\nfor a MAX node s,\nfor a MIN node s,\n\nfor a MAX node s,\nfor a MIN node s.\nNote that these intervals are the tightest possible on the parent under the sole assumption that the\nchild con\ufb01dence intervals are all valid. A similar construction was used in the OMS algorithm of [3]\nin a different context. It is easy to convince oneself (or prove by induction, see Appendix B.1) that\nthe accuracy of the con\ufb01dence intervals is preserved under this construction, as stated below.\n\nProposition 1. Let t\u2208 N. One has\u0016(cid:96)\u2208L(\u00b5(cid:96)\u2208I(cid:96)(t)) \u21d2 \u0016s\u2208T(Vs\u2208Is(t)).\nWe now de\ufb01ne the representative child cs(t) of an internal node s as\nand the representative leaf (cid:96)s(t) of a node s\u2208T , which is the leaf obtained when going down the\n\ncs(t) = \u0004 argmaxc\u2208C(s) Uc(t)\nargminc\u2208C(s) Lc(t)\n(cid:96)s(t)= s if s\u2208L,\n\n(cid:96)s(t)= (cid:96)cs(t)(t) otherwise.\n\ntree by always selecting the representative child:\n\nif s is a MAX node,\nif s is a MIN node,\n\nThe con\ufb01dence intervals in the tree represent the statistically plausible values in each node, hence the\nrepresentative child can be interpreted as an \u201coptimistic move\u201d in a MAX node and a \u201cpessimistic\nmove\u201d in a MIN node (assuming we play against the best possible adversary). This is reminiscent of\nthe behavior of the UCT algorithm [17]. The construction of the con\ufb01dence intervals and associated\nrepresentative children are illustrated in Figure 1.\n\n3\n\n\fInput: a BAI algorithm\n\nInitialization: t= 0.\nwhile not BAIStop({s\u2208C(s0)}) do\nRt+1= BAIStep({s\u2208C(s0)})\nLt+1= (cid:96)Rt+1(t)\nt= t+ 1.\nOutput: BAIReco({s\u2208C(s0)})\n\nSample the representative leaf\n\nend\n\nUpdate the information about the arms.\n\nFigure 2: The BAI-MCTS architecture\n\n(a) Children\n\n(b) Parent\n\nFigure 1: Construction of con\ufb01dence in-\nterval and representative child (in red)\nfor a MAX node.\n\n2.2 The BAI-MCTS architecture\n\nIn this section we present the generic BAI-MCTS algorithm, whose sampling rule combines two\ningredients: a best arm identi\ufb01cation step which selects an action at the root, followed by a con\ufb01dence\nbased exploration step, that goes down the tree starting from this depth-one node in order to select\nthe representative leaf for evaluation.\nThe structure of a BAI-MCTS algorithm is presented in Figure 2. The algorithm depends on a Best\nArm Identi\ufb01cation (BAI) algorithm, and uses the three components of this algorithm:\n\n\u2022 the sampling rule BAIStep(S) selects an arm in the setS\n\u2022 the stopping rule BAIStop(S) returns True if the algorithm decides to stop\n\u2022 the recommendation rule BAIReco(S) selects an arm as a candidate for the best arm\n\nIn BAI-MCTS, the arms are the depth-one nodes, hence the information needed by the BAI algorithm\nto make a decision (e.g. BAIStep for choosing an arm, or BAIStop for stopping) is information\nabout depth-one nodes, that has to be updated at the end of each round (last line in the while loop).\nDifferent BAI algorithms may require different information, and we now present two instances that\nrely on con\ufb01dence intervals (and empirical estimates) for the value of the depth-one nodes.\n\n2.3 UGapE-MCTS and LUCB-MCTS\n\nSeveral Best Arm Identi\ufb01cation algorithms may be used within BAI-MCTS, and we now present\ntwo variants, that are respectively based on the UGapE [8] and the LUCB [13] algorithms. These\ntwo algorithms are very similar in that they exploit con\ufb01dence intervals and use the same stopping\nrule, however the LUCB algorithm additionally uses the empirical means of the arms, which within\n\nBAI-MCTS requires de\ufb01ning an estimate \u02c6Vs(t) of the value of the depth-one nodes.\nThe generic structure of the two algorithms is similar. At round t+ 1 two promising depth-one nodes\n\nare computed, that we denote by bt and ct. Among these two candidates, the node whose con\ufb01dence\ninterval is the largest (that is, the most uncertain node) is selected:\n\nRt+1= argmax\n\ni\u2208{bt,ct}[Ui(t)\u2212 Li(t)] .\n(t)< \u0001\u0001 ,\n\nof the two promising arms overlap by less than \u0001:\n\ndown the tree) is sampled: Lt+1= (cid:96)Rt+1(t). The algorithm stops whenever the con\ufb01dence intervals\nThen, following the BAI-MCTS architecture, the representative leaf of Rt+1 (computed by going\nand it recommends \u02c6s\u03c4= b\u03c4 .\n\n\u03c4= inf\u0001t\u2208 N\u2236 Uct\n\n(t)\u2212 Lbt\n\nIn both algorithms that we detail below bt represents a guess for the best depth-one node, while ct is\nan \u201coptimistic\u201d challenger, that has the maximal possible value among the other depth-one nodes.\nBoth nodes need to be explored enough in order to discover the best depth-one action quickly.\n\n4\n\n\fUGapE-MCTS.\n\nIn UGapE-MCTS, introducing for each depth-one node the index\n\nthe promising depth-one nodes are de\ufb01ned as\n\nBs(t)= max\ns\u2032\u2208C(s0)(cid:131){s} Us\u2032(t)\u2212 Ls(t),\nbt= argmin\na\u2208C(s0) Ba(t) and ct= argmax\nbt= argmax\na\u2208C(s0) \u02c6Va(t) and ct= argmax\n\nb\u2208C(s0)(cid:131){bt} Ub(t).\nb\u2208C(s0)(cid:131){bt} Ub(t),\n\nLUCB-MCTS.\n\nIn LUCB-MCTS, the promising depth-one nodes are de\ufb01ned as\n\nwhere \u02c6Vs(t)= \u02c6\u00b5(cid:96)s(t)(t) is the empirical mean of the reprentative leaf of node s. Note that several\nalternative de\ufb01nitions of \u02c6Vs(t) may be proposed (such as the middle of the con\ufb01dence intervalIs(t),\nor maxa\u2208C(s) \u02c6Va(t)), but our choice is crucial for the analysis of LUCB-MCTS, given in Appendix C.\n\n3 Analysis of UGapE-MCTS\n\nIn this section we \ufb01rst prove that UGapE-MCTS and LUCB-MCTS are both (\u0001, \u03b4)-correct. Then we\ngive in Theorem 3 a high-probability upper bound on the number of samples used by UGapE-MCTS.\nA similar upper bound is obtained for LUCB-MCTS in Theorem 9, stated in Appendix C.\n\n3.1 Choosing the Con\ufb01dence Intervals\n\nthe following lemma, whose proof can be found in Appendix B.2\n\nFrom now on, we assume that the con\ufb01dence intervals on the leaves are of the form\n\nand U(cid:96)(t)= \u02c6\u00b5(cid:96)(t)+\u00bf``(cid:192) \u03b2(N(cid:96)(t), \u03b4)\n2N(cid:96)(t)\n\nL(cid:96)(t) = \u02c6\u00b5(cid:96)(t)\u2212\u00bf``(cid:192) \u03b2(N(cid:96)(t), \u03b4)\n2N(cid:96)(t)\n\u03b2(s, \u03b4) is some exploration function, that can be tuned to have a \u03b4-PAC algorithm, as expressed in\nLemma 2. If \u03b4\u2264 max(0.1L, 1), for the choice\n\u03b2(s, \u03b4)= ln(L~\u03b4)+ 3 ln ln(L~\u03b4)+(3~2) ln(ln s+ 1)\nboth UGapE-MCTS and LUCB-MCTS satisfy P(V(s\u2217)\u2212 V(\u02c6s\u03c4)\u2264 \u0001)\u2265 1\u2212 \u03b4.\nnumber of draws N(cid:96)(t), whereas most of the BAI algorithms use exploration functions that depend\nMoreover, \u03b2(s, \u03b4) scales with ln(ln(s)), and not ln(s), leveraging some tools recently introduced to\nobtain tighter con\ufb01dence intervals [12, 15]. The union bound overL (that may be an artifact of our\nand in practice, we recommend the use of \u03b2(s, \u03b4)= ln(ln(es)~\u03b4).\n\non the number of rounds t. Hence the only con\ufb01dence intervals that need to be updated at round t are\nthose of the ancestors of the selected leaf, which can be done recursively.\n\ncurrent analysis) however makes the exploration function of Lemma 2 still a bit over-conservative\n\nAn interesting practical feature of these con\ufb01dence intervals is that they only depend on the local\n\n.\n\n(1)\n\n(2)\n\nFinally, similar correctness results (with slightly larger exploration functions) may be obtained for\ncon\ufb01dence intervals based on the Kullback-Leibler divergence (see [5]), which are known to lead to\nbetter performance in standard best arm identi\ufb01cation problems [16] and also depth-two tree search\nproblems [10]. However, the sample complexity analysis is much more intricate, hence we stick to\nthe above Hoeffding-based con\ufb01dence intervals for the next section.\n\n3.2 Complexity term and sample complexity guarantees\n\nWe \ufb01rst introduce some notation. Recall that s\u2217 is the optimal action at the root, identi\ufb01ed with\nthe depth-one node satisfying V(s\u2217)= V(s0), and de\ufb01ne the second-best depth-one node as s\u2217\n2=\n\n5\n\n\fargmaxs\u2208C(s0)(cid:131){s\u2217} Vs. RecallP(s) denotes the parent of a node s different from the root. Introducing\nfurthermore the set Anc(s) of all the ancestors of a node s, we de\ufb01ne the complexity term by\n\u2236= V(s\u2217)\u2212 V(s\u2217\n2)\n\u2236= maxs\u2208Anc((cid:96))(cid:131){s0}Vs\u2212 V(P(s))\n\n, where \u2206\u2217\n\n\u0001(\u00b5)\u2236=Q\n\u2217\n(cid:96)\u2208L\n\n(cid:96)\u2228 \u22062\u2217\u2228 \u00012\n\n(3)\n\n\u22062\n\n\u2206(cid:96)\n\nH\n\n1\n\nThe intuition behind these squared terms in the denominator is the following. We will sample a leaf (cid:96)\nuntil we either prune it (by determining that it or one of its ancestors is a bad move), prune everyone\nelse (this happens for leaves below the optimal arm) or reach the required precision \u0001.\n\nTheorem 3. Let \u03b4\u2264 min(1, 0.1L). UGapE-MCTS using the exploration function (2) is such that,\nwith probability larger than 1\u2212 \u03b4,(V(s\u2217)\u2212 V(\u02c6s\u03c4)< \u0001) and, letting \u2206(cid:96),\u0001= \u2206(cid:96)\u2228 \u2206\u2217\u2228 \u0001,\n\u0004\u0004+ 1.\n\nL\n\u03c4 \u2264 8H\n+Q\n\u0001(\u00b5) ln\n\u2217\n\u0001(\u00b5)\u00043 ln ln\n+ 8H\n\u2217\n\nRemark 4. If \u03b2(Na(t), \u03b4) is changed to \u03b2(t, \u03b4), one can still prove(\u0001, \u03b4) correctness and further-\n\n+ 2 ln ln\u00048e ln\n\n+ 24e ln ln\n\nmore upper bound the expectation of \u03c4. However the algorithm becomes less ef\ufb01cient to implement,\nsince after each leaf observation, ALL the con\ufb01dence intervals have to be updated. In practice, this\nchange lowers the probability of error but does not effect signi\ufb01cantly the number of play-outs used.\n\nL\n\nL\n\nL\n\n16\n2\n(cid:96),\u0001\n\n1\n2\n(cid:96),\u0001\n\nln ln\n\n\u2206\n\n\u2206\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n(cid:96)\n\n3.3 Comparison with previous work\n\nTo the best of our knowledge1, the FindTopWinner algorithm [22] is the only algorithm from the\nliterature designed to solve the best action identi\ufb01cation problem in any-depth trees. The number of\nplay-outs of this algorithm is upper bounded with high probability by\n\nln\n\nln\n\n\u00012\n\n\u0001\u03b4\n\n\u22062\n(cid:96)\n\n\u2206(cid:96)\u03b4\n\nsmall [12]. The most interesting improvement is in the control of the number of draws of 2\u0001-optimal\n\nand second best value. Moreover, unlike FindTopWinner and M-LUCB [10] in the depth two case,\n\nOne can \ufb01rst note the improvement in the constant in front of the leading term in ln(1~\u03b4), as well as\nthe presence of the ln ln(1~\u2206(cid:96),\u00012) second order, that is unavoidable in a regime in which the gaps are\nleaves (such that \u2206(cid:96)\u2264 2\u0001). In UGapE-MCTS, the number of draws of such leaves is at most of order\n(\u0001\u2228 \u22062\u2217)\u22121 ln(1~\u03b4), which may be signi\ufb01cantly smaller than \u0001\u22121 ln(1~\u03b4) if there is a gap in the best\nUGapE-MCTS can also be used when \u0001= 0, with provable guarantees.\nstructure, \ufb01rst computing the representative leaf for each depth-one node:\u2200s\u2208C(s0), Rs,t= (cid:96)s(t)\nand then performing a BAI step over the representative leaves: \u02dcLt+1= BAIStep(Rs,t, s\u2208C(s0)).\nglobal time \u03b2(t, \u03b4) and that bt is the empirical maximin arm (which can be different from the arm\n\nThis alternative architecture can also be generalized to deeper trees, and was found to have empirical\nperformance similar to BAI-MCTS. M-LUCB, which will be used as a benchmark in Section 4, also\ndistinguish itself from LUCB-MCTS by the fact that it uses an exploration rate that depends on the\n\nRegarding the algorithms themselves, one can note that M-LUCB, an extension of LUCB suited for\ndepth-two tree, does not belong to the class of BAI-MCTS algorithms. Indeed, it has a \u201creversed\u201d\n\nmaximizing \u02c6Vs). This alternative choice is not yet supported by theoretical guarantees in deeper trees.\nFinally, the exploration step of BAI-MCTS algorithm bears some similarity with the UCT algorithm\n[17], as it goes down the tree choosing alternatively the move that yields the highest UCB or the\nlowest LCB. However, the behavior of BAI-MCTS is very different at the root, where the \ufb01rst move is\nselected using a BAI algorithm. Another key difference is that BAI-MCTS relies on exact con\ufb01dence\n\n1In a recent paper, [11] independently proposed the LUCBMinMax algorithm, that differs from UGapE-\nMCTS and LUCB-MCTS only by the way the best guess bt is picked. The analysis is very similar to ours,\nbut features some re\ufb01ned complexity measure, in which \u2206(cid:96) (that is the maximal distance between consecutive\nancestors of the leaf, see (3)) is replaced by the maximal distance between any ancestors of that leaf. Similar\nresults could be obtained for our two algorithms following the same lines.\n\n6\n\n\u0004 32\n\nQ\n(cid:96)\u2236\u2206(cid:96)>2\u0001\n\n16L\n\n+ 1\u0004+ Q\n(cid:96)\u2236\u2206(cid:96)\u22642\u0001\n\n\u0004 8\n\n8L\n\n+ 1\u0004\n\n\fintervals: each intervalIs(t) is shown to contain with high probability the corresponding value Vs,\n\nwhereas UCT uses more heuristic con\ufb01dence intervals, based on the number of visits of the parent\nnode, and aggregating all the samples from descendant nodes. Using UCT in our setting is not obvious\nas it would require to de\ufb01ne a suitable stopping rule, hence we don\u2019t include a comparison with this\nalgorithm in Section 4. A hybrid comparison between UCT and FindTopWinner is proposed in\n[22], providing UCT with the random number of samples used by the the \ufb01xed-con\ufb01dence algorithm.\nIt is shown that FindTopWinner has the advantage for hard trees that require many samples. Our\nexperiments show that our algorithms in turn always dominate FindTopWinner.\n\n\u22062\n\n2\n\n\u22062\n\n7\n\n3.4 Proof of Theorem 3.\n\nusing the following key result, which is proved in Appendix D.\n\nAn intuition behind this result is the following. First, using that the selected leaf (cid:96) is a representative\n\nAnother useful tool is the following lemma, that will allow to leverage the particular form of the\n\n2 ln(1+ ln(s)) and de\ufb01ne S= sup{s\u2265 1\u2236 a\u03b2(s)\u2265 s}. Then\n\nLettingEt=\u0016(cid:96)\u2208L(\u00b5(cid:96)\u2208I(cid:96)(t)) and E =\u0016t\u2208NEt, we upper bound \u03c4 assuming the eventE holds,\nLemma 5. Let t\u2208 N.Et\u2229(\u03c4> t)\u2229(Lt+1= (cid:96)) \u21d2 N(cid:96)(t)\u2264 8\u03b2(N(cid:96)(t),\u03b4)\n(cid:96)\u2228\u22062\u2217\u2228\u00012 .\nleaf, it can be seen that the con\ufb01dence intervals from sD= (cid:96) to s0 are nested (Lemma 11). Hence if\nEt holds, V(sk)\u2208I(cid:96)(t) for all k= 1, . . . , D, which permits to lower bound the width of this interval\n(and thus upper bound N(cid:96)(t)) as a function of the V(sk) (Lemma 12). Then Lemma 13 exploits the\nmechanism of UGapE to further relate this width to \u2206\u2217 and \u0001.\nexploration function \u03b2 to obtain an explicit upper bound on N(cid:96)(\u03c4).\nLemma 6. Let \u03b2(s)= C+ 3\nS\u2264 aC+ 2a ln(1+ ln(aC)).\nThis result is a consequence of Theorem 16 stated in Appendix F, that uses the fact that for C\u2265\n\u2212 ln(0.1) and a\u2265 8, it holds that\nC(1+ ln(aC))\nC(1+ ln(aC))\u2212 3\nOn the eventE, letting \u03c4(cid:96) be the last instant before \u03c4 at which the leaf (cid:96) has been played before\nstopping, one has N(cid:96)(\u03c4\u2212 1)= N(cid:96)(\u03c4(cid:96)) that satis\ufb01es by Lemma 5\nN(cid:96)(\u03c4(cid:96))\u2264 8\u03b2(N(cid:96)(\u03c4(cid:96)), \u03b4)\n(cid:96)\u2228 \u22062\u2217\u2228 \u00012\n(cid:96)\u2228\u22062\u2217\u2228\u00012 and C = ln\n+ 3 ln ln\nApplying Lemma 6 with a= a(cid:96)=\nL\nL\nN(cid:96)(\u03c4\u2212 1) \u2264 a(cid:96)(C+ 2 ln(1+ ln(a(cid:96)C))) .\n\u03b4 leads to\nLetting \u2206(cid:96),\u0001= \u2206(cid:96)\u2228 \u2206\u2217\u2228 \u0001 and summing over arms, we \ufb01nd\n\u03c4 = 1+Q\n\u00ef\u00ef\u00ef8e\n\u00ef\u00ef\u0017\n\u00ef\u00ef\u0017\n+ 3 ln ln\nL\nL\n\u2264 1+Q\nL\n+ 2 ln ln\u00048e ln\n= 1+Q\n\u0001(\u00b5)\u00043 ln ln\n\u2217\nonE, V(s\u2217)\u2212 V(\u02c6s\u03c4)< \u0001 and thatE holds with probability larger than 1\u2212 \u03b4.\n\nTo conclude the proof, we remark that from the proof of Lemma 2 (see Appendix B.2) it follows that\n\n\u2264 1.7995564 \u2264 2.\n\n+ 2 ln ln\n\u00ef\u00ef\u0017+ 8H\n\nN(cid:96)(\u03c4\u2212 1)\n\u00ef\u00ef\u00efln\nL\n\u00ef\u00ef\u00efln\nL\n\n8\n2\n(cid:96),\u0001\n\n8\n2\n(cid:96),\u0001\n\n\u2206\n\n\u2206\n\n\u03b4\n\n\u03b4\n\n+ 3 ln ln\n+ 2 ln ln\n\n+ 24e ln ln\n\n1\n2\n(cid:96),\u0001\n\n\u2206\n\nL\n\nL\n\n\u03b4\n\nL\n\n\u03b4\n\n8\n\n\u22062\n\n\u2206\n\n2\n(cid:96),\u0001\n\nln\n\n\u03b4\n\n\u03b4\n\n.\n\n\u03b4\n\n(cid:96)\n\n(cid:96)\n\n(cid:96)\n\n3\n2\n\n\u03b4\n\n\u0004\u0004 .\n\n\u03b4\n\n\f4 Experimental Validation\n\nL\n\n\u03b4\n\nIn this section we evaluate the performance of our algorithms in three experiments. We evaluate\non the depth-two benchmark tree from [10], a new depth-three tree and the random tree ensemble\nfrom [22]. We compare to the FindTopWinner algorithm from [22] in all experiments, and in the\ndepth-two experiment we include the M-LUCB algorithm from [10]. Its relation to BAI-MCTS is\ndiscussed in Section 3.3. For our BAI-MCTS algorithms and for M-LUCB we use the exploration\n\nrate \u03b2(s, \u03b4)= ln\n+ ln(ln(s)+ 1) (a stylized version of Lemma 2 that works well in practice), and\nsupply all algorithms with \u03b4= 0.1 and \u0001= 0.01. For comparing with [10] we run all algorithms with\n\u0001= 0 and \u03b4= 0.1L (undoing the conservative union bound over leaves. This excessive choice, which\nmight even exceed one, does not cause a problem, as the algorithms depend on \u03b4L= 0.1). In none of\n\nwe use the KL re\ufb01nement of the con\ufb01dence intervals (1). To replicate the experiment from [22], we\n\nour experiments the observed error rate exceeds 0.1.\nFigure 3 shows the benchmark tree from [10, Section 5] and the performance of four algorithms on it.\nWe see that the special-purpose depth-two M-LUCB performs best, very closely followed by both\nour new arbitrary-depth LUCB-MCTS and UGapE-MCTS methods. All three use signi\ufb01cantly fewer\nsamples than FindTopWinner. Figure 4 (displayed in Appendix A for the sake of readability) shows\n\na full 3-way tree of depth 3 with leafs drawn uniformly from[0, 1]. Again our algorithms outperform\nfrom[0, 1] the average numbers of samples are: LUCB-MCTS 141811, UGapE-MCTS 142953 and\nover leaves to all algorithms, which are run with \u0001= 0.01 and \u03b4= 0.1. We did not observe any error\n\nthe previous state of the art by an order of magnitude. Finally, we replicate the experiment from\n[22, Section 4]. To make the comparison as fair as possible, we use the proven exploration rate from\n(2). On 10K full 10-ary trees of depth 3 with Bernoulli leaf parameters drawn uniformly at random\n\nFindTopWinner 2254560. To closely follow the original experiment, we do apply the union bound\n\nfrom any algorithm (even though we allow 10%). Our BAI-MCTS algorithms deliver an impressive\n15-fold reduction in samples.\n\nFigure 3: The 3\u00d7 3 tree of depth 2 that is the benchmark in [10]. Shown below the leaves are the\naverages over 10K repetitions with \u0001= 0 and \u03b4= 0.1\u22c5 9.\n\naverage numbers of pulls for 4 algorithms: LUCB-MCTS (0.89% errors, 2460 samples), UGapE-\nMCTS (0.94%, 2419), FindTopWinner (0%, 17097) and M-LUCB (0.14%, 2399). All counts are\n\n5 Lower bounds and discussion\n\nGiven a treeT , a MCTS model is parameterized by the leaf values, \u00b5\u2236=(\u00b5(cid:96))(cid:96)\u2208L, which determine the\nbest root action: s\u2217= s\u2217(\u00b5). For \u00b5\u2208[0, 1]L, We de\ufb01ne Alt(\u00b5)={\u03bb\u2208[0, 1]L\u2236 s\u2217(\u03bb)\u2260 s\u2217(\u00b5)}.\n\nUsing the same technique as [9] for the classic best arm identi\ufb01cation problem, one can establish the\nfollowing (non explicit) lower bound. The proof is given in Appendix E.\n\n8\n\n0.450.450.350.300.450.500.559058752941798199200293121281822498920.350.400.60629630293275228727929302481717418220.300.470.52197193114021012312373944202056621\fbinary Kullback-Leibler divergence.\n\nand\n\nT\n\nmin\n\nw1,a\u00b51,a+ wi,1\u00b5i,1\nw1,a+ wi,1\n\n\u03bb\u2208Alt(\u00b5)Q\n\n(cid:96)\u2208L w(cid:96)d(\u00b5(cid:96), \u03bb(cid:96))\n\ninf\n\n(4)\n\nThis result is however not directly amenable for comparison with our upper bounds, as the optimiza-\n\nE\u00b5[\u03c4]\u2265 T\n\n\u2217(\u00b5)d(\u03b4, 1\u2212 \u03b4), where T\n\nour upper bounds have the right dependency in \u03b4. For depth-two trees with K (resp. M) actions for\nplayer A (resp. B), we can moreover prove the following result, that suggests an intriguing behavior.\n\n\u02dc\u03a3K,M\u2236={w\u2208 \u03a3K\u00d7M\u2236 wi,j= 0 if i\u2265 2 and j\u2265 2}\n\u0004w1,ad\u0004\u00b51,a,\n\u0004+wi,1d\u0004\u00b5i,1,\ni=2,...,K\na=1,...,M\n\nTheorem 7. Assume \u0001= 0. Any \u03b4-correct algorithm satis\ufb01es\n\u2217(\u00b5)\u22121\u2236= sup\nw\u2208\u03a3L\ni=1 wi= 1} and d(x, y)= x ln(x~y)+(1\u2212 x) ln((1\u2212 x)~(1\u2212 y)) is the\nwith \u03a3k={w\u2208[0, 1]i\u2236\u2211k\ntion problem de\ufb01ned in Lemma 7 is not easy to solve. Note that d(\u03b4, 1\u2212 \u03b4)\u2265 ln(1~(2.4\u03b4)) [15], thus\nLemma 8. Assume \u0001= 0 and consider a tree of depth two with \u00b5=(\u00b5i,j)1\u2264i\u2264K,1\u2264j\u2264M such that\n\u2200(i, j), \u00b51,1> \u00b5i,1, \u00b5i,1< \u00b5i,j. The supremum in the de\ufb01nition of T\u2217(\u00b5)\u22121 can be restricted to\nw1,a\u00b51,a+ wi,1\u00b5i,1\n\u2217(\u00b5)\u22121= max\n\u0004\u0004 .\nw1,a+ wi,1\nw\u2208 \u02dc\u03a3K,M\nIt can be extracted from the proof of Theorem 7 (see Appendix E) that the vector w\u2217(\u00b5) that attains\nshould draw many of the leaves much less than O(ln(1~\u03b4)) times. This hints at the exciting prospect\nof optimal stochastic pruning, at least in the asymptotic regime \u03b4\u2192 0.\n\u2217 = (0.3633, 0.1057, 0.0532),(0.3738, 0, 0),(0.1040, 0, 0).\nWith \u03b4= 0.1 we \ufb01nd kl(\u03b4, 1\u2212 \u03b4)= 1.76 and the lower bound is E\u00b5[\u03c4]\u2265 456.9. We see that there is a\ndeveloped by [9]. It maintains the empirical proportions of draws close to w\u2217( \u02c6\u00b5), adding forced\nexploration to ensure \u02c6\u00b5\u2192 \u00b5. We believe that developing this line of ideas for MCTS would result in\npattern evolves for deeper trees, let alone how to compute w\u2217(\u00b5).\n\na major advance in the quality of tree search algorithms. The main challenge is developing ef\ufb01cient\nsolvers for the general optimization problem (4). For now, even the sparsity pattern revealed by\nLemma 8 for depth two does not give rise to ef\ufb01cient solvers. We also do not know how this sparsity\n\nAs an example, we numerically solve the lower bound optimization problem (which is a concave\nmaximization problem) for \u00b5 corresponding to the benchmark tree displayed in Figure 3 to obtain\n\nthe supremum in (4) represents the average proportions of selections of leaves by any algorithm\nmatching the lower bound. Hence, the sparsity pattern of Lemma 8 suggests that matching algorithms\n\nFuture directions An (asymptotically) optimal algorithm for BAI called Track-and-Stop was\n\n\u2217(\u00b5)= 259.9\n\nT\n\nand\n\nw\n\npotential improvement of at least a factor 4.\n\nAcknowledgments. Emilie Kaufmann acknowledges the support of the French Agence Nationale\nde la Recherche (ANR), under grant ANR-16-CE40-0002 (project BADASS). Wouter Koolen ac-\nknowledges support from the Netherlands Organization for Scienti\ufb01c Research (NWO) under Veni\ngrant 639.021.439.\n\nReferences\n[1] J-Y. Audibert, S. Bubeck, and R. Munos. Best Arm Identi\ufb01cation in Multi-armed Bandits. In\n\nProceedings of the 23rd Conference on Learning Theory, 2010.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2):235\u2013256, 2002.\n\n[3] L. Borsoniu, R. Munos, and E. P\u00e1ll. An analysis of optimistic, best-\ufb01rst search for minimax\n\nsequential decision making. In ADPRL14, 2014.\n\n[4] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener,\nD. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE\nTransactions on Computational Intelligence and AI in games,, 4(1):1\u201349, 2012.\n\n9\n\n\f[5] O. Capp\u00e9, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper\ncon\ufb01dence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516\u20131541,\n2013.\n\n[6] T. Cazenave. Sequential halving applied to trees.\n\nIntelligence and AI in Games, 7(1):102\u2013105, 2015.\n\nIEEE Transactions on Computational\n\n[7] E. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for\nthe Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[8] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best Arm Identi\ufb01cation: A Uni\ufb01ed Approach to\nFixed Budget and Fixed Con\ufb01dence. In Advances in Neural Information Processing Systems,\n2012.\n\n[9] A. Garivier and E. Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence.\n\nProceedings of the 29th Conference On Learning Theory (COLT), 2016.\n\nIn\n\n[10] A. Garivier, E. Kaufmann, and W.M. Koolen. Maximin action identi\ufb01cation: A new bandit\n\nframework for games. In Proceedings of the 29th Conference On Learning Theory, 2016.\n\n[11] Ruitong Huang, Mohammad M. Ajallooeian, Csaba Szepesv\u00e1ri, and Martin M\u00fcller. Structured\nbest arm identi\ufb01cation with \ufb01xed con\ufb01dence. In 28th International Conference on Algorithmic\nLearning Theory (ALT), 2017.\n\n[12] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil\u2019UCB: an Optimal Exploration Algorithm\nfor Multi-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014.\n\n[13] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic\n\nmulti-armed bandits. In International Conference on Machine Learning (ICML), 2012.\n\n[14] Z. Karnin, T. Koren, and O. Somekh. Almost optimal Exploration in multi-armed bandits. In\n\nInternational Conference on Machine Learning (ICML), 2013.\n\n[15] E. Kaufmann, O. Capp\u00e9, and A. Garivier. On the Complexity of Best Arm Identi\ufb01cation in\n\nMulti-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\n[16] E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In\n\nProceeding of the 26th Conference On Learning Theory., 2013.\n\n[17] L. Kocsis and C. Szepesv\u00e1ri. Bandit based monte-carlo planning. In Proceedings of the 17th\nEuropean Conference on Machine Learning, ECML\u201906, pages 282\u2013293, Berlin, Heidelberg,\n2006. Springer-Verlag.\n\n[18] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\nMathematics, 6(1):4\u201322, 1985.\n\n[19] T. Pepels, T. Cazenave, M. Winands, and M. Lanctot. Minimizing simple and cumulative regret\n\nin monte-carlo tree search. In Computer Games Workshop, ECAI, 2014.\n\n[20] Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie de Bruin. Best-\ufb01rst \ufb01xed-depth minimax\n\nalgorithms. Arti\ufb01cial Intelligence, 87(1):255 \u2013 293, 1996.\n\n[21] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander\nDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap,\nMadeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the\ngame of go with deep neural networks and tree search. Nature, 529:484\u2013489, 2016.\n\n[22] K. Teraoka, K. Hatano, and E. Takimoto. Ef\ufb01cient sampling method for monte carlo tree search\n\nproblem. IEICE Transactions on Infomation and Systems, pages 392\u2013398, 2014.\n\n[23] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\n10\n\n\f", "award": [], "sourceid": 2523, "authors": [{"given_name": "Emilie", "family_name": "Kaufmann", "institution": "CNRS & CRIStAL (SequeL)"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}]}