{"title": "Modelling Uncertainty in the Game of Go", "book": "Advances in Neural Information Processing Systems", "page_first": 1353, "page_last": 1360, "abstract": null, "full_text": "Modelling Uncertainty in the Game of Go\n\n\n\n David H. Stern Thore Graepel David J. C. MacKay\n Department of Physics Microsoft Research Department of Physics\n Cambridge University Cambridge, U.K. Cambridge University\n dhs26@cam.ac.uk thoreg@microsoft.com mackay@mrao.cam.ac.uk\n\n\n\n\n Abstract\n\n Go is an ancient oriental game whose complexity has defeated at-\n tempts to automate it. We suggest using probability in a Bayesian\n sense to model the uncertainty arising from the vast complexity\n of the game tree. We present a simple conditional Markov ran-\n dom field model for predicting the pointwise territory outcome of\n a game. The topology of the model reflects the spatial structure of\n the Go board. We describe a version of the Swendsen-Wang pro-\n cess for sampling from the model during learning and apply loopy\n belief propagation for rapid inference and prediction. The model\n is trained on several hundred records of professional games. Our\n experimental results indicate that the model successfully learns to\n predict territory despite its simplicity.\n\n\n1 Introduction\n\nThe game of Go originated in China over 4000 years ago. Its rules are simple (See\nwww.gobase.org for an introduction). Two players, Black and White, take turns\nto place stones on the intersections of an N N grid (usually N = 19 but smaller\nboards are in use as well). All the stones of each player are identical. Players place\ntheir stones in order to create territory by occupying or surrounding areas of the\nboard. The player with the most territory at the end of the game is the winner. A\nstone is captured if it has been completely surrounded (in the horizontal and vertical\ndirections) by stones of the opponent's colour. Stones in a contiguous `chain' have\nthe common fate property: they are captured all together or not at all [1].\n\nThe game that emerges from these simple rules has a complexity that defeats at-\ntempts to apply minimax search. The best Go programs play only at the level of\nweak amateur Go players and Go is therefore considered to be a serious AI challenge\nnot unlike Chess in the 1960s. There are two main reasons for this state of affairs:\nfirstly, the high branching factor of Go (typically 200 to 300 potential moves per\nposition) prevents the expansion of a game tree to any useful depth. Secondly, it\nis difficult to produce an evaluation function for Go positions. A Go stone has no\nintrinsic value; its value is determined by its relationships with other stones. Go\nplayers evaluate positions using visual pattern recognition and qualitative intuitions\nwhich are difficult to formalise.\n\nMost Go programs rely on a large amount of hand-tailored rules and expert knowl-\n\n\f\nedge [2]. Some machine learning techniques have been applied to Go with limited\nsuccess. Schraudolph, Dayan and Sejnowski [3] trained a multi-layer perceptron to\nevaluate board positions by temporal difference learning. Enzenberger [4] improved\non this by structuring the topologies of his neural networks according to the rela-\ntionships between stones on the board. Graepel et al. [1] made use of the common\nfate property of chains to construct an efficient graph-based representation of the\nboard. They trained a Support Vector Machine to use this representation to solve\nGo problems.\n\nOur starting point is the uncertainty about the future course of the game that\narises from the vast complexity of the game tree. We propose to explicitly model\nthis uncertainty using probability in a Bayesian sense. The Japanese have a word,\naji, much used by Go players. Taken literally it means `taste'. Taste lingers, and\nlikewise the influence of a Go stone lingers (even if it appears weak or dead) because\nof the uncertainty of the effect it may have in the future. We use a probabilistic\nmodel that takes the current board position and predicts for every intersection of\nthe board if it will be Black or White territory. Given such a model the score of\nthe game can be predicted and hence an evaluation function produced. The model\nis a conditional Markov random field [5] which incorporates the spatial structure of\nthe Go board.\n\n\n2 Models for Predicting Territory\n\nConsider the Go board as an undirected Graph G = (N , E) with N = Nx Ny nodes\nn N representing vertices on the board and edges e E connecting vertically\nand horizontally neighbouring points. We can denote a position as the vector c \n{Black, White, Empty}N for cn = c(n) and similarly the final territory outcome\nof the game as s {+1, -1}N for sn = s(n). For convenience we score from\nthe point of view of Black so elements of s representing Black territory are valued\n+1 and elements representing white territory are valued -1. Go players will note\nthat we are adopting the Chinese method of scoring empty as well as occupied\nintersections. The distribution we wish to model is P (s|c), that is, the distribution\nover final territory outcomes given the current position. Such a model would be\nuseful for several reasons.\n\n Most importantly, the detailed outcomes provide us with a simple evalua-\n tion function for Go positions by the expected score, u(c) := s\n i i P (s|c).\n An alternative (and probably better) evaluation function is given by the\n probability of winning which takes the form P (Black wins) = P ( s\n i i >\n komi), where komi refers to the winning threshold for Black.\n Connectivity of stones is vital because stones can draw strength from other\n stones. Connectivity could be measured by the correlation between nodes\n under the distribution P (s|c). This would allow us to segment the board\n into `groups' of stones to reduce complexity.\n It would also be useful to observe cases where we have an anti-correlation\n between nodes in the territory prediction. Japanese refer to such cases as\n miai in which only one of two desired results can be achieved at the expense\n of the other - a consequence of moving in turns.\n The fate of a group of Go stones could be estimated from the distribution\n P (s|c) by marginalising out the nodes not involved.\n\nThe way stones exert long range influence can be considered recursive. A stone\ninfluences its neighbours, who influence their neighbours and so on. A simple model\n\n\f\nwhich exploits this idea is to consider the Go board itself as an undirected graphical\nmodel in the form of a Conditional Random Field (CRF) [5]. We factorize the\ndistribution as\n 1 1 \n P (s|c) = exp log(\n Z( f (sf , cf , f ) = f (sf , cf , f ))\n c, ) Z(c, ) .\n f F f F\n (1)\nThe simplest form of this model has one factor for each pair of neighbouring nodes\ni, j so f (sf , cf , f ) = f (si, sj, ci, cj, f ).\n\nBoltzmann5 For our first model we decompose the factors into `coupling' terms\nand `external field' terms as follows:\n 1 \n P (s|c) = exp {w(c\n Z( i, cj )sisj + h(ci)si + h(cj )sj }\n c, ) (2)\n (i,j)F\n\nThis gives a Boltzmann machine whose connections have the grid topology of the\nboard. The couplings between territory-outcome nodes depend on the current board\nposition local to those nodes and the external field at each node is determined\nby the state of the board at that location. We assume that Go positions with\ntheir associated territory positions are symmetric with respect to colour reversal so\nf (si, sj, ci, cj, f ) = f (-si, -sj, -ci, -cj, f ). Pairwise connections are also in-\nvariant to direction reversal so f (si, sj, ci, cj, f ) = f (sj, si, cj, ci, f ). It follows\nthat the model described in 2 can be specified by just five parameters:\n\n wchains = w(Black, Black) = w(White, White),\n winter-chain = w(Black, White) = w(White, Black),\n wchain-empty = w(Empty, White) = w(Empty, Black),\n wempty = w(Empty, Empty),\n hstones = h(Black) = -h(White),\n\nand h(empty) is set to zero by symmetry. We will refer to this model as Boltzmann5.\nThis simple model is interesting because all these parameters are readily interpreted.\nFor example we would expect wchains to take on a large positive value since chains\nhave common fate.\n\nBoltzmannLiberties A feature that has particular utility for evaluating Go po-\nsitions is the number of liberties associated with a chain of stones. A liberty of a\nchain is an empty vertex adjacent to it. The number of liberties indicates a chain's\nsafety because the opponent would have to occupy all the liberties to capture the\nchain. Our second model takes this information into account:\n 1 \n P (s|c) = exp w(c\n Z( i, cj , si, sj , li, lj )\n c, ) , (3)\n (i,j)F\n\nwhere li is element i of a vector l {+1, +2, +3, 4 or more}N the liberty count of\neach vertex on the Go board. A group with four or more liberties is considered\nrelatively safe. Again we can apply symmetry arguments and end up with 78\nparameters. We will refer to this model as BoltzmannLiberties.\n\nWe trained the two models using board positions from a database of 22,000 games\nbetween expert Go players1. The territory outcomes of a subset of these games\n\n 1The GoGoD database, April 2003. URL:http://www.gogod.demon.co.uk\n\n\f\n (a) Gibbs Sampling (b) Swendsen Wang\n\nFigure 1: Comparing ordinary Gibbs with Swendsen Wang sampling for Boltz-\nmann5. Shown are the differences between the running averages and the exact\nmarginals for each of the 361 nodes plotted as a function of the number of whole-\nboard samples.\n\n\nwere determined using the Go program GnuGo2 to analyse their final positions.\nEach training example comprised a board position c, with its associated territory\noutcome s. Training was performed by maximising the likelihood ln P (s |c) using\ngradient descent. In order to calculate the likelihood it is necessary to perform\ninference to obtain the marginal expectations of the potentials.\n\n\n3 Inference Methods\n\nIt is possible to perform exact inference on the model by variable elimination [6].\nEliminating nodes one diagonal at a time gave an efficient computation. The cost\nof exact inference was still too high for general use but it was used to compare other\ninference methods.\n\nSampling The standard method for sampling from a Boltzmann machine is to use\nGibbs sampling where each node is updated one at a time, conditional on the others.\nHowever, Gibbs sampling mixes slowly for spin systems with strong correlations.\nA generalisation of the Swendsen-Wang process [7] alleviates this problem. The\noriginal Swendsen-Wang algorithm samples from a ferromagnetic Ising model with\nno external field by adding an additional set of `bond' nodes d, one attached to\neach factor (edge) in the original graph. Each of these nodes can either be in the\nstate `bond' or `no bond'. The new factor potentials f (sf , cf , df , f ) are chosen\nsuch that if a bond exists between a pair of spins then they are forced to be in\nthe same state. Conditional on the bonds, each cluster has an equal probability\nof having all its spins in the `up' state or all in the `down' state. The algorithm\nsamples from P (s|d, c, ) and P (d|s, c, ) in turn (flipping clusters and forming\nbonds respectively). It can be generalised to models with arbitrary couplings and\nbiases [7, 8]. The new factor potentials f (sf , cf , df , f ) have the following effect:\nif the coupling is positive then when the d node is in the `bond' state it forces the\ntwo spins to be in the same state; if the coupling is negative the `bond' state forces\nthe two spins to be opposite. The probability of each cluster being in each state\ndepends on the sum of the biases involved. Figure 1 shows that the mixing rate\nof the sampling process is improved by using Swendsen-Wang allowing us to find\naccurate marginals for a single position in a couple of seconds.\n\n 2URL:http://www.gnu.org/software/gnugo/gnugo.html\n\n\f\nLoopy Belief Propagation In order to perform very rapid (approximate) infer-\nence we used the loopy belief propagation (BP) algorithm [9] and the results are\nexamined in Section 4. This algorithm is similar to an influence function [10], as\noften used by Go programmers to segment the board into Black and White territory\nand for this reason is laid out below.\n\nFor each board vertex j N , create a data structure called a node containing:\n\n 1. A(j), the set of nodes corresponding to the neighbours of vertex j,\n 2. a set of new messages mnew(s\n ij j ) Mnew , one for each i A(j),\n\n 3. a set of old messages mold(s\n ij j ) Mold, one for each i A(j),\n\n 4. a belief bj(sj).\n\nrepeat\n for all j N do\n for all i A(j) do\n for all sj {Black, White} do\n let variable SUM := 0,\n for all si {Black, White} do\n SUM := SUM + (i,j)(si, sj) mold(s\n qi i),\n qA(i)\\j\n end for\n mnew(s\n ij j ) := SUM,\n end for\n end for\n end for\n for all messages, mnew(s\n xy y ) Mnew do\n mnew(s (s (s\n xy y ) := mold\n xy y ) + (1 - )mnew\n xy y ),\n end for\nuntil completed I iterations (typically I=50)\n\nBelief Update:\nfor all j N do\n for all sj {Black, White} do\n bj(sj) := mnew(s\n qj j )\n qA(j)\n end for\nend for\n\n\nHere, (typically 0.5), damps any oscillations. (i,j)(si, sj) is the factor poten-\ntial (see (1)) and in the case of Boltzmann5 takes on the form (i,j)(si, sj) =\nexp (w(ci, cj)sisj + h(ci)si + h(cj)sj). Now the probability of each vertex being\nBlack or White territory is found by normalising the beliefs at each node. For\nexample P (sj = Black) = bj(Black)/Z where Z = bj(Black) + bj(White). The\naccuracy of the loopy BP approximation appears to be improved by using it during\nthe parameter learning stage in cases where it is to be used in evaluation.\n\n\n4 Results for Territory Prediction\n\nSome Learnt Parameters Here are some parameters learnt for the Boltzmann5\nmodel (2). This model was trained on 290 positions from expert Go games at move\n80. Training was performed by maximum likelihood as described in Section 2.\n\n\f\n (a) Boltzmann5 (Exact) (b) Boltzmann5 (Loopy BP)\n\nFigure 2: Comparing territory predictions for a Go position from a professional\ngame at move 90. The circles represent stones. The small black and white squares\nat each vertex represent the average territory prediction at that vertex, from -1\n(maximum white square) to +1 (maximum black square).\n\n\n h The values of these parameters can be in-\n stones = 0.265 terpreted. For example w\n w chains corresponds\n empty = 0.427 to the correlation between the likely territory\n wchain-empty = 0.442 outcome of two adjacent vertices in a chain of\n w connected stones. The high value of this pa-\n chains = 2.74 rameter derives from the `common fate' prop-\n winter-chain = 0.521 erty of chains as described in Section 1.\n\nInterestingly, the value of the parameter wempty (corresponding to the coupling\nbetween territory predictions of neighbouring vertices in empty space) is 0.427 which\nis close to the critical coupling for an Ising model, 0.441.\n\n\nTerritory Predictions Figure 2 gives examples of territory predictions generated\nby Boltzmann5. In comparison, Figure 3 shows the prediction of BoltzmannLiberties\nand a territory prediction from The Many Faces of Go [2]. Go players confirm that\nthe territory predictions produced by the models are reasonable, even around loose\ngroups of Black and White stones. Compare Figures 2 (a) and 3 (a); when liberty\ncounts are included as features, the model can more confidently identify which of the\ntwo small chains competing in the bottom right of the board is dead. Comparing\nFigure 2 (a) and (b) Loopy BP appears to give over-confident predictions in the top\nright of the board where few stones are present. However, it is a good approximation\nwhere many stones are present (bottom left).\n\n\nComparing Models and Inference Methods Figure 4 shows cross-entropies\nbetween model territory predictions and true final territory outcomes for a dataset\nof expert games. As we progress through a game, predictions become more accurate\n(not surprising) but the spread of the accuracy increases, possibly due to incorrect\nassessment of the life-and-death status of groups. Swendsen-Wang performs better\nthan Loopy BP, which may suffer from its over-confidence. BoltzmannLiberties\nperforms better than Boltzmann5 (when using Swendsen-Wang) the difference in\n\n\f\n (a) BoltzmannLiberties (Exact) (b) Many Faces of Go\n\nFigure 3: Diagram (a) is produced by exact inference (training was also by Loopy\nBP). Diagram (b) shows the territory predicted by The Many Faces of Go (MFG)\n[2]. MFG uses of a rule-based expert system and its prediction for each vertex has\nthree possible values: `White', `Black' or `unknown/neutral'.\n\n\nperformance increasing later in the game when liberty counts become more useful.\n\n\n5 Modelling Move Selection\n\nIn order to produce a Go playing program we are interested in modelling the selec-\ntion of moves. A measure of performance of such a model is the likelihood it assigns\nto professional moves as measured by\n\n log P (move|model). (4)\n games moves\n\nWe can obtain a probability over moves by choosing a Gibbs distribution with the\nnegative energy replaced by the evaluation function,\n\n eu(c ,w)\n P (move|model, w) = (5)\n Z(w)\n\nwhere u(c , w) is an evaluation function evaluated at the board position c resulting\nfrom a given move. The inverse temperature parameter determines the degree to\nwhich the move made depends on its evaluation. The territory predictions from the\nmodels Boltzmann5 and BoltzmannLiberties can be combined with the evaluation\nfunction of Section 2 to produce position evaluators.\n\n\n6 Conclusions\n\nWe have presented a probabilistic framework for modelling uncertainty in the game\nof Go. A simple model which incorporates the spatial structure of a board position\ncan perform well at predicting the territory outcomes of Go games. The models\ndescribed here could be improved by extracting more features from board positions\nand increasing the size of the factors (see (1)).\n\n\f\n Swendsen-Wang Loopy BP\n\n 1.5 1.5\n\n\n\n\n\n 1.0 1.0\n\n\n\n\n\n Cross Entropy Cross Entropy\n 0.5 0.5\n\n\n\n\n\n 0.0 0.0\n\n B5 BLib B5 BLib B5 BLib B5 BLib B5 BLib B5 BLib\n\n Move 20 Move 80 Move 150 Move 20 Move 80 Move 150\n\n\n\nFigure 4: Cross entropies 1 N [s log s ) log(1 - s\n N n n n + (1 - sn n)] between actual and\npredicted territory outcomes, s and n for 327 Go positions. Sampling is compared\n n\nwith Loopy BP (training and testing). 3 board positions were analysed for each\ngame (moves 20, 80 and 150). The Boltzmann5 (B5) and the BoltzmannLiberties\n(BLib) models are compared.\n\n\nAcknowledgements We thank I. Murray for helpful discussions on sampling and\nT. Minka for general advice about probabilistic inference. This work was supported\nby a grant from Microsoft Research UK.\n\n\nReferences\n\n [1] Thore Graepel, Mike Goutrie, Marco Kruger, and Ralf Herbrich. Learning on graphs\n in the game of Go. In Proceedings of the International Conference on Artificial Neural\n Networks, ICANN 2001, 2001.\n\n [2] David Fotland. Knowledge representation in the many faces of go. URL:\n ftp://www.joy.ne.jp/welcome/igs/Go/computer/mfg.tex.Z, 1993.\n\n [3] Nicol N. Schrauldolph, Peter Dayan, and Terrance J. Sejnowski. Temporal difference\n learning of position evaluation in the game of go. In Advances in Neural Information\n Processing Systems 6, pages 817824, San Fransisco, 1994. Morgan Kaufmann.\n\n [4] Markus Enzenberger. The integration of a priori knowledge into a Go playing neural\n network. URL: http://www.markus-enzenberger.de/neurogo.html, 1996.\n\n [5] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields:\n Probabilistic models for segmenting and labeling sequence data. In Proc. Int. Conf.\n on Machine Learning, 2001.\n\n [6] Fabio Gagliardi Cozman. Generalizing variable elimination in Bayesian networks. In\n Proceedings of the IBERAMIA/SBIA 2000 Workshops, pages 2732, 2000.\n\n [7] R. H. Swendsen and J-S Wang. Nonuniversal critical dynamics in Monte Carlo sim-\n ulations. Physical Review Letters, 58:8688, 1987.\n\n [8] Robert G. Edwards and Alan D. Sokal. Generalisation of the Fortuin-Kasteleyn-\n Swendsen-Wang representation and Monte Carlo algorithm. Physical Review Letters,\n 38(6), 1988.\n\n [9] Yair Weiss. Belief propagation and revision in networks with loops. Technical report,\n AI Lab Memo, MIT, Cambridge, 1998.\n\n[10] A. L. Zobrist. Feature Extractions and Representations for Pattern Recognition and\n the Game of Go. PhD thesis, Graduate School of the University of Wisconsin, 1970.\n\n\f\n", "award": [], "sourceid": 2688, "authors": [{"given_name": "David", "family_name": "Stern", "institution": null}, {"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}]}