{"title": "TrueSkill Through Time: Revisiting the History of Chess", "book": "Advances in Neural Information Processing Systems", "page_first": 337, "page_last": 344, "abstract": null, "full_text": "TrueSkill Through Time:\n\nRevisiting the History of Chess\n\nPierre Dangauthier\nINRIA Rhone Alpes\n\nGrenoble, France\n\nRalf Herbrich\n\nTom Minka\n\nMicrosoft Research Ltd.\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nCambridge, UK\n\npierre.dangauthier@imag.fr\n\nrherb@microsoft.com\n\nminka@microsoft.com\n\nThore Graepel\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nthoreg@microsoft.com\n\nAbstract\n\nWe extend the Bayesian skill rating system TrueSkill to infer entire time\nseries of skills of players by smoothing through time instead of (cid:12)ltering.\nThe skill of each participating player, say, every year is represented by a\nlatent skill variable which is a(cid:11)ected by the relevant game outcomes that\nyear, and coupled with the skill variables of the previous and subsequent\nyear. Inference in the resulting factor graph is carried out by approximate\nmessage passing (EP) along the time series of skills. As before the system\ntracks the uncertainty about player skills, explicitly models draws, can deal\nwith any number of competing entities and can infer individual skills from\nteam results. We extend the system to estimate player-speci(cid:12)c draw mar-\ngins. Based on these models we present an analysis of the skill curves of\nimportant players in the history of chess over the past 150 years. Results\ninclude plots of players\u2019 lifetime skill development as well as the ability to\ncompare the skills of di(cid:11)erent players across time. Our results indicate that\na) the overall playing strength has increased over the past 150 years, and\nb) that modelling a player\u2019s ability to force a draw provides signi(cid:12)cantly\nbetter predictive power.\n\n1\n\nIntroduction\n\nCompetitive games and sports can bene(cid:12)t from statistical skill ratings for use in match-\nmaking as well as for providing criteria for the admission to tournaments. From a historical\nperspective, skill ratings also provide information about the general development of skill\nwithin the discipline or for a particular group of interest. Also, they can give a fascinating\nnarrative about the key players in a given discipline, allowing a glimpse at their rise and fall\nor their struggle against their contemporaries.\n\nIn order to provide good estimates of the current skill level of players skill rating systems\nhave traditionally been designed as (cid:12)lters that combine a new game outcome with knowledge\nabout a player\u2019s skill from the past to obtain a new estimate. In contrast, when taking a\nhistorical view we would like to infer the skill of a player at a given point in the past when\nboth their past as well as their future achievements are known.\n\nThe best known such skill (cid:12)lter based rating system is the Elo system [3] developed by Arpad\nElo in 1959 and adopted by the World Chess Federation FIDE in 1970 [4]. Elo models the\n\n1\n\n\fp2(cid:12) (cid:17) where s1 and s2\nprobability of the game outcome as P (1 wins over 2js1; s2) := (cid:8)(cid:16) s1(cid:0)s2\nare the skill ratings of each player, (cid:8) denotes the cumulative density of a zero-mean unit-\nvariance Gaussian and (cid:12) is the assumed variability of performance around skill. Denote the\ngame outcomes by y = +1 if player 1 wins, y = (cid:0)1 if player 2 wins and y = 0 if a draw\noccurs. Then the resulting (linearised) Elo update is given by s1 s1 + y(cid:1), s2 s2 (cid:0) y(cid:1)\nand\n\nK(cid:0)Factor(cid:18) y + 1\n(cid:1) = (cid:11)(cid:12)p(cid:25)\n| {z }\n\n2 (cid:0) P (1 wins over 2js1; s2)(cid:19) ;\n\nwhere 0 < (cid:11) < 1 determines how much the (cid:12)lter weighs the new evidence versus the old\nestimate.\n\nThe TrueSkill rating system [6] improves on the Elo system in a number of ways. TrueSkill\u2019s\ncurrent belief about a player\u2019s skill is represented by a Gaussian distribution with mean (cid:22)\nand variance (cid:27)2. As a consequence, TrueSkill does not require a provisional rating period\nand converges to the true skills of players very quickly. Also, in contrast to Elo, TrueSkill\nexplicitly models the probability of draws. Crucially for its application in the Xbox Live\nonline gaming system (see [6] for details) it can also infer skills from games with more than\ntwo participating entities and infers individual players\u2019 skills from the outcomes of team\ngames.\n\nAs a skill rating and matchmaking system TrueSkill operates as a (cid:12)lter as discussed above.\nHowever, due to its fully probabilistic formulation it is possible to extend Trueskill to perform\nsmoothing on a time series of player skills. In this paper we extend TrueSkill to provide\naccurate estimates of the past skill levels of players at any point in time taking into account\nboth their past and their future achievements. We carry out a large-scale analysis of about\n3.5 million games of chess played over the last 150 years.\n\nThe paper is structured as follows.\nIn Section 2 we review previous work on historical\nchess ratings. In Section 3 we present two models for historical ratings through time, one\nassuming a (cid:12)xed draw margin and one estimating the draw margin per player per year.\nWe indicate how large scale approximate message passing (EP) can be used to e(cid:14)ciently\nperform inference in these huge models. In Section 4 we present experimental results on a\nhuge data set from ChessBase with over 3.5 million games and gain some fascinating chess\nspeci(cid:12)c insights from the data.\n\n2 Previous Work on Historical Chess Ratings\n\nEstimating players\u2019 skills in retrospective allows one to take into account more information\nand hence can be expected to lead to more precise estimates. The pioneer in this (cid:12)eld\nwas Arpad Elo himself, when he encountered the necessity of initializing the skill values of\nthe Elo system when it was (cid:12)rst deployed. To that end he (cid:12)tted a smooth curve to skill\nestimates from (cid:12)ve-year periods; however little is known about the details of his method [3].\n\nProbably best known in the chess community is the Chessmetrics system [8], which aims\nat improving the Elo scores by attempting to obtain a better (cid:12)t with the observed data.\nAlthough constructed in a very thoughtful manner, Chessmetrics is not a statistically well-\nfounded method and is a (cid:12)ltering algorithm that disregards information from future games.\n\nThe (cid:12)rst approach to the historical rating problem with a solid statistical foundation was\ndeveloped by Mark Glickman, chairman of the USCF Rating Committee. Glicko 1 & 2 [5]\nare Bayesian rating systems that address a number of drawbacks of the Elo system while\nstill being based on the Bradley-Terry paired-comparison method [1] used by modern Elo.\nGlickman models skills as Gaussian variables whose variances indicate the reliability of the\nskill estimate, an idea later adopted in the TrueSkill model as well. Glicko 2 adds volatility\nmeasures, indicating the degree of expected (cid:13)uctuation in a player\u2019s rating. After an initial\nestimate past estimations are smoothed by propagating information back in time.\n\nThe second statistically well founded approach are Rod Edwards\u2019s Edo Historical Chess\nRatings [2], which are also based on the Bradley-Terry model but have been applied only to\nhistorical games from the 19th century. In order to model skill dynamics Edwards considers\n\n2\n\n\fthe same player at di(cid:11)erent times as several distinct players, whose skills are linked together\nby a set of virtual games which are assumed to end in draws. While Edo incorporates a\ndynamics model via virtual games and returns uncertainty measures in terms of the esti-\nmator\u2019s variance it is not a full Bayesian model and provides neither posterior distributions\nover skills, nor does it explicitly model draws.\n\nIn light of the above previous work on historical chess ratings the goal of this paper is to\nintroduce a fully probabilistic model of chess ratings through time which explicitly accounts\nfor draws and provides posterior distributions of skills that re(cid:13)ect the reliability of the\nestimate at every point in time.\n\n3 Models for Ranking through Time\n\nThis paper strongly builds on the original TrueSkill paper [6]. Although TrueSkill is appli-\ncable to the case of multiple team games, we will only consider the two player case for this\napplication to chess. It should be clear, however, that the methods presented can equally\nwell be used for games with any number of teams competing.\nConsider a game such as chess in which a number of, say, N players f1; : : : ; Ng are competing\nover a period of T time steps, say, years. Denote the series of game outcomes between two\nij(cid:9) denotes the number\nij (k) 2 f+1;(cid:0)1; 0g where k 2(cid:8)1; : : : ; K t\nplayers i and j in year t by yt\nof game outcomes available for that pair of players in that year. Furthermore, let y = +1 if\nplayer i wins, y = (cid:0)1 if player j wins and y = 0 in case of a draw.\n3.1 Vanilla TrueSkill\n\nij (k) ; st\n\nj is then determined as\n\ni(cid:1) = N(cid:0)pt\n\nIn the Vanilla TrueSkill system, each player i is assumed to have an unknown skill\ni 2 R at time t. We assume that a game outcome yt\nst\nij (k) is generated as follows. For\neach of the two players i and j performances pt\nij (k) and pt\nji (k) are drawn according to\np(cid:0)pt\nij (k)jst\nij (k) of the game between players i and\n\nwhere the parameter \" > 0 is the draw margin. In order to infer the unknown skills st\n\npt\nij (k) > pt\npt\nij (k) > pt\nji (k)(cid:12)(cid:12) (cid:20) \"\nij (k) (cid:0) pt\ni(cid:1) = N(cid:0)s0\nTrueSkill model assumes a factorising Gaussian prior p(cid:0)s0\n0(cid:1) over skills and a\n(cid:1) = N(cid:0)st; st(cid:0)1; (cid:28) 2(cid:1). The model\nGaussian drift of skills between time steps given by p(cid:0)st\nijst(cid:0)1\n\ncan be well described as a factor graph (see Figure 1, left) which clari(cid:12)es the factorisation\nassumptions of the model and allows to develop e(cid:14)cient (approximate) inference algorithms\nbased on message passing (for details see [6])\n\ni; (cid:12)2(cid:1). The outcome yt\nij (k) :=8<\n(cid:12)(cid:12)pt\n:\n\nji (k) + \"\nij (k) + \"\n\n+1\n(cid:0)1\n0\n\ni ; (cid:22)0; (cid:27)2\n\ni the\n\nif\nif\nif\n\nyt\n\n;\n\ni\n\nIn the Vanilla TrueSkill algorithm denoting the winning player by W and the losing player by\nL and dropping the time index for now, approximate Bayesian inference (Gaussian density\n(cid:12)ltering [7]) leads to the following update equations for (cid:22)W , (cid:22)L, (cid:27)W and (cid:27)L.\n\n(cid:22)W (cid:22)W +\n\n(cid:22)L (cid:22)L (cid:0)\n\ncij\n\n(cid:27)2\nL\n\n(cid:27)2\nW\n\ncij (cid:1) v(cid:18) (cid:22)W (cid:0) (cid:22)L\ncij (cid:1) v(cid:18) (cid:22)W (cid:0) (cid:22)L\nv (t; (cid:11)) :=N (t (cid:0) (cid:11); 0; 1)\n\ncij\n\n;\n\n;\n\n\"\n\ncij(cid:19) and (cid:27)W (cid:27)Ws1 (cid:0)\ncij(cid:19) and (cid:27)L (cid:27)Ls1 (cid:0)\n\n(cid:27)2\nL\nc2\n\ncij\n\n(cid:27)2\nW\nc2\n\ncij(cid:19)\nij (cid:1) w(cid:18) (cid:22)W (cid:0) (cid:22)L\ncij(cid:19) :\nij (cid:1) w(cid:18) (cid:22)W (cid:0) (cid:22)L\n\ncij\n\n\"\n\n\"\n\n\"\n\n;\n\n;\n\nThe overall variance is c2\n\nij = 2(cid:12)2 + (cid:27)2\n\nW + (cid:27)2\n\nL and the two functions v and w are given by\n\nFor the case of a draw we have the following update equations:\n(cid:27)2\ni\nc2\n\nand w (t; (cid:11)) := v (t; (cid:11)) (cid:1) (v (t; (cid:11)) + (t (cid:0) (cid:11))) :\ncij(cid:19) ;\n\ncij(cid:19) and (cid:27)i (cid:27)is1 (cid:0)\n\nij (cid:1) ~w(cid:18) (cid:22)i (cid:0) (cid:22)i\n\n(cid:8) (t (cid:0) (cid:11))\ncij (cid:1) ~v(cid:18) (cid:22)i (cid:0) (cid:22)i\n\n(cid:22)i (cid:22)i +\n\n(cid:27)2\ni\n\ncij\n\ncij\n\n\"\n\n\"\n\n;\n\n;\n\n3\n\n\fand similarly for player j. De(cid:12)ning d := (cid:11) (cid:0) t and s := (cid:11) + t then ~v and ~w are given by\n~v (t; (cid:11)) :=N ((cid:0)s; 0; 1) (cid:0) N (d; 0; 1)\n\n(d)N (d; 0; 1) (cid:0) (s)N (s; 0; 1)\n\n~w (t; (cid:11)) := ~v2 (t; (cid:11)) +\n\nand\n\n:\n\n(cid:8) (d) (cid:0) (cid:8) ((cid:0)s)\n\n(cid:8) (d) (cid:0) (cid:8) ((cid:0)s)\n\nIn order to approximate the skill parameters (cid:22)t\ntimes t 2 f0; : : : ; Tg the Vanilla TrueSkill algorithm initialises each skill belief with (cid:22)0\nand (cid:27)0\ni (cid:27)0.\nthe game outcomes yt\nequations above.\n\ni for all players i 2 f1; : : : ; Ng at all\ni (cid:22)0\nIt then proceeds through the years t 2 f1 : : : Tg in order, goes through\nij (k) in random order and updates the skill beliefs according to the\n\ni and (cid:27)t\n\n3.2 TrueSkill through Time (TTT)\n\nThe Vanilla TrueSkill algorithm su(cid:11)ers from two major disadvantages:\n\n1. Inference within a given year t depends on the random order chosen for the updates.\nSince no knowledge is assumed about game outcomes within a given year, the results\nof inference should be independent of the order of games within a year.\n\n2. Information across years is only propagated forward in time. More concretely, if\nplayer A beats player B and player B later turns out to be very strong (i.e., as\nevidenced by him beating very strong player C repeatedly), then Vanilla TrueSkill\ncannot propagate that information backwards in time to correct player A\u2019s skill\nestimate upwards.\n\nBoth problems can be addressed by extending the Gaussian density (cid:12)ltering to running full\nexpectation propagation (EP) until convergence [7]. The basic idea is to update repeatedly\non the same game outcomes but making sure that the e(cid:11)ect of the previous update on that\ngame outcome is removed before the new e(cid:11)ect is added. This way, the model remains the\nsame but the inferences are less approximate.\n\nMore speci(cid:12)cally, we go through the game outcomes yt\nconvergence. The update for a game outcome yt\nbut saving the upward messages mf(pt\nperformance pt\nupdate again, the new downward message mf(pt\n\ni)!st\nij (k) on the underlying skill st\n\nij (k);st\n\nij (k);st\n\ni\n\nij within a year t several times until\nij (k) is performed in the same way as before\n(st\ni) which describe the e(cid:11)ect of the updated\nij (k) comes up for\n\ni. When game outcome yt\n\nmf(pt\n\nij (k);st\n\ni)!pt\n\nij (k)(cid:0)pt\n\nij (k)(cid:1) =Z 1\n\n(cid:0)1\n\nf(cid:0)pt\n\ni)!pt\n\nij (k)(cid:0)pt\ni(cid:1)\n\nmf(pt\n\nij (k) ; st\n\nij (k)(cid:1) can be calculated by\n\np (st\ni)\ni)!st\n\ni\n\nij (k);st\n\ndst\ni ;\n\n(st\ni)\n\nthus e(cid:11)ectively dividing out the earlier upward message to avoid double counting. The\nintegral above is easily evaluated since the messages as well as the marginals p (st\ni) have\nbeen assumed Gaussian. The new downward message serves as the e(cid:11)ective prior belief on\nthe performance pt\ni (k). At convergence, the dependency of the inferred skills on the order\nof game outcomes vanishes.\n\nThe second problem is addressed by performing inference for TrueSkill through time (TTT),\ni.e. by repeatedly smoothing forward and backward in time. The (cid:12)rst forward pass of TTT\nis identical to the inference pass of Vanilla TrueSkill except that the forward messages\nmf(st(cid:0)1\nat time\nt (cid:0) 1 on skill estimate st\ni at time t. In the backward pass, these messages are then used to\ncalculate the new backward messages mf(st(cid:0)1\nnew prior for time step t (cid:0) 1,\n\n(cid:1), which e(cid:11)ectively serve as the\n\ni) are stored. They represent the in(cid:13)uence of skill estimate st(cid:0)1\n\ni)!st(cid:0)1\n;st\n\ni)!st\n;st\n\n(st\n\ni\n\ni\n\ni\n\ni\n\ni\n\nmf(st(cid:0)1\n\ni\n\ni)!st(cid:0)1\n;st\n\ni\n\ni\n\n(cid:0)st(cid:0)1\n\n(cid:1) =Z 1\n\n(cid:0)1\n\ni\n\nf(cid:0)st(cid:0)1\n\n; st\n\np (st\ni)\ni)!st\n;st\n\ni\n\nmf(st(cid:0)1\n\ni\n\ndst\ni :\n\n(st\ni)\n\ni\n\n(cid:0)st(cid:0)1\ni(cid:1)\n\nThis procedure is repeated forward and backward along the time series of skills until con-\nvergence. The backward passes make it possible to propagate information from the future\ninto the past.\n\n4\n\n\fst(cid:0)1\nW\n\nst(cid:0)1\nL\n\nst(cid:0)1\nW\n\nst(cid:0)1\nL\n\n\"t(cid:0)1\nL\n\n\"t(cid:0)1\ni\n\nst(cid:0)1\ni\n\nst(cid:0)1\nj\n\n\"t(cid:0)1\nj\n\n(cid:28)\n\n(cid:25)\nst\nW\n\n(cid:12)\n\n(cid:25)\npW\n\n(cid:0)\n\nd\n\n>0\n\n(cid:28)\n\n(cid:25)\nst\nL\n\n(cid:12)\n\n(cid:25)\npL\n\n(cid:28)\n\n(cid:25)\nst\nW\n\n(cid:12)\n\n(cid:25)\npW\n\n(cid:0)\n\nd\n\n>0\n\n(cid:28)\n\n(cid:25)\nst\nL\n\n(cid:12)\n\n(cid:25)\npL\n\n>0\n\n&\n\n(cid:25)\n\"t\nL\n\n+\n\n>0\n\n&\n\n(cid:25)\n\"t\ni\n\n+\n\nuL\n\nui\n\n(cid:28)\n\n(cid:25)\nst\ni\n\n(cid:12)\n\n(cid:25)\npi\n\n(cid:0)\n\ndi\n\n>0\n\n>0\n\n&\n\n(cid:25)\n\"t\nj\n\n+\n\nuj\n\n(cid:28)\n\n(cid:25)\nst\nj\n\n(cid:12)\n\n(cid:25)\npj\n\n(cid:0)\n\ndj\n\n<0\n\nFigure 1: Factor graphs of single game outcomes for TTT (left) and TTT-D. In the left graph\nthere are three types of variables: skills s, performances p, performance di(cid:11)erences d. In the\nTTT-D graphs there are two additional types: draw margins \" and winning thresholds u:\n\nThe graphs only require three di(cid:11)erent types of factors: factor\nfactor > 0 takes the form I ((cid:1) > 0) and factor (cid:6) takes the form I ((cid:1) (cid:6) (cid:1) = (cid:1)).\n\n(cid:25) takes the form N(cid:0)(cid:1);(cid:1); (cid:28) 2(cid:1),\n\n(cid:28)\n\n3.3 TTT with Individual Draw Margins (TTT-D)\n\nij (k) and pt\n\ni > 0. Again, performances pt\nij (k) ; st\n\nFrom exploring the data it is known that the probability of draw not only increases markedly\nthrough the history of chess, but is also positively correlated with playing skill and even\nvaries considerably across individual players. We would thus like to extend the TrueSkill\nmodel to incorporate another player-speci(cid:12)c parameter which indicates a player\u2019s ability to\nforce a draw. Suppose each player i at every time-step t is characterised by an unknown skill\ni 2 R and a player-speci(cid:12)c draw margin \"t\nst\nji (k)\nare drawn according to p(cid:0)pt\ni(cid:1) = N(cid:0)pt\nyt\nij (k) between players i and j at time t is generated as follows:\nij (k) =8<\n:\n0(cid:1) and a Gaussian drift of draw margins between time steps\n(cid:1) = N(cid:0)\"t; \"t(cid:0)1; & 2(cid:1). The factor graph for the case of win/loss is shown\n\nIn addition to the Gaussian assumption about player skills as in the Vanilla TrueSkill model\nof Section 3.1 we assume a factorising Gaussian distribution for the player-speci(cid:12)c draw\n\nin Figure 1 (centre) and for the case of a draw in Figure 1 (right). Note, that the positivity\nof the player-speci(cid:12)c draw margins at each time step t is enforced by a factor > 0 .\n\ni; (cid:12)2(cid:1). In this model a game outcome\n\npt\nij (k) > pt\nif\npt\nji (k) > pt\nif\ni (cid:20) pt\nif (cid:0)\"t\n\nij (k) (cid:0) pt\n\nmargins p(cid:0)\"0\ngiven by p(cid:0)\"t\n\nji (k) + \"t\nj\nij (k) + \"t\ni\n\ni ; (cid:23)0; & 2\n\ni(cid:1) = N(cid:0)\"0\nij\"t(cid:0)1\n\ni\n\n:\n\nji (k) (cid:20) \"t\n\nj\n\nij (k)jst\n\n+1\n(cid:0)1\n0\n\nyt\n\nInference in the TTT-D model is again performed by expectation propagation, both within\na given year t as well as across years in a forward backward manner. Note that in this\nmodel the current belief about the skill of a player is represented by four numbers: (cid:22)t\ni and\n(cid:27)t\ni for the skill and (cid:23) t\ni for the player-speci(cid:12)c draw margin. Players with a high value\nof (cid:23) t\ni can be thought of as having the ability to achieve a draw against strong players, while\nplayers with a high value of (cid:22)t\n\ni have the ability to achieve a win.\n\ni and & t\n\n5\n\n\fx 105\n\n2.5\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n1850\n\n1872\n\n1894\n\n1916\n\n1938\n\nYear\n\n1960\n\n1982\n\n2004\n\nFigure 2: (Left) Distribution over number of recorded match outcomes played per year\nin the ChessBase database. (Right) The log-evidence P (yj(cid:12); (cid:28) ) for the TTT model as a\nfunction of the variation of player performance, (cid:12), and skill dynamics, (cid:28) . The maximizing\nparameter settings are indicated by a black dot.\n\n4 Experiments and Results\n\nOur experiments are based on a data-set of chess match outcomes collected by ChessBase1.\nThis database is the largest top-class annotated database in the world and covers more than\n3.5 million chess games from 1560 to 2006 played between (cid:25)200,000 unique players. From\nthis database, we selected all the matches between 1850 (the birth of modern Chess) and\n2006. This results in 3,505,366 games between 206,059 unique players. Note that a large\nproportion of games was collected between 1987 and 2006 (see Figure 2 (left)).\nOur implementation of the TrueSkill through Time algorithms was done in F#2 and builds\na factor graph with approximately 11,700,000 variables and 15,200,000 factors (TTT) or\n18,500,000 variables and 27,600,000 factors (TTT-D). The whole schedule allocates no more\nthan 6 GB (TTT) or 11 GB (TTT-D) and converges in less than 10 minutes (TTT)/20\nminutes (TTT-D) of CPU time on a standard Pentium 4 machine. The code for this\nanalysis will be made publicly available.\n\nIn the (cid:12)rst experiment, we built the TTT model for the above mentioned collection of Chess\ngames. The draw margin was chosen such that the a-priori probability of draw between two\nequally skilled players matches the overall draw probability of 30.3%. Moreover, the model\nhas a translational invariance in the skills and a scale invariance in (cid:12)=(cid:27)0 and (cid:28) =(cid:27)0. Thus,\nwe (cid:12)xed (cid:22)0 = 1200, (cid:27)0 = 400 and computed the log-evidence L := P (yj(cid:12); (cid:28) ) for varying\nvalues of (cid:12) and (cid:28) (see Figure 2 (right)). The plots show that the model is very robust to\nsetting these two parameters except if (cid:12) is chosen too small. Interestingly, the log-evidence is\nneither largest for (cid:28) (cid:29) 0 (complete de-coupling) nor for (cid:28) ! 0 (constant skill over life-time)\nindicating that it is important to model the dynamics of Chess players. Note that the log-\nevidence is LTTT = (cid:0)3; 953; 997, larger than that of the naive model (Lnaive = (cid:0)4; 228; 005)\nwhich always predicts 30:3% for a draw and correspondingly for win/loss3.\nIn a second\nexperiment, we picked the optimal values ((cid:12)(cid:3); (cid:28) (cid:3)) = (480; 60) for TTT and optimised the\nremaining prior and dynamics parameters of TTT-D to arrive at a model with a log-evidence\nof LTTT(cid:0)D = (cid:0)3; 661; 813.\nIn Figure 3 we have plotted the skill evolution for some well{known players of the last 150\nyears when (cid:12)tting the TTT model ((cid:22)t; (cid:27)t are shown).\nIn Figure 4 the skill evolution of\nthe same players is plotted when (cid:12)tting the TTT-D model; the dashed lines show (cid:22)t + \"t\n\n1For more information, see http://www.bcmchess.co.uk/softdatafrcb.html.\n2For more details, see http://research.microsoft.com/fsharp/fsharp.aspx.\n3Leakage due to approximate inference.\n\n6\n\n\fFigure 3: Skill evolution of top Chess players with TTT; see text for details.\n\nwhereas the solid lines display (cid:22)t; for comparisons we added the (cid:22)t of the TTT model as\ndotted lines.\n\nAs a (cid:12)rst observation, the uncertainties always grow towards the beginning and end of a\ncareer since they are not constrained by past/future years. In fact, for Bobby Fischer the\nuncertainty grows very large in his 20 years of inactivity (1972{1992). Moreover, there seems\nto be a noticeable increase in overall skill since the 1980\u2019s. Looking at Figure 4 we see that\nplayers have di(cid:11)erent abilities to force a draw; the strongest player to do so is Boris Spassky\n(1937{). This ability got stronger after 1975 which explains why the model with a (cid:12)xed\ndraw margin estimates Spassky\u2019s skill larger.\n\nLooking at individual players we see that Paul Morphy (1837{1884), \\The Pride and Sorrow\nof Chess\", is particularly strong when comparing his skill to those of his contemporaries in\nthe next 80 years. He is considered to have been the greatest chess master of his time, and\nthis is well supported by our analysis. \\Bobby\" Fischer (1943{) tied with Boris Spassky at\nthe age of 17 and later defeated Spassky in the \\Match of the Century\" in 1972. Again,\nthis is well supported by our model. Note how the uncertainty grows during the 20 years of\ninactivity (1972{1992) but starts to shrink again in light of the (future) re-match of Spassky\nand Fischer in 1992 (which Fischer won). Also, Fischer is the only one of these players\nwhose \"t decreased over time|when he was active, he was known for the large margin by\nwhich he won!\n\nFinally, Garry Kasparov (1963{) is considered the strongest Chess player of all time. This is\nwell supported by our analysis. In fact, based on our analysis Kasparov is still considerably\nstronger than Vladimir Kramnik (1975{) but a contender for the crown of strongest player\nin the world is Viswanathan Anand (1969{), a former FIDE world champion.\n\n5 Conclusion\n\nWe have extended the Bayesian rating system TrueSkill to provide player ratings through\ntime on a uni(cid:12)ed scale. In addition, we introduced a new model that tracks player-speci(cid:12)c\ndraw margins and thus models the game outcomes even more precisely. The resulting factor\ngraph model for our large ChessBase database of game outcomes has 18.5 million nodes and\n27.6 million factors, thus constituting one of the largest non-trivial Bayesian models ever\n\n7\n\n\f \n\nAnand; Viswanathan\n\nKasparov; Garry\n\nKramnik; Vladimir\n\nKarpov; Anatoly\n\nSpassky; Boris V\n\nFischer; Robert James\n\nBotvinnik; Mikhail\n\nLasker; Emanuel\n\nCapablanca; Jose Raul\n\nSkill (Variable Draw Margin)\nSkill + Draw Margin\nSkill (Fixed Draw Margin)\n\nEichborn; Louis\n\nMorphy; Paul\n\nSteinitz; William\n\nAnderssen; Adolf\n\n3500\n\n3000\n\ne\n\nt\n\na\nm\n\ni\nt\ns\ne\n\n \nl\nl\ni\n\nk\nS\n\n2500\n\n2000\n\n1500\n\n \n\n1850\n\n1858\n\n1866\n\n1875\n\n1883\n\n1891\n\n1899\n\n1907\n\n1916\n\n1924\n\n1932\n\n1940\n\n1949\n\n1957\n\n1965\n\n1973\n\n1981\n\n1990\n\n1998\n\n2006\n\nYear\n\nFigure 4: Skill evolution of top Chess players with TTT-D; see text for details.\n\ntackled. Full approximate inference takes a mere 20 minutes in our F# implementation and\nthus demonstrates the e(cid:14)ciency of EP in appropriately structured factor graphs.\n\nOne of the key questions provoked by this work concerns the comparability of skill estimates\nacross di(cid:11)erent eras of chess history. Can we directly compare Fischer\u2019s rating in 1972 with\nKasparov\u2019s in 1991? Edwards [2] points out that we would not be able to detect any skill\nimprovement if two players of equal skill were to learn about a skill-improving breakthrough\nin chess theory at the same time but would only play against each other. However, this\nargument does not rule out the possibility that with more players and chess knowledge\n(cid:13)owing less perfectly the improvement may be detectable. After all, we do see a marked\nimprovement in the average skill of the top players.\n\nIn future work, we would like to address the issue of skill calibration across years further,\ne.g., by introducing a latent variable for each year that serves as the prior for new players\njoining the pool. Also, it would be interesting to model the e(cid:11)ect of playing white rather\nthan black.\n\nReferences\n\n[1] H. A. David. The method of paired comparisons. Oxford University Press, New York, 1988.\n\n[2] R. Edwards. Edo historical chess ratings. http://members.shaw.ca/edo1/.\n\n[3] A. E. Elo. The rating of chess players: Past and present. Arco Publishing, New York, 1978.\n\n[4] M. E. Glickman. A comprehensive guide to chess ratings. Amer. Chess Journal, 3:59{102, 1995.\n\n[5] M. E. Glickman. Parameter estimation in large dynamic paired comparison experiments. Applied\n\nStatistics, 48:377{394, 1999.\n\n[6] R. Herbrich, T. Minka, and T. Graepel. TrueSkill(TM): A Bayesian skill rating system.\n\nIn\n\nAdvances in Neural Information Processing Systems 20, 2007.\n\n[7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\n\n[8] J. Sonas. Chessmetrics. http://db.chessmetrics.com/.\n\n8\n\n\f", "award": [], "sourceid": 931, "authors": [{"given_name": "Pierre", "family_name": "Dangauthier", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Tom", "family_name": "Minka", "institution": null}, {"given_name": "Thore", "family_name": "Graepel", "institution": null}]}