{"title": "Learning to Play the Game of Chess", "book": "Advances in Neural Information Processing Systems", "page_first": 1069, "page_last": 1076, "abstract": null, "full_text": "Learning To Play the Game of Chess \n\nSebastian Thrun \nUniversity of Bonn \n\nDepartment of Computer Science III \n\nRomerstr.  164, 0-53117 Bonn, Germany \n\nE-mail:  thrun@carbon.informatik.uni-bonn.de \n\nAbstract \n\nThis paper presents NeuroChess,  a program which learns to play chess from the final \noutcome of games.  NeuroChess learns chess  board  evaluation functions,  represented \nby artificial neural  networks.  It integrates inductive neural  network learning, temporal \ndifferencing, and a variant of explanation-based learning.  Performance results illustrate \nsome of the strengths and weaknesses of this approach. \n\n1  Introduction \n\nThroughout the last  decades,  the  game of chess has  been  a major testbed  for  research  on \nartificial intelligence and computer science.  Most oftoday's chess programs rely on intensive \nsearch to generate moves.  To evaluate boards, fast evaluation functions are employed which \nare usually carefully designed by hand, sometimes augmented by automatic parameter tuning \nmethods [1].  Building a chess machine that learns to play solely from the final  outcome of \ngames (win/loss/draw) is a challenging open problem in AI. \nIn this paper,  we are interested in learning to play chess from the final  outcome of games. \nOne of the earliest approaches,  which learned  solely by  playing itself,  is Samuel's famous \nchecker player program [10].  His approach employed temporal difference learning (in short: \nTO)  [14],  which is  a  technique for  recursively  learning an  evaluation  function .  Recently, \nTesauro  reported  the  successful  application  of TO  to  the  game  of Backgammon,  using \nartificial neural network representations [16].  While his TO-Gammon approach plays grand(cid:173)\nmaster-level  backgammon,  recent  attempts to reproduce these results  in the context of Go \n[12]  and chess  have been  less successful.  For example,  Schafer [11]  reports a  system just \nlike Tesauro's TO-Gammon, applied to learning to play certain chess endgames.  Gherrity [6] \npresented a similar system which he applied to entire chess  games.  Both approaches learn \npurely inductively from  the final  outcome of games.  Tadepalli  [15]  applied a lazy  version \nof explanation-based learning [5,  7]  to  endgames  in  chess.  His approach  learns  from  the \nfinal outcome, too, but unlike the inductive neural  network approaches listed above it learns \nanalytically, by analyzing and generalizing experiences in terms of chess-specific knowledge. \n\n\f1070 \n\nSebastian  Thrun \n\nThe level  of play reported for all  these approaches  is still below the level of GNU-Chess,  a \npublicly available chess tool which has frequently been used as a benchmark.  This illustrates \nthe hardness of the problem of learning to play chess from the final  outcome of games. \n\nThis paper presents NeuroChess, a program that learns to play chess from the final outcome \nof games.  The central learning mechanisms is the explanation-based neural network (EBNN) \nalgorithm  [9,  8].  Like Tesauro's  TD-Gammon  approach,  NeuroChess constructs  a  neural \nnetwork evaluation function for chess boards using TO. In addition, a neural network version \nof explanation-based learning is employed, which analyzes games in terms of a previously \nlearned neural  network chess model.  This paper describes  the NeuroChess approach,  dis(cid:173)\ncusses  several  training issues in  the domain of chess,  and  presents results which elucidate \nsome of its strengths and weaknesses. \n\n2  Temporal Difference Learning in the Domain of Chess \n\nTemporal  difference learning (TO)  [14]  comprises a  family  of approaches  to prediction in \ncases  where the event to be predicted may be delayed by an  unknown number of time steps. \nIn the context of game playing, TD methods have frequently been applied to learn functions \nwhich  predict  the  final  outcome of games.  Such functions  are  used  as  board  evaluation \nfunctions. \nThe goal  of TO(O),  a basic  variant of TO which is currently employed in the NeuroChess \napproach, is to find  an evaluation function, V,  which ranks chess boards according to their \ngoodness:  If the  board  S is  more  likely  to  be  a  winning  board  than  the  board  Sf,  then \nV(s)  >  V(Sf).  To  learn  such  a  function,  TO transforms  entire chess  games,  denoted by \na  sequence  of chess  boards  So,  SI, s2, . . . , StunaJ'  into  training  patterns  for  V.  The TO(O) \nlearning rule works in the following way.  Assume without loss of generality we are learning \nwhite's evaluation function.  Then the target values for the final board is given by \n\n{\n\nI, \n0, \n-1, \n\nif Stu.\u00bbtI  is a win for white \nif StUnaJ  is a draw \nif StonaJ  is a loss for white \n\n(1) \n\nand the targets for the intermediate chess boards So, SI , S2,  . .. , Stu.\u00bbtI-2  are given by \n\nVt.1fget( St)  =  I\u00b7 V (St+2) \n\n(2) \nThis update rule constructs V  recursively.  At the end  of the game,  V  evaluates  the final \noutcome of the game (Eq. (l \u00bb. In between, when the assignment of V -values is less obvious, \nV  is trained based  on the evaluation two half-moves later (Eq.  (2\u00bb.  The constant I  (with \no ~ I  ~ 1)  is  a  so-called discount factor.  It decays  V  exponentially  in time  and hence \nfavors  early over late success.  Notice that in NeuroChess V  is represented  by  an  artificial \nneural  network, which is trained to fit  the target values  vtarget obtained via Eqs. (l) and (2) \n(cj  [6,  11,  12, 16]). \n\n3  Explanation-Based Neural Network Learning \n\nIn  a  domain as  complex  as  chess,  pure inductive learning techniques.  such  as  neural  net(cid:173)\nwork Back-Propagation, suffer from  enormous training times.  To  illustrate why,  consider \nthe  situation of a  knight fork.  in  which the opponent's knight attacks  our queen  and  king \nsimultaneously.  Suppose in  order to save our king we have to move it, and hence sacrifice \nour queen.  To learn the badness of a knight fork, NeuroChess has to discover that certain \nboard features  (like the position of the queen relative to the knight) are important, whereas \n\n\fLearning to  Play the Game of Chess \n\n1071 \n\nFigure 1:  Fitting values and slopes in EBNN: Let V  be the target function for which three \nexamples  (s\\, V(S\\)),  (S2'  V(S2)),  and  (S3,  V(S3))  are  known.  Based  on these points the \nlearner might generate the hypothesis V'.  If the slopes  a~;:I), ar S2)OS2,  and  a~;:3)  are \nalso known, the learner can do much better:  V\". \n\nothers (like the number of weak pawns) are not.  Purely inductive learning algorithms such \nas  Back-propagation figure out the relevance of individual features by observing statistical \ncorrelations  in  the training data.  Hence,  quite a  few  versions of a  knight fork have  to be \nexperienced  in  order to  generalize  accurately.  In  a  domain  as  complex  as  chess,  such  an \napproach might require unreasonably large amounts of training data. \n\nExplanation-based methods (EBL)  [5,  7,  15] generalize more accurately  from less training \ndata.  They rely instead on the availability of domain knowledge, which they use for explaining \nand generalizing training examples.  For example,  in the explanation of a knight fork,  EBL \nmethods employ knowledge about the game of chess  to figure out that the position of the \nqueen  is  relevant,  whereas  the number of weak pawns  is  not.  Most current approaches  to \nEBL require that the domain  knowledge be represented  by  a  set of symbolic rules.  Since \nNeuroChess relies on  neural  network representations,  it employs a  neural  network version \nof EBL,  called explanation-based neural network learning (EBNN)  [9].  In the context of \nchess,  EBNN works  in the following  way:  The domain-specific  knowledge is represented \nby a separate neural network, called the chess model M. M  maps arbitrary chess boards St \nto the corresponding expected board St+2  two half-moves later.  It is trained prior to learning \nV, using a large database of grand-master chess games.  Once trained, M  captures important \nknowledge about temporal dependencies of chess board features in high-quality chess play. \nEBNN exploits M  to bias the board evaluation function V.  It does this by extracting slope \nconstraints for the evaluation function V  at all non-final boards, i.e.,  all boards for which V \nis updated by Eq. (2).  Let \n\nwith \n\nt  E  {a, 1,2, ... , tlioa\\  - 2} \n\n(3) \n\ndenote the target slope of V  at St,  which,  because  vtarget ( St)  is  set to 'Y V (St+2)  according \nEq.  (2), can be rewritten as \n\noV target ( St) \n\n= \n\n'Y. \n\noV( St+2)  OSt+2 \n._ -\nOSt \n\nOSt+2 \n\n(4) \n\nusing  the  chain  rule  of differentiation.  The rightmost term  in  Eq.  (4)  measures  how  in(cid:173)\nfinitesimal  small  changes of the chess  board  St  influence  the chess  board  St+2.  It can  be \napproximated by the chess model  M: \n\novtarget(St) \n\nOSt \n\n~  'Y. \n\nOV(St+2)  oM(st) \n\n. \n\nOSt+2 \n\nOSt \n\n(5) \n\nThe right expression is only an approximation to the left side, because M is a trained neural \n\n\f1072 \n\nSebastian  Thrun \n\n~ \n\n~ \n\n~ \n\nbmrd at time  f \n\n(W\"T\"~) \n\nboard attime  1+ I \n(black to move) \n\nboard at  time  1+2 \n(w\"'\u00b7ro~) \n\npredictive model network  M \n\n165  hidden unit, \n\nFigure 2:  Learning an evaluation function  in NeuroChess.  Boards are  mapped  into  a \nhigh-dimensionalJeature vector,  which forms the input for both the evaluation network  V \nand  the chess  model  M.  The  evaluation network is  trained  by  Back-propagation and  the \nTD(O)  procedure.  Both networks are employed for analyzing training example in  order to \nderive target slopes for V. \n\nV(1+2) \n\nnetwork  and  thus  its  first  derivative might be erroneous.  Notice that both expressions on \nthe right hand side of Eq. (5) are derivatives of neural network functions, which are easy  to \ncompute since neural networks are differentiable. \nThe result of Eq . (5) is  an  estimate of the slope of the target function  V  at  8t .  This slope \nadds important shape information to the target values constructed via Eq. (2).  As depicted in \nFig.  1,  functions can be fit more accurately if in addition to target values the slopes of these \nvalues are known.  Hence, instead of just fitting the target values vtarget ( 8t), NeuroChess also \nfits these target slopes. This is done using the Tangent-Prop algorithm [13]. \nThe  complete NeuroChess  learning  architecture  is  depicted  in  Fig.  2.  The  target  slopes \nprovide  a  first-order  approximation  to  the  relevance  of each  chess  board  feature  in  the \ngoodness of a board  position.  They  can  be interpreted as  biasing the network V  based on \nchess-specific domain knowledge, embodied in M . For the relation ofEBNN and EBL and \nthe accommodation of inaccurate slopes in EBNN see [8]. \n\n4  Training Issues \n\nIn this section we will briefly discuss some training issues that are essential for learning good \nevaluation functions  in  the domain of chess.  This  list of points has mainly been  produced \nthrough practical  experience with the NeuroChess and related TD approaches.  It illustrates \nthe  importance of a  careful  design  of the  input representation,  the  sampling  rule and  the \n\n\fLearning to  Play the Game of Chess \n\n1073 \n\nparameter setting in a domain as complex as chess. \n\nSampling.  The vast majority of chess boards are,  loosely speaking, not interesting.  If,  for \nexample,  the opponent leads  by more than a queen and a rook, one is most likely to loose. \nWithout an  appropriate sampling  method there is  the danger that the  learner spends  most \nof its time learning from  uninteresting examples.  Therefore,  NeuroChess  interleaves self(cid:173)\nplay and  expert play for guiding the sampling process.  More specifically,  after presenting \na random  number of expert moves generated from a large database of grand-master games, \nNeuroChess completes the game by playing itself.  This sampling mechanism has been found \nto be of major importance to learn a good evaluation function in a reasonable amount of time. \nQuiescence.  In the domain of chess certain boards are harder to evaluate than others.  For \nexample,  in  the middle of an  ongoing material exchange,  evaluation functions often fail  to \nproduce a  good  assessment.  Thus,  most  chess  programs  search  selectively.  A  common \ncriterion for determining the depth of search  is called quiescence.  This criterion basically \ndetects material threats and deepens the search correspondingly. NeuroChess' search engine \ndoes  the  same.  Consequently,  the  evaluation  function  V  is  only  trained  using  quiescent \nboards. \n\nSmoothness. Obviously, using the raw, canonical board description as input representation is \na poor choice.  This is because small changes on the board can cause a huge difference in value, \ncontrasting  the  smooth  nature of neural  network representations.  Therefore,  NeuroChess \nmaps  chess  board descriptions into a set of board features .  These features  were carefully \ndesigned by hand. \n\nDiscounting.  The variable 'Y  in Eq.  (2) allows to discount values in time.  Discounting has \nfrequently been used to bound otherwise infinite sums of pay-off.  One might be inclined to \nthink that in the game of chess no discounting is needed, as values are bounded by definition. \nIndeed, without discounting the evaluation function predicts the probability for winning-in \nthe ideal  case.  In  practice,  however,  random disturbations of the evaluation  function  can \nseriously  hurt learning,  for  reasons  given  in  [4,  17].  Empirically  we  found  that  learning \nfailed completely when no discount factor was used.  Currently, NeuroChess uses 'Y  =  0.98. \nLearning  rate.  TO  approaches  minimize  a  Bellman  equation  [2]. \nIn  the  NeuroChess \ndomain, a close-to-optimal approximation of the Bellman equation is the constant function \nV(s)  ==  O.  This function  violates the Bellman equation only at the end of games (Eq. (1\u00bb, \nwhich is  rare if complete games  are considered.  To  prevent this, we amplified the learning \nrate for final  values by a factor of20, which was experimentally found to produce sufficiently \nnon-constant evaluation functions. \n\nSoftware  architecture.  Training  is  performed  completely  asynchronously  on  up  to  20 \nworkstations simultaneously.  One of the workstations acts as a weight server, keeping track \nof the most recent  weights and  biases  of the  evaluation network.  The other workstations \ncan  dynamically establish links to the weight server and contribute to the process of weight \nrefinement.  The main process also monitors the state of all  other workstations and restarts \nprocesses  when  necessary.  Training examples are stored  in  local  ring buffers (1000 items \nper workstation). \n\n5  Results \n\nIn this section  we will  present results obtained  with the NeuroChess architecture.  Prior to \nlearning an evaluation function, the model M  (175 input, 165 hidden, and  175 output units) \nis trained using a database of 120,000 expert games.  NeuroChess then learns an evaluation \n\n\f1074 \n\nSebastian  Thrun \n\n48. e7e4 g6h6  63. b7b8q g4f5  67. f4c7 \n\n65. a8e8 e6d7 \n66. e8e7 d7d8 \n\n46. d I c2 b8h2  61 . e4f5 h3g4 \n62. f5f6 h6h5 \n\n16. b2b4 a5a4  31 . a3f8 f2e4 \nI. e2e3 b8c6 \n2. dlf3 c6e5 \n17. b5c6 a4c6  32. c3b2 h8f8  47. c2c3 f6b6 \n18. gl f3 d8d6  33. a4d7 f3f5 \n3. f3d5  d7d6 \n19. d4a7 f5g4  34. d7b7 f5e5  49. d4f5  h6g5  64. b8f4 f5e6 \n4.  flb5 c7c6 \n5.  b5a4 g8f6 \n20. c2c4 c8d7  35. b2cl  f8e8 \n50. e4e7 g5g4 \n21. b4b5 c6c7  36. b7d5 e5h2  51. f5h6 g7h6 \n6. d5d4 c8f5 \n22. d2d3 d6d3  37. ala7 e8e6  52. e7d7 g4h5 \n7.  f2f4 e5d7 \n23. b5b6 c7c6  38. d5d8 f6g6  53. d7d I h5h4 \n8. ele2d8a5 \n9.  a4b3 d7c5 \n24. e2d3 e4f2  39. b6b7 e6d6  54. d I d4 h4h3 \n10.  b I a3 c5b3  25. d3c3 g4f3  40. d8a5 d6c6  55. d4b6 h2e5 \n41 . a5b4 h2b8  56. b6d4 e5e6 \n11 . a2b3 e7e5  26. g2f3 f2h 1 \n12. f4e5 f6e4 \n27. clb2 c6f3 \n42.  a7a8 e4c3  57. c3d2 e6f5 \n13. e5d6 e8c8  28. a7a4 d7e7  43. c2d4 c6f6  58. e3e4 f5 g5 \n14. b3b4 a5a6  29. a3c2 hi f2 \n44. b4e7 c3a2  59. d4e3 g5e3 \n15. b4b5 a6a5  30.  b2a3 e7f6  45. cldl a2c3  60. d2e3 f7f5 \n\nfinal board \n\nFigure 3:  NeuroChess against GNU-Chess.  NeuroChess plays  white.  Parameters:  Both \nplayers searched  to depth 3,  which could be extended by  quiescence search to at most  11. \nThe evaluation  network had  no  hidden  units.  Approximately 90%  of the training boards \nwere sampled from expert play. \n\nnetwork V  (175 input units, 0 to 80 hidden units, and one output units).  To evaluate the level \nof play, NeuroChess plays against GNU-Chess in regular time intervals.  Both players employ \nthe same search mechanism which is adopted from GNU-Chess.  Thus far, experiments lasted \nfor 2 days to 2  weeks on  I  to 20 SUN Sparc Stations. \n\nA typical game is depicted in Fig. 3.  This game has been chosen because it illustrates both \nthe strengths and the shortcomings of the NeuroChess approach.  The opening of NeuroChess \nis rather weak.  In the first  three  moves  NeuroChess  moves  its  queen to the center of the \nboard.'  NeuroChess then  escapes  an  attack  on  its  queen  in  move  4,  gets  an  early  pawn \nadvantage  in  move  12,  attacks  black's queen  pertinaciously  through  moves  15  to  23,  and \nsuccessfully exchanges a rook.  In move 33, it captures a strategically important pawn, which, \nafter chasing black's king for a while and sacrificing a knight for no apparent reason, finally \nleads to a new queen (move 63).  Four moves later black is mate.  This game is prototypical. \nAs can be seen  from this and various other games, NeuroChess has learned successfully to \nprotect its material, to trade material, and to protect its king.  It has not learned, however, to \nopen a  game in a coordinated way, and it also frequently fails to play short. endgames even \nif it has a material advantage (this is due to the short planning horizon). Most importantly, it \nstill plays incredibly poor openings, which are often responsible for a draw or a loss.  Poor \nopenings do  not surprise, however, as TD propagates values from  the end of a game to the \nbeginning. \n\nTable  I  shows  a  performance  comparison  of NeuroChess  versus  GNU-Chess,  with  and \nwithout the explanation-based learning strategy.  This table illustrates that NeuroChess wins \napproximately  13% of all games against GNU-Chess, if both use the same search engine. It \n\n'This is  because in  the current version  NeuroChess still  heavily  uses expert games for  sampling. \nWhenever a grand-master moves its queen to the center of the board, the queen is usually safe, and there \nis  indeed a positive correlation  between having the queen in  the center and  winning in  the database. \nNeuroChess falsely deduces that having the queen in the center is  good.  This effect disappears when \nthe level of self-play is increased, but this comes at the expense of drastically increased training time, \nsince self-play requires search. \n\n\fLearning to  Play the Game of Chess \n\n1075 \n\nGNU depth 2, NeuroChess depth 2  GNU depth 4, NeuroChess depth 2 \n\n# of games  Back-propagation \n\n100 \n200 \n500 \n1000 \n1500 \n2000 \n2400 \n\n1 \n6 \n35 \n73 \n130 \n190 \n239 \n\nEBNN \n\nBack-propagation \n\nEBNN \n\n0 \n2 \n13 \n85 \n135 \n215 \n316 \n\n0 \n0 \nI \n2 \n3 \n3 \n3 \n\n0 \n0 \n0 \n1 \n3 \n8 \nII \n\nTable 1:  Performance ofNeuroChess vs. GNU-Chess during training. The numbers show the \ntotal number of games won against GNU-Chess using the same number of games for testing \nas  for  training.  This  table  also  shows  the  importance  of the  explanation-based  learning \nstrategy  in  EBNN.  Parameters:  both  learners  used  the  original  GNU-Chess  features,  the \nevaluation network had 80 hidden units and search was cut at depth 2, or 4, respectively (no \nquiescence extensions). \n\nalso illustrates the utility of explanation-based learning in chess. \n\n6  Discussion \n\nThis  paper  presents  NeuroChess,  an  approach  for  learning  to  play  chess  from  the  final \noutcomes  of games.  NeuroChess  integrates  TD,  inductive  neural  network  learning  and \na  neural  network  version  of explanation-based  learning.  The  latter  component  analyzes \ngames  using  knowledge that was  previously learned  from  expert play.  Particular care has \nbeen  taken  in  the  design of an  appropriate feature  representation,  sampling methods,  and \nparameter settings.  Thus far,  NeuroChess has successfully managed to beat GNU-Chess in \nseveral  hundreds of games.  However, the level of play still compares poorly to GNU-Chess \nand human chess players. \nDespite the initial success,  NeuroChess faces  two fundamental  problems which both might \nweB  be  in  the  way  of excellent  chess  play.  Firstly,  training  time  is  limited,  and  it is  to \nbe expected  that excellent chess  skills develop  only  with excessive  training time.  This  is \nparticularly the case if only the final  outcomes are considered.  Secondly, with each  step of \nTO-learning NeuroChess loses information.  This is partially because the features  used for \ndescribing chess boards are incomplete, i.e., knowledge about the feature values alone does \nnot suffice to determine the actual board exactly. But, more importantly, neural networks have \nnot the discriminative power to assign arbitrary values to all possible feature combinations. \nIt is  therefore unclear that a TD-like approach  will ever,  for example,  develop  good chess \nopenmgs. \nAnother problem of the present implementation is related to the trade-off between knowledge \nand search.  It has been well recognized that the ul timate cost in chess is determi ned by the ti me \nit takes to generate a move.  Chess programs can generally invest their time in search, or in the \nevaluation of chess boards (search-knowledge trade-off) [3].  Currently, NeuroChess does a \npoor job, because it spends most of its time computing board evaluations.  Computing a large \nneural  network function takes two orders of magnitude longer than evaluating an optimized \nlinear evaluation function (like that of GNU-Chess).  VLSI neural network technology offers \na promising perspective to overcome this critical shortcoming of sequential neural network \nsimulations. \n\n\f1076 \n\nAcknowledgment \n\nSebastian  Thrun \n\nThe author gratefully acknowledges the guidance and advise by Hans Berliner, who provided \nthe features for representing chess boards, and without whom the current level of play would \nbe much worse.  He also thanks Tom Mitchell for his suggestion on the learning methods, \nand Horst Aurisch for his help with GNU-Chess and the database. \n\nReferences \n[I]  Thomas S.  Anantharaman. A Statistical Study of Selective Min-Max Search in Computer Chess. \nPhD thesis,  Carnegie  Mellon  University,  School  of Computer  Science, Pittsburgh,  PA,  1990. \nTechnical Report CMU-CS-90-173. \n\n[2]  R.  E.  Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ,  1957. \n[3]  Hans  J.  Berliner,  Gordon  Goetsch,  Murray  S.  Campbell,  and  Carl  Ebeling.  Measuring  the \n\nperformance potential of chess programs.  Artificial Intelligence, 43:7-20, 1990. \n\n[4]  Justin  A.  Boyan.  Generalization  in  reinforcement  learning:  Safely  approximating  the  value \nfunction.  In  G. Tesauro,  D.  Touretzky,  and T.  Leen,  editors,  Advances in  Neural Information \nProcessing Systems 7, San Mateo, CA,  1995. Morgan Kaufmann.  (to appear). \n\n[5]  Gerald  Dejong and Raymond Mooney.  Explanation-based learning:  An  alternative view.  Ma(cid:173)\n\nchine Learning, 1(2): 145-176, 1986. \n\n[6]  Michael Gherrity.  A Game-Learning Machine.  PhD thesis, University of California, San Diego, \n\n1993. \n\n[7]  Tom M. Mitchell, Rich Keller, and Smadar Kedar-Cabelli.  Explanation-based generalization:  A \n\nunifying view.  Machine Learning, 1 (1 ):47-80, 1986. \n\n[8]  Tom M.  Mitchell and Sebastian Thrun.  Explanation based learning:  A comparison of symbolic \nand neural network approaches. In Paul E.  Utgoff, editor, Proceedings of the Tenth International \nConference on Machine Learning, pages 197-204, San Mateo, CA,  1993. Morgan Kaufmann. \n\n[9]  Tom M.  Mitchell  and  Sebastian Thrun.  Explanation-based neural  network  learning for  robot \ncontrol.  In  S.  J.  Hanson, J.  Cowan,  and C.  L.  Giles,  editors, Advances in  Neural Information \nProcessing Systems 5, pages 287-294, San Mateo, CA,  1993. Morgan Kaufmann. \n\n[10]  A.  L.  Samuel.  Some studies in  machine learning using the game of checkers.  IBM Journal on \n\nresearch and development, 3:210-229, 1959. \n\n[11]  Johannes Schafer.  Erfolgsorientiertes  Lemen mit Tiefensuche in  Bauemendspielen.  Technical \n\nreport,  UniversiUit Karlsruhe,  1993.  (in  German). \n\n[12]  Nikolaus Schraudolph, Pater Dayan, and Terrence J. Sejnowski. Using the TD(lambda) algorithm \nto learn an evaluation function for the game of go. In Advances in Neural Information Processing \nSystems 6,  San Mateo, CA,  1994. Morgan Kaufmann. \n\n[13]  Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop -a formalism for \nspecifying selected invariances in  an  adaptive network.  In J.  E.  Moody, S.  J.  Hanson, and R.  P. \nLippmann, editors, Advances in  Neural Information Processing Systems 4, pages 895-903, San \nMateo, CA, 1992. Morgan Kaufmann. \n\n[14]  Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, \n\n3,1988. \n\n[15]  Prasad Tadepalli.  Planning in games using approximately learned macros.  In Proceedings of the \nSixth International Workshop on Machine Learning, pages 221-223, Ithaca, NY,  1989. Morgan \nKaufmann. \n\n[16]  Gerald J.  Tesauro.  Practical issues in  temporal difference learning.  Machine Learning, 8,  1992. \n[17]  Sebastian Thrun  and  Anton  Schwartz.  Issues  in  using  function  approximation  for  reinforce(cid:173)\nment  learning.  In  M.  Mozer,  P. Smolensky, D.  Touretzky,  J.  Elman,  and  A.  Weigend,  editors, \nProceedings of the  1993 Connectionist Models Summer School,  Hillsdale,  NJ,  1993.  Erlbaum \nAssociates. \n\n\f", "award": [], "sourceid": 1007, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}