{"title": "Solitaire: Man Versus Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 1553, "page_last": 1560, "abstract": null, "full_text": " Solitaire: Man Versus Machine\n\n\n\n Xiang Yan Persi Diaconis Paat Rusmevichientong Benjamin Van Roy\n\n\n Stanford University\n {xyan,persi.diaconis,bvr}@stanford.edu\n\n\n Cornell University\n paatrus@orie.cornell.edu\n\n\n\n\n Abstract\n\n In this paper, we use the rollout method for policy improvement to an-\n alyze a version of Klondike solitaire. This version, sometimes called\n thoughtful solitaire, has all cards revealed to the player, but then follows\n the usual Klondike rules. A strategy that we establish, using iterated roll-\n outs, wins about twice as many games on average as an expert human\n player does.\n\n\n1 Introduction\n\nThough proposed more than fifty years ago [1, 7], the effectiveness of the policy improve-\nment algorithm remains a mystery. For discounted or average reward Markov decision\nproblems with n states and two possible actions per state, the tightest known worst-case\nupper bound in terms of n on the number of iterations taken to find an optimal policy is\nO(2n/n) [9]. This is also the tightest known upper bound for deterministic Markov de-\ncision problems. It is surprising, however, that there are no known examples of Markov\ndecision problems with two possible actions per state for which more than n + 2 iterations\nare required. A more intriguing fact is that even for problems with a large number of states\n say, in the millions an optimal policy is often delivered after only half a dozen or so\niterations.\n\nIn problems where n is enormous say, a googol this may appear to be a moot point\nbecause each iteration requires (n) compute time. In particular, a policy is represented\nby a table with one action per state and each iteration improves the policy by updating\neach entry of this table. In such large problems, one might resort to a suboptimal heuris-\ntic policy, taking the form of an algorithm that accepts a state as input and generates an\naction as output. An interesting recent development in dynamic programming is the roll-\nout method. Pioneered by Tesauro and Galperin [13, 2], the rollout method leverages the\npolicy improvement concept to amplify the performance of any given heuristic. Unlike the\nconventional policy improvement algorithm, which computes an optimal policy off-line so\nthat it may later be used in decision-making, the rollout method performs its computations\non-line at the time when a decision is to be made. When making a decision, rather than\napplying the heuristic policy directly, the rollout method computes an action that would\nresult from an iteration of policy improvement applied to the heuristic policy. This does\n\n\f\nnot require (n) compute time since only one entry of the table is computed.\n\nThe way in which actions are generated by the rollout method may be considered an al-\nternative heuristic that improves on the original. One might consider applying the rollout\nmethod to this new heuristic. Another heuristic would result, again with improved perfor-\nmance. Iterated a sufficient number of times, this process would lead to an optimal policy.\nHowever, iterating is usually not an option. Computational requirements grow exponen-\ntially in the number of iterations, and the first iteration, which improves on the original\nheuristic, is already computationally intensive. For this reason, prior applications of the\nrollout method have involved only one iteration [3, 4, 5, 6, 8, 11, 12, 13]. For example, in\nthe interesting study of Backgammon by Tesauro and Galperin [13], moves were generated\nin five to ten seconds by the rollout method running on configurations of sixteen to thirty-\ntwo nodes in a network of IBM SP1 and SP2 parallel-RISC supercomputers with parallel\nspeedup efficiencies of 90%. A second iteration of the rollout method would have been\ninfeasible requiring about six orders of magnitude more time per move.\n\nIn this paper, we apply the rollout method to a version of solitaire, modeled as a deter-\nministic Markov decision problem with over 52! states. Determinism drastically reduces\ncomputational requirements, making it possible to consider iterated rollouts1. With five\niterations, a game, implemented in Java, takes about one hour and forty-five minutes on\naverage on a SUN Blade 2000 machine with two 900MHz CPUs, and the probability of\nwinning exceeds that of a human expert by about a factor of two. Our study represents an\nimportant contribution both to the study of the rollout method and to the study of solitaire.\n\n\n\n2 Solitaire\n\n\nIt is one of the embarrassments of applied mathematics that we cannot determine the odds\nof winning the common game of solitaire. Many people play this game every day, yet\nsimple questions such as What is the chance of winning? How does this chance depend on\nthe version I play? What is a good strategy? remain beyond mathematical analysis.\n\nAccording to Parlett [10], solitaire came into existence when fortune-telling with cards\ngained popularity in the eighteenth century. Many variations of solitaire exist today, such\nas Klondike, Freecell, and Carpet. Popularized by Microsoft Windows, Klondike has prob-\nably become the most widely played.\n\nKlondike is played with a standard deck of cards: there are four suits (Spades, Clubs,\nHearts, and Diamonds) each made up of thirteen cards ranked 1 through 13: Ace, 2, 3, ...,\n10, Jack, Queen, and King. During the game, each card resides in one of thirteen stacks2 :\nthe pile, the talon, four suit stacks and seven build stacks. Each suit stack corresponds to a\nparticular suit and build stacks are labeled 1 through 7.\n\nAt the beginning of the game, cards are dealt so that there is one card in the first build stack,\ntwo cards in the second build stack, ..., and seven cards in the seventh build stack. The top\ncard on each of the seven build stacks is turned face-up while the rest of the cards in the\nbuild stacks face down. The other twenty-four cards, forming the pile, face down as well.\nThe talon is initially empty.\n\nThe goal of the game is to move all cards into the suit stacks, aces first, then two's, and so\non, with each suit stack evolving as an ordered increasing arrangement of cards of the same\nsuit. The figure below shows a typical mid-game configuration.\n\n\n\n 1Backgammon is stochastic because play is influenced by the roll of dice.\n 2In some solitaire literature, stacks are referred to as piles.\n\n\f\nWe will study a version of solitaire in which the identity of each card at each position is\nrevealed to the player at the beginning of the game but the usual Klondike rules still apply.\nThis version is played by a number of serious solitaire players as a much more difficult\nversion than standard Klondike. Parlett [10] offers further discussion. We call this game\nthoughtful solitaire and now spell out the rules.\n\nOn each turn, the player can move cards from one stack to another in the following manner:\n\n Face-up cards of a build stack, called a card block, can be moved to the top of\n another build stack provided that the build stack to which the block is being moved\n accepts the block. Note that all face-up cards on the source stack must be moved\n together. After the move, these cards would then become the top cards of the stack\n to which they are moved, and their ordering is preserved. The card originally\n immediately beneath the card block, now the top card in its stack, is turned face-\n up. In the event that all cards in the source stack are moved, the player has an\n empty stack. 3\n\n The top face-up card of a build stack can be moved to the top of a suit stack,\n provided that the suit stack accepts the card.\n\n The top card of a suit stack can be moved to the top of a build stack, provided that\n the build stack accepts the card.\n\n If the pile is not empty, a move can deal its top three cards to the talon, which\n maintains its cards in a first-in-last-out order. If the pile becomes empty, the player\n can redeal all the cards on the talon back to the pile in one card move. A redeal\n preserves the ordering of cards. The game allows an unlimited number of redeals.\n\n A card on the top of the talon can be moved to the top of a build stack or a suit\n stack, provided that the stack to which the card is being moved accepts the card.\n\n 3It would seem to some that since the identity of all cards is revealed to the player, whether a\ncard is face-up or face-down is irrelevant. We retain this property of cards as it is still important in\ndescribing the rules and formulating our strategy.\n\n\f\n A build stack can only accept an incoming card block if the top card on the build\n stack is adjacent to and braided with the bottom card of the block. A card is\n adjacent to another card of rank r if it is of rank r + 1. A card is braided with a\n card of suit s if its suit is of a color different from s. Additionally, if a build stack\n is empty, it can only accept a card block whose bottom card is a King.\n\n A suit stack can only accept an incoming card of its corresponding suit. If a suit\n stack is empty, it can only accept an Ace. If it is not empty, the incoming card\n must be adjacent to the current top card of the suit stack.\n\nAs stated earlier, the objective is to end up with all cards on suit stacks. If this event occurs,\nthe game is won.\n\n\n3 Expert Play\n\nWe were introduced to thoughtful solitaire by a senior American mathematician (former\npresident of the American Mathematical Society and indeed a famous combinatorialist)\nwho had spent a number of years studying the game. He finds this version of solitaire much\nmore thought-provoking and challenging than the standard Klondike. For instance, while\nthe latter is usually played quickly, our esteemed expert averages about 20 minutes for each\ngame of thoughtful solitaire. He carefully played and recorded 2,000 games, achieving a\nwin rate of 36.6%.\n\nWith this background, it is natural to wonder how well an optimal player can perform at\nthoughtful solitaire. As we will illustrate, our best strategy offers a win rate of about 70%.\n\n\n4 Machine Play\n\nWe have developed two strategies that play thoughtful solitaire. Both are based on the\nfollowing general procedure:\n\n 1. Identify the set of legal moves.\n\n 2. Select and execute a legal move.\n\n 3. If all cards are on suit stacks, declare victory and terminate.\n\n 4. If the new card configuration repeats a previous one, declare loss and terminate 4.\n\n 5. Repeat procedure.\n\nThe only nontrivial task in this procedure is selection from the legal moves. We will first\ndescribe a heuristic strategy for selecting a legal move based on a card configuration. Af-\nterwards, we will discuss the use of rollouts.\n\n\n4.1 A Heuristic Strategy\n\nOur heuristic strategy is based on part of the Microsoft Windows Klondike scoring system:\n\n The player starts the game with an initial score of 0.\n\n 4One straight-forward way to determine if a card configuration has previously occurred is to store\nall encountered card configurations. Instead of doing so, however, we notice that there are three kinds\nof moves that could lead us into an infinite loop: pile-talon moves, moves that could juggle a card\nblock between two build stacks, and moves that could juggle a card block between a build stack and\na suit stack. Hence, to simplify our strategy, we disable the second kind of moves. Our heuristic will\nalso practically disable the third kind. For the first kind, we record if any card move other than a\npile-talon move has occurred since the last redeal. If not, we detect an infinite loop and declare loss.\n\n\f\n Whenever a card is moved from a build stack to a suit stack, the player gains 5\n points.\n\n Whenever a card is moved from the talon to a build stack, the player gains 5 points.\n\n Whenever a card is moved from a suit stack to a build stack, the player loses 10\n points.\n\nIn our heuristic strategy, we assign a score to each card move based on the above scoring\nsystem. We assign the score zero to any moves not covered by the above rules. When\nselecting a move, we choose among those that maximize the score.\n\nIntuitively, this heuristic seems reasonable. The player has incentive to move cards from\nthe talon to a build stack and from a build stack to a suit stack. One important element\nthat the heuristic fails to capture, however, is what move to make when multiple moves\nmaximize the score. Such decisions especially during the early phases of a game are\ncrucial.\n\nTo select among moves that maximize score, we break the tie by assigning the following\npriorities:\n\n If the card move is from a build stack to another build stack, one of the following\n two assignments of priority occurs:\n\n If the move turns an originally face-down card face-up, we assign this move\n a priority of k + 1, where k is the number of originally face-down cards on\n the source stack before the move takes place.\n If the move empties a stack, we assign this move a priority of 1.\n\n If the card move is from the talon to a build stack, one of the following three\n assignments of priority occurs:\n\n If the card being moved is not a King, we assign the move priority 1.\n If the card being moved is a King and its matching Queen is in the pile, in\n the talon, in a suit stack, or is face-up in a build stack, we assign the move\n priority 1.\n If the card being moved is a King and its matching Queen is face-down in a\n build stack, we assign the move priority -1.\n\n For card moves not covered by the description above, we assign them a priority of\n 0.\n\nIn addition to introducing priorities, we modify the Windows Klondike scoring system\nfurther by adding the following change: in a card move, if the card being moved is a King\nand its matching Queen is face-down in a build stack, we assign the move a score of 0.\n\nNote that given our assignment of scores and priorities, we practically disable card moves\nfrom a suit stack to a build stack. Because such moves have a negative score and a card\nmove from the pile to the talon or from the talon to the pile has zero score and is almost\nalways available, our strategy would always choose the pile-talon move over the moves\nfrom a suit stack to a build stack.\n\nIn the case when multiple moves equal in priority maximize the score, we randomly select\na move among them.\n\nThe introduction of priority improves our original game-playing strategy in two ways:\nwhen we encounter a situation where we can move either one of two blocks on two separate\nbuild stacks atop the top card of a third build stack, we prefer moving the block whose stack\nhas more face-down cards. Intuitively, such a move would strive to balance the number of\nface-down cards in stacks. Our experiments show that this heuristic significantly improves\n\n\f\nsuccess rate. The second way in which our prioritization scheme helps is that we are more\ndeliberate in which King to select to enter an empty build stack. For instance, consider a\nsituation where the King of Hearts and the King of Spades, both on the pile, are vying for\nan empty build stack and there is a face-up Queen of Diamonds on a build stack. We should\ncertainly move the King of Spades to the empty build stack so that the Queen of Diamonds\ncan be moved on top of it. Whereas our prioritization warrants such consideration, our\noriginal heuristic does not.\n\n\n4.2 Rollouts\n\n\nConsider a strategy h that maps a card configuration x to a legal move h(x). What we\ndescribed in the previous section was one example of a strategy h. In this section, we will\ndiscuss the rollout method as a procedure for amplifying the performance of any strategy.\nGiven a strategy h, this procedure generates an improved strategy h , called a rollout strat-\negy. This idea was originally proposed by Tesauro and Galperin [13] and builds on the\npolicy improvement algorithm of dynamic programming [1, 7].\n\nGiven a card configuration x. A strategy h would make a move h(x). A rollout strategy\nwould make a move h (x), determined as follows:\n\n\n 1. For each legal move a, simulate the remainder of the game, taking move a and\n then employing strategy h thereafter.\n\n 2. If any of these simulations leads to victory, choose one of them randomly and let\n h (x) be the corresponding move a5.\n\n 3. If none of the simulations lead to victory, let h (x) = h(x).\n\n\nWe can then iterate this procedure to generate a further improved strategy h that is a\nrollout strategy relative to h . It is easy to prove that after a finite number of such iterations,\nwe would arrive at an optimal strategy [2]. However, the computation time required grows\nexponentially in the number of iterations, so this may not be practical. Nevertheless, one\nmight try a few iterations and hope that this offers the bulk of the mileage.\n\n\n\n5 Results\n\n\nWe implemented in Java the heuristic strategy and the procedure for computing rollout\nstrategies. Simulation results are provided in the following table and chart. We randomly\ngenerated a large number of games and played them with our algorithms in an effort to ap-\nproximate the success probability with the percentage of games actually won. To determine\na sufficient number of games to simulate, we used the Central Limit Theorem to compute\nthe confidence bounds on success probability for each algorithm with a confidence level of\n99%. For the original heuristic and 1 through 3 rollout iterations, we managed to achieve\nconfidence bounds of [-1.4%, 1.4%]. For 4 and 5 rollout iterations, due to time constraints,\nwe simulated fewer games and obtained weaker confidence bounds. Interestingly, how-\never, after 5 rollout iterations, the resulting strategy wins almost twice as frequently as our\nesteemed mathematician.\n\n\n\n\n 5Note that at this stage, we could record all moves made in this simulation and declare victory.\nThat is how our program is implemented. However, we leave step 2 as stated for the sake of clarity\nin presentation.\n\n\f\n Player Success Games Average Time 99% Confidence\n Rate Played Per Game Bounds\n Human expert 36.6% 2,000 20 minutes 2.78%\n heuristic 13.05% 10,000 .021 seconds .882%\n 1 rollout 31.20% 10,000 .67 seconds 1.20%\n 2 rollouts 47.60% 10,000 7.13 seconds 1.30%\n 3 rollouts 56.83% 10,000 1 minute 36 seconds 1.30%\n 4 rollouts 60.51% 1,000 18 minutes 7 seconds 4.00%\n 5 rollouts 70.20% 200 1 hour 45 minutes 8.34%\n\n\n\n\n\n6 Future Challenges\n\n\nOne limitation of our rollout method lies in its recursive nature. Although it is clearly\nformulated and hence easily implemented in software, the algorithm does not provide a\nsimple and explicit strategy for human players to make decisions.\n\nOne possible direction for further exploration would be to compute a value function, map-\nping the state of the game to an estimate of whether or not the game can be won. Certainly,\nthis function could not be represented exactly, but we could try approximating it in terms\nof a linear combination of features of the game state, as is common in the approximate\ndynamic programming literature [2].\n\nWe have also attempted proving an upper bound for the success rate of thoughtful soli-\ntaire by enumerating sets of initial card configurations that would force loss. Currently,\nthe tightest upper bound we can rigorously prove is 98.81%. Speed optimization of our\nsoftware implementation is under way. If the success rate bound is improved and we are\nable to run additional rollout iterations, we may produce a verifiable near-optimal strategy\nfor thoughtful solitaire.\n\n\n\nAcknowlegment\n\n\nThis material is based upon work supported by the National Science Foundation under\nGrant ECS-9985229.\n\n\f\nReferences\n\n[1] R. Bellman. Applied Dynamic Programming. Princeton University Press, 1957.\n\n[2] D. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific,\n 1996.\n\n[3] D. P. Bertsekas, J. N. Tsitsiklis, and C. Wu, Rollout Algorithms for Combinatorial\n Optimization. Journal of Heuristics, 3:245-262, 1997.\n\n[4] D. P. Bertsekas and D. A. Casta~non. Rollout Algorithms for Stochastic Scheduling\n Problems. Journal of Heuristics, 5:89-108, 1999.\n\n[5] D. Bertsimas and R. Demir. An Approximate Dynamic Programming Approach to\n Multi-dimensional Knapsack Problems. Management Science, 4:550-565, 2002.\n\n[6] D. Bertsimas and I. Popescu. Revenue Management in a Dynamic Network Environ-\n ment. Transportation Science, 37:257-277, 2003.\n\n[7] R. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.\n\n[8] A. McGovern, E. Moss, and A. Barto. Building a Basic Block Instruction Scheduler\n Using Reinforcement Learning and Rollouts. Machine Learning, 49:141-160, 2002.\n\n[9] Y. Mansour and S. Singh. On the Complexity of Policy Iteration. In Fifteenth Confer-\n ence on Uncertainty in Artificial Intelligence, 1999.\n\n[10] D. Parlett. A History of Card Games. Oxford University Press, 1991.\n\n[11] N. Secomandi. Analysis of a Rollout Approach to Sequencing Problems with Stochas-\n tic Routing Applications. Journal of Heuristics, 9:321-352, 2003.\n\n[12] N. Secomandi. A Rollout Policy for the Vehicle Routing Problem with Stochastic\n Demands. Operations Research, 49:796-802, 2001.\n\n[13] G. Tesauro and G. Galperin. On-line Policy Improvement Using Monte-Carlo Search.\n In Advances in Neural Information Processing Systems, 9:1068-1074, 1996.\n\n\f\n", "award": [], "sourceid": 2568, "authors": [{"given_name": "Xiang", "family_name": "Yan", "institution": null}, {"given_name": "Persi", "family_name": "Diaconis", "institution": null}, {"given_name": "Paat", "family_name": "Rusmevichientong", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}