{"title": "No-Press Diplomacy: Modeling Multi-Agent Gameplay", "book": "Advances in Neural Information Processing Systems", "page_first": 4474, "page_last": 4485, "abstract": "Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. We present DipNet, a neural-network-based policy model for No Press Diplomacy. The model was trained on a new dataset of more than 150,000 human games. Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. Both the SL and the RL agent demonstrate state-of-the-art No Press performance by beating popular rule-based bots.", "full_text": "No Press Diplomacy: Modeling Multi-Agent\n\nGameplay\n\nPhilip Paquette 1\n\npcpaquette@gmail.com\n\nYuchen Lu 1\n\nluyuchen.paul@gmail.com\n\nSteven Bocco 1\n\nstevenbocco@gmail.com\n\nMax O. Smith 3\n\nmax.olan.smith@gmail.com\n\nSatya Ortiz-Gagn\u00e9 1\n\ns.ortizgagne@gmail.com\n\nJonathan K. Kummerfeld 3\n\njkummerf@umich.edu\n\nSatinder Singh 3\n\nbaveja@umich.edu\n\nJoelle Pineau 2\n\njpineau@cs.mcgill.ca\n\nAaron Courville 1\n\naaron.courville@gmail.com\n\nAbstract\n\nDiplomacy is a seven-player non-stochastic, non-cooperative game, where agents\nacquire resources through a mix of teamwork and betrayal. Reliance on trust and\ncoordination makes Diplomacy the \ufb01rst non-cooperative multi-agent benchmark\nfor complex sequential social dilemmas in a rich environment. In this work, we\nfocus on training an agent that learns to play the No Press version of Diplomacy\nwhere there is no dedicated communication channel between players. We present\nDipNet, a neural-network-based policy model for No Press Diplomacy. The model\nwas trained on a new dataset of more than 150,000 human games. Our model is\ntrained by supervised learning (SL) from expert trajectories, which is then used to\ninitialize a reinforcement learning (RL) agent trained through self-play. Both the\nSL and RL agents demonstrate state-of-the-art No Press performance by beating\npopular rule-based bots.\n\n1\n\nIntroduction\n\nDiplomacy is a seven-player game where players attempt to acquire a majority of supply centers\nacross Europe. To acquire supply centers, players can coordinate their units with other players\nthrough dialogue or signaling. Coordination can be risky, because players can lie and even betray\neach other. Reliance on trust and negotiation makes Diplomacy the \ufb01rst non-cooperative multi-agent\nbenchmark for complex sequential social dilemmas in a rich environment.\nSequential social dilemmas (SSD) are situations where one individual experiences con\ufb02ict between\nself- and collective-interest over repeated interactions [1]. In Diplomacy, players are faced with a\nSSD in each phase of the game. Should I help another player? Do I betray them? Will I need their\nhelp later? The actions they choose will be visible to the other players and in\ufb02uence how other\nplayers interact with them later in the game. The outcomes of each interaction are non-stochastic.\nThis characteristic sets Diplomacy apart from previous benchmarks where players could additionally\n\n1 Mila, University of Montreal\n2 Mila, McGill University\n3 University of Michigan\n\u00a7 Dataset and code can be found at https://github.com/diplomacy/research\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Orders submitted in S1901M\n\n(b) Map adjacencies.\n\nFigure 1: The standard Diplomacy map\n\nrely on chance to win [2, 3, 4, 5, 6]. Instead, players must put their faith in other players and not in\nthe game\u2019s mechanics (e.g. having a player role a critical hit).\nDiplomacy is also one of the \ufb01rst SSD games to feature a rich environment. A single player may\nhave up to 34 units, with each unit having an average of 26 possible actions. This astronomical action\nspace makes planning and search intractable. Despite this, thinking at multiple time scales is an\nimportant aspect of Diplomacy. Agents need to be able to form a high-level long-term strategy (e.g.\nwith whom to form alliances) and have a very short-term execution plan for their strategy (e.g. what\nunits should I move in the next turn). Agents must also be able to adapt their plans, and beliefs about\nothers (e.g. trustworthiness) depending on how the game unfolds.\nIn this work, we focus on training an agent that learns to play the No Press version of Diplomacy.\nThe No Press version does not allow agents to communicate with each other using an explicit\ncommunication channel. Communication between agents still occurs through signalling in actions [7,\n2]. This allows us to \ufb01rst focus on the key problem of having an agent that has learned the game\nmechanics, without introducing the additional complexity of learning natural language and learning\ncomplex interactions between agents.\nWe present DipNet, a fully end-to-end trained neural-network-based policy model for No Press\nDiplomacy. To train our architecture, we collect the \ufb01rst large scale dataset of Diplomacy, containing\nmore than 150,000 games. We also develop a game engine that is compatible with DAIDE [8],\na research framework developed by the Diplomacy research community, and that enables us to\ncompare with previous rule-based state-of-the-art bots from the community [9]. Our agent is trained\nwith supervised learning over the expert trajectories. Its parameters are then used to initialize a\nreinforcement learning agent trained through self-play.\nIn order to better evaluate the performance of agents, we run a tournament among different variants\nof the model as well as baselines, and compute the TrueSkill score [10]. Our tournament shows\nthat both our supervised learning (SL) and reinforcement learning (RL) agents consistently beat\nbaseline rule-based agents. In order to further demonstrate the affect of architecture design, we\nperform an ablation study with different variants of the model, and \ufb01nd that our architecture has\nhigher prediction accuracy for support orders even in longer sequences. This ability suggests that\nour model is able to achieve tactical coordination with multiple units. Finally we perform a coalition\nanalysis by computing the ratio of cross-power support, which is one of the main methods for players\nto cooperate with each other. Our results suggest that our architecture is able to issue more effective\ncross-power orders.\n\n2 No Press Diplomacy: Game Overview\n\nDiplomacy is a game where seven European powers (Austria, England, France, Germany, Italy,\nRussia, and Turkey) are competing over supply centers in Europe at the beginning of the 20th century.\n\n2\n\n\fThere are 34 supply centers in the game scattered across 75 provinces (board positions, including\nwater). A power interacts with the game by issuing orders to army and \ufb02eet units. The game is split\ninto years (starting in 1901) and each year has 5 phases: Spring Movement, Spring Retreat, Fall\nMovement, Fall Retreat, and Winter Adjustment.\n\nMovements. There are 4 possible orders during a movement phase: Hold, Move, Support, and\nConvoy. A hold order is used by a unit to defend the province it is occupying. Hold is the default\norder for a unit if no orders are submitted. A move order is used by a unit to attack an adjacent\nprovince. Armies can move to any adjacent land or coastal province, while \ufb02eets can move to water\nor coastal provinces by following a coast.\nSupport orders can be given by any power to increase the attack strength of a moving unit or to\nincrease the defensive strength of a unit holding, supporting, or convoying. Supporting a moving unit\nis only possible if the unit issuing the support order can reach the destination of the supported move\n(e.g. Marseille can support Paris moving to Burgundy, because an army in Marseille could move to\nBurgundy). If the supporting unit is attacked, its support is unsuccessful.\nIt is possible for an army unit to move over several water locations in one phase and attack another\nprovince by being convoyed by several \ufb02eets. A matching convoy order by the convoying \ufb02eets and a\nvalid path of non-dislodged \ufb02eets (explained below) is required for the convoy to be successful.\n\nRetreats.\nIf an attack is successful and there is a unit in the conquered province, the unit is dislodged\nand is given a chance to retreat. There are 2 possible orders during a retreat phase: Retreat and\nDisband. A retreat order is the equivalent of a move order, but only happens during the retreat phase.\nA unit can only retreat to a location that is 1) unoccupied, 2) adjacent, and 3) not a standoff location\n(i.e. left vacant because of a failed attack). A disband order indicates that the unit at the speci\ufb01ed\nprovince should be removed from the board. A dislodged unit is automatically disbanded if either\nthere are no possible retreat locations, it fails to submit a retreat order during the retreat phase, or two\nunits retreat to the same location.\n\nAdjustments. The adjustment phase happens once every year. During that phase, supply centers\nchange ownership if a unit from one power occupies a province with a supply center owned by\nanother power. There are three possible orders during an adjustment phase: Build, Disband, and\nWaive. If a power has more units than supply centers, it needs to disband units. If a power has more\nsupply centers than units, it can build additional units to match its number of supply centers. Units\ncan only be built in a power\u2019s original supply centers (e.g. Berlin, Kiel, and Munich for Germany),\nand the power must still control the chosen province and it must be unoccupied. A power can also\ndecide to waive builds, leaving them with fewer units than supply centers.\n\nCommunication in a No Press game\nIn a No Press game, even if there are no messages, players\ncan communicate between one another by using orders as signals [7]. For example, a player can\ndeclare war by positioning their units in an offensive manner, they can suggest possible moves with\nsupport and convoy orders, propose alliances with support orders, propose a draw by convoying\nunits to Switzerland, and so on. Sometimes even invalid orders can be used as communication, e.g.,\nRussia could order their army in St. Petersburg to support England\u2019s army in Paris moving to London.\nThis invalid order could communicate that France should attack England, even though Paris and\nSt. Petersburg are not adjacent to London.\n\nVariants There are three important variants of the game: Press, Public Press, and No Press. In\na Press game, players are allowed to communicate with one another privately. In a Public Press\ngame, all messages are public announcements and can be seen by all players. In a No Press game,\nplayers are not allowed to send any messages. In all variants, orders are written privately and become\npublic simultaneously, after adjudication. There are more than 100 maps available to play the game\n(ranging from 2 to 17 players), though the original Europe map is the most played, and as a result is\nthe focus of this work. The \ufb01nal important variation is check, where invalid orders can be submitted\n(but are then not applied), versus no-check where only valid orders are submitted. This distinction is\nimportant, because it determines the inclusion of a side-channel for communication through invalid\norders.\n\n3\n\n\fGame end. The game ends when a power is able to reach a majority of the supply centers (18/34 on\nthe standard map), or when players agree to a draw. When a power is in the lead, it is fairly common\nfor other players to collaborate to prevent the leading player from making further progress and to\nforce a draw.\n\nScoring system. Points in a diplomacy game are usually computed either with 1) a draw-based\nscoring system (points in a draw are shared equally among all survivors), or 2) a supply-center count\nscoring system (points in a draw are proportional to the number of supply centers). Players in a\ntournament are usually ranked with a modi\ufb01ed Elo or TrueSkill system [10][11][12][13].\n\n3 Previous Work\n\nIn recent years, there has been a de\ufb01nite trend toward the use of games of increasingly complexity as\nbenchmarks for AI research including: Atari [14], Go [15][16], Capture the Flag [17], Poker [3][4],\nStarcraft [6], and DOTA [5]. However, most of these games do not focus on communication. The\nbenchmark most similar to our No Press Diplomacy setting is Hanabi [2], a card game that involves\nboth communication and action. However Hanabi is fully cooperative, whereas in Diplomacy, ad hoc\ncoalitions form and degenerate dynamically throughout the evolution of the game. We believe this\nmakes Diplomacy unique and deserving of special attention.\nPrevious work on Diplomacy has focused on building rule-based agents with substantial feature\nengineering. DipBlue [18] is a rule-based agent that can negotiate and reason about trust. It was\ndeveloped for the DipGame platform [19], a DAIDE-compatible framework [8] that also introduced\na language hierarchy. DBrane [20] is a search-based bot that uses branch-and-bound search, with\nstate evaluation to truncate as appropriate. Another work, most similar to ours, uses self-play to\nlearn a game strategy leveraging patterns of board states [21]. Our work is the \ufb01rst attempt to use a\ndata-driven method on a large-scale dataset.\nOur work is also related to the learning-to-cooperate literature. In classical game theory, the Iterated\nPrisoner\u2019s Dilemma (IPD) has been the main focus for SSD, and a tit-for-tat strategy has been shown\nto be a highly effective strategy [22]. Recent work [23] has proposed an algorithm that takes into\naccount the impact of one agent\u2019s policy on the update of the other agents. The resulting algorithm\nwas able to achieve reciprocity and cooperation in both IPD and a more complex coin game with\ndeep neural networks. There is also a line of work on solving social dilemmas with deep RL, which\nhas shown that enhanced cooperation and meaningful communication can be promoted via causal\ninference [24], inequity aversions [25], and understanding consequences of intention [26]. However,\nmost of this work has only been applied to simple settings. It is still an open question whether these\nmethods could scale up to a complex domain like Diplomacy.\nOur work is also related to behavioral game theory, which extends game theory to account for human\ncognitive biases and limitations [27]. Such behavior is observed in Diplomacy when players make\nnon-optimal moves due to ill-conceived betrayals or personal vengeance against a perceived slight.\n\n4 DipNet: A Generative Model of Unit Orders\n\n4.1 Input Representation\n\nOur model takes two inputs: current board state and previous phase orders. To represent the board\nstate, we encode for each province: the type of province, whether there is a unit on that province,\nwhich power owns the unit, whether a unit can be built or removed in that province, the dislodged\nunit type and power, and who owns the supply center, if the province has one. If a \ufb02eet is on a coast\n(e.g. on the North Coast of Spain), we also record the unit information in the coast\u2019s parent province.\nPrevious orders are encoded in a way that helps infer which powers are allies and enemies. For\ninstance, for the order \u2019A MAR S A PAR - BUR\u2019 (Army in Marseille supports army in Paris moves to\nBurgundy), we would encode: 1) \u2019Army\u2019 as the unit type, 2) the power owning \u2019A MAR\u2019, 3) \u2019support\u2019\nas the order type, 4) the power owning \u2019A PAR\u2019 (i.e. the friendly power), 5) the power, if any, having\neither a unit on BUR or owning the BUR supply center (i.e. the opponent power), 6) the owner of\nthe BUR supply center, if it exists. Based on our empirical \ufb01ndings, orders from the last movement\n\n4\n\n\fphase are enough to infer the current relationship between the powers. Our representation scheme is\nshown in Figure 2, with one vector per province.\n\n4.2 Graph Convolution Network with FiLM\n\nTo take advantage of the adjacency information on the Diplomacy map, we propose to use a graph\nbo is the board state embedding produced by\nconvolution-based encoder [28]. Suppose xl\nlayer l and xl\npo are\nthe input representations described in Section 4.1. We will now describe the process for encoding the\nboard state; the process for the previous order embedding is the same. Suppose A is the normalized\nmap adjacency matrix of 81 \u00d7 81. We \ufb01rst aggregate neighbor information by:\n\npo is the corresponding embedding of previous orders, where x0\n\npo \u2208 R81\u00d7dl\n\nbo \u2208 R81\u00d7dl\n\nbo, x0\n\nyl\nbo = BatchN orm(Axl\nbo \u2208 R81\u00d7dl+1\n\nbo , yl\n\nboWbo + bbo)\n\nbo\u00d7dl+1\n\nbo , bbo \u2208 Rdl+1\n\nwhere Wbo \u2208 Rdl\nbo and BatchN orm is operated on the last\ndimension. We perform conditional batch normalization using FiLM [29, 30], which has been\nshown to be an effective method of fusing multimodal information in many domains [31]. Batch\nnormalization is conditioned on the player\u2019s power p and the current season s (Spring, Fall, Winter).\n\n(1)\nwhere fl is a linear transformation, \u03b3, \u03b2 \u2208 Rdl+1, and both addition and multiplication are broadcast\nacross provinces. Finally we add a ReLU and residual connections [32] where possible:\n\n\u03b3bo, \u03b2bo = f l\n\nzl\nbo = yl\n\nbo([p; s])\n\nbo (cid:12) \u03b3bo + \u03b2bo\n\n(cid:26)ReLU (zl\n\nReLU (zl\n\nxl+1\nbo =\n\nbo) + xl\nbo\nbo)\n\ndl = dl+1\nbo\nbo (cid:54)= dl+1\ndl\n\nbo\n\nThe board state and the previous orders are both encoded through L of these blocks, and there is no\nweight sharing. Concatenation is performed at the end, giving henc = [xL\nenc is the\n\ufb01nal embedding of the province with index i. We choose L = 16 in our experiment.\n\npo] where hi\n\nbo, xL\n\n4.3 Decoder\n\nIn order to achieve coordination between units, sequential decoding is required. However there is\nno natural sequential ordering. We hypothesize that orders are usually given to a cluster of nearby\nunits, and therefore processing neighbouring units together would be effective. We used a top-left to\nbottom-right ordering based on topological sorting, aiming to prevent jumping across the map during\ndecoding.\nSuppose it is the index of the province requiring an order at time t, we use an LSTM to decode its\norder ot by\n\ndec = LSTM(ht\u22121\nht\n\ndec , [hit\n\nenc; ot\u22121])\n\n(2)\n\n(3)\n\nThen we apply a mask to only get valid possible orders for that location on the current board:\n\not = MaskedSoftmax(ht\n\ndec)\n\nFigure 2: Encoding of the board state and previous orders.\n\n5\n\n\fFigure 3: DipNet architecture\n\n5 Datasets and Game Engine\n\nOur dataset is generated by aggregating 156,468 anonymized human games. We also develop an\nopen source game engine for this dataset to standardize its format and rule out invalid orders. The\ndataset contains 33,279 No Press games, 1,290 Public Press games, 105,266 Press games (messages\nare not included), and 16,633 games not played on the standard map. We are going to release the\ndataset along with the game engine5. Detailed dataset statistics are shown in Table 1.\nThe game engine is also integrated with the Diplomacy Arti\ufb01cial Intelligence Development Environ-\nment (DAIDE) [8], an AI framework from the Diplomacy community. This enables us to compare\nwith several state-of-the-art rule-based bots [9, 18] that have been developed on DAIDE. DAIDE also\nhas a progression of 14 symbolic language levels (from 0 to 130) for negotiation and communication,\nwhich could be potentially useful for research on Press Diplomacy. Each level de\ufb01nes what tokens\nare allowed to be exchanged by agents. For instance, a No Press bot would be considered level 0,\nwhile a level 20 bot can propose peace, alliances, and orders.\n\n6 Experiments\n\n6.1 Supervised Learning\n\nWe \ufb01rst present our supervised learning results. Our test set is composed of the last 5% of games\nsorted by game id in alphabetical order. To measure the impact of each model component, we ran\nan ablation study. The results are presented in Table 2. We evaluate the model with both greedy\ndecoding and teacher forcing. We measure the accuracy of each unit-order (e.g. \u2018A PAR - BUR\u2019), and\nthe accuracy of the complete set of orders for a power (e.g. \u2018A PAR - BUR\u2019, \u2018F BRE - MAO\u2018). We\n\n5Researchers can request access to the dataset by contacting webdipmod@gmail.com. An executive summary\n\ndescribing the research purpose and execution of a con\ufb01dentiality agreement are required.\n\nTable 1: Dataset statistics\n\nSurvival rate for opponents\n\nAustria\nEngland\nFrance\nGermany\nItaly\nRussia\nTurkey\nTotal\n\nWin% Draw% Defeated%\n4.3%\n4.6%\n6.1%\n5.3%\n3.6%\n6.6%\n7.2%\n39.9%\n\n33.4%\n43.7%\n43.8%\n35.9%\n36.5%\n35.2%\n43.1%\n60.1%\n\nAUS\n48.1% 100%\n29.1%\n25.7%\n40.4%\n40.2%\n39.8%\n26.0%\n\nENG\n79%\n47% 100%\n40%\n44%\n15%\n25%\n9%\n37%\n\nFRA\n62%\n30%\n26% 100%\n26%\n65%\n52%\n78%\n59%\n\nGER\n55%\n16%\n22%\n39% 100%\n56%\n77%\n71%\n65%\n\nITA\n40%\n49%\n45%\n61%\n61% 100%\n38%\n56%\n49%\n\nRUS\n29%\n33%\n59%\n27%\n56%\n63% 100%\n23%\n51%\n\nTUR\n15%\n80%\n77%\n80%\n25%\n42%\n31% 100%\n50%\n64%\n\n6\n\n\fTable 2: Evaluation of supervised models: Predicting human orders.\n\nModel\n\nDipNet\nUntrained\nWithout FiLM\nMasked Decoder (No Board)\nBoard State Only\nAverage Embedding\n\nAccuracy per unit-order\nTeacher forcing Greedy\n47.5%\n6.4%\n47.0%\n26.5%\n45.6%\n46.2%\n\n61.3%\n6.6%\n60.7%\n47.8%\n60.3%\n59.9%\n\nAccuracy for all orders\n\nTeacher forcing Greedy\n23.5%\n4.2%\n22.9%\n14.7%\n23.0%\n23.2%\n\n23.5%\n4.2%\n22.9%\n14.7%\n22.9%\n23.2%\n\n\ufb01nd that our untrained model with a masked decoder performs better than the random model, which\nsuggests the effectiveness of masking out invalid orders. We observe a small drop in performance\nwhen we only provide the board state. We also observe a performance drop when we use the average\nembedding over all locations as input to the LSTM decoder (rather than using attention based on the\nlocation the current order is being generated for).\nTo further demonstrate the difference between these variants we focus on the model\u2019s ability to predict\nsupport orders, which are a crucial element for successful unit coordination. Table 3 shows accuracy\non this order type, separated based on the position of the unit in the prediction sequence. We can\nsee that although the performance of different variants of the model are close to each other when\npredicting support for the \ufb01rst unit, the difference is larger when predicting support for the 16th unit.\nThis indicates that our architecture helps DipNet maintain tactical coordination across multiple units.\n\nTable 3: Comparison of the models\u2019 ability to predict support orders with greedy decoding.\n\nSupport Accuracy\n\n1st location\n\n16th location\n\nDipNet\nBoard State Only\nWithout FiLM\nAverage Embedding\n\n40.3%\n38.5%\n40.0%\n39.1%\n\n32.2%\n25.9%\n30.3%\n27.9%\n\n6.2 Reinforcement Learning and Self-play\n\nWe train DipNet with self-play (same model for all powers, with shared updates) using an A2C\narchitecture [14] with n-step (n=15) returns for approximately 20,000 updates (approx. 1 million\nsteps). As a reward function, we use the average of (1) a local reward function (+1/-1 when a\nsupply center is gained or lost (updated every phase and not just in Winter)), and (2) a terminal\nreward function (for a solo victory, the winner gets 34 points; for a draw, the 34 points are divided\n\nTable 4: Diplomacy agents comparison when played against each other, with one agent controlling\none power and the other six powers controlled by copies of the other agent.\n\nAgent A (1x) Agent B (6x)\nSL DipNet\nSL DipNet\nSL DipNet\nSL DipNet\nSL DipNet\nRandom\nGreedyBot\nDumbbot\nAlbert 6.0\nRL DipNet\n\nRandom\nGreedyBot\nDumbbot\nAlbert 6.0\nRL DipNet\nSL DipNet\nSL DipNet\nSL DipNet\nSL DipNet\nSL DipNet\n\nTrueSkill A-B % Win % Most SC % Survived % Defeated\n0.0%\n0.0%\n0.6%\n23.1%\n52.1%\n95.6%\n91.5%\n95.0%\n81.3%\n39.6%\n\n28.1 - 19.7\n28.1 - 20.9\n28.1 - 19.2\n28.1 - 24.5\n28.1 - 27.4\n19.7 - 28.1\n20.9 - 28.1\n19.2 - 28.1\n24.5 - 28.1\n27.4 - 28.1\n\n100.0%\n97.8%\n74.8%\n28.9%\n6.2%\n0.0%\n0.0%\n0.0%\n5.8%\n14.0%\n\n0.0%\n1.2%\n9.2%\n5.3%\n0.3%\n0.0%\n0.0%\n0.1%\n0.4%\n3.5%\n\n0.0%\n1.0%\n15.4%\n42.8%\n41.4%\n4.4%\n8.5%\n5.0%\n12.6%\n42.9%\n\n# Games\n1,000\n1,000\n950\n208\n1,000\n1,000\n1,000\n950\n278\n1,000\n\n7\n\n\fproportionally to the number of supply centers). The policy is pre-trained using DipNet SL described\nabove. We also used a value function pre-trained on human games by predicting the \ufb01nal rewards.\nThe opponents we have used to evaluate our agents were: (1) Random. This agent selects an action\nper unit uniformly at random from the list of valid orders. (2) GreedyBot. This agent greedily tries\nto conquer neighbouring supply centers and is not able to support any attacks. (3) Dumbbot [33].\nThis rule-based bot computes a value for each province, ranks orders using computed province values\nand uses rules to maintain coordination. (4) Albert Level 0 [9]. Albert is the current state-of-the-art\nagent. It evaluates the probability of success of orders, and builds alliances and trust between powers,\neven without messages. To evaluate performance, we run a 1-vs-6 tournament where each game is\nstructured with one power controlled by one agent and the other six controlled by copies of another\nagent. We also run another tournament where each player is randomly sampled from our model\npools and compute TrueSkill scores for these models [10]. We report both the 1-vs-6 results and\nthe TrueSkill scores in Table 4. From the TrueSkill score we can see both the SL (28.1) and RL\n(27.4) versions of DipNet consistently beat the baseline models as well as Albert (24.5), the previous\nstate-of-art bot. Although there is no signi\ufb01cant difference in TrueSkill between SL and RL, the\nperformance of RL vs 6 SL is better than SL vs 6 RL with an increasing win rate.\n\n6.3 Coalition Analysis\n\nIn the No Press games, cross-power support is the major method for players to signal and coordinate\nwith each other for mutual bene\ufb01t. In light of this, we propose a coalition analysis method to further\nunderstand agents\u2019 behavior. We de\ufb01ne a cross-power support (X-support) as being when a power\nsupports a foreign power, and we de\ufb01ne an effective cross-power support as being a cross-power\norder support without which the supported attack or defense would fail:\n\nX-support-ratio =\n\n#X-support\n#support\n\n,\n\nEff-X-support-ratio =\n\n#Effective X-support\n\n#X-support\n\nThe X-support-ratio re\ufb02ects how frequently the support order is used for cooperation/communication,\nwhile the Eff-X-support-ratio re\ufb02ects the ef\ufb01ciency or utility of cooperation. We launch 1000 games\nwith our model variants for all powers and compute this ratio for each one. Our results are shown in\nTable 5.\nFor human games, across different game variants, there is only minor variations in the X-support-\nratio, but the Eff-X-support-ratio varies substantially. This shows that when people are allowed to\ncommunicate, their effectiveness in cooperation increases, which is consistent with previous results\nthat cheap talk promotes cooperation for agents with aligned interests [34, 35]. In terms of agent\nvariants, although RL and SL models show similar TrueSkill scores, their behavior is very different.\nRL agents seem to be less effective at cooperation but have more frequent cross-power support.\nThis decrease in effective cooperation is also consistent with past observations that naive policy\ngradient methods fail to learn cooperative strategies in a non-cooperative setting such as the iterated\nprisoner dilemma [36]. Ablations of the SL model have a similar X-support-ratio, but suffer from\na loss in Eff-X-support-ratio. This further suggests that our DipNet architecture can help agents\ncooperate more effectively. The Masked Decoder has a very high X-support-ratio, suggesting that the\nmarginal distribution of support is highest among agent games, however, it suffers from an inability\nto effectively cooperate (i.e. very small Eff-X-support-ratio). This is also expected since the Masked\nDecoder has no board information to understand the effect of supports.\n\n7 Conclusion\n\nIn this work, we present DipNet, a fully end-to-end policy for the strategy board game No Press\nDiplomacy. We collect a large dataset of human games to evaluate our architecture. We train our agent\nwith both supervised learning and reinforcement learning self-play. Our tournament results suggest\nthat DipNet is able to beat state-of-the-art rule-based bots in the No Press setting. Our ablation study\nand coalition analysis demonstrate that DipNet can effectively coordinate units and cooperate with\nother players. We propose Diplomacy as a new multi-agent benchmark for dynamic cooperation\nemergence in a rich environment. Probably the most interesting result to emerge from our analysis is\nthe difference between the SL agent (trained on human data) and the RL agent (trained with self-play).\nOur coalition analysis suggests that the supervised agent was able to learn to coordinate support\n\n8\n\n\fTable 5: Coalition formation: Diplomacy agents comparison\n\nX-support-ratio Eff-X-support-ratio\n\nHuman Game\n\nNo Press\nPublic Press\nPress\n\nAgents Games RL DipNet\nSL DipNet\nBoard State Only\nWithout FiLM\nMasked Decoder (No Board)\n\n14.7%\n11.8%\n14.4%\n9.1%\n7.4%\n7.3%\n6.7%\n12.1%\n\n7.7%\n12.1%\n23.6%\n5.3%\n10.2%\n7.5%\n7.9%\n0.62%\n\norders while this behaviour appears to deteriorate during self-play training. We believe that the most\nexciting path for future research for Diplomacy playing agents is in the exploration of methods such\nas LOLA [36] that are better able to discover collaborative strategies among self-interested agents.\n\nAcknowledgments\n\nWe would like to thank Kestas Kuliukas, T. Nguyen (Zultar), Joshua M., Timothy Jones, and the\nwebdiplomacy team for their help on the dataset and on model evaluation. We would also like to\nthank Florian Strub, Claire Lasserre, and Nissan Pow for helpful discussions. Moreover, we would\nlike to thank Mario Huys and Manus Hand for developing DPjudge, that was used to develop our\ngame engine. Finally, we would like to thank John Newbury and Jason van Hal for helpful discussions\non DAIDE, Compute Canada for providing the computing resources to run the experiments, and\nSamsung for providing access to the DGX-1 to run our experiments.\n\nReferences\n[1] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent\nreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference\non Autonomous Agents and MultiAgent Systems, pages 464\u2013473. International Foundation for\nAutonomous Agents and Multiagent Systems, 2017.\n\n[2] Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song,\nEmilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi\nchallenge: A new frontier for ai research. arXiv preprint arXiv:1902.00506, 2019.\n\n[3] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, 359(6374):418\u2013424, 2018.\n\n[4] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level\narti\ufb01cial intelligence in heads-up no-limit poker. Science, 356(6337):508\u2013513, 2017.\n\n[5] OpenAI. Openai \ufb01ve. https://blog.openai.com/openai-five/, 2018.\n\n[6] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wo-\njciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo\nEwalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin\nDalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor\nCai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen,\nYuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul,\nTimothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaS-\ntar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/\nalphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.\n\n[7] Simon Szykman. Dp w1995a: Communication in no-press diplomacy. http://uk.diplom.\n\norg/pouch/Zine/W1995A/Szykman/Syntax.html, 1995. Accessed: 2019-05-01.\n\n9\n\n\f[8] David Norman. Daide - diplomacy arti\ufb01cial intelligence development environment. http:\n\n//www.daide.org.uk/, 2013. Accessed: 2019-05-01.\n\n[9] Jason van Hal. Diplomacy ai - albert. https://sites.google.com/site/diplomacyai/,\n\n2013. Accessed: 2019-05-01.\n\n[10] Ralf Herbrich, Tom Minka, and Thore Graepel. TrueskillTM: a bayesian skill rating system. In\n\nAdvances in neural information processing systems, pages 569\u2013576, 2007.\n\n[11] Tony Nichols. A player rating system for diplomacy. http://www.stabbeurfou.org/docs/\nAccessed:\n\narticles/en/DP_S1998R_Diplomacys_New_Rating_System.html, 1998.\n2019-05-01.\n\n[12] WebDiplomacy.\n\nhttps://sites.google.com/view/\nwebdipinfo/ghost-ratings/ghost-ratings-explained, 2019. Accessed: 2019-05-01.\n\nGhost-ratings explained.\n\n[13] super dipsy. Site scoring system. https://www.playdiplomacy.com/forum/viewtopic.\n\nphp?f=565&t=34913, 2013. Accessed: 2019-05-01.\n\n[14] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning, pages 1928\u20131937,\n2016.\n\n[15] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. Nature, 529(7587):484,\n2016.\n\n[16] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[17] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia\nCastaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al.\nHuman-level performance in \ufb01rst-person multiplayer games with population-based deep rein-\nforcement learning. arXiv preprint arXiv:1807.01281, 2018.\n\n[18] Andr\u00e9 Ferreira, Henrique Lopes Cardoso, and Luis Paulo Reis. Dipblue: A diplomacy agent\nwith strategic and trust reasoning. In ICAART 2015-7th International Conference on Agents\nand Arti\ufb01cial Intelligence, Proceedings, 2015.\n\n[19] Angela Fabregues, David Navarro, Alejandro Serrano, and Carles Sierra. Dipgame: A testbed\nfor multiagent systems. In Proceedings of the 9th International Conference on Autonomous\nAgents and Multiagent Systems: volume 1-Volume 1, pages 1619\u20131620. International Foundation\nfor Autonomous Agents and Multiagent Systems, 2010.\n\n[20] Dave Jonge and Jordi Gonz\u00e1lez Sabat\u00e9. Negotiations over large agreement spaces, 2015.\n\n[21] Ari Shapiro, Gil Fuchs, and Robert Levinson. Learning a game strategy using pattern-weights\nand self-play. In International Conference on Computers and Games, pages 42\u201360. Springer,\n2002.\n\n[22] Robert Axelrod and William D Hamilton.\n\n211(4489):1390\u20131396, 1981.\n\nThe evolution of cooperation.\n\nScience,\n\n[23] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to\ncommunicate with deep multi-agent reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2137\u20132145, 2016.\n\n[24] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A Ortega,\nDJ Strouse, Joel Z Leibo, and Nando de Freitas. Intrinsic social motivation via causal in\ufb02uence\nin multi-agent rl. arXiv preprint arXiv:1810.08647, 2018.\n\n10\n\n\f[25] Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Due\u00f1ez-Guzman, Anto-\nnio Garc\u00eda Casta\u00f1eda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity\naversion improves cooperation in intertemporal social dilemmas. In Advances in Neural Infor-\nmation Processing Systems, pages 3326\u20133336, 2018.\n\n[26] Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social\n\ndilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.\n\n[27] Colin F Camerer, Teck-Hua Ho, and Juin Kuan Chong. Behavioural game theory: thinking,\nlearning and teaching. In Advances in Understanding Strategic Behaviour, pages 120\u2013180.\nSpringer, 2004.\n\n[28] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[29] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:\nVisual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[30] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to\naccelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[31] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron\n\nCourville, and Yoshua Bengio. Feature-wise transformations. Distill, 3(7):e11, 2018.\n\n[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[33] David Norman. Daide - clients. http://www.daide.org.uk/clients.html, 2013. Ac-\n\ncessed: 2019-05-01.\n\n[34] Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark.\n\nEmergent communication through negotiation. 2018.\n\n[35] Vincent P Crawford and Joel Sobel. Strategic information transmission. Econometrica: Journal\n\nof the Econometric Society, pages 1431\u20131451, 1982.\n\n[36] Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and\nIn Proceedings of the 17th\nIgor Mordatch. Learning with opponent-learning awareness.\nInternational Conference on Autonomous Agents and MultiAgent Systems, pages 122\u2013130.\nInternational Foundation for Autonomous Agents and Multiagent Systems, 2018.\n\n11\n\n\fA Tournament and TrueSkill Score\n\nTo compute TrueSkill, we ran a tournament where we randomly sampled a model for each power.\nFor each game, we computed the ranks by elimination order (\ufb01rst power eliminated is 7th, second\neliminated is 6th, ...), and the surviving powers by number of supply centers. We computed our\nTrueskill ratings using 1,378 games. We used the available python package6, and the default TrueSkill\nenvironment con\ufb01guration. The initial TrueSkill \u03c3 is set to 8.33, and after 1,378 games the \u03c3 is 0.64,\nwhich shows that the scores have converged.\nNote that in our current evaluation settings we do not consider the existing power imbalance in\nthe game, e.g., winning as Austria is harder than winning as France. Using a more sophisticated\nevaluation which includes the prior on the role of players is an interesting topic for future work.\n\nB Effects of Graph Convolution Layers\n\nWe tested the affect of graph convolution layers by varying the number of layers in Table 6. After 8\nlayers of GCN there is no further improvement. We think this could be related to the fact that in the\nstandard map of Diplomacy the most distant locations are connected by paths of length 8.\n\nTable 6: Effect of GCN Layers\n\nModel\n\nDipNet\n8 GCN Layers\n4 GCN Layers\n2 GCN Layers\n\nAccuracy per unit-order\nTeacher forcing Greedy\n47.5%\n47.4%\n47.2%\n45.9%\n\n61.3%\n61.2%\n61.1%\n60.3%\n\nAccuracy for all orders\n\nTeacher forcing Greedy\n23.5%\n23.4%\n23.2%\n23.2%\n\n23.5%\n23.4%\n23.2%\n23.2%\n\nC Effects of Decoding Granularity\n\nWe experimented with different decoding granularity. Instead of decoding each unit order as an\natomic option (e.g. \u2019A PAR H\u2019), we can decode as a sequence (e.g. [\u2019A\u2019, \u2019PAR\u2019, \u2019H\u2019]). We call the\n\ufb01rst one unit-based and the latter token-based. We \ufb01nd that although the token-based model had better\nperformance in terms of token accuracy, it had lower unit accuracy. We also try the transformer-based\ndecoder. The results are in Table 7\n\nModel\n\nLSTM (order-based)\nTransformer (order-based)\nLSTM (token-based)\nTransformer (token-based)\n\nTable 7: Comparison of different decoding granularity\n\nAccuracy per unit-order\nTeacher forcing Greedy\n47.5%\n47.5%\n46.7%\n45.4%\n\n61.3%\n60.7%\n60.3%\n58.4%\n\nAccuracy for all orders\n\nAccuracy per token\n\nTeacher forcing Greedy\n23.5%\n23.4%\n23.2%\n22.3%\n\n23.5%\n23.4%\n23.2%\n22.3%\n\nTeacher forcing Greedy\n74.4%\n74.4%\n73.8%\n72.9%\n\n82.1%\n81.9%\n90.6%\n90.0%\n\n6https://trueskill.org/\n\n12\n\n\f", "award": [], "sourceid": 2513, "authors": [{"given_name": "Philip", "family_name": "Paquette", "institution": "Universit\u00e9 de Montr\u00e9al - MILA"}, {"given_name": "Yuchen", "family_name": "Lu", "institution": "University of Montreal"}, {"given_name": "SETON STEVEN", "family_name": "BOCCO", "institution": "MILA"}, {"given_name": "Max", "family_name": "Smith", "institution": "University of Michigan"}, {"given_name": "Satya", "family_name": "O.-G.", "institution": "MILA"}, {"given_name": "Jonathan", "family_name": "Kummerfeld", "institution": "University of Michigan"}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "McGill University"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "U. Montreal"}]}