{"title": "MAVEN: Multi-Agent Variational Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 7613, "page_last": 7624, "abstract": "Centralised training with decentralised execution is an important setting for cooperative deep multi-agent reinforcement learning due to communication constraints during execution and computational tractability in training. In this paper, we analyse value-based methods that are known to have superior performance in complex environments. We specifically focus on QMIX, the current state-of-the-art in this domain. We show that the representation constraints on the joint action-values introduced by QMIX and similar methods lead to provably poor exploration and suboptimality. Furthermore, we propose a novel approach called MAVEN that hybridises value and policy-based methods by introducing a latent space for hierarchical control. The value-based agents condition their behaviour on the shared latent variable controlled by a hierarchical policy. This allows MAVEN to achieve committed, temporally extended exploration, which is key to solving complex multi-agent tasks. Our experimental results show that MAVEN achieves significant performance improvements on the challenging SMAC domain.", "full_text": "MAVEN: Multi-Agent Variational Exploration\n\nAnuj Mahajan\u21e4\u2020\n\nTabish Rashid\u2020\n\nMikayel Samvelyan\u2021\n\nShimon Whiteson\u2020\n\nAbstract\n\nCentralised training with decentralised execution is an important setting for coop-\nerative deep multi-agent reinforcement learning due to communication constraints\nduring execution and computational tractability in training. In this paper, we\nanalyse value-based methods that are known to have superior performance in\ncomplex environments [43]. We speci\ufb01cally focus on QMIX [40], the current\nstate-of-the-art in this domain. We show that the representational constraints on the\njoint action-values introduced by QMIX and similar methods lead to provably poor\nexploration and suboptimality. Furthermore, we propose a novel approach called\nMAVEN that hybridises value and policy-based methods by introducing a latent\nspace for hierarchical control. The value-based agents condition their behaviour on\nthe shared latent variable controlled by a hierarchical policy. This allows MAVEN\nto achieve committed, temporally extended exploration, which is key to solving\ncomplex multi-agent tasks. Our experimental results show that MAVEN achieves\nsigni\ufb01cant performance improvements on the challenging SMAC domain [43].\n\n1\n\nIntroduction\n\nCooperative multi-agent reinforcement learning (MARL) is a key tool for addressing many real-world\nproblems such as coordination of robot swarms [22] and autonomous cars [6]. However, two key\nchallenges stand between cooperative MARL and such real-world applications. First, scalability\nis limited by the fact that the size of the joint action space grows exponentially in the number of\nagents. Second, while the training process can typically be centralised, partial observability and\ncommunication constraints often mean that execution must be decentralised, i.e., each agent can\ncondition its actions only on its local action-observation history, a setting known as centralised\ntraining with decentralised execution (CTDE).\nWhile both policy-based [13] and value-based [40, 48, 46] methods have been developed for CTDE,\nthe current state of the art, as measured on SMAC, a suite of StarCraft II micromanagement benchmark\ntasks [43], is a value-based method called QMIX [40]. QMIX tries to address the challenges\nmentioned above by learning factored value functions. By decomposing the joint value function\ninto factors that depend only on individual agents, QMIX can cope with large joint action spaces.\nFurthermore, because such factors are combined in a way that respects a monotonicity constraint, each\nagent can select its action based only on its own factor, enabling decentralised execution. However,\nthis decentralisation comes with a price, as the monotonicity constraint restricts QMIX to suboptimal\nvalue approximations.\nQTRAN[44], another recent method, performs this trade-off differently by formulating multi-agent\nlearning as an optimisation problem with linear constraints and relaxing it with L2 penalties for\ntractability.\nIn this paper, we shed light on a problem unique to decentralised MARL that arises due to inef\ufb01cient\nexploration. Inef\ufb01cient exploration hurts decentralised MARL, not only in the way it hurts single agent\n\n\u21e4Correspondence to: anuj.mahajan@cs.ox.ac.uk\n\u2020Dept. of Computer Science, University of Oxford\n\u2021Russian-Armenian University\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRL[33] (by increasing sample inef\ufb01ciency[31, 30]), but also by interacting with the representational\nconstraints necessary for decentralisation to push the algorithm towards suboptimal policies. Single\nagent RL can avoid convergence to suboptimal policies using various strategies like increasing the\nexploration rate (\u270f) or policy variance, ensuring optimality in the limit. However, we show, both\ntheoretically and empirically, that the same is not possible in decentralised MARL.\nFurthermore, we show that committed exploration can be used to solve the above problem. In com-\nmitted exploration [36], exploratory actions are performed over extended time steps in a coordinated\nmanner. Committed exploration is key even in single-agent exploration but is especially important\nin MARL, as many problems involve long-term coordination, requiring exploration to discover\ntemporally extended joint strategies for maximising reward. Unfortunately, none of the existing\nmethods for CTDE are equipped with committed exploration.\nTo address these limitations, we propose a novel approach called multi-agent variational exploration\n(MAVEN) that hybridises value and policy-based methods by introducing a latent space for hierar-\nchical control. MAVEN\u2019s value-based agents condition their behaviour on the shared latent variable\ncontrolled by a hierarchical policy. Thus, \ufb01xing the latent variable, each joint action-value function\ncan be thought of as a mode of joint exploratory behaviour that persists over an entire episode.\nFurthermore, MAVEN uses mutual information maximisation between the trajectories and latent\nvariables to learn a diverse set of such behaviours. This allows MAVEN to achieve committed explo-\nration while respecting the representational constraints. We demonstrate the ef\ufb01cacy of our approach\nby showing signi\ufb01cant performance improvements on the challenging SMAC domain.\n\n2 Background\n\nWe model the fully cooperative multi-agent task as a Dec-POMDP [34],\nwhich is formally de\ufb01ned as a tuple G = hS, U, P, r, Z, O, n, i. S is\nthe state space of the environment. At each time step t, every agent\n1, ..., n} chooses an action ui 2 U which forms the joint\ni 2A\u2318{\naction u 2 U \u2318 U n. P (s0|s, u) : S \u21e5 U \u21e5 S ! [0, 1] is the state\ntransition function. r(s, u) : S \u21e5 U ! R is the reward function shared\nby all agents and 2 [0, 1) is the discount factor. We consider partially\nobservable settings, where each agent does not have access to the full state\nand instead samples observations z 2 Z according to observation function\nO(s, i) : S \u21e5A! Z. The action-observation history for an agent i is \u2327 i 2 T \u2318 (Z \u21e5 U )\u21e4 on which\nit can condition its policy \u21e1i(ui|\u2327 i) : T \u21e5 U ! [0, 1]. We use ui to denote the action of all the\nagents other than i and follow a similar convention for the policies \u21e1i. The joint policy \u21e1 is based on\naction-value function: Q\u21e1(st, ut) = Est+1:1,ut+1:1\u21e5P1k=0 krt+k|st, ut\u21e4. The goal of the problem\n\nis to \ufb01nd the optimal action value function Q\u21e4. During centralised training, the learning algorithm\nhas access to the action-observation histories of all agents and the full state. However, each agent can\nonly condition on its own local action-observation history \u2327 i during decentralised execution (hence\nthe name CTDE). For CTDE methods factoring action values over agents, we represent the individual\nagent utilities by qi, i 2A . An important concept for such methods is decentralisability (see IGM in\n[44]) which asserts that 9qi, such that 8s, u:\n\nFigure 1: Classi\ufb01cation of\nMARL problems.\n\narg max\n\nu\n\nQ\u21e4(s, u) =arg maxu1 q1(\u2327 1, u1) . . . arg maxun qn(\u2327 n, un)0\n\nFig. 1 gives the classi\ufb01cation for MARL problems. While the containment is strict for partially\nobservable setting, it can be shown that all tasks are decentralisable given full observability and\nsuf\ufb01cient representational capacity.\n\n,\n\n(1)\n\nQMIX [40]\nis a value-based method that learns a monotonic approximation Qqmix for the joint\naction-value function. Figure 8 in Appendix A illustrates its overall setup, reproduced for convenience.\nQMIX factors the joint-action Qqmix into a monotonic nonlinear combination of individual utilities\nqi of each agent which are learnt via a utility network. A mixer network with nonnegative weights\nis responsible for combining the agent\u2019s utilities for their chosen actions ui into Qqmix(s, u). This\nnonnegativity ensures that @Qqmix(s,u)\n@qi(s,ui) 0, which in turn guarantees Eq. (1). This decomposition\nallows for an ef\ufb01cient, tractable maximisation as it can be performed in O(n|U|) time as opposed to\nO(|U|n). Additionally, it allows for easy decentralisation as each agent can independently perform\n\n2\n\n\fan argmax. During learning, the QMIX agents use \u270f-greedy exploration over their individual utilities\nto ensure suf\ufb01cient exploration. For VDN [46] the factorization is further restrained to be just the\n\nsum of utilities: Qvdn(s, u) =Pi qi(s, ui).\nQTRAN [44]\nis another value-based method. Theorem 1 in the QTRAN paper guarantees optimal\ndecentralisation by using linear constraints between agent utilities and joint action values, but it\nimposes O(|S||U|n) constraints on the optimisation problem involved, where | \u00b7 | gives set size.\nThis is computationally intractable to solve in discrete state-action spaces and is impossible given\ncontinuous state-action spaces. The authors propose two algorithms (QTRAN-base and QTRAN-alt)\nwhich relax these constraints using two L2 penalties. While QTRAN tries avoid QMIX\u2019s limitations,\nwe found that it performs poorly in practice on complex MARL domains (see Section 5) as it deviates\nfrom the exact solution due to these relaxations.\n\n3 Analysis\n\nIn this section, we analyse the policy learnt by QMIX in the case where it cannot represent the true\noptimal action-value function. Our analysis is not restricted to QMIX and can easily be extended\nto similar algorithms like VDN [46] whose representation class is a subset of QMIX. Intuitively,\nmonotonicity implies that the optimal action of agent i does not depend on the actions of the other\nagents. This motivates us to characterise the class of Q-functions that cannot be represented by\nQMIX, which we call nonmonotonic Q functions.\nDe\ufb01nition 1 (Nonmonotonicity). For any state s 2 S and agent i 2A given the actions of the other\nagents ui 2 U n1, the Q-values Q(s, (ui, ui)) form an ordering over the action space of agent\ni. De\ufb01ne C(i, ui) := {(ui\n1, ..., ui\nj 2\nj0}, as the set of all possible such orderings over the action-values. The\nU, j 6= j0 =) ui\nj 6= ui\njoint-action value function is nonmonotonic if 9i 2A , ui\n2 ) = ?.\nA simple example of a nonmonotonic Q-function is given by the payoff matrix of the two-player\nthree-action matrix game shown on Table 1(a). Table 1(b) shows the values learned by QMIX under\nuniform visitation, i.e., when all state-action pairs are explored equally.\n\nj+1, ui)), j 2{ 1, . . . ,|U|}, ui\n\nj, ui)) Q(s, (ui\n\n1 ) \\ C(i, ui\n\n)|Q(s, (ui\n\ns.t. C(i, ui\n\n6= ui\n\n|U|\n\n1\n\n2\n\nA\nA 10.4\n0\nB\n10\nC\n\nB\n0\n10\n10\n(a)\n\nC\n10\n10\n10\n\nA\nA 6.08\nB 6.00\n8.99\nC\n\nC\n8.95\n8.87\n11.87\n\nB\n6.08\n5.99\n8.99\n(b)\n\nTable 1: (a) An example of a nonmonotonic payoff matrix, (b) QMIX values under uniform visitation.\n\nOf course, the fact that QMIX cannot represent the optimal value function does not imply that the\npolicy it learns must be suboptimal. However, the following analysis establishes the suboptimality of\nsuch policies.\nTheorem 1 (Uniform visitation). For n-player, k 3-action matrix games (|A| = n,|U| = k),\nunder uniform visitation, Qqmix learns a -suboptimal policy for any time horizon T , for any\n0 < \uf8ff Rhq a(b+1)\na+b 1i for the payoff matrix (n-dimensional) given by the template below, where\n, a = kn (b + 1), R > 0:\ns=1n+s1\nb =Pk2\n26664\n\n. . . R\n. . .\n\n37775\n\n. . .\n. . .\n\n0\n...\nR\n\nR + \n\n...\nR\n\n0\n\ns\n\nProof. see Appendix B.1\n\n3\n\n\fWe next consider \u270f-greedy visitation, in which each agent uses an \u270f-greedy policy and \u270f decreases\nover time. Below we provide a probabilistic bound on the maximum possible value of for QMIX to\nlearn a suboptimal policy for any time horizon T .\nTheorem 2 (\u270f-greedy visitation). For n-player, k 3-action matrix games, under \u270f-greedy\nvisitation \u270f(t), Qqmix learns a -suboptimal policy for any time horizon T with probability\n2(kn1)2 )\u2318 , for any 0 < \uf8ff R\"ra\u21e3\n 1\u21e3 exp( T 2\ns=1n+s1\nfor the payoff matrix given by the template above, where b =Pk2\n\n2(1/2)(a+b) + 1\u23181#\n, a = kn (b + 1),\n\n2 )+(kn1) exp( T 2\n\nb\n\ns\n\nR > 0 and = \u270f(T ).\n\nProof. see Appendix B.2\n\nThe reliance of QMIX on \u270f-greedy action selection prevents it from engaging in committed exploration\n[36], in which a precise sequence of actions must be chosen in order to reach novel, interesting parts\nof the state space. Moreover, Theorems 1 and 2 imply that the agents can latch onto suboptimal\nbehaviour early on, due to the monotonicity constraint. Theorem 2 in particular provides a surprising\nresult: For a \ufb01xed time budget T , increasing QMIX\u2019s exploration rate lowers its probability of\nlearning the optimal action due to its representational limitations. Intuitively this is because the\nmonotonicity constraint can prevent the Q-network from correctly remembering the true value of\nthe optimal action (currently perceived as suboptimal). We hypothesise that the lack of a principled\nexploration strategy coupled with these representational limitations can often lead to catastrophically\npoor exploration, which we con\ufb01rm empirically.\n\n4 Methodology\n\nIn this\nsection, we propose\nmulti-agent variational explo-\nration (MAVEN), a new method\nthat overcomes the detrimen-\ntal effects of QMIX\u2019s mono-\ntonicity constraint on exploration.\nMAVEN does so by learning a di-\nverse ensemble of monotonic ap-\nproximations with the help of a\nlatent space. Its architecture con-\nsists of value-based agents that\ncondition their behaviour on the\nshared latent variable z controlled\nby a hierarchical policy that off-\nloads \u270f-greedy with committed ex-\nploration. Thus, \ufb01xing z, each\njoint action-value function is a\nmonotonic approximation to the\noptimal action-value function that\nis learnt with Q-learning. Furthermore, each such approximation can be seen as a mode of committed\njoint exploratory behaviour. The latent policy over z can then be seen as exploring the space of joint\nbehaviours and can be trained using any policy learning method. Intuitively, the z space should map\nto diverse modes of behaviour. Fig. 2 illustrates the complete setup for MAVEN. We \ufb01rst focus on\nthe lefthand side of the diagram, which describes the learning framework for the latent space policy\nand the joint action values. We parametrise the hierarchical policy by \u2713, the agent utility network\nwith \u2318, the hypernet map from latent variable z used to condition utilities by , and the mixer net\nwith . \u2318 can be associated with a feature extraction module per agent and can be associated with\nthe task of modifying the utilities for a particular mode of exploration. We model the hierarchical\npolicy \u21e1z(\u00b7|s0; \u2713) as a transformation of a simple random variable x \u21e0 p(x) through a neural network\nparameterised by \u2713; thus z \u21e0 g\u2713(x, s0), where s0 is initial state. Natural choices for p(x) are uniform\nfor discrete z and uniform or normal for continuous z.\n\nFigure 2: Architecture for MAVEN.\n\n4\n\n\fWe next provide a coordinate ascent scheme for optimising the parameters. Fixing z gives a joint\naction-value function Q(u, s; z, , \u2318, ) which implicitly de\ufb01nes a greedy deterministic policy\n\u21e1A(u|s; z, , \u2318, ) (we drop the parameter dependence wherever its inferable for clarity of pre-\nsentation). This gives the corresponding Q-learning loss:\n\nLQL(, \u2318, ) = E\u21e1A\n\n[(Q(ut, st; z) [r(ut, st) + max\n\nut+1\n\nQ(ut+1, st+1; z)])2],\n\nAlgorithm 1 MAVEN\n\nwhere t is the time step. Next, \ufb01xing , \u2318, , the hierarchical policy over \u21e1z(\u00b7|s0; \u2713) is trained on the\ncumulative trajectory reward R(\u2327 , z|, \u2318, ) =Pt rt where \u2327 is the joint trajectory.\nInitialize parameter vectors , , \u2318, , \u2713\nLearning rate \u21b5, D {}\nfor each episodic iteration i do\n\ns0 \u21e0 \u21e2(s0), x \u21e0 p(x), z \u21e0 g\u2713(x; s0)\nfor each environment step t do\n\nut \u21e0 \u21e1A(u|st; z, , \u2318, )\nst+1 \u21e0 p(st+1|st, ut)\nD D[{\n\n(st, ut, r(st, ut), rz\n\naux(\u2327i), st+1)}\n\nend for\nfor each gradient step do\n + \u21b5 \u02c6r(M IJV QLLQL) (Hypernet update)\n\u2318 \u2318 + \u21b5 \u02c6r\u2318(M IJV QLLQL) (Feature update)\n + \u21b5 \u02c6r (M IJV QLLQL) (Mixer update)\n + \u21b5 \u02c6rM IJV (Variational update)\n\u2713 \u2713 + \u21b5 \u02c6r\u2713JRL (Latent space update)\nend for\n\nend for\n\nThus, the hierarchical policy objective for z, freezing the parameters , \u2318, is given by:\n\nJRL(\u2713) =Z R(\u2327A|z)p\u2713(z|s0)\u21e2(s0)dzds0.\n\nHowever, the formulation so far does not encourage diverse behaviour corresponding to different\nvalues of z and all the values of z could collapse to the same joint behaviour. To prevent this, we\nintroduce a mutual information (MI) objective between the observed trajectories \u2327 , {(ut, st)},\nwhich are representative of the joint behaviour and the latent variable z. The actions ut in the\ntrajectory are represented as a stack of agent utilities and is an operator that returns a per-agent\nBoltzmann policy w.r.t. the utilities at each time step t, ensuring the MI objective is differentiable and\nhelping train the network parameters ( , \u2318, ). We use an RNN [20] to encode the entire trajectory\nand then maximise M I((\u2327 ), z). Intuitively, the MI objective encourages visitation of diverse\ntrajectories \u2327 while at the same time making them identi\ufb01able given z, thus elegantly separating the\nz space into different exploration modes. The MI objective is:\n\nJM I = H((\u2327 )) H ((\u2327 )|z) = H(z) H (z|(\u2327 )),\n\nwhere H is the entropy. However, neither the entropy of (\u2327 ) nor the conditional of z given the\nformer is tractable for nontrivial mappings, which makes directly using MI infeasible. Therefore, we\nintroduce a variational distribution q(z|(\u2327 )) [50, 3] parameterised by as a proxy for the posterior\nover z, which provides a lower bound on JM I (see Appendix B.3).\n\nJM I H (z) + E(\u2327 ),z[log(q(z|(\u2327 )))].\n\nWe refer to the righthand side of the above inequality as the variational MI objective JV (, , \u2318, ).\nThe lower bound matches the exact MI when the variational distribution equals p(z|(\u2327 )), the true\nposterior of z. The righthand side of Fig. 2 gives the network architectures corresponding to the\nvariational MI loss. Since\n\nE\u2327 ,z[log(q(z|(\u00b7)))] =E\u2327 [KL(p(z|(\u00b7))||q(z|(\u00b7))] H (z|(\u00b7)),\n\n5\n\n\fwhere the nonnegativity of the KL divergence on the righthand side implies that a bad variational\napproximation can hurt performance as it induces a gap between the true objective and the lower\nbound [32, 2]. This problem is especially important if z is chosen to be continuous as for discrete\ndistributions the posterior can be represented exactly as long as the dimensionality of is greater than\nthe number of categories for z. The problem can be addressed by various state-of-the-art developments\nin amortised variational inference [42, 41]. The variational approximation can also be seen as a\ndiscriminator/critic that induces an auxiliary reward \ufb01eld rz\naux(\u2327 ) = log(q(z|(\u2327 ))) log(p(z))\non the trajectory space. Thus the overall objective becomes:\n\n,,\u2318, ,\u2713 JRL(\u2713) + M IJV (, , \u2318, ) QLLQL(, \u2318, ),\nmax\n\nwhere M I, QL are positive multipliers. For training (see Algorithm 1), at the beginning of each\nepisode we sample an x and obtain z and then unroll the policy until termination and train , \u2318, , \non the Q-learning loss corresponding to greedy policy for the current exploration mode and the\nvariational MI reward. The hierarchical policy parameters \u2713 can be trained on the true task return\nusing any policy optimisation algorithm. At test time, we sample z at the start of an episode and then\nperform a decentralised argmax on the corresponding Q-function to select actions. Thus, MAVEN\nachieves committed exploration while respecting QMIX\u2019s representational constraints.\n\n5 Experimental Results\nWe now empirically evaluate MAVEN on various new and existing domains.\n5.1 m-step matrix games\nTo test the how nonmonotonicity and exploration interact, we introduce a simple m-step matrix game.\nThe initial state is nonmonotonic, zero rewards lead to termination, and the differentiating states are\nlocated at the terminal ends; there are m 2 intermediate states. Fig. 3(a) illustrates the m-step\nmatrix game for m = 10. The optimal policy is to take the top left joint action and \ufb01nally take the\nbottom right action, giving an optimal total payoff of m + 3. As m increases, it becomes increasingly\ndif\ufb01cult to discover the optimal policy using \u270f-dithering and a committed approach becomes necessary.\nAdditionally, the initial state\u2019s nonmonotonicity provides inertia against switching the policy to the\nother direction. Fig. 3(b) plots median returns for m = 10. QMIX gets stuck in a suboptimal policy\nwith payoff 10, while MAVEN successfully learns the true optimal policy with payoff 13. This\nexample shows how representational constraints can hurt performance if they are left unmoderated.\n\nFigure 3: (a) m-step matrix game for m = 10 case (b) median return of MAVEN and QMIX method on 10-step\nmatrix game for 100k training steps, averaged over 20 random initializations (2nd and 3rd quartile is shaded).\n\n(a)\n\n(b)\n\n5.2 StarCraft II\n\nStarCraft Multi-Agent Challenge We consider a challenging set of cooperative StarCraft II maps\nfrom the SMAC benchmark [43] which Samvelyan et al. have classi\ufb01ed as Easy, Hard and Super\nHard. Our evaluation procedure is similar to [40, 43]. We pause training every 100000 time steps\nand run 32 evaluation episodes with decentralised greedy action selection. After training, we report\nthe median test win rate (percentage of episodes won) along with 2nd and 3rd quartiles (shaded in\nplots). We use grid search to tune hyperparameters. Appendix C.1 contains additional experimental\ndetails. We compare MAVEN, QTRAN, QMIX, COMA [13] and IQL [48] on several SMAC maps.\nHere we present the results for two Super Hard maps corridor & 6h_vs_8z and an Easy map\n2s3z. The corridor map, in which 6 Zealots face 24 enemy Zerglings, requires agents to make\neffective use of the terrain features and block enemy attacks from different directions. A properly\n\n6\n\n\f(a) corridor Super Hard\n\n(b) 6h_vs_8z Super Hard\n\n(c) 2s3z Easy\n\nFigure 4: The performance of various algorithms on three SMAC maps.\n\ncoordinated exploration scheme applied to this map would help the agents discover a suitable unit\npositioning quickly and improve performance. 6h_vs_8z requires \ufb01ne grained \u2019focus \ufb01re\u2019 by the\nallied Hydralisks. 2s3z requires agents to learn \u201cfocus \ufb01re\" and interception. Figs. 4(a) to 4(c)\nshow the median win rates for the different algorithms on the maps; additional plots can be found in\nAppendix C.2. The plots show that MAVEN performs substantially better than all alternate approaches\non the Super Hard maps with performance similar to QMIX on Hard and Easy maps.Thus MAVEN\nperforms better as dif\ufb01culty increases. Furthermore, QTRAN does not yield satisfactory performance\non most SMAC maps (0% win rate). The map on which it performs best is 2s3z (Fig. 4(c)), an Easy\nmap, where it is still worse than QMIX and MAVEN. We believe this is because QTRAN enforces\ndecentralisation using only relaxed L2 penalties that are insuf\ufb01cient for challenging domains.\n\n(a) 2_corridors\n\n(b) Shorter corridor closed\nat 5mil steps\n\n(c) zealot_cave\n\n(d) zealot_cave depth 3\n\n(e) zealot_cave depth 4\n\nFigure 5: State exploration and policy robustness\n\nExploration and Robustness Although SMAC domains are challenging, they are not specially\ndesigned to test state-action space exploration, as the units involved start engaging immediately after\nspawning. We thus introduce a new SMAC map designed speci\ufb01cally to assess the effectiveness\nof multi-agent exploration techniques and their ability to adapt to changes in the environment. The\n2-corridors map features two Marines facing an enemy Zealot. In the beginning of training, the\nagents can make use of two corridors to attack the enemy (see Fig. 5(a)). Halfway through training,\nthe short corridor is blocked. This requires the agents to adapt accordingly and use the long corridor\nin a coordinated way to attack the enemy. Fig. 5(b) presents the win rate for MAVEN and QMIX\nfor 2-corridors when the gate to short corridor is closed after 5 million steps. While QMIX\nfails to recover after the closure, MAVEN swiftly adapts to the change in the environment and starts\nusing the long corridor. MAVEN\u2019s latent space allows it to explore in a committed manner and\nassociate use of the long corridor with a value of z. Furthermore, it facilitates recall of the behaviour\nonce the short corridor becomes unavailable, which QMIX struggles with due to its representational\nconstraints. We also introduce another new map called zealot_cave to test state exploration,\nfeaturing a tree-structured cave with a Zealot at all but the leaf nodes (see Fig. 5(c)). The agents\nconsist of 2 marines who need to learn \u2018kiting\u2019 to reach all the way to the leaf nodes and get extra\nreward only if they always take the right branch except at the \ufb01nal intersection. The depth of the\ncave offers control over the task dif\ufb01culty. Figs. 5(d) and 5(e) give the average reward received by\nthe different algorithms for cave depths of 3 and 4. MAVEN outperforms all algorithms compared.\n\n7\n\n\fRepresentability The\noptimal\naction-value function lies outside of\nthe representation class of the CTDE\nalgorithm used for most interesting\nproblems. One way to tackle this is-\nsue is to \ufb01nd local approximations to\nthe optimal value function and choose\nthe best local approximation given\nthe observation. We hypothesise that\nMAVEN enables application of this\nprinciple by mapping the latent space\nz to local approximations and using the hierarchical policy to choose the best such approximation\ngiven the initial state s0, thus offering better representational capacity while respecting the constraints\nrequiring decentralization. To demonstrate this, we plot the t-SNE [29] of the initial states and colour\nthem according to the latent variable sampled for it using the hierarchical policy at different time\nsteps during training. The top row of Fig. 6 gives the time evolution of the plots for 3s5z which\nshows that MAVEN learns to associate the initial state clusters with the same latent value, thus\npartitioning the state-action space with distinct joint behaviours. Another interesting plot in the\nbottom row for micro_corridor demonstrates how MAVEN\u2019s latent space allows transition to\nmore rewarding joint behaviour which existing methods would struggle to accomplish.\n\nFigure 6: tsne plot for s0 labelled with z (16 categories), initial\n(left) to \ufb01nal (right), top 3s5z, bottom micro_corridor\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 7: (a) & (b) investigate uniform hierarchical policy. (c) & (d) investigate effects of MI loss.\n\nAblations We perform several ablations on the micro_corridor scenario with Z = 16 to try\nand determine the importance of each component of MAVEN. We \ufb01rst consider using a \ufb01xed uniform\nhierarchical policy over z. Fig. 7(a) shows that MAVEN with a uniform policy over z performs worse\nthan a learned policy. Interestingly, using a uniform hierarchical policy and no variational MI loss to\nencourage diversity results in a further drop in performance, as shown in Fig. 7(b). Thus suf\ufb01cient\ndiversi\ufb01cation of the observed trajectories via an explicit agency is important to \ufb01nd good policies\nensuring sample ef\ufb01ciency. Fig. 7(b) is similar to Bootstrapped-DQN [36], which has no incentive to\nproduce diverse behaviour other than the differing initialisations depending on z. Thus, all the latent\nvariable values can collapse to the same joint behaviour. If we are able to learn a hierarchical policy\nover z, we can focus our computation and environmental samples on the more promising variables,\nwhich allows for better \ufb01nal performance. Fig. 7(c) shows improved performance relative to Fig. 7(b)\nproviding some evidence for this claim. Next, we consider how the different choices of variational\nMI loss (per time step, per trajectory) affect performance in Fig. 7(d). Intuitively, the per time step\nloss promotes a more spread out exploration as it forces the discriminator to learn the inverse map to\nthe latent variable at each step. It thus tends to distribute its exploration budget at each step uniformly,\nwhereas the trajectory loss allows the joint behaviours to be similar for extended durations and take\ndiversifying actions at only a few time steps in a trajectory, keeping its spread fairly narrow. However,\nwe found that in most scenarios, the two losses perform similarly. See Appendix C.2 for additional\nplots and ablation results.\n\n8\n\n\f6 Related Work\nIn recent years there has been considerable work extending MARL from small discrete state spaces\nthat can be handled by tabular methods [51, 5] to high-dimensional, continuous state spaces that\nrequire the use of function approximators [13, 28, 39]. To tackle computational intractability from\nexponential blow-up of state-action space, Guestrin et al. [16, 17] use coordination graphs to factor\nlarge MDPs for multi-agent systems and propose inter-agent communication arising from message\npassing on the graphs. Similarly [45, 11, 23] model inter-agent communication explicitly.\nIn\nCTDE [26], [47] extend Independent Q-Learning [48] to use DQN to learn Q-values for each agent\nindependently. [12, 35] tackle the instability that arises from training the agents independently. Lin\net al.[27] \ufb01rst learn a centralised controller to solve the task, and then train the agents to imitate\nits behaviour. Sunehag et al. [46] propose Value Decomposition Networks (VDN), which learn the\njoint-action Q-values by factoring them as the sum of each agent\u2019s Q-values. QMIX [40] extends\nVDN to allow the joint action Q-values to be a monotonic combination of each agent\u2019s Q-Values\nthat can vary depending on the state. Section 4 outlines how MAVEN builds upon QMIX. QTRAN\n[44] approaches the suboptimality vs. decentralisation tradeoff differently by introducing relaxed\nL2 penalties in the RL objective. [15] maximise the empowerment between one agents actions\nand the others future state in a competitive setting. Zheng et al. [52] allow each agent to condition\ntheir policies on a shared continuous latent variable. In contrast to our setting, they consider the\nfully-observable centralised control setting and do not attempt to enforce diversity across the shared\nlatent variable. Aumann [1] proposes the concept of a correlated equilibrium in non-cooperative\nmulti-agent settings in which each agent conditions its policy on some shared variable that is sampled\nevery episode.\nIn the single agent setting, Osband et al.[36] learn an ensemble of Q-value functions (which all\nshare weights except for the \ufb01nal few layers) that are trained on their own sampled trajectories to\napproximate a posterior over Q-values via the statistical bootstrapping method. MAVEN without the\nMI loss and a uniform policy over z is then equivalent to each agent using a Bootstrapped DQN. [37]\nextends the Bootstrapped DQN to include a prior. [7] consider the setting of concurrent RL in which\nmultiple agents interact with their own environments in parallel. They aim to achieve more ef\ufb01cient\nexploration of the state-action space by seeding each agent\u2019s parametric distributions over MDPs\nwith different seeds, whereas MAVEN aims to achieve this by maximising the mutual information\nbetween z and a trajectory.\nYet another direction of related work lies in de\ufb01ning intrinsic rewards for single agent hierarchical RL\nthat enable learning of diverse behaviours for the low level layers of the hierarchical policy. Florensa\net al. [10] use hand designed state features and train the lower layers of the policy by maximising MI,\nand then tune the policy network\u2019s upper layers for speci\ufb01c tasks. Similarly [14, 8] learn a mixture\nof diverse behaviours using deep neural networks to extract state features and use MI maximisation\nbetween them and the behaviours to learn useful skills without a reward function. MAVEN differs\nfrom DIAYN [8] in the use case, and also enforces action diversi\ufb01cation due to MI being maximised\njointly with states and actions in a trajectory. Hence, agents jointly learn to solve the task is many\ndifferent ways; this is how MAVEN prevents suboptimality from representational constraints, whereas\nDIAYN is concerned only with discovering new states. Furthermore, DIAYN trains on diversity\nrewards using RL whereas we train on them via gradient ascent. Haarnoja et al. [18] use normalising\n\ufb02ows [41] to learn hierarchical latent space policies using max entropy RL [49, 53, 9], which is\nrelated to MI maximisation but ignores the variational posterior over latent space behaviours. In\na similar vein [21, 38] use auxiliary rewards to modify the RL objective towards a better tradeoff\nbetween exploration and exploitation.\n7 Conclusion and Future work\nIn this paper, we analysed the effects of representational constraints on exploration under CTDE.\nWe also introduced MAVEN, an algorithm that enables committed exploration while obeying such\nconstraints. As immediate future work, we aim to develop a theoretical analysis similar to QMIX for\nother CTDE algorithms. We would also like to carry out empirical evaluations for MAVEN when z\nis continuous. To address the intractability introduced by the use of continuous latent variables, we\npropose the use of state-of-the-art methods from variational inference [24, 42, 41, 25]. Yet another\ninteresting direction would be to condition the latent distribution on the joint state space at each time\nstep and transmit it across the agents to get a low communication cost, centralised execution policy\nand compare its merits to existing methods [45, 11, 23].\n\n9\n\n\f8 Acknowledgements\nAM is generously funded by the Oxford-Google DeepMind Graduate Scholarship and Drapers Schol-\narship. TR is funded by the Engineering and Physical Sciences Research Council (EP/M508111/1,\nEP/N509711/1). This project has received funding from the European Research Council under\nthe European Union\u2019s Horizon 2020 research and innovation programme (grant agreement number\n637713). The experiments were made possible by generous equipment grant by NVIDIA and cloud\ncredit grant from Oracle Cloud Innovation Accelerator.\n\nReferences\n[1] Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of mathemati-\n\ncal Economics, 1(1):67\u201396, 1974.\n\n[2] David Barber, A Taylan Cemgil, and Silvia Chiappa. Bayesian time series models. Cambridge\n\nUniversity Press, 2011.\n\n[3] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n\n[4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[5] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A Comprehensive Survey of Multiagent\nIEEE Transactions on Systems, Man, and Cybernetics, Part C\n\nReinforcement Learning.\n(Applications and Reviews), 38(2):156\u2013172, 2008.\n\n[6] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An Overview of Recent Progress in\nthe Study of Distributed Multi-agent Coordination. IEEE Transactions on Industrial Informatics,\n9(1):427\u2013438, 2013.\n\n[7] Maria Dimakopoulou and Benjamin Van Roy. Coordinated exploration in concurrent rein-\nforcement learning. In International Conference on Machine Learning, pages 1270\u20131278,\n2018.\n\n[8] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you\n\nneed: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.\n\n[9] Matthew Fellows, Anuj Mahajan, Tim GJ Rudner, and Shimon Whiteson. Virel: A variational\n\ninference framework for reinforcement learning. arXiv preprint arXiv:1811.01132, 2018.\n\n[10] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical\n\nreinforcement learning. arXiv preprint arXiv:1704.03012, 2017.\n\n[11] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to\ncommunicate with deep multi-agent reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2137\u20132145, 2016.\n\n[12] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr,\nPushmeet Kohli, and Shimon Whiteson. Stabilising Experience Replay for Deep Multi-Agent\nReinforcement Learning. In Proceedings of The 34th International Conference on Machine\nLearning, pages 1146\u20131155, 2017.\n\n[13] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon\nWhiteson. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[14] Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv\n\npreprint arXiv:1611.07507, 2016.\n\n[15] Christian Guckelsberger, Christoph Salge, and Julian Togelius. New and surprising ways\nto be mean. adversarial npcs with coupled empowerment minimisation. arXiv preprint\narXiv:1806.01387, 2018.\n\n10\n\n\f[16] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent Planning with Factored MDPs. In\n\nAdvances in Neural Information Processing Systems, pages 1523\u20131530. MIT Press, 2002.\n\n[17] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Ef\ufb01cient solution\nalgorithms for factored mdps. Journal of Arti\ufb01cial Intelligence Research, 19:399\u2013468, 2003.\n\n[18] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies\n\nfor hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.\n\n[19] Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially Observable\nMDPs. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents, 2015.\n\n[20] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[21] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:\nVariational information maximizing exploration. In Advances in Neural Information Processing\nSystems, pages 1109\u20131117, 2016.\n\n[22] Maximilian H\u00fcttenrauch, Adrian \u0160o\u0161i\u00b4c, and Gerhard Neumann. Guided Deep Reinforcement\nLearning for Swarm Systems. In AAMAS 2017 Autonomous Robots and Multirobot Systems\n(ARMS) Workshop, 2017.\n\n[23] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent coopera-\n\ntion. In Advances in Neural Information Processing Systems, pages 7254\u20137264, 2018.\n\n[24] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder.\n\nIn Second International Conference on Learning Representations, ICLR, 2014.\n\n[25] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural informa-\ntion processing systems, pages 4743\u20134751, 2016.\n\n[26] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for\n\ndecentralized planning. Neurocomputing, 190:82\u201394, 2016.\n\n[27] Alex Tong Lin, Mark J Debord, Katia Estabridis, Gary Hewer, and Stanley Osher. Cesma:\n\nCentralized expert supervises multi-agents. arXiv preprint arXiv:1902.02311, 2019.\n\n[28] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-\nagent actor-critic for mixed cooperative-competitive environments. In Advances in Neural\nInformation Processing Systems, pages 6379\u20136390, 2017.\n\n[29] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[30] Anuj Mahajan and Theja Tulabandhula. Symmetry detection and exploitation for function\napproximation in deep rl. In Proceedings of the 16th Conference on Autonomous Agents and\nMultiAgent Systems, pages 1619\u20131621. International Foundation for Autonomous Agents and\nMultiagent Systems, 2017.\n\n[31] Anuj Mahajan and Theja Tulabandhula. Symmetry learning for function approximation in\n\nreinforcement learning. arXiv preprint arXiv:1706.02999, 2017.\n\n[32] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\narXiv preprint arXiv:1402.0030, 2014.\n\n[33] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[34] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs.\n\nSpringerBriefs in Intelligent Systems. Springer, 2016.\n\n11\n\n\f[35] Shayegan Omidsha\ufb01ei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep\nDecentralized Multi-task Multi-Agent RL under Partial Observability. In Proceedings of the\n34th International Conference on Machine Learning, pages 2681\u20132690, 2017.\n\n[36] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped dqn. In Advances in neural information processing systems, pages 4026\u20134034,\n2016.\n\n[37] Ian Osband, Benjamin Van Roy, Daniel Russo, and Zheng Wen. Deep exploration via random-\n\nized value functions. arXiv preprint arXiv:1703.07608, 2017.\n\n[38] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo-\n\nration by self-supervised prediction. In ICML, 2017.\n\n[39] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang.\nMultiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in\nLearning to Play StarCraft Combat Games. arXiv preprint arXiv:1703.10069, 2017.\n\n[40] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob\nFoerster, and Shimon Whiteson. QMIX: Monotonic Value Function Factorisation for Deep\nMulti-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on\nMachine Learning, pages 4295\u20134304, 2018.\n\n[41] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\narXiv preprint arXiv:1505.05770, 2015.\n\n[42] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n[43] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas\nNardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon White-\nson. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference\non Autonomous Agents and MultiAgent Systems, 2019.\n\n[44] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran:\nLearning to factorize with transformation for cooperative multi-agent reinforcement learning.\narXiv preprint arXiv:1905.05408, 2019.\n\n[45] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa-\n\ngation. In Advances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\n[46] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi,\nMax Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-\ndecomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,\n2017.\n\n[47] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru,\nJaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement\nlearning. PloS one, 2017.\n\n[48] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceed-\n\nings of the tenth international conference on machine learning, pages 330\u2013337, 1993.\n\n[49] Emanuel Todorov. Linearly-solvable markov decision problems.\n\ninformation processing systems, pages 1369\u20131376, 2007.\n\nIn Advances in neural\n\n[50] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n[51] Erfu Yang and Dongbing Gu. Multiagent reinforcement learning for multi-robot systems: A\nsurvey. Technical report, University of Strathclyde, 2004.\n\n[52] Stephan Zheng and Yisong Yue. Structured exploration via hierarchical variational policy\n\nnetworks. openreview, 2018.\n\n[53] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy\ninverse reinforcement learning. In Aaai, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n12\n\n\f", "award": [], "sourceid": 4160, "authors": [{"given_name": "Anuj", "family_name": "Mahajan", "institution": "University of Oxford"}, {"given_name": "Tabish", "family_name": "Rashid", "institution": "University of Oxford"}, {"given_name": "Mikayel", "family_name": "Samvelyan", "institution": "Russian-Armenian University"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}