{"title": "A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8130, "page_last": 8140, "abstract": "Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. \nIn this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.", "full_text": "A Structured Prediction Approach for Generalization\nin Cooperative Multi-Agent Reinforcement Learning\n\nNicolas Carion\nFacebook, Paris\n\nLamsade, Univ. Paris Dauphine\n\nalcinos@fb.com\n\nGabriel Synnaeve\nFacebook, NYC\n\ngab@fb.com\n\nAlessandro Lazaric\n\nFacebook, Paris\nlazaric@fb.com\n\nNicolas Usunier\nFacebook, Paris\nusunier@fb.com\n\nAbstract\n\nEffective coordination is crucial to solve multi-agent collaborative (MAC) prob-\nlems. While centralized reinforcement learning methods can optimally solve small\nMAC instances, they do not scale to large problems and they fail to generalize\nto scenarios different from those seen during training. In this paper, we consider\nMAC problems with some intrinsic notion of locality (e.g., geographic proximity)\nsuch that interactions between agents and tasks are locally limited. By leverag-\ning this property, we introduce a novel structured prediction approach to assign\nagents to tasks. At each step, the assignment is obtained by solving a centralized\noptimization problem (the inference procedure) whose objective function is pa-\nrameterized by a learned scoring model. We propose different combinations of\ninference procedures and scoring models able to represent coordination patterns of\nincreasing complexity. The resulting assignment policy can be ef\ufb01ciently learned\non small problem instances and readily reused in problems with more agents and\ntasks (i.e., zero-shot generalization). We report experimental results on a toy search\nand rescue problem and on several target selection scenarios in StarCraft: Brood\nWar1, in which our model signi\ufb01cantly outperforms strong rule-based baselines on\ninstances with 5 times more agents and tasks than those seen during training.\n\nIntroduction\n\n1\nMulti-agent collaboration (MAC) problems often decompose into several intermediate tasks that need\nto be completed to achieve a global goal. A common measure of size, or dif\ufb01culty, of MAC problems\nis the number of agents and tasks: more tasks usually require longer-term planning, the joint action\nspace grows exponentially with the number of agents, and the joint state space is exponential in both\nthe numbers of tasks and agents. While general-purpose reinforcement learning (RL) methods [26]\nare theoretically able to solve (centralized) MAC problems, their learning (e.g., estimating the optimal\naction-value function) and computational (e.g., deriving the greedy policy from an action-value\nfunction) complexity grows exponentially with the dimension of the problem. A way to address this\nlimitation is to learn in problems with few agents and a small planning horizon and then generalize\nthe solution to more complex instances. Unfortunately, standard RL methods are not able to perform\nany meaningful generalization to scenarios different from those seen during training. In this paper\nwe study problems whose structure can be exploited to learn policies in small instances that can be\nef\ufb01ciently generalized across scenarios of different size.\n\n1StarCraft and its expansion StarCraft: Brood War are trademarks of Blizzard EntertainmentTM.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWell-known MAC problems that are solved by a suitable sequence of agent-task assignments include\nsearch and rescue, predator-prey problems, \ufb02eet coordination, or managing units in video games. In\nall these problems, the dynamics describing the interaction of the \u201cobjects\u201d in the environment (i.e.,\nagents and tasks) is regulated by constraints that may greatly simplify the problem. A typical example\nis local proximity, where objects\u2019 actions may only affect nearby objects (e.g., in the predator-prey, the\nprey\u2019s movements only depend on nearby agents). Similarly, constraints may be related to assignment\nproximity, as agents may only interact with agents assigned to the same task.\nThe structure of problems with constrained interaction has been exploited to simplify the learning of\nvalue functions [e.g., 20] or dynamics of the environment [e.g., 11]. These approaches effectively\ngeneralize from easier to more dif\ufb01cult instances: we may train on small environments where the\nsample complexity is practical and generalize to large problems without ever training on them (zero-\nshot generalization). The main drawback is that when generalizing value functions or the dynamics,\nthe optimal (or greedy) policy still needs to be recomputed at each new instance, which usually\nrequires solving an optimization problem with complexity exponential in the number of objectives\n(e.g., maximizing the action-value function over the joint action space).\nIn this paper, we build on the observation that in MAC problems with constrained interaction, optimal\npolicies (or good approximations) can be effectively represented as a combination of coordination\npatterns that can be expressed as reactive rules, such as creating subgroups of agents to solve a\nsingle task, avoiding redundancy, or combinations of both. We decompose agents\u2019 policies into a\nhigh-level agent-task assignment policy and a low-level policy that prescribes the actual actions\nagents should take to solve the assigned task. As the most critical aspect of MAC problems is\nthe coordination between agents, we assume low-level policies are provided in advance and we\nfocus on learning effective high-level policies. To leverage the structure of the assignment policy,\nwe propose a structured prediction approach, where agents are assigned to tasks as a result of an\noptimization problem. In particular, we distinguish between the coordination inference procedure\n(i.e., the optimization problem itself) and scoring models (the objective function) that provide a score\nto agent-task and task-task pairs. In its more complex instance, we de\ufb01ne a quadratic inference\nprocedure with linear constraints, where the objective function uses learned pairwise scores between\nagents and tasks for the linear part of the objective, and between different tasks for the quadratic\npart. With this structure we address the intrinsic exponential complexity of learning in large MAC\nproblems through zero-shot generalization: 1) the parameters of the scoring model can be learned in\nsmall instances, thus keeping the learning complexity low, 2) the coordination inference procedure\ncan be generalized to an arbitrary number of agents and tasks, as its computational complexity is\npolynomial in the number of agents and tasks. We study the effectiveness of this approach on a\nsearch and rescue problem and different battle scenarios in \u201cStarCraft: Brood War\u201d. We show that the\nlinear part of the optimization problem (i.e., using agent-task scores) represents simple coordination\npatterns such as assigning agents to their closest tasks, while the quadratic part (i.e., using task-task\nscores) may capture longer-term coordination such as spreading the different agents to tasks that are\nfar away to each other or, on the contrary, create groups of agents that focus on a single task.\n2 Related Work\nMulti-agent reinforcement learning has been extensively studied, mostly in problems of decentralized\ncontrol and limited communication (see Busoniu et al. [2] for a survey). By contrast, this paper\nfocuses on centralized control under full state observation.\nOur work is closely related to generalization in relational Markov Decision Processes [11] and\ndecomposition approaches in loosely and weakly coupled MDPs [24, 17, 10, 27, 20]. The work on\nrelational MDPs and the related object-oriented MDPs and \ufb01rst-order MDPs [11, 6, 22] focus on\nlearning and planning in environments where the state/action space is compactly described in terms\nof objects (e.g., agents) that interact with each other, without prior knowledge of the actual number\nof objects involved. Most of the work in this direction is devoted to either ef\ufb01ciently estimating the\nenvironment dynamics, or approximate the planning in new problem instances. Whereas the type of\nenvironments and problems we aim at are similar, we focus here on model-free learning of policies\nthat generalize to new (and larger) problem instances without replanning.\nLoosely or weakly coupled MDPs are another form of structured MDPs, which decompose into\nsmaller MDPs with nearly independent dynamics. These works mostly follow a decomposition\napproach in which global action-value functions are broken down into independent parts that are either\nlearned individually, or serve as guide for an effective parameterization for function approximation.\n\n2\n\n\fThe policy parameterization we develop follows the task decomposition approach of Proper and\nTadepalli [20], but the policy structures we propose are different. Proper and Tadepalli [20] develop\npolicies based on pairwise interaction terms between tasks and agents similar to our quadratic model,\nbut the pairwise terms are based on interactions dictated by the dynamics of the environment (e.g.,\nagent actions that directly impact the effect of other actions) aiming at a better estimation of the value\nfunction of low-level actions of the agents once an assignment is \ufb01xed, whereas our quadratic term\naims at assessing the long-term value of an assignment.\nMany deep reinforcement learning algorithms have been recently proposed to solve MAC problems\nwith a variable number of agents, using different variations of communication and attention over\ngraphs [25, 7, 32, 31, 13, 12, 16, 23]. However, most of these algorithms focus on \ufb01xed-size action\nspaces, and little evidence has been given that these approaches generalize to larger problem instances\n[28, 8, 32]. Rashid et al. [21] and Lin et al. [15] address the problem of learning (deep) decentralized\npolicies with a centralized critic during learning in structured environments. While they do not\naddress the problem of generalization, nor the problem of learning a centralized controller, we use\ntheir idea of a separate critic computed based on the full state information during training.\n\n3 Multi-agent Task Assignment\n\n(cid:2)r(s, x, a) +(cid:80)T\n\nt=1 \u03b3tr(st, xt, at)(cid:3), where \u03b3 \u2208 [0, 1), at = \u03c0(st, xt) for all t \u2265 1, st and xt are\n\nWe formalize a general MAC problem. To keep notation simple, we present a \ufb01xed-size description,\nbut the end goal is to design policies that can be applied to environments of arbitrary size.\nAs customary in reinforcement learning, the objective of solving the tasks is encoded through a\nreward function that needs to be maximized over the long run by the coordinated actions of all agents.\nAn environment with m tasks and n agents is modeled as an MDP (cid:104)S m,X n,An, r, p(cid:105), where S is\nthe set of possible states of each task (indexed by j = 1, . . . , m), X and A are the the set of states\nand actions of each agent (indexed by i = 1, . . . , n). We denote the joint states/actions by s \u2208 S m,\nx \u2208 X n, and a \u2208 An. The reward function is de\ufb01ned as r : S m \u00d7 X n \u00d7 An \u2192 R and the stochastic\ndynamics is p : S m \u00d7 X n \u00d7 An \u2192 \u2206(S m \u00d7 X m), where \u2206 is the probability simplex over the\n(next) joint state set. A joint deterministic policy is de\ufb01ned as a mapping \u03c0 : S m \u00d7 X n \u2192 An. We\nconsider the episodic discounted setting where the action-value function is de\ufb01ned as Q\u03c0(s, x, a) =\nE\u03c0\nsampled from p, and T is the time by when all tasks have been solved. The goal is to learn a policy \u03c0\nclose to the optimal \u03c0\u2217 = arg max\u03c0 Q\u03c0 that we can easily generalize to larger environments.\nTask decomposition. Following a similar task decomposition approach as [27] and [20], we consider\nhierarchical policies that \ufb01rst assign each agent to a task, and where actions are given by a lower-\nlevel policy that only depends on the state of individual agents and the task they are assigned to.\nj=1 \u03b2ij = 1} the set of assignment matrices of agents to\ntasks, an assignment policy \ufb01rst chooses \u02c6\u03b2(s, x) \u2208 B. In the second step, the action for each agent\nis chosen according to a lower-level policy \u02dc\u03c0. Using \u03c0i(s, x) to denote the action of agent i and\n\u02c6\u03b2i(s, x) \u2208 {1, ..., m} for the task assigned to agent i, we have \u03c0i(s, x) = \u02dc\u03c0(s \u02c6\u03b2i(s,x), xi), where sj\nand xi are respectively the internal states of task j and agent i in the full state (s, x). In the following,\nwe focus on learning high-level assignment policies responsible for the collaborative behavior, while\nwe assume that the lower-level policy \u02dc\u03c0 is known and \ufb01xed.\n\nDenoting by B = {\u03b2 \u2208 {0, 1}n\u00d7m :(cid:80)m\n\n4 A Structured Prediction Approach\n\nIn this section we introduce a novel method for centralized coordination. We propose a structured\nprediction approach in which the agent-task assignment is chosen by solving an optimization problem.\nOur method is composed of two components: a coordination inference procedure, which de\ufb01nes\nthe shape of the optimization problem and thus the type of coordination between agents and tasks,\nand a scoring model, which receives as input the state of agents and tasks and returns the parameters\nof the objective function of the optimization. The combination of these two components de\ufb01nes an\n\nagent-task assignment policy \u02c6\u03b2 that is then passed to the low-level policy(cid:101)\u03c0 (that we assume \ufb01xed)\n\nwhich returns the actual actions executed by the agents. Finally, we use a learning algorithm to\nlearn the parameters of the scoring model itself in order to maximize the performance of \u02c6\u03b2. The\noverall scheme of this method is illustrated in Fig. 1.\n\n3\n\n\fState, Reward\n\nScoring model\n\n\u03b8\n\nh\u03b8(s, x, i, j)\n\ng\u03b8(s, x, j, l)\n\nCIP\n\n\u02c6\u03b2(s, x)\n\n(Argmax, LP or Quad)\n\nActor\n\nLow level \u02dc\u03c0(s \u02c6\u03b21\n\n, x1)\n\n...\n\nLow level \u02dc\u03c0(s \u02c6\u03b2n\nMeta Environment\n\n, xn)\n\na1\n\nan\n\nEnvironment\n\na\n\nFigure 1: Illustration of the approach, where the agent-task assignment is computed by a coordination\ninference procedure (CIP) which receives as input agent-task (h) and task-task (g) scores computed\nby a scoring model parametrized by \u03b8. The assignment \u02c6\u03b2 is then passed to \ufb01xed low level policies that\nreturn the actions played by each agent. The learning algorithm tunes \u03b8 and performs \u201cmeta-actions\u201d\nh\u03b8 and g\u03b8 on to the \u201cmeta-environment\u201d composed by the inference procedure, the low-level policies,\nand the actual environment.\n\n4.1 Coordination Inference Procedures\nThe collaborative behaviors that we can represent are tied to the speci\ufb01c form of the objective function\nand its constraints. The formulations we propose are motivated by collaboration patterns important\nfor long-term performance, such as creating subgroups of agents, or spreading agents across tasks.\nGreedy assignment. The simplest form of assignment is to give a score to each agent-task pair and\nthen assign each agent to the task with the highest score, ignoring other agents at inference time.\nIn this approach, that we refer to as AMAX strategy, a model h\u03b8(x, s, i, j) \u2208 R parameterized by \u03b8\nreceives as input the full state and returns the score of agent i for task j. The associated policy is then\n\n\u02c6\u03b2AMAX (s, x, \u03b8) = arg max\n\n\u03b2\u2208B\n\n\u03b2i,jh\u03b8(s, x, i, j),\n\n(1)\n\n(cid:88)\n\ni,j\n\nwhich corresponds to assigning each agent i to the task j with largest score h\u03b8(x, s, i, j). As a result,\nthe complexity of coordination is reduced from O(mn) (i.e., considering all possible agent-to-task\nassignments) down to a linear complexity O(nm) (once the function h\u03b8 has been evaluated on the\nfull state). We also notice that AMAX bears strong resemblance to the strategy used in [20], where\nthe scores are replaced by approximate value functions computed for any agent-task pair.2\nLinear Program assignment. Since AMAX ignores interactions between agents, it tends to perform\npoorly in scenarios where a task has a high score for all agents (i.e., h(s, x, i, j) is large for a given\nj and for all i). In this situation, all agents are assigned to the same task, implicitly assuming that\nthe \u201cvalue\u201d of solving a task is additive in the number of agents assigned to it (i.e., if n agents are\nassigned to the same task then we could collect a reward n times larger). While this may be the\ncase when the number of agents assigned to the same task is small, in many practical scenarios this\neffect tends to saturate as more agents are assigned to a single task. A simple way to overcome this\nundesirable behavior is to impose a restriction on the number of agents assigned to a task. We can\nformalize this intuition by introducing \u00b5(i,j)(s, x) as the contribution of an agent i to a given task j,\nand uj(s, x) as the capacity of the task j. In the simplest case, we may know the maximum number\nof agents nj that is necessary to solve each task j, and we can set the capacity of each task to be nj,\nand all the contributions \u00b5i,j to be 1. Depending on the problem, the capacities and contributions\nare either prior knowledge or learned as a function of the state. Formally, denoting by B(s, x) the\nconstrained assignment space\n\n\u03b2 \u2208 {0, 1}n\u00d7m(cid:12)(cid:12)(cid:12) \u2200i,\n(cid:110)\n\nm(cid:88)\n\nj=1\n\nB(s, x) =\n\nn(cid:88)\n\ni=1\n\n\u03b2i,j \u2264 1;\u2200j,\n\n\u00b5i,j(s, x)\u03b2i,j \u2264 uj(s, x)\n\nthe resulting policy infers the assignment by solving an integer linear program\n\n\u02c6\u03b2LP (s, x, \u03b8) = arg max\n\u03b2\u2208B(s,x)\n\n\u03b2i,jh\u03b8(s, x, i, j),\n\n(cid:88)\n\ni,j\n\n(cid:111)\n\n,\n\n(2)\n\n(3)\n\n2An alternative approach is to sample assignments proportionally to h\u03b8(s, x, i, j). Preliminary empirical\n\ntests of this procedure performed worse than AMAX and thus we do not report its results.\n\n4\n\n\f(cid:104)(cid:88)\n\n(cid:88)\n\n(cid:105)\n\nhence inequality(cid:80)m\n\nNotice that even with the additional constraints in (2), some agents may not be assigned to any task,\n\nj=1 \u03b2i,j \u2264 1 instead of strict equality.\n\ni,j, and assign each agent to the task of maximum score that is not already saturated.\n\nIn order to optimize (3) ef\ufb01ciently, we trade off accuracy for speed by solving its linear relaxation\nusing an ef\ufb01cient LP library [1], and retrieving a valid assignment using greedy rounding: Let\nus denote as \u03b2\u2217\ni,j the solution of the relaxed ILP; we iterate over agents i in descending order of\nmaxj \u03b2\u2217\nQuadratic Program assignment. The linear program above avoids straightforward drawbacks of\na greedy assignment policy, but is unable to represent grouping patterns that are important on the\nlong-run in coordination and collaboration problems. For instance, it may be convenient to \u201cspread\u201d\nagents among unrelated tasks, or, on the contrary, group agents together on a single task (up to the\nconstraints) and then move to other tasks in a sequential fashion. Such grouping patterns can be well\nrepresented with a quadratic objective function of the form\n\n\u02c6\u03b2QUAD(s, x, \u03b8) = arg max\n\u03b2\u2208B(s,x)\n\n\u03b2i,jh\u03b8(s, x, i, j) +\n\ni,j\n\ni,j,k,l\n\n\u03b2i,j\u03b2k,lg\u03b8(s, x, j, l)\n\n,\n\n(4)\n\nwhere g\u03b8(x, s, j, l) plays the role of a (signed) distance between two tasks and B is the same set of\nconstraints as in (2). In the extreme case where g\u03b8(x, s, ., .) is a diagonal matrix, the quadratic part of\nthe objective favors agents to carry on the same task (if the diagonal terms are positive) or on the\ncontrary carry on different tasks (if the terms are negative). In general, negative g\u03b8(x, s, j, l) disfavors\nagents to be assigned to j and l at the same time step depending on |g\u03b8(x, s, j, l)|. For instance, in\nthe search and rescue problem, this captures the idea that agents should spread to explore the map.\nAs for the LP , we optimize a continuous relaxation of (4) using the same rounding procedure. The\nobjective function may not be concave, because there is no reason for g\u03b8(x, s, ., .) to be negative\nsemi-de\ufb01nite. In practice, we use the Frank-Wolfe algorithm [9] to deal with the linear constraints;\nthe algorithm is guaranteed to converge to a local maximum and was ef\ufb01cient in our experiments.\n\n4.2 Scoring Models\n\nIn order to allow generalizing the coordination policy(cid:98)\u03b2 to instances of different size, the h\u03b8 and g\u03b8\n\nfunctions should be able to compute scores for pairs agents/tasks and tasks/tasks, independently of\nthe actual amount of those. In order to make the presentation concrete, in the following we illustrate\ndifferent scoring models in the case where the agents and tasks are objects located in a \ufb01xed-size 2D\ngrid, and are characterized by an internal state. The position on the grid is part of this internal state.3\n\nDirect Model (DM). The \ufb01rst option is to use a fully decomposable approach (direct model), where\nthe score for the pair (i, j) only depends on the internal states of agent i and task j: h\u03b8(s, x, i, j) =\n\u02dch\u03b8(sj, xi) for some function \u02dch : S \u00d7 X \u2192 R. This model only uses the features of the pair of objects\nto compute the score. Precisely, \u02dch\u03b8(sj, xi) is obtained by concatenating the feature vectors of agent i\nand task j, and by feeding them to a fully-connected network of moderate depth. In the quadratic\nprogram strategy, the function g\u03b8 follows the same structure as h (but uses different weights).\nWhile this approach is computationally ef\ufb01cient, if used in the simple AMAX procedure (1), it leads\nto a policy that ignores interactions between agents altogether and is thus unable to represent effective\ncollaboration patterns. As a result, the direct model should be paired with more sophisticated inference\nprocedures to achieve more complex coordination patterns. On the other hand, as it computes scores\nby ignoring surrounding agents and tasks, once learned on small instances, it can be directly applied\n(i.e., zero-shot generalization) to larger instances independently of the number of agents and tasks.\n\nGeneral Model. An alternative approach is to take h\u03b8 as a highly expressive function of the\nfull state. The main challenge is this case is to de\ufb01ne an architecture that can output scores for\na variable number of agents and tasks. In the case where agents/tasks are in a 2D grid, we can\nde\ufb01ne a positional embedding model (PEM) (see App. ?? for more details) that computes scores\nfollowing ideas similar to non-local networks [29]. We use a deep convolutional neural network\nthat outputs k features planes at the same resolution as the input. This implies that each cell is\nassociated with k values that we treat as an embedding of the position. We divide this embedding\n\n3Notice that the direct model illustrated below does not leverage this speci\ufb01c scenario, which, on the other\n\nhand, is needed to de\ufb01ne the general model.\n\n5\n\n\fin two sub-embeddings of size k/2, to account for the two kinds of entities: the \ufb01rst k/2 values\nrepresent an embedding of an agent, and the remaining ones represent an embedding of a task. To\ncompute the score between two entities, we concatenate the embeddings of both entities and the input\nfeatures of both of them, and run that through a fully connected model, using the same topology as\ndescribed for the direct model.\nBy leveraging the full state, this model can capture non-local interactions between agents and tasks\n(unlike the direct model) depending on the receptive \ufb01eld of the convolutional network. Larger\nreceptive \ufb01elds allow the model to learn more sophisticated scoring functions and thus better policies.\nFurthermore, it can be applied to variable number of agents and tasks as a position contains at most\none agent and one task. Nonetheless, as it depends on the full state, the application to larger instances\nmeans that the model may be tested on data points outside the support of the training distribution. As\na result, the scores computed on larger instances may be not accurate, thus leading to policies that\ncan hardly generalize to more complex instances.\n\n4.3 Learning Algorithm\n\nAs illustrated in Fig. 1, the learning algorithm optimizes a policy parametrized by \u03b8 that returns as\n\nactions the scores h\u03b8 and g\u03b8, while the combination of the assignment(cid:98)\u03b2 returned by the optimization,\nthe low-level policy(cid:101)\u03c0 and the environment, plays the role of a \u201cmeta-environment\u201d. While any policy\n\ngradient algorithm could be used to optimize \u03b8, in the experiments we use a synchronous Advantage-\nActor-Critic algorithm [18], which requires computing a state-value function. As advocated by Rashid\net al. [21] in the context of learning decentralized policies, we use a global value function that takes\nthe whole state as input. We use a CNN similar to the one of PEM, followed by a spatial pooling\nand a linear layer to output the value. This value function is used only during training, hence its\nparametrization does not impact the potential generalization of the policy (more details in App. ??).\nCorrelated exploration. Reinforcement learning requires some form of exploration scheme. Many\nalgorithms using decompositional approaches for MAC problems [10, 27, 20, 21] rely on variants of\nQ-learning or SARSA and directly randomize the low-level actions taken by the agents. However, this\napproach is not applicable to our framework. In our case, the randomization is applied to the scores\n(denoted as H\u03b8(s, x, i, j) and G\u03b8(s, x, j, l)) before passing them to the inference procedure. We can\u2019t\nuse a simple gaussian noise, since at the beginning of the training, when the scoring model is random,\nit would cause the agents to be assigned to different tasks at each step, thus preventing them from\nsolving any task and getting any positive reward. To alleviate this problem, we correlate temporally\nthe consecutive realizations of H\u03b8 and G\u03b8 using auto-correlated noise as studied in [e.g., 30, 14],\nso that the actual sequence of assignments executed by the agent is also correlated. To correlate\nthe parameters over p steps, at time t, we sample Ht,\u03b8(i, j) according to (dropping dependence\n\non (st, xt) for clarity): N(cid:0)ht,\u03b8(i, j) +(cid:80)t\u22121\n\n(cid:1). This is equivalent to\n\nt(cid:48)=t\u2212p(Ht(cid:48),\u03b8(i, j) \u2212 ht(cid:48),\u03b8(i, j)), \u03c3\n\np\n\ncorrelating the sampling noise over a sliding window of size p. During the update of the model, we\nignore the correlation, and assume that the actions were sampled according to N (ht,\u03b8(i, j), \u03c3).\n\n5 Experiments\nWe report results in two different problems: search and rescue and target selection in StarCraft. Both\nexperiments are designed to test the generalization performance of our method: we learn the scoring\nmodels on small instances and the learned policy is tested on larger instances with no additional\nre-training. We test different combinations of coordination inference procedures and scoring models.\nAmong the inference procedures, AMAX should be considered as a basic baseline, while we expect\nLP to express some interesting coordination patterns. The QUAD is expected to achieve the better\nperformance in the training instance, although its more complex coordination patterns may not\neffectively generalize well to larger instances. Among the scoring models, PEM should be able to\ncapture dependencies between agents and tasks in a single instance but may fail to generalize when\ntested on instances with a number of agents and tasks not seen at training time. On the other hand,\nthe simpler DM should generalize better if paired with a good coordination inference procedure.\nThe PEM + AMAX combination roughly corresponds to independent A2C learning and can be seen\nas the standard approach baseline, and we also provied strong hand-crafted baselines. Most previous\napproaches didn\u2019t aim achieving effective generalization, and often relied on \ufb01xed-size action spaces,\nrendering direct comparison impractical.\n\n6\n\n\fTable 1: Search and Rescue. Average number of steps to solve the validation episodes, depending on\nthe train scenario. \u2206 denotes the improvement over baseline. Best results are in bold, with an asterisk\nwhen they are statistically (p < 0.0001) better than the second best. Results like \"10.3(1.1%)\"\nmean that the evaluation failed in 1.1% of the test scenarios, and had an average score of 10.3 on\nthe remaining 98.9%. In case of evaluation failures, the reported improvement over baseline are\nindicative (reported in italics between parenthesis).\n\nTrain (n\u00d7m) Test Baseline Topline AMAX-PEM LP-PEM QUAD-PEM AMAX-DM LP-DM QUAD-DM\n11.55\n9.32*\n7.85*\n11.78*\n9.36*\n7.95*\n25%\n29%\n28%\n\n11.44\n12.09\n10.69\n9.67\n9.86 10.3(1.1%)\n13.23(1%)\n12.94\n10.43\n10.24\n9.51\n9.37\n21%\n20%\n(18%)\n17%\n(19%)\n18%\n\n2 \u00d7 4\n5 \u00d7 10\n8 \u00d7 15\n2 \u00d7 4\n5 \u00d7 10\n8 \u00d7 15\nIn domain \u2206\nOut of domain \u2206\nTotal \u2206\n\n11.98\n13.36\n15.8(0.7%)\n12.05\n9.84\n8.60\n22%\n(22%)\n(18%)\n\n13.78\n12.49\n11.06\n13.84\n12.26\n10.57\n7%\n7%\n7%\n\n11.98\n10.24\n9.71\n12.22\n10.12\n8.63\n21%\n21%\n21%\n\n14.34\n13.61\n11.8\n14.34\n13.61\n11.8\n\n10.28\n7.19\nn.a\n10.28\n7.19\nn.a\n\n0% 38%\n\n2 \u00d7 4\n\n5 \u00d7 10\n\nlower\n\nis\n\nbetter\nlower\n\nis\n\nbetter\nhigher\n\nis\n\nbetter\n\n5.1 Search and Rescue\nSetting. We consider a search and rescue problem on a grid environment of 16 by 16 cells. Each\ninstance is characterized by a set of n ambulances (i.e., agents) and m victims (i.e., tasks). The goal\nis that all the victims are picked up by one of the ambulances as quickly as possible. This problem\ncan be seen as a Multi-vehicle Routing Problem (MVR), which makes it NP-hard.\nThe reward is \u22120.01 per time-step until the end of an episode (when all the victims have been picked\nup). The learning task is challenging because the reward is uninformative and coupled; it is dif\ufb01cult\nfor an agent to assign credit to the solution of an individual tasks (i.e., picking up a victim). The\n\nassignment policy (cid:98)\u03b2 matches ambulances to victims, while the low-level policy(cid:101)\u03c0 takes an action\n\nto reduce the distance between the ambulance and its assigned victim. In this environment, only\none ambulance is needed to pick-up a particular victim, hence the saturation uj(s, x) is set to 1.\nWe trained our models on two instances (n = 2, m = 4 and n = 5, m = 10) and we test it on\nthe trained scenarios, as well as in instances with a larger number of victims and ambulances. At\ntest time, we evaluate the policies on a \ufb01xed set of 1000 random episodes (with different starting\npositions). The agents use the same variance and number of correlated steps as they had during\ntraining. The results are summarized in Tab. 1, where we report the average number of steps required\nto complete the episodes. We also report the results of a greedy baseline policy that always assigns\neach ambulance to the closest victim, and a topline policy that solves each instance optimally (see\nApp. ?? for more details). Because of its computational cost, the topline for the biggest instance\n(8 \u00d7 15) is not available. In the last rows of the table, we aggregate the average improvements over\nthe baseline (100 \u2217 baseline\u2212method\n). The in-domain scores correspond to the scores obtained when\nthe test instance matches the train instance. Conversely, the out of domain scores correspond to the\nperformances on unseen instances. Note that no model was trained on 8 \u00d7 15.\nResults. Firstly, the PEM scoring model tends to over\ufb01t to the train scenario, leading to poor\ngeneralization (i.e., in some con\ufb01guration it fails to solve the problem). On the other hand, for the\nDM, the generalization is very stable. Regarding the inference procedures 4, AMAX tends to perform\nat least as well as the greedy baseline, by learning how to compute the relevant distance function\nbetween an ambulance and a victim. The LP strategy can rely on the same distance function and\nperform better, by since it enforces the coordination and avoids sending more than one ambulance to\nthe same victim. Finally, the QUAD strategy is able to learn long-term strategies, and in particular\nhow to spread ef\ufb01ciently the ambulances across the map (e.g., if two victims are very close, it is\nwasteful to assign two distinct ambulances to them, since one can ef\ufb01ciently pick-up both victims\nsequentially, while the other ambulance deals with further victims) (see App.?? for more discussion).\n\nbaseline\n\n5.2 Target Selection in StarCraft\nSetting. We focus on a speci\ufb01c sub-problem in StarCraft: battles between two groups of units. This\nsetting, often referred to as micromanagement, has already been studied in the literature, using a\nmixture of scripted behaviours and search [4, 3, 19, 5], or using RL techniques [28, 8]. In these\n\n4We study their performance when paired with DM.\n\n7\n\n\fbattles, a crucial aspect of the policy is to assign a target enemy unit (the task) to each of our units\n(the agents), in a coordinated way. Since we focus on the agent-task assignment (the high-level policy\n\n(cid:98)\u03b2), we use a simple low-level policy for the agents (neither learnt nor scripted) relying on the built-in\n\n\u201cattack\u201d command of the game, which moves each unit towards its target and shoots as soon as the\ntarget is in range. This contrasts with previous works, which usually allow more complex movement\npatterns (e.g., retreating while the weapon is reloading). While such low-level policies could be\nintegrated in our framework, we preferred to use the simplest \u201cattack\u201d policy to better assess the\nimpact of the high-level coordination.\nIn this problem, the capacity uj(s, x) of a task j is de\ufb01ned as the remaining health of the enemy unit,\nand the contribution \u00b5i,j(s, x) of an agent i to this task is de\ufb01ned as the amount of damage dealt by\nunit i to the enemy j. These constraints are meant to avoid dealing more damage to an enemy than\nnecessary to kill it, a phenomenon known as over-killing.\nGiven the poor results of PEM in the previous experiment, we only train DM with all the possible\ninference procedures. Each unit is represented by its features: whether it is an enemy, its position,\nvelocity, current health, range, cool-down (number of frames before the next possible attack), and\none hot encoding of its type. This amounts to 8 to 10 features per units, depending on the scenario.\nFor training, we sample 100 sets of hyper-parameters for each combination of model/scenario, and\ntrain them for 8 hours on a Nvidia Volta. In this experiment, we found that the training algorithm is\nrelatively sensitive to the random seed. To better asses the performances, we re-trained the best set of\nhyper-parameters for each model/scenario on 10 random seeds, for 18 hours. The performances we\nreport are the median of the performances of all the seeds, to alleviate the effects of the outliers. The\nresults are aggregated in Tab. 2. Although the number of units is a good indicator of the dif\ufb01culty\nof the environment, whether the numbers of units are balanced in both teams dramatically change\nthe \u201cdynamics\u201d of the game. For instance, zh10v12 is unbalanced and thus much more dif\ufb01cult than\nzh11v11, which is balanced. The performance of the baseline can be seen as a relatively accurate\nestimate of the dif\ufb01culty of the scenario. See App. ?? for a description of the heuristics used for\ncomparison, and App. ?? for a description of the training scenarios. We also provide a more detailed\nresults in App. ??.\nResults. As StarCraft is a real-time game, one \ufb01rst\nconcern regards the runtime of our algorithm. In the\nbiggest experiment, involving 80 units vs 82, our al-\ngorithm returned actions in slightly more than 500ms\n(5ms for the forward in the model, 500ms to solve the\ninference of QUAD ). Given the frequency at which\nwe take actions (every 6 frames), such timings allow\nreal-time play in StarCraft.\nAmongst the scenarios, the Wraith setting (wNvM)\nare the ones where the assumption of independence\nbetween the tasks holds the best, since in this case there\nare no collisions between units. These scenarios also\nrequire good coordination, since it is important to focus\n\ufb01re on the same unit. During these battles, both armies\ntend to overlap totally, hence it becomes almost impos-\nsible to use surrogate coordination principles such as\ntargeting the closest unit. In this case, the quadratic part\nof the score function is crucial to learn focus-\ufb01ring and\nresults show that without the ability to represent such\na long-term coordination pattern, both LP and AMAX\nfail to reach the same level of performance. Notably,\nthe coordination pattern learned by QUAD generalizes\nwell, outperforming the best heuristics in instances as\nmuch as 5 times the size of the training instance.\nThe other setting, Marine (mNvM) and Zergling-Hydralisk (zhNvM) break the independence\nassumption because the units now have collisions. It is even worse for the Zerglings, since they\nare melee units. The coordination patterns are then harder to learn for the QUAD model, and they\ngeneralize poorly. However, these scenarios with collisions also tend to require less long-term\n\nheuristic\n0.88 0.90 0.83\nm10v10 0.77 0.94 0.83\nm10v11 0.25 0.52 0.28\nm15v15 0.75 0.92 0.69\nm15v16 0.40 0.68 0.32\nm30v30 0.69 0.74 0.06\nw15v17 w15v17 0.81 0.53 0.89\nw30v34 0.90 0.76 0.99\nw30v35 0.60 0.56 0.94\nw60v67 0.07 0.33 0.72\nw60v68 0.01 0.21 0.52\nw80v82 0.32 0.11 0.36\nzh10v10 zh10v10 0.86 0.90 0.83\nzh10v11 0.30 0.46 0.24\nzh10v12 0.03 0.06 0.01\nzh11v11 0.87 0.87 0.75\nzh12v12 0.85 0.82 0.64\n\nTable 2: StarCraft. Average win-rate of\ndifferent methods (best in bold). See con\ufb01-\ndence intervals in the full table in App ??\n\n0.84\n0.82\n0.29\n0.77\n0.43\n0.36\n0.30\n0.37\n0.24\n0.13\n0.07\n0.03\n0.84\n0.40\n0.06\n0.80\n0.75\n\nm10v10 m5v5\n\nTrain\n\nTest\n\nLP QUAD AMAX\n\nBest\n\n8\n\n\fcoordination, and the immediate coordination patterns learned by the LP model are enough to\nsigni\ufb01cantly outperform the heuristics, even when transferring to unseen instances.\n6 Conclusion\nIn this paper we proposed a structured approach to multi-agent coordination. Unlike previous work,\nit uses an optimization procedure to compute the assignment of agents to tasks and de\ufb01ne suitable\ncoordination patterns. The parameterization of this optimization procedure is seen as the continuous\noutput of an RL trained model. We showed on two challenging problems the effectiveness of this\nmethod, in particular in generalizing from small to large instance.\n\n9\n\n\fReferences\n[1] The glop linear solver. https://developers.google.com/optimization/lp/glop.\n[2] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforce-\nment learning. IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews,\n38 (2), 2008, 2008.\n\n[3] David Churchill and Michael Buro. Portfolio greedy search and simulation for large-scale combat in\nStarCraft. In Computational Intelligence in Games (CIG), 2013 IEEE Conference on, pages 1\u20138. IEEE,\n2013.\n\n[4] David Churchill, Abdallah Saf\ufb01dine, and Michael Buro. Fast heuristic search for RTS game combat\n\nscenarios. In AIIDE, pages 112\u2013117, 2012.\n\n[5] David Churchill, Zeming Lin, and Gabriel Synnaeve. An analysis of model-based heuristic search\ntechniques for StarCraft combat scenarios. AAAI Publications, Thirteenth Arti\ufb01cial Intelligence and\nInteractive . . . , 2017.\n\n[6] Carlos Diuk, Andre Cohen, and Michael L Littman. An object-oriented representation for ef\ufb01cient\nreinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages\n240\u2013247. ACM, 2008.\n\n[7] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to commu-\nnicate with deep multi-agent reinforcement learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2137\u20132145. Curran\nAssociates, Inc., 2016.\n\n[8] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.\nCounterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[9] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\n(NRL), 3(1-2):95\u2013110, 1956.\n\n[10] Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In ICML,\n\nvolume 2, pages 227\u2013234, 2002.\n\n[11] Carlos Guestrin, Daphne Koller, Chris Gearhart, and Neal Kanodia. Generalizing plans to new environments\nin relational mdps. In Proceedings of the 18th International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI\u201903, pages 1003\u20131010, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc.\n\n[12] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. In\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 7265\u20137275. Curran Associates, Inc., 2018.\n\n[13] Jiechuan Jiang, I\u00f1igo Fern\u00e1ndez del Amo, and Zongqing Lu. Graph convolutional reinforcement learning\n\nfor multi-agent cooperation. CoRR, abs/1810.09202, 2018.\n\n[14] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\n\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning.\n\n[15] Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. Ef\ufb01cient large-scale \ufb02eet management via multi-agent\ndeep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on\nKnowledge Discovery & Data Mining, KDD \u201918, pages 1774\u20131783, New York, NY, USA, 2018. ACM.\nISBN 978-1-4503-5552-0. doi: 10.1145/3219819.3219993.\n\n[16] Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-\ncritic for mixed cooperative-competitive environments. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems\n30, pages 6379\u20136390. Curran Associates, Inc., 2017.\n\n[17] Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas\nDean, and Craig Boutilier. Solving very large weakly coupled Markov decision processes. pages 165\u2013172,\n1998.\n\n[18] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational Conference on Machine Learning, pages 1928\u20131937, 2016.\n\n[19] Santiago Ontan\u00f3n, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David Churchill, and Mike Preuss.\nIEEE Transactions on\n\nA survey of real-time strategy game ai research and competition in starcraft.\nComputational Intelligence and AI in games, 5(4):293\u2013311, 2013.\n\n[20] Scott Proper and Prasad Tadepalli. Solving Multiagent Assignment Markov Decision Processes. In\nProceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume\n1, AAMAS \u201909, pages 681\u2013688, Richland, SC, 2009. International Foundation for Autonomous Agents\nand Multiagent Systems. ISBN 978-0-9817381-6-1.\n\n10\n\n\f[21] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and\nShimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement\nlearning. arXiv preprint arXiv:1803.11485, 2018.\n\n[22] Scott Sanner and Craig Boutilier. Practical linear value-approximation techniques for \ufb01rst-order mdps.\n\narXiv preprint arXiv:1206.6879, 2012.\n\n[23] Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Individualized controlled continuous communi-\ncation model for multiagent cooperative and competitive tasks. In International Conference on Learning\nRepresentations, 2019.\n\n[24] Satinder P. Singh and David Cohn. How to dynamically merge Markov decision processes. In Advances in\n\nneural information processing systems, pages 1057\u20131063, 1998.\n\n[25] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In\n\nAdvances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\n[26] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.\n[27] Gerald Tesauro. Online resource allocation using decompositional reinforcement learning. In AAAI,\n\nvolume 5, pages 886\u2013891, 2005.\n\n[28] Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep de-\nterministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993,\n2016.\n\n[29] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[30] Pawel Wawrzynski. Control policy with autocorrelated noise in reinforcement learning for robotics.\n\nInternational Journal of Machine Learning and Computing, 5(2):91, 2015.\n\n[31] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean \ufb01eld multi-agent\n\nreinforcement learning. arXiv preprint arXiv:1802.05438, 2018.\n\n[32] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls,\nDavid Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv\npreprint arXiv:1806.01830, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4443, "authors": [{"given_name": "Nicolas", "family_name": "Carion", "institution": "Facebook AI Research Paris"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Facebook AI Research"}, {"given_name": "Gabriel", "family_name": "Synnaeve", "institution": "Facebook"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "Facebook Artificial Intelligence Research"}]}