{"title": "Credit Assignment For Collective Multiagent RL With Global Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 8102, "page_last": 8113, "abstract": "Scaling decision theoretic planning to large multiagent systems is challenging due to uncertainty and partial observability in the environment. We focus on a multiagent planning model subclass, relevant to urban settings, where agent interactions are dependent on their ``collective influence'' on each other, rather than their identities. Unlike previous work, we address a general setting where system reward is not decomposable among agents. We develop collective actor-critic RL approaches for this setting, and address the problem of multiagent credit assignment, and computing low variance policy gradient estimates that result in faster convergence to high quality solutions. We also develop difference rewards based credit assignment methods for the collective setting. Empirically our new approaches provide significantly better solutions than previous methods in the presence of global rewards on two real world problems modeling taxi fleet optimization and multiagent patrolling, and a synthetic grid navigation domain.", "full_text": "Credit Assignment For Collective Multiagent RL\n\nWith Global Rewards\n\nDuc Thien Nguyen Akshat Kumar Hoong Chuin Lau\n\n{dtnguyen.2014,akshatkumar,hclau}@smu.edu.sg\n\nSchool of Information Systems\n\nSingapore Management University\n80 Stamford Road, Singapore 178902\n\nAbstract\n\nScaling decision theoretic planning to large multiagent systems is challenging due\nto uncertainty and partial observability in the environment. We focus on a multia-\ngent planning model subclass, relevant to urban settings, where agent interactions\nare dependent on their \u201ccollective in\ufb02uence\u201d on each other, rather than their iden-\ntities. Unlike previous work, we address a general setting where system reward\nis not decomposable among agents. We develop collective actor-critic RL ap-\nproaches for this setting, and address the problem of multiagent credit assignment,\nand computing low variance policy gradient estimates that result in faster conver-\ngence to high quality solutions. We also develop difference rewards based credit\nassignment methods for the collective setting. Empirically our new approaches\nprovide signi\ufb01cantly better solutions than previous methods in the presence of\nglobal rewards on two real world problems modeling taxi \ufb02eet optimization and\nmultiagent patrolling, and a synthetic grid navigation domain.\n\n1\n\nIntroduction\n\nSequential multiagent decision making allows multiple agents operating in an uncertain, partially\nobservable environment to take coordinated decision towards a long term goal [15]. Decentralized\npartially observable MDPs (Dec-POMDPs) have emerged as a rich framework for cooperative multi-\nagent planning [8], and are applicable to several domains such as multiagent robotics [4], multiagent\npatrolling [19] and vehicle \ufb02eet optimization [42]. Scalability remains challenging due to NEXP-\nHard complexity even for two agent systems [8]. To address the complexity, various models are\nexplored where agent interactions are limited by design by enforcing various conditional and con-\ntextual independencies such as transition and observation independence among agents [7, 24] where\nagents are coupled primarily via joint-rewards, event driven interactions [6], and weakly coupled\nagents [34, 44]. However, their impact remains limited due to narrow application scope.\nRecent multiagent planning research has focused on models where agent interactions are primar-\nily dependent on agents\u2019 \u201ccollective in\ufb02uence\u201d on each other rather than their identities [42, 33,\n30, 25, 26]. Such models are widely applicable in urban system optimization problems due to the\ninsight that urban systems are often composed of a large number of nearly identical agents, such\nas taxis in transportation, and vessels in a maritime traf\ufb01c setting [2]. In our work, we focus on\nthe collective Dec-POMDP model (CDec-POMDP) that formalizes such collective multiagent plan-\nning [25], and also generalizes \u201canonymity planning\u201d models [42]. The CDec-POMDP model is\nbased on the idea of partial exchangeability [13, 27], and collective graphical models [31, 35]. Par-\ntial exchangeability in probabilistic inference is complementary to the notion of conditional and\ncontextual independence, and combining all of them leads to a larger class of tractable models and\ninference algorithms [27].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWhen only access to a domain simulator is available without exact model de\ufb01nition, several multi-\nagent RL (MARL) approaches are developed such as independent Q-learning [38], counterfactual\nmultiagent policy gradients and actor-critic methods [16, 23], multiagent Q-learning [29], SARSA-\nbased MARL for Dec-POMDPs [14], and MARL with limited communication [47, 48]. However,\nmost of these approaches are limited to tens of agents in contrast to the collective setting with thou-\nsands of agents, which is the setting we target. Closely related to the collective setting we address,\nspecial MARL sub-classes are proposed to model and control population-level behaviour of agents\nsuch as mean \ufb01eld RL (MFRL) [46] and mean \ufb01eld games (MFGs) [45, 17]. MFGs are used learn\nthe behaviour of a population of agents in an inverse RL setting. The MFRL framework does not\nexplicitly address credit assignment, and also requires agents to maintain individual state-action tra-\njectories, which may not be scalable with thousands of agents, as is the case in our tested domains.\nWe focus on the problem of learning agent policies in a MARL setting for CDec-POMDPs. We\naddress the crucial challenge of multiagent credit assignment in the collective setting when joint ac-\ntions generate a team reward that may not be decomposable among agents. The joint team rewards\nmake it dif\ufb01cult for agent to deduce their individual contribution to the team\u2019s success. Such team\nreward settings have been recognised as particularly challenging in the MARL literature [9, 16], and\nare common in disaster rescue domains (ambulance dispatch, police patrolling) where the penalty of\nnot attending to a victim is awarded to the whole team, team games such as StarCraft [16], and traf\ufb01c\ncontrol [39]. Previous work in CDec-POMDPs develops an actor-critic RL approach when the joint\nreward is additively decomposable among agents [25], and is unable to address non-decomposable\nteam rewards. Therefore, we develop multiple actor-critic approaches for the general team reward\nsetting where some (or all) joint-reward component may be non-decomposable among agents. We\naddress two crucial issues\u2014multiagent credit assignment, and computing low variance policy gra-\ndient estimates for faster convergence to high quality solutions even with thousands of agents. As\na baseline approach, we \ufb01rst extend the notion of difference rewards [41, 16], which are a popular\nway to perform credit assignment, to the collective setting. Difference rewards (DRs) provide a con-\nceptual framework for credit assignment; there is no general computational technique to compute\nDRs in different settings. Naive extension of the previous DR methods in deep multiagent RL set-\nting [16] is infeasible for large domains. Therefore, we develop novel approximation schemes that\ncan compute DRs in the collective case even with thousands of agents.\nWe show empirically that DRs can result in high variance policy gradient estimates, and are un-\nable to provide high quality solutions when the agent population is small. We therefore develop a\nnew approach called mean collective actor critic (MCAC) that works signi\ufb01cantly better than DRs\nand MFRL across a range of agent population sizes from 5 to 8000 agents. The MCAC analyti-\ncally marginalizes out the actions of agents by using an approximation of the critic. This results in\nlow variance gradient estimates, allows credit assignment at the level of gradients, and empirically\nperforms better than DR-based approaches.\nWe test our approaches on two real world problems motivated by supply-demand taxi matching\nproblem (with 8000 taxis or agents), and police patrolling for incident response in the city. We use\nreal world data for both these problem for constructing our models. We also test on a synthetic\ngrid navigation domain. Thanks to the techniques for credit assignment and low variance policy\ngradients, our approches converge to high quality solutions signi\ufb01cantly faster than the standard\npolicy gradient method and the previous best approach [26]. For the police patrolling domain, our\napproach provides better quality than a strong baseline static allocation approach that is computed\nusing a math program [10].\n\n2 Collective Decentralized POMDP Model\nWe describe the CDec-POMDP model [25] . The model extends the statistical notion of partial\nexchangeability to multiagent planning [13, 27]. Previous works have mostly explored only condi-\ntional and contextual independences in multiagent models [24, 44]. CDec-POMDPs combine both\nconditional independences and partial exchangeability to solve much larger instances of multiagent\ndecision making.\nDe\ufb01nition 1 ([27]). Let X = {X1, . . . , Xn} be a set of random variables, and x denote an assign-\nment to X. Let Di denote the domain of the variable Xi, and let T : \u00d7n\ni=1Di \u2192 S be a statistic of\nX, where S is a \ufb01nite set. The set of random variables X is partially exchangeable w.r.t. the statistic\nT if and only if T (x) = T (x(cid:48)) implies Pr(x) = Pr(x(cid:48)).\n\n2\n\n\fIn the CDec-POMDP model, agent identities do not matter; different model components are only\naffected by agent\u2019s local state-action, and a statistic of other agents\u2019 states-actions. There are M\nagents in the environment. An agent m can be in one of the states i \u2208 S. We also assume a global\nstate component d \u2208 D. The joint state space is \u00d7M\nm=1S \u00d7 D. The component d typically models\nvariables common to all the agents such as demand in the supply-demand matching case or location\nof incidents in the emergency response. Let st, at denote the joint state-action of agents at time t.\nThe joint-state transition probability is:\n\nP (st+1, dt+1|st, dt, at) = Pg(dt+1|dt,T (st, at))\n\nPl\n\nt+1|sm\n\nt , am\n\nt ,T (st, at), dt\n\n(cid:1)\n\n(cid:0)sm\n\nM(cid:89)\n\nm=1\n\nt , am\n\no(d)\u2200d.\n\nm rl(sm\n\nt , am\n\nis the same for all\n\nthe reward function rl\n\nt denote agent m\u2019s local state, action components, and T is a statistic of the corre-\nwhere sm\nsponding random variables (de\ufb01ned later). We assume that the local state transition function is the\nsame for all the agents. Such an expression conveys that only the statistic T of the joint state-action,\nand an agent\u2019s local state-action are suf\ufb01cient to predict the agent\u2019s next state.\nObservation function: We assume a decentralized and partially observable setting in which each\nagent receives only a partial observation about the environment. Let the current joint-state be\n(st, dt) after the last join-action, then the observation for agent m is given using the function\nIn the taxi supply-demand case, the observation for a taxi in location z can\not(sm\nbe the local demand in zone z, and the counts of other taxis in z and neighbouring zones of z. No\nagent has a complete view of the system.\n\nt , dt,T (st)).\nThe reward function is r(st, dt, at) =(cid:80)\nt , dt,T (st, at)) + rg(dt,T (st, at)) where rl is\n(cid:80)\nthe local reward for individual agents, and rg is the non-decomposable global reward. Given\nthat\nthe agents, we can further simplify it as\ni,j n(i, j)rl(i, j, dt,T (st, at)) + rg(dt,T (st, at)), where n(i, j) is the number of agents in state i\nand taking action j given the joint state-action (st, at). We assume that the initial state distribution,\nbo(i)\u2200i\u2208 S, is the same for all the agents; initial distribution over global states is bg\nThe above de\ufb01ned model components can also differentiate among agents by using the notion of\nagent types, which can be included in an agent\u2019s state-space S, and each agent can receive its type\nas part of its observation. In the extreme case, each agent would be of a different type representing a\nfairly general multiagent planning problem. However, the main bene\ufb01t of the model lies in settings\nwhen agent types are much smaller than the number of agents.\nWe consider a \ufb01nite-horizon problem with H time steps. Each agent has a non-stationary reactive\npolicy that takes as input agent\u2019s current state i and the observation o, and outputs the probability\nt (j|i, o). Such a policy is analogous to \ufb01nite-state controllers in POMDPs\nof the next action j as \u03c0m\nand Dec-POMDPs [28, 3]. Let \u03c0 = (cid:104)\u03c01, . . . , \u03c0M(cid:105) denote the joint-policy. The goal is to optimize\nE[rt|bo, bd\no].\nGlobal rewards:\nThe key difference from previous\nworks [25, 26] is that in our model we have a global reward\nsignal rg that is not decomposable among individual agents,\nwhich is crucial to model real world applications. Consider\na real world multiagent patrolling problem in \ufb01gure 1. A set\nof homogeneous police patrol cars (or agents) are stationed\nin prede\ufb01ned geographic regions to respond to incidents that\nmay arise over a shift (say 7AM to 7PM). When an inci-\ndent comes, the central command unit dispatches the closet\npatrol car to the incident location. The dispatched car be-\ncomes unavailable for some amount of time (including travel\nand incident service time). To cover for the engaged car, other\navailable patrol cars from nearby zones may need to reallocate\nthemselves so that no zones are left vulnerable. The reward in\nthis system depends on the response time to incidents (e.g., threshold to attend to urgent incidents is\n10 min, non-urgent in 20 min). The goal is to compute a reallocation policy for agents to minimize\nthe number of unsatis\ufb01ed incidents where the response time was more than the speci\ufb01ed threshold.\nTo model this objective, we award penalty -10 whenever the response time requirement of an in-\ncident is not met and 0 otherwise. In this domain, the delay penalty is non-decomposable among\npatrol cars. It is not reasonable to attribute penalty in an incident to its assigned agent because delay\n\nthe value V (\u03c0) =(cid:80)H\n\nFigure 1: Solid black lines de\ufb01ne 24\npatrolling zones of a city district\n\nt=1\n\n3\n\n\fT (st, at) = (nt(i, j)\u2200i\u2208 S, j\u2208 A) where each entry nt(i, j) =(cid:80)\n\nis due to the intrinsic system-wide supply-demand mismatch. Furthermore, individual agent penal-\nties may even discourage agents to go to nearby critical sectors, which is undesirable (we observed\nit empirically). Indeed, in this domain, all rewards are global, therefore, previous approaches that\nrequire local rewards for agents are not applicable. This is precisely the gap our work targets, and\nsigni\ufb01cantly increases the applicability of multiagent decision making to real world applications.\nStatistic for Planning: We now describe the statistic T which increases scalability and the\ngeneralization ability of solution approaches. For a given joint state-action (st, at), we de\ufb01ne\nt ) = (i, j)} counts the\nnumber of agents that are in state i and take action j. We can similarly de\ufb01ne T (st) = (nt(i)\u2200i\u2208 S)\nthat counts the number of agents in each state i. For clarity, we denote T (st, at) or state-action count\ntable as nsa\nt. Given a transition (st, at, st+1), we de\ufb01ne the count\n(i, j, i(cid:48))\u2200i, i(cid:48) \u2208 S, j \u2208 A) which counts the number of agents in state i that took\ntable nsas\naction j and transitioned to next state i(cid:48). Complete count table is denoted as nt = (ns\nIn collective planning settings, the agent population size is typically very large (\u2248 8000 for our real\nworld experiments). Given such a large population, it is infeasible to compute unique policy for\neach agent. Therefore, similar to previous works [43, 26], our goal is to compute a homogenous\nstochastic policy \u03c0t(j|i, ot(i, dt, ns\nt )) that outputs the probability of each action j given an agent in\nstate i receiving local observation ot depending on the state variable dt and state-counts ns\nt at time\nt. As the policy \u03c0 is dependent on count based observations, it represents an expressive class of\npolicies. Let n1:H be the combined vector of count tables over all the time steps. Let \u21261:H be the\nspace of consistent count tables satisfying constraints:\n\nt , and the state count table as ns\n\nm I{(sm\n\nt = (nsas\n\nt , nsas\n\nt , am\n\nt, nsa\n\n).\n\nt\n\nt\n\n\u2200t :\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ni(cid:48)\u2208S\n\ni\u2208S\n\ni\u2208S,j\u2208A\n\n(cid:88)\n\nns\nt(i) = M ;\n(cid:48)\nt (i, j, i\n\nnsas\n\n(cid:48)\nnsas\nt (i, j, i\n\nt(i) \u2200i \u2208 S\n\nj\u2208A\n) = nsa\n\nnsa\nt (i, j) = ns\nt (i, j)\u2200i \u2208 S, j \u2208 A\n(cid:48) \u2208 S ;\n\n) \u2200i\n\nt+1(i\n\n(cid:48)\n\n) = ns\n\n(1)\n\n(2)\n\nCount tables n1:H are the suf\ufb01cient statistic for planning for CDec-POMDPs.\nTheorem 1. [25] The joint-value function of a policy \u03c0 over horizon H given by the expectation of\n\njoint reward r, V (\u03c0) =(cid:80)H\n\nt=1\n\nE[rt], can be computed by the expectation over counts:\n\n(cid:88)\n\n(cid:0)nsa\n\n(cid:1)(cid:21)\n\n(cid:20) H(cid:88)\n\nt=1\n\n(cid:0)nsa\n\nt , dt\n\nwhere rt\n\n(cid:1) =(cid:80)\n\nV (\u03c0) =\n\nP (n1:H , d1:H ; \u03c0)\n\nrt\n\nt , dt\n\nn1:H\u2208\u21261:H ,d1:H\n\ni,j nsa\n\nt (i, j)rl(i, j, dt, nsa\n\nt )+rg(dt, nsa\nt )\n\nWe show in appendix how to sample from the distribution P (n1:H , d1:H ; \u03c0) directly without sam-\npling individual state-action trajectories.\nScalability to large agent population: Since sampling individual agent trajectories is not required,\nthe count-based computation can scale to thousands of agents. Such bene\ufb01ts also extends to com-\nputing policy gradients which also depend only on counts n. Furthermore, different data structures\nand function approximators such as the policy \u03c0 and action-value function depend only on counts\nn. Such a setting is computationally tractable because if we change only the agent population, the\ndimensions of the count table still remains \ufb01xed, only the counts of agents in different buckets (e.g.\nn(i), n(i, j)) changes. Such count-based formulations also extend the generalization ability of RL\napproaches as multiple joint state-actions (st, at, dt) can give rise to the same statistic n. Our goal\nis to compute the optimal policy \u03c0 that maximizes V (\u03c0).\nLearning framework: We follow a centralized learning and decentralized execution RL frame-\nwork. Such centralized learning is possible in the presence of domain simulators [16, 23]. We\nassume access only to a domain simulator that provides count samples n and the team reward r.\nDuring centralized training, we have access to all the count-based information, which helps de\ufb01ne\na centralized action-value function resulting in faster convergance to good solutions. During pol-\nicy execution, agents execute individual policies without accessing centralized functions . In single\nagent RL, agent experiences the tuple (st, at, st+1, rt) by interacting with the environment. In the\ncollective case, given that the suf\ufb01cient statistic is counts, we simulate and learn at the abstraction\n\n4\n\n\f, dt+1, rt). The current\nt, dt); observations are generated for agents from this statistic and fed into\nt . As a result of this joint action, agents transition to\n. Given the constraint set \u2126 in (1),\n; dt+1 is the next global state. The joint-reward is\n\nof counts. The experience tuple for the centralized learner is (ns\njoint-state statistic is (ns\npolicies. The output is state-action counts nsa\nnew states, and their joint transitions are recorded in the table nsas\nthe next count table ns\nrt(nsa\nActor Critic based MARL: We follow an actor-critic (AC) based policy gradient approach [21].\nThe policy \u03c0 is parameterized using \u03b8. The parameters are adjusted to maximize the objective\n\nt , dt). Appendix shows how simulator generates such count-based samples.\n\nt+1 is computed from nsas\n\nt, dt, nsa\n\nt , nsas\n\nt\n\nt\n\nt\n\nJ(\u03b8) = E[(cid:80)H\n\nt=1 rt] by taking steps in the direction of \u2207\u03b8J(\u03b8), which is shown in [26] as:\n(cid:19)(cid:21)\n\n(cid:18) (cid:88)\n\n\u2207\u03b8J(\u03b8) =\n\nH(cid:88)\n\n(cid:20)\nt is the expected return E(cid:2)(cid:80)H\n\nt |bo,bd\n\ndt,nsa\n\nQ\u03c0\n\no ,\u03c0\n\nt=1\n\nE\n\nt (nsa\n\nt , dt)\n\ni\u2208S,j\u2208A\n\nt (i, j)\u2207\u03b8 log \u03c0t(j|i, o(i, dt, ns\nnsa\nt))\n\n(cid:3). The above expression can be evaluated by\n\n(3)\n\nT =t rT|dt, nsa\n\nt\n\nwhere Q\u03c0\nIn the AC approach, the policy \u03c0 is termed as actor. We can estimate Q\u03c0\nsampling counts n.\nt\nusing empirical returns, but it has high variance. To remedy this, AC methods often use a function\napproximator for Q\u03c0 (say Qw), which is termed as the critic. We consider the critic Qw to be\na continuous function (e.g., a deep neural network) instead of a function de\ufb01ned only for integer\ninputs. This allows us to compute the derivative of Qw with respect to all the input variables, which\nwill be useful later. The critic can be learned from empirical returns using temporal-difference\nlearning. We next show several techniques to estimate the collective policy gradient \u2207\u03b8J(\u03b8) that\nhelp in the credit assignment problem and provide low variance gradient estimates even for very\nlarge number of agents.\n\n3 Difference Rewards Based Credit Assignment\n\nDifference rewards provide a powerful way to perform credit assignment when there are several\nagents, and have been explored extensively in the MARL literature [41, 1, 39, 40, 12]. Difference\nrewards (DR) are shaped rewards that help individual agents \ufb01lter out the noise from the global\nreward signal (which includes effects from other agents\u2019 actions), and assess their individual con-\ntribution to the global reward. As such, there is no general technique to compute DRs for different\nproblems. We therefore develop novel methods to approximately compute two popular types of\nDRs\u2014wonderful life utility (WLU) and aristocratic utility (AU) [41] for the collective case.\nWonderful Life Utility (WLU): Let s, a denote the joint state-action; r(s, a) be the system reward.\nThe WLU based DR for an agent m is rm = r(s, a) \u2212 r(s, a\u2212m) where a\u2212m is the joint-action\nwithout the agent m. The WLU DR compares the global reward to the reward received when agent\nm is not in the system. Agent m can use this shaped reward rm for its individual learning. However\nextracting such shaped rewards from the simulator is very challenging and not feasible for large\nnumber of agents. Therefore, we apply this reasoning to the critic (or action-value function approxi-\nmator) Qw(nsa, d). Similar to WLU, we de\ufb01ne WLQ (wonderful life Q-function) for an agent m as\nQm = Qw(nsa, d)\u2212 Qw(nsa\u2212m, d) where nsa\u2212m is the state-action count table without the agent m.\nFor a given (nsa, d), we show how to estimate Qm. Assume that the agent m is in some state\ni \u2208 S and performing action j \u2208 A. As agents do not have identities, we use Qij to denote the\nWLQ for any agent in state-action (i, j). Let eij be a vector with the same dimension as nsa;\nall entries in eij are zero except value 1 at the index corresponding to state-action (i, j). We have\nQij = Qw(nsa, d)\u2212Qw(nsa \u2212eij, d). Typically, critic Qw is represented using a neural network; we\nnormalize all count inputs to the network (denoted as \u02dcnsa = nsa/M) using the total agent population\nM. We now estimate WLQ assuming that M is large:\nQij \u2248 lim\nM\u2192\u221e\n= \u22121 \u00b7\nlim\n= \u22121 \u2217 (\u2212\u2206)\n\u2202Qw\n\n(cid:0)(nsa \u2212eij )/M, d(cid:1)(cid:3) = lim\n(cid:0)\u02dcnsa, d(cid:1)(cid:3)\n\n(cid:0) nsa/M, d(cid:1)\u2212Qw\n(cid:2)Qw\n\n(cid:0)\u02dcnsa \u2212 \u2206 \u00b7 eij, d(cid:1) \u2212 Qw\n\n(cid:0)\u02dcnsa \u2212 \u2206 \u00b7 eij, d(cid:1)(cid:3)\n\n(cid:0)\u02dcnsa, d(cid:1) \u2212 Qw\n\n(by de\ufb01nition of total differential)\n\n(cid:2)Qw\n\n(cid:2)Qw\n\n\u2206=1/M\u21920\n\n\u2206=1/M\u21920\n\n(\u02dcnsa, d)\n\n\u2202Qw\n\n(4)\n\nQij \u2248 1\nM\n\n\u2202\u02dcnsa(i, j)\n\n\u2202\u02dcnsa(i, j)\n(\u02dcnsa, d)\n\n(5)\n\n5\n\n\ft\n\nE\n\nt , nsas\n\nt, dt, nsa\n\n(cid:20) (cid:88)\n\n, dt+1, rt), global reward rt\n\nis used to\nThus, upon experiencing the tuple (ns\ntrain the global critic Qw. An agent m in state-action (i, j) accumulates the gradient term\nQij\u2207\u03b8 log \u03c0t(j|i, o(i, dt, ns\nt)) as per the standard policy gradient result [37](notice that policy \u03c0 is\nthe same for all the agents). Given that there are nsa\nt (i, j) agents performing action j in state i, the\ntotal accumulated gradient based on WLQ updates (5) by all the agents for all time steps is given as:\n\nH(cid:88)\n\u2207wlq\n\u03b8 J(\u03b8) =\nWe can estimate \u2207wlq\nAristrocratic Utility (AU): For a given joint state-action (s, a), the AU based DR for an agent m\nam \u03c0m(am|om(s))r(s, a\u2212m \u222a am) where a\u2212m \u222a am is the joint-\naction where agent m\u2019s action in a is replaced with am; om is the observation of the agent; \u03c0m is\nthe probability of action am. The AU marginalizes over all the actions of agent m keeping other\nagents\u2019 actions \ufb01xed. We next de\ufb01ne AU-based reasoning to the critic Qw. For a given (nsa, d), we\nde\ufb01ne Aij as the counterfactual advantage function for the agent in state i and taking action j as:\n\nis de\ufb01ned as rm = r(s, a) \u2212(cid:80)\n\n\u03b8 J(\u03b8) by sampling counts and the state dt for all the time steps.\n\nt , dt)\u2207\u03b8 log \u03c0t(j|i, o(i, dt, ns\nt))\n\nnsa\nt (i, j)Qij\n\nt |bo,bd\n\ni\u2208S,j\u2208A\n\nt (nsa\n\n(cid:21)\n\ndt,nsa\n\n(6)\n\nt=1\n\no\n\n\u03c0(cid:0)j\n\n(cid:48)|i, o(i, d, ns)(cid:1)Qw(nsa \u2212eij + eij(cid:48)\n\nAij = Qw(nsa, d) \u2212(cid:88)\n\n, d)\n\n(7)\n\nj(cid:48)\n\nwhere vectors eik are de\ufb01ned as for WLQ. Such advantages have been used by [16]. However in\nour setting, computing them naively is prohibitively expensive as the number of agents is large (in\nthousands). Therefore, we use similar technique as for WLQ by normalizing counts, and computing\ndifferentials lim\u2206=1/M\u21920\nt , dt) \u2248 1\nM\n\n(cid:2)Qw(\u02dcnsa, d)\u2212Qw(\u02dcnsa+\u2206\u00b7(eij(cid:48)\u2212eij), d)(cid:3), \ufb01nal estimate is (proof in appendix):\n(cid:104)\n\nt , dt) \u2212(cid:88)\n\n(cid:48)|i, o(i, dt, ns\nt))\n\n\u2202\u02dcnsa(i, j(cid:48))\n\n\u03c0(cid:0)j\n\n\u2202\u02dcnsa(i, j)\n\nt (nsa\n\nt , dt)\n\n\u2202Qw\n\n\u2202Qw\n\n(nsa\n\n(nsa\n\n(cid:105)\n\nAij\n\n(8)\n\nj(cid:48)\n\nCrucially, the above computation is independent of agent population M, and is thus highly scalable.\nUsing the same reasoning as WLQ, the gradient \u2207au\nt replaced by\nadvantages Aij\nin (8). Empirically, we observed that using advantages Aij resulted in better quality\nt\n\nis exactly the same as (6) with Qij\n\nbecause the additional term(cid:80)\n\nj(cid:48) in Aij acts as a baseline and reduces variance.\n\n\u03b8\n\n\u03b8\n\n\u03b8 , \u2207wlq\n\n4 Mean Collective Actor Critc\u2014Credit assignment, low variance gradients\nNotice that computing gradients \u2207au\nfor DRs requires taking expectation over state-action\ncounts nsa (see (6)), which can have high variance. Furthermore, the DR approximation is accu-\nrate only when the agent population M is large; for smaller populations we empirically observed a\ndrop in the solution quality using DRs. We next show how to address these limitations by develop-\ning a new approach called mean collective actor critic (MCAC) which is robust across a range of\npopulation sizes, and empirically works better than DRs in several problems.\n\u2022 We develop an alternative formulation of the policy gradient (3) that allows to analytically\nt . By analytically computing the expectation over counts,\n\u2022 We show that a factored critic structure is particularly suited for credit assignment, and also\nallows analytical gradient computation by using results from collective graphical models [22].\n\u2022 However, factored critic is not effectively learnable with global rewards. Our key insight is\nthat we learn a global critic which is not factorizable among agents.\nInstead of computing\ngradients from this critic, we estimate gradients from its \ufb01rst-order Taylor approximation, which\nfortunately is factored among agents, and \ufb01ts well within our previous two results above.\n\nmarginalize out state-action counts nsa\nvariance in the gradient estimates can be reduced, as also shown for MDPs in [11, 5].\n\nVariance reduction of gradient using expectation: Before reformulating the gradient expres-\nsion (3), we \ufb01rst de\ufb01ne P \u03c0(nsa\nt, dt) as the collective distribution of the action counts given\nt\nthe action probabilities \u03c0 and state counts:\n\nP \u03c0(nsa\nt\n\n| ns\n\nt, dt) =\n\nThe above is a multinomial distribution\u2014for each state i, we perform ns\nfor each of ns\n\nt(i) agents). Each trial\u2019s outcome is an action j\u2208 A with probability \u03c0(cid:0)j|i, o(i, dt, ns\nt)(cid:1).\n\nt(i) trials independently (one\n\nt (i, j)!\n\nj\u2208A\n\ni\u2208S\n\n(9)\n\n| ns\n(cid:89)\n\n(cid:104)\n\n(cid:81)\n\nns\nt(i)!\nj\u2208A nsa\n\n(cid:89)\n\nt (i,j)(cid:105)\nt)(cid:1)nsa\n\u03c0(cid:0)j|i, o(i, dt, ns\n\n6\n\n\fProposition 1. The collective policy gradient in (3) can be reformulated as:\n\nH(cid:88)\n\nt=1\n\n(cid:104)(cid:88)\n\nnsa\n\n(cid:105)\n\n\u2207\u03b8J(\u03b8) =\n\nE\nt,dt|bo,bd\nns\n\no\n\nt (nsa, dt)\u2207\u03b8P \u03c0(nsa | ns\nQ\u03c0\n\nt, dt)\n\n(10)\n\nt , dt) and analytically marginal-\nProof is provided in appendix. In the above expression we sample (ns\nize out state-action counts nsa\nt , which will result in lower variance than using (3) directly to estimate\ngradients. In the AC approach, we use a critic to approximate Q\u03c0. However, not all types of critics\nwill enable analytical marginalization over state-action counts.\nCritic design for multiagent credit assignment: We now investigate the special structure required\nfor the critic Qw that enables the analytical computation required in (10), and also helps in the mul-\ntiagent credit assignment. One solution studied in several previous works is a linear decomposition\n\nof the critic among agents [36, 18, 20]: Qw(st, dt, at) =(cid:80)M\n\n(cid:0)sm\n\nt)(cid:1).\n\nt , dt, ns\n\nt , o(sm\n\nt , am\n\nm=1 f m\nw\n\nt , dt) =(cid:80)\n\nSuch a factored critic structure is particularly suited for credit assignment as we are explicitly learn-\ning f m\nw as an agent m\u2019s contribution to the global critic value. Crucially, we also show that the\npolicy gradient computed using such a critic also gets factored among agents, which is essentially\ncredit assignment at the level of gradients among agents. In the collective setting, counts are the suf-\n\ufb01cient statistic for planning, and we assume a homogenous stochastic policy. Therefore, the critic\nsimpli\ufb01es as: Qw(nsa\nde\ufb01nition of fw that may depend on entire state counts ns\nGaussian approximation of collective graphical models [22].\nTheorem 2. A linear critic, Qw(nsa\nt (i, j)fw\nb only depends on (dt, ns\n\nt)(cid:1). The next result uses a more general\n(cid:0)i, j, dt, ns\n\nt), has the expected policy gradient under the policy \u03c0\u03b8 as:\n\n(cid:0)i, j, o(i, dt, ns\n\nt. Proof (in appendix) uses results from\n\n(cid:1)+b(dt, ns\n\nt) where function\n\nt (i, j)fw\n\ni,j nsa\n\ni,j nsa\n\nt\n\nQw(nsa, dt)\u2207\u03b8P \u03c0(nsa | ns\n\nt(i)\u2207\u03b8\u03c0\u03b8\nns\n\nt (j|i, o(i, dt, ns\n\nt))fw\n\n(cid:0)i, j, dt, ns\n\nt\n\n(cid:1)\n\n(11)\n\nt , dt) =(cid:80)\n(cid:88)\n\nt, dt) =\n\ni\u2208S,j\u2208A\n\n(cid:88)\n\nnsa\n\nLearning the critic from global rewards: The factored critic used in theorem 2 has two major\ndisadvantages. First, learning the factored critic from global returns is not effective as crediting\nempirical returns into contributions from different agents is dif\ufb01cult to learn without requiring too\nmany samples. Second, the critic components fw are based on an agent\u2019s local state, action while\nignoring other agents\u2019 policy and actions which increases the inaccuracy as both local and global\nrewards are affected by other agents\u2019 actions.\nOur key insight is that instead of learning a decomposable critic, we learn a global critic which is\nnot factorized among agents. This addresses the problem of learning from global rewards; as the\ncritic is de\ufb01ned over the input from all the agents (count tables n in our case). However, instead\nof computing policy gradients directly from the global critic, we compute gradients from a linear\napproximation to the global critic using \ufb01rst-order Taylor approximation. Actor update using linear\napproximation of the critic is studied previously for MDPs in [11, 32]. Given a small step size, the\nlinear approximation is suf\ufb01cient to estimate the direction of the policy gradient to move towards\na higher value. For our case, linear critic addresses both the credit assignment problem and low\nt , dt), we consider its \ufb01rst order Taylor\nvariance gradient estimates. Consider the global critic Qw(nsa\nt))\u2200i, j(cid:105) with\n| ns\nt, dt] =(cid:104)ns\nexpansion at the mean value of action counts n(cid:63) sa\n\u03c0 as the current policy:\nQw(nsa\n\n(\u2207nsa Qw(nsa, dt)|nsa=n(cid:63) sa\n\nt(i)\u03c0(j|i, o(i, dt, ns\n\nt , dt) \u2248 Qw(n(cid:63) sa\n\nt =E[nsa\n\nt \u2212 n(cid:63) sa\n\n, dt) + (nsa\n\n(12)\n\n(cid:124)\n\n)\n\n)\n\nt\n\nt\n\nt\n\nt\n\nUpon re-arranging the above, it \ufb01ts the critic structure in theorem 2:\n\nnsa\nt (i, j)\n\n\u2202Qw\n\n\u2202 nsa(i, j)\n\n(n(cid:63) sa\n\nt\n\n, dt)+\n\nQw(n(cid:63) sa\n\nt\n\n(cid:124)\n\n)\n\n(\u2207nsa Qw(nsa, dt)|nsa=n(cid:63) sa\n\nt\n\n(cid:104)\n\n, dt)\u2212(cid:0) n(cid:63) sa\n\nt\n\nt , dt)\u2248(cid:88)\n\ni,j\n\nQw(nsa\n\n(cid:1)(cid:105)\n\nUsing theorem 2 and proposition 1, we have (proof in appendix):\nCorollary 1. Using the \ufb01rst-order Taylor approximation of the critic at the expected state-action\ncounts n(cid:63) sa\n\nt, dt; \u03c0], the collective policy gradient is:\n\nt =E[nsa\n\n| ns\n\nE\n\nt,dt|bo,bd\nns\n\no\n\nt(i)\u2207\u03b8\u03c0t(j|i, ot(i, dt, ns\nns\nt))\n\n\u2202Qw\n\n\u2202 nsa(i, j)\n\n(n(cid:63) sa\n\nt\n\n, dt)\n\n(13)\n\nt\n\n\u2207\u03b8J(\u03b8)\u2248 H(cid:88)\n\nt=1\n\n(cid:104) (cid:88)\n\ni\u2208S,j\u2208A\n\n(cid:105)\n\n7\n\n\f(a) Objective value\n\n(b) Objective value\n\n(c) Avg. pro\ufb01t/taxi/day (d) Avg. unserved demand\n\nFigure 2: Different metrics on the taxi problem with different penalty weights w.\n\nIntuitively, terms \u2202Qw/\u2202n(i, j) facilitate credit assignment, which also occur in DR based formula-\ntions (section 3). When this term has a high value, it implies that a higher count of agents in state i\nand taking action j would increase the overall critic value Q. This will encourage more agents to take\naction j in state i. Each term \u2202Qw/\u2202n(i, j) is evaluated at the overall state-action counts n(cid:63) sa\nt which\nin turn depend on the policy and actions of other agents. Thus, it overcomes the second limitation\nof the factored critic in theorem 2 where terms fw ignore policy and actions of other agents.\n\n5 Experiments\n\nWe test the aristocratic utility based approach (called \u2018CCAC\u2019 or collective counterfactual AC) that\nuses gradient estimates (8), and the mean collective AC (\u2018MCAC\u2019) that uses (13). We test against\n(a) the standard AC approach which \ufb01ts the critic using global rewards and computes gradients from\nthe global critic; (b) the factored actor critic (\u2018fAfC\u2019) approach of [26], the previous best approach\nfor CDec-POMDPs with decomposable rewards; (c) the average \ufb02ow based solver (\u2018AverageFlow\u2019)\nof [42]. In some domains (speci\ufb01cally the taxi problem), we have both local and global rewards.\nThe local rewards are incorporated in \u2018fAfC\u2019 as before; for global rewards, we change the training\nprocedure of the critic in \u2018fAfC\u2019 (different AC updates are shown in appendix). We test on two real\nworld domains\u2014taxi supply-demand matching [43], and the police patrolling problem [10].\nTaxi Supply-Demand Matching: The dataset consists of taxi demands (GPS traces of the taxi\nmovement and their hired/unhired status) in an Asian city over 1 year. The \ufb02eet contains 8000\ntaxis (or agents) with the city divided in 81 zones. Environment dynamics are similar to [43]. The\nenvironment is uncertain (due to stochastic demand), and partially observable as each taxi observes\nthe count of other taxis and the demand in the current zone and geographically connected neighbor\nzones, and decides its next action (stay or move to a neighboring zone). Over the plan horizon of 48\nhalf hour intervals, the goal is to compute policies that enable strategic movement of taxis to optimize\nthe total \ufb02eet pro\ufb01t. Individual rewards model the revenue each taxi receives. Global rewards model\nquality-of-service (QoS) by giving a high positive reward when the ratio of available taxis and the\ncurrent demand in a zone is greater than some threshold, and negative reward when the ratio is below\nthe set QoS. We selected the topmost 15 busiest zones for such global rewards. To enforce QoS level\n\n\u03b1 = 95% for each zone i and time t, we add penalty terms min(0, w\u00d7(cid:0) \u02c6dt(i)\u2212 \u03b1dt(i)(cid:1)) where w is the\n\npenalty weight, \u02c6dt(i) is the total served demand at time t, and dt(i) is the total demand at time t. We\ntest the effect of QoS penalty by using weights w\u2208 [0, 10.0]. We normalize all trip payments between\n(0, 1) which implies that the penalty for missing a customer over the QoS threshold is roughly w\ntimes the negative of the maximum reward for serving a customer.\nFigure 2(a) shows the quality comparisons (higher is better) between MCAC (CCAC is almost iden-\ntical to MCAC) and fAfC with varying penalty w. It shows that with increasing w, fAfC becomes\nsigni\ufb01cantly worse than MCAC. We next investigate the reason. Figure 2(b) summarizes quality\ncomparisons among all approaches for three settings of w. Results con\ufb01rm that both MCAC and\nCCAC provide similar quality, and are the best performing among the rest. \u2018AverageFlow\u2019 and \u2018AC\u2019\nare much worse off due to presence of global rewards. As the weight w increases from 0 to 10, the\ndifference between CCAC/MCAC and fAfC increases signi\ufb01cantly. This is because higher w puts\nmore emphasis on optimizing global rewards. Figure 2(d) shows unserved demand below the QoS\nthreshold or (\u03b1 \u00b7 dt(i) \u2212 \u02c6dt(i)) averaged over all 15 zones and all the time steps (AC, AverageFlow\nare omitted as their high numbers distort the \ufb01gure). When penalty increased from w = 0 to 1 in\n\n8\n\n0110Penalty Weight-50050100150200Ind. RevenueACMCACCCACAverageFlowfAfC0123456710Penalty Weight-150000-100000-50000050000100000150000Objective value0110Penalty Weight\u2212200.0E+3\u2212100.0E+30.0E+0100.0E+3200.0E+3Objective value0110Penalty Weight-50050100150200Individual Profit0110Penalty Weight01020304050Unserved Demands\f(a) Algorithm convergence\n\n(b) Objective value\n\n(c) Unsatis\ufb01ed Percentage\n\n(d) M = 5; grid 5x5 (e) M = 20; grid 5x5 (f) M = 50; grid 5x5 (g) M = 20; grid 9x9 (h) M = 50; grid 9x9\nFigure 3: (a)-(c) Police patrolling problem; (d)-(h) synthetic grid patrolling with varying population M, grids\n\n\ufb01gure 2(c), MCAC/CCAC still maintain similar individual pro\ufb01ts, but their unserved demand de-\ncreased signi\ufb01cantly (by 32%) as shown in \ufb01gure 2(d). Thus, CCAC/MCAC maintain individual\npro\ufb01ts while still reducing global penalty, and are therefore effective with global rewards. In con-\ntrast, the unserved demand by fAfC does not decrease much from w = 0 to w = 1, 10; because the\nQoS penalty constitutes global rewards whereas \u2018fAfC\u2019 is optimized for decomposable rewards.\nPolice Patrolling: The problem is introduced in section 2. There are 24 city zones, and 16 patrol\ncars (or agents). We have access to real world data about all incidents for 31 days in 24 zones.\nRoughly 50-60 incidents happen per day (7AM-7PM shift). The goal is to compute reallocation pol-\nicy for agents such that number of incidents with response time more than the threshold is minimized\n(further details in appendix). This domain has only global rewards. Therefore, we compare MCAC,\nCCAC and AC (fAfC, AverageFlow are unable to model this domain). As a baseline, we compare\nagainst a static allocation of patrol cars that is optimized using a stochastic math program [10],\ndenoted as \u2018MIP\u2019. Figure 3(a) shows the convergence results. MCAC performs much better than\nCCAC. This is because this problem is sparse with sparse tables nsa, resulting in higher gradient\nvariance for CCAC; MCAC marginalizes out nsa, thus has lower variance. Figure 3(b) shows over-\nall objective comparisons (higher is better) among all three approaches. It con\ufb01rms that MCAC is\nthe best approach. MCAC has 7.8% incidents where response time was more than the threshold ver-\nsus 9.32% for MIP (\ufb01gure 3(c)). Notice that even this improvement is signi\ufb01cant as it allows \u224825\nmore incidents to be served within the threshold over 31 days (assuming 55 avg. incidents/day). In\nemergency scenarios, improving response time even by few minutes is potentially life saving.\nTo further compare CCAC and MCAC, we created a synthetic grid patrolling problem also inspired\nby police patrolling, where we vary grid sizes and agent population (domain details in appendix).\nFigure 3(d-h) show convergence plots. In these problems, CCAC performs much worse (even worse\nthan AC) as these problems are sparse with sparse state-action counts nsa. This makes its gradient\nvariance higher than MCAC, which again performs best. To summarize, when the population size\nis large and state-action counts are dense (as in the taxi problem with M = 8000), both CCAC and\nMCAC give similar quality; but for small population size (as in grid patrolling with M = 5), MCAC\nis more robust than CCAC and AC.\n\n6 Summary\n\nWe developed several approaches for credit assignment in collective multiagent RL. We extended\nthe notion of difference rewards to the collective setting and showed how to compute them ef\ufb01ciently\neven for very large agent population. To further reduce the gradient variance, we developed a num-\nber of results that analytically marginalize out agents\u2019 actions from the gradient expression. This\napproach, called MCAC, was more robust than difference rewards based approach across a number\nof problems settings and consistently provided better quality over varying agent population sizes.\n\n9\n\n025005000750010000125001500017500Iteration0102030Objective valueACMCACCCAC05000100001500020000Iteration1201008060Objective value\u2212140\u2212120\u2212100\u221280\u221260\u221240\u2212200Objective valueMIPMCACCCACAC05101520Unsatisfied Incidents (%)MIPMCACCCACAC025005000750010000125001500017500Iteration0102030Objective valueACMCACCCAC010000Iteration050Quality010000Iteration0100Quality010000Iteration50100Quality010000Iteration050Quality010000Iteration050Quality\f7 Acknowledgments\n\nThis research project is supported by National Research Foundation Singapore under its Corp Lab\n@ University scheme and Fujitsu Limited. First author is also supported by A(cid:63)STAR graduate\nscholarship.\n\nReferences\n[1] Adrian K Agogino and Kagan Tumer. Unifying temporal and structural credit assignment problems. In\n\nInternational Joint Conference on Autonomous Agents and Multiagent Systems, pages 980\u2013987, 2004.\n\n[2] Lucas Agussurja, Akshat Kumar, and Hoong Chuin Lau. Resource-constrained scheduling for maritime\n\ntraf\ufb01c management. In AAAI Conference on Arti\ufb01cial Intelligence, pages 6086\u20136093, 2018.\n\n[3] Christopher Amato, Daniel S. Bernstein, and Shlomo Zilberstein. Optimizing \ufb01xed-size stochastic con-\ntrollers for pomdps and decentralized pomdps. Autonomous Agents and Multi-Agent Systems, 21(3):293\u2013\n320, 2010.\n\n[4] Christopher Amato, George Konidaris, Gabriel Cruz, Christopher A. Maynor, Jonathan P. How, and\nLeslie Pack Kaelbling. Planning for decentralized control of multiple robots under uncertainty. In IEEE\nInternational Conference on Robotics and Automation, ICRA, pages 1241\u20131248, 2015.\n\n[5] Kavosh Asadi, Cameron Allen, Melrose Roderick, Abdel-Rahman Mohamed, George Konidaris, and\n\nMichael Littman. Mean Actor Critic. In eprint arXiv:1709.00503, September 2017.\n\n[6] Raphen Becker, Shlomo Zilberstein, and Victor Lesser. Decentralized Markov decision processes with\nevent-driven interactions. In International Conference on Autonomous Agents and Multiagent Systems,\npages 302\u2013309, 2004.\n\n[7] Raphen Becker, Shlomo Zilberstein, Victor Lesser, and Claudia V. Goldman. Solving transition indepen-\ndent decentralized Markov decision processes. Journal of Arti\ufb01cial Intelligence Research, 22:423\u2013455,\n2004.\n\n[8] Daniel S. Bernstein, Rob Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentral-\n\nized control of Markov decision processes. Mathematics of Operations Research, 27:819\u2013840, 2002.\n\n[9] Yu-Han Chang, Tracey Ho, and Leslie P Kaelbling. All learning is local: Multi-agent learning in global\n\nreward games. In Advances in neural information processing systems, pages 807\u2013814, 2004.\n\n[10] J. Chase, J. Du, N. Fu, T. V. Le, and H. C. Lau. Law enforcement resource optimization with response\n\ntime guarantees. In IEEE Symposium Series on Computational Intelligence, pages 1\u20137, 2017.\n\n[11] Kamil Ciosek and Shimon Whiteson. Expected policy gradients. In AAAI Conference on Arti\ufb01cial Intel-\n\nligence, 2018.\n\n[12] Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for\nmultiagent reinforcement learning. In International conference on Autonomous agents and multi-agent\nsystems, pages 165\u2013172, 2014.\n\n[13] P Diaconis and Diaconis Freedman. De Finetti\u2019s generalizations of exchangeability. Studies in Inductive\n\nLogic and Probability, 2:233\u2013249, 1980.\n\n[14] Jilles Steeve Dibangoye and Olivier Buffet. Learning to act in decentralized partially observable MDPs.\n\nIn International Conference on Machine Learning, pages 1241\u20131250, 2018.\n\n[15] Ed Durfee and Shlomo Zilberstein. Multiagent planning, control, and execution. In Gerhard Weiss, editor,\n\nMultiagent Systems, chapter 11, pages 485\u2013546. MIT Press, Cambridge, MA, USA, 2013.\n\n[16] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Coun-\n\nterfactual multi-agent policy gradients. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[17] Diogo A Gomes, Joana Mohr, and Rafael Rigao Souza. Discrete time, \ufb01nite state space mean \ufb01eld games.\n\nJournal de math\u00e9matiques pures et appliqu\u00e9es, 93(3):308\u2013328, 2010.\n\n[18] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs. In Advances\n\nin Neural Information Processing Systems, pages 1523\u20131530, 2002.\n\n10\n\n\f[19] Tarun Gupta, Akshat Kumar, and Praveen Paruchuri. Planning and learning for decentralized MDPs with\n\nevent driven rewards. In AAAI Conference on Arti\ufb01cial Intelligence, pages 6186\u20136194, 2018.\n\n[20] Jelle R Kok and Nikos Vlassis. Collaborative multiagent reinforcement learning by payoff propagation.\n\nJournal of Machine Learning Research, 7(Sep):1789\u20131828, 2006.\n\n[21] Vijay R. Konda and John N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Opti-\n\nmization, 42(4):1143\u20131166, 2003.\n\n[22] Liping Liu, Daniel Sheldon, and Thomas Dietterich. Gaussian approximation of collective graphical\n\nmodels. In International Conference on Machine Learning, pages 1602\u20131610, 2014.\n\n[23] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic\nfor mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems,\npages 6382\u20136393, 2017.\n\n[24] R. Nair, P. Varakantham, M. Tambe, and M. Yokoo. Networked distributed POMDPs: A synthesis of\ndistributed constraint optimization and POMDPs. In AAAI Conference on Arti\ufb01cial Intelligence, pages\n133\u2013139, 2005.\n\n[25] Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Collective multiagent sequential decision\n\nmaking under uncertainty. In AAAI Conference on Arti\ufb01cial Intelligence, pages 3036\u20133043, 2017.\n\n[26] Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Policy gradient with value function approxi-\nmation for collective multiagent planning. In Advances in Neural Information Processing Systems, pages\n4322\u20134332, 2017.\n\n[27] Mathias Niepert and Guy Van den Broeck. Tractability through exchangeability: A new perspective on\nef\ufb01cient probabilistic inference. In AAAI Conference on Arti\ufb01cial Intelligence, pages 2467\u20132475, July\n2014.\n\n[28] Pascal Poupart and Craig Boutilier. Bounded \ufb01nite state controllers. In Neural Information Processing\n\nSystems, pages 823\u2013830, 2003.\n\n[29] Tabish Rashid, Mikayel Samvelyan, Christian Schr\u00f6der de Witt, Gregory Farquhar, Jakob N. Foerster,\nand Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement\nlearning. In International Conference on Machine Learning, pages 4292\u20134301, 2018.\n\n[30] Philipp Robbel, Frans A Oliehoek, and Mykel J Kochenderfer. Exploiting anonymity in approximate\nlinear programming: Scaling to large multiagent MDPs. In AAAI Conference on Arti\ufb01cial Intelligence,\npages 2537\u20132543, 2016.\n\n[31] Daniel R Sheldon and Thomas G Dietterich. Collective graphical models. In Advances in Neural Infor-\n\nmation Processing Systems, pages 1161\u20131169, 2011.\n\n[32] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deter-\nministic policy gradient algorithms. In International Conference on Machine Learning, pages 387\u2013395,\n2014.\n\n[33] Ekhlas Sonu, Yingke Chen, and Prashant Doshi.\n\nIndividual planning in agent populations: Exploit-\ning anonymity and frame-action hypergraphs. In International Conference on Automated Planning and\nScheduling, pages 202\u2013210, 2015.\n\n[34] Matthijs T. J. Spaan and Francisco S. Melo. Interaction-driven Markov games for decentralized multiagent\nplanning under uncertainty. In International COnference on Autonomous Agents and Multi Agent Systems,\npages 525\u2013532, 2008.\n\n[35] Tao Sun, Daniel Sheldon, and Akshat Kumar. Message passing for collective graphical models.\n\nInternational Conference on Machine Learning, pages 853\u2013861, 2015.\n\nIn\n\n[36] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max\nJaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks\nfor cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.\n\n[37] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for\nreinforcement learning with function approximation. In International Conference on Neural Information\nProcessing Systems, pages 1057\u20131063, 1999.\n\n11\n\n\f[38] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of\n\nthe tenth international conference on machine learning, pages 330\u2013337, 1993.\n\n[39] Kagan Tumer and Adrian Agogino. Distributed agent-based air traf\ufb01c \ufb02ow management. In International\n\nJoint Conference on Autonomous Agents and Multiagent Systems, pages 255:1\u2013255:8, 2007.\n\n[40] Kagan Tumer and Adrian K. Agogino. Multiagent learning for black box system reward functions. Ad-\n\nvances in Complex Systems, 12(4-5):475\u2013492, 2009.\n\n[41] Kagan Tumer, Adrian K Agogino, and David H Wolpert. Learning sequences of actions in collectives\nof autonomous agents. In International joint conference on Autonomous agents and multiagent systems,\npages 378\u2013385, 2002.\n\n[42] Pradeep Varakantham, Yossiri Adulyasak, and Patrick Jaillet. Decentralized stochastic planning with\n\nanonymity in interactions. In AAAI Conference on Arti\ufb01cial Intelligence, pages 2505\u20132512, 2014.\n\n[43] Pradeep Reddy Varakantham, Shih-Fen Cheng, Geoff Gordon, and Asrar Ahmed. Decision support for\nagent populations in uncertain and congested environments. In AAAI Conference on Arti\ufb01cial Intelligence,\npages 1471\u20131477, 2012.\n\n[44] Stefan J. Witwicki and Edmund H. Durfee. In\ufb02uence-based policy abstraction for weakly-coupled Dec-\n\nPOMDPs. In International Conference on Automated Planning and Scheduling, pages 185\u2013192, 2010.\n\n[45] Jiachen Yang, Xiaojing Ye, Rakshit Trivedi, Huan Xu, and Hongyuan Zha. Deep mean \ufb01eld games for\nlearning optimal behavior policy of large populations. In International Conference on Learning Repre-\nsentations, 2018.\n\n[46] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean \ufb01eld multi-agent\n\nreinforcement learning. In International Conference on Machine Learning, pages 5567\u20135576, 2018.\n\n[47] Chongjie Zhang and Victor R. Lesser. Coordinated multi-agent reinforcement learning in networked\n\ndistributed POMDPs. In AAAI Conference on Arti\ufb01cial Intelligence, 2011.\n\n[48] Chongjie Zhang and Victor R. Lesser. Coordinating multi-agent reinforcement learning with limited\nIn International conference on Autonomous Agents and Multi-Agent Systems, pages\n\ncommunication.\n1101\u20131108, 2013.\n\n12\n\n\f", "award": [], "sourceid": 4977, "authors": [{"given_name": "Duc Thien", "family_name": "Nguyen", "institution": "Singapore Management University"}, {"given_name": "Akshat", "family_name": "Kumar", "institution": "Singapore Management University"}, {"given_name": "Hoong Chuin", "family_name": "Lau", "institution": "Singapore Management University"}]}