{"title": "Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1184, "page_last": 1193, "abstract": "We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. \n We assume each agent is located at a node of a communication network and can exchange information only with its neighbors. Using softmax temporal consistency, we derive a primal-dual decentralized optimization method and obtain a principled and data-efficient iterative algorithm named {\\em value propagation}. We prove a non-asymptotic convergence rate of $\\mathcal{O}(1/T)$ with nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with a convergence guarantee in the control, off-policy, non-linear function approximation, fully decentralized setting.", "full_text": "Value Propagation for Decentralized Networked Deep\n\nMulti-agent Reinforcement Learning\n\nChao Qu \u22171, Shie Mannor2, Huan Xu3,4, Yuan Qi1, Le Song1,4, and Junwu Xiong1\n\n1Ant Financial Services Group\n\n4Georgia Institute of Technology\n\n2 Technion\n\n3Alibaba Group\n\nAbstract\n\nWe consider the networked multi-agent reinforcement learning (MARL) problem\nin a fully decentralized setting, where agents learn to coordinate to achieve joint\nsuccess. This problem is widely encountered in many areas including traf\ufb01c control,\ndistributed control, and smart grids. We assume each agent is located at a node\nof a communication network and can exchange information only with its neigh-\nbors. Using softmax temporal consistency, we derive a primal-dual decentralized\noptimization method and obtain a principled and data-ef\ufb01cient iterative algorithm\nnamed value propagation. We prove a non-asymptotic convergence rate of O(1/T )\nwith nonlinear function approximation. To the best of our knowledge, it is the\n\ufb01rst MARL algorithm with a convergence guarantee in the control, off-policy,\nnon-linear function approximation, fully decentralized setting.\n\n1\n\nIntroduction\n\nMulti-agent systems have applications in a wide range of areas such as robotics, traf\ufb01c control,\ndistributed control, telecommunications, and economics. For these areas, it is often dif\ufb01cult or\nsimply impossible to prede\ufb01ne agents\u2019 behaviour to achieve satisfactory results, and multi-agent\nreinforcement learning (MARL) naturally arises [Bu et al., 2008, Tan, 1993]. For example, El-\nTantawy et al. [2013] model a traf\ufb01c signal control problem as a multi-player stochastic game and\nsolve it with MARL. MARL generalizes reinforcement learning by considering a set of agents\n(decision makers) sharing a common environment. However, multi-agent reinforcement learning\nis a challenging problem since the agents interact with both the environment and each other. For\ninstance, independent Q-learning\u2014treating other agents as a part of the environment\u2014often fails\nas the multi-agent setting breaks the theoretical convergence guarantee of Q-learning and makes\nthe learning process unstable [Tan, 1993]. Rashid et al. [2018], Foerster et al. [2018], Lowe et al.\n[2017] alleviate such a problem using a centralized network (i.e., being centralized for training, but\ndecentralized during execution.). Its communication pattern is illustrated in the left panel of Figure 1.\nDespite the great success of (partially) centralized MARL approaches, there are various scenarios,\nsuch as sensor networks [Rabbat and Nowak, 2004] and intelligent transportation systems [Adler\nand Blue, 2002] , where a central agent does not exist or may be too expensive to use. In addition,\nprivacy and security are requirements of many real world problems in multi-agent system (also in\nmany modern machine learning problems) [Abadi et al., 2016, Kurakin et al., 2016] . For instance,\nin Federated learning [McMahan et al., 2016], the learning task is solved by a lose federation of\nparticipating devices (agents) without the need to centrally store the data, which signi\ufb01cantly reduces\n\n\u2217luoji.qc@ant\ufb01n.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fprivacy and security risk by limiting the attack surface to only the device. In the agreement problem\n[DeGroot, 1974, Mo and Murray, 2017], a group of agents may want to reach consensus on a subject\nwithout leaking their individual goal or opinion to others. Obviously, centralized MARL violates\nprivacy and security requirements. To this end, we and others have advocated the fully decentralized\napproaches, which are useful for many applications including unmanned vehicles [Fax and Murray,\n2002], power grid [Callaway and Hiskens, 2011], and sensor networks [Cortes et al., 2004]. For\nthese approaches, we can use a network to model the interactions between agents (see the right panel\nof Figure 1). Particularly, We consider a fully cooperative setting where each agent makes its own\ndecision based on its local reward and messages received from their neighbors. Thus each agent\npreserves the privacy of its own goal and policy. At the same time, through the message-passing all\nagents achieve consensus to maximize the averaged cumulative rewards over all agents; see Equation\n(3).\nIn this paper, we propose a new fully decen-\ntralized networked multi-agent deep reinforce-\nment learning algorithm. Using softmax tem-\nporal consistency [Nachum et al., 2017, Dai\net al., 2018] to connect value and policy updates,\nwe derive a new two-step primal-dual decentral-\nized reinforcement learning algorithm inspired\nby a primal decentralized optimization method\n[Hong et al., 2017] 2. In the \ufb01rst step of each iter-\nation, each agent computes its local policy, value\ngradients and dual gradients and then updates\nonly policy parameters. In the second step, each\nagent propagates to its neighbors the messages\nbased on its value function (and dual function)\nand then updates its own value function. Hence\nwe name the algorithm value propagation. It\npreserves the privacy in the sense that no indi-\nvidual reward function is required for the network-wide collaboration. We approximate the local\npolicy, value function and dual function of each agent by deep neural networks, which enables\nautomatic feature generation and end-to-end learning.\nContributions: [1] We propose the value propagation algorithm and prove that it converges with the\nrate O(1/T ) even with the non-linear deep neural network function approximation. To the best of\nour knowledge, it is the \ufb01rst deep MARL algorithm with non-asymptotic convergence guarantee. At\nthe same time, value propagation can use off-policy updates, making it data ef\ufb01cient. When it reduces\nto the single agent case, it provides a proof of [Dai et al., 2018] in the realistic setting; see remarks of\nalgorithm 1 in Section 3.3. [2] The objective function in our problem is a primal-dual decentralized\noptimization form (see (8)), while the objective function in [Hong et al., 2017] is a primal problem.\nWhen our method reduces to pure primal analysis, we extend [Hong et al., 2017] to the stochastic\nand biased gradient setting which may be of independent interest to the optimization community. In\nthe practical implementation, we extend ADAM into the decentralized setting to accelerate training.\n\nFigure 1: Centralized network vs Decentralized\nnetwork. Each blue node in the \ufb01gure corresponds\nto an agent. In centralized network (left), the red\ncentral node collects information for all agents,\nwhile in decentralized network (right), agents ex-\nchanges information with neighbors.\n\n2 Preliminaries\nMDP Markov Decision Process (MDP) can be described by a 5-tuple (S,A,R,P, \u03b3): S is the \ufb01nite\nstate space, A is the \ufb01nite action space, P = (P (s(cid:48)|s, a))s,s(cid:48)\u2208S,a\u2208A are the transition probabilities,\nR = (R(s, a))s,s(cid:48)\u2208S,a\u2208A are the real-valued immediate rewards and \u03b3 \u2208 (0, 1) is the discount factor.\nA policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by \u03c0,\nwhere \u03c0(st, at) is the conditional probability density at at associated with the policy. De\ufb01ne V \u2217(s) =\nt=0 \u03b3tR(st, at)|s0 = s] to be the optimal value function. It is known that V \u2217 is the unique\n\ufb01xed point of the Bellman optimality operator, V (s) = (T V )(s) := maxa R(s, a)+\u03b3Es(cid:48)|s,a[V (s(cid:48))].\nThe optimal policy \u03c0\u2217 is related to V \u2217 by the following equation: \u03c0\u2217(s, a) = arg maxa{R(s, a) +\n\u03b3Es(cid:48)|s,aV \u2217(s(cid:48))}\n\nmax\u03c0 E[(cid:80)\u221e\n\n2The objective in Hong et al. [2017] is a primal optimization problem with constraint. Thus they introduce a\nLagrange multiplier like method to solve it (so they call it primal-dual method ). Our objective function is a\nprimal-dual optimization problem with constraint.\n\n2\n\n\fV\u03bb(s) = max\n\u03c0(s,\u00b7)\n\nwhere H(\u03c0, s) = \u2212(cid:80)\n\nSoftmax Temporal Consistency Nachum et al. [2017] establish a connection between value and\npolicy based reinforcement learning based on a relationship between softmax temporal value consis-\ntency and policy optimality under entropy regularization. Particularly, the soft Bellman optimality is\nas follows,\n\n(cid:0)Ea\u223c\u03c0(s,\u00b7)(R(s, a) + \u03b3Es(cid:48)|s,aV\u03bb(s(cid:48))) + \u03bbH(\u03c0, s)(cid:1),\n\n(1)\na\u2208A \u03c0(s, a) log \u03c0(s, a) and \u03bb \u2265 0 controls the degree of regularization. When\n\u03bb = 0, above equation reduces to the standard Bellman optimality condition. An important property\nof soft Bellman optimality is the called temporal consistency, which leads to the Path Consistency\nLearning.\nProposition 1. [Nachum et al., 2017]. Assume \u03bb > 0. Let V \u2217\n\u03bb be the \ufb01xed point of (1) and\n\u03c0\u2217\n\u03bb be the corresponding policy that attains that maximum on the RHS of (1). Then, (V \u2217\n\u03bb , \u03c0\u2217\n\u03bb)\nis the unique (V, \u03c0) pair that satis\ufb01es the following equation for all (s, a) \u2208 S \u00d7 A : V (s) =\nR(s, a) + \u03b3Es(cid:48)|s,aV (s(cid:48)) \u2212 \u03bb log \u03c0(s, a).\n\n(cid:0)R(s, a) + \u03b3Es(cid:48)|s,aV (s(cid:48)) \u2212 \u03bb log \u03c0(s, a) \u2212 V (s)(cid:1)2\n\nA straightforward way to apply temporal consistency is to optimize the following problem,\n. Dai et al. [2018] get around the\nminV,\u03c0 Es,a\ndouble sampling problem of above formulation by introduce a primal-dual form\n\nEs,a,s(cid:48)[(\u03b4(s, a, s(cid:48)) \u2212 V (s))2] \u2212 \u03b7Es,a,s(cid:48)[(\u03b4(s, a, s(cid:48)) \u2212 \u03c1(s, a))2],\n\n(2)\nwhere \u03b4(s, a, s(cid:48)) = R(s, a) + \u03b3V (s(cid:48)) \u2212 \u03bb log \u03c0(s, a), 0 \u2264 \u03b7 \u2264 1 controls the trade-off between bias\nand variance.\nIn the following discussion, we use (cid:107) \u00b7 (cid:107) to denote the Euclidean norm over the vector, A(cid:48) stands for\nthe transpose of A, and (cid:12) denotes the entry-wise product between two vectors.\n\nmin\nV,\u03c0\n\nmax\n\n\u03c1\n\n3 Value Propagation\n\nof agent i, A = (cid:81)N\n\nIn this section, we present our multi-agent reinforcement learning algorithm, i.e., value propagation.\nTo begin with, we extend the MDP model to the Networked Multi-agent MDP model following\nthe de\ufb01nition in [Zhang et al., 2018]. Let G = (N ,E) be an undirected graph with |N| = N\nagents (node). E represents the set of edges. (i, j) \u2208 E means agent i and j can communicate\nwith each other through this edge. A networked multi-agent MDP is characterized by a tuple\n(S,{Ai}i\u2208N ,P,{Ri}i\u2208N ,G, \u03b3): S is the global state space shared by all agents (It could be partially\nobserved, i.e., each agent observes its own state Si, see our experiment). Ai is the action space\ni=1 Ai is the joint action space, P is the transition probability, Ri denotes\nthe local reward function of agent i. We assume rewards are observed only locally to preserve\nthe privacy of the each agent\u2019s goal. At each time step, agents observe st and make the decision\nN ). Then each agent just receives its own reward Ri(st, at), and the environment\nat = (at\nswitches to the new state st+1 according to the transition probability. Furthermore, since each agent\nmake the decisions independently, it is reasonable to assume that the policy \u03c0(s, a) can be factorized,\ni=1 \u03c0i(s, ai) [Zhang et al., 2018]. We call our method fully-decentralized method,\nsince reward is received locally, the action is executed locally by agent, critic (value function) are\ntrained locally.\n\ni.e., \u03c0(s, a) =(cid:81)N\n\n2, ..., at\n\n1, at\n\n3.1 Multi-Agent Softmax Temporal Consistency\n\nThe goal of the agents is to learn a policy that maximizes the long-term reward averaged over the\nagent, i.e.,\n\n\u221e(cid:88)\n\nN(cid:88)\nE(cid:0) 1\ni=1 Ri(s, a) + \u03b3Es(cid:48)|s,aV\u03bb(s(cid:48)) + \u03bbH(\u03c0, s)(cid:1), V \u2217\n(cid:80)N\n(cid:81)N\ni=1 \u03c0i\u2217\n\n1\nN\n\nt=0\n\ni=1\n\nE\n\nIn the following, we adapt the temporal consistency into the multi-agent version. Let V\u03bb(s) =\n\u03bb(s, a) =\n\u03bb (s, ai) be the corresponding policy. Apply the soft temporal consistency, we obtain that for\n\n\u03bb be the optimal value function and \u03c0\u2217\n\nN\n\n\u03b3tRi(st, at).\n\n(3)\n\n3\n\n\fall (s, a) \u2208 S \u00d7 A, (V \u2217\n\n\u03bb) is the unique (V, \u03c0) pair that satis\ufb01es\n\nRi(s, a) + \u03b3Es(cid:48)|s,aV (s(cid:48)) \u2212 \u03bb\n\nlog \u03c0i(s, ai).\n\nV (s) =\n\n\u03bb , \u03c0\u2217\n1\nN\n\nN(cid:88)\nE(cid:0)V (s) \u2212 1\n\ni=1\n\nN\n\nN(cid:88)\n\ni=1\n\nmin\n{\u03c0i}N\n\ni=1,V\n\nN(cid:88)\n\ni=1\n\nlog \u03c0i(s, ai)(cid:1)2\n\n.\n\nN(cid:88)\n\ni=1\n\n(4)\n\n(5)\n\nA optimization problem inspired by (4) would be\n\nRi(s, a) \u2212 \u03b3Es(cid:48)|s,aV (s(cid:48)) + \u03bb\n\nThere are two potential issues of above formulation: First, due to the inner conditional expectation, it\nwould require two independent samples to obtain the unbiased estimation of gradient of V [Dann\net al., 2014]. Second, V (s) is a global variable over the network, thus can not be updated in a\ndecentralized way.\nFor the \ufb01rst issue, we introduce the primal-dual form of (5) as that in [Dai et al., 2018]. Using the\nfact that x2 = max\u03bd(2\u03bdx \u2212 \u03bd2) and the interchangeability principle [Shapiro et al., 2009] we have,\n\nmin\nV,{\u03c0i}N\nChange the variable \u03bd(s, a) = \u03c1(s, a) \u2212 V (s), the objective function becomes\n\nN(cid:88)\n2Es,a,s(cid:48)[\u03bd(s, a)(cid:0) 1\nN(cid:88)\n(\u03b4i(s, a, s(cid:48)) \u2212 V (s))(cid:1)2\nEs,a,s(cid:48)[(cid:0) 1\n\n(Ri(s, a)+\u03b3V (s(cid:48))\u2212V (s)\u2212\u03bbN log \u03c0i(s, ai)(cid:1)]\u2212Es,a,s[\u03bd2(s, a)].\n(\u03b4i(s, a, s(cid:48)) \u2212 \u03c1(s, a))(cid:1)2\n\n] \u2212 Es,a,s(cid:48)[(cid:0) 1\n\nN(cid:88)\n\nmax\n\ni=1\n\nN\n\ni=1\n\n],\n\n\u03bd\n\nN\n\ni=1\n\n(6)\n\n\u03c1\n\ni=1\n\nmax\n\nmin\nV,{\u03c0i}N\nwhere \u03b4i = Ri(s, a) + \u03b3V (s(cid:48)) \u2212 \u03bbN log \u03c0i(s, ai).\n\ni=1\n\nN\n\n3.2 Decentralized Formulation\n\nParticularly, each agent\u2019s policy is \u03c0i(s, ai) := \u03c0\u03b8\u03c0i (s, ai) and \u03c0\u03b8(s, a) =(cid:81)N\n\nSo far the problem is still in a centralized form, and we now turn to reformulating it in a decentralized\nway. We assume that policy, value function, dual variable \u03c1 are all in the parametric function class.\ni=1 \u03c0\u03b8\u03c0i (s, ai). The\nvalue function V\u03b8v (s) is characterized by the parameter \u03b8v, while \u03b8\u03c1 represents the parameter of\n\u03c1(s, a). Similar to [Dai et al., 2018], we optimize a slightly different version from (6).\n\n(\u03b4i(s, a, s(cid:48)) \u2212 V (s))(cid:1)2\n\n] \u2212 \u03b7Es,a,s(cid:48)[(cid:0) 1\n\nN(cid:88)\n(\u03b4i(s, a, s(cid:48)) \u2212 \u03c1(s, a))(cid:1)2\n\n],\n\nEs,a,s(cid:48)[(cid:0) 1\n\nN\n\nN(cid:88)\n\ni=1\n\nmin\n\u03b8v,{\u03b8\u03c0i}N\n\ni=1\n\nmax\n\n\u03b8\u03c1\n\nN\n\ni=1\n\n(7)\nwhere 0 \u2264 \u03b7 \u2264 1 controls the bias and variance trade-off. When \u03b7 = 0, it reduces to the pure primal\nform.\nWe now consider the second issue that V (s) is a global variable. To address this problem, we introduce\nthe local copy of the value function, i.e., Vi(s) for each agent i. In the algorithm, we have a consensus\nupdate step, such that these local copies are the same, i.e., V1(s) = V2(s) = ... = VN (s) = V (s), or\nequivalently \u03b8v1 = \u03b8v2 = ... = \u03b8vN , where \u03b8vi are parameter of Vi respectively. Notice now in (7),\nthere is a global dual variable \u03c1 in the primal-dual form. Therefore, we also introduce the local copy\nof the dual variable, i.e., \u03c1i(s, a) to formulate it into the decentralized optimization problem. Now\nthe \ufb01nal objective function we need to optimize is\n\nL(\u03b8V , \u03b8\u03c0, \u03b8\u03c1) = Es,a,s(cid:48)[(cid:0) 1\n\nmin\n\n{\u03b8vi ,\u03b8\u03c0i}N\n\ni=1\n\nmax\n{\u03b8\u03c1i}N\n\ni=1\n\nN(cid:88)\n(\u03b4i(s, a, s(cid:48)) \u2212 Vi(s))(cid:1)2\n\u2212 \u03b7Es,a,s(cid:48)[(cid:0) 1\n\ni=1\n\nN(cid:88)\n(\u03b4i(s, a, s(cid:48)) \u2212 \u03c1i(s, a))(cid:1)2\n\n]\n\n],\n\nN\n\nN\n\ni=1\n\ns.t. \u03b8v1 =, ..., = \u03b8vN , \u03b8\u03c11 =, ..., = \u03b8\u03c1N ,\n\n(8)\nwhere \u03b4i = Ri(s, a) + \u03b3Vi(s(cid:48))\u2212 \u03bbN log \u03c0i(s, ai). We are now ready to present the value propagation\nalgorithm. In the following, for notational simplicity, we assume the parameter of each agent is a\n\n4\n\n\fscalar, i.e., \u03b8\u03c1i, \u03b8\u03c0i, \u03b8vi \u2208 R. We pack the parameter together and slightly abuse the notation by\nwriting \u03b8\u03c1 = [\u03b8\u03c11, ..., \u03b8\u03c1N ](cid:48), \u03b8\u03c0 = [\u03b8\u03c01, ..., \u03b8\u03c0N ](cid:48), \u03b8V = [\u03b8v1, ..., \u03b8vN ](cid:48). Similarly, we also pack the\nstochastic gradient g(\u03b8\u03c1) = [g(\u03b8\u03c11), ..., g(\u03b8\u03c1n)](cid:48), g(\u03b8V ) = [g(\u03b8v1), ..., g(\u03b8vn )](cid:48).\n\n3.3 Value propagation algorithm\n\nSolving (8) even without constraints is not an easy problem when both primal and dual parts are\napproximated by the deep neural networks. An ideal way is to optimize the inner dual problem and\n\ufb01nd the solution \u03b8\u2217\n\u03c1 = arg max\u03b8\u03c1 L(\u03b8V , \u03b8\u03c0, \u03b8\u03c1), such that \u03b8\u03c11 = ... = \u03b8\u03c1N . Then we can do the\n(decentralized) stochastic gradient decent to solve the primal problem.\n\nmin\n\n{\u03b8vi ,\u03b8\u03c0i}N\n\ni=1\n\nL(\u03b8V , \u03b8\u03c0, \u03b8\u2217\n\n\u03c1) s.t. \u03b8v1 = ... = \u03b8vN .\n\n(9)\n\n\u03c1 + \u03b1\u03c1\n\n\u03c1 = 1\n\n\u03c1 \u2212 \u03b1\u03c1\n\n2 D\u22121L+\u03b8t\n\n2 D\u22121A(cid:48)\u00b5t\n\nHowever in practice, one tricky issue is that we can not get the exact solution \u03b8\u2217\n\u03c1 of the dual problem.\nThus, we do the (decentralized) stochastic gradient for Tdual steps in the dual problem and get an\napproximated solution \u02dc\u03b8\u03c1 in the Algorithm 1. In our analysis, we take the error \u03b5 generated from this\ninexact solution into the consideration and analyze its effect on the convergence. Particularly, since\n\u2207\u03b8V L(\u03b8V , \u03b8\u03c0, \u02dc\u03b8\u03c1) (cid:54)= \u2207\u03b8V L(\u03b8V , \u03b8\u03c0, \u03b8\u2217\n\u03c1), the primal gradient is biased and the results in [Dai et al.,\n2018, Hong et al., 2017] do not \ufb01t this problem.\n2 D\u22121g(\u03b8t\nIn the dual update we do a consensus update \u03b8t+1\n\u03c1)\nusing the stochastic gradient of each agent, where \u00b5\u03c1 is some auxiliary variable to incorporate the\ncommunication, D is the degree matrix, A is the node-edge incidence matrix, L+ is sign-less graph\nLaplacian. We defer the detail de\ufb01nition and the derivation of this algorithm to Appendix A.1 and\nAppendix A.5 due to space limitation. After updating the dual parameters, we optimize the primal\nparameters \u03b8v, \u03b8\u03c0. Similarly, we use a mini-batch data from the replay buffer and then do a consensus\nupdate on \u03b8v. The same remarks on \u03c1 also hold for the primal parameter \u03b8v. Notice here we do not\nneed the consensus update on \u03b8\u03c0, since each agent\u2019s policy \u03c0i(s, ai) is different than each other. This\nupdate rule is adapted from a primal decentralized optimization algorithm [Hong et al., 2017]. Notice\neven in the pure primal case, Hong et al. [2017] only consider the batch gradient case while our\nalgorithm and analysis include the stochastic and biased gradient case. In practicals implementation,\nwe consider the decentralized momentum method and multi-step temporal consistency to accelerate\nthe training; see details in Appendix A.2 and Appendix A.3.\nRemarks on Algorithm 1. (1) In the single agent case, Dai et al. [2018] assume the dual problem\ncan be exactly solved and thus they analyze a simple pure primal problem. However such assumption\nis unrealistic especially when the dual variable is represented by the deep neural network. Our\nmulti-agent analysis considers the inexact solution. This is much harder than that in [Dai et al., 2018],\nsince now the primal gradient is biased. (2) The update of each agent just needs the information of\nthe agent itself and its neighbors. See this from the de\ufb01nition of D, A, L+ in the appendix. (3) The\ntopology of the Graph G affects the convergence speed. In particular, the rate depends on \u03c3min(A(cid:48)A)\nand \u03c3min(D), which are related to spectral gap of the network.\n\n4 Theoretical Result\n\nIn this section, we give the convergence result on Algorithm 1. We \ufb01rst make two mild assumptions\non the function approximators f (\u03b8) of Vi(s), \u03c0i(s, ai), \u03c1i(s, a).\nAssumption 1. i) The function approximator f (\u03b8) is differentiable and has Lipschitz continuous\ngradient, i.e., (cid:107)\u2207f (\u03b81) \u2212 \u2207f (\u03b82)(cid:107) \u2264 L(cid:107)\u03b81 \u2212 \u03b82(cid:107),\u2200\u03b81, \u03b82 \u2208 RK. This is commonly assumed in the\nnon-convex optimization. ii) The function approximator f (\u03b8) is lower bounded. This can be easily\nsatis\ufb01ed when the parameter is bounded, i.e., (cid:107)\u03b8(cid:107) \u2264 C for some positive constant C.\nIn the following, we give the theoretical analysis for Algorithm 1 in the same setting of [Antos et al.,\n2008, Dai et al., 2018] where samples are pre\ufb01xed and from one single \u03b2-mixing off-policy sample\npath. We denote \u02c6L(\u03b8V , \u03b8\u03c0) = max\u03b8\u03c1 L(\u03b8V , \u03b8\u03c0, \u03b8\u03c1), s.t., \u03b8\u03c11 =, ..., = \u03b8\u03c1N\nTheorem 1. Let the function approximators of Vi(s), \u03c0i(s, ai) and \u03c1i(s, a) satisfy Assumption 1,\nsnd denote the total training step be T . We solve the inner dual problem with a approximated\nsolution \u02dc\u03b8\u03c1 = (\u02dc\u03b8\u03c11 , ..., \u02dc\u03b8\u03c1N )(cid:48), such that (cid:107)\u2207\u03b8V L(\u03b8V , \u03b8\u03c0, \u02dc\u03b8\u03c1) \u2212 \u2207\u03b8V L(\u03b8V , \u03b8\u03c0, \u03b8\u2217\nT , and\n(cid:107)\u2207\u03b8\u03c0 L(\u03b8V , \u03b8\u03c0, \u02dc\u03b8\u03c1) \u2212 \u2207\u03b8\u03c0 L(\u03b8V , \u03b8\u03c0, \u03b8\u2217\nT . Assume the variance of the stochastic gradient\n\n\u221a\n\u03c1)(cid:107) \u2264 c1/\n\n\u03c1)(cid:107) \u2264 c2/\n\n\u221a\n\n5\n\n\fAlgorithm 1 Value Propagation\n\nInput: Environment ENV, learning rate \u03b1\u03c0, \u03b1v, \u03b1\u03c1, discount factor \u03b3, number of step Tdual to train\ndual parameter \u03b8\u03c1i, replay buffer capacity B, node-edge incidence matrix A \u2208 RE\u00d7N , degree\nmatrix D, signless graph Laplacian L+.\nInitialization of \u03b8vi, \u03b8\u03c0i, \u03b8\u03c1i, \u00b50\n\u03c1 = 0, \u00b50\nfor t = 1, ..., T do\n\nsample trajectory s0:\u03c4 \u223c \u03c0(s, a) =(cid:81)N\n\ni=1 \u03c0i(s, ai) and add it into the replay buffer.\n\nV = 0.\n\n1. Update the dual parameter \u03b8\u03c1i\nDo following dual update Tdual times:\nRandom sample a mini-batch of transition (st,{ai\nfor agent i = 1 to n do\n\nCalculate the stochastic gradient g(\u03b8t\n\nt}N\ni=1, st+1,{ri\n\nt}N\ni=1) from the replay buffer.\n\n\u03c1i) of \u2212\u03b7(\u03b4i(st, at, st+1) \u2212 \u03c1i(st, at))2 w.r.t. \u03b8t\n\u03c1i.\n\n\u03c1 \u2212 \u03b1\u03c1\n\n2 D\u22121L+\u03b8t\n\n2 D\u22121A(cid:48)\u00b5t\n\nend for\n// Do consensus update on \u03b8\u03c1 := [\u03b8\u03c11, ..., \u03b8\u03c1N ](cid:48)\n2 D\u22121g(\u03b8t\n\u03c1 = 1\n\u03b8t+1\n\u03c1), \u00b5t+1\n\u03c1 = \u00b5t\n2. Update primal parameters \u03b8vi, \u03b8\u03c0i\nRandom sample a mini-batch of transition (st,{ai\ni=1, st+1,{ri\nt}N\nfor agent i = 1 to n do\nCalculate the stochastic gradient g(\u03b8t\n\u03c0i) of\nvi),g(\u03b8t\n\u03b7(\u03b4i(st, at, st+1) \u2212 \u03c1i(st, at))2, w.r.t. \u03b8t\nvi, \u03b8t\n\u03c0i\n\n\u03c1 + \u03b1\u03c1\n\n\u03c1\n\nA\u03b8t+1\n\n\u03c1 + 1\n\u03b1\u03c1\nt}N\ni=1) from the replay buffer.\n(\u03b4i(st, at, st+1) \u2212 Vi(st))2 \u2212\n\n\u03c0i = \u03b8t\n\nend for\n\u03c0i \u2212 \u03b1\u03c0g(\u03b8t\n// Do gradient decent on \u03b8\u03c0i: \u03b8t+1\n// Do consensus update on \u03b8V := [\u03b8v1 , ..., \u03b8vN ](cid:48) :\n2 D\u22121g(\u03b8t\n2 D\u22121A(cid:48)\u00b5t\n\u03b8t+1\nV = 1\nend for\n\u221a\n\n2 D\u22121L+\u03b8t\n\nV \u2212 \u03b1v\n\nV \u2212 \u03b1v\n\n\u03c0i) for each agent i.\n\nV ), \u00b5t+1\n\nV = \u00b5t\n\nV + 1\n\u03b1v\n\nA\u03b8t+1\nV .\n\nT , the step size \u03b1\u03c0, \u03b1v, \u03b1\u03c1 \u221d 1\n\ng(\u03b8V ), g(\u03b8\u03c0) and g(\u03b8\u03c1) (estimated by a single sample) are bounded by \u03c32, the size of the mini-batch\nis\nL . Then value propagation in Algorithm 1 converges to the\nstationary solution of \u02c6L(\u03b8V , \u03b8\u03c0) with rate O(1/T ).\nRemarks: (1) The convergence criteria and its dependence on the network structure are involved. We\ndefer the de\ufb01nition of them to the proof section in the appendix (Equation (44)). (2) We require that\nthe approximated dual solution \u02dc\u03b8\u03c1 are not far from \u03b8\u2217\n\u03c1 such that the estimation of the primal gradient\n\u221a\nof \u03b8v and \u03b8\u03c0 are not far from the true one (the distance is less than O(1/\nT )). Once the inner\ndual problem is concave, we can get this approximated solution easily using vanilla decentralized\nstochastic gradient method after at most T steps. If the dual problem is non-convex, we still can show\nthe dual problem converges to some stationary solution with rate O(1/T ) by our proof. (3) In the\ntheoretical analysis, the stochastic gradient estimated from the mini-batch (rather than the estimation\nfrom a single sample ) is common in non-convex analysis, see the work [Ghadimi and Lan, 2016]. In\npractice, a mini-batch of samples is commonly used in training deep neural network.\n\n5 Related work\n\nE(cid:80)T\n\n(cid:80)n\n\n1\nn\n\nt=1\n\ni=1 rt\n\nAmong related work on MARL, the setting of [Zhang et al., 2018] is close to ours, where the authors\nproposed a fully decentralized multi-agent Actor-Critic algorithm to maximize the expected time-\ni. They provide the asymptotic convergence analysis\naverage reward limT\u2192\u221e 1\nT\non the on-policy and linear function approximation setting. In our work, we consider the discounted\nreward setup, i.e., Equation (3). Our algorithm includes both on-policy and off-policy setting thus\ncan exploit data more ef\ufb01ciently. Furthermore, we provide a convergence rate O( 1\nT ) in the non-\nlinear function approximation setting which is much stronger than the result in [Zhang et al., 2018].\nLittman [1994] proposed the framework of Markov games which can be applied to collaborative\nand competitive setting [Lauer and Riedmiller, 2000, Hu and Wellman, 2003]. These early works\nconsidered the tabular case thus can not apply to real problems with large state space. Recent works\n[Foerster et al., 2016, 2018, Rashid et al., 2018, Raileanu et al., 2018, Jiang et al., 2018, Lowe\net al., 2017] have exploited powerful deep learning and obtained some promising empirical results.\nHowever most of them lacks theoretical guarantees while our work provides convergence analysis.\n\n6\n\n\fFigure 2: Results on randomly sampled MDP. Left: Value function of different agents in value\npropagation. In the \ufb01gure, value functions of three agents are similar, which means agents get\nconsensus on value functions. Middle: Cumulative reward of value propagation (with different \u03b7)\nand centralized PCL with 10 agents. Right : Results with 20 agents.\nWe emphasize that most of the research on MARL is in the fashion of centralized training and\ndecentralized execution. In the training, they do not have the constraint on the communication, while\nour work has a network decentralized structure.\n\n6 Experimental result\n\nThe goal of our experiment is two-fold: To better understand the effect of each component in the\nproposed algorithm; and to evaluate ef\ufb01ciency of value propagation in the off-policy setting. To\nthis end, we \ufb01rst do an ablation study on a simple random MDP problem, we then evaluate the\nperformance on the cooperative navigation task [Lowe et al., 2017]. The settings of the experiment\nare similar to those in [Zhang et al., 2018]. Some implementation details are deferred to Appendix\nA.4 due to space constraints.\n\n6.1 Ablation Study\n\nIn this experiment, we test effect of several components of our algorithm such as the consensus update,\ndual formulation in a random MDP problem. Particularly we answer following three questions: (1)\nWhether an agent can get consensus through message-passing in value propagation even when each\nagent just knows its local reward. (2) How much performance does the decentralized approach\nsacri\ufb01ce comparing with centralized one? (3) What is the effect of the dual part in our formulation\n(0 \u2264 \u03b7 \u2264 1 and \u03b7 = 0 corresponds to the pure primal form)?\nWe compare value propagation with the centralized PCL. The centralized PCL means that there is\na central node to collect rewards of all agent, thus it can optimize the objective function (5) using\nthe single agent PCL algorithm [Nachum et al., 2017, Dai et al., 2018]. Ideally, value propagation\nshould converges to the same long term reward with the one achieved by the centralized PCL. In the\nexperiment, we consider a multi-agent RL problem with N = 10 and N = 20 agents, where each\nagent has two actions. A discrete MDP is randomly generated with |S| = 32 states. The transition\nprobabilities are distributed uniformly with a small additive constant to ensure ergodicity of the MDP,\nwhich is P(s(cid:48)|a, s) \u221d pa\nss(cid:48) \u223c U [0, 1]. For each agent i and each state-action pair (s, a),\nthe reward Ri(s, a) is uniformly sampled from [0, 4].\nIn the left panel of Figure 2, we verify that the value function vi(s) in value propagation reaches\nthe consensus through message-passing in the end of the training. Particularly, we randomly choose\nthree agent i, j, k and draw their value functions over 20 randomly picked states. It is easy to see\nthat value functions vi(s), vj(s), vk(s) over these states are almost same. This is accomplished by\nthe consensus update in value propagation. In the middle and right panel of Figure 2, we compare\nthe result of value propagation with centralized PCL and evaluate the effect of the dual part of value\npropagation. Particularly, we pick \u03b7 = 0, 0.01, 0.1, 1 in the experiment, where \u03b7 = 0 corresponds to\nthe pure primal formulation. When \u03b7 is too large (\u03b7 = 1), the algorithm would have large variance\nwhile \u03b7 = 0 the algorithm has some bias. Thus value propagation with \u03b7 = 0.1, 0.01 has better result.\nWe also see that value propagation (\u03b7 = 0.1, 0.01) and centralized PCL converge to almost the same\nvalue, although there is a gap between centralized and decentralized algorithm. The centralized PCL\nconverges faster than value propagation, since it does not need time to diffuse the reward information\nover the network.\n\nss(cid:48) + 10\u22125, pa\n\n7\n\n0.02.55.07.510.012.515.017.520.0state21.522.022.523.023.5value functionagent1agent2agent3025050075010001250150017502000Episodes18.018.519.019.520.020.521.0Cumulative rewardValue Propagation eta=0Value Propagation eta=1Value Propagation eta=0.1Value Propagation eta=0.01Centralized-PCL025050075010001250150017502000Episodes17.518.018.519.019.520.0Cumulative rewardValue Propagation eta=0Value Propagation eta=1Value Propagation eta=0.1Value Propagation eta=0.01Centralized-PCL\fFigure 3: Results on Cooperative Navigation task. Left: value functions of three random picked\nagents (totally 16 agents) in value propagation. They get consensus. Middle : cumulative reward\nof value propagation (eta=0.01 and eta=0.1), MA-AC and PCL without communication with agent\nnumber N=8. Right: Results with agent number N=16. Our algorithm outperforms MA-AC and PCL\nwithout communication. Comparing with the middle panel, the number of agent increases in the\nright panel. Therefore, the problem becomes harder (more collisions). We see agents achieve lower\ncumulative reward (averaged over agents) and need more time to \ufb01nd a good policy.\n6.2 Cooperative Navigation task\n\nThe aim of this section is to demonstrate that the value propagation outperforms decentralized multi-\nagent Actor-Critic (MA-AC)[Zhang et al., 2018], independent Q learning [Tan, 1993], the Multi-agent\nPCL without communication. Here PCL without communication means each agent maintains its own\nestimation of policy \u03c0i(s, ai) and value function V i(s) but there is no communication Graph. Notice\nthat this is different from the centralized PCL in Section 6.1, where centralized PCL has a central\nnode to collect all reward information and thus do not need further communication. Note that the\noriginal MA-AC is designed for the averaged reward setting thus we adapt it into the discounted case\nto \ufb01t our setting. We test the value propagation in the environment of the Cooperative Navigation task\n[Lowe et al., 2017], where agents need to reach a set of L landmarks through physical movement.\nWe modify this environment to \ufb01t our setting. A reward is given when the agent reaches its own\nlandmarks. A penalty is received if agents collide with other agents. Since the position of landmarks\nare different, the reward function of each agent is different. Here we test the case the state is globally\nobserved and partially observed. In particular, we assume the environment is in a rectangular region\nwith size 2 \u00d7 2. There are N = 8 or N = 16 agents. Each agent has a single target landmark, i.e.,\nL = N, which is randomly located in the region. Each agent has \ufb01ve actions which corresponds to\ngoing up, down, left, right with units 0.1 or staying at the position. The agent has high probability\n(0.95) to move in the direction following its action and go in other direction randomly otherwise.\nThe maximum length of each epoch is set to be 500 steps. When the agent is close enough to the\nlandmark, e.g., the distance is less than 0.1, we think it reaches the target and gets reward +5. When\ntwo agents are close to each other (with distance less than 0.1), we treat this case as a collision and\na penalty \u22121 is received for each of the agents. The state includes the position of the agents. The\ncommunication graph is generated as that in Section 6.1 with connectivity ratio 4/N. In the partially\nobserved case, the actor of each agent can only observe its own and neighbors\u2019 states. We report the\nresults in Figure 3.\nIn the left panel of Figure 3, we see the value function vi(s) reaches consensus in value propagation.\nIn the middle and right panel of Figure 3, we compare value propagation with PCL without commu-\nnication, independent Q learning and MA-AC. In PCL without communication, each agent maintains\nits own policy, value function and dual function, which is trained by the algorithm SBEED [Dai et al.,\n2018] with \u03b7 = 0.01. Since there is no communication between agents, intuitively agents may have\nmore collisions in the learning process than those in value propagation. Similar augment holds for\nthe independent Q learning. Indeed, In the middle and right panel, we see value propagation learns\nthe policy much faster than PCL without communication. We also observe that value propagation\noutperforms MA-AC. One possible reason is that value propagation is an off-policy method thus we\ncan apply experience replay which exploits data more ef\ufb01ciently than the on-policy method MA-AC.\nWe also test the performance of value propagation (result labeled as partial value propagation in\nFigure 3) when the state information of actor is partially observed. Since the agent has limited\ninformation, its performance is worse than the fully observed case. But it is better than the PCL\nwithout communication (fully observed state).\n\n8\n\n0.02.55.07.510.012.515.017.520.0state0.00.20.40.60.81.01.21.4value functionagent1agent2agent302505007501000125015001750Episodes024681012Cumulative rewardValue Propagation eta=0.01Value Propagation eta=0.1PCL without communicationIndependent QMA-ACPartial Value Propagation02505007501000125015001750Episodes02468Cumulative rewardValue Propagation eta=0.01Value Propagation eta=0.1PCL without communicationIndependent QMA-ACPartial Value Propagation\fReferences\nMartin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and\nLi Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\nJeffrey L Adler and Victor J Blue. A cooperative multi-agent transportation management and route\nguidance system. Transportation Research Part C: Emerging Technologies, 10(5-6):433\u2013454,\n2002.\n\nAndr\u00e1s Antos, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Learning near-optimal policies with bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71\n(1):89\u2013129, 2008.\n\nStephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms.\n\nIEEE transactions on information theory, 52(6):2508\u20132530, 2006.\n\nLucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement\nlearning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),\n38(2):156\u2013172, 2008.\n\nDuncan S Callaway and Ian A Hiskens. Achieving controllability of electric loads. Proceedings of\n\nthe IEEE, 99(1):184\u2013199, 2011.\n\nFederico S Cattivelli, Cassio G Lopes, and Ali H Sayed. Diffusion recursive least-squares for\ndistributed estimation over adaptive networks. IEEE Transactions on Signal Processing, 56(5):\n1865\u20131877, 2008.\n\nJorge Cortes, Sonia Martinez, Timur Karatas, and Francesco Bullo. Coverage control for mobile\n\nsensing networks. IEEE Transactions on robotics and Automation, 20(2):243\u2013255, 2004.\n\nBo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed:\nIn International\n\nConvergent reinforcement learning with nonlinear function approximation.\nConference on Machine Learning, pages 1133\u20131142, 2018.\n\nChristoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A\n\nsurvey and comparison. The Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\nMorris H DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69(345):\n\n118\u2013121, 1974.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nSamah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. Multiagent reinforcement learning for\nintegrated network of adaptive traf\ufb01c signal controllers (marlin-atsc): methodology and large-scale\napplication on downtown toronto. IEEE Transactions on Intelligent Transportation Systems, 14(3):\n1140\u20131150, 2013.\n\nJ Alexander Fax and Richard M Murray.\n\nInformation \ufb02ow and cooperative control of vehicle\n\nformations. In IFAC World Congress, volume 22, 2002.\n\nJakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to\ncommunicate with deep multi-agent reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2137\u20132145, 2016.\n\nJakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.\nCounterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\nSaeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and\n\nstochastic programming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\nMingyi Hong. Decomposing linearly constrained nonconvex problems by a proximal primal dual\n\napproach: Algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543, 2016.\n\n9\n\n\fMingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-pda: The proximal primal-dual\nalgorithm for fast distributed nonconvex optimization and learning over networks. In International\nConference on Machine Learning, pages 1529\u20131538, 2017.\n\nJunling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of\n\nmachine learning research, 4(Nov):1039\u20131069, 2003.\n\nJiechuan Jiang, Chen Dun, and Zongqing Lu. Graph convolutional reinforcement learning for\n\nmulti-agent cooperation. arXiv preprint arXiv:1810.09202, 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nAlexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv\n\npreprint arXiv:1611.01236, 2016.\n\nMartin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in coop-\nerative multi-agent systems. In In Proceedings of the Seventeenth International Conference on\nMachine Learning. Citeseer, 2000.\n\nMichael L Littman. Markov games as a framework for multi-agent reinforcement learning. In\n\nMachine Learning Proceedings 1994, pages 157\u2013163. Elsevier, 1994.\n\nRyan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent\nactor-critic for mixed cooperative-competitive environments. In Advances in Neural Information\nProcessing Systems, pages 6379\u20136390, 2017.\n\nH Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-ef\ufb01cient\n\nlearning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.\n\nYilin Mo and Richard M Murray. Privacy preserving average consensus. IEEE Transactions on\n\nAutomatic Control, 62(2):753\u2013765, 2017.\n\nO\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between\nvalue and policy based reinforcement learning. In Advances in Neural Information Processing\nSystems, pages 2775\u20132785, 2017.\n\nMichael Rabbat and Robert Nowak. Distributed optimization in sensor networks. In Proceedings\nof the 3rd international symposium on Information processing in sensor networks, pages 20\u201327.\nACM, 2004.\n\nRoberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in\n\nmulti-agent reinforcement learning. arXiv preprint arXiv:1802.09640, 2018.\n\nTabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster,\nand Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent\nreinforcement learning. arXiv preprint arXiv:1803.11485, 2018.\n\nAlexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy\u00b4nski. Lectures on stochastic program-\n\nming: modeling and theory. SIAM, 2009.\n\nWei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact \ufb01rst-order algorithm for decentralized\n\nconsensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\nMing Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings\n\nof the tenth international conference on machine learning, pages 330\u2013337, 1993.\n\nT Tieleman and G Hinton. Divide the gradient by a running average of its recent magnitude. coursera:\nNeural networks for machine learning. Technical report, Technical Report. Available online:\nhttps://zh. coursera. org/learn . . . .\n\nLin Xiao, Stephen Boyd, and Sanjay Lall. A scheme for robust distributed sensor fusion based on\naverage consensus. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth\nInternational Symposium on, pages 63\u201370. IEEE, 2005.\n\nKaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Ba\u00b8sar. Fully decentralized\nmulti-agent reinforcement learning with networked agents. International Conference on Machine\nLearning, 2018.\n\n10\n\n\f", "award": [], "sourceid": 697, "authors": [{"given_name": "Chao", "family_name": "Qu", "institution": "Ant Financial Services Group"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}, {"given_name": "Huan", "family_name": "Xu", "institution": "Georgia Inst. of Technology"}, {"given_name": "Yuan", "family_name": "Qi", "institution": "Ant Financial Services Group"}, {"given_name": "Le", "family_name": "Song", "institution": "Ant Financial Services Group"}, {"given_name": "Junwu", "family_name": "Xiong", "institution": "Ant Financial Services Group"}]}