{"title": "Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9649, "page_last": 9660, "abstract": "Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. \nIn this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.", "full_text": "Multi-Agent Reinforcement Learning via\n\nDouble Averaging Primal-Dual Optimization\n\nHoi-To Wai\n\nThe Chinese University of Hong Kong\n\nShatin, Hong Kong\n\nhtwai@se.cuhk.edu.hk\n\nZhuoran Yang\n\nPrinceton University\nPrinceton, NJ, USA\nzy6@princeton.edu\n\nZhaoran Wang\n\nNorthwestern University\n\nEvanston, IL, USA\n\nzhaoranwang@gmail.com\n\nMingyi Hong\n\nUniversity of Minnesota\nMinneapolis, MN, USA\n\nmhong@umn.edu\n\nAbstract\n\nDespite the success of single-agent reinforcement learning, multi-agent reinforce-\nment learning (MARL) remains challenging due to complex interactions between\nagents. Motivated by decentralized applications such as sensor networks, swarm\nrobotics, and power grids, we study policy evaluation in MARL, where agents with\njointly observed state-action pairs and private local rewards collaborate to learn the\nvalue of a given policy.\nIn this paper, we propose a double averaging scheme, where each agent iteratively\nperforms averaging over both space and time to incorporate neighboring gradient\ninformation and local reward information, respectively. We prove that the proposed\nalgorithm converges to the optimal solution at a global geometric rate. In particular,\nsuch an algorithm is built upon a primal-dual reformulation of the mean squared\nprojected Bellman error minimization problem, which gives rise to a decentralized\nconvex-concave saddle-point problem. To the best of our knowledge, the proposed\ndouble averaging primal-dual optimization algorithm is the \ufb01rst to achieve fast\n\ufb01nite-time convergence on decentralized convex-concave saddle-point problems.\n\n1\n\nIntroduction\n\nReinforcement learning combined with deep neural networks recently achieves superhuman perfor-\nmance on various challenging tasks such as video games and board games [34, 45]. In these tasks,\nan agent uses deep neural networks to learn from the environment and adaptively makes optimal\ndecisions. Despite the success of single-agent reinforcement learning, multi-agent reinforcement\nlearning (MARL) remains challenging, since each agent interacts with not only the environment but\nalso other agents. In this paper, we study collaborative MARL with local rewards. In this setting, all\nthe agents share a joint state whose transition dynamics is determined together by the local actions\nof individual agents. However, each agent only observes its own reward, which may differ from\nthat of other agents. The agents aim to collectively maximize the global sum of local rewards. To\ncollaboratively make globally optimal decisions, the agents need to exchange local information. Such\na setting of MARL is ubiquitous in large-scale applications such as sensor networks [42, 9], swarm\nrobotics [23, 8], and power grids [3, 13].\nA straightforward idea is to set up a central node that collects and broadcasts the reward information,\nand assigns the action of each agent. This reduces the multi-agent problem into a single-agent one.\nHowever, the central node is often unscalable, susceptible to malicious attacks, and even infeasible\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fin large-scale applications. Moreover, such a central node is a single point of failure, which is\nsusceptible to adversarial attacks. In addition, the agents are likely to be reluctant to reveal their local\nreward information due to privacy concerns [5, 27], which makes the central node unattainable.\nTo make MARL more scalable and robust, we propose a decentralized scheme for exchanging local\ninformation, where each agent only communicates with its neighbors over a network. In particular,\nwe study the policy evaluation problem, which aims to learn a global value function of a given policy.\nWe focus on minimizing a Fenchel duality-based reformulation of the mean squared Bellman error in\nthe model-free setting with in\ufb01nite horizon, batch trajectory, and linear function approximation.\nAt the core of the proposed algorithm is a \u201cdouble averaging\" update scheme, in which the algorithm\nperforms one average over space (across agents to ensure consensus) and one over time (across\nobservations along the trajectory). In detail, each agent locally tracks an estimate of the full gradient\nand incrementally updates it using two sources of information: (i) the stochastic gradient evaluated on\na new pair of joint state and action along the trajectory and the corresponding local reward, and (ii) the\nlocal estimates of the full gradient tracked by its neighbors. Based on the updated estimate of the full\ngradient, each agent then updates its local copy of the primal parameter. By iteratively propagating\nthe local information through the network, the agents reach global consensus and collectively attain\nthe desired primal parameter, which gives an optimal approximation of the global value function.\nRelated Work The study of MARL in the context of Markov game dates back to [28]. See also\n[29, 24, 21] and recent works on collaborative MARL [51, 1]. However, most of these works consider\nthe tabular setting, which suffers from the curse of dimensionality. To address this issue, under the\ncollaborative MARL framework, [53] and [25] study actor-critic algorithms and policy evaluation\nwith on linear function approximation, respectively. However, their analysis is asymptotic in nature\nand largely relies on two-time-scale stochastic approximation using ordinary differential equations\n[2], which is tailored towards the continuous-time setting. Meanwhile, most works on collaborative\nMARL impose the simplifying assumption that the local rewards are identical across agents, making\nit unnecessary to exchange the local information. More recently, [17\u201319, 31, 37] study deep MARL\nthat uses deep neural networks as function approximators. However, most of these works focus on\nempirical performance and lack theoretical guarantees. Also, they do not emphasize on the ef\ufb01cient\nexchange of information across agents. In addition to MARL, another line of related works study\nmulti-task reinforcement learning (MTRL), in which an agent aims to solve multiple reinforcement\nlearning problems with shared structures [52, 39, 32, 33, 48].\nThe primal-dual formulation of reinforcement learning is studied in [30, 32, 33, 26, 10, 7, 50, 12, 11,\n15] among others. Except for [32, 33] discussed above, most of these works study the single-agent\nsetting. Among them, [26, 15] are most related to our work. In speci\ufb01c, they develop variance\nreduction-based algorithms [22, 14, 43] to achieve the geometric rate of convergence in the setting\nwith batch trajectory. In comparison, our algorithm is based on the aforementioned double averaging\nupdate scheme, which updates the local estimates of the full gradient using both the estimates of\nneighbors and new states, actions, and rewards. In the single-agent setting, our algorithm is closely\nrelated to stochastic average gradient (SAG) [43] and stochastic incremental gradient (SAGA) [14],\nwith the difference that our objective function is a \ufb01nite sum convex-concave saddle-point problem.\nOur work is also related to prior work in the broader contexts of primal-dual and multi-agent\noptimization. For example, [38] apply variance reduction techniques to convex-concave saddle-point\nproblems to achieve the geometric rate of convergence. However, their algorithm is centralized and\nit is unclear whether their approach is readily applicable to the multi-agent setting. Another line\nof related works study multi-agent optimization, for example, [49, 36, 6, 44, 41]. However, these\nworks mainly focus on the general setting where the objective function is a sum of convex local cost\nfunctions. To the best of our knowledge, our work is the \ufb01rst to address decentralized convex-concave\nsaddle-point problems with sampled observations that arise from MARL.\nContribution Our contribution is threefold: (i) We reformulate the multi-agent policy evaluation\nproblem using Fenchel duality and propose a decentralized primal-dual optimization algorithm with\na double averaging update scheme. (ii) We establish the global geometric rate of convergence for\nthe proposed algorithm, making it the \ufb01rst algorithm to achieve fast linear convergence for MARL.\n(iii) Our proposed algorithm and analysis is of independent interest for solving a broader class of\ndecentralized convex-concave saddle-point problems with sampled observations.\nOrganization In \u00a72 we introduce the problem formulation of MARL. In \u00a73 we present the proposed\nalgorithm and lay out the convergence analysis. In \u00a74 we illustrate the empirical performance of the\nproposed algorithm. We defer the detailed proofs to the supplementary material.\n\n2\n\n\fNotation Unless otherwise speci\ufb01ed, for a vector x, (cid:107)x(cid:107) denotes its Euclidean norm; for a matrix\nX, (cid:107)X(cid:107) denotes its spectral norm, i.e., the largest singular value.\n2 Problem Formulation\nIn this section, we introduce the background of MARL, which is modeled as a multi-agent Markov\ndecision process (MDP). Under this model, we formulate the policy evaluation problem as a primal-\ndual convex-concave optimization problem.\nMulti-agent MDP Consider a group of N agents. We are interested in the multi-agent MDP:\n\n(cid:0)S,{Ai}N\n\ni=1, \u03b3(cid:1) ,\n\ni=1,P a,{Ri}N\n\nwhere S is the state space and Ai is the action space for agent i. We write s \u2208 S and a :=\n(a1, ..., aN ) \u2208 A1 \u00d7\u00b7\u00b7\u00b7\u00d7AN as the joint state and action, respectively. The function Ri(s, a) is the\nlocal reward received by agent i after taking joint action a at state s, and \u03b3 \u2208 (0, 1) is the discount\nfactor. Both s and a are available to all agents, whereas the reward Ri is private for agent i.\nIn contrast to a single-agent MDP, the agents are coupled together by the state transition matrix\nP a \u2208 R|S|\u00d7|S|, whose (s, s(cid:48))-th element is the probability of transiting from s to s(cid:48), after taking a\njoint action a. This scenario arises from large-scale applications such as sensor networks [42, 9],\nswarm robotics [23, 8], and power grids [3, 13], which strongly motivates the development of a\nmulti-agent RL strategy. Moreover, under the collaborative setting, the goal is to maximize the\ncollective return of all agents. Suppose there exists a central controller that collects the rewards of,\nand assigns the action to each individual agent, the problem reduces to the classical MDP with action\ni=1 Ri(s, a). Thus, without such a central\ncontroller, it is essential for the agents to collaborate with each other so as to solve the multi-agent\nproblem based solely on local information.\nFurthermore, a joint policy, denoted by \u03c0, speci\ufb01es the rule of making sequential decisions for the\nagents. Speci\ufb01cally, \u03c0(a|s) is the conditional probability of taking joint action a given the current\nstate s. We de\ufb01ne the reward function of joint policy \u03c0 as an average of the local rewards:\n\nspace A and global reward function Rc(s, a) = N\u22121(cid:80)N\n\nR\u03c0\n\nc (s) := 1\nN\n\n(1)\nThat is, R\u03c0\nc (s) is the expected value of the average of the rewards when the agents follow policy \u03c0\nat state s. Besides, any \ufb01xed policy \u03c0 induces a Markov chain over S, whose transition matrix is\ndenoted by P \u03c0. The (s, s(cid:48))-th element of P \u03c0 is given by\n\nwhere R\u03c0\n\ni=1 R\u03c0\n\ni (s),\n\ni (s) := Ea\u223c\u03c0(\u00b7|s)\n\n(cid:2)Ri(s, a)(cid:3) .\n\n(cid:80)N\n\n[P \u03c0]s,s(cid:48) =(cid:80)\n\na\u2208A \u03c0(a|s) \u00b7 [P a]s,s(cid:48).\n\nWhen this Markov chain is aperiodic and irreducible, it induces a stationary distribution \u00b5\u03c0 over S.\nPolicy Evaluation A central problem in reinforcement learning is policy evaluation, which refers to\nlearning the value function of a given policy. This problem appears as a key component in both value-\nbased methods such as policy iteration, and policy-based methods such as actor-critic algorithms\n[46]. Thus, ef\ufb01cient estimation of the value functions in multi-agent MDPs enables us to extend the\nsuccessful approaches in single-agent RL to the setting of MARL.\nSpeci\ufb01cally, for any given joint policy \u03c0, the value function of \u03c0, denoted by V \u03c0 : S \u2192 R, is de\ufb01ned\nas the expected value of the discounted cumulative reward when the multi-agent MDP is initialized\nwith a given state and the agents follows policy \u03c0 afterwards. For any state s \u2208 S, we de\ufb01ne\n\np=1 \u03b3pR\u03c0\n\nc (sp)|s1 = s, \u03c0\n\n.\n\n(2)\nTo simplify the notation, we de\ufb01ne the vector V \u03c0 \u2208 R|S| through stacking up V \u03c0(s) in (2) for all s.\nBy de\ufb01nition, V \u03c0 satis\ufb01es the Bellman equation\nV \u03c0 = R\u03c0\n\nc is obtained by stacking up (1) and [P \u03c0]s,s(cid:48) := E\u03c0[P a\nwhere R\u03c0\nMoreover, it can be shown that V \u03c0 is the unique solution of (3).\nWhen the number of states is large, it is impossible to store V \u03c0. Instead, our goal is to learn an\napproximate version of the value function via function approximation. In speci\ufb01c, we approximate\nV \u03c0(s) using the family of linear functions\n\n(3)\ns,s(cid:48)] is the expected transition matrix.\n\nc + \u03b3P \u03c0V \u03c0 ,\n\nV \u03c0(s) := E(cid:104)(cid:80)\u221e\n\n(cid:105)\n\n(cid:8)V\u03b8(s) := \u03c6(cid:62)(s)\u03b8 : \u03b8 \u2208 Rd},\n\n3\n\n\f(cid:17)(cid:13)(cid:13)(cid:13)2\n\n(cid:16)\n\n(cid:13)(cid:13)(cid:13)\u03a0\u03a6\n\n1\n2\n\nwhere \u03b8 \u2208 Rd is the parameter, \u03c6(s) : S \u2192 Rd is a known dictionary consisting of d features,\ne.g., a feature mapping induced by a neural network. To simplify the notation, we de\ufb01ne \u03a6 :=\n(...; \u03c6(cid:62)(s); ...) \u2208 R|S|\u00d7d and let V\u03b8 \u2208 R|S| be the vector constructed by stacking up {V\u03b8(s)}s\u2208S.\nWith function approximation, our problem is reduced to \ufb01nding a \u03b8 \u2208 Rd such that V\u03b8 \u2248 V \u03c0.\nSpeci\ufb01cally, we seek for \u03b8 such that the mean squared projected Bellman error (MSPBE)\n\n(cid:16)\n\n(cid:13)(cid:13)(cid:13)\u03a6(cid:62)D\n\n1\n2\n\nV\u03b8 \u2212 \u03b3P \u03c0V\u03b8 \u2212 R\u03c0\n\n+ \u03c1(cid:107)\u03b8(cid:107)2\n\nMSPBE(cid:63)(\u03b8) :=\n\n(4)\nis minimized, where D = diag[{\u00b5\u03c0(s)}s\u2208S ] \u2208 R|S|\u00d7|S| is a diagonal matrix constructed using\nthe stationary distribution of \u03c0, \u03a0\u03a6 : R|S| \u2192 R|S| is the projection onto subspace {\u03a6\u03b8 : \u03b8 \u2208 Rd},\nde\ufb01ned as \u03a0\u03a6 = \u03a6(\u03a6(cid:62)D\u03a6)\u22121\u03a6(cid:62)D, and \u03c1 \u2265 0 is a free parameter controlling the regularization\non \u03b8. For any positive semide\ufb01nite matrix A, we de\ufb01ne (cid:107)v(cid:107)A =\nv(cid:62)Av for any vector v. By\ndirect computation, when \u03a6(cid:62)D\u03a6 is invertible, the MSPBE de\ufb01ned in (4) can be written as\n\n\u221a\n\nD\n\nc\n\n+ \u03c1(cid:107)\u03b8(cid:107)2 =\n\nV\u03b8 \u2212 \u03b3P \u03c0V\u03b8 \u2212 R\u03c0\n\nc\n\nMSPBE(cid:63)(\u03b8) =\n\n(\u03a6(cid:62)D\u03a6)\u22121\n\nwhere we de\ufb01ne A := E(cid:2)\u03c6(sp)(cid:0)\u03c6(sp) \u2212 \u03b3\u03c6(sp+1)(cid:1)(cid:62)(cid:3), C := E(cid:2)\u03c6(sp)\u03c6(cid:62)(sp)(cid:3), and b :=\nE(cid:2)R\u03c0\nc (sp)\u03c6(sp)(cid:3). Here the expectations in A, b, and C are all taken with respect to (w.r.t. )\n\nthe stationary distribution \u00b5\u03c0. Furthermore, when A is full rank and C is positive de\ufb01nite, it can be\nshown that the MSPBE in (5) has a unique minimizer.\nTo obtain a practical optimization problem, we replace the expectations above by their sampled\naverages from M samples. In speci\ufb01c, for a given policy \u03c0, a \ufb01nite state-action sequence {sp, ap}M\np=1\nis simulated from the multi-agent MDP using joint policy \u03c0. We also observe sM +1, the next state of\nsM . Then we construct the sampled versions of A, b, C, denoted respectively by \u02c6A, \u02c6C, \u02c6b, as\n\nC\u22121\n\n+ \u03c1(cid:107)\u03b8(cid:107)2,\n(5)\n\n(cid:13)(cid:13)(cid:13)A\u03b8\u2212 b\n\n(cid:13)(cid:13)(cid:13)2\n\n1\n2\n\n(cid:17)(cid:13)(cid:13)(cid:13)2\n\n(cid:80)M\n\np=1 Cp, \u02c6b := 1\n\nM\n\np=1 bp, with\n\n, Cp := \u03c6(sp)\u03c6(cid:62)(sp), bp := Rc(sp, ap)\u03c6(sp) ,\n\n(6)\n\n(cid:80)M\n\n(cid:80)M\nAp := \u03c6(sp)(cid:0)\u03c6(sp) \u2212 \u03b3\u03c6(sp+1)(cid:1)(cid:62)\n\np=1 Ap, \u02c6C := 1\n\nwhere Rc(sp, ap) := N\u22121(cid:80)N\n\n\u02c6A := 1\nM\n\nM\n\n(cid:13)(cid:13)(cid:13) \u02c6A\u03b8 \u2212 \u02c6b\n\n(cid:13)(cid:13)(cid:13)2\n\n1\n2\n\ni=1 Ri(sp, ap) is the average of the local rewards received by each\nagent when taking action ap at state sp. Here we assume that M is suf\ufb01ciently large such that \u02c6C is\ninvertible and \u02c6A is full rank. Using the terms de\ufb01ned in (6), we obtain the empirical MSPBE\n\n+ \u03c1(cid:107)\u03b8(cid:107)2 ,\n\n\u02c6C\u22121\n\nMSPBE(\u03b8) :=\n\n(7)\nwhich converges to MSPBE(cid:63)(\u03b8) as M \u2192 \u221e. Let \u02c6\u03b8 be a minimizer of the empirical MSPBE, our\nestimation of V \u03c0 is given by \u03a6 \u02c6\u03b8. Since the rewards {Ri(sp, ap)}N\ni=1 are private to each agent,\nit is impossible for any agent to compute Rc(sp, ap), and minimize the empirical MSPBE (7)\nindependently.\nMulti-agent, Primal-dual, Finite-sum Optimization Recall that under the multi-agent MDP, the\nagents are able to observe the states and the joint actions, but can only observe their local rewards.\nThus, each agent is able to compute \u02c6A and \u02c6C de\ufb01ned in (6), but is unable to obtain \u02c6b. To resolve\nthis issue, for any i \u2208 {1, . . . , N} and any p \u2208 {1, . . . , M}, we de\ufb01ne bp,i := Ri(sp, ap)\u03c6(sp) and\np=1 bp,i, which are known to agent i only. By direct computation, it is easy to verify\n\n\u02c6bi := M\u22121(cid:80)M\n\nthat minimizing MSPBE(\u03b8) in (7) is equivalent to solving\n\nmin\n\u03b8\u2208Rd\n\n1\nN\n\nMSPBEi(\u03b8) where MSPBEi(\u03b8) :=\n\n1\n2\n\n+ \u03c1(cid:107)\u03b8(cid:107)2 .\n\n(8)\n\n(cid:13)(cid:13)(cid:13) \u02c6A\u03b8 \u2212 \u02c6bi\n\n(cid:13)(cid:13)(cid:13)2\n\n\u02c6C\u22121\n\nN(cid:88)\n\ni=1\n\nThe equivalence can be seen by comparing the optimality conditions of two optimization problems.\nImportantly, (8) is a multi-agent optimization problems [36] whose objective is to minimize a\nsummation of N local functions coupled together by the common parameter \u03b8. Here MSPBEi(\u03b8)\nis private to agent i and the parameter \u03b8 is shared by all agents. As inspired by [35, 30, 15], using\nFenchel duality, we obtain the conjugate form of MSPBEi(\u03b8), i.e.,\n\n(cid:0) \u02c6A\u03b8 \u2212 \u02c6bi\n\n(cid:1) \u2212 1\n\n(cid:17)\n\nw(cid:62)\n\ni\n\n\u02c6Cwi\n\n2\n\n+ \u03c1(cid:107)\u03b8(cid:107)2 .\n\n(9)\n\n(cid:13)(cid:13)(cid:13) \u02c6A\u03b8 \u2212 \u02c6bi\n\n(cid:13)(cid:13)(cid:13)2\n\n1\n2\n\n+ \u03c1(cid:107)\u03b8(cid:107)2 = max\nwi\u2208Rd\n\n\u02c6C\u22121\n\n(cid:16)\n\nw(cid:62)\n\ni\n\n4\n\n\fObserve that each of \u02c6A, \u02c6C, \u02c6bi can be expressed as a \ufb01nite sum of matrices/vectors. By (9), problem\n(8) is equivalent to a multi-agent, primal-dual and \ufb01nite-sum optimization problem:\n\nmin\n\u03b8\u2208Rd\n\nmax\n\nwi\u2208Rd,i=1,...,N\n\n1\n\nN M\n\nN(cid:88)\n\nM(cid:88)\n\ni=1\n\np=1\n\n(cid:0)w(cid:62)\n(cid:124)\n\ni Ap\u03b8 \u2212 b(cid:62)\n\n(cid:123)(cid:122)\np,iwi \u2212 1\n2\n\nw(cid:62)\ni Cpwi +\n\n\u03c1\n2\n\n:=Ji,p(\u03b8,wi)\ndenoted\nby\n\nJ(\u03b8,{wi}N\n\n(cid:107)\u03b8(cid:107)2(cid:1)\n(cid:125)\n\n.\n\n(10)\n\n(1/N M )(cid:80)N\n\nHereafter,\n\n(cid:80)M\n\nthe\n\nis\n\nglobal\n\nfunction\n\nobjective\n\ni=1\n\ni=1)\n\ni=1.\n\n:=\np=1 Ji,p(\u03b8, wi), which is convex w.r.t. the primal variable \u03b8 and is concave\n\nw.r.t. the dual variable {wi}N\nIt is worth noting that the challenges in solving (10) are three-fold. First, to obtain a saddle-\npoint solution ({wi}N\ni=1, \u03b8), any algorithm for (10) needs to update the primal and dual variables\nsimultaneously, which can be dif\ufb01cult as objective function needs not be strongly convex with respect\nto \u03b8. In this case, it is nontrivial to compute a solution ef\ufb01ciently. Second, the objective function of\n(10) consists of a sum of M functions, with M (cid:29) 1 potentially, such that conventional primal-dual\nmethods [4] can no longer be applied due to the increased complexity. Lastly, since \u03b8 is shared by\nall the agents, when solving (10), the N agents need to reach a consensus on \u03b8 without sharing the\nlocal functions, e.g., Ji,p(\u00b7) has to remain unknown to all agents except for agent i due to privacy\nconcerns. Although \ufb01nite-sum convex optimization problems with shared variables are well-studied,\nnew algorithms and theory are needed for convex-concave saddle-point problems. Next, we propose a\nnovel decentralized \ufb01rst-order algorithm that tackles these dif\ufb01culties and converges to a saddle-point\nsolution of (10) with linear rate.\n\n3 Primal-dual Distributed Incremental Aggregated Gradient Method\nWe are ready to introduce our algorithm for solving the optimization problem in (10). Since \u03b8 is shared\nby all the N agents, the agents need to exchange information so as to reach a consensual solution.\nLet us \ufb01rst specify the communication model. We assume that the N agents communicate over a\nnetwork speci\ufb01ed by a connected and undirected graph G = (V, E), with V = [N ] = {1, ..., N}\nand E \u2286 V \u00d7 V being its vertex set and edge set, respectively. Over G, it is possible to de\ufb01ne\na doubly stochastic matrix W such that Wij = 0 if (i, j) /\u2208 E and W 1 = W (cid:62)1 = 1, note\n\u03bb := \u03bbmax(W \u2212 N\u2212111(cid:62)) < 1 since G is connected. Notice that the edges in G may be formed\nindependently of the coupling between agents in the MDP induced by the stochastic policy \u03c0.\nWe handle problem (10) by judiciously combining the techniques of dynamic consensus [41, 54]\nand stochastic (or incremental) average gradient (SAG) [20, 43], which have been developed\nindependently in the control and machine learning communities, respectively. From a high level\nviewpoint, our method utilizes a gradient estimator which tracks the gradient over space (across\nN agents) and time (across M samples). To proceed with our development while explaining the\nintuitions, we \ufb01rst investigate a centralized and batch algorithm for solving (10).\nCentralized Primal-dual Optimization Consider the primal-dual gradient updates. For any t \u2265 1,\nat the t-th iteration, we update the primal and dual variables by\n\ni}N\ni=1),\n\n\u03b8t+1 = \u03b8t \u2212 \u03b31\u2207\u03b8J(\u03b8t,{wt\n\ni + \u03b32\u2207wiJ(\u03b8t,{wt\n\ni}N\ni=1), i \u2208 [N ] ,\n\nwt+1\n\ni = wt\n\n(11)\nwhere \u03b31, \u03b32 > 0 are step sizes, which is a simple application of a gradient descent/ascent update to\nthe primal/dual variables. As shown by Du et al. [15], when \u02c6A is full rank and \u02c6C is invertible, the\nJacobian matrix of the primal-dual optimal condition is full rank. Thus, within a certain range of step\nsize (\u03b31, \u03b32), recursion (11) converges linearly to the optimal solution of (10).\nProposed Method The primal-dual gradient method in (11) serves as a reasonable template for\ndeveloping an ef\ufb01cient decentralized algorithm for (10). Let us focus on the update of the primal\nvariable \u03b8 in (11), which is a more challenging part since \u03b8 is shared by all N agents. To evaluate the\ngradient w.r.t. \u03b8, we observe that \u2013 (a) agent i does not have access to the functions, {Jj,p(\u00b7), j (cid:54)= i},\nof the other agents; (b) computing the gradient requires summing up the contributions from M\nsamples. As M (cid:29) 1, doing so is undesirable since the computation complexity would be O(M d).\nWe circumvent the above issues by utilizing a double gradient tracking scheme for the primal\n\u03b8-update and an incremental update scheme for the local dual wi-update in the following primal-\ndual distributed incremental aggregated gradient (PD-DistIAG) method. Here each agent i \u2208 [N ]\n\n5\n\n\fAlgorithm 1 PD-DistIAG Method for Multi-agent, Primal-dual, Finite-sum Optimization\n\ni }i\u2208[N ], initial gradient estimators s0\n\ni , w1\n\ni = d0\n\ni = 0, \u2200 i \u2208 [N ], initial\n\nInput: Initial estimators {\u03b81\ncounter \u03c4 0\nfor t \u2265 1 do\n\np = 0, \u2200 p \u2208 [M ], and stepsizes \u03b31, \u03b32 > 0.\n\nThe agents pick a common sample indexed by pt \u2208 {1, ..., M}.\nUpdate the counter variable as:\n, \u2200 p (cid:54)= pt .\n\np = \u03c4 t\u22121\n\n= t, \u03c4 t\n\np\n\n\u03c4 t\npt\nfor each agent i \u2208 {1, . . . , N} do\n\nUpdate the gradient surrogates by\nj + 1\nM\n\nj=1 Wijst\u22121\n\ni =(cid:80)N\n\nst\ni = dt\u22121\ndt\n\ni + 1\nM\n\ni , wt\ni ) = 0 and \u2207wiJi,p(\u03b80\n\n(cid:104)\u2207wiJi,pt (\u03b8t\ni =(cid:80)N\n\nj=1 Wij\u03b8t\n\ni, dt\n\n\u03b8t+1\n\nwhere \u2207\u03b8Ji,p(\u03b80\nPerform primal-dual updates using st\n\ni , w0\n\n(cid:104)\u2207\u03b8Ji,pt(\u03b8t\n\n(cid:105)\n\n)\n\n,\n\n\u03c4 t\u22121\npt\ni\n\n(12)\n\n(13)\n\n(14)\n\ni) \u2212 \u2207\u03b8Ji,pt(\u03b8\n\ni , wt\n\n\u03c4 t\u22121\npt\ni\n\u03c4 t\u22121\npt\n, w\ni\n\n, w\n\n(cid:105)\n\n)\n\n,\n\n\u03c4 t\u22121\npt\ni\n\ni) \u2212 \u2207wiJi,pt(\u03b8\ni , w0\n\ni ) = 0 for all p \u2208 [M ] for initialization.\ni as surrogates for the gradients w.r.t. \u03b8 and wi:\nj \u2212 \u03b31st\n\ni + \u03b32dt\ni .\n\ni = wt\n\ni, wt+1\n\n(15)\n\nend for\n\nend for\n\ni\n\nis obtained by \ufb01rst combining {\u03b8t\n\ni}t\u22651. We construct sequences {st\n\ni. The details of our method are presented in Algorithm 1.\n\ni}t\u22651 and\ni}t\u22651 to track the gradients with respect to \u03b8 and wi, respectively. Similar to (11), in the t-th\ni. As for the primal variable, to\ni}i\u2208[N ] using the weight matrix W ,\n\nmaintains a local copy of the primal parameter {\u03b8t\n{dt\niteration, we update the dual variable via gradient update using dt\nachieve consensus, each \u03b8t+1\nand then update in the direction of st\nLet us explain the intuition behind the PD-DistIAG method through studying the update (13). Recall\ni}N\nthat the global gradient desired at iteration t is given by \u2207\u03b8J(\u03b8t,{wt\ni=1), which represents a\ndouble average \u2013 one over space (across agents) and one over time (across samples). Now in the\ncase of (13), the \ufb01rst summand on the right hand side computes a local average among the neighbors\nof agent i, and thereby tracking the global gradient over space. This is in fact akin to the gradient\ntracking technique in the context of distributed optimization [41]. The remaining terms on the right\nhand side of (13) utilize an incremental update rule akin to the SAG method [43], involving a swap-in\nswap-out operation for the gradients. This achieves tracking of the global gradient over time.\nTo gain insights on why the scheme works, we note that st\ni represent some surrogate functions\nfor the primal and dual gradients. Moreover, for the counter variable, using (12) we can alternatively\nrepresent it as \u03c4 t\np is the iteration index where\nthe p-th sample is last visited by the agents prior to iteration t, and if the p-th sample has never been\nvisited, we have \u03c4 t\ni. The following lemma\nshows that g\u03b8(t) is a double average of the primal gradient \u2013 it averages over the local gradients\nacross the agents, and for each local gradient; it also averages over the past gradients for all the\ni}N\nsamples evaluated up till iteration t + 1. This shows that the average over network for {st\ni=1 can\nalways track the double average of the local and past gradients, i.e., the gradient estimate g\u03b8(t) is\n\u2018unbiased\u2019 with respect to the network-wide average.\n(cid:80)M\nLemma 1 For all t \u2265 1 and consider Algorithm 1, it holds that\np=1 \u2207\u03b8Ji,p(\u03b8\nM(cid:88)\nN(cid:88)\n\nProof. We shall prove the statement using induction. For the base case with t = 1, using (13) and the\nupdate rule speci\ufb01ed in the algorithm, we have\n\np = max{(cid:96) \u2265 0 : (cid:96) \u2264 t, p(cid:96) = p}. In other words, \u03c4 t\n\np = 0. For any t \u2265 1, de\ufb01ne g\u03b8(t) := (1/N )(cid:80)N\n\ng\u03b8(t) = 1\nNM\n\n(cid:80)N\n\nN(cid:88)\n\ni and dt\n\ni=1 st\n\n\u03c4 t\np\ni ) .\n\n(16)\n\n\u03c4 t\np\ni\n\n, w\n\ni=1\n\n1\n\n\u2207\u03b8Ji,pt(\u03b8\n\n\u03c4 1\np\ni\n\n, w\n\n\u03c4 1\np\ni ) ,\n\n(17)\n\ng\u03b8(1) =\n\n1\nN\n\n1\nM\n\ni=1\n\n\u2207\u03b8Ji,p1(\u03b81\n\ni , w1\n\ni ) =\n\nN M\n\ni=1\n\np=1\n\n6\n\n\f(cid:104)\u2207\u03b8Ji,pt+1 (\u03b8t+1\n\ni\n\n1\nM\n\nj +\n\n(cid:104)\u2207\u03b8Ji,pt+1(\u03b8t+1\n\ni\n\nWijst\n\nN(cid:88)\n\ni=1\n\ng\u03b8(t + 1) =\n\n1\nN\n\nN(cid:88)\n\n(cid:26) N(cid:88)\n\ni=1\n\nj=1\n\n= g\u03b8(t) +\n\n1\n\nN M\n\nN(cid:88)\n\n(cid:20) (cid:88)\n\ni=1\n\np(cid:54)=pt+1\n\n(cid:105)(cid:27)\n\n(cid:105)\n\n, wt+1\n\ni\n\n) \u2212 \u2207\u03b8Ji,pt+1 (\u03b8\n\n\u03c4 t\npt+1\ni\n\n\u03c4 t\npt+1\ni\n\n)\n\n, w\n\n, wt+1\n\ni\n\n) \u2212 \u2207\u03b8Ji,pt+1(\u03b8\n\n\u03c4 t\npt+1\ni\n\n\u03c4 t\npt+1\n, w\ni\n\n)\n\n.\n\n(18)\np for all p (cid:54)= pt+1. The induction assumption in (16)\n(cid:21)\n\n\u2207\u03b8Ji,pt+1(\u03b8\n\n\u03c4 t\npt+1\ni\n\n\u03c4 t\npt+1\n, w\ni\n\n) .\n\n(19)\n\nN(cid:88)\n\nwhere we use the fact \u2207\u03b8Ji,p(\u03b8\ninduction step, suppose (16) holds up to iteration t. Since W is doubly stochastic, (13) implies\n\ni ) = 0 for all p (cid:54)= p1 in the above. For the\n\ni , w0\n\n\u03c4 1\np\n\ni ) = \u2207\u03b8Ji,p(\u03b80\n\n\u03c4 1\np\ni\n\n, w\n\nNotice that we have \u03c4 t+1\npt+1\ncan be written as\n\n= t + 1 and \u03c4 t+1\n\np = \u03c4 t\n\ng\u03b8(t) =\n\n1\n\nN M\n\n\u2207\u03b8Ji,p(\u03b8\n\n\u03c4 t+1\np\ni\n\n, w\n\n\u03c4 t+1\np\ni\n\n)\n\n+\n\n1\n\nN M\n\ni=1\n\nFinally, combining (18) and (19), we obtain the desired result that (16) holds for the t + 1th iteration.\nQ.E.D.\nThis, together with (17), establishes Lemma 1.\n\nAs for the dual update (14), we observe the variable wi is local to agent i. Therefore its gradient\ni, involves only the tracking step over time [cf. (14)], i.e., it only averages the gradient\nsurrogate, dt\nover samples. Combining with Lemma 1 shows that the PD-DistIAG method uses gradient surrogates\nthat are averages over samples despite the disparities across agents. Since the average over samples\nare done in a similar spirit as the SAG method, the proposed method is expected to converge linearly.\nStorage and Computation Complexities Let us comment on the computational and storage com-\nplexity of PD-DistIAG method. First of all, since the method requires accessing the previously\nevaluated gradients, each agent has to store 2M such vectors in the memory to avoid re-evaluating\nthese gradients. Each agent needs to store a total of 2M d real numbers. On the other hand, the\nper-iteration computation complexity for each agent is only O(d) as each iteration only requires to\nevaluate the gradient over one sample, as delineated in (14)\u2013(15).\nCommunication Overhead The PD-DistIAG method described in Algorithm 1 requires an infor-\nmation exchange round [of st\ni] among the agents at every iteration. From an implementation\nstand point, this may incur signi\ufb01cant communication overhead when d (cid:29) 1, and it is especially\nineffective when the progress made in successive updates of the algorithm is not signi\ufb01cant. A\nnatural remedy is to perform multiple local updates at the agent using different samples without\nexchanging information with the neighbors. In this way, the communication overhead can be reduced.\nActually, this modi\ufb01cation to the PD-DistIAG method can be generally described using a time varying\nweight matrix W (t), such that we have W (t) = I for most of the iteration. The convergence of\nPD-DistIAG method in this scenario is part of the future work.\n\ni and \u03b8t\n\n3.1 Convergence Analysis\n\nThe PD-DistIAG method is built using the techniques of (a) primal-dual batch gradient descent,\n(b) gradient tracking for distributed optimization and (c) stochastic average gradient, where each\nof them has been independently shown to attain linear convergence under certain conditions; see\n[41, 43, 20, 15]. Naturally, the PD-DistIAG method is also anticipated to converge at a linear rate.\nTo see this, let us consider the condition for the sample selection rule of PD-DistIAG:\nA1 A sample is selected at least once for every M iterations, |t \u2212 \u03c4 t\np| \u2264 M for all p \u2208 [M ], t \u2265 1.\nThe assumption requires that every samples are visited in\ufb01nitely often. For example, this can be\nenforced by using a cyclical selection rule, i.e., pt = (t mod M ) + 1; or a random sampling scheme\nwithout replacement (i.e., random shuf\ufb02ing) from the pool of M samples. Finally, it is possible\nto relax the assumption such that a sample can be selected once for every K iterations only, with\nK \u2265 M. The present assumption is made solely for the purpose of ease of presentation. Moreover,\nto ensure that the solution to (10) is unique, we consider:\nA2 The sampled correlation matrix \u02c6A is full rank, and the sampled covariance \u02c6C is non-singular.\nThe following theorem con\ufb01rms the linear convergence of PD-DistIAG:\n\n7\n\n\fTheorem 1 Under A1 and A2, we denote by (\u03b8(cid:63),{w(cid:63)\nto the optimization problem in (10).\n\u03bbmax( \u02c6A(cid:62) \u02c6C\u22121 \u02c6A))/\u03bbmin( \u02c6C). De\ufb01ne \u03b8(t) := 1\nprimal step size \u03b31 is suf\ufb01ciently small, then there exists a constant 0 < \u03c3 < 1 that\n\ni=1) the primal-dual optimal solution\nSet the step sizes as \u03b32 = \u03b2\u03b31 with \u03b2 := 8(\u03c1 +\nIf the\n\ni as the average of parameters.\n\ni }N\n\nN\n\n(cid:13)(cid:13)\u03b8(t) \u2212 \u03b8(cid:63)(cid:13)(cid:13)2\n\n+ (1/\u03b2N )(cid:80)N\n\n(cid:13)(cid:13)wt\n\ni \u2212 w(cid:63)\n\ni\n\ni=1\n\n(cid:13)(cid:13)2\n\n(cid:80)N\n= O(\u03c3t), (1/N )(cid:80)N\n\ni=1 \u03b8t\n\n(cid:13)(cid:13)\u03b8t\ni \u2212 \u03b8(t)(cid:13)(cid:13) = O(\u03c3t) .\n\ni=1\n\ni }N\n\nIf N, M (cid:29) 1 and the graph is geometric, a suf\ufb01cient condition for convergence is to set \u03b3 =\nO(1/ max{N 2, M 2}) and the resultant rate is \u03c3 = 1 \u2212 O(1/ max{M N 2, M 3}).\nThe result above shows the desirable convergence properties for PD-DistIAG method \u2013 the primal\ni}N\ndual solution (\u03b8(t),{wt\ni=1) converges to (\u03b8(cid:63),{w(cid:63)\ni=1) at a linear rate; also, the consensual error\nof the local parameters \u00af\u03b8t\ni converges to zero linearly. A distinguishing feature of our analysis is that\nit handles the worst case convergence of the proposed method, rather than the expected convergence\nrate popular for stochastic / incremental gradient methods.\nProof Sketch Our proof is divided into three steps. The \ufb01rst step studies the progress made by the\nalgorithm in one iteration, taking into account the non-idealities due to imperfect tracking of the\ngradient over space and time. This leads to the characterization of a Lyapunov vector. The second step\nanalyzes the coupled system of one iteration progress made by the Lyapunov vector. An interesting\nfeature of it is that it consists of a series of independently delayed terms in the Lyapunov vector. The\nlatter is resulted from the incremental update schemes employed in the method. Here, we study a\nsuf\ufb01cient condition for the coupled and delayed system to converge linearly. The last step is to derive\ncondition on the step size \u03b31 where the suf\ufb01cient convergence condition is satis\ufb01ed.\nSpeci\ufb01cally, we study the progress of the Lyapunov functions:\n\n(cid:107)(cid:98)v(t)(cid:107)2 := \u0398\ni \u2212 \u03b8(t)(cid:107)2,\nThat is,(cid:98)v(t) is a vector whose squared norm is equivalent to a weighted distance to the optimal\n\nprimal-dual solution, Ec(t) and Eg(t) are respectively the consensus errors of the primal parameter\nand of the primal aggregated gradient. These functions form a non-negative vector which evolves as:\n\n, Ec(t) := 1\n\ni=1 (cid:107)\u03b8t\n\n\u03c4 t\np\nj , w\n\nNM\n\ni=1\n\nN\n\n.\n\ni\n\nj=1\n\n(cid:13)(cid:13)2(cid:17)\n(cid:13)(cid:13)wt\n(cid:80)N\n(cid:80)M\ni \u2212 w(cid:63)\np=1 \u2207\u03b8Jj,p(\u03b8\n\uf8eb\uf8ed max(t\u22122M )+\u2264q\u2264t (cid:107)(cid:98)v(q)(cid:107)\n\nmax(t\u22122M )+\u2264q\u2264t Ec(q)\nmax(t\u22122M )+\u2264q\u2264t Eg(q)\n\nN\n\ni=1\n\ni \u2212 1\n\nEg(t) := 1\n\n(cid:16)(cid:13)(cid:13)\u03b8(t) \u2212 \u03b8(cid:63)(cid:13)(cid:13)2\n+ (1/\u03b2N )(cid:80)N\n(cid:113)(cid:80)N\n(cid:13)(cid:13)st\n\uf8f6\uf8f8 \u2264 Q(\u03b31)\n\uf8eb\uf8ed (cid:107)(cid:98)v(t + 1)(cid:107)\n\uf8eb\uf8ed 1 \u2212 \u03b31a0 + \u03b32\n\nEc(t + 1)\nEg(t + 1)\n\nQ(\u03b31) =\n\n0\n\n\u03c4 t\np\n\n(cid:113)(cid:80)N\nj )(cid:13)(cid:13)2\n\uf8f6\uf8f8 ,\n\uf8f6\uf8f8 .\n\nwhere the matrix Q(\u03b31) \u2208 R3\u00d73 is de\ufb01ned by (exact form given in the supplementary material)\n\n(20)\n\n(21)\n\n1 a1\n\n\u03b31a2\n\n\u03bb\n\n0\n\u03b31\n\na4 + \u03b31a5 \u03bb + \u03b31a6\n\n\u03b31a3\n\nIn the above, \u03bb := \u03bbmax(W \u2212 (1/N )11(cid:62)) < 1, and a0, ..., a6 are some non-negative constants\nthat depends on the problem parameters N, M, the spectral properties of A, C, etc., with a0 being\npositive. If we focus only on the \ufb01rst row of the inequality system, we obtain\n\n(cid:107)(cid:98)v(t + 1)(cid:107) \u2264(cid:0)1 \u2212 \u03b31a0 + \u03b32\n\n1 a1)\n\nmax\n\n(t\u22122M )+\u2264q\u2264t\n\n(cid:107)(cid:98)v(q)(cid:107) + \u03b31a2\n\nmax\n\n(t\u22122M )+\u2264q\u2264t\n\nEc(q) .\n\nIn fact, when the contribution from Ec(q) can be ignored, then applying [16, Lemma 3] shows that\n1 a1 < 0, which is possible as a0 > 0. Therefore, if Ec(t)\nalso converges linearly, then it is anticipated that Eg(t) would do so as well. In other words, the linear\n\n(cid:107)(cid:98)v(t + 1)(cid:107) converges linearly if \u2212\u03b31a0 + \u03b32\nconvergence of (cid:107)(cid:98)v(t)(cid:107), Ec(t) and Eg(t) are all coupled in the inequality system (20).\nradius of Q(\u03b31) in (21) is strictly less than one, then each of the Lyapunov functions, (cid:107)(cid:98)v(t)(cid:107), Ec(t),\n\nFormalizing the above observations, Lemma 1 in the supplementary material shows a suf\ufb01cient\ncondition on \u03b31 for linear convergence. Speci\ufb01cally, if there exists \u03b31 > 0 such that the spectral\nEg(t), would enjoy linear convergence. Furthermore, Lemma 2 in the supplementary material gives\nan existence proof for such an \u03b31 to exist. This concludes the proof.\n\n8\n\n\fRemark While delayed inequality system has been studied in [16, 20] for optimization algorithms,\nthe coupled system in (20) is a non-trivial generalization of the above. Importantly, the challenge\nhere is due to the asymmetry of the system matrix Q and the maximum over the past sequences on\nthe right hand side are taken independently. To the best of our knowledge, our result is the \ufb01rst to\ncharacterize the (linear) convergence of such coupled and delayed system of inequalities.\nExtension Our analysis and algorithm may in fact be applied to solve general problems that involves\nmulti-agent and \ufb01nite-sum optimization, e.g.,\n\n(cid:80)N\n\n(cid:80)M\n\nmin\u03b8\u2208Rd J(\u03b8) := 1\nNM\n\ni=1\n\np=1 Ji,p(\u03b8) .\n\n(22)\n\nFor instance, these problems may arise in multi-agent empirical risk minimization, where data\nsamples are kept independently by agents. Our analysis, especially with convergence for inequality\nsystems of the form (20), can be applied to study a similar double averaging algorithm with just the\nprimal variable. In particular, we only require the sum function J(\u03b8) to be strongly convex, and the\nobjective functions Ji,p(\u00b7) to be smooth in order to achieve linear convergence. We believe that such\nextension is of independent interest to the community. At the time of submission, a recent work [40]\napplied a related double averaging distributed algorithm to a stochastic version of (22). However,\ntheir convergence rate is sub-linear as they considered a stochastic optimization setting.\n\n4 Numerical Experiments\n\nTo verify the performance of our proposed method, we conduct an experiment on the mountaincar\ndataset [46] under a setting similar to [15] \u2013 to collect the dataset, we ran Sarsa with d = 300 features\nto obtain the policy, then we generate the trajectories of actions and states according to the policy with\nM samples. For each sample p, we generate the local reward, Ri(sp,i, ap,i) by assigning a random\nportion for the reward to each agent such that the average of the local rewards equals Rc(sp, ap).\nWe compare our method to several centralized methods \u2013 PDBG is the primal-dual gradient descent\nmethod in (11), GTD2 [47], and SAGA [15]. Notably, SAGA has linear convergence while only\nrequiring an incremental update step of low complexity. For PD-DistIAG, we simulate a communica-\ntion network with N = 10 agents, connected on an Erdos-Renyi graph generated with connectivity\nof 0.2; for the step sizes, we set \u03b31 = 0.005/\u03bbmax( \u02c6A), \u03b32 = 5 \u00d7 10\u22123.\n\nFigure 1: Experiment with mountaincar dataset. For this problem, we have d = 300, M = 5000\nsamples, and there are N = 10 agents. (Left) Graph Topology. (Middle) \u03c1 = 0.01. (Right) \u03c1 = 0.\n\ni=1 MSPBE(\u03b8t\n\nobjective, i.e., it is (1/N )(cid:80)N\n\nFigure 1 compares the optimality gap in terms of MSPBE of different algorithms against the epoch\nnumber, de\ufb01ned as (t/M ). For PD-DistIAG, we compare its optimality gap in MSPBE as the average\ni ) \u2212 MSPBE(\u03b8(cid:63)). As seen in the left panel, when the\nregularization factor is high with \u03c1 > 0, the convergence speed of PD-DistIAG is comparable to\nthat of SAGA; meanwhile with \u03c1 = 0, the PD-DistIAG converges at a slower speed than SAGA.\nNevertheless, in both cases, the PD-DistIAG method converges faster than the other methods except\nfor SAGA. Additional experiments are presented in the supplementary materials to compare the\nperformance at different topology and regularization parameter.\nConclusion In this paper, we have studied the policy evaluation problem in multi-agent rein-\nforcement learning. Utilizing Fenchel duality, a double averaging scheme is proposed to tackle the\nprimal-dual, multi-agent, and \ufb01nite-sum optimization arises. The proposed PD-DistIAG method\ndemonstrates linear convergence under reasonable assumptions.\n\n9\n\n0100200300400500Epoch10-810-610-410-2Optimality Gap of MSPBEPDBGGTD2SAGAPD-DistIAG010002000300040005000Epoch10-610-410-2100Optimality Gap of MSPBEPDBGGTD2SAGAPD-DistIAG\fAcknowledgement The authors would like to thank for the useful comments from three anonymous\nreviewers. HTW\u2019s work was supported by the grant NSF CCF-BSF 1714672. MH\u2019s work has been\nsupported in part by NSF-CMMI 1727757, and AFOSR 15RT0767.\n\nReferences\n[1] G. Arslan and S. Y\u00fcksel. Decentralized Q-learning for stochastic teams and games. IEEE\n\nTransactions on Automatic Control, 62(4):1545\u20131558, 2017.\n\n[2] V. S. Borkar. Stochastic approximation: A dynamical systems viewpoint. Cambridge University\n\nPress, 2008.\n\n[3] D. S. Callaway and I. A. Hiskens. Achieving controllability of electric loads. Proceedings of\n\nthe IEEE, 99(1):184\u2013199, 2011.\n\n[4] A. Chambolle and T. Pock. On the ergodic convergence rates of a \ufb01rst-order primal\u2013dual\n\nalgorithm. Mathematical Programming, 159(1-2):253\u2013287, 2016.\n\n[5] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-\n\nmization. Journal of Machine Learning Research, 12(Mar):1069\u20131109, 2011.\n\n[6] J. Chen and A. H. Sayed. Diffusion adaptation strategies for distributed optimization and\n\nlearning over networks. IEEE Transactions on Signal Processing, 60(8):4289\u20134305, 2012.\n\n[7] Y. Chen and M. Wang. Stochastic primal-dual methods and sample complexity of reinforcement\n\nlearning. arXiv preprint arXiv:1612.02516, 2016.\n\n[8] P. Corke, R. Peterson, and D. Rus. Networked robots: Flying robot navigation using a sensor\n\nnet. Robotics Research, pages 234\u2013243, 2005.\n\n[9] J. Cortes, S. Martinez, T. Karatas, and F. Bullo. Coverage control for mobile sensing networks.\n\nIEEE Transactions on Robotics and Automation, 20(2):243\u2013255, 2004.\n\n[10] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual\n\nembeddings. arXiv preprint arXiv:1607.04579, 2016.\n\n[11] B. Dai, A. Shaw, N. He, L. Li, and L. Song. Boosting the actor with dual critic. arXiv preprint\n\narXiv:1712.10282, 2017.\n\n[12] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, J. Chen, and L. Song. Smoothed dual embedding\n\ncontrol. arXiv preprint arXiv:1712.10285, 2017.\n\n[13] E. Dall\u2019Anese, H. Zhu, and G. B. Giannakis. Distributed optimal power \ufb02ow for smart\n\nmicrogrids. IEEE Transactions on Smart Grid, 4(3):1464\u20131475, 2013.\n\n[14] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in neural information\nprocessing systems, pages 1646\u20131654, 2014.\n\n[15] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy\n\nevaluation. arXiv preprint arXiv:1702.07944, 2017.\n\n[16] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. A delayed proximal gradient method\nwith linear convergence rate. In Machine Learning for Signal Processing (MLSP), 2014 IEEE\nInternational Workshop on, pages 1\u20136. IEEE, 2014.\n\n[17] J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep\nmulti-agent reinforcement learning. In Advances in Neural Information Processing Systems,\npages 2137\u20132145, 2016.\n\n[18] J. Foerster, N. Nardelli, G. Farquhar, P. Torr, P. Kohli, S. Whiteson, et al. Stabilising experience\n\nreplay for deep multi-agent reinforcement learning. arXiv preprint arXiv:1702.08887, 2017.\n\n10\n\n\f[19] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep\nreinforcement learning. In International Conference on Autonomous Agents and Multi-agent\nSystems, pages 66\u201383, 2017.\n\n[20] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo. On the convergence rate of incremental\n\naggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035\u20131048, 2017.\n\n[21] J. Hu and M. P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of\n\nMachine Learning Research, 4(Nov):1039\u20131069, 2003.\n\n[22] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[23] J. Kober and J. Peters. Reinforcement learning in robotics: A survey. In Reinforcement Learning,\n\npages 579\u2013610. Springer, 2012.\n\n[24] M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative\n\nmulti-agent systems. In International Conference on Machine Learning, 2000.\n\n[25] D. Lee, H. Yoon, and N. Hovakimyan. Primal-dual algorithm for distributed reinforcement\n\nlearning: Distributed gtd2. arXiv preprint arXiv:1803.08031, 2018.\n\n[26] X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced\n\ngradient descent. arXiv preprint arXiv:1610.04674, 2016.\n\n[27] A. Lin and Q. Ling. Decentralized and privacy-preserving low-rank matrix completion. 2014.\n\nPreprint.\n\n[28] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 157\u2013163, 1994.\n\n[29] M. L. Littman. Value-function reinforcement learning in Markov games. Cognitive Systems\n\nResearch, 2(1):55\u201366, 2001.\n\n[30] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of\n\nproximal gradient td algorithms. In UAI, pages 504\u2013513, 2015.\n\n[31] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for\n\nmixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.\n\n[32] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed. Distributed policy evaluation under multiple\n\nbehavior strategies. IEEE Transactions on Automatic Control, 60(5):1260\u20131274, 2015.\n\n[33] S. V. Macua, A. Tukiainen, D. G.-O. Hern\u00e1ndez, D. Baldazo, E. M. de Cote, and S. Zazo.\nDiff-dac: Distributed actor-critic for multitask deep reinforcement learning. arXiv preprint\narXiv:1710.10363, 2017.\n\n[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n[35] A. Nedi\u00b4c and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function\n\napproximation. Discrete Event Dynamic Systems, 13(1-2):79\u2013110, 2003.\n\n[36] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[37] S. Omidsha\ufb01ei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task\nmulti-agent reinforcement learning under partial observability. In International Conference on\nMachine Learning, pages 2681\u20132690, 2017.\n\n[38] B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems.\n\nIn Advances in Neural Information Processing Systems, pages 1416\u20131424, 2016.\n\n11\n\n\f[39] E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multi-task and transfer\n\nreinforcement learning. arXiv preprint arXiv:1511.06342, 2015.\n\n[40] S. Pu and A. Nedi\u00b4c. Distributed stochastic gradient tracking methods. arXiv preprint\n\narXiv:1805.11454, 2018.\n\n[41] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transac-\n\ntions on Control of Network Systems, 2017.\n\n[42] M. Rabbat and R. Nowak. Distributed optimization in sensor networks.\n\nSymposium on Information Processing in Sensor Networks, pages 20\u201327, 2004.\n\nIn International\n\n[43] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[44] W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact \ufb01rst-order algorithm for decentralized\n\nconsensus optimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\n[45] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,\nM. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550\n(7676):354\u2013359, 2017.\n\n[46] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[47] R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. A convergent o(n) temporal-difference algorithm\nfor off-policy learning with linear function approximation. In Advances in neural information\nprocessing systems, pages 1609\u20131616, 2009.\n\n[48] Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and\nR. Pascanu. Distral: Robust multi-task reinforcement learning. arXiv preprint arXiv:1707.04175,\n2017.\n\n[49] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic\ngradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803\u2013812,\n1986.\n\n[50] M. Wang. Primal-dual \u03c0 learning: Sample complexity and sublinear run time for ergodic\n\nmarkov decision problems. arXiv preprint arXiv:1710.06100, 2017.\n\n[51] X. Wang and T. Sandholm. Reinforcement learning to play an optimal Nash equilibrium in\nteam Markov games. In Advances in Neural Information Processing Systems, pages 1603\u20131610,\n2003.\n\n[52] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: A hierarchical\nBayesian approach. In International Conference on Machine Learning, pages 1015\u20131022, 2007.\n\n[53] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba\u00b8sar. Fully decentralized multi-agent reinforcement\n\nlearning with networked agents. arXiv preprint arXiv:1802.08757, 2018.\n\n[54] M. Zhu and S. Mart\u00ednez. Discrete-time dynamic average consensus. Automatica, 46(2):322\u2013329,\n\n2010.\n\n12\n\n\f", "award": [], "sourceid": 6015, "authors": [{"given_name": "Hoi-To", "family_name": "Wai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton, Phd student"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "University of Minnesota"}]}