{"title": "A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers", "book": "Advances in Neural Information Processing Systems", "page_first": 8625, "page_last": 8634, "abstract": "The alternating direction method of multipliers (ADMM) has recently received tremendous interests for distributed large scale optimization in machine learning, statistics, multi-agent networks and related applications. In this paper, we propose a new parallel multi-block stochastic ADMM for distributed stochastic optimization, where each node is only required to perform simple stochastic gradient descent updates. The proposed ADMM is fully parallel, can solve problems with arbitrary block structures, and has a convergence rate comparable to or better than existing state-of-the-art ADMM methods for stochastic optimization. Existing stochastic (or deterministic) ADMMs require each node to exchange its updated primal variables across nodes at each iteration and hence cause significant amount of communication overhead. Existing ADMMs require roughly the same number of inter-node communication rounds as the number of in-node computation rounds. In contrast, the number of communication rounds required by our new ADMM is only the square root of the number of computation rounds.", "full_text": "A Communication Ef\ufb01cient Stochastic Multi-Block\n\nAlternating Direction Method of Multipliers\n\nHao Yu\nAmazon\n\neeyuhao@gmail.com\n\nAbstract\n\nThe alternating direction method of multipliers (ADMM) has recently received\ntremendous interests for distributed large scale optimization in machine learning,\nstatistics, multi-agent networks and related applications. In this paper, we propose a\nnew parallel multi-block stochastic ADMM for distributed stochastic optimization,\nwhere each node is only required to perform simple stochastic gradient descent\nupdates. The proposed ADMM is fully parallel, can solve problems with arbitrary\nblock structures, and has a convergence rate comparable to or better than existing\nstate-of-the-art ADMM methods for stochastic optimization. Existing stochastic\n(or deterministic) ADMMs require each node to exchange its updated primal\nvariables across nodes at each iteration and hence cause signi\ufb01cant amounts of\ncommunication overhead. Existing ADMMs require roughly the same number of\ninter-node communication rounds as the number of in-node computation rounds.\nIn contrast, the number of communication rounds required by our new ADMM is\nonly the square root of the number of computation rounds.\n\nIntroduction\n\n1\nFix integer N \u2265 2. Consider multi-block linearly constrained stochastic convex programs given by:\n\nN(cid:88)\n\nN(cid:88)\n\nmin\n\nxi\u2208Xi,\u2200i\n\nf (x) =\n\ni=1\n\ni=1\n\nfi(xi) s.t.\n\nAixi = b,\n\n(1)\n\nwhere xi \u2208 Rdi , Ai \u2208 Rm\u00d7di, b \u2208 Rm, Xi \u2286 Rdi are closed convex sets, and fi(xi) =\n[x1; x2; . . . ; xN ] \u2208 R(cid:80)N\nE\u03be[fi(xi; \u03be)] are convex functions. To have a compact representation of (1), we de\ufb01ne x =\nRm\u00d7(cid:80)N\ni=1 fi(xi) and A = [A1, A2, . . . , AN ] \u2208\n\ni=1 di, X =(cid:81)N\ni=1 di. Note that constraint(cid:80)N\n\ni=1 Xi, f (x) =(cid:80)N\n\ni=1 Aixi = b now can be written as Ax = b.\n\nThe problem (1) captures many important applications in machine learning, network scheduling,\nstatistics and \ufb01nance. For example, (stochastic) linear programs that are too huge to be solved over a\nsingle node can be written as (1). To solve such large scale linear programs in a distributed manner,\nwe can save each Ai and fi(\u00b7) at a separate node and let each node iteratively solves smaller sub-\nproblems (with necessary inter-node communication). Another important application of formulation\n(1) is the distributed consensus training of a machine learning model over N nodes [15, 17, 23]\ndescribed as follows:\n\u2022 In an online training setup, i.i.d. realizations of fi(\u00b7; \u03be) are sampled at each node. In an of\ufb02ine\ntraining setup, fi(xi) = E\u03be[fi(xi; \u03be)] are approximated by 1\nj=1 fij(xi) where Ni is the\nnumber of training samples at node i and each fij(\u00b7) represents one training sample.\n\u2022 To enforce all N nodes are training the same model, our constraint Ax = b is given by xi = xj\nfor all i (cid:54)= j \u2208 {1, 2, . . . , N}. (In fact, we only need such constraints for pairs (i, j) that construct\na connected graph for all nodes.)\n\n(cid:80)Ni\n\nNi\n\nThe Alternating Direction Method of Multipliers (ADMM) is an effective and popular method to\nsolve linearly constrained convex programs, especially distributed consensus optimiation [28, 5],\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsince it often yields distributed implementations with low complexity [4]. Conventional ADMMs are\ndeveloped for the special case of problem (1) with N = 2 and/or deterministic fi(xi). To solve a\ntwo-block problem (1) where f1 is a stochastic function and f2 is a deterministic function, previous\nworks [21, 25, 31, 1] have developed stochastic (two-block) ADMMs to solve problem (1) with\nN = 2. It is unclear whether these methods can be extended to solve the case N \u2265 3. In fact, even for\nproblem (1) where all fi(xi) are deterministic, [6] proves that the classical (two-block) ADMM, on\nwhich the stochastic versions in [21, 25] are built, converges for N = 2 but diverges for N \u2265 3. To\nsolve stochastic convex program (1) with N \u2265 3, randomized block coordinate updated ADMMs with\nO(1/\u00012) convergence are developed in [27, 11]. Due to the challenging stochastic objective functions,\nthe convergence rate of stochastic ADMMs is fundamentally slower than deterministic ADMMs, i.e.,\nO(1/\u00012) v.s. O(1/\u0001) [13, 7, 11]. The O(1/\u00012) convergence is optimal since it is optimal even for\nunconstrained stochastic convex optimization without strong convexity [20]. However, in distributive\nimplementations of ADMMs, each node has to pass its most recent xi value to its neighbors or a\nfusion center and then updates the dual variable \u03bb. Existing stochastic ADMM methods [21, 25, 11]\nrequire a communication step immediately after each xi computation step. In practice, the inter-node\ncommunication over TCP/IP is much slower than in-node memory computations and often requires\nadditional set-up time such that communication overhead is the performance bottleneck of most\ndistributed optimization methods.\nAs a consequence, communication ef\ufb01cient optimization recently attracted a lot of research interests\n[29, 14, 24, 15, 17, 18, 23]. Work [17] proposes a primal-dual method that can solve problem (1)\nwith stochastic objective functions using O(1/\u00012) computation iterations and O(1/\u0001) communication\niterations. However, the method in [17] requires each objective function fi(\u00b7) to satisfy the stringent\ncondition that there exists M such that fi(u) \u2264 fi(v) + (cid:104)d, u \u2212 v(cid:105) + M(cid:107)u \u2212 v(cid:107) for any u, v and\nd \u2208 \u2202fi(v) . Such a condition is more stringent than the smoothness when u and v are far apart from\neach other. For example, the simple scalar smooth function f (x) = x2 does not satisfy this condition\nover X = R. Work [18] proposes a communication ef\ufb01cient method to solve deterministic convex\nprograms based on the quadratic penalty method and can obtain an \u0001-optimal solution with O(1/\u00012+\u03b4)\ncomputation rounds (\u03b4 is a positive constant) and O(1/\u0001) communication rounds. For distributed\nconsensus optimization over a network, which can be formulated as a special case of problem (1)\nwhere Ai and b are chosen to ensure all xi are identical, mixing or local averaging based methods\nwith fast convergence (and low communication overhead) are recently developed in [26, 22, 23, 19].\nOur Contributions: This paper proposes a new communication ef\ufb01cient stochastic multi-block\nADMM which has communication rounds less frequently than computation rounds. For stochastic\nconvex programs with general convex objective functions, our algorithm can achieve an \u0001-solution\nwith O(1/\u00012) computation1 rounds and O(1/\u0001) communication rounds. That is, our communication\nef\ufb01cient ADMM has the same computation convergence rate as the ADMM in [11] but only requires\nthe square root of communication rounds required by the method in [11]. For stochastic convex\n\u221a\nprograms with strongly convex objective functions, our algorithm can achieve an \u0001-accuracy solution\nwith \u02dcO(1/\u0001) computation rounds and \u02dcO(1/\n\u0001) communication rounds2. The fast computation\nconvergence (and even faster communication convergence) for strongly convex stochastic programs is\nnot possessed by the ADMM in [11]. When applying our new multi-block ADMM to the special case\nof two-block problems, our algorithm has the same computation convergence as existing two-block\nstochastic ADMM methods in [21, 25, 31, 1]. However, the number of communication rounds used\nby our ADMM is only the squared root of these previous methods.\nNotations: This paper uses (cid:107)A(cid:107) to denote the spectral norm of matrix A; (cid:107)z(cid:107) to denote the Euclidean\nnorm of vector z; and (cid:104)y, z(cid:105) = yTz to denote the inner product of vectors y and z. If symmetric\nmatrix Q (cid:23) 0 is positive semi-de\ufb01nite, then we de\ufb01ne (cid:107)z(cid:107)2\n2 Formulation and New Algorithm\nFollowing the convention in [8], a function h(x) is said to be convex with modulus \u00b5, or equivalently,\n\u00b5-convex, if h(x) \u2212 \u00b5\n2(cid:107)x(cid:107)2 is convex. The \u00b5-convex de\ufb01nition uni\ufb01es the conventional de\ufb01nitions of\nconvexity and strong convexity. That is, a general convex function, which is not necessarily strongly\nconvex, is convex with modulus \u00b5 = 0; and a strongly convex function is convex with modulus \u00b5 > 0.\nThroughout this paper, convex program (1) is assumed to satisfy the following standard assumption:\n\nQ = zTQz for any vector z.\n\n1A computation round of our algorithm is a just a single iteration of the SGD update.\n2A logarithm factor log( 1\n\n\u0001 ) is hidden in the notation \u02dcO(\u00b7).\n\n2\n\n\f\u2217\n\n\u2217\n\n). That is, x\u2217 is an optimal solution\n) \u2206=\n\n\u2217 \u2208 Rm is a Lagrange multiplier attaining strong duality q(\u03bb\n\u2217\n, Ax \u2212 b(cid:105)} is the Lagrangian dual function.\n\nAssumption 1. Convex program (1) has a saddle point (x\u2217, \u03bb\nand \u03bb\ninf{xi\u2208Xi,\u2200i}{f (x) + (cid:104)\u03bb\nNote that strong duality in Assumption 1 is often stated as its equivalent \u201cKKT conditions\u201d, e.g., in\n[7]. A mild suf\ufb01cient condition for Assumption 1 to hold is (1) has at least one feasible point and the\ndomain of each fi(xi) includes Xi as an interior [3].\nAssume unbiased subgradients Gi(xi; \u03be) satisfying E\u03be[Gi(xi; \u03be)] = \u2202fi(xi),\u2200xi \u2208 Xi\n[G1(x1; \u03be)T, . . . , GN (xN ; \u03be)T]T \u2208 R(cid:80)N\nfor each function fi(xi) can be sampled. Denote the stacked column vector G(x; \u03be) \u2206=\n\ni=1 di. We have E\u03be[G(x; \u03be)] = \u2202f (x).\n\n) = f (x\u2217), where q(\u03bb\n\n\u2217\n\nConsider the communication ef\ufb01cient stochastic multi-block ADMM described in Algorithm 1.\nSince fi(xi) are stochastic, \u03c6i(xi) de\ufb01ned in (2) is fundamentally unknown. However, each \u03c6i(xi)\nis \u03bd(t)-convex and its unbiased stochastic subgradient is available as long as we have unbiased\nstochastic subgradients of fi(xi). The sub-procedure STO-LOCAL involved in Algorithm 1 is a\nsimple stochastic subgradient decent (SGD) procedure (with particular choices of parameters, starting\npoints and averaging schemes) to minimize \u03c6(t)\n\ni (\u00b7) over set Xi and is described in Algorithm 2.\n\nAlgorithm 1 Two-Layer Communication Ef\ufb01cient ADMM\n1: Input: Algorithm parameters T , {\u03c1(t)}t\u22651, {\u03bd(t)}t\u22651 and {K (t)}t\u22651.\n2: Initialize arbitrary y(0)\n3: while t \u2264 T do\n4:\n\ni \u2208 Xi,\u2200i, r(0) =(cid:80)N\ni (xi) \u2206=fi(xi) + \u03c1(t)(cid:10)r(t\u22121) +\n\nEach node i de\ufb01nes\n\ni=1 Aiy(0)\n\n\u03c1(t) \u03bb(t\u22121), Aixi \u2212 b\n\n(cid:11) +\n\n\u03c6(t)\n\nN\n\n1\n\ni \u2212 b, \u03bb(0) = 0, and t = 1.\n\n\u03bd(t)\n2\n\n(cid:107)xi \u2212 y(t\u22121)\n\ni\n\n(cid:107)2\n\nand in parallel updates x(t)\ni\n, y(t)\n\n(x(t)\n\ni\n\n, y(t)\n\ni using local sub-procedure Algorithm 2 via\n\ni ) = STO-LOCAL(\u03c6(t)\n\n, K (t))\n\n5:\n\nEach node i passes x(t)\ni\n\nr(t) via\n\nand y(t)\n\ni between nodes or to a parameter server. Update \u03bb(t) and\n\ni\n\ni (\u00b7),Xi, y(t\u22121)\n(cid:17)\ni \u2212 b\n\nAix(t)\n\n\u03bb(t) =\u03bb(t\u22121) + \u03c1(t)(cid:16) N(cid:88)\n\ni=1\n\nAiy(t)\n\ni \u2212 b.\n\nr(t) =\n\nN(cid:88)\n\ni=1\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n6:\n7: end while\n8: Output: x(T ) =\n\nUpdate t \u2190 t + 1.\n1(cid:80)T\n\nt=1 \u03c1(t)\n\n(cid:80)T\n\nt=1 \u03c1(t)x(t)\n\nAlgorithm 2 STO-LOCAL(\u03c6(z),Z, zinit, K)\n1: Input: \u00b5: strong convexity modulus of \u03c6(z); Algorithm parameters: k0 > 0; \u03b3(k) =\n\n2\n\n\u00b5(k+k0) ,\u2200k \u2208 {1, 2, . . . , K}.\n2: Initialize z(0) = zinit and k = 1.\n3: while k \u2264 K do\n4:\n\nObserve an unbiased gradient \u03b6(k) such that E[\u03b6(k)] = \u2202\u03c6(z(k\u22121)) and update z(k) via\n\n(cid:104)\nz(k\u22121) \u2212 \u03b3(k)\u03b6(k)(cid:105)\n\nz(k) = PZ\n\nwhere PZ [\u00b7] is the projection onto Z.\n\n5: end while\n\n6: Output: ((cid:98)z, z(K)) where(cid:98)z is the time average of {z(0), . . . , z(K)} de\ufb01ned in Lemmas 1 or 2.\n\n3\n\n\fLagrangian based methods. It is helpful to enforce the linear constraint.\n\nWe now justify why Algorithm 1 is a two-layer ADMM method. (See Supplement 6.1 for a more\ndetailed discussion.)\n\u2022 The Lagrange multiplier update (4) is identical to that used in existing ADMM methods or other\n\u2022 At the \ufb01rst sight, the primal update in Algorithm (4) is quite different from existing deterministic\nADMMs in [10, 4, 7], which require to solve an \u201cargmin\" problem, or stochastic ADMMs in\n[21, 25, 11], which perform a single gradient descent step . However, with a simple manipulation,\nit is not dif\ufb01cult to show that that function \u03c6(t)\ni (xi) in (2) is similar to the \u201cargmin\" target in the\nproximal Jacobi ADMM method [7] with the distinction that the proximal term (cid:107)xi \u2212 y(t\u22121)\n(cid:107)2 is\nrather than x(t\u22121)\nregarding a newly introduced variable y(t\u22121)\n\n.\n\ni\n\ni\n\ni\n\ni\n\ni (\u00b7),Xi, y(t)\n\nSince each call of Algorithm 2 incurs K (t) SGD update, T iterations of Algorithm 1 use(cid:80)T\nAlgorithm 1 uses T = O(1/\u0001) communication rounds and(cid:80)T\n\n\u221a\nRecall that the fastest stochastic ADMMs in [21, 25, 11] can solve general convex problem (1) (with\nN = 2) with O(1/\nT ) convergence. That is, to obtain a solution with \u0001 errors for both the objective\nvalue and the constraint violation, the ADMMs in [21, 25, 11] require O(1/\u00012) computation steps,\neach of which uses a single gradient evaluation and variable update. The ADMMs in [21, 25, 11] has a\nsingle layer structure and hence are communication inef\ufb01cient in the sense that each computation step\ninvolves a communication steps. Thus, the communication complexity of these stochastic ADMMs\nis also O(1/\u00012). Compared with existing ADMMs in [21, 25, 11], Algorithm 1 has a two layer\nstructure where each outer layer step involves a single inter-node communication step given by (4)-(5)\nand calls the sub-procedure, i.e. Algorithm 2, STO-LOCAL(\u03c6(t)\n, K (t)), which is run by\neach node locally and in parallel and hence does not incur any inter-node communication overhead.\nt=1 K (t)\ncomputation steps. We shall show that to achieve an \u0001 solution for general convex problem (1),\nt=1 K (t) = O(1/\u00012) computation steps.\nThat is, Algorithm 1 is as fast as existing fastest stochastic ADMMs but uses only a square root of the\nnumber of communications rounds in [21, 25, 11].\nNote that inter-node communication in Algoirthm 1 can be either centralized or decentralized. To use\ncentralized communication, we can let all nodes pass their x(t)\nto a parameter server, where (4)-(5) are\ni\nexecuted, and then pull the updated \u03bb(t) and r(t) from the server. It is possible to implement (4)-(5)\nusing decentralized communication by exploring the structure of matrix A = [A1, A2, . . . , AN ]. For\nexample, consider distributed machine learning in a line network where Ax = b is given by N \u2212 1\nequality constraints xi \u2212 xi+1 = 0, i \u2208 {1, 2, . . . , N \u2212 1}. In this case, \u03bb(t)\ni only depend on\nx(t)\ni+1 . Thus, to implement Algorithm 1, each\ni\nnode only needs to send its local x(t)\nfrom its neighbors in the line network.\ni\n2.1 Basic Facts of Algorithm 2\nSince each iteration of Algorithm 1 calles Algorithm 2, which essentially applies SGD with carefully\ni (\u00b7). This subsection provides\ndesigned step size rules to newly introduced objective functions \u03c6(t)\nsome useful insight of SGD for strongly convex stochastic minimization.\nIt is known that SGD can have O(1/\u0001) convergence for strongly convex minimization. The next two\nlemmas summarize the convergence of SGD Algorithm 2. When characterizing O(1/\u0001) rate, our\nlemmas also include a push-back term involving the last iteration solution. This term ensures when\nthe SGD solution from Algorithm 2 is used in the outer-level ADMM dynamics, the accumulated\nerror of our \ufb01nal solution does not explode. It also explains why we use y(t\u22121)\n, which is the last\niteration solution from the SGD sub-procedure, rather than conventional x(t\u22121)\nto de\ufb01ne \u03c6(t)\ni (xi).\nLemma 1 ([16]). Assume \u03c6(z) is a \u00b5-convex function (\u00b5 > 0) over set Z and there exists a constant\nB such that the unbiased subgradient \u03b6(k) used in Algorithm 2 satis\ufb01es E[(cid:107)\u03b6(k)(cid:107)2] \u2264 B2,\u2200k \u2208\n{1, 2, . . . , K}. If we take k0 = 1 in Algorithm 2, then for all z \u2208 Z, we have\n\ni+1 and are only used to updates x(t+1)\n\nand x(t+1)\nand r(t)\nj\n\nto and pull \u03bb(t)\nj\n\nand x(t)\n\nand r(t)\n\ni\n\ni\n\ni\n\ni\n\nE[\u03c6((cid:98)z)] \u2264 \u03c6(z) \u2212 \u00b5\n(cid:124)\n(cid:80)K\u22121\n\nk=0 (k + k0)z(k).\n\n2\n\nwhere(cid:98)z =\n\n1(cid:80)K\u22121\n\nk=0 (k+k0)\n\nE[(cid:107)z(K) \u2212 z(cid:107)2]\n\n+\n\n2B2\n\n\u00b5(K + 1)\n\n,\n\n(7)\n\n(cid:123)(cid:122)\n\n(7)-term (I)\n\n(cid:125)\n\n4\n\n\fRemark 1. It is \ufb01rstly shown in [16] that Algorithm 2 with k0 = 1 (vanilla SGD with a particular\naveraging scheme) has O(1/\u0001) convergence for non-smooth strongly convex problems. Note that (7)\nholds for all z \u2208 Z (not necessarily the minimizer of \u03c6(\u00b7)). The push-back term (7)-term (I) is often\nignored in convergence rate analysis for SGD but is important for our analysis of Algorithm 1.\nRecall that a function h(x) is said to be L-smooth if its gradient \u2207h(x) is Lipschitz with modulus\nL. The next lemma is new and extends Lemma 1 to smooth minimization such that the error term\ndepends only on the variance of stochastic gradients (using a different averaging scheme).\nLemma 2. Assume \u03c6(z) is a L-smooth and \u00b5-convex function (\u00b5 > 0) with conditional number\n\u00b5 and there exists \u03c3 > 0 such that unbiased gradient \u03b6(k) (at point z(k\u22121)) in Algorithm 2\n\u03ba = L\nsatis\ufb01es E[(cid:107)\u03b6(k) \u2212 \u2207\u03c6(z(k\u22121))(cid:107)2] \u2264 \u03c32,\u2200k \u2208 {1, 2, . . . , K}. If we take integer k0 > 2\u03ba, then for\nany z \u2208 Z, we have\nE[\u03c6((cid:98)z)] \u2264\u03c6(z) +\nwhere(cid:98)z =\n(cid:80)K\nk=1(k+k0\u22121)\n\n(cid:0)E[(cid:107)z \u2212 z(0)(cid:107)2] \u2212 E[(cid:107)z \u2212 z(K)(cid:107)2](cid:1) \u2212 \u00b5\n(cid:80)K\nk=1(k + k0 \u2212 1)z(k).\n\n2K(K + 2k0 \u2212 1)\n1\n\nE[(cid:107)z \u2212 z(K)(cid:107)2] +\n\n(K + 2k0 \u2212 1)\u00b5\n\n0 \u2212 k0)\n\n2k0\u03c32\n\n\u00b5(k2\n\n(8)\n\n2\n\n\u221a\n\nProof. See Supplement 6.6.\n3 Performance Analysis of Algorithm 1\nThis section shows that Algorithm 1 can achieve an \u0001-accuracy solution using O(1/\u00012) computation\nrounds and O(1/\u0001) communication rounds for general convex stochastic programs; or using \u02dcO(1/\u0001)\ncomputation rounds and \u02dcO(1/\n\u0001) communication rounds for strongly convex stochastic programs.\n3.1 General objective functions (possibly non-smooth non-strongly convex)\nTheorem 1. Consider convex program (1) under Assumption 1. Let (x\u2217, \u03bb\nde\ufb01ned in Assumption 1. Assume that\n\u2022 The constraint set X is bounded, i.e., there exists constant R > 0 such that (cid:107)x(cid:107) \u2264 R,\u2200x \u2208 X .\n\u2022 The function f (x) has unbiased stochastic subgradients with a bounded second order moment, i.e.,\nFor all T \u2265 1, if we choose any \ufb01xed \u03c1(t) = \u03c1 > 0, \u03bd(t) = \u03bd \u2265 8\u03c1(cid:107)A(cid:107)2, K (t) = K \u2265 T in\noutput, then\n\nAlgorithm 1 and the sub-procedure STO-LOCAL (Algorithm 2) uses(cid:98)z de\ufb01ned in Lemma 1 as the\n\nthere exists constant D > 0 such that E\u03be[(cid:107)G(x; \u03be)(cid:107)2] \u2264 D2,\u2200x \u2208 X .\n\n) be any saddle point\n\n\u2217\n\nE[f (x(T ))] \u2264 f (x\n\n\u2217\n\n) +\n\n(cid:107)x\n\n\u2217 \u2212 y(0)(cid:107)2 +\n\nC\n2\u03bdT\n\n(cid:80)T\n\nE[(cid:107)Ax(T ) \u2212 b(cid:107)] \u2264 1\nT\n\n(cid:114)\n\n\u03bd\n\n\u03bd\n\n\u03bd\n\n\u03c1\u03bd(cid:107)x\u2217 \u2212 y(0)(cid:107)2 + 24\u03c1D2\n\n+ 24(\u03c1)3(cid:107)A(cid:107)2 ((cid:107)A(cid:107)R+(cid:107)b(cid:107))2\n\nt=1 x(t); Q = (2(cid:107)\u03bb\u2217(cid:107) +\n\nof computation rounds is(cid:80)T\n\n+ 96\u03bd\u03c1R2/(cid:0)1 \u2212\n(cid:1)(cid:1)2 is an absolute constant (irrelevant to T ); and C \u2206= 4(cid:107)A(cid:107)2Q+12D2+12\u03c12(cid:107)A(cid:107)2((cid:107)A(cid:107)R+\n\n(cid:114)\nwhere x(T ) = 1\nt\n8\u03c1(cid:107)A(cid:107)2\n(cid:107)b(cid:107))2 + 48\u03bd2R2 is also an absolute constant.\nProof. See Supplement 6.7.\nRemark 2. After T outer-level rounds, Algorithm 1 yields a solution with error O(1/T ). Note that\nthe number of communication rounds is equal to the number of outer-level rounds and the number\nt=1 K (t) = O(T 2) when K (t) = T,\u2200t. Thus, to obtain an \u0001-solution,\nAlgorithm 1 uses O(1/\u0001) communication rounds and O(1/\u00012) computation rounds.\nRemark 3. If we choose \u03bd(t) = \u03bd = 8\u03c1(cid:107)A(cid:107)2 in Theorem 1 and further analyze the dependence on\nT \u03c1(cid:107)A(cid:107)2) and E[(cid:107)Ax(T ) \u2212 b(cid:107)] \u2264 O( 1\n(cid:107)A(cid:107) in (9)-(10), we have E[f (x(T ))] \u2264 f (x\u2217) + O( 1\nT ( 1\n\u03c1 +\n(cid:107)A(cid:107))). If (cid:107)A(cid:107) is large, to balance the dependence on (cid:107)A(cid:107) in (9)-(10), we shall choose \u03c1 = 1(cid:107)A(cid:107) such\nT (cid:107)A(cid:107)). In general, \u03c1 can be controlled to trade\nthat the error terms in both (9) and (10) are order O( 1\noff between objective error and constraint error. For distributed consensus optimization considered in\n[26, 22, 23, 19] (assuming di = 1 without loss of generality), we can choose any A, b that suf\ufb01ces to\nensure the consistence of local solutions, e.g., Null{A} =Span{1} and b = 0. Our method does not\nnecessarily require A = I \u2212 W with a stochastic matrix W encoding the network topology as some\nmethods in [26, 22, 23, 19]. Nevertheless, even when ung A = I \u2212 W, our communication overhead\ncan possibly have a better dependence on W. Note that a stochastic matrix W ensures (cid:107)A(cid:107) \u2264 2.\nThe convergence in [26, 22, 23, 19] (using a doubly stochastic or symmetric PSD W for mixing)\nfurther depends on 1/(1 \u2212 max{|\u03bb2(W)|,|\u03bbN (W)|}) or the eigen-gap \u03bb1(W)/\u03bbN\u22121(W), which\ncan be much larger than constant 2 when some eigenvalues are extreme.\n\n(9)\n\n(10)\n\n\u03bd\n\u221a\n2T\nQ\n\u03c1\n\n5\n\n\f(cid:80)T\n\nwhere x(T ) = 1\nt\n\nt=1 x(t) .\n\npositive integer k0 \u2265 2(1 + L\n\u2217\nE[f (x(T ))] \u2264 f (x\n\n1\n\n) +\n\nT (T + 1)\n\n(cid:16)\n(cid:16) 4(cid:107)\u03bb\u2217(cid:107)\n\nc1(cid:107)x\n\nt=1 \u03c1(t)x(t); and c1\n\nE[(cid:107)Ax(T ) \u2212 b(cid:107)] \u2264\n\n2\n\nT (T + 1)\n\n\u03c1\n\n(cid:80)T\n\n1(cid:80)T\n\nwhere x(T ) =\n(2k0\u22121)(cid:107)A(cid:107)2 are two constants.\n\nt=1 \u03c1(t)\n\n4k0\u03c32\n\n+\n\n\u2217 \u2212 y(0)(cid:107)2 +\n\nc2\n\u03c1\n\n(cid:107)x\n\nlog(T + 1)\n\n\u2217 \u2212 y(0)(cid:107) +\n\n(cid:112)c2 log(T + 1)\n\u221a\nc1\u221a\n\u03c1\n\u2206= \u03c1(cid:107)A(cid:107)2 + (\u03c1(cid:107)A(cid:107)2+\u00b5)(k2\n2(2k0\u22121)2\n\n\u03c1\n\n(cid:17)\n\n(13)\n\n(14)\n\n0\u2212k0)\n\nand c2\n\n\u2206=\n\n3.2 Smooth objective functions\nFor unconstrained stochastic smooth minimization, the constant factor in the SGD convergence\nrate is determined by the variance that can be signi\ufb01cantly less than the second order moment for\nnon-smooth stochastic minimization[20]. Such a property enable us to speed up SGD by averaging\nmultiple i.i.d. stochastic gradients, e.g., mini-batch SGD. In this subsection, we show that Algorithm\n1 has a similar property when f (\u00b7) in problem (1) is smooth.\nTheorem 2. Consider convex program (1) with \u00b5-convex (possibly \u00b5 = 0) objective function under\nAssumption 1. Let (x\u2217, \u03bb\n\u2022 The function f (x) is L-smooth.\n\u2022 The function f (x) has unbiased stochastic gradients with a bounded variance, i.e., there exists\n\nconstant \u03c3 > 0 such that E\u03be[(cid:107)G(x; \u03be) \u2212 \u2207f (x)(cid:107)2] \u2264 \u03c32,\u2200x \u2208 X .\n\nIf the sub-procedure STO-LOCAL (Algorithm 2) uses(cid:98)z de\ufb01ned in Lemma 2 as the output, then\n\n) be any saddle point de\ufb01ned in Assumption 1. Assume that\n\nAlgorithm 1 ensures:\n\u2022 General Convex (\u00b5 = 0): For all T \u2265 1, if we choose any \ufb01xed \u03c1(t) = \u03c1 > 0, \u03bd(t) = \u03bd \u2265 \u03c1(cid:107)A(cid:107)2,\n\n\u2217\n\nK (t) = K = T and positive integer k0 \u2265 2 L+\u03bd\n\u03bd , then we have\n(cid:115)\n\u2217 \u2212 y(0)(cid:107)2 +\n\nE[f (x(T ))] \u2264 f (x\n\n\u03bd(k0 + 1)\n\n(cid:107)x\n\n1\nT\n\n) +\n\n\u2217\n\n1\nT\n\n2k0\u03c32\n\n\u03bd\n\nE[(cid:107)Ax(T ) \u2212 b(cid:107)] \u2264 1\nT\n\n\u03bd(k0 + 1)\n\n(cid:107)x\n\n\u2217 \u2212 y(0)(cid:107) + 2\n\n2\u03c1\n\n(cid:16) 2\n\n\u03c1\n\n4\n\u2217(cid:107) +\n\n(cid:107)\u03bb\n\n(cid:115)\n\n(cid:17)\n\nk0\u03c32\n\n\u03c1\u03bd\n\n(11)\n\n(12)\n\n\u2022 Strongly Convex (\u00b5 > 0): For all T \u2265 1, if we choose \u03c1 \u2264 \u00b5\n\n3(cid:107)A(cid:107)2 , \u03c1(t) = t\u03c1, \u03bd(t) = t\u03c1(cid:107)A(cid:107)2,\n\n\u00b5 ) and K (t) = (2k0 \u2212 1)t, then we have\n\n(cid:17)\n\nT 2\n\n2 T (T + 1) = O(T 2), Algorithm 1 requires \u02dcO( 1\n\nProof. See Supplement 6.8.\nRemark 4. If f (x) in convex program (1) is strongly convex, Algorithm 1 can obtain a solution\n(cid:80)T\nwith error O( log(T )\n) after T outer-level rounds. Recall the number of communication rounds\nis equal to the number of outer-level rounds and the number of computation rounds is equal to\nt=1 K (t) = 2k0\u22121\n\u0001 ) communication rounds and\n\u00012 ) computation rounds to obtain an \u0001-solution.\n\n\u02dcO( 1\n3.3 Non-smooth strongly convex objective functions\nThere is a fourth case, where the stochastic objective function f (x) is strongly convex but possibly\nnon-smooth, uncovered in the previous subsections. In this case, we assume the following condition\n(originally introduced in [17]): There exists constant M > 0 such that\nf (x) \u2264 f (y) + (cid:104)d, x \u2212 y(cid:105) + M(cid:107)x \u2212 y(cid:107),\n\n(15)\nfor all x, y \u2208 X and d \u2208 \u2202f (y). This condition is assumed throughout [17] to develop a different\ncommunication ef\ufb01cient primal-dual method. Supplement 6.9 shows this condition is almost as\n\u221a\nuseful as smoothness and under this condition, our communication ef\ufb01cient ADMM can achieve\nan \u0001-accuracy solution with \u02dcO(1/\u0001) computation rounds and \u02dcO(1/\n\u0001) communication rounds for\nnon-smooth strongly convex stochastic optimization.\n4 Experiments\n4.1 Distributed Stochastic Optimization with Noisy Stochastic Gradient Information\n\nConsider simple stochastic optimization given by\n\n6\n\n\f3(cid:88)\n\ni=1\n\nmin\n\nEci[(cid:107)xi \u2212 ci(cid:107)2\n2]\n\n(16)\n\n1 = x\u2217\n\n2 = x\u2217\n\ns.t. x1 = x2, x2 = x3\n\nxi \u2208 [\u22121, 1]3,\u2200i \u2208 {1, 2, . . . , 3}\n\n(17)\n(18)\nwhere ci \u223c N (\u00afci, \u03c32\ni I) satisfy normal distributions with \u00afc1 = [\u22122.0871,\u22120.3702, 0.2302]T, \u03c31 =\n0.1, \u00afc2 = [\u22120.5556,\u22120.4413, 0.2869]T, \u03c32 = 0.2, \u00afc3 = [\u22121.4991,\u22121.8286,\u22122.0477]T and \u03c33 =\n0.1. Solving this problem with Algorithm 1 only requires each node to access samples of local ci and\ndoes not use the true value \u00afci and \u03c3i,which are fundamentally unavailable. However, by assuming the\nknowledge of \u00afci and \u03c3i, we can convert this stochastic optimization to a deterministic problem and\n3 = [\u22121,\u22120.88003599,\u22120.51020207]T\nuse CVXPY [9] to obtain the unique solution x\u2217\nsuch that we can evaluate the performance of Algorithm 1. Since the objective function is smooth and\nstrongly convex, by Theorem 2, using time-varying parameters in Algorithm 1 has faster convergence.\nWe run Algorithm 1 with constant \u03c1, \u03bd according to3 Theorem 1 and with time-varying \u03c1(t), \u03bd(t)\naccording to Theorem 2, respectively. Note that if an algorithm has O(1/\u0001\u03b2) convergence, then its\nerror should decay like O(1/t1/\u03b2) where t is the iteration index.\nFigures 1 plots the distance to x\u2217 versus the computation round index or the communication round\nindex in a log-log scale. It also plots baseline curves 1/t\n\u03b2 corresponding to O(1/\u0001\u03b2) convergence\nproven in the theorems. Note that in a log-log scale, curves 1/t\n\u03b2 become straight lines with\nslopes \u2212 1\n\u03b2 . That is, if our algorithm has the proven convergence rate, the error curves should be\neventually parallel to corresponding baseline for large t. In Figures 1, we observe the numerical\nresult is consistent with our theoretical rate proven in our theorems. This simple experiment veri\ufb01es\nthe correctness of our theorems. Our multi-core implementation of Algorithm 1 uses Python 3.7\nand MPI4PY. In an experiment over a machine with a multi-core Intel Xeon Processor E5-2682\n2.5GHz. Each computation round takes 0.3ms and each communication round takes 43.7ms. Note\ncommunication becomes more relatively expensive as more parallel nodes/cores are involved.\n\n1\n\n1\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: Performance of Algorithm 1 to solve stochastic optimization (16)-(18): (a)& (b) conver-\ngence w.r.t. # of computation rounds; (c)&(d) convergence w.r.t. # of communication rounds.\n\n3Since f (x) is also smooth, using constant \u03c1, \u03bd according to Theorem 2 can give a similar (slightly better)\nperformance. Theoretically, by using K (t) = t rather than K (t) = T for a \ufb01xed T , the rate is slightly worse, i.e.\nO(log(T )/T ) v.s. O(1/T ). However, we \ufb01nd the performance degradation for large T regions is negligible\nwhen using K (t) = t. In contrast, using K (t) = t enable the algorithm converge faster for small t. We use\nK (t) = t when performing the numerical experiments in this paper.\n\n7\n\n\f10(cid:88)\n\n1\n10\n\n1\nNi\n\nNi(cid:88)\n\n4.2 Distributed l1 Regularized Logistic Regression\nConsider a distributed l1 regularized logistic regression problem (over 10 nodes) given by:\n\nijxi)) + \u00b5(cid:107)xi(cid:107)1\n\ni=1\n\nj=1\n\nmin\n\nlog(1 + exp(bij(aT\n\n(19)\nwith each optimization variable xi \u2208 Rd. Each node contains Ni training pairs (aij, bij), where\naij \u2208 Rd is a feature vector and bij \u2208 {\u22121, 1} is the corresponding label. To ensure all nodes yield a\nconsistent model, consensus constraints are needed to enforce all xi are equal. Note that conventional\ntwo-block ADMMs must introduce a dummy block (server node) z and add constraints xi = z. (See\ne.g., [4, 21, 25].) However, such an ADMM method requires all nodes to pass the updated xi value\nto the (server) node corresponding to the z block and hence can turn z node into a communication\nbottleneck in large networks. In contrast, using a multi-block ADMM method allows arbitrary\nlinear constraints, e.g., constraints xi = xi+1,\u2200i that ensure all xi are equal, and the corresponding\nmulti-block ADMM only uses communication between adjacent blocks. Alternatively, consider a line\nnetwork where only one-hop transmission is allowed, then our ADMM naturally yields a protocol\nthat is faithful to the network communication restriction. In general, given an arbitrary network\ncommunication topology, our multi-block ADMM can always yield an implementable distributed\nprotocol by adding constraints xi = xj for links (i, j) existing in the network.\nWe generate a problem instance in a way similarly to [4]. Our problem instance uses d = 100,\nNi = 105 for all i and \u00b5 = 0.002. Each feature vector aij is generated from a standard normal\ndistribution. We choose a true weight vector xtrue \u2208 Rd with 10 non-zero entries from a standard\nijxtrue + ni) where noise ni \u223c N (0, \u03c32\nnormal distribution and then generate the label bij = sign(aT\ni )\nwith \ufb01xed constants \u03c3i randomly generated from a uniform distribution Unif[0, 1]. Figures 2 compares\nAlgorithm 1 with RPDBUS ADMM proposed in [11], where the number of communication rounds is\nthe same that of computation rounds, and DCS in [17], where the number of communication rounds\nis the square root of that of computation rounds. We observe that Algorithm 1 has fastest convergence\nwith respect to both computation and communication.\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 2: Distributed l1 regularized logistic regression: (a)& (b) performance w.r.t. # of computation\nrounds; (c)&(d) performance w.r.t. # of communication rounds\n5 Conclusions\nThis paper proposes a new communication ef\ufb01cient multi-block ADMM for linearly constrained\nstochastic optimization. This method is as fast as (or faster than) existing stochastic ADMMs but the\nassociated communication overhead is only the square root of that required by existing ADMMs.\n\n8\n\n\fReferences\n[1] Samaneh Azadi and Suvrit Sra. Towards an optimal stochastic alternating direction method of\nmultipliers. In International Conference on Machine Learning (ICML), pages 620\u2013628, 2014.\n\n[2] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, second edition, 1999.\n\n[3] Dimitri P. Bertsekas, Angelia Nedi\u00b4c, and Asuman E. Ozdaglar. Convex Analysis and Optimiza-\n\ntion. Athena Scienti\ufb01c, 2003.\n\n[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[5] Tsung-Hui Chang, Mingyi Hong, Wei-Cheng Liao, and Xiangfeng Wang. Asynchronous\ndistributed admm for large-scale optimization\u2014part i: Algorithm and convergence analysis.\nIEEE Transactions on Signal Processing, 64(12):3118\u20133130, 2016.\n\n[6] Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan. The direct extension of ADMM\nfor multi-block convex minimization problems is not necessarily convergent. Mathematical\nProgramming, 155:57\u201379, 2016.\n\n[7] Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin. Parallel multi-block ADMM with\n\no(1/k) convergence. Journal of Scienti\ufb01c Computing, 71(2):712\u2013736, 2017.\n\n[8] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating\n\ndirection method of multipliers. Journal of Scienti\ufb01c Computing, 66(3):889\u2013916, 2016.\n\n[9] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for\n\nconvex optimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n[10] Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas-Rachford splitting method and\nthe proximal point algorithm for maximal monotone operators. Mathematical Programming,\n55(1-3):293\u2013318, 1992.\n\n[11] Xiang Gao, Yangyang Xu, and Shuzhong Zhang. Randomized primal-dual proximal block\n\ncoordinate updates. arXiv:1605.05969, 2016.\n\n[12] Bingsheng He, Hong-Kun Xu, and Xiaoming Yuan. On the proximal Jacobian decomposition\nof ALM for multiple-block separable convex minimization problems and its relationship to\nADMM. Journal of Scienti\ufb01c Computing, 66(3):1204\u20131217, 2016.\n\n[13] Bingsheng He and Xiaoming Yuan. On the O(1/n) convergence rate of the Douglas-Rachford\n\nalternating direction method. SIAM Journal on Numerical Analysis, 50(2):700\u2013709, 2012.\n\n[14] Martin Jaggi, Virginia Smith, Martin Tak\u00e1c, Jonathan Terhorst, Sanjay Krishnan, Thomas\nHofmann, and Michael I Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent.\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[15] Jakub Kone\u02c7cn`y, H Brendan McMahan, Daniel Ramage, and Peter Richt\u00e1rik. Federated opti-\n\nmization: Distributed machine learning for on-device intelligence. arXiv:1610.02527, 2016.\n\n[16] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an\nO(1/t) convergence rate for the projected stochastic subgradient method. arXiv:1212.2002,\n2012.\n\n[17] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-ef\ufb01cient algorithms for decentralized\n\nand stochastic optimization. arXiv:1701.03961, 2017.\n\n[18] Huan Li, Cong Fang, and Zhouchen Lin. Convergence rates analysis of the quadratic penalty\nmethod and its applications to decentralized distributed optimization. arXiv:1711.10802, 2017.\n\n[19] Angelia Nedi\u00b4c, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-\ncomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953\u2013976,\n2018.\n\n9\n\n\f[20] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574\u2013\n1609, 2009.\n\n[21] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method\n\nof multipliers. In International Conference on Machine Learning (ICML), 2013.\n\n[22] Shi Pu and Angelia Nedi\u00b4c. Distributed stochastic gradient tracking methods. arXiv:1805.11454,\n\n2018.\n\n[23] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Opti-\nmal algorithms for non-smooth distributed optimization in networks. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2018.\n\n[24] Virginia Smith, Simone Forte, Chenxin Ma, Martin Tak\u00e1c, Michael I Jordan, and Martin\nJaggi. CoCoA: A general framework for communication-ef\ufb01cient distributed optimization.\narXiv:1611.02189, 2016.\n\n[25] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction\n\nmultiplier method. In International Conference on Machine Learning (ICML), 2013.\n\n[26] C\u00e9sar A Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedi\u00b4c. Optimal algorithms for\n\ndistributed optimization. ArXiv:1712.00232, 2017.\n\n[27] Huahua Wang, Arindam Banerjee, and Zhi-Quan Luo. Parallel direction method of multipliers.\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[28] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In\n\nIEEE Conference on Decision and Control (CDC), pages 5445\u20135450, 2012.\n\n[29] Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate\n\nascent. Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[30] Hao Yu and Michael J. Neely. A simple parallel algorithm with an O(1/t) convergence rate for\n\ngeneral convex programs. SIAM Journal on Optimization, 27(2):759\u2013783, 2017.\n\n[31] Wenliang Zhong and James Kwok. Fast stochastic alternating direction method of multipliers.\n\nIn International Conference on Machine Learning (ICML), pages 46\u201354, 2014.\n\n10\n\n\f", "award": [], "sourceid": 4653, "authors": [{"given_name": "Hao", "family_name": "Yu", "institution": "Alibaba Group (US) Inc"}]}