{"title": "Bregman Alternating Direction Method of Multipliers", "book": "Advances in Neural Information Processing Systems", "page_first": 2816, "page_last": 2824, "abstract": "The mirror descent algorithm (MDA) generalizes gradient descent by using a Bregman divergence to replace squared Euclidean distance. In this paper, we similarly generalize the alternating direction method of multipliers (ADMM) to Bregman ADMM (BADMM), which allows the choice of different Bregman divergences to exploit the structure of problems. BADMM provides a unified framework for ADMM and its variants, including generalized ADMM, inexact ADMM and Bethe ADMM. We establish the global convergence and the $O(1/T)$ iteration complexity for BADMM. In some cases, BADMM can be faster than ADMM by a factor of $O(n/\\ln n)$ where $n$ is the dimensionality. In solving the linear program of mass transportation problem, BADMM leads to massive parallelism and can easily run on GPU. BADMM is several times faster than highly optimized commercial software Gurobi.", "full_text": "Bregman Alternating Direction Method of Multipliers\n\nDept of Computer Science & Engg, University of Minnesota, Twin Cities\n\nArindam Banerjee\nHuahua Wang,\n{huwang,banerjee}@cs.umn.edu\n\nAbstract\n\nThe mirror descent algorithm (MDA) generalizes gradient descent by using a\nBregman divergence to replace squared Euclidean distance.\nIn this paper, we\nsimilarly generalize the alternating direction method of multipliers (ADMM) to\nBregman ADMM (BADMM), which allows the choice of different Bregman di-\nvergences to exploit the structure of problems. BADMM provides a uni\ufb01ed frame-\nwork for ADMM and its variants, including generalized ADMM, inexact ADMM\nand Bethe ADMM. We establish the global convergence and the O(1/T ) iteration\ncomplexity for BADMM. In some cases, BADMM can be faster than ADMM by\na factor of O(n/ ln n) where n is the dimensionality. In solving the linear pro-\ngram of mass transportation problem, BADMM leads to massive parallelism and\ncan easily run on GPU. BADMM is several times faster than highly optimized\ncommercial software Gurobi.\n\n1\n\nIntroduction\n\nIn recent years, the alternating direction method of multipliers (ADMM) [4] has been successfully\nused in a broad spectrum of applications, ranging from image processing [11, 14] to applied statis-\ntics and machine learning [26, 25, 12]. ADMM considers the problem of minimizing composite\nobjective functions subject to an equality constraint:\n\nmin\n\nx\u2208X ,z\u2208Z f (x) + g(z)\n\ns.t. Ax + Bz = c ,\n\n(1)\nwhere f and g are convex functions, A \u2208 Rm\u00d7n1, B \u2208 Rm\u00d7n2, c \u2208 Rm\u00d71, x \u2208 X \u2208 Rn1\u00d71, z \u2208\nZ \u2208 Rn2\u00d71, and X \u2286 Rn1 and Z \u2286 Rn2 are nonempty closed convex sets. f and g can be\nnon-smooth functions, including indicator functions of convex sets. For further understanding of\nADMM, we refer the readers to the comprehensive review by [4] and references therein. Many\nmachine learning problems can be cast into the framework of minimizing a composite objective [22,\n10], where f is a loss function such as hinge or logistic loss, and g is a regularizer, e.g., (cid:96)1 norm, (cid:96)2\nnorm, nuclear norm or total variation. The functions and constraints usually have different structures.\nTherefore, it is useful and sometimes necessary to split and solve them separately, which is exactly\nthe forte of ADMM.\nIn each iteration, ADMM updates splitting variables separately and alternatively by solving the\npartial augmented Lagrangian of (1), where only the equality constraint is considered:\n(cid:107)Ax + Bz \u2212 c(cid:107)2\n2,\n\n(2)\nwhere y \u2208 Rm is dual variable, \u03c1 > 0 is penalty parameter, and the quadratic penalty term is to\npenalize the violation of the equality constraint. ADMM consists of the following three updates:\n\nL\u03c1(x, z, y) = f (x) + g(z) + (cid:104)y, Ax + Bz \u2212 c(cid:105) +\n\n\u03c1\n2\n\nxt+1 = argminx\u2208X f (x) + (cid:104)yt, Ax + Bzt \u2212 c(cid:105) +\n\u03c1\n2\nzt+1 = argminz\u2208Z g(z) + (cid:104)yt, Axt+1 + Bz \u2212 c(cid:105) +\nyt+1 = yt + \u03c1(Axt+1 + Bzt+1 \u2212 c) .\n\n(cid:107)Ax + Bzt \u2212 c(cid:107)2\n2 ,\n(cid:107)Axt+1 + Bz \u2212 c(cid:107)2\n\u03c1\n2 ,\n2\n\n(3)\n\n(4)\n\n(5)\n\n1\n\n\fSince the computational complexity of the y update (5) is trivial, the computational complexity of\nADMM is determined by the x and z updates (3)-(4) which amount to solving proximal minimiza-\ntion problems using the quadratic penalty term. Inexact ADMM [26, 4] and generalized ADMM [8]\nhave been proposed to solve the updates inexactly by linearizing the functions and adding additional\nquadratic terms. Recently, online ADMM [25] and Bethe-ADMM [12] add an additional Bregman\ndivergence on the x update by keeping or linearizing the quadratic penalty term (cid:107)Ax + Bz \u2212 c(cid:107)2\n2.\nAs far as we know, all existing ADMMs use quadratic penalty terms.\nA large amount of literature shows that replacing the quadratic term by Bregman divergence in\ngradient-type methods can greatly boost their performance in solving constrained optimization\nproblem. First, the use of Bregman divergence could effectively exploit the structure of prob-\nlems [6, 2, 10] , e.g., in computerized tomography [3], clustering problems and exponential family\n\u221a\ndistributions [1]. Second, in some cases, the gradient descent method with Kullback-Leibler (KL)\ndivergence can outperform the method with the quadratic term by a factor of O(\nn ln n) where n\nis the dimensionality of the problem [2, 3]. Mirror descent algorithm (MDA) and composite objec-\ntive mirror descent (COMID) [10] use Bregman divergence to replace the quadratic term in gradient\ndescent or proximal gradient [7]. Proximal point method with D-functions (PMD) [6, 5] and Breg-\nman proximal minimization (BPM)\n[20] generalize proximal point method by using generalized\nBregman divegence to replace the quadratic term.\nFor ADMM, although the convergence of ADMM is well understood, it is still unknown whether\nthe quadratic penalty term in ADMM can be replaced by Bregman divergence. The proof of global\nconvergence of ADMM can be found in [13, 4]. Recently, it has been shown that ADMM converges\nat a rate of O(1/T ) [25, 17], where T is the number of iterations. For strongly convex functions,\nthe dual objective of an accelerated version of ADMM can converge at a rate of O(1/T 2) [15].\nUnder suitable assumptions like strongly convex functions or a suf\ufb01ciently small step size for the\ndual variable update, ADMM can achieve a linear convergence rate [8, 19]. However, as pointed out\nby [4], \u201cThere is currently no proof of convergence known for ADMM with nonquadratic penalty\nterms.\u201d\nIn this paper, we propose Bregman ADMM (BADMM) which uses Bregman divergences to replace\nthe quadratic penalty term in ADMM, answering the question raised in [4]. More speci\ufb01cally, the\nquadratic penalty term in the x and z updates (3)-(4) will be replaced by a Bregman divergence in\nBADMM. We also introduce a generalized version of BADMM where two additional Bregman di-\nvergences are added to the x and z updates. The generalized BADMM (BADMM for short) provides\na uni\ufb01ed framework for solving (1), which allows one to choose suitable Bregman divergence so that\nthe x and z updates can be solved ef\ufb01ciently. BADMM includes ADMM and its variants as special\ncases. In particular, BADMM replaces all quadratic terms in generalized ADMM [8] with Bregman\ndivergences. By choosing a proper Bregman divergence, we also show that inexact ADMM [26] and\nBethe ADMM [12] can be considered as special cases of BADMM. BADMM generalizes ADMM\nsimilar to how MDA generalizes gradient descent and how PMD generalizes proximal methods. In\nBADMM, the x and z updates can take the form of MDA or PMD. We establish the global conver-\ngence and the O(1/T ) iteration complexity for BADMM. In some cases, we show that BADMM can\noutperform ADMM by a factor O(n/ ln n). We evaluate the performance of BADMM in solving\nthe linear program problem of mass transportation [18]. Since BADMM takes use of the structure\nof the problem, it leads to closed-form solutions which amounts to elementwise operations and can\nbe done in parallel. BADMM is faster than ADMM and can even be orders of magnitude faster than\nhighly optimized commercial software Gurobi when implemented on GPU.\nThe rest of the paper is organized as follows.\nIn Section 2, we propose Bregman ADMM and\ndiscuss several special cases of BADMM. In Section 3, we establish the convergence of BADMM.\nIn Section 4, we consider illustrative applications of BADMM, and conclude in Section 5.\n\n2 Bregman Alternating Direction Method of Multipliers\nLet \u03c6 : \u2126 \u2192 R be a continuously differentiable and strictly convex function on the relative interior\nof a convex set \u2126. Denote \u2207\u03c6(y) as the gradient of \u03c6 at y. We de\ufb01ne Bregman divergence B\u03c6 :\n\u2126 \u00d7 ri(\u2126) \u2192 R+ induced by \u03c6 as\n\nB\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212 (cid:104)\u2207\u03c6(y), x \u2212 y(cid:105) .\n\n2\n\n\fSince \u03c6 is strictly convex, B\u03c6(x, y) \u2265 0 where the equality holds if and only if x = y. More details\nabout Bregman divergence can be found in [6, 1]. Note the de\ufb01nition of Bregman divergence has\nbeen generalized for the nondifferentiable functions [20, 23]. In this paper, our discussion uses the\nde\ufb01nition of classical Bregman divergence. Two of the most commonly used examples are squared\nEuclidean distance B\u03c6(x, y) = 1\nAssuming B\u03c6(c \u2212 Ax, Bz) is well de\ufb01ned, we replace the quadratic penalty term in the partial\naugmented Lagrangian (2) by a Bregman divergence as follows:\n\n2 and KL divergence B\u03c6(x, y) =(cid:80)n\n\n2(cid:107)x \u2212 y(cid:107)2\n\ni=1 xi log xi\nyi\n\n.\n\n\u03c1 (x, zt, yt), where the quadratic penalty term 1\n\n\u03c1 (x, z, y) = f (x) + g(z) + (cid:104)y, Ax + Bz \u2212 c(cid:105) + \u03c1B\u03c6(c \u2212 Ax, Bz).\nL\u03c6\n\n(6)\nUnfortunately, we can not derive Bregman ADMM (BADMM) updates by simply solving\n\u03c1 (x, z, y) alternatingly as ADMM does because Bregman divergences are not necessarily con-\nL\u03c6\nvex in the second argument. More speci\ufb01cally, given (zt, yt), xt+1 can be obtained by solving\n2 for ADMM in (3) is\nminx\u2208X L\u03c6\nreplaced with B\u03c6(c\u2212 Ax, Bzt) in the x update of BADMM. However, given (xt+1, yt), we cannot\n\u03c1 (xt+1, z, yt), since the term B\u03c6(c \u2212 Axt+1, Bz) need not be\nobtain zt+1 by solving minz\u2208Z L\u03c6\nconvex in z. The observation motivates a closer look at the role of the quadratic term in ADMM.\nIn standard ADMM, the quadratic augmentation term added to the Lagrangian is just a penalty term\nto ensure the new updates do not violate the equality constraint signi\ufb01cantly. Staying with these\ngoals, we propose the z update augmentation term of BADMM to be: B\u03c6(Bz, c \u2212 Axt+1), instead\n2 in (3). Then, we get the following updates for\nof the quadratic penalty term 1\nBADMM:\n\n2(cid:107)Axt+1 + Bz \u2212 c(cid:107)2\n\n2(cid:107)Ax + Bzt \u2212 c(cid:107)2\n\nxt+1 =argminx\u2208X f (x) + (cid:104)yt, Ax + Bzt \u2212 c(cid:105) + \u03c1B\u03c6(c \u2212 Ax, Bzt) ,\nzt+1 =argminz\u2208Z g(z) + (cid:104)yt, Axt+1 + Bz \u2212 c(cid:105) + \u03c1B\u03c6(Bz, c \u2212 Axt+1) ,\nyt+1 =yt + \u03c1(Axt+1 + Bzt+1 \u2212 c) .\n\n(7)\n(8)\n(9)\nCompared to ADMM (3)-(5), BADMM simply uses a Bregman divergence to replace the quadratic\npenalty term in the x and z updates. It is worth noting that the same Bregman divergence B\u03c6 is used\nin the x and z updates.\nWe consider a special case when A = \u2212I, B = I, c = 0. (7) is reduced to\n\nxt+1 = argminx\u2208X f (x) + (cid:104)yt,\u2212x + zt(cid:105) + \u03c1B\u03c6(x, zt) .\n\n(10)\nIf \u03c6 is a quadratic function, the constrained problem (10) requires the projection onto the constraint\nset X . However, in some cases, by choosing a proper Bregman divergence, (10) can be solved\nef\ufb01ciently or has a closed-form solution. For example, assuming f is a linear function and X is\nthe unit simplex, choosing B\u03c6 to be KL divergence leads to the exponentiated gradient [2, 3, 21].\nInterestingly, if the z update is also the exponentiated gradient, we have alternating exponentiated\ngradients. In Section 4, we will show the mass transportation problem can be cast into this scenario.\nWhile the updates (7)-(8) use the same Bregman divergences, ef\ufb01ciently solving the x and z updates\nmay not be feasible, especially when the structure of the original functions f, g, the function \u03c6 used\nfor augmentation, and the constraint sets X ,Z are rather different. For example, if f (x) is a logistic\nfunction in (10), it will not have a closed-form solution even B\u03c6 is the KL divergence and X is the\nunit simplex. To address such concerns, we propose a generalized version of BADMM.\n\n2.1 Generalized BADMM\n\nTo allow the use of different Bregman divergences in the x and z updates (7)-(8) of BADMM, the\ngeneralized BADMM simply introduces an additional Bregman divergence for each update. The\ngeneralized BADMM has the following updates:\nxt+1 =argminx\u2208X f (x) + (cid:104)yt, Ax + Bzt \u2212 c(cid:105) + \u03c1B\u03c6(c \u2212 Ax, Bzt) + \u03c1xB\u03d5x(x, xt) ,\n(11)\nzt+1 =argminz\u2208Z g(z) + (cid:104)yt, Axt+1 + Bz \u2212 c(cid:105) + \u03c1B\u03c6(Bz, c \u2212 Axt+1) + \u03c1zB\u03d5z(z, zt) , (12)\nyt+1 = yt + \u03c4 (Axt+1 + Bzt+1 \u2212 c) .\n(13)\nwhere \u03c1 > 0, \u03c4 > 0, \u03c1x \u2265 0, \u03c1z \u2265 0. Note that we allow the use of a different step size \u03c4 in the dual\nvariable update [8, 19]. There are three Bregman divergences in the generalized BADMM. While\n\n3\n\n\fthe Bregman divergence B\u03c6 is shared by the x and z updates, the x update has its own Bregman\ndivergence B\u03d5x and the z update has its own Bregman divergence B\u03d5z. The two additional Bregman\ndivergences in generalized BADMM are variable speci\ufb01c, and can be chosen to make sure that\nthe xt+1, zt+1 updates are ef\ufb01cient. If all three Bregman divergences are quadratic functions, the\ngeneralized BADMM reduces to the generalized ADMM [8]. We prove convergence of generalized\nBADMM in Section 3, which yields the convergence of BADMM with \u03c1x = \u03c1z = 0.\nIn the following, we illustrate how to choose a proper Bregman divergence B\u03d5x so that the x update\ncan be solved ef\ufb01ciently, e.g., a closed-form solution, noting that the same arguments apply to the\nz-updates. Consider the \ufb01rst three terms in (11) as s(x) + h(x), where s(x) denotes a simple term\nand h(x) is the problematic term which needs to be linearized for an ef\ufb01cient x-update. We illustrate\nthe idea with several examples later in the section. Now, we have\n\nxt+1 = minx\u2208X s(x) + h(x) + \u03c1xB\u03d5x (x, xt) .\n\n(14)\nwhere ef\ufb01cient updates are dif\ufb01cult due to the mismatch in structure between h and X . The goal is\nto \u2018linearize\u2019 the function h by using the fact that the Bregman divergence Bh(x, xt) captures all\nthe higher-order (beyond linear) terms in h(x) so that:\n\nh(x) \u2212 Bh(x, xt) = h(xt) + (cid:104)x \u2212 xt,\u2207h(xt)(cid:105)\n\n(15)\n\nis a linear function of x. Let \u03c8 be another convex function such that one can ef\ufb01ciently solve\nminx\u2208X s(x) + \u03c8(x) + (cid:104)x, b(cid:105) for any constant b. Assuming \u03d5x(x) = \u03c8(x) \u2212 1\nh(x) is continu-\nously differentiable and strictly convex, we construct a Bregman divergence based proximal term to\nthe original problem so that:\nargminx\u2208X s(x)+h(x)+\u03c1xB\u03d5x(x,xt) = argminx\u2208X s(x)+(cid:104)\u2207h(xt), x\u2212xt(cid:105)+\u03c1xB\u03c8x(x,xt),(16)\nwhere the latter problem can be solved ef\ufb01ciently, by our assumption. To ensure \u03d5x is continuously\ndifferentiable and strictly convex, we need the following condition:\n\n\u03c1x\n\nProposition 1 If h is smooth and has Lipschitz continuous gradients with constant \u03bd under a p-\nnorm, then \u03d5x is \u03bd/\u03c1x-strongly convex w.r.t. the p-norm.\n\nThis condition has been widely used in gradient-type methods, including MDA and COMID. Note\nthat the convergence analysis of generalized ADMM in Section 4 holds for any additional Bregman\ndivergence based proximal terms, and does not rely on such speci\ufb01c choices. Using the above idea,\none can \u2018linearize\u2019 different parts of the x update to yield an ef\ufb01cient update.\nWe consider three special cases, respectively focusing on linearizing the function f (x), linearizing\nthe Bregman divergence based augmentation term B\u03c6(c \u2212 Ax, Bzt), and linearizing both terms,\nalong with examples for each case.\nCase 1: Linearization of smooth function f: Let h(x) = f (x) in (16), we have\nxt+1 = argminx\u2208X (cid:104)\u2207f (xt), x \u2212 xt(cid:105) + (cid:104)yt, Ax(cid:105) + \u03c1B\u03c6(c \u2212 Ax, Bzt) + \u03c1xB\u03c8x(x, xt) . (17)\nwhere \u2207f (xt) is the gradient of f (x) at xt.\n\nExample 1 Consider the following ADMM form for sparse logistic regression problem [16, 4]:\n\nminx h(x) + \u03bb(cid:107)z(cid:107)1 , s.t. x = z ,\n\n(18)\n\nwhere h(x) is the logistic function. If we use ADMM to solve (18), the x update is as follows [4]:\n\nxt+1 = argminx h(x) + (cid:104)yt, x \u2212 zt(cid:105) +\n\n(cid:107)x \u2212 zt(cid:107)2\n2 ,\n\n\u03c1\n2\n\n(19)\n\nwhich is a ridge-regularized logistic regression problem and one needs an iterative algorithm like\nL-BFGS to solve it. Instead, if we linearize h(x) at xt and set B\u03c8 to be a quadratic function, then\n\nxt+1 = argminx (cid:104)\u2207 h(xt), x \u2212 xt(cid:105) + (cid:104)yt, x \u2212 zt(cid:105) +\n\n(cid:107)x \u2212 zt(cid:107)2\n\n2 +\n\n\u03c1\n2\n\n(cid:107)x \u2212 xt(cid:107)2\n2 ,\n\n\u03c1x\n2\n\n(20)\n\nthe x update has a simple closed-form solution.\n\n4\n\n\f2(cid:107)Ax +\n\nCase 2: Linearization of the quadratic penalty term: In ADMM, B\u03c6(c \u2212 Ax, Bzt) = 1\nBzt \u2212 c(cid:107)2\n\n2. Let h(x) = \u03c1\nxt+1 = argminx\u2208X f (x) + (cid:104)yt + \u03c1(Axt + Bzt \u2212 c), Ax(cid:105) + \u03c1xB\u03c8(x, xt) .\n\n2. Then \u2207h(xt) = \u03c1AT (Axt + Bzt \u2212 c), we have\n\n2(cid:107)Ax + Bzt \u2212 c(cid:107)2\n\n(21)\nThe case mainly solves the problem due to the (cid:107)Ax(cid:107)2\n2 term which makes x updates nonseparable,\nwhereas the linearized version can be solved with separable (parallel) updates. Several problems\nhave been bene\ufb01ted from the linearization of quadratic term [8], e.g., when f is (cid:96)1 loss function [16],\nand projection onto the unit simplex or (cid:96)1 ball [9].\nCase 3: Mirror Descent: In some settings, we want to linearize both the function f and the\nquadratic augmentation term B\u03c6(c \u2212 Ax, Bzt) = 1\n2. Let h(x) = f (x) +\n(cid:104)yt, Ax(cid:105) + \u03c1\n\n2(cid:107)Ax + Bzt \u2212 c(cid:107)2\nxt+1 = argminx\u2208X(cid:104)\u2207h(xt), x(cid:105) + \u03c1xB\u03c8(x, xt) .\n\n(22)\nNote that (22) is a MDA-type update. Further, one can do a similar exercise with a general Bregman\ndivergence based augmentation term B\u03c6(c \u2212 Ax, Bzt), although there has to be a good motivation\nfor going to this route.\n\n2(cid:107)Ax + Bzt \u2212 c(cid:107)2\n\n2, we have\n\nExample 2 [Bethe-ADMM [12]] Given an undirected graph G = (V, E), where V is the vertex\nset and E is the edge set. Assume a random discrete variable Xi associated with node i \u2208 V\ncan take K values. In a pairwise MRF, the joint distribution of a set of discrete random variables\nX = {X1,\u00b7\u00b7\u00b7 , Xn} (n is the number of nodes in the graph) is de\ufb01ned in terms of nodes and\ncliques [24]. Consider solving the following graph-structured linear program (LP) :\n\n(23)\nwhere l(\u00b5) is a linear function of \u00b5 and L(G) is the so-called local polytope [24] determined by the\nmarginalization and normalization (MN) constraints for each node and edge in the graph G:\n\nmin\n\n\u00b5\n\nl(\u00b5) s.t. \u00b5 \u2208 L(G) ,\n\nL(G) = {\u00b5 \u2265 0 ,\n\n\u00b5i(xi) = 1 ,\n\nxi\n\nxj\n\n\u00b5ij(xi, xj) = \u00b5i(xi)} ,\n\n(24)\n\n(cid:88)\n\nwhere \u00b5i, \u00b5ij are pseudo-marginal distributions of node i and edge ij respectively. The LP in (23)\ncontains O(nK + |E|K 2) variables and that order of constraints. In particular, (23) serves as a LP\nrelaxation of MAP inference probem in a pairwise MRF if l(\u00b5) is de\ufb01ned as follows:\n\nl(\u00b5) =\n\n\u03b8i(xi)\u00b5i(xi) +\n\n\u03b8ij(xi, xj)\u00b5ij(xi, xj),\n\n(25)\n\nwhere \u03b8i, \u03b8ij are the potential functions of node i and edge ij respectively.\nFor a grid graph (e.g., image) of size 1000\u00d71000, (23) contains millions of variables and constraints,\nposing a challenge to LP solvers. An ef\ufb01cient way is to decompose the graph into trees such that\n\n(26)\nwhere T\u03c4 denotes the MN constraints (24) in the tree \u03c4. \u00b5\u03c4 is a vector of pseudo-marginals of nodes\nand edges in the tree \u03c4. m is a global variable which contains all trees and m\u03c4 corresponds to the\ntree \u03c4 in the global variable. c\u03c4 is the weight for sharing variables. The augmented Lagrangian is\n\nc\u03c4 l\u03c4 (\u00b5\u03c4 ) s.t. \u00b5\u03c4 \u2208 T\u03c4 , \u00b5\u03c4 = m\u03c4 ,\n\nmin\n\u00b5\u03c4\n\n\u03c4\n\n(cid:88)\n\nL\u03c1(\u00b5\u03c4 , m, \u03bb\u03c4 ) =\n\nc\u03c4 l\u03c4 (\u00b5\u03c4 ) + (cid:104)\u03bb\u03c4 , \u00b5\u03c4 \u2212 m\u03c4(cid:105) +\n\n\u03c4\n\n(cid:107)\u00b5\u03c4 \u2212 m\u03c4(cid:107)2\n2 .\n\n\u03c1\n2\n\nwhich leads to the following update for \u00b5t+1\n\nin ADMM:\n\u03c4 = argmin\u00b5\u03c4\u2208T\u03c4 c\u03c4 l\u03c4 (\u00b5\u03c4 ) + (cid:104)\u03bbt\n\u00b5t+1\n\n\u03c4\n\n\u03c4 , \u00b5\u03c4(cid:105) +\n\n(cid:107)\u00b5\u03c4 \u2212 mt\n\n\u03c4(cid:107)2\n\n2\n\n\u03c1\n2\n\n(27)\n\n(28)\n\n(cid:88)\n\n(28) is dif\ufb01cult to solve due to the MN constraints in the tree. Let h(\u00b5\u03c4 ) be the objective of (28).\nLinearizing h(\u00b5\u03c4 ) and adding a Bregman divergence in (28), we have:\n\n\u03c4 = argmin\u00b5\u03c4\u2208T\u03c4 (cid:104)\u2207h(\u00b5t\n\u00b5t+1\n= argmin\u00b5\u03c4\u2208T\u03c4 (cid:104)\u2207h(\u00b5t\n\n\u03c4 ), \u00b5\u03c4(cid:105) + \u03c1xB\u03c8(\u00b5\u03c4 , \u00b5t\n\u03c4 )\n\n\u03c4 ) \u2212 \u03c1x\u2207\u03c8(\u00b5t\n\n\u03c4 ), \u00b5\u03c4(cid:105) + \u03c1x\u03c8(\u00b5\u03c4 ) ,\n\nIf \u03c8(\u00b5\u03c4 ) is the negative Bethe entropy of \u00b5\u03c4 , the update of \u00b5t+1\nlem [24] and can be solved exactly using the sum-product algorithm in linear time for any tree.\n\nbecomes the Bethe entropy prob-\n\n\u03c4\n\n5\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nij\u2208E\n\nxij\n\n(cid:88)\n\n(cid:88)\n\ni\n\nxi\n\n\f3 Convergence Analysis of BADMM\n\np, where \u03b1 > 0.\n\n2 (cid:107)u \u2212 v(cid:107)2\n\np, i.e., B\u03c6(u, v) \u2265 \u03b1\n\nWe need the following assumption in establishing the convergence of BADMM:\nAssumption 1 (a) f : Rn1(cid:55)\u2192R\u222a{+\u221e} and g : Rn2(cid:55)\u2192R\u222a{+\u221e} are closed, proper and convex.\n(b) An optimal solution exists.\n(c) The Bregman divergence B\u03c6 is de\ufb01ned on an \u03b1-strongly convex function \u03c6 with respect to a\np-norm (cid:107) \u00b7 (cid:107)2\nAssume that {x\u2217, z\u2217, y\u2217} satis\ufb01es the KKT conditions of the Lagrangian of (1) (\u03c1 = 0 in (2)), i.e.,\n(29)\nand x\u2217 \u2208 X , z\u2217 \u2208 Z. Note X and Z are always satis\ufb01ed in (11) and (12). Let f(cid:48)(xt+1) \u2208 \u2202f (xt+1)\nand g(cid:48)(zt+1) \u2208 \u2202g(zt+1). For x\u2217 \u2208 X , z\u2217 \u2208 Z, the optimality conditions of (11) and (12) are\n(cid:104)f(cid:48)(xt+1)+AT{yt +\u03c1(\u2212\u2207\u03c6(c\u2212Axt+1)+\u2207\u03c6(Bzt)}+\u03c1x(\u2207\u03d5x(xt+1)\u2212\u2207\u03d5x(xt)), xt+1\u2212x\u2217(cid:105)\u2264 0 ,\n(cid:104)g(cid:48)(zt+1)+BT{yt +\u03c1(\u2207\u03c6(Bzt+1)\u2212\u2207\u03c6(c\u2212Axt+1)}+\u03c1z(\u2207\u03d5z(zt+1)\u2212\u2207\u03d5z(zt)), zt+1 \u2212 z\u2217(cid:105)\u2264 0 .\nIf Axt+1 + Bzt+1 = c, then yt+1 = yt. Further, if B\u03d5x(xt+1, xt) = 0, B\u03d5z(zt+1, zt) = 0, then\nthe KKT conditions in (29) will be satis\ufb01ed. Therefore, we have the following suf\ufb01cient conditions\nfor the KKT conditions:\n\n\u2212AT y\u2217 \u2208 \u2202f (x\u2217) ,\u2212BT y\u2217 \u2208 \u2202g(z\u2217) , Ax\u2217 + Bz\u2217 \u2212 c = 0 ,\n\n\u03c1z\n\u03c1\n\n\u03c1x\n\u03c1\n\nB\u03d5x(xt+1,xt)+\n\nB\u03d5x(xt+1, xt) = 0 , B\u03d5z (zt+1, zt) = 0 ,\n\nAxt+1 + Bzt \u2212 c = 0 , Axt+1 + Bzt+1 \u2212 c = 0 .\n\n(30a)\n(30b)\nFor the exact BADMM, \u03c1x = \u03c1z = 0 in (11) and (12), the optimality conditions are (30b), which is\nequivalent to the optimality conditions of ADMM [4], i.e., Bzt+1\u2212Bzt = 0 , Axt+1+Bzt+1\u2212c =\n0. De\ufb01ne the residuals of optimality conditions (30) at (t + 1) as:\nR(t+1) =\n\nB\u03d5z(zt+1,zt)+B\u03c6(c\u2212Axt+1,Bzt)+\u03b3(cid:107)Axt+1+Bzt+1\u2212c(cid:107)2\n\np\u22121} and 0 < \u03b3 < \u03b1\u03c3\n\n2 , (31)\nwhere \u03b3 > 0. If R(t + 1) = 0, the optimality conditions (30a) and (30b) are satis\ufb01ed. It is suf\ufb01cient\nto show the convergence of BADMM by showing R(t+1) converges to zero. The following theorem\nestablishes the global convergence for BADMM.\nTheorem 1 Let the sequence {xt, zt, yt} be generated by BADMM (11)-(13), {x\u2217, z\u2217, y\u2217} sat-\nisfy (29) and x\u2217 \u2208 X , z\u2217 \u2208 Z. Let the Assumption 1 hold and \u03c4 \u2264 (\u03b1\u03c3 \u2212 2\u03b3)\u03c1, where\n\u03c3 = min{1, m\n2 . Then R(t + 1) converges to zero and {xt, zt, yt} con-\nverges to a KKT point {x\u2217, z\u2217, y\u2217}.\nRemark 1 (a) If 0 < p \u2264 2, then \u03c3 = 1 and \u03c4 \u2264 (\u03b1 \u2212 2\u03b3)\u03c1. The case that 0 < p \u2264 2 includes two\nwidely used Bregman divergences, i.e., Euclidean distance and KL divergence. For KL divergence\nin the unit simplex, we have \u03b1 = 1, p = 1 in the Assumption 1 (c), i.e., KL(u, v) \u2265 1\n1 [2].\n(b) Since we often set B\u03c6 to be a quadratic function (p = 2), the three special cases in Section 2.1\ncould choose step size \u03c4 = (\u03b1 \u2212 2\u03b3)\u03c1.\n(c) If p > 2, \u03c3 will be small, leading to a small step size \u03c4 which may be not be necessary in practice.\nIt would be interesting to see whether a large step size can be used for any p > 0.\n\n2(cid:107)u\u2212 v(cid:107)2\n\n2\n\nThe following theorem establishes a O(1/T ) iteration complexity for the objective and residual of\nconstraints in an ergodic sense.\nTheorem 2 Let the sequences {xt, zt, yt} be generated by BADMM (11)-(13). Set \u03c4 \u2264 (\u03b1\u03c3\u22122\u03b3)\u03c1,\nwhere \u03c3 = min{1, m\nt=1 zt and y0 = 0.\nFor any x\u2217 \u2208 X , z\u2217 \u2208 Z and (x\u2217, z\u2217, y\u2217) satisfying KKT conditions (29), we have\n\np\u22121} and 0 < \u03b3 < \u03b1\u03c3\n\n2 . Let \u00afxT = 1\n\nt=1 xt, \u00afzT = 1\n\n(cid:80)T\n\n(cid:80)T\n\nT\n\nT\n\n2\n\nf (\u00afxT ) + g(\u00afzT ) \u2212 (f (x\u2217) + g(z\u2217)) \u2264 D1\nT\n(cid:107)A\u00afxT + B\u00afzT \u2212 c(cid:107)2\n\n2 \u2264 D(w\u2217, w0)\n\n,\n\n,\n\n(32)\n\nwhere D1 = \u03c1B\u03c6(Bz\u2217, Bz0) + \u03c1xB\u03d5x(x\u2217, x0) + \u03c1zB\u03d5z(z\u2217, z0) and D(w\u2217, w0) = 1\ny0(cid:107)2\n\n2 + B\u03c6(Bz\u2217, Bz0) + \u03c1x\n\n\u03c1 B\u03d5x (x\u2217, x0)+ \u03c1z\n\n\u03c1 B\u03d5z(z\u2217, z0).\n\n(33)\n2\u03c4 \u03c1(cid:107)y\u2217 \u2212\n\n\u03b3T\n\n6\n\n\fi\nzi,0\n\ni=1 z\u2217\n\n=(cid:80)n2\ni ln z\u2217\n(cid:80)T\n\nWe consider one special case of BADMM where B = I and X ,Z are the unit simplex. Let B\u03c6\n(cid:80)n2\nbe the KL divergence. For z\u2217 \u2208 Z \u2282 Rn2\u00d71, choosing z0 = e/n2, we have B\u03c6(z\u2217, z0) =\ni + ln n2 \u2264 ln n2 . Similarly, if \u03c1x > 0, by choosing x0 = e/n1,\ni=1 z\u2217\nB\u03d5x(x\u2217, x0) \u2264 ln n1. Setting \u03b1 = 1, \u03c3 = 1 and \u03b3 = 1\nCorollary 1 Let the sequences {xt, zt, yt} be generated by Bregman ADMM (11),(12),(13) and\n(cid:80)T\ny0 = 0. Assume B = I, and X and Z is the unit simplex. Let B\u03c6, B\u03d5x, B\u03d5z be KL divergence.\n4 . For any x\u2217 \u2208 X , z\u2217 \u2208 Z and (x\u2217, z\u2217, y\u2217)\nLet \u00afxT = 1\nT\nsatisfying KKT conditions (29), we have\n\n4 in Theorem 2 yields the following result:\n\nt=1 zt. Set \u03c4 = 3\u03c1\n\nt=1 xt, \u00afzT = 1\n\ni ln z\u2217\n\nT\n\nf (\u00afxT ) + g(\u00afzT ) \u2212 (f (x\u2217) + g(z\u2217)) \u2264 \u03c1 ln n2 + \u03c1x ln n1 + \u03c1z ln n2\nT\n\u03c1 ln n1 + 4\u03c1z\n2 + 4 ln n2 + 4\u03c1x\n\n\u03c4 \u03c1(cid:107)y\u2217\u2212y0(cid:107)2\n\n,\n\n(cid:107)A\u00afxT + B\u00afzT \u2212 c(cid:107)2\n\n2 \u2264 2\n\n\u03c1 ln n2\n\nT\n\n(34)\n\n(35)\n\n,\n\nRemark 2 (a) [2] shows that MDA yields a smilar O(ln n) bound where n is dimensionality of\nthe problem. If the diminishing step size of MDA is propotional to\nln n).\n(cid:80)n\nTherefore, MDA is faster than the gradient descent method by a factor O((n/ ln n)1/2).\ni=1 (cid:107)z\u2217\n(b) In ADMM, B\u03c6(z\u2217, z0) = 1\nTherefore, BADMM is faster than ADMM by a factor O(n/ ln n) in an ergodic sense.\n\n\u221a\nln n, the bound is O(\n\n2(cid:107)(cid:80)n\n\n2(cid:107)z\u2217 \u2212 z0(cid:107)2\n\ni \u2212 zi,0(cid:107)2\n\ni \u2212 zi,0(cid:107)2\n\n2 \u2264 n.\n\n2 \u2264 n\n\ni=1 z\u2217\n\n2 = 1\n\n\u221a\n\n2\n\n4 Experimental Results\nIn this section, we use BADMM to solve the mass transportation problem [18]:\n\nmin (cid:104)C, X(cid:105)\n\ns.t. Xe = a, XT e = b, X \u2265 0 .\n\n(36)\nwhere (cid:104)C, X(cid:105) denotes Tr(CT X), C \u2208 Rm\u00d7n is a cost matrix, X \u2208 Rm\u00d7n, a \u2208 Rm\u00d71, b \u2208 Rm\u00d71,\ne is a column vector of ones. The mass transportation problem (36) is a linear program and thus can\nbe solved by the simplex method.\nWe now show that (36) can be solved by ADMM and BADMM. We \ufb01rst introduce a variable Z to\nsplit the constraints into two simplex such that \u2206x = {X|X \u2265 0, Xe = a} and \u2206z = {Z|Z \u2265\n0, ZT e = b}. (36) can be rewritten in the following ADMM form:\n\n(37)\n(37) can be solved by ADMM which requires the Euclidean projection onto the simplex \u2206x and\n\u2206z, although the projection can be done ef\ufb01ciently [9]. We use BADMM to solve (37):\n\ns.t. X \u2208 \u2206x, Z \u2208 \u2206z, X = Z .\n\nmin (cid:104)C, X(cid:105)\n\nXt+1 = argminX\u2208\u2206x(cid:104)C, X(cid:105) + (cid:104)Yt, X(cid:105) + \u03c1KL(X, Zt) ,\nZt+1 = argminZ\u2208\u2206z(cid:104)Yt,\u2212Z(cid:105) + \u03c1KL(Z, Xt+1) ,\nYt+1 = Yt + \u03c1(Xt+1 \u2212 Zt+1) .\n\n(38)\n(39)\n(40)\n\nbj\n\n(41)\n\nBoth (38) and (39) have closed-form solutions, i.e.,\n\nX t+1\n\nij =\n\n(cid:80)n\n\nij exp(\u2212 Cij +Y t\nZ t\nj=1 Z t\n\n)\nij exp(\u2212 Cij +Y t\n\n\u03c1\n\nij\n\n\u03c1\n\nij\n\nai , Z t+1\n\nij =\n\n)\n\n(cid:80)m\n\nX t+1\n\nij\n\nexp(\n\ni=1 X t+1\n\nij\n\nY t\nij\n\u03c1 )\nY t\nij\n\u03c1 )\n\nexp(\n\nwhich are exponentiated graident updates and can be done in O(mn). Besides the sum operation\n(O(ln n) or O(ln m)),\n(41) amounts to elementwise operation and thus can be done in parallel.\nAccording to Corollary 1, BADMM can be faster than ADMM by a factor of O(n/ ln n).\nWe compare BADMM with ADMM and a commercial LP solver Gurobi on the mass transportation\nproblem (36) with m = n and a = b = e. C is randomly generated from the uniform distribution.\nWe run the experiments 5 times and the average is reported. We choose the \u2018best\u2019parameter for\nBADMM (\u03c1 = 0.001) and ADMM (\u03c1 = 0.001). The stopping condition is either when the number\nof iterations exceeds 2000 or when the primal-dual residual is less than 10\u22124.\nBADMM vs ADMM: Figure 1 compares BADMM and ADMM with different dimensions n =\n{1000, 2000, 4000} running on a single CPU. Figure 1(a) plots the primal and dual residual against\n\n7\n\n\f(a) m = n = 1000\n\n(b) m = n = 2000\n\n(c) m = n = 4000\n\nFigure 1: Comparison BADMM and ADMM. BADMM converges faster than ADMM. (a): the\nprimal and dual residual agaist the runtime. (b): the primal and dual residual over iterations. (c):\nThe convergence of objective value against the runtime.\n\nTable 1: Comparison of BADMM (GPU) with Gurobi in solving mass transportation problem\nBADMM (GPU)\nobjective\ntime (s)\n\nnumber of variables\n\nGurobi (Laptop)\n\nGurobi (Server)\n\nm \u00d7 n\n\nobjective\n\nobjective\n\n1.69\n1.61\n1.65\n\n-\n\n0.54\n22.15\n117.75\n303.54\n\n1.69\n1.61\n1.65\n1.63\n\n(210)2 > 1 million\n\n(5 \u00d7 210)2 > 25 million\n(10 \u00d7 210)2 > 0.1 billion\n(15 \u00d7 210)2 > 0.2 billion\n\ntime (s)\n\n4.22\n377.14\n\n-\n-\n\n1.69\n1.61\n\n-\n-\n\ntime (s)\n\n2.66\n92.89\n1235.34\n\n-\n\nthe runtime when n = 1000, and Figure 1(b) plots the convergence of primal and dual residual over\niteration when n = 2000. BADMM converges faster than ADMM. Figure 1(c) plots the convergence\nof objective value against the runtime when n = 4000. BADMM converges faster than ADMM even\nwhen the initial point is further from the optimum.\nBADMM vs Gurobi: Gurobi (http://www.gurobi.com/) is a highly optimized commercial software\nwhere linear programming solvers have been ef\ufb01ciently implemented. We run Gurobi on two set-\ntings: a Mac laptop with 8G memory and a server with 86G memory, respectively. For comparison,\nBADMM is run in parallel on a Tesla M2070 GPU with 5G memory and 448 cores1. We experi-\nment with large scale problems and use m = n = {1, 5, 10, 15} \u00d7 210. Table 1 shows the runtime\nand the objective values of BADMM and Gurobi, where a \u2018-\u2019 indicates the algorithm did not termi-\nnate. In spite of Gurobi being one of the most optimized LP solvers, BADMM running in parallel\nis several times faster than Gurobi. In fact, for larger values of n, Gurobi did not terminate even\non the 86G server, whereas BADMM was ef\ufb01cient even with just 5G memory! The memory con-\nsumption of Gurobi increases rapidly with the increase of n, especially at the scales we consider.\nWhen n = 5 \u00d7 210, the memory required by Gurobi surpassed the memory in the laptop, leading\nto the rapid increase of time. A similar situation was also observed in the server with 86G when\nn = 10 \u00d7 210. In contrast, the memory required by BADMM is O(n2)\u2014even when n = 15 \u00d7 210\n(more than 0.2 billion parameters), BADMM can still run on a single GPU with only 5G memory.\nThe results clearly illustrate the promise of BADMM. With more careful implementation and code\noptimization, BADMM has the potential to solve large scale problems ef\ufb01ciently in parallel with\nsmall memory foot-print.\n\n5 Conclusions\nIn this paper, we generalized the alternating direction method of multipliers (ADMM) to Bregman\nADMM, similar to how mirror descent generalizes gradient descent. BADMM de\ufb01nes a uni\ufb01ed\nframework for ADMM, generalized ADMM, inexact ADMM and Bethe ADMM. The global con-\nvergence and the O(1/T ) iteration complexity of BADMM are also established. In some cases,\nBADMM is faster than ADMM by a factor of O(n/ ln n). BADMM is also faster than highly opti-\nmized commercial software in solving the linear program of mass transportation problem.\n\nAcknowledgment\nThe research was supported by NSF grants IIS-1447566, IIS-1422557, CCF-1451986, CNS-1314560, IIS-\n0953274, IIS-1029711, IIS-0916750, and by NASA grant NNX12AQ39A. H.W. and A.B. acknowledge the\ntechnical support from the University of Minnesota Supercomputing Institute. H.W. acknowledges the support\nof DDF (2013-2014) from the University of Minnesota. A.B. acknowledges support from IBM and Yahoo.\n\n1GPU code is available on https://github.com/anteagle/GPU_BADMM_MT\n\n8\n\n010020030040050060000.20.40.60.81x 10\u22123runtime (s)Primal and dual residual BADMMADMM050010001500200000.20.40.60.81x 10\u22123IterationPrimal and dual residual BADMMADMM020004000600080001000005101520runtime (s)Objective value BADMMADMM\fReferences\n[1] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 6:1705\u2013\n\n1749, 2005.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[3] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method\n\nwith applications to tomography. SIAM Journal on Optimization, 12:79\u2013108, 2001.\n\n[4] S. Boyd, E. Chu N. Parikh, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via\nthe alternating direction method of multipliers. Foundation and Trends Machine Learning, 3(1):1\u2013122,\n2011.\n\n[5] Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University\n\nPress, 1998.\n\n[6] G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using bremgan\n\nfunctions. SIAM Journal on Optimization, 3:538\u2013543, 1993.\n\n[7] P. Combettes and J. Pesquet. Proximal splitting methods in signal processsing. Fixed-Point Algorithms\n\nfor Inverse Problems in Science and Engineering Springer (Ed.), pages 185\u2013212, 2011.\n\n[8] W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method\n\nof multipliers. ArXiv, 2012.\n\n[9] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l1-ball for learning\n\nin high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[10] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In COLT,\n\n2010.\n\n[11] M. A. T. Figueiredo and J. M. Bioucas-Dias. Restoration of Poissonian images using alternating direction\n\noptimization. IEEE Transactions on Image Processing, 19:3133\u20133145, 2010.\n\n[12] Q. Fu, H. Wang, and A. Banerjee. Bethe-ADMM for tree decomposition based parallel MAP inference.\n\nIn UAI, 2013.\n\n[13] D. Gabay. Applications of the method of multipliers to variational inequalities. In Augmented Lagrangian\nMethods: Applications to the Solution of Boundary-Value Problems. M. Fortin and R. Glowinski, eds.,\nNorth-Holland: Amsterdam, 1983.\n\n[14] T. Goldstein, X. Bresson, and S. Osher. Geometric applications of the split Bregman method: segmenta-\n\ntion and surface reconstruction. Journal of Scienti\ufb01c Computing, 45(1):272\u2013293, 2010.\n\n[15] T. Goldstein, B. Donoghue, and S. Setzer. Fast alternating direction optimization methods. CAM report\n\n12-35, UCLA, 2012.\n\n[16] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,\n\nand Prediction. Springer, 2009.\n\n[17] B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction\n\nmethod. SIAM Journal on Numerical Analysis, 50:700\u2013709, 2012.\n\n[18] F. L. Hitchcock. The distribution of a product from several sources to numerous localities. Journal of\n\nMathematical Physics, 20:224\u2013230, 1941.\n\n[19] M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. ArXiv,\n\n2012.\n\n[20] K. C. Kiwiel. Proximal minimization methods with generalized Bregman functions. SIAM Journal on\n\nControl and Optimization, 35:1142\u20131168, 1995.\n\n[21] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley, 1983.\n[22] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center\n\nfor Operation Research and Economics (CORE), Catholic University of Louvain (UCL), 2007.\n\n[23] M. Telgarsky and S. Dasgupta. Agglomerative Bregman clustering. In ICML, 2012.\n[24] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1:1\u2013305, 2008.\n\n[25] H. Wang and A. Banerjee. Online alternating direction method. In ICML, 2012.\n[26] J. Yang and Y. Zhang. Alternating direction algorithms for L1-problems in compressive sensing. ArXiv,\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 1462, "authors": [{"given_name": "Huahua", "family_name": "Wang", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota, Twin Cites"}]}