{"title": "Parallel Direction Method of Multipliers", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "We consider the problem of minimizing block-separable convex functions subject to linear constraints. While the Alternating Direction Method of Multipliers (ADMM) for two-block linear constraints has been intensively studied both theoretically and empirically, in spite of some preliminary work, effective generalizations of ADMM to multiple blocks is still unclear. In this paper, we propose a parallel randomized block coordinate method named Parallel Direction Method of Multipliers (PDMM) to solve the optimization problems with multi-block linear constraints. PDMM randomly updates some blocks in parallel, behaving like parallel randomized block coordinate descent. We establish the global convergence and the iteration complexity for PDMM with constant step size. We also show that PDMM can do randomized block coordinate descent on overlapping blocks. Experimental results show that PDMM performs better than state-of-the-arts methods in two applications, robust principal component analysis and overlapping group lasso.", "full_text": "Parallel Direction Method of Multipliers\n\nHuahua Wang , Arindam Banerjee , Zhi-Quan Luo\n\n{huwang,banerjee}@cs.umn.edu, luozq@umn.edu\n\nUniversity of Minnesota, Twin Cities\n\nAbstract\n\nWe consider the problem of minimizing block-separable (non-smooth) convex\nfunctions subject to linear constraints. While the Alternating Direction Method of\nMultipliers (ADMM) for two-block linear constraints has been intensively studied\nboth theoretically and empirically, in spite of some preliminary work, effective\ngeneralizations of ADMM to multiple blocks is still unclear. In this paper, we\npropose a parallel randomized block coordinate method named Parallel Direction\nMethod of Multipliers (PDMM) to solve optimization problems with multi-block\nlinear constraints. At each iteration, PDMM randomly updates some blocks in\nparallel, behaving like parallel randomized block coordinate descent. We establish\nthe global convergence and the iteration complexity for PDMM with constant step\nsize. We also show that PDMM can do randomized block coordinate descent on\noverlapping blocks. Experimental results show that PDMM performs better than\nstate-of-the-arts methods in two applications, robust principal component analysis\nand overlapping group lasso.\n\n1\n\nIntroduction\n\nAc\n\njxj = a ,\n\n(1)\n\nIn this paper, we consider the minimization of block-seperable convex functions subject to linear\nconstraints, with a canonical form:\n\nJ(cid:88)\nj \u2208 Rm\u00d7nj is the j-th column block of A \u2208 Rm\u00d7n where n =(cid:80)\n\n{xj\u2208Xj} f (x) =\nmin\n\nfj(xj) , s.t. Ax =\n\nJ(cid:88)\n\nj=1\n\nj=1\n\nwhere the objective function f (x) is a sum of J block separable (nonsmooth) convex functions,\nj nj, xj \u2208 Rnj\u00d71 is the j-th\nAc\nblock coordinate of x, Xj is a local convex constraint of xj and a \u2208 Rm\u00d71. The canonical form\ncan be extended to handle linear inequalities by introducing slack variables, i.e., writing Ax \u2264 a as\nAx + z = a, z \u2265 0.\nA variety of machine learning problems can be cast into the linearly-constrained optimization prob-\nlem (1) [8, 4, 24, 5, 6, 21, 11]. For example, in robust Principal Component Analysis (RPCA) [5],\none attempts to recover a low rank matrix L and a sparse matrix S from an observation matrix M,\ni.e., the linear constraint is M = L + S. Further, in the stable version of RPCA [29], an noisy matrix\nZ is taken into consideration, and the linear constraint has three blocks, i.e., M = L + S + Z. Prob-\nlem (1) can also include composite minimization problems which solve a sum of a loss function and\na set of nonsmooth regularization functions. Due to the increasing interest in structural sparsity [1],\ncomposite regularizers have become widely used, e.g., overlapping group lasso [28]. As the blocks\nare overlapping in this class of problems, it is dif\ufb01cult to apply block coordinate descent methods\nfor large scale problems [16, 18] which assume block-separable. By simply splitting blocks and in-\ntroducing equality constraints, the composite minimization problem can also formulated as (1) [2].\nA classical approach to solving (1) is to relax the linear constraints using the (augmented) La-\ngrangian, i.e.,\n\nL\u03c1(x, y) = f (x) + (cid:104)y, Ax \u2212 a(cid:105) +\n\n(cid:107)Ax \u2212 a(cid:107)2\n2 ,\n\n\u03c1\n2\n\n(2)\n\n1\n\n\fwhere \u03c1 \u2265 0 is called the penalty parameter. We call x the primal variable and y the dual variable. (2)\nusually leads to primal-dual algorithms which update the primal and dual variables alternatively.\nWhile the dual update is simply dual gradient descent, the primal update is to solve a minimization\nproblem of (2) given y. If \u03c1 = 0, the primal update can be solved in a parallel block coordinate\nfashion [3, 19], leading to the dual ascent method. While the dual ascent method can achieve mas-\nsive parallelism, a careful choice of stepsize and some strict conditions are required for convergence,\nparticularly when f is nonsmooth. To achieve better numerical ef\ufb01ciency and convergence behavior\ncompared to the dual ascent method, it is favorable to set \u03c1 > 0 in the augmented Lagrangian (2)\nwhich we call the method of multipliers. However,\n(2) is no longer separable and solving entire\naugmented Lagrangian (2) exactly is computationally expensive. In [20], randomized block coor-\ndinate descent (RBCD) [16, 18] is used to solve (2) exactly, but leading to a double-loop algorithm\nalong with the dual step. More recent results show (2) can be solved inexactly by just sweeping the\ncoordinates once using the alternating direction method of multipliers (ADMM) [12, 2]. This paper\nattempts to develop a parallel randomized block coordinate variant of ADMM.\nWhen J = 2, ADMM has been widely used to solve the augmented Lagragian (2) in many ap-\nplications [2]. Encouraged by the success of ADMM with two blocks, ADMM has also been ex-\ntended to solve the problem with multiple blocks [15, 14, 10, 17, 13, 7]. The variants of ADMM\ncan be mainly divided into two categories. The \ufb01rst category considers Gauss-Seidel ADMM\n(GSADMM) [15, 14], which solves (2) in a cyclic block coordinate manner. In [13], a back sub-\nstitution step was added so that the convergence of ADMM for multiple blocks can be proved. In\nsome cases, it has been shown that ADMM might not converge for multiple blocks [7]. In [14], a\nblock successive upper bound minimization method of multipliers (BSUMM) is proposed to solve\nthe problem (1). The convergence of BSUMM is established under some fairly strict conditions: (i)\ncertain local error bounds hold; (ii) the step size is either suf\ufb01ciently small or decreasing. However,\nin general, Gauss-Seidel ADMM with multiple blocks is not well understood and its iteration com-\nplexity is largely open. The second category considers Jacobian variants of ADMM [26, 10, 17],\nwhich solves (2) in a parallel block coordinate fashion. In [26, 17], (1) is solved by using two-block\nADMM with splitting variables (sADMM). [10] considers a proximal Jacobian ADMM (PJADMM)\nby adding proximal terms. A randomized block coordinate variant of ADMM named RBSUMM\nwas proposed in [14]. However, RBSUMM can only randomly update one block. Moreover, the\nconvergence of RBSUMM is established under the same conditions as BSUMM and its iteration\ncomplexity is unknown.\nIn this paper, we propose a parallel randomized block coordinate method named parallel direction\nmethod of multipliers (PDMM) which randomly picks up any number of blocks to update in parallel,\nbehaving like randomized block coordinate descent [16, 18]. Like the dual ascent method, PDMM\nsolves the primal update in a parallel block coordinate fashion even with the augmentation term.\nMoreover, PDMM inherits the merits of the method of multipliers and can solve a fairly large class\nof problems, including nonsmooth functions. Technically, PDMM has three aspects which make it\ndistinct from such state-of-the-art methods. First, if block coordinates of the primal x is solved ex-\nactly, PDMM uses a backward step on the dual update so that the dual variable makes conservative\nprogress. Second, the sparsity of A and the number of randomized blocks are taken into consider-\nation to determine the step size of the dual update. Third, PDMM can randomly update arbitrary\nnumber of primal blocks in parallel. Moreover, we show that sADMM and PJADMM are the two ex-\ntreme cases of PDMM. The connection between sADMM and PJADMM through PDMM provides\nbetter understanding of dual backward step. PDMM can also be used to solve overlapping groups in\na randomized block coordinate fashion. Interestingly, the corresponding problem for RBCD [16, 18]\nwith overlapping blocks is still an open problem. We establish the global convergence and O(1/T )\niteration complexity of PDMM with constant step size. We evaluate the performance of PDMM in\ntwo applications: robust principal component analysis and overlapping group lasso.\nThe rest of the paper is organized as follows: We introduce PDMM in Section 2, and establish\nconvergence results in Section 3. We evaluate the performance of PDMM in Section 4 and conclude\nin Section 5. The technical analysis and detailed proofs are provided in the supplement.\nNotations: Assume that A \u2208 Rm\u00d7n is divided into I \u00d7 J blocks. Let Ar\ni \u2208 Rmi\u00d7n be the i-th row\nj \u2208 Rm\u00d7nj be the j-th column block of A, and Aij \u2208 Rmi\u00d7nj be the ij-th block of\nblock of A, Ac\nA. Let yi \u2208 Rmi\u00d71 be the i-th block of y \u2208 Rm\u00d71. Let N (i) be a set of nonzero blocks Aij in the\n\n2\n\n\fi and di = |N (i)| be the number of nonzero blocks. Let \u02dcKi = min{di, K} where\n\ni-th row block Ar\nK is the number of blocks randomly chosen by PDMM and T be the number of iterations.\n2 Parallel Direction Method of Multipliers\nConsider a direct Jacobi version of ADMM which updates all blocks in parallel:\n\nk(cid:54)=j, yt) ,\n\nxt+1\nj = argminxj\u2208Xj L\u03c1(xj, xt\nyt+1 = yt + \u03c4 \u03c1(Axt+1 \u2212 a) .\n\n(3)\n(4)\nwhere \u03c4 is a shrinkage factor for the step size of the dual gradient ascent update. However, empirical\nresults show that it is almost impossible to make the direct Jacobi updates (3)-(4) to converge even\nwhen \u03c4 is extremely small. [15, 10] also noticed that the direct Jacobi updates may not converge.\nTo address the problem in (3) and (4), we propose a backward step on the dual update. Moreover,\ninstead of updating all blocks, the blocks xj will be updated in a parallel randomized block coordi-\nnate fashion. We call the algorithm Parallel Direction Method of Multipliers (PDMM). PDMM \ufb01rst\nrandomly select K blocks denoted by set Jt at time t, then executes the following iterates:\n\n, \u02c6yt) + \u03b7jt B\u03c6jt\n\n(xjt, xt\njt\n\n) , jt \u2208 Jt,\n\nk(cid:54)=jt\n\nxt+1\n\njt\n\nL\u03c1(xjt, xt\n\n= argmin\nxjt\u2208Xjt\ni + \u03c4i\u03c1(Aixt+1 \u2212 ai) ,\ni \u2212 \u03bdi\u03c1(Aixt+1 \u2212 ai) ,\n\nyt+1\ni = yt\n\u02c6yt+1\ni = yt+1\n\n(5)\n\n(6)\n(7)\n\nwhere \u03c4i > 0, 0 \u2264 \u03bdi < 1, \u03b7jt \u2265 0, and B\u03c6jt\n) is a Bregman divergence. Note xt+1 =\n(xt+1Jt\n) in (6) and (7). (6) and (7) update all dual blocks. We show that PDMM can also do\nrandomized dual block coordinate ascent in an extended work [25]. Let \u02dcKi = min{di, K}. \u03c4i and\n\u03bdi can take the following values:\n\n(xjt, xt\njt\n\nk /\u2208Jt\n\n, xt\n\nK\n\n\u03c4i =\n\n\u02dcKi(2J \u2212 K)\n\n, \u03bdi = 1 \u2212 1\n\u02dcKi\n\n.\n\n(8)\n\nIn the xjt-update (5), a Bregman divergence is addded so that exact PDMM and its inexact variants\ncan be analyzed in an uni\ufb01ed framework [23, 11]. In particular, if \u03b7jt = 0, (5) is an exact update. If\n\u03b7jt > 0, by choosing a suitable Bregman divergence, (5) can be solved by various inexact updates,\noften yielding a closed-form for the xjt update (see Section 2.1).\nTo better understand PDMM, we discuss the following three aspects which play roles in choosing \u03c4i\nand \u03bdi: the dual backward step (7), the sparsity of A, and the choice of randomized blocks.\nDual Backward Step: We attribute the failure of the Jacobi updates (3)-(4) to the following obser-\nvation in (3), which can be rewritten as:\n\nj = argminxj\u2208Xj fj(xj) + (cid:104)yt + \u03c1(Axt \u2212 a), Ac\nxt+1\n\njxj(cid:105) +\n\n\u03c1\n2\n\n(cid:107)Ac\n\nj(xj \u2212 xt\n\nj)(cid:107)2\n2 .\n\n(9)\n\nIn the primal xj update, the quadratic penalty term implicitly adds full gradient ascent step to the\ndual variable, i.e., yt+\u03c1(Axt\u2212a), which we call implicit dual ascent. The implicit dual ascent along\nwith the explicit dual ascent (4) may lead to too aggressive progress on the dual variable, particularly\nwhen the number of blocks is large. Based on this observation, we introduce an intermediate variable\n\u02c6yt to replace yt in (9) so that the implicit dual ascent in (9) makes conservative progress, e.g.,\n\u02c6yt + \u03c1(Axt \u2212 a) = yt + (1 \u2212 \u03bd)\u03c1(Axt \u2212 a) , where 0 < \u03bd < 1. \u02c6yt is the result of a \u2018backward\nstep\u2019 on the dual variable, i.e., \u02c6yt = yt \u2212 \u03bd\u03c1(Axt \u2212 a).\nMoreover, one can show that \u03c4 and \u03bd have also been implicitly used when using two-block ADMM\nwith splitting variables (sADMM) to solve (1) [17, 26]. Section 2.2 shows sADMM is a special case\nof PDMM. The connection helps in understanding the role of the two parameters \u03c4i, \u03bdi in PDMM.\nInterestingly, the step sizes \u03c4i and \u03bdi can be improved by considering the block sparsity of A and\nthe number of random blocks K to be updated.\nSparsity of A: Assume A is divided into I \u00d7 J blocks. While xj can be updated in parallel,\nthe matrix multiplication Ax in the dual update (4) requires synchronization to gather messages\nfrom all block coordinates jt \u2208 Jt. For updating the i-th block of the dual yi, we need Aixt+1 =\nk which aggregates \u201cmessages\u201d from all xjt. If Aijt is a block of\n\n+(cid:80)\n\nAijtxt+1\n\n(cid:80)\n\nAikxt\n\njt\u2208Jt\n\nk /\u2208Jt\n\njt\n\n3\n\n\fk /\u2208Jt\n\nAikxt\n\njt\u2208Jt\u2229N (i) Aijtxt+1\njt\n\nzeros, there is no \u201cmessage\u201d from xjt to yi. More precisely, Aixt+1 =(cid:80)\n(cid:80)\n\n+\nk where N (i) denotes a set of nonzero blocks in the i-th row block Ai. N (i) can be\nconsidered as the set of neighbors of the i-th dual block yi and di = |N (i)| is the degree of the i-th\ndual block yi. If A is sparse, di could be far smaller than J. According to (8), a low di will lead to\nbigger step sizes \u03c4i for the dual update and smaller step sizes for the dual backward step (7). Further,\nas shown in Section 2.3, when using PDMM with all blocks to solve composite minimization with\noverlapping blocks, PDMM can use \u03c4i = 0.5 which is much larger than 1/J in sADMM.\nRandomized Blocks: The number of blocks to be randomly chosen also has the effect on \u03c4i, \u03bdi.\nIf randomly choosing one block (K = 1), then \u03bdi = 0, \u03c4i = 1\n2J\u22121. The dual backward step (7)\nvanishes. As K increases, \u03bdi increases from 0 to 1 \u2212 1\n. If\nupdating all blocks (K = J), \u03c4i = 1\ndi\nPDMM does not necessarily choose any K combination of J blocks. The J blocks can be randomly\npartitioned into J/K groups where each group has K blocks. Then PDMM randomly picks some\ngroups. A simple way is to permutate the J blocks and choose K blocks cyclically.\n\nand \u03c4i increases from 1\n\n, \u03bdi = 1 \u2212 1\n\n2J\u22121 to 1\ndi\n\ndi\n\ndi\n\n.\n\nInexact PDMM\n\n2.1\nIf \u03b7jt > 0, there is an extra Bregman divergence term in (5), which can serve two purposes. First,\nchoosing a suitable Bregman divergence can lead to an ef\ufb01cient solution for (5). Second, if \u03b7jt is\nsuf\ufb01ciently large, the dual update can use a large step size (\u03c4i = 1) and the backward step (7) can be\nremoved (\u03bdi = 0), leading to the same updates as PJADMM [10] (see Section 2.2).\nGiven a continuously differentiable and strictly convex function \u03c8jt, its Bregman divergence is\nde\ufb01end as\n\nB\u03c8jt\n\n(xjt, xt\njt\n\n), xjt \u2212 xt\n\n\u03c8jt(xjt) \u2212 B\u03c8jt\n\n) = \u03c8jt(xjt) \u2212 \u03c8jt(xt\n\n) \u2212 (cid:104)\u2207\u03c8jt(xt\nwhere \u2207\u03c8jt denotes the gradient of \u03c8jt. Rearranging the terms yields\n) + (cid:104)\u2207\u03c8jt(xt\n(11)\n) = \u03c8jt(xt\njt\nwhich is exactly the linearization of \u03c8jt(xjt) at xt\njt. Therefore, if solving (5) exactly becomes\ndif\ufb01cult due to some problematic terms, we can use the Bregman divergence to linearize these\nproblematic terms so that (5) can be solved ef\ufb01ciently. More speci\ufb01cally, in (5), we can choose\n\u03c6jt = \u03d5jt \u2212 1\n\u03c8jt assuming \u03c8jt is the problematic term. Using the linearity of Bregman diver-\ngence,\n\n), xjt \u2212 xt\n\n(xjt, xt\njt\n\n(10)\n\n(cid:105),\n\n(cid:105),\n\n\u03b7jt\n\njt\n\njt\n\njt\n\njt\n\njt\n\nB\u03c6jt\n\n(xjt, xt\njt\n\n) = B\u03d5jt\n\n(xjt, xt\njt\n\nB\u03c8jt\n\n(xjt, xt\njt\n\n) .\n\n(12)\n\n) \u2212 1\n\u03b7jt\n\n(cid:88)\n\n2(cid:107)\u00b7(cid:107)2\n(cid:104)\u2207fjt(xt\n\nFor instance, if fjt is a logistic function, solving (5) exactly requires an iterative algorithm. Setting\n\u03c8jt = fjt, \u03d5jt = 1\n\n2 in (12) and plugging into (5) yield\n\njt\n\njt\n\n\u03c1\n2\n\nAc\n\nk(cid:54)=jt\n\nxt+1\n\nAkxt\n\n(cid:80)\n\nk\u2212a(cid:107)2\n\n= argmin\nxjt\u2208Xjt\n\n(cid:107)Ajtxjt +\n\n), xjt(cid:105)+(cid:104)\u02c6yt, Ajtxjt(cid:105)+\n\nwhich has a closed-form solution.\n\n2 +\u03b7jt(cid:107)xjt\u2212xt\n2(cid:107)Ac\nxjt +\njt\nxjt(cid:107)2\n2,\nthen\n)(cid:107)2\n2 can be used to linearize the quadratic penalty term.\nimplies that B\u03d5jt\n\nif the quadratic penalty term \u03c1\n2(cid:107)Ac\n\nk \u2212 a(cid:107)2\n2 is a problematic term, we can set \u03c8jt(xjt) = \u03c1\n2(cid:107)Ac\n) = \u03c1\n\n(xjt \u2212 xt\nB\u03c8jt\njt\nIn (12), the nonnegativeness of B\u03c6jt\n. This condition can be satis\ufb01ed\nas long as \u03d5jt is more convex than \u03c8jt. Technically, we assume that \u03d5jt is \u03c3/\u03b7jt-strongly convex\nand \u03c8jt has Lipschitz continuous gradient with constant \u03c3, which has been shown in [23].\n\nkxt\n(xjt, xt\njt\n\n\u2265 1\n\u03b7jt\n\nSimilarly,\n\n(cid:107)2\n2 ,\n\nB\u03c8jt\n\nk(cid:54)=jt\n\njt\n\njt\n\njt\n\n2.2 Connections to Related Work\nConsider the case when all blocks are used in PDMM. There are also two other methods which\nupdate all blocks in parallel. If solving the primal updates exactly, two-block ADMM with splitting\nvariables (sADMM) is considered in [17, 26]. We show that sADMM is a special case of PDMM\nIf the primal updates are solved\nwhen setting \u03c4i = 1\ninexactly, [10] considers a proximal Jacobian ADMM (PJADMM) by adding proximal terms where\n\nJ and \u03bdi = 1 \u2212 1\n\nJ (Appendix B in [25]).\n\n4\n\n\fthe converge rate is improved to o(1/T ) given the suf\ufb01ciently large proximal terms. We show that\nPJADMM [10] is also a special case of PDMM (Appendix C in [25]). sADMM and PJADMM are\ntwo extreme cases of PDMM. The connection between sADMM and PJADMM through PDMM can\nprovide better understanding of the three methods and the role of dual backward step. If the primal\nupdate is solved exactly which makes suf\ufb01cient progress, the dual update should take small step, e.g.,\nsADMM. On the other hand, if the primal update takes small progress by adding proximal terms,\nthe dual update can take full gradient step, e.g. PJADMM. While sADMM is a direct derivation of\nADMM, PJADMM introduces more terms and parameters.\nIn addition to PDMM, RBUSMM [14] can also randomly update one block. The convergence\nof RBSUMM requires certain local error bounds to be hold and decreasing step size. Moreover,\nthe iteration complexity of RBSUMM is still unknown. In contast, PDMM converges at a rate of\nO(1/T ) with the constant step size.\n\n2.3 Randomized Overlapping Block Coordinate Descent\nConsider the composite minimization problem of a sum of a loss function (cid:96)(w) and composite\nregularizers gj(wj):\n\nmin\n\nw\n\n(cid:96)(w) +\n\ngj(wj) ,\n\n(13)\n\nwhich considers L overlapping groups wj \u2208 Rb\u00d71. Let J = L + 1, xJ = w. For 1 \u2264 j \u2264 L,\nj xJ, where Uj \u2208 Rb\u00d7L is the columns of an identity matrix and\ndenote xj = wj, then xj = UT\nextracts the coordinates of xJ. Denote U = [U1,\u00b7\u00b7\u00b7 , UL] \u2208 Rn\u00d7(bL) and A = [IbL,\u2212UT ] where\nbL denotes b \u00d7 L. By letting fj(xj) = gj(wj) and fJ (xJ ) = (cid:96)(w), (13) can be written as:\n\nfj(xj)\n\ns.t. Ax = 0.\n\n(14)\n\nK\n\nwhere x = [x1;\u00b7\u00b7\u00b7 ; xL; xL+1] \u2208 Rb\u00d7J.\n(14) can be solved by PDMM in a randomized block\ncoordinate fashion. In A, for b rows block, there are only two nonzero blocks, i.e., di = 2. There-\nfore, \u03c4i =\n2(2J\u2212K) , \u03bdi = 0.5. In particular, if K = J, \u03c4i = \u03bdi = 0.5. In contrast, sADMM uses\n\u03c4i = 1/J (cid:28) 0.5, \u03bdi = 1 \u2212 1/J > 0.5 if J is larger.\nRemark 1 (a) ADMM [2] can solve (14) where the equality constraint is xj = UT\n(b) In this setting, Gauss-Seidel ADMM (GSADMM) and BSUMM [14] are the same as ADMM.\nBSUMM should converge with constant stepsize \u03c1 (not necessarily suf\ufb01ciently small), although the\ntheory of BSUMM does not include this special case.\n\nj xJ.\n\nJ(cid:88)\n\nj=1\n\nmin\n\nx\n\nL(cid:88)\n\nj=1\n\n5\n\n3 Theoretical Results\nWe establish the convergence results for PDMM under fairly simple assumptions:\nAssumption 1\n(1) fj : Rnj (cid:55)\u2192 R \u222a {+\u221e} are closed, proper, and convex.\n(2) A KKT point of the Lagrangian (\u03c1 = 0 in (2)) of Problem (1) exists.\nAssumption 1 is the same as that required by ADMM [2, 22]. Assume that {x\u2217\nthe KKT conditions of the Lagrangian (\u03c1 = 0 in (2)), i.e.,\nj y\u2217 \u2208 \u2202fj(x\u2217\n\nj ) ,\n\n\u2212 AT\nAx\u2217 \u2212 a = 0.\n\nj \u2208 Xj, y\u2217\n\ni } satis\ufb01es\n\n(15)\n(16)\n) where \u2202fj be the\n\nj \u2208 Xj, the optimality conditions for the xj update (5) is\n\nj\n\nj\n\nDuring iterations, (16) is satis\ufb01ed if Axt+1 = a. Let f(cid:48)\nsubdifferential of fj. For x\u2217\n(cid:104)f(cid:48)\nj(xt+1\nWhen Axt+1 = a, yt+1 = yt. If Ac\nassuming B\u03c6j (xt+1\n\nj[yt +(1\u2212\u03bd)\u03c1(Axt\u2212a)+Ac\nj(xt+1\n\nj(xt+1\nj \u2212 xt\n\nj \u2212xt\n\n(15) will be satis\ufb01ed. Note x\u2217\n\nj) = 0,\n\n)+Ac\n\n, xt\n\nj(xt+1\n\nj\n\nj\n\n)\u2212\u2207\u03c6j(xt\n\nj)]+\u03b7j(\u2207\u03c6j(xt+1\n\nj(cid:105)\u2264 0 .\nj) = 0, then Axt \u2212 a = 0. When \u03b7j \u2265 0, further\nj \u2208 Xj is always satis\ufb01ed in (5) in\n\nj \u2212x\u2217\n\nj)), xt+1\n\nj\n\n) \u2208 \u2202fj(xt+1\n\n\fi) be generated by PDMM (5)-(7) and h(v\u2217, vt) be de\ufb01ned in (20).\n\n2\n\nJ(cid:88)\n\nPDMM. Overall, the KKT conditions (15)-(16) are satis\ufb01ed if the following optimality conditions\nare satis\ufb01ed by the iterates:\n\nj \u2212 xt\nj(xt+1\n, xt\nj) = 0 .\n\nj\n\nj) = 0 ,\n\ni = [zT\n\nAxt+1 = a , Ac\nB\u03c6j (xt+1\n\n(17)\n(18)\nThe above optimality conditions are suf\ufb01cient for the KKT conditions. (17) are the optimality con-\nditions for the exact PDMM. (18) is needed only when \u03b7j > 0.\nLet zij = Aijxj \u2208 Rmi\u00d71, zr\nRJm\u00d71. De\ufb01ne the residual of optimality conditions (17)-(18) as\ni xt+1 \u2212 ai(cid:107)2\n\n2 +\n. If R(xt+1) \u2192 0, (17)-(18) will be\nwhere Pt is some positive semi-de\ufb01nite matrix and \u03b2i = K\nJ \u02dcKi\nsatis\ufb01ed and thus PDMM converges to the KKT point {x\u2217, y\u2217}. De\ufb01ne the current iterate vt =\n(xt\n\niJ ]T \u2208 RmiJ\u00d71 and z = [(zr\n\ni1,\u00b7\u00b7\u00b7 , zT\nI(cid:88)\n\n1)T ,\u00b7\u00b7\u00b7 , (zr\n\n(cid:107)zt+1 \u2212 zt(cid:107)2\n\nI )T ]T \u2208\n\n\u03b7jB\u03c6j (xt+1\n\ni) and h(v\u2217, vt) as a distance from vt to a KKT point v\u2217 = (x\u2217\n(cid:107)z\u2217 \u2212 zt(cid:107)2\n\nj \u2208 Xj, y\u2217\nJ(cid:88)\ni ):\n\u03b7jB\u03c6j (x\u2217\n\nh(v\u2217, vt) =\n\nJ(cid:88)\n\nI(cid:88)\n\nR(xt+1) =\n\n\u03b2i(cid:107)Ar\n\nj) , (20)\n\n(cid:107)y\u2217\n\nj, yt\n\nQ +\n\n(19)\n\nj) .\n\n, xt\n\n\u03c1\n2\n\n\u03c1\n2\n\nj=1\n\ni=1\n\n+\n\nPt\n\n1\n\nj\n\nwhere Q is a positive semi-de\ufb01nite matrix and \u02dcL\u03c1(xt, yt) with \u03b3i = 2(J\u2212K)\n\u02dcKi(2J\u2212K)\n(cid:107)Ar\n\n\u02dcL\u03c1(xt, yt) = f (xt) \u2212 f (x\u2217) +\n\n(\u03b3i \u2212 \u03c4i)\u03c1\n\ni xt \u2212 ai(cid:105) +\n\n(cid:104)yt\n\n\u03c1\n2\n\ni\n\n2 + \u02dcL\u03c1(xt, yt) +\ni \u2212 yt\u22121\n(cid:107)2\n(cid:26)\nI(cid:88)\n\ni xt \u2212 ai(cid:107)2\n\n2\n\n.\n\n(21)\n\nj , xt\n\u2212 K\nJ \u02dcKi\n\n(cid:27)\n\nK\nJ\n\n2\u03c4i\u03c1\n\ni=1\n\nj=1\n\n+ 1\ndi\n\nis\n\ni, Ar\nThe following Lemma shows that h(v\u2217, vt) \u2265 0.\nLemma 1 Let vt = (xt\nSetting \u03bdi = 1 \u2212 1\n\u02dcKi\n\nj, yt\nand \u03c4i =\n\n, we have\n\ni=1\n\nK\n\nI(cid:88)\n\n\u02dcKi(2J\u2212K)\n\u03b6i(cid:107)Ar\n\u2212 K\nJ \u02dcKi\n\ni=1\n+ 1\ndi\n\nh(v\u2217, vt) \u2265 \u03c1\n2\n\n2 +\n\n(cid:107)z\u2217 \u2212 zt(cid:107)2\n\ni xt \u2212 ai(cid:107)2\n\u03b7jB\u03c6j (x\u2217\n\u2265 0. Moreover, if h(v\u2217, vt) = 0, then Ar\n\nQ +\n\n\u03c1\n2\n\nj=1\n\nj) \u2265 0 .\n\n(22)\n\nj , xt\ni xt = ai, zt = z\u2217 and\n\nj , xt\n\n\u02dcKi(2J\u2212K)\n\nj) = 0. Thus, (15)-(16) are satis\ufb01ed.\n\nwhere \u03b6i = J\u2212K\nB\u03c6j (x\u2217\nIn PDMM, yt+1 depends on xt+1, which in turn depends on Jt. xt and yt are independent of Jt. xt\ndepends on the observed realizations of the random variable \u03bet\u22121 = {J1,\u00b7\u00b7\u00b7 , Jt\u22121} .The following\ntheorem shows that h(v\u2217, vt) decreases monotonically and thus establishes the global convergence\nof PDMM.\nTheorem 1 (Global Convergence) Let vt = (xt\n(x\u2217\nhave\n\ni) be generated by PDMM (5)-(7) and v\u2217 =\n, we\n\ni ) be a KKT point satisfying (15)-(16). Setting \u03bdi = 1 \u2212 1\n\nj \u2208 Xj, y\u2217\n\n\u02dcKi(2J\u2212K)\n\nand \u03c4i =\n\nj, yt\n\n\u02dcKi\n\nK\n\n0 \u2264 E\u03beth(v\u2217, vt+1) \u2264 E\u03bet\u22121 h(v\u2217, vt) , E\u03betR(xt+1) \u2192 0 .\n\n(23)\n\nThe following theorem establishes the iteration complexity of PDMM in an ergodic sense.\nTheorem 2 (Iteration Complexity) Let (xt\n\ni) be generated by PDMM (5)-(7). Let \u00afxT =\n\n(cid:80)T\nt=1 xt. Setting \u03bdi = 1 \u2212 1\nI(cid:88)\n\nEf (\u00afxT ) \u2212 f (x\u2217) \u2264\n\n(cid:110)(cid:80)I\n\nJ\nK\n\n\u03b2i(cid:107)Ar\n\ni \u00afxT \u2212 ai(cid:107)2\n\n2 \u2264 2\n\nE\n\n\u02dcKi\n\ni=1\n\ni=1\n\nwhere \u03b2i = K\nJ \u02dcKi\n\n\u03c1 h(v\u2217, v0)\n\n.\n\nT\n\nj, yt\nK\n\n, we have\nand \u03c4i =\n\u02dcKi(2J\u2212K)\n2\u03b2i\u03c1(cid:107)y\u2217\ni (cid:107)2\n2 + \u02dcL\u03c1(x1, y1) + \u03c1\nT\n\n1\n\n, Q is a positive semi-de\ufb01nite matrix, and the expectation is over Jt.\n\n6\n\nQ +(cid:80)J\n\n2(cid:107)z\u2217 \u2212 z1(cid:107)2\n\nj=1 \u03b7jB\u03c6j (x\u2217\n\nj , x1\nj )\n\n(cid:111)\n\n,\n\n\fFigure 1: Comparison of the convergence of PDMM with ADMM methods in RPCA.\n\nTable 1: The best results of PDMM with tuning parameters \u03c4i, \u03bdi in RPCA.\n\niteration\n\nresidual(\u00d710\u22125)\n\nobjective (log)\n\n8.07\n8.07\n8.07\n8.07\n8.07\n8.07\n\ntime (s)\n118.83\nPDMM1\n137.46\nPDMM2\nPDMM3\n147.82\nGSADMM 163.09\nRBSUMM 206.96\nsADMM1\n731.51\n\n40\n34\n31\n28\n141\n139\n\n3.60\n5.51\n6.54\n6.84\n8.55\n9.73\n\n1\n2\n\nRemark 2 PDMM converges at the same rate as ADMM and its variants. In Theorem 2, PDMM\ncan achieve the fastest convergence by setting J = K = 1, \u03c4i = 1, \u03bdi = 0, i.e., the entire matrix A\nis considered as a single block, indicating PDMM reduces to the method of multipliers. In this case,\nhowever, the resulting subproblem may be dif\ufb01cult to solve, as discussed in Section 1. Therefore,\nthe number of blocks in PDMM depends on the trade-off between the number of subproblems and\nhow ef\ufb01ciently each subproblem can be solved.\n4 Experimental Results\nIn this section, we evaluate the performance of PDMM in solving robust principal component\nanalysis (RPCA) and overlapping group lasso [28]. We compared PDMM with ADMM [2] or\nGSADMM (no theory guarantee), sADMM [17, 26], and RBSUMM [14]. Note GSADMM in-\ncludes BSUMM [14]. All experiments are implemented in Matlab and run sequentially. We run\nthe experiments 10 times and report the average results. The stopping criterion is either when the\nresidual is smaller than 10\u22124 or when the number of iterations exceeds 2000.\nRPCA: RPCA is used to obtain a low rank and sparse decomposition of a given matrix A corrupted\nby noise [5, 17]:\n\n(cid:107)X1(cid:107)2\n\nF + \u03b32(cid:107)X2(cid:107)1 + \u03b33(cid:107)X3(cid:107)\u2217\n\nmin\n\ns.t. A = X1 + X2 + X3 .\n\n(24)\nwhere A \u2208 Rm\u00d7n, X1 is a noise matrix, X2 is a sparse matrix and X3 is a low rank matrix.\nA = L + S + V is generated in the same way as [17]1. In this experiment, m = 1000, n = 5000\nand the rank is 100. The number appended to PDMM denotes the number of blocks (K) to be chosen\nin PDMM, e.g., PDMM1 randomly updates one block.\nFigure 1 compares the convegence results of PDMM with ADMM methods. In PDMM, \u03c1 = 1\n3 )} for PDMM1, PDMM2\nand \u03c4i, \u03bdi are chosen according to (8), i.e., (\u03c4i, \u03bdi) = {( 1\nand PDMM3 respectively. We choose the \u2018best\u2019results for GSADMM (\u03c1 = 1) and RBSUMM\n(\u03c1 = 1, \u03b1 = \u03c1 11\u221a\n) and sADMM (\u03c1 = 1). PDMMs perform better than RBSUMM and sADMM.\nNote the public available code of sADMM1 does not have dual update, i.e., \u03c4i = 0. sADMM should\nbe the same as PDMM3 if \u03c4i = 1\n3. Since \u03c4i = 0, sADMM is the slowest algorithm. Without\ntuning the parameters of PDMM, GSADMM converges faster than PDMM. Note PDMM can run\nin parallel but GSADMM only runs sequentially. PDMM3 is faster than two randomized version\nof PDMM since the costs of extra iterations in PDMM1 and PDMM2 have surpassed the savings\nat each iteration. For the two randomized one block coordinate methods, PDMM1 converges faster\nthan RBSUMM in terms of both the number of iterations and runtime.\nThe effect of \u03c4i, \u03bdi: We tuned the parameter \u03c4i, \u03bdi\nin PDMMs. Three randomized meth-\nods (RBSUMM, PDMM1 and PDMM2) choose the blocks cyclically instead of randomly. Ta-\nble 1 compares the \u2018best\u2019results of PDMM with other ADMM methods.\nIn PDMM, (\u03c4i, \u03bdi) =\n\n5 , 0), ( 1\n\n4 , 1\n\n2 ), ( 1\n\n3 , 1\n\nt+10\n\n1http://www.stanford.edu/ boyd/papers/prox algs/matrix decomp.html\n\n7\n\n0100200300400500600700800\u22125\u22124\u22123\u22122\u2212101234time (s)residual (log) PDMM1PDMM2PDMM3GSADMMRBSUMMsADMM050100150200250\u22125\u22124\u22123\u22122\u2212101234iterationsresidual (log) PDMM1PDMM2PDMM3GSADMMRBSUMMsADMM501001502002503007.87.857.97.9588.058.18.15time (s)objective (log) PDMM1PDMM2PDMM3GSADMMRBSUMMsADMM\fFigure 2: Comparison of convergence of PDMM and other methods in overlapping group Lasso.\n\n(cid:88)\n\n1\n\n(cid:107)Aw \u2212 b(cid:107)2\n\ng\u2208Gdg(cid:107)wg(cid:107)2 .\n\n2 , 1\n\n3 , 1\n\n2 ), ( 1\n\n2 , 0), ( 1\n\n{( 1\n2 )}. GSADMM converges with the smallest number of iterations, but PDMMs\ncan converge faster than GSADMM in terms of runtime. The computation per iteration in\nGSADMM is slightly higher than PDMM3 because GSADMM updates the sum X1 + X2 + X3 but\nPDMM3 can reuse the sum. Therefore, if the numbers of iterations of the two methods are close,\nPDMM3 can be faster than GSADMM. PDMM1 and PDMM2 can be faster than PDMM3. By\nsimply updating one block, PDMM1 is the fastest algorithm and achieves the lowest residual.\nOverlapping Group Lasso: We consider solving the overlapping group lasso problem [28]:\n\nw\n\nmin\n\n2 +\n\n2L\u03bb\n\nL , \u03bb = L\n\nJ+1 , \u03bdi = 1 \u2212 1\n\n(25)\nwhere A \u2208 Rm\u00d7n, w \u2208 Rn\u00d71 and wg \u2208 Rb\u00d71 is the vector of overlapping group indexed by\ng. dg is some positive weight of group g \u2208 G. As shown in Section 2.3,\n(25) can be rewritten\nas the form (14). The data is generated in a same way as [27, 9]: the elements of A are sampled\nfrom normal distribution, b = Ax + \u0001 with noise \u0001 sampled from normal distribution, and xj =\n(\u22121)j exp(\u2212(j \u2212 1)/100). In this experiment, m = 5000, the number of groups is L = 100, and\n5 in (25). The size of each group is 100 and the overlap is 10. The total number of\ndg = 1\nblocks in PDMM and sADMM is J = 101. \u03c4i, \u03bdi in PDMM are computed according to (8).\nIn Figure 2, the \ufb01rst two \ufb01gures plot the convergence of objective in terms of the number of iterations\nand time. PDMM uses all 101 blocks and is the fastest algorithm. ADMM is the same as GSADMM\nin this problem, but is slower than PDMM. Since sADMM does not consider the sparsity, it uses\nJ+1, leading to slow convergence. The two accelerated methods, PA-APG [27]\n\u03c4i = 1\nand S-APG [9], are slower than PDMM and ADMM.\nThe effect of K: The third \ufb01gure shows PDMM with different number of blocks K. Although the\ncomplexity of each iteration is the lowest when K = 1, PDMM takes much more iterations than\nother cases and thus takes the longest time. As K increases, PDMM converges faster and faster.\nWhen K = 20, the runtime is already same as using all blocks. When K > 21, PDMM takes less\ntime to converge than using all blocks. The runtime of PDMM decreases as K increases from 21\nto 61. However, the speedup from 61 to 81 is negligable. We tried different set of parameters for\ni+t (0 \u2264 i \u2264 5, \u03c1 = 0.01, 0.1, 1) or suf\ufb01ciently small step size, but could not see the\nRBSUMM \u03c1 i2+1\nconvergence of the objective within 5000 iterations. Therefore, the results are not included here.\n5 Conclusions\nWe proposed a randomized block coordinate variant of ADMM named Parallel Direction Method of\nMultipliers (PDMM) to solve the class of problem of minimizing block-separable convex functions\nsubject to linear constraints. PDMM considers the sparsity and the number of blocks to be updated\nwhen setting the step size. We show two existing Jacobian ADMM methods are special cases of\nPDMM. We also use PDMM to solve overlapping block problems. The global convergence and\nthe iteration complexity are established with constant step size. Experiments on robust PCA and\noverlapping group lasso show that PDMM is faster than existing methods.\nAcknowledgment\nH. W. and A. B. acknowledge the support of NSF via IIS-1447566, IIS-1422557, CCF-1451986, CNS-1314560,\nIIS-0953274, IIS-1029711, IIS-0916750, and NASA grant NNX12AQ39A. H. W. acknowledges the support\nof DDF (2013-2014) from the University of Minnesota. A.B. acknowledges support from IBM and Yahoo.\nZ.Q. Luo is supported in part by the US AFOSR via grant number FA9550-12-1-0340 and the National Science\nFoundation via grant number DMS-1015346.\n\n8\n\n05010015020000.10.20.30.40.5time (s)objective PA\u2212APGS\u2212APGPDMMADMMsADMM0200400600800100000.10.20.30.40.5iterationobjective PA\u2212APGS\u2212APGPDMMADMMsADMM203040506070\u22125\u22124\u22123\u22122\u221210time (s)residual (log) 121416181101\fReferences\n[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex Optimization with Sparsity-Inducing Norms.\n\nS. Sra, S. Nowozin, S. J. Wright., editors, Optimization for Machine Learning, MIT Press, 2011.\n\n[2] S. Boyd, E. Chu N. Parikh, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via\nthe alternating direction method of multipliers. Foundation and Trends Machine Learning, 3(1):1\u2013122,\n2011.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] T. Cai, W. Liu, and X. Luo. A constrained (cid:96)1 minimization approach to sparse precision matrix estimation.\n\nJournal of American Statistical Association, 106:594\u2013607, 2011.\n\n[5] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis ?. Journal of the ACM,\n\n58:1\u201337, 2011.\n\n[6] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex\n\noptimization. Annals of Statistics, 40:1935\u20131967, 2012.\n\n[7] C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of ADMM for multi-block convex minimization\n\nproblems is not necessarily convergent. Preprint, 2013.\n\n[8] S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM review,\n\n43:129\u2013159, 2001.\n\n[9] X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. Smoothing proximal gradient method for\n\ngeneral structured sparse regression. The Annals of Applied Statistics, 6:719752, 2012.\n\n[10] W. Deng, M. Lai, Z. Peng, and W. Yin. Parallel multi-block admm with o(1/k) convergence. ArXiv,\n\n2014.\n\n[11] Q. Fu, H. Wang, and A. Banerjee. Bethe-ADMM for tree decomposition based parallel MAP inference.\n\nIn UAI, 2013.\n\n[12] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via \ufb01nite-\n\nelement approximations. Computers and Mathematics with Applications, 2:17\u201340, 1976.\n\n[13] B. He, M. Tao, and X. Yuan. Alternating direction method with Gaussian back substitution for separable\n\nconvex programming. SIAM Journal of Optimization, pages 313\u2013340, 2012.\n\n[14] M. Hong, T. Chang, X. Wang, M. Razaviyayn, S. Ma, and Z. Luo. A block successive upper bound\n\nminimization method of multipliers for linearly constrained convex optimization. Preprint, 2013.\n\n[15] M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. ArXiv,\n\n2012.\n\n[16] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization methods. SIAM Journal\n\non Optimization, 22(2):341362, 2012.\n\n[17] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1:123\u2013231, 2014.\nIteration complexity of randomized block-coordinate descent methods for\n[18] P. Richtarik and M. Takac.\n\nminimizing a composite function. Mathematical Programming, 2012.\n\n[19] N. Z. Shor. Minimization Methods for Non-Differentiable Functions. Springer-Verlag, 1985.\n[20] R. Tappenden, P. Richtarik, and B. Buke. Separable approximations and decomposition methods for the\n\naugmented lagrangian. Preprint, 2013.\n\n[21] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1:1\u2013305, 2008.\n\n[22] H. Wang and A. Banerjee. Online alternating direction method. In ICML, 2012.\n[23] H. Wang and A. Banerjee. Bregman alternating direction method of multipliers. In NIPS, 2014.\n[24] H. Wang, A. Banerjee, C. Hsieh, P. Ravikumar, and I. Dhillon. Large scale distributed sparse precesion\n\nestimation. In NIPS, 2013.\n\n[25] H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. ArXiv, 2014.\n[26] X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems\n\nusing two-block alternating direction method of multipliers. Preprint, 2013.\n\n[27] Y. Yu. Better approximation and faster algorithm using the proximal average. In NIPS, 2012.\n[28] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 37:34683497, 2009.\n\n[29] Z. Zhou, X. Li, J. Wright, E. Candes, and Y. Ma. Stable principal component pursuit. In IEEE Interna-\n\ntional Symposium on Information Theory, 2010.\n\n9\n\n\f", "award": [], "sourceid": 136, "authors": [{"given_name": "Huahua", "family_name": "Wang", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Zhi-Quan", "family_name": "Luo", "institution": "University of Minnesota, Twin Cites"}]}