{"title": "Constrained convex minimization via model-based excessive gap", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 729, "abstract": "We introduce a model-based excessive gap technique to analyze first-order primal- dual methods for constrained convex minimization. As a result, we construct first- order primal-dual methods with optimal convergence rates on the primal objec- tive residual and the primal feasibility gap of their iterates separately. Through a dual smoothing and prox-center selection strategy, our framework subsumes the augmented Lagrangian, alternating direction, and dual fast-gradient methods as special cases, where our rates apply.", "full_text": "Constrained convex minimization\n\nvia model-based excessive gap\n\nQuoc Tran-Dinh and Volkan Cevher\n\n\u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne (EPFL), CH1015-Lausanne, Switzerland\n\nLaboratory for Information and Inference Systems (LIONS)\n{quoc.trandinh, volkan.cevher}@epfl.ch\n\nAbstract\n\nWe introduce a model-based excessive gap technique to analyze \ufb01rst-order primal-\ndual methods for constrained convex minimization. As a result, we construct new\nprimal-dual methods with optimal convergence rates on the objective residual and\nthe primal feasibility gap of their iterates separately. Through a dual smoothing\nand prox-function selection strategy, our framework subsumes the augmented La-\ngrangian, and alternating methods as special cases, where our rates apply.\n\nf (cid:63) := min\n\nIntroduction\n\n1\nIn [1], Nesterov introduced a primal-dual technique, called the excessive gap, for constructing and\nanalyzing \ufb01rst-order methods for nonsmooth and unconstrained convex optimization problems. This\npaper builds upon the same idea for constructing and analyzing algorithms for the following a class\nof constrained convex problems, which captures a surprisingly broad set of applications [2, 3, 4, 5]:\n(1)\nwhere f : Rn \u2192 R\u222a{+\u221e} is a proper, closed and convex function; X \u2286 Rn is a nonempty, closed\nand convex set; and A \u2208 Rm\u00d7n and b \u2208 Rm are given.\nIn the sequel, we show how Nesterov\u2019s excessive gap relates to the smoothed gap function for a\nvariational inequality that characterizes the optimality condition of (1). In the light of this connec-\ntion, we enforce a simple linear model on the excessive gap, and use it to develop ef\ufb01cient \ufb01rst-order\nmethods to numerically approximate an optimal solution x(cid:63) of (1). Then, we rigorously characterize\nhow the following structural assumptions on (1) affect their computational ef\ufb01ciency:\nStructure 1: Decomposability. We say that problem (1) is p-decomposable if its objective function\nf and its feasible set X can be represented as follows:\n\n{f (x) : Ax = b, x \u2208 X} ,\n\nx\n\ni=1\n\ni=1\n\n1, . . . , p, and(cid:80)p\n\nf (x) :=\nwhere xi \u2208 Rni, Xi \u2286 Rni, fi\n\n(2)\n: Rni \u2192 R \u222a {+\u221e} is proper, closed and convex for i =\ni=1 ni = n. Decomposability naturally arises in machine learning applications such\nas group sparsity linear recovery, consensus optimization, and the dual formulation of empirical risk\nminimization problems [5]. As an important example, the composite convex minimization problem\nminx1{f1(x1) + f2(Kx1)} can be cast into (1) with a 2-decomposable structure using intermediate\nvariables x2 = Kx1 to represent Ax = b. Decomposable structure immediately supports parallel\nand distributed implementations in synchronous hardware architectures.\nStructure 2: Proximal tractability. By proximal tractability, we mean that the computation of the\nfollowing operation with a given proper, closed and convex function g is \u201cef\ufb01cient\u201d (e.g., by a closed\nform solution or by polynomial time algorithms) [6]:\n\nproxg(z) := argmin\nw\n\n(3)\nWhen the constraint z \u2208 Z is available, we consider the proximal operator of g(\u00b7) + \u03b4Z (\u00b7) instead\nof g, where \u03b4Z is the indicator function of Z. Many smooth and non-smooth convex functions have\ntractable proximal operators such as norms, and the projection onto a simple set [3, 7, 4, 5].\n\n{g(w) + (1/2)(cid:107)w \u2212 z(cid:107)2}.\n\n(cid:88)p\n\n(cid:89)p\n\nfi(xi), and X :=\n\nXi,\n\n1\n\n\fScalable algorithms for (1) and their limitations. We can obtain scalable numerical solutions of\n(1) when we augment the objective f with simple penalty functions on the constraints. Despite the\nfundamental dif\ufb01culties in choosing the penalty parameter, this approach enhances our computa-\ntional capabilities as well as numerical robustness since we can apply modern proximal gradient,\nalternating direction, and primal-dual methods. Unfortunately, existing approaches invariably fea-\nture one or both of the following two limitations:\nLimitation 1: Non-ideal convergence characterizations. Ideally, the convergence rate characteriza-\ntion of a \ufb01rst-order algorithm for solving (1) must simultaneously establish for its iterates xk \u2208 X\nboth on the objective residual f (xk) \u2212 f (cid:63) and on the primal feasibility gap (cid:107)Axk \u2212 b(cid:107) of its linear\nconstraints. The constraint feasibility is critical so that the primal convergence rate has any signif-\nicance. Rates on a joint of the objective residual and feasibility gap is not necessarily meaningful\nsince (1) is a constrained problem and f (xk) \u2212 f (cid:63) can easily be negative at all times as compared\nto the unconstrained setting, where we trivially have f (xk) \u2212 f (cid:63) \u2265 0.\nHitherto, the convergence results of state-of-the-art methods are far from ideal; see Table 1 in [28].\nMost algorithms have guarantees in the ergodic sense [8, 9, 10, 11, 12, 13, 14] with non-optimal\nrates, which diminishes the practical performance; they rely on special function properties to im-\nprove convergence rates on the function and feasibility [12, 15], which reduces the scope of their\napplicability; they provide rates on dual functions [16], or a weighted primal residual and feasibility\nscore [13], which does not necessarily imply convergence on the primal residual or the feasibility;\nor they obtain convergence rate on the gap function value sequence composed both the primal and\ndual variables via variational inequality and gap function characterizations [8, 10, 11], where the\nrate is scaled by a diameter parameter of the dual feasible set which is not necessary bounded.\nLimitation 2: Computational in\ufb02exibility. Recent theoretical developments customize algorithms to\nspecial function classes for scalability, such as convex functions with global Lipschitz gradient and\nstrong convexity. Unfortunately, these algorithms often require knowledge of function class param-\neters (e.g., the Lipschitz constant and the strong convexity parameter); they do not address the full\nscope of (1) (e.g., with self-concordant [barrier] functions or fully non-smooth decompositions); and\nthey often have complicated algorithmic implementations with backtracking steps, which can cre-\nate computational bottlenecks. These issues are compounded by their penalty parameter selection,\nwhich can signi\ufb01cantly decrease numerical ef\ufb01ciency [17]. Moreover, they lack a natural ability to\nhandle p-decomposability in a parallel fashion at optimal rates.\nOur speci\ufb01c contributions. To this end, this paper addresses the question: \u201cIs it possible to ef\ufb01-\nciently solve (1) using only the proximal tractability assumption with rigorous global convergence\nrates on the objective residual and the primal feasibility gap?\u201d The answer is indeed positive pro-\nvided that there exists a solution in a bounded feasible set X . Surprisingly, we can still leverage\nfavorable function classes for fast convergence, such as strongly convex functions, and exploit p-\ndecomposability at optimal rates.\nOur characterization is radically different from existing results, such as in [18, 8, 19, 9, 10, 11, 12,\n13]. Speci\ufb01cally, we unify primal-dual methods [20, 21], smoothing (both for Bregman distances\nand for augmented Lagrangian functions) [22, 21], and the excessive gap function technique [1] in\none. As a result, we develop an ef\ufb01cient algorithmic framework for solving (1), which covers aug-\nmented Lagrangian method [23, 24], [preconditioned] alternating direction method-of-multipliers\n([P]ADMM) [8] and fast dual descent methods [18] as special cases.\nBased on the new technique, we establish rigorous convergence rates for a few well-known primal-\ndual methods, which is optimal (in the sense of \ufb01rst order black-box models [25]) given our partic-\nular assumptions. We also discuss adaptive strategies for trading-off between the objective residual\n|f (xk)\u2212f (cid:63)| and the feasibility gap (cid:107)Axk\u2212b(cid:107), which enhance practical performance. Finally, we\ndescribe how strong convexity of f can be exploited, and numerically illustrate theoretical results.\n\n2 Preliminaries\n2.1. A semi-Bregman distance. Given a nonempty, closed and convex set Z \u2286 Rnz, a nonnegative,\ncontinuous and \u00b5b-strongly convex function b is called a \u00b5b-proximity function (or prox-function)\nof Z if Z \u2286 dom (b). Then zc := argminz\u2208Z b(z) exists and is unique, called the center point of\nb. Given a smooth \u00b5b-prox-function b of Z (with \u00b5b = 1), we de\ufb01ne db(z, \u02c6z) := b(\u02c6z)\u2212 b(z)\u2212\n\u2207b(z)T (\u02c6z\u2212z), \u2200z, \u02c6z \u2208 dom (b), as the Bregman distance between z and \u02c6z given b. As an example,\nwith b(z) := (1/2)(cid:107)z(cid:107)2\n\n2, we have db(z, \u02c6z) = (1/2)(cid:107)z \u2212 \u02c6z(cid:107)2\n\n2, which is the Euclidean distance.\n\n2\n\n\fIn order to unify both the Bregman distance and augmented Lagrangian smoothing methods, we\nintroduce a new semi-Bregman distance db(Sx, Sxc) between x and xc, given matrix S. Since S is\nnot necessary square, we use the pre\ufb01x \u201csemi\u201d for this measure. We also denote by:\n\nDSX := sup{db(Sx, Sxc) : x, xc \u2208 X},\n\n(4)\n\nthe semi-diameter of X . If X is bounded, then 0 \u2264 DSX < +\u221e.\n2.2. The dual problem of (1). Let L(x, y) := f (x) + yT (Ax \u2212 b) be the Lagrange function of\n(1), where y \u2208 Rm is the Lagrange multipliers. The dual problem of (1) is de\ufb01ned as:\n\ng(cid:63) := max\ny\u2208Rm\n\ng(y),\n\n(5)\n\nwhere g is the dual function, which is de\ufb01ned as:\n\nx\u2208X{f (x) + yT (Ax \u2212 b)}.\n\ng(y) := min\n\n(6)\nFor y \u2208 Rm, let us denote by x(cid:63)(y) the solution of (6). Corresponding to x(cid:63)(y), we also de\ufb01ne the\ndomain of g as dom (g) := {y \u2208 Rm : x(cid:63)(y) exists}. If f is continuous on X and if X is bounded,\nthen x(cid:63)(y) exists for all y \u2208 Rm. Unfortunately, g is nonsmooth, and numerical solutions of (5)\nare dif\ufb01cult [25]. In general, we have g(y) \u2264 f (x) which is the weak-duality condition in convex\noptimization. To guarantee strong duality, i.e., f (cid:63) = g(cid:63) for (1) and (5), we need an assumption:\nAssumption A. 1. The solution set X (cid:63) of (1) is nonempty. The function f is proper, closed and\nconvex. In addition, either X is a polytope or the Slater condition holds, i.e.: {x \u2208 Rn : Ax = b}\u2229\nrelint(X ) (cid:54)= \u2205, where relint(X ) is the relative interior of X .\nUnder Assumption A.1, the solution set Y (cid:63) of (5) is also nonempty and bounded. Moreover, the\nstrong duality holds, i.e., f (cid:63) = g(cid:63). Any point (x(cid:63), y(cid:63)) \u2208 X (cid:63) \u00d7 Y (cid:63) is a primal-dual solution to (1)\nand (5), and is also a saddle point of L, i.e., L(x(cid:63), y) \u2264 L(x(cid:63), y(cid:63)) \u2264 L(x, y(cid:63)),\u2200(x, y) \u2208 X \u00d7 Rm.\n2.3. Mixed-variational inequality formulation and the smoothed gap function. We use w :=\n[x, y] \u2208 Rn \u00d7 Rm to denote the primal-dual variable, F (w) :=\nto denote a partial\nKarush-Kuhn-Tucker (KKT) mapping, and W := X \u00d7 Rm. Then, we can write the optimality\ncondition of (1) as:\n\n(cid:20) AT y\n\nb \u2212 Ax\n\n(cid:21)\n\n(7)\n\n(8)\n\nwhich is known as the mixed-variational inequality (MVIP) [26]. If we de\ufb01ne:\n\nf (x) \u2212 f (x(cid:63)) + F (w(cid:63))T (w \u2212 w(cid:63)) \u2265 0, \u2200w \u2208 W,\n\n(cid:8)f (x(cid:63)) \u2212 f (x) + F (w(cid:63))T (w(cid:63) \u2212 w)(cid:9) ,\n\nG(w(cid:63)) := max\nw\u2208W\n\nthen G is known as the Auslender gap function of (7) [27]. By the de\ufb01nition of F , we can see that:\n\n(cid:8)f (x(cid:63)) \u2212 f (x) \u2212 (Ax \u2212 b)T y(cid:63)(cid:9) = f (x(cid:63)) \u2212 g(y(cid:63)) \u2265 0.\n\nG(w(cid:63)) := max\n[x,y]\u2208W\n\nIt is clear that G(w(cid:63)) = 0 if and only if w(cid:63) := [x(cid:63), y(cid:63)] \u2208 W (cid:63) := X (cid:63) \u00d7Y (cid:63)\u2014i.e., the strong duality.\nSince G is generally nonsmooth, we strictly smooth it by adding an augmented term:\n\nd\u03b3\u03b2(w) \u2261 d\u03b3\u03b2(x, y) := \u03b3db(Sx, Sxc) + (\u03b2/2)(cid:107)y(cid:107)2,\n\n(9)\nwhere db is a Bregman distance, S is a given matrix, and \u03b3, \u03b2 > 0 are two smoothness parameters.\nThe smoothed gap function for G is de\ufb01ned as:\n\n(cid:8)f (\u00afx) \u2212 f (x) + F ( \u00afw)T ( \u00afw \u2212 w) \u2212 d\u03b3\u03b2(w)(cid:9) ,\n\nG\u03b3\u03b2( \u00afw) := max\nw\u2208W\n\nwhere F is de\ufb01ned in (7). By the de\ufb01nition of G and G\u03b3\u03b2, we can easily show that:\n\nG\u03b3\u03b2( \u00afw) \u2264 G( \u00afw) \u2264 G\u03b3\u03b2( \u00afw) + max{d\u03b3\u03b2(w) : w \u2208 W},\n\nwhich is key to develop the algorithm in the next section.\nProblem (10) is convex, and its solution w(cid:63)\n\n(cid:40)x(cid:63)\n\n\u03b3\u03b2( \u00afw) can be computed as:\n\n(cid:8)f (x)+yT (Ax\u2212b)+\u03b3db(Sx, Sxc)(cid:9)\n\n\u03b3(\u00afy) := argmin\nx\u2208X\n\u03b2(\u00afx) := \u03b2\u22121(A\u00afx \u2212 b).\ny(cid:63)\n\nw(cid:63)\n\n\u03b3\u03b2( \u00afw) := [x(cid:63)\n\n\u03b3(\u00afy), y(cid:63)\n\n\u03b2(\u00afx)] \u21d4\n\n3\n\n(10)\n\n(11)\n\n(12)\n\n\fIn this case, the following concave function:\n\n(cid:8)f (x) + yT (Ax \u2212 b) + \u03b3db(Sx, Sxc)(cid:9) ,\n\ng\u03b3(y) := min\nx\u2208X\n\n(13)\n\ncan be considered as a smooth approximation of the dual function g de\ufb01ned by (6).\n2.4. Bregman distance smoother vs. augmented Lagrangian smoother. Depending on the choice\nof S and xc, we deal with two smoothers as follows:\n\n1. If we choose S = I, the identity matrix, and xc is then center point of b, then we obtain a\n2. If we choose S = A, and xc \u2208 X such that Axc = b, then we have the augmented\n\nBregman distance smoother.\n\nLagrangian smoother.\n\nClearly, with both smoothing techniques, the function g\u03b3 is smooth and concave. Its gradient is\nLipschitz continuous with the Lipschitz constant Lg\n\n\u03b3 := \u03b3\u22121(cid:107)A(cid:107)2 and Lg\n\n\u03b3 := \u03b3\u22121, respectively.\n\n3 Construction and analysis of a class of \ufb01rst-order primal-dual algorithms\n3.1. Model-based excessive gap technique for (1). Since G(w(cid:63)) = 0 iff w(cid:63) = [x(cid:63), y(cid:63)] is a primal-\ndual optimal solution of (1)-(5). The goal is to construct a sequence { \u00afwk} such that G( \u00afwk) \u2192 0,\nwhich implies that { \u00afwk} converges to w(cid:63). As suggested by (11), if we can construct two sequences\n{ \u00afwk} and {(\u03b3k, \u03b2k)} such that G\u03b3k\u03b2k ( \u00afwk) \u2192 0+ as \u03b3k\u03b2k \u2193 0+, then G( \u00afwk) \u2192 0.\nInspired by Nesterov\u2019s excessive gap idea in [1], we construct the following model-based excessive\ngap condition for (1) in order to achieve our goal.\nDe\ufb01nition 1 (Model-based Excessive Gap). Given \u00afwk \u2208 W and (\u03b3k, \u03b2k) > 0, a new point \u00afwk+1 \u2208\nW and (\u03b3k+1, \u03b2k+1) > 0 with \u03b3k+1\u03b2k+1 < \u03b3k\u03b2k is said to reduce the primal-dual gap if:\n\nGk+1( \u00afwk+1) \u2264 (1 \u2212 \u03c4k)Gk( \u00afwk) \u2212 \u03c8k,\n\n(14)\n\nwhere Gk := G\u03b3k\u03b2k, \u03c4k \u2208 [0, 1) and \u03c8k \u2265 0.\n\nFrom De\ufb01nition 1, if(cid:8) \u00afwk(cid:9) and {(\u03b3k, \u03b2k)} satisfy (14), then we have Gk( \u00afwk) \u2264 \u03c9kG0( \u00afw0) \u2212 \u03a8k\nby induction, where \u03c9k :=(cid:81)k\u22121\nLemma 1 ([28]). Let G\u03b3\u03b2 be de\ufb01ned by (10). Let(cid:8) \u00afwk(cid:9) \u2282 W and {(\u03b3k, \u03b2k)} \u2282 R2\n\n(cid:81)k\ni=j(1\u2212\u03c4i)\u03c8j\u22121. If G0( \u00afw0) \u2264 0,\nthen we can bound the objective residual |f (\u00afxk) \u2212 f (cid:63)| and the primal feasibility (cid:107)A\u00afxk \u2212 b(cid:107) of (1):\n++ be the\n\nj=0 (1\u2212\u03c4j) and \u03a8k := \u03c8k +(cid:80)k\n\nj=1\n\nsequences that satisfy (14). Then, it holds that:\n\n\u2212 D(cid:63)Y(cid:107)A\u00afxk \u2212 b(cid:107) \u2264 f (\u00afxk) \u2212 f (cid:63) \u2264 \u03b3kDSX and (cid:107)A\u00afxk \u2212 b(cid:107) \u2264 2\u03b2kD(cid:63)Y +\n\n2\u03b3k\u03b2kDSX ,\nwhere D(cid:63)Y := min{(cid:107)y(cid:63)(cid:107) : y(cid:63) \u2208 Y (cid:63)}, which is the norm of a minimum norm dual solutions.\nHence, we can derive algorithms based (\u03b3k, \u03b2k) with a predictable convergence rate via (15). In the\nsequel, we manipulate \u03c4k and \u03c8k to do just that in order to preserve (14) \u00b4a la Nesterov [1]. Finally,\nwe say that \u00afxk \u2208 X is an \u03b5-solution of (1) if |f (\u00afxk) \u2212 f (cid:63)| \u2264 \u03b5 and (cid:107)A\u00afxk \u2212 b(cid:107) \u2264 \u03b5.\n3.2. Initial points. We \ufb01rst show how to compute an initial point w0 such that G0( \u00afw0) \u2264 0.\nLemma 2 ([28]). Given xc \u2208 X , \u00afw0 := [\u00afx0, \u00afy0] \u2208 W is computed by:\n\n(15)\n\n(cid:113)\n\n(cid:40)\u00afx0 = x(cid:63)\n\n\u03b30\n\u00afy0 = y(cid:63)\n\u03b20\n\nx\u2208X {f (x) + (\u03b30/2)db(Sx, Sxc)} ,\n\n(0m) := arg min\n(\u00afx0) := \u03b2\u22121\n\n0 (A\u00afx0 \u2212 b).\n\n(16)\n\nsatis\ufb01es G\u03b30\u03b20( \u00afw0) \u2264 0 provided that \u03b20\u03b30 \u2265 \u00afLg, where \u00afLg is the Lipschitz constant of \u2207g\u03b3 with\ng\u03b3 given by (13).\n3.3. An algorithmic template. Algorithm 1 combines the above ingredients for solving (1). We\nobserve that the key computational step of Algorithm 1 is Step 3, where we update [\u00afxk+1, \u00afyk+1]. In\nthe algorithm, we provide two update schemes (1P2D) and (2P1D) based on the updates of the\nprimal or dual variables. The primal step x(cid:63)\n(\u00afyk) is calculated via (12). At line 3 of (2P1D), the\n\u03b3k\noperator proxS\n\n(cid:8)f (x) + \u02c6yT A(x \u2212 \u02c6x) + \u03b2\u22121db(Sx, S\u02c6x)(cid:9) ,\n\n\u03b2f is computed as:\nproxS\n\n\u03b2f (\u02c6x, \u02c6y) := argmin\nx\u2208X\n\n(17)\n\n4\n\n\f0 , and \u03b20 := \u03b3\u22121\n\n0\n\n\u00afLg (c.f. the text).\n\n2: Compute [\u00afx0, \u00afy0] by (16) as in Lemma 2.\nFor k = 0 to kmax, perform the following steps:\n3: If stopping criterion holds, then terminate. Otherwise, use one of the following schemes:\n\nAlgorithm 1: (A primal-dual algorithmic template using model-based excessive gap)\nInputs: Fix \u03b30 > 0. Choose c0\u2208 (\u22121, 1].\nInitialization:\n\n1: Compute a0 := 0.5(1+c0+(cid:112)4(1\u2212c0)+(1+c0)2, \u03c40 := a\u22121\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n:= (1 \u2212 \u03c4k)\u00afxk + \u03c4kx(cid:63)\n\u02c6xk\nk+1(A\u02c6xk \u2212 b)\n:= \u03b2\u22121\n\u02c6yk\n\u00afxk+1 := proxS\n\u03b2k+1f (\u02c6xk, \u02c6yk)\n\u00afyk+1 := (1 \u2212 \u03c4k)\u00afyk + \u03c4k \u02c6yk.\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n(2P1D) :\n\n(1P2D) :\n\n(\u00afyk)\n\n\u03b3k\n\nk (A\u00afxk \u2212 b),\n:= \u03b2\u22121\n\u00afy(cid:63)\n:= (1 \u2212 \u03c4k)\u00afyk + \u03c4k \u00afy(cid:63)\nk\n\u02c6yk\nk,\n\u00afxk+1 := (1\u2212\u03c4k)\u00afxk +\u03c4kx(cid:63)\n(\u02c6yk),\n\u00afyk+1 := \u02c6yk +\u03b3k\n\n(\u02c6yk)\u2212b(cid:1).\n\n(cid:0)Ax(cid:63)\n\n\u03b3k\n\n\u03b3k\n\n4: Update \u03b2k+1 := (1 \u2212 \u03c4k)\u03b2k and \u03b3k+1 := (1 \u2212 ck\u03c4k)\u03b3k. Update ck+1 from ck (optional).\n\n5: Update ak+1 := 0.5(cid:0)1 + ck+1 +(cid:112)4a2\n\nk + (1 \u2212 ck+1)2(cid:1) and set \u03c4k+1 := a\u22121\n\nk+1.\n\nEnd For\n\n\u221a\n\nwhere we overload the notation of the proximal operator prox de\ufb01ned by (3). At Step 2 of Algorithm\n1, if we choose S := I, i.e., db(Sx, Sxc) := db(x, xc) for xc being the center point of b, then we set\n\u00afLg := (cid:107)A(cid:107)2. If S := A, i.e., db(Sx, Sxc) := (1/2)(cid:107)Ax \u2212 b(cid:107)2, then we set \u00afLg := 1.\nTheorem 1 characterizes three variants of Algorithm 1, whose proof can be found in [28].\n\nTheorem 1. Let(cid:8)(\u00afxk, \u00afyk)(cid:9) be the sequence generated by Algorithm 1 after k iterations. Then:\n\n\u00afLg = 1, and ck := 0, then\n\na) If S = A, i.e., using the augmented Lagrangian smoother, \u03b30 :=\nthe (1P2D) update satis\ufb01es:\n(cid:107)A\u00afxk\u2212b(cid:107)2\u2212D(cid:63)Y(cid:107)A\u00afxk\u2212b(cid:107) \u2264 f (\u00afxk) \u2212 f (cid:63) \u2264 0 and (cid:107)A\u00afxk\u2212b(cid:107) \u2264 8D(cid:63)Y\n(1P2D) : \u2212 1\n(k + 1)2 .\n2\nConsequently, the worst-case complexity of Algorithm 1 to achieve an \u03b5-solution \u00afxk is O(\u03b5\u22121/2).\n\u00afLg = (cid:107)A(cid:107), and ck := 1, then, for\nb) If S = I, i.e., using the Bregman distance smoother, \u03b30 :=\nthe (2P1D) scheme, we have:\n(2P1D) : \u2212 D(cid:63)Y(cid:107)A\u00afxk\u2212 b(cid:107)\u2264 f (\u00afxk) \u2212 f (cid:63) \u2264 (cid:107)A(cid:107)\n\u221a\n2(cid:107)A(cid:107)\nc) Similarly, if \u03b30 := 2\nK+1 and ck := 0 for all k = 0, 1, . . . , K, then, for the (1P2D) scheme, we\nhave:\n\n(cid:113)\nHence, the worst-case complexity to achieve an \u03b5-solution \u00afxk of (1) in either b) or c) is O(cid:0)\u03b5\u22121(cid:1).\n\n\u221a\n(1P2D) : \u2212D(cid:63)Y(cid:107)A\u00afxK\u2212b(cid:107)\u2264 f (\u00afxK)\u2212f (cid:63) \u2264 2\n\nX and (cid:107)A\u00afxK\u2212b(cid:107)\u2264 2\n\nX and (cid:107)A\u00afxk\u2212 b(cid:107) \u2264\nI\nD\n\n(cid:107)A(cid:107)(2D(cid:63)Y +\nk + 1\n\n2(cid:107)A(cid:107)(D(cid:63)Y +\n(K +1)\n\n2(cid:107)A(cid:107)\n(K +1)\n\n(cid:113)\n\nI\nX )\n\nk+1\n\n\u221a\n\n\u221a\n\n2D\n\nD\n\nI\n\n.\n\nI\nX )\n\nD\n\n.\n\nThe (1P2D) scheme has close relationship to some well-known primal dual methods we describe\nbelow. Unfortunately, (1P2D) has the drawback of \ufb01xing the total number of iterations a priori,\nwhich (2P1D) can avoid at the expense of one more proximal operator calculation at each iteration.\n3.4. Impact of strong convexity. We can improve the above schemes when f \u2208 F\u00b5, i.e., f is\nstrongly convex with parameter \u00b5f > 0. The dual function g given in (6) is smooth and Lipschitz\nf (cid:107)A(cid:107)2. Let us illustrate this when S = I and using the (1P2D) scheme as:\ngradient with Lg\n\nf := \u00b5\u22121\n\n\uf8f1\uf8f2\uf8f3 \u02c6yk\n\n(1P2D\u00b5) :\n\n:= (1\u2212\u03c4k)\u00afyk +\u03c4k\u03b2\u22121\n\n\u00afxk+1 := (1\u2212\u03c4k)\u00afxk +\u03c4kx(cid:63)(\u02c6yk),\n\u00afyk+1 := \u02c6yk + 1\nLg\nf\n\n(cid:0)Ax(cid:63)(\u02c6yk)\u2212b(cid:1).\n\nk (A\u00afxk \u2212 b),\n\nHere, x(cid:63)(\u02c6yk) is the solution of (6) at \u02c6yk. We can still choose the starting point as in (16) with \u03b20 :=\nf . The parameters \u03b2k and \u03c4k at Steps 4 and 5 of Algorithm 1 are updated as \u03b2k+1 := (1 \u2212 \u03c4k)\u03b2k,\nLg\n\n5\n\n\f2 ((cid:112)\u03c4 2\n\n\u221a\n\nk + 4 \u2212 \u03c4k), where \u03b20 := Lg\n\nand \u03c4k+1 := \u03c4k\nillustrates the convergence of Algorithm 1 using (1P2D\u00b5); see [28] for the detail proof.\n\nCorollary 1. Let f \u2208 F\u00b5 and(cid:8)(\u00afxk, \u00afyk)(cid:9)\n\nf and \u03c40 := (\n\nk\u22650 be generated by Algorithm 1 using (1P2D\u00b5). Then:\n\n5 \u2212 1)/2. The following corollary\n\n\u2212D(cid:63)Y(cid:107)A\u00afxk \u2212 b(cid:107) \u2264 f (\u00afxk) \u2212 f (cid:63) \u2264 0 and (cid:107)A\u00afxk \u2212 b(cid:107) \u2264 4(cid:107)A(cid:107)2D(cid:63)Y\n\u00b5f (k + 2)2 .\n\nMoreover, we also have (cid:107)\u00afxk \u2212 x(cid:63)(cid:107) \u2264 4(cid:107)A(cid:107)\nIt is important to note that, when f \u2208 F\u00b5, we only have one smoothness parameter \u03b2 and, hence,\nwe do not need to \ufb01x the number of iterations a priori (compared with [18]).\n\nD(cid:63)Y.\n\n(k+2)\u00b5f\n\n\u03b3\n\n\uf8f1\uf8f2\uf8f3x(cid:63)\n\n\u03b3k\u22121\n\n(18)\n\n(cid:9)\n\n1)(cid:107)2\n\n2\n\nc := x(cid:63)\n\n2 for \ufb01xed xc\n\n(cid:107)A1(x1\u2212xc\n\nk\u22121) with the Bregman distance db.\n\n\u03b3,1(\u02c6yk) + A2x2\u2212b(cid:107)2(cid:9).\n\n(cid:8)f1(x1)+(\u02c6yk)TA1x1 +\n(cid:8)f2(x2)+ (\u02c6yk)T A2x2 +\n\n\u03b3,2(\u02c6yk)] in (1P2D) by the following alternating step:\n2\u2212b(cid:107)2 +\n\u03b3\n2\n\n4 Algorithmic enhancements through existing methods\nOur framework can be applied to develop other variants of popular primal-dual methods for (1)\nincluding alternating minimization algorithms and alternating direction methods of multipliers. We\nillustrate in this section three variants of Algorithm 1. We also borrow adaptation heuristics from\nother algorithms to enhance our practical performance.\n(\u02c6yk\u22121). This makes\n4.1. Proximal-based decomposition method. We can choose xk\nthe (1P2D) scheme of Algorithm 1 similar to the proximal-based decomposition algorithm in [30],\nwhich employs the proximal term db(\u00b7, \u02c6x(cid:63)\n4.2. ADMM. Let f and X be 2-decomposable, i.e., f (x) := f1(x1) + f2(x2) and X := X1 \u00d7 X2.\nWe can apply the (1P2D) scheme of Algorithm 1 to this case with f1 being f\u03b3,1(\u00b7) := f1(\u00b7) +\n1)(cid:107)2\n1 \u2208 X1. For this variant, we substitute the primal step of computing\n2(cid:107)A1(\u00b7 \u2212 xc\n\u03b3(\u02c6yk) = [x(cid:63)\n\u03b3,1(\u02c6yk), x(cid:63)\nx(cid:63)\n\u03b3,1(\u02c6yk) := arg min\nx1\u2208X1\n\u03b3,2(\u02c6yk) := arg min\nx(cid:63)\nx2\u2208X2\n\n(cid:107)A1x1 +A2 \u02c6xk\n\u03c1k\n2\n(cid:107)A1x(cid:63)\n\u03b7k\n2\nHere, \u03c1k and \u03b7k are two penalty parameters, and \u02c6xk\n2 is the previous iteration of x(cid:63)\nof parameters, as well as the complete algorithm and its convergence can be found in [29].\n4.3. Primal-dual hybrid gradient (PDHG). When A1 and A2 are not orthogonal, one can linearize\nthe quadratic terms in both steps of (18) to obtain a new preconditioned ADMM (PADMM) algo-\nrithm that employes the proximal operator of f1 and f2 instead of two general convex subproblems.\nIn this case, the (1P2D) scheme with (18) leads to a new variant of PADMM in [8] or PDHG in [9].\nDetails of the complete algorithm can be found in [29].\n4.4. Enhancements of our schemes. For the PADMM and ADMM methods, a great deal of\nadaptation techniques has been proposed to enhance their convergence. We can view some of these\ntechniques in the light of model-based excessive gap condition. For instance, Algorithm 1 decreases\nthe smoothed gap function G\u03b3k\u03b2k as illustrated in De\ufb01nition 1. The actual decrease is then given by\nf (\u00afxk) \u2212 f (cid:63) \u2264 \u03b3k(DSX \u2212 \u03a8k/\u03b3k). In practice, Dk := DSX \u2212 \u03a8k/\u03b3k can be dramatically smaller\nthan DSX in the early iterations. This implies that increasing \u03b3k can improve practical performance.\nSuch a strategy indeed forms the basis of many adaptation techniques in PADMM and in ADMM.\nSpeci\ufb01cally, if \u03b3k increases, then \u03c4k also increases and \u03b2k decreases. Since \u03b2k measures the primal\nfeasibility gap Fk := (cid:107)A\u00afxk \u2212 b(cid:107) due to Lemma 1, we should only increase \u03b3k if the feasibility\ngap Fk is relatively high. Indeed, when xc = xc\nk is updated adaptively, we can compute the dual\n2)k)(cid:107). Then, if Fk \u2265 sHk for some s > 0, we\n2)k+1 \u2212 (\u02c6x(cid:63)\nfeasibility gap as Hk := \u03b3k(cid:107)AT\nincrease \u03b3k+1 := c\u03b3k for some c > 1 (we use ck = c := 1.05 in practice). We can also decrease the\n(\u02c6yk), Sxc)/DSX \u2208 [0, 1]\nparameter \u03b3k in (1P2D) by \u03b3k+1 := (1 \u2212 ck\u03c4k)\u03b3k, where ck := db(Sx(cid:63)\nafter or during the update of (\u00afxk+1, \u00afyk+1) as in (2P1D) if we know the estimate DSX .\n5 Numerical illustrations\n5.1. Theoretical vs. practical bounds. We demonstrate the empirical performance of Algorithm 1\nw.r.t. its theoretical bounds via a basic non-overlapping sparse-group basis pursuit problem:\n\n\u03b3,2(\u02c6yk). The update\n\n1 A2((\u02c6x(cid:63)\n\n(cid:8)(cid:88)ng\n\ni=1\n\nmin\nx\u2208Rn\n\nwi(cid:107)xgi(cid:107)2 : Ax = b, (cid:107)x(cid:107)\u221e \u2264 \u03c1(cid:9),\n\n(19)\n\n\u03b3k\n\n6\n\n\fwhere \u03c1 > 0 is the signal magnitude, and gi and wi\u2019s are the group indices and weights, respectively.\n\n\u221a\n\n[top row] \u2013 the decomposable Bregman distance\n\nFigure 1: Actual performance vs. theoretical bounds:\nsmoother (S = I), and [bottom row] \u2013 the augmented Lagrangian smoother (S = A).\nIn this test, we \ufb01x xc = 0n and db(x, xc) := (1/2)(cid:107)x(cid:107)2. Since \u03c1 is given, we can evaluate DX\nnumerically. By solving (19) with the SDPT3 interior-point solver [31] up to the accuracy 10\u22128, we\n2(cid:107)A(cid:107)(K + 1)\u22121 with K := 104 and generate the theoretical\n\ncan numerically estimate D(cid:63)Y and f (cid:63). In the (2P1D) scheme, we set \u03b30 = \u03b20 = (cid:112) \u00afLg, while, in\n\nthe (1P2D) scheme, we set \u03b30 := 2\nbounds de\ufb01ned in Theorem 1.\nWe test the performance of the four variants using a synthetic data: n = 1024, m = (cid:98)n/3(cid:99) = 341,\nng = (cid:98)n/8(cid:99) = 128, and x(cid:92) is a (cid:98)ng/8(cid:99)-sparse vector. Matrix A is generated randomly using the iid\nstandard Gaussian and b := Ax(cid:92). The group indices gi is also generated randomly (i = 1,\u00b7\u00b7\u00b7 , ng).\nThe empirical performance of two variants: (2P1D) and (1P2D) of Algorithm 1 is shown in Fig-\nure 1. The basic algorithm refers to the case when xc\nk := xc = 0n and the parameters are not tuned.\nHence, at each iteration of the basic (1P2D), it requires only 1 proximal calculation and applies A\nand AT once each, and at each iteration of the basic (2P1D), we use 2 proximal calculations and\napply A twice and AT once. In contrast, (2P1D) and (1P2D) variants whose iterations require one\nmore application of AT for adaptive parameter updates.\nAs can be seen from Figure 1 (row 1) that the empirical performance of the basic variants roughly\nfollows the O(1/k) convergence rate in terms of |f (\u00afxk)\u2212 f (cid:63)| and (cid:107)A\u00afxk \u2212 b(cid:107). The deviations from\nthe bound are due to the increasing sparsity of the iterates, which improves empirical convergence.\nWith a kick-factor of ck = \u22120.02/\u03c4k and adaptive xc\nk, both turned variants (2P1D) and (1P2D)\nsigni\ufb01cantly outperform theoretical predictions. Indeed, they approach x(cid:63) up to 10\u221213 accuracy,\ni.e., (cid:107)\u00afxk \u2212 x(cid:63)(cid:107) \u2264 10\u221213 after a few hundreds of iterations.\nSimilarly, Figure 1 (row 2) illustrates the actual performance vs. the theoretical bounds O(1/k2) by\nusing the augmented Lagrangian smoother. Here, we solve the subproblems (13) and (17) by using\nFISTA [32] up to 10\u22128 accuracy as suggested in [28]. In this case, the theoretical bounds and the\nactual performance of the basis variants are very close to each other both in terms of |f (\u00afxk) \u2212 f (cid:63)|\nand (cid:107)A\u00afxk \u2212 b(cid:107). When the parameter \u03b3k is updated, the algorithms exhibit a better performance.\n5.2. Binary linear support vector machine. This example is concerned with the following binary\nlinear support vector machine problem:\n\nmin\nx\u2208Rn\n\n(cid:96)j(yj, wT\n\n(20)\nwhere (cid:96)j(\u00b7,\u00b7) is the Hinge loss function given by (cid:96)j(s, \u03c4 ) := max{0, 1 \u2212 s\u03c4} = [1 \u2212 s\u03c4 ]+, wj is\nthe column of a given matrix W \u2208 Rm\u00d7n, b \u2208 Rn is the intercept vector, y \u2208 {\u22121, +1}m is a\nclassi\ufb01er vector g is a given regularization function, e.g., g(x) := (\u03bb/2)(cid:107)x(cid:107)2 for the (cid:96)2-regularizer\nor g(x) := \u03bb(cid:107)x(cid:107)1 for the (cid:96)1-regularizer, where \u03bb > 0 is a regularization parameter.\nBy introducing a slack variable r = Wx \u2212 b, we can write (20) in terms of (1) as:\n\nj=1\n\n(cid:8)F (x) :=\n\n(cid:88)m\n\nj x \u2212 bj) + g(x)(cid:9),\n\n(cid:110)(cid:88)m\n\nmin\n\nx\u2208Rn,r\u2208Rm\n\nj=1\n\n7\n\n(cid:96)j(yj, rj) + g(x) : Wx \u2212 r = b\n\n(21)\n\n(cid:111)\n\n.\n\n020004000600080001000010\u22125100105#iterations|f(xk)\u2212f\u2217|inlog-scale 020004000600080001000010\u22121010\u22125100105#iterationskAxk\u2212bkinlog-scale 020004000600080001000010\u22125100105#iterations|f(xk)\u2212f\u2217|inlog-scale 020004000600080001000010\u22121010\u22125100105#iterationskAxk\u2212bkinlog-scale Theoretical boundBasic 2P1D algorithm2P1D algorithmTheoretical boundBasic 1P2D algorithm1P2D algorithm020004000600080001000010\u22121010\u22125100105#iterations|f(xk)\u2212f\u2217|inlog-scale 020004000600080001000010\u22121510\u22121010\u22125100105#iterationskAxk\u2212bkinlog-scale 020004000600080001000010\u22121010\u22125100105#iterations|f(xk)\u2212f\u2217|inlog-scale 020004000600080001000010\u22121510\u22121010\u22125100105#iterationskAxk\u2212bkinlog-scale Theoretical boundBasic 1P2D algorithm1P2D algorithmTheoretical boundBasic 2P1D algorithm2P1D algorithm\fNow, we apply the (1P2D) variant to solve (21). We test this algorithm on (21) and compare it\nwith LibSVM [33] using two problems from the LibSVM data set available at http://www.csie.\nntu.edu.tw/\u02dccjlin/libsvmtools/datasets/. The \ufb01rst problem is a1a, which has p = 119\nfeatures and N = 1605 data points, while the second problem is news20, which has p = 1(cid:48)355(cid:48)191\nfeatures and N = 19(cid:48)996 data points.\nWe compare Algorithm 1 and the LibSVM solver in terms of the \ufb01nal value F (xk) of the orig-\ninal objective function F , the computational time, and the classi\ufb01cation accuracy ca\u03bb := 1 \u2212\n\n(cid:2)sign(Wxk \u2212 r) (cid:54)= y)(cid:3) of both training and test data set. We randomly select 30%\n\ndata in a1a and news20 to form a test set, and the remaining 70% data is used for training. We\nperform 10 runs and compute the average results. These average results are plotted in Fig. 2 for two\nseparate problems, respectively. The upper and lower bounds show the maximum and minimum\nvalues of these 10 runs.\n\nN\u22121(cid:80)N\n\nj=1\n\nFigure 2: The average performance results of the two algorithms on the a1a (\ufb01rst row) and news20\n(second row) problems.\nAs can be seen from these results that both solvers give relatively the same objective values, the\naccuracy for these two problems, while the computational of (1P2D) is much lower than LibSVM.\nWe note that LibSVM becomes slower when the parameter \u03bb becomes smaller due to its active-set\nstrategy. The (1P2D) algorithm is almost independent of the regularization parameter \u03bb, which is\ndifferent from active-set methods. In addition, the performance of (1P2D) can be improved by tak-\ning account its parallelization ability, which has not fully been exploited yet in our implementation.\n\n6 Conclusions\nWe propose a model-based excessive gap (MEG) technique for constructing and analyzing \ufb01rst-order\nprimal-dual methods that numerically approximate an optimal solution of the constrained convex\noptimization problem (1). Thanks to a combination of smoothing strategies and MEG, we propose,\nto the best of our knowledge, the \ufb01rst primal-dual algorithmic schemes for (1) that theoretically\nobtain optimal convergence rates directly without averaging the iterates and that seamlessly handle\nthe p-decomposability structure. In addition, our analysis techniques can be simply adapt to handle\ninexact oracle produced by solving approximately the primal subproblems (c.f.\n[28]), which is\nimportant for the augmented Lagrangian versions with lower-iteration counts. We expect a deeper\nunderstanding of MEG and different smoothing strategies to help us in tailoring adaptive update\nstrategies for our schemes (as well as several other connected and well-known schemes) in order to\nfurther improve the empirical performance.\nAcknowledgments. This work is supported in part by the European Commission under the grants MIRG-\n268398 and ERC Future Proof, and by the Swiss Science Foundation under the grants SNF 200021-132548,\nSNF 200021-146750 and SNF CRSII2-147633.\n\n8\n\n0200400600800100000.511.522.533.5x 108The objective valuesParameterhorizon(\u03bb\u22121)TheobjectivevaluesF(xk) 1P2DLibSVM020040060080010000.740.760.780.80.820.840.860.880.9The classification accuracy (training data)Parameterhorizon(\u03bb\u22121)The classification accuracy (training set) 1P2DLibSVM020040060080010000.740.760.780.80.820.840.86The classification accuracy (test data)Parameterhorizon(\u03bb\u22121)The classification accuracy (test set) 02004006008001000\u221220246810121416The CPU time [second]Parameterhorizon(\u03bb\u22121)The CPU time [second] 1P2DLibSVM1P2DLibSVM0200400600800100000.511.522.533.54x 107The objective valuesParameterhorizon(\u03bb\u22121)TheobjectivevalueF(xk) 1P2DLibSVM020040060080010000.50.60.70.80.91The classification accuracy (training data)Parameterhorizon(\u03bb\u22121)The classification accuracy (training set) 1P2DLibSVM020040060080010000.50.60.70.80.91The classification accuracy (test data)Parameterhorizon(\u03bb\u22121)The classification accuracy (test set) 1P2DLibSVM02004006008001000400450500550600650700750800850The CPU time [second]Parameterhorizon(\u03bb\u22121)CPU time [second] 1P2DLibSVM\fReferences\n[1] Y. Nesterov, \u201cExcessive gap technique in nonsmooth convex minimization,\u201d SIAM J. Optim., vol. 16, no. 1, pp. 235\u2013249, 2005.\n[2] D. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: Numerical methods. Prentice Hall, 1989.\n[3] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, \u201cThe convex geometry of linear inverse problems,\u201d Laboratory for Informa-\ntion and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Tech.\nReport., 2012.\n\n[4] M. B. McCoy, V. Cevher, Q. Tran-Dinh, A. Asaei, and L. Baldassarre, \u201cConvexity in source separation: Models, geometry, and algo-\n\nrithms,\u201d IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 87\u201395, 2014.\n\n[5] M. J. Wainwright, \u201cStructured regularizers for high-dimensional problems: Statistical and computational issues,\u201d Annual Review of\n\nStatistics and its Applications, vol. 1, pp. 233\u2013253, 2014.\n\n[6] N. Parikh and S. Boyd, \u201cProximal algorithms,\u201d Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123\u2013231, 2013.\n[7] P. L. Combettes and V. R. Wajs, \u201cSignal recovery by proximal forward-backward splitting,\u201d Multiscale Model. Simul., vol. 4, pp. 1168\u2013\n\n1200, 2005.\n\n[8] A. Chambolle and T. Pock, \u201cA \ufb01rst-order primal-dual algorithm for convex problems with applications to imaging,\u201d Journal of Mathe-\n\nmatical Imaging and Vision, vol. 40, no. 1, pp. 120\u2013145, 2011.\n\n[9] T. Goldstein, E. Esser, and R. Baraniuk, \u201cAdaptive primal-dual hybrid gradient methods for saddle point problems,\u201d Tech. Report., vol.\n\nhttp://arxiv.org/pdf/1305.0546v1.pdf, pp. 1\u201326, 2013.\n\n[10] B. He and X. Yuan, \u201cOn non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers,\u201d Numer. Math.,\n\n[11] \u2014\u2014, \u201cOn the O(1/n) convergence rate of the Douglas-Rachford alternating direction method,\u201d SIAM J. Numer. Anal., vol. 50, pp.\n\nDOI 10.1007/s00211-014-0673-6, 2014.\n\n700\u2013709, 2012.\n\n[12] Y. Ouyang, Y. Chen, G. L. Lan., and E. J. Pasiliao, \u201cAn accelerated linearized alternating direction method of multiplier,\u201d Tech, 2014.\n[13] R. She\ufb01 and M. Teboulle, \u201cRate of convergence analysis of decomposition methods based on the proximal method of multipliers for\n\nconvex minimization,\u201d SIAM J. Optim., vol. 24, no. 1, pp. 269\u2013297, 2014.\n\n[14] H. Wang and A. Banerjee, \u201cBregman alternating direction method of multipliers,\u201d Tech. Report, pp. 1\u201318, 2013. Online at:\n\nhttp://arxiv.org/pdf/1306.3203v1.pdf.\n\n[15] H. Ouyang, N. He, L. Q. Tran, and A. Gray, \u201cStochastic alternating direction method of multipliers,\u201d JMLR W&CP, vol. 28, pp. 80\u201388,\n\n2013.\n\n[16] T. Goldstein, B. O. Donoghue, and S. Setzer, \u201cFast alternating direction optimization methods,\u201d SIAM J. Imaging Sci., vol. 7, no. 3,\n\npp. 1588\u20131623, 2014.\n\n[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, \u201cDistributed optimization and statistical learning via the alternating direction\n\nmethod of multipliers,\u201d Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1\u2013122, 2011.\n\n[18] A. Beck and M. Teboulle, \u201cA fast dual proximal gradient algorithm for convex minimization and applications,\u201d Oper. Res. Letter, vol. 42,\n\nno. 1, pp. 1\u20136, 2014.\n\n[19] W. Deng and W. Yin, \u201cOn the global and linear convergence of the generalized alternating direction method of multipliers,\u201d Rice Uni-\n\nversity CAAM, Tech. Rep., 2012, tR12-14.\n\n[20] D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods. Athena Scienti\ufb01c, 1996.\n[21] R. T. Rockafellar, \u201cAugmented lagrangians and applications of the proximal point algorithm in convex programming,\u201d Mathematics of\n\nOperations Research, vol. 1, pp. 97\u2013116, 1976.\n\n[22] Y. Nesterov, \u201cSmooth minimization of non-smooth functions,\u201d Math. Program., vol. 103, no. 1, pp. 127\u2013152, 2005.\n[23] G. Lan and R. Monteiro, \u201cIteration-complexity of \ufb01rst-order augmented Lagrangian methods for convex programming,\u201d Math. Program.,\n\nDOI 10.1007/s10107-015-0861-x, 2015.\n\n[24] V. Nedelcu, I. Necoara, and Q. Tran-Dinh, \u201cComputational complexity of inexact gradient augmented Lagrangian methods: Application\n\nto constrained MPC,\u201d SIAM J. Optim. Control, vol. 52, no. 5, pp. 3109\u20133134, 2014.\n\n[25] Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, 2004, vol. 87.\n[26] F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems, N. York, Ed. Springer-Verlag,\n\n2003, vol. 1-2.\n\n[27] A. Auslender, Optimisation: M\u00b4ethodes Num\u00b4eriques. Paris: Masson, 1976.\n[28] Q. Tran-Dinh and V. Cevher, \u201cA primal-dual algorithmic framework for constrained convex minimization,\u201d Tech. Report., LIONS, pp.\n\n1\u201354, 2014.\n\n[29] Q. Tran-Dinh and V. Cevher, \u201cOptimal-rate and tuning-free alternating algorithms for constrained convex optimization,\u201d Tech. Report.,\n\nLIONS, 2015.\n\n[30] G. Chen and M. Teboulle, \u201cA proximal-based decomposition method for convex minimization problems,\u201d Math. Program., vol. 64, pp.\n\n81\u2013101, 1994.\n\n[31] K.-C. Toh, M. Todd, and R. T\u00a8ut\u00a8unc\u00a8u, \u201cOn the implementation and usage of SDPT3 \u2013 a Matlab software package for semide\ufb01nite-\n\nquadratic-linear programming, version 4.0,\u201d NUS Singapore, Tech. Report, 2010.\n\n[32] A. Beck and M. Teboulle, \u201cA fast iterative shrinkage-thresholding algorithm for linear inverse problems,\u201d SIAM J. Imaging Sciences,\n\nvol. 2, no. 1, pp. 183\u2013202, 2009.\n\n[33] C.-C. Chang and C.-J. Lin, \u201cLIBSVM: a library for support vector machines,\u201d ACM Transactions on Intelligent Systems and Technology,\n\nvol. 2, no. 27, pp. 1\u201327, 2011.\n\n9\n\n\f", "award": [], "sourceid": 506, "authors": [{"given_name": "Quoc", "family_name": "Tran-Dinh", "institution": "LIONS, EPFL, Switzerland"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}