{"title": "ADMM without a Fixed Penalty Parameter: Faster Convergence with New Adaptive Penalization", "book": "Advances in Neural Information Processing Systems", "page_first": 1267, "page_last": 1277, "abstract": "Alternating direction method of multipliers (ADMM) has received tremendous interest for solving numerous problems in machine learning, statistics and signal processing. However, it is known that the performance of ADMM and many of its variants is very sensitive to the penalty parameter of a quadratic penalty applied to the equality constraints. Although several approaches have been proposed for dynamically changing this parameter during the course of optimization, they do not yield theoretical improvement in the convergence rate and are not directly applicable to stochastic ADMM. In this paper, we develop a new ADMM and its linearized variant with a new adaptive scheme to update the penalty parameter. Our methods can be applied under both deterministic and stochastic optimization settings for structured non-smooth objective function. The novelty of the proposed scheme lies at that it is adaptive to a local sharpness property of the objective function, which marks the key difference from previous adaptive scheme that adjusts the penalty parameter per-iteration based on certain conditions on iterates. On theoretical side, given the local sharpness characterized by an exponent $\\theta\\in(0, 1]$, we show that the proposed ADMM enjoys an improved iteration complexity of $\\widetilde O(1/\\epsilon^{1-\\theta})$\\footnote{$\\widetilde O()$ suppresses a logarithmic factor.} in the deterministic setting and an iteration complexity of $\\widetilde O(1/\\epsilon^{2(1-\\theta)})$ in the stochastic setting without smoothness and strong convexity assumptions. The complexity in either setting improves that of the standard ADMM which only uses a fixed penalty parameter. On the practical side, we demonstrate that the proposed algorithms converge comparably to, if not much faster than, ADMM with a fine-tuned fixed penalty parameter.", "full_text": "ADMM without a Fixed Penalty Parameter:\n\nFaster Convergence with New Adaptive Penalization\n\nYi Xu\u2020, Mingrui Liu\u2020, Qihang Lin\u2021, Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\u2021Department of Management Sciences, The University of Iowa, Iowa City, IA 52242, USA\n\n{yi-xu, mingrui-liu, qihang-lin, tianbao-yang}@uiowa.edu\n\nAbstract\n\nAlternating direction method of multipliers (ADMM) has received tremendous\ninterest for solving numerous problems in machine learning, statistics and signal\nprocessing. However, it is known that the performance of ADMM and many of its\nvariants is very sensitive to the penalty parameter of a quadratic penalty applied\nto the equality constraints. Although several approaches have been proposed for\ndynamically changing this parameter during the course of optimization, they do not\nyield theoretical improvement in the convergence rate and are not directly applica-\nble to stochastic ADMM. In this paper, we develop a new ADMM and its linearized\nvariant with a new adaptive scheme to update the penalty parameter. Our methods\ncan be applied under both deterministic and stochastic optimization settings for\nstructured non-smooth objective function. The novelty of the proposed scheme lies\nat that it is adaptive to a local sharpness property of the objective function, which\nmarks the key difference from previous adaptive scheme that adjusts the penalty\nparameter per-iteration based on certain conditions on iterates. On theoretical side,\ngiven the local sharpness characterized by an exponent \u03b8 \u2208 (0, 1], we show that the\n\nproposed ADMM enjoys an improved iteration complexity of (cid:101)O(1/\u00011\u2212\u03b8)1 in the\ndeterministic setting and an iteration complexity of (cid:101)O(1/\u00012(1\u2212\u03b8)) in the stochastic\n\nsetting without smoothness and strong convexity assumptions. The complexity in\neither setting improves that of the standard ADMM which only uses a \ufb01xed penalty\nparameter. On the practical side, we demonstrate that the proposed algorithms\nconverge comparably to, if not much faster than, ADMM with a \ufb01ne-tuned \ufb01xed\npenalty parameter.\n\n1\n\nIntroduction\n\nOur problem of interest is the following convex optimization problem that commonly arises in\nmachine learning, statistics and signal processing:\n\nF (x) (cid:44) f (x) + \u03c8(Ax)\n\nmin\nx\u2208\u2126\n\n(1)\nwhere \u2126 \u2286 Rd is a closed convex set, f : Rd \u2192 R and \u03c8 : Rm \u2192 R are proper lower-semicontinuous\nconvex functions, and A \u2208 Rm\u00d7d is a matrix. In this paper, we consider solving (1) by alternating\ndirection method of multipliers (ADMM) in two paradigms, namely deterministic optimization and\nstochastic optimization. In both paradigms, ADMM has been employed widely for solving the\nregularized statistical learning problems like (1) due to its capability of tackling the sophisticated\nstructured regularization term \u03c8(Ax) in (1) (e.g., the generalized lasso (cid:107)Ax(cid:107)1), which is often an\n1(cid:101)O() suppresses a logarithmic factor.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fobstacle for applying other methods such as proximal gradient method. As follows, we describe\nthe standard ADMM and its variants for solving (1) in different optimization paradigms. It is worth\nmentioning that all algorithms presented in this paper can be easily extended to handle a more general\nterm \u03c8(A(x) + c), where A is a linear mapping.\nTo apply ADMM, the original problem (1) is \ufb01rst cast into an equivalent constrained optimization\nproblem via decoupling:\n\nmin\n\nx\u2208\u2126,y\u2208Rm\n\nf (x) + \u03c8(y),\n\ns.t. y = Ax.\n\n(2)\n\nAn augmented Lagrangian function for (2) is de\ufb01ned as\n\nL(x, y, \u03bb) = f (x) + \u03c8(y) \u2212 \u03bb(cid:62)(Ax \u2212 y) +\n\n(3)\nwhere \u03b2 is a constant called penalty parameter and \u03bb \u2208 Rm is a dual variable. Then, the standard\nADMM solves problem (1) by executing the following three steps in each iteration:\n\n(cid:107)Ax \u2212 y(cid:107)2\n2,\n\n\u03b2\n2\n\nx\u03c4 +1 = arg min\nx\u2208\u2126\n\nL(x, y\u03c4 , \u03bb\u03c4 ) = arg min\nx\u2208\u2126\n\nf (x) +\n\n\u03b2\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(Ax \u2212 y\u03c4 ) \u2212 1\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(Ax\u03c4 +1 \u2212 y) \u2212 1\n\n\u03b2\n2\n\n\u03bb\u03c4\n\n\u03b2\n\n\u03b2\n\n2\n\n,\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n,\n\n\u03bb\u03c4\n\n(4)\n\n(5)\n\ny\u03c4 +1 = arg min\nx\u2208\u2126\n\u03bb\u03c4 +1 = \u03bb\u03c4 \u2212 \u03b2(Ax\u03c4 +1 \u2212 y\u03c4 +1).\n\nL(x\u03c4 +1, y, \u03bb\u03c4 ) = arg min\ny\u2208Rm\n\n\u03c8(y) +\n\n(6)\nWhen A is not an identity matrix, solving the subproblem (4) above for x\u03c4 +1 might be dif\ufb01cult. To\nalleviate the issue, linearized ADMM [33, 34, 8] has been proposed, which solves the following\nproblem instead of (4):\n\n(cid:13)(cid:13)(cid:13)(cid:13)(Ax \u2212 y\u03c4 ) \u2212 1\n\n\u03b2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u03bb\u03c4\n\n+\n\n1\n2\n\n(cid:107)x \u2212 x\u03c4(cid:107)2\nG,\n\n(7)\n\nx\u03c4 +1 = arg min\nx\u2208\u2126\n\u221a\n\nf (x) +\n\n\u03b2\n2\n\nwhere (cid:107)x(cid:107)G =\nx(cid:62)Gx and G \u2208 Rd\u00d7d is a positive semi-de\ufb01nite matrix. By setting G =\n\u03b3I \u2212 \u03b2A(cid:62)A (cid:23) 0, the term x(cid:62)A(cid:62)Ax in (7) vanishes. It has been established that both standard\nADMM and linearized ADMM have an O(1/t) convergence rate for solving (2) [8] , where t is the\nnumber of iterations. Under a minor condition, this result implies an O(1/\u0001) iteration complexity for\nsolving the original problem (1) (see Corollary 1).\nIn addition, we consider ADMM for solving (1) in stochastic optimization with\n\n(cid:80)n\n\nf (x) = E\u03be[f (x; \u03be)]\n\n(8)\nwhere \u03be is a random variable. This formulation captures many risk minimization problems in\nmachine learning where \u03be denotes a data point sampled from a distribution and f (x; \u03be) denotes a\nloss function of the model x on the data \u03be. It also covers as a special case the empirical loss where\ni=1 f (x; \u03bei) with n is the number of samples. For these problems, computing f (x)\nf (x) = 1\nn\nitself might be prohibitive (e.g., when n is very large) or even impossible. To address this issue,\none usually considers the stochastic optimization paradigm, where it is assumed that f (x; \u03be) and\nits subgradient \u2202f (x; \u03be) can be ef\ufb01ciently computed. To solve the stochastic optimization problem,\nstochastic ADMM algorithms have been proposed [21, 23], which update y\u03c4 +1 and \u03bb\u03c4 +1 the same\nto (5) and (6), respectively, but update x\u03c4 +1 as\n\n\u03bb\u03c4\n\n\u03b2\n2\n\nf (x\u03c4 ; \u03be\u03c4 )+\u2202f (x\u03c4 ; \u03be\u03c4 )(cid:62)(x\u2212x\u03c4 )+\n\n(9)\nx\u03c4 +1 = arg min\nx\u2208\u2126\nwhere \u03be\u03c4 is a random sample, \u03b7\u03c4 is a stepsize and G\u03c4 = \u03b3I \u2212 \u03b2\u03b7\u03c4 A(cid:62)A (cid:23) I [23] or G\u03c4 = I [21].\nOther stochastic variants of ADMM for general convex optimization were also proposed in [23, 35].\nThese work have established an O(1/\nt) convergence rate of stochastic ADMM for solving (2) with\nf (x) being (8). Under a minor condition, we can also show that these stochastic ADMM algorithms\nsuffer from a higher iteration complexity of O(1/\u00012) for \ufb01nding an \u0001-optimal solution to the original\nproblem (1) (see Corollary 3).\nAlthough the variants of ADMM with fast convergence rates have been developed under smoothness,\nstrong convexity and other regularity conditions (e.g., the matrix A has full rank), the best iteration\n\n\u221a\n\n\u03b7\u03c4\n\n+\n\nG\u03c4\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(Ax \u2212 y\u03c4 ) \u2212 1\n\n\u03b2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:107)x \u2212 x\u03c4(cid:107)2\n\n2\n\n\fcomplexities of deterministic ADMM and stochastic ADMM for general convex optimization remain\nO(1/\u0001) and O(1/\u00012), respectively. On the other hand, many studies have reported that the perfor-\nmance of ADMM is very sensitive to the penalty parameter \u03b2. How to address or alleviate this issue\nhas attracted many studies and remains an active topic. In particular, it remains an open question\nhow to quantify the improvement in ADMM\u2019s theoretical convergence by using adaptive penalty\nparameters. Of course, the answer to this question depends on the adaptive scheme being used.\nAlmost all previous works focus on using self-adaptive schemes that update the penalty parameter\nduring the course of optimization according to the historical iterates (e.g., by balancing the primal\nresidual and dual residual). However, there is hitherto no quanti\ufb01able improvement in terms of\nconvergence rate or iteration complexity for these self-adaptive schemes.\nIn this paper, we focus on the design of adaptive penalization for both deterministic and stochastic\nADMM and show that, with the proposed adaptive updating scheme on the penalty parameter, the\ntheoretical convergence properties of ADMM can be improved without imposing any smoothness and\nstrong convexity assumptions on the objective function. The key difference between the proposed\nadaptive scheme and previous self-adaptive schemes is that the proposed penalty parameter is adaptive\nto an local sharpness property of the objective function, namely the local error bound (see De\ufb01nition 1).\nGiven the exponent constant \u03b8 \u2208 (0, 1] that characterizes this local sharpness property, we show that\n\nthe proposed deterministic ADMM enjoys an improved iteration complexity of (cid:101)O(1/\u00011\u2212\u03b8)2 and the\nproposed stochastic ADMM enjoys an iteration complexity of (cid:101)O(1/\u00012(1\u2212\u03b8)), both of which improve\n\nthe complexity of their standard counterparts which only use a \ufb01xed penalty parameter. To the best of\nour knowledge, this is the \ufb01rst evidence that an adaptive penalty parameter used in ADMM can lead\nto provably lower iteration complexities. We call the proposed ADMM algorithms locally adaptive\nADMM because of its adaptivity to the problem\u2019s local property.\n\n2 Related Work\n\n(cid:80)n\n\nt) for general convex problems and (cid:101)O(1/t) for strongly convex problems.\n\nSince there is a tremendous amount of studies on ADMM, the review below mainly focuses on the\nADMMs with a variable penalty parameter. A convergence rate of O(1/t) was \ufb01rst shown for both the\nstandard and linearized variants of ADMM [8, 19, 9] on general non-smooth and non-strongly convex\nproblems. Later, smoothness and strong convexity assumptions are introduced to develop faster\n\u221a\nconvergence rates of ADMMs [22, 3, 11, 6]. Stochastic ADMM was considered in [21, 23] with a\nconvergence rate of O(1/\nRecently, many variance reduction techniques have been borrowed into stochastic ADMM to achieve\nimproved convergence rates for \ufb01nite-sum optimization problems where f (x) = 1\ni=1 fi(x)\nn\nunder the smoothness and strong convexity assumptions [37, 36, 24]. Nevertheless, most of these\naforementioned works focus on using a constant penalty parameter.\nHe et al. [10] analyzed ADMM with self-adaptive penalty parameters. The motivation for their\nself-adaptive penalty is to balance the order of the primal residual and the dual residual. However,\nthe convergence of ADMM with self-adaptive penalty is not guaranteed unless the adaptive scheme\nis turned off after a number of iterations. Additionally, their self-adaptive rule requires computing\nthe deterministic subgradient of f (x) so that is not appropriate for stochastic optimization. Tian\n& Yuan [25] proposed a variant of ADMM with variable penalty parameters. Their analysis and\nalgorithm require the smoothness assumption of \u03c8(Ax) and full column rank of the A matrix. Zhou\net al. [15] focused on solving low-rank representation by linearized ADMM and also proposed a\nnon-decreasing self-adaptive penalty scheme. However, their scheme is only applicable to an equality\nconstraint Ax + By = c with c (cid:54)= 0. Recently, Xu et al. [31] proposed a self-adaptive penalty\nscheme for ADMM based on the Barzilai and Borwein gradient methods. The convergence of their\nADMM relies on the analysis in He et al. [10] and thus requires the penalty parameter to be \ufb01xed after\na number of iterations. In contrast, our adaptive scheme fpr the penalty parameter is different from\nthe previous methods in the following aspects: (i) it is adaptive to the local sharpness property of the\nproblem; (ii) it allows the penalty parameter to increase to in\ufb01nity as the algorithm proceeds; (iii) it\ncan be employed for both deterministic and stochastic ADMMs as well as their linearized versions.\nIt is also notable that the presented algorithms and their convergence theory share many similarities\nwith the recent developments leveraging the local error bound condition [32, 30, 29], where similar\niteration complexities have been established. However, we would like to emphasize that the newly\n\n2(cid:101)O() suppresses a logarithmic factor.\n\n3\n\n\fproposed ADMM algorithms are more effective to tackle problems with structured regularizers (e.g.,\ngeneralized lasso) than the methods in [32, 30, 29], and have an additional unique feature of using\nadaptive penalty parameter.\n\n3 Preliminaries\n\nRecall that the problem of our interest:\n\nF (x) (cid:44) f (x) + \u03c8(Ax),\n\nmin\nx\u2208\u2126\n\n(10)\nwhere \u2126 \u2286 Rd is a closed convex set, f : Rd \u2192 (\u2212\u221e, +\u221e] and \u03c8 : Rm \u2192 (\u2212\u221e, +\u221e] are proper\nlower-semicontinuous convex functions, and A \u2208 Rm\u00d7d is a matrix. Let \u2126\u2217 and F\u2217 denote the\noptimal set of (10) and the optimal value, respectively. We present some assumptions that will be\nused in the paper.\nAssumption 1. For the convex optimization problem (10), we assume (a) there exist known x0 \u2208 \u2126\nand \u00010 \u2265 0 such that F (x0) \u2212 F\u2217 \u2264 \u00010; (b) \u2126\u2217 is a non-empty convex compact set; (c) there exists a\nconstant \u03c1 such that (cid:107)\u2202\u03c8(y)(cid:107)2 \u2264 \u03c1 for all y; (d) \u03c8 is de\ufb01ned everywhere.\nFor a positive semi-de\ufb01nite matrix G, the G-norm is de\ufb01ned as (cid:107)x(cid:107)G =\nx(cid:62)Gx. Let B(x, r) =\n{u \u2208 Rd : (cid:107)u \u2212 x(cid:107)2 \u2264 r} denote the Euclidean ball centered x with a radius r. We denote by\ndist(x, \u2126\u2217) the distance between x and the set \u2126\u2217, i.e., dist(x, \u2126\u2217) = minv\u2208\u2126\u2217 (cid:107)x \u2212 v(cid:107)2. We\ndenote by S\u0001 the \u0001-sublevel set of F (x), respectively, i.e., S\u0001 = {x \u2208 \u2126 : F (x) \u2264 F\u2217 + \u0001}.\nLocal Sharpness. Below, we introduce a condition, namely local error bound condition, to character-\nize the local sharpness property of the objective function.\nDe\ufb01nition 1 (Local error bound (LEB)). A function F (x) is said to satisfy a local error bound\ncondition on the \u0001-sublevel set if there exist \u03b8 \u2208 (0, 1] and c > 0 such that for any x \u2208 S\u0001\n\n\u221a\n\ndist(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217)\u03b8.\n\n(11)\n\nRemark: We will refer to \u03b8 as the local sharpness parameter. A recent study [1] has shown that the\nlocal error bound condition is equivalent to the famous Kurdyka - \u0141ojasiewicz (KL) property [13],\nwhich characterizes that under a transformation of \u03c8(s) = cs\u03b8, the function F (x) can be made sharp\naround the optimal solutions, i.e, the norm of subgradient of the transformed function \u03c8(F (x) \u2212 F\u2217)\nis lowered bounded by a constant 1. Note that by allowing \u03b8 = 0 in the above condition we can\ncapture a full spectrum of functions. However, a broad family of functions can have a sharper upper\nbound, i.e., with a non-zero constant \u03b8 in the above condition. For example, for functions that are\nsemi-algebraic and continuous, the above inequality is known to hold on any compact set (c.f. [1] and\nreferences therein). The value of \u03b8 has been revealed for many functions (c.f. [18, 14, 20, 1, 32]).\n\n4 Locally Adaptive ADMM for Deterministic Optimization\n\nSince the proposed locally adaptive ADMM algorithm builds upon the standard ADMM, we \ufb01rst\npresent the detailed steps of ADMM in Algorithm 1. Note that if we set G = 0 \u2208 Rd\u00d7d, it gives\nthe standard ADMM; and if we use G = \u03b3I \u2212 \u03b2A(cid:62)A (cid:23) 0, it gives the linearized variant, which\ncan make the computation of x\u03c4 +1 easier. To ensure G (cid:23) 0, the minimum valid value for \u03b3 in the\nlinearized variant is \u03b2(cid:107)A(cid:107)2\n2. To present the convergence result of ADMM (Algorithm 1), we \ufb01rst\nintroduce some notations.\n\nu =\n\n(cid:32) x\n(cid:33)\nt(cid:88)\n(cid:98)ut =\n\ny\n\u03bb\n\n1\nt\n\n\u03c4 =1\n\nu\u03c4 ,\n\n, F(u) =\n\n\uf8eb\uf8ed \u2212A(cid:62)\u03bb\nt(cid:88)\n\nAx \u2212 y\n\n\u03bb\n\nx\u03c4 ,\n\n\uf8f6\uf8f8 ,\n(cid:98)yt =\n\n(cid:98)xt =\n\n1\nt\n\n\u03c4 =1\n\nt(cid:88)\n\n\u03c4 =1\n\n1\nt\n\ny\u03c4 ,\n\n(cid:98)\u03bbt =\n\nt(cid:88)\n\n\u03c4 =1\n\n1\nt\n\n\u03bb\u03c4 .\n\nWe recall the convergence result of [8] for the equality constrained problem (2), which does not\nassume any smoothness, strong convexity and other regularity conditions.\n\n4\n\n\fAlgorithm 1 ADMM(x0, \u03b2, t)\n1: Input: x0 \u2208 \u2126, the penalty parameter \u03b2, the number\n2: Initialize: x1 = x0, y1 = Ax1, \u03bb1 = 0, \u03b3 = \u03b2(cid:107)A(cid:107)2\n\nof iterations t\nand G = \u03b3I \u2212 \u03b2A(cid:62)A or G = 0.\n\n2\n\nAlgorithm 2 LA-ADMM (x0, \u03b21, K, t)\n1: Input: x0 \u2208 \u2126, the number of stages\nK, and the number of iterations t per\nstage, initial value of penalization pa-\nrameter \u03b21\n\n\u03c4 =1 x\u03c4 /t\n\n3: for \u03c4 = 1, . . . , t do\n4:\n5:\n6: end for\n\n2: for k = 1, . . . , K do\n3:\n4:\n5: end for\n6: Output: xK\n\nLet xk = ADMM(xk\u22121, \u03b2k, t)\nUpdate \u03b2k+1 = 2\u03b2k\n\nUpdate x\u03c4 +1 by (7), y\u03c4 +1 by (5),\nUpdate \u03bb\u03c4 +1 by (6)\n\nProposition 1 (Theorem 4.1 in [8]). For any x \u2208 \u2126, y \u2208 Rm and \u03bb \u2208 Rm, we have\n\u03b2(cid:107)y \u2212 y1(cid:107)2\n\n7: Output:(cid:98)xt =(cid:80)t\nf ((cid:98)xt) + \u03c8((cid:98)yt) \u2212 [f (x) + \u03c8(y)] + ((cid:98)ut \u2212 u)(cid:62)F(u) \u2264 (cid:107)x \u2212 x1(cid:107)2\nto (2). When t \u2192 \u221e, ((cid:98)xt,(cid:98)yt) converges to the optimal solutions of (2) in a rate of O(1/t).\nCorollary 1. Suppose Assumption 1.c and 1.d hold. Let(cid:98)xt be the output of ADMM. For any x \u2208 \u2126,\n\nSince our goal is to solve the problem (1), next we present a corollary exhibiting the convergence of\nADMM for solving the original problem (1). All omitted proofs can be found in the supplement.\n\nRemark: The above result establishes a convergence rate for the variational inequality pertained\n\n(cid:107)\u03bb \u2212 \u03bb1(cid:107)2\n\n2\u03b2t\n\n2t\n\n2t\n\n+\n\n+\n\nG\n\n2\n\n2\n\n.\n\nwe have\n\nF ((cid:98)xt) \u2212 F (x) \u2264 (cid:107)x \u2212 x0(cid:107)2\n\nG\n\n2t\n\n\u03b2(cid:107)A(cid:107)2\n\n2(cid:107)x \u2212 x0(cid:107)2\n2t\n\n2\n\n+\n\n+\n\n\u03c12\n2\u03b2t\n\n.\n\nRemark: For the standard ADMM with G = 0 the \ufb01rst term in the R.H.S vanishes. For the linearized\nADMM with G = \u03b3I \u2212 \u03b2A(cid:62)A (cid:23) 0, we can bound (cid:107)x \u2212 x0(cid:107)2\n2. One can also derive\na theoretically optimal value of \u03b2 by setting x = x\u2217 \u2208 \u2126\u2217 and minimizing the upper bound, which\nresults in \u03b2 =\nfor the linearized\nADMM. Finally, the above result implies that the iteration complexity of standard and linearized\nADMM for \ufb01nding an \u0001-optimal solution of (1) is O\n\n(cid:16) \u03c1(cid:107)A(cid:107)2(cid:107)x\u2212x0(cid:107)2\n\nfor the standard ADMM or \u03b2 =\n\nG \u2264 \u03b3(cid:107)x \u2212 x0(cid:107)2\n\n2(cid:107)A(cid:107)2(cid:107)x\u2217\u2212x0(cid:107)2\n\n(cid:107)A(cid:107)2(cid:107)x\u2217\u2212x0(cid:107)2\n\n(cid:17)\n\n\u221a\n\n.\n\n\u03c1\n\n\u03c1\n\n\u0001\n\nNext, we present our locally adaptive ADMM and our main result in this section regarding its iteration\ncomplexity. The proposed algorithm is described in Algorithm 2, which is referred to as LA-ADMM.\nThe algorithm runs with multiple stages by calling ADMM at each stage with a warm start and a\nconstant number of iterations t. The penalty parameter \u03b2k is increased by a constant factor larger\nthan 1 (e.g., 2) after each stage and has an initial value dependent on \u03c1, (cid:107)A(cid:107)2, \u00010, \u03b8 and the targeted\naccuracy \u0001. The convergence result of LA-ADMM employing G = \u03b3I \u2212 \u03b2A(cid:62)A is established below.\nA slightly better result in terms of a constant factor can be established for employing G = 0.\nTheorem 2. Suppose Assumption 1 holds and F (x) obeys a local error bound condition on the \u0001-\n, we have F (xK) \u2212 F\u2217 \u2264\nsublevel. Let \u03b21 = 2\u03c1\u00011\u2212\u03b8\n(cid:107)A(cid:107)2\u00010\n\n2\u0001. The iteration complexity of LA-ADMM for achieving an 2\u0001-optimal solution is (cid:101)O(1/\u00011\u2212\u03b8).\n\n(cid:108) 8\u03c1(cid:107)A(cid:107)2 max(1,c2)\n\n, K = (cid:100)log2(\u00010/\u0001)(cid:101) and t =\n\n(cid:109)\n\n\u00011\u2212\u03b8\n\nRemark: There are two levels of adaptivity to the local sharpness of the penalty parameter. First,\nthe initial value \u03b21 in Algorithm 3 depends on the local sharpness parameter \u03b8. Second, the time\ninterval to increase the penalty parameter is determined by the value of t which is also dependent on\n\u03b8. Compared to the iteration complexity O(1/\u0001) of vanilla ADMM, LA-ADMM can enjoy a lower\niteration complexity.\n\n5 Locally Adaptive ADMM for Stochastic Optimization\n\nIn this section, we consider stochastic optimization problem as the following:\n\n(12)\nwhere \u03be is a random variable and f (x; \u03be) : Rd \u2192 (\u2212\u221e, +\u221e] is a proper lower-semicontinuous\nconvex function for each realization of \u03be. For this problem, in addition to Assumption 1, we make\n\nF (x) (cid:44) E\u03be[f (x; \u03be)] + \u03c8(Ax),\n\nmin\nx\u2208\u2126\n\n5\n\n\fAlgorithm 3 SADMM(x0, \u03b7, \u03b2, t, \u2126 )\n1: Input: x0 \u2208 Rd, a step size \u03b7, penalty\nparameter \u03b2, the number of iterations t\nand a domain \u2126.\n\n2: Initialize: x1 = x0, y1 = Ax1, \u03bb1 = 0\n3: for \u03c4 = 1, . . . , t do\n4:\n5:\n6: end for\n\nUpdate x\u03c4 +1 by (9) and y\u03c4 +1 by (5)\nUpdate \u03bb\u03c4 +1 by (6)\n\n7: Output:(cid:98)xt =(cid:80)t\n\n\u03c4 =1 x\u03c4 /t\n\nAlgorithm 4 LA-SADMM (x0, \u03b71, \u03b21, D1, K, t)\n1: Input: x0 \u2208 Rd, the number of stages K, the num-\nber of iterations t per stage, the initial step size \u03b71,\nthe initial parameter \u03b21 and the initial radius D1.\nLet xk = SADMM(xk\u22121, \u03b7k, \u03b2k, t,Bk \u2229 \u2126)\nUpdate \u03b7k+1 = \u03b7k/2 and \u03b2k+1 = 2\u03b2k, Dk+1 =\nDk/2.\n5: end for\n6: Output: xK\n\n2: for k = 1, . . . , K do\n3:\n4:\n\nthe following assumption for our development, which is a standard assumption for many previous\nstochastic gradient methods.\nAssumption 2. For the stochastic optimization problem (12), we assume that there exists a constant\nR such that (cid:107)\u2202f (x; \u03be)(cid:107)2 \u2264 R almost surely for any x \u2208 \u2126.\nWe present a framework of stochastic ADMM (SADMM) in Algorithm 3. The convergence results\nfor solving the equivalent constrained optimization problem of stochastic ADMM with different\nchoices of G\u03c4 have been established in [21, 23, 35].\nBelow, we will focus on G\u03c4 = \u03b3I \u2212 \u03b7\u03b2A(cid:62)A (cid:23) I because it leads to computationally more ef\ufb01cient\nupdate for x\u03c4 +1 than other two choices for high-dimensional problems. Using G\u03c4 = I will yield a\nsimilar convergence bound except for a constant term and using the idea of AdaGrad for computing\nG\u03c4 will lead to the same order of convergence in the worst-case, which we will postpone to future\nwork for exploration. The corollary below will be used in our analysis.\nCorollary 3. Suppose Assumption 1.c, 1.d and Assumption 2 hold. Let G\u03c4 = \u03b3I \u2212 \u03b7\u03b2A(cid:62)A (cid:23) I in\nAlgorithm 3. For any x \u2208 \u2126,\n\n(cid:19)\n\n+\n\n\u03c12\n2\u03b2t\n\n\u03c1(cid:107)A(cid:107)2(cid:107)x1 \u2212 xt+1(cid:107)2\n\nt\n\n+\n\nF ((cid:98)xt) \u2212 F (x) \u2264 \u03b7R2\n\n2\n\n(cid:18) \u03b2(cid:107)A(cid:107)2\n\n2\n\n+\n\n\u03b3(cid:107)x1 \u2212 x(cid:107)2\n\n2(cid:107)x1 \u2212 x(cid:107)2\n2t\n(E[g\u03c4 ] \u2212 g\u03c4 )(cid:62)(x\u03c4 \u2212 x).\n\n2\u03b7t\n\n2\n\n+\n\nt(cid:88)\n\n+\n\n1\nt\n\n\u03c4 =1\n\n\u221a\n\n\u221a\n\u03c4, the above result implies an O(1/\n\nRemark: Taking expectation on both sides will yield the expectational convergence bound. We can\nalso use an analysis of large deviation to bound the last term to obtain the convergence with high\nprobability. In particular, by setting \u03b7 \u221d 1/\nt) convergence\nrate, i.e., O(1/\u00012) iteration complexity of stochastic ADMM.\nNext, we discuss our locally adaptive stochastic ADMM (LA-SADMM) algorithm in Algorithm 4.\nThe key idea is similar to LA-ADMM, i.e., calling SADMM in multiple stages with warm start. The\nstep size \u03b7k in each call of SADMM is \ufb01xed and decreases by a certain fraction after one stage. The\npenalty parameter is updated similarly to that in LA-ADMM but with a different initial value. A key\ndifference from LA-ADMM is that we employ a domain shrinking approach to modify the domain\nof the solutions x\u03c4 +1 at each stage. For the k-th stage, the domain for x is the intersection of \u2126\nand Bk = B(xk\u22121, Dk), where the latter is a ball with a radius of Dk centered at xk\u22121 (the initial\nsolution of the k-th stage). The radius Dk will decrease geometrically between stages. The purpose\nof using the domain shrinking approach is to tackle the last term of the upper bound in Corollary 3 so\nthat it can decrease geometrically as the stage number increases. A similar idea has been adopted\nin [29, 7, 5]. Note that during each SADMM, we can use the three choices of G\u03c4 as mentioned before.\nBelow we only present the convergence result of the variant with G\u03c4 = \u03b3I \u2212 \u03b7k\u03b2kA(cid:62)A.\nTheorem 4. Suppose Assumptions 1 and 2 hold and F (x) obeys the local error bound condition\non S\u0001. Given \u03b4 \u2208 (0, 1), let \u02dc\u03b4 = \u03b4/K, K = (cid:100)log2( \u00010\n, D1 \u2265 c\u00010\n6R2 , \u03b21 = 6R2\n\u00011\u2212\u03b8 ,\n(cid:107)A(cid:107)2\n2\u00010\n, \u03c12(cid:107)A(cid:107)2\n, 12\u03c1(cid:107)A(cid:107)2D1\nR2 } and G\u03c4 =\nt be the smallest integer such that t \u2265 max{ 6912R2 log(1/\u02dc\u03b4)D2\n2I \u2212 \u03b71\u03b21A(cid:62)A (cid:23) I. Then LA-SADMM guarantees that, with a probability 1 \u2212 \u03b4, we have F (xK) \u2212\nF\u2217 \u2264 2\u0001. The iteration complexity of LA-SADMM for achieving an 2\u0001-optimal solution with a high\n\nprobability 1 \u2212 \u03b4 is (cid:101)O(log(1/\u03b4)/\u00012(1\u2212\u03b8)), provided D1 = O( c\u00010\n\n\u0001 )(cid:101), \u03b71 = \u00010\n\n\u00012\n0\n\n\u00010\n\n1\n\n2\n\n\u0001(1\u2212\u03b8) ).\n\n6\n\n\f= D(s)\n\n1\n\n1 , K, ts)\n\nx(s) =LA-ADMM(x(s\u22121), \u03b2(s)\nts+1 = ts21\u2212\u03b8, \u03b2(s+1)\n= \u03b2(s)\n\n1 , K, ts)\n1 /21\u2212\u03b8\n\n1\n\n6R2 , \u03b21 = 6R2\n(cid:107)A(cid:107)2\n2\u00010\n\nx(s) =LA-SADMM(x(s\u22121), \u03b71, \u03b21, D(s)\n1 21\u2212\u03b8\nts+1 = ts22(1\u2212\u03b8), D(s+1)\n\nAlgorithm 6 LA-SADMM with Restarting\nand \u0001 \u2264 \u00010/2\n1: Input: t1, D(1)\n1\n2: Initialization: x(0), \u03b71 = \u00010\n3: for s = 1, 2, . . . , do\n4:\n5:\n6: end for\n7: Output: x(S)\n\nAlgorithm 5 LA-ADMM with Restarting\n1: Input: t1, \u03b2(1)\n1\n2: Initialization: x(0)\n3: for s = 1, 2, . . . , do\n4:\n5:\n6: end for\n7: Output: x(S)\nRemark: Interestingly, unlike that in LA-ADMM, the initial value \u03b21 does not depend on \u03b8. The\nadaptivity of the penalty parameters lies on the time interval t which determines when the value of \u03b2\nis increased. The difference comes from the \ufb01rst two terms in Corollary 3.\nBefore ending this section, we discuss two points. First, both Theorem 2 and Theorem 4 exhibit the\ndependence of the two algorithms on the c parameter (e.g., t in Algorithm 2 and D1 in Algorithm 4)\nthat is usually unknown. Nevertheless, this issue can be easily addressed by using another level of\nrestarting and increasing sequence of t and D1 similar to the practical variants in [29, 32]. Due to\nthe limit of space, we only present the algorithms in Algorithm 5 and Algorithm 6 with their formal\nguarantee presented in supplement. The conclusion is that under mild conditions as long as \u03b2(1)\nin Algorithm 5 is suf\ufb01ciently small, t1 and D(1)\nin Algorithm 6 are suf\ufb01ciently large, the iteration\n1\n\ncomplexities remain (cid:101)O(1/\u00011\u2212\u03b8) and (cid:101)O(1/\u00012(1\u2212\u03b8)) when \u03b8 in LEB condition is known. Second, these\n\nvariants can be even employed when the local sharpness parameter \u03b8 is unknown by simply setting it\nto 0, and still enjoy reduced iteration complexities in terms of a multiplicative factor compared to\nvanilla ADMMs. Detailed results are included in the supplement.\n\n6 Applications and Experiments\n\nIn this section, we present some experimental results of the proposed algorithms for solving three\ntasks, namely generalized LASSO, robust regression with a low-rank regularizer (RR-LR) and\nlearning low-rank representation. For generalized lasso, our experiment focuses on comparing the\nproposed LA-SADMM with SADMM. For the latter tasks, we focus on comparing the proposed\nLA-ADMM with previous linearized ADMM with and without self-adaptive penalty parameters.\nWe \ufb01rst consider generalized LASSO, which can \ufb01nd applications in many problems in statistics and\nmachine learning [28]. The objective of generalized LASSO can be expressed as:\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\nF (x) =\n\n1\nn\n\n(cid:96)(x(cid:62)ai, bi) + \u03b4(cid:107)Ax(cid:107)1\n\n(13)\n\nwhere (ai, bi) is a set of pairs of training data, i = 1, . . . , n, \u03b4 \u2265 0 is a regularization parameter,\nA \u2208 Rm\u00d7d is a speci\ufb01ed matrix, and (cid:96)(z, b) is a convex loss function in terms of z. The above\nformulation include many formulations as special cases, e.g., the standard LASSO where A = I \u2208\nRd\u00d7d [26], fused LASSO that penalizes the (cid:96)1 norm of both the coef\ufb01cients and their successive\ndifferences [27], graph-guided fused LASSO (GGLASSO) where A = F \u2208 Rm\u00d7d encodes some\ngraph information about features [12], and sparse graph-guided fused LASSO (S-GGLASSO) where\n(cid:107)Ax(cid:107)1 = \u03b42(cid:107)x(cid:107)1 + \u03b41(cid:107)F x(cid:107)1 [21].\nLet us \ufb01rst discuss the local sharpness parameter of generalized lasso with different loss\nfunctions. For the loss function, let us \ufb01rst consider piecewise linear loss function such as\nhinge loss (cid:96)(z, b) = max(0, 1 \u2212 bz), absolute loss (cid:96)(z, b) = |z \u2212 b| and \u0001-insensitive loss\n(cid:96)(z, b) = max(|z \u2212 b| \u2212 \u0001, 0). Then the objective is a polyhedral function. According to the\nresults in [32], the local sharpness parameter is \u03b8 = 1. It then implies that both LA-ADMM and\nLA-SADMM enjoy linear convergence results for solving the problem (13) with a piecewise linear\nloss function. To the best of our knowledge, these are the \ufb01rst linear convergence results of ADMM\nwithout smoothness and strong convexity conditions. One can also consider piecewise quadratic loss\nsuch as square loss (cid:96)(z, b) = (z \u2212 b)2 for b \u2208 R and squared hinge loss (cid:96)(z, b) = max(0, 1 \u2212 bz)2\nfor b \u2208 {1,\u22121}. According to [14], the problem with convex piecewise quadratic loss has a local\n\n\u0001) and (cid:101)O(1/\u0001) for LA-ADMM and LA-SADMM.\n\nsharpness parameter \u03b8 = 1/2, implying (cid:101)O(1/\n\n\u221a\n\n7\n\n\f(a) SVM + GGLASSO\n\n(b) SVM + GGLASSO\n\n(c) RR + LR\n\n(d) SVM + S-GGLASSO\n\n(e) SVM + S-GGLASSO\n\n(f) LRR\n\nFigure 1: Comparison of different algorithms for solving different tasks. RR + LR represents robust\nregression with a low rank regularizer. LRR represents low-rank representation.\n\nFor more examples with different values of \u03b8, we refer readers to [32, 30, 29, 17].\n\nSVM Classi\ufb01cation with GGLASSO and S-GGLASSO Regularizers To generate the A\nmatrix, we \ufb01rst need to construct a dependency graph of features. We follow [21] to generate a\ndependency graph by sparse inverse covariance selection [4]. Speci\ufb01cally, we get the estimator\nof the inverse covariance matrix denoted by \u02c6\u03a3\u22121 via sparse inverse covariance estimation with\nij , where i, j \u2208 {1, . . . , d}, i (cid:54)= j, an edge\nthe graphical lasso [4]. For each nonzero entry \u02c6\u03a3\u22121\nbetween i and j is created. If we denote by G \u2261 {V,E} the resulting graph, where V is a set of\nd vertices, which correspond to d features in the data, and E = {e1, . . . , em} denotes the set of\nm edges between elements of V, where ei consists of a tuple of two elements, then the k-th row\nof A has two non-zero elements corresponding to the k-th edge ek = (i, j) \u2208 E with Ak,i = 1\nand Ak,j = \u22121. We choose two medium-scale data sets from libsvm website, namely w8a data\n(n = 49749, d = 300) and gisette data (n = 6000, d = 5000), to conduct the experiment. In\nthe process of estimating inverse covariance matrix, we choose a penalty parameter to be 0.01\nthat renders the percentage of non-zero elements of the A matrix to be around 3% for w8a\n\u221a\ndata and 1% for gisette data. We compare the performance of the LA-SADMM algorithm with\nSADMM [23], where in SADMM we use G\u03c4 = \u03b3I \u2212 \u03b2\u03b7\u03c4 A(cid:62)A (cid:23) I with \u03b7\u03c4 \u221d \u03b71/\n\u03c4. For\nfairness, we set the same initial solution with all zero entries. We \ufb01x the value of regularization\nparameters (\u03b4 in GGLASSO and \u03b41, \u03b42 in S-GGLASSO) to be 1\nn, where n is the number of\nsamples. For SADMM, we tune both \u03b71 and \u03b2 from {10\u22125:1:5} . For LA-SADMM, we set the\ninitial step size and penalty parameter to their theoretical value in Theorem 4, and select D1 from\n{100, 1000}. The values of t in LA-SADMM is set to 105 and 5 \u00d7 104 for w8a and gisette, respec-\ntively. The results of comparing the objective values versus the number of iterations are presented\nin Figure 1 (a,b,d,e). We can see that LA-SADMM exhibits a much faster convergence than SADMM.\n\nRobust Regression with a Low-rank Regularizer The objective function is F (X) =\n\u03bb(cid:107)X(cid:107)\u2217 + (cid:107)AX \u2212 C(cid:107)1. We can form an equality constraint Y = AX \u2212 C and solve the problem\nby linearized ADMM. The value of the local sharpness parameter of this problem is still an open\nproblem. We compare the proposed LA-ADMM, the vanilla linearized ADMM with a \ufb01xed\npenalty parameter (ADMM), the linearized ADMM with self-adaptive penalty proposed in [15]\n(ADMM-AP), and the linearized ADMM with residual balancing in [10, 2] (ADMM-RB). We\nconstruct a synthetic data where A \u2208 R1000\u00d7100 is generated following a Gaussian distribution with\nmean 0 and standard deviation 1. To construct C \u2208 R1000\u00d750, we \ufb01rst generate X \u2208 R100\u00d750 and\n\n8\n\n# of iterations\u00d710600.511.522.53objective0.130.1350.140.1450.150.1550.16w8aSADMMLA-SADMM# of iterations\u00d71050246810objective0.250.30.350.40.450.50.55gisetteSADMMLA-SADMM# of iterations020406080100log(objective)1010.51111.51212.513synthetic dataADMM-bestADMM-worstADMM-APADMM-RBLA-ADMM# of iterations\u00d710601234objective0.1350.140.1450.150.1550.160.1650.170.175w8aSADMMLA-SADMM# of iterations\u00d71050246810objective0.250.30.350.40.450.50.55gisetteSADMMLA-SADMM# of iterations02004006008001000log(objective)55.566.57shapeADMM-bestADMM-worstADMM-APADMM-RBLA-ADMM\fretain only its top 20 components denoted by \u02c6X and then let C = A \u02c6X + \u03b5, where \u03b5 is a Gaussian\nnoise matrix with mean zero and standard deviation 0.01. We set \u03bb = 100. For the vanilla linearized\nADMM, we try different penalty parameters from {10\u22123:1:3} and report the best performance\n(using \u03b2 = 0.01) and worst performance (using \u03b2 = 0.001). To demonstrate the capability of\nadaptive ADMM, we choose \u03b2 = 0.001 as the initial step size for LA-ADMM and ADMM-AP.\nOther parameters of ADMM-AP is the same as suggested in the original paper. For LA-ADMM, we\nimplement its restarting variant (Algorithm 5), and start with the number of inner iterations t = 2 and\nincrease its value by a factor 2 after 10 stages, and also increase the value of \u03b2 by 10 times after each\nstage. The results are reported in Figure 1 (c), from which we can see that LA-ADMM performs\ncomparably with ADMM with the best penalty parameter and also better than ADMM-AP. We also\ninclude the results in terms of running time in the supplement.\nLow-rank Representation [16] The objective function is F (X) = \u03bb(cid:107)X(cid:107)\u2217 + (cid:107)AX \u2212 A(cid:107)2,1, where\nA \u2208 Rn\u00d7d is a data matrix. We used the shape image 3 and set \u03bb = 10. For the vanilla linearized\nADMM, we try different penalty parameters from {10\u22123:1:3} and report the best performance (using\n\u03b2 = 0.1) and worst performance (using \u03b2 = 0.01). To demonstrate the capability of adaptive\nADMM, we choose \u03b2 = 0.01 as the initial step size for LA-ADMM and ADMM-AP. Other\nparameters of ADMM-AP is the same as suggested in the original paper. For LA-ADMM, we\nstart with the number of inner iterations t = 20 and increase its value by a factor 2 after 2 stages,\nand also increase the value of \u03b2 by 2 times after each stage. The results are reported in Figure 1\n(f), from which we can see that LA-ADMM performs comparably with ADMM with the best\npenalty parameter and also better than ADMM-AP. We can see from the \ufb01gure that the results of\nADMM-worst and ADMM-AP are quite similar. We also include the results in terms of running time\nin the supplement.\n\n7 Conclusion\n\nIn this paper, we have presented a new theory of (linearized) ADMM for both deterministic and\nstochastic optimization with adaptive penalty parameters. The new adaptive scheme is different\nfrom previous self-adaptive schemes and is adaptive to the local sharpness of the problem. We\nhave established faster convergence of the proposed algorithms of ADMM with penalty parameters\nadaptive to the local sharpness parameter. Experimental results have demonstrated the superior\nperformance of the proposed stochastic and deterministic adaptive ADMM.\n\nAcknowlegements\n\nWe thank the anonymous reviewers for their helpful comments. Y. Xu, M. Liu and T. Yang are\npartially supported by National Science Foundation (IIS-1463988, IIS-1545995). Y. Xu would like to\nthank Yan Yan for useful discussions on the low-rank representation experiments.\n\nReferences\n[1] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of\n\n\ufb01rst-order descent methods for convex functions. CoRR, abs/1510.08234, 2015.\n\n[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[3] W. Deng and W. Yin. On the global and linear convergence of the generalized alternating\n\ndirection method of multipliers. Journal of Scienti\ufb01c Computing, 66(3):889\u2013916, 2016.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9, 2008.\n\n[5] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex\nstochastic composite optimization, ii: Shrinking procedures and optimal algorithms. SIAM\nJournal on Optimization, 23(4):2061\u20132089, 2013.\n\n3http://pages.cs.wisc.edu/~swright/TVdenoising/\n\n9\n\n\f[6] T. Goldstein, B. O\u2019Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization\n\nmethods. SIAM Journal on Imaging Sciences, 7(3):1588\u20131623, 2014.\n\n[7] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\nstochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on\nLearning Theory (COLT), pages 421\u2013436, 2011.\n\n[8] B. He and X. Yuan. On the o(1/n) convergence rate of the douglas-rachford alternating direction\n\nmethod. SIAM Journal on Numerical Analysis, 50(2):700\u2013709, 2012.\n\n[9] B. He and X. Yuan. On non-ergodic convergence rate of douglas\u2013rachford alternating direction\n\nmethod of multipliers. Numerische Mathematik, 130(3):567\u2013577, 2015.\n\n[10] B. S. He, H. Yang, and S. L. Wang. Alternating direction method with self-adaptive penalty\nparameters for monotone variational inequalities. Journal of Optimization Theory and Applica-\ntions, 106(2):337\u2013356, 2000.\n\n[11] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of\n\nmultipliers. Mathematical Programming, pages 1\u201335, 2016.\n\n[12] S. Kim, K.-A. Sohn, and E. P. Xing. A multivariate regression approach to association analysis\n\nof a quantitative trait network. Bioinformatics, 25(12):i204\u2013i212, 2009.\n\n[13] K. Kurdyka. On gradients of functions de\ufb01nable in o-minimal structures. Annales de l\u2019institut\n\nFourier, 48(3):769 \u2013 783, 1998.\n\n[14] G. Li. Global error bounds for piecewise convex polynomials. Math. Program., 137(1-2):37\u201364,\n\n2013.\n\n[15] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty for\nlow-rank representation. In Advances In Neural Information Processing Systems (NIPS), pages\n612\u2013620, 2011.\n\n[16] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In\nProceedings of the 27th international conference on machine learning (ICML-10), pages 663\u2013\n670, 2010.\n\n[17] M. Liu and T. Yang. Adaptive accelerated gradient converging methods under holderian error\n\nbound condition. CoRR, abs/1611.07609, 2017.\n\n[18] Z.-Q. Luo and J. F. Sturm. Error bound for quadratic systems. Applied Optimization, 33:383\u2013\n\n404, 2000.\n\n[19] R. D. Monteiro and B. F. Svaiter. Iteration-complexity of block-decomposition algorithms and\nthe alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475\u2013507,\n2013.\n\n[20] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-\n\nstrongly convex optimization. CoRR, abs/1504.06298, 2015.\n\n[21] H. Ouyang, N. He, L. Tran, and A. G. Gray. Stochastic alternating direction method of\nmultipliers. Proceedings of the 30th International Conference on Machine Learning (ICML),\n28:80\u201388, 2013.\n\n[22] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction\n\nmethod of multipliers. SIAM Journal on Imaging Sciences, 8(1):644\u2013681, 2015.\n\n[23] T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction\nmultiplier method. In Proceedings of The 30th International Conference on Machine Learning,\npages 392\u2013400, 2013.\n\n[24] T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In\nProceedings of The 31st International Conference on Machine Learning, pages 736\u2013744, 2014.\n\n10\n\n\f[25] W. Tian and X. Yuan. Faster alternating direction method of multipliers with a worst-case\n\no(1/n2) convergence rate. 2016.\n\n[26] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety (Series B), 58:267\u2013288, 1996.\n\n[27] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via\nthe fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n67(1):91\u2013108, 2005.\n\n[28] R. J. Tibshirani, J. Taylor, et al. The solution path of the generalized lasso. The Annals of\n\nStatistics, 39(3):1335\u20131371, 2011.\n\n[29] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster\nglobal convergence. In Proceedings of the 34th International Conference on Machine Learning\n(ICML), pages 3821\u20133830, 2017.\n\n[30] Y. Xu, Y. Yan, Q. Lin, and T. Yang. Homotopy smoothing for non-smooth problems with lower\ncomplexity than O(1/\u0001). In Advances In Neural Information Processing Systems 29 (NIPS),\npages 1208\u20131216, 2016.\n\n[31] Z. Xu, M. A. T. Figueiredo, and T. Goldstein. Adaptive admm with spectral penalty parameter\n\nselection. CoRR, abs/1605.07246, 2016.\n\n[32] T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity.\n\nCoRR, abs/1512.03107, 2016.\n\n[33] X. Zhang, M. Burger, X. Bresson, and S. Osher. Bregmanized nonlocal regularization for\ndeconvolution and sparse reconstruction. SIAM Journal on Imaging Sciences, 3(3):253\u2013276,\n2010.\n\n[34] X. Zhang, M. Burger, and S. Osher. A uni\ufb01ed primal-dual algorithm framework based on\n\nbregman iteration. Journal of Scienti\ufb01c Computing, 46(1):20\u201346, 2011.\n\n[35] P. Zhao, J. Yang, T. Zhang, and P. Li. Adaptive stochastic alternating direction method of\nmultipliers. In Proceedings of the 32nd International Conference on Machine Learning (ICML),\npages 69\u201377, 2015.\n\n[36] S. Zheng and J. T. Kwok. Fast-and-light stochastic admm. In The 25th International Joint\n\nConference on Arti\ufb01cial Intelligence (IJCAI-16), 2016.\n\n[37] W. Zhong and J. T.-Y. Kwok. Fast stochastic alternating direction method of multipliers. In\nProceedings of The 31st International Conference on Machine Learning, pages 46\u201354, 2014.\n\n11\n\n\f", "award": [], "sourceid": 844, "authors": [{"given_name": "Yi", "family_name": "Xu", "institution": "The University of Iowa"}, {"given_name": "Mingrui", "family_name": "Liu", "institution": "The University of Iowa"}, {"given_name": "Qihang", "family_name": "Lin", "institution": "University of Iowa"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}