{"title": "Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1458, "page_last": 1466, "abstract": "We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the second term. We show that the basic proximal-gradient method, the basic proximal-gradient method with a strong convexity assumption, and the accelerated proximal-gradient method achieve the same convergence rates as in the error-free case, provided the errors decrease at an appropriate rate.  Our experimental results on a structured sparsity problem indicate that sequences of errors with these appealing theoretical properties can lead to practical performance improvements.", "full_text": "Convergence Rates of Inexact Proximal-Gradient\n\nMethods for Convex Optimization\n\nMark Schmidt\n\nmark.schmidt@inria.fr\n\nNicolas Le Roux\n\nnicolas@le-roux.name\n\nFrancis Bach\n\nfrancis.bach@ens.fr\n\nINRIA - SIERRA Project Team\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\n\nAbstract\n\nWe consider the problem of optimizing the sum of a smooth convex function and\na non-smooth convex function using proximal-gradient methods, where an error\nis present in the calculation of the gradient of the smooth term or in the proxim-\nity operator with respect to the non-smooth term. We show that both the basic\nproximal-gradient method and the accelerated proximal-gradient method achieve\nthe same convergence rate as in the error-free case, provided that the errors de-\ncrease at appropriate rates. Using these rates, we perform as well as or better than\na carefully chosen \ufb01xed error level on a set of structured sparsity problems.\n\n1\n\nIntroduction\n\nIn recent years the importance of taking advantage of the structure of convex optimization problems\nhas become a topic of intense research in the machine learning community. This is particularly\ntrue of techniques for non-smooth optimization, where taking advantage of the structure of non-\nsmooth terms seems to be crucial to obtaining good performance. Proximal-gradient methods and\naccelerated proximal-gradient methods [1, 2] are among the most important methods for taking\nadvantage of the structure of many of the non-smooth optimization problems that arise in practice.\nIn particular, these methods address composite optimization problems of the form\n\nminimize\n\nx\u2208Rd\n\nf (x) := g(x) + h(x),\n\n(1)\n\nwhere g and h are convex functions but only g is smooth. One of the most well-studied instances of\nthis type of problem is (cid:96)1-regularized least squares [3, 4],\n\n(cid:107)Ax \u2212 b(cid:107)2 + \u03bb(cid:107)x(cid:107)1,\n\n1\n2\n\nminimize\n\nx\u2208Rd\n\nwhere we use (cid:107) \u00b7 (cid:107) to denote the standard (cid:96)2-norm.\nProximal-gradient methods are an appealing approach for solving these types of non-smooth opti-\nmization problems because of their fast theoretical convergence rates and strong practical perfor-\n\u221a\nmance. While classical subgradient methods only achieve an error level on the objective function of\nk) after k iterations, proximal-gradient methods have an error of O(1/k) while accelerated\nO(1/\nproximal-gradient methods futher reduce this to O(1/k2) [1, 2]. That is, accelerated proximal-\ngradient methods for non-smooth convex optimization achieve the same optimal convergence rate\nthat accelerated gradient methods achieve for smooth optimization.\nEach iteration of a proximal-gradient method requires the calculation of the proximity operator,\n\nproxL(y) = arg min\n\nx\u2208Rd\n\n(cid:107)x \u2212 y(cid:107)2 + h(x),\n\nL\n2\n\n(2)\n\n1\n\n\fwhere L is the Lipschitz constant of the gradient of g. We can ef\ufb01ciently compute an analytic solu-\ntion to this problem for several notable choices of h, including the case of (cid:96)1-regularization and dis-\njoint group (cid:96)1-regularization [5, 6]. However, in many scenarios the proximity operator may not have\nan analytic solution, or it may be very expensive to compute this solution exactly. This includes im-\nportant problems such as total-variation regularization and its generalizations like the graph-guided\nfused-LASSO [7, 8], nuclear-norm regularization and other regularizers on the singular values of\nmatrices [9, 10], and different formulations of overlapping group (cid:96)1-regularization with general\ngroups [11, 12]. Despite the dif\ufb01culty in computing the exact proximity operator for these regular-\nizers, ef\ufb01cient methods have been developed to compute approximate proximity operators in all of\nthese cases; accelerated projected gradient and Newton-like methods that work with a smooth dual\nproblem have been used to compute approximate proximity operators in the context of total-variation\nregularization [7, 13], Krylov subspace methods and low-rank representations have been used to\ncompute approximate proximity operators in the context of nuclear-norm regularization [9, 10], and\nvariants of Dykstra\u2019s algorithm (and related dual methods) have been used to compute approximate\nproximity operators in the context of overlapping group (cid:96)1-regularization [12, 14, 15].\nIt is known that proximal-gradient methods that use an approximate proximity operator converge un-\nder only weak assumptions [16, 17]; we brie\ufb02y review this and other related work in the next section.\nHowever, despite the many recent works showing impressive empirical performance of (accelerated)\nproximal-gradient methods that use an approximate proximity operator [7, 13, 9, 10, 14, 15], up until\nrecently there was no theoretical analysis on how the error in the calculation of the proximity op-\nerator affects the convergence rate of proximal-gradient methods. In this work we show in several\ncontexts that, provided the error in the proximity operator calculation is controlled in an appropriate\nway, inexact proximal-gradient strategies achieve the same convergence rates as the corresponding\nexact methods. In particular, in Section 4 we \ufb01rst consider convex objectives and analyze the inexact\nproximal-gradient (Proposition 1) and accelerated proximal-gradient (Proposition 2) methods. We\nthen analyze these two algorithms for strongly convex objectives (Proposition 3 and Proposition 4).\nNote that in these analyses, we also consider the possibility that there is an error in the calculation of\nthe gradient of g. We then present an experimental comparison of various inexact proximal-gradient\nstrategies in the context of solving a structured sparsity problem (Section 5).\n\n2 Related Work\n\nThe algorithm we shall focus on in this paper is the proximal-gradient method\n\nxk = proxL [yk\u22121 \u2212 (1/L)(g(cid:48)(yk\u22121) + ek)] ,\n\n(3)\n\nwhere ek is the error in the calculation of the gradient and the proximity problem (2) is solved\ninexactly so that xk has an error of \u03b5k in terms of the proximal objective function (2). In the basic\nproximal-gradient method we choose yk = xk, while in the accelerated proximal-gradient method\nwe choose yk = xk + \u03b2k(xk \u2212 xk\u22121), where the sequence (\u03b2k) is chosen appropriately.\nThere is a substantial amount of work on methods that use an exact proximity operator but have an\nerror in the gradient calculation, corresponding to the special case where \u03b5k = 0 but ek is non-zero.\n\u221a\nFor example, when the ek are independent, zero-mean, and \ufb01nite-variance random variables, then\nproximal-gradient methods achieve the (optimal) error level of O(1/\nk) [18, 19]. This is different\nthan the scenario we analyze in this paper since we do not assume unbiased nor independent errors,\nbut instead consider a sequence of errors converging to 0. This leads to faster convergence rates, and\nmakes our analysis applicable to the case of deterministic (and even adversarial) errors.\nSeveral authors have recently analyzed the case of a \ufb01xed deterministic error in the gradient, and\nshown that accelerated gradient methods achieve the optimal convergence rate up to some accuracy\nthat depends on the \ufb01xed error level [20, 21, 22], while the earlier work of [23] analyzes the gradient\nmethod in the context of a \ufb01xed error level. This contrasts with our analysis, where by allowing\nthe error to change at every iteration we can achieve convergence to the optimal solution. Also, we\ncan tolerate a large error in early iterations when we are far from the solution, which may lead to\nsubstantial computational gains. Other authors have analyzed the convergence rate of the gradient\nand projected-gradient methods with a decreasing sequence of errors [24, 25], but this analysis does\nnot consider the important class of accelerated gradient methods. In contrast, the analysis of [22]\nallows a decreasing sequence of errors (though convergence rates in this context are not explicitly\n\n2\n\n\fmentioned) and considers the accelerated projected-gradient method. However, the authors of this\nwork only consider the case of an exact projection step, and they assume the availability of an oracle\nthat yields global lower and upper bounds on the function. This non-intuitive oracle leads to a\nnovel analysis of smoothing methods, but leads to slower convergence rates than proximal-gradient\nmethods. The analysis of [21] considers errors in both the gradient and projection operators for\naccelerated projected-gradient methods, but this analysis requires that the domain of the function is\ncompact. None of these works consider proximal-gradient methods.\nIn the context of proximal-point algorithms, there is a substantial literature on using inexact prox-\nimity operators with a decreasing sequence of errors, dating back to the seminal work of Rock-\nafeller [26]. Accelerated proximal-point methods with a decreasing sequence of errors have also\nbeen examined, beginning with [27]. However, unlike proximal-gradient methods where the prox-\nimity operator is only computed with respect to the non-smooth function h, proximal-point methods\nrequire the calculation of the proximity operator with respect to the full objective function. In the\ncontext of composite optimization problems of the form (1), this requires the calculation of the\nproximity operator with respect to g + h. Since it ignores the structure of the problem, this prox-\nimity operator may be as dif\ufb01cult to compute (even approximately) as the minimizer of the original\nproblem.\nConvergence of inexact proximal-gradient methods can be established with only weak assumptions\non the method used to approximately solve (2). For example, we can establish that inexact proximal-\ngradient methods converge under some closedness assumptions on the mapping induced by the ap-\nproximate proximity operator, and the assumption that the algorithm used to compute the inexact\nproximity operator achieves suf\ufb01cient descent on problem (2) compared to the previous iteration\nxk\u22121 [16]. Convergence of inexact proximal-gradient methods can also be established under the\nassumption that the norms of the errors are summable [17]. However, these prior works did not\nconsider the rate of convergence of inexact proximal-gradient methods, nor did they consider accel-\nerated proximal-gradient methods. Indeed, the authors of [7] chose to use the non-accelerated vari-\nant of the proximal-gradient algorithm since even convergence of the accelerated proximal-gradient\nmethod had not been established under an inexact proximity operator.\nWhile preparing the \ufb01nal version of this work, [28] independently gave an analysis of the accelerated\nproximal-gradient method with an inexact proximity operator and a decreasing sequence of errors\n(assuming an exact gradient). Further, their analysis leads to a weaker dependence on the errors than\nin our Proposition 2. However, while we only assume that the proximal problem can be solved up\nto a certain accuracy, they make the much stronger assumption that the inexact proximity operator\n\u221a\nyields an \u03b5k-subdifferential of h [28, De\ufb01nition 2.1]. Our analysis can be modi\ufb01ed to give an\nimproved dependence on the errors under this stronger assumption. In particular, the terms in\n\u03b5i\n\ndisappear from the expressions of Ak, (cid:101)Ak and (cid:98)Ak appearing in the propositions, leading to the\n\noptimal convergence rate with a slower decay of \u03b5i. More details may be found in [29].\n\n3 Notation and Assumptions\n\nIn this work, we assume that the smooth function g in (1) is convex and differentiable, and that its\ngradient g(cid:48) is Lipschitz-continuous with constant L, meaning that for all x and y in Rd we have\n\n(cid:107)g(cid:48)(x) \u2212 g(cid:48)(y)(cid:107) (cid:54) L(cid:107)x \u2212 y(cid:107) .\n\nThis is a standard assumption in differentiable optimization, see [30, \u00a72.1.1].\nIf g is twice-\ndifferentiable, this corresponds to the assumption that the eigenvalues of its Hessian are bounded\nabove by L. In Propositions 3 and 4 only, we will also assume that g is \u00b5-strongly convex (see [30,\n\u00a72.1.3]), meaning that for all x and y in Rd we have\n\ng(y) (cid:62) g(x) + (cid:104)g(cid:48)(x), y \u2212 x(cid:105) +\n\n||y \u2212 x||2.\n\n\u00b5\n2\n\nIn contrast to these assumptions on g, we will only assume that h in (1) is a lower semi-continuous\nproper convex function (see [31, \u00a71.2]), but will not assume that h is differentiable or Lipschitz-\ncontinuous. This allows h to be any real-valued convex function, but also allows for the possibility\nthat h is an extended real-valued convex function. For example, h could be the indicator function of\na convex set, and in this case the proximity operator becomes the projection operator.\n\n3\n\n\fWe will use xk to denote the parameter vector at iteration k, and x\u2217 to denote a minimizer of f. We\nassume that such an x\u2217 exists, but do not assume that it is unique. We use ek to denote the error\nin the calculation of the gradient at iteration k, and we use \u03b5k to denote the error in the proximal\nobjective function achieved by xk, meaning that\n\nL\n2\n\n(cid:107)xk \u2212 y(cid:107)2 + h(xk) (cid:54) \u03b5k + min\nx\u2208Rd\n\n(4)\nwhere y = yk\u22121 \u2212 (1/L)(g(cid:48)(yk\u22121) + ek)). Note that the proximal optimization problem (2) is\nstrongly convex and in practice we are often able to obtain such bounds via a duality gap (e.g.,\nsee [12] for the case of overlapping group (cid:96)1-regularization).\n\n(cid:107)x \u2212 y(cid:107)2 + h(x)\n\n2\n\n,\n\n(cid:26) L\n\n(cid:27)\n\n4 Convergence Rates of Inexact Proximal-Gradient Methods\n\nIn this section we present the analysis of the convergence rates of inexact proximal-gradient meth-\nods as a function of the sequences of solution accuracies to the proximal problems (\u03b5k), and the\nsequences of magnitudes of the errors in the gradient calculations ((cid:107)ek(cid:107)). We shall use (H) to\ndenote the set of four assumptions which will be made for each proposition:\n\n\u2022 g is convex and has L-Lipschitz-continuous gradient;\n\u2022 h is a lower semi-continuous proper convex function;\n\u2022 The function f = g + h attains its minimum at a certain x\u2217 \u2208 Rn;\n\u2022 xk is an \u03b5k-optimal solution to the proximal problem (2) in the sense of (4).\n\nWe \ufb01rst consider the basic proximal-gradient method in the convex case:\n\nProposition 1 (Basic proximal-gradient method - Convexity) Assume (H) and that we iterate re-\ncursion (3) with yk = xk. Then, for all k (cid:62) 1, we have\n\n(cid:16)(cid:107)x0 \u2212 x\u2217(cid:107) + 2Ak +\n\n(cid:112)\n\n(cid:17)2\n\n2Bk\n\n,\n\n(5)\n\nk(cid:88)\n\ni=1\n\n\u03b5i\nL\n\n.\n\n(cid:32)\n\nf\n\n1\nk\n\n(cid:33)\nk(cid:88)\n(cid:32)(cid:107)ei(cid:107)\nk(cid:88)\n\nxi\n\ni=1\n\nL\n\ni=1\n\n\u2212 f (x\u2217) (cid:54) L\n2k\n\n(cid:33)\n\n(cid:114) 2\u03b5i\n\n+\n\nL\n\nwith Ak =\n\n, Bk =\n\n\u221a\n\nThe proof may be found in [29]. Note that while we have stated the proposition in terms of the\nfunction value achieved by the average of the iterates, it trivially also holds for the iteration that\n\u221a\nachieves the lowest function value. This result implies that the well-known O(1/k) convergence\nrate for the gradient method without errors still holds when both ((cid:107)ek(cid:107)) and (\n\u03b5k) are summable.\nA suf\ufb01cient condition to achieve this is that (cid:107)ek(cid:107) decreases as O(1/k1+\u03b4) while \u03b5k decreases as\n) for any \u03b4, \u03b4(cid:48) > 0. Note that a faster convergence of these two errors will not improve\nO(1/k2+\u03b4(cid:48)\nthe convergence rate, but will yield a better constant factor.\n\u221a\nIt is interesting to consider what happens if ((cid:107)ek(cid:107)) or (\nand\nand the convergence of the function values is in O\nconvergence is that the partial sums Ak and Bk need to be in o(\nWe now turn to the case of an accelerated proximal-gradient method. We focus on a basic variant of\nthe algorithm where \u03b2k is set to (k \u2212 1)/(k + 2) [32, Eq. (19) and (27)]:\nProposition 2 (Accelerated proximal-gradient method - Convexity) Assume (H) and that we it-\nerate recursion (3) with yk = xk + k\u22121\n\n\u03b5k) is not summable. For instance, if (cid:107)ek(cid:107)\n\u03b5k decrease as O(1/k), then Ak grows as O(log k) (note that Bk is always smaller than Ak)\n. Finally, a necessary condition to obtain\n\n(cid:16) log2 k\n\n(cid:17)\n\n\u221a\n\nk).\n\nk\n\n(cid:19)2\nk+2 (xk \u2212 xk\u22121). Then, for all k (cid:62) 1, we have\n\n(cid:18)\n(cid:107)x0 \u2212 x\u2217(cid:107) + 2(cid:101)Ak +\n(cid:33)\n(cid:114) 2\u03b5i\n\n(cid:113)\n2(cid:101)Bk\n\nk(cid:88)\n\n(k + 1)2\n\n2L\n\n,\n\n(cid:32)(cid:107)ei(cid:107)\n\n(cid:101)Bk =\n\n+\n\nL\n\ni2\u03b5i\nL\n\n.\n\ni=1\n\nf (xk) \u2212 f (x\u2217) (cid:54)\n\nwith (cid:101)Ak =\n\nk(cid:88)\n\ni=1\n\ni\n\n(6)\n\n,\n\nL\n\n4\n\n\f\u221a\n\n\u221a\n\n\u221a\n\n(cid:17)\n\nIn this case, we require the series (k(cid:107)ek(cid:107)) and (k\n\u03b5k) to be summable to achieve the optimal\nO(1/k2) rate, which is an (unsurprisingly) stronger constraint than in the basic case. A suf\ufb01cient\ncondition is for (cid:107)ek(cid:107) and\n\u03b5k to decrease as O(1/k2+\u03b4) for any \u03b4 > 0. Note that, as opposed to\nProposition 1 that is stated for the average iterate, this bound is for the last iterate xk.\nAgain, it is interesting to see what happens when the summability assumption is not met. First,\nif (cid:107)ek(cid:107) or\nek) decreases as O(1/k) and\n\n(cid:101)Ak grows as O(log k) (note that (cid:101)Bk is always smaller than (cid:101)Ak), yielding a convergence rate of\n(cid:16) log2 k\nthe form of (cid:101)Ak and (cid:101)Bk indicates that errors have a greater effect on the accelerated method than\n\n\u03b5k decreases at\nO\na rate of O(1/k), Eq. (6) does not guarantee convergence of the function values. More generally,\n\nfor f (xk) \u2212 f (x\u2217). Also, and perhaps more interestingly, if (cid:107)ek(cid:107) or\n\n\u03b5k decreases at a rate of O(1/k2), then k((cid:107)ek(cid:107) +\n\non the basic method. Hence, as also discussed in [22], unlike in the error-free case the accelerated\nmethod may not necessarily be better than the basic method because it is more sensitive to errors in\nthe computation.\nIn the case where g is strongly convex it is possible to obtain linear convergence rates that depend\non the ratio \u03b3 = \u00b5/L as opposed to the sublinear convergence rates discussed above. In particular,\nwe obtain the following convergence rate on the iterates of the basic proximal-gradient method:\n\n\u221a\n\n\u221a\n\nk2\n\nProposition 3 (Basic proximal-gradient method - Strong convexity) Assume (H), that g is \u00b5-\nstrongly convex, and that we iterate recursion (3) with yk = xk. Then, for all k (cid:62) 1, we have:\n\n(cid:107)xk \u2212 x\u2217(cid:107) (cid:54) (1 \u2212 \u03b3)k ((cid:107)x0 \u2212 x\u2217(cid:107) + \u00afAk) ,\n\n(7)\n\nwith\n\n\u00afAk =\n\n(1 \u2212 \u03b3)\n\n\u2212i\n\nk(cid:88)\n\ni=1\n\n(cid:32)(cid:107)ei(cid:107)\n\n(cid:33)\n\n(cid:114) 2\u03b5i\n\n+\n\nL\n\nL\n\n.\n\n\u221a\n\n(cid:16)\n\n[(1 \u2212 \u03b3) + \u03b4(cid:48)]k(cid:17)\n\nA consequence of this proposition is that we obtain a linear rate of convergence even in the presence\nof errors, provided that (cid:107)ek(cid:107) and\n\u03b5k decrease linearly to 0. If they do so at a rate of Q(cid:48) < (1 \u2212 \u03b3),\nthen the convergence rate of (cid:107)xk\u2212x\u2217(cid:107) is linear with constant (1 \u2212 \u03b3), as in the error-free algorithm.\nIf we have Q(cid:48) > (1 \u2212 \u03b3), then the convergence of (cid:107)xk \u2212 x\u2217(cid:107) is linear with constant Q(cid:48). If we have\nQ(cid:48) = (1 \u2212 \u03b3), then (cid:107)xk\u2212x\u2217(cid:107) converges to 0 as O(k (1 \u2212 \u03b3)k) = o\nfor all \u03b4(cid:48) > 0.\nfocus on a basic variant of the algorithm where \u03b2k is set to (1 \u2212 \u221a\nFinally, we consider the accelerated proximal-gradient algorithm when g is strongly convex. We\nProposition 4 (Accelerated proximal-gradient method - Strong convexity) Assume (H), that g\nis \u00b5-strongly convex, and that we iterate recursion (3) with yk = xk + 1\u2212\u221a\n\u03b3 (xk \u2212 xk\u22121). Then, for\n\u221a\n(cid:114) 2\n(cid:18)(cid:112)2(f (x0) \u2212 f (x\u2217)) + (cid:98)Ak\n(cid:113)(cid:98)Bk\nall k (cid:62) 1, we have\nk(cid:88)\n(cid:98)Bk =\n\u03b5i (1 \u2212 \u221a\n\nf (xk) \u2212 f (x\u2217) (cid:54) (1 \u2212 \u221a\n(cid:112)\n\n\u03b3)k\n(1 \u2212 \u221a\n\nwith (cid:98)Ak =\n\n(cid:16)(cid:107)ei(cid:107) +\n\n\u03b3) [30, \u00a72.2.1]:\n\nk(cid:88)\n\n\u2212i/2 ,\n\n(cid:19)2\n\n\u03b3)/(1 +\n\n(cid:17)\n\n\u2212i\n\n\u03b3)\n\n2L\u03b5i\n\n+\n\n\u00b5\n\n\u221a\n\n\u03b3\n\n1+\n\n,\n\n.\n\n(8)\n\n\u03b3)\n\ni=1\n\ni=1\n\nNote that while we have stated the result in terms of function values, we obtain an analogous result\non the iterates because by strong convexity of f we have\n\n||xk \u2212 x\u2217||2 \u2264 f (xk) \u2212 f (x\u2217).\n\n\u00b5\n2\n\n\u03b3), while if Q(cid:48) > (1 \u2212 \u221a\n\nvided that ||ek||2 and \u03b5k decrease linearly to 0. If they do so at a rate Q(cid:48) < (1 \u2212 \u221a\nThis proposition implies that we obtain a linear rate of convergence in the presence of errors pro-\nconstant is (1 \u2212 \u221a\n\u03b3), then the\n\u03b3) then the constant will be Q(cid:48). Thus, the accelerated\ninexact proximal-gradient method will have a faster convergence rate than the exact basic proximal-\ngradient method provided that Q(cid:48) < (1 \u2212 \u03b3). Oddly, in our analysis of the strongly convex case,\nthe accelerated method is less sensitive to errors than the basic method. However, unlike the basic\nmethod, the accelerated method requires knowing \u00b5 in addition to L. If \u00b5 is misspeci\ufb01ed, then the\nconvergence rate of the accelerated method may be slower than the basic method.\n\n5\n\n\f5 Experiments\n\nWe tested the basic inexact proximal-gradient and accelerated proximal-gradient methods on the\nCUR-like factorization optimization problem introduced in [33] to approximate a given matrix W ,\n\n1\n\n2(cid:107)W \u2212 W XW(cid:107)2\n\nF + \u03bbrow\n\nmin\nX\n\nnr(cid:88)\n\nnc(cid:88)\n\n||X i||p + \u03bbcol\n\n||Xj||p .\n\ni=1\n\nj=1\n\nUnder an appropriate choice of p, this optimization problem yields a matrix X with sparse rows\nand sparse columns, meaning that entire rows and columns of the matrix X are set to exactly zero.\nIn [33], the authors used an accelerated proximal-gradient method and chose p = \u221e since under\nthis choice the proximity operator can be computed exactly. However, this has the undesirable effect\nthat it also encourages all values in the same row (or column) to have the same magnitude. The more\nnatural choice of p = 2 was not explored since in this case there is no known algorithm to exactly\ncompute the proximity operator.\nOur experiments focused on the case of p = 2. In this case, it is possible to very quickly compute\nan approximate proximity operator using the block coordinate descent (BCD) algorithm presented\nin [12], which is equivalent to the proximal variant of Dykstra\u2019s algorithm introduced by [34]. In our\nimplementation of the BCD method, we alternate between computing the proximity operator with\nrespect to the rows and to the columns. Since the BCD method allows us to compute a duality gap\nwhen solving the proximal problem, we can run the method until the duality gap is below a given\nerror threshold \u03b5k to \ufb01nd an xk+1 satisfying (4).\nIn our experiments, we used the four data sets examined by [33]1 and we choose \u03bbrow = .01 and\n\u03bbcol = .01, which yielded approximately 25\u201340% non-zero entries in X (depending on the data\nset). Rather than assuming we are given the Lipschitz constant L, on the \ufb01rst iteration we set L to\n1 and following [2] we double our estimate anytime g(xk) > g(yk\u22121) + (cid:104)g(cid:48)(yk\u22121), xk \u2212 yk\u22121(cid:105) +\n(L/2)||xk\u2212yk\u22121||2. We tested three different ways to terminate the approximate proximal problem,\neach parameterized by a parameter \u03b1:\n\n\u2022 \u03b5k = 1/k\u03b1: Running the BCD algorithm until the duality gap is below 1/k\u03b1.\n\u2022 \u03b5k = \u03b1: Running the BCD algorithm until the duality gap is below \u03b1.\n\u2022 n = \u03b1: Running the BCD algorithm for a \ufb01xed number of iterations \u03b1.\n\nNote that all three strategies lead to global convergence in the case of the basic proximal-gradient\nmethod, the \ufb01rst two give a convergence rate up to some \ufb01xed optimality tolerance, and in this paper\nwe have shown that the \ufb01rst one (for large enough \u03b1) yields a convergence rate for an arbitrary op-\ntimality tolerance. Note that the iterates produced by the BCD iterations are sparse, so we expected\nthe algorithms to spend the majority of their time solving the proximity problem. Thus, we used\nthe function value against the number of BCD iterations as a measure of performance. We plot the\nresults after 500 BCD iterations for the \ufb01rst two data sets for the proximal-gradient method in Fig-\nure 1, and the accelerated proximal-gradient method in Figure 2. The results for the other two data\nsets are similar, and are included in [29]. In these plots, the \ufb01rst column varies \u03b1 using the choice\n\u03b5k = 1/k\u03b1, the second column varies \u03b1 using the choice \u03b5k = \u03b1, and the third column varies \u03b1\nusing the choice n = \u03b1. We also include one of the best methods from the \ufb01rst column in the second\nand third columns as a reference.\nIn the context of proximal-gradient methods the choice of \u03b5k = 1/k3, which is one choice that\nachieves the fastest convergence rate according to our analysis, gives the best performance across\nall four data sets. However, in these plots we also see that reasonable performance can be achieved\nby any of the three strategies above provided that \u03b1 is chosen carefully. For example, choosing\nn = 3 or choosing \u03b5k = 10\u22126 both give reasonable performance. However, these are only empirical\nobservations for these data sets and they may be ineffective for other data sets or if we change the\nnumber of iterations, while we have given theoretical justi\ufb01cation for the choice \u03b5k = 1/k3.\nSimilar trends are observed for the case of accelerated proximal-gradient methods, though the choice\nof \u03b5k = 1/k3 (which no longer achieves the fastest convergence rate according to our analysis) no\nlonger dominates the other methods in the accelerated setting. For the SRBCT data set the choice\n\n1The datasets are freely available at http://www.gems-system.org.\n\n6\n\n\fFigure 1: Objective function against number of proximal iterations for the proximal-gradient method\nwith different strategies for terminating the approximate proximity calculation. The top row is for\nthe 9 Tumors data, the bottom row is for the Brain Tumor1 data.\n\nFigure 2: Objective function against number of proximal iterations for the accelerated proximal-\ngradient method with different strategies for terminating the approximate proximity calculation.\nThe top row is for the 9 Tumors data, the bottom row is for the Brain Tumor1 data.\n\n\u03b5k = 1/k4, which is a choice that achieves the fastest convergence rate up to a poly-logarithmic\nfactor, yields better performance than \u03b5k = 1/k3.\nInterestingly, the only choice that yields the\nfastest possible convergence rate (\u03b5k = 1/k5) had reasonable performance but did not give the best\nperformance on any data set. This seems to re\ufb02ect the trade-off between performing inner BCD\niterations to achieve a small duality gap and performing outer gradient iterations to decrease the\nvalue of f. Also, the constant terms which were not taken into account in the analysis do play an\nimportant role here, due to the relatively small number of outer iterations performed.\n\n7\n\n10020030040050010\u22121010\u22125100  \u03b5k = 1/k\u03b5k = 1/k2\u03b5k = 1/k3\u03b5k = 1/k4\u03b5k = 1/k510020030040050010\u22121010\u22125100  n=1n=2n=3n=5\u03b5k = 1/k310020030040050010\u22121010\u22125100  \u03b5k=1e\u22122\u03b5k=1e\u22124\u03b5k=1e\u22126\u03b5k=1e\u221210\u03b5k = 1/k310020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  \u03b5k = 1/k\u03b5k = 1/k2\u03b5k = 1/k3\u03b5k = 1/k4\u03b5k = 1/k510020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  n=1n=2n=3n=5\u03b5k = 1/k310020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  \u03b5k=1e\u22122\u03b5k=1e\u22124\u03b5k=1e\u22126\u03b5k=1e\u221210\u03b5k = 1/k310020030040050010\u22121010\u22125100  \u03b5k = 1/k\u03b5k = 1/k2\u03b5k = 1/k3\u03b5k = 1/k4\u03b5k = 1/k510020030040050010\u22121010\u22125100  n=1n=2n=3n=5\u03b5k = 1/k410020030040050010\u22121010\u22125100  \u03b5k=1e\u22122\u03b5k=1e\u22124\u03b5k=1e\u22126\u03b5k=1e\u221210\u03b5k = 1/k410020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  \u03b5k = 1/k\u03b5k = 1/k2\u03b5k = 1/k3\u03b5k = 1/k4\u03b5k = 1/k510020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  n=1n=2n=3n=5\u03b5k = 1/k410020030040050010\u22121010\u2212810\u2212610\u2212410\u22122100  \u03b5k=1e\u22122\u03b5k=1e\u22124\u03b5k=1e\u22126\u03b5k=1e\u221210\u03b5k = 1/k4\f6 Discussion\n\nAn alternative to inexact proximal methods for solving structured sparsity problems are smoothing\nmethods [35] and alternating direction methods [36]. However, a major disadvantage of both these\napproaches is that the iterates are not sparse, so they can not take advantage of the sparsity of the\nproblem when running the algorithm. In contrast, the method proposed in this paper has the ap-\npealing property that it tends to generate sparse iterates. Further, the accelerated smoothing method\nonly has a convergence rate of O(1/k), and the performance of alternating direction methods is\noften sensitive to the exact choice of their penalty parameter. On the other hand, while our analysis\nsuggests using a sequence of errors like O(1/k\u03b1) for \u03b1 large enough, the practical performance of\ninexact proximal-gradients methods will be sensitive to the exact choice of this sequence.\nAlthough we have illustrated the use of our results in the context of a structured sparsity problem,\ninexact proximal-gradient methods are also used in other applications such as total-variation [7, 8]\nand nuclear-norm [9, 10] regularization. This work provides a theoretical justi\ufb01cation for using\ninexact proximal-gradient methods in these and other applications, and suggests some guidelines\nfor practioners that do not want to lose the appealing convergence rates of these methods. Further,\nalthough our experiments and much of our discussion focus on errors in the calculation of the prox-\nimity operator, our analysis also allows for an error in the calculation of the gradient. This may also\nbe useful in a variety of contexts. For example, errors in the calculation of the gradient arise when\n\ufb01tting undirected graphical models and using an iterative method to approximate the gradient of the\nlog-partition function [37]. Other examples include using a reduced set of training examples within\nkernel methods [38] or subsampling to solve semide\ufb01nite programming problems [39].\nIn our analysis, we assume that the smoothness constant L is known, but it would be interesting to\nextend methods for estimating L in the exact case [2] to the case of inexact algorithms. In the context\nof accelerated methods for strongly convex optimization, our analysis also assumes that \u00b5 is known,\nand it would be interesting to explore variants that do not make this assumption. We also note that\nif the basic proximal-gradient method is given knowledge of \u00b5, then our analysis can be modi\ufb01ed\nto obtain a faster linear convergence rate of (1 \u2212 \u03b3)/(1 + \u03b3) instead of (1 \u2212 \u03b3) for strongly-convex\noptimization using a step size of 2/(\u00b5 + L), see Theorem 2.1.15 of [30]. Finally, we note that there\nhas been recent interest in inexact proximal Newton-like methods [40], and it would be interesting\nto analyze the effect of errors on the convergence rates of these methods.\n\nAcknowledgements Mark Schmidt, Nicolas Le Roux, and Francis Bach are supported by the\nEuropean Research Council (SIERRA-ERC-239993).\n\nReferences\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[2] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Papers,\n\n(2007/76), 2007.\n\n[3] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society:\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[4] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[5] S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation.\n\nIEEE Transactions on Signal Processing, 57(7):2479\u20132493, 2009.\n\n[6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In\n\nS. Sra, S. Nowozin, and S.J. Wright, editors, Optimization for Machine Learning. MIT Press, 2011.\n\n[7] J. Fadili and G. Peyr\u00b4e. Total variation projection with \ufb01rst order schemes. IEEE Transactions on Image\n\nProcessing, 20(3):657\u2013669, 2011.\n\n[8] X. Chen, S. Kim, Q. Lin, J.G. Carbonell, and E.P. Xing. Graph-structured multi-task regression and an\n\nef\ufb01cient optimization method for general fused Lasso. arXiv:1005.3579v1, 2010.\n\n[9] J.-F. Cai, E.J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4), 2010.\n\n[10] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimiza-\n\ntion. Mathematical Programming, 128(1):321\u2013353, 2011.\n\n8\n\n\f[11] L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. ICML, 2009.\n[12] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary\n\nlearning. JMLR, 12:2297\u20132334, 2011.\n\n[13] A. Barbero and S. Sra. Fast Newton-type methods for total variation regularization. ICML, 2011.\n[14] J. Liu and J. Ye. Fast overlapping group Lasso. arXiv:1009.0306v1, 2010.\n[15] M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials.\n\nAISTATS, 2010.\n\n[16] M. Patriksson. A uni\ufb01ed framework of descent algorithms for nonlinear programs and variational in-\n\nequalities. PhD thesis, Department of Mathematics, Link\u00a8oping University, Sweden, 1995.\n\n[17] P.L. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators.\n\nOptimization, 53(5-6):475\u2013504, 2004.\n\n[18] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. JMLR,\n\n10:2873\u20132898, 2009.\n\n[19] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. JMLR, 10:777\u2013801, 2009.\n[20] A. d\u2019Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization,\n\n19(3):1171\u20131183, 2008.\n\n[21] M. Baes. Estimate sequence methods: extensions and approximations. IFOR internal report, ETH Zurich,\n\n2009.\n\n[22] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact\n\noracle. CORE Discussion Papers, (2011/02), 2011.\n\n[23] A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic Opti-\n\nmization: Algorithms and Applications, pages 263\u2013304, 2000.\n\n[24] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: A general\n\napproach. Annals of Operations Research, 46-47(1):157\u2013178, 1993.\n\n[25] M.P. Friedlander and M. Schmidt.\n\narXiv:1104.2373, 2011.\n\nHybrid deterministic-stochastic methods for data \ufb01tting.\n\n[26] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and\n\nOptimization, 14(5):877\u2013898, 1976.\n\n[27] O. G\u00a8uler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization,\n\n2(4):649\u2013664, 1992.\n\n[28] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and inexact forward-backward algorithms.\n\nOptimization Online, 2011.\n\n[29] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for\n\nconvex optimization. arXiv:1109.2415v2, 2011.\n\n[30] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.\n[31] D.P. Bertsekas. Convex optimization theory. Athena Scienti\ufb01c, 2009.\n[32] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2008.\n[33] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network \ufb02ow optimization for structured\n\nsparsity. JMLR, 12:2681\u20132720, 2011.\n\n[34] H.H. Bauschke and P.L. Combettes. A Dykstra-like algorithm for two monotone operators. Paci\ufb01c Journal\n\nof Optimization, 4(3):383\u2013391, 2008.\n\n[35] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Prog., 103(1):127\u2013152, 2005.\n[36] P.L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In H.H. Bauschke,\nR.S. Burachik, P.L. Combettes, V. Elser, D.R. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithms\nfor Inverse Problems in Science and Engineering, pages 185\u2013212. Springer, 2011.\n\n[37] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. Tree-reweighted belief propagation algorithms and\n\napproximate ML estimation by pseudo-moment matching. AISTATS, 2003.\n\n[38] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Transactions on Signal\n\nProcessing, 52(8):2165\u20132176, 2004.\n\n[39] A. d\u2019Aspremont. Subsampling algorithms for semide\ufb01nite programming. arXiv:0803.1990v5, 2009.\n[40] M. Schmidt, D. Kim, and S. Sra. Projected Newton-type methods in machine learning.\n\nIn S. Sra,\n\nS. Nowozin, and S. Wright, editors, Optimization for Machine Learning. MIT Press, 2011.\n\n9\n\n\f", "award": [], "sourceid": 839, "authors": [{"given_name": "Mark", "family_name": "Schmidt", "institution": null}, {"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}