{"title": "Better Approximation and Faster Algorithm Using the Proximal Average", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 466, "abstract": "It is a common practice to approximate complicated'' functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool---proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims.\"", "full_text": "Better Approximation and Faster Algorithm Using\n\nthe Proximal Average\n\nDepartment of Computing Science, University of Alberta, Edmonton AB T6G 2E8, Canada\n\nyaoliang@cs.ualberta.ca\n\nYaoliang Yu\n\nAbstract\n\nIt is a common practice to approximate \u201ccomplicated\u201d functions with more\nfriendly ones.\nIn large-scale machine learning applications, nonsmooth\nlosses/regularizers that entail great computational challenges are usually approxi-\nmated by smooth functions. We re-examine this powerful methodology and point\nout a nonsmooth approximation which simply pretends the linearity of the proxi-\nmal map. The new approximation is justi\ufb01ed using a recent convex analysis tool\u2014\nproximal average, and yields a novel proximal gradient algorithm that is strictly\nbetter than the one based on smoothing, without incurring any extra overhead. Nu-\nmerical experiments conducted on two important applications, overlapping group\nlasso and graph-guided fused lasso, corroborate the theoretical claims.\n\n1\n\nIntroduction\n\nIn many scienti\ufb01c areas, an important methodology that has withstood the test of time is the ap-\nproximation of \u201ccomplicated\u201d functions by those that are easier to handle. For instance, Taylor\u2019s\nexpansion in calculus [1], essentially a polynomial approximation of differentiable functions, has\nfundamentally changed analysis, and mathematics more broadly. Approximations are also ubiq-\nuitous in optimization algorithms, e.g. various gradient-type algorithms approximate the objective\nfunction with a quadratic upper bound. In some (if not all) cases, there are multiple ways to make\nthe approximation, and one usually has this freedom of choice. It is perhaps not hard to convince\noneself that there is no approximation that would work best in all scenarios. And one would prob-\nably also agree that a speci\ufb01c form of approximation should be favored if it well suits our ultimate\ngoal. Despite of all these common-sense, in optimization algorithms, the smooth approximations are\nstill dominating, bypassing some recent advances on optimizing nonsmooth functions [2, 3]. Part of\nthe reason, we believe, is the lack of new technical tools.\nWe consider the composite minimization problem where the objective consists of a smooth loss func-\ntion and a sum of nonsmooth functions. Such problems have received increasing attention due to the\narise of structured sparsity [4], notably the overlapping group lasso [5], the graph-guided fused lasso\n[6] and some others. These structured regularizers, although greatly enhance our modeling capabil-\nity, introduce signi\ufb01cant new computational challenges as well. Popular gradient-type algorithms\ndealing with such composite problems include the generic subgradient method [7], (accelerated)\nproximal gradient (APG) [2, 3], and the smoothed accelerated proximal gradient (S-APG) [8]. The\nsubgradient method is applicable to any nonsmooth function, although the convergence rate is rather\nslow. APG, being a recent advance, can handle simple functions [9] but for more complicated struc-\ntured regularizers, an inner iterative procedure is needed, resulting in an overall convergence rate\nthat could be as slow as the subgradient method [10]. Lastly, S-APG simply runs APG on a smooth\napproximation of the original objective, resulting in a much improved convergence rate.\nOur work is inspired by the recent advance on nonsmooth optimization [2, 3], of which the building\nblock is the proximal map of the nonsmooth function. This proximal map is available in closed-form\n\n1\n\n\ffor simple functions but can be quite expensive for more complicated functions such as a sum of\nnonsmooth functions we consider here. A key observation we make is that oftentimes the proximal\nmap for each individual summand can be easily computed, therefore a bold idea is to simply use the\nsum of proximal maps, pretending that the proximal map is a linear operator. Somewhat surprisingly,\nthis naive choice, when combined with APG, results in a novel proximal algorithm that is strictly\nbetter than S-APG, while keeping per-step complexity unchanged. We justify our method via a\nnew tool from convex analysis\u2014the proximal average [11]. In essence, instead of smoothing the\nnonsmooth function, we use a nonsmooth approximation whose proximal map is cheap to evaluate,\nafter all this is all we need to run APG.\nWe formally state our problem in Section 2, along with the proposed algorithm. After recalling\nthe relevant tools from convex analysis in Section 3 we provide the theoretical justi\ufb01cation of our\nmethod in Section 4. Related works are discussed in Section 5. We test the proposed algorithm in\nSection 6 and conclude in Section 7.\n\n2 Problem Formulation\n\nWe are interested in solving the following composite minimization problem:\n\nK(cid:88)\n\nk=1\n\n(cid:96)(x) + \u00aff (x), where\n\n\u00aff (x) =\n\n\u03b1kfk(x).\n\n(1)\n\nHere (cid:96) is convex with L0-Lipschitz continuous gradient w.r.t. the Euclidean norm (cid:107) \u00b7 (cid:107), and \u03b1k \u2265\nk \u03b1k = 1. The usual regularization constant that balances the two terms in (1) is absorbed into\n\nthe loss (cid:96). For the functions fk, we assume\nAssumption 1. Each fk is convex and Mk-Lipschitz continuous w.r.t. the Euclidean norm (cid:107) \u00b7 (cid:107).\n\nmin\nx\u2208Rd\n\n0,(cid:80)\nThe abbreviation M 2 =(cid:80)K\n\nk=1 \u03b1kM 2\n\nk is adopted throughout.\n\nWe are interested in the general case where the functions fk need not be differentiable. As men-\ntioned in the introduction, a generic scheme that solves (1) is the subgradient method [7], of which\neach step requires merely an arbitrary subgradient of the objective. With a suitable stepsize, the sub-\ngradient method converges1 in at most O(1/\u00012) steps where \u0001 > 0 is the desired accuracy. Although\nbeing general, the subgradient method is exceedingly slow, making it unsuitable for many practical\napplications.\nAnother recent algorithm for solving (1) is the (accelerated) proximal gradient (APG) [2, 3], of\nwhich each iteration needs to compute the proximal map of the nonsmooth part \u00aff in (1):\n\nP1/L0\n\n\u00aff\n\n(x) = argmin\n\ny\n\nL0\n\n2 (cid:107)x \u2212 y(cid:107)2 + \u00aff (y).\n\n(Recall that L0 is the Lipschitz constant of the gradient of the smooth part (cid:96) in (1).) Provided that\n\u221a\nthe proximal map can be computed in constant time, it can be shown that APG converges within\n\u0001) complexity, signi\ufb01cantly better than the subgradient method. For some simple functions,\nO(1/\nthe proximal map indeed is available in closed-form, see [9] for a nice survey. However, for more\ncomplicated functions such as the one we consider here, the proximal map itself is expensive to\ncompute and an inner iterative subroutine is required. Somewhat disappointingly, recent analysis\nhas shown that such a two-loop procedure can be as slow as the subgradient method [10].\nYet another approach, popularized by Nesterov [8], is to approximate each nonsmooth component\nfk with a smooth function and then run APG. By carefully balancing the approximation and the\nconvergence requirement of APG, the smoothed accelerated proximal gradient (S-APG) proposed\n\nin [8] converges in at most O((cid:112)1/\u00012 + 1/\u0001) steps, again much better than the subgradient method.\n\nThe main point of this paper is to further improve S-APG, in perhaps a surprisingly simple way.\nThe key assumption that we will exploit is the following:\nAssumption 2. Each proximal map P\u00b5\nfk\n\ncan be computed \u201ceasily\u201d for any \u00b5 > 0.\n\n1In this paper we satisfy ourselves with convergence in terms of function values, although with additional\n\nassumptions/efforts it is possible to argue for convergence in terms of the iterates.\n\n2\n\n\fAlgorithm 1: PA-APG.\n1: Initialize x0 = y1, \u00b5, \u03b71 = 1.\n2: for t = 1, 2, . . . do\nzt = yt \u2212 \u00b5\u2207(cid:96)(yt),\n3:\nk \u03b1k \u00b7 P\u00b5\n4:\n\nxt =(cid:80)\n\n\u221a\n\u03b7t+1 = 1+\nyt+1 = xt + \u03b7t\u22121\n\nfk\n1+4\u03b72\nt\n2\n\n(zt),\n,\n(xt \u2212 xt\u22121).\n\n\u03b7t+1\n\n5:\n6:\n7: end for\n\nAlgorithm 2: PA-PG.\n\n1: Initialize x0, \u00b5.\n2: for t = 1, 2, . . . do\n\nxt =(cid:80)\n\n4:\n5: end for\n\n3:\n\nzt = xt\u22121 \u2212 \u00b5\u2207(cid:96)(xt\u22121),\n\nk \u03b1k \u00b7 P\u00b5\n\nfk\n\n(zt).\n\n?\u2248 K(cid:88)\n\nWe prefer to leave the exact meaning of \u201ceasily\u201d unspeci\ufb01ed, but roughly speaking, the proximal\nmap should be no more expensive than computing the gradient of the smooth part (cid:96) so that it does\nnot become the bottleneck. Both Assumption 1 and Assumption 2 are satis\ufb01ed in many important\napplications (examples will follow). As it will also become clear later, these assumptions are exactly\nthose needed by S-APG.\nUnfortunately, in general, there is no known ef\ufb01cient way that reduces the proximal map of the\naverage \u00aff to the proximal maps of its individual components fk, therefore the fast scheme APG is\nnot readily applicable. The main dif\ufb01culty, of course, is due to the nonlinearity of the proximal map\nP\u00b5\nf , when treated as an operator on the function f. Despite of this fact, we will \u201cnaively\u201d pretend\nthat the proximal map is linear and use\n\nP\u00b5\n\u00aff\n\n\u03b1kP\u00b5\nfk\n\n.\n\n(2)\n\nk=1\n\nUnder this approximation, the fast scheme APG can be applied. We give one particular realization\n(PA-APG) in Algorithm 1 based on the FISTA in [2]. A simpler (though slower) version (PA-PG)\nbased on ISTA [2] is also provided in Algorithm 2. Clearly both algorithms are easily parallelizable\nif K is large. We remark that any other variation of APG, e.g. [8], is equally well applicable. Of\ncourse, when K = 1, our algorithm reduces to the corresponding APG scheme.\nAt this point, one might be suspicious about the usefulness of the \u201cnaive\u201d approximation in (2).\nBefore addressing this well-deserved question, let us \ufb01rst point out two important applications where\nAssumption 1 and Assumption 2 are naturally satis\ufb01ed.\nExample 1 (Overlapping group lasso, [5]). In this example, fk(x) = (cid:107)xgk(cid:107) where gk is a group\n(subset) of variables and xg denotes a copy of x with all variables not contained in the group g\nset to 0. This group regularizer has been proven quite useful in high-dimensional statistics with the\ncapability of selecting meaningful groups of features [5]. In the general case where the groups could\noverlap as needed, P\u00b5\nClearly each fk is convex and 1-Lipschitz continuous w.r.t. (cid:107) \u00b7 (cid:107), i.e., Mk = 1 in Assumption 1.\nMoreover, the proximal map P\u00b5\nfk\n\nis simply a re-scaling of the variables in group gk, that is\n\n\u00aff cannot be computed easily.\n\nj (cid:54)\u2208 gk\nj \u2208 gk\nwhere (\u03bb)+ = max{\u03bb, 0}. Therefore, both of our assumptions are met.\nExample 2 (Graph-guided fused lasso, [6]). This example is an enhanced version of the fused lasso\n[12], with some graph structure exploited to improve feature selection in biostatistic applications\n[6]. Speci\ufb01cally, given some graph whose nodes correspond to the feature variables, we let fij(x) =\n|xi \u2212 xj| for every edge (i, j) \u2208 E. For a general graph, the proximal map of the regularizer\n\n(1 \u2212 \u00b5/(cid:107)xgk(cid:107))+xj,\n\n(cid:26)xj,\n\n(x)]j =\n\n[P\u00b5\nfk\n\n(3)\n\n,\n\n\u00aff =(cid:80)\n\n(i,j)\u2208E \u03b1ijfij, with \u03b1ij \u2265 0,(cid:80)\n\n(i,j)\u2208E \u03b1ij = 1, is not easily computable.\n\nSimilar as above, each fij is 1-Lipschitz continuous w.r.t.\nproximal map P\u00b5\nfij\n\nis easy to compute:\n\nthe Euclidean norm. Moreover, the\n\n(cid:26)xs,\n\n[P\u00b5\nfij\n\n(x)]s =\n\nxs \u2212 sign(xi \u2212 xj) min{\u00b5,|xi \u2212 xj|/2},\n\ns (cid:54)\u2208 {i, j}\ns \u2208 {i, j} .\n\n(4)\n\nAgain, both our assumptions are satis\ufb01ed.\n\n3\n\n\fNote that in both examples we could have incorporated weights into the component functions fk\nor fij, which amounts to changing \u03b1k or \u03b1ij accordingly. We also remark that there are other\napplications that fall into our consideration, but for illustration purposes we shall contend ourselves\nwith the above two examples. More conveniently, both examples have been tried with S-APG [13],\nthus constitute a natural benchmark for our new algorithm.\n\n3 Technical Tools\n\nTo justify our new algorithm, we need a few technical tools from convex analysis [14]. Let our\ndomain H be a real Hilbert space with the inner product (cid:104)\u00b7,\u00b7(cid:105) and the induced norm (cid:107)\u00b7(cid:107). Denote \u03930\nas the set of all lower semicontinuous proper convex functions f : H \u2192 R\u222a{\u221e}. It is well-known\nthat the Fenchel conjugation\n\nf\u2217(y) = sup\n\n(cid:104)x, y(cid:105) \u2212 f (x)\n\nx\n\n2(cid:107) \u00b7 (cid:107)2\nis a bijection and involution on \u03930 (i.e. (f\u2217)\u2217 = f). For convenience, throughout we let q = 1\n(q for \u201cquadratic\u201d). Note that q is the only function which coincides with its Fenchel conjugate.\nAnother convention that we borrow from convex analysis is to write (f \u00b5)(x) = \u00b5f (\u00b5\u22121x) for\n\u00b5 > 0. We easily verify (\u00b5f )\u2217 = f\u2217\u00b5 and also (f \u00b5)\u2217 = \u00b5f\u2217.\nFor any f \u2208 \u03930, we de\ufb01ne its Moreau envelop (with parameter \u00b5 > 0) [14, 15]\n\n1\n\n2\u00b5(cid:107)x \u2212 y(cid:107)2 + f (y),\n\n(5)\n\nand correspondingly the proximal map\n\nM\u00b5\n\nf (x) = min\n\ny\n\n1\n\n2\u00b5(cid:107)x \u2212 y(cid:107)2 + f (y).\n\nP\u00b5\n\ny\n\nf (x) = argmin\n\n(6)\nSince f is closed convex and (cid:107) \u00b7 (cid:107)2 is strongly convex, the proximal map is well-de\ufb01ned and single-\nvalued. As mentioned before, the proximal map is the key component of fast schemes such as APG.\nWe summarize some nice properties of the Moreau envelop and the proximal map as:\nProposition 1. Let \u00b5, \u03bb > 0, f \u2208 \u03930, and Id be the identity map, then\ni). M\u00b5\n\nf )\u2217 = f\u2217 + \u00b5q;\nf (x) = inf x f (x), and argminx M\u00b5\n\nf \u2208 \u03930 and (M\u00b5\nf \u2264 f, inf x M\u00b5\nf is differentiable with \u2207M\u00b5\n\u03bbf = \u03bbM\u03bb\u00b5\n\u03bbf = P\u03bb\u00b5\n\nf and P\u00b5\n\nii). M\u00b5\n\niii). M\u00b5\n\niv). M\u00b5\n\nf = 1\n\n\u00b5 (Id \u2212 P\u00b5\nf );\nf \u03bb\u22121)\u03bb;\n\nf = (P\u00b5\n\nf (x) = argminx f (x);\n\nv). M\u03bb\n\nM\u00b5\nf\n\n= M\u03bb+\u00b5\n\nf\n\nand P\u03bb\n\nM\u00b5\nf\n\n= \u00b5\n\n\u03bb+\u00b5 Id + \u03bb\n\n\u03bb+\u00b5 P\u03bb+\u00b5\n\nf\n\n;\n\nvi). \u00b5M\u00b5\n\nf + (M1/\u00b5\n\nf\u2217 )\u00b5 = q and P\u00b5\n\nf + (P1/\u00b5\n\nf\u2217 )\u00b5 = Id.\n\nf )\u2217 in general is different from M\u00b5\nf\u2217.\n\ni) is the well-known duality between in\ufb01mal convolution and summation. ii), albeit being trivial, is\nthe driving force behind the proximal point algorithm [16]. iii) justi\ufb01es the \u201cniceness\u201d of the Moreau\nenvelop and connects it with the proximal map. iv) and v) follow from simple algebra. And lastly\nvi), known as Moreau\u2019s identity [15], plays an important role in the early development of convex\nanalysis. We remind that (M\u00b5\nFix \u00b5 > 0. Let SC\u00b5 \u2286 \u03930 denote the class of \u00b5-strongly convex functions, that is, functions f\nsuch that f \u2212 \u00b5q is convex. Similarly, let SS\u00b5 \u2286 \u03930 denote the class of \ufb01nite-valued functions\nthe norm (cid:107) \u00b7 (cid:107)). A well-known duality between\nwhose gradient is \u00b5-Lipschitz continuous (w.r.t.\nstrong convexity and smoothness is that for f \u2208 \u03930, we have f \u2208 SC\u00b5 iff f\u2217 \u2208 SS1/\u00b5, cf. [17,\nTheorem 18.15]. Based on this duality, we have the next result which turns out to be critical. (Proof\nin Appendix A)\nProposition 2. Fix \u00b5 > 0. The Moreau envelop map M\u00b5 : \u03930 \u2192 SS1/\u00b5 that sends f \u2208 \u03930 to M\u00b5\nbijective, increasing, and concave on any convex subset of \u03930 (under the pointwise order).\n\nf is\n\n4\n\n\fk=1 \u03b1k = 1. Recall that \u00aff =(cid:80)\n\naverage\u2014the key object to us. Fix constants \u03b1k \u2265 0 with(cid:80)K\n\nIt is clear that SS1/\u00b5 is a convex subset of \u03930, which motivates the de\ufb01nition of the proximal\nk \u03b1kfk\nwith each fk \u2208 \u03930, i.e. \u00aff is the convex combination of the component functions {fk} under the\nweight {\u03b1k}. Note that we always assume \u00aff \u2208 \u03930 (the exception \u00aff \u2261 \u221e is clearly uninteresting).\nDe\ufb01nition 1 (Proximal Average, [11, 15]). Denote f = (f1, . . . , fK) and f\u2217 = (f\u2217\nK). The\nproximal average A\u00b5\nf ,\u03b1, or simply A\u00b5 when the component functions and weights are clear from\ncontext, is the unique function h \u2208 \u03930 such that M\u00b5\nIndeed, the existence of the proximal average follows from the surjectivity of M\u00b5 while the unique-\nness follows from the injectivity of M\u00b5, both proven in Proposition 2. The main property of the\nproximal average, as seen from its de\ufb01nition, is that its Moreau envelop is the convex combination\nof the Moreau envelops of the component functions. By iii) of Proposition 1 we immediately obtain\n\nh =(cid:80)K\n\n1 , . . . , f\u2217\n\nk=1 \u03b1kM\u00b5\nfk\n\n.\n\nP\u00b5\n\nA\u00b5 =\n\n\u03b1kP\u00b5\nfk\n\n.\n\n(7)\n\nRecall that the right-hand side is exactly the approximation we employed in Section 2.\nInterestingly, using the properties we summarized in Proposition 1, one can show that the Fenchel\nconjugate of the proximal average, denoted as (A\u00b5)\u2217, enjoys a similar property [11]:\n\u03b1k(q \u2212 \u00b5M\u00b5\n\n\u00b5 = q \u2212 \u00b5M\u00b5\n\nA\u00b5 = q \u2212 \u00b5\n\nM1/\u00b5\n\n(cid:104)\n\n(cid:105)\n\n\u03b1kM\u00b5\nfk\n\n(A\u00b5)\u2217\n\n=\n\nfk\n\n)\n\nK(cid:88)\n(cid:35)\n\nk=1\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n(cid:34) K(cid:88)\n\nk=1\n\nk=1\n\nK(cid:88)\n\nk=1\n\n(cid:32)(cid:16) K(cid:88)\n\nf ,\u03b1)\u2217 = (cid:80)K\n\nthat is, M1/\u00b5\n(A\u00b5\nProposition 2:\n\n=\n\n\u03b1k[(M1/\u00b5\nf\u2217\n\nk\n\n)\u00b5] =\n\n\u03b1kM1/\u00b5\nf\u2217\n\nk\n\n\u00b5,\n\nk=1 \u03b1kM1/\u00b5\nf\u2217\n\nk\n\n= M1/\u00b5\nA1/\u00b5\nf\u2217 ,\u03b1\n\n, therefore by the injective property established in\n\n(8)\nFrom its de\ufb01nition it is also possible to derive an explicit formula for the proximal average (although\nfor our purpose only the existence is needed):\n\n(A\u00b5\n\nf ,\u03b1)\u2217 = A1/\u00b5\nf\u2217,\u03b1.\n(cid:33)\u2217\n\n(cid:17)\u2217 \u2212 \u00b5q\n\n=\n\n(cid:16) K(cid:88)\n\n(cid:17)\u2217 \u2212 q\u00b5,\n\n\u03b1kM1/\u00b5\nf\u2217\n\nk\n\nA\u00b5\n\nf ,\u03b1 =\n\n\u03b1kM\u00b5\nfk\n\nk=1\n\nk=1\n\n(9)\n\nwhere the second equality is obtained by conjugating (8) and applying the \ufb01rst equality to the con-\njugate. By the concavity and monotonicity of M\u00b5, we have the inequality\nA\u00b5 \u21d0\u21d2 \u00aff \u2265 A\u00b5.\n\n\u00aff \u2265 K(cid:88)\n\n\u03b1kM\u00b5\nfk\n\n= M\u00b5\n\n(10)\n\nM\u00b5\n\nk=1\n\nf \u2192 f pointwise [14], which, under the Lipschitz assumption,\n\nThe above results (after De\ufb01nition 1) are due to [11], although our treatment is slightly different.\nIt is well-known that as \u00b5 \u2192 0, M\u00b5\ncan be strengthened to uniform convergence (Proof in Appendix B):\nProposition 3. Under Assumption 1 we have 0 \u2264 \u00aff \u2212 M\u00b5\nA\u00b5 \u2264 \u00b5M 2\n2 .\nFor the proximal average, [11] showed that A\u00b5 \u2192 \u00aff pointwise, which again can be strengthened to\nuniform convergence (proof follows from (10) and Proposition 3 since A\u00b5 \u2265 M\u00b5\nProposition 4. Under Assumption 1 we have 0 \u2264 \u00aff \u2212 A\u00b5 \u2264 \u00b5M 2\n2 .\n\nA\u00b5):\n\nAs it turns out, S-APG approximates the nonsmooth function \u00aff with the smooth function M\u00b5\nA\u00b5 while\nour algorithm operates on the nonsmooth approximation A\u00b5 (note that it can be shown that A\u00b5 is\nsmooth iff some component fi is smooth). By (10) and ii) in Proposition 1 we have\n\n(11)\n\nA\u00b5 \u2264 A\u00b5 \u2264 \u00aff ,\nM\u00b5\n\n5\n\n\fA\u00b5 \u2264 A\u00b5 \u2264 \u00aff. Observe that the proximal\nFigure 1: See Example 3 for context. As predicted M\u00b5\nA\u00b5 is smooth everywhere. For x \u2265 0, f1 = f2 =\naverage A\u00b5 remains nondifferentiable at 0 while M\u00b5\n\u00aff = A\u00b5 (the red circled line), thus the proximal average A\u00b5 is a strictly tighter approximation than\nsmoothing. When \u00b5 is small (right panel), \u00aff \u2248 M\u00b5\nA\u00b5 \u2248 A\u00b5.\n\nmeaning that the proximal average A\u00b5 is a better under-approximation of \u00aff than M\u00b5\nLet us compare the proximal average A\u00b5 with the smooth approximation M\u00b5\nExample 3. Let f1(x) = |x|, f2(x) = max{x, 0}. Clearly both are 1-Lipschitz continuous. More-\nover, P\u00b5\nf1\n\n(x) = (x \u2212 \u00b5)+ + x \u2212 (x)+,\n\n(x) = sign(x)(|x| \u2212 \u00b5)+, P\u00b5\n\nA\u00b5 on a 1-D example.\n\nA\u00b5.\n\nf2\n\n(cid:40) x2\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30,\n\nx2\n2\u00b5 ,\nx \u2212 \u00b5/2,\n\nx \u2264 0\n0 \u2264 x \u2264 \u00b5\notherwise\n\n.\n\nM\u00b5\nf1\n\n(x) =\n\n2\u00b5 ,\n|x| \u2212 \u00b5/2,\n\n|x| \u2264 \u00b5\notherwise\n\n, and M\u00b5\nf2\n\n(x) =\n\nFinally, using (9) we obtain (with \u03b11 = \u03b1, \u03b12 = 1 \u2212 \u03b1)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3x,\n\nA\u00b5(x) =\n\nx2\n2\u00b5 ,\n\n\u03b1\n1\u2212\u03b1\n\u2212\u03b1x \u2212 (1 \u2212 \u03b1) \u03b1\u00b5\n\n2 , x \u2264 (\u03b1 \u2212 1)\u00b5\n\nx \u2265 0\n(\u03b1 \u2212 1)\u00b5 \u2264 x \u2264 0\n\n.\n\nFigure 1 depicts the case \u03b1 = 0.5 with different values of the smoothing parameter \u00b5.\n\n4 Theoretical Justi\ufb01cation\nGiven our development in the previous section, it is now clear that our proposed algorithm aims at\nsolving the approximation\n\n(cid:96)(x) + A\u00b5(x).\n\nmin\n\nx\n\n(12)\n\nThe next important piece is to show how a careful choice of \u00b5 would lead to a strictly better conver-\ngence rate than S-APG.\nRecall that using APG to slove (12) requires computing the following proximal map in each iteration:\n\nP1/L0\n\nA\u00b5\n\n(x) = argmin\n\ny\n\nL0\n\n2 (cid:107)x \u2212 y(cid:107)2 + A\u00b5(y),\n\nwhich, unfortunately, is not yet amenable to ef\ufb01cient computation, due to the mismatch of the con-\nstants 1/L0 and \u00b5 (recall that in the decomposition (7) the superscript and subscript must both be \u00b5).\nIn general, there is no known explicit formula that would reduce P1/L0\nf for different positive\nconstants L0 and \u00b5 [17, p. 338], see also iv) in Proposition 1. Our \ufb01x is almost trivial: If necessary,\nwe use a bigger Lipschitz constant L0 = 1/\u00b5 so that we can compute the proximal map easily. This\nis indeed legitimate since L0-Lipschitz implies L-Lipschitz for any L \u2265 L0. Said differently, all we\nneed is to tune down the stepsize a little bit in APG. We state formally the convergence property of\nour algorithm as (Proof in Appendix C):\nTheorem 1. Fix the accuracy \u0001 > 0. Under Assumption 1 and the choice \u00b5 = min{1/L0, 2\u0001/ M 2},\nafter at most\n\n(cid:113) 2\n\u00b5\u0001(cid:107)x0 \u2212 x(cid:107) steps, the output of Algorithm 1, say \u02dcx, satis\ufb01es\n\nto P\u00b5\n\nf\n\nThe same guarantee holds for Algorithm 2 after at most 1\n\n2\u00b5\u0001(cid:107)x0 \u2212 x(cid:107)2 steps.\n\n(cid:96)(\u02dcx) + \u00aff (\u02dcx) \u2264 (cid:96)(x) + \u00aff (x) + 2\u0001.\n\n6\n\n\u221210\u2212505100246810\u03b1=0.5,\u00b5=10 f1f2\u00affM\u03b7\u00affA\u03b7\u221210\u2212505100246810\u03b1=0.5,\u00b5=5 f1f2\u00affM\u03b7\u00affA\u03b7\u221210\u2212505100246810\u03b1=0.5,\u00b5=1 f1f2\u00affM\u03b7\u00affA\u03b7\frate O((cid:112)1/\u0001), even though we approximate the nonsmooth function \u00aff by the proximal average\n\nA\u00b5, we would end up with the optimal (overall)\n\nNote that if we could reduce P1/L0\n\nef\ufb01ciently to P\u00b5\n\nA\u00b5. In other words, approximation itself does not lead to an inferior rate. It is our incapability to\n(ef\ufb01ciently) relate proximal maps that leads to the sacri\ufb01ce in convergence rates.\n\nA\u00b5\n\n5 Discussions\n\nrecall\n\nlet us\n\n(cid:113)\n\n(cid:113)\n\nmax{L0, M 2/(2\u0001)}(cid:112)1/\u0001) of our approach. In other words, we have managed to\n\nL0 + M 2/(2\u0001)(cid:112)1/\u0001) steps since the Lipschitz constant of the gradient of (cid:96) + M\u00b5\n\nTo ease our discussion with related works, let us \ufb01rst point out a fact that is not always explicitly\nrecognized, that is, S-APG essentially relies on approximating the nonsmooth function \u00aff with M\u00b5\nA\u00b5.\nIndeed, consider \ufb01rst the case K = 1. The smoothing idea introduced in [8] purports the super\ufb01cial\nmax-structure assumption, that is, f (x) = maxy\u2208C (cid:104)x, y(cid:105)\u2212 h(y) where C is some bounded convex\nset and h \u2208 \u03930. As it is well-known (also easily veri\ufb01ed from de\ufb01nition), f \u2208 \u03930 is M-Lipschitz\nthe norm (cid:107) \u00b7 (cid:107)) iff dom f\u2217 \u2286 B(cid:107)\u00b7(cid:107)(0, M ), the ball centered at the origin with\ncontinuous (w.r.t.\nradius M. Thus the function f \u2208 \u03930 admits the max-structure iff it is Lipschitz continuous, i.e.,\nsatisfying our Assumption 1, in which case h = f\u2217 and C = dom f\u2217. [8] proceeded to add some\n\u201cdistance\u201d function d to obtain the approximation f\u00b5(x) = maxy\u2208C (cid:104)x, y(cid:105) \u2212 f\u2217(y) \u2212 \u00b5d(y). For\nsimplicity, we will only consider d = q, thus f\u00b5 = (f\u2217 + \u00b5q)\u2217 = M\u00b5\nf . The other assumption of\nS-APG [8] is that f\u00b5 and the maximizer in its expression can be easily computed, which is precisely\nour Assumption 2. Finally for the general case where \u00aff is an average of K nonsmooth functions, the\nsmoothing technique is applied in a component by component way, i.e., approximate \u00aff with M\u00b5\nA\u00b5.\nthat S-APG \ufb01nds a 2\u0001 accurate solution in at most\nFor comparison,\nA\u00b5 is up-\nO(\nper bounded by L0 + M 2/(2\u0001) (under the choice of \u00b5 in Theorem 1). This is strictly worse than the\ncomplexity O(\nremove the secondary term in the complexity bound of S-APG. We should emphasize that this strict\nimprovement is obtained under exactly the same assumptions and with an algorithm as simple (if not\nsimpler) as S-APG. In some sense it is quite remarkable that the seemingly \u201cnaive\u201d approximation\nthat pretends the linearity of the proximal map not only can be justi\ufb01ed but also leads to a strictly\nbetter result.\nLet us further explain how the improvement is possible. As mentioned, S-APG approximates \u00aff with\nthe smooth function M\u00b5\nA\u00b5. This smooth approximation is bene\ufb01cial if our capability is limited to\nsmooth functions. Put differently, S-APG implicitly treats applying the fast gradient algorithms as\nthe ultimate goal. However, the recent advances on nonsmooth optimization have broadened the\nrange of fast schemes: It is not smoothness but the proximal map that allows fast convergence. Just\nas how APG improves upon the subgradient method, our approach, with the ultimate goal to enable\nef\ufb01cient computation of the proximal map, improves upon S-APG. Another lesson we wish to point\nout is that unnecessary \u201cover-smoothing\u201d, as in S-APG, does hurt the performance since it always\nincreases the Lipschitz constant. To summarize, smoothing is not free and it should be used when\ntruly needed.\nLastly, we note that our algorithm shares some similarity with forward-backward splitting proce-\ndures and alternating direction methods [9, 18, 19], although a detailed examination will not be\ngiven here. Due to space limits, we refer further extensions and improvements to [20, Chapter 3].\n6 Experiments\n\nWe compare the proposed algorithm with S-APG on two important problems: overlapping group\nlasso and graph-guided fused lasso. See Example 1 and Example 2 for details about the nonsmooth\nfunction \u00aff. We note that S-APG has been demonstrated with superior performance on both problems\nin [13], therefore we will only concentrate on comparing with it. Bear in mind that the purpose of our\nexperiment is to verify the theoretical improvement as discussed in Section 5. We are not interested\nin \ufb01ne tuning parameters here (despite its practical importance), thus for a fair comparison, we use\nthe same desired accuracy \u0001, Lipschitz constant L0 and other parameters for all methods. Since both\nour method and S-APG have the same per-step complexity, we will simply run them for a maximum\nnumber of iterations (after which saturation is observed) and report all the intermediate objective\nvalues.\n\n7\n\n\fFigure 2: Objective value vs. iteration on overlapping group lasso.\n\nFigure 3: Objective value vs. iteration on graph-guided fused lasso.\n\n1\n\nOverlapping Group Lasso: Following [13] we generate the data as follows: We set (cid:96)(x) =\n2\u03bbK(cid:107)Ax \u2212 b(cid:107)2 where A \u2208 Rn\u00d7d whose entries are sampled from i.i.d. normal distributions,\nxj = (\u22121)j exp(\u2212(j \u2212 1)/100), and b = Ax + \u03be with the noise \u03be sampled from zero mean and unit\nvariance normal distribution. Finally, the groups in the regularizer \u00aff are de\ufb01ned as\n\n{{1, . . . , 100},{91, . . . , 190}, . . . ,{d \u2212 99, . . . , d}},\n\nwhere d = 90K + 10. That is, there are K groups, each containing 100 variables, and the groups\noverlap by 10 consecutive variables. We adopt the uniform weight \u03b1k = 1/K and set \u03bb = K/5.\nFigure 2 shows the results for n = 5000 and K = 50, with three different accuracy parameters.\nFor completeness, we also include the results for the non-accelerated versions (PA-PG and S-PG).\nClearly, accelerated algorithms are much faster than their non-accelerated cousins. Observe that our\nalgorithms (PA-APG and PA-PG) converge consistently faster than S-APG and S-PG, respectively,\nwith a big margin in the favorable case (middle panel). Again we emphasize that this improvement\nis achieved without any overhead.\nGraph-guided Fused Lasso: We generate (cid:96) similarly as above. Following [13], the graph edges E\nare obtained by thresholding the correlation matrix. The case n = 5000, d = 1000, \u03bb = 15 is shown\nin Figure 3, under three different desired accuracies. Again, we observe that accelerated algorithms\nare faster than non-accelerated versions and our algorithms consistently converge faster.\n\n7 Conclusions\n\nWe have considered the composite minimization problem which consists of a smooth loss and a sum\nof nonsmooth regularizers. Different from smoothing, we considered a seemingly naive nonsmooth\napproximation which simply pretends the linearity of the proximal map. Based on the proximal\naverage, a new tool from convex analysis, we proved that the new approximation leads to a novel al-\ngorithm that strictly improves the state-of-the-art. Experiments on both overlapping group lasso and\ngraph-guided fused lasso veri\ufb01ed the superiority of the proposed method. An interesting question\narose from this work, also under our current investigation, is in what sense certain approximation is\noptimal? We also plan to apply our algorithm to other practical problems.\n\nAcknowledgement\nThe author thanks Bob Williamson and Xinhua Zhang from NICTA\u2014Canberra for their hospitality\nduring the author\u2019s visit when this work was performed; Warren Hare and Yves Lucet from UBC\u2014\nOkanagan for drawing his attention to the proximal average; and the reviewers for their valuable\ncomments.\n\n8\n\n0501001501001011021031041052\u03b5 = 2/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG0501001501001011021031041052\u03b5 = 1/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG0501001501001011021031041052\u03b5 = 0.5/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG02040608010010\u221211001011021032\u03b5 = 2/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG02040608010010\u221211001011021032\u03b5 = 1/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG02040608010010\u221211001011021032\u03b5 = 0.5/L0 PA\u2212PGS\u2212PGPA\u2212APGS\u2212APG\fReferences\n[1] Walter Rudin. Principles of mathematical analysis. McGraw-Hill, 3rd edition, 1976.\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Pro-\n\ngramming, Series B, 140:125\u2013161, 2013.\n\n[4] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity\n\nthrough convex optimization. Statistical Science, 27(4):450\u2013468, 2012.\n\n[5] Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for\n\ngrouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n[6] Seyoung Kim and Eric P. Xing. Statistical estimation of correlated genome associations to a\n\nquantitative trait network. PLoS Genetics, 5(8):1\u201318, 2009.\n\n[7] Naum Z. Shor. Minimization Methods for Non-Differentiable Functions. Springer, 1985.\n[8] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[9] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal pro-\ncessing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages\n185\u2013212. Springer, 2011.\n\n[10] Silvia Villa, Saverio Salzo, Luca Baldassarre, and Alessandro Verri. Accelerated and inexact\n\nforward-backward algorithms. SIAM Journal on Optimization, 23(3):1607\u20131633, 2013.\n\n[11] Heinz H. Bauschke, Rafal Goebel, Yves Lucet, and Xianfu Wang. The proximal average:\n\nBasic theory. SIAM Journal on Optimization, 19(2):766\u2013785, 2008.\n\n[12] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and\nsmoothness via the fused lasso. Journal of the Royal Statistical Society: Series B, 67:91\u2013108,\n2005.\n\n[13] Xi Chen, Qihan Lin, Seyoung Kim, Jaime G. Carbonell, and Eric P. Xing. Smoothing proximal\ngradient method for general structured sparse regression. The Annals of Applied Statistics, 6\n(2):719\u2013752, 2012.\n\n[14] Ralph Tyrell Rockafellar and Roger J-B Wets. Variational Analysis. Springer, 1998.\n[15] Jean J. Moreau. Proximit\u00b4e et dualtit\u00b4e dans un espace Hilbertien. Bulletin de la Soci\u00b4et\u00b4e\n\nMath\u00b4ematique de France, 93:273\u2013299, 1965.\n\n[16] Ralph Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Jour-\n\nnal on Control and Optimization, 14(5):877\u2013898, 1976.\n\n[17] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator The-\n\nory in Hilbert Spaces. Springer, 1st edition, 2011.\n\n[18] Hua Ouyang, Niao He, Long Q. Tran, and Alexander Gray. Stochastic alternating direction\n\nmethod of multipliers. In International Conference on Machine Learning, 2013.\n\n[19] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction\n\nmultiplier method. In International Conference on Machine Learning, 2013.\n\n[20] Yaoliang Yu. Fast Gradient Algorithms for Stuctured Sparsity. PhD thesis, University of\n\nAlberta, 2013.\n\n9\n\n\f", "award": [], "sourceid": 295, "authors": [{"given_name": "Yao-Liang", "family_name": "Yu", "institution": "University of Alberta"}]}