{"title": "Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than $O(1/\\epsilon)$", "book": "Advances in Neural Information Processing Systems", "page_first": 1208, "page_last": 1216, "abstract": "In this paper, we develop a novel {\\bf ho}moto{\\bf p}y {\\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal mapping is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is $O(1/\\epsilon)$ without any assumption on the strong convexity. In this work, we will show that the proposed HOPS achieved a lower iteration complexity of $\\tilde O(1/\\epsilon^{1-\\theta})$ with $\\theta\\in(0,1]$ capturing the local sharpness of the objective function around the optimal solutions. To the best of our knowledge, this is the lowest iteration complexity achieved so far for the considered non-smooth optimization problems without strong convexity assumption. The HOPS algorithm employs Nesterov's smoothing technique and Nesterov's accelerated gradient method and runs in stages, which gradually decreases the smoothing parameter in a stage-wise manner until it yields a sufficiently good approximation of the original function. We show that HOPS enjoys a linear convergence for many well-known non-smooth problems (e.g., empirical risk minimization with a piece-wise linear loss function and $\\ell_1$ norm regularizer, finding a point in a polyhedron, cone programming, etc). Experimental results verify the effectiveness of HOPS in comparison with Nesterov's smoothing algorithm and the primal-dual style of first-order methods.", "full_text": "Homotopy Smoothing for Non-Smooth Problems\n\nwith Lower Complexity than O(1/\u0001)\n\nYi Xu\u2020\u2217, Yan Yan\u2021\u2217, Qihang Lin(cid:92), Tianbao Yang u\u2020\n\n\u2020 Department of Computer Science, University of Iowa, Iowa City, IA 52242\n\n\u2021 QCIS, University of Technology Sydney, NSW 2007, Australia\n\n(cid:92) Department of Management Sciences, University of Iowa, Iowa City, IA 52242\n\n{yi-xu, qihang-lin, tianbao-yang}@uiowa.edu, yan.yan-3@student.uts.edu.au\n\nAbstract\n\nIn this paper, we develop a novel homotopy smoothing (HOPS) algorithm for\nsolving a family of non-smooth problems that is composed of a non-smooth\nterm with an explicit max-structure and a smooth term or a simple non-smooth\nterm whose proximal mapping is easy to compute. The best known iteration\ncomplexity for solving such non-smooth optimization problems is O(1/\u0001) without\nany assumption on the strong convexity. In this work, we will show that the\nproposed HOPS achieved a lower iteration complexity of \u02dcO(1/\u00011\u2212\u03b8) 1with \u03b8 \u2208\n(0, 1] capturing the local sharpness of the objective function around the optimal\nsolutions. To the best of our knowledge, this is the lowest iteration complexity\nachieved so far for the considered non-smooth optimization problems without\nstrong convexity assumption. The HOPS algorithm employs Nesterov\u2019s smoothing\ntechnique and Nesterov\u2019s accelerated gradient method and runs in stages, which\ngradually decreases the smoothing parameter in a stage-wise manner until it yields\na suf\ufb01ciently good approximation of the original function. We show that HOPS\nenjoys a linear convergence for many well-known non-smooth problems (e.g.,\nempirical risk minimization with a piece-wise linear loss function and (cid:96)1 norm\nregularizer, \ufb01nding a point in a polyhedron, cone programming, etc). Experimental\nresults verify the effectiveness of HOPS in comparison with Nesterov\u2019s smoothing\nalgorithm and the primal-dual style of \ufb01rst-order methods.\n\nIntroduction\n\n1\nIn this paper, we consider the following optimization problem:\nF (x) (cid:44) f (x) + g(x)\n\nmin\nx\u2208\u21261\n\n(1)\n\nwhere g(x) is a convex (but not necessarily smooth) function, \u21261 is a closed convex set and f (x) is a\nconvex but non-smooth function which can be explicitly written as\n\n(cid:104)Ax, u(cid:105) \u2212 \u03c6(u)\n\nf (x) = max\nu\u2208\u21262\n\n(2)\nwhere \u21262 \u2282 Rm is a closed convex bounded set, A \u2208 Rm\u00d7d and \u03c6(u) is a convex function, and (cid:104)\u00b7,\u00b7(cid:105)\nis scalar product. This family of non-smooth optimization problems has applications in numerous\ndomains, e.g., machine learning and statistics [7], image processing [6], cone programming [11],\nand etc. Several \ufb01rst-order methods have been developed for solving such non-smooth optimization\n\u2217The \ufb01rst two authors make equal contributions. The work of Y. Yan was done when he was a visiting student\n\nat Department of Computer Science of the University of Iowa.\n\n1 \u02dcO() suppresses a logarithmic factor.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fproblems including the primal-dual methods [15, 6], Nesterov\u2019s smoothing algorithm [16] 2, and they\ncan achieve O(1/\u0001) iteration complexity for \ufb01nding an \u0001-optimal solution, which is faster than the\ncorresponding black-box lower complexity bounds by an order of magnitude.\nIn this paper, we propose a novel homotopy smoothing (HOPS) algorithm for solving the problem\nin (1) that achieves a lower iteration complexity than O(1/\u0001). In particular, the iteration complexity\nof HOPS is given by \u02dcO(1/\u00011\u2212\u03b8), where \u03b8 \u2208 (0, 1] captures the local sharpness (de\ufb01ned shortly) of\nthe objective function around the optimal solutions. The proposed HOPS algorithm builds on the\nNesterov\u2019s smoothing technique, i.e., approximating the non-smooth function f (x) by a smooth\nfunction and optimizing the smoothed function to a desired accuracy level.\nThe striking difference between HOPS and Nesterov\u2019s smoothing algorithm is that Nesterov uses\na \ufb01xed small smoothing parameter that renders a suf\ufb01ciently accurate approximation of the non-\nsmooth function f (x), while HOPS adopts a homotopy strategy for setting the value of the smoothing\nparameter. It starts from a relatively large smoothing parameter and gradually decreases the smoothing\nparameter in a stage-wise manner until the smoothing parameter reaches a level that gives a suf\ufb01ciently\ngood approximation of the non-smooth objective function. The bene\ufb01t of using a homotopy strategy\nis that a larger smoothing parameter yields a smaller smoothness constant and hence a lower iteration\ncomplexity for smoothed problems in earlier stages. For smoothed problems in later stages with\nlarger smoothness constants, warm-start can help reduce the number of iterations to converge. As\na result, solving a series of smoothed approximations with a smoothing parameter from large to\nsmall and with warm-start is faster than solving one smoothed approximation with a very small\nsmoothing parameter. To the best of our knowledge, this is the \ufb01rst work that rigorously analyzes\nsuch a homotopy smoothing algorithm and establishes its theoretical guarantee on lower iteration\ncomplexities. The keys to our analysis of lower iteration complexity are (i) to leverage a global error\ninequality (Lemma 1) [21] that bounds the distance of a solution to the \u0001 sublevel set by a multiple\nof the functional distance; and (ii) to explore a local error bound condition to bound the multiplicative\nfactor.\n\n2 Related Work\nIn this section, we review some related work for solving the considered family of non-smooth\noptimization problems.\nIn the seminal paper by Nesterov [16], he proposed a smoothing technique for a family of structured\nnon-smooth optimization problems as in (1) with g(x) being a smooth function and f (x) given in (2).\nBy adding a strongly convex prox function in terms of u with a smoothing parameter \u00b5 into the\nde\ufb01nition of f (x), one can obtain a smoothed approximation of the original objective function. Then\nhe developed an accelerated gradient method with an O(1/t2) convergence rate for the smoothed\nobjective function with t being the number of iterations, which implies an O(1/t) convergence rate for\nthe original objective function by setting \u00b5 \u2248 c/t with c being a constant. The smoothing technique\nhas been exploited to solving problems in machine learning, statistics, cone programming [7, 11, 24].\nThe primal-dual style of \ufb01rst-order methods treat the problem as a convex-concave minimization\nproblem, i.e.,\n\nmin\nx\u2208\u21261\n\nmax\nu\u2208\u21262\n\ng(x) + (cid:104)Ax, u(cid:105) \u2212 \u03c6(u)\n\nNemirovski [15] proposed a mirror prox method, which has a convergence rate of O(1/t) by assuming\nthat both g(x) and \u03c6(u) are smooth functions. Chambolle & Pock [6] designed \ufb01rst-order primal-dual\nalgorithms, which tackle g(x) and \u03c6(u) using proximal mapping and achieve the same convergence\nrate of O(1/t) without assuming smoothness of g(x) and \u03c6(u). When g(x) or \u03c6(u) is strongly\nconvex, their algorithms achieve O(1/t2) convergence rate. The effectiveness of their algorithms\nwas demonstrated on imaging problems. Recently, the primal-dual style of \ufb01rst-order methods have\nbeen employed to solve non-smooth optimization problems in machine learning where both the loss\nfunction and the regularizer are non-smooth [22]. Lan et al. [11] also considered Nemirovski\u2019s prox\nmethod for solving cone programming problems.\nThe key condition for us to develop an improved convergence is closely related to local error bounds\n(LEB) [17] and more generally the Kurdyka-\u0141ojasiewicz property [12, 4]. The LEB characterizes\n\n2The algorithm in [16] was developed for handling a smooth component g(x), which can be extended to\n\nhandling a non-smooth component g(x) whose proximal mapping is easy to compute.\n\n2\n\n\fthe relationship between the distance of a local solution to the optimal set and the optimality gap\nof the solution in terms of objective value. The Kurdyka-\u0141ojasiewicz property characterizes that\nproperty of a function that whether it can be made \u201csharp\u201d by some transformation. Recently,\nthese conditions/properties have been explored for feasible descent methods [13], non-smooth\noptimization [8], gradient and subgradient methods [10, 21]. It is notable that our local error bound\ncondition is different from the one used in [13, 25] which bounds the distance of a point to the optimal\nset by the norm of the projected or proximal gradient at that point instead of the functional distance,\nconsequentially it requires some smoothness assumption about the objective function. By contrast,\nthe local error bound condition in this paper covers a much broad family of functions and thus it is\nmore general. Recent work [14, 23] have shown that the error bound in [13, 25] is a special case of\nour considered error bound with \u03b8 = 1/2. Two mostly related work leveraging a similar error bound\nto ours are discussed in order. Gilpin et al. [8] considered the two-person zero-sum games, which is a\nspecial case of (1) with g(x) and \u03c6(u) being zeros and \u21261 and \u21262 being polytopes. The present work\nis a non-trivial generalization of their work that leads to improved convergence for a much broader\nfamily of non-smooth optimization problems. In particular, their result is just a special case of our\nresult when the constant \u03b8 that captures the local sharpness is one for problems whose epigraph is\na polytope. Recently, Yang & Lin [21] proposed a restarted subgradient method by exploring the\nlocal error bound condition or more generally the Kurdyka-\u0141ojasiewicz property, resulting in an\n\u02dcO(1/\u00012(1\u2212\u03b8)) iteration complexity with the same constant of \u03b8. In contrast, our result is an improved\niteration complexity of \u02dcO(1/\u00011\u2212\u03b8).\nIt is worth emphasizing that the proposed homotopy smoothing technique is different from recently\nproposed homotopy methods for sparse learning (e.g., (cid:96)1 regularized least-squares problem [20]),\nthough a homotopy strategy on an involved parameter is also employed to boost the convergence. In\nparticular, the involved parameter in the homotopy methods for sparse learning is the regularization\nparameter before the (cid:96)1 regularization, while the parameter in the present work is the introduced\nsmoothing parameter.\nIn addition, the bene\ufb01t of starting from a relatively large regularization\nparameter in sparse learning is the sparsity of the solution, which makes it possible to explore the\nrestricted strong convexity for proving faster convergence. We do not make such assumption of\nthe data and we are mostly interested in that when both f (x) and g(x) are non-smooth. Finally,\nwe note that a similar homotopy (a.k.a continuation) strategy is employed in Nesterov\u2019s smoothing\nalgorithm for solving an (cid:96)1 norm minimization problem subject to a constraint for recovering a sparse\nsolution [3]. However, we would like to draw readers\u2019 attention to that they did not provide any\ntheoretical guarantee on the iteration complexity of the homotopy strategy and consequentially their\nimplementation is ad-hoc without guidance from theory. More importantly, our developed algorithms\nand theory apply to a much broader family of problems.\n\n3 Preliminaries\nWe present some preliminaries in this section. Let (cid:107)x(cid:107) denote the Euclidean norm on the primal\nvariable x. A function h(x) is L-smooth in terms of (cid:107) \u00b7 (cid:107), if (cid:107)\u2207h(x) \u2212 \u2207h(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107). Let\n(cid:107)u(cid:107)+ denote a norm on the dual variable, which is not necessarily the Euclidean norm. Denote by\n\u03c9+(u) a 1-strongly convex function of u in terms of (cid:107) \u00b7 (cid:107)+.\nFor the optimization problem in (1), we let \u2126\u2217, F\u2217 denote the set of optimal solutions and optimal\nvalue, respectively, and make the following assumption throughout the paper.\nAssumption 1. For a convex minimization problem (1), we assume (i) there exist x0 \u2208 \u21261 and\n\u00010 \u2265 0 such that F (x0) \u2212 minx\u2208\u21261 F (x) \u2264 \u00010; (ii) f (x) is characterized as in (2), where \u03c6(u) is\na convex function; (iii) There exists a constant D such that maxu\u2208\u21262 \u03c9+(u) \u2264 D2/2; (iv) \u2126\u2217 is a\nnon-empty convex compact set.\nNote that: 1) Assumption 1(i) assumes that the objective function is lower bounded; 2) Assump-\ntion 1(iii) assumes that \u21262 is a bounded set, which is also required in [16].\nIn addition, for brevity we assume that g(x) is simple enough 3 such that the proximal mapping\nde\ufb01ned below is easy to compute similar to [6]:\n\nP\u03bbg(x) = min\nz\u2208\u21261\n\n1\n2\n\n(cid:107)z \u2212 x(cid:107)2 + \u03bbg(z)\n\n(3)\n\n3If g(x) is smooth, this assumption can be relaxed. We will defer the discussion and result on a smooth\n\nfunction g(x) to the supplement.\n\n3\n\n\fRelying on the proximal mapping, the key updates in the optimization algorithms presented below\ntake the following form:\n\n\u03a0c\n\nv,\u03bbg(x) = arg min\nz\u2208\u21261\n\n(4)\nFor any x \u2208 \u21261, let x\u2217 denote the closest optimal solution in \u2126\u2217 to x measured in terms of (cid:107) \u00b7 (cid:107), i.e.,\nx\u2217 = arg minz\u2208\u2126\u2217 (cid:107)z \u2212 x(cid:107)2, which is unique because \u2126\u2217 is a non-empty convex compact set We\ndenote by L\u0001 the \u0001-level set of F (x) and by S\u0001 the \u0001-sublevel set of F (x), respectively, i.e.,\n\n(cid:107)z \u2212 x(cid:107)2 + (cid:104)v, z(cid:105) + \u03bbg(z)\n\nc\n2\n\nL\u0001 = {x \u2208 \u21261 : F (x) = F\u2217 + \u0001}, S\u0001 = {x \u2208 \u21261 : F (x) \u2264 F\u2217 + \u0001}\n\nIt follows from [18] (Corollary 8.7.1) that the sublevel set S\u0001 is bounded for any \u0001 \u2265 0 and so as the\nlevel set L\u0001 due to that \u2126\u2217 is bounded. De\ufb01ne dist(L\u0001, \u2126\u2217) to be the maximum distance of points on\nthe level set L\u0001 to the optimal set \u2126\u2217, i.e.,\ndist(L\u0001, \u2126\u2217) = max\nx\u2208L\u0001\n\n(cid:107)x \u2212 z(cid:107)\nDue to that L\u0001 and \u2126\u2217 are bounded, dist(L\u0001, \u2126\u2217) is also bounded. Let x\u2020\nthe \u0001-sublevel set to x, i.e.,\n\ndist(x, \u2126\u2217) (cid:44) min\nz\u2208\u2126\u2217\n\n\u0001 denote the closest point in\n\n(cid:20)\n\n(cid:21)\n\n(5)\n\n.\n\nx\u2020\n\u0001 = arg min\nz\u2208S\u0001\n\n(cid:107)z \u2212 x(cid:107)2\n\n(6)\n\nIt is easy to show that x\u2020\n\n\u0001 \u2208 L\u0001 when x /\u2208 S\u0001 (using the KKT condition).\n\n4 Homotopy Smoothing\n4.1 Nesterov\u2019s Smoothing\nWe \ufb01rst present the Nesterov\u2019s smoothing technique and accelerated proximal gradient methods for\nsolving the smoothed problem due to that the proposed algorithm builds upon these techniques. The\nidea of smoothing is to construct a smooth function f\u00b5(x) that well approximates f (x). Nesterov\nconsidered the following function\n\nf\u00b5(x) = max\nu\u2208\u21262\n\n(cid:104)Ax, u(cid:105) \u2212 \u03c6(u) \u2212 \u00b5\u03c9+(u)\n\nIt was shown in [16] that f\u00b5(x) is smooth w.r.t (cid:107) \u00b7 (cid:107) and its smoothness parameter is given by\nL\u00b5 = 1\n\n\u00b5(cid:107)A(cid:107)2 where (cid:107)A(cid:107) is de\ufb01ned by (cid:107)A(cid:107) = max(cid:107)x(cid:107)\u22641 max(cid:107)u(cid:107)+\u22641(cid:104)Ax, u(cid:105). Denote by\n\nu\u00b5(x) = arg max\nu\u2208\u21262\n\n(cid:104)Ax, u(cid:105) \u2212 \u03c6(u) \u2212 \u00b5\u03c9+(u)\n\nThe gradient of f\u00b5(x) is computed by \u2207f\u00b5(x) = A(cid:62)u\u00b5(x). Then\nf\u00b5(x) \u2264 f (x) \u2264 f\u00b5(x) + \u00b5D2/2\n\n(7)\nFrom the inequality above, we can see that when \u00b5 is very small, f\u00b5(x) gives a good approximation\nof f (x). This motivates us to solve the following composite optimization problem\n\nF\u00b5(x) (cid:44) f\u00b5(x) + g(x)\n\nmin\nx\u2208\u21261\n\nMany works have studied such an optimization problem [2, 19] and the best convergence rate is\ngiven by O(L\u00b5/t2), where t is the total number of iterations. We present a variant of accelerated\nproximal gradient (APG) methods in Algorithm 1 that works even with (cid:107)x(cid:107) replaced with a general\nnorm as long as its square is strongly convex. We make several remarks about Algorithm 1: (i) the\nvariant here is similar to Algorithm 3 in [19] and the algorithm proposed in [16] except that the prox\nfunction d(x) is replaced by (cid:107)x \u2212 x0(cid:107)2/2 in updating the sequence of zk, which is assumed to be\n\u03c31-strongly convex w.r.t (cid:107) \u00b7 (cid:107); (ii) If (cid:107) \u00b7 (cid:107) is simply the Euclidean norm, a simpli\ufb01ed algorithm with\nonly one update in (4) can be used (e.g., FISTA [2]); (iii) if L\u00b5 is dif\ufb01cult to compute, we can use the\nbacktracking trick (see [2, 19]).\nThe following theorem states the convergence result for APG.\nTheorem 2. ([19]) Let \u03b8k = 2\nany x \u2208 \u21261, we have\n\nk+1 , k \u2265 0 or \u03b1k+1 = \u03b8k+1 =\n\n, k \u2265 0. For\n\nk\u2212\u03b82\n\nk+4\u03b82\n\u03b84\n\n\u221a\n\n2\n\nk\n\nk+2 , \u03b1k = 2\nF\u00b5(xt) \u2212 F\u00b5(x) \u2264 2L\u00b5(cid:107)x \u2212 x0(cid:107)2\n\nt2\n\n(8)\n\n4\n\n\fAlgorithm 1 An Accelerated Proximal Gradient Method: APG(x0, t, L\u00b5)\n1: Input: the number of iterations t, the initial solution x0, and the smoothness constant L\u00b5\n2: Let \u03b80 = 1, V\u22121 = 0, \u0393\u22121 = 0, z0 = x0\n3: Let \u03b1k and \u03b8k be two sequences given in Theorem 2.\n4: for k = 0, . . . , t \u2212 1 do\n5:\n6:\n7:\n8: end for\n9: Output: xt\n\nCompute yk = (1 \u2212 \u03b8k)xk + \u03b8kzk\nCompute vk = \u2207f\u00b5(yk), Vk = Vk\u22121 + vk\nCompute zk+1 = \u03a0L\u00b5/\u03c31\n\nVk,\u0393kg(x0) and xk+1 = \u03a0L\u00b5\n\n, and \u0393k = \u0393k\u22121 + 1\n\u03b1k\n\nvk,g(yk)\n\n\u03b1k\n\nCombining the above convergence result with the relation in (7), we can establish the iteration\ncomplexity of Nesterov\u2019s smoothing algorithm for solving the original problem (1).\nCorollary 3. For any x \u2208 \u21261, we have\n\nF (xt) \u2212 F (x) \u2264 \u00b5D2/2 +\n\nIn particular in order to have F (xt) \u2264 F\u2217 + \u0001, it suf\ufb01ces to set \u00b5 \u2264 \u0001\nwhere x\u2217 is an optimal solution to (1).\n\n2L\u00b5(cid:107)x \u2212 x0(cid:107)2\n\nt2\n\n(9)\nD2 and t \u2265 2D(cid:107)A(cid:107)(cid:107)x0\u2212x\u2217(cid:107)\n\n\u0001\n\n,\n\n4.2 Homotopy Smoothing\nFrom the convergence result in (9), we can see that in order to obtain a very accurate solution, we\nhave to set \u00b5 - the smoothing parameter - to be a very small value, which will cause the blow-up of\nthe second term because L\u00b5 \u221d 1/\u00b5. On the other hand, if \u00b5 is set to be a relatively large value, then t\ncan be set to be a relatively small value to match the \ufb01rst term in the R.H.S. of (9), which may lead to\na not suf\ufb01ciently accurate solution. It seems that the O(1/\u0001) is unbeatable. However, if we adopt a\nhomotopy strategy, i.e., starting from a relatively large value \u00b5 and optimizing the smoothed function\nwith a certain number of iterations t such that the second term in (9) matches the \ufb01rst term, which\nwill give F (xt) \u2212 F (x\u2217) \u2264 O(\u00b5). Then we can reduce the value of \u00b5 by a constant factor b > 1\nand warm-start the optimization process from xt. The key observation is that although \u00b5 decreases\nand L\u00b5 increases, the other term (cid:107)x\u2217 \u2212 xt(cid:107) is also reduced compared to (cid:107)x\u2217 \u2212 x0(cid:107), which could\ncancel the blow-up effect caused by increased L\u00b5. As a result, we expect to use the same number of\niterations to optimize the smoothed function with a smaller \u00b5 such that F (x2t) \u2212 F (x\u2217) \u2264 O(\u00b5/b).\nTo formalize our observation, we need the following key lemma.\nLemma 1 ([21]). For any x \u2208 \u21261 and \u0001 > 0, we have\n\u0001, \u2126\u2217)\n\u0001\n\n\u0001(cid:107) \u2264 dist(x\u2020\n\n(F (x) \u2212 F (x\u2020\n\u0001))\n\n(cid:107)x \u2212 x\u2020\n\n\u0001 \u2208 S\u0001 is the closest point in the \u0001-sublevel set to x as de\ufb01ned in (6).\n\nwhere x\u2020\nThe lemma is proved in [21]. We include its proof in the supplement. If we apply the above bound\ninto (9), we will see in the proof of the main theorem (Theorem 5) that the number of iterations t for\nsolving each smoothed problem is roughly O( dist(L\u0001,\u2126\u2217)\n\u0001 ) in light of\nthe local error bound condition given below.\nDe\ufb01nition 4 (Local error bound (LEB)). A function F (x) is said to satisfy a local error bound\ncondition if there exist \u03b8 \u2208 (0, 1] and c > 0 such that for any x \u2208 S\u0001\n\n), which will be lower than O( 1\n\n\u0001\n\ndist(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217)\u03b8\n\n(10)\nRemark: In next subsection, we will discuss the relationship with other types of conditions and\nshow that a broad family of non-smooth functions (including almost all commonly seen functions in\nmachine learning) obey the local error bound condition. The exponent constant \u03b8 can be considered\nas a local sharpness measure of the function. Figure 1 illustrates the sharpness of F (x) = |x|p for\np = 1, 1.5, and 2 around the optimal solutions and their corresponding \u03b8.\nWith the local error bound condition, we can see that dist(L\u0001, \u2126\u2217) \u2264 c\u0001\u03b8, \u03b8 \u2208 (0, 1]. Now, we\nare ready to present the homotopy smoothing algorithm and its convergence guarantee under the\n\n5\n\n\fAlgorithm 2 HOPS for solving (1)\n1: Input: m, t, x0 \u2208 \u21261, \u00010, D2 and b > 1.\n2: Let \u00b51 = \u00010/(bD2)\n3: for s = 1, . . . , m do\n4:\n5:\n6: end for\n7: Output: xm\n\nLet xs = APG(xs\u22121, t, L\u00b5s)\nUpdate \u00b5s+1 = \u00b5s/b\n\nFigure 1: Illustration of local sharpness of three func-\ntions and the corresponding \u03b8 in the LEB condition.\n\nlocal error bound condition. The HOPS algorithm is presented in Algorithm 2, which starts from\na relatively large smoothing parameter \u00b5 = \u00b51 and gradually reduces \u00b5 by a factor of b > 1 after\nrunning a number t of iterations of APG with warm-start. The iteration complexity of HOPS is\nestablished in Theorem 5. We include the proof in the supplement.\nTheorem 5. Suppose Assumption 1 holds and F (x) obeys the local error bound condition. Let\nHOPS run with t = O( 2bcD(cid:107)A(cid:107)\n\u0001 )(cid:101).\n\u00011\u2212\u03b8\nThen F (xm) \u2212 F\u2217 \u2264 2\u0001. Hence, the iteration complexity for achieving an 2\u0001-optimal solution is\n2bcD(cid:107)A(cid:107)\n\niterations for each stage, and m = (cid:100)logb( \u00010\n\n) \u2265 2bcD(cid:107)A(cid:107)\n\u00011\u2212\u03b8\n\n(cid:100)logb( \u00010\n\n\u0001 )(cid:101) in the worst-case.\n\n\u00011\u2212\u03b8\n\nmin\nx\u2208Rd\n\n4.3 Local error bounds and Applications\nIn this subsection, we discuss the local error bound condition and its application in non-smooth\noptimization problems.\nThe Hoffman\u2019s bound and \ufb01nding a point in a polyhedron. A polyhedron can be expressed as\nP = {x \u2208 Rd; B1x \u2264 b1, B2x = b2}. The Hoffman\u2019s bound [17] is expressed as\n\nF (x) (cid:44)\n1 , u(cid:62)\n\ndist(x,P) \u2264 c((cid:107)(B1x \u2212 b1)+(cid:107) + (cid:107)B2x \u2212 b2(cid:107)),\u2203c > 0\n\n(cid:104)B1x \u2212 b1, u1(cid:105) + (cid:104)B2x \u2212 b2, u2(cid:105)\n\n(11)\nwhere [s]+ = max(0, s). This can be considered as the error bound for the polyhedron feasibility\nproblem, i.e., \ufb01nding a x \u2208 P, which is equivalent to\n\n(cid:20)\n(cid:107)(B1x \u2212 b1)+(cid:107) + (cid:107)B2x \u2212 b2(cid:107) = max\nu\u2208\u21262\n2 )(cid:62) and \u21262 = {u|u1 (cid:23) 0,(cid:107)u1(cid:107) \u2264 1,(cid:107)u2(cid:107) \u2264 1}. If there exists a x \u2208 P, then\nwhere u = (u(cid:62)\nF\u2217 = 0. Thus the Hoffman\u2019s bound in (11) implies a local error bound (10) with \u03b8 = 1. Therefore,\nthe HOPS has a linear convergence for \ufb01nding a feasible solution in a polyhedron. If we let \u03c9+(u) =\n2(cid:107)u(cid:107)2 then D2 = 2 so that the iteration complexity is 2\n1\nCone programming. Let U, V denote two vector spaces. Given a linear opearator E : U \u2192 V \u2217 4,\na closed convex set \u2126 \u2286 U, and a vector e \u2208 V \u2217, and a closed convex cone K \u2286 V , the general\nconstrained cone linear system (cone programing) consists of \ufb01nding a vector x \u2208 \u2126 such that\nEx \u2212 e \u2208 K\u2217. Lan et al. [11] have considered Nesterov\u2019s smoothing algorithm for solving the cone\nprogramming problem with O(1/\u0001) iteration complexity. The problem can be cast into a non-smooth\noptimization problem:\n\n2bc max((cid:107)B1(cid:107),(cid:107)B2(cid:107))(cid:100)logb( \u00010\n\n\u0001 )(cid:101).\n\n(cid:21)\n\n\u221a\n\n(cid:20)\n\n(cid:21)\n\nF (x) (cid:44)\n\ndist(Ex \u2212 e,K\u2217) =\n\nmax\n\n(cid:107)u(cid:107)\u22641,u\u2208\u2212K\n\n(cid:104)Ex \u2212 e, u(cid:105)\n\nmin\nx\u2208\u2126\n\nAssume that e \u2208 Range(E) \u2212 K\u2217, then F\u2217 = 0. Burke et al. [5] have considered the error bound for\nsuch problems and their results imply that there exists c > 0 such that dist(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217)\nas long as \u2203x \u2208 \u2126, s.t. Ex \u2212 e \u2208 int(K\u2217), where \u2126\u2217 denotes the optimal solution set. Therefore,\nthe HOPS also has a linear convergence for cone programming. Considering that both U and V are\n2(cid:107)u(cid:107)2 then D2 = 1. Thus, the iteraction complexity of HOPS\nEuclidean spaces, we set \u03c9+(u) = 1\nfor \ufb01nding an 2\u0001-solution is 2bc(cid:107)E(cid:107)(cid:100)logb( \u00010\nNon-smooth regularized empirical loss (REL) minimization in Machine Learning The REL\nconsists of a sum of loss functions on the training data and a regularizer, i.e.,\n\n\u0001 )(cid:101).\nn(cid:88)\n\ni=1\n\nmin\nx\u2208Rd\n\nF (x) (cid:44) 1\nn\n\n(cid:96)(x(cid:62)ai, yi) + \u03bbg(x)\n\n4V \u2217 represents the dual space of V . The notations and descriptions are adopted from [11].\n\n6\n\n\u22120.1\u22120.0500.050.100.020.040.060.080.1xF(x) |x|, \u03b8=1|x|1.5, \u03b8=2/3|x|2, \u03b8=0.5\fwhere (ai, yi), i = 1, . . . , n denote pairs of a feature vector and a label of training data. Non-smooth\nloss functions include hinge loss (cid:96)(z, y) = max(0, 1 \u2212 yz), absolute loss (cid:96)(z, y) = |z \u2212 y|, which\ncan be written as the max structure in (2). Non-smooth regularizers include e.g., g(x) = (cid:107)x(cid:107)1,\ng(x) = (cid:107)x(cid:107)\u221e. These loss functions and regularizers are essentially piecewise linear functions,\nwhose epigraph is a polyhedron. The error bound condition has been developed for such kind of\nproblems [21]. In particular, if F (x) has a polyhedral epigraph, then there exists c > 0 such that\ndist(x, \u2126\u2217) \u2264 c(F (x) \u2212 F\u2217) for any x \u2208 Rd. It then implies HOPS has an O(log(\u00010/\u0001)) iteration\ncomplexity for solving a non-smooth REL minimization with a polyhedral epigraph. Yang et al. [22]\nhas also considered such non-smooth problems, but they only have O(1/\u0001) iteration complexity.\nWhen F (x) is essentially locally strongly convex [9]\n\nin terms of (cid:107) \u00b7 (cid:107) such that 5\n\ndist2(x, \u2126\u2217) \u2264 2\n\u03c3\n\n(F (x) \u2212 F\u2217),\u2200x \u2208 S\u0001\n\n(12)\n\ni=1 |a(cid:62)\n\np =(cid:80)n\n\nthen we can see that the local error bound holds with \u03b8 = 1/2, which implies the iteration complexity\nof HOPS is \u02dcO( 1\u221a\n\u0001 ), which is up to a logarithmic factor the same as the result in [6] for a strongly\nconvex function. However, here only local strong convexity is suf\ufb01cient and there is no need to\ndevelop a different algorithm and different analysis from the non-strongly convex case as done\ni x \u2212 yi|p, p \u2208 (1, 2), which\nin [6]. For example, one can consider F (x) = (cid:107)Ax \u2212 y(cid:107)p\nsatis\ufb01es (12) according to [21].\nThe Kurdyka-\u0141ojasiewicz (KL) property. The de\ufb01nition of KL property is given below.\nDe\ufb01nition 6. The function F (x) is said to have the KL property at x\u2217 \u2208 \u2126\u2217 if there exist \u03b7 \u2208 (0,\u221e],\na neighborhood U of x\u2217 and a continuous concave function \u03d5 : [0, \u03b7) \u2192 R+ such that i) \u03d5(0) = 0,\n\u03d5 is continuous on (0, \u03b7), ii) for all s \u2208 (0, \u03b7), \u03d5(cid:48)(s) > 0, iii) and for all x \u2208 U \u222a {x : F (x\u2217) <\nF (x) < F (x\u2217) + \u03b7}, the KL inequality \u03d5(cid:48)(F (x) \u2212 F (x\u2217))(cid:107)\u2202F (x)(cid:107) \u2265 1 holds.\nThe function \u03d5 is called the desingularizing function of F at x\u2217, which makes the function F (x)\nsharp by reparameterization. An important desingularizing function is in the form of \u03d5(s) = cs1\u2212\u03b2\nfor some c > 0 and \u03b2 \u2208 [0, 1), which gives the KL inequality (cid:107)\u2202F (x)(cid:107) \u2265 1\nc(1\u2212\u03b2) (F (x) \u2212 F (x\u2217))\u03b2.\nIt has been established that the KL property is satis\ufb01ed by a wide class of non-smooth functions\nde\ufb01nable in an o-minimal structure [4]. Semialgebraic functions and (globally) subanalytic functions\nare for instance de\ufb01nable in their respective classes. While the de\ufb01nition of KL property involves\na neighborhood U and a constant \u03b7, in practice many convex functions satisfy the above property\nwith U = Rd and \u03b7 = \u221e [1]. The proposition below shows that a function with the KL property\nwith a desingularizing function \u03d5(s) = cs1\u2212\u03b2 obeys the local error bound condition in (10) with\n\u03b8 = 1 \u2212 \u03b2 \u2208 (0, 1], which implies an iteration complexity of \u02dcO(1/\u0001\u03b8) of HOPS for optimizing such\na function.\nProposition 1. (Theorem 5 [10]) Let F (x) be a proper, convex and lower-semicontinuous function\nthat satis\ufb01es KL property at x\u2217 and U be a neighborhood of x\u2217. For all x \u2208 U \u2229 {x : F (x\u2217) <\nF (x) < F (x\u2217) + \u03b7}, if (cid:107)\u2202F (x)(cid:107) \u2265 1\nc(1\u2212\u03b2) (F (x)\u2212 F (x\u2217))\u03b2, then dist(x, \u2126\u2217) \u2264 c(F (x)\u2212 F\u2217)1\u2212\u03b2.\n\n4.4 Primal-Dual Homotopy Smoothing (PD-HOPS)\nFinally, we note that the required number of iterations per-stage t for \ufb01nding an \u0001 accurate solution\ndepends on an unknown constant c and sometimes \u03b8. Thus, an inappropriate setting of t may lead\nto a less accurate solution. In practice, it can be tuned to obtain the fastest convergence. A way to\neschew the tuning is to consider a primal-dual homotopy smoothing (PD-HOPS). Basically, we also\napply the homotopy smoothing to the dual problem:\n\n\u03a6(u) (cid:44) \u2212\u03c6(u) + min\nx\u2208\u21261\n\nmax\nu\u2208\u21262\n\n(cid:104)A(cid:62)u, x(cid:105) + g(x)\n\nDenote by \u03a6\u2217 the optimal value of the above problem. Under some mild conditions, it is easy to\nsee that \u03a6\u2217 = F\u2217. By extending the analysis and result to the dual problem, we can obtain that\nF (xs) \u2212 F\u2217 \u2264 \u0001 + \u0001s and \u03a6\u2217 \u2212 \u03a6(us) \u2264 \u0001 + \u0001s after the s-th stage with a suf\ufb01cient number of\niterations per-stage. As a result, we get F (xs) \u2212 \u03a6(us) \u2264 2(\u0001 + \u0001s). Therefore, we can use the\nduality gap F (xs) \u2212 \u03a6(us) as a certi\ufb01cate to monitor the progress of optimization. As long as\nthe above inequality holds, we restart the next stage. Then with at most m = (cid:100)logb(\u00010/\u0001)(cid:101) epochs\n\n5This is true if g(x) is strongly convex or locally strongly convex.\n\n7\n\n\fTable 1: Comparison of different optimization algorithms by the number of iterations and running\ntime in second (mean \u00b1 standard deviation) for achieving a solution that satis\ufb01es F (x) \u2212 F\u2217 \u2264 \u0001.\n\n\u0001 = 10\u22124\n9861 (1.58\u00b10.02)\n4918 (2.44\u00b10.22)\n3277 (1.33\u00b10.01)\n1012 (0.44\u00b10.02)\n1009 (0.46\u00b10.02)\n846 (0.36\u00b10.01)\n\nLinear Classi\ufb01cation\n\u0001 = 10\u22125\n27215 (4.33\u00b10.06)\n28600 (11.19\u00b10.26)\n19444 (7.69\u00b10.07)\n4101 (1.67\u00b10.01)\n4102 (1.69\u00b10.04)\n3370 (1.27\u00b10.02)\n\nImage Denoising\n\n\u0001 = 10\u22123\n8078 (22.01\u00b10.51)\n179204 (924.37\u00b159.67)\n14150 (40.90\u00b12.28)\n3542 (13.77\u00b10.13)\n2206 (6.99\u00b10.15)\n2538 (7.97\u00b10.13)\n\n\u0001 = 10\u22124\n34292 (94.26\u00b12.67)\n1726043 (9032.69\u00b1539.01)\n91380 (272.45\u00b114.56)\n4501 (17.38\u00b10.10)\n3905 (16.52\u00b10.08)\n3605 (11.39\u00b10.10)\n\nMatrix Decomposition\n\u0001 = 10\u22124\n3441 (5.65\u00b10.20)\n8622 (30.36\u00b10.11)\n4151 (9.16\u00b10.10)\n313 (1.51\u00b10.03)\n312 (1.23\u00b10.01)\n162 (0.64\u00b10.01)\n\n\u0001 = 10\u22123\n2523 (4.02\u00b10.10)\n1967 (6.85\u00b10.08)\n1115 (3.76\u00b10.06)\n224 (1.36\u00b10.02)\n230 (0.91\u00b10.01)\n124 (0.45\u00b10.01)\n\nPD\n\nAPG-D\nAPG-F\n\nHOPS-D\nHOPS-F\n\nPD-HOPS\n\nwe get F (xm) \u2212 \u03a6(um) \u2264 2(\u0001 + \u0001m) \u2264 4\u0001. Similarly, we can show that PD-HOPS enjoys an\n\u02dcO(max{1/\u00011\u2212\u03b8, 1/\u00011\u2212\u02dc\u03b8}) iteration complexity, where \u02dc\u03b8 is the exponent constant in the local error\nbound of the objective function for dual problem. For example, for linear classi\ufb01cation problems\nwith a piecewise linear loss and (cid:96)1 norm regularizer we can have \u03b8 = 1 and \u02dc\u03b8 = 1, and PD-HOPS\nenjoys a linear convergence. Due to the limitation of space, we defer the details of PD-HOPS and its\nanalysis into the supplement.\n\n5 Experimental Results\nIn this section, we present some experimental results to demonstrate the effectiveness of HOPS\nand PD-HOPS by comparing with two state-of-the-art algorithms, the \ufb01rst-order Primal-Dual (PD)\nmethod [6] and the Nesterov\u2019s smoothing with Accelerated Proximal Gradient (APG) methods. For\nAPG, we implement two variants, where APG-D refers to the variant with the dual averaging style of\nupdate on one sequence of points (i.e., Algorithm 1) and APG-F refers to the variant of the FISTA\nstyle [2]. Similarly, we also implement the two variants for HOPS. We conduct experiments for\nsolving three problems: (1) an (cid:96)1-norm regularized hinge loss for linear classi\ufb01cation on the w1a\ndataset 6; (2) a total variation based ROF model for image denoising on the Cameraman picture 7; (3)\na nuclear norm regularized absolute error minimization for low-rank and sparse matrix decomposition\non a synthetic data. More details about the formulations and experimental setup can be found in the\nsupplement.\nTo make fair comparison, we stop each algorithm when the optimality gap is less than a given \u0001 and\ncount the number of iterations and the running time that each algorithm requires. The optimal value\nis obtained by running PD with a suf\ufb01ciently large number of iterations such that the duality gap is\nvery small. We present the comparison of different algorithms on different tasks in Table 1, where\nfor PD-HOPS we only report the results of using the faster variant of APG, i.e., APG-F. We repeat\neach algorithm 10 times for solving a particular problem and then report the averaged running time\nin second and the corresponding standard deviations. The running time of PD-HOPS only accounts\nthe time for updating the primal variable since the updates for the dual variable are fully decoupled\nfrom the primal updates and can be carried out in parallel. From the results, we can see that (i) HOPS\nconverges consistently faster than their APG variants especially when \u0001 is small; (ii) PD-HOPS allows\nfor choosing the number of iterations at each epoch automatically, yielding faster convergence speed\nthan HOPS with manual tuning; (iii) both HOPS and PD-HOPS are signi\ufb01cantly faster than PD.\n\n6 Conclusions\nIn this paper, we have developed a homotopy smoothing (HOPS) algorithm for solving a family of\nstructured non-smooth optimization problems with formal guarantee on the iteration complexities. We\nshow that the proposed HOPS can achieve a lower iteration complexity of \u02dcO(1/\u00011\u2212\u03b8) with \u03b8 \u2208 (0, 1]\nfor obtaining an \u0001-optimal solution under a mild local error bound condition. The experimental results\non three different tasks demonstrate the effectiveness of HOPS.\n\nAcknowlegements\nWe thank the anonymous reviewers for their helpful comments. Y. Xu and T. Yang are partially\nsupported by National Science Foundation (IIS-1463988, IIS-1545995).\n\n6https://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/\n7http://pages.cs.wisc.edu/\u223cswright/TVdenoising/\n\n8\n\n\fReferences\n[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection\nmethods for nonconvex problems: An approach based on the kurdyka-lojasiewicz inequality. Math. Oper.\nRes., 35:438\u2013457, 2010.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Img. Sci., 2:183\u2013202, 2009.\n\n[3] S. Becker, J. Bobin, and E. J. Cand\u00e8s. Nesta: A fast and accurate \ufb01rst-order method for sparse recovery.\n\nSIAM J. Img. Sci., 4:1\u201339, 2011.\n\n[4] J. Bolte, A. Daniilidis, and A. Lewis. The \u0142ojasiewicz inequality for nonsmooth subanalytic functions with\n\napplications to subgradient dynamical systems. SIAM J. Optim., 17:1205\u20131223, 2006.\n\n[5] J. V. Burke and P. Tseng. A uni\ufb01ed analysis of hoffman\u2019s bound via fenchel duality. SIAM J. Optim.,\n\n6(2):265\u2013282, 1996.\n\n[6] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. J. Math. Img. Vis., 40:120\u2013145, 2011.\n\n[7] X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing. Smoothing proximal gradient method for general\n\nstructured sparse regression. Ann. Appl. Stat., 6(2):719\u2013752, 06 2012.\n\n[8] A. Gilpin, J. Pe\u00f1a, and T. Sandholm. First-order algorithm with log(1/epsilon) convergence for epsilon-\n\nequilibrium in two-person zero-sum games. Math. Program., 133(1-2):279\u2013298, 2012.\n\n[9] R. Goebel and R. T. Rockafellar. Local strong convexity and local lipschitz continuity of the gradient of\n\nconvex functions. J. Convex Anal., 15(2):263\u2013270, 2008.\n\n[10] J. P. B. S. Jerome Bolte, Trong Phong Nguyen. From error bounds to the complexity of \ufb01rst-order descent\n\nmethods for convex functions. CoRR, abs/1510.08234, 2015.\n\n[11] G. Lan, Z. Lu, and R. D. C. Monteiro. Primal-dual \ufb01rst-order methods with O(1/e) iteration-complexity for\n\ncone programming. Math. Program., 126(1):1\u201329, 2011.\n\n[12] S. \u0141ojasiewicz. Ensembles semi-analytiques. Institut des Hautes Etudes Scienti\ufb01ques, 1965.\n\n[13] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general\n\napproach. Ann. Oper. Res., 46(1):157\u2013178, 1993.\n\n[14] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-strongly convex\n\noptimization. CoRR, abs/1504.06298, 2015.\n\n[15] A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz\ncontinuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim.,\n15(1):229\u2013251, Jan. 2005.\n\n[16] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013152, 2005.\n[17] J. Pang. Error bounds in mathematical programming. Math. Program., 79:299\u2013332, 1997.\n[18] R. Rockafellar. Convex Analysis. Princeton mathematical series. Princeton University Press, 1970.\n[19] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM\n\nJ. Optim., 2008.\n\n[20] L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM\n\nJ. Optim., 23(2):1062\u20131091, 2013.\n\n[21] T. Yang and Q. Lin. Rsg: Beating subgradient method without smoothness and strong convexity. CoRR,\n\nabs/1512.03107, 2016.\n\n[22] T. Yang, M. Mahdavi, R. Jin, and S. Zhu. An ef\ufb01cient primal-dual prox method for non-smooth optimization.\n\nMachine Learning, 2014.\n\n[23] H. Zhang. New analysis of linear convergence of gradient-type methods via unifying error bound conditions.\n\narXiv:1606.00269, 2016.\n\n[24] X. Zhang, A. Saha, and S. Vishwanathan. Smoothing multivariate performance measures. JMLR,\n\n13(Dec):3623\u20133680, 2012.\n\n[25] Z. Zhou and A. M.-C. So. A uni\ufb01ed approach to error bounds for structured convex optimization problems.\n\nCoRR, abs/1512.03518, 2015.\n\n9\n\n\f", "award": [], "sourceid": 667, "authors": [{"given_name": "Yi", "family_name": "Xu", "institution": "The University of Iowa"}, {"given_name": "Yan", "family_name": "Yan", "institution": "University of Technology Sydney"}, {"given_name": "Qihang", "family_name": "Lin", "institution": "University of Iowa"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "University of Iowa"}]}