{"title": "Geometric Descent Method for Convex Composite Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 636, "page_last": 644, "abstract": "In this paper, we extend the geometric descent method recently proposed by Bubeck, Lee and Singh to tackle nonsmooth and strongly convex composite problems. We prove that our proposed algorithm, dubbed geometric proximal gradient method (GeoPG), converges with a linear rate $(1-1/\\sqrt{\\kappa})$ and thus achieves the optimal rate among first-order methods, where $\\kappa$ is the condition number of the problem. Numerical results on linear regression and logistic regression with elastic net regularization show that GeoPG compares favorably with Nesterov's accelerated proximal gradient method, especially when the problem is ill-conditioned.", "full_text": "Geometric Descent Method for\nConvex Composite Minimization\n\nShixiang Chen1, Shiqian Ma2, and Wei Liu3\n\n1Department of SEEM, The Chinese University of Hong Kong, Hong Kong\n\n2Department of Mathematics, UC Davis, USA\n\n3Tencent AI Lab, China\n\nAbstract\n\nIn this paper, we extend the geometric descent method recently proposed by Bubeck,\nLee and Singh [1] to tackle nonsmooth and strongly convex composite problems.\n\u221a\nWe prove that our proposed algorithm, dubbed geometric proximal gradient method\n(GeoPG), converges with a linear rate (1 \u2212 1/\n\u03ba) and thus achieves the optimal\nrate among \ufb01rst-order methods, where \u03ba is the condition number of the problem.\nNumerical results on linear regression and logistic regression with elastic net\nregularization show that GeoPG compares favorably with Nesterov\u2019s accelerated\nproximal gradient method, especially when the problem is ill-conditioned.\n\n1\n\nIntroduction\n\nRecently, Bubeck, Lee and Singh proposed a geometric descent method (GeoD) for minimizing a\nsmooth and strongly convex function [1]. They showed that GeoD achieves the same optimal rate\nas Nesterov\u2019s accelerated gradient method (AGM) [2, 3]. In this paper, we provide an extension of\nGeoD that minimizes a nonsmooth function in the composite form:\n\nmin\nx\u2208Rn\n\nF (x) := f (x) + h(x),\n\n(1.1)\nwhere f is \u03b1-strongly convex and \u03b2-smooth (i.e., \u2207f is Lipschitz continuous with Lipschitz constant\n\u03b2), and h is a closed nonsmooth convex function with simple proximal mapping. Commonly seen\nexamples of h include (cid:96)1 norm, (cid:96)2 norm, nuclear norm, and so on.\n\u221a\nIf h vanishes, then the objective function of (1.1) becomes smooth and strongly convex. In this case,\nit is known that AGM converges with a linear rate (1\u2212 1/\n\u03ba), which is optimal among all \ufb01rst-order\nmethods, where \u03ba = \u03b2/\u03b1 is the condition number of the problem. However, AGM lacks a clear\ngeometric intuition, making it dif\ufb01cult to interpret. Recently, there has been much work on attempting\nto explain AGM or designing new algorithms with the same optimal rate (see, [4, 5, 1, 6, 7]). In\nparticular, the GeoD method proposed in [1] has a clear geometric intuition that is in the \ufb02avor\nof the ellipsoid method [8]. The follow-up work [9, 10] attempted to improve the performance of\nGeoD by exploiting the gradient information from the past with a \u201climited-memory\u201d idea. Moreover,\nDrusvyatskiy, Fazel and Roy [10] showed how to extend the suboptimal version of GeoD (with the\nconvergence rate (1 \u2212 1/\u03ba)) to solve the composite problem (1.1). However, it was not clear how to\nextend the optimal version of GeoD to address (1.1), and the authors posed this as an open question.\nIn this paper, we settle this question by proposing a geometric proximal gradient (GeoPG) algorithm\nwhich can solve the composite problem (1.1). We further show how to incorporate various techniques\nto improve the performance of the proposed algorithm.\n\nNotation. We use B(c, r2) = (cid:8)x|(cid:107)x \u2212 c(cid:107)2 \u2264 r2(cid:9) to denote the ball with center c and radius r.\n\nWe use Line(x, y) to denote the line that connects x and y, i.e., {x + s(y \u2212 x), s \u2208 R}. For \ufb01xed\nt \u2208 (0, 1/\u03b2], we denote x+ := Proxth(x \u2212 t\u2207f (x)), where the proximal mapping Proxh(\u00b7) is\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2(cid:107)z \u2212 x(cid:107)2. The proximal gradient of F is de\ufb01ned as\nde\ufb01ned as Proxh(x) := argminz h(z) + 1\nGt(x) := (x\u2212x+)/t. It should be noted that x+ = x\u2212tGt(x). We also denote x++ := x\u2212Gt(x)/\u03b1.\nNote that both x+ and x++ are related to t, and we omit t whenever there is no ambiguity.\nThe rest of this paper is organized as follows. In Section 2, we brie\ufb02y review the GeoD method for\nsolving smooth and strongly convex problems. In Section 3, we provide our GeoPG algorithm for\nsolving nonsmooth problem (1.1) and analyze its convergence rate. We address two practical issues of\nthe proposed method in Section 4, and incorporate two techniques: backtracking and limited memory,\nto cope with these issues. In Section 5, we report some numerical results of comparing GeoPG with\nNesterov\u2019s accelerated proximal gradient method in solving linear regression and logistic regression\nproblems with elastic net regularization. Finally, we conclude the paper in Section 6.\n\n2 Geometric Descent Method for Smooth Problems\nThe GeoD method [1] solves (1.1) when h \u2261 0, in which the problem reduces to a smooth and\nstrongly convex problem min f (x). We denote its optimal solution and optimal value as x\u2217 and\nf\u2217, respectively. Throughout this section, we \ufb01x t = 1/\u03b2, which together with h \u2261 0 implies that\nx+ = x \u2212 \u2207f (x)/\u03b2 and x++ = x \u2212 \u2207f (x)/\u03b1. We \ufb01rst brie\ufb02y describe the basic idea of the\nsuboptimal GeoD. Since f is \u03b1-strongly convex, the following inequality holds\n(cid:107)y \u2212 x(cid:107)2 \u2264 f (y), \u2200x, y \u2208 Rn.\n\nf (x) + (cid:104)\u2207f (x), y \u2212 x(cid:105) +\nBy letting y = x\u2217 in (2.1), it is easy to obtain\n\nx\u2217 \u2208 B(cid:0)x++,(cid:107)\u2207f (x)(cid:107)2/\u03b12 \u2212 2(f (x) \u2212 f\u2217)/\u03b1(cid:1),\u2200x \u2208 Rn.\n\n(2.1)\n\n\u03b1\n2\n\n(2.2)\n\nNote that the \u03b2-smoothness of f implies\n\nCombining (2.2) and (2.3) yields x\u2217 \u2208 B(cid:0)x++, (1 \u2212 1/\u03ba)(cid:107)\u2207f (x)(cid:107)2/\u03b12 \u2212 2(f (x+) \u2212 f\u2217)/\u03b1(cid:1). As\n\nf (x+) \u2264 f (x) \u2212 (cid:107)\u2207f (x)(cid:107)2/(2\u03b2),\u2200x \u2208 Rn.\n\n(2.3)\n\nx\u2217 \u2208 B(cid:0)x0, R2\n\n(cid:1) \u2229 B(cid:0)x++\n\n0) that contains x\u2217, then it follows that\n\na result, suppose that initially we have a ball B(x0, R2\n\n0\n\n0\n\n, (1 \u2212 1/\u03ba)(cid:107)\u2207f (x0)(cid:107)2/\u03b12 \u2212 2(f (x+\n\nthat x\u2217 \u2208 B(cid:0)x1, R2\n\n0(1 \u2212 1/\u03ba)(cid:1). Therefore, the squared radius of the initial ball shrinks by a factor\n\n(2.4)\nSome simple algebraic calculations show that the squared radius of the minimum enclosing ball\n0(1 \u2212 1/\u03ba), i.e., there exists some x1 \u2208 Rn such\nof the right hand side of (2.4) is no larger than R2\n(1 \u2212 1/\u03ba). Repeating this process yields a linear convergent sequence {xk} with the convergence\nrate (1 \u2212 1/\u03ba): (cid:107)xk \u2212 x\u2217(cid:107)2 \u2264 (1 \u2212 1/\u03ba)kR2\n0.\nThe optimal GeoD (with the linear convergence rate (1 \u2212 1/\n\u03ba)) maintains two balls containing\nx\u2217 in each iteration, whose centers are ck and x++\nk+1, respectively. More speci\ufb01cally, suppose that in\nthe k-th iteration we have ck and xk, then ck+1 and xk+1 are obtained as follows. First, xk+1 is the\nminimizer of f on Line(ck, x+\nk+1) is the center (resp. squared radius) of\nthe ball (given by Lemma 2.1) that contains\n\nk ). Second, ck+1 (resp. R2\n\nk \u2212 (cid:107)\u2207f (xk+1)(cid:107)2/(\u03b12\u03ba)(cid:1) \u2229 B(cid:0)x++\n\nk+1, (1 \u2212 1/\u03ba)(cid:107)\u2207f (xk+1)(cid:107)2/\u03b12(cid:1).\n\nB(cid:0)ck, R2\n\n\u221a\n\n0 ) \u2212 f\u2217)/\u03b1(cid:1).\n\n\u221a\nk+1 = (1 \u2212 1/\n\nCalculating ck+1 and Rk+1 is easy and we refer to Algorithm 1 of [1] for details. By applying\nk ) \u2212 f (x\u2217)),\nLemma 2.1 with xA = ck, rA = Rk, rB = (cid:107)\u2207f (xk+1)(cid:107)/\u03b1, \u0001 = 1/\u03ba and \u03b4 = 2\nk, which further implies (cid:107)x\u2217 \u2212 ck(cid:107)2 \u2264 (1 \u2212 1/\n0, i.e., the\n\u221a\nwe obtain R2\noptimal GeoD converges with the linear rate (1 \u2212 1/\nB > 0. Also \ufb01x \u0001 \u2208 (0, 1)\nLemma 2.1 (see [1, 10]). Fix centers xA, xB \u2208 Rn and squared radii r2\nB. There exists a new center c \u2208 Rn such that for any \u03b4 > 0, we have\nand suppose (cid:107)xA \u2212 xB(cid:107)2 \u2265 r2\n\n\u03b1 (f (x+\n\u221a\n\n\u03ba)kR2\n\n\u03ba)R2\n\nB \u2212 \u03b4) \u2229 B(cid:0)xB, r2\n\nB(1 \u2212 \u0001) \u2212 \u03b4(cid:1) \u2282 B(cid:0)c, (1 \u2212 \u221a\n\nA, r2\n\nA \u2212 \u03b4(cid:1).\n\nA \u2212 \u0001r2\n\nB(xA, r2\n\n\u0001)r2\n\n\u03ba).\n\n3 Geometric Descent Method for Nonsmooth Convex Composite Problems\n\nDrusvyatskiy, Fazel and Roy [10] extended the suboptimal GeoD to solve the composite problem\n(1.1). However, it was not clear how to extend the optimal GeoD to solve problem (1.1). We resolve\nthis problem in this section.\nThe following lemma is useful to our analysis. Its proof is in the supplementary material.\n\n2\n\n\fLemma 3.1. Given point x \u2208 Rn and step size t \u2208 (0, 1/\u03b2], denote x+ = x\u2212 tGt(x). The following\ninequality holds for any y \u2208 Rn:\n\nF (y) \u2265 F (x+) + (cid:104)Gt(x), y \u2212 x(cid:105) +\n\n(cid:107)Gt(x)(cid:107)2 +\n\nt\n2\n\n(cid:107)y \u2212 x(cid:107)2.\n\n\u03b1\n2\n\n(3.1)\n\n3.1 GeoPG Algorithm\n\nIn this subsection, we describe our proposed geometric proximal gradient method (GeoPG) for\nsolving (1.1). Throughout Sections 3.1 and 3.2, t \u2208 (0, 1/\u03b2] is a \ufb01xed scalar. The key observation for\ndesigning GeoPG is that in the k-th iteration one has to \ufb01nd xk that lies on Line(x+\nk\u22121, ck\u22121) such\nthat the following two inequalities hold:\n\nF (x+\n\nk ) \u2264 F (x+\n\nk\u22121) \u2212 t\n2\n\n(cid:107)Gt(xk)(cid:107)2, and (cid:107)x++\n\nk \u2212 ck\u22121(cid:107)2 \u2265 1\n\n\u03b12(cid:107)Gt(xk)(cid:107)2.\n\n(3.2)\n\nIntuitively, the \ufb01rst inequality in (3.2) requires that there is a function value reduction on x+\nk from\nx+\nk\u22121, and the second inequality requires that the centers of the two balls are far away from each other\nso that Lemma 2.1 can be applied.\nThe following lemma gives a suf\ufb01cient condition for (3.2). Its proof is in the supplementary material.\nLemma 3.2. (3.2) holds if xk satis\ufb01es\n\n(cid:104)x+\nk \u2212 xk, x+\n\nk\u22121 \u2212 xk(cid:105) \u2264 0, and (cid:104)x+\n\nk \u2212 xk, xk \u2212 ck\u22121(cid:105) \u2265 0.\n\n(3.3)\n\nTherefore, we only need to \ufb01nd xk such that (3.3) holds. To do so, we de\ufb01ne the following functions\nfor given x, c (x (cid:54)= c) and t \u2208 (0, \u03b2]:\n\n\u03c6t,x,c(z) = (cid:104)z+ \u2212 z, x \u2212 c(cid:105),\u2200z \u2208 Rn, and \u00af\u03c6t,x,c(s) = \u03c6t,x,c\n\n(cid:0)x + s(c \u2212 x)(cid:1),\u2200s \u2208 R.\n\nThe functions \u03c6t,x,c(z) and \u00af\u03c6t,x,c(s) have the following properties. Its proof can be found in the\nsupplementary material.\nLemma 3.3. (i) \u03c6t,x,c(z) is Lipschitz continuous. (ii) \u00af\u03c6t,x,c(s) strictly monotonically increases.\nWe are now ready to describe how to \ufb01nd xk such that (3.3) holds. This is summarized in Lemma 3.4.\nLemma 3.4. The following two ways \ufb01nd xk satisfying (3.3).\n\n(i) If \u00af\u03c6t,x+\n\nk\u22121,ck\u22121\n\n(1) \u2264 0, then (3.3) holds by setting xk := ck\u22121; if \u00af\u03c6t,x+\n(1) > 0 and \u00af\u03c6t,x+\n\n(0) \u2265 0, then\n(0) < 0, then\n(s) = 0. As a result, (3.3) holds by setting\n\nk\u22121; if \u00af\u03c6t,x+\n\nk\u22121,ck\u22121\n\nk\u22121,ck\u22121\n\nk\u22121,ck\u22121\n\n(3.3) holds by setting xk := x+\nthere exists s \u2208 [0, 1] such that \u00af\u03c6t,x+\nxk := x+\n\nk\u22121 + s(ck\u22121 \u2212 x+\n\nk\u22121).\n\nk\u22121,ck\u22121\n\n(ii) If \u00af\u03c6t,x+\n\n(0) \u2265 0, then (3.3) holds by setting xk := x+\n\nthen there exists s \u2265 0 such that \u00af\u03c6t,x+\nxk := x+\n\nk\u22121,ck\u22121\nk\u22121 + s(ck\u22121 \u2212 x+\n\nk\u22121).\n\nk\u22121,ck\u22121\n\n(0) < 0,\n(s) = 0. As a result, (3.3) holds by setting\n\nk\u22121; if \u00af\u03c6t,x+\n\nk\u22121,ck\u22121\n\nProof. Case (i) directly follows from the Mean-Value Theorem. Case (ii) follows from the mono-\ntonicity and continuity of \u00af\u03c6t,x+\n\nfrom Lemma 3.3.\n\nk\u22121,ck\u22121\n\nk\u22121,ck\u22121\n\nIt is indeed very easy to \ufb01nd xk satisfying the two cases in Lemma 3.4, since we are tackling a\nunivariate Lipschitz continuous function \u00af\u03c6t,x,c(s) . Speci\ufb01cally, for case (i) of Lemma 3.4, we can\nuse the bisection method to \ufb01nd the zero of \u00af\u03c6t,x+\nin the closed interval [0, 1]. In practice, we\nfound that the Brent-Dekker method [11, 12] performs much better than the bisection method, so\nwe use the Brent-Dekker method in our numerical experiments. For case (ii) of Lemma 3.4, we\nin the interval [0, +\u221e).\ncan use the semi-smooth Newton method to \ufb01nd the zero of \u00af\u03c6t,x+\nIn our numerical experiments, we implemented the global semi-smooth Newton method [13, 14]\nand obtained very encouraging results. These two procedures are described in Algorithms 1 and 2,\nrespectively. Based on the discussions above, we know that xk generated by these two algorithms\nsatis\ufb01es (3.3) and hence (3.2).\nWe are now ready to present our GeoPG algorithm for solving (1.1) as in Algorithm 3.\n\nk\u22121,ck\u22121\n\n3\n\n\fk\u22121 and ck\u22121.\n\nk\u22121)+ \u2212 x+\nset xk := x+\nk\u22121;\n\nk\u22121, x+\nk\u22121 \u2212 ck\u22121, x+\n\nAlgorithm 1 : The \ufb01rst procedure for \ufb01nding xk from given x+\n1: if (cid:104)(x+\n2:\n3: else if (cid:104)c+\n4:\n5: else\n6:\n\nset xk := ck\u22121;\nuse the Brent-Dekker method to \ufb01nd s \u2208 [0, 1] such that \u00af\u03c6t,x+\nxk := x+\n\nk\u22121 \u2212 ck\u22121(cid:105) \u2265 0 then\nk\u22121 \u2212 ck\u22121(cid:105) \u2264 0 then\n\nk\u22121 + s(ck\u22121 \u2212 x+\n\nk\u22121);\n\n7: end if\n\nk\u22121,ck\u22121\n\n(s) = 0, and set\n\nk\u22121, x+\n\nk\u22121 \u2212 ck\u22121(cid:105) \u2265 0 then\n\nAlgorithm 2 : The second procedure for \ufb01nding xk from given x+\n1: if (cid:104)(x+\n2:\n3: else\n4:\n\nk\u22121)+ \u2212 x+\nset xk := x+\nk\u22121;\nuse the global semi-smooth Newton method [13, 14] to \ufb01nd the root s \u2208 [0, +\u221e) of\n\u00af\u03c6t,x+\n5: end if\n\nk\u22121 + s(ck\u22121 \u2212 x+\n\n(s), and set xk := x+\n\nk\u22121 and ck\u22121.\n\nk\u22121,ck\u22121\n\nk\u22121);\n\n3.2 Convergence Analysis of GeoPG\n\nWe are now ready to present our main convergence result for GeoPG.\nTheorem 3.5. Given initial point x0 and step size t \u2208 (0, 1/\u03b2], we set R2\n(1 \u2212 \u03b1t).\nSuppose that sequence {(xk, ck, Rk)} is generated by Algorithm 3, and that x\u2217 is the optimal\nsolution of (1.1) and F \u2217 is the optimal objective value. For any k \u2265 0, one has x\u2217 \u2208 B(ck, R2\nk+1 \u2264 (1 \u2212 \u221a\nk) and\nR2\n\n(cid:107)Gt(x0)(cid:107)2\n\n0 =\n\n\u03b12\n\n\u03b1t)R2\n\nk, and thus\n(cid:107)x\u2217 \u2212 ck(cid:107)2 \u2264 (1 \u2212 \u221a\n\n\u03b1t)kR2\n\n0, and F (x+\n\nk+1) \u2212 F \u2217 \u2264 \u03b1\n2\n\n\u03b1t)kR2\n0.\n\n(3.4)\n\n(1 \u2212 \u221a\n\u221a\n\nNote that when t = 1/\u03b2, (3.4) implies the linear convergence rate (1 \u2212 1/\nProof. We prove a stronger result by induction that for every k \u2265 0, one has\n\n\u03ba).\n\nx\u2217 \u2208 B(cid:0)ck, R2\n\nk ) \u2212 F \u2217)/\u03b1(cid:1).\n\nk \u2212 2(F (x+\n\n(3.5)\nLet y = x\u2217 in (3.1). We have (cid:107)x\u2217 \u2212 x++(cid:107)2 \u2264 (1\u2212 \u03b1t)(cid:107)Gt(x)2(cid:107)/\u03b12 \u2212 2(F (x+)\u2212 F \u2217)/\u03b1, implying\n(3.6)\nSetting x = x0 in (3.6) shows that (3.5) holds for k = 0. We now assume that (3.5) holds for some\nk \u2265 0, and in the following we will prove that (3.5) holds for k + 1. Combining (3.5) and the \ufb01rst\ninequality of (3.2) yields\n\nx\u2217 \u2208 B(cid:0)x++,(cid:107)Gt(x)(cid:107)2(1 \u2212 \u03b1t)/\u03b12 \u2212 2(F (x+) \u2212 F \u2217)/\u03b1(cid:1).\nk+1) \u2212 F \u2217)/\u03b1(cid:1).\nx\u2217 \u2208 B(cid:0)ck, R2\nk+1) \u2212 F \u2217)/\u03b1(cid:1).\nx\u2217 \u2208 B(cid:0)x++\n\nk+1,(cid:107)Gt(xk+1)(cid:107)2(1 \u2212 \u03b1t)/\u03b12 \u2212 2(F (x+\n\nk \u2212 t(cid:107)Gt(xk+1)(cid:107)2/\u03b1 \u2212 2(F (x+\n\nBy setting x = xk+1 in (3.6), we obtain\n\n(3.7)\n\nWe now apply Lemma 2.1 to (3.7) and (3.8). Speci\ufb01cally, we set xB = x++\nrA = Rk, rB = (cid:107)Gt(xk+1)(cid:107)/\u03b1, \u03b4 = 2\nk ) \u2212 F \u2217), and note that (cid:107)xA \u2212 xB(cid:107)2 \u2265 r2\nx\u2217 \u2208 B(cid:0)ck+1, (1 \u2212 1/\n\u03b1 (F (x+\nthe second inequality of (3.2). Then Lemma 2.1 indicates that there exists ck+1 such that\n\u221a\nk+1 \u2264 (1 \u2212 \u221a\n\nk \u2212 2(F (x+\ni.e., (3.5) holds for k + 1 with R2\nk. Note that ck+1 is the center of the minimum\nenclosing ball of the intersection of the two balls in (3.7) and (3.8), and can be computed in the\nk \u2264\n\nsame way as Algorithm 1 of [1]. From (3.9) we obtain that (cid:107)x\u2217 \u2212 ck+1(cid:107)2 \u2264 (1 \u2212 \u221a\n(1 \u2212 \u221a\n\nk+1) \u2212 F \u2217)/\u03b1(cid:1),\n\n0. Moreover, (3.7) indicates that F (x+\n\nk+1) \u2212 F \u2217 \u2264 \u03b1\n\n2 (1 \u2212 \u221a\n\nk \u2264 \u03b1\n\n0.\n\u03b1t)kR2\n\n\u03b1t)k+1R2\n\n\u03b1t)R2\n\n\u03b1t)R2\n\n\u03ba)R2\n\n2 R2\n\n(3.9)\n\n(3.8)\nk+1, xA = ck, \u0001 = \u03b1t,\nB because of\n\n4\n\n\f0\n\n0 = (cid:107)Gt(x0)(cid:107)2(1 \u2212 \u03b1t)/\u03b12;\n\nAlgorithm 3 : GeoPG: geometric proximal gradient descent for convex composite minimization.\nRequire: Parameters \u03b1, \u03b2, initial point x0 and step size t \u2208 (0, 1/\u03b2].\n1: Set c0 = x++\n, R2\n2: for k = 1, 2, . . . do\n3:\n4:\n5:\n6:\n\nk = xk \u2212 Gt(xk)/\u03b1, and R2\nk\u22121 \u2212 2(F (x+\n\nk ))/\u03b1;\nk): the minimum enclosing ball of B(xA, R2\n\nUse Algorithm 1 or 2 to \ufb01nd xk;\nSet xA := x++\nSet xB := ck\u22121, and R2\nCompute B(ck, R2\ndone using Algorithm 1 in [1];\n\nA = (cid:107)Gt(xk)(cid:107)2(1 \u2212 \u03b1t)/\u03b12;\nk\u22121) \u2212 F (x+\n\nA) \u2229 B(xB, R2\n\nB = R2\n\nB), which can be\n\n7: end for\n\n4 Practical Issues\n\n4.1 GeoPG with Backtracking\n\nIn practice, the Lipschitz constant \u03b2 may be unknown to us. In this subsection, we describe a\nbacktracking strategy for GeoPG in which \u03b2 is not needed. From the \u03b2-smoothness of f, we have\n\nf (x+) \u2264 f (x) \u2212 t(cid:104)\u2207f (x), Gt(x)(cid:105) + t(cid:107)Gt(x)(cid:107)2/2.\n\n(4.1)\nNote that inequality (3.1) holds because of (4.1), which holds when t \u2208 (0, 1/\u03b2]. If \u03b2 is unknown,\nwe can perform backtracking on t such that (4.1) holds, which is a common practice for proximal\ngradient method, e.g., [15\u201317]. Note that the key step in our analysis of GeoPG is to guarantee that\nthe two inequalities in (3.2) hold. According to Lemma 3.2, the second inequality in (3.2) holds as\nlong as we use Algorithm 1 or Algorithm 2 to \ufb01nd xk, and it does not need the knowledge of \u03b2.\nHowever, the \ufb01rst inequality in (3.2) requires t \u2264 1/\u03b2, because its proof in Lemma 3.2 needs (3.1).\nThus, we need to perform backtracking on t until (4.1) is satis\ufb01ed, and use the same t to \ufb01nd xk by\nAlgorithm 1 or Algorithm 2. Our GeoPG algorithm with backtracking (GeoPG-B) is described in\nAlgorithm 4.\n\nAlgorithm 4 : GeoPG with Backtracking (GeoPG-B)\nRequire: Parameters \u03b1, \u03b3 \u2208 (0, 1), \u03b7 \u2208 (0, 1), initial step size t0 > 0 and initial point x0.\n\nRepeat t0 := \u03b7t0 until (4.1) holds for t = t0;\n(1 \u2212 \u03b1t0);\nSet c0 = x++\n, R2\n0 =\nfor k = 1, 2, . . . do\n\n(cid:107)Gt0(x0)(cid:107)2\n\n\u03b12\n\n0\n\nif no backtracking was performed in the (k \u2212 1)-th iteration then\n\nSet tk := tk\u22121/\u03b3;\n\nelse\n\nSet tk := tk\u22121;\n\nend if\nCompute xk by Algorithm 1 or Algorithm 2 with t = tk;\nwhile f (x+\n\nk ) > f (xk) \u2212 tk(cid:104)\u2207f (xk), Gtk (xk)(cid:105) + tk\n\nSet tk := \u03b7tk (backtracking);\nCompute xk by Algorithm 1 or Algorithm 2 with t = tk;\n\n2 (cid:107)Gtk (xk)(cid:107)2 do\n\nend while\nSet xA := x++\nSet xB := ck\u22121, R2\nCompute B(ck, R2\n\n(cid:107)Gtk (xk)(cid:107)2\nk = xk \u2212 Gtk (xk)/\u03b1, R2\nA =\nk\u22121) \u2212 F (x+\n\u03b1 (F (x+\n\nB = R2\nk): the minimum enclosing ball of B(xA, R2\n\nk\u22121 \u2212 2\n\n(1 \u2212 \u03b1tk);\n\nk ));\n\n\u03b12\n\nend for\n\nA) \u2229 B(xB, R2\nB);\n\nNote that the sequence {tk} generated in Algorithm 4 is uniformly bounded away from 0. This is\nbecause (4.1) always holds when tk \u2264 1/\u03b2. As a result, we know tk \u2265 tmin := mini=0,...,k ti \u2265 \u03b7/\u03b2.\nIt is easy to see that in the k-th iteration of Algorithm 4, x\u2217 is contained in two balls:\n\nx\u2217 \u2208 B(cid:0)ck\u22121, R2\nx\u2217 \u2208 B(cid:0)x++\n\nk\n\nk\u22121 \u2212 tk(cid:107)Gtk (xk)(cid:107)2/\u03b1 \u2212 2(F (x+\n,(cid:107)Gtk (xk)(cid:107)2(1 \u2212 \u03b1tk)/\u03b12 \u2212 2(F (x+\n\nk ) \u2212 F \u2217)/\u03b1(cid:1),\nk ) \u2212 F \u2217)/\u03b1(cid:1).\n\n5\n\n\fTherefore, we have the following convergence result for Algorithm 4, whose proof is similar to that\nfor Algorithm 3. We thus omit the proof for succinctness.\nTheorem 4.1. Suppose that {(xk, ck, Rk, tk)} is generated by Algorithm 4. For any k \u2265 0, one\nhas x\u2217 \u2208 B(ck, R2\n0 \u2264\nk) and R2\n(1 \u2212 \u221a\n\u03b1tmin)kR2\n0.\n4.2 GeoPG with Limited Memory\n\nk, and thus (cid:107)x\u2217 \u2212 ck(cid:107)2 \u2264(cid:81)k\n\nk+1 \u2264 (1 \u2212 \u221a\n\ni=0(1 \u2212 \u221a\n\n\u03b1ti)iR2\n\n\u03b1tk)R2\n\n2) that\nThe basic idea of GeoD is that in each iteration we maintain two balls B(y1, r2\nboth contain x\u2217, and then compute the minimum enclosing ball of their intersection, which is expected\nto be smaller than both B(y1, r2\n2). One very intuitive idea that can possibly improve\nthe performance of GeoD is to maintain more balls from the past, because their intersection should be\nsmaller than the intersection of two balls. This idea has been proposed by [9] and [10]. Speci\ufb01cally,\nBubeck and Lee [9] suggested to keep all the balls from past iterations and then compute the minimum\nenclosing ball of their intersection. For a given bounded set Q, the center of its minimum enclosing\nball is known as the Chebyshev center, and is de\ufb01ned as the solution to the following problem:\n\n1) and B(y2, r2\n\n1) and B(y2, r2\n\nmin\n\ny\n\nmax\nx\u2208Q\n\n(cid:107)y \u2212 x(cid:107)2 = min\n\ny\n\nmax\nx\u2208Q\n\n(cid:107)y(cid:107)2 \u2212 2y(cid:62)x + Tr(xx(cid:62)).\n\n(4.2)\n\n(4.2) is not easy to solve for a general set Q. However, when Q := \u2229m\ni ), Beck [18] proved\nthat the relaxed Chebyshev center (RCC) [19], which is a convex quadratic program, is equivalent to\n(4.2) if m < n. Therefore, we can solve (4.2) by solving a convex quadratic program (RCC):\n\ni=1B(yi, r2\n\nmin\n\ny\n\nmax\n(x,(cid:52))\u2208\u0393\n\n(cid:107)y(cid:107)2\u22122y(cid:62)x+Tr((cid:52)) = max\n(x,(cid:52))\u2208\u0393\n\nmin\n\ny\n\n(cid:107)y(cid:107)2\u22122y(cid:62)x+Tr((cid:52)) = max\n(x,(cid:52))\u2208\u0393\n\n\u2212(cid:107)x(cid:107)2+Tr((cid:52)),\n(4.3)\n\ni ), then the dual of (4.3) is\n\u03bbi \u2265 0, i = 1, . . . , m,\n\n(4.4)\n\nwhere \u0393 = {(x,(cid:52)) : x \u2208 Q,(cid:52) (cid:23) xx(cid:62)}. If Q = \u2229m\n\ni=1B(ci, r2\n\nm(cid:88)\n\nm(cid:88)\n\n\u03bbi(cid:107)ci(cid:107)2 +\n\n\u03bbir2\n\ni , s.t.\n\n\u03bbi = 1,\n\nmin(cid:107)C\u03bb(cid:107)2 \u2212 m(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\nwhere C = [c1, . . . , cm] and \u03bbi, i = 1, 2, . . . , m are the dual variables. Beck [18] proved that the\noptimal solutions of (4.2) and (4.4) are linked by x\u2217 = C\u03bb\u2217 if m < n.\nNow we can give our limited-memory GeoPG algorithm (L-GeoPG) as in Algorithm 5.\n\n0\n\n, r2\n\n0 = R2\n\n0 = (cid:107)Gt(x0)(cid:107)2(1 \u2212 1/\u03ba)/\u03b12, and t = 1/\u03b2;\n\nAlgorithm 5 : L-GeoPG: Limited-memory GeoPG\nRequire: Parameters \u03b1, \u03b2, memory size m > 0 and initial point x0.\n1: Set c0 = x++\n2: for k = 1, 2, . . . do\n3:\n4:\n5:\n\nUse Algorithm 1 or 2 to \ufb01nd xk;\nCompute r2\nCompute B(ck, R2\nk): an enclosing ball of the intersection of B(ck\u22121, R2\ni ) (if k \u2264 m, then set Qk := \u2229k\n\u2229k\ni=k\u2212m+1B(x++\n, r2\nck = C\u03bb\u2217, where \u03bb\u2217 is the optimal solution of (4.4);\n\nk = (cid:107)Gt(xk)(cid:107)2(1 \u2212 1/\u03ba)/\u03b12;\n\ni=1B(x++\n\n, r2\n\ni\n\ni\n\n6: end for\n\nk\u22121) and Qk :=\ni )). This is done by setting\n\nRemark 4.2. Backtracking can also be incorporated into L-GeoPG. We denote the resulting algo-\nrithm as L-GeoPG-B.\n\n\u221a\n\n\u03ba)R2\n\nL-GeoPG has the same linear convergence rate as GeoPG, as we show in Theorem 4.3.\nTheorem 4.3. Consider the L-GeoPG algorithm. For any k \u2265 0, one has x\u2217 \u2208 B(ck, R2\nk \u2264 (1 \u2212 1/\nR2\nProof. Note that Qk := \u2229k\nk\u22121)\u2229 B(x++\nof B(ck\u22121, R2\nfrom the proof of Theorem 3.5, and we omit it for brevity.\n\nk\u22121, and thus (cid:107)x\u2217 \u2212 ck(cid:107)2 \u2264 (1 \u2212 1/\ni ) \u2282 B(x++\n, r2\n\nk). Thus, the minimum enclosing ball\nk\u22121)\u2229Qk. The proof then follows\n\nk) is an enclosing ball of B(ck\u22121, R2\n\ni=k\u2212m+1B(x++\n, r2\n\nk) and\n\n\u03ba)kR2\n0.\n\n, r2\n\n\u221a\n\nk\n\ni\n\nk\n\n6\n\n\f5 Numerical Experiments\n\nIn this section, we compare our GeoPG algorithm with Nesterov\u2019s accelerated proximal gradient\n(APG) method for solving two nonsmooth problems: linear regression and logistic regression, both\nwith elastic net regularization. Because of the elastic net term, the strong convexity parameter \u03b1 is\nknown. However, we assume that \u03b2 is unknown, and implement backtracking for both GeoPG and\nAPG, i.e., we test GeoPG-B and APG-B (APG with backtracking). We do not target at comparing\nwith other ef\ufb01cient algorithms for solving these two problems. Our main purpose here is to illustrate\nthe performance of this new \ufb01rst-order method GeoPG. Further improvement of this algorithm and\ncomparison with other state-of-the-art methods will be a future research topic.\nThe initial points were set to zero. To obtain the optimal objective function value F \u2217, we ran\nAPG-B and GeoPG-B for a suf\ufb01ciently long time and the smaller function value returned by the two\nalgorithms is selected as F \u2217. APG-B was terminated if (F (xk) \u2212 F \u2217)/F \u2217 \u2264 tol, and GeoPG-B was\nk ) \u2212 F \u2217)/F \u2217 \u2264 tol, where tol = 10\u22128 is the accuracy tolerance. The parameters\nterminated if (F (x+\nused in backtracking were set to \u03b7 = 0.5 and \u03b3 = 0.9. In GeoPG-B, we used Algorithm 2 to \ufb01nd xk,\nbecause we found that the performance of Algorithm 2 is slightly better than Algorithm 1 in practice.\nIn the experiments, we ran Algorithm 2 until the absolute value of \u00af\u03c6 is smaller than 10\u22128. The code\nwas written in Matlab and run on a standard PC with 3.20 GHz I5 Intel microprocessor and 16GB of\nmemory. In all \ufb01gures we reported, the x-axis denotes the CPU time (in seconds) and y-axis denotes\n(F (x+\n\nk ) \u2212 F \u2217)/F \u2217.\n\n5.1 Linear regression with elastic net regularization\n\nIn this subsection, we compare GeoPG-B and APG-B in terms of solving linear regression with\nelastic net regularization, a popular problem in machine learning and statistics [20]:\n\n\u03b1\n2\n\n1\n2p\n\nmin\nx\u2208Rn\n\n(cid:107)Ax \u2212 b(cid:107)2 +\n\n(cid:107)x(cid:107)2 + \u00b5(cid:107)x(cid:107)1,\nwhere A \u2208 Rp\u00d7n, b \u2208 Rp, and \u03b1, \u00b5 > 0 are the weighting parameters.\nWe conducted tests on two real datasets downloaded from the LIBSVM repository: a9a, RCV1. The\nresults are reported in Figure 1. In particular, we tested \u03b1 = 10\u22128 and \u00b5 = 10\u22123, 10\u22124, 10\u22125. Note\nthat since \u03b1 is very small, the problems are very likely to be ill-conditioned. We see from Figure 1\nthat GeoPG-B is faster than APG-B on these real datasets, which indicates that GeoPG-B is preferable\nthan APG-B. In the supplementary material, we show more numerical results on varying \u03b1, which\nfurther con\ufb01rm that GeoPG-B is faster than APG-B when the problems are more ill-conditioned.\n\n(5.1)\n\n(a) Dataset a9a\nFigure 1: GeoPG-B and APG-B for solving (5.1) with \u03b1 = 10\u22128.\n\n(b) Dataset RCV1\n\n5.2 Logistic regression with elastic net regularization\n\nIn this subsection, we compare the performance of GeoPG-B and APG-B in terms of solving the\nfollowing logistic regression problem with elastic net regularization:\n\np(cid:88)\n\nlog(cid:0)1 + exp(\u2212bi \u00b7 a(cid:62)\n\ni x)(cid:1) +\n\n1\np\n\nmin\nx\u2208Rn\n\n(5.2)\nwhere ai \u2208 Rn and bi \u2208 {\u00b11} are the feature vector and class label of the i-th sample, respectively,\nand \u03b1, \u00b5 > 0 are the weighting parameters.\n\ni=1\n\n(cid:107)x(cid:107)2 + \u00b5(cid:107)x(cid:107)1,\n\n\u03b1\n2\n\n7\n\n0102030405010\u22121210\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* GeoPG\u2212B: \u00b5 =10\u22123APG\u2212B: \u00b5 =10\u22123GeoPG\u2212B: \u00b5 =10\u22124APG\u2212B: \u00b5 =10\u22124GeoPG\u2212B: \u00b5 =10\u22125APG\u2212B: \u00b5 =10\u2212505101520253010\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* GeoPG\u2212B: \u00b5 =10\u22123APG\u2212B: \u00b5 =10\u22123GeoPG\u2212B: \u00b5 =10\u22124APG\u2212B: \u00b5 =10\u22124GeoPG\u2212B: \u00b5 =10\u22125APG\u2212B: \u00b5 =10\u22125\fWe tested GeoPG-B and APG-B for solving (5.2) on the three real datasets a9a, RCV1 and Gisette\nfrom LIBSVM, and the results are reported in Figure 2. In particular, we tested \u03b1 = 10\u22128 and\n\u00b5 = 10\u22123, 10\u22124, 10\u22125. Figure 2 shows that with the same \u00b5, GeoPG-B is much faster than APG-B.\nMore numerical results are provided in the supplementary material, which also indicate that GeoPG-B\nis much faster than APG-B, especially when the problems are more ill-conditioned.\n\nFigure 2: GeoPG-B and APG-B for solving (5.2) with \u03b1 = 10\u22128. Left: dataset a9a; Middle: dataset\nRCV1; Right: dataset Gisette.\n\n5.3 Numerical results of L-GeoPG-B\n\nIn this subsection, we test GeoPG with limited memory described in Algorithm 5 in solving (5.2)\non the Gisette dataset. Since we still need to use the backtracking technique, we actually tested\nL-GeoPG-B. The results with different memory sizes m are reported in Figure 3. Note that m = 0\ncorresponds to the original GeoPG-B without memory. The subproblem (4.4) is solved using the\nfunction \u201cquadprog\u201d in Matlab. From Figure 3 we see that roughly speaking, L-GeoPG-B performs\nbetter for larger memory sizes, and in most cases, the performance of L-GeoPG-B with m = 100\nis the best among the reported results. This indicates that the limited-memory idea indeed helps\nimprove the performance of GeoPG.\n\nFigure 3: L-GeoPG-B for solving (5.2) on the dataset Gisette with \u03b1 = 10\u22128. Left: \u00b5 = 10\u22123;\nMiddle: \u00b5 = 10\u22124; Right: \u00b5 = 10\u22125.\n\n6 Conclusions\n\nIn this paper, we proposed a GeoPG algorithm for solving nonsmooth convex composite problems,\nwhich is an extension of the recent method GeoD that can only handle smooth problems. We proved\nthat GeoPG enjoys the same optimal rate as Nesterov\u2019s accelerated gradient method for solving\nstrongly convex problems. The backtracking technique was adopted to deal with the case when the\nLipschitz constant is unknown. Limited-memory GeoPG was also developed to improve the practical\nperformance of GeoPG. Numerical results on linear regression and logistic regression with elastic net\nregularization demonstrated the ef\ufb01ciency of GeoPG. It would be interesting to see how to extend\nGeoD and GeoPG to tackle non-strongly convex problems, and how to further accelerate the running\ntime of GeoPG. We leave these questions in future work.\nAcknowledgements. Shixiang Chen is supported by CUHK Research Postgraduate Student Grant\nfor Overseas Academic Activities. Shiqian Ma is supported by a startup funding in UC Davis.\n\n8\n\n010203040506070809010\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* GeoPG\u2212B: \u00b5 =10\u22123APG\u2212B: \u00b5 =10\u22123GeoPG\u2212B: \u00b5 =10\u22124APG\u2212B: \u00b5 =10\u22124GeoPG\u2212B: \u00b5 =10\u22125APG\u2212B: \u00b5 =10\u2212501234567810\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* GeoPG\u2212B: \u00b5 =10\u22123APG\u2212B: \u00b5 =10\u22123GeoPG\u2212B: \u00b5 =10\u22124APG\u2212B: \u00b5 =10\u22124GeoPG\u2212B: \u00b5 =10\u22125APG\u2212B: \u00b5 =10\u22125050010001500200025003000350010\u22121010\u2212810\u2212610\u2212410\u22122100102104CPU(s)(Fk\u2212F*)/F* GeoPG\u2212B: \u00b5 =10\u22123APG\u2212B: \u00b5 =10\u22123GeoPG\u2212B: \u00b5 =10\u22124APG\u2212B: \u00b5 =10\u22124GeoPG\u2212B: \u00b5 =10\u22125APG\u2212B: \u00b5 =10\u2212505010015020025010\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* memorysize=0memorysize=5memorysize=20memorysize=10005010015020025030035010\u22121010\u2212810\u2212610\u2212410\u22122100102CPU(s)(Fk\u2212F*)/F* memorysize=0memorysize=5memorysize=20memorysize=10005010015020025030035040010\u22121010\u2212810\u2212610\u2212410\u22122100102104CPU(s)(Fk\u2212F*)/F* memorysize=0memorysize=5memorysize=20memorysize=100\fReferences\n[1] S. Bubeck, Y.-T. Lee, and M. Singh. A geometric alternative to Nesterov\u2019s accelerated gradient\n\ndescent. arXiv preprint arXiv:1506.08187, 2015.\n\n[2] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence O(1/k2). Dokl. Akad. Nauk SSSR, 269:543\u2013547, 1983.\n\n[3] Y. E. Nesterov.\n\nIntroductory lectures on convex optimization: A basic course. Applied\n\nOptimization. Kluwer Academic Publishers, Boston, MA, 2004. ISBN 1-4020-7553-7.\n\n[4] W. Su, S. Boyd, and E. J. Cand\u00e8s. A differential equation for modeling Nesterov\u2019s accelerated\n\ngradient method: Theory and insights. In NIPS, 2014.\n\n[5] H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont. Fast convergence of inertial dynamics\n\nand algorithms with asymptotic vanishing viscosity. Mathematical Programming, 2016.\n\n[6] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via\n\nintegral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395, 2016.\n\n[7] A. Wibisono, A. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in\n\noptimization. Proceedings of the National Academy of Sciences, 133:E7351\u2013E7358, 2016.\n\n[8] R. G. Bland, D. Goldfarb, and M. J. Todd. The ellipsoid method: A survey. Operations\n\nResearch, 29:1039\u20131091, 1981.\n\n[9] S. Bubeck and Y.-T. Lee. Black-box optimization with a politician. ICML, 2016.\n\n[10] D. Drusvyatskiy, M. Fazel, and S. Roy. An optimal \ufb01rst order method based on optimal quadratic\n\naveraging. SIAM Journal on Optimization, 2016.\n\n[11] R. P. Brent. An algorithm with guaranteed convergence for \ufb01nding a zero of a function. In\nAlgorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall, 1973.\n\n[12] T. J. Dekker. Finding a zero by means of successive linear interpolation. In Constructive Aspects\n\nof the Fundamental Theorem of Algebra. London: Wiley-Interscience, 1969.\n\n[13] M. Gerdts, S. Horn, and S. Kimmerle. Line search globalization of a semismooth Newton\nmethod for operator equations in Hilbert spaces with applications in optimal control. Journal of\nIndustrial And Management Optimization, 13(1):47\u201362, 2017.\n\n[14] E. Hans and T. Raasch. Global convergence of damped semismooth Newton methods for L1\n\nTikhonov regularization. Inverse Problems, 31(2):025005, 2015.\n\n[15] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[16] K. Scheinberg, D. Goldfarb, and X. Bai. Fast \ufb01rst-order methods for composite convex\noptimization with backtracking. Foundations of Computational Mathematics, 14(3):389\u2013417,\n2014.\n\n[17] Y. E. Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[18] A. Beck. On the convexity of a class of quadratic mappings and its application to the problem of\n\ufb01nding the smallest ball enclosing a given intersection of balls. Journal of Global Optimization,\n39(1):113\u2013126, 2007.\n\n[19] Y. C. Eldar, A. Beck, and M. Teboulle. A minimax Chebyshev estimator for bounded error\n\nestimation. IEEE Transactions on Signal Processing, 56(4):1388\u20131397, 2008.\n\n[20] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society, Series B, 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 440, "authors": [{"given_name": "Shixiang", "family_name": "Chen", "institution": "The Chinese University of HongKong"}, {"given_name": "Shiqian", "family_name": "Ma", "institution": "UC Davis"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}]}