{"title": "Constant Nullspace Strong Convexity and Fast Convergence of Proximal Methods under High-Dimensional Settings", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1016, "abstract": "State of the art statistical estimators for high-dimensional problems take the form of regularized, and hence non-smooth, convex programs. A key facet of thesestatistical estimation problems is that these are typically not strongly convex under a high-dimensional sampling regime when the Hessian matrix becomes rank-deficient. Under vanilla convexity however, proximal optimization methods attain only a sublinear rate. In this paper, we investigate a novel variant of strong convexity, which we call Constant Nullspace Strong Convexity (CNSC), where we require that the objective function be strongly convex only over a constant subspace. As we show, the CNSC condition is naturally satisfied by high-dimensional statistical estimators. We then analyze the behavior of proximal methods under this CNSC condition: we show global linear convergence of Proximal Gradient and local quadratic convergence of Proximal Newton Method, when the regularization function comprising the statistical estimator is decomposable. We corroborate our theory via numerical experiments, and show a qualitative difference in the convergence rates of the proximal algorithms when the loss function does satisfy the CNSC condition.", "full_text": "Constant Nullspace Strong Convexity and\n\nFast Convergence of Proximal Methods under\n\nHigh-Dimensional Settings\n\nIan E.H. Yen\n\nCho-Jui Hsieh\n\nPradeep Ravikumar\n\nInderjit Dhillon\n\nDepartment of Computer Science\n\n{ianyen,cjhsieh,pradeepr,inderjit}@cs.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nState of the art statistical estimators for high-dimensional problems take the form\nof regularized, and hence non-smooth, convex programs. A key facet of these\nstatistical estimation problems is that these are typically not strongly convex un-\nder a high-dimensional sampling regime when the Hessian matrix becomes rank-\nde\ufb01cient. Under vanilla convexity however, proximal optimization methods attain\nonly a sublinear rate. In this paper, we investigate a novel variant of strong con-\nvexity, which we call Constant Nullspace Strong Convexity (CNSC), where we re-\nquire that the objective function be strongly convex only over a constant subspace.\nAs we show, the CNSC condition is naturally satis\ufb01ed by high-dimensional sta-\ntistical estimators. We then analyze the behavior of proximal methods under this\nCNSC condition: we show global linear convergence of Proximal Gradient and lo-\ncal quadratic convergence of Proximal Newton Method, when the regularization\nfunction comprising the statistical estimator is decomposable. We corroborate our\ntheory via numerical experiments, and show a qualitative difference in the con-\nvergence rates of the proximal algorithms when the loss function does satisfy the\nCNSC condition.\n\n1 Introduction\n\nThere has been a growing interest in high-dimensional statistical problems, where the number of\nparameters d is comparable to or even larger than the sample size n, spurred in part by many modern\nscience and engineering applications. It is now well understood that in order to guarantee statistical\nconsistency it is key to impose low-dimensional structure, such as sparsity, or low-rank structure,\non the high-dimensional statistical model parameters. A strong line of research has thus developed\nclasses of regularized M-estimators that leverage such structural constraints, and come with strong\nstatistical guarantees even under high-dimensional settings [13]. These state of the art regularized\nM-estimators typically take the form of convex non-smooth programs.\nA facet of computational consequence with these high-dimensional sampling regimes is that these\nM-estimation problems, even when convex, are typically not strongly convex. For instance, for\nthe \u21131-regularized least squares estimator (LASSO), the Hessian is rank de\ufb01cient when n < d. In\nthe absence of additional assumptions however, optimization methods to solve general non-smooth\nnon-strongly convex programs can only achieve a sublinear convergence rate [19, 21]; faster rates\ntypically require strong convexity [1, 20]. In the past few years, an effort has thus been made to\nimpose additional assumptions that are stronger than mere convexity, and yet weaker than strong\nconvexity; and proving faster rates of convergence of optimization methods under these assump-\ntions. Typically these assumptions take the form of a restricted variant of strong convexity, which\nincidentally mirror those assumed for statistical guarantees as well, such as the Restricted Isometry\n\n1\n\n\fProperty or Restricted Eigenvalue property. A caveat with these results however is that these statisti-\ncally motivated assumptions need not hold in general, or require suf\ufb01ciently large number of samples\nto hold with high probability. Moreover, the standard optimization methods have to be modi\ufb01ed in\nsome manner to leverage these assumptions [5, 7, 17]. Another line of research exploits a local error\nbound to establish asymptotic linear rate of convergence for a special form of non-strongly convex\nfunctions [16, 8, 6]. However, these do not provide \ufb01nite-iteration convergence bounds, due to the\npotentially large number of iterations spent on early stage.\nIn this paper, we consider a novel simple condition, which we term Constant Nullspace Strong\nConvexity (CNSC). This assumption is motivated not from statistical considerations, but from the\nalgebraic form of standard M-estimators; indeed as we show, standard M-estimation problems even\nunder high-dimensional settings naturally satisfy the CNSC condition. Under this CNSC condition,\nwe then investigate the convergence rates of the class of proximal optimization methods; speci\ufb01cally\nthe Proximal Gradient method (Prox-GD) [14, 15, 18] and the Proximal Newton method (Prox-\nNewton) [1, 2, 9]. These proximal methods are very amenable to regularized M-estimation prob-\nlems: they do not treat the M-estimation problem as a black-box convex non-smooth problem, but\ninstead leverage the composite nature of the objective of the form F (x) = h(x)+f (x), where h(x)\nis a possibly non-smooth convex function while f (x) is a convex smooth function with Lipschitz-\ncontinuous gradient. We show that under our CNSC condition, Proximal Gradient achieves global\nlinear convergence when the non-smooth component is a decomposable norm. We also show that\nProximal Newton, under the CNSC condition, achieves local quadratic convergence as long as the\nnon-smooth component is Lipschitz-continuous. Note that in the absence of strong convexity, but\nunder no additional assumptions beyond convexity, the proximal methods can only achieve sublin-\near convergence as noted earlier. We have thus identi\ufb01ed an algebraic facet of the M-estimators\nthat explains the strong computational performance of standard proximal optimization methods in\npractical settings in solving high-dimensional statistical estimation problems.\nThe paper is organized as follows.\nIn Section 2, we de\ufb01ne the CNSC condition and introduce\nthe Proximal Gradient and Proximal Newton methods. Then we prove global linear convergence\nof Prox-GD and local quadratic convergence of Prox-Newton in Section 3 and 4 respectively. In\nSection 5, we corroborate our theory via experiments on real high-dimensional data set. We will\nleave all the proof of lemmas to the appendix.\n\n2 Preliminaries\n\nWe are interested in composite optimization problems of the form\n\nmin\nx\u2208Rd\n\nF (x) = h(x) + f (x),\n\n(1)\n\nwhere h(x) is a possibly non-smooth convex function and f (x) is twice differentiable convex func-\ntion with its Hessian matrix H(x) = \u22072f (x) satisfying\n\nmI \u227c H(x) \u227c M I, \u2200x \u2208 Rd,\n\n(2)\n\nwhere for strongly convex f (x) we have m > 0; otherwise, for convex but not strongly convex f (x)\nwe have m = 0.\n\n2.1 Constant Nullspace Strong Convexity (CNSC)\n\nBefore de\ufb01ning our strong convexity variant of Constant Nullspace Strong Convexity (CNSC), we\n\ufb01rst provide some intuition by considering the following large class of statistical estimation problems\nin high-dimensional machine learning, where f (x) takes the form\n\nn\u2211\n\nf (x) =\n\ni=1\n\nL(aT\n\ni x, yi),\n\n(3)\n\nwhere L(u, y) is a non-negative loss function that is convex in its \ufb01rst argument, ai is the observed\nfeature vector and yi is the observed response of the i-th sample. The Hessian matrix of (3) takes\nthe form\n\nH(x) = AT D(Ax)A,\n\n(4)\n\n2\n\n\f(aT\n\ni and D(Ax) is a diagonal matrix with\nwhere A is a n by d design (data) matrix with Ai,: = aT\n\u2032\u2032\n\u2032\u2032\ni x, yi), where the double-derivative in L\n(u, y) is with respect to the \ufb01rst argu-\nDii(x) = L\nment. It is easy to see that in high-dimensional problems with d > n, (4) is not positive de\ufb01nite so\nthat strong convexity would not hold. However, for strictly convex loss function L(\u00b7, y), we have\n\u2032\u2032\nL\n\n(u, y) > 0 and\n\niff Av = 0.\n\nvT H(x)v = 0\n\n(5)\nAs a consequence vT H(x)v > 0 as long as v does not lie in the Nullspace of A; that is, the Hessian\nH(x) might satisfy the strong convexity bound in the above restricted sense. We generalize this\nconcept as follows. We \ufb01rst de\ufb01ne the following notation: given a subspace T , we let \u03a0T (\u00b7) denote\nthe orthogonal projection onto T , and let T \u22a5 denote the orthogonal subspace to T .\nAssumption 1 ( Constant Nullspace Strong Convexity ). A twice-differentiable f (x) satis\ufb01es Con-\nstant Nullspace Strong Convexity (CNSC) with respect to T (CNSC-T ) iff there is a constant vector\nspace T s.t. f (x) depends only on z = \u03a0T (x) and its Hessian matrix satis\ufb01es\n\n(6)\n\n(7)\n\nfor some m > 0, and \u2200z \u2208 T ,\n\nvT H(z)v \u2265 m\u2225v\u22252, \u2200v \u2208 T\n\nH(z)v = 0, \u2200v \u2208 T \u22a5\n\n.\n\n\u222b\n\n0\n\nFrom the motivating section above, the above condition can be seen to hold for a wide range of loss\nfunctions, such as those arising from linear regression models, as well as generalized linear models\n(u, y) \u2265 mL >\n\u2032\u2032\n(e.g. logistic regression, poisson regression, multinomial regression etc.) 1. For L\n0, we have m = mL\u03bbmin(AT A) > 0 as the constant in (6), where \u03bbmin(AT A) is the minimum\npositive eigenvalue of AT A.\nThen by the assumption, any point x can be decomposed as x = z + y, where z = \u03a0T (x),\ny = \u03a0T \u22a5(x), so that the difference between gradient of two points can be written as\n\ng(x1) \u2212 g(x2) =\n\n1\n\nH(s\u2206x + x2)\u2206xds =\n\nH(s\u2206z + z2)\u2206zds = \u02dcH(z1, z2)\u2206z,\n\n(8)\n\nwhere \u2206x = x1 \u2212 x2, \u2206z = z1 \u2212 z2, and \u02dcH(z1, z2) =\n0 H(s\u2206z + z2)ds is the average Hessian\nmatrix along the path from z2 to z1. It is easy to verify that \u02dcH(z1, z2) satis\ufb01es inequalities (2),\n(6) and equality (7) for all z1, z2 \u2208 T by just applying inequalities (equality) to each individual\nHessian matrix being integrated. Then we have following theorem that shows the uniqueness of \u00afz\nat optimal.\nTheorem 1 (Optimality Condition). For f (x) satisfying CNSC-T ,\n\n1\n\n\u222b\n\n1\n\n0\n\n\u222b\n\n1. \u00afx is an optimal solution of (1) iff \u2212g(\u00afx) = \u00af(cid:26) for some \u00af(cid:26) \u2208 \u2202h(\u00afx).\n2. The optimal \u00af(cid:26) and \u00afz = \u03a0T (\u00afx) are unique.\n\nProof. The \ufb01rst statement is true since \u00afx is an optimal solution iff 0 \u2208 \u2202h(\u00afx) + \u2207f (\u00afx). To prove\nthe second statement, suppose \u00afx1 = \u00afz1 + \u00afy1 and \u00afx2 = \u00afz2 + \u00afy2 are both optimal. Let \u2206x = \u00afx1\u2212 \u00afx2\nand \u2206z = \u00afz1 \u2212 \u00afz2. Since h(x) is convex, \u2212g(\u00afx1) \u2208 \u2202h(\u00afx1) and \u2212g(\u00afx2) \u2208 \u2202h(\u00afx2) should satisfy\n\nHowever, since f (x) satis\ufb01es CNSC-T , by (8),\n\n\u27e8\u2212g(\u00afx1) + g(\u00afx2), \u2206x\u27e9 \u2265 0.\n\n\u27e8\u2212g(\u00afx1) + g(\u00afx2), \u2206x\u27e9 = \u27e8\u2212 \u02dcH(\u00afz1, \u00afz2)\u2206z, \u2206x\u27e9 = \u2212\u2206z \u02dcH(\u00afz1, \u00afz2)\u2206z \u2264 \u2212m\u2225\u2206z\u22252\n\n2\n\nfor some m > 0. The two inequalities can simultaneously hold only if \u2206\u00afz = 0. Therefore, \u00afz is\nunique at optimum, and thus g(\u00afx) = g(0) + \u02dcH(\u00afz, 0)\u00afz and \u00af(cid:26) = \u2212g(\u00afx) are also unique.\n\nIn next two sections, we review the Proximal Gradient Method (Prox-GD) and Proximal Newton\nMethod (Prox-Newton), and introduce some tools that will be used in our analysis.\n\n\u2032\u2032\n1 Note for many generalized linear models, the second derivative L\n\n(u, y) of loss function approaches 0 if |u| \u2192 \u221e. However, this\ncould not happen as long as there is a penalty term h(x) which goes to in\ufb01nity if x diverges, which then serves as a \ufb01nite constraint bound on\nx.\n\n3\n\n\f2.2 Proximal Gradient Method\n\nThe Prox-GD algorithm comprises a gradient descent step\n\nxt+ 1\n\n2\n\n= xt \u2212 1\n\nM\n\ng(xt)\n\nfollowed by a proximal step\n\nxt+1 = proxh\n\nM (xt+ 1\n\n2\n\n) = arg\nx\n\nmin h(x) +\n\n\u2225x \u2212 xt+ 1\n\n2\n\n\u22252\n2,\n\n(9)\n\nM\n2\n\nwhere \u2225 \u00b7 \u22252 means the Frobinius norm if x is a matrix. For simplicity, we will denote proxh\nM (.) as\nprox(.) in the following discussion when it is clear from the context. In Prox-GD algorithm, it is\nassumed that (9) can be computed ef\ufb01ciently, which is true for most of decomposable regularizers.\nHere we introduce some properties of proximal operator that can facilitate our analysis.\nLemma 1. De\ufb01ne \u2206P x = x \u2212 prox(x), the following properties hold for proximal operation (9).\n\n1. M \u2206P x \u2208 \u2202h(prox(x)).\n2. \u2225prox(x1) \u2212 prox(x2)\u22252\n\n2\n\n2.3 Proximal Newton Method\n\n\u2264 \u2225x1 \u2212 x2\u22252\n\n2\n\n\u2212 \u2225\u2206P x1 \u2212 \u2206P x2\u22252\n2.\n\nIn this section, we introduce the Proximal Newton method, which has been shown to be consider-\nably more ef\ufb01cient than \ufb01rst-order methods in many applications [1], including Sparse Inverse Co-\nvariance Estimation [2] and \u21131-regularized Logistic-Regression [9, 10]. Each step of Prox-Newton\nsolves a local quadratic approximation\n\nx+\n\nt = arg\nx\n\nmin h(x) +\n\n1\n2\n\n(x \u2212 xt)T Ht(x \u2212 xt) + gT\n\nt (x \u2212 xt)\n\n(10)\n\nto \ufb01nd a search direction x+ \u2212 xt, and then conduct a line search procedure to \ufb01nd t such that\n\nf (xt+1) = f (xt + t(x+\nt\n\n\u2212 xt))\n\nmeets a suf\ufb01cient decrease condition. Note unlike Prox-GD update (9), in most of cases (10) requires\nan iterative procedure to solve. For example if h(x) is \u21131-norm, then a coordinate descent algorithm\nis usually employed to solve (10) as an LASSO subproblem [1, 2, 9, 10].\nThe convergence of Newton-type method comprises two phases [1, 3]. In the \ufb01rst phase, it is possible\nthat step size t < 1 is chosen, while in the second phase, which occurs when xt is close enough\nto optimum, step size t = 1 is always chosen and each step leads to quadratic convergence. In this\npaper, we focus on the quadratic convergence phase, while refer readers to [21] for a global analysis\nof Prox-Newton without strong convexity assumption. In the quadratic convergence phase, we have\nxt+1 = x+\n\nt and the update can be written as\n\n(\n\n)\n\nxt+1 = proxHt\n\nxt + \u2206xnt\nt\n\n, Ht\u2206xnt\n\nt = \u2212gt,\n\n(11)\n\nwhere \u2206xnt\nt\nfor any PSD matrix H as\n\nis the Newton step when h(x) is absent, and the proximal operator proxH (.) is de\ufb01ned\n\nproxH (x) = arg\nv\n\nmin h(v) +\n\n\u2225v \u2212 x\u22252\nH .\n\n(12)\n\n1\n2\n\nNote while we use \u2225x\u22252\nH to denote xT Hx, we only require H to be PSD instead of PD. Therefore,\n\u2225x\u2225H is not a true norm, and (12) might have multiple solutions, where proxH (x) refers to any\none of them.\nIn the following, we show proxH (.) has similar properties as that of prox(.) in\nprevious section.\nLemma 2. De\ufb01ne \u2206P x = x\u2212 proxH (x), the following properties hold for the proximal operator:\n\n1. H\u2206P x \u2208 \u2202h(proxH (x)).\n2. \u2225proxH (x1) \u2212 proxH (x2)\u22252\n\nH\n\n\u2264 \u2225x1 \u2212 x2\u22252\nH.\n\n4\n\n\f3 Linear Convergence of Proximal Gradient Method\nIn this section, we analyze convergence of Proximal Gradient Method for h(x) = \u03bb\u2225x\u2225, where \u2225\u00b7\u2225\nis a decomposable norm de\ufb01ned as follows.\nDe\ufb01nition 1 (Decomposable Norm). \u2225 \u00b7 \u2225 is a decomposable norm if there are orthogonal sub-\nMi such that for any point x \u2208 Rd that can be written as\ni=1 with Rd = \u222aJ\nspaces {Mi}J\n\u2211\nj\u2208E cjaj, where cj > 0 and aj \u2208 Mj, \u2225aj\u2225\u2217 = 1, we have\ncj, and \u2202\u2225x\u2225 = {(cid:26) | \u03a0Mj ((cid:26)) = aj,\u2200j \u2208 E; \u2225\u03a0Mj ((cid:26))\u2225\u2217 \u2264 1,\u2200j /\u2208 E},\n\n\u2211\n\u2225x\u2225 =\n\nx =\n\ni=1\n\n(13)\n\nj\u2208E\n\nwhere \u2225 \u00b7 \u2225\u2217 is the dual norm of \u2225 \u00b7 \u2225.\nThe above de\ufb01nition includes several well-known examples such as \u21131-norm \u2225x\u22251 and group-\u21131\nnorm \u2225X\u22251,2. For \u21131-norm, Mj corresponds to vectors with only j-th coordinate not equal to 0,\nand E is the set of non-zero coordinates of x. For group-\u21131 norm, Mj corresponds to vectors with\nonly j-th group not equal to 0T and E are the set of non-zero groups of X. Under the de\ufb01nition, we\ncan pro\ufb01le the set of optimal solutions as follows.\nLemma 3 (Optimal Set). Let \u00afE be the active set at optimal and \u00afE + = {j| \u2225 \u03a0Mj (\u00af(cid:26))\u2225\u2217 = \u03bb} be its\naugmented set (which is unique since \u00af(cid:26) is unique) such that \u03a0Mj (\u00af(cid:26)) = \u03bb\u00afaj, j \u2208 \u00afE +. The optimal\nsolutions of (1) form a polyhedral set\n\u00afX =\n\n{\nx | \u03a0T (x) = \u00afz and x \u2208 \u00afO}\n\n(14)\n\n,\n\n{\nx | x =\n\n\u2211\nj\u2208 (cid:22)E + cj \u00afaj, cj \u2265 0, j \u2208 \u00afE +\n\n}\n\nis the set of x with \u00af(cid:26) \u2208 \u2202h(x).\n\nwhere \u00afO =\n\nGiven the optimal set is a polyhedron, we can then employ the following lemma to bound the dis-\ntance of an iterate xt to the optimal set \u00afX .\nLemma 4 (Hoffman\u2019s bound). Consider a polyhedral set S = {x|Ax \u2264 b, Ex = c}. For any point\nx \u2208 Rd, there is a \u00afx \u2208 S such that\n\n\u2225x \u2212 \u00afx\u22252 \u2264 \u03b8(S)\n\n,\n\n(15)\n\n(cid:13)(cid:13)(cid:13)(cid:13) [Ax \u2212 b]+\n\nEx \u2212 c\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\nwhere \u03b8(S) is a positive constant that depends only on A and E.\nThe above bound \ufb01rst appears in [11], and was employed in [4] to prove linear convergence of\nFeasible Descent method for a class of convex smooth function. A proof of the \u21132-norm version (15)\ncan be found in [4, lemma 4.3]. By applying (15) to the set \u00afX , the distance of a point x to \u00afX can be\nbounded by infeasible amounts to the two constraints \u03a0T (x) = z and x \u2208 \u00afO, where the latter can\nbe bounded according the following lemma when cj = \u27e8x, \u00afaj\u27e9 \u2265 0,\u2200j \u2208 \u00afE +.\nLemma 5. Let \u00afA = span(\u00afa1, \u00afa2 . . . , \u00afa| (cid:22)E +|). Suppose \u2225x\u2225 \u2264 R and \u03a0Mj (x) = 0 for j /\u2208 \u00afE +.\nThen\n\u03bb2\u2225x \u2212 \u03a0 (cid:22)A(x)\u22252\nwhere (cid:26) \u2208 \u2202h(x) and \u00af(cid:26) is as de\ufb01ned in Theorem 1.\nNow we are ready to prove the main theorem of this section.\nTheorem 2 (Linear Convergence of Prox-GD). Let \u00afX be the set of optimal solutions for problem\n(1), and \u00afx = \u03a0 (cid:22)X (x) be the solution closest to x. Denote d\u03bb = minj /\u2208 (cid:22)E +\n> 0.\nFor the sequence {xt}\u221e\n\n(\n\u03bb \u2212 \u2225\u03a0Mj (\u00af(cid:26))\u2225\u2217\n\nt=0 produced by Proximal Gradient Method, we have:\n\n\u2264 R2\u2225(cid:26) \u2212 \u00af(cid:26)\u22252\n2,\n\n)\n\n2\n\n(a) If xt+1 satis\ufb01es the condition that\n\n\u2203j /\u2208 \u00afE + : \u03a0Mj (xt+1) \u0338= 0 or \u2203j \u2208 \u00afE + : \u27e8xt+1, \u00afaj\u27e9 < 0,\n\nwe then have:\n\n\u2225xt+1 \u2212 \u00afxt+1\u22252\n\n2\n\n\u2264 (1 \u2212 \u03b1)\u2225xt \u2212 \u00afxt\u22252\n\n2, \u03b1 =\n\n5\n\nd2\n\u03bb\n\nM 2\u2225x0 \u2212 \u00afx0\u22252\n\n2\n\n(16)\n\n(17)\n\n\f2\n\n\u2225xt \u2212 \u00afxt\u22252\n= \u2225\u2206xt\u22252\n\u2265 \u2225\u2206xt\u22252\n\n2\n\n2\n\n\u2265 \u2225xt \u2212 \u00afxt\u22252\n\n\u2212 \u2225xt+1 \u2212 \u00afxt+1\u22252\n\u2212 \u2225prox(xt \u2212 g(xt)/M ) \u2212 prox(\u00afxt \u2212 g(\u00afxt)/M )\u22252\n\u2212 \u00af(cid:26)\u22252\n\u2212 \u2225(xt \u2212 g(xt)/M ) \u2212 (\u00afxt \u2212 g(\u00afxt)/M )\u22252\n\n\u2212 \u2225xt+1 \u2212 \u00afxt\u22252\n\n2 + \u2225(cid:26)t\n\n2\n\n2\n\n2\n\n2\n\n2/M 2.\n\n\u2225xt \u2212 \u00afxt\u22252\n\nSince g(xt) \u2212 g(\u00afxt) = \u02dcH\u2206x from (8), we have\n(\n\u2265 \u2225\u2206xt\u22252\n\u2265 \u2206xT\n\u2265 m\u2225\u2206zt\u22252\n\n\u2212 \u2225xt+1 \u2212 \u00afxt+1\u22252\n\n)\n\u2212 \u2225\u2206xt \u2212 \u02dcH\u2206xt/M\u22252\n2\n\u2212 \u00af(cid:26)\u22252\n\u02dcH/M\n2/M + \u2225(cid:26)t\n2/M 2.\n\n\u2206xt + \u2225(cid:26)t\n\u2212 \u00af(cid:26)\u22252\n\n2\n\n2\n\nt\n\n2 + \u2225(cid:26)t\n2/M 2\n\n\u2212 \u00af(cid:26)\u22252\n\n2/M 2\n\n(20)\n\n(21)\n\n(b) If xt+1 does not satisfy the condition in (16) but xt does, then\n\n\u2225xt+1 \u2212 \u00afxt+1\u22252\n\n2\n\n\u2264 (1 \u2212 \u03b1)\u2225xt\u22121 \u2212 \u00afxt\u22121\u22252\n\n2, \u03b1 =\n\n(c) If neither xt+1, xt satisfy the condition in (16), then\n\nd2\n\u03bb\n\nM 2\u2225x0 \u2212 \u00afx0\u22252\n\n2\n\n(18)\n\n\u2225xt+2 \u2212 \u00afxt+2\u22252\n\n\u2264 1\n\n\u2225xt \u2212 \u00afxt\u22252\n\n(19)\nwhere we recall that \u03b8( \u00afX ) is the constant determined by polyhedron \u00afX from Hoffman\u2019s\nBound (15).\n\n2, \u03b2 =\n\nM \u03b8( \u00afX )2\n\n1 + \u03b2\n\n2\n\n,\n\nm\n\nProof. Since \u00afxt is an optimal solution, we have \u00afxt = prox(\u00afxt \u2212 g(\u00afxt)/M ). Let \u2206xt = xt \u2212 \u00afxt,\n\u2212 xt+1) \u2208 \u2202h(xt+1) and \u02dcH = \u02dcH(zt, \u00afzt). by Lemma 1, each iterate of Prox-GD\n(cid:26)t = M (xt+ 1\nhas\n\n2\n\nThe second inequality holds since 2 \u02dcH/M \u2212 \u02dcH 2/M 2 = ( \u02dcH/M )(2I \u2212 \u02dcH/M ) \u227d \u02dcH/M. The\ninequality tells us \u2225xt \u2212 \u00afxt\u22252 \u2212 \u2225xt+1 \u2212 \u00afxt+1\u22252 \u2265 0, that is, the distance to the optimal set\n\u2225xt \u2212 \u00afxt\u2225 is monotonically non-increasing. To get a tighter bound, we consider two cases.\nCase 1: \u03a0Mj (xt) \u0338= 0 for some j /\u2208 \u00afE + or \u27e8xt, \u00afaj\u27e9 < 0 for some j \u2208 \u00afE +.\nIn this case, suppose there is j /\u2208 E +\n\nt with \u03a0Mj (xt) \u0338= 0, then 2\n\n\u2265 \u2225\u03a0Mj ((cid:26)t) \u2212 \u03a0Mj (\u00af(cid:26))\u22252\u2217 \u2265 (\u2225\u03a0Mj ((cid:26)t)\u2225\u2217 \u2212 \u2225\u03a0Mj (\u00af(cid:26))\u2225\u2217)2 \u2265 d2\n\u03bb.\n\n(22)\nOn the other hand, if \u27e8xt, \u00afaj\u27e9 < 0 for some j \u2208 \u00afE +, then we have \u27e8aj, \u00afaj\u27e9 < 0 for \u03a0Mj ((cid:26)t) =\n\u03bbaj. Therefore\n\u2212 \u00af(cid:26)\u22252\n\n2 = \u03bb2(2 \u2212 2\u27e8aj, \u00afaj\u27e9) > 2\u03bb2.\n\n\u2212 \u00af(cid:26)\u22252\n\n\u2225(cid:26)t\n\n\u2225(cid:26)t\n\n2\n\n2\n\nEither cases we have\n\u2225xt \u2212 \u00afxt\u22252\n\n2\n\n\u2265 \u2225\u03a0Mj ((cid:26)t) \u2212 \u03a0Mj (\u00af(cid:26))\u22252\n\u2265 \u2225(cid:26)t\n\n\u2212 \u2225xt+1 \u2212 \u00afxt+1\u22252\n\n2\n\n2\n\n\u2265 \u03bb2\u2225aj \u2212 \u00afaj\u22252\n(\n\n\u2212 \u00af(cid:26)\u22252\nM 2\n\n2\n\n\u2265\n\nd2\n\u03bb\n\nM 2\u2225x0 \u2212 \u00afx0\u22252\n\n2\n\n\u2225xt \u2212 \u00afxt\u22252\n2.\n\n(23)\n\n)\n\nCase 2: Both xt, xt+1 do not fall in Case 1\nGiven \u27e8xt, \u00afaj\u27e9 \u2265 0,\u2200j \u2208 \u00afE + and \u03a0Mj (xt) = 0,\u2200j /\u2208 \u00afE +, then x belongs to the set \u00afO de\ufb01ned in\nLemma 3 iff \u2225x \u2212 \u03a0 (cid:22)A(x)\u22252\n2 = 0,\nwhere R is a bound on \u2225xt\u2225 holds for \u2200t, which must exist as long as the regularization parameter\n\u03bb > 0 in h(x) = \u03bb\u2225x\u2225.\nBy Lemma 4, the distance of point xt to the polyhedral set \u00afX is bounded by its infeasible amount\n\n2 = 0. The condition can be also scaled as\n\nmM R2\u2225x \u2212 \u03a0 (cid:22)A(x)\u22252\n\n\u03bb2\n\n)\n\n\u2225xt \u2212 \u00afxt\u22252\n\n2\n\n\u2264 \u03b8( \u00afX )2\n\n2 +\n\n\u03bb2\n\nmM R2\n\n\u2225xt \u2212 \u03a0 (cid:22)A(xt)\u22252\n\n2\n\n,\n\n(24)\n\n(\n\u2225zt \u2212 \u00afz\u22252\n\n2From our de\ufb01nition of decomposable norm, if a vector v belongs to single subspace Mj, then \u2225v\u2225 = \u2225v\u2225\u2217 = \u2225v\u22252. The reason\nis: By the de\ufb01nition, if v \u2208 Mj, then v = cj aj for some cj > 0, aj \u2208 Mj , \u2225aj\u2225\u2217 = 1, and it has decomposable norm \u2225v\u2225 = cj.\nHowever, we also have \u2225v\u2225\u2217 = \u2225cj aj\u2225\u2217 = cj\u2225aj\u2225\u2217 = cj = \u2225v\u2225. The norm equals to its dual norm only if it is \u21132-norm.\n\n6\n\n\fwhere zt = \u03a0T (xt). Applying (24) to (21) for iteration t + 1, we have\n\n\u2225xt+1 \u2212 \u00afxt+1\u22252 \u2212 \u2225xt+2 \u2212 \u00afxt+2\u22252\n\u2265\n\n\u2225\u2206xt+1\u22252 \u2212 \u03bb2\nM 2R2\n\nm\n\nM \u03b8( \u00afX )2\nFor iteration t, we have\n\n\u2225xt+1 \u2212 \u03a0 (cid:22)A(xt+1)\u22252\n\n2 +\n\n\u2225(cid:26)t+1\n\n\u2212 \u00af(cid:26)\u22252\nM 2\n\n.\n\n\u2225xt \u2212 \u00afxt\u22252 \u2212 \u2225xt+1 \u2212 \u00afxt+1\u22252 \u2265 m\n\nM\n\n\u2225\u2206zt\u22252\n\n2 +\n\n\u2225(cid:26)t\n\n\u2212 \u00af(cid:26)\u22252\nM 2\n\n. By Lemma 5, adding the two inequalities gives\nm\n\n\u2225xt \u2212 \u00afxt\u22252 \u2212 \u2225xt+2 \u2212 \u00afxt+2\u22252 \u2265\n\u2265\n\nM \u03b8( \u00afX )2\nM \u03b8( \u00afX )2\nwhich yields desired result (18) after arrangement.\n\nm\n\n\u2225\u2206xt+1\u22252 +\n\u2225\u2206xt+1\u22252 \u2265\n\n\u2225(cid:26)t+1\n\n\u2212 \u00af(cid:26)\u22252\n2 +\nM 2\n\u2225\u2206xt+2\u22252,\n\n\u2225\u2206zt\u22252\n\nm\nM\nM \u03b8( \u00afX )2\n\nm\n\nWe note that the descent in the \ufb01rst two cases is actually even stronger than stated above: from the\nproofs, that the distance can be seen to reduce by a \ufb01xed constant. This is faster than superlinear\nconvergence since the \ufb01nal solution could then be obtained in a \ufb01nite number of steps.\n\n4 Quadratic Convergence of Proximal Newton Method\n\nThe key idea of the proof is to re-formulate Prox-Newton update (10) as\n\nzt+1 = arg\n\nz\u2208T min h(z + \u02c6y(z)) + gT\n\nt (z \u2212 zt) +\n\n\u2225z \u2212 zt\u22252\n\nHt\n\n1\n2\n\nwhere\n\n\u02c6y(z) = arg\ny\u2208T \u22a5\n\nmin h(z + y),\n\nso that we can focus our convergence analysis on z = \u03a0T (x) as follows.\nLemma 6 (Optimality Condition). For any matrix H satisfying CNSC-T , the update\n\n\u2206x = arg\nd\n\nmin h(x + d) + g(x)T d +\n\n\u2225d\u22252\n\nH\n\n1\n2\n\nhas\n\nF (x + t\u2206x) \u2212 F (x) \u2264 \u2212t\u2225\u2206z\u22252\n\nH + O(t2),\n\nwhere \u2206z = \u03a0T (\u2206x). Furthermore, if x is an optimal solution, \u2206x = 0 satis\ufb01es (27).\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n(29)\n\nThe following lemma then states that, for Prox-Newton, the function suboptimality is bounded by\nonly distance in the T space.\nLemma 7. Suppose h(x) and f (x) are Lipschitz-continuous with Lipschitz constants Lh and Lf .\nIn quadratic convergence phase (de\ufb01ned in Theorem 3), Proximal Newton Method has\n\nF (xt) \u2212 F (\u00afx) \u2264 L\u2225zt \u2212 \u00afz\u2225,\n\nwhere L = max{Lh, Lf} and zt = \u03a0T (xt), \u00afz = \u03a0T (\u00afx).\nBy the above lemma, we have F (xt) \u2212 F (\u00afx) \u2264 L\u03f5 as long as \u2225zt \u2212 \u00afz\u2225 \u2264 \u03f5. Therefore, it suf\ufb01ces\nto show quadratic convergence of \u2225zt \u2212 \u00afz\u2225 to guarantee F (xt) \u2212 F (\u00afx) double its precision after\neach iteration.\nTheorem 3 (Quadratic Convergence of Prox-Newton). For f (x) satisfying CNSC-T with Lipschitz-\ncontinuous second derivative \u22072f (x), the Proximal Newton update (10) has\n\nwhere \u00afz = \u03a0T (\u00afx), zt = \u03a0T (xt), and LH is the Lipschitz constant for \u22072f (x).\n\n\u2225zt+1 \u2212 \u00afz\u2225 \u2264 LH\n\n\u2225zt \u2212 \u00afz\u22252,\n\n2m\n\n7\n\n\fProof. Let \u00afx be an optimal solution of (1). By Lemma 6, for any PSD matrix H the update \u2206\u00afx = 0\nsatis\ufb01es (27), which means\n\n\u00afx = proxHt (\u00afx + \u2206\u00afxnt), Ht\u2206\u00afxnt = \u2212g(\u00afx).\nThen by non-expansiveness of proximal operation (Lemma 2), we have\nt ) \u2212 proxHt (\u00afx + \u2206\u00afxnt)\u2225Ht\n\u2212 \u2206\u00afznt\n\n\u2225xt+1 \u2212 \u00afx\u2225Ht = \u2225proxHt(xt + \u2206xnt\n\u2264 \u2225(xt + \u2206xnt\n= \u2225(zt \u2212 \u00afz) + (\u2206znt\n\nt\n\nt\n\nt ) \u2212 (\u00afx + \u2206\u00afxnt)\u2225Ht = \u2225(xt \u2212 \u00afx) + (\u2206xnt\nt )\u2225Ht .\nm\u2225z\u2225Ht, (31) leads to\n\u2264 1\u221a\nm\n1\u221a\nm\n\n\u2225Ht(zt \u2212 \u00afz) \u2212 Ht(\u2206znt\n\u2225Ht(zt \u2212 \u00afz) \u2212 (gt\n\nm\n\n=\n\n\u2225zt \u2212 \u00afz\u22252\n2,\n\nSince for z \u2208 T , \u2225Htz\u22252 \u2265 \u221a\n\u2225xt+1 \u2212 \u00afx\u2225Ht\n\n(30)\n\n\u2212 \u2206\u00afxnt)\u2225Ht\n\n(31)\n\n(32)\n\n(33)\n\nt\n\n\u2212 \u2206\u00afznt)\u22252\n\u221a\n\u2212 \u00afg)\u22252 \u2264 LH\n2\n\u2265 \u221a\n\nm\u2225zt+1 \u2212 \u00afz\u22252.\n\u221a\n\u2225zt \u2212 \u00afz\u22252\n2,\n\nwhere last inequality follows from Lipschitz-continuity of \u22072f (x). Since zt+1, \u00afz \u2208 T , we have\n\n\u2225xt+1 \u2212 \u00afx\u2225Ht = \u2225zt+1 \u2212 \u00afz\u2225Ht\n\nFinally, combining (33) with (32),\n\nwhere quadratic convergence phase occurs when \u2225zt \u2212 \u00afz\u2225 <\n\n2m\nLH\n\n.\n\n\u2225zt+1 \u2212 \u00afz\u22252 \u2264 LH\n\n2m\n\n5 Numerical Experiments\n\nIn this section, we study the convergence behavior of Proximal Gradient method and Proxi-\nmal Newton method on high-dimensional real data set with and without the CNSC condition.\nIn particular, two loss functions \u2014 logistic loss L(u, y)=log(1 + exp(\u2212yu)) and \u21132-hinge loss\nL(u, y)=max(1 \u2212 yu, 0)2 \u2014 are used in (3) with \u21131-regularization h(x) = \u03bb\u2225x\u22251, where both\nlosses are smooth but only logistic loss has strict convexity that implies the CNSC condition. For\nProximal Newton method we employ an randomized coordinate descent algorithm to solve sub-\nproblem (10) as in [9]. Figure 5 shows their convergence results of objective value relative to the\noptimum on rcv1.1k, subset of a document classi\ufb01cation data set with dimension d = 10, 192 and\nnumber of samples n = 1000. From the \ufb01gure one can clearly observe the linear convergence of\nProx-GD and quadratic convergence of Prox-Newton on problem satisfying CNSC, contrasted to\nthe qualitatively different behavior on problem without CNSC.\n\nFigure 1: objective value (relative to optimum) of Proximal Gradient method (left) and Proximal\nNewton method (right) with logistic loss and \u21132-hinge loss.\n\nAcknowledgement\nThis research was supported by NSF grants CCF-1320746 and CCF-1117055. C.-J.H acknowledges\nsupport from an IBM PhD fellowship. P.R. acknowledges the support of ARO via W911NF-12-1-\n0390 and NSF via IIS-1149803, IIS-1320894, IIS-1447574, and DMS-1264033.\n\n8\n\n0.511.522.53x 10610\u2212810\u2212610\u2212410\u22122iterobjProx\u2212GD logisticL2hinge5101520253010\u2212810\u2212610\u2212410\u22122100iterobjProx\u2212Newton logisticL2hinge\fReferences\n[1] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal newton-type methods for minimizing compos-\n\nite functions. In NIPS, 2012.\n\n[2] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance estimation\n\nusing quadratic approximation. In NIPS 2011.\n\n[3] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, Cambridge, U.K.,\n\n2003.\n\n[4] P.-W. Wang and C.-J. Lin. Iteration Complexity of Feasible Descent Methods for Convex Op-\ntimization. Technical report, Department of Computer Science, National Taiwan University,\nTaipei, Taiwan, 2013.\n\n[5] A. Agarwal, S. Negahban, and M. Wainwright. Fast Global Convergence Rates of Gradient\n\nMethods for High-Dimensional Statistical Recovery. In NIPS 2010.\n\n[6] K. Hou, Z. Zhou, A. M.-S. So, and Z.-Q. Luo, On the linear convergence of the proximal gra-\ndient method for trace norm regularization, in Neural Information Processing Systems (NIPS),\n2013.\n\n[7] L. Xiao and T. Zhang, A proximal-gradient homotopy method for the l1-regularized least-\n\nsquares problem, in ICML, 2012.\n\n[8] P. Tseng and S. Yun, A coordinate gradient descent method for nonsmooth separable minimiza-\n\ntion, Math. Prog. B. 117 (2009).\n\n[9] G.-X. Yuan, C.-H. Ho, and C.-J. Lin, An improved GLMNET for l1-regularized logistic regres-\n\nsion, Journal of Machine Learning Research, vol. 13, pp. 19992030, 2012\n\n[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for\nlarge linear classi\ufb01cation, Journal of Machine Learning Research, vol. 9, pp. 1871-1874, 2008.\n[11] Alan J Hoffman. On approximate solutions of systems of linear inequalities. Journal of Re-\n\nsearch of the National Bureau of Standards, 1952.\n\n[12] Tewari, A, Ravikumar, P, and Dhillon, I S. Greedy Algorithms for Structurally Constrained\n\nHigh Dimensional Problems. In NIPS, 2011.\n\n[13] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\n\ndimensional analysis of M-estimators with decomposable regularizers. In NIPS, 2009.\n\n[14] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[15] S. Becker, J. Bobin, and E.J.Candes. Nesta: a fast and accurate \ufb01rst-order method for sparse\n\nrecovery. SIAM Journal on Imaging Sciences, 2011.\n\n[16] Z. Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a\n\ngeneral approach. Annals of Operations Research, 46-47:157\u2013178, 1993.\n\n[17] Rahul Garg and Rohit Khandekar. Gradient Descent with Sparsi\ufb01cation: an iterative algorithm\n\nfor sparse recovery with restricted isometry property. In ICML 2009.\n\n[18] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In ICML, 2009.\n[19] Y. E. Nesterov, Gradient Methods for Minimizing Composite Objective Function, CORE re-\n\nport, 2007.\n\n[20] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers,\n\nNew York, 2004\n\n[21] K. Scheinberg, X. Tang. Practical Inexact Proximal Quasi-Newton Method with Global Com-\n\nplexity Analysis. COR@L Technical Report at Lehigh University. arXiv:1311.6547, 2013.\n\n9\n\n\f", "award": [], "sourceid": 612, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "UT-Austin"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}