{"title": "On Quadratic Convergence of DC Proximal Newton Algorithm in Nonconvex Sparse Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2742, "page_last": 2752, "abstract": "We propose a DC proximal Newton algorithm for solving nonconvex regularized sparse learning problems in high dimensions. Our proposed algorithm integrates the proximal newton algorithm with multi-stage convex relaxation based on the difference of convex (DC) programming, and enjoys both strong computational and statistical guarantees. Specifically, by leveraging a sophisticated characterization of sparse modeling structures (i.e., local restricted strong convexity and Hessian smoothness), we prove that within each stage of convex relaxation, our proposed algorithm achieves (local) quadratic convergence, and eventually obtains a sparse approximate local optimum with optimal statistical properties after only a few convex relaxations. Numerical experiments are provided to support our theory.", "full_text": "On Quadratic Convergence of DC Proximal Newton\n\nAlgorithm in Nonconvex Sparse Learning\n\nXingguo Li1,4 Lin F. Yang2\u21e4 Jason Ge2\n\nJarvis Haupt1 Tong Zhang3 Tuo Zhao4\u2020\n\n1University of Minnesota\n\n2Princeton University 3Tencent AI Lab 4Georgia Tech\n\nAbstract\n\nWe propose a DC proximal Newton algorithm for solving nonconvex regularized\nsparse learning problems in high dimensions. Our proposed algorithm integrates\nthe proximal newton algorithm with multi-stage convex relaxation based on the\ndifference of convex (DC) programming, and enjoys both strong computational and\nstatistical guarantees. Speci\ufb01cally, by leveraging a sophisticated characterization\nof sparse modeling structures (i.e., local restricted strong convexity and Hessian\nsmoothness), we prove that within each stage of convex relaxation, our proposed\nalgorithm achieves (local) quadratic convergence, and eventually obtains a sparse\napproximate local optimum with optimal statistical properties after only a few\nconvex relaxations. Numerical experiments are provided to support our theory.\n\nIntroduction\n\n1\nWe consider a high dimensional regression or classi\ufb01cation problem: Given n independent observa-\ntions {xi, yi}n\ni=1 \u21e2 Rd \u21e5 R sampled from a joint distribution D(X, Y ), we are interested in learning\nthe conditional distribution P(Y |X) from the data. A popular modeling approach is the Generalized\nLinear Model (GLM) [20], which assumes\n\nP (Y |X; \u2713\u21e4) / exp\u2713 Y X>\u2713\u21e4 (X>\u2713\u21e4)\n\nc()\n\n\u25c6 ,\n\nwhere c() is a scaling parameter, and is the cumulant function. A natural approach to estimate\n\u2713\u21e4 is the Maximum Likelihood Estimation (MLE) [25], which essentially minimizes the negative\nlog-likelihood of the data given parameters. However, MLE often performs poorly in parameter\nestimation in high dimensions due to the curse of dimensionality [6].\nTo address this issue, machine learning researchers and statisticians follow Occam\u2019s razor principle,\nand propose sparse modeling approaches [3, 26, 30, 32]. These sparse modeling approaches assume\nthat \u2713\u21e4 is a sparse vector with only s\u21e4 nonzero entries, where s\u21e4 < n \u2327 d. This implies that\nmany variables in X are essentially irrelevant to modeling, which is very natural to many real world\napplications such as genomics and medical imaging [7, 21]. Many empirical results have corroborated\nthe success of sparse modeling in high dimensions. Speci\ufb01cally, many sparse modeling approaches\nobtain a sparse estimator of \u2713\u21e4 by solving the following regularized optimization problem,\n\n\u2713 = argmin\n\n\u27132Rd L(\u2713) + Rtgt(\u2713),\n\n(1)\n\nwhere L : Rd ! R is the convex negative log-likelihood (or pseudo-likelihood) function, Rtgt :\nRd ! R is a sparsity-inducing decomposable regularizer, i.e., Rtgt(\u2713) = Pd\nj=1 rtgt(\u2713j) with\n: R ! R, and tgt > 0 is the regularization parameter. Many existing sparse modeling\nrtgt\napproaches can be cast as special examples of (1), such as sparse linear regression [30], sparse logistic\nregression [32], and sparse Poisson regression [26].\n\n\u21e4The work was done while the author was at Johns Hopkins University.\n\u2020The authors acknowledge support from DARPA YFA N66001-14-1-4047, NSF Grant IIS-1447639,\nand Doctoral Dissertation Fellowship from University of Minnesota. Correspondence to: Xingguo Li\n and Tuo Zhao .\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fGiven a convex regularizer, e.g., Rtgt(\u2713) = tgt||\u2713||1 [30], we can obtain global optima in polynomial\ntime and characterize their statistical properties. However, convex regularizers incur large estimation\nbias. To address this issue, several nonconvex regularizers are proposed, including the minimax\nconcave penalty (MCP, [39]), smooth clipped absolute deviation (SCAD, [8]), and capped `1-\nregularization [40]. The obtained estimator (e.g., hypothetically global optima to (1)) can achieve\nfaster statistical rates of convergence than their convex counterparts [9, 16, 22, 34].\nDespite of these superior statistical guarantees, nonconvex regularizers raise greater computational\nchallenge than convex regularizers in high dimensions. Popular iterative algorithms for convex\noptimization, such as proximal gradient descent [2, 23] and coordinate descent [17, 29], no longer\nhave strong global convergence guarantees for nonconvex optimization. Therefore, establishing\nstatistical properties of the estimators obtained by these algorithms becomes very challenging, which\nexplains why existing theoretical studies on computational and statistical guarantees for nonconvex\nregularized sparse modeling approaches are so limited until recent rise of a new area named \u201cstatistical\noptimization\u201d. Speci\ufb01cally, machine learning researchers start to incorporate certain structures of\nsparse modeling (e.g. restricted strong convexity, large regularization effect) into the algorithmic\ndesign and convergence analysis for optimization. This further motivates a few recent progresses:\n[16] propose proximal gradient algorithms for a family of nonconvex regularized estimators with a\nlinear convergence to an approximate local optimum with suboptimal statistical guarantees; [34, 43]\nfurther propose homotopy proximal gradient and coordinate gradient descent algorithms with a linear\nconvergence to a local optimum and optimal statistical guarantees; [9, 41] propose a multistage\nconvex relaxation-based (also known as Difference of Convex (DC) Programming) proximal gradient\nalgorithm, which can guarantee an approximate local optimum with optimal statistical properties.\nTheir computational analysis further shows that within each stage of the convex relaxation, the\nproximal gradient algorithm achieves a (local) linear convergence to a unique sparse global optimum\nfor the relaxed convex subproblem.\nThe aforementioned approaches only consider \ufb01rst order algorithms, such as proximal gradient\ndescent and proximal coordinate gradient descent. The second order algorithms with theoretical\nguarantees are still largely missing for high dimensional nonconvex regularized sparse modeling\napproaches, but this does not suppress the enthusiasm of applying heuristic second order algorithms\nto real world problems. Some evidences have already corroborated their superior computational\nperformance over \ufb01rst order algorithms (e.g. glmnet [10]). This further motivates our attempt\ntowards understanding the second order algorithms in high dimensions.\nIn this paper, we study a multistage convex relaxation-based proximal Newton algorithm for noncon-\nvex regularized sparse learning. This algorithm is not only highly ef\ufb01cient in practice, but also enjoys\nstrong computational and statistical guarantees in theory. Speci\ufb01cally, by leveraging a sophisticated\ncharacterization of local restricted strong convexity and Hessian smoothness, we prove that within\neach stage of convex relaxation, our proposed algorithm maintains the solution sparsity, and achieves\na (local) quadratic convergence, which is a signi\ufb01cant improvement over (local) linear convergence\nof proximal gradient algorithm in [9] (See more details in later sections). This eventually allows us to\nobtain an approximate local optimum with optimal statistical properties after only a few relaxations.\nNumerical experiments are provided to support our theory. To the best of our knowledge, this is the\n\ufb01rst of second order based approaches for high dimensional sparse learning using convex/nonconvex\nregularizers with strong statistical and computational guarantees.\n\n1(vj\n\nNotations: Given a vector v 2 Rd, we denote the p-norm as ||v||p = (Pd\nj=1 |vj|p)1/p for\na real p > 0 and the number of nonzero entries as ||v||0 = Pj\n6= 0) and v\\j =\n(v1, . . . , vj1, vj+1, . . . , vd)> 2 Rd1 as the subvector with the j-th entry removed. Given an\nindex set A\u2713{ 1, ..., d}, A? = {j | j 2{ 1, ..., d}, j /2A} is the complementary set to A. We use\nvA to denote a subvector of v indexed by A. Given a matrix A 2 Rd\u21e5d, we use A\u21e4j (Ak\u21e4) to denote\nthe j-th column (k-th row) and \u21e4max(A) (\u21e4min(A)) as the largest (smallest) eigenvalue of A. We\n2 and ||A||2 =p\u21e4max(A>A). We denote A\\i\\j as the submatrix of A\nde\ufb01ne ||A||2\nwith the i-th row and the j-th column removed, A\\ij (Ai\\j) as the j-th column (i-th row) of A with\nits i-th (j-th) entry removed, and AAA as a submatrix of A with both row and column indexed by\nA. If A is a PSD matrix, we de\ufb01ne ||v||A = pv>Av as the induced seminorm for vector v. We use\nconventional notation O(\u00b7), \u2326(\u00b7),\u21e5(\u00b7) to denote the limiting behavior, ignoring constant, and OP (\u00b7)\nto denote the limiting behavior in probability. C1, C2, . . . are denoted as generic positive constants.\n\nF =Pj ||A\u21e4j||2\n\n2\n\n\f2 DC Proximal Newton Algorithm\nThroughout the rest of the paper, we assume: (1) L(\u2713) is nonstrongly convex and twice continuously\ndifferentiable, e.g., the negative log-likelihood function of the generalized linear model (GLM);\nnPn\n(2) L(\u2713) takes an additive form, i.e., L(\u2713) = 1\ni=1 `i(\u2713), where each `i(\u2713) is associated with an\nobservation (xi, yi) for i = 1, ..., n. Take GLM as an example, we have `i(\u2713) = (x>i \u2713) yix>i \u2713,\nwhere is the cumulant function.\nFor nonconvex regularization, we use the capped `1 regularizer [40] de\ufb01ned as\n\nRtgt(\u2713) =\n\nrtgt(\u2713j) = tgt\n\nmin{|\u2713j|, tgt},\n\ndXj=1\n\ndXj=1\n\nwhere > 0 is an additional tuning parameter. Our algorithm and theory can also be extended to the\nSCAD and MCP regularizers in a straightforward manner [8, 39]. As shown in Figure 1, rtgt(\u2713j)\ncan be decomposed as the difference of two convex functions [5], i.e.,\n\n=\n\n\n\nr(\u2713j) = |\u2713j|\n\n max{|\u2713j| 2, 0}\n|\n}\n\n{z\n\nconvex\n\n.\n\nconvex\n\n|{z}\n\nThis motivates us to apply the difference\nof convex (DC) programming approach\nto solve the nonconvex problem. We then\nintroduce the DC proximal Newton algo-\nrithm, which contains three components:\nthe multistage convex relaxation, warm\ninitialization, and proximal Newton algo-\nrithm.\n\nj\n\n1\n\nFigure 1: The capped `1 regularizer is the difference of two con-\nvex functions. This allows us to relax the nonconvex regularizer\nbased the concave duality.\n(I) The multistage convex relaxation is essentially a sequential optimization framework [40]. At\n\n, ..., {K+1}\n\n)>, where {K+1}\n\n\u2713{K+1} = argmin\n\nsimplicity, we de\ufb01ne a regularization vector as {K+1} = ({K+1}\n\n\u27132Rd F{K+1}(\u2713), where F{K+1}(\u2713) = L(\u2713) + ||{K+1} \u2713||1,\n\nthe (K + 1)-th stage, we have the output solution from the previous stageb\u2713{K}. For notational\n=\ntgt \u00b7 1(|b\u2713{K}\n|\uf8ff tgt) for all j = 1, . . . , d. Let be the Hadamard (entrywise) product. We solve\na convex relaxation of (1) at \u2713 =b\u2713{K} as follows,\nwhere ||{K+1} \u2713||1 = Pd\n|\u2713j|. One can verify that ||{K+1} \u2713||1 is essentially\na convex relaxation of Rtgt(\u2713) at \u2713 = b\u2713{K} based on the concave duality in DC programming.\nWe emphasis that \u2713{K} denotes the unique sparse global optimum for (2) (The uniqueness will be\nelaborated in later sections), andb\u2713{K} denotes the output solution for (2) when we terminate the\n\niteration at the K-th convex relaxation stage. The stopping criterion will be explained later.\n(II) The warm initialization is the \ufb01rst stage of DC programming, where we solve the `1 regularized\ncounterpart of (1),\n\nj=1 {K+1}\n\n(2)\n\nd\n\nj\n\nj\n\n\u2713{1} = argmin\n\n\u27132Rd L(\u2713) + tgt||\u2713||1.\n\n(3)\n\nThis is an intuitive choice for sparse statistical recovery, since the `1 regularized estimator can give\nus a good initialization, which is suf\ufb01ciently close to \u2713\u21e4. Note that this is equivalent to (2) with\n\n{1}j = tgt for all j = 1, . . . , d, which can be viewed as the convex relaxation of (1) atb\u2713{0} = 0\nfor the \ufb01rst stage.\n(III) The proximal Newton algorithm proposed in [12] is then applied to solve the convex sub-\nproblem (2) at each stage, including the warm initialization (3). For notational simplicity, we omit\nthe stage index {K} for all intermediate updates of \u2713, and only use (t) as the iteration index within\nthe K-th stage for all K 1. Speci\ufb01cally, at the K-th stage, given \u2713(t) at the t-th iteration of the\nproximal Newton algorithm, we consider a quadratic approximation of (2) at \u2713(t) as follows,\nQ(\u2713; \u2713(t), {K}) = L(\u2713(t)) + (\u2713 \u2713(t))>rL(\u2713(t)) +\nr2L(\u2713(t)) + ||{K} \u2713||1,\n\n1\n2||\u2713 \u2713(t)||2\n\n(4)\n\n3\n\n\u03b8j\u03b8j\u03b8j\fr2L(\u2713(t)) = (\u2713 \u2713(t))>r2L(\u2713(t))(\u2713 \u2713(t)). We then take \u2713(t+ 1\n\nwhere ||\u2713 \u2713(t)||2\n2 ) =\nnPn\ni=1 `i(\u2713) takes an additive form, we can avoid\nargmin\u2713 Q(\u2713; \u2713(t), {K}). Since L(\u2713) = 1\ndirectly computing the d by d Hessian matrix in (4). Alternatively, in order to reduce the memory\nusage when d is large, we rewrite (4) as a regularized weighted least square problem as follows\n\nQ(\u2713; \u2713(t)) =\n\n1\nn\n\nnXi=1\n\nwi(zi x>i \u2713)2 + ||{K} \u2713||1 + constant,\n\n(5)\n\nwhere wi\u2019s and zi\u2019s are some easy to compute constants depending on \u2713(t), `i(\u2713(t))\u2019s, xi\u2019s, and yi\u2019s.\nRemark 1. Existing literature has shown that (5) can be ef\ufb01ciently solved by coordinate descent\nalgorithms in conjunction with the active set strategy [43]. See more details in [10] and Appendix B.\n\nFor the \ufb01rst stage (i.e., warm initialization), we require an additional backtracking line search\nprocedure to guarantee the descent of the objective value [12]. Speci\ufb01cally, we denote\n\nThen we start from \u2318t = 1 and use backtracking line search to \ufb01nd the optimal \u2318t 2 (0, 1] such that\nthe Armijo condition [1] holds. Speci\ufb01cally, given a constant \u00b5 2 (0.9, 1), we update \u2318t = \u00b5q from\nq = 0 and \ufb01nd the smallest integer q such that\n\n\u2713(t) = \u2713(t+ 1\n\n2 ) \u2713(t).\n\nwhere \u21b5 2 (0, 1\n\n2 ) is a \ufb01xed constant and\n\nF{1}(\u2713(t) + \u2318t\u2713(t)) \uf8ffF {1}(\u2713(t)) + \u21b5\u2318tt,\n\u00b7 \u2713(t) + ||{1} \u21e3\u2713(t) + \u2713(t)\u2318||1 || {1} \u2713(t)||1.\n\nt = rL\u21e3\u2713(t)\u2318>\n\nWe then set \u2713(t+1) as \u2713(t+1) = \u2713(t) + \u2318t\u2713(t). We terminate the iterations when the following\napproximate KKT condition holds:\n\n!{1}\u21e3\u2713(t)\u2318 := min\n\n\u21e02@||\u2713(t)||1 ||rL(\u2713(t)) + {1} \u21e0||1 \uf8ff \",\n\nwhere \" is a prede\ufb01ned precision parameter. Then we set the output solution asb\u2713{1} = \u2713(t). Note that\nb\u2713{1} is then used as the initial solution for the second stage of convex relaxation (2). The proximal\nNewton algorithm with backtracking line search is summarized in Algorithm 2 in Appendix.\nSuch a backtracking line search procedure is not necessary at K-th stage for all K 2. In other\nwords, we simply take \u2318t = 1 and \u2713(t+1) = \u2713(t+ 1\n2 ) for all t 0 when K 2. This leads to more\nef\ufb01cient updates for the proximal Newton algorithm from the second stage of convex relaxation (2).\nWe summarize our proposed DC proximal Newton algorithm in Algorithm 1 in Appendix.\n\n3 Computational and Statistical Theories\n\nBefore we present our theoretical results, we \ufb01rst introduce some preliminaries, including important\nde\ufb01nitions and assumptions. We de\ufb01ne the largest and smallest s-sparse eigenvalues as follows.\nDe\ufb01nition 2. We de\ufb01ne the largest and smallest s-sparse eigenvalues of r2L(\u2713) as\n\n\u21e2+\ns = sup\nkvk0\uf8ffs\n\nv>r2L(\u2713)v\n\nv>v\n\nand \u21e2s = inf\n\nkvk0\uf8ffs\n\nv>r2L(\u2713)v\n\nv>v\n\nfor any positive integer s. We de\ufb01ne \uf8ffs = \u21e2+\ns\n\u21e2s\n\nas the s-sparse condition number.\n\nThe sparse eigenvalue (SE) conditions are widely studied in high dimensional sparse modeling prob-\nlems, and are closely related to restricted strong convexity/smoothness properties and restricted eigen-\nvalue properties [22, 27, 33, 44]. For notational convenience, given a parameter \u27132 Rd and a real con-\nstant R > 0, we de\ufb01ne a neighborhood of \u2713 with radius R as B(\u2713, R) := 2 Rd | || \u2713||2 \uf8ff R .\nOur \ufb01rst assumption is for the sparse eigenvalues of the Hessian matrix over a sparse domain.\nAssumption 1. Given \u2713 2B (\u2713\u21e4, R) for a generic constant R, there exists a generic constant C0 such\nthat r2L(\u2713) satis\ufb01es SE with parameters 0 <\u21e2 s\u21e4+2es <\u21e2 +\ns\u21e4+2es s\u21e4\nand \uf8ffs\u21e4+2es =\n\ns\u21e4+2es < +1, wherees C0\uf8ff2\n\n.\n\n\u21e2+\n\ns\u21e4+2es\n\u21e2s\u21e4+2es\n\n4\n\n\fAssumption 1 requires that L(\u2713) has \ufb01nite largest and positive smallest sparse eigenvalues, given \u2713 is\nsuf\ufb01ciently sparse and close to \u2713\u21e4. Analogous conditions are widely used in high dimensional analysis\n[13, 14, 34, 35, 43], such as the restricted strong convexity/smoothness of L(\u2713) (RSC/RSS, [6]). Given\nany \u2713, \u27130 2 Rd, the RSC/RSS parameter can be de\ufb01ned as (\u27130,\u2713 ) := L(\u27130)L(\u2713)rL(\u2713)>(\u27130\u2713).\nFor notational simplicity, we de\ufb01ne S = {j | \u2713\u21e4j 6= 0} and S? = {j | \u2713\u21e4j = 0}. The following\nproposition connects the SE property to the RSC/RSS property.\n\n2 \u21e2+\n\n2.\n\n1\n\nProposition 3. Given \u2713, \u27130 2B (\u2713\u21e4, R) with ||\u2713S?||0 \uf8ffes and ||\u27130\n\ns\u21e4+2esk\u27130 \u2713k2\nThe proof of Proposition 3 is provided in [6], and therefore is omitted. Proposition 3 implies that\nL(\u2713) is essentially strongly convex, but only over a sparse domain (See Figure 2).\nThe second assumption requires r2L(\u2713) to be smooth over the sparse domain.\n\nS?||0 \uf8ffes, L(\u2713) satis\ufb01es\n\n2 \u21e2s\u21e4+2esk\u27130 \u2713k2\n\n2 \uf8ff (\u27130,\u2713 ) \uf8ff 1\n\nAssumption 2 (Local Restricted Hessian Smoothness). Recall thates is de\ufb01ned in Assumption 1.\nThere exist generic constants Ls\u21e4+2es and R such that for any \u2713, \u27130 2B (\u2713\u21e4, R) with ||\u2713S?||0 \uf8ffes\n\nS?||0 \uf8ff es, we have supv2\u2326, ||v||=1 v>(r2L(\u27130) r2L(\u2713))v \uf8ff Ls\u21e4+2es||\u2713 \u27130||2\n\nand ||\u27130\n\u2326= {v | supp(v) \u2713 (supp(\u2713) [ supp(\u27130))}.\nAssumption 2 guarantees that r2L(\u2713) is Lipschitz continuous within a neighborhood of \u2713\u21e4 over a\nsparse domain. The local restricted Hessian sm-\noothness is parallel to the local Hessian smooth-\nness for analyzing the proximal Newton method\nin low dimensions [12].\nIn our analysis, we set the radius R as R :=\n\u21e2s\u21e4+2es\nis the radius of the\n2Ls\u21e4+2es\nregion centered at the unique global minimizer\nof (2) for quadratic convergence of the proximal\nNewton algorithm. This is parallel to the radius\nin low dimensions [12], except that we restrict\nthe parameters over the sparse domain.\nThe third assumption requires the choice of tgt to be appropriate.\nAssumption 3. Given the true modeling parameter \u2713\u21e4, there exists a generic constant C1 such\nn 4||rL(\u2713\u21e4)||1. Moreover, for a large enough n, we have ps\u21e4tgt \uf8ff\n\nFigure 2: An illustrative two dimensional example of\nthe restricted strong convexity. L(\u2713) is not strongly\nconvex. But if we restrict \u2713 to be sparse (Black Curve),\nL(\u2713) behaves like a strongly convex function.\n\nRestricted Strongly Convex\n\n\u21e2s\u21e4+2es\nLs\u21e4+2es\n\n, where 2R =\n\nNonstrongly Convex\n\n2, where\n\nthat tgt = C1q log d\nC2R\u21e2s\u21e4+2es.\n\nAssumption 3 guarantees that the regularization is suf\ufb01ciently large to eliminate irrelevant coordinates\nsuch that the obtained solution is suf\ufb01ciently sparse [4, 22].\nIn addition, tgt can not be too\nlarge, which guarantees that the estimator is close enough to the true model parameter. The above\nassumptions are deterministic. We will verify them under GLM in the statistical analysis.\nOur last assumption is on the prede\ufb01ned precision parameter \" as follows.\nAssumption 4. For each stage of solving the convex relaxation subproblems (2) for all K 1, there\nexists a generic constant C3 such that \" satis\ufb01es \" = C3pn \uf8ff tgt\n8 .\nAssumption 4 guarantees that the output solutionb\u2713{K} at each stage for all K 1 has a suf\ufb01cient\n\nprecision, which is critical for our convergence analysis of multistage convex relaxation.\n\n3.1 Computational Theory\nWe \ufb01rst characterize the convergence for the \ufb01rst stage of our proposed DC proximal Newton\nalgorithm, i.e., the warm initialization for solving (3).\nTheorem 4 (Warm Initialization, K = 1). Suppose that Assumptions 1 \u21e0 4 hold. After suf\ufb01ciently\nmany iterations T < 1, the following results hold for all t T :\n\n||\u2713(t) \u2713\u21e4||2 \uf8ff R and F{1}(\u2713(t)) \uf8ffF {1}(\u2713\u21e4) +\n\n5\n\n,\n\n152\n\ntgts\u21e4\n\n4\u21e2s\u21e4+2es\n\n\fwhich further guarantee\n\n\u2318t = 1, ||\u2713(t)\n\nLs\u21e4+2es\n2\u21e2s\u21e4+2es||\u2713(t) \u2713{1}||2\nwhere \u2713{1} is the unique sparse global minimizer of (3) satisfying ||\u2713{1}\nMoreover, we need at most\n\nS?||0 \uf8ffes and ||\u2713(t+1) \u2713{1}||2 \uf8ff\nT + log log 3\u21e2+\n\" !\ns\u21e4+2es\n\n2,\n\nS? ||0 \uf8ffes and !{1}(\u2713{1}) = 0.\n\niterations to terminate the proximal Newton algorithm for the warm initialization (3), where the\n\noutput solutionb\u2713{1} satis\ufb01es\n\n18tgtps\u21e4\n\u21e2s\u21e4+2es\n\n.\n\n||b\u2713{1}\nS? ||0 \uf8ffes, !{1}(b\u2713{1}) \uf8ff \", and ||b\u2713{1} \u2713\u21e4||2 \uf8ff\n\nThe proof of Theorem 4 is provided in Appendix C.1. Theorem 4 implies: (I) The objective value is\nsuf\ufb01ciently small after \ufb01nite T iterations of the proximal Newton algorithm, which further guarantees\nsparse solutions and good computational performance in all follow-up iterations. (II) The solution\nenters the ball B(\u2713\u21e4, R) after \ufb01nite T iterations. Combined with the sparsity of the solution, it further\nguarantees that the solution enters the region of quadratic convergence. Thus the backtracking line\nsearch stops immediately and output \u2318t = 1 for all t T . (III) The total number of iterations is at\nmost O(T + log log 1\n\" ) to achieve the approximate KKT condition !{1}(\u2713(t)) \uf8ff \", which serves as\nthe stopping criterion of the warm initialization (3).\nGiven these good properties of the output solutionb\u2713{1} obtained from the warm initialization, we can\nfurther show that our proposed DC proximal Newton algorithm for all follow-up stages (i.e., K 2)\nachieves better computational performance than the \ufb01rst stage. This is characterized by the following\ntheorem. For notational simplicity, we omit the iteration index {K} for the intermediate updates\nwithin each stage for the multistage convex relaxation.\nTheorem 5 (Stage K, K 2). Suppose Assumptions 1 \u21e0 4 hold. Then for all iterations t = 1, 2, ...\nwithin each stage K 2, we have\n||\u2713(t)\nS?||0 \uf8ffes\nLs\u21e4+2es\n2\u21e2s\u21e4+2es||\u2713(t) \u2713{K}||2\nlog log 3\u21e2+\n\" !.\ns\u21e4+2es\nS? ||0 \uf8ffes, !{K}(b\u2713{K}) \uf8ff \", and\n\nwhere \u2713{K} is the unique sparse global minimizer of (2) at the K-th stage satisfying ||\u2713{K}\nand !{K}(\u2713{K}) = 0. Moreover, we need at most\n\n2, and F{K}(\u2713(t+1)) < F{K}(\u2713(t)),\nS? ||0 \uf8ffes\n\nwhere the output solutionb\u2713{K} satis\ufb01es ||b\u2713{K}\n\niterations to terminate the proximal Newton algorithm for the K-th stage of convex relaxation (2),\n\n\u2318t = 1, ||\u2713(t+1) \u2713{K}||2 \uf8ff\n\nand ||\u2713(t) \u2713\u21e4||2 \uf8ff R,\n\nwhich further guarantee\n\n||b\u2713{K} \u2713\u21e4||2 \uf8ff C20@krL(\u2713\u21e4)Sk2 + tgtsXj2S\n+ C30.7K1||b\u2713{1} \u2713\u21e4||2,\n\nfor some generic constants C2 and C3.\n\n1(|\u2713\u21e4j|\uf8ff tgt)2 + \"ps\u21e41A\n\nThe proof of Theorem 5 is provided in Appendix C.2. A geometric interpretation for the computational\ntheory of local quadratic convergence for our proposed algorithm is provided in Figure 3. From the\nsecond stage of convex relaxation (2), i.e., K 2, Theorem 5 implies: (I) Within each stage, the al-\n\n6\n\n\fNeighborhood of : B(,R)\nInitial Solution for Warm Initialization\n\nOutput Solution for Warm Initialization {0}\n\nOutput Solution for the 2nd Stage\n\n...\n\nOutput Solution for the Last Stage\n\nRegion of Quadratic Convergence\n\n{1}\n{2}\n\n\n\n{K}\n\ngorithm maintains a sparse solution\nthroughout all iterations t 1. The spar-\nsity further guarantees that the SE property\nand the restrictive Hessian smoothness hold,\nwhich are necessary conditions for the fast\nconvergence of the proximal Newton algo-\nrithm. (II) The solution is maintained in\nthe region B(\u2713\u21e4, R) for all t 1. Com-\nbined with the sparsity of the solution, we\nhave that the solution enters the region of\nquadratic convergence. This guarantees that\nwe only need to set the step size \u2318t = 1 and\nthe objective value is monotonely decreas-\ning without the sophisticated backtracking\nline search procedure. Thus, the proximal\nNewton algorithm enjoys the same fast con-\nvergence as in low dimensional optimiza-\ntion problems [12].\n\nFigure 3: A geometric interpretation of local quadratic con-\nvergence: the warm initialization enters the region of quadratic\nconvergence (orange region) after \ufb01nite iterations and the\nfollow-up stages remain in the region of quadratic conver-\n\n\" ) to attain\n\ngence. The \ufb01nal estimatorb\u2713{eK} has a better estimation error\nthan the estimatorb\u2713{1} obtained from the convex warm ini-\ntialization.\n(III) With the quadratic convergence rate, the number of iterations is at most O(log log 1\nthe approximate KKT condition !{K}(\u2713(t)) \uf8ff \", which is the stopping criteria at each stage.\n3.2 Statistical Theory\nRecall that our computational theory relies on deterministic assumptions (Assumptions 1 \u21e0 3).\nHowever, these assumptions involve data, which are sampled from certain statistical distribution.\nTherefore, we need to verify that these assumptions hold with high probability under mild data\ngeneration process of (i.e., GLM) in high dimensions in the following lemma.\nLemma 6. Suppose that xi\u2019s are i.i.d. sampled from a zero-mean distribution with covariance matrix\nCov(xi) =\u2303 such that 1 > cmax \u21e4max(\u2303) \u21e4min(\u2303) cmin > 0, and for any v 2 Rd,\nv>xi is sub-Gaussian with variance at most a||v||2\n2, where cmax, cmin, and a are generic constants.\nMoreover, for some constant M > 0, at least one of the following two conditions holds: (I) The\nHessian of the cumulant function is uniformly bounded: || 00||1 \uf8ff M , or (II) The covariates are\nbounded ||xi||1 \uf8ff 1, and E[max|u|\uf8ff1[ 00(x>\u2713\u21e4) + u]p] \uf8ff M for some p > 2. Then Assumption 1\n\u21e0 3 hold with high probability.\nThe proof of Lemma 6 is provided in Appendix F. Given that these assumptions hold with high\nprobability, we know that the proximal Newton algorithm attains quadratic rate convergence within\neach stage of convex relaxation with high probability. Then we establish the statistical rate of\nconvergence for the obtained estimator in parameter estimation.\nTheorem 7. Suppose the observations are generated from GLM satisfying the condition in Lemma 6\nfor large enough n such that n C4s\u21e4 log d and = C5/cmin is a constant de\ufb01ned in Section 2 for\ngeneric constants C4 and C5, then with high probability, the output solutionb\u2713{K} satis\ufb01es\n+r s0 log d\nfor generic constants C6 and C7, where s0 =Pj2S\n\n||b\u2713{K} \u2713\u21e4||2 \uf8ff C6 r s\u21e4\n\nn ! + C70.7K r s\u21e4 log d\nn !\n\nTheorem 7 is a direct result combining Theorem 5 and the analysis in [40]. As can be seen, s0\nis essentially the number of nonzero \u2713j\u2019s with smaller magnitudes than tgt, which are often\nconsidered as \u201cweak\u201d signals. Theorem 7 essentially implies that by exploiting the multi-stage convex\nrelaxation framework, our proposed DC proximal Newton algorithm gradually reduces the estimation\nbias for \u201cstrong\u201d signals, and eventually obtains an estimator with better statistical properties than\n\n1(|\u2713\u21e4j|\uf8ff tgt)).\n\nthe `1-regularized estimator. Speci\ufb01cally, let eK be the smallest integer such that after eK stages of\nconvex relaxation we have C70.7eK\u2713q s\u21e4 log d\nn , which is equivalent\nto requiring eK = O(log log d). This implies the total number of the proximal Newton updates being\nat most OT + log log 1\n\" \u00b7 (1 + log log d). In addition, the obtained estimator attains the optimal\n\nn \u25c6 \uf8ff C6 max\u21e2q s\u21e4\n\nn ,q s0 log d\n\nn\n\n7\n\n\fstatistical properties in parameter estimation:\n\n||b\u2713{eK} \u2713\u21e4||2 \uf8ffO P\u2713q s\u21e4\n\nn +q s0 log d\n\nn \u25c6 v.s.\n\n||b\u2713{1} \u2713\u21e4||2 \uf8ffO P\u2713q s\u21e4 log d\nn \u25c6 .\n\n(6)\n\nn\n\n2\n\n\n\n\n\n\n\n}\n\nn\n\n{\n\nK\n\n\n\nr\no\nr\nr\nE\nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\n\nPercentage of Strong Signals ss\ns\n\nOracle Bound: OP s\nn\n\nRecall thatb\u2713{1} is obtained by the warm initialization (3). As illustrated in Figure 3, this implies\nthe statistical rate in (6) for ||b\u2713{eK} \u2713\u21e4||2 obtained from the multistage convex relaxation for the\nnonconvex regularized problem (1) is a signi\ufb01cant improvement over ||b\u2713{1} \u2713\u21e4||2 obtained from\nresult approaches the oracle bound3 OP\u21e3q s\u21e4\n\nthe convex problem (3). Especially when s0 is small, i.e., most of nonzero \u2713j\u2019s are strong signals, our\n\nn\u2318 [8] as illustrated in Figure 4.\n\nSlow Bound: Convex OP slogd\n\nOP r s\u21e4\nn !\n+r s0log d\nFastBound:Nonconvex\n\n4 Experiments\nWe compare our DC Proximal Newton (DC+PN) algo-\nrithm with two competing algorithms for solving the\nnonconvex regularized sparse logistic regression prob-\nlem. They are accelerated proximal gradient algorithm\n(APG) implemented in the SPArse Modeling Software\n(SPAMS, coded in C++ [18]), and accelerated coordinate\ndescent (ACD) algorithm implemented in R package\ngcdnet (coded in Fortran, [36]). We further optimize\nthe active set strategy in gcdnet to boost its computa-\ntional performance. To integrate these two algorithms\nwith the multistage convex relaxation framework, we\nrevise their source code.\nTo further boost the computational ef\ufb01ciency at each\nstage of the convex relaxation, we apply the pathwise\noptimization [10] for all algorithms in practice. Speci\ufb01-\ncally, at each stage, we use a geometrically decreas-\ning sequence of regularization parameters {[m] =\nm=1, where [0] is the smallest value such that\n\u21b5m[0]}M\nthe corresponding solution is zero, \u21b5 2 (0, 1) is a shrin-\nkage parameter and tgt = [M ]. For each [m], we apply the corresponding algorithm (DC+PN,\nDC+APG, and DC+ACD) to solve the nonconvex regularized problem (1). Moreover, we initialize\nthe solution for a new regularization parameter [m+1] using the output solution obtained with [m].\nSuch a pathwise optimization scheme has achieved tremendous success in practice [10, 15, 42]. We\nrefer [43] for detailed discussion of pathwise optimization.\nOur comparison contains 3 datasets: \u201cmadelon\u201d (n = 2000, d = 500, [11]), \u201cgisette\u201d (n = 2000,d =\n5000, [11]), and three simulated datasets: \u201csim_1k\u201d (d=1000), \u201csim_5k\u201d (d=5000), and \u201csim_10k\u201d\n\nFigure 4: An illustration of the statistical rates\nof convergence in parameter estimation. Our\nobtained estimator has an error bound between\nthe oracle bound and the slow bound from the\nconvex problem in general. When the percentage\nof strong signals increases, i.e., s0 decreases,\nthen our result approaches the oracle bound.\n\n(d=10000) with the sample size n = 1000 for all three datasets. We set tgt = 0.25plog d/n and\n = 0.2 for all settings here. We generate each row of the design matrix X independently from a\nd-dimensional normal distribution N (0, \u2303), where \u2303jk = 0.5|jk| for j, k = 1, ..., d. We generate\ny \u21e0 Bernoulli(1/[1 + exp(X\u2713\u21e4)]), where \u2713\u21e4 has all 0 entries except randomly selected 20 entries.\nThese nonzero entries are independently sampled from Uniform(0, 1). The stopping criteria for all\nalgorithms are tuned such that they attain similar optimization errors.\nAll three algorithms are compared in wall clock time. Our DC+PN algorithm is implemented in\nC with double precisions and called from R by a wrapper. All experiments are performed on a\ncomputer with 2.6GHz Intel Core i7 and 16GB RAM. For each algorithm and dataset, we repeat\nthe algorithm 10 times and report the average value and standard deviation of the wall clock time in\nTable 1. As can be seen, our DC+PN algorithm signi\ufb01cantly outperforms the competing algorithms.\nWe remark that for increasing d, the superiority of DC+PN over DC+ACD becomes less signi\ufb01cant\nas the Newton method is more sensitive to ill conditioned problems. This can be mitigated by using a\ndenser sequence of {[m]} along the solutions path.\nWe then illustrate the quadratic convergence of our DC+PN algorithm within each stage of the\nconvex relaxation using the \u201csim\u201d dataset. Speci\ufb01cally, we plot the gap towards the optimal objective\n3The oracle bound assumes that we know which variables are relevant in advance. It is not a realistic bound,\n\nbut only for comparison purpose.\n\n8\n\n\fF{K}(\u2713{K}) of the K-th stage versus the wall clock time in Figure 5. We see that our proposed DC\nproximal Newton algorithm achieves quadratic convergence, which is consistent with our theory.\nTable 1: Quantitive timing comparisons for the nonconvex regularized sparse logistic regression. The average\nvalues and the standard deviations (in parenthesis) of the timing performance (in seconds) over 10 random trials\nare presented.\n\nDC+PN\n\nDC+ACD\n\nDC+APG\n\nmadelon\n\n1.51(\u00b10.01)s\nobj value: 0.52\n5.83(\u00b10.03)s\nobj value: 0.52\n1.60(\u00b10.03)s\nobj value: 0.52\n\ngisette\n\n5.35(\u00b10.11)s\nobj value: 0.01\n18.92(\u00b12.25)s\nobj value: 0.01\n207(\u00b12.25)s\nobj value: 0.01\n\nsim_1k\n\n1.07(\u00b10.02)s\nobj value: 0.01\n9.46(\u00b10.09) s\nobj value: 0.01\n17.8(\u00b11.23) s\nobj value: 0.01\n\nsim_5k\n\n4.53(\u00b10.06)s\nobj value: 0.01\n16.20(\u00b10.24) s\nobj value: 0.01\n111(\u00b11.28) s\nobj value: 0.01\n\nsim_10k\n\n8.82(\u00b10.04)s\nobj value: 0.01\n19.1(\u00b10.56) s\nobj value: 0.01\n222(\u00b15.79) s\nobj value: 0.01\n\n(a) Simulated Data\n\n(b) Gissete Data\n\nFigure 5: Timing comparisons in wall clock time. Our proposed DC proximal Newton algorithm demonstrates\nsuperior quadratic convergence and signi\ufb01cantly outperforms the DC proximal gradient algorithm.\n\n5 Discussions\nWe provide further discussions on the superior performance of our DC proximal Newton. There exist\ntwo major drawbacks of existing multi-stage convex relaxation based \ufb01rst order algorithms:\n(I) The \ufb01rst order algorithms have signi\ufb01cant computational overhead in each iteration. For example,\nfor GLM, computing gradients requires frequently evaluating the cumulant function and its derivatives.\nThis often involves extensive non-arithmetic operations such as log(\u00b7) and exp(\u00b7) functions, which\nnaturally appear in the cumulant function and its derivates, are computationally expensive. To the best\nof our knowledge, even if we use some ef\ufb01cient numerical methods for calculating exp(\u00b7) in [28, 19],\nthe computation still need at least 10 30 times more CPU cycles than basic arithmetic operations,\ne.g., multiplications. Our proposed DC Proximal Newton algorithm cannot avoid calculating the\ncumulant function and its derivatives, when computing quadratic approximations. The computation,\nhowever, is much less intense, since the convergence is quadratic.\n(II) The \ufb01rst order algorithms are computationally expensive with the step size selection. Although\nfor certain GLM, e.g., sparse logistic regression, we can choose the step size parameter as \u2318 =\ni=1 xix>i ), such a step size often leads to poor empirical performance. In contrast, as\n4\u21e41\nour theoretical analysis and experiments suggest, the proposed DC proximal Newton algorithm needs\nvery few line search steps, which saves much computational effort.\nSome recent works on proximal Newton or inexact proximal Newton also demonstrate local quadratic\nconvergence guarantees [37, 38]. However, the conditions there are much more stringent than the\nSE property in terms of the dependence on the problem dimensions. Speci\ufb01cally, their quadratic\nconvergence can only be guaranteed in a much smaller neighborhood. For example, the constant\n\nnPn\n\nmax( 1\n\nin our analysis, is as small as 1\n[6]. Therefore, instead of a constant radius as in our analysis, they can only guarantee the quadratic\n\nnullspace strong convexity in [37], which plays the rule as the smallest sparse eigenvalue \u21e2s\u21e4+2es\nd. Note that \u21e2s\u21e4+2es can be (almost) independent of d in our case\nconvergence in a region with radius O 1\nd, which is very small in high dimensions. A similar issue\n\nexists in [38] that the region of quadratic convergence is too small.\n\n9\n\n\fReferences\n[1] Larry Armijo. Minimization of functions having lipschitz continuous \ufb01rst partial derivatives. Paci\ufb01c\n\nJournal of Mathematics, 16(1):1\u20133, 1966.\n\n[2] Amir Beck and Marc Teboulle. Fast gradient-based algorithms for constrained total variation image\ndenoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419\u20132434, 2009.\n\n[3] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotal recovery of sparse\n\nsignals via conic programming. Biometrika, 98(4):791\u2013806, 2011.\n\n[4] Peter J Bickel, Yaacov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig\n\nselector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[5] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n\n[6] Peter B\u00fchlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and\n\napplications. Springer Science & Business Media, 2011.\n\n[7] Ani Eloyan, John Muschelli, Mary Beth Nebel, Han Liu, Fang Han, Tuo Zhao, Anita D Barber, Suresh\nJoel, James J Pekar, Stewart H Mostofsky, et al. Automated diagnoses of attention de\ufb01cit hyperactive\ndisorder using magnetic resonance imaging. Frontiers in Systems Neuroscience, 6, 2012.\n\n[8] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.\n\nJournal of the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[9] Jianqing Fan, Han Liu, Qiang Sun, and Tong Zhang. TAC for sparse learning: Simultaneous control of\n\nalgorithmic complexity and statistical error. arXiv preprint arXiv:1507.01037, 2015.\n\n[10] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\n[11] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 2003 feature\n\nselection challenge. In Advances in neural information processing systems, pages 545\u2013552, 2005.\n\n[12] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for minimizing\n\ncomposite functions. SIAM Journal on Optimization, 24(3):1420\u20131443, 2014.\n\n[13] Xingguo Li, Jarvis Haupt, Raman Arora, Han Liu, Mingyi Hong, and Tuo Zhao. A \ufb01rst order free lunch\n\nfor sqrt-lasso. arXiv preprint arXiv:1605.07950, 2016.\n\n[14] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization\nfor nonconvex sparse learning. In International Conference on Machine Learning, pages 917\u2013925, 2016.\n\n[15] Xingguo Li, Tuo Zhao, Tong Zhang, and Han Liu. The picasso package for nonconvex regularized\n\nm-estimation in high dimensions in R. Technical Report, 2015.\n\n[16] Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity: Statistical and\n\nalgorithmic theory for local optima. Journal of Machine Learning Research, 2015. to appear.\n\n[17] Zhi-Quan Luo and Paul Tseng. On the linear convergence of descent methods for convex essentially\n\nsmooth minimization. SIAM Journal on Control and Optimization, 30(2):408\u2013425, 1992.\n\n[18] Julien Mairal, Francis Bach, Jean Ponce, et al. Sparse modeling for image and vision processing. Founda-\n\ntions and Trends R in Computer Graphics and Vision, 8(2-3):85\u2013283, 2014.\n\n[19] A Cristiano I Malossi, Yves Ineichen, Costas Bekas, and Alessandro Curioni. Fast exponential computation\n\non simd architectures. Proc. of HIPEAC-WAPCO, Amsterdam NL, 2015.\n\n[20] Peter McCullagh. Generalized linear models. European Journal of Operational Research, 16(3):285\u2013292,\n\n1984.\n\n[21] Benjamin M Neale, Yan Kou, Li Liu, Avi Ma\u2019Ayan, Kaitlin E Samocha, Aniko Sabo, Chiao-Feng Lin,\nChristine Stevens, Li-San Wang, Vladimir Makarov, et al. Patterns and rates of exonic de novo mutations\nin autism spectrum disorders. Nature, 485(7397):242\u2013245, 2012.\n\n[22] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557,\n2012.\n\n10\n\n\f[23] Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\n[24] Yang Ning, Tianqi Zhao, and Han Liu. A likelihood ratio framework for high dimensional semiparametric\n\nregression. arXiv preprint arXiv:1412.2295, 2014.\n\n[25] Johann Pfanzagl. Parametric statistical theory. Walter de Gruyter, 1994.\n\n[26] Maxim Raginsky, Rebecca M Willett, Zachary T Harmany, and Roummel F Marcia. Compressed sensing\nperformance bounds under poisson noise. IEEE Transactions on Signal Processing, 58(8):3990\u20134002,\n2010.\n\n[27] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated\n\nGaussian designs. Journal of Machine Learning Research, 11(8):2241\u20132259, 2010.\n\n[28] Nicol N Schraudolph. A fast, compact approximation of the exponential function. Neural Computation,\n\n11(4):853\u2013862, 1999.\n\n[29] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for `1-regularized loss minimization. Journal\n\nof Machine Learning Research, 12:1865\u20131892, 2011.\n\n[30] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[31] Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J\nTibshirani. Strong rules for discarding predictors in Lasso-type problems. Journal of the Royal Statistical\nSociety: Series B (Statistical Methodology), 74(2):245\u2013266, 2012.\n\n[32] Sara A van de Geer. High-dimensional generalized linear models and the Lasso. The Annals of Statistics,\n\n36(2):614\u2013645, 2008.\n\n[33] Sara A van de Geer and Peter B\u00fchlmann. On the conditions used to prove oracle results for the Lasso.\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[34] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for\n\nsparse nonconvex learning problems. The Annals of Statistics, 42(6):2164\u20132201, 2014.\n\n[35] Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse least-squares problem.\n\nSIAM Journal on Optimization, 23(2):1062\u20131091, 2013.\n\n[36] Yi Yang and Hui Zou. An ef\ufb01cient algorithm for computing the hhsvm and its generalizations. Journal of\n\nComputational and Graphical Statistics, 22(2):396\u2013415, 2013.\n\n[37] Ian En-Hsu Yen, Cho-Jui Hsieh, Pradeep K Ravikumar, and Inderjit S Dhillon. Constant nullspace strong\nconvexity and fast convergence of proximal methods under high-dimensional settings. In Advances in\nNeural Information Processing Systems, pages 1008\u20131016, 2014.\n\n[38] Man-Chung Yue, Zirui Zhou, and Anthony Man-Cho So. Inexact regularized proximal newton method:\nprovable convergence guarantees for non-smooth convex minimization without strong convexity. arXiv\npreprint arXiv:1605.07522, 2016.\n\n[39] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of\n\nStatistics, 38(2):894\u2013942, 2010.\n\n[40] Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine\n\nLearning Research, 11:1081\u20131107, 2010.\n\n[41] Tong Zhang et al. Multi-stage convex relaxation for feature selection. Bernoulli, 19(5B):2277\u20132293, 2013.\n\n[42] Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. The huge package for high-\ndimensional undirected graph estimation in R. Journal of Machine Learning Research, 13:1059\u20131062,\n2012.\n\n[43] Tuo Zhao, Han Liu, and Tong Zhang. Pathwise coordinate optimization for sparse learning: Algorithm and\n\ntheory. arXiv preprint arXiv:1412.7477, 2014.\n\n[44] Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. arXiv preprint\n\narXiv:0912.4045, 2009.\n\n11\n\n\f", "award": [], "sourceid": 1552, "authors": [{"given_name": "Xingguo", "family_name": "Li", "institution": "University of Minnesota"}, {"given_name": "Lin", "family_name": "Yang", "institution": "Johns Hopkins University"}, {"given_name": "Jason", "family_name": "Ge", "institution": "Princeton University"}, {"given_name": "Jarvis", "family_name": "Haupt", "institution": "University of Minnesota"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}