{"title": "Fast global convergence rates of gradient methods for high-dimensional statistical recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 37, "page_last": 45, "abstract": "Many statistical $M$-estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension $d$ to grow with (and possibly exceed) the sample size $n$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov's first-order method~\\cite{Nesterov07} has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter $\\theta^*$ and the optimal solution $\\widehat{\\theta}$. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso ($\\ell_1$-regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation.", "full_text": "Fast global convergence of gradient methods\n\nfor high-dimensional statistical recovery\n\nAlekh Agarwal1\n\nDepartment of Electrical Engineering and Computer Science1 and Department of Statistics2\n\nSahand N. Negahban1\n\nMartin J. Wainwright1,2\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720-1776\n\n{alekh,sahand n,wainwrig}@eecs.berkeley.edu\n\nAbstract\n\nMany statistical M -estimators are based on convex optimization problems\nformed by the weighted sum of a loss function with a norm-based regular-\nizer. We analyze the convergence rates of \ufb01rst-order gradient methods for\nsolving such problems within a high-dimensional framework that allows the\ndata dimension d to grow with (and possibly exceed) the sample size n.\nThis high-dimensional structure precludes the usual global assumptions\u2014\nnamely, strong convexity and smoothness conditions\u2014that underlie clas-\nsical optimization analysis. We de\ufb01ne appropriately restricted versions of\nthese conditions, and show that they are satis\ufb01ed with high probability\nfor various statistical models. Under these conditions, our theory guaran-\ntees that Nesterov\u2019s \ufb01rst-order method [12] has a globally geometric rate\nof convergence up to the statistical precision of the model, meaning the\ntypical Euclidean distance between the true unknown parameter \u03b8\u2217 and\n\nthe optimal solution b\u03b8. This globally linear rate is substantially faster than\n\nprevious analyses of global convergence for speci\ufb01c methods that yielded\nonly sublinear rates. Our analysis applies to a wide range of M -estimators\nand statistical models, including sparse linear regression using Lasso (`1-\nregularized regression), group Lasso, block sparsity, and low-rank matrix\nrecovery using nuclear norm regularization. Overall, this result reveals an\ninteresting connection between statistical precision and computational e\ufb03-\nciency in high-dimensional estimation.\n\n1\n\nIntroduction\n\nHigh-dimensional data sets present challenges that are both statistical and computational in\nnature. On the statistical side, recent years have witnessed a \ufb02urry of results on consistency\nand rates for various estimators under high-dimensional scaling, meaning that the data\ndimension d and other structural parameters are allowed to grow with the sample size\nn. These results typically involve some assumption regarding the underlying structure of\nthe parameter space, including sparse vectors, low-rank matrices, or structured regression\nfunctions, as well as some regularity conditions on the data-generating process. On the\ncomputational side, many estimators for statistical recovery are based on solving convex\nprograms. Examples of such M -estimators include `1-regularized quadratic programming\n(Lasso), second-order cone programs for sparse non-parametric regression, and semide\ufb01nite\nprogramming relaxations for low-rank matrix recovery.\n\nIn parallel, a line of recent work (e.g., [3, 7, 6, 5, 12, 18]) focuses on polynomial-time\nalgorithms for solving these types of convex programs. Several authors [2, 6, 1] have used\nvariants of Nesterov\u2019s accelerated gradient method [12] to obtain algorithms with a global\n\n1\n\n\fsublinear rate of convergence. For the special case of compressed sensing (sparse regression\nwith incoherent design), some authors have established fast convergence rates in a local\nsense\u2013once the iterates are close enough to the optimum [3, 5]. Other authors have studied\n\ufb01nite convergence of greedy algorithms (e.g., [18]). If an algorithm identi\ufb01es the support\nset of the optimal solution, the problem is then e\ufb00ectively reduced to the lower-dimensional\nsubspace, and thus fast convergence can be guaranteed in a local sense. Also in application to\ncompressed sensing, Garg and Khandekar [4] showed that a thresholded gradient algorithm\nconverges rapidly up to some tolerance; we discuss this result in more detail following our\nCorollary 2 on this special case of sparse linear models.\n\nUnfortunately, for general convex programs with only Lipschitz conditions, the best conver-\ngence rates in a global sense using \ufb01rst-order methods are sub-linear. Much faster global\nrates\u2014in particular, at a linear or geometric rate\u2014can be achieved if global regularity condi-\ntions like strong convexity and smoothness are imposed [11]. However, a challenging aspect\nof statistical estimation in high dimensions is that the underlying optimization problems\ncan never be globally strongly convex when d > n in typical cases (since the d \u00d7 d Hessian\nmatrix is rank-de\ufb01cient), and global smoothness conditions cannot hold when d/n \u2192 +\u221e.\nIn this paper, we analyze a simple variant of the composite gradient method due to\nNesterov [12] in application to the optimization problems that underlie regularized M -\nestimators. Our main contribution is to establish a form of global geometric convergence for\nthis algorithm that holds for a broad class of high-dimensional statistical problems. We do\nso by leveraging the notion of restricted strong convexity, used in recent work by Negahban\net al. [8] to derive various bounds on the statistical error in high-dimensional estimation.\nOur analysis consists of two parts. We \ufb01rst establish that for optimization problems un-\nderlying such M -estimators, appropriately modi\ufb01ed notions of restricted strong convexity\n(RSC) and smoothness (RSM) su\ufb03ce to establish global linear convergence of a \ufb01rst-order\nmethod. Our second contribution is to prove that for the iterates generated by our \ufb01rst-\norder method, these RSC/RSM assumptions do indeed hold with high probability for a\nbroad class of statistical models, among them sparse linear regression, group-sparse regres-\nsion, matrix completion, and estimation in generalized linear models. We note in passing\nthat our notion of RSC is related to but slightly di\ufb00erent than its previous use for bounding\nstatistical error [8], and hence we cannot use these existing results directly.\n\nAn interesting aspect of our results is that we establish global geometric convergence only up\n\nproblem. Note that this is very natural from the statistical perspective, since it is the true\n\nto the statistical precision of the problem, meaning the typical Euclidean distance kb\u03b8 \u2212 \u03b8\u2217k\nbetween the true parameter \u03b8\u2217 and the estimate b\u03b8 obtained by solving the optimization\nparameter \u03b8\u2217 itself (as opposed to the solution b\u03b8 of the M -estimator) that is of primary\n\ninterest, and our analysis allows us to approach it as close as is statistically possible. Over-\nall, our results reveal an interesting connection between the statistical and computational\nproperties of M -estimators\u2014that is, the properties of the underlying statistical model that\nmake it favorable for estimation also render it more amenable to optimization procedures.\n\nThe remainder of the paper is organized as follows. In the following section, we give a precise\ndescription of the M -estimators considered here, provide de\ufb01nitions of restricted strong\nconvexity and smoothness, and their link to the notion of statistical precision. Section 3\ngives a statement of our main result, as well as its corollaries when specialized to various\nstatistical models. Section 4 provides some simulation results that con\ufb01rm the accuracy of\nour theoretical predictions. Due to space constraints, we refer the reader to the full-length\nversion of our paper for technical details.\n\n2 Problem formulation and optimization algorithm\n\nIn this section, we begin by describing the class of regularized M -estimators to which our\nanalysis applies, as well as the optimization algorithms that we analyze. Finally, we describe\nthe assumptions that underlie our main result.\n\n2\n\n\fA class of regularized M -estimators: Given a random variable Z \u223c P taking values\nin some set Z, let Z n\nfrom\nP. Assuming that P lies within some indexed family {P\u03b8, \u03b8 \u2208 \u2126}, the goal is to recover an\nestimate of the unknown true parameter \u03b8\u2217 \u2208 \u2126 generating the data. In order to do so, we\nconsider the regularized M -estimator\n\n1 = {Z1, . . . , Zn} be a collection of n observations drawn i.i.d.\n\nb\u03b8\u03bbn \u2208 arg min\n\n\u03b8\u2208\u2126(cid:8)L(\u03b8; Z n\n\n1 ) + \u03bbnR(\u03b8)(cid:9),\n\nwhere L : \u2126\u00d7Z n 7\u2192 R is a loss function, and R : \u2126 7\u2192 R+ is a non-negative regularizer on\nthe parameter space. Throughout this paper, we assume that the loss function L is convex\nand di\ufb00erentiable, and that the regularizer R is a norm. In order to assess the quality of\nan estimate, we measure the error kb\u03b8\u03bbn \u2212 \u03b8\u2217k in some norm induced by an inner product\nh\u00b7, \u00b7i on the parameter space. Typical choices are the standard Euclidean inner product and\n`2-norm for vectors; the trace inner product and the Frobenius norm for matrices; and the\nL2(P) inner product and norm for non-parametric regression. As described in more detail\nin Section 3.2, a variety of estimators\u2014among them the Lasso, structured non-parametric\nregression in RKHS, and low-rank matrix recovery\u2014can be cast in this form (1). When the\ndata Z n\n\n1 are clear from the context, we frequently use the shorthand L(\u00b7) for L(\u00b7; Z n\n1 ).\n\nIn general, we expect the loss function L to be\nComposite objective minimization:\ndi\ufb00erentiable, while the regularizer R can be non-di\ufb00erentiable. Nesterov [12] proposed a\nsimple \ufb01rst-order method to exploit this type of structure, and our focus is a slight variant\nof this procedure. In particular, given some initialization \u03b80 \u2208 \u2126, consider the update\nfor t = 0, 1, 2, . . .,\n\n\u03b8t+1 = arg min\n\n(2)\n\n\u03b8\u2208BR(\u03c1)(cid:8)h\u2207L(\u03b8t), \u03b8i + \u03bbnR(\u03b8) +\n\n2(cid:9),\n\u03b3u\n2 k\u03b8 \u2212 \u03b8tk2\n\nwhere \u03b3u > 0 is a parameter related to the smoothness of the loss function, and\n\n(1)\n\n(3)\n\nBR(\u03c1) :=(cid:8)\u03b8 \u2208 \u2126 | R(\u03b8) \u2264 \u03c1(cid:9)\n\nis the ball of radius \u03c1 in the norm de\ufb01ned by the regularizer. The only di\ufb00erence from\nNesterov\u2019s method is the additional constraint \u03b8 \u2208 BR(\u03c1), which is required for control\nof early iterates in the high-dimensional setting. Parts of our theory apply to arbitrary\nchoices of the radius \u03c1; for obtaining results that are statistically order-optimal, a setting\n\u03c1 = \u0398(R(\u03b8\u2217)) with \u03b8\u2217 \u2208 BR(\u03c1) is su\ufb03cient, so that fairly conservative upper bounds on\nR(\u03b8\u2217) are adequate.\nStructural conditions in high dimensions:\nIt is known that under global smoothness\nand strong convexity assumptions, the procedure (2) enjoys a globally geometric convergence\n\nrate, meaning that there is some \u03b1 \u2208 (0, 1) such that k\u03b8t \u2212b\u03b8k = O(\u03b1t) for all iterations\nt = 0, 1, 2, . . . (e.g., see Theorem 5 in Nesterov [12]). Unfortunately, in the high-dimensional\nsetting (d > n), it is usually impossible to guarantee strong convexity of the problem (1) in\na global sense. For instance, when the data is drawn i.i.d., the loss function consists of a\nsum of n terms. The resulting d \u00d7 d Hessian matrix \u22072L(\u03b8; Z n\n1 ) is often a sum of n rank-1\nterms and hence rank-degenerate whenever n < d. However, as we show in this paper, in\norder to obtain fast convergence rates for an optimization method, it is su\ufb03cient that (a)\nthe objective is strongly convex and smooth in a restricted set of directions, and (b) the\n\nalgorithm approaches the optimum b\u03b8 only along these directions.\n\nLet us now formalize this intuition. Consider the \ufb01rst-order Taylor series expansion of the\nloss function around the point \u03b80 in the direction of \u03b8:\n\nTL(\u03b8; \u03b80) := L(\u03b8) \u2212 L(\u03b80) \u2212 h\u2207L(\u03b80), \u03b8 \u2212 \u03b80i.\n\n(4)\n\nDe\ufb01nition 1 (Restricted strong convexity (RSC)). We say the loss function L\nsatis\ufb01es the RSC condition with strictly positive parameters (\u03b3`, \u03ba`, \u03b4) if\n\nTL(\u03b8; \u03b80) \u2265\n\n\u03b3`\n2 k\u03b8 \u2212 \u03b80k2 \u2212 \u03ba`\u03b42\n\nfor all \u03b8, \u03b80 \u2208 BR(\u03c1).\n\n(5)\n\n3\n\n\fIn order to gain intuition for this de\ufb01nition, \ufb01rst consider the degenerate setting \u03b4 = \u03ba` = 0.\nIn this case, imposing the condition (5) for all \u03b8 \u2208 \u2126 is equivalent to the usual de\ufb01nition\nof strong convexity on the optimization set. In contrast, when the pair (\u03b4, \u03ba`) are strictly\npositive, the condition (5) only applies to a limited set of vectors. In particular, when \u03b80 is\n\nset equal to the optimum b\u03b8, and we assume that \u03b8 belongs to the set\n\u03b42(cid:9),\n\nC := BR(\u03c1) \u2229(cid:8)\u03b8 \u2208 \u2126 | k\u03b8 \u2212b\u03b8k2 \u2265\n\n4\u03ba`\n\u03b3`\n\nthen condition (5) implies that TL(\u03b8;b\u03b8) \u2265 \u03b3`\n4 k\u03b8 \u2212b\u03b8k2 for all \u03b8 \u2208 C. Thus, for any feasible \u03b8\nthat is not too close to the optimum b\u03b8, we are guaranteed strong convexity in the direction\n\u03b8 \u2212b\u03b8.\n\nWe now specify an analogous notion of restricted smoothness:\n\nDe\ufb01nition 2 (Restricted smoothness (RSM)). We say the loss function L satis\ufb01es\nthe RSM condition with strictly positive parameters (\u03b3u, \u03bau, \u03b4) if\n\nTL(\u03b8;b\u03b8) \u2264\n\n\u03b3u\n\n2 k\u03b8 \u2212b\u03b8k2 + \u03bau\u03b42\n\nfor all \u03b8 \u2208 BR(\u03c1).\n\n(6)\n\nNote that the tolerance parameter \u03b4 is the same as that in the de\ufb01nition (5). The additional\nterm \u03bau\u03b42 is not present in analogous smoothness conditions in the optimization literature,\nbut it is essential in our set-up.\n\nLoss functions and statistical precision:\nIn order for these de\ufb01nitions to be sensi-\nble and of practical interest, it remains to clarify two issues. First, for what types of loss\nfunction and regularization pairs can we expect RSC/RSM to hold? Second, what is the\nsmallest tolerance \u03b4 with which they can hold? Past work by Negahban et al. [8] has intro-\nduced the class of decomposable regularizers; it includes various regularizers frequently used\nin M -estimation, among them `1-norm regularization, block-sparse regularization, nuclear\nnorm regularization, and various combinations of such norms. Negahban et al. [8] showed\nthat versions of RSC with respect to \u03b8\u2217 hold for suitable loss functions combined with a\ndecomposable regularizer. The de\ufb01nition of RSC given here is related but slightly di\ufb00erent:\ninstead of control in a neighborhood of the true parameter \u03b8\u2217, we need control over the iter-\n\nates of an algorithm approaching the optimum b\u03b8. Nonetheless, it can be also be shown that\n\nour form of RSC (and also RSM) holds with high probability for decomposable regularizers,\nand this fact underlies the corollaries stated in Section 3.2.\n\nWith regards to the choice of tolerance parameter \u03b4, as our results will clarify, it makes little\nsense to be concerned with choices that are substantially smaller than the statistical precision\nof the model. There are various ways in which statistical precision can be de\ufb01ned; one\nnatural one is \u00012\nin the data-dependent loss function.1 The statistical precision of various M -estimators\nunder high-dimensional scaling are now relatively well understood, and in the sequel, we\nwill encounter various models for which the RSM/RSC conditions hold with tolerance equal\nto the statistical precision.\n\nstat := E[kb\u03b8\u03bbn \u2212 \u03b8\u2217k2], where the expectation is taken over the randomness\n\n3 Global geometric convergence and its consequences\n\nIn this section, we \ufb01rst state the main result of our paper, and discuss some of its conse-\nquences. We illustrate its application to several statistical models in Section 3.2.\n\n1As written, statistical precision also depends on the choice of \u03bbn, but our theory will involve\n\nspeci\ufb01c choices of \u03bbn that are order-optimal.\n\n4\n\n\f3.1 Guarantee of geometric convergence\n\nRecall that b\u03b8\u03bbn denotes any optimal solution to the problem (1). Our main theorem guaran-\ntees that if the RSC/RSM conditions hold with tolerance \u03b4, then Algorithm (2) is guaranteed\nto have a geometric rate of convergence up to this tolerance. The theorem statement involves\nthe objective function \u03c6(\u03b8) = L(\u03b8) + \u03bbnR(\u03b8).\nTheorem 1 (Geometric convergence). Suppose that the loss function satis\ufb01es conditions\n(RSC) and (RSM) with a tolerance \u03b4 and parameters (\u03b3`, \u03b3u, \u03ba`, \u03bau). Then the sequence\n{\u03b8t}\u221e\n\nt=0 generated by the updates (2) satis\ufb01es\n\n+ c1\u03b42\n\nfor all t = 0, 1, 2, . . .\n\n(7)\n\nk\u03b8t \u2212b\u03b8k2 \u2264 c0(cid:18)1 \u2212\n\nwhere c0 := 2 (\u03c6(0)\u2212\u03c6(b\u03b8))\n\n\u03b3`\n\n, and c1 := 8\u03b3u\n\u03b3 2\n\n\u03b3`\n\n4\u03b3u(cid:19)t\n` (cid:0) 4\u03b3`\u03ba`\n\n\u03b3u\n\n+ \u03bau(cid:1).\n\n4\u03b3u\n\nRemarks: Note that the bound (7) consists of two terms: the \ufb01rst term decays exponen-\ntially fast with the contraction coe\ufb03cient \u03b1 := 1 \u2212 \u03b3`\n. The second term is an additive\no\ufb00set, which becomes relevant only for t large enough such that k\u03b8t \u2212b\u03b8k2 = O(\u03b42). Thus,\nthe result guarantees a globally geometric rate of convergence up to the tolerance parameter\n\u03b4. Previous work has focused primarily on the case of sparse linear regression. For this spe-\ncial case, certain methods are known to be globally convergent at sublinear rates (e.g., [2]),\nmeaning of the type O(1/t2). The geometric rate of convergence guaranteed by Theorem 1\nis exponentially faster. Other work on sparse regression [3, 5] has provided geometric rates\nof convergence that hold once the iterates are close to the optimum. In contrast, Theorem 1\nguarantees geometric convergence if the iterates are not too close to the optimum b\u03b8.\nIn Section 3.2, we describe a number of concrete models for which the (RSC) and (RSM)\nconditions hold with \u03b4 (cid:16) \u0001stat, which leads to the following result.\nCorollary 1. Suppose that the loss function satis\ufb01es conditions (RSC) and (RSM) with\ntolerance \u03b4 = O(\u0001stat) and parameters (\u03b3`, \u03b3u, \u03ba`, \u03bau). Then\nlog(4\u03b3u/(4\u03b3u \u2212 \u03b3`))(cid:1)\n\nT = O(cid:0)\n\nlog(1/\u0001stat)\n\nsteps of the updates (2) ensures that k\u03b8T \u2212 \u03b8\u2217k2 = O(\u00012\nIn the setting of statistical recovery, since the true parameter \u03b8\u2217 is of primary interest, there\nis little point to optimizing to a tolerance beyond the statistical precision. To the best of\nour knowledge, this result\u2014where fast convergence happens when the optimization error is\nlarger than statistical precision\u2014is the \ufb01rst of its type, and makes for an interesting contrast\nwith other local convergence results.\n\nstat).\n\n(8)\n\n3.2 Consequences for speci\ufb01c statistical models\n\nWe now consider the consequences of Theorem 1 for some speci\ufb01c statistical models. In\ncontrast to the previous deterministic results, these corollaries hold with high probability.\n\nSparse linear regression: First, we consider the case of sparse least-squares regression.\nGiven an unknown regression vector \u03b8\u2217 \u2208 Rd, suppose that we make n i.i.d. observations of\nthe form yi = hxi, \u03b8\u2217i + wi, where wi is zero-mean noise. For this model, each observation\nis of the form Zi = (xi, yi) \u2208 Rd \u00d7 R. In a variety of applications, it is natural to assume\nthat \u03b8\u2217 is sparse. For a parameter q \u2208 [0, 1] and radius Rq > 0, let us de\ufb01ne the `q \u201cball\u201d\n\nBq(Rq) :=(cid:8)\u03b8 \u2208 Rd |\n\ndXj=1\n\n|\u03b2j|q \u2264 Rq(cid:9).\n\n(9)\n\nNote that q = 0 corresponds to the case of \u201chard sparsity\u201d, for which any vector \u03b2 \u2208 B0(R0)\nis supported on a set of cardinality at most R0. For q \u2208 (0, 1], membership in Bq(Rq)\nenforces a decay rate on the ordered coe\ufb03cients, thereby modelling approximate sparsity.\n\n5\n\n\fIn order to estimate the unknown regression vector \u03b8\u2217 \u2208 Bq(Rq), we consider the usual\nLasso program, with the quadratic loss function L(\u03b8; Z n\ni=1(yi \u2212 hxi, \u03b8i)2 and the\n`1-norm regularizer R(\u03b8) := k\u03b8k1. We consider the Lasso in application to a random design\nmodel, in which each predictor vector xi \u223c N (0, \u03a3); we assume that maxj=1,...,d \u03a3jj \u2264 1 for\nstandardization, and that the condition number \u03ba(\u03a3) is \ufb01nite.\n\n2nPn\n\n1 ) := 1\n\nc2\n\nfor all t = 0, 1, 2, . . ..\n\n(10)\n\nCorollary 2 (Sparse vector recovery). Suppose that the observation noise wi is zero-mean\nand sub-Gaussian with parameter \u03c3, and \u03b8\u2217 \u2208 Bq(Rq), and we use the Lasso program with\n\u03bbn = 2\u03c3q log d\nn . Then there are universal positive constants ci, i = 0, 1, 2, 3 such that with\nlog d )q/2(cid:1) satisfy\nprobability at least 1 \u2212 exp(\u2212c3n\u03bb2\n\u03ba(\u03a3)(cid:19)t\n\nn), the iterates (2) with \u03c12 = \u0398(cid:0)\u03c32Rq( n\n+ c1 \u03c32Rq(cid:18) log d\nn (cid:19)1\u2212q/2\n{z\n}\n\n2 \u2264 c0(cid:18)1 \u2212\n\nk\u03b8t \u2212b\u03b8k2\n\nIt is worth noting that the form of statistical error \u0001stat given in bound (10) is known to\nbe minimax optimal up to constant factors [13]. In related work, Garg and Khandekar [4]\nshowed that for the special case of design matrices that satisfy the restricted isometry prop-\nerty (RIP), a thresholded gradient method has geometric convergence up to the tolerance\n\nkwk2/\u221an \u2248 \u03c3. However, this tolerance is independent of sample size, and far larger the sta-\n\ntistical error \u0001stat if n > log d; moreover, severe conditions like RIP are not needed to ensure\nfast convergence. In particular, Corollary 2 guarantees guarantees geometric convergence\nup to \u0001stat for many random matrices that violate RIP. The proof of Corollary 2 involves\nexploiting some random matrix theory results [14] in order to verify that the RSC/RSM\nconditions hold with high probability (see the full-length version for details).\n\n|\n\n\u00012\nstat\n\nMatrix regression with rank constraints: For a pair of matrices A, B \u2208 Rm\u00d7m, we use\nhhA, Bii = trace(AT B) to denote the trace inner product. Suppose that we are given n i.i.d.\nobservations of the form yi = hhXi, \u0398\u2217ii + wi, where wi is zero-mean noise with variance \u03c32,\nand Xi \u2208 Rm\u00d7m is an observation matrix. The parameter space is \u2126 = Rm\u00d7m and each ob-\nservation is of the form Zi = (Xi, yi) \u2208 Rm\u00d7m\u00d7 R. In many contexts, it is natural to assume\nthat \u0398\u2217 is exactly or approximately low rank; applications include collaborative \ufb01ltering and\nmatrix completion [7, 15], compressed sensing [16], and multitask learning [19, 10, 17]. In\norder to model such behavior, we let \u03c3(\u0398\u2217) \u2208 Rm denote the vector of singular values of\n\u0398\u2217 (padded with zeros as necessary), and impose the constraint \u03c3(\u0398\u2217) \u2208 Bq(Rq). We then\n1 ) = 1\nconsider the M -estimator based on the quadratic loss L(\u0398; Z n\ni=1(yi \u2212 hhXi, \u0398ii)2\ncombined with the nuclear norm R(\u0398) = k\u03c3(\u0398)k1 as the regularizer.\nVarious problems can be cast within this framework of matrix regression:\n\u2022 Matrix completion: In this case, observation yi is a noisy version of a randomly selected\nentry \u0398\u2217\na(i),b(i) of the unknown matrix. It is a special case with Xi = Ea(i)b(i), the matrix\nwith one in position (a(i), b(i)) and zeros elsewhere.\n\u2022 Compressed sensing: In this case, the observation matrices Xi are dense, drawn from\nsome random ensemble, with the simplest being Xi \u2208 Rm\u00d7m with i.i.d. N (0, 1) entries.\n\u2022 Multitask regression: In this case, the matrix \u0398\u2217 is likely to be non-square, with the\ncolumn size m2 corresponding to the dimension of the response variable, and m1 to the\nnumber of predictors. Imposing a low-rank constraint on \u0398\u2217 is equivalent to requiring\nthat the regression vectors (or columns of the matrix) lie close to a lower-dimensional\nsubspace. See the papers [10, 17] for more details on reformulating this problem as an\ninstance of matrix regression.\n\n2nPn\n\nFor each of these problems, it is possible to show that suitable forms of the RSC/RSM\nconditions will hold with high probability. For the case of matrix completion, the paper [9]\nestablishes a form of RSC useful for controlling statistical error; this argument can be suit-\nably modi\ufb01ed to establish related notions of RSC/RSM required for ensuring fast algorithmic\nconvergence. Similar statements apply to the settings of compressed sensing and multi-task\n\n6\n\n\f\u00012\nmat (cid:16)\n\n(cid:17)1\u2212q/2\n\nn\n\nRq(cid:16) m log m\nRq(cid:16) m\nn(cid:17)1\u2212q/2\n\n\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\nfor matrix completion\n\notherwise,\n\nregression. For these matrix regression problems, consider the statistical precision\n\nrates that (up to logarithmic factors) are known to be minimax-optimal [9, 17]. As dictated\n\nby this statistical theory, the regularization parameter should be chosen as \u03bbn = c\u03c3q m log m\nfor matrix completion, and \u03bbn = c\u03c3p m\nn otherwise, where c > 0 is a universal positive con-\nstant. The following result applies to matrix regression problems for which the RSC/RSM\nconditions hold with tolerance \u03b4 = \u0001stat.\nCorollary 3 (Low-rank matrix recovery). Suppose that \u03c3(\u0398\u2217) \u2208 Bq(Rq), and the observa-\ntion noise is zero-mean \u03c3-sub-Gaussian. Then there are universal positive constants c1, c2, c3\n\u03bbn (cid:17) satisfy\nsuch that with probability at least 1\u2212exp(\u2212c3n\u03bb2\nHere the contraction coe\ufb03cient \u03bd \u2208 (0, 1) is a universal constant, independent of (n, m, Rq),\ndepending on the parameters (\u03b3`, \u03b3u). We refer the reader to the full-length version for\nspeci\ufb01c form taken for di\ufb00erent variants of matrix regression.\n\nn), the iterates (2) with \u03c1 = \u0398(cid:16) \u0001mat\n\nfor all t = 0, 1, 2, . . ..\n\n|||\u0398t \u2212 \u0398\u2217|||2\n\nF \u2264 c0\u03bdt + c1\u00012\n\nmat\n\nn\n\n4 Simulations\n\nIn this section, we provide some experimental results that con\ufb01rm the accuracy of our theo-\nretical predictions. In particular, these results verify the predicted linear rates of convergence\nunder the conditions of Corollaries 2 and 3.\n\nSparse regression: We consider a random ensemble of problems, in which each de-\nsign vector xi \u2208 Rd is generated i.i.d.\naccording to the recursion x(1) = z1 and\nx(j) = zj + \u03c5xi(j \u2212 1) for j = 2, . . . , d, where the zj are N (0, 1), and \u03c5 \u2208 [0, 1) is a correla-\ntion parameter. The singular values of the resulting covariance matrix \u03a3 satisfy the bounds\n\u03c3min(\u03a3) \u2265 1/(1 + \u03c5)2 and \u03c3max(\u03a3) \u2264\n(1\u2212\u03c5)2(1+\u03c5) . Note that \u03a3 has a \ufb01nite condition number\nfor all \u03c5 \u2208 [0, 1); for \u03c5 = 0, it is the identity, but it becomes ill-conditioned as \u03c5 \u2192 1. We\nrecall that in this setting yi = hxi, \u03b8\u2217i + wi where wi \u223c N (0, 1) and \u03b8\u2217 \u2208 Bq(Rq). We study\nthe convergence properties for sample sizes n = \u03b1s log d using di\ufb00erent values of \u03b1. We note\nthat the per iteration cost of our algorithm is n \u00d7 d. All our results are averaged over 10\nrandom trials.\n\n2\n\nOur \ufb01rst experiment is based on taking the correlation parameter \u03c5 = 0, and the `q-ball\nparameter q = 0, corresponding to exact sparsity. We then measure convergence rates for\n\u03b1 \u2208 {1, 1.25, 5, 25} with d = 40000 and s = (log d)2. As shown in Figure 1(a), the procedure\nfails to converge for \u03b1 = 1: with this setting, the sample size n is too small for conditions\n(RSC) and (RSM) to hold, so that a constant step size leads to oscillations without these\nconditions. For \u03b1 su\ufb03ciently large to ensure RSC/RSM, we observe a geometric convergence\n\nof the error k\u03b8t \u2212 b\u03b8k2, and the convergence rate is faster for \u03b1 = 25 compared to \u03b1 = 5,\n\nsince the RSC/RSM constants are better with larger sample size.\n\nOn the other hand, we expect the convergence rates to be slower when the condition number\nof \u03a3 is worse; in addition to address this issue, we ran the same set of experiments with the\ncorrelation parameter \u03c5 = 0.5. As shown in Figure 1(b), in sharp contrast to the case \u03c5 = 0,\nwe no longer observe geometric convergence for \u03b1 = 1.25, since the conditioning of \u03a3 with\n\u03c5 = 0.5 is much poorer than with the identity matrix. Finally, we also expect optimization\nto be harder as the sparsity parameter q \u2208 [0, 1] is increase away from zero. For larger q,\nlarger sample sizes are required to verify the RSC/RSM conditions. Figure 1(c) shows that\neven with \u03c5 = 0, setting \u03b1 = 5 is required for geometric convergence.\n\nLow-rank matrices: We also performed experiments with two di\ufb00erent versions of low-\nrank matrix regression, each time with m2 = 1602. The \ufb01rst setting is a version of com-\npressed sensing with matrices Xi \u2208 R160\u00d7160 with i.i.d. N (0, 1) entries, and we set q = 0,\n\n7\n\n\f2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n)\n2\nk\n\u02c6\u03b2\n\u2212\n\nt\n\u03b2\nk\n(\ng\no\nl\n\nlog error vs. iterations\n\n \n\n\u03b1 = 1\n\u03b1 =1.25\n\u03b1 = 5\n\u03b1 = 25\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n)\n2\nk\n\u02c6\u03b2\n\u2212\n\nt\n\u03b2\nk\n(\ng\no\nl\n\n \n\n\u22128\n0\n\n50\n\n100\n\nIterations\n(a)\n\n150\n\n200\n\n\u221210\n0\n\n \n\n\u03b1 = 1\n\u03b1 =1.25\n\u03b1 = 5\n\u03b1 = 25\n\n50\n\n100\n\nIterations\n(b)\n\n150\n\n200\n\n \n\n\u22128\n0\n\n50\n\nlog error vs. iterations\n\n \n\nlog error vs. iterations\n\n \n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n)\n2\nk\n\u02c6\u03b2\n\u2212\n\nt\n\u03b2\nk\n(\ng\no\nl\n\n\u03b1 = 1\n\u03b1 =1.25\n\u03b1 = 5\n\u03b1 = 25\n\n150\n\n200\n\n100\n\nIterations\n(c)\n\nFigure 1. Plot of the log of the optimization error log(k\u03b8t \u2212 b\u03b8k2) in the sparse linear\nregression problem. In this problem, d = 40000, s = (log d)2, n = \u03b1s log d. Plot (a) shows\nconvergence for the exact sparse case with q = 0 and \u03a3 = I (i.e. \u03c5 = 0). In panel (b), we\nobserve how convergence rates change for a non-identity covariance with \u03c5 = 0.5. Finally\nplot (c) shows the convergence rates when \u03c5 = 0, q = 1.\n\nand formed a matrix \u0398\u2217 with rank R0 = dlog me. We then performed a series of trials with\nsample size n = \u03b1R0 m, with the parameter \u03b1 \u2208 {1, 5, 25}. The per iteration cost in this\ncase is n \u00d7 m2. As seen in Figure 2(a), the general behavior of convergence rates in this\nproblem stays the same as for the sparse linear regression problem: it fails to converge when\n\u03b1 is too small, and converges geometrically (with a progressively faster rate) as \u03b1 increases.\nFigure 2(b) shows matrix completion also enjoys geometric convergence, for both exactly\nlow-rank (q = 0) and approximately low-rank matrices.\n\nlog error vs. iterations\n\n \n\n\u03b1 = 1\n\u03b1 = 5\n\u03b1 =25\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n)\nF\nk\n\u02c6\u0398\n\u2212\n\nt\n\n\u0398\nk\n(\ng\no\nl\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n)\nF\nk\n\u02c6\u0398\n\u2212\n\nt\n\n\u0398\nk\n(\ng\no\nl\n\nlog error vs. iterations\n\n \n\nq = 0\nq =0.5\nq = 1\n\n\u22128\n \n0\n\n50\n\n100\n\nIterations\n(a)\n\n150\n\n200\n\n\u22125\n \n0\n\n10\n\n20\n\n40\n\n50\n\n60\n\n30\n\nIterations\n(b)\n\n(a) Plot of log Frobenius error log(|||\u0398t \u2212 b\u0398|||F ) versus number of iterations\nFigure 2.\nin matrix compressed sensing for a matrix size m = 160 with rank R0 = dlog(160)e, and\nsample sizes n = \u03b1R0m. For \u03b1 = 1, the algorithm oscillates whereas geometric convergence\nis obtained for \u03b1 \u2208 {5, 25}, consistent with the theoretical prediction.\n(b) Plot of log\nFrobenius error log(|||\u0398t \u2212 b\u0398|||F ) versus number of iterations in matrix completion with\napproximately low rank matrices (q \u2208 {0, 0.5, 1}), showing geometric convergence.\n\n5 Discussion\n\nWe have shown that even though high-dimensional M -estimators in statistics are neither\nstrongly convex nor smooth, simple \ufb01rst-order methods can still enjoy global guarantees of\ngeometric convergence. The key insight is that strong convexity and smoothness need only\nhold in restricted senses, and moreover, these conditions are satis\ufb01ed with high probabil-\nity for many statistical models and decomposable regularizers used in practice. Examples\ninclude sparse linear regression and `1-regularization, various statistical models with group-\nsparse regularization, and matrix regression with nuclear norm constraints. Overall, our\nresults highlight that the properties of M -estimators favorable for fast rates in a statistical\nsense can also be used to establish fast rates for optimization algorithms.\n\nAcknowledgements: AA, SN and MJW were partially supported by grants AFOSR-\n09NL184; SN and MJW acknowledge additional funding from NSF-CDI-0941742.\n\n8\n\n\fReferences\n\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[2] S. Becker, J. Bobin, and E. J. Candes. Nesta: a fast and accurate \ufb01rst-order method\n\nfor sparse recovery. Technical report, Stanford University, 2009.\n\n[3] K. Bredies and D. A. Lorenz. Linear convergence of iterative soft-thresholding. Journal\n\nof Fourier Analysis and Applications, 14:813\u2013837, 2008.\n\n[4] R. Garg and R. Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm\nfor sparse recovery with restricted isometry property. In ICML, New York, NY, USA,\n2009. ACM.\n\n[5] E. T. Hale, Y. Wotao, and Y. Zhang. Fixed-point continuation for `1-minimization:\n\nMethodology and convergence. SIAM J. on Optimization, 19(3):1107\u20131130, 2008.\n\n[6] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization.\nInternational Conference on Machine Learning, New York, NY, USA, 2009. ACM.\n\nIn\n\n[7] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization\nalgorithms for exact recovery of a corrupted low-rank matrix. Technical Report UILU-\nENG-09-2214, Univ. Illinois, Urbana-Champaign, July 2009.\n\n[8] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for\nhigh-dimensional analysis of M-estimators with decomposable regularizers. In NIPS\nConference, Vancouver, Canada, December 2009. Full length version arxiv:1010.2731v1.\n\n[9] S. Negahban and M. J. Wainwright. Restricted strong convexity and (weighted) matrix\ncompletion: Optimal bounds with noise. Technical report, UC Berkeley, August 2010.\narxiv:1009.2118.\n\n[10] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise\nand high-dimensional scaling. Annals of Statistics, To appear. Originally posted as\narxiv:0912.5100.\n\n[11] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Pub-\n\nlishers, New York, 2004.\n\n[12] Y. Nesterov. Gradient methods for minimizing composite objective function. Techni-\ncal Report 76, Center for Operations Research and Econometrics (CORE), Catholic\nUniversity of Louvain (UCL), 2007.\n\n[13] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-\ndimensional linear regression over `q-balls. Technical Report arXiv:0910.2042, UC\nBerkeley, Department of Statistics, 2009.\n\n[14] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue conditions for corre-\nlated Gaussian designs. Journal of Machine Learning Research, 11:2241\u20132259, August\n2010.\n\n[15] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning\n\nResearch, 2010. Posted as arXiv:0910.0651v2.\n\n[16] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM Review, Vol 52(3):471\u2013501, 2010.\n\n[17] A. Rohde and A. Tsybakov. Estimation of high-dimensional low-rank matrices. Tech-\n\nnical Report arXiv:0912.5338v2, Universite de Paris, January 2010.\n\n[18] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthog-\nonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655\u20134666,\nDecember 2007.\n\n[19] M. Yuan, A. Ekici, Z. Lu, and R. Monteiro. Dimension reduction and coe\ufb03cient\nestimation in multivariate linear regression. Journal Of The Royal Statistical Society\nSeries B, 69(3):329\u2013346, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "Sahand", "family_name": "Negahban", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}