{"title": "Support Recovery for Orthogonal Matching Pursuit: Upper and Lower bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 10814, "page_last": 10824, "abstract": "This paper studies the problem of sparse regression where the goal is to learn a sparse vector that best optimizes a given objective function. Under the assumption that the objective function satisfies restricted strong convexity (RSC), we analyze orthogonal matching pursuit (OMP), a greedy algorithm that is used heavily in applications, and obtain support recovery result as well as a tight generalization error bound for OMP. Furthermore, we obtain lower bounds for OMP, showing that both our results on support recovery and generalization error are tight up to logarithmic factors. To the best of our knowledge, these support recovery and generalization bounds are the first such matching upper and lower bounds (up to logarithmic factors) for {\\em any} sparse regression algorithm under the RSC assumption.", "full_text": "Support Recovery for Orthogonal Matching Pursuit:\n\nUpper and Lower bounds\n\nRaghav Somani\u2217\n\nMicrosoft Research, India\nt-rasom@microsoft.com\n\nChirag Gupta\u2217\u2020\n\nMachine Learning Department,\n\nCarnegie Mellon University\nchiragg@andrew.cmu.edu\n\nPrateek Jain\n\nMicrosoft Research, India\nprajain@microsoft.com\n\nPraneeth Netrapalli\n\nMicrosoft Research, India\npraneeth@microsoft.com\n\nAbstract\n\nWe study the problem of sparse regression where the goal is to learn a sparse\nvector that best optimizes a given objective function. Under the assumption that\nthe objective function satis\ufb01es restricted strong convexity (RSC), we analyze\northogonal matching pursuit (OMP), a greedy algorithm that is used heavily in\napplications, and obtain a support recovery result as well as a tight generalization\nerror bound for the OMP estimator. Further, we show a lower bound for OMP,\ndemonstrating that both our results on support recovery and generalization error\nare tight up to logarithmic factors. To the best of our knowledge, these are the \ufb01rst\nsuch tight upper and lower bounds for any sparse regression algorithm under the\nRSC assumption.\n\n1\n\nIntroduction\n\nThe goal in sparse regression is to \ufb01nd the optimal sparse vector that minimizes a given objective\nfunction. Sparse regression is an important problem in statistical machine learning since sparse\nmodels lead to better generalization guarantees when the feature dimension is high or data is less,\neg, high-dimensional statistics [19], bioinformatics [18], etc. Sparse models also have a smaller\nmemory footprint and are thus useful for resource-constrained machine learning [9]. For simplicity\nof exposition, we focus on the problem of sparse linear regression (SLR), which is a representative\nproblem in this domain. Results for this problem typically extend easily to the general case. Given\nA \u2208 Rn\u00d7d and y, the goal of SLR is to recover a sparse vector \u00afx that minimizes (cid:107)Ax \u2212 y(cid:107)2\n2.\nThe unconditional version of sparse regression can be shown to be NP-hard via a reduction to 3-set\ncover [14]. However, the problem has been studied heavily under a variety of assumptions such\nas incoherence [7], null-space property [8], restricted isometry property (RIP) or restricted strong\nconvexity (RSC) [4, 15]. RSC, in particular, is one of the weakest and most popular assumptions for\nsparse regression problems and has been studied in the context of various algorithms [27, 11, 1, 13].\nIn this paper, we study the SLR problem under RSC condition.\nTypically SLR is studied with one of two goals: a) support recovery, i.e., recovering support (or\n\nfeatures) of \u00afx and b) bounding generalization error(cid:0)(cid:107)A(x\u2212\u00afx)(cid:107)2\n\n2/n(cid:1) which bounds excess error on\n\nunseen test points if each row of A is sampled from a \ufb01xed distribution. In general, support recovery\n\n\u2217Equal contribution\n\u2020Work done in part while Chirag Gupta was a Research Fellow at Microsoft Research, India\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fis a more fundamental and challenging problem as a strong support recovery result usually tends to\nprovide strong generalization error bound.\nExisting sparse regression algorithms can be broadly categorized into three categories: a) (cid:96)1 min-\nimization or LASSO based algorithms [6, 5, 1], b) non-convex penalty based methods [2, 11, 13],\nc) greedy methods [22, 17, 27, 12]. In this work, we focus on OMP which is a greedy method that\nincrementally adds elements to support based on the amount of reduction in training error. Owing to\nits simplicity, \ufb02exibility, and strong practical performance, OMP is one of the most celebrated and\npractically used algorithms for sparse regression.\nOMP has been shown to provide support recovery in noiseless settings, i.e., when y = A\u00afx, under\nvarious conditions like incoherence [8], null-space property, RIP/RSC [28] etc. In the noisy setting,\nwhile the generalization error of OMP has been studied [28] under RSC, these bounds do not match\nknown lower bounds [29] in terms of the restricted strong convexity constant. In fact, the tightest\nknown generalization error upper bound for polynomial time algorithms is a factor of restricted strong\nconvexity constant worse than the known lower bound [29, 28, 11, 30]. Furthermore, strong support\nrecovery results under RSC are also known only for a non-convex SCAD/MCP penalty based method\n[13]. For greedy methods, there have been several recent works [20, 21, 25] that consider the problem\nof support recovery. However, none of these works give strong results for this problem.\nIn this work, we signi\ufb01cantly improve upon these support recovery results for OMP. We show that\nif the smallest element of \u00afx is larger than an appropriate noise level, then OMP recovers the full\nsupport of \u00afx (see Theorem 3.1). As noted in remarks 3 and 4 below the theorem, our result has a\nbetter dependence on the restricted condition number than the ones in [20, 21, 25]. The proof of\nTheorem 3.1 exploits the fact that if a certain element of \u00afx is not included in the current support set,\nthen a single step of OMP should lead to a large additive decrease in the error. In addition, we present\na generalization error analysis for OMP.\nFinally, we provide matching lower bounds for our support recovery and generalization error results.\nTo this end, we construct a design matrix that ensures that OMP picks incorrect indices until a large\nnumber of elements are added to the support set (see Theorems 4.2, 4.3). As the support set size has\nto increase arbitrarily for recovery, this also implies poor generalization error (see Theorem 4.3).\nWe note that our lower bound results are unconditional and are directly applicable to OMP. In contrast,\nexisting lower bounds such as [29] obtain a lower bound for generalization error of any polynomial\ntime algorithm assuming NP (cid:54)\u2282 P/poly. Moreover, these lower bound results are restricted to\nalgorithms which recover exactly s\u2217-sparse vectors, where s\u2217 = |supp(\u00afx)| and hence do not apply to\nOMP if it adds more than s\u2217 elements to the support set, which is the more meaningful scenario to\nconsider. Moreover, if each element of \u00afx is large, then the claim of [29] is almost vacuous as one can\nrecover the support exactly which is the main problem in SLR. In that case, while the generalization\nerror lower bound of [29] holds, it does not preclude the OMP algorithm from recovering the correct\nsupport (see Section 4).\n\nNotation: Matrices are typically written in bold capital letters (such as A and \u03a3), vectors are\ntypically written in bold small case letters (such as x and \u03b7) and universal constants independent\nof problem parameters are written as C1, C2, etc. For a matrix A, Ai represents its ith column\nand AS represent the sub-matrix of A with columns in the index set S. \u03c1+\ns (AT A) are\nrestricted smoothness and restricted strong convexity constants of the matrix A (de\ufb01ned below).\n\ns and (cid:101)\u03bas when used without parameter,\ns (AT A) and(cid:101)\u03bas(AT A) respectively. The non-zero element of \u00afx with the\n\n(cid:101)\u03bas(AT A) := \u03c1+\n\nrepresent \u03c1+\nleast absolute value is denoted as \u00afxmin.\n\ns (AT A), \u03c1\u2212\n\ns (AT A), \u03c1\u2212\n\n1 (AT A)/\u03c1\u2212\n\ns (AT A) for all s > 0. \u03c1+\n\ns , \u03c1\u2212\n\n2 Preliminaries and Setting\n\nIn this section, we will present some preliminaries and the problem setting considered in this paper.\nBroadly, we are interested in sparse estimation problems where we are given a function Q(\u00b7) and we\nwish to solve minx:(cid:107)x(cid:107)0\u2264s\u2217 Q(x). This problem is in general NP-hard even when Q(\u00b7) is a quadratic\nfunction. So, we consider this problem under restricted strong convexity (RSC) and restricted\nsmoothness (RSS) assumptions. While part of our results apply to this general setting, for simplicity\nof presentation, we focus on the case where Q(\u00b7) is a quadratic. More concretely, in the sparse linear\nregression problem where we are given a measurement matrix A \u2208 Rn\u00d7d and response y \u2208 Rn\n\n2\n\n\fAlgorithm 1 Orthogonal Matching Pursuit (OMP)\n1: procedure OMP(s)\n2:\n3:\n4:\n\ni rk\u22121|\n\nS0 = \u03c6, x0 = 0, r0 = y\nfor k = 1, 2, . . . , s do\nj \u2190 arg max\n|AT\ni(cid:54)\u2208Sk\u22121\nSk \u2190 Sk\u22121 \u222a {j}\nxk \u2190 arg min\nsupp(x)\u2286Sk\nrk \u2190 y \u2212 Axk\n\n(cid:107)Ax \u2212 y(cid:107)2\n\n2\n\n5:\n6:\n\n7:\nend for\n8:\nreturn xs\n9:\n10: end procedure\n\nand we wish to solve min(cid:107)x(cid:107)0\u2264s\u2217 (cid:107)Ax \u2212 y(cid:107)2\nrestricted strong convexity and restricted smoothness properties [4]:\nDe\ufb01nition 2.1 (Restricted strong convexity (RSC)). A is said to be restricted strongly convex at\nlevel s with parameter \u03c1\u2212\n\ns if for every x and z such that (cid:107)x \u2212 z(cid:107)0 \u2264 s, we have\n\n2. We assume that the measurement matrix A satis\ufb01es\n\nDe\ufb01nition 2.2 (Restricted smoothness (RSS)). A is said to be restricted smooth at level s with\nparameter \u03c1+\n\n2 \u2265 \u03c1\u2212\n\n(cid:107)Ax \u2212 Az(cid:107)2\n\ns (cid:107)x \u2212 z(cid:107)2\n2 .\ns if for every x and z such that (cid:107)z \u2212 x(cid:107)0 \u2264 s, we have\ns (cid:107)x \u2212 z(cid:107)2\n2 .\n\n(cid:107)Ax \u2212 Az(cid:107)2\n\n2 \u2264 \u03c1+\n\nThe above de\ufb01nitions capture the standard strong convexity and smoothness properties but only in\nsparse directions. Similarly, we can de\ufb01ne a notion of restricted condition number.\nDe\ufb01nition 2.3 (Restricted condition number). The restricted condition number at level s of a matrix\nA is de\ufb01ned as\n\n(cid:101)\u03bas(AT A) =\n\n\u03c1+\n1\n\u03c1\u2212\n\ns\n\n.\n\n(2.1)\n\ns , and(cid:101)\u03bas respectively. For our lower bound matrices in Section 4 we show that\n\nThroughout this paper, we assume that A satis\ufb01es the above properties and denote the corresponding\nparameters as \u03c1\u2212\nthese properties are satis\ufb01ed.\nDe\ufb01nition 2.4 ((cid:96)\u221e \u2212 norm). We de\ufb01ne the (cid:96)\u221e \u2212 norm of a matrix A:\n\ns , \u03c1+\n\n(cid:107)A(cid:107)\u221e := max\n(cid:107)x(cid:107)\u221e=1\n\n(cid:107)Ax(cid:107)\u221e\n\n(2.2)\n\nWe work under the generative model where \u00afx is an s\u2217-sparse vector supported on S\u2217, that generates\nthe data. More concretely, we assume that the measurements y are generated as noisy linear\nmeasurements of \u00afx:\n\n(2.3)\nwhere each element of \u03b7 is a mean zero sub-Gaussian random variable with parameter \u03c3. This means\nthat for some constant C, we have,\n\ny = A\u00afx + \u03b7,\n\nP{|\u03b7i| > t} \u2264 C exp(cid:0)\u2212t2/2\u03c32(cid:1) .\n\nThe non-zero element of \u00afx with the least absolute value is denoted as \u00afxmin.\nIn this problem setting, there are two critical questions:\n\n1. Support recovery: The goal here is to recover the support of \u00afx after observing y and\nA. This question can also be posed as estimating \u00afx in the (cid:96)\u221e norm i.e., \ufb01nd \u02c6x such that\n(cid:107)\u02c6x \u2212 \u00afx(cid:107)\u221e is small.\n2. Generalization error: Here, the goal is to compute an \u02c6x such that (cid:107)A(\u02c6x \u2212 \u00afx)(cid:107)2 is small.\nThis quantity is essentially the generalization error when the learned \u02c6x is used to make\nprediction over test data generated from same distribution as training data A and y.\n\n3\n\n\fApart from(cid:101)\u03ba(\u00b7), we also use \u03bas(\u00b7) = \u03c1+\n\nTable 1: Comparison between our results and several prior results on support recovery for Sparse\nLinear Regression. HTP refers to Hard Thresholding Pursuit, PHT refers to Partial Hard Thresholding,\nand IHT referes to Iterative Hard Thresholding. These are all thresholding based greedy algorithms.\ns (\u00b7). All values are correct upto constants; we have\nskipped order notation in the interest of succinctness. Support expansion refers to the value of s in\nthe paper. The |\u00afxmin| column refers to the condition for support recovery guarantee. All support\nrecovery happens with some probability \u03b4, and we incur polynomial factors of log(d/\u03b4) in the |\u00afxmin|\ncondition. We skip these in the interest of succinctness.\n\ns (\u00b7)/\u03c1\u2212\n\nRelated Work\n\nSupport expansion (s)\n\nYuan et al. [25] [HTP]\n\nShen et al. [20] [HTP]\n\nShen et al. [21] [PHT(r)]\n\nJain et al. [11] [IHT]\nZhang [28] [OMP]\nTheorem 3.1 [OMP]\n\ns\u2217 + \u03ba2\n\n2ss\u2217\n\u03ba2\n2ss\u2217\n\u03ba2\n2s min{s\u2217, r}\n2s+s\u2217 s\u2217\n(cid:101)\u03bas+s\u2217 s\u2217 log \u03bas+s\u2217\n\u03ba2\n(cid:101)\u03bas+s\u2217 s\u2217 log \u03bas+s\u2217\n\n|\u00afxmin| lower bound\n\n\u03c1+\n1 s\n\n\u03c1+\n1 s\n\n\u03c3\n\n\u03c3\n\n\u03c1\n\n\u03c3\n\n\u221a\n\n\u221a\n\n\u221a\ns\u221a\n\u221a\n\u2212\n2s\n\u03ba2s\n\u2212\n\u221a\ns+s\u2217\n\u03c1\n\u03ba2s\n\u2212\n\u03c1\n2s\n\u2013\n\u221a\n\u2013\n\u03b3 \u00b7 \u03c3\n\n\u03c1+\n1\n\u2212\ns+s\u2217\n\n\u03c1\n\nWe note that in both the above problems we are allowed to output \u02c6x that may have s \u2265 s\u2217 elements\nin the support. This is a standard and crucial relaxation needed to provide strong guarantees under\nweak assumptions for SLR. This work considers orthogonal matching pursuit (OMP) [16, 23] for\nsolving both of the above problems. OMP is one of the most popular methods for sparse optimization\nand it is essentially a greedy method that incrementally estimates the support of \u00afx by adding one\nelement at a time. See Algorithm 1 for a pseudo-code of OMP for SLR.\nIn Section 3 we show our upper bounds for the performance of OMP with respect to both the problems\nabove, under the RSS/RSC conditions. In Section 4, we provide a matching lower bound (upto\nlogarithmic factors) which shows that there exist certain sparse linear regression problems on which\nOMP cannot perform signi\ufb01cantly better than the error bounds given by our analysis. In Section 5 we\nshow some simple simulations to ground our results.\n\n3 Upper bounds for OMP\n\nWe \ufb01rst present our key contribution which is a support recovery bound for OMP under RSC/RSS.\nTheorem 3.1 (Support Recovery for OMP). Let A \u2208 Rn\u00d7d and \u00afx \u2208 Rd be a s\u2217-sparse vector. Let\ny = A\u00afx + \u03b7 and let \u02c6xs be the output of OMP after s iterations, where\n\nand (cid:101)\u03bas+s\u2217\n(cid:13)(cid:13)(cid:13)AT\n\nis\nS\u2217\\SAS(AT\n\nS AS)\u22121(cid:13)(cid:13)(cid:13)\u221e\n\nthe\n\n(cid:32)\n\n(cid:33)\n\n5\u03c1+\ns+s\u2217\n\u03c1\u2212\ns+s\u2217\n\ns \u2265 C1(cid:101)\u03bas+s\u2217 s\u2217 \u00b7 log\n\u2264 \u03b3 where S = supp(\u02c6xs). Then, for every \u03b4 \u2208 (cid:0)0, e\u221268(cid:1), if\n|\u00afxmin| \u2265(cid:16)\n\nrestricted condition number\n\n(cid:17) \u03c3\n\n(De\ufb01nition 2.3).\n\nMoreover,\n\n(cid:114)\n\n(3.1)\n\n\u221a\n\nlet\n\n,\n\n1 +\n\n2 (1 + \u03b3)\n\n\u03c1\u2212\ns+s\u2217\n\n\u03c1+\n1 log\nand s + s\u2217 \u2265 log (1/\u03b4), then S\u2217 \u2286 supp(\u02c6xs) and (cid:107)\u02c6xs \u2212 \u00afx(cid:107)\u221e \u2264 \u03c3\nleast 1 \u2212 7\u03b4. Here C1 = 664 is a universal constant.\nRemark 1: \u03c1\u2212\nrestricted strong convexity of the normalized objective 1\nn. Similarly,\n\ns+s\u2217 is the RSC constant of the (cid:107)Ax \u2212 y(cid:107)2\n(cid:113)\n\ns+s\u2217 is n times the\nn(cid:107)Ax \u2212 y(cid:107)2\n2 whose scale is independent of\n\u221a\nn. Thus |\u00afxmin| essentially scales as 1/\nn.\n\n2 objective. Hence \u03c1\u2212\n\nlog (s/\u03b4) with probability at\n\n\u221a\n\n\u03c1+\n1 hides a\n\n\u2212\n\u03c1\ns\n\n,\n\nd\n\u03b4\n\n(cid:113) 2\n\n4\n\n\fRemark 2: The \u03b3 parameter in the above theorem is somewhat similar to the standard incoherence\nparameter [24], although the incoherence parameter can be signi\ufb01cantly larger than \u03b3. Further,\nexisting results for OMP [26] require the incoherence parameter to be strictly less than 1 while our\nanalysis holds for arbitrary values of \u03b3. Thus, our results apply to more general design matrices A.\nRemark 3: Our assumption on |\u00afxmin| is better at least by a factor of\nassumptions made in recent work that analyzes OMP for support recovery [20, 21, 25] (see Table 1).\nRemark 4: To the best of our knowledge, [13] is the only known support recovery result for LASSO\nunder RSC, that provides strong guarantees as our result above. However, the non-convex penalty\nbased algorithm of [13] might produce iterates which are dense, so intermediate steps can be more\nexpensive than sparsity preserving OMP. Furthermore, while qualitatively, our bound is similar to\nthe bound of [13], their proof requires n \u2265 (cid:107)\u00afx(cid:107)2\n1 log d which, na\u00efvely, for many problems with\nimbalanced non-zero elements of \u00afx can be as large as (s\u2217)2.\n\n\u221a(cid:101)\u03ba than corresponding\n\nProof Sketch of Theorem 3.1 (see Appendix B.2 for details): Theorem 3.2 (stated below) guar-\nantees that OMP has a very small objective value after a certain number of support expansion steps.\nThis guarantees small generalization error (Theorem 3.3), but not support recovery. To guarantee\nsupport recovery, our proof critically exploits a novel observation (Lemma B.4 in Appendix B.2) that\nif at any iteration of OMP, full support recovery has not happened, then OMP decreases function\nvalue by a \ufb01xed, but small, additive constant. Theorem 3.2 allows us to say that even this small\nconstant decrement cannot happen for too long since the objective value is already small. Overall,\nthis means that support recovery must happen soon after we have small objective value.\nLet s be the iteration index that is suf\ufb01cient to satisfy the conditions for Theorem 3.2. From\nTheorem 3.2 we have with probability at least 1 \u2212 2\u03b4,\n2 \u2264 (cid:107)A\u00afx \u2212 y(cid:107)2\n\u2264 (cid:107)\u03b7(cid:107)2\n\n\u03c32s\u03c1+\n1 log(d/\u03b4)\n\u03c1+\ns+s\u2217\n\n(cid:107)Axs \u2212 y(cid:107)2\n\nSuppose any one of the support index has not been recovered (that is, |S\u2217\\S| > 0) then if j \u2208 (S\u2217\\S)c\nis selected by OMP in its (s + 1)th iteration, we have by step 4 of Algorithm 1,\n\n2 + 40\u03c32s log(d/\u03b4)\n\n2 + 40\n\n(3.2)\n\n.\n\nS\u2217\\Srs\n\n(cid:13)(cid:13)(cid:13)AT\n\nj rs|.\n\u2264 |AT\nIn Lemma B.4, we lower bound the LHS of (3.3) as follows:\n\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:113)\n2 (1 + \u03b3)(cid:1)\nwith probability at least 1 \u2212 2\u03b4. Since |\u00afxmin| \u2265(cid:0)1 +\n(cid:0)AT\n(cid:1)2\n\n2(1 + \u03b3)\u03c3\n\u221a\n\n(3.3) with (3.4) gives,\n\ns+s\u2217|\u00afxmin| \u2212\n\n(cid:13)(cid:13)(cid:13)AT\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u2265 \u03c1\u2212\n\nS\u2217\\Srs\n\n\u03c32 log\n\n\u221a\n\n.\n\nj rs\n\nd\n\u03b4\n\n\u2264 1\n\u03c1+\n1\n\n\u03c1+\n1 log(d/\u03b4),\n\n(cid:113)\n\n\u03c3\n\u2212\ns+s\u2217\n\n\u03c1\n\n\u03c1+\n1 log (d/\u03b4), combining\n\n(3.3)\n\n(3.4)\n\n(3.5)\n\n(3.6)\n\n(3.7)\n\n(3.9)\n\nThis gives us an additive decrease in the function value:\n\n(cid:107)Axs+1 \u2212 y(cid:107)2\n\nxj\n\n2 \u2264 min\n= (cid:107)Axs \u2212 y(cid:107)2\n\n(cid:107)Ajxj \u2212 rs(cid:107)2\n2 \u2212 1\n\u03c1+\n1\n\n2\n\n(cid:0)AT\n\nj rs\n\n(cid:1)2 \u2264 (cid:107)Axs \u2212 y(cid:107)2\n\n2 \u2212 \u03c32 log(d/\u03b4)\n\nSuppose that for another l iterations, the full support is not recovered. Then,\n\n(cid:107)Axs+l \u2212 y(cid:107)2\n\nFurther it can be shown that the function value at iteration s + l cannot be too small,\n\n(cid:107)Axs+l \u2212 y(cid:107)2\n\n2 \u2265 (cid:107)\u03b7(cid:107)2\n\n(3.8)\nwith probability at least 1 \u2212 \u03b4. Therefore combining (3.8) and (3.2) and plugging them in (3.7), we\n\ufb01nally get,\n\n2 \u2264 (cid:107)Axs \u2212 y(cid:107)2\n\n2 \u2212 \u03c32l log(d/\u03b4).\n\n2 \u2212 \u03c32(s + l + s\u2217) \u2212 4\u03c32(s + l + s\u2217)(cid:112)log(d/\u03b4),\n\nl \u2264 80s + s + s\u2217 = O (s) .\n\n5\n\n\fTherefore with good probability, OMP recovers the full support in O(s) iterations. See Appendix B.2\nfor details.\nWe now bound the training error for OMP after running a certain number of iterations (which are\nfewer than the number of iterations required for support recovery as shown in Theorem 3.1). The\nproof of this theorem follows via a modi\ufb01cation of the proof of Lemma A.5 in [28]. See Appendix B.1\nfor the proof.\nTheorem 3.2 (Training Error for OMP). Consider the setting of Theorem 3.1. Also, let\n\ns \u2265 8(cid:101)\u03bas+s\u2217 s\u2217 \u00b7 log\n\n(cid:32)\n\n(cid:33)\n\n.\n\n5\u03c1+\ns+s\u2217\n\u03c1\u2212\ns+s\u2217\n\nThen with probability 1 \u2212 2\u03b4, the output \u02c6xs of OMP after s steps satis\ufb01es:\n\u03c32s log(d/\u03b4)\n\n1\nn\n\n(cid:107)A\u02c6xs \u2212 y(cid:107)2\n\n2 \u2264 1\nn\n\n(cid:107)A\u00afx \u2212 y(cid:107)2\n\n2 + 40\n\n\u00b7 \u03c1+\n1\nn\n\n\u03c1+\ns+s\u2217\n\n.\n\n(3.10)\n\nGiven good objective value decrease, we can show a tight generalization error on the output of\nOMP. While in general support recovery is the main goal of a sparse regression algorithm, in several\nproblem scenarios one might not care about support recovery and focus only on the accuracy of the\nlearned predictor. See Appendix B.3 for the proof.\nTheorem 3.3 (Generalization Error for OMP). Consider the setting of Theorem 3.1. Let \u02c6xs be the\noutput of OMP after s iterations. For any constant C1 \u2265 8, there exists a constant C2(\u2264 9C1) such\nthat if s satis\ufb01es,\n\nthen with probability at least 1 \u2212 4\u03b4,\n\n(cid:32)\n\n(cid:33)\n\nC1(cid:101)\u03bas+s\u2217 s\u2217 \u00b7 log\ns \u2212 \u00afx)(cid:13)(cid:13)2\n(cid:13)(cid:13)A(\u02c6xOMP\n\n1\nn\n\n5\u03c1+\ns+s\u2217\n\u03c1\u2212\ns+s\u2217\n\n2 \u2264 C2\n\n\u2265 s \u2265 8(cid:101)\u03bas+s\u2217 s\u2217 \u00b7 log\n\u03c32(cid:101)\u03bas+s\u2217 s\u2217\n\n(cid:32)\n\n\u00b7 log\n\nn\n\n5\u03c1+\ns+s\u2217\n\u03c1\u2212\ns+s\u2217\n\n5\u03c1+\ns+s\u2217\n\u03c1\u2212\ns+s\u2217\n\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\n,\n\n\u00b7 log\n\nd\n\u03b4\n\n.\n\n(3.11)\n\n3.1 Gaussian ensemble\n\nFinally, we instantiate the above theorems for a Gaussian ensemble, i.e., when A is sampled from a\nGaussian distribution N (0, \u03a3). We denote the maximum and the minimum singular values of \u03a3 as\n\u03c3max and \u03c3min and the condition number of \u03a3 as \u03ba(\u03a3). To the best of our knowledge, the following\nis the best known generalization error guarantee in this setting in terms of the dependence on \u03ba(\u03a3).\nCorollary 3.3.1 (Gaussian ensemble: generalization error). Let the rows of the matrix A \u2208 Rn\u00d7d be\nsampled from N (0, \u03a3) where \u03a3ii \u2264 1 \u2200 i \u2208 [d] and \u00afx be a s\u2217-sparse vector. Let \u02c6xs be the output of\nOMP after s iterations and S = supp(\u02c6xs) be the support recovered, where,\n\ns = C2\u03ba(\u03a3) \u00b7 log (45\u03ba(\u03a3)) s\u2217, n > 4C1\n\n, and s + s\u2217 \u2265 log\nfor any \u03b4 > 0. Then with probability at least 1 \u2212 4\u03b4 \u2212 e\u2212C0n, the following holds:\n\ns log d\n\u03c3min(\u03a3)\n\n1\n\u03b4\n\n,\n\n(cid:13)(cid:13)A(\u02c6xOMP\n\ns \u2212 \u00afx)(cid:13)(cid:13)2\n\n1\nn\n\n2 \u2264 C3\n\n\u03c32\u03ba(\u03a3)s\u2217\n\nn\n\n\u00b7 log (45\u03ba (\u03a3)) \u00b7 log\n\nd\n\u03b4\n\nHere C0, C1, C3 and C4 are universal constants independent of any problem parameters.\n\nNote the linear dependence of generalization error on \u03ba(\u03a3). This matches the lower bound of [29],\nalthough technically the bound does not apply to OMP as s > s\u2217. The proof follows directly from\nTheorem 3.3 along with standard concentration results. See Appendix B.3 for details.\nWe now present support recovery result for Gaussian ensembles. For simplicity, we consider the case\nwhen A is sampled from N (0, I). This can also be extended to N (0, \u03a3) but involves cumbersome\nlinear algebraic computations, which we avoid for simplicity.\n\n6\n\n\fCorollary 3.3.2 (Gaussian ensemble: support recovery). Let the rows of the matrix A \u2208 Rn\u00d7d be\nsampled from N (0, Id\u00d7d) and \u00afx be a s\u2217-sparse vector. Suppose further that |\u00afxmin| \u2265 23\u03c3\n.\nLet \u02c6xs be the output of OMP after s iterations and S = supp(\u02c6xs) be the support recovered, where,\n\nn\n\ns \u2265 C1s\u2217, n > C2(s\u2217)2 log\n\n1\n\u03b4\nfor any \u03b4 > 0. Then S\u2217 \u2286 supp(\u02c6xs) and (cid:107)\u02c6xs \u2212 \u00afx(cid:107)\u221e \u2264 2\u03c3\nwith probability at least\n1 \u2212 e\u2212C0n \u2212 9\u03b4. Here C0, C1 and C2 are universal constants independent of any problem parameter.\nThis matches the bounds of [13] up to constants. The proof directly follows from Theorem 3.1 along\nwith standard Gaussian concentration results. See Appendix B.3 for details.\n\n, and s + s\u2217 \u2265 log\n\nd\n\u03b4\n\nn\n\n,\n\n(cid:113) 2 log(s/\u03b4)\n\n(cid:113) log(d/\u03b4)\n\n4 Lower bounds for OMP\n\nIn this section, we provide lower bounds on the performance of OMP, both in terms of support\nrecovery and generalization error. These bounds show that:\n\n\u2022 The imperative quantities we make assumptions on in the upper bound section, viz:(cid:101)\u03bas+s\u2217\n\nand \u03b3 are relevant and meaningful.\n\n\u2022 Given bounds on these quantities, our results are tight, up to logarithmic factors.\n\nTo provide these lower bounds, we construct matrices M(\u0001) that are parametrized by \u0001. We \ufb01x \u00afx to\n\nbe an s\u2217-sparse vector such that: (cid:26)\u00afxi =(cid:112)1/s\u2217\n\nif 1 \u2264 i \u2264 s\u2217,\nif s\u2217 < i.\n\n\u00afxi = 0\n\n(4.1)\nThus, S\u2217 := supp(\u00afx) = {1, 2, . . . , s\u2217}. All our lower bound theorems use this \ufb01xed vector which is\nindependent of the noise level \u03c3. Our results are thus stronger than a typical minimax rate in which \u00afx\ncan be scaled based on \u03c3. For instance, the lower bounds of [29], [30] use such a strategy. Also, the\nsupport is distributed evenly across the \u00afxi\u2019s (4.1). Thus, we show that even large elements are not\nrecovered.\nWe now de\ufb01ne M(\u0001) \u2208 Rn\u00d7d for a given \u0001 \u2208 [0, 1], any s\u2217 \u2264 d \u2264 n in the following manner: M(\u0001)\n1:s\u2217\n= n, \u2200 i \u2208 [s\u2217]. For i \u2208 [d]\\ [s\u2217], each column\nare random orthogonal vectors such that\nvector is de\ufb01ned as,\n\n(cid:13)(cid:13)(cid:13)M(\u0001)\n(cid:114) 1 \u2212 \u0001\n\ni\n\n(cid:13)(cid:13)(cid:13)2\ns\u2217(cid:88)\n\n2\n\ns\u2217\n\nj=1\n\nM(\u0001)\n\ni =\n\n\u221a\n\n\u0001 gi,\n\nM(\u0001)\nj +\ni gj = 0 for all i (cid:54)= j.\n\n(4.2)\n\ni M(\u0001)\n\n2 = n, gT\n\n1:s\u2217 = 0 and gT\n\nwhere gi is such that (cid:107)gi(cid:107)2\nThe intuition behind this construction is that OMP would prefer the average direction M(\u0001)\nany of the correct directions M(\u0001)\nthe other orthogonal vectors of the matrix.\nThe parameter \u0001 is set carefully to ensure that the condition number of the matrix does not increase\ntoo much, so that M(\u0001) satis\ufb01es the constraints of Theorem 3.3 and Theorem 3.1 (upto constants).\nThis is captured in the next lemma:\nLemma 4.1. The matrix M(\u0001) satis\ufb01es\n\n, where i \u2208 S\u2217. Thus, we add a scaled version of M(\u0001)\n\nS\u2217 \u00afx over\nS\u2217 \u00afx to each of\n\ni\n\n(cid:0)M(\u0001)(cid:1) \u2264 4(1 + 2(1 \u2212 \u0001)s) = O(s)\n\u2022 (cid:101)\u03bas\n(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n(cid:13)(cid:13)(cid:13)(cid:13)M(\u0001)T\n\nS\u2217\\SM(\u0001)\n\nS M(\u0001)\n\nM(\u0001)T\n\n(cid:16)\n\n\u2264\n\n\u2022\n\nS\n\nS\n\n1\u221a\ns\u2217(1\u2212\u0001)\n\nfor S \u2229 S\u2217 = \u03c6.\n\nWe now use the above construction to show that in the noiseless case, i.e., when y = M(\u0001)\u00afx, OMP\nfails to recover any of the support elements in S\u2217 for some \u0001. Similarly, we show that in the noisy\ncase, support recovery fails and hence the generalization error of OMP is also large and matches the\nupper bound provided in Theorem 3.3. Proofs for this section can be found in Appendix C.\n\n7\n\n\f4.1 Noiseless case\nFor the deterministic noiseless case, i.e., \u03c3 = 0, we consider the matrix M(\u0001) for \u0001 = (1 \u2212 3/2s\u2217) and\nshow that OMP requires to add all the elements in support to recover the correct support.\nTheorem 4.2. For every value of d, n and s\u2217 where s\u2217 \u2264 d \u2264 n, there exists a design matrix\nA \u2208 Rn\u00d7d and a s\u2217-sparse vector \u00afx (de\ufb01ned in (4.2), (4.1)) such that the following holds true for\nOMP when applied to the sparse linear regression problem with y = A\u00afx and when OMP is executed\nfor s \u2264 d \u2212 s\u2217 iterations:\n\n\u2022 (cid:101)\u03bas(A) \u2264 16(s/s\u2217) and \u03b3 \u2264(cid:112)2/3.\n\n\u2022 The support set S recovered by OMP after s iterations is disjoint from S\u2217, i.e., S\u2217 \u2229 S = \u03c6.\n\nOur support recovery result in Theorem 3.1 requires s \u2265 C(cid:101)\u03bas+s\u2217 s\u2217 and one natural question is\nwith support sets of size s \u2265(cid:101)\u03bas+s\u2217 s\u2217. This in turn implies that the number of rows in A (i.e., sample\ncomplexity) should also scale with(cid:101)\u03bas+s\u2217.\n\nwhether running OMP for s iterations is necessary for recovering the actual support. This theorem\nguarantees that it is indeed the case, i.e., if design matrix A is ill-conditioned then OMP has to work\n\nNote that the lower bound results of [29], [30] do not provide any insights for how the sample\ncomplexity of an algorithm should scale with \u03bas+s\u2217 for support recovery. In fact for this problem\ntheir results are vacuous if |\u00afxmin| is reasonably large. For instance, with the \u00afx de\ufb01ned in (4.1) and\nthe design matrix proposed by [29], OMP can recover the true support of \u00afx exactly after just O (s\u2217)\niterations with n = s\u2217 log d samples. Thus, a large condition number of A in their construction does\nnot imply dif\ufb01culty in recovery for OMP.\n\nbehavior with respect to the restricted condition number(cid:101)\u03bas+s\u2217. For this section, we consider the\n\n4.2 Noisy case\nFor the noisy case, i.e., \u03c3 (cid:54)= 0, we can study both support recovery as well as generalization error\nmatrix M(\u0001) for \u0001 = (1 \u2212 1/4s\u2217). That is, we show that with high probability, OMP starts recovering\nthe correct support only after d1\u2212\u03b1 iterations for some constant \u03b1 > 0. This further implies that\nthe generalization error cannot be better than the lower bound on generalization error we showed in\nTheorem 3.3 (upto constants).\nTheorem 4.3. For every value of d and s\u2217, and any constants \u03b1 \u2208 (0, 1), \u03b4 \u2208 (0, 1), such that\n\n8 \u2264 s\u2217 \u2264 s \u2264 d1\u2212\u03b1 and d \u2265 max(cid:8)32 log (1/\u03b4) , 41/\u03b1(cid:9), there exists a sparse linear regression\nproblem with y = A\u00afx + \u03b7, \u03b7 \u223c N(cid:0)0, \u03c32In\u00d7n\n(cid:1), with design matrix A, and a s\u2217-sparse vector \u00afx\n\u2022 (cid:101)\u03bas(A) \u2264 36 (s/s\u2217) for all s and \u03b3 \u2264 1/2,\n\nde\ufb01ned in (4.2),(4.1) such that the following holds:\n\n\u2022 With probability at least 1 \u2212 \u03b4, the output \u02c6xs of OMP after s steps satis\ufb01es:\n\n2 \u2265 \u03c32(cid:101)\u03bas+s\u2217 s\u2217\n\n18n\n\n\u00b7 log\n\nd\n\u03b4\n\n,\n\n(cid:107)A\u02c6xs \u2212 A\u00afx(cid:107)2\n\n1\nn\n\n\u2022 Support set S recovered by OMP after s iterations is disjoint from S\u2217.\n\nNote that the dependence of the generalization error bound on(cid:101)\u03bas+s\u2217 matches our generalization error\n\nbound in Theorem 3.3. Interestingly, for our construction, noise ends up helping recovery because\nwhile Theorem 4.2 ensures that the recovery of true support elements does not occur till the very last\nstep, noise can only help in recovering one of the true elements. However, the probability of picking\nup the correct element by chance is tiny as we restrict s \u2264 d1\u2212\u03b1. We in fact believe that the result\nholds generally for any s and d. However, proving it turns out to be quite intricate since it requires\n\ufb01ner results about the the behavior of the order statistics of independent Gaussian variables.\n\n5 Simulations\n\nIn this section, we present simulations that verify our results. In particular, we generate a matrix\nM(\u0001) \u2208 R1000\u00d7100, and a \ufb01xed s\u2217 = 10-sparse vector \u00afx by using the construction given in (4.2) and\n\n8\n\n\f(4.1) where \u0001 \u2208 (0, 1). We then generate y = M(\u0001)\u00afx + \u03c3N (0, In\u00d7n) and apply OMP for recovering\nthe support of \u00afx. \u02c6s(\u0001) denotes the index or support set size that is needed by OMP for fully recovering\nthe support of \u00afx.\n\nNote that we can also compute the actual value of(cid:101)\u03ba for M(\u0001); in general the restricted condition\n\nnumber of M(\u0001) increases with decreasing \u0001, thus increasing the dif\ufb01culty of the support recovery\nproblem.\n\nterm (\u03c32).\n\nFigure 1: Number of iterations required for recovering the full support of \u00afx with respect to the\n\nrestricted condition number ((cid:101)\u03bas+s\u2217) of the design matrix and the sub-Gaussian parameter of the noise\n(cid:101)\u03ba(M(\u0001)) of M(\u0001) generated by varying \u0001 \u2208 (0, 1). Theorem 4.2 claims that for \u03c3 = 0, full recovery\nrequires(cid:101)\u03bas to be smaller than O(d/s\u2217), which is observed in Figure 1(a). For larger variance \u03c32, full\nrecovery requires larger number of iterations for smaller(cid:101)\u03ba.\nAs mentioned in the remark below Theorem 4.3, adding noise can only help in case of large(cid:101)\u03ba as our\n\nFigure 1(a) plots \u02c6s(\u0001) (i.e. support size required for full recovery) against restricted condition number\n\n(a) Varying condition number\n\n(b) Varying noise variance\n\nconstruction precludes full recovery unless s = d. We observe this behavior in both Figure 1(a) and\n1(b), where slightly larger value of \u03c3 ends up helping support recovery, but for larger values of noise\nvariance, OMP\u2019s performance is similar to an algorithm that simply selects each feature uniformly at\nrandom.\n\n6 Conclusion\n\nIn this paper, we analyze OMP for the sparse regression problem under RSC/RSS assumptions. We\nobtain support recovery and generalization guarantees for OMP under this setting. We also provide\nlower bounds for OMP showing that our results are tight up to logarithmic factors. We note that our\nresults signi\ufb01cantly improve upon a long list of existing results for greedy methods and match the\nbest known results for sparse regression that use nonconvex penalty based methods. In contrast to\nnonconvex penalty methods however, OMP guarantees the sparsity of intermediate iterates and hence\ncan be much more ef\ufb01cient. We also verify our results with synthetic experiments.\n\n9\n\n\fReferences\n[1] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence of\ngradient methods for high-dimensional statistical recovery. Ann. Statist., 40(5):2452\u20132482, 10\n2012.\n\n[2] Thomas Blumensath and Mike E Davies. Gradient pursuit for non-linear sparse signal modelling.\n\nIn Signal Processing Conference, 2008 16th European, pages 1\u20135. IEEE, 2008.\n\n[3] St\u00e9phane Boucheron and Maud Thomas. Concentration inequalities for order statistics. arXiv\n\npreprint arXiv:1207.7209, 2012.\n\n[4] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Inf. Theor., 51(12):4203\u2013\n\n4215, December 2005.\n\n[5] Emmanuel Candes and Justin Romberg. Sparsity and incoherence in compressive sampling.\n\nInverse problems, 23(3):969, 2007.\n\n[6] Emmanuel Candes, Terence Tao, et al. The dantzig selector: Statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[7] David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse over-\ncomplete representations in the presence of noise. IEEE Transactions on information theory,\n52(1):6\u201318, 2006.\n\n[8] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing,\n\nvolume 1.\n\n[9] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape,\nAshish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn:\nCompressed and accurate knn for resource-scarce devices. In International Conference on\nMachine Learning, pages 1331\u20131340, 2017.\n\n[10] D. Hsu, S. M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian\n\nrandom vectors. ArXiv e-prints, October 2011.\n\n[11] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for\nhigh-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages\n685\u2013693, 2014.\n\n[12] Ali Jalali, Christopher C Johnson, and Pradeep K Ravikumar. On learning discrete graphical\nmodels using greedy methods. In Advances in Neural Information Processing Systems, pages\n1935\u20131943, 2011.\n\n[13] Po-Ling Loh, Martin J Wainwright, et al. Support recovery without incoherence: A case for\n\nnonconvex regularization. The Annals of Statistics, 45(6):2455\u20132482, 2017.\n\n[14] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on\n\ncomputing, 24(2):227\u2013234, 1995.\n\n[15] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix\ncompletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665\u2013\n1697, 2012.\n\n[16] Y. C. Pati, R. Rezaiifar, Y. C. Pati R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching\npursuit: Recursive function approximation with applications to wavelet decomposition. In\nProceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers,\npages 40\u201344, 1993.\n\n[17] Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Or-\nthogonal matching pursuit: Recursive function approximation with applications to wavelet\ndecomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The\nTwenty-Seventh Asilomar Conference on, pages 40\u201344. IEEE, 1993.\n\n10\n\n\f[18] Amela Preli\u00b4c, Stefan Bleuler, Philip Zimmermann, Anja Wille, Peter B\u00fchlmann, Wilhelm\nGruissem, Lars Hennig, Lothar Thiele, and Eckart Zitzler. A systematic comparison and\nevaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122\u20131129,\n2006.\n\n[19] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model\nselection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319,\n2010.\n\n[20] Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding\n\npursuit. In International Conference on Machine Learning, pages 3115\u20133124, 2017.\n\n[21] Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of support\n\nrecovery. In Advances in Neural Information Processing Systems, pages 3127\u20133137, 2017.\n\n[22] Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal\n\nmatching pursuit. IEEE Transactions on information theory, 53(12):4655\u20134666, 2007.\n\n[23] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal\n\nmatching pursuit. IEEE TRANS. INFORM. THEORY, 53:4655\u20134666, 2007.\n\n[24] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery\nusing l1-constrained quadratic programming (lasso). IEEE transactions on information theory,\n55(5):2183\u20132202, 2009.\n\n[25] Xiaotong Yuan, Ping Li, and Tong Zhang. Exact recovery of hard thresholding pursuit. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 3558\u20133566. Curran Associates, Inc., 2016.\n\n[26] Tong Zhang. On the consistency of feature selection using greedy least squares regression.\n\nJournal of Machine Learning Research, 10(Mar):555\u2013568, 2009.\n\n[27] Tong Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations.\n\nIEEE transactions on information theory, 57(7):4689\u20134708, 2011.\n\n[28] Tong Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Transactions\n\non Information Theory, 57(9):6215\u20136221, 2011.\n\n[29] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance\nof polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory,\npages 921\u2013948, 2014.\n\n[30] Yuchen Zhang, Martin J Wainwright, Michael I Jordan, et al. Optimal prediction for sparse\nlinear models? lower bounds for coordinate-separable m-estimators. Electronic Journal of\nStatistics, 11(1):752\u2013799, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6901, "authors": [{"given_name": "Raghav", "family_name": "Somani", "institution": "Microsoft Research Lab - India"}, {"given_name": "Chirag", "family_name": "Gupta", "institution": "Carnegie Mellon University"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Praneeth", "family_name": "Netrapalli", "institution": "Microsoft Research"}]}