{"title": "Orthogonal Matching Pursuit with Replacement", "book": "Advances in Neural Information Processing Systems", "page_first": 1215, "page_last": 1223, "abstract": "In this paper, we consider the problem of compressed sensing where the goal is to recover almost all the sparse vectors using a small number of fixed linear measurements. For this problem, we propose a novel partial hard-thresholding operator leading to a general family of iterative algorithms. While one extreme of the family yields well known hard thresholding algorithms like ITI and HTP, the other end of the spectrum leads to a novel algorithm that we call Orthogonal Matching Pursuit with Replacement (OMPR). OMPR, like the classic greedy algorithm OMP, adds exactly one coordinate to the support at each iteration, based on the correlation with the current residual. However, unlike OMP, OMPR also removes one coordinate from the support. This simple change allows us to prove the best known guarantees for OMPR in terms of the Restricted Isometry Property (a condition on the measurement matrix). In contrast, OMP is known to have very weak performance guarantees under RIP. We also extend OMPR using locality sensitive hashing to get OMPR-Hash, the first provably sub-linear (in dimensionality) algorithm for sparse recovery. Our proof techniques are novel and flexible enough to also permit the tightest known analysis of popular iterative algorithms such as CoSaMP and Subspace Pursuit. We provide experimental results on large problems providing recovery for vectors of size up to million dimensions. We demonstrate that for large-scale problems our proposed methods are more robust and faster than the existing methods.", "full_text": "Orthogonal Matching Pursuit with Replacement \n\nPrateek Jain \n\nMicrosoft Research India \n\nBangalore, INDIA \n\nAmbujTewari \n\nAustin, TX \n\nprajain@microsoft.com \n\nambuj@cs.utexas.edu \n\ninderjit@cs.utexas.edu \n\nThe University of Texas at Austin \n\nThe University of Texas at Austin \n\nInderjit S. Dhillon \n\nAustin, TX \n\nAbstract \n\nIn this paper, we consider the problem of compressed sensing where the goal is to recover all sparse \nvectors using a small number offixed linear measurements. For this problem, we propose a novel \npartial hard-thresholding operator that leads to a general family of iterative algorithms. While one \nextreme of the family yields well known hard thresholding algorithms like ITI and HTP[17, 10], the \nother end of the spectrum leads to a novel algorithm that we call Orthogonal Matching Pursnit with \nReplacement (OMPR). OMPR, like the classic greedy algorithm OMP, adds exactly one coordinate \nto the support at each iteration, based on the correlation with the current residnal. However, unlike \nOMP, OMPR also removes one coordinate from the support. This simple change allows us to prove \nthat OMPR has the best known guarantees for sparse recovery in terms of the Restricted Isometry \nProperty (a condition on the measurement matrix). In contrast, OMP is known to have very weak \nperformance guarantees under RIP. Given its simple structore, we are able to extend OMPR using \nlocality sensitive hashing to get OMPR-Hasb, the first provably sub-linear (in dimensionality) al(cid:173)\ngorithm for sparse recovery. Our proof techniques are novel and flexible enough to also permit the \ntightest known analysis of popular iterative algorithms such as CoSaMP and Subspace Pursnit. We \nprovide experimental results on large problems providing recovery for vectors of size up to million \ndimensions. We demonstrste that for large-scale problems our proposed methods are more robust \nand faster than existing methods. \n\n1 Introduction \nWe nowadays routinely face high-dimensional datasets in diverse application areas such as biology, astronomy, and \nfinance. The associated curse of dimensionality is often alleviated by prior knowledge that the object being estimsted \nhas some structore. One of the most natorsl and well-stodied structural assumption for vectors is sparsity. Accordingly, \na huge amount of recent work in machine learning, statistics and signal processing has been devoted to finding better \nways to leverage sparse structures. Compressed sensing, a new and active branch of modem signal processing, deals \nwith the problem of designing measurement matrices and recovery algorithms, such that almost all sparse signals can \nbe recovered from a smalI number of measurements. It has important applications in imsging, computer vision and \nmachine learning (see, for example, [9,24, 14]). \n\nIn this paper, we focus on the compressed sensing setting [3, 7] where we want to design a measurement matrix \nA E R=xn such that a sparse vector x* E Rn with Ilx*llo := I BUpp(X*)I ::; k < n can be efficiently recovered from \nthe measurements b = Ax* E R=. Initial work focused on various random ensembles of matrices A such that, if A \nwas chosen randomly from that ensemble, one would be able to recover all or almost all sparse vectors x* from Ax*. \nCandes and Tao[3] isolated a key property called the restricted Isometry property (RIP) and proved that, as long as the \nmeasurement matrix A satisfies RIP, the true sparse vector can be obtained by solving an i,-optimization problem, \n\nThe above problem can be easily formulated as a linear program and is hence efficiently solvable. We recall for the \nreader that a matrix A is said to satisfY RIP of order k if there is some Ok E 10,1) such that, for all x with Ilxllo ::; k, \nwe have \n\nmin Ilxll, S.t. Ax = b. \n\nI \n\n\fSeveral random matrix ensembles are known to satisfY 00> < {} with high probability provided one chooses \nm ~ 0 (~ log ~) measurements. It was shown in [2] that i,-minimization recovers all k-sparse vectors provided A \nsatisfies t.k < 0.414 although the conditioohas been recently intproved to 02k < 0.473 [11]. Note that, in compressed \nsensing, the goal is to recover all, or most, k-sparse signals using the same measurement matrix A. Hence, weaker \ncooditioos such as restricted coovexity [20] studied in the statistical literature (where the aint is to recover a single \nsparse vector from noisy linear measurements) typically do not suffice. In fact, if RIP is not satisfied then multiple \nsparse vectors x can lead to the sante observatioo b, hence making recovery of the true sparse vector intpossible. \nBased on its RIP guarantees, i,-minimizatioo can guarantee recovery using just O(k log(n/ k\u00bb) measurements, but it \nhas been observed in practice that i,-minimization is too expensive in large scale applications [8], for example, when \nthe dimensionality is in the millions. This has sparked a huge interest in other iterative methods for sparse recovery. \nAn early classic iterative method is Orthogooal Matching Pursuit (OMP) [21, 6] that greedily chooses elements to add \nto the support. It is a natural, easy-to-intplement and fast method but unfortuoately lacks stroug theoretical guarantees. \nIndeed, it is known that, if run for k iterations, OMP cannot uoiformly recover all k-sparse vectors assumiug RIP \ncooditioo of the form 02k :'0 IJ [22, 18]. However, Zhang [26] showed that OMP, if run for 30k iterations, recovers the \noptimal solution when 03'k :'0 1/3; a significantly more restrictive cooditioo than the ones required by other methods \nlike i,-minimization. \n\nSeveral other iterative approaches have been proposed that include Iterative Soft Thresholding (1ST) [17], Iterative \nHard Thresholding (!BT) [I], Compressive Santpling Matching Pursuit (CoSaMP) [19], Subspace Pursuit (SP) [4], \nIterative Thresholding with Inversion (IT!) [16], Hard Thresholding Pursuit (HTP) [10] and many others. In the family \nofiterative hard thresholding algorithms, we can identifY two major subfamilies [17]: one- and two-stage algorithms. \nAs their nantes suggest, the distiuctioo is based on the number of stages in each iteration of the algorithm. One-stage \nalgorithms such as IHT, m and HTP, decide on the choice of the next support set and then usually solve a least \nsquares problem on the updated support. The one-stage methods always set the support set to have size k, where k \nis the target sparsity level. On the other hand, two-stage algorithms, notable examples being CoSaMP and SP, first \nenlarge the support set, solve a least squares 00 it, and then reduce the support set back again to the desired size. A \nsecood least squares problem is then solved 00 the reduced support. These algorithms typically enlarge and reduce \nthe support set by k or 2k elements. An exceptioo is the two-stage algorithm FoBa [25] that adds and removes single \nelements from the support. However, it differs from our proposed methods as its analysis requires very restrictive RIP \ncooditioos (08k < 0.1 as quoted in [14]) and the connection to locality sensitive hashing (see below) is not made. \nAnother algorithm with replacentent steps was studied by Shalev-Shwartz et al. [23]. However, the algorithm and the \nsettiug under which it is analyzed are different from ours. \n\nIn this paper, we present and provide a unified analysis for a family of one-stage iterative hard thresholding algorithms. \nThe family is parameterized by a positive integer I :'0 k. At the extrente value I ~ k, we recover the algorithm ITIIHTP. \nAt the other extrente k ~ 1, we get a novel algorithm that we call Orthogonal Matching Pursuit with Replacement \n(OMPR). OMPR can be thought of as a sintple modification of the classic greedy algorithm OMP: instead of sintply \nadding an element to the existiug support, it replaces an existiug support element with a new one. Surprisingly, this \nchange allows us to prove sparse recovery under the condition 02k < 0.499. This is the best 02k based RIP condition \nunder which any method, including i, -minimization, is (currently) known to provably perform sparse recovery. \nOMPR also lends itself to a faster intplententatioo using locality sensitive hashing (LSH). This allows us to provide \nrecovery guarantees using an algorithm whose run-time is provably sub-linear in n, the number of dimensions. An \nadded advantage of OMPR, unlike many iterative methods, is that no careful tuning of the step-size parameter is \nrequired even under noisy settiugs or even when RIP does not hold. The default step-size of 1 is always guaranteed to \nconverge to at least a local optimum. \n\nFinally, we show that our proof techniques used in the analysis of the OMPR family are useful in tightening the \nanalysis of two-stage algorithms, such as CoSaMP and SP, as well. As a result, we are able to prove better recovery \nguarantees for these algorithms: 04k < 0.35 for CoSaMP, and 03k < 0.35 for SP. We hope that this unified analysis \nsheds more light on the interrelationships between the various kinds of iterative hard thresholding algorithms. \n\nIn summary, the contributions of this paper are as follows . \n\n\u2022 We present a family of iterative hard thresholding algorithms that on one end of the spectrum includes ex(cid:173)\n\nisting methods such as ITIIHTP while on the other end gives OMPR. OMPR is an intproventent over the \nclassical OMP method as it enjoys better theoretical guarantees and is also better in practice as shown in our \nexperiments . \n\n\u2022 Unlike other intprovements over OMP, such as CoSaMP or SP, OMPR changes ouly ooe elentent of the \nsupport at a tinte. This allows us to use Locality Sensitive Hashing (LSH) to speed it up resultiug in the first \nprovably sub-linear (in the ambient dimensionality n) time sparse recovery algorithm. \n\n2 \n\n\fAlgorithm 1 OMPR \n1: Input: matrix A, vector b, sparsity level k \n2: Parameter: s1ep size 1/ > 0 \n3: Initialize Xl S.t I supp(xl)1 = k, h = supp(XI) \n4: for t = 1 to T do \n5: \n6 \n: \n7: \n8: \n\nzHI <- x' + 1/AT(b - Ax') \nI HII \n. \nJt+l +- argmaxj~It Zj \nJ'+1 <- I, U {iHI} \nyt+l +- H (zt+l) \nJ t +l \nIt+1 <- supp(y'+1) \n9: \nxHI \n10 \n[Hl +-\n: \n11: end for \n\n\\b xHI \n\nIt+l \n\n, \n\nit+l +-\n\nk \n\nA \n\n0 \n\nAlgorithm 2 OMPR (I) \n1: Input: matrix A, vector b, sparsity level k \n2: Parameter: step size 1/ > 0, replacement budget 1 \n3: Initialize Xl S.t I supp(xl)1 = k, h = supp(xl ) \n4: fort = ltoTdo \n5: \n6: \n7: \n8: \n9: \n10: XHI <- A \nI t +1 \n11: end for \n\nzHI <- x' + 1/AT(b - Ax') \ntOPHI <- indices of top 1 elements of Iz};'\"11 \nJ'+1 <- I, U tOPHI \nyt+l +- Hk (z~~:J \nIHI <- supp(yHI) \n\n\\b x'.+1 <- 0 \n\nIt+l' 1t+l \n\n\u2022 We provide a general proof for all the algorithms in our partial hard thresholding based family. In particular, \nwe can guarantee recovery using OMPR, under both noiseless and noisy settings, provided 02' < 0.499. \nThis is the least restrictive 02. cooditioo under which any efficient sparse recovery method is known to work. \nFurthermore, our proof technique can be used to provide a general theorem that provides the least restrictive \nknown guarantees for all the two-stage algorithms such as CoSaMP and SP (see Appendix D). \n\nAll proofs omitted from the main body of the paper can be found in the appendix. \n2 Orthogonal Matching PUl\"lIuit with Replacement \nOrthogonal matching pursuit (OMP), is a classic iterative algorithm for sparse recovery. At every stage, it selecta a \ncoordinate to include in the current support set by maximizing the inner product between columns of the measurement \nmatrix A and the current residnal b - Ax'. Doce the new coordinate has been added, it solves a least squares problem \nto fully miuimize the error on the current support set As a result, the residnal becomes orthogonal to the columos of \nA that correspond to the current support set. Thus, the least squares s1ep is also referred to as orthogonalization by \nsome authors [5]. \nLet us briefly explain some of our notation. We use the MATI..AB notation: \n\nA\\b:= argmin IIAx - bl1 2 \u2022 \n\nz \n\nThe hard thresholding operator H.O sorts its argument vector in decreasing order (in absolute value) and retains \nooly the top k entries. It is defined formally in the next sectioo. Also, we use subscripts to denote sub-vectors and \nsubmatrices, e.g. if I <;; Inl is a set of cardinality k and x ERn, XI E R' denotes the sub-vector of X indexed by I. \nSimilarly, AI for a matrix A E Rmx n denotes a sub-matrix of size m x k with columns indexed by I. The complement \nof set I is denoted by I and x I denotes the subvector not indexed by I. The support (indices of non-zero entries) of a \nvector x is denoted by supp(x). \nOur new algorithm called Orthogooal Matching Pursuit with Replacement (OMPR), shown as Algorithm 1, differs \nfrom OMP in two respects. First, the selection of the coordinate to include is based not just on the magnitude of entries \nin AT (b - Ax') but instead on a weighted combination x' + 1/AT (b - Ax') with the s1ep-size 1/ cootrolling the relative \nimportance of the two addends. Second, the selected coordinate replaces one of the existing elements in the support, \nnamely the one corresponding to the minimum magnitude entry in the weighted combination mentioned above. \nDoce the support IHI of the next iterate has been determined, the actna1 iterate XHI is obtained by solving the least \nsquares problem: \n\nXHI = \n\nargmin \n\nx: supp(z)=It+l \n\nIIAx - bli2 . \n\nNote that if the matrix A satisfies RIP of order k or larger, the above problem will be well conditioned and can be \nsolved quickly and reliably using an iterative least squares solver. We will show that OMPR, uulike OMP, recovers any \nk-sparse vector under the RIP based cooditioo 02. :<:; 0.499. This appears to be the least restrictive recovery condition \n(i.e., best known coodition) under which any method, be it basis pursuit (ll-minimizatioo) or some iterative algorithm, \nis guaranteed to recover all k-sparse vectors. \nIn the literature on sparse recovery, RIP based cooditioos of a different order other than 2k are often provided. It is \nseldom possible to directly compare two conditions, say, one based on 62\u2022 and the other based on 63 \u2022\u2022 Foucart [10] has \n\n3 \n\n\fgiven a heuristic to compare such RIP conditions based on the number of samples it takes in the Gaussian ensemble \nto satisfy a given RIP condition. This heuristic says that an RIP condition of the form lic' < 9 is less restrictive if the \nratio c/92 is smaller. For the OMPR condition Ii,. < 0.499, this ratio is 2/0.4992 \"\" 8 which makes it heuristically \nthe least restrictive RIP condition for sparse recovery. The following summarize our main results on OMPR. \nTheorem 1 (Noiseless Case). Suppose the vector x* E IRn is k-sparse and the matrix A satisfies 1i2\u2022 < 0.499 and \nIi, < 0.002. Then OMPR converges to an E approximate solution (i.e. 1/211Ax -\nbl1 2 ~ E) from measurements \nb ~ Ax* in O(klog(k/E)) iterations. \nTheorem 2 (Noisy Case). Suppose the vector x* E IRn is k-sparse and the matrix A satisfies 1i2\u2022 < 0.499 and \nIi, < 0.002. Then OMPR converges to a (C,E) approximate solution (i.e. 1/211Ax - bll' ~ ~llell' + E) from \nmeasurements b ~ Ax* + e in O(k log((k + IleI1 2)/E)) iterations. Here C > 1 is a constant dependent only on 1i2 \u2022. \nThe above theorems are special cases of our convergence results for a family of algorithms that contains OMPR as a \nspecial case. We now tum our attention to this family. We note that the condition 1i2 < 0.002 is very mild and will \ntypically hold for standard random matrix ensembles as soon as the number of rows sampled is larger than a fixed \nuniversal constant \n\n3 A New Family of Iterative Algorithms \nIn this section we show that OMPR is one particular member of a family of algorithms parameterized by a single \ninteger 1 E {I, ... , k}. The I-th member of this family, OMPR (I), showo in Algorithm 2, replaces at most 1 elements \nof the curreot support with new elements. OMPR corresponds to the choice 1 ~ 1. Hence, OMPR and OMPR (1) \nrefer to the same algorithm. \nOur first result in this section conoects the OMPR family to hard thresholding. Given a set I of cardinality k, define \nthe partial hard thresholding operator \n\nHk (z; I, I):~ argmin \nhl o:S;k \n\nI supp(y)\\II5:l \n\nIly - zll . \n\n(I) \n\nAs is clear from the definition, the above operator tries to find a vector V close to a given vector z under two constraints: \n(i) the vector V should have bounded support (1lvllo ~ k), and (ii) its support should not include more than 1 new \nelements outside a given support I. \nThe name partial hard thresholding operator is justified because of the following reasoning. When 1 ~ k, the constraint \nI supp(Y)\\I1 ~ 1 is trivially implied by IIYllo ~ k and hence the operator becomes independent of!. In fact, itbecomes \nidentical to the standard hard thresholding operator \n\nH. (z; I, k) ~ H. (z) :~ argmin Ily - zll . \n\n(2) \nEven though the definition of Hk (z) seems to involve searching through GJ subsets, it can in fact be computed \nefficiently by simply sorting the vector z by decreasing absolute value and retaming the top k entries. \nThe following result shows that even the partial hard thresholding operator is easy to compute. In fact, lines 6-8 in \nAlgorithm 2 precisely compute H. (zt+1; It, I). \nProposition 3. Let III ~ k and z be given. Then Y ~ H. (z;I, I) can be computed using the sequence of operations \n\n11.1109 \n\ntop ~ indices of top 1 elements oflzll, J ~ I U top, V ~ Hk (ZJ) . \n\nThe proof of this proposition is straightforward and elementary. However, using it, we can now see that the OMPR (I) \nalgorithm has a simple conceptoa1 s1ructore. In each iteration (with current iterate x' having support It ~ supp(xt\u00bb, \nwe do the following: \n\n1. (Gradient Descent) Fonn zHI ~ xt - '1AT(Axt - b). Note that AT(Axt - b) is the gradient of the objective \n\nfunction ~IIAx - bll' at x'. \n\n2. (partial Hard Thresholding) Form VH1 by partially hard thresholding zHI using the operator H. (.; It, I). \n3. (Least Squares) Form the next iterate XHI by solving a least squares problem on the support IHI ofyHI. \n\nA nice property enjoyed by the entire OMPR family is guaranteed sparse recovery under RIP based conditions. Note \nfrom below that the condition under which OMPR (I) recovers sparse vectors becomes more restrictive as I increases. \nThis could be an artifact of our analysis, as in experiments, we do not see any degradation in recovery ability as I is \nincreased. \n\n4 \n\n\fTheorem 4 (Noiseless Case). Suppose the vector x' E IRn is k-sporse. Then OMPR (I) converges to an < approxima(cid:173)\ntion solution (i.e. 1/211Ax - bl12 :5 <)from measurements b = Ax* in O( ~ log(k/<\u00bb iterations provided we choose a \nstep size 1'/ that satisfies 1'/(1 + 02.) < 1 and 1'/(1 - 02.) > 1/2. \nTheorem S (Noisy Case). Suppose the vector x' E IRn is k-sparse. Then OMPR (I) converges to a (C, <) approximate \nsolution (i.e., 1/211Ax - bl1 2 :5 t IIell2 + <) from measurements b = Ax' + e in O( t log\u00abk + IleI1 2)1<) \niterations \nprovided we choose a step size 1'/ that satisfies 1'/(1 + 02,) < 1 and 1'/(1 - 02.) > 1/2. Here C > 1 is a constant \ndependent only on 02., 02 \u2022. \nProof Here we provide a rough sketch of the proof of Theorem 4; the complete proof is giveo in Appeodix A. \nOur proof uses the following crucial observatioo regarding the structure of the vector zH1 = x' - 1'/AT (Ax' - b) . \nDue to the least squares step of the previous iteration, the curreot residual Ax' - b is orthogoual to columns of AI,. \nThis meaos that \n\nZH1 - x' \nIt \n\n-\n\nIt' \n\nz~+1 = -nA'!' (Ax' - b) . \nIt \n\nIt \n\n\" \n\n(3) \n\nAs the algorithm proceeds, elemeots come in and move out of the curreot set I,. Let us give names to the set offound \nand lost elements as we move from I, to 1'+1: \n\n(found): F, = IH1 \\I\" \n\nHeoce, using (3) and updates for YH1: Y~;' = Z~;' = -1'/A~,A(x' - x'), and Z~;' = xL. Now let J(x) = \n1/211Ax - b112, theo using upper RIP and the fact that I supp(yH1 - x')1 = IF, U L,I :5 21, we can sbow that (details \nare in the Appeodix A): \n\nJ(yH1) - J(x'):5 C ~02' - D IIyWII2 + 1 ~02'llxUI2. \n\n(4) \n\nFurthermore, since yH1 is choseo based on the k largest eotries in z~;:\" we have: IIY~;'112 = Ilz~;'112 ~ Ilz~;'112 = \nIlxL 112 . Plugging this into (4), we get: \n\nJ(yH1) - J(x'):5 (1 +O2'-~) M;'112. \n\n(5) \n\nSince J(xH1 ) :5 J(yH1) :5 J(x'), the above expression shows that if 1'/ < 1':.\" then our method moootonically \ndecreases the objective function and converges to a local optimum even if RIP is not satisfied (note that upper RIP \nbound is indepeodeot oflower RIP bound, and can always be satisfied by nurma1izing the matrix appropriately). \nHowever, to prove convergeoce to the global optimum, we need to show that at least ooe new elemeot is added at each \nstep, i.e., IF,I ~ 1. Furthermore, we need to show sufficieot decrease, i.e, IIY~;'112 ~ elJ(x'). We show both these \nconditions for global coovergeoce in Lemma 6, whose proof is giveo in Appeodix A. \nLemma 6. Let 02k < 1 -\nF, '\" 0. Furthermore, IIY~;'11 > teJ(x'), where e = min(41'/(1 - 1'/),,2(21'/- 1-~\"\u00bb > 0 is a constant. \n\n2~ and 1/2 < 1'/ < 1. Then assuming J(x') > 0, at least one new element is found i.e. \n\nAssunling Lemma 6, (5) shows that at each iteration OMPR (I) reduces the objective functioo value by at least a \nconstant fractioo. Furthermore, if XO is choseo to have eotries bounded by 1, theo J(XO) :5 (1 + 02k)k. Heoce, afier \nD \nO(k/llog(k/<\u00bb iteratioos, the optimal solution x* would be obtained within < error. \n\nSpeeial Cases: We have already observed that the OMPR algorithm of the previous sectioo is simply OMPR (1). \nAlso note that Theorem I immediately follows from Theorem 4. \nThe algorithm at the other extreme of 1 = k has appeared at least three times in the receot literature: as Iterative (hard) \nThresholding with Inversioo (IT!) in [16], as SVP-Newton (in its matrix avatar) in [15], and as Hard Thresholding \nPursuit (HTP) in [10]). Let us call it IHT-Newton as the least squares step can be viewed as a Newton step for the \nquadrstic objective. The above geoera1 result for the OMPR family immediately implies that it recovers sparse vectors \nas soon as the measuremeot matrix A satisfies 02, < 1/3. \nCoroUary 7. Suppose the vector x' E an is k-sparse and the matrix A satisfies 02k < 1/3. Then IlIT-Newton \nrecovers x* from measurements b = Ax' in O(1og(k\u00bb iterations. \n\n5 \n\n\f4 Tighter Analysis of Two Stage Hard Thresholding Algorithms \nRecently, Maleki and Donoho [17] proposed a novel family of algorithms, namely two-stage hard thresholding algo(cid:173)\nrithms. Doring each iteration, these algorithms add a fixed nwnber (say l) of elements to the current iterate's support \nset. A least squares problem is solved over the larger support set and then I elements with smallest magnitude are \ndropped to form next iterate's support set. Next iterate is then obtained by agaiu solviug the least squares over next \niterate's support set. See Appendix D for a more detailed description of the algorithm. \n\nUsiug proof techniques developed for our proof of Theorem 4, we can obtain a simple proof for the entire spectrum of \nalgorithms iu the two-stage hard thresholding family. \nTheorem 8. Suppose the vector x* E {-I, 0, l}n is k-sparse. Then the 7Wo-stage Hard Thresholding algorithm with \nreplacement size I recovers x* from measurements b = Ax* in O(k) iterations provided: 6.H1 :::; .35. \nNote that CoSaMP [19] and Subspace Pursuit(SP) [4] are popular special cases of the two-stage family. Usiug our \ngeneral analysis, we are able to provide significantly less restrictive RIP conditions for recovery. \nCoroUary 9. CoSaMP[l9] recovers k-sparse x* E {-1,0, l}n from measurements b = Ax* provided 64k :::; 0.35. \nCoroUary 10. Subspace Pursuit[4] recovers k-sparse x* E {-I, 0, I}n from measurements b = Ax* provided \n63k :::; 0.35. \nNote that CoSaMP's analysis given by [19] requires 64k :::; 0.1 while Subspace Pursuit's analysis given by [4] requires \n63k :::; 0.205. See Appendix Diu the supplementary material for proofs of the ahove theorem and coroUaries. \n\n5 Fast Implementation Using Hashing \nIn this section, we discuss a fast implementation of the OMPR method usiug locality-sensitive hashiug. The \nmall iutuition behind our approach is that the OMPR method selects at most one element at each step (given by \nargmax, IAT(Ax' - b) I); hence, selection of the top most element is equivalent to finding the column Ai that is most \n\"similar\" (iu magnitude) to r, = Ax' - b, i.e., this may be viewed as the similarity search task for queries of the form \nr, and -r, from a database of N vectors IAI\"'\" ANI. \nTo this end, we use locality sensitive hashiug (LSH) [12], a well known data-structore for approximate nearest(cid:173)\nneighbor retrieval. Note that while LSH is designed for nearest neighbor search (iu terms of Euclidean distances) and \niu general might not have any guarantees for the similar neighbor search task, we are still able to apply it to our task \nbecause we can lower-hound the similarity of the most similar neighbor. \n\nWe first briefly describe the LSH scheme that we use. LSH generates hash bits for a vector usiug randoruized hash \nfunctions that have the property that the probability of collision between two vectors is proportional to the similarity \nbetween them. For our problem, we use the following hash function: h,.(a) = sign(uT a), where u ~ N(O, J) is a \nrandom hyper-plane generated from the standard multivariate Gaussian distribution. It can be shown that [13] \n\n-I ( af a2 \n\nIladlla211' \n\n) \n\n( ) \n\nPr[hu al = hu a. ] = 1-;;: cos \nis created by randoruly sampling hash \n\n( ) \n\nI \n\nan \n\ni.e., g( a) \n\nfunctions h,., \n\n.-bit hash key \n\nNow, \n[hu,(a),hu,(a), ... ,hu.(a)], where each Ui is sampled randoruly from the standard multivariate Gaussian \ndistribution. Next, q hash tables are constructed doring the pre-processiug stage usiug iudependently constructed hash \nkey functions gl, 92, ... , gq' Doring the query stage, a query is iudexed iuto each hash table usiug hash-key functions \n91, 92, ... ,9q and then the nearest neighbors are retrieved by doing an exhaustive search over the indexed elements. \nBelow we state the following theorem from [12] that guarantees sub-liuear time nearest neighbor retrieval for LSH. \nTheorem 11. Let. = O(logn) and q = O(log 1/6)nr1<, then with probability 1 - 6, LSH recovers (I + f)-nearest \nneighbors, i.e., Ila' - rl12 :::; (1 + f)lla' - rll\u00b7, where a' is the nearest neighbor to r and a' is a point retrieved by \nLSH. \nHowever, we cannot directly use the above theorem to guarantee convergence of our hashing based OMPR algorithm \nas our algorithm requires finding the most similar poiut iu terms of magnitude of the iuner product. Below, we provide \nappropriate settings of the LSH parameters to guarantee sub-liuear time convergence of our method under a slightly \nweaker condition on the RIP constant. A detailed proof of the theorem below can be found iu Appendix B. \nTheorem 12. Let 62\u2022 < 1/4 -\")' and 'f/ = I -\")\" where\")' > 0 is a small constant, then with probability I - 6, OMPR \nwith hashing converges to the optimal solution in O(kmnl /(1+0(I/k)) log k/6) computational steps. \nThe above theorem shows that the time complexity is sub-liuear iu n. However, currently our guarantees are not \nparticularly strung as for large k the exponent of n will be close to 1. We believe that the exponent can be improved \nby more careful analysis and our empirical results iudicate that LSH does speed up the OMPR method significantly. \n\n6 \n\n\f(a)OMPR \n\n(b)OMP \n\n(c) nIT-Newton \n\nFigure 1: Phase Transition Diagrams for different methods. Red represents high probability of success while blue \nrepresents low probability of success. Clearly, OMPR recovers correct solution for a much larger region of the plot \nthan OMP and is comparable to nIT-Newton. (Best viewed in color) \n6 Experimental Results \nIn this section we present empirical results to demonstrate accurate and fast recovery by our OMPR method. In the first \nset of experiments, we present a phase transition diagram for OMPR and compare it to the phase transition diagrams \nof OMP and nIT-Newton with step size 1. For the second set of experiments, we demonstrate robostoess of OMPR \ncompared to many existiog methods when measurements are noisy or smaller in number than what is required for exact \nrecovery. For the third set of experiments, we demonstrate efficiency of our LSH based implementation by comparing \nrecovery error and time required for our method with OMP and nIT-Newtoo (with step-size 1 and 1/2). We do not \npresent results for the i,ibasis pursuit methods, as it has a1readybeen shown in several recent papers [10, 17] that the \ni, relaxation based methods are relatively inefficient for very large scale recovery problems. \nIn all the experiments we generate the measurement matrix by sampling each entry independently from the standard \nnormal distribotion N (0, 1) and then normalize each column to have uuit norm. The underlying k-sparse vectors are \ngenerated by randomly selecting a support set of size k and then each entry in the support set is sampled uuiformiy from \n{ +1, -I}. We use our own optimized implementation of OMP and nIT-Newtoo. All the methods are implemented in \nMATLAB and our hashing routioe uses mex files. \n6.1 Phase Transition Diagrams \nWe first compare different methods using phase transition diagrams which are commouly used in compressed sensing \nliteratore to compare different methods [17]. We first fix the number of measurements to be m = 400 and generate \ndifferent problem sizes by varying p = kim and 6 = min. For each problem size (m, n, k), we generate random \nm x n Gaussian measurement matrices and k-sparse random vectors. We then estimate the probability of success of \neach of the method by applying the method to 100 randomly generated instances. A method is considered successful \nfor a particular instance if it recovers the underlying k-sparse vector with at most 1 % relative error. \nIn Figure 1, we show the phase transition diagram of our OMPR method as well as that ofOMP and nIT-Newtoo (with \nstep size 1). The plots shows probability of successful recovery as a function of p = min and 6 = kim. Figure 1 (a) \nshows color coding of different success probabilities; red represents high probability of success while blue represents \nlow probability of success. Note that for Gaussian measurement matrices, the RIP constant 62\u2022 is less than a fixed \n\nconstant if and ouly ifm = Ck log(nlk), where C is a uuiversal constant This implies that * = Clog p and hence a \n\nmethod that recovers for high 62\u2022 will have a large fraction in the phase transition diagram wbere successful recovery \nprobability is high. We observe this phenomenon for both OMPR and nIT-Newton method which is consistent with \ntheir respective theoretical goarantees (see Theorem 4). On the other hand, as expected, the phase transition diagram \nof OMP has a negligible fraction of the plot that shows high recovery probability. \n6_2 Performance for Noisy or Under-sampled Observations \nNext, we empirically compare performance of OMPR to various existing compressed sensing methods. As shown \nin the phase transition diagrams in Figure 1, OMPR provides comparable recovery to the nIT-Newton method for \nnoiseless cases. Here, we show that OMPR is fairly robust under the noisy settiog as well as in the case of under(cid:173)\nsampled observations, where the number of observations is much smaller than what is required for exact recovery. \nFor this experiment, we generate random Gaussian measurement matrix of size m = 200, n = 3000. We then generate \nrandom binary vector x of sparsity k aod add Gaussian noise to it Figure 2 (a) shows recovery error (1iAx - bll) \nincurred by various methods for increasing k and noise level of 10%. Clearly, our method outperforms the existing \nmethods, perhaps a consequence of goaranteed convergence to a local minimum for fixed step size 1/ = 1. Similarly, \nFigure 2 (b) shows recovery error incurred by various methods for fixed k = 50 and varying noise level. Here again, \nour method outperforms existiog methods and is more robust to noise. Fina11y, in Figure 2 ( c) we show difference in \n\n7 \n\n\fEnurvsk(Noi-=10%) \n\n\" _ OMPR \n\nError w NaIM k=SO \n\n+OMPR(1rI2 \n\n:U .\n\nIHT-N \n\n+ SP \n\n~ _ CoSAMP \n:ii' \n~ 3 \n~'.'~~/ \n\"~:----';o,', -'0.\" -'0.\" -'0'.-----,10.' \n\nNOise! 1< \n\n0 \n\n,0 \n\n0.00 \n0.05 \n0.0 \n0.20 \nU.3U \n0.40 \n0.50 \n\n-0.21(0.6) \n0.00(0.0) \n0.13(0.3) \n0.00(0.0) \nO.OO(u.O) \n0.2\"(0.3) \n0.03(0.0) \n0.62(0.2) \nU.1\"(U.1) U.92(0.3) \n1.19(0.3) \n0.31(0.1) \n0.37(0.1) \n1.48(0.3) \n\n5u \n\n0.25(0.3) \n0.37(0.3) \nO. 3 0.4) \n0.58(0.5) \nO.92(O.b) \n0.84(0.5) \n1.24(0.6) \n\n(c) \n\n10 \n\n20 \n30 \nSp.lI'IiIy{k) \n(a) \n\n\" \n\n50 \n\nHoi_LewI \n(b) \n\nFigure 2: Error in recovery