{"title": "Sparse PCA via Covariance Thresholding", "book": "Advances in Neural Information Processing Systems", "page_first": 334, "page_last": 342, "abstract": "In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension $n\\times p$ and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here that the principal components $\\bv_1,\\dots,\\bv_r$ have at most $k_1, \\cdots, k_q$ non-zero entries respectively, and study the high-dimensional regime in which $p$ is of the same order as $n$. In an influential paper, Johnstone and Lu \\cite{johnstone2004sparse} introduced a simple algorithm that estimates the support of the principal vectors $\\bv_1,\\dots,\\bv_r$ by the largest entries in the diagonal of the empirical covariance. This method can be shown to succeed with high probability if $k_q \\le C_1\\sqrt{n/\\log p}$, and to fail with high probability if $k_q\\ge C_2 \\sqrt{n/\\log p}$ for two constants $0 < C_1,C_2 < \\infty$. Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler and Vilenchik \\cite{KrauthgamerSPCA}. We confirm empirical evidence presented by these authors and rigorously prove that the algorithm succeeds with high probability for $k$ of order $\\sqrt{n}$. Recent conditional lower bounds \\cite{berthet2013computational} suggest that it might be impossible to do significantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before.", "full_text": "Sparse PCA via Covariance Thresholding\n\nYash Deshpande\n\nElectrical Engineering\nStanford University\n\nyashd@stanford.edu\n\nAndrea Montanari\n\nElectrical Engineering and Statistics\n\nStanford University\n\nmontanari@stanford.edu\n\nAbstract\n\n(cid:112)\n\n(cid:112)\n\nIn sparse principal component analysis we are given noisy observations of a low-\nrank matrix of dimension n \u00d7 p and seek to reconstruct it under additional spar-\nIn particular, we assume here that the principal components\nsity assumptions.\nv1, . . . , vr have at most k1,\u00b7\u00b7\u00b7 , kq non-zero entries respectively, and study the\nhigh-dimensional regime in which p is of the same order as n.\nIn an in\ufb02uential paper, Johnstone and Lu [JL04] introduced a simple algorithm\nthat estimates the support of the principal vectors v1, . . . , vr by the largest entries\nin the diagonal of the empirical covariance. This method can be shown to succeed\nwith high probability if kq \u2264 C1\nn/ log p, and to fail with high probability if\nn/ log p for two constants 0 < C1, C2 < \u221e. Despite a considerable\nkq \u2265 C2\namount of work over the last ten years, no practical algorithm exists with provably\nbetter support recovery guarantees.\nHere we analyze a covariance thresholding algorithm that was recently proposed\nby Krauthgamer, Nadler and Vilenchik [KNV13]. We con\ufb01rm empirical evidence\npresented by these authors and rigorously prove that the algorithm succeeds with\nhigh probability for k of order \u221an. Recent conditional lower bounds [BR13]\nsuggest that it might be impossible to do signi\ufb01cantly better.\nThe key technical component of our analysis develops new bounds on the norm of\nkernel random matrices, in regimes that were not considered before.\n\nIntroduction\n\n1\nIn the spiked covariance model proposed by [JL04], we are given data x1, x2, . . . , xn with xi \u2208 Rp\nof the form1:\n\nr(cid:88)\n\n(cid:112)\n\nxi =\n\n\u03b2q uq,i vq + zi ,\n\n(1)\n\nq=1\n\nHere v1, . . . , vr \u2208 Rp is a set of orthonormal vectors, that we want to estimate, while uq,i \u223c\nN(0, 1) and zi \u223c N(0, Ip) are independent and identically distributed. The quantity \u03b2q \u2208 R>0\nquanti\ufb01es the signal-to-noise ratio. We are interested in the high-dimensional limit n, p \u2192 \u221e with\nlimn\u2192\u221e p/n = \u03b1 \u2208 (0,\u221e). In the rest of this introduction we will refer to the rank one case, in\norder to simplify the exposition, and drop the subscript q = {1, 2, . . . , r}. Our results and proofs\nhold for general bounded rank.\nThe standard method of principal component analysis involves computing the sample covariance\ni and estimates v = v1 by its principal eigenvector vPC(G). It is a\nwell-known fact that, in the high dimensional asymptotic p/n \u2192 \u03b1 > 0, this yields an inconsistent\n1Throughout the paper, we follow the convention of denoting scalars by lowercase, vectors by lowercase\n\nmatrix G = n\u22121(cid:80)n\n\ni=1 xixT\n\nboldface, and matrices by uppercase boldface letters.\n\n1\n\n\festimate [JL09]. Namely (cid:107)vPC \u2212 v(cid:107)2 (cid:54)\u2192 0 in the high-dimensional asymptotic limit, unless \u03b1 \u2192 0\n(i.e. p/n \u2192 0). Even worse, Baik, Ben-Arous and P\u00b4ech\u00b4e [BBAP05] and Paul [Pau07] demonstrate\na phase transition phenomenon: if \u03b2 < \u221a\u03b1 the estimate is asymptotically orthogonal to the signal\n(cid:104)vPC, v(cid:105) \u2192 0. On the other hand, for \u03b2 > \u221a\u03b1, (cid:104)vPC, v(cid:105) remains strictly positive as n, p \u2192 \u221e. This\n\nphase transition phenomenon has attracted considerable attention recently within random matrix\ntheory [FP07, CDMF09, BGN11, KY13].\nThese inconsistency results motivated several efforts to exploit additional structural information on\nthe signal v. In two in\ufb02uential papers, Johnstone and Lu [JL04, JL09] considered the case of a\nsignal v that is sparse in a suitable basis, e.g. in the wavelet domain. Without loss of generality, we\nwill assume here that v is sparse in the canonical basis e1, . . . ep. In a nutshell, [JL09] proposes the\nfollowing:\n\n1. Order the diagonal entries of the Gram matrix Gi(1),i(1) \u2265 Gi(2),i(2) \u2265 \u00b7\u00b7\u00b7 \u2265 Gi(p),i(p),\nand let J \u2261 {i(1), i(2), . . . , i(k)} be the set of indices corresponding to the k largest\nentries.\n2. Set to zero all the entries Gi,j of G unless i, j \u2208 J, and estimate v with the principal\n\neigenvector of the resulting matrix.\n\n(cid:112)\n\nJohnstone and Lu formalized the sparsity assumption by requiring that v belongs to a weak (cid:96)q-ball\nwith q \u2208 (0, 1). Instead, here we consider a strict sparsity constraint where v has exactly k non-zero\nentries, with magnitudes bounded below by \u03b8/\u221ak for some constant \u03b8 > 0. The case of \u03b8 = 1 was\nstudied by Amini and Wainwright in [AW09].\nWithin this model, it was proved that diagonal thresholding successfully recovers the support of\nn/ log p with C = C(\u03b1, \u03b2) a constant [AW09].\nv provided v is sparse enough, namely k \u2264 C\n(Throughout the paper we denote by C constants that can change from point to point.) This result is\na striking improvement over vanilla PCA. While the latter requires a number of samples scaling as\nthe number of parameters2 n (cid:38) p, sparse PCA via diagonal thresholding achieves the same objective\nwith a number of samples scaling as the number of non-zero parameters, n (cid:38) k2 log p.\nAt the same time, this result is not as optimistic as might have been expected. By searching ex-\nhaustively over all possible supports of size k (a method that has complexity of order pk) the correct\nsupport can be identi\ufb01ed with high probability as soon as n (cid:38) k log p. On the other hand, no method\ncan succeed for much smaller n, because of information theoretic obstructions [AW09].\nOver the last ten years, a signi\ufb01cant effort has been devoted to developing practical algorithms\nthat outperform diagonal thresholding, see e.g. [MWA05, ZHT06, dEGJL07, dBG08, WTH09]. In\nparticular, d\u2019Aspremont et al. [dEGJL07] developed a promising M-estimator based on a semide\ufb01-\nnite programming (SDP) relaxation. Amini and Wainwright [AW09] carried out an analysis of this\nmethod and proved that, if (i) k \u2264 C(\u03b2) n/ log p, and (ii) if the SDP solution has rank one, then the\nSDP relaxation provides a consistent estimator of the support of v.\nAt \ufb01rst sight, this appears as a satisfactory solution of the original problem. No procedure can\nestimate the support of v from less than k log p samples, and the SDP relaxation succeeds in doing it\nfrom \u2013at most\u2013 a constant factor more samples. This picture was upset by a recent, remarkable result\nby Krauthgamer, Nadler and Vilenchik [KNV13] who showed that the rank-one condition assumed\nby Amini and Wainwright does not hold for \u221an (cid:46) k (cid:46) (n/ log p). This result is consistent with\nrecent work of Berthet and Rigollet [BR13] demonstrating that sparse PCA cannot be performed in\npolynomial time in the regime k (cid:38) \u221an, under a certain computational complexity conjecture for\nthe so-called planted clique problem.\nIn summary, the problem of support recovery in sparse PCA demonstrates a fascinating interplay\nbetween computational and statistical barriers.\n\nFrom a statistical perspective, and disregarding computational considerations, the support of v\ncan be estimated consistently if and only if k (cid:46) n/ log p. This can be done, for instance,\n\n(cid:1) possible supports of v. (See [VL12, CMW+13] for a\n\nby exhaustive search over all the(cid:0)p\n\nk\n\nminimax analysis.)\n\n2Throughout the introduction, we write f (n) (cid:38) g(n) as a shorthand of \u2018f (n) \u2265 C g(n) for a some constant\n\nC = C(\u03b2, \u03b1)\u2019. Further C denotes a constant that may change from point to point.\n\n2\n\n\fFrom a computational perspective, the problem appears to be much more dif\ufb01cult. There is rig-\norous evidence [BR13, MW13] that no polynomial algorithm can reconstruct the support\nunless k (cid:46) \u221an. On the positive side, a very simple algorithm (Johnstone and Lu\u2019s diagonal\n\nthresholding) succeeds for k (cid:46)(cid:112)\n\nn/ log p.\n\nOf course, several elements are still missing in this emerging picture. In the present paper we address\none of them, providing an answer to the following question:\n\nproblem with high probability for(cid:112)\n\nn/ log p (cid:46) k (cid:46) \u221an?\n\nIs there a polynomial time algorithm that is guaranteed to solve the sparse PCA\n\nWe answer this question positively by analyzing a covariance thresholding algorithm that proceeds,\nbrie\ufb02y, as follows. (A precise, general de\ufb01nition, with some technical changes is given in the next\nsection.)\n\n1. Form the Gram matrix G and set to zero all its entries that are in modulus smaller than\n\n\u03c4 /\u221an, for \u03c4 a suitably chosen constant.\n\n2. Compute the principal eigenvector(cid:98)v1 of this thresholded matrix.\n3. Denote by B \u2286 {1, . . . , p} be the set of indices corresponding to the k largest entries of(cid:98)v1.\nG(cid:98)vB with(cid:98)vB obtained by zeroing the entries outside B.)\n\n4. Estimate the support of v by \u2018cleaning\u2019 the set B. (Brie\ufb02y, v is estimated by thresholding\n\nSuch a covariance thresholding approach was proposed in [KNV13], and is in turn related to earlier\nwork by Bickel and Levina [BL08]. The formulation discussed in the next section presents some\ntechnical differences that have been introduced to simplify the analysis. Notice that, to simplify\nproofs, we assume k to be known: This issue is discussed in the next two sections.\nThe rest of the paper is organized as follows. In the next section we provide a detailed description of\nthe algorithm and state our main results. Our theoretical results assume full knowledge of problem\nparameters for ease of proof. In light of this, in Section 3 we discuss a practical implementation\nof the same idea that does not require prior knowledge of problem parameters, and is entirely data-\ndriven. We also illustrate the method through simulations. The complete proofs are available in the\naccompanying supplement, in Sections 1, 2 and 3 respectively.\n\n2 Algorithm and main result\n\nFor notational convenience, we shall assume hereafter that 2n sample vectors are given (instead of\nn): {xi}1\u2264i\u22642n. These are distributed according to the model (1). The number of spikes r will be\ntreated as a known parameter in the problem.\nWe will make the following assumptions:\n\nk = (cid:80)\n\nFurther \u03b21 > \u03b22 > . . . \u03b2r are all distinct.\n\nA1 The number of spikes r and the signal strengths \u03b21, . . . , \u03b2r are \ufb01xed as n, p \u2192 \u221e.\nA2 Let Qq and kq denote the support of vq and its size respectively. We let Q = \u222aqQq and\nfor all i \u2208 Qq for some \u03b8 > 0. Further, for any q, q(cid:48) we assume |vq,i/vq(cid:48),i| \u2264 \u03b3 for every\ni \u2208 Qq \u2229 Qq(cid:48), for some constant \u03b3 > 1.\n\nq kq throughout. Then the non-zero entries of the spikes satisfy |vq,i| \u2265 \u03b8/(cid:112)kq\n\nAs before, we are interested in the high-dimensional limit of n, p \u2192 \u221e with p/n \u2192 \u03b1. A more\ndetailed description of the covariance thresholding algorithm for the general model (1) is given in\nAlgorithm 1. We describe the basic intuition for the simpler rank-one case (omitting the subscript\nq \u2208 {1, 2, . . . , r}), while stating results in generality.\nWe start by splitting the data into two halves: (xi)1\u2264i\u2264n and (xi)n* \u221a\u03b1, the principal eigenvector of G, and hence of (cid:98)\u03a3 is\nlimn\u2192\u221e(cid:104)(cid:98)v1((cid:98)\u03a3), v(cid:105) > 0. However, for \u03b2 < \u221a\u03b1, the noise\ncomponent in (cid:98)\u03a3 dominates and the two vectors become asymptotically orthogonal, i.e. for instance\nlimn\u2192\u221e(cid:104)(cid:98)v1((cid:98)\u03a3), v(cid:105) = 0. In order to reduce the noise level, we exploit the sparsity of the spike v.\nDenoting by X \u2208 Rn\u00d7p the matrix with rows x1, . . . xn, by Z \u2208 Rn\u00d7p the matrix with rows z1,\n. . . zn, and letting u = (u1, u2, . . . , un), the model (1) can be rewritten as\n(2)\n\n\u03b2 u vT + Z .\n\n(cid:112)\n\nX =\n\nHence, letting \u03b2(cid:48) \u2261 \u03b2(cid:107)u(cid:107)2/n \u2248 \u03b2, and w \u2261 \u221a\u03b2ZTu/n\n(cid:98)\u03a3 = \u03b2(cid:48) vvT + v wT + w vT +\n\n1\nn\n\nZTZ \u2212 Ip, .\n\n(3)\n\nFor a moment, let us neglect the cross terms (vwT + wvT). The \u2018signal\u2019 component \u03b2(cid:48) vvT is\nsparse with k2 entries of magnitude \u03b2/k, which (in the regime of interest k = \u221an/C) is equivalent\nto C\u03b2/\u221an. The \u2018noise\u2019 component ZTZ/n \u2212 Ip is dense with entries of order 1/\u221an. Assuming\nk/\u221an a small enough constant, it should be possible to remove most of the noise by thresholding\nthe entries at level of order 1/\u221an. For technical reasons, we use the soft thresholding function\n\u03b7 : R \u00d7 R\n\u22650 \u2192 R, \u03b7(z; \u03c4 ) = sgn(z)(|z| \u2212 \u03c4 )+. We will omit the second argument wherever it is\nclear from context. Classical denoising theory [DJ94, Joh02] provides upper bounds the estimation\nerror of such a procedure. Note however that these results fall short of our goal. Classical theory\nmeasures estimation error by (element-wise) (cid:96)p norm, while here we are interested in the resulting\nprincipal eigenvector. This would require bounding, for instance, the error in operator norm.\nSince the soft thresholding function \u03b7(z; \u03c4 /\u221an) is af\ufb01ne when z (cid:29) \u03c4 /\u221an, we would expect that\n\nthe following decomposition holds approximately (for instance, in operator norm):\n\n(cid:18) 1\n\n(cid:19)\n\nZTZ \u2212 Ip\n\nn\n\n.\n\n(4)\n\n\u03b7((cid:98)\u03a3) \u2248 \u03b7(cid:0)\u03b2(cid:48)vvT(cid:1) + \u03b7\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u03b7\n\n(cid:18) 1\n\nZTZ\n\nn\n\nThe main technical challenge now is to control the operator norm of the perturbation \u03b7(ZTZ/n\u2212Ip).\nIt is easy to see that \u03b7(ZTZ/n \u2212 Ip) has entries of variance \u03b4(\u03c4 )/n, for \u03b4(\u03c4 ) \u2192 0 as \u03c4 \u2192 \u221e. If\nentries were independent with mild decay, this would imply \u2013with high probability\u2013\n\n(cid:46) C\u03b4(\u03c4 ),\n\n(5)\n\nfor some constant C. Further, the \ufb01rst component in the decomposition (4) is still approximately\nthe perturbation of eigenspaces [DK70], we obtain an error bound (cid:107)(cid:98)v \u2212 v(cid:107) (cid:46) \u03b4(\u03c4 )/C(cid:48)\u03b2\nrank one with norm of the order of \u03b2(cid:48) \u2248 \u03b2. Consequently, with standard linear algebra results on\ncipal components of \u03b7((cid:98)\u03a3).\nOur \ufb01rst theorem formalizes this intuition and provides a bound on the estimation error in the prin-\nTheorem 1. Under the spiked covariance model Eq. (1) satisfying Assumption A1, let(cid:98)vq denote\nthe qth eigenvector of \u03b7((cid:98)\u03a3) using threshold \u03c4. For every \u03b1, (\u03b2q)r\nq kq =(cid:80)\nsuch that, if(cid:80)\nq=1 \u2208 (0,\u221e), integer r and every\nq=1, r, \u03b8) < \u221e\nP(cid:110)\n(cid:111)\nmin((cid:107)(cid:98)vq \u2212 vq(cid:107) ,(cid:107)(cid:98)vq + vq(cid:107)) \u2264 \u03b5 \u2200q \u2208 {1, . . . , r}\n\nq=1, r, \u03b8) and 0 < c\u2217 = c\u2217(\u03b5, \u03b1, (\u03b2q)r\n\nq |supp(vq)| \u2264 c\u2217\u221an), then\n\n\u03b5 > 0 there exist constants, \u03c4 = \u03c4 (\u03b5, \u03b1, (\u03b2q)r\n\n\u2265 1 \u2212\n\n\u03b1\nn4 .\n\n(6)\n\nRandom matrices of the type \u03b7(ZTZ/n \u2212 Ip) are called inner-product kernel random matrices and\nhave attracted recent interest within probability theory [EK10a, EK10b, CS12]. In this literature, the\nasymptotic eigenvalue distribution of a matrix f (ZTZ/n) is the object of study. Here f : R \u2192 R\nis a kernel function and is applied entry-wise to the matrix ZTZ/n, with Z a matrix as above.\nUnfortunately, these results cannot be applied to our problem for the following reasons:\n\n\u2022 The results [EK10a, EK10b] are perturbative and provide conditions under which the spec-\ntrum of f (ZTZ/n) is close to a rescaling of the spectrum of (ZTZ/n) (with rescaling\n\n4\n\n\fi=n+1 xixT\n\ni /n;\n\n\u22650;\n\ni=1 xixT\n\ni /n , G(cid:48) \u2261\n\n(cid:80)2n\n(cid:80)n\nAlgorithm 1 Covariance Thresholding\n1: Input: Data (xi)1\u2264i\u22642n, parameters r, (kq)q\u2264r \u2208 N, \u03b8, \u03c4, \u03c1 \u2208 R\n3: Compute (cid:98)\u03a3 = G \u2212 Ip (resp. (cid:98)\u03a3(cid:48) = G(cid:48) \u2212 Ip);\n2: Compute the Gram matrices G \u2261\n4: Compute the matrix \u03b7((cid:98)\u03a3) by soft-thresholding the entries of (cid:98)\u03a3:\nif (cid:98)\u03a3ij \u2265 \u03c4 /\u221an,\n(cid:98)\u03a3ij \u2212 \u03c4\u221an\nif \u2212\u03c4 /\u221an < (cid:98)\u03a3ij < \u03c4 /\u221an,\nif (cid:98)\u03a3ij \u2264 \u2212\u03c4 /\u221an,\n(cid:98)\u03a3ij + \u03c4\u221an\n5: Let ((cid:98)vq)q\u2264r be the \ufb01rst r eigenvectors of \u03b7((cid:98)\u03a3);\n6: De\ufb01ne sq \u2208 Rp by sq,i =(cid:98)vq,iI((cid:12)(cid:12)(cid:98)vq,i \u2265 \u03b8/2(cid:112)kq\n(cid:12)(cid:12));\n7: Output: (cid:98)Q = {i \u2208 [p] : \u2203 q s.t. |((cid:98)\u03a3(cid:48)sq)i| \u2265 \u03c1}.\n\n\u03b7((cid:98)\u03a3)ij =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n0\n\nfactors depending on the Taylor expansion of f close to 0). We are interested instead in a\nnon-perturbative regime in which the spectrum of f (ZTZ/n) is very different from the one\nof (ZTZ/n) (and the Taylor expansion is trivial).\n\n\u2022 [CS12] consider n-dependent kernels, but their results are asymptotic and concern the weak\nlimit of the empirical spectral distribution of f (ZTZ/n). This does not yield an upper\nbound on the spectral norm3 of f (ZTZ/n).\n\nOur approach to prove Theorem 1 follows instead the so-called \u03b5-net method: we develop high\nprobability bounds on the maximum Rayleigh quotient:\n\nmax\n\ny\u2208Sp\u22121(cid:104)y, \u03b7(ZTZ/n)y(cid:105) = max\ny\u2208Sp\u22121\n\n(cid:104)\u02dczi, \u02dczj(cid:105)\n\nn\n\n;\n\n\u03c4\n\u221an\n\nyiyj,\n\n(cid:18)\n\n(cid:88)\n\ni,j\n\n\u03b7\n\n(cid:19)\n\na sign \ufb02ip), it does not provide a consistent estimator of its support. This brings us to the second\n\nHere Sp\u22121 = {y \u2208 Rp : (cid:107)y(cid:107) = 1} is the unit sphere and \u02dczi denote the columns of Z. Since\n\u03b7(ZTZ/n) is not Lipschitz continuous in the underlying Gaussian variables Z, concentration does\nnot follow immediately from Gaussian isoperimetry. We have to develop more careful (non-uniform)\nWhile Theorem 1 guarantees that (cid:98)v is a reasonable estimate of the spike v in (cid:96)2 distance (up to\nbounds on the gradient of \u03b7(ZTZ/n) and show that they imply concentration as required.\nphase of our algorithm. Although(cid:98)v is not even expected to be sparse, it is easy to see that the largest\nentries of(cid:98)v should have signi\ufb01cant overlap with supp(v). Steps 6, 7 and 8 of the algorithm exploit\nthis property to construct a consistent estimator(cid:98)Q of the support of the spike v. Our second theorem\n\u03b1, (\u03b2q)q\u2264r, \u03b3, \u03b8, r, such that, if(cid:80)\n\nguarantees that this estimator is indeed consistent.\nTheorem 2. Consider the spiked covariance model of Eq. (1) satisfying Assumptions A1, A2. For\nany \u03b1, (\u03b2q)q\u2264r \u2208 (0,\u221e), \u03b8, \u03b3 > 0 and integer r, there exist constants c\u2217, \u03c4, \u03c1 dependent on\nq kq = |supp(vq)| \u2264 c\u2217\u221an, the Covariance Thresholding al-\ngorithm of Table 1 recovers the joint supports of vq with high probability.\nExplicitly, there exists a constant C > 0 such that\n\nC\nn4 .\n\n(7)\n\n\u2265 1 \u2212\nGiven the results above, it is useful to pause for a few remarks.\nRemark 2.1. We focus on a consistent estimation of the joint supports \u222aqsupp(vq) of the spikes.\nIn the rank-one case, this obviously corresponds to the standard support recovery. Once this is\naccomplished, estimating the individual supports poses no additional dif\ufb01culty: indeed, since | \u222aq\n\nsupp(vq))| = O(\u221an) an extra step with n fresh samples xi restricted to(cid:98)Q yields estimates for vq\n\n3Note that [CS12] also provide a \ufb01nite n bound for the spectral norm of f (ZTZ/n) via the moment method,\n\nbut this bound diverges with n and does not give a result of the type of Eq. (5).\n\n5\n\nP(cid:110)(cid:98)Q = \u222aqsupp(vq)\n\n(cid:111)\n\n\fk/n. This implies consistent estimation of supp(vq) when vq have entries\n\nwith (cid:96)2 error of order(cid:112)\nRemark 2.3. Assumption A2 requires |vq,i| \u2265 \u03b8/(cid:112)kq for all i \u2208 Qq. This is a standard requirement\n\nbounded below as in Assumption A2.\nRemark 2.2. Recovering the signed supports Qq,+ = {i \u2208 [p] : vq,i > 0} and Qq,\u2212 = {i \u2208\n[p] : vq,i < 0} is possible using the same technique as recovering the supports supp(vq) above, and\nposes no additional dif\ufb01culty.\n\nin the support recovery literature [Wai09, MB06]. The second part of assumption A2 guarantees\nthat when the supports of two spikes overlap, their entries are roughly of the same order. This is\nnecessary for our proof technique to go through. Avoiding such an assumption altogether remains\nan open question.\n\nOur covariance thresholding algorithm assumes knowledge of the correct support sizes kq. Notice\nthat the same assumption is made in earlier theoretical, e.g.\nin the analysis of SDP-based recon-\nstruction by Amini and Wainwright [AW09]. While this assumption is not realistic in applications,\nit helps to focus our exposition on the most challenging aspects of the problem. Further, a ballpark\nq kq) is actually suf\ufb01cient, with which we use the following steps in lieu of\n\nestimate of kq (indeed(cid:80)\nSteps 7, 8 of Algorithm 1.\n7: De\ufb01ne s(cid:48)q \u2208 Rp by\n\n8: Return\n\n(cid:26)(cid:98)vq,i\n\ns(cid:48)q,i =\n\nif |(cid:98)vq,i| > \u03b8/(2\u221ak0)\n(cid:98)Q = \u222aq{i \u2208 [p] : |((cid:98)\u03a3(cid:48)s(cid:48)q)i| \u2265 \u03c1} .\n\notherwise.\n\n0\n\n(8)\n\n(9)\n\nof magnitude. Its proof is deferred to Section 2.\nTheorem 3. Consider the spiked covariance model of Eq. (1). For any \u03b1, \u03b2 \u2208 (0,\u221e), let constants\nq k \u2264 k0 \u2264\nq kq. Then, the Covariance Thresholding algorithm of Table 1, with the de\ufb01nitions in Eqs. (8)\n\nThe next theorem shows that this procedure is effective even if k0 overestimates(cid:80)\nc\u2217, \u03c4, \u03c1 be given as in Theorem 2. Further assume k =(cid:80)\nq |supp(vq)| \u2264 c\u2217\u221an, and(cid:80)\n20(cid:80)\nP(cid:16)(cid:98)Q = \u222aqsupp(vq)\n(cid:17)\n\nand (9), recovers the joint supports of vq successfully, i.e.\n\nq kq by an order\n\n\u2265 1 \u2212\n\nC\nn4 .\n\n(10)\n\n3 Practical aspects and empirical results\n\nSpecializing to the rank one case, Theorems 1 and 2 show that Covariance Thresholding succeeds\nwith high probability for a number of samples n (cid:38) k2, while Diagonal Thresholding requires n (cid:38)\nk2 log p. The reader might wonder whether eliminating the log p factor has any practical relevance\nor is a purely conceptual improvement. Figure 1 presents simulations on synthetic data under the\nstrictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of\nTheorem 2. The objective is to check whether the log p factor has an impact at moderate p. We\ncompare this with Diagonal Thresholding.\nWe plot the empirical success probability as a function of k/\u221an for several values of p, with p = n.\nThe empirical success probability was computed by using 100 independent instances of the problem.\nA few observations are of interest: (i) Covariance Thresholding appears to have a signi\ufb01cantly\nlarger success probability in the \u2018dif\ufb01cult\u2019 regime where Diagonal Thresholding starts to fail; (ii)\nThe curves for Diagonal Thresholding appear to decrease monotonically with p indicating that k\nproportional to \u221an is not the right scaling for this algorithm (as is known from theory); (iii) In\ncontrast, the curves for Covariance Thresholding become steeper for larger p, and, in particular,\nthe success probability increases with p for k \u2264 1.1\u221an. This indicates a sharp threshold for k =\nconst \u00b7 \u221an, as suggested by our theory.\nIn terms of practical applicability, our algorithm in Table 1 has the shortcomings of requiring knowl-\nedge of problem parameters \u03b2q, r, kq. Furthermore, the thresholds \u03c1, \u03c4 suggested by theory need not\n\n6\n\n\fFigure 1: The support recovery phase transitions for Diagonal Thresholding (left) and Covariance\nThresholding (center) and the data-driven version of Section 3 (right). For Covariance Threshold-\n\ning, the fraction of support recovered correctly increases monotonically with p, as long as k \u2264 c\u221an\nwith c \u2248 1.1. Further, it appears to converge to one throughout this region. For Diagonal Thresh-\nolding, the fraction of support recovered correctly decreases monotonically with p for all k of order\n\u221an. This con\ufb01rms that Covariance Thresholding (with or without knowledge of the support size k)\nsucceeds with high probability for k \u2264 c\u221an, while Diagonal Thresholding requires a signi\ufb01cantly\n\nsparser principal component.\n\nbe optimal. We next describe a principled approach to estimating (where possible) the parameters of\ninterest and running the algorithm in a purely data-dependent manner. Assume the following model,\nfor i \u2208 [n]\n\n(cid:88)\n\n(cid:112)\n\nxi = \u00b5 +\n\n\u03b2quq,ivq + \u03c3zi,\n\nq\n\nwhere \u00b5 \u2208 Rp is a \ufb01xed mean vector, ui have mean 0 and variance 1, and zi have mean 0 and co-\nvariance Ip. Note that our focus in this section is not on rigorous analysis, but instead to demonstrate\nEstimating \u00b5, \u03c3: We let (cid:98)\u00b5 = (cid:80)n\na principled approach to applying covariance thresholding in practice. We proceed as follows:\n\nX = X \u2212 1(cid:98)\u00b5T we see that pn \u2212 ((cid:80)\n\u03c32. We let(cid:98)\u03c3 = MAD(X)/\u03bd where MAD(\u00b7 ) denotes the median absolute deviation of\n\ni=1 xi/n be the empirical mean estimate for \u00b5. Further letting\nq kq)n \u2248 pn entries of X are mean 0 and variance\nthe entries of the matrix in the argument, and \u03bd is a constant scale factor. Guided by the\nGaussian case, we take \u03bd = \u03a6\u22121(3/4) \u2248 0.6745.\nChoosing \u03c4: Although in the statement of the theorem, our choice of \u03c4 depends on the SNR\n\u03b2/\u03c32, we believe this is an artifact of our analysis. Indeed it is reasonable to threshold\n\u2018at the noise level\u2019, as follows. The noise component of entry i, j of the sample covari-\nance (ignoring lower order terms) is given by \u03c32(cid:104)zi, zj(cid:105)/n. By the central limit theo-\nrem, (cid:104)zi, zj(cid:105)/\u221an d\u21d2 N(0, 1). Consequently, \u03c32(cid:104)zi, zj(cid:105)/n \u2248 N(0, \u03c34/n), and we need to\nlet \u03c4 = \u03bd(cid:48) \u00b7(cid:98)\u03c32 for a constant \u03bd(cid:48). In simulations, a choice 3 (cid:46) \u03bd(cid:48) (cid:46) 4 appears to work well.\nchoose the (rescaled) threshold proportional to \u221a\u03c34 = \u03c32. Using previous estimates, we\nEstimating r: We de\ufb01ne (cid:98)\u03a3 = X\nX/n \u2212 \u03c32Ip and soft threshold it to get \u03b7((cid:98)\u03a3) using \u03c4 as above.\nOur proof of Theorem 1 relies on the fact that \u03b7((cid:98)\u03a3) has r eigenvalues that are separated\nfrom the bulk of the spectrum4. Hence, we estimate r using(cid:98)r: the number of eigenvalues\nseparated from the bulk in \u03b7((cid:98)\u03a3).\nEstimating vq: Let(cid:98)vq denote the qth eigenvector of \u03b7((cid:98)\u03a3). Our theoretical analysis indicates that\n(cid:98)vq is expected to be close to vq. In order to denoise(cid:98)vq, we assume(cid:98)vq \u2248 (1 \u2212 \u03b4)vq + \u03b5q,\nwhere \u03b5q is additive random noise. We then threshold vq \u2018at the noise level\u2019 to re-\n\u201cnoise\u201d \u03b5 by(cid:99)\u03c3\u03b5 = MAD(vq)/\u03bd. Here we set \u2013again guided by the Gaussian heuristic\u2013\ncover a better estimate of vq. To do this, we estimate the standard deviation of the\nnoise deviation. We let \u03b7H ((cid:98)vq) denote the vector obtained by hard thresholding(cid:98)vq: set\n\u03bd \u2248 0.6745. Since vq is sparse, this procedure returns a good estimate for the size of the\n\nT\n\n4The support of the bulk spectrum can be computed numerically from the results of [CS12].\n\n7\n\n0.511.520.20.40.60.81k/\u221anFractionofsupportrecoveredp=625p=1250p=2500p=50000.511.5200.20.40.60.81k/\u221anFractionofsupportrecoveredp=625p=1250p=2500p=50000.511.5200.20.40.60.81k/\u221anFractionofsupportrecoveredp=625p=1250p=2500p=5000\f(\u03b7H ((cid:98)vq))i =(cid:98)vq,i if |(cid:98)vq,i| \u2265 \u03bd(cid:48)(cid:99)\u03c3\u03b5q and 0 otherwise. We then let(cid:98)v\u2217q = \u03b7((cid:98)vq)/(cid:107)\u03b7((cid:98)vq)(cid:107) and\nreturn(cid:98)v\u2217q as our estimate for vq.\n\nNote that \u2013while different in several respects\u2013 this empirical approach shares the same philosophy\nof the algorithm in Table 1. On the other hand, the data-driven algorithm presented in this section is\nless straightforward to analyze, a task that we defer to future work.\nFigure 1 also shows results of a support recovery experiment using the \u2018data-driven\u2019 version of\nthis section. Covariance thresholding in this form also appears to work for supports of size k \u2264\nconst\u221an. Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance\nThresholding on the \u201cThree Peak\u201d example of Johnstone and Lu [JL04]. This signal is sparse in\nthe wavelet domain and the simulations employ the data-driven version of covariance thresholding.\nA similar experiment with the \u201cbox\u201d example of Johnstone and Lu is provided in the supplement.\nThese experiments demonstrate that, while for large values of n both Diagonal Thresholding and\nCovariance Thresholding perform well, the latter appears superior for smaller values of n.\n\nFigure 2: The results of Simple PCA, Diagonal Thresholding and Covariance Thresholding (respec-\ntively) for the \u201cThree Peak\u201d example of Johnstone and Lu [JL09] (see Figure 1 of the paper). The\nsignal is sparse in the \u2018Symmlet 8\u2019 basis. We use \u03b2 = 1.4, p = 4096, and the rows correspond to\nsample sizes n = 1024, 1625, 2580, 4096 respectively. Parameters for Covariance Thresholding are\nchosen as in Section 3, with \u03bd(cid:48) = 4.5. Parameters for Diagonal Thresholding are from [JL09]. On\neach curve, we superpose the clean signal (dotted).\n\nReferences\n\n[AW09]\n\n[BBAP05]\n\nArash A Amini and Martin J Wainwright, High-dimensional analysis of semide\ufb01nite relaxations\nfor sparse principal components, The Annals of Statistics 37 (2009), no. 5B, 2877\u20132921.\nJinho Baik, G\u00b4erard Ben Arous, and Sandrine P\u00b4ech\u00b4e, Phase transition of the largest eigenvalue for\nnonnull complex sample covariance matrices, Annals of Probability (2005), 1643\u20131697.\n\n8\n\n01,0002,0003,0004,000\u22125\u00b710\u2212205\u00b710\u221220.101,0002,0003,0004,00000.10.20.301,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,000\u22125\u00b710\u2212205\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,000\u22125\u00b710\u2212205\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.101,0002,0003,0004,00005\u00b710\u221220.1n=1024n=1625n=2580n=4096PCADTCT\f[BGN11]\n\n[BL08]\n\n[BR13]\n\nFlorent Benaych-Georges and Raj Rao Nadakuditi, The eigenvalues and eigenvectors of \ufb01nite,\nlow rank perturbations of large random matrices, Advances in Mathematics 227 (2011), no. 1,\n494\u2013521.\nPeter J Bickel and Elizaveta Levina, Regularized estimation of large covariance matrices, The\nAnnals of Statistics (2008), 199\u2013227.\nQuentin Berthet and Philippe Rigollet, Computational lower bounds for sparse pca, arXiv\npreprint arXiv:1304.0828 (2013).\n\n[CDMF09] Mireille Capitaine, Catherine Donati-Martin, and Delphine F\u00b4eral, The largest eigenvalues of \ufb01nite\nrank deformation of large wigner matrices: convergence and nonuniversality of the \ufb02uctuations,\nThe Annals of Probability 37 (2009), no. 1, 1\u201347.\n\n[CMW+13] T Tony Cai, Zongming Ma, Yihong Wu, et al., Sparse pca: Optimal rates and adaptive estimation,\n\n[CS12]\n\n[dBG08]\n\nThe Annals of Statistics 41 (2013), no. 6, 3074\u20133110.\nXiuyuan Cheng and Amit Singer, The spectrum of random inner-product kernel matrices, arXiv\npreprint arXiv:1202.3155 (2012).\nAlexandre d\u2019Aspremont, Francis Bach, and Laurent El Ghaoui, Optimal solutions for sparse prin-\ncipal component analysis, The Journal of Machine Learning Research 9 (2008), 1269\u20131294.\n\n[dEGJL07] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet, A direct\nformulation for sparse pca using semide\ufb01nite programming, SIAM review 49 (2007), no. 3, 434\u2013\n448.\nD. L. Donoho and I. M. Johnstone, Minimax risk over lp balls, Prob. Th. and Rel. Fields 99\n(1994), 277\u2013303.\nChandler Davis and W. M. Kahan, The rotation of eigenvectors by a perturbation. iii, SIAM\nJournal on Numerical Analysis 7 (1970), no. 1, pp. 1\u201346.\nNoureddine El Karoui, On information plus noise kernel random matrices, The Annals of Statis-\ntics 38 (2010), no. 5, 3191\u20133216.\n\n[EK10a]\n\n[DJ94]\n\n[DK70]\n\n[EK10b]\n[FP07]\n\n[JL04]\n\n[JL09]\n\n[Joh02]\n\n[KNV13]\n\n[KY13]\n\n[MB06]\n\n[MW13]\n\n, The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1\u201350.\nDelphine F\u00b4eral and Sandrine P\u00b4ech\u00b4e, The largest eigenvalue of rank one deformation of large\nwigner matrices, Communications in mathematical physics 272 (2007), no. 1, 185\u2013228.\nIain M Johnstone and Arthur Yu Lu, Sparse principal components analysis, Unpublished\nmanuscript (2004).\n\n, On consistency and sparsity for principal components analysis in high dimensions, Jour-\n\nnal of the American Statistical Association 104 (2009), no. 486.\nIM Johnstone, Function estimation and gaussian sequence models, Unpublished manuscript 2\n(2002), no. 5.3, 2.\nR. Krauthgamer, B. Nadler, and D. Vilenchik, Do semide\ufb01nite relaxations really solve sparse\npca?, CoRR abs/1306:3690 (2013).\nAntti Knowles and Jun Yin, The isotropic semicircle law and deformation of wigner matrices,\nCommunications on Pure and Applied Mathematics (2013).\nNicolai Meinshausen and Peter B\u00a8uhlmann, High-dimensional graphs and variable selection with\nthe lasso, The Annals of Statistics (2006), 1436\u20131462.\nZongming Ma and Yihong Wu, Computational barriers in minimax submatrix detection, arXiv\npreprint arXiv:1309.5914 (2013).\n\n[MWA05] Baback Moghaddam, Yair Weiss, and Shai Avidan, Spectral bounds for sparse pca: Exact and\n\n[Pau07]\n\n[VL12]\n\n[Wai09]\n\n[WTH09]\n\n[ZHT06]\n\ngreedy algorithms, Advances in neural information processing systems, 2005, pp. 915\u2013922.\nDebashis Paul, Asymptotics of sample eigenstructure for a large dimensional spiked covariance\nmodel, Statistica Sinica 17 (2007), no. 4, 1617.\nVincent Q Vu and Jing Lei, Minimax rates of estimation for sparse pca in high dimensions, Pro-\nceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS)\n2012, 2012.\nMartin J Wainwright, Sharp thresholds for high-dimensional and noisy sparsity recovery using-\nconstrained quadratic programming (lasso), Information Theory, IEEE Transactions on 55\n(2009), no. 5, 2183\u20132202.\nDaniela M Witten, Robert Tibshirani, and Trevor Hastie, A penalized matrix decomposition, with\napplications to sparse principal components and canonical correlation analysis, Biostatistics 10\n(2009), no. 3, 515\u2013534.\nHui Zou, Trevor Hastie, and Robert Tibshirani, Sparse principal component analysis, Journal of\ncomputational and graphical statistics 15 (2006), no. 2, 265\u2013286.\n\n9\n\n\f", "award": [], "sourceid": 242, "authors": [{"given_name": "Yash", "family_name": "Deshpande", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}]}*