{"title": "Sparse PCA from Sparse Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 10942, "page_last": 10952, "abstract": "Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high dimension. A variety of algorithms have been proposed for both SPCA and SLR, but an explicit connection between the two had not been made. We show how to efficiently transform a black-box solver for SLR into an algorithm for SPCA: assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, the SPCA algorithm derived from it achieves near state of the art guarantees for testing and for support recovery for the single spiked covariance model as obtained by the current best polynomial-time algorithms. Our reduction not only highlights the inherent similarity between the two problems, but also, from a practical standpoint, allows one to obtain a collection of algorithms for SPCA directly from known algorithms for SLR. We provide experimental results on simulated data comparing our proposed framework to other algorithms for SPCA.", "full_text": "Sparse PCA from Sparse Linear Regression\n\nGuy Bresler\n\nMIT\n\nguy@mit.edu\n\nSung Min Park\n\nMIT\n\nsp765@mit.edu\n\nM\u02d8ad\u02d8alina Persu\nTwo Sigma\u21e4, MIT\nmpersu@mit.edu\n\nAbstract\n\nSparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR)\nhave a wide range of applications and have attracted a tremendous amount of\nattention in the last two decades as canonical examples of statistical problems in\nhigh dimension. A variety of algorithms have been proposed for both SPCA and\nSLR, but an explicit connection between the two had not been made. We show how\nto ef\ufb01ciently transform a black-box solver for SLR into an algorithm for SPCA:\nassuming the SLR solver satis\ufb01es prediction error guarantees achieved by existing\nef\ufb01cient algorithms such as those based on the Lasso, the SPCA algorithm derived\nfrom it achieves near state of the art guarantees for testing and for support recovery\nfor the single spiked covariance model as obtained by the current best polynomial-\ntime algorithms. Our reduction not only highlights the inherent similarity between\nthe two problems, but also, from a practical standpoint, allows one to obtain a\ncollection of algorithms for SPCA directly from known algorithms for SLR. We\nprovide experimental results on simulated data comparing our proposed framework\nto other algorithms for SPCA.\n\n1\n\nIntroduction\n\nPrincipal component analysis (PCA) is a fundamental technique for dimension reduction used widely\nin data analysis. PCA projects data along a few directions that explain most of the variance of\nobserved data. One can also view this as linearly transforming the original set of variables into a\n(smaller) set of uncorrelated variables called principal components.\nRecent work in high-dimensional statistics has focused on sparse principal component analysis\n(SPCA), as ordinary PCA estimates become inconsistent in this regime [22]. In SPCA, we re-\nstrict the principal components to be sparse, meaning they have only a few nonzero entries in the\noriginal basis. This has the advantage, among others, that the components are more interpretable\n[23, 49], while components may no longer be uncorrelated. We study SPCA under the Gaussian\n(single) spiked covariance model introduced by [21]: we observe n samples of a random variable X\ndistributed according to a Gaussian distribution N (0, Id + \u2713uu>), where ||u||2 = 1 with at most k\nnonzero entries,2 Id is the d \u21e5 d identity matrix, and \u2713 is the signal-to-noise parameter. We study two\nsettings of the problem, hypothesis testing and support recovery.\nSparsity assumptions have played an important role in a variety of other problems in high-dimensional\nstatistics, in particular linear regression. Linear regression is also ill-posed in high dimensions, so via\nimposing sparsity on the regression vector we recover tractability.\nThough the literature on two problems are largely disjoint, there is a striking similarity between\nthe two problems, in particular when we consider statistical and computational trade-offs. The\n\u21e4The views expressed herein are solely the views of the author(s) and are not necessarily the views of Two\nSigma Investments, LP or any of its af\ufb01liates. They are not intended to provide, and should not be relied upon\nfor, investment advice.\n\n2Sometimes we will write this latter condition as u 2 B0(k) where B0(k) is the \u201c`0-ball\u201d of radius k.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fnatural information-theoretically optimal algorithm for SPCA [4] involves searching over all possible\nsupports of the hidden spike. This bears resemblance to the minimax optimal algorithm for SLR [35],\nwhich optimizes over all sparse supports of the regression vector. Both problems appear to exhibit\ngaps between statistically optimal algorithms and known computationally ef\ufb01cient algorithms, and\nconditioned on relatively standard complexity assumptions, these gaps seem irremovable [3, 44, 47].\n\n1.1 Our contributions\n\nIn this paper we give algorithmic evidence that this similarity is likely not a coincidence. Speci\ufb01cally,\nwe give a simple, general, and ef\ufb01cient procedure for transforming a black-box solver for sparse linear\nregression to an algorithm for SPCA. At a high level, our algorithm tries to predict each coordinate3\nlinearly from the rest of the coordinates using a black-box algorithm for SLR. The advantages of such\na black-box framework are two fold: theoretically, it highlights a structural connection between the\ntwo problems; practically, it allows one to simply plug in any of the vast majority of solvers available\nfor SLR and directly get an algorithm for SPCA with provable guarantees. In particular,\n\nn\n\nof diagonal thresholding [22] and Minimal Dual Perturbation [4] up to constant factors.\n\n(a standard assumption in the literature), our algorithm succeeds with high probability for\n\n\u2022 For hypothesis testing: we match state of the art provable guarantee for computationally\nef\ufb01cient algorithms; our algorithm successfully distinguishes between isotropic and spiked\nGaussian distributions at signal strength \u2713 &q k2 log d\n. This matches the phase transition\n\u2022 For support recovery: for general p and n, when each non-zero entry of u is at least \u2326(1/pk)\nsignal strength \u2713 &q k2 log d\n\u2022 In experiments, we demonstrate that using popular existing SLR algorithms as our black-box\n\u2022 We theoretically and empirically illustrate that our SPCA algorithm is also robust to rescaling\nof the data, for instance by using a Pearson correlation matrix instead of a covariance matrix.5\nMany iterative methods rely on initialization via \ufb01rst running diagonal thresholding, which\n\ufb01lters variables with higher variance; rescaling renders diganoal thresholding useless, so in\nsome sense our framework is more robust.\n\nresults in reasonable performance.\n\n, which is nearly optimal.4\n\nn\n\n2 Preliminaries\n\n2.1 Problem formulation for SPCA\n\nHypothesis testing Here, we want to distinguish whether X is distributed according to an isotropic\nGaussian or a spiked covariance model. That is, our null and alternate hypotheses are:\n\nH0 : X \u21e0N (0, Id) and H1 : X \u21e0N (0, Id + \u2713uu>),\n\nOur goal is to design a test : Rn\u21e5d !{ 0, 1} that discriminates H0 and H1. More precisely, we\nsay that discriminates between H0 and H1 with probability 1 if both type I and II errors have a\nprobability smaller than :\n\nPH0( (X) = 1) \uf8ff and PH1( (X) = 0) \uf8ff .\n\nWe assume the following additional condition on the spike u:\n\n(C1) c2\n\nmin/k \uf8ff u2\n\ni \uf8ff 1 c2\n\nmin/k for at least one i 2 [d] where cmin > 0 is some constant.\n\n3From here on, we will use \u201ccoordinate\u201d and \u201cvariable\u201d interchangeably.\n4In the scaling limit d/n ! \u21b5 as d, n ! 1, the covariance thresholding algorithm [15] theoretically\nsucceeds at a signal strength that is an order of plog d smaller. However, our experimental results indicate that\nwith an appropriate choice of black-box, our Q algorithm outperforms covariance thresholding\n\n5Solving SPCA based on the correlation matrix was suggested in a few earlier works [49, 41].\n\n2\n\n\fThe above condition says that at least one coordinate has enough mass, yet the mass is not entirely\ni 1/k,\nconcentrated on just that singlecoordinate. Trivially, we always have at least one i 2 [d] s.t. u2\nbut this is not enough for our regression setup, since we want at least one other coordinate j to have\nsuf\ufb01cient correlation with coordinate i. We remark that the above condition is a very mild technical\ncondition. If it were violated, almost all of the mass of u is on a single coordinate, so a simple\nprocedure for testing the variance (which is akin to diagonal thresholding) would suf\ufb01ce.\nSupport recovery The goal of support recovery is to identify the support of u from our samples X.\n\nMore precisely, we say that a support recovery algorithm succeeds if the recovered support bS is the\n\nsame as S, the true support of u. As standard in the literature [1, 31], we need to assume a minimal\nbound on the size of the entries of u in the support.\nFor our support recovery algorithm, we will assume the following condition (note that it implies\nCondition (C1) and is much stronger):\n\n(C2) |ui| cmin/pk for some constant 0 < cmin < 1 8i 2 [d]\n\nThough the settings are a bit different, this minimal bound along with our results are consistent with\nlower bounds known for sparse recovery. These lower bounds ([18, 42]; bound of [18] is a factor of\nk weaker) imply that the number of samples must grow roughly as n & (1/u2\nmin)k log d where umin\nis the smallest entry of our signal u normalized by 1/pk, which is qualitativley the same threhold\nrequired by our theorems.\n\n2.2 Background on SLR\nIn linear regression, we observe a response vector y 2 Rn and a design matrix X 2 Rn\u21e5d that are\nlinked by the linear model y = X\u21e4 + w, where w 2 Rn is some form of observation noise, typically\nwith i.i.d. N (0, 2) entries. Our goal is to recover \u21e4 given noisy observations y. While the matrices\nX we consider arise from a (random) correlated design (as analyzed in [42], [43]), it will make no\ndifference to assume the matrices are deterministic by conditioning, as long as the distribution of the\ndesign matrix and noise are independent, which we will demonstrate in our case. Most of the relevant\nresults on sparse linear regression pertain to deterministic design.\nIn sparse linear regression, we additionally assume that \u21e4 has only k non-zero entries, where k \u2327 d.\nThis makes the problem well posed in the high-dimensional setting. Commonly used performance\nmeasures for SLR are tailored to prediction error (1/nkX\u21e4 Xbk2\n2 whereb is our guess), support\nrecovery (recovering support of \u21e4), or parameter estimation (minimizing k\u21e4 bk under some\n\nnorm). We focus on prediction error, analyzed over random realizations of the noise. There is a large\namount of work on SLR and we defer a more in-depth overview to Appendix A.\nMost ef\ufb01cient methods for SLR impose certain conditions on X. We focus on the restricted eigenvalue\ncondition, which roughly stated makes the prediction loss strongly convex near the optimum:\nDe\ufb01nition 2.1 (Restricted eigenvalue [47]). First de\ufb01ne the cone C(S) = { 2 Rd |k Sck1 \uf8ff\n3kSk1}, where Sc denotes the complement, T is restricted to the subset T . The restricted\neigenvalue (RE) constant of X, denoted (X), is de\ufb01ned as the largest constant > 0 s.t.\n\n1/nkXk2\n\n2 kk2\n\n2\n\nfor all 2 [|S|=k,S\u2713[d]\n\nC(S)\n\nFor more discussion on the restricted eigenvalue, see Appendix A.\nBlack-box condition Given the known guarantees on SLR, we de\ufb01ne a condition that is natural to\nrequire on the guarantee of our SLR black-box, which is invoked as SLR(y, X, k).\nCondition 2.2 (Black-box condition). Let (X) denote the restricted eigenvalue of X. There are\n\nuniversal constants c, c0, c00 such that SLR(y, X, k) outputsb that is k-sparse and satis\ufb01es:\n8\u21e4 2 B0(k) w.p. 1 c0 exp(c00k log d)\n\n(2k log d)\n\n2 \uf8ff\n\nn\n\n1\n\nnkXb X\u21e4k2\n\nc\n\n(X)2\n\n3\n\n\f3 Algorithms and main results\n\nWe \ufb01rst discuss how to view samples from the spiked covariance model in terms of a linear model. We\nthen give some intuition motivating our statistic. Finally, we state our algorithms and main theorems,\nand give a high-level sketch of the proof.\n\n3.1 The linear model\nLet X (1), X (2), . . . , X (n) be n i.i.d. samples from the spiked covariance model; denote as X 2 Rn\u21e5d\nthe matrix whose rows are X (i). Intuitively, if variable i is contained in the support of the spike, then\nthe rest of the support should allow to provide a nontrivial prediction for Xi since variables in the\nsupport are correlated. Conversely, for i not in the support (or under the isotropic null hypothesis), all\nof the variables are independent and other variables are useless for predicting Xi. So we regress Xi\nonto the rest of the variables.\nLet Xi denote the matrix of samples in the SPCA model with the ith column removed. For each\ncolumn i, we can view our data as coming from a linear model with design matrix X = Xi and the\nresponse variable y = Xi.\nThe \u201ctrue\u201d regression vector depends on i. Under the alternate hypothesis H1, if i 2 S, we can write\ni )\u2713 .6 If i 62 S,\ny = X\u21e4 + w where \u21e4 =\nand for any i 2 [d] under the null hypothesis, y = w where w = Xi \u21e0N (0, 1) (implicitly \u21e4 = 0).\n3.2 Designing the test statistic\nBased on the linear model above, we want to compute a test statistic that will indicate when a\ncoordinate i is on support. Intuitively, we predictive power of our linear model should be higher when\ni is on support. Indeed, a calculation shows that the variance in Xi is reduced by approximately \u27132/k.\nWe want to measure this reduction in noise to detect when i is on support or not.\n\ni )\u2713 ui and w \u21e0N (0, 2) with 2 = 1 +\n\n\u2713u2\ni\n1+(1u2\n\n\u2713ui\n1+(1u2\n\nSuppose for instance that we have access to \u21e4 rather thanb (note that this is not possible in practice\n\nsince we do not know the support!). Since we want to measure the reduction in noise when the\nvariable is on support, as a \ufb01rst step we might try the following statistic:\n\nQi = 1/nky X\u21e4k2\n\n2\n\nUnfortunately, this statistic will not be able to distinguish the two hypotheses, as the reduction in the\nabove error is too small (on the order of \u27132/k compared to overall order of 1 + \u2713), so deviation due to\nrandom sampling will mask any reduction in noise. We can \ufb01x this by adding the variance term kyk2:\n\nQi = 1/nkyk2\n\n2 1/nky X\u21e4k2\n\n2\n\n2 allows us to measure the relative gain in predictive power\nOn a more intuitive level, including kyk2\nwithout being penalized by a possibly large variance in y. Fluctuations in y due to noise will typically\nbe canceled out in the difference of terms in Qi, minimizing the variance of our statistic.\nWe have to add one \ufb01nal \ufb01x to the above estimator. We obviously do not have access to \u21e4, so we\n\nmust use the estimateb = SLR(y, X, k) (y, X are as de\ufb01ned in Section 3.1) which we get from our\n\nblack-box. As our analysis shows, this substitution does not affect much of the discriminative power\nof Qi as long as the SLR black-box satis\ufb01es prediction error guarantees stated in Condition 2.2. This\ngives our \ufb01nal statistic:7\n\nQi = 1/nkyk2\n\n2 1/nky Xbk2\n\n2.\n\n3.3 Algorithms\nBelow we give algorithms for hypothesis testing and for support recovery, based on the Q statistic:\n6By the theory of linear minimum mean-square-error (LMMSE) con\ufb01rms that this choice of \u21e4 minimizes\n\nthe error 2. See Appendix B.1, B.2 for details of this calculation.\n\n7As pionted out by a reviewer, Note that this statistic is actually equivalent to R2 up to rescaling by sample\nvariance. Note that our formula is slightly different though as we use the sample variance computed with\npopulation mean as opposed to sample mean, as the mean is known to be zero.\n\n4\n\n\fAlgorithm 1 Q-hypothesis testing\n\nAlgorithm 2 Q-support recovery\n\nInput: X 2 Rd\u21e5n, k\nOutput: {0, 1}\nfor i = 1, . . . , d do\nbi = SLR(Xi, Xi, k)\nnkXi Xibik2\nQi = 1\n2 1\nnkXik2\nif Qi > 13k log d\n\nthen\n\nn\n\n2\n\nk\n\nreturn 1\n\nend if\nend for\nReturn 0\n\nInput: X 2 Rd\u21e5n, k\nbS = ?\nfor i = 1, . . . , d do\nbi = SLR(Xi, Xi, k)\nnkXi Xibik2\nQi = 1\n2 1\nnkXik2\nif Qi > 13k log d\nbS bS [{ i}\n\nend if\nend for\n\nthen\n\nn\n\n2\n\nk\n\nReturn bS\n\nBelow we summarize our guarantees for the above algorithms. The proofs are simple, but we defer\nthem to Appendix C.\nTheorem 3.1 (Hypothesis test). Assume we have access to SLR that satis\ufb01es Condition 2.2 and with\nruntime T (d, n, k) per instance. Under Condition (C1), there exist universal constants c1, c2, c3, c4\ns.t. if \u27132 > c1\nc2\nmin\n\nand n > c2k log d, Algorithm 1 outputs s.t.\n\nk2 log d\n\nn\n\nPH0( (X) = 1) _ PH1( (X) = 0) \uf8ff c3 exp(c4k log d)\n\nin time O(dT + d2n).\nTheorem 3.2 (Support recovery). Under the same conditions as above plus Condition (C2), if\n\u27132 > c1\nc2\nmin\nin time O(dT + d2n).\n\n, Algorithm 2 above \ufb01nds bS = S with probability at least 1 c3 exp(c4k log d)\n\nk2 log d\n\nn\n\n3.4 Comments\nRE for sample design matrix Because population covariance \u2303= E[XX>] has minimum eigenvalue\n1, with high probability the sample design matrix X has constant restricted eigenvalue value given\nenough samples, i.e. n is large enough (see Appendix B.3 for more details), and the prediction error\nguarantee of Condition 2.2 will be good enough for our analysis.\nRunning time The runtime of both Algorithm 1 and 2 is \u02dcO(nd2). The discussion presented at the\nend of Appendix C details why this is competitive for (single spiked) SPCA, at least theoretically.\nUnknown sparsity Throughout the paper we assume that the sparsity level k is known. However, if\nk is unknown, standard techniques could be used to adaptively \ufb01nd approximate values of k ([16]).\nFor instance, for hypothesis testing, we can start with an initial overestimate k0, and keep halving\nuntil we get enough coordinates i with Qi that passes the threshold for the given k0.\n\nRobustness of Q statistic to rescaling\n\nIntuitively, our algorithms for detecting correlated structure in data should be invariant to rescaling of\nthe data; the precise scale or units for which one variable is measured should not have an impact on\nour ability to \ufb01nd meaningful structure underlying the data. Our algorithms based on Q are robust to\nrescaling, perhaps unsurprisingly, since correlations between variables in the support remain under\nrescaling.\nOn the other hand, diagonal thresholding, an often-used preprocessing step for SPCA which \ufb01lters\nvariables strictly based on variance, would trivially fail under rescaling. This illustrates a strength of\nour framework over other existing algorithms for SPCA.\n\nrescaling of X, where D is some diagonal matrix. Let DS be D restricted to rows and columns in\n\nBelow we show explicitly that Q statistics are indeed robust to rescaling: Let eX = DX be the\nS. Note thate\u2303, the covariance matrix of the rescaled data, is just D\u2303D by expanding the de\ufb01nition.\n\n5\n\n\fSimilarly, note e\u23032:d,1 = D1D2:d\u23032:d,1 where D2:d denotes D without row and column 1. Now,\n\nrecall the term which dominated our analysis of Qi under H1, (\u21e4)>\u23032:d\u21e4, which was equal to\n\n\u23031,2:d\u23031\n\n2:d\u23032:d,1\n\nWe replace the covariances by their rescaled versions to obtain:\n\n\u02dc\u21e4>e\u2303 \u02dc\u21e4 = (D1\u23031,2:dD2:d)D1\n\n2:d\u23031\n\n2:dD1\n\n2:d(D2:d\u23032:d,1D1) = D2\n\n1(\u21e4)>\u23032:d\u21e4\n\nFor the spiked covariance model, rescaling variances to one amount to rescaling with D1 = 1/(1+\u2713).\nThus, we see that our signal strength is affected only by constant factor (assuming \u2713 \uf8ff 1).\n4 Experiments\n\nWe test our algorithmic framework on randomly generated synthetic data and compare to other\nexisting algorithms for SPCA. The code was implemented in Python using standard libraries.\nWe refer to our general algorithm from Section 3 that uses the Q statistic as SPCAvSLR. For our\nSLR \u201cblack-box,\u201d we use thresholded Lasso [47].8. (We experimented with other SLR algorithms\nsuch as the forward-backward algorithm of [46] and CoSaMP9 [32], but results were similar and only\nslower.)\nFor more details on our experimental setup, including hyperparameter selection, see Appendix D.\nSupport recovery We randomly generate a spike u 2 Rd by \ufb01rst choosing a random support of size\nk, and then using random signs for each coordinate (uniformity is to make sure Condition (C2) is\nmet). Then spike is scaled appropriately with \u2713 to build the spiked covariance matrix of our normal\ndistribution, from which we draw samples.\nWe study how the performance of six algorithms vary over various values of k for \ufb01xed n and d.10 As\nin the [15], our measure is the fraction of true support. We compare SPCAvSLR with the following\nalgorithms: diagnoal thresholding, which is a simple baseline; \u201cSPCA\u201d (ZHT [49]) is a fast heuristic\nalso based on the regression idea; the truncated power method of [45], which is known for both strong\ntheoretical guarantees and empirical performance; covariance thresholding, which has state-of-the-art\ntheoretical guarantees.\nWe modi\ufb01ed each algorithm to return the top k most likely coordinates in the support (rather than\nthresholding based on a cutoff); for algorithms that compute a candidate eigenvector, we choose the\ntop k coordinates largest in absolute value.\nWe observe that SPCAvSLR performs better than covariance thresholding and diagonal thresholding,\nbut its performance falls short of that of the truncated power method and the heuristic algorithm of\n[49]. We suspect the playing with different SLR algorithms may slightly improve its performance.\nThe reason for the gap between performance of SPCAvSLR and other state of the arts algorithms\ndespite its theoretical guarantees is open to further investigation.\nHypothesis testing We generate data under two different distributions: for the spiked covariance\nmodel, we generate a spike u by sampling a uniformly random direction from the k-dimensional\nunit sphere, and embedding the vector at a random subset of k coordinates among d coordinates;\nfor the null, we draw from standard isotropic Gaussian. In a single trial, we draw n samples from\neach distribution and we compute various statistics11 (diagonal thresholding (DT), Minimal Dual\nPerturbation (MDP), and our Q statistic, again using thresholded Lasso). We repeat for 100 trials,\nand plot the resulting empirical distribution for each statistic. We observe similar performance of\n8Thresholded Lasso is a variant of lasso where after running lasso, we keep the k largest (in magnitude)\ncoef\ufb01cients to make the estimator k-sparse. Proposition 2 in [47] shows that thresholded Lasso satis\ufb01es Condition\n2.2\n\n9CoSaMP in theory requires the stronger condition of restricted isometry on the \u201csensing\u201d matrix.\n10We remark that while the size of this dataset might seem too small to be representative \u201chigh-dimensional\u201d\nsetting, these are representative of the usual size of dataset that these methods are usually tested on. One\nbottleneck is the computation of the covariance matrix.\n\n11While some of the algorithms used for support recovery in the previous section could in theory be adapated\n\nfor hypothesis testing, the extensions were immediate so we do not consider them here.\n\n6\n\n\fFigure 1: Performance of diagonal thresholding, SPCA (ZHT), truncated power method, covariance\nthresholding, and SPCAvSLR for support recovery at n = d = 625 (left) and n = 625, d = 1250\n(right), varying values of k, and \u2713 = 3.0. On the horizontal axis we show k/pn; the vertical axis is\nthe fraction of support correctly recovered. Each datapoint on the \ufb01gure is averaged over 50 trials.\n\nDT and Q, while MDP seems slightly more effective at distinguishing H0 and H1 at the same signal\nstrength (that is, the distributions of the statistics under H0 vs. H1 are more well-separated).\nRescaling variables As discussed in Section 3.4, our algorithms are robust to rescaling the covariance\nmatrix to the correlation matrix. As illustrated in Figure 2 (right), DT fails while Q appears to be\nstill effective for distinguishing hypotheses the same regime of parameters. Other methods such as\nMDP and CT also appear to be robust to such rescaling (not shown). This suggests that more modern\nalgorithms for SPCA may be more appropriate than diagonal thresholding in practice, particularly on\ninstances where the relative scales of the variables may not be accurate or knowable in advance, but\nwe still want to be able to \ufb01nd correlation between the variables.\n\nFigure 2: Performance of diagonal thresholding (D), MDP, and Q for hypothesis testing at n =\n200, d = 500, k = 30,\u2713 = 4 (left and center). T0 denotes the statistic T under H0, and similarly for\nT1. The effect of rescaling the covariance matrix to make variances indistinguishable is demonstrated\n(right).\n\n5 Previous work\n\nHere we discuss in more detail previous approaches to SPCA and how it relates to our work. Various\napproaches to SPCA have been designed in an extensive list of prior work. As we cannot cover all of\nthem, we focus on works that aim to give computationally ef\ufb01cient (i.e. polynomial time) algorithms\nwith provable guarantees in settings similar to ours.\nThese algorithms include fast, heuristic methods based on `1 minimization [23, 49], rigorous but slow\nmethods based on natural semide\ufb01nite program (SDP) relaxations [13, 1, 41, 44], iterative methods\nmotivated by power methods for approximating eigenvectors [45, 24], non-iterative methods based\non random projections [20], among others. Many iterative methods rely on initialization schemes,\nsuch as ordinary PCA or diagonal thresholding [22].\nBelow, we discuss the known sample bounds for support recovery and hypothesis testing.\n\n7\n\n\fSupport recovery [1] analyzed both diagonal thresholding and an SDP for support recovery under\nthe spiked covariance model.12 They showed that the SDP requires an order of k fewer samples\nwhen the SDP optimal solution is rank one. However, [27] showed that the rank one condition\ndoes not happen in general, particularly in the regime approaching the information theoretic limit\n(pn . k . n/log d). This is consistent with computational lower bounds from [3] (k & pn), but a\nsmall gap remains (diagonal thresholding and SDP\u2019s succeed only up to k .pn/ log d). The above\ngap was closed by the covariance thresholding algorithm, \ufb01rst suggested by [27] and analyzed by\n[15], that succeeds in the regimepn/ log d . k . pn, although the theoretical guarantee is limited\nto the regime when d/n ! \u21b5 due to relying on techniques from random matrix theory.\nHypothesis testing Some works [4, 1, 17] have focused on the problem of detection. In this case, [4]\nobserved that it suf\ufb01ces to work with the much simpler dual of the standard SDP called Minimal Dual\nPerturbation (MDP). Diagonal thresholding (DT) and MDP work up to the same signal threshold\n\u2713 as for support recovery, but MDP seems to outperform DT on simulated data [4]. MDP works at\nthe same signal threshold as the standard SDP relaxation for SPCA. [17] analyze a statistic based on\nan SDP relaxation and its approximation ratio to the optimal statistic. In the regime where k, n are\nproportional to d, their statistic succeeds at a signal threshold for \u2713 that is independent of d, unlike the\nMDP. However, their statistic is quite slow to compute; runtime is at least a high order polynomial in\nd.\nRegression based approaches To the best of our knowledge, our work is the \ufb01rst to give a general\nframework for SPCA that uses SLR in a black-box fashion. [49] uses speci\ufb01c algorithms for SLR\nsuch as Lasso as a subroutine, but they use a heuristic alternating minimization procedure to solve\na non-convex problem, and hence lack any theoretical guarantees. [31] applies a regression based\napproach to a restricted class of graphical models. While our regression setup is similar, their\nstatistic is different and their analysis depends directly on the particulars of Lasso. Further, their\nalgorithm requires extraneous conditions on the data.[9] also uses a reduction to linear regression for\ntheir problem of sparse subspace estimation. Their iterative algorithm depends crucially on a good\ninitialization done by a diagonal thresholding-like pre-processing step, which fails under rescaling of\nthe data.13 Furthermore, their framework uses regression for the speci\ufb01c case of orthogonal design,\nwhereas our design matrix can be more general as long as it satis\ufb01es a condition similar to the\nrestricted eigenvalue condition. On the other hand, their setup allows for more general `q-based\nsparsity as well as the estimation of an entire subspace as opposed to a single component. [29] also\nachieves this more general setup, while still suffering from the same initialization problem.\nSparse priors Finally, connections between SPCA and SLR have been noted in the probabilistic\nsetting [26, 25], albeit in an indirect manner: the same sparsity-inducing priors can be used for\neither problem. We view our work as entirely different as we focus on giving a black-box reduction.\nFurthermore, provable guarantees for the EM algorithm and variational methods are lacking in\ngeneral, and it is not immediately obvious what signal threshold their algorithm achieves for the\nsingle spike covariance model.\n\n6 Conclusion\n\nWe gave a black-box reduction for reducing instances of the SPCA problem under the spiked\ncovariance model to instances of SLR. Given oracle access to SLR black-box meeting a certain\nnatural condition, the reduction is shown to ef\ufb01ciently solve hypothesis testing and support recovery.\nSeveral directions are open for future work. The work in this paper remains limited to the Gaussian\nsetting and to the single spiked covariance model. Making the results more general would make the\nconnection made here more appealing. Also, the algorithms designed here, though simple, seem a\nbit wasteful in that they do not aggregate information from different statistics. Designing a more\nef\ufb01cient estimator that makes a more ef\ufb01cient use of samples would be interesting. Finally, there is\ncertainly room for improvement by tuning the choice of the SLR black-box to make the algorithm\nmore ef\ufb01cient for use in practice.\n\n12They analyze the subcase when the spike is uniform in all k coordinates.\n13See Section 3.4 for more discussion on rescaling.\n\n8\n\n\fReferences\n[1] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for sparse\nprincipal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages\n2454\u20132458. IEEE, 2009.\n\n[2] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Certifying the restricted\n\nisometry property is hard. IEEE transactions on information theory, 59(6):3448\u20133450, 2013.\n\n[3] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component\n\ndetection. In Conference on Learning Theory, pages 1046\u20131066, 2013.\n\n[4] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension.\n\nThe Annals of Statistics, 41(4):1780\u20131815, 2013.\n\n[5] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig\n\nselector. The Annals of Statistics, pages 1705\u20131732, 2009.\n\n[6] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and\n\nComputational Harmonic Analysis, 27(3):265\u2013274, 2009.\n\n[7] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Aggregation for gaussian regression.\n\nThe Annals of Statistics, 35(4):1674\u20131697, 2007.\n\n[8] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Sparse density estimation with `1\n\npenalties. In Learning theory, pages 530\u2013543. Springer, 2007.\n\n[9] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The\n\nAnnals of Statistics, 41(6):3074\u20133110, 2013.\n\n[10] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger\n\nthan n. The Annals of Statistics, pages 2313\u20132351, 2007.\n\n[11] Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE\n\nTransactions on, 51(12):4203\u20134215, 2005.\n\n[12] Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang. Aggregation of af\ufb01ne estimators. Electronic\n\nJournal of Statistics, 8(1):302\u2013327, 2014.\n\n[13] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation\n\nfor sparse pca using semide\ufb01nite programming. SIAM review, 49(3):434\u2013448, 2007.\n\n[14] Gamarnik David and Zadik Ilias. High dimensional regression with binary coef\ufb01cients. estimating squared\nerror and a phase transtition. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of\nProceedings of Machine Learning Research, pages 948\u2013953. PMLR, 2017.\n\n[15] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. In Advances in Neural\n\nInformation Processing Systems, pages 334\u2013342, 2014.\n\n[16] Thong T Do, Lu Gan, Nam Nguyen, and Trac D Tran. Sparsity adaptive matching pursuit algorithm for\npractical compressed sensing. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on,\npages 581\u2013587. IEEE, 2008.\n\n[17] Alexandre d\u2019Aspremont, Francis Bach, and Laurent El Ghaoui. Approximation bounds for sparse principal\n\ncomponent analysis. Mathematical Programming, 148(1-2):89\u2013110, 2014.\n\n[18] Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. Necessary and suf\ufb01cient conditions for sparsity\n\npattern recovery. IEEE Transactions on Information Theory, 55(12):5758\u20135772, 2009.\n\n[19] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. algorithmic barriers and a\n\nlocal search algorithm. arXiv preprint arXiv:1711.04952, 2017.\n\n[20] Milana Gataric, Tengyao Wang, and Richard J Samworth. Sparse principal component analysis via random\n\nprojections. arXiv preprint arXiv:1712.05630, 2017.\n\n[21] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of\n\nstatistics, pages 295\u2013327, 2001.\n\n[22] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high\n\ndimensions. Journal of the American Statistical Association, 2009.\n\n9\n\n\f[23] Ian T Jolliffe, Nickolay T Trenda\ufb01lov, and Mudassir Uddin. A modi\ufb01ed principal component technique\n\nbased on the lasso. Journal of computational and Graphical Statistics, 12(3):531\u2013547, 2003.\n\n[24] Michel Journ\u00e9e, Yurii Nesterov, Peter Richt\u00e1rik, and Rodolphe Sepulchre. Generalized power method for\n\nsparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517\u2013553, 2010.\n\n[25] Rajiv Khanna, Joydeep Ghosh, Russell A Poldrack, and Oluwasanmi Koyejo. Sparse submodular proba-\n\nbilistic pca. In AISTATS, 2015.\n\n[26] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distributions and\napproximate inference for structured variables. In Advances in Neural Information Processing Systems,\npages 676\u2013684, 2014.\n\n[27] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semide\ufb01nite relaxations really solve sparse pca.\n\nTechnical report, Technical report, Weizmann Institute of Science, 2013.\n\n[28] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection.\n\nAnnals of Statistics, pages 1302\u20131338, 2000.\n\n[29] Zongming Ma et al. Sparse principal component analysis and iterative thresholding. The Annals of\n\nStatistics, 41(2):772\u2013801, 2013.\n\n[30] St\u00e9phane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal\n\nProcessing, IEEE Transactions on, 41(12):3397\u20133415, 1993.\n\n[31] Nicolai Meinshausen and Peter B\u00fchlmann. High-dimensional graphs and variable selection with the lasso.\n\nThe annals of statistics, pages 1436\u20131462, 2006.\n\n[32] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate\n\nsamples. Applied and Computational Harmonic Analysis, 26(3):301\u2013321, 2009.\n\n[33] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information\nProcessing Systems, pages 1348\u20131356, 2009.\n\n[34] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. The Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[35] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional\n\nlinear regression over `q-balls. Information Theory, IEEE Transactions on, 57(10):6976\u20136994, 2011.\n\n[36] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation.\n\nThe Annals of Statistics, 39(2):731\u2013771, 2011.\n\n[37] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. In Conference\n\non Learning Theory, pages 10\u20131, 2012.\n\n[38] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[39] Sara van de Geer. The deterministic lasso. Seminar f\u00fcr Statistik, Eidgen\u00f6ssische Technische Hochschule\n\n(ETH) Z\u00fcrich, 2007.\n\n[40] Sara A Van De Geer, Peter B\u00fchlmann, et al. On the conditions used to prove oracle results for the lasso.\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[41] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal\nconvex relaxation of sparse pca. In Advances in Neural Information Processing Systems, pages 2670\u20132678,\n2013.\n\n[42] Martin Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy\nsetting. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 961\u2013965. IEEE,\n2007.\n\n[43] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained\n\nquadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183\u20132202, 2009.\n\n[44] Tengyao Wang, Quentin Berthet, Richard J Samworth, et al. Statistical and computational trade-offs in\n\nestimation of sparse principal components. The Annals of Statistics, 44(5):1896\u20131930, 2016.\n\n10\n\n\f[45] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. The Journal\n\nof Machine Learning Research, 14(1):899\u2013925, 2013.\n\n[46] Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In\n\nAdvances in Neural Information Processing Systems, pages 1921\u20131928, 2009.\n\n[47] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of\n\npolynomial-time algorithms for sparse linear regression. arXiv preprint arXiv:1402.1918, 2014.\n\n[48] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Optimal prediction for sparse linear models?\n\nlower bounds for coordinate-separable m-estimators. arXiv preprint arXiv:1503.03188, 2015.\n\n[49] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computa-\n\ntional and graphical statistics, 15(2):265\u2013286, 2006.\n\n11\n\n\f", "award": [], "sourceid": 7997, "authors": [{"given_name": "Guy", "family_name": "Bresler", "institution": "MIT"}, {"given_name": "Sung Min", "family_name": "Park", "institution": "MIT"}, {"given_name": "Madalina", "family_name": "Persu", "institution": "Two Sigma Investments, MIT"}]}