{"title": "Approximating Sparse PCA from Incomplete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 396, "abstract": "We study how well one can recover sparse principal componentsof a data matrix using a sketch formed from a few of its elements. We show that for a wide class of optimization problems,if the sketch is close (in the spectral norm) to the original datamatrix, then one can recover a near optimal solution to the optimizationproblem by using the sketch. In particular, we use this approach toobtain sparse principal components and show that for \\math{m} data pointsin \\math{n} dimensions,\\math{O(\\epsilon^{-2}\\tilde k\\max\\{m,n\\})} elements gives an\\math{\\epsilon}-additive approximation to the sparse PCA problem(\\math{\\tilde k} is the stable rank of the data matrix).We demonstrate our algorithms extensivelyon image, text, biological and financial data.The results show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our sparse sketch, the running timedrops by a factor of five or more.", "full_text": "Approximating Sparse PCA from Incomplete Data\n\nAbhisek Kundu \u2217\n\nPetros Drineas \u2020\n\nMalik Magdon-Ismail \u2021\n\nAbstract\n\nWe study how well one can recover sparse principal components of a data ma-\ntrix using a sketch formed from a few of its elements. We show that for a wide\nclass of optimization problems, if the sketch is close (in the spectral norm) to the\noriginal data matrix, then one can recover a near optimal solution to the optimiza-\ntion problem by using the sketch. In particular, we use this approach to obtain\nsparse principal components and show that for m data points in n dimensions,\nO(\u0001\u22122\u02dck max{m, n}) elements gives an \u0001-additive approximation to the sparse\nPCA problem (\u02dck is the stable rank of the data matrix). We demonstrate our algo-\nrithms extensively on image, text, biological and \ufb01nancial data. The results show\nthat not only are we able to recover the sparse PCAs from the incomplete data, but\nby using our sparse sketch, the running time drops by a factor of \ufb01ve or more.\n\n1\n\nIntroduction\n\nPrincipal components analysis constructs a low dimensional subspace of the data such that projection\nof the data onto this subspace preserves as much information as possible (or equivalently maximizes\nthe variance of the projected data). The earliest reference to principal components analysis (PCA)\nis in [15]. Since then, PCA has evolved into a classic tool for data analysis. A challenge for the\ninterpretation of the principal components (or factors) is that they can be linear combinations of all\nthe original variables. When the original variables have direct physical signi\ufb01cance (e.g. genes in\nbiological applications or assets in \ufb01nancial applications) it is desirable to have factors which have\nloadings on only a small number of the original variables. These interpretable factors are sparse\nprincipal components (SPCA).\nThe question we address is not how to better perform sparse PCA; rather, it is whether one can per-\nform sparse PCA on incomplete data and be assured some degree of success. (i.e., can we do sparse\nPCA when we have a small sample of data points and those data points have missing features?).\nIncomplete data is a situation that one is confronted with all too often in machine learning. For\nexample, with user-recommendation data, one does not have all the ratings of any given user. Or in\na privacy preserving setting, a client may not want to give us all entries in the data matrix. In such a\nsetting, our goal is to show that if the samples that we do get are chosen carefully, the sparse PCA\nfeatures of the data can be recovered within some provable error bounds. A signi\ufb01cant part of this\nwork is to demonstrate our algorithms on a variety of data sets.\nMore formally, The data matrix is A \u2208 Rm\u00d7n (m data points in n dimensions). Data matrices\noften have low effective rank. Let Ak be the best rank-k approximation to A; in practice, it is often\npossible to choose a small value of k for which (cid:107)A\u2212 Ak(cid:107)2 is small. The best rank-k approximation\nAk is obtained by projecting A onto the subspace spanned by its top-k principal components Vk,\nwhich is the n \u00d7 k matrix containing the top-k right singular vectors of A. These top-k principal\n\n\u2217Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, kundua2@rpi.edu.\n\u2020Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, drinep@cs.rpi.edu.\n\u2021Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, magdon@cs.rpi.edu.\n\n1\n\n\fcomponents are the solution to the variance maximization problem:\n\nVk =\n\narg max\n\nV\u2208Rn\u00d7k,VT V=I\n\ntrace(VT AT AV).\n\nWe denote the maximum variance attainable by OPTk, which is the sum of squares of the top-\nk singular values of A. To get sparse principal components, we add a sparsity constraint to the\noptimization problem: every column of V should have at most r non-zero entries (the sparsity\nparameter r is an input),\n\nSk =\n\narg max\n\nV\u2208Rn\u00d7k,VT V=I,(cid:107)V(i)(cid:107)0\u2264r\n\ntrace(VT AT AV).\n\n(1)\n\nThe sparse PCA problem is itself a very hard problem that is not only NP-hard, but also inap-\nproximable [12] There are many heuristics for obtaining sparse factors [2, 18, 20, 5, 4, 14, 16]\nincluding some approximation algorithms with provable guarantees [1]. The existing research typ-\nically addresses the task of getting just the top principal component (k = 1) (some exceptions are\n[11, 3, 19, 9]). While the sparse PCA problem is hard and interesting, it is not the focus of this work.\nWe address the question: What if we do not know A, but only have a sparse sampling of some of\nthe entries in A (incomplete data)? The sparse sampling is used to construct a sketch of A, denoted\n\u02dcA. There is not much else to do but solve the sparse PCA problem with the sketch \u02dcA instead of the\nfull data A to get \u02dcSk,\n\n\u02dcSk =\n\narg max\n\nV\u2208Rn\u00d7k,VT V=I,(cid:107)V(i)(cid:107)0\u2264r\n\ntrace(VT \u02dcAT \u02dcAV).\n\n(2)\n\nWe study how \u02dcSk performs as an approximation to Sk with respective to the objective that we\nare trying to optimize, namely trace(ST AT AS) \u2014 the quality of approximation is measured with\nrespect to the true A. We show that the quality of approximation is controlled by how well \u02dcAT \u02dcA\napproximates AT A as measured by the spectral norm of the deviation AT A \u2212 \u02dcAT \u02dcA. This is a\ngeneral result that does not rely on how one constructs the sketch \u02dcA.\n\nTheorem 1 (Sparse PCA from a Sketch) Let Sk be a solution to the sparse PCA problem that\nsolves (1), and \u02dcSk a solution to the sparse PCA problem for the sketch \u02dcA which solves (2). Then,\n\ntrace(\u02dcSk\n\nT AT A\u02dcSk) \u2265 trace(Sk\n\nT AT ASk) \u2212 2k(cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107)2.\n\nTheorem 1 says that if we can closely approximate A with \u02dcA, then we can compute, from \u02dcA, sparse\ncomponents which capture almost as much variance as the optimal sparse components computed\nfrom the full data A.\nIn our setting, the sketch \u02dcA is computed from a sparse sampling of the data elements in A (incom-\nplete data). To determine which elements to sample, and how to form the sketch, we leverage some\nrecent results in elementwise matrix completion ([8]). In a nutshell, if one samples larger data ele-\nments with higher probability than smaller data elements, then, for the resulting sketch \u02dcA, the error\n(cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107)2 will be small. The details of the sampling scheme and how the error depends on\nthe number of samples is given in Section 2.1. Combining the bound on (cid:107)A\u2212 \u02dcA(cid:107)2 from Theorem 4\nin Section 2.1 with Theorem 1, we get our main result:\nTheorem 2 (Sampling Complexity for Sparse PCA) Sample s data-elements from A \u2208 Rm\u00d7n to\nform the sparse sketch \u02dcA using Algorithm 1. Let Sk be a solution to the sparse PCA problem that\nsolves (1), and let \u02dcSk, which solves (2), be a solution to the sparse PCA problem for the sketch \u02dcA\nformed from the s sampled data elements. Suppose the number of samples s satis\ufb01es\n\ns \u2265 2k2\u0001\u22122(\u03c12 + \u0001\u03b3/(3k)) log((m + n)/\u03b4)\n\n(\u03c12 and \u03b3 are dimensionless quantities that depend only on A). Then, with probability at least 1\u2212 \u03b4\n\ntrace(\u02dcSk\n\nT AT A\u02dcSk) \u2265 trace(Sk\n\nT AT ASk) \u2212 2\u0001(2 + \u0001/k)(cid:107)A(cid:107)2\n2.\n\n2\n\n\fThe dependence of \u03c12 and \u03b3 on A are given in Section 2.1. Roughly speaking, we can ignore\nthe term with \u03b3 since it is multiplied by \u0001/k, and \u03c12 = O(\u02dck max{m, n}), where \u02dck is the stable\n(numerical) rank of A. To paraphrase Theorem 2, when the stable rank is a small constant, with\nO(k2 max{m, n}) samples, one can recover almost as good sparse principal components as with\nall data (the price being a small fraction of the optimal variance, since OPTk \u2265 (cid:107)A(cid:107)2\n2). As far as\nwe know, the only prior work related to the problem we consider here is [10] which proposed a\nspeci\ufb01c method to construct sparse PCA from incomplete data. However, we develop a general tool\nthat can be used with any existing sparse PCA heuristic. Moreover, we derive much simpler bounds\n(Theorems 1 and 2) using matrix concentration inequalities, as opposed to \u0001-net arguments in [10].\nWe also give an application of Theorem 1 to running sparse PCA after \u201cdenoising\u201d the data using a\ngreedy thresholding algorithm that sets the small elements to zero (see Theorem 3). Such denoising\nis appropriate when the observed matrix has been element-wise perturbed by small noise, and the\nuncontaminated data matrix is sparse and contains large elements. We show that if an appropriate\nfraction of the (noisy) data is set to zero, one can still recover sparse principal components. This\ngives a principled approach to regularizing sparse PCA in the presence of small noise when the data\nis sparse.\nNot only do our algorithms preserve the quality of the sparse principal components, but iterative\nalgorithms for sparse PCA, whose running time is proportional to the number of non-zero entries in\nthe input matrix, bene\ufb01t from the sparsity of \u02dcA. Our experiments show about \ufb01ve-fold speed gains\nwhile producing near-comparable sparse components using less than 10% of the data.\n\nDiscussion.\nIn summary, we show that one can recover sparse PCA from incomplete data while\ngaining computationally at the same time. Our result holds for the optimal sparse components from\nA versus from \u02dcA. One cannot ef\ufb01ciently \ufb01nd these optimal components (since the problem is NP-\nhard to even approximate), so one runs a heuristic, in which case the approximation error of the\nheuristic would have to be taken into account. Our experiments show that using the incomplete data\nwith the heuristics is just as good as those same heuristics with the complete data.\nIn practice, one may not be able to sample the data, but rather the samples are given to you. Our\nresult establishes that if the samples are chosen with larger values being more likely, then one can\nrecover sparse PCA. In practice one has no choice but to run the sparse PCA on these sampled\nelements and hope. Our theoretical results suggest that the outcome will be reasonable. This is\nbecause, while we do not have speci\ufb01c control over what samples we get, the samples are likely to\nrepresent the larger elements. For example, with user-recommendation data, users are more likely\nto rate items they either really like (large positive value) or really dislike (large negative value).\n\nNotation. We use bold uppercase (e.g., X) for matrices and bold lowercase (e.g., x) for col-\numn vectors. The i-th row of X is X(i), and the i-th column of X is X(i). Let [n] denote\nthe set {1, 2, ..., n}. E(X) is the expectation of a random variable X; for a matrix, E(X) de-\nnotes the element-wise expectation. For a matrix X \u2208 Rm\u00d7n, the Frobenius norm (cid:107)X(cid:107)F is\nij, and the spectral (operator) norm (cid:107)X(cid:107)2 is (cid:107)X(cid:107)2 = max(cid:107)y(cid:107)2=1(cid:107)Xy(cid:107)2. We\n(cid:107)X(cid:107)2\ni,j=1 |Xij| and (cid:107)X(cid:107)0 (the number of non-zero entries\nalso have the (cid:96)1 and (cid:96)0 norms: (cid:107)X(cid:107)(cid:96)1\nin X). The k-th largest singular value of X is \u03c3k(X). and log x is the natural logarithm of x.\n\nF = (cid:80)m,n\n\n=(cid:80)m,n\n\ni,j=1 X2\n\n2 Sparse PCA from a Sketch\n\nIn this section, we will prove Theorem 1 and give a simple application to zeroing small \ufb02uctuations\nas a way to regularize to noise. In the next section we will use a more sophisticated way to select\nthe elements of the matrix allowing us to tolerate a sparser matrix (more incomplete data) but still\nrecovering sparse PCA to reasonable accuracy.\nTheorem 1 will be a corollary of a more general result, for a class of optimization problems involving\na Lipschitz-like objective function over an arbitrary (not necessarily convex) domain. Let f (V, X)\nbe a function that is de\ufb01ned for a matrix variable V and a matrix parameter X. The optimization\nvariable V is in some feasible set S which is arbitrary. The parameter X is also arbitrary. We assume\nthat f is locally Lipschitz in X with, that is\n\n|f (V, X) \u2212 f (V, \u02dcX)| \u2264 \u03b3(X)(cid:107)X \u2212 \u02dcX(cid:107)2\n\n\u2200V \u2208 S.\n\n3\n\n\f(Note we allow the \u201cLipschitz constant\u201d to depend on the \ufb01xed matrix X but not the variables V, \u02dcX;\nthis is more general than a globally Lipshitz objective) The next lemma is the key tool we need\nto prove Theorem 1 and it may be on independent interest in other optimization settings. We are\ninterested in maximizing f (V, X) w.r.t. V to obtain V\u2217. But, we only have an approximation \u02dcX\nfor X, and so we maximize f (V, \u02dcX) to obtain \u02dcV\u2217, which will be a suboptimal solution with respect\nto X. We wish to bound f (V\u2217, X) \u2212 f ( \u02dcV\u2217, X) which quanti\ufb01es how suboptimal \u02dcV\u2217 is w.r.t. X.\nLemma 1 (Surrogate optimization bound) Let f (V, X) be \u03b3-locally Lipschitz w.r.t. X over the\ndomain V \u2208 S. De\ufb01ne V\u2217 = arg maxV\u2208S f (V, X);\n\n\u02dcV\u2217 = arg maxV\u2208S f (V, \u02dcX). Then,\n\nf (V\u2217, X) \u2212 f ( \u02dcV\u2217, X) \u2264 2\u03b3(X)(cid:107)X \u2212 \u02dcX(cid:107)2.\n\nIn the lemma, the function f and the domain S are arbitrary. In our setting, X \u2208 Rn\u00d7n, the domain\nS = {V \u2208 Rn\u00d7k; VT V = Ik;(cid:107)V(j)(cid:107)0 \u2264 r}, and f (V, X) = trace(VT XV). We \ufb01rst show that\nf is Lipschitz w.r.t. X with \u03b3 = k (a constant independent of X). Let the representation of V by its\ncolumns be V = [v1, . . . , vk]. Then,\n\n|trace(VT XV) \u2212 trace(VT \u02dcXV)| = |trace((X \u2212 \u02dcX)VVT )| \u2264 k(cid:88)\n\n\u03c3i(X \u2212 \u02dcX) \u2264 k(cid:107)X \u2212 \u02dcX(cid:107)2\n\nwhere, \u03c3i(A) is the i-th largest singular value of A (we used Von-neumann\u2019s trace inequality\nand the fact that VVT is a k-dimensional projection). Now, by Lemma 1, trace(V\u2217T XV\u2217) \u2212\ntrace( \u02dcV\u2217T X \u02dcV\u2217) \u2264 2k(cid:107)X \u2212 \u02dcX(cid:107)2. Theorem 1 follows by setting X = AT A and \u02dcX = \u02dcAT \u02dcA 1.\n\ni=1\n\nGreedy thresholding. We give the simplest scenario of incomplete data where Theorem 1 gives\nsome reassurance that one can compute good sparse principal components. Suppose the smallest\ndata elements have been set to zero. This can happen, for example, if only the largest elements are\nmeasured, or in a noisy setting if the small elements are treated as noise and set to zero. So\n\n(cid:26)Aij\n\n0\n\n\u02dcAij =\n\n|Aij| \u2265 \u03b4;\n|Aij| < \u03b4.\n\nF =(cid:80)|Aij|<\u03b4 A2\n\n2 (stable rank of A), and de\ufb01ne (cid:107)A\u03b4(cid:107)2\nF = (cid:107)A\u03b4(cid:107)2\nT \u02dcA(cid:107)2 = (cid:107)AT \u2206 + \u2206T A \u2212 \u2206T \u2206(cid:107)2 \u2264 2(cid:107)A(cid:107)2(cid:107)\u2206(cid:107)2 + (cid:107)\u2206(cid:107)2\n2.\n\nF . Then,\n\nij. Let A = \u02dcA + \u2206.\n\nF /(cid:107)A(cid:107)2\nRecall \u02dck = (cid:107)A(cid:107)2\nBy construction, (cid:107)\u2206(cid:107)2\n(cid:107)AT A \u2212 \u02dcA\nF \u2264 \u00012(cid:107)A(cid:107)2\n\n(3)\nSuppose the zeroing of elements only loses a fraction of the energy in A, i.e. \u03b4 is selected so\nthat (cid:107)A\u03b4(cid:107)2\nF /\u02dck; that is an \u0001/\u02dck fraction of the total variance in A has been lost in the\nunmeasured (or zero) data. Then (cid:107)\u2206(cid:107)2 \u2264 (cid:107)\u2206(cid:107)F \u2264 \u0001(cid:107)A(cid:107)F /\nTheorem 3 Suppose that \u02dcA is created from A by zeroing all elements that are less than \u03b4, and \u03b4\nis such that the truncated norm satis\ufb01es (cid:107)A\u03b4(cid:107)2\nF /\u02dck. Then the sparse PCA solution \u02dcV\u2217\nsatis\ufb01es\n\n(cid:112)\u02dck = \u0001(cid:107)A(cid:107)2.\n\n2 \u2264 \u00012(cid:107)A(cid:107)2\n\ntrace( \u02dcV\u2217T AA \u02dcV\u2217) \u2265 trace(V\u2217T AAT V\u2217) \u2212 2k\u0001(cid:107)A(cid:107)2\n\n2(2 + \u0001).\n\nTheorem 3 shows that it is possible to recover sparse PCA after setting small elements to zero. This\nis appropriate when most of the elements in A are small noise and a few of the elements in A\n\u221a\nnm) large elements (of\ncontain large data elements. For example if the data consists of sparse O(\nmagnitude, say, 1) and many nm \u2212 O(\nnm) (high\nsignal-to-noise setting), then (cid:107)A\u03b4(cid:107)2\nnm)\nlarge elements (very incomplete data), we recover near optimal sparse PCA.\nGreedily keeping only the large elements of the matrix requires a particular structure in A to work,\nand it is based on a crude Frobenius-norm bound for the spectral error. In Section 2.1, we use recent\nresults in element-wise matrix sparsi\ufb01cation to choose the elements in a randomized way, with a bias\ntoward large elements. With high probability, one can directly bound the spectral error and hence\nget better performance.\n\n\u221a\nnm) small elements whose magnitude is o(1/\n2 \u2192 0 and with just a sparse sampling of the O(\n\n2/(cid:107)A(cid:107)2\n\n\u221a\n\n\u221a\n\n1Theorem 1 can also be proved as follows:\n\ntrace(VT XV) \u2212 trace( \u02dcVT X \u02dcV) = trace(VT XV) \u2212\ntrace(VT \u02dcXV) + trace(VT \u02dcXV) \u2212 trace( \u02dcVT X \u02dcV) \u2264 k(cid:107)X \u2212 \u02dcX(cid:107)2 + trace(VT \u02dcXV) \u2212 trace( \u02dcVT X \u02dcV) \u2264\nk(cid:107)X \u2212 \u02dcX(cid:107)2 + trace( \u02dcVT \u02dcX \u02dcV) \u2212 trace( \u02dcVT X \u02dcV) \u2264 2k(cid:107)X \u2212 \u02dcX(cid:107)2.\n\n4\n\n\fAlgorithm 1 Hybrid ((cid:96)1, (cid:96)2)-Element Sampling\nInput: A \u2208 Rm\u00d7n; # samples s; probabilities {pij}.\n1: Set \u02dcA = 0m\u00d7n.\n2: for t = 1 . . . s (i.i.d. trials with replacement) do\nRandomly sample indices (it, jt) \u2208 [m] \u00d7 [n] with P [(it, jt) = (i, j)] = pij.\n3:\nUpdate \u02dcA: \u02dcAij \u2190 \u02dcAij + Aij/(s \u00b7 pij).\n4:\n5: return \u02dcA (with at most s non-zero entries).\n\n2.1 An ((cid:96)1, (cid:96)2)-Sampling Based Sketch\n\nIn the previous section, we created the sketch by deterministically setting the small data elements to\nzero. Instead, we could randomly select the data elements to keep. It is natural to bias this random\nsampling toward the larger elements. Therefore, we de\ufb01ne sampling probabilities for each data\nelement Aij which are proportional to a mixture of the absolute value and square of the element:\n\npij = \u03b1\n\n|Aij|\n(cid:107)A(cid:107)(cid:96)1\n\n+ (1 \u2212 \u03b1)\n\nA2\nij\n(cid:107)A(cid:107)2\n\nF\n\n,\n\n(4)\n\nwhere \u03b1 \u2208 (0, 1] is a mixing parameter. Such a sampling probability was used in [8] to sample\ndata elements in independent trials to get a sketch \u02dcA. We repeat the prototypical algorithm for\nelement-wise matrix sampling in Algorithm 1.\nNote that unlike with the deterministic zeroing of small elements, in this sampling scheme, one\nsamples the element Aij with probability pij and then rescales it by 1/pij. To see the intuition for\nthis rescaling, consider the expected outcome for a single sample: E[ \u02dcAij] = pij \u00b7 (Aij/pij) + (1 \u2212\npij) \u00b7 0 = Aij; that is, \u02dcA is a sparse but unbiased estimate for A. This unbiasedness holds for any\nchoice of the sampling probabilities pij de\ufb01ned over the elements of A in Algorithm 1. However, for\nan appropriate choice of the sampling probabilities, we get much more than unbiasedness; we can\ncontrol the spectral norm of the deviation, (cid:107)A \u2212 \u02dcA(cid:107)2. In particular, the hybrid-((cid:96)1, (cid:96)2) distribution\nin (4) was analyzed in [8], where they suggest an optimal choice for the mixing parameter \u03b1\u2217 which\nminimizes the theoretical bound on (cid:107)A \u2212 \u02dcA(cid:107)2. This algorithm to choose \u03b1\u2217 is summarized in\nAlgorithm 1 of [8].\nUsing the probabilities in (4) to create the sketch \u02dcA using Algorithm 1, with \u03b1\u2217 selected using\nAlgorithm 1 of [8], one can prove a bound for (cid:107)A\u2212 \u02dcA(cid:107)2. We state a simpli\ufb01ed version of the bound\nfrom [8] in Theorem 4.\nTheorem 4 ([8]) Let A \u2208 Rm\u00d7n and let \u0001 > 0 be an accuracy parameter. De\ufb01ne probabilities\npij as in (4) with \u03b1\u2217 chosen using Algorithm 1 of [8]. Let \u02dcA be the sparse sketch produced using\nAlgorithm 1 with a number of samples s \u2265 2\u0001\u22122(\u03c12 + \u03b3\u0001/3) log((m + n)/\u03b4), where\n\n(cid:17)\u22121\n\n(cid:112)\n\n\u03c12 = \u02dck \u00b7 max{m, n}(cid:16)\n\n\u03b1 \u00b7 \u02dck \u00b7 (cid:107)A(cid:107)2/(cid:107)A(cid:107)(cid:96)1\n\n+ (1 \u2212 \u03b1)\n\nand \u03b3 \u2264 1 +\n\n,\n\nmn\u02dck/\u03b1.\n\nThen, with probability at least 1 \u2212 \u03b4,\n\n(cid:107)A \u2212 \u02dcA(cid:107)2 \u2264 \u0001(cid:107)A(cid:107)2.\n\n3 Experiments\n\nWe show the experimental performance of sparse PCA from a sketch using several real data matrices.\nAs we mentioned, sparse PCA is NP-Hard, and so we must use heuristics. These heuristics are\ndiscussed next, followed by the data, the experimental design and \ufb01nally the results.\nAlgorithms for Sparse PCA: Let G (ground truth) denote the algorithm which computes the prin-\ncipal components (which may not be sparse) of the full data matrix A; the optimal variance is OPTk.\nWe consider six heuristics for getting sparse principal components.\n\n5\n\n\fRelative loss of greedy thresholding versus Spasm, illustrating the value of a good\nsparse PCA algorithm. Our sketch based algorithms do not address this loss.\nRelative loss of using the ((cid:96)1, (cid:96)2)-sketch \u02dcA instead of complete data A. A ratio close\nto 1 is desired.\nRelative loss of using the uniform sketch \u02dcA instead of complete data A. A benchmark\nto highlight the value of a good sketch.\n\nThe r largest-magnitude entries in each principal component generated by G.\nr-sparse components using the Spasm toolbox of [17] with A.\nThe r largest entries of the principal components for the ((cid:96)1, (cid:96)2)-sampled sketch \u02dcA.\nr-sparse components using Spasm with the ((cid:96)1, (cid:96)2)-sampled sketch \u02dcA.\nThe r largest entries of the principal components for the uniformly sampled sketch \u02dcA.\nr-sparse components using Spasm with the uniformly sampled sketch \u02dcA.\n\nGmax,r\nGsp,r\nHmax,r\nHsp,r\nUmax,r\nUsp,r\nOutput of an algorithm Z is sparse principal components V, and our metric is f (Z) =\ntrace(VT AT AV), where A is the original centered data. We consider the following statistics.\nf (Gmax,r)\nf (Gsp,r)\nf (Hmax/sp,r)\nf (Gmax/sp,r)\nf (Umax/sp,r)\nf (Gmax/sp,r)\nWe also report the computation time for the algorithms. We show results to con\ufb01rm that sparse PCA\nalgorithms using the ((cid:96)1, (cid:96)2)-sketch are nearly comparable to those same algorithms on the complete\ndata; and, gain in computation time from sparse sketch is proportional to the sparsity.\nData Sets: We show results on image, text, stock, and gene expression data.\n\u2022 Digit Data (m = 2313, n = 256): We use the [7] handwritten zip-code digit images (300\npixels/inch in 8-bit gray scale). Each pixel is a feature (normalized to be in [\u22121, 1]). Each 16 \u00d7 16\ndigit image forms a row of the data matrix A. We focus on three digits: \u201c6\u201d (664 samples), \u201c9\u201d (644\nsamples), and \u201c1\u201d (1005 samples).\n\u2022 TechTC Data (m = 139, n = 15170): We use the Technion Repository of Text Categorization\nDataset (TechTC, see [6]) from the Open Directory Project (ODP). We removed words (features)\nwith fewer than 5 letters. Each document (row) has unit norm.\n\u2022 Stock Data (m = 7056, n = 1218): We use S&P100 stock market data with 7056 snapshots\nof prices for 1218 stocks. The prices of each day form a row of the data matrix and a principal\ncomponent represents an \u201cindex\u201d of sorts \u2013 each stock is a feature.\n\u2022 Gene Expression Data (m = 107, n = 22215): We use GSE10072 gene expression data for\nlung cancer from the NCBI Gene Expression Omnibus database. There are 107 samples (58 lung\ntumor cases and 49 normal lung controls) forming the rows of the data matrix, with 22,215 probes\n(features) from the GPL96 platform annotation table.\n\n3.1 Results\n\nWe report results for primarily the top principal component (k = 1) which is the case most consid-\nered in the literature. When k > 1, our results do not qualitatively change. We note the optimal\nmixing parameter \u03b1\u2217 using Algorithm 1 of [8] for various datasets in Table 1.\nHandwritten Digits. We sample approximately 7% of the elements from the centered data using\n((cid:96)1, (cid:96)2)-sampling, as well as uniform sampling. The performance for small r is shown in Table 1,\nincluding the running time \u03c4. For this data, f (Gmax,r)/f (Gsp,r) \u2248 0.23 (r = 10), so it is important\nto use a good sparse PCA algorithm. We see from Table 1 that the ((cid:96)1, (cid:96)2)-sketch signi\ufb01cantly\noutperforms the uniform sketch. A more extensive comparison of recovered variance is given in\nFigure 2(a). We also observe a speed-up of a factor of about 6 for the ((cid:96)1, (cid:96)2)-sketch. We point\nout that the uniform sketch is reasonable for the digits data because most data elements are close to\neither +1 or \u22121, since the pixels are either black or white.\nWe show a visualization of the principal components in Figure 1. We observe that the sparse com-\nponents from the ((cid:96)1, (cid:96)2)-sketch are almost identical to that of from the complete data.\nTechTC Data. We sample approximately 5% of the elements from the centered data using our\n((cid:96)1, (cid:96)2)-sampling, as well as uniform sampling. For this data, f (Gmax,r)/f (Gsp,r) \u2248 0.84 (r =\n10). We observe a very signi\ufb01cant performance difference between the ((cid:96)1, (cid:96)2)-sketch and uniform\nsketch. A more extensive comparison of recovered variance is given in Figure 2(b). We also observe\n\n6\n\n\f\u03b1\u2217\n\n.42\n1\n.10\n.92\n\nr\n\n40\n40\n40\n40\n\nf (Hmax/sp,r)\nf (Gmax/sp,r)\n0.99/0.90\n0.94/0.99\n1.00/1.00\n0.82/0.88\n\n\u03c4 (G)\n\u03c4 (H)\n6.21\n5.70\n3.72\n3.61\n\nf (Umax/sp,r)\nf (Gmax/sp,r)\n1.01/0.70\n0.41/0.38\n0.66/0.66\n0.65/0.15\n\n\u03c4 (G)\n\u03c4 (U)\n5.33\n5.96\n4.76\n2.53\n\nDigit\nTechTC\nStock\nGene\n\nTable 1: Comparison of sparse principal components from the ((cid:96)1, (cid:96)2)-sketch and uniform sketch.\n\n(a) r = 100%\n\n(b) r = 50%\n\n(c) r = 30%\n\n(d) r = 10%\n\nFigure 1: [Digits] Visualization of top-3 sparse principal components. In each \ufb01gure, left panel\nshows Gsp,r and right panel shows Hsp,r.\n\n(a) Digit\n\n(b) TechTC\n\n(c) Stock\n\n(d) Gene\n\nFigure 2: Performance of sparse PCA for ((cid:96)1, (cid:96)2)-sketch and uniform sketch over an extensive\nrange for the sparsity constraint r. The performance of the uniform sketch is signi\ufb01cantly worse\nhighlighting the importance of a good sketch.\n\na speed-up of a factor of about 6 for the ((cid:96)1, (cid:96)2)-sketch. Unlike the digits data which is uniformly\nnear \u00b11, the text data is \u201cspikey\u201d and now it is important to sample with a bias toward larger\nelements, which is why the uniform-sketch performs very poorly.\nAs a \ufb01nal comparison, we look at the actual sparse top component with sparsity parameter r = 10.\nThe topic IDs in the TechTC data are 10567=\u201dUS: Indiana: Evansville\u201d and 11346=\u201dUS: Florida\u201d.\nThe top-10 features (words) in the full PCA on the complete data are shown in Table 2.\n\nID Top 10 in Gmax,r\nevansville\n1\n\ufb02orida\n2\n3\nsouth\n4 miami\nindiana\n5\ninformation\n6\n7\nbeach\nlauderdale\n8\nestate\n9\n10\nspacer\n\nID Other words\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n\nservice\nsmall\nframe\ntours\nfaver\ntransaction\nneeds\ncommercial\nbullet\ninlets\nproducer\n\nGmax,r Hmax,r\n\nUmax,r\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n1\n2\n3\n4\n5\n7\n6\n8\n11\n12\n\n6\n14\n15\n16\n17\n7\n18\n19\n20\n21\n\nGsp,r Hsp,r\n1\n1\n2\n2\n3\n3\n4\n4\n5\n5\n6\n7\n8\n7\n6\n8\n12\n9\n13\n11\n\nUsp,r\n6\n14\n15\n16\n17\n7\n18\n19\n20\n21\n\nTable 2: [TechTC] Top ten words in top princi-\npal component of the complete data (the other\nwords are discovered by some of the sparse\nPCA algorithms).\n\nTable 3:\n[TechTC] Relative ordering of the\nwords (w.r.t. Gmax,r) in the top sparse principal\ncomponent with sparsity parameter r = 10.\n\n7\n\n204060801000.60.81Sparsity constraint: r (percent) f(Hsp,r)/f(Gsp,r)f(Usp,r)/f(Gsp,r)204060801000.20.40.60.8Sparsity constraint: r (percent) f(Hsp,r)/f(Gsp,r)f(Usp,r)/f(Gsp,r)204060801000.60.81Sparsity constraint: r (percent) f(Hsp,r)/f(Gsp,r)f(Usp,r)/f(Gsp,r)204060801000.20.40.60.8Sparsity constraint: r (percent) f(Hsp,r)/f(Gsp,r)f(Usp,r)/f(Gsp,r)\fIn Table 3 we show which words appear in the top sparse principal component with sparsity r = 10\nusing various sparse PCA algorithms. We observe that the sparse PCA from the ((cid:96)1, (cid:96)2)-sketch with\nonly 5% of the data sampled matches quite closely with the same sparse PCA algorithm using the\ncomplete data (Gmax/sp,r matches Hmax/sp,r).\nStock Data. We sample about 2% of the non-zero elements from the centered data using the ((cid:96)1, (cid:96)2)-\nsampling, as well as uniform sampling. For this data, f (Gmax,r)/f (Gsp,r) \u2248 0.96 (r = 10). We\nobserve a very signi\ufb01cant performance difference between the ((cid:96)1, (cid:96)2)-sketch and uniform sketch.\nA more extensive comparison of recovered variance is given in Figure 2(c). We also observe a\nspeed-up of a factor of about 4 for the ((cid:96)1, (cid:96)2)-sketch. Similar to TechTC data this dataset is also\n\u201cspikey\u201d, so biased sampling toward larger elements signi\ufb01cantly outperforms the uniform-sketch.\nGene Expression Data. We sample about 9% of the elements from the centered data using the\n((cid:96)1, (cid:96)2)-sampling, as well as uniform sampling. For this data, f (Gmax,r)/f (Gsp,r) \u2248 0.05 (r = 10)\nwhich means a good sparse PCA algorithm is imperative. We observe a very signi\ufb01cant perfor-\nmance difference between the ((cid:96)1, (cid:96)2)-sketch and uniform sketch. A more extensive comparison of\nrecovered variance is given in Figure 2(d). We also observe a speed-up of a factor of about 4 for\nthe ((cid:96)1, (cid:96)2)-sketch. Similar to TechTC data this dataset is also \u201cspikey\u201d, and consequently biased\nsampling toward larger elements signi\ufb01cantly outperforms the uniform-sketch.\nPerformance of Other Sketches: We brie\ufb02y report on other options for sketching A. We consider\nsuboptimal \u03b1 (not \u03b1\u2217 from Algorithm 1 of [8] ) in (4) to construct a suboptimal hybrid distribution,\nand use this in proto-Algorithm 1 to construct a sparse sketch. Figure 3 reveals that a good sketch\nusing the \u03b1\u2217 is important.\n\nFigure 3: [Stock data] Performance of sketch us-\ning suboptimal \u03b1 to illustrate the importance of\nthe optimal mixing parameter \u03b1\u2217.\n\nthat this should be possible by getting a better bound on(cid:80)k\n\nConclusion:\nIt is possible to use a sparse sketch (incomplete data) to recover nearly as good sparse\nprincipal components as one would have gotten with the complete data. We mention that, while Gmax\nwhich uses the largest weights in the unconstrained PCA does not perform well with respect to the\nvariance, it does identify good features. A simple enhancement to Gmax is to recalibrate the sparse\ncomponent after identifying the features - this is an unconstrained PCA problem on just the columns\nof the data matrix corresponding to the features. This method of recalibrating can be used to improve\nany sparse PCA algorithm.\nOur algorithms are simple and ef\ufb01cient, and many interesting avenues for further research remain.\nCan the sampling complexity for the top-k sparse PCA be reduced from O(k2) to O(k). We suspect\ni=1 \u03c3i(AT A\u2212 \u02dcAT \u02dcA); we used the crude\nbound k(cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107)2. We also presented a general surrogate optimization bound which may\nbe of interest in other applications. In particular, it is pointed out in [13] that though PCA optimizes\nvariance, a more natural way to look at PCA is as the linear projection of the data that minimizes\nthe information loss. [13] gives ef\ufb01cient algorithms to \ufb01nd sparse linear dimension reduction that\nminimizes information loss \u2013 the information loss of sparse PCA can be considerably higher than op-\ntimal. To minimize information loss, the objective to maximize is f (V) = trace(AT AV(AV)\u2020A).\nIt would be interesting to see whether one can recover sparse low-information-loss linear projectors\nfrom incomplete data.\nAcknowledgments: AK and PD are partially supported by NSF IIS-1447283 and IIS-1319280. M\nM-I was partially supported by the Army Research Laboratory under Cooperative Agreement Num-\nber W911NF-09-2-0053 (the ARL Network Science CTA). The views and conclusions contained\nin this document are those of the authors and should not be interpreted as representing the of\ufb01cial\npolicies, either expressed or implied, of the Army Research Laboratory or the U.S. Government.\nThe U.S. Government is authorized to reproduce and distribute reprints for Government purposes\nnotwithstanding any copyright notation here on.\n\n8\n\n204060801000.850.90.95Sparsity constraint: r (percent) f(Hsp,r),\u03b1\u2217=0.1f(Hsp,r),\u03b1=1.0\fReferences\n[1] M. Asteris, D. Papailiopoulos, and A. Dimakis. Non-negative sparse PCA with provable guar-\n\nantees. In Proc. ICML, 2014.\n\n[2] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal compo-\n\nnents. Applied Statistics, 22:203\u2013214, 1995.\n\n[3] T. T. Cai, Z. Ma, and Y. Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals\n\nof Statistics, 41(6):3074\u20133110, 2013.\n\n[4] Alexandre d\u2019Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse\nprincipal component analysis. Journal of Machine Learning Research, 9:1269\u20131294, June\n2008.\n\n[5] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A\ndirect formulation for sparse PCA using semide\ufb01nite programming. SIAM Review, 49(3):434\u2013\n448, 2007.\n\n[6] E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: using\naggressive feature selection to make SVMs competitive with C4.5. In Proceedings of Interna-\ntional Conference on Machine Learning, 2004.\n\n[7] J. J. Hull. A database for handwritten text recognition research.\n\nIn IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, pages 550\u2013554, 16(5), 1994.\n\n[8] A. Kundu, P. Drineas, and M. Magdon-Ismail. Recovering PCA from Hybrid-((cid:96)1, (cid:96)2) Sparse\nSampling of Data Elements. In http://arxiv.org/pdf/1503.00547v1.pdf, 2015.\n\n[9] J. Lei and V. Q. Vu. Sparsistency and agnostic inference in sparse pca. The Annals of Statistics,\n\n43(1):299\u2013322, 2015.\n\n[10] Karim Lounici. Sparse principal component analysis with missing observations. arxiv report:\n\nhttp://arxiv.org/abs/1205.7060, 2012.\n\n[11] Z. Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics,\n\n41(2):772\u2013801, 2013.\n\n[12] M. Magdon-Ismail. NP-hardness and inapproximability of sparse pca.\n\nhttp://arxiv.org/abs/1502.05675, 2015.\n\narxiv report:\n\n[13] M. Magdon-Ismail and C. Boutsidis. arxiv report: http://arxiv.org/abs/1502.06626, 2015.\n\n[14] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spectral bounds for sparse LDA. In\n\nProc. ICML, 2006.\n\n[15] K. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical\n\nMagazine, 2:559\u2013572, 1901.\n\n[16] Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low\n\nrank matrix approximation. Journal of Multivariate Analysis, 99:1015\u20131034, July 2008.\n\n[17] K. Sjstrand, L.H. Clemmensen, R. Larsen, and B. Ersbll. Spasm: A matlab toolbox for sparse\n\nstatistical modeling. In Journal of Statistical Software (Accepted for publication), 2012.\n\n[18] N. Trenda\ufb01lov, I. T. Jolliffe, and M. Uddin. A modi\ufb01ed principal component technique based\n\non the lasso. Journal of Computational and Graphical Statistics, 12:531\u2013547, 2003.\n\n[19] Z. Wang, H. Lu, and H. Liu. Nonconvex statistical optimization: Minimax-optimal sparse pca\n\nin polynomial time. http://arxiv.org/abs/1408.5352?context=cs.LG, 2014.\n\n[20] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Compu-\n\ntational & Graphical Statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 268, "authors": [{"given_name": "ABHISEK", "family_name": "KUNDU", "institution": "RENSSELAER POLYTECHNIC INST"}, {"given_name": "Petros", "family_name": "Drineas", "institution": "Rensselaer Polytechnic Institute"}, {"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": "RPI"}]}