{"title": "Large-Scale Sparse Principal Component Analysis with Application to Text Data", "book": "Advances in Neural Information Processing Systems", "page_first": 532, "page_last": 539, "abstract": "Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models.", "full_text": "Large-Scale Sparse Principal Component Analysis\n\nwith Application to Text Data\n\nDepartment of Electrical Engineering and Computer Sciences\n\nYouwei Zhang\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nzyw@eecs.berkeley.edu\n\nDepartment of Electrical Engineering and Computer Sciences\n\nLaurent El Ghaoui\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nelghaoui@eecs.berkeley.edu\n\nAbstract\n\nSparse PCA provides a linear combination of small number of features that maxi-\nmizes variance across data. Although Sparse PCA has apparent advantages com-\npared to PCA, such as better interpretability, it is generally thought to be compu-\ntationally much more expensive. In this paper, we demonstrate the surprising fact\nthat sparse PCA can be easier than PCA in practice, and that it can be reliably\napplied to very large data sets. This comes from a rigorous feature elimination\npre-processing result, coupled with the favorable fact that features in real-life data\ntypically have exponentially decreasing variances, which allows for many features\nto be eliminated. We introduce a fast block coordinate ascent algorithm with much\nbetter computational complexity than the existing \ufb01rst-order ones. We provide ex-\nperimental results obtained on text corpora involving millions of documents and\nhundreds of thousands of features. These results illustrate how Sparse PCA can\nhelp organize a large corpus of text data in a user-interpretable way, providing an\nattractive alternative approach to topic models.\n\n1\n\nIntroduction\n\nThe sparse Principal Component Analysis (Sparse PCA) problem is a variant of the classical PCA\nproblem, which accomplishes a trade-off between the explained variance along a normalized vector,\nand the number of non-zero components of that vector.\nSparse PCA not only brings better interpretation [1], but also provides statistical regularization [2]\nwhen the number of samples is less than the number of features. Various researchers have proposed\ndifferent formulations and algorithms for this problem, ranging from ad-hoc methods such as factor\nrotation techniques [3] and simple thresholding [4], to greedy algorithms [5, 6]. Other algorithms\ninclude SCoTLASS by [7], SPCA by [8], the regularized SVD method by [9] and the generalized\npower method by [10]. These algorithms are based on non-convex formulations, and may only\nconverge to a local optimum. The (cid:96)1-norm based semide\ufb01nite relaxation DSPCA, as introduced\nin [1], does guarantee global convergence and as such, is an attractive alternative to local methods.\nIn fact, it has been shown in [1, 2, 11] that simple ad-hoc methods, and the greedy, SCoTLASS\nand SPCA algorithms, often underperform DSPCA. However, the \ufb01rst-order algorithm for solving\nlog n), with n the number of\n\nDSPCA, as developed in [1], has a computational complexity of O(n4\u221a\n\n1\n\n\ffeatures, which is too high for many large-scale data sets. At \ufb01rst glance, this complexity estimate\nindicates that solving sparse PCA is much more expensive than PCA, since we can compute one\nprincipal component with a complexity of O(n2).\nIn this paper we show that solving DSPCA is in fact computationally easier than PCA, and hence can\nbe applied to very large-scale data sets. To achieve that, we \ufb01rst view DSPCA as an approximation\nto a harder, cardinality-constrained optimization problem. Based on that formulation, we describe\na safe feature elimination method for that problem, which leads to an often important reduction in\nproblem size, prior to solving the problem. Then we develop a block coordinate ascent algorithm,\nwith a computational complexity of O(n3) to solve DSPCA, which is much faster than the \ufb01rst-\norder algorithm proposed in [1]. Finally, we observe that real data sets typically allow for a dramatic\nreduction in problem size as afforded by our safe feature elimination result. Now the comparison\nbetween sparse PCA and PCA becomes O(\u02c6n3) v.s. O(n2) with \u02c6n (cid:28) n, which can make sparse\nPCA surprisingly easier than PCA.\nIn Section 2, we review the (cid:96)1-norm based DSPCA formulation, and relate it to an approximation to\nthe (cid:96)0-norm based formulation and highlight the safe feature elimination mechanism as a powerful\npre-processing technique. We use Section 3 to present our fast block coordinate ascent algorithm.\nFinally, in Section 4, we demonstrate the ef\ufb01ciency of our approach on two large data sets, each one\ncontaining more than 100,000 features.\nNotation. R(Y ) denotes the range of matrix Y , and Y \u2020 its pseudo-inverse. The notation log refers\nto the extended-value function, with log x = \u2212\u221e if x \u2264 0.\n\n\u03c6 = max\n\nZ\n\n2 Safe Feature Elimination\nPrimal problem. Given a n \u00d7 n positive-semide\ufb01nite matrix \u03a3, the \u201csparse PCA\u201d problem intro-\nduced in [1] is :\n\nTr \u03a3Z \u2212 \u03bb(cid:107)Z(cid:107)1 : Z (cid:23) 0, Tr Z = 1\n\n(1)\nwhere \u03bb \u2265 0 is a parameter encouraging sparsity. Without loss of generality we may assume that\n\u03a3 (cid:31) 0.\nProblem (1) is in fact a relaxation to a PCA problem with a penalty on the cardinality of the variable:\n(2)\nWhere (cid:107)x(cid:107)0 denotes the cardinality (number of non-zero elemements) in x. This can be seen by \ufb01rst\nwriting problem (2) as:\nmax\n\nTr \u03a3Z \u2212 \u03bb(cid:112)(cid:107)Z(cid:107)0 : Z (cid:23) 0, Tr Z = 1, Rank(Z) = 1\n\nxT \u03a3x \u2212 \u03bb(cid:107)x(cid:107)0 : (cid:107)x(cid:107)2 = 1\n\nwhere (cid:107)Z(cid:107)0 is the cardinality (number of non-zero elements) of Z. Since (cid:107)Z(cid:107)1 \u2264(cid:112)(cid:107)Z(cid:107)0 (cid:107)Z(cid:107)F =\n(cid:112)(cid:107)Z(cid:107)0, we obtain the relaxation\n\nZ\n\n\u03c8 = max\n\nx\n\nTr \u03a3Z \u2212 \u03bb(cid:107)Z(cid:107)1 : Z (cid:23) 0, Tr Z = 1, Rank(Z) = 1\n\nmax\n\nZ\n\nFurther drop the rank constraint, leading to problem (1).\nBy viewing problem (1) as a convex approximation to the non-convex problem (2), we can leverage\nthe safe feature elimination theorem \ufb01rst presented in [6, 12] for problem (2):\nTheorem 2.1 Let \u03a3 = AT A, where A = (a1, . . . , an) \u2208 Rm\u00d7n. We have\n\nn(cid:88)\n\ni=1\n\n\u03c8 = max\n(cid:107)\u03be(cid:107)2=1\n\n((aT\n\ni \u03be)2 \u2212 \u03bb)+.\n\nAn optimal non-zero pattern corresponds to indices i with \u03bb < (aT\n\ni \u03be)2 at optimum.\n\nWe observe that the i-th feature is absent at optimum if (aT\nwe can safely remove feature i \u2208 {1, . . . , n} if\n\ni \u03be)2 \u2264 \u03bb for every \u03be, (cid:107)\u03be(cid:107)2 = 1. Hence,\n\n\u03a3ii = aT\n\ni ai < \u03bb\n\n2\n\n(3)\n\n\fA few remarks are in order. First, if we are interested in solving problem (1) as a relaxation to prob-\nlem (2), we \ufb01rst calculate and rank all the feature variances, which takes O(nm) and O(n log(n))\nrespectively. Then we can safely eliminate any feature with variance less than \u03bb. Second, the\nelimination criterion above is conservative. However, when looking for extremely sparse solutions,\napplying this safe feature elimination test with a large \u03bb can dramatically reduce problem size and\nlead to huge computational savings, as will be demonstrated empirically in Section 4. Third, in\npractice, when PCA is performed on large data sets, some similar variance-based criteria is rou-\ntinely employed to bring problem sizes down to a manageable level. This purely heuristic practice\nhas a rigorous interpretation in the context of sparse PCA, as the above theorem states explicitly the\nfeatures that can be safely discarded.\n\n3 Block Coordinate Ascent Algorithm\nof O(n4\u221a\n\nThe \ufb01rst-order algorithm developed in [1] to solve problem (1) has a computational complexity\n\u0001 ), the DSPCA algorithm does not\nconverge fast in practice. In this section, we develop a block coordinate ascent algorithm with better\ndependence on problem size (O(n3)), that in practice converges much faster.\n\nlog n). With a theoretical convergence rate of O( 1\n\nmin\nX\n\nFailure of a direct method. We seek to apply a \u201crow-by-row\u201d algorithm by which we update each\nrow/column pair, one at a time. This algorithm appeared in the speci\ufb01c context of sparse covariance\nestimation in [13], and extended to a large class of SDPs in [14]. Precisely, it applies to problems of\nthe form\n\nf(X) \u2212 \u03b2 log det X : L \u2264 X \u2264 U, X (cid:31) 0,\n\n(4)\nwhere X = X T is a n\u00d7 n matrix variable, L, U impose component-wise bounds on X, f is convex,\nand \u03b2 > 0.\nHowever, if we try to update the row/columns of Z in problem (1), the trace constraint will imply that\nwe never modify the diagonal elements of Z. Indeed at each step, we update only one diagonal ele-\nment, and it is entirely \ufb01xed given all the other diagonal elements. The row-by-row algorithm does\nnot directly work in that case, nor in general for SDPs with equality constraints. The authors in [14]\npropose an augmented Lagrangian method to deal with such constraints, with a complication due to\nthe choice of appropriate penalty parameters. In our case, we can apply a technique resembling the\naugmented Lagrangian technique, without this added complication. This is due to the homogeneous\nnature of the objective function and of the conic constraint. Thanks to the feature elimination result\n(Thm. 2.1), we can always assume without loss of generality that \u03bb < \u03c32\n\nmin := min1\u2264i\u2264n \u03a3ii.\n\nX\n\n1\n2 \u03c62 = max\n\nTr \u03a3X \u2212 \u03bb(cid:107)X(cid:107)1 \u2212 1\n2\n\nDirect augmented Lagrangian technique. We can express problem (1) as\n(Tr X)2 : X (cid:23) 0.\n\n(5)\nThis expression results from the change of variable X = \u03b3Z, with Tr Z = 1, and \u03b3 \u2265 0. Optimizing\nover \u03b3 \u2265 0, and exploiting \u03c6 > 0 (which comes from our assumption that \u03bb < \u03c32\nmin), leads to the\nresult, with the optimal scaling factor \u03b3 equal to \u03c6. An optimal solution Z\u2217 to (1) can be obtained\nfrom an optimal solution X\u2217 to the above, via Z\u2217 = X\u2217/\u03c6. (In fact, we have Z\u2217 = X\u2217/ Tr(X\u2217).)\nTo apply the row-by-row method to the above problem, we need to consider a variant of it, with a\nstrictly convex objective. That is, we address the problem\n\nmax\nX\n\nTr \u03a3X \u2212 \u03bb(cid:107)X(cid:107)1 \u2212 1\n2\n\n(Tr X)2 + \u03b2 log det X,\n\n: X (cid:31) 0,\n\n(6)\n\nwhere \u03b2 > 0 is a penalty parameter. SDP theory ensures that if \u03b2 = \u0001/n, then a solution to the\nabove problem is \u0001-suboptimal for the original problem [15].\n\nOptimizing over one row/column. Without loss of generality, we consider the problem of updat-\ning the last row/column of the matrix variable X. Partition the latter and the covariance matrix S\nas\n\n(cid:18) Y\n\nyT\n\n(cid:19)\n\ny\nx\n\nX =\n\n(cid:18) S\n\nsT\n\n(cid:19)\n\n,\n\ns\n\u03c3\n\n, \u03a3 =\n\n3\n\n\fwhere Y, S \u2208 R(n\u22121)\u00d7(n\u22121), y, s \u2208 Rn\u22121, and x, \u03c3 \u2208 R. We are considering the problem above,\nwhere Y is \ufb01xed, and (y, x) \u2208 Rn is the variable. We use the notation t := Tr Y .\nThe conic constraint X (cid:31) 0 translates as yT Y \u2020y \u2264 x, y \u2208 R(Y ), where R(Y ) is the range of the\nmatrix Y . We obtain the sub-problem\n\n(cid:18) 2(yT s \u2212 \u03bb(cid:107)y(cid:107)1) + (\u03c3 \u2212 \u03bb)x \u2212 1\n\n+\u03b2 log(x \u2212 yT Y \u2020y)\n\n\u03c8 := max\nx,y\n\n2(t + x)2\n\n: y \u2208 R(Y ).\n\n(7)\n\n(cid:19)\n\nSimplifying the sub-problem. We can simplify the above problem, in particular, avoid the step of\nforming the pseudo-inverse of Y , by taking the dual of problem (7).\nUsing the conjugate relation, valid for every \u03b7 > 0:\nlog \u03b7 + 1 = min\nz>0\n\nz\u03b7 \u2212 log z,\n\nand with f(x) := (\u03c3 \u2212 \u03bb)x \u2212 1\n\n\u03c8 + \u03b2 = max\ny\u2208R(Y )\n\n2(t + x)2, we obtain\n\n2(yT s \u2212 \u03bb(cid:107)y(cid:107)1) + f(x) + \u03b2 min\n\nz>0\n\n(cid:0)z(x \u2212 yT Y \u2020y) \u2212 log z(cid:1)\n\n2(yT s \u2212 \u03bb(cid:107)y(cid:107)1 \u2212 \u03b2zyT Y \u2020y) + max\n\n(f(x) + \u03b2zx) \u2212 \u03b2 log z\n\nx\n\n= min\nz>0\n= min\nz>0\n\nmax\ny\u2208R(Y )\nh(z) + 2g(z)\n\nwhere, for z > 0, we de\ufb01ne\n\nh(z)\n\nx\n\n:= \u2212\u03b2 log z + max\n= \u2212\u03b2 log z + max\n= \u22121\n= \u22121\n\n2 t2 \u2212 \u03b2 log z + max\n2 t2 \u2212 \u03b2 log z +\n\n1\n2\n\nx\n\nx\n\n(f(x) + \u03b2zx)\n((\u03c3 \u2212 \u03bb + \u03b2z)x \u2212 1\n2\n\n(t + x)2)\n((\u03c3 \u2212 \u03bb \u2212 t + \u03b2z)x \u2212 1\n\n2 x2)\n\n(\u03c3 \u2212 \u03bb \u2212 t + \u03b2z)2\n\nwith the following relationship at optimum:\n\nx = \u03c3 \u2212 \u03bb \u2212 t + \u03b2z.\n\nIn addition,\n\ng(z)\n\n:= max\ny\u2208R(Y )\n\n(yT Y \u2020y)\n\nyT s + min\n\nyT s \u2212 \u03bb(cid:107)y(cid:107)1 \u2212 \u03b2z\n2\nyT v \u2212 \u03b2z\n2\n(yT (s + v) \u2212 \u03b2z\n2\n(yT Y \u2020y))\n\nmax\ny\u2208R(Y )\n\nv : (cid:107)v(cid:107)\u221e\u2264\u03bb\n\n(yT u \u2212 \u03b2z\n2\n\nmax\ny\u2208R(Y )\n\n(yT Y \u2020y)\n\n(yT Y \u2020y))\n\nmax\ny\u2208R(Y )\n\nmin\n\nv : (cid:107)v(cid:107)\u221e\u2264\u03bb\n\nmin\n\nu : (cid:107)u\u2212s(cid:107)\u221e\u2264\u03bb\n\n=\n\n=\n\n=\n\n=\n\nu : (cid:107)u\u2212s(cid:107)\u221e\u2264\u03bb\nwith the following relationship at optimum:\n\nmin\n\n1\n2\u03b2z\n\nuT Y u.\n\ny =\n\n1\n\u03b2z\n\nY u.\n\n(8)\n\n(9)\n\nPutting all this together, we obtain the dual of problem (7): with \u03c8(cid:48) := \u03c8+\u03b2+ 1\nwe have\n\n2 t2, and c := \u03c3\u2212\u03bb\u2212t,\n\n\u03c8(cid:48) = min\n\nu,z\n\n1\n\u03b2z\n\nuT Y u \u2212 \u03b2 log z +\n\n(c + \u03b2z)2 : z > 0, (cid:107)u \u2212 s(cid:107)\u221e \u2264 \u03bb.\n\n1\n2\n\nSince \u03b2 is small, we can avoid large numbers in the above, with the change of variable \u03c4 = \u03b2z:\n\n\u03c8(cid:48) \u2212 \u03b2 log \u03b2 = min\n\nu,\u03c4\n\n1\n\u03c4\n\nuT Y u \u2212 \u03b2 log \u03c4 +\n\n1\n2\n\n(c + \u03c4)2 : \u03c4 > 0, (cid:107)u \u2212 s(cid:107)\u221e \u2264 \u03bb.\n\n(10)\n\n4\n\n\fSolving the sub-problem. Problem (10) can be further decomposed into two stages.\nFirst, we solve the box-constrained QP\n\nR2 := min\nu\n\nuT Y u : (cid:107)u \u2212 s(cid:107)\u221e \u2264 \u03bb,\n\n(11)\n\nusing a simple coordinate descent algorithm to exploit sparsity of Y . Without loss of generality, we\nconsider the problem of updating the \ufb01rst coordinate of u. Partition u, Y and s as\n\n(cid:19)\n\n(cid:18) \u03b7\n\n\u02c6u\n\nu =\n\n, Y =\n\n(cid:18) y1\n\n\u02c6y\n\n(cid:19)\n\n\u02c6yT\n\u02c6Y\n\n, s =\n\n(cid:18) s1\n\n\u02c6s\n\n(cid:19)\n\n,\n\nWhere, \u02c6Y \u2208 R(n\u22122)\u00d7(n\u22122), \u02c6u, \u02c6y, \u02c6s \u2208 Rn\u22122, y1, s1 \u2208 R are all \ufb01xed, while \u03b7 \u2208 R is the variable.\nWe obtain the subproblem\n\ny1\u03b72 + (2\u02c6yT \u02c6u)\u03b7 : (cid:107)\u03b7 \u2212 s1(cid:107) \u2264 \u03bb\n\nmin\n\n\u03b7\n\nfor which we can solve for \u03b7 analytically using the formula given below.\n\ny1\n\nif (cid:107)s1 + \u02c6yT \u02c6u\n\n\u2212 \u02c6yT \u02c6u\ns1 \u2212 \u03bb if \u2212 \u02c6yT \u02c6u\ns1 + \u03bb if \u2212 \u02c6yT \u02c6u\n\ny1\n\ny1\n\ny1\n\n(cid:107) \u2264 \u03bb, y1 > 0,\n\n< s1 \u2212 \u03bb, y1 > 0 or if \u02c6yT \u02c6u > 0, y1 = 0,\n> s1 + \u03bb, y1 > 0 or if \u02c6yT \u02c6u <= 0, y1 = 0.\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u03b7 =\n\n(12)\n\n(13)\n\nNext, we set \u03c4 by solving the one-dimensional problem:\n1\n\u2212 \u03b2 log \u03c4 +\n2\n\nmin\n\u03c4 >0\n\nR2\n\u03c4\n\n(c + \u03c4)2.\n\nThe above can be reduced to a bisection problem over \u03c4, or by solving a polynomial equation of\ndegree 3.\n\nObtaining the primal variables. Once the above problem is solved, we can obtain the primal\nvariables y, x, as follows. Using formula (9), with \u03b2z = \u03c4, we set y = 1\n\u03c4 Y u. For the diagonal\nelement x, we use formula (8): x = c + \u03c4 = \u03c3 \u2212 \u03bb \u2212 t + \u03c4.\n\nAlgorithm summary. We summarize the above derivations in Algorithm 1. Notation: for any\nsymmetric matrix A \u2208 Rn\u00d7n, let A\\i\\j denote the matrix produced by removing row i and column\nj. Let Aj denote column j (or row j) with the diagonal element Ajj removed.\n\nConvergence and complexity. Our algorithm solves DSPCA by \ufb01rst casting it to problem (6),\nwhich is in the general form (4). Therefore, the convergence result from [14] readily applies and\nhence every limit point that our block coordinate ascent algorithm converges to is the global opti-\nmizer. The simple coordinate descent algorithm solving problem (11) only involves a vector product\nand can take sparsity in Y easily. To update each column/row takes O(n2) and there are n such\ncolumns/rows in total. Therefore, our algorithm has a computational complexity of O(Kn3), where\nK is the number of sweeps through columns. In practice, K is \ufb01xed at a number independent of\nproblem size (typically K = 5). Hence our algorithm has better dependence on the problem size\n\nlog n) required of the \ufb01rst order algorithm developed in [1].\n\ncompared to O(n4\u221a\n\nFig 1 shows that our algorithm converges much faster than the \ufb01rst order algorithm. On the left, both\nalgorithms are run on a covariance matrix \u03a3 = F T F with F Gaussian. On the right, the covariance\nmatrix comes from a \u201dspiked model\u201d similar to that in [2], with \u03a3 = uuT +V V T /m, where u \u2208 Rn\nis the true sparse leading eigenvector, with Card(u) = 0.1n, V \u2208 Rn\u00d7m is a noise matrix with\nVij \u223c N (0, 1) and m is the number of observations.\n\n4 Numerical Examples\n\nIn this section, we analyze two publicly available large data sets, the NYTimes news articles data\nand the PubMed abstracts data, available from the UCI Machine Learning Repository [16]. Both\n\n5\n\n\fAlgorithm 1 Block Coordinate Ascent Algorithm\nInput: The covariance matrix \u03a3, and a parameter \u03c1 > 0.\n1: Set X (0) = I\n2: repeat\n3:\n4:\n\nfor j = 1 to n do\n\nLet X (j\u22121) denote the current iterate. Solve the box-constrained quadratic program\n\nR2 := min\n\nu\nusing the coordinate descent algorithm\nSolve the one-dimensional problem\n\nuT X (j\u22121)\n\n\\j\\j u : (cid:107)u \u2212 \u03a3j(cid:107)\u221e \u2264 \u03bb\n\nmin\n\u03c4 >0\n\nR2\n\u03c4\n\n\u2212 \u03b2 log \u03c4 +\n\n1\n2\n\n(\u03a3jj \u2212 \u03bb \u2212 Tr X (j\u22121)\n\n\\j\\j + \u03c4 )2\n\nusing a bisection method, or by solving a polynomial equation of degree 3.\nFirst set X (j)\n\n\\j\\j , and then set both X (j)\u2019s column j and row j using\n\n\\j\\j = X (j\u22121)\n\n5:\n\n6:\n\nX (j)\n\nj =\n\nX (j\u22121)\n\\j\\j u\n\n1\n\u03c4\n\njj = \u03a3jj \u2212 \u03bb \u2212 Tr X (j\u22121)\nX (j)\n\n\\j\\j + \u03c4\n\nend for\nSet X (0) = X (n)\n\n7:\n8:\n9: until convergence\n\nFigure 1: Speed comparisons between Block Coordinate Ascent and First-Order\n\ntext collections record word occurrences in the form of bag-of-words. The NYTtimes text collection\ncontains 300, 000 articles and has a dictionary of 102, 660 unique words, resulting in a \ufb01le of size 1\nGB. The even larger PubMed data set has 8, 200, 000 abstracts with 141, 043 unique words in them,\ngiving a \ufb01le of size 7.8 GB. These data matrices are so large that we cannot even load them into\nmemory all at once, which makes even the use of classical PCA dif\ufb01cult. However with the pre-\nprocessing technique presented in Section 2 and the block coordinate ascent algorithm developed\nin Section 3, we are able to perform sparse PCA analysis of these data, also thanks to the fact that\nvariances of words decrease drastically when we rank them as shown in Fig 2. Note that the feature\nelimination result only requires the computation of each feature\u2019s variance, and that this task is easy\nto parallelize.\nBy doing sparse PCA analysis of these text data, we hope to \ufb01nd interpretable principal components\nthat can be used to summarize and explore the large corpora. Therefore, we set the target cardinality\nfor each principal component to be 5. As we run our algorithm with a coarse range of \u03bb to search for\n\n6\n\n010020030040050060070080010\u22121100101102103104105Problem SizeCPU Time (seconds) Block Coordinate AscentFirst Order010020030040050060070080010\u22121100101102103104105Problem SizeCPU Time (seconds) Block Coordinate AscentFirst Order\fFigure 2: Sorted variances of 102,660 words in NYTimes (left) and 141,043 words in PubMed (right)\n\na solution with the given cardinality, we might end up accepting a solution with cardinality close,\nbut not necessarily equal to, 5, and stop there to save computational time.\nThe top 5 sparse principal components are shown in Table 1 for NYTimes and in Table 2 for PubMed.\nClearly the \ufb01rst principal component for NYTimes is about business, the second one about sports,\nthe third about U.S., the fourth about politics and the \ufb01fth about education. Bear in mind that the\nNYTimes data from UCI Machine Learning Repository \u201chave no class labels, and for copyright\nreasons no \ufb01lenames or other document-level metadata\u201d [16]. The sparse principal components still\nunambiguously identify and perfectly correspond to the topics used by The New York Times itself to\nclassify articles on its own website.\n\nTable 1: Words associated with the top 5 sparse principal components in NYTimes\n\n2nd PC (5 words)\npoint\nplay\nteam\nseason\ngame\n\n3rd PC (5 words)\nofficial\ngovernment\nunited states\nu s\nattack\n\n4th PC (4 words)\npresident\ncampaign\nbush\nadministration\n\n5th PC (4 words)\nschool\nprogram\nchildren\nstudent\n\n1st PC (6 words)\nmillion\npercent\nbusiness\ncompany\nmarket\ncompanies\n\nAfter the pre-processing steps, it takes our algorithm around 20 seconds to search for a range of \u03bb\nand \ufb01nd one sparse principal component with the target cardinality (for the NYTimes data in our\ncurrent implementation on a MacBook laptop with 2.4 GHz Intel Core 2 Duo processor and 2 GB\nmemory).\n\nTable 2: Words associated with the top 5 sparse principal components in PubMed\n\n1st PC (5 words)\npatient\ncell\ntreatment\nprotein\ndisease\n\n2nd PC (5 words)\neffect\nlevel\nactivity\nconcentration\nrat\n\n3rd PC (5 words)\nhuman\nexpression\nreceptor\nbinding\n\n4th PC (4 words)\ntumor\nmice\ncancer\nmaligant\ncarcinoma\n\n5th PC (4 words)\nyear\ninfection\nage\nchildren\nchild\n\nA surprising \ufb01nding is that the safe feature elimination test, combined with the fact that word vari-\nances decrease rapidly, enables our block coordinate ascent algorithm to work on covariance matri-\nces of order at most n = 500, instead of the full order (n = 102660) covariance matrix for NYTimes,\nso as to \ufb01nd a solution with cardinality of around 5. In the case of PubMed, our algorithm only needs\nto work on covariance matrices of order at most n = 1000, instead of the full order (n = 141, 043)\n\n7\n\n024681012x 10410\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100Word IndexVariance051015x 10410\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100Word IndexVariance\fcovariance matrix. Thus, at values of the penalty parameter \u03bb that target cardinality of 5 commands,\nwe observe a dramatic reduction in problem sizes, about 150 \u223c 200 times smaller than the original\nsizes respectively. This motivates our conclusion that sparse PCA is in a sense, easier than PCA\nitself.\n\n5 Conclusion\n\nThe safe feature elimination result, coupled with a fast block coordinate ascent algorithm, allows\nto solve sparse PCA problems for very large scale, real-life data sets. The overall method works\nespecially well when the target cardinality of the result is small, which is often the case in applica-\ntions where interpretability by a human is key. The algorithm we proposed has better computational\ncomplexity, and in practice converges much faster than, the \ufb01rst-order algorithm developed in [1].\nOur experiments on text data also show that the sparse PCA can be a promising approach towards\nsummarizing and organizing a large text corpus.\n\nReferences\n[1] A. d\u2019Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulation of sparse\n\nPCA using semide\ufb01nite programming. SIAM Review, 49(3), 2007.\n\n[2] A.A. Amini and M. Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for\n\nsparse principal components. The Annals of Statistics, 37(5B):2877\u20132921, 2009.\n\n[3] I. T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal\n\nof Applied Statistics, 22:29\u201335, 1995.\n\n[4] J. Cadima and I. T. Jolliffe. Loadings and correlations in the interpretation of principal com-\n\nponents. Journal of Applied Statistics, 22:203\u2013214, 1995.\n\n[5] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: exact and greedy\n\nalgorithms. Advances in Neural Information Processing Systems, 18, 2006.\n\n[6] A. d\u2019Aspremont, F. Bach, and L. El Ghaoui. Optimal solutions for sparse principal component\n\nanalysis. Journal of Machine Learning Research, 9:1269\u20131294, 2008.\n\n[7] I. T. Jolliffe, N.T. Trenda\ufb01lov, and M. Uddin. A modi\ufb01ed principal component technique based\n\non the LASSO. Journal of Computational and Graphical Statistics, 12:531\u2013547, 2003.\n\n[8] H. Zou, T. Hastie, and R. Tibshirani. Sparse Principal Component Analysis. Journal of Com-\n\nputational & Graphical Statistics, 15(2):265\u2013286, 2006.\n\n[9] Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low\n\nrank matrix approximation. J. Multivar. Anal., 99:1015\u20131034, July 2008.\n\n[10] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized power method for sparse\n\nprincipal component analysis. arXiv:0811.4724, 2008.\n\n[11] Y. Zhang, A. d\u2019Aspremont, and L. El Ghaoui. Sparse PCA: Convex relaxations, algorithms\n\nand applications. In M. Anjos and J.B. Lasserre, editors, Handbook on Semide\nnite, Cone and Polynomial Optimization: Theory, Algorithms, Software and Applications.\nSpringer, 2011. To appear.\n\n[12] L. El Ghaoui. On the quality of a semide\ufb01nite programming bound for sparse principal com-\n\nponent analysis. arXiv:math/060144, February 2006.\n\n[13] O.Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning\nResearch, 9:485\u2013516, March 2008.\n\n[14] Zaiwen Wen, Donald Goldfarb, Shiqian Ma, and Katya Scheinberg. Row by row methods for\n\nsemide\ufb01nite programming. Technical report, Dept of IEOR, Columbia University, 2009.\n\n[15] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\nNew York, NY, USA, 2004.\n\n[16] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n\n8\n\n\f", "award": [], "sourceid": 376, "authors": [{"given_name": "Youwei", "family_name": "Zhang", "institution": null}, {"given_name": "Laurent", "family_name": "Ghaoui", "institution": null}]}