{"title": "A Direct Formulation for Sparse PCA Using Semidefinite Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 41, "page_last": 48, "abstract": null, "full_text": " A direct formulation for sparse PCA\n using semidefinite programming\n\n\n\n Alexandre d'Aspremont Laurent El Ghaoui\n EECS Dept. SAC Capital\n U.C. Berkeley 540 Madison Avenue\n Berkeley, CA 94720 New York, NY 10029\n alexandre.daspremont@m4x.org laurent.elghaoui@sac.com\n (on leave from EECS, U.C. Berkeley)\n\n\n Michael I. Jordan Gert R. G. Lanckriet\n EECS and Statistics Depts. EECS Dept.\n U.C. Berkeley U.C. Berkeley\n Berkeley, CA 94720 Berkeley, CA 94720\n jordan@cs.berkeley.edu gert@eecs.berkeley.edu\n\n\n\n\n Abstract\n\n We examine the problem of approximating, in the Frobenius-norm sense,\n a positive, semidefinite symmetric matrix by a rank-one matrix, with an\n upper bound on the cardinality of its eigenvector. The problem arises\n in the decomposition of a covariance matrix into sparse factors, and has\n wide applications ranging from biology to finance. We use a modifica-\n tion of the classical variational representation of the largest eigenvalue\n of a symmetric matrix, where cardinality is constrained, and derive a\n semidefinite programming based relaxation for our problem.\n\n\n\n1 Introduction\n\nPrincipal component analysis (PCA) is a popular tool for data analysis and dimensionality\nreduction. It has applications throughout science and engineering. In essence, PCA finds\nlinear combinations of the variables (the so-called principal components) that correspond\nto directions of maximal variance in the data. It can be performed via a singular value\ndecomposition (SVD) of the data matrix A, or via an eigenvalue decomposition if A is a\ncovariance matrix.\n\nThe importance of PCA is due to several factors. First, by capturing directions of maxi-\nmum variance in the data, the principal components offer a way to compress the data with\nminimum information loss. Second, the principal components are uncorrelated, which can\naid with interpretation or subsequent statistical analysis. On the other hand, PCA has a\nnumber of well-documented disadvantages as well. A particular disadvantage that is our\nfocus here is the fact that the principal components are usually linear combinations of all\nvariables. That is, all weights in the linear combination (known as loadings), are typically\nnon-zero. In many applications, however, the coordinate axes have a physical interpreta-\n\n\f\ntion; in biology for example, each axis might correspond to a specific gene. In these cases,\nthe interpretation of the principal components would be facilitated if these components in-\nvolve very few non-zero loadings (coordinates). Moreover, in certain applications, e.g.,\nfinancial asset trading strategies based on principal component techniques, the sparsity of\nthe loadings has important consequences, since fewer non-zero loadings imply fewer fixed\ntransaction costs.\n\nIt would thus be of interest to be able to discover \"sparse principal components\", i.e., sets of\nsparse vectors spanning a low-dimensional space that explain most of the variance present\nin the data. To achieve this, it is necessary to sacrifice some of the explained variance and\nthe orthogonality of the principal components, albeit hopefully not too much.\n\nRotation techniques are often used to improve interpretation of the standard principal com-\nponents [1]. [2] considered simple principal components by restricting the loadings to\ntake values from a small set of allowable integers, such as 0, 1, and -1. [3] propose an\nad hoc way to deal with the problem, where the loadings with small absolute value are\nthresholded to zero. We will call this approach \"simple thresholding.\" Later, a method\ncalled SCoTLASS was introduced by [4] to find modified principal components with pos-\nsible zero loadings. In [5] a new approach, called sparse PCA (SPCA), was proposed to\nfind modified components with zero loadings, based on the fact that PCA can be written\nas a regression-type optimization problem. This allows the application of LASSO [6], a\npenalization technique based on the L1 norm.\n\nIn this paper, we propose a direct approach (called DSPCA in what follows) that improves\nthe sparsity of the principal components by directly incorporating a sparsity criterion in the\nPCA problem formulation and then relaxing the resulting optimization problem, yielding a\nconvex optimization problem. In particular, we obtain a convex semidefinite programming\n(SDP) formulation.\n\nSDP problems can be solved in polynomial time via general-purpose interior-point meth-\nods [7], and our current implementation of DSPCA makes use of these general-purpose\nmethods. This suffices for an initial empirical study of the properties of DSPCA and for\ncomparison to the algorithms discussed above on problems of small to medium dimension-\nality. For high-dimensional problems, the general-purpose methods are not viable and it is\nnecessary to attempt to exploit special structure in the problem. It turns out that our prob-\nlem can be expressed as a special type of saddle-point problem that is well suited to recent\nspecialized algorithms, such as those described in [8, 9]. These algorithms offer a sig-\nnificant reduction in computational time compared to generic SDP solvers. In the current\npaper, however, we restrict ourselves to an investigation of the basic properties of DSPCA\non problems for which the generic methods are adequate.\n\nOur paper is structured as follows. In Section 2, we show how to efficiently derive a\nsparse rank-one approximation of a given matrix using a semidefinite relaxation of the\nsparse PCA problem. In Section 3, we derive an interesting robustness interpretation of our\ntechnique, and in Section 4 we describe how to use this interpretation in order to decompose\na matrix into sparse factors. Section 5 outlines different algorithms that can be used to solve\nthe problem, while Section 6 presents numerical experiments comparing our method with\nexisting techniques.\n\n\nNotation\n\nHere, Sn is the set of symmetric matrices of size n. We denote by 1 a vector of ones,\nwhile Card(x) is the cardinality (number of non-zero elements) of a vector x. For X \nSn, X F is the Frobenius norm of X, i.e., X F = Tr(X2), and by max(X) the\nmaximum eigenvalue of X, while |X| is the matrix whose elements are the absolute values\nof the elements of X.\n\n\f\n2 Sparse eigenvectors\n\nIn this section, we derive a semidefinite programming (SDP) relaxation for the problem\nof approximating a symmetric matrix by a rank one matrix with an upper bound on the\ncardinality of its eigenvector. We first reformulate this as a variational problem, we then\nobtain a lower bound on its optimal value via an SDP relaxation (we refer the reader to [10]\nfor an overview of semidefinite programming).\n\nLet A Sn be a given n n positive semidefinite, symmetric matrix and k be an integer\nwith 1 k n. We consider the problem:\n k(A) := min A - xxT F (1)\n subject to Card(x) k,\nin the variable x Rn. We can solve instead the following equivalent problem:\n 2(A) = min A\n k - xxT 2F\n subject to x 2 = 1, 0,\n Card(x) k,\nin the variable x Rn and R. Minimizing over , we obtain:\n 2k(A) = A 2F - k(A),\nwhere\n k(A) := max xT Ax\n subject to x 2 = 1 (2)\n Card(x) k.\nTo compute a semidefinite relaxation of this program (see [10], for example), we rewrite\n(2) as:\n k(A) := max Tr(AX)\n subject to Tr(X) = 1 (3)\n Card(X) k2\n X 0, Rank(X) = 1,\n\nin the symmetric, matrix variable X Sn. Indeed, if X is a solution to the above problem,\nthen X 0 and Rank(X) = 1 means that we have X = xxT , and Tr(X) = 1 implies\nthat x 2 = 1. Finally, if X = xxT then Card(X) k2 is equivalent to Card(x) k.\nNaturally, problem (3) is still non-convex and very difficult to solve, due to the rank and\ncardinality constraints. Since for every u Rp, Card(u) = q implies u 1 q u 2,\nwe can replace the non-convex constraint Card(X) k2, by a weaker but convex one:\n \n1T |X|1 k, where we have exploited the property that X F = xTx = 1 when\nX = xxT and Tr(X) = 1. If we also drop the rank constraint, we can form a relaxation\nof (3) and (2) as:\n k(A) := max Tr(AX)\n subject to Tr(X) = 1 (4)\n 1T |X|1 k\n X 0,\n\nwhich is a semidefinite program (SDP) in the variable X Sn, where k is an integer\nparameter controlling the sparsity of the solution. The optimal value of this program will\nbe an upper bound on the optimal value vk(a) of the variational program in (2), hence it\ngives a lower bound on the optimal value k(A) of the original problem (1). Finally, the\noptimal solution X will not always be of rank one but we can truncate it and keep only its\ndominant eigenvector x as an approximate solution to the original problem (1). In Section\n6 we show that in practice the solution X to (4) tends to have a rank very close to one, and\nthat its dominant eigenvector is indeed sparse.\n\n\f\n3 A robustness interpretation\n\nIn this section, we show that problem (4) can be interpreted as a robust formulation of the\nmaximum eigenvalue problem, with additive, component-wise uncertainty in the matrix A.\nWe again assume A to be symmetric and positive semidefinite. In the previous section,\nwe considered in (2) a cardinality-constrained variational formulation of the maximum\neigenvalue problem. Here we look at a small variation where we penalize the cardinality\nand solve:\n max xT Ax - Card2(x)\n subject to x 2 = 1,\nin the variable x Rn, where the parameter > 0 controls the size of the penalty. Let\nus remark that we can easily move from the constrained formulation in (4) to the penalized\nform in (5) by duality. This problem is again non-convex and very difficult to solve. As in\nthe last section, we can form the equivalent program:\n\n max Tr(AX) - Card(X)\n subject to Tr(X) = 1\n X 0, Rank(X) = 1,\n\nin the variable X Sn. Again, we get a relaxation of this program by forming:\n max Tr(AX) - 1T|X|1\n subject to Tr(X) = 1 (5)\n X 0,\n\nwhich is a semidefinite program in the variable X Sn, where > 0 controls the penalty\nsize. We can rewrite this last problem as:\n\n max min Tr(X(A + U )) (6)\n X 0,Tr(X)=1 |Uij |\n\nand we get a dual to (5) as:\n\n min max(A + U ) (7)\n subject to |Uij| , i,j = 1,...,n,\nwhich is a maximum eigenvalue problem with variable U Rnn. This gives a natural\nrobustness interpretation to the relaxation in (5): it corresponds to a worst-case maximum\neigenvalue computation, with component-wise bounded noise of intensity on the matrix\ncoefficients.\n\n\n4 Sparse decomposition\n\nHere, we use the results obtained in the previous two sections to describe a sparse equivalent\nto the PCA decomposition technique. Suppose that we start with a matrix A1 Sn, our\nobjective is to decompose it in factors with target sparsity k. We solve the relaxed problem\nin (4):\n max Tr(A1X)\n subject to Tr(X) = 1\n 1T |X|1 k\n X 0,\n\nto get a solution X1, and truncate it to keep only the dominant (sparse) eigenvector x1.\nFinally, we deflate A1 to obtain\n\n A2 = A1 - (xT1A1x1)x1xT1,\nand iterate to obtain further components.\n\n\f\nThe question is now: When do we stop the decomposition? In the PCA case, the decompo-\nsition stops naturally after Rank(A) factors have been found, since ARank(A)+1 is then\nequal to zero. In the case of the sparse decomposition, we have no guarantee that this will\nhappen. However, the robustness interpretation gives us a natural stopping criterion: if all\nthe coefficients in |Ai| are smaller than the noise level (computed in the last section) then\nwe must stop since the matrix is essentially indistinguishable from zero. So, even though\nwe have no guarantee that the algorithm will terminate with a zero matrix, the decompo-\nsition will in practice terminate as soon as the coefficients in A become undistinguishable\nfrom the noise.\n\n\n\n5 Algorithms\n\n\nFor problems of moderate size, our SDP can be solved efficiently using solvers such as\nSEDUMI [7]. For larger-scale problems, we need to resort to other types of algorithms\nfor convex optimization. Of special interest are the recently-developed algorithms due to\n[8, 9]. These are first-order methods specialized to problems having a specific saddle-\npoint structure. It turns out that our problem, when expressed in the saddle-point form (6),\nfalls precisely into this class of algorithms. Judged from the results presented in [9], in\nthe closely related context of computing the Lovascz capacity of a graph, the theoretical\ncomplexity, as well as practical performance, of the method as applied to (6) should exhibit\nvery significant improvements over the general-purpose interior-point algorithms for SDP.\nOf course, nothing comes without a price: for fixed problem size, the first-order methods\nmentioned above converge in O(1/ ), where is the required accuracy on the optimal\nvalue, while interior-point methods converge in O(log(1/ )). We are currently evaluating\nthe impact of this tradeoff both theoretically and in practice.\n\n\n\n6 Numerical results\n\n\nIn this section, we illustrate the effectiveness of the proposed approach both on an artificial\nand a real-life data set. We compare with the other approaches mentioned in the introduc-\ntion: PCA, PCA with simple thresholding, SCoTLASS and SPCA. The results show that\nour approach can achieve more sparsity in the principal components than SPCA does, while\nexplaining as much variance. We begin by a simple example illustrating the link between\nk and the cardinality of the solution.\n\n\n6.1 Controlling sparsity with k\n\nHere, we illustrate on a simple example how the sparsity of the solution to our relaxation\nevolves as k varies from 1 to n. We generate a 10 10 matrix U with uniformly distributed\ncoefficients in [0, 1]. We let v be a sparse vector with:\n\n v = (1, 0, 1, 0, 1, 0, 1, 0, 1, 0).\n\nWe then form a test matrix A = U T U + vvT , where is a signal-to-noise ratio equal\nto 15 in our case. We sample 50 different matrices A using this technique. For each k\nbetween 1 and 10 and each A, we solve the following SDP in (4). We then extract the first\neigenvector of the solution X and record its cardinality. In Figure 1, we show the mean\ncardinality (and standard deviation) as a function of k. We observe that k + 1 is actually a\ngood predictor of the cardinality, especially when k + 1 is close to the actual cardinality (5\nin this case).\n\n\f\n 12\n\n\n\n\n\n 10\n\n\n\n\n\n 8\n\n\n ality\n 6\n in\n\n 4\n card\n\n 2\n\n\n\n\n\n 00 2 4 6 8 10 12\n\n k\n\n Figure 1: Cardinality versus k.\n\n\n\n6.2 Artificial data\n\nWe consider the simulation example proposed by [5]. In this example, three hidden factors\nare created:\n\n V1 N(0,290), V2 N(0,300), V3 = -0.3V1 + 0.925V2 + , N(0,300) (8)\nwith V1, V2 and independent. Afterwards, 10 observed variables are generated as follows:\n\n Xi = Vj + j, j\n i i N (0, 1),\nwith j = 1 for i = 1, 2, 3, 4, j = 2 for i = 5, 6, 7, 8 and j = 3 for i = 9, 10 and { ji}\nindependent for j = 1, 2, 3, i = 1, . . . , 10. Instead of sampling data from this model and\ncomputing an empirical covariance matrix of (X1, . . . , X10), we use the exact covariance\nmatrix to compute principal components using the different approaches.\n\nSince the three underlying factors have about the same variance, and the first two are as-\nsociated with 4 variables while the last one is only associated with 2 variables, V1 and V2\nare almost equally important, and they are both significantly more important than V3. This,\ntogether with the fact that the first 2 principal components explain more than 99% of the\ntotal variance, suggests that considering two sparse linear combinations of the original vari-\nables should be sufficient to explain most of the variance in data sampled from this model.\nThis is also discussed by [5]. The ideal solution would thus be to only use the variables\n(X1, X2, X3, X4) for the first sparse principal component, to recover the factor V1, and\nonly (X5, X6, X7, X8) for the second sparse principal component to recover V2.\n\nUsing the true covariance matrix and the oracle knowledge that the ideal sparsity is 4, [5]\nperformed SPCA (with = 0). We carry out our algorithm with k = 4. The results are\nreported in Table 1, together with results for PCA, simple thresholding and SCoTLASS\n(t = 2). Notice that SPCA, DSPCA and SCoTLASS all find the correct sparse principal\ncomponents, while simple thresholding yields inferior performance. The latter wrongly\nincludes the variables X9 and X10 to explain most variance (probably it gets misled by\nthe high correlation between V2 and V3), even more, it assigns higher loadings to X9 and\nX10 than to one of the variables (X5, X6, X7, X8) that are clearly more important. Simple\nthresholding correctly identifies the second sparse principal component, probably because\nV1 has a lower correlation with V3. Simple thresholding also explains a bit less variance\nthan the other methods.\n\n\n6.3 Pit props data\n\nThe pit props data (consisting of 180 observations and 13 measured variables) was intro-\nduced by [11] and has become a standard example of the potential difficulty in interpreting\n\n\f\nTable 1: Loadings and explained variance for first two principal components, for the ar-\ntificial example. 'ST' is the simple thresholding method, 'other' is all the other methods:\nSPCA, DSPCA and SCoTLASS.\n X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 explained variance\n PCA, PC1 .116 .116 .116 .116 -.395 -.395 -.395 -.395 -.401 -.401 60.0%\n PCA, PC2 -.478 -.478 -.478 -.478 -.145 -.145 -.145 -.145 .010 .010 39.6%\n ST, PC1 0 0 0 0 0 0 -.497 -.497 -.503 -.503 38.8%\n ST, PC2 -.5 -.5 -.5 -.5 0 0 0 0 0 0 38.6%\n other, PC1 0 0 0 0 .5 .5 .5 .5 0 0 40.9%\n other, PC2 .5 .5 .5 .5 0 0 0 0 0 0 39.5%\n\n\n\n\nprincipal components. [4] applied SCoTLASS to this problem and [5] used their SPCA\napproach, both with the goal of obtaining sparse principal components that can better be\ninterpreted than those of PCA. SPCA performs better than SCoTLASS: it identifies prin-\ncipal components with respectively 7, 4, 4, 1, 1, and 1 non-zero loadings, as shown in\nTable 2. As shown in [5], this is much sparser than the modified principal components by\nSCoTCLASS, while explaining nearly the same variance (75.8% versus 78.2% for the 6\nfirst principal components). Also, simple thresholding of PCA, with a number of non-zero\nloadings that matches the result of SPCA, does worse than SPCA in terms of explained\nvariance.\n\nFollowing this previous work, we also consider the first 6 principal components. We try\nto identify principal components that are sparser than the best result of this previous work,\ni.e., SPCA, but explain the same variance. Therefore, we choose values for k of 5, 2, 2, 1,\n1, 1 (two less than those of the SPCA results reported above, but no less than 1). Figure 2\nshows the cumulative number of non-zero loadings and the cumulative explained variance\n(measuring the variance in the subspace spanned by the first i eigenvectors). The results\nfor DSPCA are plotted with a red line and those for SPCA with a blue line. The cumulative\nexplained variance for normal PCA is depicted with a black line. It can be seen that our\napproach is able to explain nearly the same variance as the SPCA method, while clearly re-\nducing the number of non-zero loadings for the first 6 principal components. Adjusting the\nfirst k from 5 to 6 (relaxing the sparsity), we obtain the results plotted with a red dash-dot\nline: still better in sparsity, but with a cumulative explained variance that is fully compet-\nitive with SPCA. Moreover, as in the SPCA approach, the important variables associated\nwith the 6 principal components do not overlap, which leads to a clearer interpretation. Ta-\nble 2 shows the first three corresponding principal components for the different approaches\n(DSPCAw5 for k1 = 5 and DSPCAw6 for k1 = 6).\n\n\n Table 2: Loadings for first three principal components, for the real-life example.\n topdiam length moist testsg ovensg ringtop ringbud bowmax bowdist whorls clear knots diaknot\n SPCA, PC1 -.477 -.476 0 0 .177 0 -.250 -.344 -.416 -.400 0 0 0\n SPCA, PC2 0 0 .785 .620 0 0 0 -.021 0 0 0 .013 0\n SPCA, PC3 0 0 0 0 .640 .589 .492 0 0 0 0 0 -.015\n DSPCAw5, PC1 -.560 -.583 0 0 0 0 -.263 -.099 -.371 -.362 0 0 0\n DSPCAw5, PC2 0 0 .707 .707 0 0 0 0 0 0 0 0 0\n DSPCAw5, PC3 0 0 0 0 0 -.793 -.610 0 0 0 0 0 .012\n DSPCAw6, PC1 -.491 -.507 0 0 0 -.067 -.357 -.234 -.387 -.409 0 0 0\n DSPCAw6, PC2 0 0 .707 .707 0 0 0 0 0 0 0 0 0\n DSPCAw6, PC3 0 0 0 0 0 -.873 -.484 0 0 0 0 0 .057\n\n\n\n\n\n7 Conclusion\n\nThe semidefinite relaxation of the sparse principal component analysis problem proposed\nhere appears to significantly improve the solution's sparsity, while explaining the same\n\n\f\n 18 100\n\n\n\n 90\n\n 16\n\n 80\n\n e\n nc 70\n 14\nlity ariavd 60\n cardina ine\n e 12\n v 50\n ti explaev 40\n ti\n umula 10\n C\n 30\n umulaC\n 20\n 8\n\n\n 10\n\n\n\n 6 0\n 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6\n\n Number of principal components Number of principal components\n\n\n\n\n\nFigure 2: Cumulative cardinality and cumulative explained variance for SPCA and DSPCA\nas a function of the number of principal components: black line for normal PCA, blue for\nSPCA and red for DSPCA (full for k1 = 5 and dash-dot for k1 = 6).\n\n\n\nvariance as previously proposed methods in the examples detailed above. The algorithms\nwe used here handle moderate size problems efficiently. We are currently working on\nlarge-scale extensions using first-order techniques.\n\n\nAcknowledgements\n\nThanks to Andrew Mullhaupt and Francis Bach for useful suggestions. We would like to ac-\nknowledge support from ONR MURI N00014-00-1-0637, Eurocontrol-C20052E/BM/03,\nNASA-NCC2-1428.\n\n\nReferences\n\n [1] I. T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal of\n Applied Statistics, 22:2935, 1995.\n\n [2] S. Vines. Simple principal components. Applied Statistics, 49:441451, 2000.\n\n [3] J. Cadima and I. T. Jolliffe. Loadings and correlations in the interpretation of principal compo-\n nents. Journal of Applied Statistics, 22:203214, 1995.\n\n [4] I. T. Jolliffe and M. Uddin. A modified principal component technique based on the lasso.\n Journal of Computational and Graphical Statistics, 12:531547, 2003.\n\n [5] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Technical report,\n statistics department, Stanford University, 2004.\n\n [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal statistical\n society, series B, 58(267-288), 1996.\n\n [7] Jos F. Sturm. Using sedumi 1.0x, a matlab toolbox for optimization over symmetric cones.\n Optimization Methods and Software, 11:625653, 1999.\n\n [8] I. Nesterov. Smooth minimization of non-smooth functions. CORE wroking paper, 2003.\n\n [9] A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with\n Lipschitz continuous monotone operators and smooth convex-concave saddle-point problems.\n MINERVA Working paper, 2004.\n\n[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[11] J. Jeffers. Two case studies in the application of principal components. Applied Statistics,\n 16:225236, 1967.\n\n\f\n", "award": [], "sourceid": 2628, "authors": [{"given_name": "Alexandre", "family_name": "D'aspremont", "institution": null}, {"given_name": "Laurent", "family_name": "Ghaoui", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Gert", "family_name": "Lanckriet", "institution": null}]}