{"title": "Unsupervised Feature Selection for the $k$-means Clustering Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 161, "abstract": "We present a novel feature selection algorithm for the $k$-means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter $\\epsilon \\in (0,1)$, selects and appropriately rescales in an unsupervised manner $\\Theta(k \\log(k / \\epsilon) / \\epsilon^2)$ features from a dataset of arbitrary dimensions. We prove that, if we run any $\\gamma$-approximate $k$-means algorithm ($\\gamma \\geq 1$) on the features selected using our method, we can find a $(1+(1+\\epsilon)\\gamma)$-approximate partition with high probability.", "full_text": "Unsupervised Feature Selection for the\n\nk-means Clustering Problem\n\nChristos Boutsidis\n\nDepartment of Computer Science\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\n\nMichael W. Mahoney\n\nDepartment of Mathematics\n\nStanford University\nStanford, CA 94305\n\nboutsc@cs.rpi.edu\n\nmmahoney@cs.stanford.edu\n\nPetros Drineas\n\nDepartment of Computer Science\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\n\ndrinep@cs.rpi.edu\n\nAbstract\n\nWe present a novel feature selection algorithm for the k-means clustering problem.\nOur algorithm is randomized and, assuming an accuracy parameter \u03f5 \u2208 (0, 1),\nselects and appropriately rescales in an unsupervised manner \u0398(k log(k/\u03f5)/\u03f52)\nfeatures from a dataset of arbitrary dimensions. We prove that, if we run any\n\u03b3-approximate k-means algorithm (\u03b3 \u2265 1) on the features selected using our\nmethod, we can \ufb01nd a (1 + (1 + \u03f5)\u03b3)-approximate partition with high probability.\n\n1 Introduction\n\nClustering is ubiquitous in science and engineering, with numerous and diverse application domains,\nranging from bioinformatics and medicine to the social sciences and the web [15]. Perhaps the most\nwell-known clustering algorithm is the so-called \u201ck-means\u201d algorithm or Lloyd\u2019s method [22], an\niterative expectation-maximization type approach, which attempts to address the following objec-\ntive: given a set of points in a Euclidean space and a positive integer k (the number of clusters), split\nthe points into k clusters so that the total sum of the (squared Euclidean) distances of each point to\nits nearest cluster center is minimized. This optimization objective is often called the k-means clus-\ntering objective. (See De\ufb01nition 1 for a formal discussion of the k-means objective.) The simplicity\nof the objective, as well as the good behavior of the associated algorithm (Lloyd\u2019s method [22, 28]),\nhave made k-means enormously popular in applications [32].\nIn recent years, the high dimensionality of the modern massive datasets has provided a considerable\nchallenge to k-means clustering approaches. First, the curse of dimensionality can make algorithms\nfor k-means clustering very slow, and, second, the existence of many irrelevant features may not\nallow the identi\ufb01cation of the relevant underlying structure in the data [14]. Practitioners addressed\nsuch obstacles by introducing feature selection and feature extraction techniques. It is worth not-\ning that feature selection selects a small subset of actual features from the data and then runs the\nclustering algorithm only on the selected features, whereas feature extraction constructs a small set\nof arti\ufb01cial features and then runs the clustering algorithm on the constructed features. Despite the\nsigni\ufb01cance of the problem, as well as the wealth of heuristic methods addressing it (see Section\n3), there exist no provably accurate feature selection methods and extremely few provably accurate\nfeature extraction methods for the k-means clustering objective (see Section 3.1 for the later case).\n\n1\n\n\fOur work here addresses this shortcoming by presenting the \ufb01rst provably accurate feature selection\nalgorithm for k-means clustering. Our algorithm constructs a probability distribution for the feature\nspace, and then selects a small number of features (roughly k log(k), where k is the number of\nclusters) with respect to the computed probabilities. (See Section 2 for a detailed description of\nour algorithm.) Then, we argue that running k-means clustering algorithms on the selected features\nreturns a constant-factor approximate partition to the optimal. (See Theorem 1 in Section 2.)\nWe now formally de\ufb01ne the k-means clustering problem using the so-called cluster indicator matrix.\nAlso, recall that the Frobenius norm of a matrix (denoted by \u2225\u00b7\u2225\nF ) is equal to the square root of the\nsum of the squares of its elements. (See also Section 4.1 for useful notation.)\n\nDe\ufb01nition 1 [THE K-MEANS CLUSTERING PROBLEM]\nGiven a matrix A \u2208 Rn\u00d7d (representing n points \u2013 rows \u2013 described with respect to d features \u2013\ncolumns) and a positive integer k denoting the number of clusters, \ufb01nd the n \u00d7 k indicator matrix\nXopt such that\n\n(cid:13)(cid:13)A \u2212 XX T A\n(cid:13)(cid:13)2\n(cid:13)(cid:13)A \u2212 XoptX T\n(cid:13)(cid:13)2\n\nF .\n\nF =\n\noptA\n\n(1)\n\n(2)\n\n.\n\n(cid:13)(cid:13)2\n\nF\n\nThe optimal value of the k-means clustering objective is\n\nXopt = arg min\nX\u2208X\n\n(cid:13)(cid:13)A \u2212 XX T A\n\nFopt = min\nX\u2208X\n\nIn the above X denotes the set of all n \u00d7 k indicator matrices X.\nWe brie\ufb02y expand on the notion of an n \u00d7 k indicator matrix X. Such matrices have exactly one\nnon-zero element per row, which denotes cluster membership. Equivalently, for all i = 1, . . . , n and\n\u221a\nj = 1, . . . , k, the i-th row (point) of A belongs to the j-th cluster if and only if Xij is non-zero;\nsj, where sj is the number of points in the corresponding cluster (i.e. the\nin particular Xij = 1/\nnumber of non-zero elements in the j-th column of X). Note that the columns of X are normalized\nand pairwise orthogonal so that their Euclidean norm is equal to one, and X T X = Ik, where Ik\nis the k \u00d7 k identity matrix. An example of such an indicator matrix X representing three points\n(rows in X) belonging to two different clusters (columns in X) is given below; note that the points\ncorresponding to the \ufb01rst two rows of X belong to the \ufb01rst cluster (s1 = 2) and the other point to\nthe second cluster (s2 = 1):\n\n\uf8eb\uf8ed 1/\n\n\u221a\n\u221a\n1/\n0\n\n2\n2\n\n\uf8f6\uf8f8 .\n(cid:13)(cid:13)A \u2212 XX T A\n(cid:13)(cid:13)2\n\n0\n\u221a\n0\n1/\n\n1\n\nX =\n\n\u2211\n\nThe above de\ufb01nition of the k-means objective is exactly equivalent with the standard de\ufb01nition of\n||A(i) \u2212 X(i)X T A||2\n2,\nk-means clustering [28]. To see this notice that\nwhile for i = 1, ..., n, X(i)X T A denotes the centroid of the cluster the i-th point belongs to. In the\nabove, A(i) and X(i) denote the i-th rows of A and X, respectively.\n\nF =\n\nn\ni=1\n\n2 The feature selection algorithm and the quality-of-clustering results\nAlgorithm 1 takes as inputs the matrix A \u2208 Rn\u00d7d, the number of clusters k, and an accuracy\nparameter \u03f5 \u2208 (0, 1). It \ufb01rst computes the top-k right singular vectors of A (columns of Vk \u2208 Rd\u00d7k).\nUsing these vectors, it computes the so-called (normalized) leverage scores [4, 24]; for i = 1, ..., d\nthe i-th leverage score equals the square of the Euclidian norm of the i-th row of Vk (denoted by\n(Vk)(i)). The i-th leverage score characterizes the importance of the i-th feature with respect to the\n\u2032\nk-means objective. Notice that these scores (see the de\ufb01nition of p\nis in step 2 of Algorithm 1) form\na probability distribution over the columns of A since\ni=1 pi = 1. Then, the algorithm chooses\na sampling parameter r that is equal to the number of (rescaled) features that we want to select.\nIn order to prove our theoretical bounds, r should be \ufb01xed to r = \u0398(k log(k/\u03f5)/\u03f52) at this step\n(see section 4.4). In practice though, a small value of r, for example r = 10k, seems su\ufb01cient (see\nsection 5). Having r \ufb01xed, Algorithm 1 performs r i.i.d random trials where in each trial one column\nof A is selected by the following random process: we throw a biased die with d faces with each face\ncorresponding to a column of A, where for i = 1, ..., d the i-th face occurs with probability pi. We\nselect the column of A that corresponds to the face we threw in the current trial. Finally, note that the\nrunning time of Algorithm 1 is dominated by the time required to compute the top-k right singular\nvectors of the matrix A, which is at most O\n\n(\nmin{nd2, n2d})\n\n\u2211\n\nn\n\n.\n\n2\n\n\fInput: n \u00d7 d matrix A (n points, d features), number of clusters k, parameter \u03f5 \u2208 (0, 1).\n\n1. Compute the top-k right singular vectors of A, denoted by Vk \u2208 Rd\u00d7k.\n2. Compute the (normalized) leverage scores pi, for i = 1, . . . , d,\n\n(cid:13)(cid:13)(cid:13)(Vk)(i)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\npi =\n\n/k.\n\n3. Fix a sampling parameter r = \u0398(k log(k/\u03f5)/\u03f52).\n4. For t = 1, . . . , r i.i.d random trials:\n\n\u2022 keep the i-th feature with probability pi and multiply it by the factor (rpi)\n\n5. Return the n \u00d7 r matrix \u02dcA containing the selected (rescaled) features.\n\n\u22121/2.\n\nOutput: n \u00d7 r matrix \u02dcA, with r = \u0398(k log(k/\u03f5)/\u03f52).\n\nAlgorithm 1: A randomized feature selection algorithm for the k-means clustering problem.\n\nIn order to theoretically evaluate the accuracy of our feature selection algorithm, and provide some\na priori guarantees regarding the quality of the clustering after feature selection is performed, we\nchose to report results on the optimal value of the k-means clustering objective (the Fopt of De\ufb01-\nnition 1). This metric of accuracy has been extensively used in the Theoretical Computer Science\ncommunity in order to analyze approximation algorithms for the k-means clustering problem. In\nparticular, existing constant factor or relative error approximation algorithms for k-means (see, for\nexample, [21, 1] and references therein) invariably approximate Fopt.\nObviously, Algorithm 1 does not return a partition of the rows of A. In a practical setting, it would\nbe employed as a preprocessing step. Then, an approximation algorithm for the k-means clustering\nproblem would be applied on \u02dcA in order to determine the partition of the rows of A. In order to\nformalize our discussion, we borrow a de\ufb01nition from the approximation algorithms literature.\n\nDe\ufb01nition 2 [K-MEANS APPROXIMATION ALGORITHM]\nAn algorithm is a \u201c\u03b3-approximation\u201d for the k-means clustering problem (\u03b3 \u2265 1) if it takes inputs\nA and k, and returns an indicator matrix X\u03b3 that satis\ufb01es with probability at least 1 \u2212 \u03b4\u03b3,\n\n(cid:13)(cid:13)A \u2212 X\u03b3X T\n\n\u03b3 A\n\n(cid:13)(cid:13)2\n\nF\n\n\u2264 \u03b3 min\nX\u2208X\n\n(cid:13)(cid:13)A \u2212 XX T A\n(cid:13)(cid:13)2\n\nF .\n\n(3)\n\nIn the above \u03b4\u03b3 \u2208 [0, 1) is the failure probability of the algorithm.\nClearly, when \u03b3 = 1, then X\u03b3 is the optimal partition, which is a well-known NP-hard objective. If\nwe allow \u03b3 > 1, then many approximation algorithms exist in the literature. For example, the work\nof [21], achieves \u03b3 = 1 + \u03f5, for some \u03f5 \u2208 (0, 1] in time linear on the size of the input. Similarly,\nthe k-means++ method of [1] achieves \u03b3 = O(log(k)) using the popular Lloyd\u2019s algorithm and a\nsophisticated randomized seeding. Theorem 1 (see Section 4 for its proof) is our main quality-of-\napproximation result for our feature selection algorithm.\nTheorem 1 Let the n\u00d7d matrix A and the positive integer k be the inputs of the k-means clustering\nproblem. Let \u03f5 \u2208 (0, 1), and run Algorithm 1 with inputs A, k, and \u03f5 in order to construct the n \u00d7 r\nmatrix \u02dcA containing the selected features, where r = \u0398(k log(k/\u03f5)/\u03f52).\nIf we run any \u03b3-approximation algorithm (\u03b3 \u2265 1) for the k-means clustering problem, whose fail-\nure probability is \u03b4\u03b3, on inputs \u02dcA and k, the resulting cluster indicator matrix X~\u03b3 satis\ufb01es with\nprobability at least 0.5 \u2212 \u03b4\u03b3,\n\n(4)\n\n(cid:13)(cid:13)A \u2212 X~\u03b3X T\n\n~\u03b3 A\n\n(cid:13)(cid:13)2\n\nF\n\n\u2264 (1 + (1 + \u03f5)\u03b3) min\nX\u2208X\n\n(cid:13)(cid:13)A \u2212 XX T A\n\n(cid:13)(cid:13)2\n\nF .\n\nThe failure probability of the above theorem can be easily reduced using standard boosting methods.\n\n3\n\n\f3 Related work\n\nFeature selection has received considerable attention in the machine learning and data mining com-\nmunities. A large number of different techniques appeared in prior work, addressing the feature\nselection within the context of both clustering and classi\ufb01cation. Surveys include [13], as well\nas [14], which reports the results of the NIPS 2003 challenge in feature selection. Popular feature\nselection techniques include the Laplacian scores [16], the Fisher scores [9], or the constraint scores\n[33]. In this section, we opt to discuss only a family of feature selection methods that are closely\nrelated to the leverage scores of our algorithm. To the best of our knowledge, all previous feature\nselection methods come with no theoretical guarantees of the form that we describe here.\nGiven as input an n\u00d7 d object-feature matrix A and a positive integer k, feature selection for Princi-\npal Components Analysis (PCA) corresponds to the task of identifying a subset of k columns from\nA that capture essentially the same information as do the top k principal components of A. Jol-\nliffe [18] surveys various methods for the above task. Four of them (called B1, B2, B3, and B4\nin [18]) employ the Singular Value Decomposition of A in order to identify columns that are some-\nhow correlated with its top k left singular vectors. In particular, B3 employs exactly the leverage\nscores in order to greedily select the k columns corresponding to the highest scores; no theoretical\nresults are reported. An experimental evaluation of the methods of [18] on real datasets appeared\nin [19]. Another approach employing the matrix of the top k right singular vectors of A and a\nProcrustes-type criterion appeared in [20]. From an applications perspective,\n[30] employed the\nmethods of [18] and [20] for gene selection in microarray data analysis. From a complementary\nviewpoint, feature selection for clustering seeks to identify those features that have the most dis-\ncriminative power among the set of all features. Continuing the aforementioned line of research,\nmany recent papers present methods that somehow employ the SVD of the input matrix in order\nto select discriminative features; see, for example, [23, 5, 25, 26]. Finally, note that employing\nthe leverage scores in a randomized manner similar to Algorithm 1 has already been proven to be\naccurate for least-squares regression [8] and PCA [7, 2].\n\n3.1 Connections with the SVD\n\n\u2264 (cid:13)(cid:13)A \u2212 XoptX T\n\n(cid:13)(cid:13)2\n\nA well-known property connects the SVD of a matrix and k-means clustering. Recall De\ufb01nition\noptA is a matrix of rank at most k. From the SVD optimality [11], we\n1, and notice that XoptX T\nimmediately get that (see section 4.1 for useful notation)\n\n\u2225A\u03c1\u2212k\u22252\n\nF = \u2225A \u2212 Ak\u22252\n\nF\n\nF\n\noptA\n\n= Fopt.\n\n(5)\nA more interesting connection between the SVD and k-means appeared in [6]. If the n \u00d7 d matrix\nA is projected on the subspace spanned by its top k left singular vectors, then the resulting n \u00d7 k\nmatrix \u02c6A = Uk\u03a3k corresponds to a mapping of the original d-dimensional space to the optimal\nk-dimensional space. This process is equivalent to feature extraction: the top k left singular vectors\n(the columns of Uk) correspond to the constructed features (\u03a3k is a simple rescaling operator).\nPrior to the work of [6], it was empirically known that running k-means clustering algorithms on\nthe low-dimensional matrix \u02c6A was a viable alternative to clustering the high-dimensional matrix A.\nThe work of [6] formally argued that if we let the cluster indicator matrix \u02c6Xopt denote the optimal\nk-means partition on \u02c6A, i.e.,\n\nthen using this partition on the rows of the original matrix A is a 2-approximation to the optimal\npartition, a.k.a.,\n\nThe above result is the starting point of our work here. Indeed, we seek to replace the k arti\ufb01cial\nfeatures that are extracted via the SVD with a small number (albeit slightly larger than k) of actual\nfeatures. On the positive side, an obvious advantage of feature selection vs. feature extraction is\nthe immediate interpretability of the former. On the negative side, our approximation accuracy is\nslightly worse (2 + \u03f5, see Theorem 1 with \u03b3 = 1) and we need slightly more than k features.\n\n4\n\n(cid:13)(cid:13)(cid:13) \u02c6A \u2212 XX T \u02c6A\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)A \u2212 XX T A\n(cid:13)(cid:13)2\n\nF\n\n,\n\n\u2264 2 min\nX\u2208X\n\nF .\n\n\u02c6Xopt = arg min\nX\u2208X\n\n(cid:13)(cid:13)(cid:13)A \u2212 \u02c6Xopt \u02c6X T\n\noptA\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(6)\n\n(7)\n\n\f4 The proof of Theorem 1\n\nThis section gives the proof of Theorem 1. We start by introducing useful notation; then, we present\na preliminary lemma and the proof itself.\n\nF and \u2225A\u2225\n\n4.1 Notation\nGiven an n \u00d7 d matrix A, let Uk \u2208 Rn\u00d7k (resp. Vk \u2208 Rd\u00d7k) be the matrix of the top k left (resp.\nright) singular vectors of A, and let \u03a3k \u2208 Rk\u00d7k be a diagonal matrix containing the top k singular\nvalues of A. If we let \u03c1 be the rank of A, then A\u03c1\u2212k is equal to A \u2212 Ak, with Ak = Uk\u03a3kV T\nk .\n\u2225A\u2225\n2 denote the Frobenius and the spectral norm of a matrix A, respectively. A+\ndenotes the pseudo-inverse of A and ||A+||2 = \u03c3max(A+) = 1/\u03c3min(A), where \u03c3max(X) and\n\u03c3min(X) denote the largest and the smallest non-zero singular values of a matrix X, respectively.\nA useful property of matrix norms is that for any two matrices X and Y , \u2225XY \u2225\n\u2225Y \u2225\nand \u2225XY \u2225\n2\nF ; this is a stronger version of the standard submultiplicavity property\nfor matrix norms. We call P a projector matrix if it is square and P 2 = P . We use E[y] to take the\nexpectation of a random variable y and Pr[e] to take the probability of a random event e. Finally, we\nabbreviate \u201cindependent identically distributed\u201d to \u201ci.i.d\u201d and \u201cwith probability\u201d to \u201cw.p\u201d.\n\n\u2264 \u2225X\u2225\n\n\u2225Y \u2225\n\nF\n\n2\n\n\u2264 \u2225X\u2225\n\nF\n\nF\n\n4.2 Sampling and rescaling matrices\n\nWe introduce a simple matrix formalism in order to conveniently represent the sampling and rescal-\ning processes of Algorithm 1. Let S be a d \u00d7 r sampling matrix that is constructed as follows: S\nis initially empty. For all t = 1, . . . , r, in turn, if the i-th feature of A is selected by the random\nsampling process described in Algorithm 1, then ei (a column vector of all-zeros, except for its i-th\nentry which is set to one) is appended to S. Also, let D be a r \u00d7 r diagonal rescaling matrix con-\n\u221a\nstructed as follows: D is initially an all-zeros matrix. For all t = 1, . . . , r, in turn, if the i-th feature\nrpi. Thus, by using the notation of\nof A is selected, then the next diagonal entry of D is set to 1/\nthis paragraph, Algorithm 1 outputs the matrix \u02dcA = ASD \u2208 Rn\u00d7r.\n\n4.3 A preliminary lemma and suf\ufb01cient conditions\n\nk SD||2, ||(V T\n\nk SD, respectively. This also implies that V T\n\nLemma 1 presented below gives upper and lower bounds for the largest and the smallest singular\nvalues of the matrix V T\nk SD has full rank. Finally, it\nargues that the matrix ASD can be used to provide a very accurate approximation to the matrix Ak.\nLemma 1 provides four suf\ufb01cient conditions for designing provably accurate feature selection al-\ngorithms for k-means clustering. To see this notice that, in the proof of eqn. (4) given below, the\nresults of Lemma 1 are suf\ufb01cient to prove our main theorem; the rest of the arguments apply to all\nsampling and rescaling matrices S and D. Any feature selection algorithm, i.e. any sampling matrix\nS and rescaling matrix D, that satisfy bounds similar to those of Lemma 1, can be employed to\ndesign a provably accurate feature selection algorithm for k-means clustering. The quality of such\nan approximation will be proportional to the tightness of the bounds of the three terms of Lemma\n1 (||V T\nk SD)+||2, and ||E||F ). Where no rescaling is allowed in the selected features,\nthe bottleneck in the approximation accuracy of a feature selection algorithm would be to \ufb01nd a\nsampling matrix S such that only ||(V T\nk S)+||2 is bounded from above. To see this notice that, in\nk S||2 \u2264 1, and (after applying the submultiplicavity property of Section\nLemma 1, for any S, ||V T\n4.1 in eqn. 13) ||E||F \u2264 ||(V T\nIt is worth emphasizing that the same factor\n||(V T\nk S)+||2 appeared to be the bottleneck in the design of provably accurate column-based low-\nrank approximations (see, for example, Theorem 1.5 in [17] and eqn. (3.19) in [12]). It is evident\nfrom the above observations that other column sampling methods (see, for example, [17, 3, 2] and\nreferences therein), satisfying similar bounds to those of Lemma 1, immediately suggest themselves\nfor the design of provably accurate feature selection algorithms for k-means clustering. Finally,\nequations (101) and (102) of Lemma 4.4 in [31] suggest that a sub-sampled randomized Fourier\ntransform can be used for the design of a provably accurate feature extraction algorithm for k-means\nclustering, since they provide bounds similar to those of Lemma 1 by replacing the matrices S and\nD of our algorithm with a sub-sampled randomized Fourier transform matrix (see the matrix R of\neqn. (6) in [31]).\n\nk S)+||2||A \u2212 Ak||.\n\n5\n\n\fLemma 1 Assume that the sampling matrix S and the rescaling matrix D are constructed using\nAlgorithm 1 (see also Section 4.2) with inputs A, k, and \u03f5 \u2208 (0, 1). Let co and c1 be absolute\nconstants that will be speci\ufb01ed later. If the sampling parameter r of Algorithm 1 satis\ufb01es\n\nthen all four statements below hold together with probability at least 0.5:\n\nok log(c1c2\n\nok/\u03f52)/\u03f52,\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)V T\n(cid:13)(cid:13)(V T\n\n1.\n\n2.\n\n3. V T\n\nk SD\n\n2 = \u03c3max(V T\n\nr \u2265 2c1c2\nk SD) \u2264 \u221a\nk SD) \u2264\u221a\n\n1 + \u03bb.\n\nF\n\nok/\u03f52)) +\n\n\u221a\n\nk log(r)/r.\n\n\u2211\n\u2211\n\n2\n\nk log(r)/r,\n\n\u221a\n\npi)(V T\n\n(cid:13)(cid:13)V T\n\n2 =\n\nV T\nk SD\n\ni\n\n(V T\n\nk )(i) 1\u221a\n\npi\n\n(V T\n\n4. Ak = (ASD)(V T\n\nTo simplify notation, we set \u03bb = \u03f5\n\nk SDDST Vk \u2212 Ik\n\nk SDDST Vk \u2212 Ik\n\n\u2264 \u00b5\u2225A \u2212 Ak\u2225\nF .\no log(c1c2\n6/(2c1c2\n\nk SD)+\n\n2 = 1/\u03c3min(V T\n\n1/(1 \u2212 \u03bb.\nk SD is a full rank matrix, i.e. rank(V T\nk SD) = k.\n\u221a\n\u221a\nk + E, with \u2225E\u2225\n36/c1 and \u00b5 = \u03f5\n\nk SD)+V T\n\nfor a suf\ufb01ciently large (unspeci\ufb01ed in [29]) constant co. Standard matrix perturbation theory re-\nsults [11] imply that for i = 1, ..., k\n\n(cid:13)(cid:13)V T\n(cid:13)(cid:13)Ak \u2212 ASD(V T\n\n\u221a\n6\u03bb2/(1 \u2212 \u03bb).\nProof: First, we will apply Theorem 3.1 of [29] for an appropriate random vector y. Toward that end,\nfor i = 1, ..., d, the i-th column of the matrix V T\nk )(i). We de\ufb01ne the random vector\nk is denoted by (V T\ny \u2208 Rk as follows: for i = 1, ..., d Pr[y = yi] = pi, where yi = (1/\nk )(i) is a realization of\ny. This de\ufb01nition of y and the de\ufb01nition of the sampling and rescaling matrices S and D imply that\nk. Note\nk SDDST Vk = 1\nV T\nk Vk = Ik. Obviously, ||E[yyT ]||2 =\nr\nalso that E[yyT ] =\n1. Our choice of r allows us to apply Theorem 3.1 of [29], which, combined with the Markov\u2019s\n2 implies that w.p at least 1 \u2212 1/6,\ninequality on the random variable z =\n\nk )(i)||2/k implies that ||y||2 \u2264 \u221a\ni . Our choice of pi = ||(V T\nd\ni=1 yiyT\n(cid:13)(cid:13)\n(cid:13)(cid:13)V T\n1\u221a\nd\nk )(i))T = V T\ni=1 pi\n\u221a\n(cid:13)(cid:13)\npi\nk SDDST Vk \u2212 Ik\n\u2264 6c0\n(cid:12)(cid:12) \u2264 6co\n(cid:12)(cid:12)\u03c32\n(cid:13)(cid:13)\n) \u2212 1\n(\nOur choice of r and simple algebra suf\ufb01ces to show that log(r)/r \u2264 \u03f52/(c1c2\nok), which implies that\nthe \ufb01rst two statements of the Lemma hold w.p at least 1 \u2212 5/6. To prove the third statement, we\n(cid:13)(cid:13)Ak \u2212 AkSD(V T\nk SD is positive. Our choice of \u03f5 \u2208 (0, 1) and\nonly need to show that the k-th singular value of V T\n\u2264 (cid:13)(cid:13)Ak \u2212 AkSD(V T\nk SD) > 0. To prove the fourth statement:\nthe second condition of the Lemma imply that \u03c3k(V T\n{z\n|\nk SD)+V T\nk\n(cid:13)(cid:13)Ak \u2212 Uk\u03a3kV T\n(cid:13)(cid:13)\n(cid:13)(cid:13)Ak \u2212 Uk\u03a3kIkV T\n(cid:13)(cid:13)U\u03c1\u2212k\u03a3\u03c1\u2212kV T\n\u2264 (cid:13)(cid:13)\u03a3\u03c1\u2212kV T\n\u221a\n(cid:13)(cid:13)\n\u2264 (\u03f5\n\n(10)\n(11)\n(11) we set\nk SD is a rank-k matrix w.p 1 \u2212 5/6. The second term of\n\n(12)\n\u03c1\u2212kSD(V T\n(13)\nk SD)+\nIn the above, in eqn. (12) we replaced A\u03c1\u2212k by U\u03c1\u2212k\u03a3\u03c1\u2212kV T\nk can\nbe dropped without increasing a unitarily invariant norm such as the Frobenius matrix norm. If the\n\ufb01rst three statements of the lemma hold w.p at least 1 \u2212 5/6, then w.p at least 1 \u2212 1/3,\n\n\u03c1\u2212k, and in eqn. (13) U\u03c1\u2212k and V T\n\nIn the above, in eqn. (8) we replaced A by Ak +A\u03c1\u2212k, and in eqn. (9) we used the triangle inequality.\nThe \ufb01rst term of eqn. (9) is bounded by\n\n(cid:13)(cid:13)A\u03c1\u2212kSD(V T\n{z\n|\n\n\u2212 A\u03c1\u2212kSD(V T\n}\n(cid:13)(cid:13)\n\n\u221a\n6\u03bb2/(1 \u2212 \u03bb))\u2225A \u2212 Ak\u2225\n\nIn the above,\n(V T\neqn. (9) is bounded by\n\nk SD)(V T\n\nF .\n(The proof of this last argument is omitted from this extended abstract.) Finally, notice that the \ufb01rst\nthree statements have the same failure probability 1/6 and the fourth statement fails w.p 1/3; the\nunion bound implies that all four statements hold together with probability at least 0.5.\n\u22c4\n\nok/\u03f52)) +\n\no log(c1c2\n\n6/(2c1c2\n\n(10) we replaced Ak by Uk\u03a3kV T\n\n(cid:13)(cid:13)\u03a3\u03c1\u2212kV T\n\nk SD)+ = Ik, since V T\n\nk SD)+V T\nk\nk SD)+V T\nk\n\nk SD(V T\n\nk SD)+V T\nk\n\nF\nk SD)+V T\nk\n\nk , and in eqn.\n\nk SD)+V T\nk\n\nF =\n\n\u03c1\u2212kSD(V T\n\nk SD)+\n\nk SD)+V T\nk\n\nF\n\n(cid:13)(cid:13)\n\n.\n\nF\n\n(cid:13)(cid:13)\n\n+\n\nF\n\n\u03b81 =\n\n=\n\n\u03c1\u2212kSD(V T\n\nF\n\n(cid:13)(cid:13)\n\n\u03b81\n\n\u03b82\n\nk\n\nF = 0.\n\n\u03b82 =\n\nF\n\nin eqn.\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)\n\n(8)\n.(9)\nF\n\n}\n\n6\n\n\f4.4 The proof of eqn. (4) of Theorem 1\n\nWe assume that Algorithm 1 \ufb01xes r to the value speci\ufb01ed in Lemma 1; note that this does not violate\nthe asymptotic notation used in Algorithm 1. We start by manipulating the term\nF\nin eqn. (4). Replacing A by Ak + A\u03c1\u2212k, and using the Pythagorean theorem (the subspaces spanned\nby the components Ak \u2212 X~\u03b3X T\n\n(cid:13)(cid:13)A \u2212 X~\u03b3X T\n\n(cid:13)(cid:13)(I \u2212 X~\u03b3X T\n{z\n|\nWe \ufb01rst bound the second term of eqn. (14). Since I\u2212X~\u03b3X T\nwithout increasing a unitarily invariant norm. Now eqn. (5) implies that\n\n(cid:13)(cid:13)(I \u2212 X~\u03b3X T\n(cid:13)(cid:13)2\n~\u03b3 Ak and A\u03c1\u2212k \u2212 X~\u03b3X T\n|\n{z\n}\n\n~\u03b3 A\u03c1\u2212k are perpendicular) we get\n+\n\n~\u03b3 is a projector matrix, it can be dropped\n\n~\u03b3 )A\u03c1\u2212k\n\n(cid:13)(cid:13)2\n\n~\u03b3 )Ak\n\n(14)\n\n~\u03b3 A\n\n~\u03b3 A\n\n=\n\n\u03b82\n4\n\n\u03b82\n3\n\nF\n\nF\n\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)A \u2212 X~\u03b3X T\n(cid:13)(cid:13)2\n}\n\nF\n\n.\n\nWe now bound the \ufb01rst term of eqn. (14):\n\n\u03b82\n4\n\n(cid:13)(cid:13)\n\n\u2264 Fopt.\n\n\u03b83 \u2264 (cid:13)(cid:13)(I \u2212 X~\u03b3X T\n\u2264 (cid:13)(cid:13)(I \u2212 X~\u03b3X T\n(cid:13)(cid:13)(I \u2212 XoptX T\n(cid:13)(cid:13)(I \u2212 XoptX T\n(cid:13)(cid:13)(I \u2212 XoptX T\n|\n\n\u2264 \u221a\n\u2264 \u221a\n\u221a\n\n(cid:13)(cid:13)\n(cid:13)(cid:13)(VkSD)+\n(cid:13)(cid:13)\n+ \u2225E\u2225\n(cid:13)(cid:13)(VkSD)+\n(cid:13)(cid:13)\n(cid:13)(cid:13)\n2 + \u2225E\u2225\n(cid:13)(cid:13)\n2 + \u2225E\u2225\n(cid:13)(cid:13)\n\u2225(VkSD)\u2225\n{z\n}\n\nopt)ASD\nopt)ASD(VkSD)+\nF\nopt)ASD(VkSD)+V T\nk\n\n~\u03b3 )ASD(VkSD)+V T\nk\n~\u03b3 )ASD\n\n=\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\nF\n\nF\n\nF\n\nF\n\nF\n\nF\n\n2\n\n\u2225(VkSD)\u2225\n\nF\n\n(cid:13)(cid:13)(VkSD)+\n(cid:13)(cid:13)\n(cid:13)(cid:13)(VkSD)+\n\n2\n\n(cid:13)(cid:13)\n2 + \u2225E\u2225\n\nF\n\n2 + \u2225E\u2225\n\n(15)\n\n(16)\n(17)\n(18)\n(19)\nF (20)\n\n\u03b85\n\n(cid:13)(cid:13)(\n\n)\n\n(cid:13)(cid:13)2\n\n)\n\nF\n\nASD\n\n\u2264 \u03b3 min\nX\u2208X\n\n(cid:13)(cid:13)2\n\u03b85 \u2264 (cid:13)(cid:13)(I \u2212 XoptX T\n\u2264 (cid:13)(cid:13)(I \u2212 XoptX T\n\u221a\n\nIn eqn. (16) we used Lemma 1, the triangle inequality, and the fact that I \u2212 \u02dcX\u03b3 \u02dcX T\n\u03b3 is a projector\nmatrix and can be dropped without increasing a unitarily invariant norm.\nIn eqn. (17) we used\nsubmultiplicativity (see Section 4.1) and the fact that V T\nk can be dropped without changing the\nspectral norm. In eqn. (18) we replaced X~\u03b3 by Xopt and the factor\n\u03b3 appeared in the \ufb01rst term. To\nbetter understand this step, notice that X~\u03b3 gives a \u03b3-approximation to the optimal k-means clustering\nof the matrix ASD, and any other n \u00d7 k indicator matrix (for example, the matrix Xopt) satis\ufb01es\n\n\u221a\n\n~\u03b3\n\nI \u2212 X~\u03b3X T\n\nI \u2212 XoptX T\nintroduced the k \u00d7 k identity matrix Ik = (V T\n\nopt\nF\n(19) we \ufb01rst\nk SD)+(V T\nk SD)\nk SD) = k) and then we used submultiplicativity (see Section 4.1). In eqn. (20) we intro-\nk without changing the Frobenius norm. We further manipulate the term \u03b85 of eqn. (20):\n\nASD\n\n.\n\nIn eqn.\n(rank(V T\nduced V T\n\n(cid:13)(cid:13)(I \u2212 XX T )ASD\n\nF\n\n\u2264 \u03b3\n\n(cid:13)(cid:13)(\n(cid:13)(cid:13)2\n(cid:13)(cid:13)(I \u2212 XoptX T\n\nF\n\nF\n\nopt)E\n\n+ ||E||F\n\nopt)Ak\n+\nopt)AVkV T\nk\n\n(21)\n(22)\n(23)\nIn eqn. (21) we used Lemma 1 and the triangle inequality. In eqn. (22) we replaced Ak by AVkV T\nand dropped I \u2212 XoptX T\nk\nopt is a projector matrix and does not\nincrease the Frobenius norm). In eqn. (23) we dropped the projector matrix VkV T\nk and used eqn. (5)\nand De\ufb01nition 1. Combining equations (20), (23), (5), Lemma 1, and the fact that \u03b3 \u2265 1, we get\n\n\u2264 (1 + \u00b5)\nopt from the second term (I \u2212 XoptX T\n\u221a\n}\n\n\u221a\n|\n\n\u03b83 \u2264 \u221a\n\n1 + \u03bb\n1 \u2212 \u03bb\n\n(1 + \u00b5) + \u00b5)\n\n{z\n\nFopt.\n\nFopt\n\n\u03b3 (\n\nF\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)\n\n(cid:13)(cid:13)\n\nSimple algebra suf\ufb01ces to show that for any \u03f5 \u2208 (0, 1), for any positive integer k \u2265 1, and for some\nsuf\ufb01ciently large constant c1, it is\n\nthus\n\n(24)\nCombining eqn. (24) with eqns. (14) and (15) concludes the proof of eqn. (4). Using asymptotic\nnotation our choice of r satis\ufb01es r = \u2126(k log(k/\u03f5)/\u03f52). Note that Theorem 1 fails only if Lemma 1\nor the \u03b3-approximation k-means clustering algorithm fail, which happens w.p at most 0.5 + \u03b4\u03b3.\n\n\u03b82\n3\n\n\u03b86 \u2264 \u221a\n\u2264 \u03b3(1 + \u03f5)Fopt.\n\n1 + \u03f5,\n\n\u03b86\n\n7\n\n\fNIPS (k = 3)\nBio (k = 3)\n\nr = 5k\nF\nP\n.758\n.847\n.742\n.764\n\nr = 10k\nF\nP\n.751\n.847\n.935\n0.726\n\nr = 20k\nF\nP\n.749\n.859\n.709\n\n1\n\nAll\n\nP\n.881\n\n1\n\nF\n.747\n.709\n\nTable 1: Numerics from our experiments (Leverage scores).\n\nFigure 1: Leverage scores for the NIPS dataset.\n\n5 Empirical study\n\nWe present an empirical evaluation of Algorithm 1 on two real datasets. We show that it selects the\nmost relevant features (Figure 1) and that the clustering obtained after feature selection is performed\nis very accurate (Table 1). It is important to note that the choice of r in the description of Algorithm\n1 is a suf\ufb01cient - not necessary - condition to prove our theoretical bounds. Indeed, a much smaller\nchoice of r, for example r = 10k, is often suf\ufb01cient for good empirical results.\nWe \ufb01rst experimented with a NIPS documents dataset (see http://robotics.stanford.\nedu/\u02dcgal/ and [10]). The data consist of a 184 \u00d7 6314 document-term matrix A, with Aij de-\nnoting the number of occurrences of the j-th term in the i-th document. Each document is a paper\nthat appeared in the proceedings of NIPS 2001, 2002, or 2003, and belongs to one of the following\nthree topic categories: (i) Neuroscience, (ii) Learning Theory, and (iii) Control and Reinforcement\nLearning. Each term appeared at least once in one of the 184 documents. We evaluated the accuracy\nof Algorithm 1 by running the Lloyd\u2019s heuristic1 on the rescaled features returned by our method.\nIn order to drive down the failure probability of Algorithm 1, we repeated it 30 times (followed by\nthe Lloyd\u2019 heuristic each time) and kept the partition that minimized the objective value. We report\nthe percentage of correctly classi\ufb01ed objects (denoted by P , 0 \u2264 P \u2264 1), as well as the value of\nthe k-means objective (i.e., the value F = ||A \u2212 X~\u03b3X T\nF of Theorem 1; the division by\nthe ||A||2\nF is for normalization). Results are depicted in Table 1. Notice that only a small subset of\nfeatures suf\ufb01ces to approximately reproduce the partition obtained when all features were kept. In\nFigure 1 we plotted the distribution of the leverage scores for the 6314 terms (columns) of A; we\nalso highlighted the features returned by Algorithm 1 when the sampling parameter r is set to 10k.\nWe observed that terms corresponding to the largest leverage scores had signi\ufb01cant discriminative\npower. In particular, ruyter appeared almost exclusively in documents of the \ufb01rst and third cate-\ngories, hand appeared in documents of the third category, information appeared in documents\nof the \ufb01rst category, and code appeared in documents of the second and third categories only. We\nalso experimented with microarray data showing the expression levels of 5520 genes (features) for\n31 patients (objects) having three different cancer types [27]: 10 patients with gastrointestinal stro-\nmal tumor, 12 with leiomyosarcoma, and 9 with synovial sarcoma. Table 1 depicts the results from\nour experiments by choosing k = 3. Note that the Lloyd\u2019s heuristic worked almost perfectly when r\nwas set to 10k and perfectly when r was set to 20k. Experimental parameters set to the same values\nas in the \ufb01rst experiment.\n\nF /||A||2\n\n~\u03b3 A||2\n\n1We ran 30 iterations of the E-M step with 30 different random initializations and returned the partition that\nminimized the k-means objective function, i.e. we ran kmeans(A, k, \u2019Replicates\u2019, 30, \u2019Maxiter\u2019, 30) in MatLab.\n\n8\n\n0100020003000400050006000700000.010.020.030.040.050.060.070.08featuresLeverage ScoresNIPSall leverage scoresbest set ( r = 30 )ruyterhandinformationcodeuniversalitysourcestishbynaftalihebrewneuralcenter\fReferences\n[1] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium\n\non Discrete algorithms (SODA), pages 1027\u20131035, 2007.\n\n[2] C. Boutsidis, M. W. Mahoney, and P. Drineas. Unsupervised feature selection for Principal Components Analysis. In Proceedings of the\n\n14th Annual ACM SIGKDD Conference (KDD), pages 61\u201369, 2008.\n\n[3] S. Chandrasekaran and I. Ipsen. On rank-revealing factorizations. SIAM Journal on Matrix Analysis and Applications, 15:592\u2013622,\n\n1994.\n\n[4] S. Chatterjee and A. S. Hadi. In\ufb02uential observations, high leverage points, and outliers in linear regression. Statistical Science, 1:379\u2013\n\n393, 1986.\n\n[5] Y. Cui and J. G. Dy. Orthogonal principal feature selection. manuscript.\n[6] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. In Proceedings of the 10th Annual\n\nACM-SIAM Symposium on Discrete Algorithms (SODA), pages 291\u2013299, 1999.\n\n[7] P. Drineas, M. Mahoney, and S. Muthukrishnan. Relative-Error CUR Matrix Decompositions. SIAM Journal on Matrix Analysis and\n\nApplications, 30:844\u2013881, 2008.\n\n[8] P. Drineas, M. Mahoney, and S. Muthukrishnan. Sampling algorithms for \u21132 regression and applications. In Proceedings of the 17th\n\nAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127\u20131136, 2006.\n\n[9] D. Foley and J. Sammon, J.W. An optimal set of discriminant vectors. IEEE Transactions on Computers, C-24(3):281\u2013289, March 1975.\n[10] A. Globerson, G. Chechik, F. Pereira, and N. Tishby. Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning\n\nResearch, 8:2265\u20132295, 2007.\n\n[11] G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.\n[12] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin A theory of pseudoskeleton approximations. Linear Algebra and Its\n\n[13]\n[14]\n\nApplications, 261:1-21, 1997.\nI. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157\u20131182, 2003.\nI. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural\nInformation Processing Systems (NIPS) 17, pages 545\u2013552, 2005.\n\n[15] J.A. Hartigan. Clustering algorithms. John Wiley & Sons, Inc. New York, NY, USA, 1975.\n[16] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In Advances in Neural Information Processing Systems (NIPS) 18,\n\npages 507\u2013514. 2006.\n\n[17] Y.P. Hong and C.T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Mathematics of Computation,\n\n58:213232, 1992.\nI. Jolliffe. Discarding variables in a principal component analysis. I: Arti\ufb01cial data. Applied Statistics, 21(2):160\u2013173, 1972.\nI. Jolliffe. Discarding variables in a principal component analysis. II: Real data. Applied Statistics, 22(1):21\u201331, 1973.\n\n[18]\n[19]\n[20] W. Krzanowski. Selection of variables to preserve multivariate data structure, using principal components. Applied Statistics, 36(1):22\u2013\n\n33, 1987.\n\n[21] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + \u03f5)-approximation algorithm for k-means clustering in any dimensions.\n\nIn Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 454\u2013462, 2004.\n\n[22] S.P. Lloyd. Least squares quantization in PCM. Unpublished Bell Lab. Tech. Note, portions presented at the Institute of Mathematical\nStatistics Meeting Atlantic City, NJ, September 1957. Also, IEEE Trans Inform Theory (Special Issue on Quantization), vol IT-28, pages\n129-137, March 1982.\n\n[23] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian. Feature selection using principal feature analysis. In Proceedings of the 15th international\n\nconference on Multimedia, pages 301\u2013304, 2007.\n\n[24] M. W. Mahoney and P. Drineas. CUR Matrix Decompositions for Improved Data Analysis. In Proceedings of the National Academy of\n\nSciences, USA (PNAS), 106, pages 697-702, 2009.\n\n[25] A. Malhi and R. Gao. PCA-based feature selection scheme for machine defect classi\ufb01cation. IEEE Transactions on Instrumentation and\n\nMeasurement, 53(6):1517\u20131525, Dec. 2004.\n\n[26] K. Mao. Identifying critical variables of principal components for unsupervised feature selection. IEEE Transactions on Systems, Man,\n\nand Cybernetics, 35(2):339\u2013344, April 2005.\n\n[27] T. Nielsen et al. Molecular characterisation of soft tissue tumors: A gene expression study. Lancet, 359:1301\u20131307, 2002.\n[28] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem.\n\nProceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 165\u2013176, 2006.\n\nIn\n\n[29] M. Rudelson, and R. Vershynin, Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM\n\n(JACM), 54(4), July 2007.\n\n[30] A. Wang and E. A. Gehan. Gene selection for microarray data analysis using principal component analysis. Stat Med, 24(13):2069\u20132087,\n\nJuly 2005.\n\n[31] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Compu-\n\ntational Harmonic Analysis, 25 (3): 335-366, 2008.\n\n[32] X. Wu et al. Top 10 algorithms in data mining analysis. Knowl. Inf. Syst., 14(1):1\u201337, 2007.\n[33] D. Zhang, S. Chen, and Z.-H. Zhou. Constraint score: A new \ufb01lter method for feature selection with pairwise constraints. Pattern\n\nRecognition, 41(5):1440\u20131451, 2008.\n\n9\n\n\f", "award": [], "sourceid": 13, "authors": [{"given_name": "Christos", "family_name": "Boutsidis", "institution": null}, {"given_name": "Petros", "family_name": "Drineas", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": null}]}