{"title": "Laplacian Score for Feature Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 514, "abstract": null, "full_text": "Laplacian Score for Feature Selection\n\n2\n\nXiaofei He1 Deng Cai2 Partha Niyogi1 Department of Computer Science, University of Chicago {xiaofei, niyogi}@cs.uchicago.edu Department of Computer Science, University of Illinois at Urbana-Champaign dengcai2@uiuc.edu\n1\n\nAbstract\nIn supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are \"wrapper\" techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a \"filter\" method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm.\n\n1\n\nIntroduction\n\nFeature selection methods can be classified into \"wrapper\" methods and \"filter\" methods [4]. The wrapper model techniques evaluate the features using the learning algorithm that will ultimately be employed. Thus, they \"wrap\" the selection process around the learning algorithm. Most of the feature selection methods are wrapper methods. Algorithms based on the filter model examine intrinsic properties of the data to evaluate the features prior to the learning tasks. The filter based approaches almost always rely on the class labels, most commonly assessing correlations between features and the class label. In this paper, we are particularly interested in the filter methods. Some typical filter methods include data variance, Pearson correlation coefficients, Fisher score, and Kolmogorov-Smirnov test. Most of the existing filter methods are supervised. Data variance might be the simplest unsupervised evaluation of the features. The variance along a dimension reflects its representative power. Data variance can be used as a criteria for feature selection and extraction. For example, Principal Component Analysis (PCA) is a classical feature extraction method which finds a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data. Although the data variance criteria finds features that are useful for representing data, there\n\n\f\nis no reason to assume that these features must be useful for discriminating between data in different classes. Fisher score seeks features that are efficient for discrimination. It assigns the highest score to the feature on which the data points of different classes are far from each other while requiring data points of the same class to be close to each other. Fisher criterion can be also used for feature extraction, such as Linear Discriminant Analysis (LDA). In this paper, we introduce a novel feature selection algorithm called Laplacian Score (LS). For each feature, its Laplacian score is computed to reflect its locality preserving power. LS is based on the observation that, two data points are probably related to the same topic if they are close to each other. In fact, in many learning problems such as classification, the local structure of the data space is more important than the global structure. In order to model the local geometric structure, we construct a nearest neighbor graph. LS seeks those features that respect this graph structure.\n\n2\n\nLaplacian Score\n\nLaplacian Score (LS) is fundamentally based on Laplacian Eigenmaps [1] and Locality Preserving Projection [3]. The basic idea of LS is to evaluate the features according to their locality preserving power. 2.1 The Algorithm\n\nLet Lr denote the Laplacian Score of the r-th feature. Let fri denote the i-th sample of the r-th feature, i = 1, , m. Our algorithm can be stated as follows: 1. Construct a nearest neighbor graph G with m nodes. The i-th node corresponds to xi . We put an edge between nodes i and j if xi and xj are \"close\", i.e. xi is among k nearest neighbors of xj or xj is among k nearest neighbors of xi . When the label information is available, one can put an edge between two nodes sharing the same label. 2. If nodes i and j are connected, put Sij = e- t , where t is a suitable constant. Otherwise, put Sij = 0. The weight matrix S of the graph models the local structure of the data space. 3. For the r-th feature, we define: fr = [fr1 , fr2 , , frm ]T , D = diag (S 1), 1 = [1, , 1]T , L = D - S where the matrix L is often called graph Laplacian [2]. Let\nT fr = fr - fr D1 1 1T D 1 4. Compute the Laplacian Score of the r-th feature as follows:\nxi -xj 2\n\nLr =\n\n3\n3.1\n\nJustification\nObjective Function\n\nfT D fr\nr\n\nfT Lfr\nr\n\n(1)\n\nRecall that given a data set we construct a weighted graph G with edges connecting nearby points to each other. Sij evaluates the similarity between the i-th and j -th nodes. Thus,\n\n\f\nthe importance of a feature can be thought of as the degree it respects the graph structure. To be specific, a \"good\" feature should the one on which two data points are close to each other if and only if there is an edge between these two points. A reasonable criterion for choosing a good feature is to minimize the following object function: i 2 j (fr i - fr j ) Sij (2) Lr = V ar(fr ) i where V ar(fr ) is the estimated variance of the r-th feature. By minimizing j (fr i - frj )2 Sij , we prefer those features respecting the pre-defined graph structure. For a good feature, the bigger Sij , the smaller (fri - frj ), and thus the Laplacian Score tends to be small. Following some simple algebraic steps, we see that i if S 2 2 2 (fri - frj ) Sij = ij r i + fr j - 2fr i fr j\nj\n\n=\n\n2\n\ni\n\n2 fri Sij - 2\n\nj\n\ni\n\nj\n\nfri Sij frj = 2fT Dfr - 2fT S fr = 2fT Lfr r r r\n\nj\n\nBy maximizing V ar(fr ), we prefer those features with large variance which have more representative power. Recall that the variance of a random variable a can be written as follows: M M V ar(a) = (a - )2 dP (a), = adP (a) where M is the data manifold, is the expected value of a and dP is the probability measure. By spectral graph theory [2], dP can be estimated by the diagonal matrix D on the sample points. Thus, the weig hted data variance can be estimated as follows: i V ar(fr ) = (fri - r )2 Dii = f i i fT D 1 Dii 1 i ( fri Dii ) = 1r D1 r = r i i Dii T ( Dii ) To remove the mean from the samples, we define:\nT fr = fr - fr D 1 1 1T D1\n\nThus,\n\nV ar(fr ) =\n\ni\n\nT Also, it is easy to show that fr Lfr = fT Lfr (please see Proposition 1 in Section 4.2 for r detials). We finally get equation (1).\n\nf2 Dii = fT Dfr r ri\n\nIt would be important to note that, if we do not remove the mean, the vector fr can be a nonzero constant vector such as 1. It is easy to check that, 1T L1 = 0 and 1T D1 > 0. Thus, Lr = 0. Unfortunately, this feature is clearly of no use since it contains no information. With mean being removed, the new vector fr is orthogonal to 1 with respect to D, i.e. fT D1 = 0. Therefore, fr can not be any constant vector other than 0. If fr = 0, fT Lfr = fT Dfr = 0. Thus, the Laplacian Score Lr becomes a trivial solution and the r-th feature r is excluded from selection. While computing the weighted variance, the matrix D models the importance (or local density) of the data points. We can also simply replace it by the identity matrix I , in which case the weighted variance becomes the standard variance. To be specific, T T fr = fr - fr I 1 1 = fr - fr 1 1 = fr - 1 n 1T I 1\nr r\n\n\f\nwhere is the mean of fri , i = 1, , n. Thus, V ar(fr ) = fr I fr =\nT\n\n1 T (fr - 1) (fr - 1) n\n\n(3)\n\nwhich is just the standard variance.\n\nIn fact, the Laplacian scores can be thought of as the Rayleigh quotients for the features with respect to the graph G, please see [2] for details. 3.2 Connection to Fisher Score\n\nIn this section, we provide a theoretical analysis of the connection between our algorithm and the canonical Fisher score. Given a set of data points with label, {xi , yi }n 1 , yi {1, , c}. Let ni denote the i= 2 number of data points in class i. Let i and i be the mean and variance of class i, i = 1, , c, corresponding to the r-th feature. Let and 2 denote the mean and variance of the whole data set. The Fisher score is defined below: c ni (i - )2 (4) Fr = i=1 c 2 i=1 ni i In the following, we show that Fisher score is equivalent to Laplacian score with a special graph structure. We define the weight matrix as follows: 1 nl , yi = yj = l; Sij = (5) 0, otherwise.\n\nWithout loss of generality, we assume that the data points are ordered according to which class they are in, so that {x1 , , xn1 } are in the first class, {xn1 +1 , , xn1 +n2 } are in the second class, etc. Thus, S can be written as follows: S1 0 0 S = 0 ... 0 0 0 Sc\n1 where Si = ni 11T is an ni ni matrix. For each Si , the raw (or column) sum is equal to 1, so Di = diag (Si 1) is just the identity matrix. Define f1 = [fr1 , , frn1 ]T , f2 = r r [fr,n1 +1 , , fr,n1 +n2 ]T , etc. We now make the following observations. T Observation 1 With the weight matrix S defined in (5), we have fr Lfr = fT Lfr = r i 2 ni i , where L = D - S .\n\nTo see this, define Li = Di - Si = Ii - Si , where Ii is the ni ni identity matrix. We have fT Lfr = r ic (fi )T Li fi = r r ic (fi )T (Ii - r ic ic 1 Ti 2 11 )fr = ni cov (fi , fi ) = ni i rr ni =1 =1\n\n=1\n\n=1\n\nNote that, since uT L1 = 1T Lu = 0, u Rn , the value of fT Lfr remains unchanged by r i T ni 2 . subtracting a constant vector (= 1) from fr . This shows that f Lfr = fT Lfr =\nr r i\n\nObservation 2 With the weight matrix S defined in (5), we have\n\nTo see this, by the definition of S , we have D = I . Thus, this is a immediate result from equation (3).\n\nfT D fr r\n\n= n 2 .\n\n\f\nObservation 3 With the weight matrix S defined in (5), we have fT Dfr - fT Lfr .\nr r\n\nc\n\ni=1\n\nni (i - )2 =\n\nTo see this, notice ic = ic ni (i - )2 =\n=1\n\n=1\n\nic = ic\n\n=1\n\n- ic ic 1 ( ic 1 ni i + 2 (ni i )2 - 2 ni = fi )T 11T fi 2n2 + n2 r r ni ni =1 =1 =1 1 1 (n)2 = fT S fr - fT ( 11T )fr r r n n\nT T T T 1T 11 )fr = fT Lfr - n 2 = fr Lfr - fr Dfr r n\n\nn\n\n2 i i\n\n- 2ni i + ni 2\n\nfi Si fi - r r\n\n=1\n\n= fT (I - S )fr - fT (I - r r This completes the proof.\n\nWe therefore get the following relationship between the Laplacian score and Fisher score: Theorem 1 Let Fr denote the Fisher score of the r-th feature. With the weight matrix S 1 defined in (5), we have Lr = 1+Fr . Proof From observations 1,2,3, we see that c 2 fT DfT - fT LfT n 1 i=1 c i (i - ) Fr = = r rT Tr r = -1 2 Lr f Lf i=1 ni i\nr r 1 1+Fr .\n\nThus, Lr =\n\n4\n\nExperimental Results\n\nSeveral experiments were carried out to demonstrate the efficiency and effectiveness of our algorithm. Our algorithm is a unsupervised filter method, while almost all the existing filter methods are supervised. Therefore, we compared our algorithm with data variance which can be performed in unsupervised fashion. 4.1 UCI Iris Data\n\nIris dataset, popularly used for testing clustering and classification algorithms, is taken from UCI ML repository. It contains 3 classes of 50 instances each, where each class refers to a type of Iris plant. Each instance is characterized by four features, i.e. sepal length, sepal width, petal length, and petal width. One class is linearly separable from the other two, but the other two are not linearly separable from each other. Out of the four features it is known that the features F3 (petal length) and F4 (petal width) are more important for the underlying clusters. The class correlation for each feature is 0.7826, -0.4194, 0.9490 and 0.9565. We also used leave-one-out strategy to do classification by using each single feature. We simply used the nearest neighbor classifier. The classification error rates for the four features are 0.41, 0.52, 0.12 and 0.12, respectively. Our analysis indicates that F3 and F4 are better than F1 and F2 in the sense of discrimination. In figure 1, we present a 2-D visualization of the Iris data. We compared three methods, i.e. Variance, Fisher score and Laplacian Score for feature selection. All of them are filter methods which are independent to any learning tasks. However, Fisher score is supervised, while the other two are unsupervised.\n\n\f\n45 Clas s 1 Clas s 2 Clas s 3\n\n25\n\n40\n\n20\n\nFeature 2\n\n35\n\nFeature 4\n\n15\n\n30\n\n10 Clas s 1 Clas s 2 Clas s 3 20 30 40 Feature 3 50 60 70\n\n25\n\n5\n\n20 40\n\n50\n\n60 Feature 1\n\n70\n\n80\n\n0 10\n\nFigure 1: 2-D visualization of the Iris data. By using variance, the four features are sorted as F3, F1, F4, F2. Laplacian score (with k 15) sorts these four features as F3, F4, F1, F2. Laplacian score (with 3 k < 15) sorts these four features as F4, F3, F1, F2. With a larger k , we see more global structure of the data set. Therefore, the feature F3 is ranked above F4 since the variance of F3 is greater than that of F4. By using Fisher score, the four features are sorted as F3, F4, F1, F2. This indicates that Laplacian score (unsupervised) achieved the same result as Fisher score (supervised). 4.2 Face Clustering on PIE\n\nIn this section, we apply our feature selection algorithm to face clustering. By using Laplacian score, we select a subset of features which are the most useful for discrimination. Clustering is then performed in such a subspace. 4.2.1 Data Preparation\n\nThe CMU PIE face database is used in this experiment. It contains 68 subjects with 41,368 face images as a whole. Preprocessing to locate the faces was applied. Original images were normalized (in scale and orientation) such that the two eyes were aligned at the same position. Then, the facial areas were cropped into the final images for matching. The size of each cropped image is 32 32 pixels, with 256 grey levels per pixel. Thus, each image is represented by a 1024-dimensional vector. No further preprocessing is done. In this experiment, we fixed the pose and expression. Thus, for each subject, we got 24 images under different lighting conditions. For each given number k , k classes were randomly selected from the face database. This process was repeated 20 times (except for k = 68) and the average performance was computed. For each test (given k classes), two algorithms, i.e. feature selection using variance and Laplacian score are used to select the features. The K-means was then performed in the selected feature subspace. Again, the K-means was repeated 10 times with different initializations and the best result in terms of the objective function of K-means was recorded. 4.2.2 Evaluation Metrics\n\nThe clustering result is evaluated by comparing the obtained label of each data point with that provided by the data corpus. Two metrics, the accuracy (AC ) and the normalized mutual information metric (M I ) are used to measure the clustering performance [6]. Given a data point xi , let ri and si be the obtained cluster label and the label provided by the data corpus, respectively. The AC is defined as follows: n (si , map(ri )) AC = i=1 (6) n where n is the total number of data points and (x, y ) is the delta function that equals one if x = y and equals zero otherwise, and map(ri ) is the permutation mapping function that\n\n\f\n0.9 Laplac ian Score Varianc e\n\n0.9 0.8 Laplac ian Score Varianc e 0.7\n\n0.9 0.85 Laplac ian Score Varianc e\n\n0.8\n\n0.8\n\nM utual Information\n\nM utual Information\n\nLaplac ian Score Varianc e\n\n0.8 0.75 0.7 0.65 0.6 0.55\n\nAcc urac y\n\nAcc urac y\n\n0.7\n\n0.7\n\n0.6\n\n0.6\n\n0.6\n\n0.5 0.5 0.4\n\n0.5\n\n0.4\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0.4\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0.5\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nNum ber of features\n\nNum ber of features\n\nNum ber of features\n\nNum ber of features\n\n(a) 5 classes\n0.7 0.65 0.6 Laplac ian Score V ariance\n0.85\n\n(b) 10 classes\n0.65\nLaplac ian Score Varianc e\n\n0.9 Laplac ian Score Varianc e 0.85 Laplac ian Score Varianc e\n\n0.6 0.55\n\nM utual Information\n\nAc c urac y\n\n0.55 0.5 0.45 0.4 0.35 0.3 0 200 400 600 800 1000\n\nAcc uracy\n\n0.5 0.45 0.4 0.35 0.3\n\nM utual Information\n\n0.8 0.75 0.7 0.65 0.6 0.55\n\n0.8 0.75 0.7 0.65 0.6 0.55\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0.25\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nNum ber of features\n\nNum ber of features\n\nNum ber of features\n\nNum ber of features\n\n(c) 30 classes\n\n(d) 68 classes\n\nFigure 2: Clustering performance versus number of features maps each cluster label ri to the equivalent label from the data corpus. The best mapping can be found by using the Kuhn-Munkres algorithm [5]. Let C denote the set of clusters obtained from the ground truth and C obtained from our algorithm. Their mutual information metric M I (C, C ) is defined as follows: M I (C, C\n)\n\n=\n\nc\n\n(ci , cj ) log2\ni C ,c j C p\n\np(ci , cj ) p(ci ) p(cj )\n\n(7)\n\nwhere p(ci ) and p(cj ) are the probabilities that a data point arbitrarily selected from the corpus belongs to the clusters ci and cj , respectively, and p(ci , cj ) is the joint probability that the arbitrarily selected data point belongs to the clusters ci as well as cj at the same time. In our experiments, we use the normalized mutual information M I as follows: M I (C, C\n)\n\n=\n\nM I (C, C ) max(H (C ), H (C ))\n\n(8)\n\nwhere H (C ) and H (C ) are the entropies of C and C , respectively. It is easy to check that M I (C, C ) ranges from 0 to 1. M I = 1 if the two sets of clusters are identical, and M I = 0 if the two sets are independent. 4.2.3 Results\n\nWe compared Laplacian score with data variance for clustering. Note that, we did not compare with Fisher score because it is supervised and the label information is not available in the clustering experiments. Several tests were performed with different numbers of clusters (k=5, 10, 30, 68). In all the tests, the number of nearest neighbors in our algorithm is taken to be 5. The experimental results are shown in Figures 2 and Table 1. As can be seen, in all these cases, our algorithm performs much better than using variance for feature selection. The clustering performance varies with the number of features. The best performance is obtained at very low dimensionality (less than 200). This indicates that feature selection is capable of enhancing clustering performance. In Figure 3, we show the selected features in the image domain for each test (k=5, 10, 30, 68), using our algorithm, data variance and Fisher score. The brightness of the pixels indicates their importance. That is, the more bright the pixel is, the more important. As can be seen, Laplacian score provides better approximation to Fisher score than data variance. Both Laplacian score\n\n\f\n(a) Variance\n\n(b) Laplacian Score\n\n(c) Fisher Score\n\nFigure 3: Selected features in the image domain, k = 5, 10, 30, 68. The brightness of the pixels indicates their importance. Table 1: Clustering performance comparisons (k is the number of clusters)\nk 5 10 30 68 Feature Number Laplacian Score Variance Laplacian Score Variance Laplacian Score Variance Laplacian Score Variance Feature Number Laplacian Score Variance Laplacian Score Variance Laplacian Score Variance Laplacian Score Variance 20 0.727 0.683 0.685 0.494 0.591 0.399 0.479 0.328 20 0.807 0.662 0.811 0.609 0.807 0.646 0.778 0.639 Accuracy 50 100 200 0.806 0.831 0.849 0.698 0.602 0.503 0.743 0.787 0.772 0.500 0.456 0.418 0.623 0.671 0.650 0.393 0.390 0.365 0.554 0.587 0.608 0.362 0.334 0.316 Mutual Information 50 100 200 0.866 0.861 0.862 0.697 0.609 0.526 0.849 0.865 0.842 0.632 0.6 0.563 0.826 0.849 0.831 0.649 0.649 0.624 0.83 0.833 0.843 0.686 0.661 0.651 300 0.837 0.482 0.711 0.392 0.588 0.346 0.553 0.311 300 0.85 0.495 0.796 0.538 0.803 0.611 0.814 0.642 500 0.644 0.464 0.585 0.392 0.485 0.340 0.465 0.312 500 0.652 0.482 0.705 0.529 0.735 0.608 0.76 0.643 1024 0.479 0.479 0.403 0.403 0.358 0.358 0.332 0.332 1024 0.484 0.484 0.538 0.538 0.624 0.624 0.662 0.662\n\nk 5 10 30 68\n\nand Fisher score have the brightest pixels in the area of two eyes, nose, mouth, and face contour. This indicates that even though our algorithm is unsupervised, it can discover the most discriminative features to some extent.\n\n5\n\nConclusions\n\nIn this paper, we propose a new filter method for feature selection which is independent to any learning tasks. It can be performed in either supervised or unsupervised fashion. The new algorithm is based on the observation that local geometric structure is crucial for discrimination. Experiments on Iris data set and PIE face data set demonstrate the effectiveness of our algorithm.\n\nReferences\n[1] M. Belkin and P. Niyogi, \"Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,\" Advances in Neural Information Processing Systems, Vol. 14, 2001. [2] Fan R. K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics, number 92, 1997. [3] X. He and P. Niyogi, \"Locality Preserving Projections,\" Advances in Neural Information Processing Systems, Vol. 16, 2003. [4] R. Kohavi and G. John, \"Wrappers for Feature Subset Selection,\" Artificial Intelligence, 97(12):273-324, 1997. [5] L. Lovasz and M. Plummer, Matching Theory, Akademiai Kiado, North Holland, 1986. [6] W. Xu, X. Liu and Y. Gong, \"Document Clustering Based on Non-negative Matrix Factorization ,\" ACM SIGIR Conference on Information Retrieval, 2003.\n\n\f\n", "award": [], "sourceid": 2909, "authors": [{"given_name": "Xiaofei", "family_name": "He", "institution": null}, {"given_name": "Deng", "family_name": "Cai", "institution": null}, {"given_name": "Partha", "family_name": "Niyogi", "institution": null}]}