{"title": "Grouping and dimensionality reduction by locally linear embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 1255, "page_last": 1262, "abstract": null, "full_text": "Grouping and dimensionality reduction by \n\nlocally linear embedding \n\nDivision of Physics, Mathematics and Astronomy \n\nMarzia Polito \n\nCalifornia Institute of Technology \n\nPasadena, CA, 91125 \n\npolito@caltech.edu \n\nPietro Perona \n\nDivision of Engeneering and Applied Mathematics \n\nCalifornia Institute of Technology \n\nPasadena, CA, 91125 \nperona@caltech.edu \n\nAbstract \n\n(LLE) \n\nLocally Linear Embedding \nis an elegant nonlinear \ndimensionality-reduction technique recently introduced by Roweis \nand Saul [2]. It fails when the data is divided into separate groups. \nWe study a variant of LLE that can simultaneously group the data \nand calculate local embedding of each group. An estimate for the \nupper bound on the intrinsic dimension of the data set is obtained \nautomatically. \n\n1 \n\nIntroduction \n\nConsider a collection of N data points Xi E ]RD. Suppose that, while the dimension \nD is large, we have independent information suggesting that the data are distributed \non a manifold of dimension d < < D. In many circumstances it is beneficial to \ncalculate the coordinates Yi E ]Rd of the data on the lower-dimensional manifold, \nboth because the shape of the manifold may yield some insight in the process that \nproduced the data, and because it is cheaper to store and manipulate the data when \nit is embedded in fewer dimensions. How can we compute such coordinates? \n\nPrincipal component analysis (PCA) is a classical technique which works well when \nthe data lie close to a flat manifold [1]. Elegant methods for dealing with data that \nis distributed on curved manifolds have been recently proposed [3, 2]. We study \none of them, Locally Linear Embedding (LLE) [2], by Roweis and Saul. While LLE \nis not designed to handle data that are disconnected, i.e. separated into groups, \nwe show that a simple variation of the method will handle this situation correctly. \nFurthermore, both the number of groups and the upper bound on the intrinsic \ndimension of the data may be estimated automatically, rather than being given \na-priori. \n\n\f2 Locally linear embedding \n\nThe key insight inspiring LLE is that, while the data may not lie close to a glob(cid:173)\nally linear manifold, it may be approximately locally linear, and in this case each \npoint may be approximated as a linear combination of its nearest neighbors. The \ncoefficients of this linear combination carries the vital information for constructing \na lower-dimensional linear embedding. \n\nMore explicitly: consider a data set {Xd i=l...,N E ]RD. The local linear structure \ncan be easily encoded in a sparse N by N matrix W, proceeding as follows. \n\nThe first step is to choose a criterion to determine the neighbors of each point. \nRoweis and Saul chose an integer number K and pick, for every point, the K points \nnearest to it. For each point Xi then, they determine the linear combination of its \nneighbors which best approximates the point itself. The coefficients of such linear \ncombinations are computed by minimizing the quadratic cost function: \n\nf(W) = L IXi - L WijXj 12 \n\nN \n\nj=1 \n\n(1) \n\nwhile enforcing the constraints Wij = 0 if Xj is not a neighbor of Xi , and \nL:.f=1 Wij = 1 for every i; these constraints ensure that the approximation of \nXi ~ Xi = L:.f=1 WijXj lies in the affine subspace generated by the K nearest \nneighbors of Xi, and that the solution W is translation-invariant . This least square \nproblem may be solved in closed form [2]. \n\nThe next step consists of calculating a set {Yih=1, ... ,N of points in ]Rd, reproducing \nas faithfully as possible the local linear structure encoded in W. This is done \nminimizing a cost function \n\n*(Y) = L IYi - L Wij Yjl2 \n\nN \n\nj =1 \n\nN \n\ni=1 \n\n(2) \n\nTo ensure the uniqueness of the solution two constraint are imposed: translation \ninvariance by placing the center of gravity of the data in the origin, i.e. L:i Yi = 0, \nand normalized unit covariance of the Yi's, i.e. tt L:~1 Yi Q9 Yi = I. \nRoweis and Saul prove that **(Y) = tr(yT MY), where M is defined as \n\nM = (I - wf (I - W). \n\nThe minimum of the function **(Y) for the d-th dimensional representation is then \nobtained with the following recipe. Given d, consider the d + 1 eigenvectors asso(cid:173)\nciated to the d + 1 smallest eigenvalues of the matrix M. Then discard the very \nfirst one. The rows of the matrix Y whose columns are given by such d eigenvectors \ngive the desired solution. The first eigenvector is discarded because it is a vector \ncomposed of all ones, with 0 as eigenvalue. As we shall see, this is true when the \ndata set is 'connected' . \n\n2.1 Disjoint components \n\nIn LLE every data point has a set of K neighbors. This allows us to partition of \nthe whole data set X into K -connected components, corresponding to the intuitive \nvisual notion of different 'groups' in the data set. \nWe say that a partition X = UiUi is finer than a partition X = Uj 10 if every Ui \nis contained in some 10. The partition in K -connected components is the finest \n\n\f: \u2022............................ \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n60 \n\n90 \n\n100 \n\n-020\"'---------:::---;:;;-----O;---;;;------;';c------:::--~ \n\nFigure 1: (Top-left) 2D data Xi distributed along a curve (the index i increases \nfrom left to right for convenience). (Top-right) Coordinates Yi of the same points \ncalculated by LLE with K = 10 and d = 1. The x axis represents the index \ni and the y axis represents Yi. This is a good parametrization which recognizes \nthe intrinsically I-dimensional structure of the data. (Bottom-left) As above, the \ndata is now disconnected, i.e. points in different groups do not share neighbors. \n(Bottom-right) One-dimensional LLE calculated on the data (different symbols used \nfor points belonging to the different groups). Notice that the Yi's are not a good \nrepresentation of the data any longer since they are constant within each group. \n\npartition of the data set such that if two points have at least one neighbor in \ncommon, or one is a neighbor of the other, then they belong to the same component. \n\nNote that for any two points in the same component, we can find an ordered se(cid:173)\nquence of points having them as endpoints, such that two consecutive points have \nat least one neighbor in common. A set is K -connected if it contains only one \nK-connected component. \n\nConsider data that is not K -connected, then LLE does not compute a good \nparametrization, as illustrated in Figure 1. \n\n2.2 Choice of d. \n\nHow is d chosen? The LLE method [2] is based on the assumption that d is known. \nWhat if we do not know it in advance? If we overestimate d it then LLE behaves \npathologically. \n\nLet us consider a straight line, drawn in 1R3 . Figure 2 shows what happens if d \nis chosen equal to 1 and to 2. When the choice is 2 (right) then LLE 'makes up' \ninformation and generates a somewhat arbitrary 2D curve. \n\nAs an effect of the covariance constraint, the representation curves the line, the \n\n\fFigure 2: Coordinates Yi calculated for data Xi distributed along a straight line in \n]RD = ]R3 when the dimension d is chosen as d = 1 (Left), and d = 2 (Right). The \nindex i is indicated along the x axis (Left) and along the 2D curve (Right). \n\ncurvature can be very high, and even locally we possibly completely lose the lin(cid:173)\near structure. The problem is, we chosed the wrong target dimension. The one(cid:173)\ndimensional LLE works in fact perfectly (see Figure 2, left). \n\nPCA provides a principled way of estimating the intrinsic dimensionality of the \ndata: it corresponds to the number of large singular values of the covariance matrix \nof the data. Is such an estimate possible with LLE as well? \n\n3 Dimensionality detection: the size of the eigenvalues \n\nIn the example of Figure 2 the two dimensional representation of the data (d = 2) \nis clearly the 'wrong' one, since the data lie in a one-dimensional linear subspace. \nIn this case the unit covariance constraint in minimizing the function **(Y) is not \ncompatible with the linear structure. How could one have obtained the correct \nestimate of d? The answer is that d + 1 should be less or equal to the number of \neigenvalues of M that are close to zero. \n\nProposition 1. Assume that the data Xi E ]RD is K -connected and that it is \nlocally fiat, i.e. there exists a corresponding set Yi E ]Rd for some d > 0 such that \nYi = L:j Wij}j (zero-error approximation), the set {Yi} has rank d, and has the \norigin as center of gravity: L:~1 Yi = o. Call z the number of zero eigenvalues of \nthe matrix M. Then d < z . \nProof. By construction the N vector composed of all 1 's is a zero-eigenvector of \nM. Moreover, since the Yi are such that the addends of ** have zero error, then the \nmatrix Y , which by hypothesis has rank d, is in the kernel of I - W and hence in \nthe kernel of M. Due to the center of gravity constraint, all the columns of Y are \northogonal to the all 1 's vector. Hence M has at least d + 1 zero eigenvalues. D \nTherefore, in order to estimate d, one may count the number z of zero eigenvalues \nof M and choose any d < z. Within this range, smaller values of d will yield more \ncompact representations, while larger values of d will yield more expressive ones, \ni.e. ones that are most faithful to the original data. \n\nWhat happens in non-ideal conditions, i.e. when the data are not exactly locally \nfiat , and when one has to contend with numerical noise? The appendix provides an \nargument showing that the statement in the proposition is robust with respect to \n\n\f,,' \n,,' \n,,' \n,,' \n,,' \n,,' \n\n10 ' 0 \n\n10\" \n\n10\" \n\n10\" \n\n0 \n\n,,' \n,,' \n,, ' \n,,' \n,, ' \n,, ' \n\n10\" \n\n10 \" \n\n10 \" \n\n10\" \n\n0 \n\n2nd c igcnvuluc \n\nlst eigenva]ue \n\n2nd e igen value \n\n., \nlst ci'cnv\u00b7 \"' \n\" \n\nFigure 3: (Left) Eigenvalues for the straight-line data Xi used for Figure 2. (Right) \nEigenvalues for the curve data shown in the top-left panel of Figure 1. In both cases \nthe two last eigenvalue are orders of magnitude smaller than the other eigenvalues, \nindicating a maximal dimension d = 1 for the data. \n\nnoise, i.e. numerical errors and small deviations from the ideal locally flat data will \nresult in small deviations from the ideal zero-value of the first d + 1 eigenvalues, \nwhere d is used here for the 'intrinsic' dimension of the data. This is illustrated in \nFigure 3. \n\nIn Figure 4 we describe the successful application of the dimensionality detection \nmethod on a data set of synthetically generated grayscale images. \n\n4 LLE and grouping \n\nIn the first example (2.1) we pointed out the limits of LLE when applied to multiple \ncomponents of data. It appears then that a grouping procedure should always \npreceed LLE. The data would be first split into its component groups, each one \nof which should be then analyzed with LLE. A deeper analysis of the algorithm \nthough, suggests that grouping and LLE could actually be performed at the same \ntime. \n\nProposition 2. Suppose the data set {Xdi=l , ... ,N E ll~P is partitioned into m K(cid:173)\nconnected components. Then there exists an m-dimensional eigenspace of M with \nzero eigenvalue which admits a basis {vih=l, ... ,m where the Vi have entries that are \neither '1' or '0' . More precisely: each Vi corresponds to one of the groups of the \ndata and takes value Vi ,j = 1 for j in the group, Vi ,j = 0 for j not in the group. \n\nProof. Without loss of generality, assume that the indexing of the data X i is such \nthat the weight matrix W , and consequent ely the matrix M, are block-diagonal \nwith m blocks, each block corresponding to one of the groups of data. This is \nachieved by a permutation of indices, which will not effect any further step of our \nalgorithm. As a direct consequence of the row normalization of W, each block of \nM has exactly one eigenvector composed of all ones, with eigenvalue O. Therefore, \nthere is an m-dimensional eigenspace with eigenvalue 0, and there exist a basis of \nit, each vector of which has value 1 on a certain component, 0 otherwise. D \n\nTherefore one may count the number of connected components by computing the \neigenvectors of M corresponding to eigenvalue 0, and counting the number m of \nthose vectors Vi whose components take few discrete values (see Figure 6). Each \nindex i may be assigned to a group by clustering based on the value of Vl, ... , Vm . \n\n\fFigure 4: (Left) A sample from a data set of N=1000, 40 by 40 grayscale images, \neach one thought as a point in a 1600 dimensional vector space. In each image, \na slightly blurred line separates a dark from a bright portion. The orientation of \nthe line and its distance from the center of the image are variable. (Middle) The \nnon-zero eigenvalues of M. LLE is performed with K=20. The 2nd and 3rd smallest \neigenvalues are of smaller size than the others, giving an upper bound of 2 on the \nintrinsic dimension of the data set. (Right) The 2-dimensional LLE representation. \nThe polar coordinates, after rescaling, are the distance of the dividing line from the \ncenter and its orientation. \n\n\". \n\nFigure 5: The data set is analogous to the one used above (N =1000, 40 by 40 \ngrayscale images, LLE performed with K=20). The orientation of the line dividing \nthe dark from the bright portion is now only allowed to vary in two disjoint intervals. \n(Middle) The non-zero eigenvalues of M. (Left and Right) The 3rd and 5th (resp. \n4th and 6th) eigenvectors of M are used for the LLE representation of the first (resp. \nthe second) K-component. \n\n\f,,' \n,, ' \n,, ' \n,,' \n,, ' \n\n10'0 \n\n10 \" \n\n10 \" \n\n10 \" \n\n4th. 5th and 6th eigenvalues \n\n1st. 2nd and 3rd eigenvalues \n\nFigure 6: (Left) The last six eigenvectors of M for the broken parabola of Figure 1 \nshown, top to bottom, in reverse order of magnitude of the corresponding eigenvalue. \nThe x axis is associated to the index i. (Right) The eigenvalues of the same (log \nscale). Notice that the last six are practically zero. The eigenvectors corresponding \nto the three last eigenvalues have discrete values indicating that the data is split in \nthree groups. There are z =6 zero-eigenvalues indicating that the dimension of the \ndata is d:::; z/m - 1 = 1. \n\nIn the Appendix (A) we show that such a process is robust with respect to numerical \nnoise. It is also robust to small perturbations of the block-diagonal structure of M \n(see Figure 7). This makes the use of LLE for grouping purposes convenient. Should \nthe K-connected components be completely separated, the partition would be easily \nobtained via a more efficient graph-search algorithm. \n\nThe proof is carried out for ordered indices as in Fig. 3 but it is invariant under \nindex permutation. \n\nThe analysis of Proposition 1 may be extended to the dimension of each of the \nm groups according to Proposition 2. Therefore, in the ideal case, we will find z \nzero-eigenvalues of M which, together with the number m obtained by counting \nthe discrete-valued eigenvectors may be used to estimate the maximal d using z ~ \nm(d + 1). This behavior may be observed experimentally, see Figures 6 and 5. \n\n5 Conclusions \n\nWe have examined two difficulties of the Locally Linear Embedding method [2] and \nshown that, in a neighborhood of ideal conditions, they may be solved by a careful \nexam of eigenvectors of the matrix M that are associated to very small eigenvalues. \n\nMore specifically: the number of groups in which the data is partitioned corresponds \nto the number of discrete-valued eigenvectors, while the maximal dimension d of \nthe low-dimensional embedding may be obtained by dividing the number of small \neigenvalues by m and subtracting 1. \n\nBoth the groups and the low-dimensional embedding coordinates may be computed \nfrom the components of such eigenvectors. \n\nOur algorithms have mainly been tested on synthetically generated data. Further \ninvestigation on real data sets is necessary in order to validate our theoretical results. \n\n\fw' ,----~-~-~-~--~______, \n\nw' \n\nw' \n\nw ' \n\nw ' \n\nw ' \n\nw ' \n\n10 '\u00b7 \n\n10 \" \n\n10 \" \n\n3rd and 4th eigenvalues . \n\n1st and 2ndeige!\\values \n\n10'\" o~--=-----::------=-------=------,=--~ \n\nFigure 7: (Left) 2D Data Xi distributed along a broken parabola. Nevertheless, \nfor K=14, the components are not completely K-disconnected (a different symbol is \nused for the neighbors of the leftmost point on the rightmost component). (Right) \nThe set of eigenvalues for M. A set of two almost-zero eigenvalues and a set of two \nof small size are visible. \n\nReferences \n\n[1] C. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, \n\n(1995). \n\n[2] S. T. Roweis, L.K.Saul, Science, 290, p. 2323-2326, (2000). \n[3] J. Tenenbaum, V. de Silva, J. Langford, Science, 290, p. 2319-2323, (2000). \n\nA Appendix \n\nIn Proposition 2 of Section 4 we proved that during the LLE procedure we can \nautomatically detect the number of K -connected components, in case there is no \nnoise. Similarly, in Proposition 1 of Section 3 we proved that under ideal conditions \n(no noise, locally flat data), we can determine an estimate for the intrinsic dimension \nof the data. Our next goal is to establish a certain robustness of these results in \nthe case there is numerical noise, or the components are not completely separated, \nor the data is not exactly locally flat . \nIn general, suppose we have a non degenerate matrix A, and an orthonormal basis \nof eigenvectors VI, ... , Vm , with eigenvalues AI , ... Am. As a consequence of a small \nperturbation of the matrix into A + dA, we will have eigenvectors Vi + dVi with \neigenvalues Ai + dAi' The unitary norm constraint makes sure that dVi is orthog(cid:173)\nonal to Vi and could be therefore written as dVi = L:k#i O'.ikVk. Using again the \northonormality, one can derive expressions for the perturbations of Ai and Vi : \n\ndAi \n\nO'.ij (Ai - Aj) \n\n< vi,dAvi > \n< Vj,dAVi > . \n\nThis shows that if the perturbation dA has order E, then the perturbations dA and \nO'.ij are also of order E. Notice that we are not interested in perturbations O'.ij within \nthe eigenspace of eigenvalue 0, but rather those orthogonal to it, and therefore \nAi =j:. Aj. \n\n\f", "award": [], "sourceid": 2033, "authors": [{"given_name": "Marzia", "family_name": "Polito", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}]}*