{"title": "Sparse Manifold Clustering and Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 55, "page_last": 63, "abstract": "We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds.", "full_text": "Sparse Manifold Clustering and Embedding\n\nEhsan Elhamifar\n\nCenter for Imaging Science\nJohns Hopkins University\nehsan@cis.jhu.edu\n\nRen\u00b4e Vidal\n\nCenter for Imaging Science\nJohns Hopkins University\nrvidal@cis.jhu.edu\n\nAbstract\n\nWe propose an algorithm called Sparse Manifold Clustering and Embedding\n(SMCE) for simultaneous clustering and dimensionality reduction of data lying\nin multiple nonlinear manifolds. Similar to most dimensionality reduction meth-\nods, SMCE \ufb01nds a small neighborhood around each data point and connects each\npoint to its neighbors with appropriate weights. The key difference is that SMCE\n\ufb01nds both the neighbors and the weights automatically. This is done by solving\na sparse optimization problem, which encourages selecting nearby points that lie\nin the same manifold and approximately span a low-dimensional af\ufb01ne subspace.\nThe optimal solution encodes information that can be used for clustering and di-\nmensionality reduction using spectral clustering and embedding. Moreover, the\nsize of the optimal neighborhood of a data point, which can be different for dif-\nferent points, provides an estimate of the dimension of the manifold to which the\npoint belongs. Experiments demonstrate that our method can effectively handle\nmultiple manifolds that are very close to each other, manifolds with non-uniform\nsampling and holes, as well as estimate the intrinsic dimensions of the manifolds.\n\n1\n\nIntroduction\n\n1.1 Manifold Embedding\n\nIn many areas of machine learning, pattern recognition, information retrieval and computer vision,\nwe are confronted with high-dimensional data that lie in or close to a manifold of intrinsically low-\ndimension. In this case, it is important to perform dimensionality reduction, i.e., to \ufb01nd a compact\nrepresentation of the data that unravels their few degrees of freedom.\nThe \ufb01rst step of most dimensionality reduction methods is to build a neighborhood graph by con-\nnecting each data point to a \ufb01xed number of nearest neighbors or to all points within a certain radius\nof the given point. Local methods, such as LLE [1], Hessian LLE [2] and Laplacian eigenmaps\n(LEM) [3], try to preserve local relationships among points by learning a set of weights between\neach point and its neighbors. Global methods, such as Isomap [4], Semide\ufb01nite embedding [5],\nMinimum volume embedding [6] and Structure preserving embedding [7], try to preserve local and\nglobal relationships among all data points. Both categories of methods \ufb01nd the low-dimensional rep-\nresentation of the data from a few eigenvectors of a matrix related to the learned weights between\npairs of points.\nFor both local and global methods, a proper choice of the neighborhood size used to build the\nneighborhood graph is critical. Speci\ufb01cally, a small neighborhood size may not capture suf\ufb01cient\ninformation about the manifold geometry, especially when it is smaller than the intrinsic dimension\nof the manifold. On the other hand, a large neighborhood size could violate the principles used to\ncapture information about the manifold. Moreover, the curvature of the manifold and the density of\nthe data points may be different in different regions of the manifold, hence using a \ufb01x neighborhood\nsize may be inappropriate.\n\n1\n\n\f1.2 Manifold Clustering\n\nIn many real-world problems, the data lie in multiple manifolds of possibly different dimensions.\nThus, to \ufb01nd a low-dimensional embedding of the data, one needs to \ufb01rst cluster the data according\nto the underlying manifolds and then \ufb01nd a low-dimensional representation for the data in each\ncluster. Since the manifolds can be very close to each other and they can have arbitrary dimensions,\ncurvature and sampling, the manifold clustering and embedding problem is very challenging.\nThe particular case of clustering data lying in multiple \ufb02at manifolds (subspaces) is well studied and\nnumerous algorithms have been proposed (see e.g., the tutorial [8]). However, such algorithms take\nadvantage of the global linear relations among data points in the same subspace, hence they can-\nnot handle nonlinear manifolds. Other methods assume that the manifolds have different instrinsic\ndimensions and cluster the data according to the dimensions rather than the manifolds themselves\n[9, 10, 11, 12, 13]. However, in many real-world problems this assumption is violated. Moreover,\nestimating the dimension of a manifold from a point cloud is a very dif\ufb01cult problem on its own.\nWhen manifolds are densely sampled and suf\ufb01ciently separated, existing dimensionality reduction\nalgorithms such as LLE can be extended to perform clustering before the dimensionality reduction\nstep [14, 15, 16]. More precisely, if the size of the neighborhood used to build the similarity graph\nis chosen to be small enough not to include points from other manifolds and large enough to capture\nthe local geometry of the manifold, then the similarity graph will have multiple connected compo-\nnents, one per manifold. Therefore, spectral clustering methods can be employed to separate the\ndata according to the connected components. However, as we will see later, \ufb01nding the right neigh-\nborhood size is in general dif\ufb01cult, especially when manifolds are close to each other. Moreover, in\nsome cases one cannot \ufb01nd a neighborhood that contains only points from the same manifold.\n\n1.3 Paper Contributions\n\nIn this paper, we propose an algorithm, called SMCE, for simultaneous clustering and embedding\nof data lying in multiple manifolds. To do so, we use the geometrically motivated assumption that\nfor each data point there exists a small neighborhood in which only the points that come from the\nsame manifold lie approximately in a low-dimensional af\ufb01ne subspace. We propose an optimization\nprogram based on sparse representation to select a few neighbors of each data point that span a\nlow-dimensional af\ufb01ne subspace passing near that point. As a result, a few nonzero elements of the\nsolution indicate the points that are on the same manifold, hence they can be used for clustering. In\naddition, the weights associated to the chosen neighbors indicate their distances to the given data\npoint, which can be used for dimensionality reduction. Thus, unlike conventional methods that\n\ufb01rst build a neighborhood graph and then extract information from it, our method simultaneously\nbuilds the neighborhood graph and obtains its weights. This leads to successful results even in\nchallenging situations where the nearest neighbors of a point come from other manifolds. Clustering\nand embedding of the data into lower dimensions follows by taking the eigenvectors of the matrix\nof weights and its submatrices, which are sparse hence can be stored and be operated on ef\ufb01ciently.\nThanks to the sparse representations obtained by SMCE, the number of neighbors of the data points\nin each manifold re\ufb02ects the intrinsic dimensionality of the underlying manifold. Finally, SMCE\nhas only one free parameter that, for a large range of variation, results in a stable clustering and\nembedding, as the experiments will show. To the best of our knowledge, SMCE is the only algorithm\nproposed to date that allows robust automatic selection of neighbors and simultaneous clustering and\ndimensionality reduction in a uni\ufb01ed manner.\n\nl=1 of intrinsic dimensions {dl}n\n\n2 Proposed Method\nAssume we are given a collection of N data points {xi \u2208 RD}N\ni=1 lying in n different manifolds\nl=1. In this section, we consider the problem of simultane-\n{Ml}n\nously clustering the data according to the underlying manifolds and obtaining a low-dimensional\nrepresentation of the data points within each cluster.\nWe approach this problem using a spectral clustering and embedding algorithm. Speci\ufb01cally, we\nbuild a similarity graph whose nodes represent the data points and whose edges represent the simi-\nlarity between data points. The fundamental challenge is to decide which nodes should be connected\nand how. To do clustering, we wish to connect each point to other points from the same manifold. To\n\n2\n\n\fFigure 1: For x1 \u2208 M1, the smallest neighborhood containing points from M1 also contains points from\nM2. However, only the neighbors in M1 span a 1-dimensional subspace around x1.\n\ndo dimensionality reduction, we wish to connect each point to neighboring points with appropriate\nweights that re\ufb02ect the neighborhood information. To simultaneously pursue both goals, we wish to\nselect neighboring points from the same manifold.\nWe address this problem by formulating an optimization algorithm based on sparse representation.\nThe underlying assumption behind the proposed method is that each data point has a small neighbor-\nhood in which the minimum number of points that span a low-dimensional af\ufb01ne subspace passing\nnear that point is given by the points from the same manifold. More precisely:\nAssumption 1 For each data point xi \u2208 Ml consider the smallest ball Bi \u2282 RD that contains the\ndl + 1 nearest neighbors of xi from Ml. Let the neighborhood Ni be the set of all data points in\nBi excluding xi. In general, this neighborhood contains points from Ml as well as other manifolds.\nWe assume that for all i there exists \u0001 \u2265 0 such that the nonzero entries of the sparsest solution of\n(1)\n\ncij(xj \u2212 xi)(cid:107)2 \u2264 \u0001 and (cid:88)\n\n(cid:88)\n\ncij = 1\n\n(cid:107)\n\nj\u2208Ni\n\nj\u2208Ni\n\ncorresponds to the dl + 1 neighbors of xi from Ml. In other words, among all af\ufb01ne subspaces\nspanned by subsets of the points {xj}j\u2208Ni and passing near xi up to \u0001 error, the one of lowest\ndimension has dimension dl and it is spanned by the dl + 1 neighbors of xi from Ml.\nIn the limiting case of densely sampled data, this af\ufb01ne subspace coincides with the dl-dimensional\ntangent space of Ml at xi. To illustrate this, consider the two manifolds shown in Figure 1 and\nassume that points x4, x5 and x6 are closer to x1 than x2 or x3. Then any small ball centered at\nx1 \u2208 M1 that contains x2 and x3 will also contain points x4, x5 and x6. In this case, among af\ufb01ne\nspans of all possible choices of 2 points in this neighborhood, the one corresponding to x2 and x3\nis the closest one to x1, and is also close to the tangent space of M1 at x1. On the other hand, the\naf\ufb01ne span of any choices of 3 or more data points in the neighborhood always passes through x1.\nHowever, this requires a linear combination of more than 2 data points.\n\n2.1 Optimization Algorithm\n\nOur goal is to propose a method that selects, for each data point xi, a few neighbors that lie in the\nsame manifold. If the neighborhood Ni is known and of relatively small size, one can search for the\nminimum number of points that satisfy (1). However, Ni is not known a priori and searching for\na few data points in Ni that satisfy (1) becomes more computationally complex as the size of the\nneighborhood increases. To tackle this problem, we let the size of the neighborhood be arbitrary.\nHowever, by using a sparse optimization program, we bias the method to select a few data points\nthat are close to xi and span a low-dimensional af\ufb01ne subspace passing near xi.\nConsider a point xi in the dl-dimensional manifold Ml and consider the set of points {xj}j(cid:54)=i. It\nfollows from Assumption 1 that, among these points, the ones that are neighbors of xi in Ml span\na dl-dimensional af\ufb01ne subspace of RD that passes near xi. In other words,\n(cid:62)ci = 1\n\n(2)\n\n(cid:107) [x1 \u2212 xi\n\n\u00b7\u00b7\u00b7 xN \u2212 xi] ci(cid:107)2 \u2264 \u0001 and 1\n\nhas a solution ci whose dl + 1 nonzero entries corresponds to dl + 1 neighbors of xi in Ml.\nNotice that after relaxing the size of the neighborhood, the solution ci that uses the minimum number\nof data points, i.e., the solution ci with the smallest number of nonzero entries, may no longer be\n\n3\n\nM1M2x1x2x3x4x5x6xp\funique. In the example of Figure 1, for instance, a solution of (2) with two nonzero entries can\ncorrespond to an af\ufb01ne combination of x2 and x3 or an af\ufb01ne combination of x2 and xp. To bias\nthe solutions of (2) to the one that corresponds to the closest neighbors of xi in Ml, we set up an\noptimization program whose objective function favors selecting a few neighbors of xi subject to the\nconstraint in (2), which enforces selecting points that approximately lie in an af\ufb01ne subspace at xi.\nBefore that, it is important to decouple the goal of selecting a few neighbors from that of spanning\nan af\ufb01ne subspace. To do so, we normalize the vectors {xj \u2212 xi}j(cid:54)=i and let\n\u2208 RD\u00d7N\u22121.\n\n(3)\nIn this way, for a small \u03b5, the locations of the nonzero entries of any solution ci of (cid:107)X ici(cid:107)2 \u2264 \u03b5 do\nnot depend on whether the selected points are close to or far from xi. Now, among all the solutions\n(cid:62)ci = 1, we look for the one that uses a few closest neighbors of\nof (cid:107)X ici(cid:107)2 \u2264 \u03b5 that satisfy 1\nxi. To that end, we consider an objective function that penalizes points based on their proximity to\nxi. That is, points that are closer to xi get lower penalty than points that are farther away. We thus\nconsider the following weighted (cid:96)1-optimization program\n\nX i (cid:44)(cid:104) x1\u2212xi\n\nxN\u2212xi\n(cid:107)xN\u2212xi(cid:107)2\n\n(cid:107)x1\u2212xi(cid:107)2\n\n\u00b7\u00b7\u00b7\n\n(cid:105)\n\n1\n\nmin(cid:107)Qici(cid:107)1\n\nsubject to (cid:107)X ici(cid:107)2 \u2264 \u03b5,\n\n(4)\nwhere the (cid:96)1-norm promotes sparsity of the solution [17] and the proximity inducing matrix Qi,\nwhich is a positive-de\ufb01nite diagonal matrix, favors selecting points that are close to xi. Note that\nthe elements of Qi should be chosen such that the points that are closer to xi have smaller weights,\nallowing the assignment of nonzero coef\ufb01cients to them. Conversely, the points that are farther from\n(cid:80)\nxi should have larger weights, favoring the assignment of zero coef\ufb01cients to them. A simple choice\n(cid:107)xj\u2212xi(cid:107)2\nof the proximity inducing matrix is to select the diagonal elements of Qi to be\nt(cid:54)=i (cid:107)xt\u2212xi(cid:107)2 \u2208\n(cid:80)\nexp((cid:107)xj\u2212xi(cid:107)2/\u03c3)\n(0, 1]. Also, one can use other types of weights, such as exponential weights\nt(cid:54)=i exp((cid:107)xt\u2212xi(cid:107)2/\u03c3)\nwhere \u03c3 > 0. However, the former choice of the weights, which is also tuning parameter free, works\nvery well in practice, as we will show later.\nAnother optimization program which is related to (4) by the method of Lagrange multipliers, is\n\n(cid:62)ci = 1,\n\nmin \u03bb(cid:107)Qici(cid:107)1 +\n\n1\n2(cid:107)X ici(cid:107)2\n\n2\n\nsubject to 1\n\n(cid:62)ci = 1,\n\n(5)\n\nwhere the parameter \u03bb sets the trade-off between the sparsity of the solution and the af\ufb01ne recon-\nstruction error. Notice that this new optimization program, which also prefers sparse solutions, is\nsimilar to the Lasso optimization problem [18, 17]. The only modi\ufb01cation, is the introduction of the\n(cid:62)ci = 1. As we will show in the next section, there is a wide range of values of\naf\ufb01ne constraint 1\n\u03bb for which the optimization program in (5) successfully \ufb01nds a sparse solution for each point from\nneighbors in the same manifold.\nNotice that, in sharp contrast to the nearest neighbors-based methods, which \ufb01rst \ufb01x the number\nof neighbors or the neighborhood radius and then compute the weights between points in each\nneighborhood, we do the two steps at the same time. In other words, the optimization programs (4)\nand (5) automatically choose a few neighbors of the given data point, which approximately span\na low-dimensional af\ufb01ne subspace at that point. In addition, by the de\ufb01nition of Qi and X i, the\nsolutions of the optimization programs (4) and (5) are invariant with respect to a global rotation,\ntranslation, and scaling of the data points.\n\n2.2 Clustering and Dimensionality Reduction\n\nBy solving the proposed optimization programs for each data point, we obtain the necessary\n(cid:44)\ninformation for clustering and dimensionality reduction. This is because the solution c(cid:62)\n[ci1 \u00b7\u00b7\u00b7 ciN ] of the proposed optimization programs satis\ufb01es\n\ni\n\n(cid:88)\n\nj(cid:54)=i\n\ncij\n\n(cid:107)xj \u2212 xi(cid:107)2\n\n(xj \u2212 xi) \u2248 0.\n\n\u00b7\u00b7\u00b7 xN ] wi, where the weight vector w(cid:62)\nHence, we can rewrite xi \u2248 [x1 x2\n[wi1 \u00b7\u00b7\u00b7 wiN ] \u2208 RN associated to the i-th data point is de\ufb01ned as\n(cid:80)\ncij/(cid:107)xj \u2212 xi(cid:107)2\nt(cid:54)=i cit/(cid:107)xt \u2212 xi(cid:107)2\n\nwii (cid:44) 0, wij (cid:44)\n\n, j (cid:54)= i.\n\ni\n\n4\n\n(6)\n\n(cid:44)\n\n(7)\n\n\fThe indices of the few nonzero elements of wi, ideally, correspond to neighbors of xi in the same\nmanifold and their values indicate their (inverse) distances to xi.\nNext, we use the weights wi to perform clustering and dimensionality reduction. We do so by\nbuilding a similarity graph G = (V, E) whose nodes represent the data points. We connect each\nnode i, corresponding to xi, to the node j, corresponding to xj, with an edge whose weight is equal\nto |wij|. While, potentially, every node can get connected to all other nodes, because of the sparsity\nof wi, each node i connects itself to only a few other nodes that correspond to the neighbors of xi\nin the same manifold. We call such neighbors as sparse neighbors. In addition, the distances of the\nsparse neighbors to xi are re\ufb02ected in the weights |wij|.\nThe similarity graph built in this way has ideally several connected components, where points in\nthe same manifold are connected to each other and there is no connection between two points in\ndifferent manifolds. In other words, the similarity matrix of the graph has ideally the following form\n\n\uf8ee\uf8ef\uf8ef\uf8f0W [1]\n\n0\n\n...\n\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fb\u0393,\n\n0\n0\n\n\u00b7\u00b7\u00b7\n0 W [2] \u00b7\u00b7\u00b7\n...\n...\n\u00b7\u00b7\u00b7 W [n]\n\n0\n\n...\n\nW (cid:44) [|w1| \u00b7\u00b7\u00b7 |wN| ] =\n\n(8)\n\nwhere W [l] is the similarity matrix of the data points in Ml and \u0393 \u2208 RN\u00d7N is an unknown\npermutation matrix. Clustering of the data follows by applying spectral clustering [19] to W .1 One\ncan also determine the number of connected components by analyzing the eigenspectrum of the\nLaplacian matrix [20].\nAny of the existing dimensionality reduction techniques can be applied to the data in each cluster to\nobtain a low-dimensional representation of the data in the corresponding manifold. However, this\nwould require new computation of neighborhoods and weights. On the other hand, the similarity\ngraph built by our method has a locality preserving property by the de\ufb01nition of the weights. Thus,\nwe can use the adjacency matrix, W [i], of the i-th cluster as a similarity between points in the\ncorresponding manifold and obtain a low-dimensional embedding of the data by taking the last few\neigenvectors of the normalized Laplacian matrix associated to W [i] [3]. Note that there are other\nways for inferring the low-dimensional embedding of the data in each cluster along the line of [21]\nand [1] which is beyond the scope of the current paper.\n\n2.3\n\nIntrinsic Dimension Information\n\nAn advantage of proposed sparse optimization algorithm is that it provides information about the\nintrinsic dimension of the manifolds. This comes from the fact that a data point xi \u2208 Ml and its\nneighbors in Ml lie approximately in the dl-dimensional tangent space of Ml at xi. Since dl + 1\nvectors in this tangent space are linearly dependent, the solution ci of the proposed optimization\nprograms is expected to have dl + 1 nonzero elements. As a result, we can obtain information about\nthe intrinsic dimension of the manifolds in the following way. Let \u2126l denote the set of indices of\npoints that belong to the l-th cluster. For each point in \u2126l, we sort the elements of |ci| from the\nlargest to the smallest and denote the new vector as cs,i. We de\ufb01ne the median sparse coef\ufb01cient\nvector of the l-th cluster as\n\nmsc(l) = median{cs,i}i\u2208\u2126l ,\n\n(9)\nwhose j-th element is computed as the median of the j-th elements of the vectors {cs,i}i\u2208\u2126l. Thus,\nthe number of nonzero elements of msc(l) or, more practically, the number of elements with rela-\ntively high magnitude, gives an estimate of the intrinsic dimension of the l-th manifold plus one.2\nAn advantage of our method is that it allows us to have a different neighborhood size for each data\npoint, depending on the local dimension of its underlying manifold at that point. For example, in the\ncase of two manifolds of dimensions d1 = 2 and d2 = 30, for data points in the l-th manifold we\nautomatically obtain solutions with dl + 1 nonzero elements. On the other hand, methods that \ufb01x\nthe number of neighbors fall into trouble because the number of neighbors would be too small for\none manifold or too large for the other manifold.\n\n1Note that a symmetric adjacency matrix can be obtained by taking W = max(W , W (cid:62)).\n2One can also use the mean of the sorted coef\ufb01cients in each cluster to compute the dimension of each\n\nmanifold. However, we prefer to use the median for robustness reasons.\n\n5\n\n\fFigure 2: Top: embedding of a punctured sphere and the msc vectors obtained by SMCE for different values\nof \u03bb. Bottomn: embedding obtained by LLE and LEM for different values of K.\n\nSMCE\n\nLLE\n\nFigure 3: Clustering and embedding for two trefoil-knots. Left: original manifolds. Middle: embedding and\nmsc vectors obtained by SMCE. Right: clustering and embedding obtained by LLE.\n\n3 Experiments\n\nIn this section, we evaluate the performance of SMCE on a number of synthetic and real experiments.\nFor all the experiments, we use the optimization program (5), where we typically set \u03bb = 10.\nHowever, the clustering and embedding results obtained by SMCE are stable for \u03bb \u2208 [1, 200]. Since\nthe weighted (cid:96)1-optimization does not select the points that are very far from the given point, we\nconsider only L < N \u2212 1 neighbors of each data point in the optimization program, where we\ntypically set L = N/10. As in the case of nearest neighbors-based methods, there is no guarantee\nthat the points in the same manifold form a single connected component of the similarity graph built\nby SMCE. However, this has always been the case in our experiments, as we will show next.\n\n3.1 Experiments with Synthetic Data\n\nManifold Embedding. We \ufb01rst evaluate SMCE for the dimensionality reduction task only. We\nsample N = 1, 000 data points from a 2-sphere, where a neighborhood of its north pole is excluded.\nWe then embed the data in R100, add small Gaussian white noise to it and apply SMCE for \u03bb \u2208\n{0.1, 1, 10, 100}. Figure 2 shows the embedding results of SMCE in a 2 dimensional Euclidean\nspace. The three large elements of the msc vector for different values of \u03bb correctly re\ufb02ect the fact\nthat the sphere has dimension two. However, note that for very large values of \u03bb the performance\nof the embedding degrades since we put more emphasis on the sparsity of the solution. The results\nin the bottom of Figure 2 show the embeddings obtained by LLE and LEM for K = 5 and K =\n20 nearest neighbors. Notice that, for K = 20, nearest neighbor-based methods obtain similar\nembedding results to those of SMCE, while for K = 5 they obtain poor embedding results. This\nsuggests that the principle used by SMCE to select the neighbors is very effective: it chooses very\nfew neighbors that are very informative for dimensionality reduction.\nManifold Clustering and Embedding. Next, we consider the challenging case where the mani-\nfolds are close to each other. We consider two trefoil-knots, shown in Figure 3, which are embedded\nin R100 and are corrupted with small Gaussian white noise. The data points are sampled such that\namong the 2 nearest neighbors of 1% of the data points there are points from the other manifold.\nAlso, among the 3 and 5 nearest neighbors of 9% and 18% of the data points, respectively, there\nare points from the other manifold. For such points, the nearest neighbors-based methods will con-\nnect them to nearby points in the other manifold and assign large weights to the connection. As a\nresult, these methods cannot obtain a proper clustering or a successful embedding. Table 1 shows\nthe misclassi\ufb01cation rates of LLE and LEM for different number of nearest neighbors K as well as\nthe misclassi\ufb01cation rates of SMCE for different values of \u03bb. While there is no K for which we can\nsuccessfully cluster the data using LLE and LEM, for a wide range of \u03bb, SMCE obtains a perfect\nclustering. Figure 3 shows the results of SMCE for \u03bb = 10 and LLE for K = 3. As the results\n\n6\n\nSMCE, ! = 0.1SMCE, ! = 1SMCE, ! = 10SMCE, ! = 100LLE, K = 5LEM, K = 5LLE, K = 20LEM, K = 20\fTable 1: Misclassi\ufb01cations rates for LLE and LEM as a function of K and for SMCE as a function of \u03bb.\n\n4\n\n5\n\n6\n\n8\n\n2\n\nK\nLLE 15.5% 9.5% 16.5% 13.5% 16.5% 37.5 38.5%\nLEM 15.5% 13.5% 17.5% 14.5% 28.5% 28.5% 13.5%\n200\nSMCE 15.5% 6.0% 0.0% 0.0% 0.0% 0.0% 0.0%\n\n100\n\n0.1\n\n3\n\n1\n\n10\n\n10\n\n50\n\n70\n\n\u03bb\n\nTable 2: Percentage of data points whose K nearest neighbors contain points from the other manifold.\n\nK 1\n\n2\n\n3\n\n4\n\n7\n\n10\n\n3.9% 10.2% 23.4% 35.2% 57.0% 64.8%\n\nshow, enforcing that the neighbors of a point from the same manifold span a low-dimensional af\ufb01ne\nsubspace helps to select neighbors from the correct manifold and not from the other manifolds. This\nresults in successful clustering and embedding of the data as well as unraveling the dimensions of\nthe manifolds. On the other hand, the fact that LLE and LEM choose wrong neighbors, results in a\nlow quality embedding.\n\n3.2 Experiments with Real Data\n\nIn this section, we examine the performance of SMCE on real datasets. We show that challenges\nsuch as manifold proximity and non-uniform sampling are also common in real data sets, and that\nour algorithm is able to handle these issues effectively.\nFirst, we consider the problem of clustering and embedding of face images of two different subjects\nfrom the Extended Yale B database [22]. Each subject has 64 images of 192 \u00d7 168 pixels captured\nunder a \ufb01xed pose and expression and with varying illuminations. By applying SMCE with \u03bb =\n10 on almost 33, 000-dimensional vectorized faces, we obtain a misclassi\ufb01cation rate of 2.34%,\nwhich corresponds to wrongly clustering 3 out of the 128 data points. Figure 4, top row, shows the\nembeddings obtained by SMCE, LLE and LEM for the whole data prior to clustering. Only SMCE\nreasonably separates the low-dimensional representation of face images according to the subjects.\nNote that in this experiment, the space of face images under varying illumination is not densely\nsampled and in addition the two manifolds are very close to each other. Table 2 shows the percentage\nof points in the dataset whose K nearest neighbors contain points from the other manifold. As the\ntable shows, there are several points whose closest neighbor comes from the other manifold. Beside\nthe embedding of each method in Figure 4 (top row), we have shown the coef\ufb01cient vector of a\ndata point in M1 whose closest neighbor comes from M2. While nearest-neighbor-based methods\npick the wrong neighbors with strong weights, SMCE successfully selects sparse neighbors from the\ncorrect manifold. The plots in the bottom of Figure 4 show the embedding obtained by SMCE for\neach cluster. As we move along the horizontal axis, the direction of the light source changes from\nleft to right, while as we move along the vertical axis, the overall darkness of the images changes\nfrom light to dark. Also, the msc vectors suggest a 2-dimensionality of the face manifolds, correctly\nre\ufb02ecting the number of degrees of freedom of the light source on the illumination rig, which is a\nsphere in R3.\nNext, we consider the dimensionality\nreduction of\nthe images in the Frey\nface dataset, which consists of 1965\nface images captured under varying\npose and expression.\nEach image is\nvectorized as a 560 element vector of\npixel intensities.\nFigure 5 shows the\ntwo-dimensional embedding obtained by\nSMCE. Note that the low-dimensional\nrepresentation captures well the left to\nright pose variations in the horizontal\naxis and the expression changes in the\nvertical axis.\n\nFigure 5: 2-D embedding of Frey face data using SMCE.\n\n7\n\nEmbedding via SMCE\fSMCE\n\nLLE\n\nLEM\n\nFigure 4: Clustering and embedding of two faces. Top: 2-D embedding obtained by SMCE, LLE and LEM.\nThe weights associated to a data point from the \ufb01rst subject are shown beside the embedding. Bottom: SMCE\nembedding and msc vectors.\n\nFigure 6: Clustering and embedding of \ufb01ve digits from the MNIST dataset. Left: 2-D embedding obtained by\nSMCE for \ufb01ve digits {0, 3, 4, 6, 7}. Middle: 2-D embedding of the data in the \ufb01rst cluster that corresponds to\ndigit 3. Right: 2-D embedding of the data in the second cluster that corresponds to digit 6.\n\nFinally, we consider the clustering and dimensionality reduction of the digits from the MNIST test\ndatabase [23]. We use the images from \ufb01ve digits {0, 3, 4, 6, 7} in the dataset where we randomly\nselect 200 data points from each digit. The left plot in Figure 6 shows the joint embedding of the\nwhole data using SMCE. One can see that the data are reasonably well separated according to their\nclasses. The middle and the right plots in Figure 6, show the two-dimensional embedding obtained\nby SMCE for two data clusters, which correspond to the digits 3 and 6.\n\n4 Discussion\n\nWe proposed a new algorithm based on sparse representation for simultaneous clustering and dimen-\nsionality reduction of data lying in multiple manifolds. We used the solution of a sparse optimization\nprogram to build a similarity graph from which we obtained clustering and low-dimensional embed-\nding of the data. The sparse representation of each data point ideally encodes information that can\nbe used for inferring the dimensionality of the underlying manifold around that point. Finding ro-\nbust methods for estimating the intrinsic dimension of the manifolds from the sparse coef\ufb01cients and\ninvestigating theoretical guarantees under which SMCE works is the subject of our future research.\n\nAcknowledgment\nThis work was partially supported by grants NSF CNS-0931805, NSF ECCS-0941463 and NSF\nOIA-0941362.\n\n8\n\n Subject 1Subject 2 Subject 1Subject 2 Subject 1Subject 2 Cluster 1Cluster 2 Cluster 1 Cluster 2 Digit 0Digit 3Digit 4Digit 6Digit 7 Cluster 1 Cluster 2\fReferences\n[1] S. Roweis and L. Saul, \u201cNonlinear dimensionality reduction by locally linear embedding,\u201d Science, vol.\n\n290, no. 5500, pp. 2323\u20132326, 2000.\n\n[2] D. Donoho and C. Grimes, \u201cHessian eigenmaps: Locally linear embedding techniques for high-\n\ndimensional data,\u201d National Academy of Sciences, vol. 100, no. 10, pp. 5591\u20135596, 2003.\n\n[3] M. Belkin and P. Niyogi, \u201cLaplacian eigenmaps and spectral techniques for embedding and clustering,\u201d\n\nin Neural Information Processing Systems, 2002, pp. 585\u2013591.\n\n[4] J. B. Tenenbaum, V. de Silva, and J. C. Langford, \u201cA global geometric framework for nonlinear dimen-\n\nsionality reduction,\u201d Science, vol. 290, no. 5500, pp. 2319\u20132323, 2000.\n\n[5] K. Q. Weinberger and L. Saul, \u201cUnsupervised learning of image manifolds by semide\ufb01nite programming,\u201d\n\nin IEEE Conference on Computer Vision and Pattern Recognition, 2004, pp. 988\u2013955.\n\n[6] B. Shaw and T. Jebara, \u201cMinimum volume embedding,\u201d in Arti\ufb01cial Intelligence and Statistics, 2007.\n[7] \u2014\u2014, \u201cStructure preserving embedding,\u201d in International Conference on Machine Learning, 2009.\n[8] R. Vidal, \u201cSubspace clustering,\u201d Signal Processing Magazine, vol. 28, no. 2, pp. 52\u201368, 2011.\n[9] D. Barbar\u00b4a and P. Chen, \u201cUsing the fractal dimension to cluster datasets,\u201d in KDD \u201900: Proceedings of\nthe sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000, pp.\n260\u2013264.\n\n[10] P. Mordohai and G. G. Medioni, \u201cUnsupervised dimensionality estimation and manifold learning in high-\ndimensional spaces by tensor voting.\u201d in International Joint Conference on Arti\ufb01cial Intelligence, 2005,\npp. 798\u2013803.\n\n[11] A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas, \u201cDimension induced clustering,\u201d in KDD\n\u201905: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data\nmining, 2005, pp. 51\u201360.\n\n[12] E. Levina and P. J. Bickel, \u201cMaximum likelihood estimation of intrinsic dimension.\u201d in NIPS, 2004.\n[13] G. Haro, G. Randall, and G. Sapiro, \u201cTranslated poisson mixture model for strati\ufb01cation learning,\u201d Inter-\n\nnational Journal of Computer Vision, 2008.\n\n[14] M. Polito and P. Perona, \u201cGrouping and dimensionality reduction by locally linear embedding,\u201d in Neural\n\nInformation Processing Systems, 2002.\n\n[15] A. Goh and R. Vidal, \u201cSegmenting motions of different types by unsupervised manifold clustering,\u201d in\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[16] \u2014\u2014, \u201cClustering and dimensionality reduction on Riemannian manifolds,\u201d in IEEE Conference on Com-\n\nputer Vision and Pattern Recognition, 2008.\n\n[17] D. Donoho and X. Huo, \u201cUncertainty principles and ideal atomic decomposition,\u201d IEEE Trans. Informa-\n\ntion Theory, vol. 47, no. 7, pp. 2845\u20132862, Nov. 2001.\n\n[18] R. Tibshirani, \u201cRegression shrinkage and selection via the lasso,\u201d Journal of the Royal Statistical Society\n\nB, vol. 58, no. 1, pp. 267\u2013288, 1996.\n\n[19] A. Ng, Y. Weiss, and M. Jordan, \u201cOn spectral clustering: analysis and an algorithm,\u201d in Neural Informa-\n\ntion Processing Systems, 2001, pp. 849\u2013856.\n\n[20] U. von Luxburg, \u201cA tutorial on spectral clustering,\u201d Statistics and Computing, vol. 17, 2007.\n[21] Z. Zhang and H. Zha, \u201cPrincipal manifolds and nonlinear dimensionality reduction via tangent space\n\nalignment,\u201d SIAM J. Sci. Comput., vol. 26, no. 1, pp. 313\u2013338, 2005.\n\n[22] K.-C. Lee, J. Ho, and D. Kriegman, \u201cAcquiring linear subspaces for face recognition under variable\nlighting,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684\u2013698,\n2005.\n\n[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d in Proceedings of the IEEE, 1998, pp. 2278 \u2013 2324.\n\n9\n\n\f", "award": [], "sourceid": 65, "authors": [{"given_name": "Ehsan", "family_name": "Elhamifar", "institution": null}, {"given_name": "Ren\u00e9", "family_name": "Vidal", "institution": null}]}