{"title": "Clustering with the Connectivity Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": "", "full_text": "Clustering with the Connectivity Kernel\n\nBernd Fischer, Volker Roth and Joachim M. Buhmann\n\nInstitute of Computational Science\n\nSwiss Federal Institute of Technology Zurich\n\nCH-8092 Zurich, Switzerland\n\nfbernd.fischer, volker.roth,jbuhmanng@inf.ethz.ch\n\nAbstract\n\nClustering aims at extracting hidden structure in dataset. While the prob-\nlem of \ufb01nding compact clusters has been widely studied in the litera-\nture, extracting arbitrarily formed elongated structures is considered a\nmuch harder problem. In this paper we present a novel clustering algo-\nrithm which tackles the problem by a two step procedure: \ufb01rst the data\nare transformed in such a way that elongated structures become compact\nones. In a second step, these new objects are clustered by optimizing a\ncompactness-based criterion. The advantages of the method over related\napproaches are threefold: (i) robustness properties of compactness-based\ncriteria naturally transfer to the problem of extracting elongated struc-\ntures, leading to a model which is highly robust against outlier objects;\n(ii) the transformed distances induce a Mercer kernel which allows us\nto formulate a polynomial approximation scheme to the generally NP-\nhard clustering problem; (iii) the new method does not contain free kernel\nparameters in contrast to methods like spectral clustering or mean-shift\nclustering.\n\nIntroduction\n\n1\nClustering or grouping data is an important topic in machine learning and pattern recog-\nnition research. Among various possible grouping principles, those methods which try to\n\ufb01nd compact clusters have gained particular importance. Presumingly the most prominent\nmethod of this kind is the K-means clustering for vectorial data [6]. Despite the powerful\nmodeling capabilities of compactness-based clustering methods, they mostly fail in \ufb01nding\nelongated structures. The fast single linkage algorithm [9] is the most often used algorithm\nto search for elongated structures, but it is known to be very sensitive to outliers in the\ndataset. Mean shift clustering [3], another method of this class, is capable of extracting\nelongated clusters only if all modes of the underlying probability distribution have one sin-\ngle maximum. Furthermore, a suitable kernel bandwidth parameter has to be preselected\n[2]. Spectral clustering [10] shows good performance in many cases, but the algorithm is\nonly analyzed for special input instances while a complete analysis of the algorithm is still\nmissing. Concerning the preselection of a suitable kernel width, spectral clustering suffers\nfrom similar problems as mean shift clustering.\n\nIn this paper we present an alternative method for clustering elongated structures. Apart\nfrom the number of clusters, it is a completely parameter-free grouping principle. We build\nup on the work on path-based clustering [7]. For a slight modi\ufb01cation of the original prob-\n\n\flem we show that the de\ufb01ned path distance induces a kernel matrix ful\ufb01lling Mercers con-\ndition. After the computation of the path-based distance, the compactness-based pairwise\nclustering principle is used to partition the data. While for the general NP-hard pairwise\nclustering problem no approximation algorithms are known, we present a polynomial time\napproximation scheme (PTAS) for our special case with path-based distances. The Mercer\nproperty of these distances allows us to embed the data in a (n (cid:0) 1) dimensional vector\nspace even for non-metric input graphs. In this vector space, pairwise clustering reduces to\nminimizing the K-means cost function in (n (cid:0) 1) dimensions [13]. For the latter problem,\nhowever, there exists a PTAS [11].\n\nIn addition to this theoretical result, we also present an ef\ufb01cient practical algorithm resort-\ning to a 2-approximation algorithm which is based on kernel PCA. Our experiments sug-\ngest that kernel PCA effectively reduces the noise in the data while preserving the coarse\ncluster structure. Our method is compared to spectral clustering and mean shift clustering\non selected arti\ufb01cial datasets. In addition, the performance is demonstrated on the USPS\nhandwritten digits dataset.\n\n2 Clustering by Connectivity\n\nThe main idea of our clustering criterion is to transform elongated structures into compact\nones in a preprocessing step. Given the transformed data, we then infer a clustering solution\nby optimizing a compactness based criterion. The advantage of circumventing the problem\nof directly \ufb01nding connected (elongated) regions in the data as e.g. in the spanning tree ap-\nproach is the following: while spanning tree algorithms are extremely sensitive to outliers,\nthe two-step procedure may bene\ufb01t from the statistical robustness of certain compactness\nbased methods. Concerning the general case of datasets which are not given in a vector\nspace, but only characterized by pairwise dissimilarities, the pairwise clustering model has\nbeen shown to be robust against outliers in the dataset [12].\nIt may, thus, be a natural\nchoice to formulate the second step as searching for the partition vector c 2 f1; : : : ; Kgn\nthat minimizes the pairwise clustering cost function\n\nH PC(c; D) = PK\n\n(1)\nwhere K denotes the number of clusters, n(cid:23) = jfi : ci = (cid:23)gj denotes the number of\nobjects in cluster (cid:23), and dij is the pairwise \u201ceffective\u201d dissimilarity between objects i and\nj as computed by a preprocessing step.\n\n1\nn(cid:23) Pi:ci=(cid:23) Pj:cj =(cid:23) dij ;\n\n(cid:23)=1\n\nThe idea of this preprocessing step is to de\ufb01ne distances between objects by considering\ncertain paths through the total object set. The natural formalization of such path problems\nis to represent the objects as a graph: consider a connected graph G = (V; E; d0) with n\nvertices (the objects) and symmetric nonnegative edge weights d0\nij on the edge (i; j) (the\noriginal dissimilarities). Let us denote by Pij all paths from vertex i to vertex j. In order\nto make those objects more similar which are connected by \u201cbridges\u201d of other objects,\nwe de\ufb01ne for each path p 2 Pij the effective dissimilarity dp\nij between i and j connected\nby p as the maximum weight on this path, i.e. the \u201cweakest link\u201d on this path. The total\ndissimilarity between vertices i and j is then de\ufb01ned as the minimum of all path-speci\ufb01c\neffective dissimilarities dp\nij:\n\ndij := min\n\np2Pijf max\n\n1(cid:20)h(cid:20)jpj(cid:0)1\n\nd0\np[h]p[h+1]g:\n\n(2)\n\nFigure 1 illustrates the de\ufb01nition of the effective dissimilarity. If the objects are in the same\ncluster their pairwise effective dissimilarities will be small (\ufb01g. 1(a)). If the two objects\nbelong to two different clusters, however, all paths contain at least one large dissimilarity\nand the resulting effective dissimilarity will be large (\ufb01g. 1(b)). Note that single outliers\nas in (\ufb01g. 1(a,b)) do not affect the basic structure in the path-based distances. A problem\n\n\fdij\n\n(a)\n\ndij\n\n(b)\n\ndij\n\n(c)\n\nFigure 1: Effective dissimilarities. (a) If objects belong to the same high-density region, dij is small.\n(b) If they are in different regions, dij is larger. (c) To regions connected by a \u201cbridge\u201d.\n\ncan only occur, if the point density along a \u201cbridge\u201d between the two clusters is as high as\nthe density on the backbone of the clusters, see 1(c). In such a case, however, the points\nbelonging to the \u201cbridge\u201d can hardly be considered as \u201coutliers\u201d. The reader should notice\nthat the single linkage algorithm does not posses the robustness properties, since it will\nseparate the three most distant outlier objects in example 1(a) from the remaining data, but\nit will not detect the dominant structure.\nSummarizing the above model, we formalize the path-based clustering problem as:\nINPUT: A symmetric (n (cid:2) n) matrix D0 = (d0\nlarities between n objects, with zero diagonal elements.\nQUESTION: Find clusters by minimizing H PC(c; D), where the matrix D represents the\neffective dissimilarities derived from D0 by eq. (2).\n\nij)1(cid:20)i;j(cid:20)n of nonnegative pairwise dissimi-\n\n3 The Connectivity Kernel\nIn this section we show that the effective dissimilarities induce a Mercer kernel on the\nweighted graph G. The Mercer property will then allow us to derive several approximation\nresults for the NP-hard pairwise clustering problem in section 4.\nDe\ufb01nition 1. A metric D is called ultra-metric if it satis\ufb01es the condition dij (cid:20)\nmax(dik; dkj) for all distinct i; j; k.\nTheorem 1. The dissimilarities de\ufb01ned by (2) induce an ultra-metric on G.\nProof. We have to check the axioms of a metric distance measure plus the restricted tri-\nangle inequality dij (cid:20) max(dik; dkj): (i) dij (cid:21) 0, since the weights are nonnegative; (ii)\ndij = dji, since we consider symmetric weights; (iii) dii = 0 follows immediately from\nde\ufb01nition (2); (iv) The proof of the restricted triangle inequality follows by contradiction:\nsuppose, there exists a triple i; j; k for which dij > max(dik; dkj). This situation, however,\ncontradicts the above de\ufb01nition (2) of dij: in this case there exists a path from i to j over\nk, the weakest link of which is shorter than dij. Equation (2) then implies that dij must be\nsmaller or equal to max(dik; dkj).\nDe\ufb01nition 2. A metric D is \u20182 embeddable, if there exists a set of vectors fxign\nRp; p (cid:20) n (cid:0) 1 such that for all pairs i; j kxi (cid:0) xjk2 = dij.\nA proof for the following lemma has been given in [4]:\nLemma 1. For every ultra-metric D, pD is \u20182 embeddable.\nNow we are considered with a realization of such an embedding. We introduce the notion\nof a centralized matrix. Let P be an (n (cid:2) n) matrix and let Q = In (cid:0) 1\nn , where\nen = (1; 1; : : : 1)> is a n-vector of ones and In the n (cid:2) n identity matrix. We de\ufb01ne the\ncentralized P as P c = QP Q.\nThe following lemma (for a proof see e.g. [15]) characterizes \u20182 embeddings:\n\ni=1; xi 2\n\nn ene>\n\n\fi=1 with squared Euclidean distances kxi (cid:0) xjk2\n\nLemma 2. Given a metric D, pD is \u20182 embeddable iff Dc = QDQ is negative\n(semi)de\ufb01nite.\nThe combination of both lemmata yields the following theorem.\nTheorem 2. For the distance matrix D de\ufb01ned in the setting of theorem 1, the matrix\nSc = (cid:0) 1\n2 Dc with Dc = QDQ is a Gram matrix or Mercer kernel. It contains dot products\nbetween a set of vectors fxign\nProof. (i) Since D is ultra-metric, pD is \u20182 embeddable by lemma 1, and Dc is negative\n(semi)de\ufb01nite by lemma (2). Thus, S c = (cid:0) 1\n2 Dc is positive (semi)de\ufb01nite. As any positive\n(semi)de\ufb01nite matrix, S c de\ufb01nes a Gram matrix or Mercer kernel. (ii) Since sc\nij is a dot-\nproduct between two vectors xi and xj, the squared Euclidean distance between xi and xj\nij(cid:3). With the de\ufb01nition\nis de\ufb01ned by kxi (cid:0) xjk2\nof the centralized distances, it can be seen easily that all but one term, namely the original\ndistance, cancel out: (cid:0) 1\n4 Approximation Results\n\njj (cid:0) 2sc\njj (cid:0) 2dc\n\nij = (cid:0) 1\nij(cid:3) = dij.\n\njj (cid:0) 2dc\n\n2 = dij.\n\nii + dc\n\n2 = sc\n\nii + sc\n\n2 (cid:2)dc\n\n2 (cid:2)dc\n\nii + dc\n\nPairwise clustering is known to be NP-hard [1]. To our knowledge there is no polynomial\ntime approximation algorithm known for the general case of pairwise clustering. For our\nspecial case in which the data are transformed into effective dissimilarities, however, we\nnow present a polynomial time approximation scheme.\nA Polynomial Time Approximation Scheme. Let us \ufb01rst consider the computation of the\neffective dissimilarities D. Despite the fact that the path-based distance is a minimum over\nall paths from i to j, the whole distance matrix can be computed in polynomial time.\nLemma 3. The path-based dissimilarity matrix D de\ufb01ned by equation 2 can be computed\nin running time O(n2 log n).\nProof. The computation of the connectivity kernel matrix is an extention of Kruskal\u2019s min-\nimum spanning tree algorithm. We start with n clusters each containing one single ob-\nject.\nIn each iteration step the two clusters Ci and Cj are merged with minimal costs\npq is the edge weight on the input graph. The link dij\ndij = minp2Ci;q2Cj d0\ngives the effective dissimilarity of all objects in Ci to all objects in Cj. To proof this, one\ncan consider the case, where dij is not the effective dissimilarity between Ci and Cj. Then\nthere exists a path over some other cluster Ck, where all objects on this path have a smaller\nweight, implying the existence of another pair of clusters with smaller merging costs. The\nrunning time is O(n2 log n) for the spanning tree algorithm on the the complete input graph\nand additional O(n2) for \ufb01lling all elements in the matrix D.\nLet us now discuss the clustering step. Recall \ufb01rst the problem of K-means clustering:\ngiven n vectors X = fx1; : : : ; xn 2 Rpg, the task is to partition the vectors in such a way\nthat the squared Euclidean distance to the cluster centroids is minimized. The objective\nfunction for K-means is given by\n\npq where d0\n\nwhere\n\ny(cid:23) = 1\n\nH KM(c;X ) = PK\n\n(cid:23)=1 Pi:ci=(cid:23)(xi (cid:0) y(cid:23))2\n\n(3)\nMinimizing the K-means objective function for squared Euclidean distances is NP-hard\nif the dimension of the vectors is growing with n.\nLemma 4. There exists a polynomial time approximation scheme (PTAS) for H KM in arbi-\ntrary dimensions and for \ufb01xed K.\nProof. In [11] Ostrovsky and Rabani presented a PTAS for K-means.\nUsing this approximation lemma we are able to proof the existence of a PTAS for pairwise\ndata clustering using the distance de\ufb01ned by (2).\n\nn(cid:23) Pj:cj =(cid:23) xj\n\n\fTheorem 3. for distances de\ufb01ned by (2), there exists a PTAS for H PC.\n\nProof. By lemma 3 the dissimilarity matrix D can be computed in polynomial time. By\n2. For\ntheorem 2 we can \ufb01nd vectors x1; : : : xn 2 Rp (p (cid:20) n (cid:0) 1) with dij = jjxi (cid:0) xjjj2\nsquared Euclidean distances, however, there is an algebraic identity between H PC(c; D)\nand H KM(c;X ) [13]. By lemma 4 there exists a PTAS for H KM and thus for H PC.\nA 2-approximation by Kernel PCA. While the existence of a PTAS is an interesting\ntheoretical approximation result, it does not automatically follow that a PTAS can be used\nin a constructive way to derive practical algorithms. Taking such a practical viewpoint,\nwe now consider another (weaker) approximation result from which, however, an ef\ufb01cient\nalgorithm can be designed easily. From the fact that we can de\ufb01ne a connectivity kernel\nmatrix we can use kernel PCA [14] to reduce the data dimension. The vectors are projected\non the \ufb01rst principle components. Diagonalization of the centered kernel matrix S c leads to\nSc = V t(cid:3)V , with an orthogonal matrix V = (v1; : : : ; vn) containing the eigenvectors of\nSc, and a diagonal matrix (cid:3) = diag((cid:21)1; : : : ; (cid:21)n) containing the corresponding eigenvalues\non its diagonal. Assuming now that the eigenvalues are in descending order ((cid:21)1 (cid:21) (cid:21)2 (cid:21)\n(cid:1)(cid:1)(cid:1) (cid:21) (cid:21)n), the data are projected on the \ufb01rst p eigenvectors: x0\nTheorem 4. Embedding the path-based distances into RK by kernel PCA and enumerating\nover all possible Voronoi partitions yields an O(nK 2+1) algorithm which approximates\npath-based clustering within a constant factor of 2.\n\nj=1 p(cid:21)jvji.\n\ni = Pp\n\nProof. The solution of the K-means cost function induces a Voronoi partition on the\ndataset. If the dimension p of the data is kept \ufb01x, the number of different Voronoi par-\ntitions is at most O(nKp), and they can be enumerated in O(nKp+1) time [8]. Further, if\nthe embedding dimension is chosen as p = K, K-means in RK is a 2-approximation algo-\nrithm for K-means in Rn(cid:0)1 [5]. Combining both results, we arrive at a 2-approximation\nalgorithm with running time O(nK 2+1).\nHeuristics without approximation guarantees. The running time of the 2-approximation\nalgorithm may still be too large for many applications, therefore we will refer to two heuris-\ntic optimization methods without approximation guarantees. Instead of enumerating all\npossible Voronoi partitions, one can simply partition the data with the fast classical K-\nmeans algorithm. In one sweep it assigns each object to the nearest centroid, while keeping\nall other object assignments \ufb01xed. Then the centroids are relocated according to the new\nassignments. Since the running time grows linear with the data dimension, it is useful to\n\ufb01rst embed the data in K dimensions which leads us to a functional which optimal solution\nis even in the worst case within a factor of two of the desired solution, as we know from\nthe above approximation results. In this reduced space, the K-means heuristics is applied\nwith the hope that there exist only few local minima in the low-dimensional subspace.\nAs a second heuristic one can apply Ward\u2019s method which is an agglomerative optimization\nof the K-means objective function.1 It starts with n clusters, each containing one object,\nand in each step the two clusters that minimize the K-means objective function are merged.\nWard\u2019s method produces a cluster hierarchy. For applications of this method see \ufb01gure 3.\n\n5 Experiments\nWe \ufb01rst compare our method with the classical single linkage algorithm on arti\ufb01cial data\nconsisting of three noisy spirals, see \ufb01gure 2. Our main concern in these experiments is\nthe robustness against noise in the data. Figure 3(a) shows the dendrogram produced by\nsingle linkage. The leaves of the tree are the objects of \ufb01gure 2. For better visualization\nof the tree structure, the bar diagrams below the tree show the labels of the three cluster\n\n1It has been shown in [12] that Ward\u2019s method is an optimization heuristics for H P C. Due to the\n\nequivalence of H P C and H KM in our special case, this property carries over to K-means.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Comparison to other clustering methods. (a) Mean shift clustering, (b) Spectral Clustering,\n(c) Connectivity kernel clustering. (Color images at http://www.inf.ethz.ch/(cid:24)be\ufb01sche/nips03)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Hierarchical Clustering Solutions for example 2(c). (a) Single Linkage, (b) Ward\u2019s method\nwith connectivity kernel, applied to embedded objects in n (cid:0) 1 dimensions. (c) Ward\u2019s method after\nkernel PCA embedding in 3 dimensions.\n\nsolution as drawn in \ufb01g. 2(c). The height of the inner nodes depicts the merging costs for\ntwo subtrees. Each level of the hierarchy is one cluster solution. It is obvious that the main\nparts of the spiral arms are found, but the objects drawn on the right side are separated\nfrom the rest of the cluster. The respective objects are the outliers that are separated in the\nhighest hierarchical levels of the algorithm. We conclude that for small K, single linkage\nhas the tendency to separates single outlier objects from the data.\nBy way of the connectivity kernel we can transform the original dyadic data to n (cid:0) 1\ndimensional vectorial data. To show comparable results for the connectivity kernel, we\napply Ward\u2019s method to the embedded vectors. Figure 3(b) shows the cluster hierarchy\nfor Ward\u2019s method in the full space of n (cid:0) 1 dimensions. Opposed to the single linkage\nresults, the main structure of the spiral arms has been successfully found in the hierarchy\ncorresponding to the three cluster solution. Below the three cluster lever, the tree appears\nto be very noisy. It should also be noticed that the costs of the three cluster solution are\nnot much larger as the costs of the four cluster solution, indicating that the three cluster\nsolution does not form a distinctly separated hierarchical level.\n\nFigure 3(c) demonstrates that more distinctly separated levels can be found after applying\nkernel PCA and embedding the objects into a low-dimensional space (here 3 dimensions).\nWard\u2019s method is then applied to the embedded objects. One can see that the coarse struc-\n\n\fture of the tree has been preserved, while the costs of cluster solutions for K > 3 have been\nshrunken towards zero. We conclude that PCA has the effect of de-noising the hierarchical\ntree, leading to a more robust agglomerative algorithm.\n\nNow we compare our results to other recently published clustering techniques, that have\nbeen designed to extract elongated structures. Mean shift clustering [3] computes a trajec-\ntory of vectors towards the gradient of the underlying probability density. The probability\ndistribution is estimated with a density estimation kernel, e.g. a Gaussian kernel. The tra-\njectories starting at each point in the feature space converge at the local maxima of the\nprobability distribution. Mean shift clustering is only applicable to \ufb01nite dimensional vec-\ntor spaces, because it implicitly involves density estimation. A potential shortcoming of\nmean-shift clustering is the following: if the modes of the distribution have multiple local\nmaxima (as e.g. in the spiral arm example), there does not exist any kernel bandwidth to\nsuccessfully separate the data according to the underlying structure. In \ufb01gure 2(a) the best\nresult for mean shift clustering is drawn. For smaller values of (cid:27) the spiral arms are further\nsubdivided into additional clusters, and for a larger bandwidth values, the result becomes\nmore and more similar to compactness-based criteria like K-means.\n\nSpectral methods [10] have become quite popular in the last years. Usually the Laplacian\nmatrix based on a Gaussian kernel is computed. By way of PCA, the data are embedded\nin a low dimensional space. The K-means algorithm on the embedded data then gives\nthe resulting partition.\nIt has also been proposed to project the data on the unit sphere\nbefore applying K-means. Spectral clustering with a Gaussian kernel is known to be able\nto separate nested circles, but we observed that it has severe problems to extract the noisy\nspiral arms, see 2(b). In spectral clustering, the kernel width (cid:27) is a free parameter which has\nto be selected \u201ccorrectly\u201d. If (cid:27) is too large, spectral clustering becomes similar to standard\nK-means and fails to extract elongated structures. If, on the other hand, (cid:27) is too small, the\nalgorithm becomes increasingly sensitive to outliers, in the sense that it has the tendency to\nseparate single outlier objects.\n\nOur approach to clustering with the connectivity kernel, however, could successfully extract\nthe three spiral arms as can be seen in \ufb01gure 2(c). The reader should notice, that this method\ndoes not require the user to preselect any kernel parameter.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Example from the USPS dataset. Training example of digits 2 and 9 embedded in two\ndimensions. (a) Ground truth labels. (b) K-means labels and (c) clustering with connectivity kernel.\n\nIn a last experiment, we show the advantages of our method compared to a parameter-free\ncompactness criterion (K-means) on the problem of clustering digits \u20192\u2019 and \u20199\u2019 from the\nUSPS digits dataset. Figure 4 shows the clustering result of our method using the connec-\ntivity kernel. The 16x16 digit gray-value images of the USPS dataset are interpreted as\nvectors and projected on the two leading principle components. In \ufb01gure 4(a) the ground\ntruth solution is drawn. Figure 4(b) shows the partition by directly applying K-means clus-\ntering, and \ufb01gure 4(c) shows the result produced by our method. Compared to the ground\n\n\ftruth solution, path-based clustering succeeded in extracting the elongated structures, re-\nsulting in a very small error of only 1:5% mislabeled digits. The compactness-based K-\nmeans method, on the other hand, produces clearly suboptimal clusters with an error rate\nof 30:6%.\n\n6 Conclusion\nIn this paper we presented a clustering approach, that is based on path-based distances in the\ninput graph. In a \ufb01rst step, elongated structures are transformed into compact ones, which\nin the second step are partitioned by the compactness-based pairwise clustering method.\nWe showed that the transformed distances induce a Mercer kernel, which in turn allowed\nus to derive a polynomial time approximation scheme for the generally NP-hard pairwise\nclustering problem. Moreover, Mercers property renders it possible to embed the data\ninto low-dimensional subspaces by Kernel PCA. These embeddings form the basis for an\nef\ufb01cient 2-approximation algorithm, and also for de-noising the data to \u201crobustify\u201d fast\nagglomerative optimization heuristics. Compared to related methods like single linkage,\nmean shift clustering and spectral clustering, our method has been shown to successfully\novercome the problem of sensitivity to outlier objects, while being capable of extracting\nnested elongated structures. Our method does not involve any free kernel parameters, which\nwe consider to be a particular advantage over both mean shift\u2013 and spectral clustering.\n\nReferences\n\n[1] P. Brucker. On the complexity of clustering problems. Optimization and Operations Research,\n\npages 45\u201354, 1977.\n\n[2] D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE T-PAMI, 25(2):281\u2013\n\n288, 2003.\n\n[3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE\n\nT-PAMI, 24(5):603\u2013619, 2002.\n\n[4] M. Deza and M. Laurent. Applications of cut polyhedra. J. Comp. Appl. Math., 55:191\u2013247,\n\n1994.\n\n[5] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering on large graphs and\n\nmatrices. In Proc. of the ACM-SIAM Symp. on Discrete Algorithm., pages 291\u2013299, 1999.\n\n[6] R. Duda, P. Hart, and D. Stork. Pattern Classi\ufb01cation. Wiley & Sons, 2001.\n[7] B. Fischer and J.M. Buhmann. Path-based clustering for grouping of smooth curves and texture\n\nsegmentation. IEEE T-PAMI, 25(4):513\u2013518, 2003.\n\n[8] M. Inaba, N. Katoh, and H. Imai. Applications of weighted voronoi diagrams and randomization\nto variance-based k-clustering. In 10th ACM Sympos. Computat. Geom., pages 332\u2013339, 1994.\n\n[9] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.\n[10] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.\n\nNIPS, volume 14, pages 849\u2013856, 2002.\n\nIn\n\n[11] R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric min-sum\n\nmedian clustering. Journal of the ACM, 49(2):139\u2013156, 2002.\n\n[12] J. Puzicha, T. Hofmann, and J.M. Buhmann. A theory of proximity based clustering: Structure\n\ndetection by optimization. Pattern Recognition, 2000.\n\n[13] V. Roth, J. Laub, J.M. Buhmann, and K.-R. M\u00a8uller. Going metric: Denoising pairwise data. In\n\nNIPS, volume 15, 2003. to appear.\n\n[14] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[15] G. Young and A. S. Householder. Discussion of a set of points in terms of their mutual distances.\n\nPsychometrica, 3:19\u201322, 1938.\n\n\f", "award": [], "sourceid": 2428, "authors": [{"given_name": "Bernd", "family_name": "Fischer", "institution": null}, {"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}