{"title": "Cyclizing Clusters via Zeta Function of a Graph", "book": "Advances in Neural Information Processing Systems", "page_first": 1953, "page_last": 1960, "abstract": "Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we attempt to tackle clustering problems for complex data of multiple distributions and large multi-scales. To this end, we develop an algorithm named Zeta $l$-links, or Zell which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local $l$-links of the graph. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of the cluster in determinantal form. The popularity character of the cluster is conceptualized as the global fusion of variations of the structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data, real imagery data, and real sensory data show the promising performance of Zell. The $98.1\\%$ accuracy, in the sense of the normalized mutual information, is obtained on the FRGC face data of 16028 samples and 466 facial clusters. The MATLAB codes of Zell will be made publicly available for peer evaluation.", "full_text": "Cyclizing Clusters via Zeta Function of a Graph\n\nDepartment of Information Engineering, Chinese University of Hong Kong\n\nDeli Zhao and Xiaoou Tang\n\nHong Kong, China\n\n{dlzhao,xtang}@ie.cuhk.edu.hk\n\nAbstract\n\nDetecting underlying clusters from large-scale data plays a central role in machine\nlearning research. In this paper, we tackle the problem of clustering complex data\nof multiple distributions and multiple scales. To this end, we develop an algo-\nrithm named Zeta l-links (Zell) which consists of two parts: Zeta merging with\na similarity graph and an initial set of small clusters derived from local l-links\nof samples. More speci\ufb01cally, we propose to structurize a cluster using cycles in\nthe associated subgraph. A new mathematical tool, Zeta function of a graph, is\nintroduced for the integration of all cycles, leading to a structural descriptor of a\ncluster in determinantal form. The popularity character of a cluster is conceptu-\nalized as the global fusion of variations of such a structural descriptor by means\nof the leave-one-out strategy in the cluster. Zeta merging proceeds, in the hierar-\nchical agglomerative fashion, according to the maximum incremental popularity\namong all pairwise clusters. Experiments on toy data clustering, imagery pattern\nclustering, and image segmentation show the competitive performance of Zell.\nThe 98.1% accuracy, in the sense of the normalized mutual information (NMI), is\nobtained on the FRGC face data of 16028 samples and 466 facial clusters.\n\n1 Introduction\n\nPattern clustering is a classic topic in pattern recognition and machine learning. In general, algo-\nrithms for clustering fall into two categories: partitional clustering and hierarchical clustering. Hi-\nerarchical clustering proceeds by merging small clusters (agglomerative) or dividing large clusters\ninto small ones (divisive). The key point of agglomerative merging is the measurement of struc-\ntural af\ufb01nity between clusters. This paper is devoted to handle the problem of data clustering via\nhierarchical agglomerative merging.\n1.1 Related work\nThe representative algorithms for partitional clustering are the traditional K-means and the latest\nAf\ufb01nity Propagation (AP) [1]. It is known that the K-means is sensitive to the selection of initial\nK centroids. The AP algorithm addresses this issue by that each sample is initially viewed as an\nexamplar and then examplar-to-member and member-to-examplar messages competitively transmit\namong all samples until a group of good examplars and their corresponding clusters emerge. Besides\nthe superiority of \ufb01nding good clusters, AP exhibits the surprising ability of handling large-scale\ndata. However, AP is computationally expensive to acquire clusters when the number of clusters is\nset in advance. Both K-means and AP encounter dif\ufb01culty on multiple manifolds mixed data.\n\nThe classic algorithms for agglomerative clustering include three kinds of linkage algorithms: the\nsingle, complete, and average Linkages. Linkages are free from the restriction on data distributions,\nbut are quite sensitive to local noisy links. A novel agglomerative clustering algorithm was recently\ndeveloped by Ma et al. [2] with the lossy coding theory of multivariate mixed data. The core of\ntheir algorithm is to characterize the structures of clusters by means of the variational coding length\nof coding arbitrary two merged clusters against only coding them individually. The coding length\n\n\fFigure 1: A small graph with four vertices and \ufb01ve edges can be decomposed into three cycles. The\ncomplexity of the graph can be characterized by the collective dynamics of these basic cycles.\n\nbased algorithm exhibits the exceptional performance for clustering multivariate Gaussian data or\nsubspace data. However, it is not suitable for manifold-valued data.\n\nSpectral clustering algorithms are another group of popular algorithms developed in recent years.\nThe Normalized Cuts (Ncuts) algorithm [3] was developed for image segmentation and data clus-\ntering. Ng et al.\u2019s algorithm [4] is mainly for data clustering, and Newman\u2019s work [5] is applied for\ncommunity detection in complex networks. Spectral clustering can handle complex data of multiple\ndistributions. However, it is sensitive to noise and the variation of local data scales.\n\nIn general, the following four factors pertaining to data are still problematic for most clustering al-\ngorithms: 1) mixing distributions such as multivariate Gaussians of different derivations, subspaces\nof different dimensions, or globally curved manifolds of different dimensions; 2) multiple scales; 3)\nglobal sampling densities; and 4) noise. To attack these problems, it is worthwhile to develop new\napproaches that are conceptually different from existing ones.\n\n1.2 Our work\nTo address issues for complex data clustering, we develop a new clustering approach called Zeta\nl-links, or Zell. The core of the algorithm is based on a new cluster descriptor that is essentially\nthe integration of all cycles in the cluster by means of the Zeta function of the corresponding graph.\nThe Zeta function leads to a rational form of cyclic interactions of members in the cluster, where\ncycles are employed as primitive structures of clusters. With the cluster descriptor, the popularity\nof a cluster is quanti\ufb01ed as the global fusion of variations of the structural descriptor by the leave-\none-out strategy in the cluster. This de\ufb01nition of the popularity is expressible by diagonals of matrix\ninverse. The structural inference between clusters may be performed with this popularity character.\nBased on the novel popularity character, we propose a clustering method, named Zeta merging in the\nhierarchical agglomerative fashion. This method has no additional assumptions on data distributions\nand data scales. As a subsidiary procedure for Zeta merging, we present a simple method called l-\nlinks, to \ufb01nd the initial set of clusters as the input of Zeta merging. The Zell algorithm is the\ncombination of Zeta merging and l-links. Directed graph construction is derived from l-links.\n\n2 Cyclizing a cluster with Zeta function\n\nOur ideas are mainly inspired by recent progress on study of collective dynamics of complex net-\nworks. Experiments have validated that the stochastic states of a neuronal network is partially mod-\nulated by the information that cyclically transmits [6], and that proportions of cycles in a network\nis strongly relevant to the level of its complexity [7]. Recent studies [8], [9] unveil that short cycles\nand Hamilton cycles in graphs play a critical role in the structural connectivity and community of a\nnetwork. These progress inspires us to formalize the structural complexity of a cluster by means of\ncyclic interactions of its members. As illustrated in Figure 1, the relationship between samples can\nbe characterized by the combination of all cycles in the graph. Thus the structural complexity of the\ngraph can be conveyed by the collective dynamics of these basic cycles. Therefore, we may charac-\nterize a cluster with the global combination of structural cycles in the associated graph. To do so,\nwe need to model cycles of different lengths and combine them together as a structural descriptor.\n\n2.1 Modeling cycles of equal length\nWe here model cycles using the sum-product codes to structurize a cluster. Formally, let C =\n{x1, . . . , xn} denote the set of sample vectors in a cluster C. Suppose that W is the weighted\nadjacency matrix of the graph associated with C. A vertex of the graph represents a member in\nC. For generality, the graph is assumed to be directed, meaning that W may be asymmetric. Let\n\u03b3` = {p1 \u2192 p2 \u2192 \u00b7 \u00b7 \u00b7 \u2192 p`\u22121 \u2192 p`, p` \u2192 p1} denote any cycle \u03b3` of length ` de\ufb01ned on\nW. We apply the factorial codes to retrieve the structural information of cycle \u03b3`, thus de\ufb01ning\nk=1 Wpk\u2192pk+1, where Wpk\u2192pk+1 is the (pk, pk+1) entry of W. The value \u03bd\u03b3`\n\n\u03bd\u03b3` = Wp`\u2192p1Q`\u22121\n\n\fprovides a kind of degree measure of interactions among \u03b3`-associated vertices. For the set K` of\nall cycles of length `, the sum-product code \u03bd` is written as:\n\n`\u22121\n\n\u03bd` = X\u03b3`\u2208K`\n\n\u03bd\u03b3` = X\u03b3`\u2208K`\n\nYk=1\n\nWp`\u2192p1\n\nWpk\u2192pk+1 .\n\n(1)\n\nThe value \u03bd` may be viewed as the quanti\ufb01ed indication of global interactions among C in the `-\ncycle scale. The structural complexity of the graph is measured by these quantities of cycles of all\ndifferent lengths, i.e., {\u03bd1, . . . , \u03bd`, . . . , \u03bd\u221e}. Further, we need to perform the functional integration\nof these individual measures. The Zeta function of a graph may play a role for such a task.\n2.2\nZeta functions are widely applied in pure mathematics as tools of performing statistics in number\ntheory, computing algebraic invariants in algebraic geometry, measuring complexities in dynamic\nsystems. The forms of Zeta functions are diverse. The Zeta function we use here is de\ufb01ned as:\n\nIntegrating cycles using Zeta function\n\n\u03b6z = exp \u221e\nX`=1\n\n\u03bd`\n\nz`\n\n` ! ,\n\n(2)\n\nwhere z is a real-valued variable. Here \u03b6z may be viewed as a kind of functional organization of all\ncycles in {K1, . . . , K`, . . . , K\u221e} in a global sense. What\u2019s interesting is that \u03b6z admits a rational\nform [10], which makes the intractable manipulations arising in (1) tractable.\nTheorem 1. \u03b6z = 1/ det(I \u2212 zW), where z < \u03c1(W) and \u03c1(W) denotes the spectral radius of the\nmatrix W.\nFrom Theorem 1, we see that the global interaction of elements in C is quanti\ufb01ed by a quite simple\nexpression of determinantal form.\n2.3 Modeling popularity\nThe popularity of a group of samples means how much these samples in the group is perceived to\nbe a whole cluster. To model the popularity, we need to formalize the complexity descriptor of the\ncluster C. With the cyclic integration \u03b6z in the preceding section, the complexity of the cluster can\nbe measured by the polynomial entropy \u03b5C of logarithm form:\n\n\u03b5C = ln \u03b6z =\n\n= \u2212 ln det(I \u2212 zW).\n\n(3)\n\nThe entropy \u03b5C will be employed to model the popularity of C. As we analyze at the beginning\nof Section 2, cycles are strongly associated with structural communities of a network. To model\nthe popularity, therefore, we may investigate the variational information of cycles by successively\nleaving one member in C out. More clearly, let \u03c7C denote the popularity character of C. Then \u03c7C is\nde\ufb01ned as the averaged sum of the reductive entropies:\n\n\u03bd`\n\nz`\n`\n\n\u221e\n\nX`=1\n\nn\n\n\u03c7C =\n\n1\nn\n\nXp=1(cid:0)\u03b5C \u2212 \u03b5C\\xp(cid:1) = \u03b5C \u2212\n\n1\nn\n\nn\n\nXp=1\n\n\u03b5C\\xp.\n\n(4)\n\np=1 eT\n\np (I \u2212 zW)\u22121ep.\n\nn lnQn\n\nLet T denote the transpose operator of a matrix and ep is the p-th standard basis whose p-th element\nis 1 and 0 elsewhere. We have the following theorem.\nTheorem 2. \u03c7C = 1\nBy analysis of inequalities , we may obtain that \u03c7C is bounded as 0 < \u03c7C \u2264 (\u03b5C/n). The popularity\nmeasure \u03c7C is a structural character of C , which can be exploited to handle problems in learning\nsuch as clustering, ranking, and classi\ufb01cation.\nThe computation of \u03c7C is involved with that of the inverse of (I \u2212 zW). In general, the complexity\nof computing (I \u2212 zW)\u22121 is O(n3). However, \u03c7C is only related to the diagonals of (I \u2212 zW)\u22121\ninstead of a full dense matrix. This unique property leads the computation of \u03c7C to the complexity\nof O(n1.5) by a specialized algorithm for computing diagonals of the inverse of a sparse matrix [11].\n\nStructural af\ufb01nity measurement\n\n2.4\nGiven a set of initial clusters Cc = {C1, . . . , Cm} and the adjacency matrix P of the corresponding\nsamples, the af\ufb01nities between clusters or data groups can be measured via the corresponding pop-\nularity character \u03c7C. Under our framework, an intuitive inference is that the two clusters that share\nthe largest reciprocal popularity have the most consistent structures, meaning the two clusters are\nmost relevant from the structural point of view. Formally, for two given data groups Ci and Cj from\nCc, the criterion of reciprocal popularity may be written as\n\n\f\u03b4\u03c7Ci\u222aCj = \u03b4\u03c7Ci + \u03b4\u03c7Cj = (\u03c7Ci|Ci\u222aCj \u2212 \u03c7Ci) + (\u03c7Cj |Ci\u222aCj \u2212 \u03c7Cj ),\n\n(5)\n\neT\np (I \u2212\nwhere the conditional popularity \u03c7Ci|Ci\u222aCj\nzPCi\u222aCj )\u22121ep and PCi\u222aCj is the submatrix of P corresponding to the samples in Ci and Cj. The\nincremental popularity \u03b4\u03c7Ci embodies the information gain of Ci after being merged with Cj. The\nlarger the value of \u03b4\u03c7Ci\u222aCj is, the more likely the two data groups Ci and Cj are perceived to be one\ncluster. Therefore, \u03b4\u03c7Ci\u222aCj may be exploited to measure the structural af\ufb01nity between two groups\nof samples from a whole set of samples.\n\nis de\ufb01ned as \u03c7Ci|Ci\u222aCj = 1\n\n|Ci| lnQxp\u2208Ci\n\n3 Zeta merging\n\nWe will develop the clustering algorithm using the structural character \u03c7C. The automatic detection\nof the number of clusters are also taken into consideration.\n\n3.1 Algorithm of Zeta merging\n\nWith the criterion of structural af\ufb01nity in Section 2.4, it is straightforward to write the procedures of\nclustering in the hierarchical agglomerative way. The algorithm may proceed from the pair {Ci, Cj}\nthat has the largest incremental popularity \u03b4\u03c7Ci\u222aCj , i.e., {Ci, Cj} = arg max\n\u03b4\u03c7Ci\u222aCj . We name the\ni,j\nmethod by Zeta merging, whose procedures are provided in Algorithm 1. In general, Zeta merging\nwill proceed smoothly if the damping factor z is bounded as 0 < z < 1\n\n1.\n\n2kPk\n\nAlgorithm 1 Zeta merging\n\nthe weighted adjacency matrix P, the m initial clusters Cc = {C1, . . . , Cm}, and the\n\ninputs:\nnumber mc (mc \u2264 m) of resulting clusters. Set t = m.\nwhile 1 do\n\nif t = mc then break; end if\nSearch two clusters Ci and Cj such that {Ci, Cj} = arg max\nCc \u2190 {Cc \\ {Ci, Cj}} \u222a {Ci \u222a Cj};\n\nt \u2190 t \u2212 1.\n\n{Ci,Cj }\u2208Cc\n\n\u03b4\u03c7Ci\u222aCj .\n\nend while\n\nThe merits of Zeta merging are that it is free from the restriction of data distributions and is less\naffected by the factor of multiple scales in data. Af\ufb01nity propagation in Zeta merging proceeds on\ngraph according to cyclic associations, requiring no speci\ufb01cation on data distributions. Moreover,\nthe popularity character \u03c7C of each cluster is obtained from the averaged amount of variational\ninformation conveyed by \u03b5C. Thus the size of a cluster has little in\ufb02uence on the value \u03b4\u03c7Ci\u222aCj .\nWhat\u2019s most important is that cycles rooted at each point in C globally interact with all other points.\nThus, the global descriptor \u03b5C and the popularity character \u03c7C are not sensitive to the local data scale\nat each point, leading to the robustness of Zeta merging against the variation of data scales.\n\n3.2 Number of clusters in Zeta merging\n\nIn some circumstances, it is needed to automatically detect the number of underlying clusters from\ngiven data. This functionality can be reasonably realized in Zeta merging if each cluster corresponds\nto a diagonal block structure in P, up to some permutations. The principle is that the minimum\n\u03b4\u03c7Ci\u222aCj will be zero when a set of separable clusters emerges, behind which is the mathematical\nprinciple that inverting a block-diagonal matrix is equivalent to inverting the matrices on the diagonal\nblocks. In practice, however, the minimum \u03b4\u03c7Ci\u222aCj has a jumping variation on the stable part of its\ncurve instead of exactly arriving at zero due to the perturbation of the interlinks between clusters.\nThen the number of clusters corresponds to the step at the jumping point.\n\n4 The Zell algorithm\n\nAn issue arising in Zeta merging is the determination of the initial set of clusters. Here, we give a\nmethod by performing local single Linkages ( message passing by minimum distances). The method\nof graph construction is also discussed here.\n\n\fFigure 2: Schematic illustration of l-links. From left to right: data with two seed points (red mark-\ners), 2-links grown from two seed points, and 2-links from four seed points. The same cluster is\ndenoted by the markers with the same color of edges.\n\n4.1 Detecting l-links\nGiven the sample set Cy = {y1, . . . , ymo}, we \ufb01rst get the set S 2K\nof 2K nearest neighbors for\nthe point yi. Then from yi, messages are passed among S 2K\nin the sense of minimum distances\n(or general dissimilarities), thus locally forming an acyclic directed subgraph at each point. We call\nsuch an acyclic directed subgraph by l-links, where l is the number of steps of message passing\namong S 2K\n. In general, l is a small integer, e.g., l \u2208 {2, 3, 4, . . . }. The further manipulation is to\nmerge l-links that share common vertices. A simple schematic example is shown in Figure 2. The\nspeci\ufb01c procedures are provided in Algorithm 2.\n\ni\n\ni\n\ni\n\nAlgorithm 2 Detecting l-links\n\ninputs: the sample set Cy = {y1, . . . , ymo}, the number l of l-links, the number K of nearest\nneighbors for each point, where l < K.\nInitialization: Cc = {Ci|Ci = {yi}, i = 1, . . . , mo} and q = 1.\nfor i from 1 to mo do\n\nSearch 2K nearest neighbors of yi and form S 2K\nIteratively perform Ci \u2190 Ci \u222a {yj} if yj = arg min\ndistance(y, yj), until |Ci| \u2265 l.\nyj \u2208S 2K\nPerform Cj \u2190 Ci \u222a Cj, Cc \u2190 Cc \\ Ci, and q \u2190 q + 1, if |Ci \u2229 Cj| > 0, where j = 1, . . . , q.\n\nmin\ny\u2208Ci\n\n.\n\ni\n\ni\n\nend for\n\n4.2 Graph construction\nThe directional connectivity of l-links leads us to build a directed graph whose vertex yi directionally\npoints to its K nearest neighbors. The method of graph construction is presented in Algorithm 3.\nThe free parameter \u03c3 in (6) is estimated according to the criterion that the geometric mean of all\nsimilarities between each point and its three nearest neighbors is set to be a, where a is a given\nparameter in (0, 1]. It is easy to know that \u03c1(P) < 1 here.\nAlgorithm 3 Directed graph construction\n\ninputs: the sample set Cy, the number K of nearest neighbors, and a free parameter a \u2208 (0, 1].\nEstimate the parameter \u03c3 by \u03c32 = \u2212 1\nDe\ufb01ne the entry of the i-th row and j-th column of the weighted adjacency matrix P as\n\n[distance (yi, yj)]2.\n\nmolnaPyi\u2208CyPyj \u2208S 3\n\ni\n\nPi\u2192j =(exp (\u2212 [distance(yi, yj )]2\n\n0,\n\n\u03c32\n\n),\n\nif yj \u2208 S K\ni ,\notherwise.\n\n(6)\n\nPerform the sum-to-one operation for each row, i.e., Pi\u2192j \u2190 Pi\u2192j/Pmo\n\nj=1 Pi\u2192j.\n\n4.3 Zeta l-links (Zell)\nOur algorithm for data clustering is in effect to perform Zeta merging on the initial set of small\nclusters derived from l-links. So, we name our algorithm by Zeta l-links, or Zell. The complete\nimplementation of the Zell algorithm is to consecutively perform Algorithm 3, Algorithm 2, and Al-\ngorithm 1. In practice , the steps in Algorithm 3 and Algorithm 2 are operated together to enhance\n\n1Interested one may refer to the full version of this paper for proofs.\n\n\f20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n1000\n\n800\n\n1500\n\n5\n\n80\n\n500\n1000\n\n50\n\n400\n\n300\n\n100\n\n40\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221260 \u221250 \u221240 \u221230 \u221220 \u221210\n\n0\n\n\u221260 \u221250 \u221240 \u221230 \u221220 \u221210\n\n0\n\n\u221260\n\n\u221250\n\n\u221240\n\n\u221230\n\nl\n\ny\nt\ni\nr\na\nu\np\no\np\n\n \n\na\n\nt\nl\n\n \n\ne\nD\nm\nu\nm\nn\nM\n\ni\n\ni\n\nx 10\u22126\n\n3\n\n2\n\n1\n\n0\n\n0\n\n20\n\n40\n\n(a)\n\n60\n\n80\n\n100\nNumber of clusters\n\n120\n\n140\n\n160\n\n(d)\n\nl\n\ny\nt\ni\nr\na\nu\np\no\np\na\n\n \n\nt\nl\n\n \n\ne\nD\nm\nu\nm\nn\nM\n\ni\n\ni\n\n8\n\n6\n\n4\n\n2\n\n0\n\nx 10\u22127\n\nx 10\u22127\n\n(b)\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n \nr\ne\nd\nr\no\n\u2212\nt\ns\nr\ni\nF\n\n(c)\n\n \n\n5\nNumber of clusters\n\n10\n\n15\n\n5\nNumber of clusters\n\n10\n\n15\n\n(e)\n\n \n\n(f)\n\nFigure 3: Clustering on toy data. (a) Generated data of 12 clusters. The number of each cluster is\nshown in the \ufb01gure. The data are of different distributions, consisting of multiple manifolds (two\ncircles and a hyperbola), subspaces (two pieces of lines and a piece of the rectangular strip), and\nsix Gaussians. The densities of clusters are diverse. The differences between the sizes of different\nclusters are large. The scales of the data vary. For each cluster in the manifold and subspace data, the\npoints are randomly generated with different deviations. (b) Clusters yielded by Zell (given number\nof clusters). The different colors denote different clusters. (c) Clusters automatically detected by Zell\non the data composed by six Gaussians and the short line. (d) Curve of minimum Delta popularity\n(\u03b4\u03c7). (e) Enlarged part of (d) and the curve of its \ufb01rst-order differences. The point marked by the\nsquare is the detected jumping point. (f) The block structures of P corresponding to the data in (c).\n\nthe ef\ufb01ciency of Zell. Zeta merging may also be combined with K-means and Af\ufb01nity Propaga-\ntion for clustering. These two algorithms work well for producing small clusters. So, they can be\nemployed to generate initial clusters as the input of Zeta merging.\n\n5 Experiment\n\nExperiments are conducted on clustering toy data, hand-written digits and cropped faces from cap-\ntured images, and segmenting images to test the performance of Zell. The quantitative performance\nof the algorithms is measured by the normalized mutual information (NMI) [12] which is widely\nused in learning communities. The NMI quanti\ufb01es the normalized statistical information shared\nbetween two distributions. The larger the NMI is, the better the clustering performance of the algo-\nrithm is.\n\nFour representative algorithms are taken into comparison, i.e., K-centers, (average) Linkage, Af\ufb01nity\nPropagation (AP), and Normalized Cuts (Ncuts). Here we use K-centers instead of K-means because\nit can handle the case where distances between points are not measured by Euclidean norms. For\nfair comparison, we run Ncuts on the graph whose parameters are set the same with the graph used\nby Zell. The parameters for Zell are set as z = 0.01, a = 0.95, K = 20, and l = 2.\n\n5.1 On toy data\n\nWe \ufb01rst perform an experiment on a group of toy data of diverse distributions with multiple densi-\nties, multiple scales, and signi\ufb01cantly different sizes of clusters. As shown in Figures 3 (b) and (c),\nthe Zell algorithm accurately detects the underlying clusters out. Particularly, Zell is capable of si-\nmultaneously differentiating the cluster with \ufb01ve members and the cluster with 1500 members. This\nfunctionality is critically important for \ufb01nding genes from microarray expressions in bioinformatics.\nFigures 3 (d) and (e) show the curves of minimum variational \u03b4\u03c7 (for the data in Figure 3 (c)) where\nthe number of clusters is determined at the largest gap of the curve in the stable part. However, the\nmethod presented in Section 3.2 fails to automatically detect the number of clusters for the data in\nFigure 3 (a), because the corresponding P matrix has no clear diagonal block structures.\n\n\fTable 1: Imagery data. MNIST and USPS: digit databases. ORL and FRGC: face databases. The\nlast row shows the numbers of clusters automatically detected by Zell on the \ufb01ve data sets.\n\nData set\nNumber of samples\nNumber of clusters\nAverage number of each cluster\nDimension of each sample\nDetected number of clusters\n\nMNIST\n5139\n\n5\n\n1027 \u00b1 64\n\n784\n11\n\nUSPS\n11000\n\n10\n\n1100 \u00b1 0\n\n256\n8\n\nORL\n400\n40\n\n10 \u00b1 0\n2891\n\n85 (K = 5)\n\nsFRGC\n11092\n186\n\n60 \u00b1 14\n\n2891\n229\n\nFRGC\n16028\n466\n\n34 \u00b1 24\n\n2891\n511\n\nTable 2: Quantitative clustering results on imagery data. NMI: normalized mutual information. The\n\u2018pref\u2019 means the preference value used in Af\ufb01nity Propagation for clustering of given numbers.\nK = 5 for the ORL data set.\n\nAlgorithm\n\nK-centers\n\nMNIST\n\nNMI USPS\nORL\nsFRGC\nFRGC\n\n0.228\n0.183\n0.393\n0.106\n0.187\n\nLinkage Ncuts Af\ufb01nity propagation (pref)\n0.496\n0.095\n0.878\n0.934\n0.950\n\n0.451 (-871906470)\n0.313 (-417749850)\n0.877 (-6268)\n0.899 (-16050)\n0.906 (-7877)\n\n0.737\n0.443\n0.939\n0.953\n0.924\n\nZell\n0.865\n0.772\n0.940\n0.988\n0.981\n\n5.2 On imagery data\nThe imagery patterns we adopt are the hand-written digits in the MNIST and USPS\ndatabases and the facial images in the ORL and FRGC (Face Recognition Grand Challenge,\nhttp://www.frvt.org/FRGC/) databases. The MNIST and USPS data sets are downloaded from Sam\nRoweis\u2019s homepage (http://www.cs.toronto.edu/\u02dcroweis). For MNIST, we select all the images of\ndigits from 0 to 4 in the testing set for experiment. For FRGC, we use the facial images in the target\nset of experiment 4 in the FRGC version 2. Besides the whole target set, we also select a subset from\nit. Such persons are selected as another group of clusters if the number of faces for each person is no\nless than forty. The information of data sets is provided in Table 1. For digit patterns, the Frobenius\nnorm is employed to measure distances of digit pairs without feature extraction. For face patterns,\nhowever, we extract visual features of each face by means of the local binary pattern algorithm. The\n\n(\u02c6yi\u2212\u02c7yi)2\n\n\u02c6yi+\u02c7yi\n\n.\n\nChi-square metric is exploited to compute distances, de\ufb01ned as distance(\u02c6y, \u02c7y) =Pi\n\nThe quantitative results are given in Table 2. We see that Zell consistently outperforms the other\nalgorithms across the \ufb01ve data sets. In particular, the performance of Zell is encouraging on the\nFRGC data set which has the largest numbers of clusters and samples. As reported in [1], AP does\nsigni\ufb01cantly outperform K-centers. However, AP shows the unsatisfactory performance on the digit\ndata where the manifold structures may occur due to that the styles of digits vary signi\ufb01cantly. The\naverage Linkage also exhibits such phenomena. The results achieved by Ncuts are also competitive.\nHowever, Ncuts is overall unstable, for example, yielding the low accuracy on the USPS data. The\nresults in Tabel 3 con\ufb01rms the stability of Zell over the variations of free parameters. Actually, l\naffects the performance of Zell when it is larger, because it may incur incorrect initial clusters.\n5.3\nWe show several examples of the application of Zell on image segmentation from the Berkeley\nsegmentation database. The weighted adjacency matrix P is de\ufb01ned as Pi\u2192j = exp(\u2212 (Ii\u2212Ij )2\n)\nif Ij \u2208 N 8\ni denotes the set of\npixels in the 8-neighborhood of Ii. Figure 4 displays the segmentation results of different numbers of\nsegments for each image. Overall, attentional regions are merged by Zell. Note the small attentional\nregions take the priorities of being merged than the large ones. Therefore, Zell yields many small\nattentional regions as \ufb01nal clusters.\n\ni and 0 otherwise, where Ii is the intensity value of an image and N 8\n\nImage segmentation\n\n\u03c32\n\n6 Conclusion\n\nAn algorithm, named Zell, has been developed for data clustering. The cyclization of a cluster is the\nfundamental principle of Zell. The key point of the algorithm is the integration of structural cycles\nbut Zeta function of a graph. A popularity character of measuring the compactness of the cluster\nis de\ufb01ned via Zeta function, on which the core of Zell for agglomerative clustering is based. An\n\n\fTable 3: Results yielded by Zell over variations of free parameters on the sFRGC data. The initial\nset is {z = 0.01, a = 0.95, K = 20, l = 3}. When one of them varies, the other keep invariant.\n\nParameter\nRange\nNMI\n\nz\n\na\n\nK\n\nl\n\n10\u2212{1,2,3,4}\n0.988 \u00b1 0\n\n0.2 \u00d7 {1, 2, 3, 4, 4.75}\n\n0.988 \u00b1 0.00019\n\n10 \u00d7 {2, 3, 4, 5}\n0.987 \u00b1 0.0015\n\n{2, 3, 4}\n\n0.988 \u00b1 0.0002\n\nFigure 4: Image segmentation by Zell from the Berkeley segmentation database.\n\napproach for \ufb01nding initial small clusters is presented, which is based on the merging of local links\namong samples. The directed graph used in this paper is derived from the directionality of l-links.\nExperimental results on toy data, hand-written digits, facial images, and image segmentation show\nthe competitive performance of Zell. We hope that Zell brings a new perspective on complex data\nclustering.\n\nAcknowledgement\nWe thank Yaokun Wu and Sergey Savchenko for their continuing help on algebraic graph theory.\nWe are also grateful of the interesting discussion with Yi Ma and John Wright on clustering and\nclassi\ufb01cation. Feng Li and Xiaodi Hou are acknowledged due to their kind help. The reviewers\u2019\ninsightful comments and suggestions are also greatly appreciated.\n\nReferences\n[1] Frey, B.J. & Dueck, D. (2007) Clustering by passing messages between data points. Science 315:972-976.\n[2] Ma, Y. Derksen, H. Hong, W. & Wright, J. (2007) Segmentation of multivariate mixed data via lossy data\ncoding and compression. IEEE Trans. on Pattern Recognition and Machine Intelligence 29:1546-1562.\n[3] Shi, J.B. & Malik, J. (2000) Normalized cuts and image segmentation. IEEE Trans. on Pattern Recognition\nand Machine Intelligence 22(8):888-905.\n[4] Ng, A.Y., Jordan, M.I. & Weiss, Y. (2001) On spectral clustering: analysis and an algorithm. Advances in\nNeural Information Processing Systems. Cambridge, MA: MIT Press.\n[5] Newman, M.E.J. (2006) Finding community structure in networks using the eigenvectors of matrices. Phys-\nical Review E 74(3).\n[6] Destexhe, A. & Contreras, D. (2006) Neuronal computations with stochastic network states. Science,\n314(6):85-90.\n[7] Sporns, O. Tononi, G. & Edelman, G.M. (2000) Theoretical neuroanatomy: relating anatomical and func-\ntional connectivity in graphs and cortical connection matrices. Cerebral Cortex, 10:127-141.\n[8] Bagrow, J. Bollt, E. & Costa, L.F. (2007) On short cycles and their role in network structure.\nhttp://arxiv.org/abs/cond-mat/0612502.\n[9] Bianconi, G. & Marsili, M. (2005) Loops of any size and Hamilton cycles in random scale-free networks.\nJournal of Statistical Mechanics, P06005.\n[10] Savchenko, S.V. (1993) The zeta-function and Gibbs measures. Russ. Math. Surv. 48(1):189-190.\n[11] Li, S. Ahmed, S. Klimeck, G. & Darve, E. (2008) Computing entries of the inverse of a sparse matrix\nusing the FIND algorithm. Journal of Computational Physics 227:9408-9427.\n[12] Strehl, A. & Ghosh, J. (2002) Cluster ensembles \u2014 a knowledge reuse framework for combining multiple\npartitions. Journal of Machine Learning Research 3:583617.\n\n\f", "award": [], "sourceid": 41, "authors": [{"given_name": "Deli", "family_name": "Zhao", "institution": null}, {"given_name": "Xiaoou", "family_name": "Tang", "institution": null}]}