{"title": "Robust Clustering as Ensembles of Affinity Relations", "book": "Advances in Neural Information Processing Systems", "page_first": 1414, "page_last": 1422, "abstract": "In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k \u2265 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.", "full_text": "Robust Clustering as Ensembles of Af\ufb01nity Relations\n\nHairong Liu1, Longin Jan Latecki2, Shuicheng Yan1\n\n1Department of Electrical and Computer Engineering, National University of Singapore, Singapore\n\n2Department of Computer and Information Sciences, Temple University, Philadelphia, USA\n\nlhrbss@gmail.com,latecki@temple.edu,eleyans@nus.edu.sg\n\nAbstract\n\nIn this paper, we regard clustering as ensembles of k-ary af\ufb01nity relations and\nclusters correspond to subsets of objects with maximal average af\ufb01nity relations.\nThe average af\ufb01nity relation of a cluster is relaxed and well approximated by a\nconstrained homogenous function. We present an ef\ufb01cient procedure to solve this\noptimization problem, and show that the underlying clusters can be robustly re-\nvealed by using priors systematically constructed from the data. Our method can\nautomatically select some points to form clusters, leaving other points un-grouped;\nthus it is inherently robust to large numbers of outliers, which has seriously limited\nthe applicability of classical methods. Our method also provides a uni\ufb01ed solu-\ntion to clustering from k-ary af\ufb01nity relations with k \u2265 2, that is, it applies to both\ngraph-based and hypergraph-based clustering problems. Both theoretical analysis\nand experimental results show the superiority of our method over classical solu-\ntions to the clustering problem, especially when there exists a large number of\noutliers.\n\n1 Introduction\nData clustering is a fundamental problem in many \ufb01elds, such as machine learning, data mining and\ncomputer vision [1]. Unfortunately, there is no universally accepted de\ufb01nition of a cluster, probably\nbecause of the diverse forms of clusters in real applications. But it is generally agreed that the objects\nbelonging to a cluster satisfy certain internal coherence condition, while the objects not belonging\nto a cluster usually do not satisfy this condition.\nMost of existing clustering methods are partition-based, such as k-means [2], spectral clustering\n[3, 4, 5] and af\ufb01nity propagation [6]. These methods implicitly share an assumption: every data\npoint must belong to a cluster. This assumption greatly simpli\ufb01es the problem, since we do not\nneed to judge whether a data point is an outlier or not, which is very challenging. However, this\nassumption also results in bad performance of these methods when there exists a large number of\noutliers, as frequently met in many real-world applications.\nThe criteria to judge whether several objects belong to the same cluster or not are typically expressed\nby pairwise relations, which is encoded as the weights of an af\ufb01nity graph. However, in many\napplications, high order relations are more appropriate, and may even be the only choice, which\nnaturally results in hyperedges in hypergraphs. For example, when clustering a given set of points\ninto lines, pairwise relations are not meaningful, since every pair of data points trivially de\ufb01nes\na line. However, for every three data points, whether they are near collinear or not conveys very\nimportant information.\nAs graph-based clustering problem has been well studied, many researchers tried to deal with\nhypergraph-based clustering by using existing graph-based clustering methods. One direction is\nto transform a hypergraph into a graph, whose edge-weights are mapped from the weights of the\noriginal hypergraph. Zien et. al. [7] proposed two approaches called \u201cclique expansion\u201d and \u201cstar\nexpansion\u201d, respectively, for such a purpose. Rodriguez [8] showed the relationship between the\n\n1\n\n\fspectral properties of the Laplacian matrix of the resulting graph and the minimum cut of the orig-\ninal hypergraph. Agarwal et al. [9] proposed the \u201cclique averaging\u201d method and reported better\nresults than \u201cclique expansion\u201d method. Another direction is to generalize graph-based clustering\nmethod to hypergraphs. Zhou et al. [10] generalized the well-known \u201cnormalized cut\u201d method [5]\nand de\ufb01ned a hypergraph normalized cut criterion for a k-partition of the vertices. Shashua et al.\n[11] cast the clustering problem with high order relations into a nonnegative factorization problem\nof the closest hyper-stochastic version of the input af\ufb01nity tensor.\nBased on game theory, Bulo and Pelillo [12] proposed to consider the hypergraph-based clustering\nproblem as a multi-player non-cooperative \u201cclustering game\u201d and solve it by replicator equation,\nwhich is in fact a generalization of their previous work [13]. This new formulation has a solid\ntheoretical foundation, possesses several appealing properties, and achieved state-of-art results. This\nmethod is in fact a speci\ufb01c case of our proposed method, and we will discuss this point in Section 2.\nIn this paper, we propose a uni\ufb01ed method for clustering from k-ary af\ufb01nity relations, which is\napplicable to both graph-based and hypergraph-based clustering problems. Our method is motivated\nby an intuitive observation: for a cluster with m objects, there may exist (m\nk ) possible k-ary af\ufb01nity\nrelations, and most of these (sometimes even all) k-ary af\ufb01nity relations should agree with each\nother on the same criterion. For example, in the line clustering problem, for m points on the same\nline, there are (m\n3 ) possible triplets, and all these triplets should satisfy the criterion that they lie on\na line. The ensemble of such large number of af\ufb01nity relations is hardly produced by outliers and is\nalso very robust to noises, thus yielding a robust mechanism for clustering.\n\n2 Formulation\nClustering from k-ary af\ufb01nity relations can be intuitively described as clustering on a special kind\nof edge-weighted hypergraph, k-graph. Formally, a k-graph is a triplet G = (V, E, w), where\nV = {1,\u00b7\u00b7\u00b7 , n} is a \ufb01nite set of vertices, with each vertex representing an object, E \u2286 V k is the\nset of hyperedges, with each hyperedge representing a k-ary af\ufb01nity relation, and w : E \u2192 R is a\nweighting function which associates a real value (can be negative) with each hyperedge, with larger\nweights representing stronger af\ufb01nity relations. We only consider the k-ary af\ufb01nity relations with no\nduplicate objects, that is, the hyperedges among k different vertices. For hyperedges with duplicated\nvertices, we simply set their weights to zeros.\n}|\nz\nEach hyperedge e \u2208 E involves k vertices, thus can be represented as k-tuple {v1,\u00b7\u00b7\u00b7 , vk}. The\nn \u00d7 n \u00d7 \u00b7\u00b7\u00b7 \u00d7 n super-symmetry array, denoted by M,\nw({v1,\u00b7\u00b7\u00b7 , vk})\n\n{\nif {v1,\u00b7\u00b7\u00b7 , vk} \u2208 E,\nNote that each edge {v1,\u00b7\u00b7\u00b7 , vk} \u2208 E has k! duplicate entries in the array M.\nFor a subset U \u2286 V with m vertices, its edge set is denoted as EU . If U is really a cluster, then\nmost of hyperedges in EU should have large weights. The simplest measure to re\ufb02ect such ensemble\nphenomenon is the sum of all entries in M whose corresponding hyperedges contain only vertices\nin U, which can be expressed as:\n\nweighted adjacency array of graph G is an\nand de\ufb01ned as\n\nM (v1,\u00b7\u00b7\u00b7 , vk) =\n\n{\n\nelse,\n\n(1)\n\nk\n\n0\n\n\u2211\n\nS(U ) =\n\nv1;\u00b7\u00b7\u00b7;vk\u2208U\n\nM (v1,\u00b7\u00b7\u00b7 , vk).\n\n(2)\n\n(3)\n\n\u2211\n\ni yi = m and\n\nSuppose y is an n \u00d7 1 indicator vector of the subset U, such that yvi = 1 if vi \u2208 U and zero\notherwise, then S(U ) can be expressed as:\n\n\u2211\n\nz\n\n{\n}|\n\u00b7\u00b7\u00b7 yvk .\n\nk\n\nM (v1,\u00b7\u00b7\u00b7 , vk)\n\nyv1\n\nS(U ) = S(y) =\n\nv1;\u00b7\u00b7\u00b7;vk\u2208V\n\nObviously, S(U ) usually increases as the number of vertices in U increases. Since\nthere are mk summands in S(U ), the average of these entries can be expressed as:\n\nSav(U ) =\n\n1\nmk S(y)\n\n2\n\n\f\u2211\n\n1\nmk\n\nv1;\u00b7\u00b7\u00b7;vk\u2208V\n\n\u2211\n\u2211\n\nv1;\u00b7\u00b7\u00b7;vk\u2208V\n\nk\n\nyv1\n\n}|\n{\nz\n\u00b7\u00b7\u00b7 yvk\n{\n}|\n{\n}|\n\u00b7\u00b7\u00b7 xvk ,\n\n\u00b7\u00b7\u00b7 yvk\nm\n\nk\n\nk\n\nM (v1,\u00b7\u00b7\u00b7 , vk)\nz\nz\n\nyv1\nm\n\nM (v1,\u00b7\u00b7\u00b7 , vk)\n\nM (v1,\u00b7\u00b7\u00b7 , vk)\n\nxv1\n\n=\n\n=\n\n=\n\n\u2211\n\nThe reward at vertex i, denoted by ri(x), is de\ufb01ned as follows:\n\nri(x) =\n\nv1;\u00b7\u00b7\u00b7;vk\u22121\u2208V\nSince M is a super-symmetry array, then @f (x)\n@xi\nof f (x) at x.\n\n= kri(x), i.e., ri(x) is proportional to the gradient\n\n3\n\n\u2211\n\n(4)\n\nv1;\u00b7\u00b7\u00b7;vk\u2208V\ni xi = 1 is a natural constraint over x.\n\n{\n\n\u220f\n\n\u2211\n\nk\ni=1 xvi ,\n\ni yi = m,\n\nwhere x = y/m. As\nIntuitively, when U is a true cluster, Sav(U ) should be relatively large. Thus, the clustering problem\ncorresponds to the problem of maximizing Sav(U ). In essence, this is a combinatorial optimization\nproblem, since we know neither m nor which m objects to select. As this problem is NP-hard, to\nreduce its complexity, we relax x to be within a continuous range [0, \u03b5], where \u03b5 \u2264 1 is a constant,\nwhile keeping the constraint\n\ni xi = 1. Then the problem becomes:\n\nsubject to x \u2208 \u2206n and xi \u2208 [0, \u03b5]\n\n\u2211\nv1;\u00b7\u00b7\u00b7;vk\u2208V M (v1,\u00b7\u00b7\u00b7 , vk)\n\u2211\n(5)\ni xi = 1} is the standard simplex in Rn. Note that Sav(x) is\n\nmax f (x) =\nwhere \u2206n = {x \u2208 Rn : x \u2265 0 and\nabbreviated by f (x) to simplify the formula.\nThe adoption of \u21131-norm in (5) not only let xi have an intuitive probabilistic meaning, that is, xi\nrepresents the probability for the cluster contain the i-th object, but also makes the solution sparse,\nwhich means to automatically select some objects to form a cluster, while ignoring other objects.\nRelation to Clustering Game. In [12], Bulo and Pelillo proposed to cast the hypergraph-based\nclustering problem into a clustering game, which leads to a similar formulation as (5). In fact, their\nformulation is a special case of (5) when \u03b5 = 1. Setting \u03b5 < 1 means that the probability of choosing\neach strategy (from game theory perspective) or choosing each object (from our perspective) has an\nknown upper bound, which is in fact a prior, while \u03b5 = 1 represents a noninformative prior. This\npoint is very essential in many applications, it avoids the phenomenon where some components of\nx dominate. For example, if the weight of a hyperedge is extremely large, then the cluster may only\nselect the vertices associated with this hyperedge, which is usually not desirable. In fact, \u03b5 offers us\na tool to control the least number of objects in cluster. Since each component does not exceed \u03b5, the\ncluster contains at least [ 1\n\" ] objects, where [z] represents the smallest integer larger than or equal to\nz. Because of the constraint xi \u2208 [0, \u03b5], the solution is also totally different from [12].\n3 Algorithm\nFormulation (5) usually has many local maxima. Large maxima correspond to true clusters and\nsmall maxima usually form meaningless subsets. In this section, we \ufb01rst analyze the properties\n\u2217, which are critical in algorithm design, and then introduce our algorithm to\nof the maximizer x\n\u2217.\ncalculate x\nSince the formulation (5) is a constrained optimization problem, by adding Lagrangian multipliers\n\u03bb, \u00b51,\u00b7\u00b7\u00b7 , \u00b5n and \u03b21,\u00b7\u00b7\u00b7 , \u03b2n, \u00b5i \u2265 0 and \u03b2i \u2265 0 for all i = 1,\u00b7\u00b7\u00b7 , n, we can obtain its Lagrangian\nfunction:\n\nxi \u2212 1) +\n\n\u00b5ixi +\n\n\u03b2i(\u03b5 \u2212 xi).\n\nn\u2211\nL(x, \u03bb, \u00b5, \u03b2) = f (x) \u2212 \u03bb(\n\u2211\n\ni=1\n\nn\u2211\n\ni=1\n\nn\u2211\n\ni=1\n\nk\u22121\u220f\n\nxvt\n\nt=1\n\nM (v1,\u00b7\u00b7\u00b7 , vk\u22121, i)\n\n(6)\n\n(7)\n\n\fAny local maximizer x\norder necessary conditions for local optimality. That is,\n\n\u2217 must satisfy the Karush-Kuhn-Tucker (KKT) condition [14], i.e., the \ufb01rst-\n\n\u2217\n\n) \u2212 \u03bb + \u00b5i \u2212 \u03b2i = 0, i = 1,\u00b7\u00b7\u00b7 , n,\n\u2217\ni \u00b5i = 0,\n\n\uf8f1\uf8f2\uf8f3 kri(x\n\u2211\n\u2211\nn\ni=1 x\ni=1(\u03b5 \u2212 x\n\u2217\nn\ni )\u03b2i = 0.\n\u2211\ni=1(\u03b5 \u2212 x\n\u2217\ni )\u03b2i = 0 is equivalent to saying that if x\n\n\u2211\n\nn\ni=1 x\n\n\u2217\ni \u00b5i = 0 is equivalent to saying that if\n\u2217\ni < \u03b5, then \u03b2i = 0.\n\n\u2217\ni , \u00b5i and \u03b2i are all nonnegative for all i\u2019s,\n\nSince x\n\u2217\ni > 0, then \u00b5i = 0, and\nx\nHence, the KKT conditions can be rewritten as:\n\nn\n\n{ \u2264 \u03bb/k,\n\n\u2217\nri(x\n\n)\n\n\u2217\ni = 0,\nx\n\u2217\n\u2217\ni > 0 and x\n= \u03bb/k, x\ni < \u03b5,\n\u2265 \u03bb/k,\n\u2217\ni = \u03b5.\nx\n\n\u2217\n\n\u2217\n) are equal to \u03b7; and 3) the rewards at all vertices belonging to V3(x\n\nAccording to x, the vertices set V can be divided into three disjoint subsets, V1(x) = {i|xi = 0},\nV2(x) = {i|xi \u2208 (0, \u03b5)} and V3(x) = {i|xi = \u03b5}. The Equation (9) characterizes the properties of\nthe solution of (5), which are further summarized in the following theorem.\n\u2217 is the solution of (5), then there exists a constant \u03b7 (= \u03bb/k) such that 1) the\nTheorem 1. If x\n) are not larger than \u03b7; 2) the rewards at all vertices\nrewards at all vertices belonging to V1(x\n\u2217\nbelonging to V2(x\n) are not\nsmaller than \u03b7.\n\u2217 must satisfy\nProof: Since KKT condition is a necessary condition, according to (9), the solution x\n1), 2) and 3).\nThe set of non-zero components is Vd(x) = V2(x) \u222a V3(x) and the set of the components which are\nsmaller than \u03b5 is Vu(x) = V1(x)\u222aV2(x). For any x, if we want to update it to increase f (x), then the\nvalues of some components belonging to Vd(x) must decrease and the values of some components\nbelonging to Vu(x) must increase. According to Theorem 1, if x is the solution of (5), then ri(x) \u2264\nrj(x),\u2200i \u2208 Vu(x),\u2200j \u2208 Vd(x). On the contrary, if \u2203i \u2208 Vu(x),\u2203j \u2208 Vd(x), ri(x) > rj(x), then x\nis not the solution of (5). In fact, in such case, we can increase xi and decrease xj to increase f (x).\nThat is, let\n\n\u2032\nl =\n\nx\n\nxl,\nxl + \u03b1,\nxl \u2212 \u03b1,\n\nl \u0338= i, l \u0338= j;\n\nl = i;\nl = j.\n\n{\n\u2211\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nand de\ufb01ne\n\nThen\n\nrij(x) =\n\nv1;\u00b7\u00b7\u00b7;vk\u22122\n\nM (v1,\u00b7\u00b7\u00b7 , vk\u22122, i, j)\n\n\u2032\nf (x\n\n) \u2212 f (x) = \u2212k(k \u2212 1)rij(x)\u03b12 + k(ri(x) \u2212 rj(x))\u03b1\n\nk\u22122\u220f\n\nt=1\n\nxvt\n\n2(k\u22121)rij (x) ), the increase of f (x) reaches maximum.\n\nSince ri(x) > rj(x), we can always select a proper \u03b1 > 0 to increase f (x). According to formula\n(10) and the constraint over xi, \u03b1 \u2264 min(xj, \u03b5 \u2212 xi). Since ri(x) > rj(x), if rij(x) \u2264 0, then\nwhen \u03b1 = min(xj, \u03b5 \u2212 xi), the increase of f (x) reaches maximum; if rij > 0, then when \u03b1 =\nmin(xj, \u03b5 \u2212 xi, ri(x)\u2212rj (x)\nAccording to the above analysis, if \u2203i \u2208 Vu(x),\u2203j \u2208 Vd(x), ri(x) > rj(x), then we can update\nx to increase f (x). Such procedure iterates until ri(x) \u2264 rj(x),\u2200i \u2208 Vu(x),\u2200j \u2208 Vd(x). From\na prior (initialization) x(0), the algorithm to compute the local maximizer of (5) is summarized in\nAlgorithm 1, which successively chooses the \u201cbest\u201d vertex and the \u201cworst\u201d vertex and then update\ntheir corresponding components of x.\nSince signi\ufb01cant maxima of formulation (5) usually correspond to true clusters, we need multiple\ninitializations (priors) to obtain them, with at least one initialization at the basin of attraction of\nevery signi\ufb01cant maximum. Such informative priors in fact can be easily and ef\ufb01ciently constructed\nfrom the neighborhood of every vertex (vertices with hyperedges connecting to this vertex), because\nthe neighbors of a vertex generally have much higher probabilities to belong to the same cluster.\n\n4\n\n\f\u2217 from a prior x(0)\n\nAlgorithm 1 Compute a local maximizer x\n1: Input: Weighted adjacency array M, prior x(0);\n2: repeat\n3:\n4:\n5:\n\nCompute the reward ri(x) for each vertex i;\nCompute V1(x(t)), V2(x(t)), V3(x(t)), Vd(x(t)), and Vu(x(t));\nFind the vertex i in Vu(x(t)) with the largest reward and the vertex j in Vd(x(t)) with the\nsmallest reward;\nCompute \u03b1 and update x(t) by formula (10) to obtain x(t + 1);\n\n6:\n7: until x is a local maximizer\n\u2217.\n8: Output: The local maximizer x\n\nAlgorithm 2 Construct a prior x(0) containing vertex v\n1: Input: Hyperedge set E(v) and \u03b5;\n2: Sort the hyperedges in E(v) in descending order according to their weights;\n3: for i = 1,\u00b7\u00b7\u00b7 ,|E(v)| do\nAdd all vertices associated with the i-th hyperedge to L. If |L| \u2265 [ 1\n4:\n5: end for\n6: For each vertex vj \u2208 L, set the corresponding component xvj (0) = 1|L|;\n7: Output: a prior x(0).\n\n\" ], then break;\n\nFor a vertex v, the set of hyperedges connected to v is denoted by E(v). We can construct a prior\ncontaining v from E(v), which is described in Algorithm 2.\nBecause of the constraint xi \u2264 \u03b5, the initializations need to contain at least [ 1\n\" ] nonzero compo-\nnents. To cover basin of attractions of more maxima, we expect these initializations to locate more\nuniformly in the space {x|x \u2208 \u2206n, xi \u2264 \u03b5}.\nSince from every vertex, we can construct such a prior, thus, we can construct n priors in total. From\nthese n priors, according to Algorithm 1, we can obtain n maxima. The signi\ufb01cant maxima of (5)\nare usually among these n maxima, and a signi\ufb01cant maximum may appear multiple times. In this\nway, we can robustly obtain multiple clusters simultaneously, and these clusters may overlap, both\nof which are desirable properties in many applications. Note that the clustering game approach [12]\nutilizes a noninformative prior, that is, all vertices have equal probability. Thus, it cannot obtain\nmultiple clusters simultaneously. In clustering game approach [12], if xi(t) = 0, then xi(t + 1) = 0,\nwhich means that it can only drop points and if a point is initially not included, then it cannot be\nselected. However, our method can automatically add or drop points, which is another key difference\nto the clustering game approach.\nIn each iteration of Algorithm 1, we only need to consider two components of x, which makes\nboth the update of rewards and the update of x(t) very ef\ufb01cient. As f (x(t)) increases, the sizes of\nVu(x(t)) and Vd(x(t)) both decrease quickly, thus f (x) converges to local maximum quickly. Sup-\npose the maximal number of hyperedges containing a certain vertex is h, then the time complexity\nof Algorithm 1 is O(thk), where t is the number of iterations. The total time complexity of our\nmethod is then O(nthk), since we need to ran Algorithm 1 from n initializations.\n4 Experiments\nWe evaluate our method on three types of experiments. The \ufb01rst one addresses the problem of line\nclustering, the second addresses the problem of illumination-invariant face clustering, and the third\naddresses the problem of af\ufb01ne-invariant point set matching. We compare our method with clique\naveraging [9] algorithm and matching game approach [12]. In all experiments, the clique averaging\napproach needs to know the number of clusters in advance; however, both clustering game approach\nand our method can automatically reveal the number of clusters, which yields the advantages of the\nlatter two in many applications.\n4.1 Line Clustering\nIn this experiment, we consider the problem of clustering lines in 2D point sets. Pairwise similarity\nmeasures are useless in this case, and at least three points are needed for characterizing such a\n\n5\n\n\fproperty. The dissimilarity measure on triplets of points is given by their mean distance to the best\n\ufb01tting line. If d(i, j, k) is the dissimilarity measure of points {i, j, k}, then the similarity function is\ngiven by s({i, j, k}) = exp(\u2212d(i, j, k)2/\u03c32\nd), where \u03c3d is a scaling parameter, which controls the\nsensitivity of the similarity measure to deformation.\nWe randomly generate three lines within the region [\u22120.5, 0.5]2, each line contains 30 points, and all\nthese points have been perturbed by Gaussian noise N (0, \u03c3). We also randomly add outliers into the\npoint set. Fig. 1(a) illustrates such a point set with three lines shown in red, blue and green colors,\nrespectively, and the outliers are shown in magenta color. To evaluate the performance, we ran all\nalgorithms on the same data set over 30 trials with varying parameter values, and the performance\nis measured by F-measure.\nWe \ufb01rst \ufb01x the number of outliers to be 60, vary the scaling parameter \u03c3d from 0.01 to 0.14, and\nthe result is shown in Fig. 1(b). For our method, we set \u03b5 = 1/30. Obviously, our method is nearly\nnot affected by the scaling parameter \u03c3d, while the clustering game approach is very sensitive to \u03c3d.\nNote that \u03c3d in fact controls the weights of the hyperedge graph and many graph-based algorithms\nare notoriously sensitive to the weights of the graph. Instead, by setting a proper \u03b5, our method\novercomes this problem. From Fig. 1(b), we observe that when \u03c3d = 4\u03c3, the clustering game\napproach will get the best performance. Thus, we \ufb01x \u03c3d = 4\u03c3, and change the noise parameter\n\u03c3 from 0.01 to 0.1, the results of clustering game approach, clique averaging algorithm and our\nmethod are shown in blue, green and red colors in Fig. 1(c), respectively. As the \ufb01gure shows, when\nthe noise is small, matching game approach outperforms clique averaging algorithm, and when the\nnoise becomes large, the clique averaging algorithm outperforms matching game approach. This is\nbecause matching game approach is more robust to outliers, while the clique averaging algorithm\nseems more robust to noises. Our method always gets the best result, since it can not only select\ncoherent clusters as matching game approach, but also control the size of clusters, thus avoiding the\nproblem of too few points selected into clusters.\nIn Fig. 1(d) and Fig. 1(e), we vary the number of outliers from 10 to 100, the results clearly demon-\nstrate that our method and clustering game approach are robust to outliers, while clique averaging\nalgorithm is very sensitive to outliers, since it is a partition-based method and every point must be\nassigned to a cluster. To illustrate the in\ufb02uence of \u03b5, we \ufb01x \u03c3d = \u03c3 = 0.02, and test the perfor-\nmance of our method under different \u03b5, the result is shown in Fig. 1(f), note that x axis is 1/\u03b5. As we\nstressed in Section 2, clustering game approach is in fact a special case of our method when \u03b5 = 1,\nthus, the result at \u03b5 = 1 is nearly the same as the result of clustering game approach in Fig. 1(b)\nunder the same conditions. Obviously, as 1/\u03b5 approaches the real number of points in the cluster,\nthe result become much better. Note that the best result appears when 1/\u03b5 > 30, which is due to the\nfact that some outliers fall into the line clusters, as can be seen in Fig. 1(a).\n\ns2\n4\n\n4\n\nIllumination-invariant face clustering\n\n4.2\nIt has been shown that the variability of images of a Labmertian surface in \ufb01xed pose, but under\nvariable lighting conditions where no surface point is shadowed, constitutes a three dimensional\nlinear subspace [15]. This leads to a natural measure of dissimilarity over four images, which can\nbe used for clustering. In fact, this is a generalization of the k-lines problem into the k-subspaces\nproblem. If we assume that the four images under consideration form the columns of a matrix, and\nnormalize each column by \u21132 norm, then d =\nserves as a natural measure of dissimilarity,\n1+\u00b7\u00b7\u00b7+s2\ns2\nwhere si is the ith singular value of this matrix.\nIn our experiments we use the Yale Face Database B and its extended version [16], which contains 38\nindividuals, each under 64 different illumination conditions. Since in some lighting conditions, the\nimages are severely shadowed, we delete these images and do the experiments on a subset (about\n35 images for each individual). We considered cases where we have faces from 4 and 5 random\nindividuals (randomly choose 10 faces for each individual), with and without outliers. The case with\noutliers consists 10 additional faces each from a different individual. For each of those combinations,\nwe ran 10 trials to obtain the average F-measures (mean and standard deviation), and the result is\nreported in Table 1. Note that for each algorithm, we individually tune the parameters to obtain\nthe best results. The results clearly show that partition-based clustering method (clique averaging)\nis very sensitive to outliers, but performs better when there are no outliers. The clustering game\napproach and our method both perform well, especially when there are outliers, and our method\nperforms a little better.\n\n6\n\n\fFigure 1: Results on clustering three lines with noises and outliers. The performance of clique\naveraging algorithm [9], matching game approach [12] and our method is shown as green dashed,\nblue dotted and read solid curves, respectively. This \ufb01gure is best viewed in color.\n\nTable 1: Experiments on illuminant-invariant face clustering\n\nClasses\nOutliers\n\nClique Averaging\nClustering Game\n\nOur Method\n\n4\n\n0\n\n0.95 \u00b1 0.05\n0.92 \u00b1 0.04\n0.93 \u00b1 0.04\n\n10\n\n0.84 \u00b1 0.08\n0.90 \u00b1 0.04\n0.92 \u00b1 0.05\n\n0\n\n0.93 \u00b1 0.05\n0.91 \u00b1 0.06\n0.92 \u00b1 0.07\n\n5\n\n10\n\n0.83 \u00b1 0.07\n0.90 \u00b1 0.07\n0.91 \u00b1 0.04\n\n4.3 Af\ufb01ne-invariant Point Set Matching\nAn important problem in the object recognition is the fact that an object can be seen from different\nviewpoints, resulting in differently deformed images. Consequently, the invariance to viewpoints\nis a desirable property for many vision tasks. It is well-known that a near-planar object seen from\ndifferent viewpoint can be modeled by af\ufb01ne transformations. In this subsection, we will show that\nmatching planar point sets under different viewpoints can be formulated into a hypergraph clustering\nproblem and our algorithm is very suitable for such tasks.\nSuppose the two point sets are P and Q, with nP and nQ points, respectively. For each point\nin P , it may match to any point in Q, thus there are nP nQ candidate matches. Under the af\ufb01ne\nSi\u2032 j\u2032 k\u2032 = |det(A)|, where Sijk is\ntransformation A, for three correct matches, mii\u2032, mjj\u2032 and mkk\u2032, Sijk\nthe area of the triangle formed by points i, j and k in P , Si\u2032j\u2032k\u2032 is the area of the triangle formed\n\u2032 in Q, and det(A) is the determinant of A. If we regard each candidate match\nby points i\nas a point, then s = exp(\u2212 (Sijk\u2212Si\u2032j\u2032k\u2032|det(A)|)2\n) serves as a natural similarity measure for three\npoints (candidate matches), mii\u2032, mjj\u2032 and mkk\u2032, \u03c3d is a scaling parameter, and the correct matching\ncon\ufb01guration then naturally form a cluster. Note that in this problem, most of the candidate matches\nare incorrect matches, and can be considered to be outliers.\nWe did the experiments on 8 shapes from MPEG-7 shape database [17]. For each shape, we uni-\nformly sample its contour into 20 points. Both the shapes and sampled point sets are demonstrated\nin Fig. 2. We regard original contour point sets as P s, then randomly add Gaussian noise N (0, \u03c3),\nand transform them by randomly generated af\ufb01ne matrices As to form corresponding Qs. Fig. 3\n(a) shows such a pair of P and Q in red and blue, respectively. Since most of points (candidate\nmatches) should not belong to any cluster, partition-based clustering method, such as clique aver-\n\n\u2032 and k\n\n\u2032, j\n\n(cid:27)2\nd\n\n7\n\n\faging method, cannot be used. Thus, we only compare our method with matching game approach\nand measure the performance of these two methods by counting how many matches agree with the\nground truths. Since |det(A)| is unknown, we estimate its range and sample several possible values\nin this range, and conduct the experiment for each possible |det(A)|. In Fig. 3(b), we \ufb01x noise\nparameter \u03c3 = 0.05, and test the robustness of both methods under varying scaling parameter \u03c3d.\nObviously, our method is very robust to \u03c3d, while the matching game approach is very sensitive to\nit. In Fig. 3(c), we increase \u03c3 from 0.04 to 0.16, and for each \u03c3, we adjust \u03c3d to reach the best\nperformances for both methods. As expected, our method is more robust to noise by bene\ufb01ting from\nthe parameter \u03b5, which is set to 0.05 in both Fig. 3(b) and Fig. 3(c). In Fig. 3(d), we \ufb01x \u03c3 = 0.05\nand \u03c3d = 0.15, and test the performance of our method under different \u03b5. The result again veri\ufb01es\nthe importance of the parameter \u03b5.\n\nFigure 2: The shapes and corresponding contour point sets used in our experiment.\n\nFigure 3: Performance curves on af\ufb01ne-invariant point set matching problem. The red solid curves\ndemonstrate the performance of our method, while the blue dotted curve illustrates the performance\nof matching game approach.\n\n5 Discussion\nIn this paper, we characterized clustering as an ensemble of all associated af\ufb01nity relations and relax\nthe clustering problem into optimizing a constrained homogenous function. We showed that the\nclustering game approach turns out to be a special case of our method. We also proposed an ef\ufb01cient\nalgorithm to automatically reveal the clusters in a data set, even under severe noises and a large num-\nber of outliers. The experimental results demonstrated the superiority of our approach with respect\nto the state-of-the-art counterparts. Especially, our method is not sensitive to the scaling parameter\nwhich affects the weights of the graph, and this is a very desirable property in many applications. A\nkey issue with hypergraph-based clustering is the high computational cost of the construction of a\nhypergraph, and we are currently studying how to ef\ufb01ciently construct an approximate hypergraph\nand then perform clustering on the incomplete hypergraph.\n6 Acknowledgement\nThis research is done for CSIDM Project No. CSIDM-200803 partially funded by a grant from the\nNational Research Foundation (NRF) administered by the Media Development Authority (MDA) of\nSingapore, and this work has also been partially supported by the NSF Grants IIS-0812118, BCS-\n0924164 and the AFOSR Grant FA9550-09-1-0207.\n\n8\n\n\fReferences\n[1] A. Jain, M. Murty, and P. Flynn, \u201cData clustering: a review,\u201d ACM Computing Surveys, vol. 31,\n\nno. 3, pp. 264\u2013323, 1999.\n\n[2] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, \u201cAn ef\ufb01cient\nk-means clustering algorithm: Analysis and implementation,\u201d IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, vol. 24, no. 7, pp. 881\u2013892, 2002.\n\n[3] A. Ng, M. Jordan, and Y. Weiss, \u201cOn spectral clustering: Analysis and an algorithm,\u201d in Ad-\n\nvances in Neural Information Processing Systems, vol. 2, 2002, pp. 849\u2013856.\n\n[4] I. Dhillon, Y. Guan, and B. Kulis, \u201cKernel k-means: spectral clustering and normalized cuts,\u201d\nin Proceedings of the tenth ACM International Conference on Knowledge Discovery and Data\nMining, 2004, pp. 551\u2013556.\n\n[5] J. Shi and J. Malik, \u201cNormalized cuts and image segmentation,\u201d IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, vol. 22, no. 8, pp. 888\u2013905, 2000.\n\n[6] B. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol.\n\n315, no. 5814, pp. 972\u2013976, 2007.\n\n[7] J. Zien, M. Schlag, and P. Chan, \u201cMultilevel spectral hypergraph partitioning with arbitrary\nvertex sizes,\u201d IEEE Transactions on Computer-aided Design of Integrated Circuits and Sys-\ntems, vol. 18, no. 9, pp. 1389\u20131399, 1999.\n\n[8] J. Rodriguez, \u201cOn the Laplacian spectrum and walk-regular hypergraphs,\u201d Linear and Multi-\n\nlinear Algebra, vol. 51, no. 3, pp. 285\u2013297, 2003.\n\n[9] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie, \u201cBeyond\npairwise clustering,\u201d in IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition, vol. 2, 2005, pp. 838\u2013845.\n\n[10] D. Zhou, J. Huang, and B. Scholkopf, \u201cLearning with hypergraphs: Clustering, classi\ufb01cation,\nand embedding,\u201d in Advances in Neural Information Processing Systems, vol. 19, 2007, pp.\n1601\u20131608.\n\n[11] A. Shashua, R. Zass, and T. Hazan, \u201cMulti-way clustering using super-symmetric non-negative\n\ntensor factorization,\u201d in European Conference on Computer Vision, 2006, pp. 595\u2013608.\n\n[12] S. Bulo and M. Pelillo, \u201cA game-theoretic approach to hypergraph clustering,\u201d in Advances in\n\nNeural Information Processing Systems, 2009.\n\n[13] M. Pavan and M. Pelillo, \u201cDominant sets and pairwise clustering,\u201d IEEE Transactions on Pat-\n\ntern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 167\u2013172, 2007.\n\n[14] H. Kuhn and A. Tucker, \u201cNonlinear programming,\u201d ACM SIGMAP Bulletin, pp. 6\u201318, 1982.\n[15] P. Belhumeur and D. Kriegman, \u201cWhat is the set of images of an object under all possible\nillumination conditions?\u201d International Journal of Computer Vision, vol. 28, no. 3, pp. 245\u2013\n260, 1998.\n\n[16] K. Lee, J. Ho, and D. Kriegman, \u201cAcquiring linear subspaces for face recognition under vari-\nable lighting,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5,\npp. 684\u2013698, 2005.\n\n[17] L. Latecki, R. Lakamper, and T. Eckhardt, \u201cShape descriptors for non-rigid shapes with a single\nclosed contour,\u201d in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,\n2000, pp. 65\u201372.\n\n9\n\n\f", "award": [], "sourceid": 306, "authors": [{"given_name": "Hairong", "family_name": "Liu", "institution": null}, {"given_name": "Longin", "family_name": "Latecki", "institution": null}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": null}]}