{"title": "Learning Taxonomies by Dependence Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 160, "abstract": "We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-of-the-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data.", "full_text": "Learning Taxonomies by Dependence Maximization\n\nMatthew B. Blaschko\nArthur Gretton\nMax Planck Institute for Biological Cybernetics\n\nSpemannstr. 38\n\n72076 T\u00a8ubingen, Germany\n\n{blaschko,arthur}@tuebingen.mpg.de\n\nAbstract\n\nWe introduce a family of unsupervised algorithms, numerical taxonomy cluster-\ning, to simultaneously cluster data, and to learn a taxonomy that encodes the re-\nlationship between the clusters. The algorithms work by maximizing the depen-\ndence between the taxonomy and the original data. The resulting taxonomy is\na more informative visualization of complex data than simple clustering; in ad-\ndition, taking into account the relations between different clusters is shown to\nsubstantially improve the quality of the clustering, when compared with state-of-\nthe-art algorithms in the literature (both spectral clustering and a previous depen-\ndence maximization approach). We demonstrate our algorithm on image and text\ndata.\n\n1 Introduction\n\nWe address the problem of \ufb01nding taxonomies in data: that is, to cluster the data, and to specify in a\nsystematic way how the clusters relate. This problem is widely encountered in biology, when group-\ning different species; and in computer science, when summarizing and searching over documents\nand images. One of the simpler methods that has been used extensively is agglomerative clustering\n[18]. One speci\ufb01es a distance metric and a linkage function that encodes the cost of merging two\nclusters, and the algorithm greedily agglomerates clusters, forming a hierarchy until at last the \ufb01nal\ntwo clusters are merged into the tree root. A related alternate approach is divisive clustering, in\nwhich clusters are split at each level, beginning with a partition of all the data, e.g. [19]. Unfortu-\nnately, this is also a greedy technique and we generally have no approximation guarantees. More\nrecently, hierarchical topic models [7, 23] have been proposed to model the hierarchical cluster struc-\nture of data. These models often rely on the data being representable by multinomial distributions\nover bags of words, making them suitable for many problems, but their application to arbitrarily\nstructured data is in no way straightforward. Inference in these models often relies on sampling\ntechniques that can affect their practical computational ef\ufb01ciency.\nOn the other hand, many kinds of data can be easily compared using a kernel function, which\nencodes the measure of similarity between objects based on their features. Spectral clustering al-\ngorithms represent one important subset of clustering techniques based on kernels [24, 21]:\nthe\nspectrum of an appropriately normalized similarity matrix is used as a relaxed solution to a partition\nproblem. Spectral techniques have the advantage of capturing global cluster structure of the data,\nbut generally do not give a global solution to the problem of discovering taxonomic structure.\nIn the present work, we propose a novel unsupervised clustering algorithm, numerical taxonomy\nclustering, which both clusters the data and learns a taxonomy relating the clusters. Our method\nworks by maximizing a kernel measure of dependence between the observed data, and a product\nof the partition matrix that de\ufb01nes the clusters with a structure matrix that de\ufb01nes the relationship\nbetween individual clusters. This leads to a constrained maximization problem that is in general NP\nhard, but that can be approximated very ef\ufb01ciently using results in spectral clustering and numerical\n\n1\n\n\ftaxonomy (the latter \ufb01eld addresses the problem \ufb01tting taxonomies to pairwise distance data [1, 2,\n4, 8, 11, 15, 25], and contains techniques that allow us to ef\ufb01ciently \ufb01t a tree structure to our data\nwith tight approximation guarantees). Aside from its simplicity and computational ef\ufb01ciency, our\nmethod has two important advantages over previous clustering approaches. First, it represents a\nmore informative visualization of the data than simple clustering, since the relationship between the\nclusters is also represented. Second, we \ufb01nd the clustering performance is improved over methods\nthat do not take cluster structure into account, and over methods that impose a cluster distance\nstructure rather than learning it.\nSeveral objectives that have been used for clustering are related to the objective employed here. Bach\nand Jordan [3] proposed a modi\ufb01ed spectral clustering objective that they then maximize either with\nrespect to the kernel parameters or the data partition. Christianini et al. [10] proposed a normalized\ninner product between a kernel matrix and a matrix constructed from the labels, which can be used\nto learn kernel parameters. The objective we use here is also a normalized inner product between a\nsimilarity matrix and a matrix constructed from the partition, but importantly, we include a structure\nmatrix that represents the relationship between clusters. Our work is most closely related to that of\nSong et al. [22], who used an objective that includes a \ufb01xed structure matrix and an objective based\non the Hilbert-Schmidt Independence Criterion. Their objective is not normalized, however, and\nthey do not maximize with respect to the structure matrix.\nThe paper is organized as follows. In Section 2, we introduce a family of dependence measures\nwith which one can interpret the objective function of the clustering approach. The dependence\nmaximization objective is presented in Section 3, and its relation to classical spectral clustering\nalgorithms is explained in Section 3.1. Important results for the optimization of the objective are\npresented in Sections 3.2 and 3.3. The problem of numerical taxonomy and its relation to the pro-\nposed objective function is presented in Section 4, as well as the numerical taxonomy clustering\nalgorithm. Experimental results are given in Section 5.\n\n2 Hilbert-Schmidt Independence Criterion\n\nIn this section, we give a brief introduction to the Hilbert-Schmidt Independence Criterion (HSIC),\nwhich is a measure of the strength of dependence between two variables (in our case, following\n[22], these are the data before and after clustering). We begin with some basic terminology in kernel\nmethods. Let F be a reproducing kernel Hilbert space of functions from X to R, where X is a\nseparable metric space (our input domain). To each point x \u2208 X , there corresponds an element\n\u03c6(x) \u2208 F (we call \u03c6 the feature map) such that (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)F = k(x, x(cid:48)), where k : X \u00d7 X \u2192 R\nis a unique positive de\ufb01nite kernel. We also de\ufb01ne a second RKHS G with respect to the separable\nmetric space Y, with feature map \u03c8 and kernel (cid:104)\u03c8(y), \u03c8(y(cid:48))(cid:105)G = l(y, y(cid:48)).\nLet (X, Y ) be random variables on X \u00d7 Y with joint distribution PrX,Y , and associated marginals\nPrX and PrY . Then following [5, 12], the covariance operator Cxy : G \u2192 F is de\ufb01ned such that\nfor all f \u2208 F and g \u2208 G,\n\n(cid:104)f, Cxyg(cid:105)F = Ex,y ([f(x) \u2212 Ex(f(x))] [g(y) \u2212 Ey(g(y))]) .\n\nA measure of dependence is then the Hilbert-Schmidt norm of this operator (the sum of the squared\nsingular values), (cid:107)Cxy(cid:107)2\nHS. For characteristic kernels [13], this is zero if and only if X and Y\nare independent. It is shown in [13] that the Gaussian and Laplace kernels are characteristic on\nRd. Given a sample of size n from PrX,Y , the Hilbert-Schmidt Independence Criterion (HSIC) is\nde\ufb01ned by [14] to be a (slightly biased) empirical estimate of (cid:107)Cxy(cid:107)2\n\nHS,\nHSIC := Tr [HnKHnL] , where Hn = I \u2212 1\nn\n\n1n1T\nn ,\n\n1n is the n \u00d7 1 vector of ones, K is the Gram matrix for samples from PrX with (i, j)th entry\nk(xi, xj), and L is the Gram matrix with kernel l(yi, yj).\n\n3 Dependence Maximization\n\nWe now specify how the dependence criteria introduced in the previous section can be used in\nclustering. We represent our data via an n \u00d7 n Gram matrix M (cid:23) 0: in the simplest case, this\n\n2\n\n\fis the centered kernel matrix (M = HnKHn), but we also consider a Gram matrix corresponding\nto normalized cuts clustering (see Section 3.1). Following [22], we de\ufb01ne our output Gram matrix\nto be L = \u03a0Y \u03a0T , where \u03a0 is an n \u00d7 k partition matrix, k is the number of clusters, and Y is a\npositive de\ufb01nite matrix that encodes the relationship between clusters (e.g. a taxonomic structure).\nOur clustering quality is measured according to\n\nTr(cid:2)M Hn\u03a0Y \u03a0T Hn\n(cid:3)\n(cid:112)Tr [\u03a0Y \u03a0T Hn\u03a0Y \u03a0T Hn]\n\n.\n\n(1)\n\nIn terms of the covariance operators introduced earlier, we are optimizing HSIC, this being an em-\npirical estimate of (cid:107)Cxy(cid:107)2\nHS (we need not\nnormalize by (cid:107)Cxx(cid:107)2\nHS, since it is constant). This criterion is very similar to the criterion introduced\nfor use in kernel target alignment [10], the difference being the addition of centering matrices, Hn,\n\nas required by de\ufb01nition of the covariance. We remark that the normalizing term(cid:13)(cid:13)Hn\u03a0Y \u03a0T Hn\n\nHS, while normalizing by the empirical estimate of (cid:107)Cyy(cid:107)2\n\nwas not needed in the structured clustering objective of [22]. This is because Song et al. were in-\nterested only in solving for the partition matrix, \u03a0, whereas we also wish to solve for Y : without\nnormalization, the objective can always be improved by scaling Y arbitrarily. In the remainder of\nthis section, we address the maximization of Equation (1) under various simplifying assumptions:\nthese results will then be used in our main algorithm in Section 4.\n\n(cid:13)(cid:13)HS\n\n3.1 Relation to Spectral Clustering\n\nMaximizing Equation (1) is quite dif\ufb01cult given that the entries of \u03a0 can only take on values in\n{0, 1}, and that the row sums have to be equal to 1. In order to more ef\ufb01ciently solve this dif\ufb01cult\ncombinatorial problem, we make use of a spectral relaxation. Consider the case that \u03a0 is a column\nvector and Y is the identity matrix. Equation (1) becomes\n\nTr(cid:2)M Hn\u03a0\u03a0T Hn\n(cid:3)\n(cid:112)Tr [\u03a0\u03a0T Hn\u03a0\u03a0T Hn]\n\nmax\n\n\u03a0T Hn\u03a0\nSetting the derivative with respect to \u03a0 to zero and rearranging, we obtain\n\n\u03a0\n\n\u03a0\n\n\u03a0T HnM Hn\u03a0\n\n= max\n\nHnM Hn\u03a0 =\n\n\u03a0T HnM Hn\u03a0\n\n\u03a0T Hn\u03a0 Hn\u03a0.\n\nUsing the normalization \u03a0T Hn\u03a0 = 1, we obtain the generalized eigenvalue problem\n\n2 AD\u2212 1\n\nHnM Hn\u03a0i = \u03c1iHn\u03a0i,\n\nor equivalently HnM Hn\u03a0i = \u03c1i\u03a0i.\n\nthat Dii =(cid:80)\n\n(4)\nFor \u03a0 \u2208 {0, 1}n\u00d7k where k > 1, we can recover \u03a0 by extracting the k eigenvectors associated\nwith the largest eigenvalues. As discussed in [24, 21], the relaxed solution will contain an arbitrary\nrotation which can be recovered using a reclustering step.\nIf we choose M = D\u2212 1\n\n2 where A is a similarity matrix, and D is the diagonal matrix such\nj Aij, we can recover a centered version of the spectral clustering of [21]. In fact, we\nwish to ignore the eigenvector with constant entries [24], so the centering matrix Hn does not alter\nthe clustering solution.\n3.2 Solving for Optimal Y (cid:23) 0 Given \u03a0\nWe now address the subproblem of solving for the optimal structure matrix, Y , subject only to\npositive semi-de\ufb01niteness, for any \u03a0. We note that the maximization of Equation (1) is equivalent\nto the constrained optimization problem\n\nmax\nY\n\n(cid:3) = 1\ns.t. Tr(cid:2)\u03a0Y \u03a0T Hn\u03a0Y \u03a0T Hn\n(cid:3) ,\nTr(cid:2)M Hn\u03a0Y \u03a0T Hn\nL(Y, \u03bd) = Tr(cid:2)M Hn\u03a0Y \u03a0T Hn\n(cid:3) + \u03bd(cid:0)1 \u2212 Tr(cid:2)\u03a0Y \u03a0T Hn\u03a0Y \u03a0T Hn\n(cid:3)(cid:1) ,\n\nWe write the Lagrangian\n\ntake the derivative with respect to Y , and set to zero, to obtain\n\n= \u03a0T HnM Hn\u03a0 \u2212 2\u03bd(cid:0)\u03a0T Hn\u03a0Y \u03a0T Hn\u03a0(cid:1) = 0\n\n\u2202L\n\u2202Y\n\n(2)\n\n(3)\n\n(5)\n\n(6)\n\n(7)\n\n3\n\n\fwhich together with the constraint in Equation (5) yields\n\n(cid:0)\u03a0T Hn\u03a0(cid:1)\u2020\n\u03a0T HnM Hn\u03a0 (\u03a0T Hn\u03a0)\u2020 \u03a0T HnM Hn\u03a0 (\u03a0T Hn\u03a0)\u2020(cid:105) ,\n\n\u03a0T HnM Hn\u03a0(cid:0)\u03a0T Hn\u03a0(cid:1)\u2020\n\n(cid:104)\n\n(cid:114)\n\nTr\n\nY \u2217 =\n\n(8)\n\nBecause(cid:0)\u03a0T Hn\u03a0(cid:1)\u2020\n\n\u03a0T Hn = Hk\n\nwhere \u2020 indicates the Moore-Penrose generalized inverse [17, p. 421].\n\n(cid:0)\u03a0T \u03a0(cid:1)\u22121 \u03a0T Hn (see [6, 20]), we note that Equation (8) com-\n\nputes a normalized set kernel between the elements in each cluster. Up to a constant normalization\nfactor, Y \u2217 is equivalent to Hk \u02dcY \u2217Hk where\n\n(cid:88)\n\n(cid:88)\n\n\u02dcY \u2217\nij =\n\n1\n\nNiNj\n\n\u03b9\u2208Ci\n\n\u03ba\u2208Cj\n\n\u02dcM\u03b9\u03ba,\n\n(9)\n\n(cid:114)\n\nNi is the number of elements in cluster i, Ci is the set of indices of samples assigned to cluster i,\nand \u02dcM = HnM Hn. This is a standard set kernel as de\ufb01ned in [16].\n3.3 Solving for \u03a0 with the Optimal Y (cid:23) 0\nAs we have solved for Y \u2217 in closed form in Equation (8), we can plug this result into Equation (1)\nto obtain a formulation of the problem of optimizing \u03a0\u2217 that does not require a simultaneous opti-\nmization over Y . Under these conditions, Equation (1) is equivalent to\n\n(cid:104)\n\n\u03a0T HnM Hn\u03a0 (\u03a0T \u03a0)\u22121 \u03a0T HnM Hn\u03a0 (\u03a0T \u03a0)\u22121(cid:105)\n\n.\n\n\u03a0\n\nTr\n\nmax\n\n(10)\nBy evaluating the \ufb01rst order conditions on Equation (10), we can see that the relaxed solution, \u03a0\u2217,\nto Equation (10) must lie in the principal subspace of HnM Hn.1 Therefore, for the problem of\nsimultaneously optimizing the structure matrix, Y (cid:23) 0, and the partition matrix, one can use the\nsame spectral relaxation as in Equation (4), and use the resulting partition matrix to solve for the\noptimal assignment for Y using Equation (8). This indicates that the optimal partition of the data\nis the same for Y given by Equation (8) and for Y = I. We show in the next section how we can\nadd additional constraints on Y to not only aid in interpretation, but to actually improve the optimal\nclustering.\n\n4 Numerical Taxonomy\n\nIn this section, we consolidate the results developed in Section 3 and introduce the numerical tax-\nonomy clustering algorithm. The algorithm allows us to simultaneously cluster data and learn a tree\nstructure that relates the clusters. The tree structure imposes constraints on the solution, which in\nturn affect the data partition selected by the clustering algorithm. The data are only assumed to be\nwell represented by some taxonomy, but not any particular topology or structure.\nIn Section 3 we introduced techniques for solving for Y and \u03a0 that depend only on Y being con-\nstrained to be positive semi-de\ufb01nite.\nIn the interests of interpretability, as well as the ability to\nin\ufb02uence clustering solutions by prior knowledge, we wish to explore the problem where additional\nconstraints are imposed on the structure of Y . In particular, we consider the case that Y is con-\nstrained to be generated by a tree metric. By this, we mean that the distance between any two\nclusters is consistent with the path length along some \ufb01xed tree whose leaves are identi\ufb01ed with the\nclusters. For any positive semi-de\ufb01nite matrix Y , we can compute the distance matrix, D, given by\n\nthe norm implied by the inner product that computes Y , by assigning Dij =(cid:112)Yii + Yjj \u2212 2Yij. It\n\nis suf\ufb01cient, then, to reformulate the optimization problem given in Equation (1) to add the following\nconstraints that characterize distances generated by a tree metric\nDab + Dcd \u2264 max (Dac + Dbd, Dad + Dbc)\n\n(11)\nwhere D is the distance matrix generated from Y . The constraints in Equation (11) are known as\nthe 4-point condition, and were proven in [8] to be necessary and suf\ufb01cient for D to be a tree metric.\n\n\u2200a, b, c, d,\n\n1For a detailed derivation, see the extended technical report [6].\n\n4\n\n\fOptimization problems incorporating these constraints are combinatorial and generally dif\ufb01cult to\nsolve. The problem of numerical taxonomy, or \ufb01tting additive trees, is as follows: given a \ufb01xed\ndistance matrix, D, that ful\ufb01lls metric constraints, \ufb01nd the solution to\n\n(cid:107)D \u2212 DT(cid:107)2\n\nmin\nDT\n\n(12)\n\nwith respect to some norm (e.g. L1, L2, or L\u221e), where DT is subject to the 4-point condition. While\nnumerical taxonomy is in general NP hard, a great variety of approximation algorithms with feasible\ncomputational complexity have been developed [1, 2, 11, 15]. Given a distance matrix that satis\ufb01es\nthe 4-point condition, the associated unrooted tree that generated the matrix can be found in O(k2)\ntime, where k is equal to the number of clusters [25].\nWe propose the following iterative algorithm to incorporate the 4-point condition into the optimiza-\ntion of Equation (1):\nRequire: M (cid:23) 0\nEnsure: (\u03a0, Y ) \u2248 (\u03a0\u2217, Y \u2217) that solve Equation (1) with the constraints given in Equation (11)\n\nInitialize Y = I\nInitialize \u03a0 using the relaxation in Section 3.1\nwhile Convergence has not been reached do\n\nConstruct D such that Dij =(cid:112)Yii + Yjj \u2212 2Yij\n\nSolve for Y given \u03a0 using Equation (8)\nSolve for minDT (cid:107)D \u2212 DT(cid:107)2\nAssign Y = \u2212 1\nUpdate \u03a0 using a normalized version of the algorithm described in [22]\n\n2 Hk(DT (cid:12) DT )Hk, where (cid:12) represents the Hadamard product\n\nend while\n\nOne can view this optimization as solving the relaxed version of the problem such that Y is only\nconstrained to be positive de\ufb01nite, and then projecting the solution onto the feasible set by requiring\nY to be constructed from a tree metric. By iterating the procedure, we can allow \u03a0 to re\ufb02ect the fact\nthat it should best \ufb01t the current estimate of the tree metric.\n\n5 Experimental Results\n\nTo illustrate the effectiveness of the proposed algorithm, we have performed clustering on two\nbenchmark datasets. The face dataset presented in [22] consists of 185 images of three different\npeople, each with three different facial expressions. The authors posited that this would be best\nrepresented by a ternary tree structure, where the \ufb01rst level would decide which subject was repre-\nsented, and the second level would be based on facial expression. In fact, their clustering algorithm\nroughly partitioned the data in this way when the appropriate structure matrix was imposed. We\nwill show that our algorithm is able to \ufb01nd a similar structure without supervision, which better\nrepresents the empirical structure of the data.\nWe have also included results for the NIPS 1-12 dataset,2 which consists of binarized histograms\nof the \ufb01rst 12 years of NIPS papers, with a vocabulary size of 13649 and a corpus size of 1740.\nA Gaussian kernel was used with the normalization parameter set to the median squared distance\nbetween points in input space.\n\n5.1 Performance Evaluation on the Face Dataset\n\nWe \ufb01rst describe a numerical comparison on the face dataset [22] of the approach presented in\nSection 4 (where M = HnKHn is assigned as in a HSIC objective). We considered two alternative\napproaches: a classic spectral clustering algorithm [21], and the dependence maximization approach\nof Song et al. [22]. Because the approach in [22] is not able to learn the structure of Y from the data,\nwe have optimized the partition matrix for 8 different plausible hierarchical structures (Figure 1).\nThese have been constructed by truncating n-ary trees to the appropriate number of leaf nodes. For\nthe evaluation, we have made use of the fact that the desired partition of the data is known for the\nface dataset, which allows us to compare the predicted clusters to the ground truth labels. For each\n\n2The NIPS 1-12 dataset is available at http://www.cs.toronto.edu/\u02dcroweis/data.html\n\n5\n\n\fpartition matrix, we compute the conditional entropy of the true labels, l, given the cluster ids, c,\nH(l|c), which is related to mutual information by I(l; c) = H(l) \u2212 H(l|c). As H(l) is \ufb01xed for\na given dataset, argmaxc I(l; c) = argminc H(l|c), and H(l|c) \u2265 0 with equality only in the case\nthat the clusters are pure [9]. Table 1 shows the learned structure and proper normalization of our\nalgorithm results in a partition of the images that much more closely matches the true identities and\nexpressions of the faces, as evidenced by a much lower conditional entropy score than either the\nspectral clustering approach of [21] or the dependence maximization approach of [22].\nFigure 2 shows the discovered taxonomy for the face dataset, where the length of the edges is\nproportional to the distance in the tree metric (thus, in interpreting the graph, it is important to take\ninto account both the nodes at which particular clusters are connected, and the distance between\nthese nodes; this is by contrast with Figure 1, which only gives the hierarchical cluster structure\nand does not represent distance). Our results show we have indeed recovered an appropriate tree\nstructure without having to pre-specify the cluster similarity relations.\n\n(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 1: Structures used in the optimization of [22]. The clusters are identi\ufb01ed with leaf nodes, and\ndistances between the clusters are given by the minimum path length from one leaf to another. Each\nedge in the graph has equal cost.\n\nspectral\n0.5443\n\na\n0.7936\n\nb\n0.4970\n\nc\n0.6336\n\nd\n0.8652\n\ne\n1.2246\n\nf\n1.1396\n\ng\n1.1325\n\nh\n0.5180\n\ntaxonomy\n0.2807\n\nTable 1: Conditional entropy scores for spectral clustering [21], the clustering algorithm of [22],\nand the method presented here (last column). The structures for columns a-h are shown in Figure 1,\nwhile the learned structure is shown in Figure 2. The structure for spectral clustering is implicitly\nequivalent to that in Figure 1(h), as is apparent from the analysis in Section 3.1. Our method exceeds\nthe performance of [21] and [22] for all the structures.\n\n5.2 NIPS Paper Dataset\n\nFor the NIPS dataset, we partitioned the documents into k = 8 clusters using the numerical tax-\nonomy clustering algorithm. Results are given in Figure 3. To allow us to verify the clustering\nperformance, we labeled each cluster using twenty informative words, as listed in Table 2. The most\nrepresentative words were selected for a given cluster according to a heuristic score \u03b3\n\u03c4 , where \u03b3 is\nthe number of times the word occurs in the cluster, \u03b7 is the number of times the word occurs outside\nthe cluster, \u03bd is the number of documents in the cluster, and \u03c4 is the number of documents outside\nthe cluster. We observe that not only are the clusters themselves well de\ufb01ned (e.g cluster a contains\nneuroscience papers, cluster g covers discriminative learning, and cluster h Bayesian learning), but\nthe similarity structure is also reasonable: clusters d and e, which respectively cover training and\napplications of neural networks, are considered close, but distant from g and h; these are themselves\ndistant from the neuroscience cluster at a and the hardware papers in b; reinforcement learning gets\na cluster at f distant from the remaining topics. Only cluster c appears to be indistinct, and shows no\nclear theme. Given its placement, we anticipate that it would merge with the remaining clusters for\nsmaller k.\n\n\u03bd \u2212 \u03b7\n\n6 Conclusions and Future Work\n\nWe have introduced a new algorithm, numerical taxonomy clustering, for simultaneously clustering\ndata and discovering a taxonomy that relates the clusters. The algorithm is based on a dependence\n\n6\n\n\fFigure 2: Face dataset and the resulting taxon-\nomy that was discovered by the algorithm\n\nFigure 3: The taxonomy discovered for the NIPS\ndataset. Words that represent the clusters are\ngiven in Table 2.\n\na\nneurons\ncells\nmodel\ncell\nvisual\nneuron\nactivity\nsynaptic\nresponse\n\ufb01ring\ncortex\nstimulus\nspike\ncortical\nfrequency\norientation\nmotion\ndirection\nspatial\nexcitatory\n\nb\nchip\ncircuit\nanalog\nvoltage\ncurrent\n\ufb01gure\nvlsi\nneuron\noutput\ncircuits\nsynapse\nmotion\npulse\nneural\ninput\ndigital\ngate\ncmos\nsilicon\nimplementation\n\nc\nmemory\ndynamics\nimage\nneural\nhop\ufb01eld\ncontrol\nsystem\ninverse\nenergy\ncapacity\nobject\n\ufb01eld\nmotor\ncomputational\nnetwork\nimages\nsubjects\nmodel\nassociative\nattractor\n\nd\nnetwork\nunits\nlearning\nhidden\nnetworks\ninput\ntraining\noutput\nunit\nweights\nerror\nweight\nneural\nlayer\nrecurrent\nnet\ntime\nback\npropagation\nnumber\n\ne\ntraining\nrecognition\nnetwork\nspeech\nset\nword\nperformance\nneural\nnetworks\ntrained\nclassi\ufb01cation\nlayer\ninput\nsystem\nfeatures\ntest\nclassi\ufb01er\nclassi\ufb01ers\nfeature\nimage\n\nf\nstate\nlearning\npolicy\naction\nreinforcement\noptimal\ncontrol\nfunction\ntime\nstates\nactions\nagent\nalgorithm\nreward\nsutton\ngoal\ndynamic\nstep\nprogramming\nrl\n\ng\nfunction\nerror\nalgorithm\nfunctions\nlearning\ntheorem\nclass\nlinear\nexamples\ncase\ntraining\nvector\nbound\ngeneralization\nset\napproximation\nbounds\nloss\nalgorithms\ndimension\n\nh\ndata\nmodel\nmodels\ndistribution\ngaussian\nlikelihood\nparameters\nalgorithm\nmixture\nem\nbayesian\nposterior\nprobability\ndensity\nvariables\nprior\nlog\napproach\nmatrix\nestimation\n\nTable 2: Representative words for the NIPS dataset clusters.\n\nmaximization approach, with the Hilbert-Schmidt Independence Criterion as our measure of depen-\ndence. We have shown several interesting theoretical results regarding dependence maximization\nclustering. First, we established the relationship between dependence maximization and spectral\nclustering. Second, we showed the optimal positive de\ufb01nite structure matrix takes the form of a set\nkernel, where sets are de\ufb01ned by cluster membership. This result applied to the original dependence\nmaximization objective indicates that the inclusion of an unconstrained structure matrix does not\naffect the optimal partition matrix. In order to remedy this, we proposed to include constraints that\nguarantee Y to be generated from an additive metric. Numerical taxonomy clustering allows us to\noptimize the constrained problem ef\ufb01ciently.\nIn our experiments on grouping facial expressions, numerical taxonomy clustering is more accurate\nthan the existing approaches of spectral clustering and clustering with a \ufb01xed prede\ufb01ned structure.\nWe were also able to \ufb01t a taxonomy to NIPS papers that resulted in a reasonable and interpretable\nclustering by subject matter. In both the facial expression and NIPS datasets, similar clusters are\nclose together on the resulting tree.We conclude that numerical taxonomy clustering is a useful tool\nboth for improving the accuracy of clusterings and for the visualization of complex data.\nOur approach presently relies on the combinatorial optimization introduced in [22] in order to op-\ntimize \u03a0 given a \ufb01xed estimate of Y . We believe that this step may be improved by relaxing the\nproblem similar to Section 3.1. Likewise, automatic selection of the number of clusters is an inter-\nesting area of future work. We cannot expect to use the criterion in Equation (1) to select the number\nof clusters because increasing the size of \u03a0 and Y can never decrease the objective. However, the\n\n7\n\nfghdecba\felbow heuristic can be applied to the optimal value of Equation (1), which is closely related to the\neigengap approach. Another interesting line of work is to consider optimizing a clustering objective\nderived from the Hilbert-Schmidt Normalized Independence Criterion (HSNIC) [13].\n\nAcknowledgments\n\nThis work is funded by the EC projects CLASS, IST 027978, PerAct, EST 504321, and by the\nPascal Network, IST 2002-506778. We would also like to thank Christoph Lampert for simplifying\nthe Moore-Penrose generalized inverse.\n\nReferences\n[1] R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup. On the approximability\n\nof numerical taxonomy (\ufb01tting distances by tree metrics). In SODA, pages 365\u2013372, 1996.\n\n[2] N. Ailon and M. Charikar. Fitting tree metrics: Hierarchical clustering and phylogeny. In Foundations of\n\nComputer Science, pages 73\u201382, 2005.\n\n7:1963\u20132001, 2006.\n\n[3] F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speech separation. JMLR,\n\n[4] R. Baire. Lec\u00b8ons sur les Fonctions Discontinues. Gauthier Villars, 1905.\n[5] C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical\n\nSociety, 186:273\u2013289, 1973.\n\n[6] M. B. Blaschko and A. Gretton. Taxonomy inference using kernel dependence measures. Technical\n\nreport, Max Planck Institute for Biological Cybernetics, 2008.\n\n[7] D. Blei, T. Grif\ufb01ths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese\n\nrestaurant process. In NIPS 16, 2004.\n\n[8] P. Buneman. The Recovery of Trees from Measures of Dissimilarity. In D. Kendall and P. Tautu, editors,\n\nMathematics the the Archeological and Historical Sciences, pages 387\u2013395. Edinburgh U.P., 1971.\n\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.\n[10] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In NIPS 14,\n[11] M. Farach, S. Kannan, and T. Warnow. A robust model for \ufb01nding optimal evolutionary trees. In STOC,\n\n2002.\npages 137\u2013145, 1993.\n\n[12] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. JMLR, 5:73\u201399, 2004.\n\n[13] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. In\n\n[14] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\n\nSchmidt norms. In Algorithmic Learning Theory, pages 63\u201378, 2005.\n\n[15] B. Harb, S. Kannan, and A. McGregor. Approximating the best-\ufb01t tree under lp norms. In APPROX-\n\nRANDOM, pages 123\u2013133, 2005.\n\nof California at Santa Cruz, 1999.\n\n[16] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University\n\n[17] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.\n[18] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.\n[19] P. Macnaughton Smith, W. Williams, M. Dale, and L. Mockett. Dissimilarity analysis: a new technique\n\nof hierarchical subdivision. Nature, 202:1034\u20131035, 1965.\n\n[20] C. D. Meyer, Jr. Generalized inversion of modi\ufb01ed matrices. SIAM Journal on Applied Mathematics,\n\nNIPS 20, 2008.\n\n24(3):315\u2013323, 1973.\n\n849\u2013856, 2001.\n\nIn ICML, pages 815\u2013822, 2007.\n\n101(476):1566\u20131581, 2006.\n\n[21] A. Y. Ng, M. I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an Algorithm. In NIPS, pages\n\n[22] L. Song, A. Smola, A. Gretton, and K. M. Borgwardt. A Dependence Maximization View of Clustering.\n\n[23] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes.\n\nJASA,\n\n[24] U. von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n[25] M. S. Waterman, T. F. Smith, M. Singh, and W. A. Beyer. Additive Evolutionary Trees. Journal of\n\nTheoretical Biology, 64:199\u2013213, 1977.\n\n8\n\n\f", "award": [], "sourceid": 831, "authors": [{"given_name": "Matthew", "family_name": "Blaschko", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}]}