{"title": "An Iterative Improvement Procedure for Hierarchical Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 481, "page_last": 488, "abstract": "", "full_text": "An iterative improvement procedure for\n\nhierarchical clustering\n\nDavid Kauchak\n\nDepartment of Computer Science\nUniversity of California, San Diego\n\nSanjoy Dasgupta\n\nDepartment of Computer Science\nUniversity of California, San Diego\n\ndkauchak@cs.ucsd.edu\n\ndasgupta@cs.ucsd.edu\n\nAbstract\n\nWe describe a procedure which \ufb01nds a hierarchical clustering by hill-\nclimbing. The cost function we use is a hierarchical extension of the\nk-means cost; our local moves are tree restructurings and node reorder-\nings. We show these can be accomplished ef\ufb01ciently, by exploiting spe-\ncial properties of squared Euclidean distances and by using techniques\nfrom scheduling algorithms.\n\n1\n\nIntroduction\n\nA hierarchical clustering of n data points is a recursive partitioning of the data into\n2; 3; 4; : : : and \ufb01nally n clusters. Each intermediate clustering is made more \ufb01ne-grained\nby splitting one of its clusters. It is natural to depict this process as a tree whose leaves are\nthe data points and whose interior nodes represent intermediate clusters. Such hierarchical\nrepresentations are very popular \u2013 they depict a data set at multiple levels of granularity,\nsimultaneously; they require no prior speci\ufb01cation of the number of the clusters; and there\nare several simple heuristics for constructing them [2, 3].\n\nSome of these heuristics \u2013 such as average-linkage \u2013 implicitly try to create clusters of\nsmall \u201cradius\u201d throughout the hierarchy. However, to the best of our knowledge, there\nis so far no procedure which speci\ufb01cally hillclimbs the space of hierarchical clusterings\naccording to a precise objective function. Given the heuristic nature of existing algorithms,\nit would be most helpful to be able to call an iterative improvement procedure on their\noutput. In particular, we seek an analogue of k-means for hierarchical clustering. Taken\nliterally this is possible only to a certain extent \u2013 the basic object we are dealing with\nis a tree rather than a partition \u2013 but k-means has closely informed many aspects of our\nprocedure, and has determined our choice of objective function.\n\nWe use a canonical tree representation of a hierarchical clustering, in which the leaves are\ndata points, and the interior nodes are ordered; such a clustering is speci\ufb01ed completely by\na tree structure and by an ordering of nodes. Our cost function is a hierarchical extension of\nthe k-means cost function, and is the same cost function which motivates average-linkage\nschemes. Our iterative procedure alternates between two simple moves:\n\n1. The ordering of nodes is kept \ufb01xed, and one subtree is relocated. This is the\nnatural generalization of a standard heuristic clustering move in which a data point\nis transferred from one cluster to another.\n\n\f2. The tree structure is kept \ufb01xed, and its interior nodes are reordered optimally.\n\nWe show that by exploiting properties of Euclidean distance (which underlies the k-means\ncost function and therefore ours as well), these tasks can be performed ef\ufb01ciently. For in-\nstance, the second one can be transformed into a problem in VLSI design and job schedul-\ning called minimum linear arrangement. In general this problem is NP-hard, but for our\nparticular case it is known [4] to be ef\ufb01ciently solvable, in O(n log n) time. After motivat-\ning and describing our model and our algorithm, we end with some experimental results.\n\n2 The model\n\n2.1 The space of trees\n\nA hierarchical clustering of n points contains n different clusterings, nested within each\nother. It is often depicted using a dendogram, such as the one below on the left (for a\ndata set of \ufb01ve points). We will use the term k-clustering, and the notation Ck, to denote\nthe grouping into k clusters. One of these clusters is divided in two to yield the (k + 1)-\nclustering Ck+1, and so on. Instead of a dendogram, it is convenient to use a rooted binary\ntree (shown below on the right) in which the leaves are data points and internal nodes have\nexactly two children, so there are 2n (cid:0) 1 nodes overall. Each internal node is annotated\nwith a unique \u201csplit number\u201d between 1 and n (cid:0) 1. These satisfy the property that the\nsplit number of a parent is less than that of its children; so the root is numbered 1. The\nk-clustering is produced by removing the internal nodes numbered 1; 2; 3; : : : ; k (cid:0) 1; each\ncluster consists of (the leaves in) one of the resulting connected components.\n\n(cid:15) 2-clustering:\n\nfa; b; eg; fc; dg\n\n(cid:15) 3-clustering:\n\nfa; bg; feg; fc; dg\n\n(cid:15) 4-clustering:\n\na\n\nb\n\ne\n\nc\n\nd\n\nfag; fbg; feg; fc; dg\n\na\n\n1\n\n4\n\ne\n\nc\n\nd\n\n3\n\n2\n\nb\n\nHenceforth we will use \u201cnode i\u201d to mean \u201cthe internal node with split number i\u201d. The\nmaximal subtree rooted at this node is Ti; the mean of its data points (leaves) is called (cid:22)i.\nTo summarize, a hierarchical clustering is speci\ufb01ed by: a binary tree with the data points at\nthe leaves; and an ordering of the internal nodes.\n\n2.2 Cost function\n\nIf the clusters of Ck are S1; S2; : : : ; Sk, then the k-means cost function is\n\ncost(Ck) =\n\nk\n\nX\n\nj=1\n\nX\nx2Sj\n\nkx (cid:0) (cid:22)(Sj)k2;\n\nwhere (cid:22)(S) is the mean of set S. To evaluate a hierarchical clustering, we need to combine\nthe costs of all n intermediate clusterings, and we do so in the most obvious way, by a\nlinear combination. We take the overall cost of the hierarchical clustering to be\n\nn\n\nX\n\nk=1\n\nwk (cid:1) cost(Ck);\n\n\fwhere the wk are non-negative weights which add up to one. The default choice is to make\nall wk = 1=n, but in general the speci\ufb01c application will dictate the choice of weights.\nA decreasing schedule w1 > w2 > w3 > (cid:1) (cid:1) (cid:1) > wn places more emphasis upon coarser\nclusterings (ie. small k); a setting wk = 1 singles out a particular intermediate clustering.\nAlthough many features of our cost function are familiar from the simpler k-means setting,\nthere is one which is worth pointing out. Consider the set of six points shown here:\n\nUnder the k-means cost function, it is clear what the best 2-clustering is (three points in\neach cluster). It is similarly clear what the best 3-clustering is, but this cannot be nested\nwithin the best 2-clustering.\nIn other words, the imposition of a hierarchical structure\nforces certain tradeoffs between the intermediate clusterings. This particular feature is\nfundamental to hierarchical clustering, and in our cost function it is laid bare. By adjusting\nthe weights wk, the user can bias this tradeoff according to his or her particular needs.\nIt is worth pointing out that cost(Ck) decreases as k increases; as more clusters are allowed,\nthe data can be modeled with less error. This means that even when all the weights wk\nare identical, the smaller values of k contribute more to the cost function, and therefore, a\nprocedure for minimizing this function must implicitly focus a little more on smaller k than\non larger k. This is the sort of bias we usually seek. If we wanted to further emphasize small\nvalues of k, we could for instance use an exponentially decreasing schedule of weights, ie.\nwk = c (cid:1) (cid:11)k, where (cid:11) < 1 and where c is a normalization constant.\nNotice that any given subtree Tj can appear as an individual cluster in many of the cluster-\nings Ck. If (cid:25)(j) denotes the parent of j, then Tj \ufb01rst appears as its own cluster in C(cid:25)(j)+1,\nand is part of all the successive clusterings up to and including Cj. At that point, it gets\nsplit in two.\n\n2.3 Relation to previous work\n\nThe most commonly used heuristics for hierarchical clustering are agglomerative. They\nwork bottom-up, starting with each data point in its own cluster, and then repeatedly merg-\ning the two \u201cclosest\u201d clusters until \ufb01nally all the points are grouped together in one cluster.\nThe different schemes are distinguished by their measure of closeness between clusters.\n\n1. Single linkage \u2013 the distance between two clusters S and T is taken to be the\n\ndistance between their closest pair of points, ie. minx2S;y2T kx (cid:0) yk.\n\n2. Complete linkage uses the distance between the farthest pair of points,\n\nmaxx2S;y2T kx (cid:0) yk.\n\nie.\n\n3. Average linkage seems to have now become a generic term encompassing at least\n\nthree different measures of distance between clusters.\n\n(a) (Sokal-Michener) k(cid:22)(S) (cid:0) (cid:22)(T )k2\n(b)\n\njSj(cid:1)jT j Px2S;y2T kx (cid:0) yk2\n\n1\n\n(c) (Ward\u2019s method) jSj(cid:1)jT j\n\njSj+jT j k(cid:22)(S) (cid:0) (cid:22)(T )k2\n\nAverage linkage appears to be the most widely used of these; for instance, it is a standard\ntool for analyzing gene expression data [1]. The three average linkage distance functions\nare all trying to minimize something very much like our cost function. In particular, Ward\u2019s\nmeasure of the distance between two clusters is exactly the increase in k-means cost oc-\ncasioned by merging those clusters. For our experimental comparisons, we have therefore\nchosen Ward\u2019s method.\n\n\f3 Local moves\n\nEach element of the search space is a tree structure in which the data points are leaves and\nin which the interior nodes are ordered. A quick calculation shows that this space has size\nn((n (cid:0) 1)!)2=2n(cid:0)1 (consider the sequence of n (cid:0) 1 merge operations which create the tree\nfrom the data set). We consider two moves for navigating the space, along the lines of the\nstandard \u201calternating optimization\u201d paradigm of k-means and EM:\n\n1. keep the structure \ufb01xed and reorder the internal nodes optimally;\n2. keep the ordering of the internal nodes \ufb01xed and alter the structure by relocating\n\nsome subtree.\n\nA key concern in the design of these local moves is ef\ufb01ciency. A k-means update takes\nO(kn) time; in our situation the analogue would be O(n2) time since we are dealing with\nall values of k. Ideally, however, we\u2019d like a faster update. For our \ufb01rst move \u2013 reordering\ninternal nodes \u2013 we show that a previously-known scheduling algorithm [4] can be adapted\nto solve this task (in the case of uniform weights) in just O(n log n) time. For the second\nmove, we show that any given subtree can be relocated optimally in O(n) time, using just\na single pass through the tree. These ef\ufb01ciency results are nontrivial; a crucial step in\nobtaining them is to exploit special properties of squared Euclidean distance. In particular,\nwe write our cost function in three different, but completely equivalent, ways; and we\nswitch back and forth between these:\n\n1. In the form given above (the de\ufb01nition).\n2. We de\ufb01ne the cost of a subtree Ti to be cost(Ti) = Px2Ti\n\nkx (cid:0) (cid:22)ik2 (where the\nsum is over leaf nodes), that is, the cost of the single cluster rooted at point i. Then\nthe overall cost is a linear combination of subtree costs. Speci\ufb01cally, it is\n\nn(cid:0)1\n\nX\n\nj=1\n\nW(cid:25)(j);j (cid:1) cost(Tj);\n\n(1)\n\nwhere (cid:25)(j) is the parent of node j and Wij = wi+1 + wi+2 + (cid:1) (cid:1) (cid:1) + wj.\n\n3. We annotate each tree edge (i; j) (i is the parent of j > i) by k(cid:22)i (cid:0) (cid:22)jk2; the\n\noverall cost is also a linear combination of these edge weights, speci\ufb01cally,\n\nWk (cid:1) nl (cid:1) k(cid:22)k (cid:0) (cid:22)lk2;\n\nX\n\n(k;l)2T\n\n(2)\n\nwhere Wk = w1 + w2 + (cid:1) (cid:1) (cid:1) + wk and nl is the number of leaves in subtree Tl.\n\nAll proofs are in a technical report [5] which can be obtained from the authors. To give a\nhint for why these alternative formulations of the cost function are true, we brie\ufb02y mention\na simple \u201cbias-variance\u201d decomposition of squared Euclidean distance:\n\nSuppose S is a set of points with mean (cid:22)S. Then for any (cid:22),\n\nX\n\nkx (cid:0) (cid:22)k2 = X\n\nkx (cid:0) (cid:22)Sk2 + jSj (cid:1) k(cid:22) (cid:0) (cid:22)Sk2:\n\nx2S\n\nx2S\n\n3.1 The graft\n\nIn a graft move, an entire subtree is moved to a different location, as shown below. The\nletters a; b; i; : : : denote split numbers of interior nodes; here the subtree Tj is moved. The\nonly prerequisite (to ensure a consistent ordering) is a < i < b.\n\n\fa\n\nb\n\n1\n\nh\n\ni\n\nj\n\nk\n\na\n\ni\n\nb\n\nj\n\n1\n\nh\n\nk\n\nFirst of all, a basic sanity check: this move enables us to traverse the entire search space.\nClaim. Any two hierarchical clusterings are connected by a sequence of graft operations.\nIt is important to \ufb01nd good grafts ef\ufb01ciently. Suppose we want to move a subtree Tj; what is\nthe best place for it? Evaluating the cost of a hierarchical clustering takes O(n) time using\nequation (1) and doing a single, bottom-up pass. Since there are O(n) possible locations\nfor Tj, naively it seems like evaluating all of them would take O(n2) time. In fact, the best\nrelocation of Tj can be computed in just O(n) time, in a single pass over the tree.\nTo see why this is possible, notice that in the diagram above, the movement of Tj affects\nonly the subtrees on the path between a and h. Some of these subtrees get bigger (Tj is\nadded to them); others shrink (Tj is removed). The precise change in cost of any given\nsubtree Tl on this path is easy to compute:\nClaim. If subtree Tj is merged into Tl, then the cost of Tl goes up by\n\n(cid:1)+\n\nl = cost(Tl [ Tj) (cid:0) cost(Tl) = cost(Tj) +\n\nnlnj\n\nnl + nj\n\n(cid:1) k(cid:22)l (cid:0) (cid:22)jk2:\n\nClaim. If subtree Tj (cid:26) Tl is removed from Tl, then the cost of Tl changes by\n\n(cid:1)(cid:0)\n\nl = cost(Tl (cid:0) Tj) (cid:0) cost(Tl) = (cid:0)cost(Tj) (cid:0)\n\nninl\n\nnl (cid:0) nj\n\n(cid:1) k(cid:22)l (cid:0) (cid:22)jk2:\n\nUsing (1), the total change in cost from grafting Tj between a; b (as depicted above) can\nbe found by adding terms of the form W(cid:25)(l);l(cid:1)(cid:6)\nl , for nodes l on the path between j and\na. This suggests a two-pass algorithm for optimally relocating Tj: in the \ufb01rst pass over\nthe tree, for each Tl, the potential cost change from adding/removing Tj is computed. The\nsecond pass \ufb01nds the best location. In fact, these can be combined into a single pass [5].\n\n3.2 Reordering internal nodes\n\nLet Vint be the interior nodes of the tree; if there are n data points (leaves), then jVintj =\nn (cid:0) 1. For any x 2 Vint, let Tx be the maximal subtree rooted at x, which contains all the\ndescendants of x. Let nx be the number of leaves in this subtree. If x has children y and z,\nthen the goodness of split at x is the reduction in cost obtained by splitting cluster Tx,\n\ncost(Tx) (cid:0) (cost(Ty) + cost(Tz));\n\nwhich we henceforth denote g(x) (for leaves g(x) = 0). Again using properties of Eu-\nclidean distance, we can rewrite it thus:\n\ng(x) = nyk(cid:22)x (cid:0) (cid:22)yk2 + nzk(cid:22)x (cid:0) (cid:22)zk2:\n\n\fPriority queue operations:\nmakequeue, max, deletemax,\nunion, insert.\n\nLinked list operations:\n(cid:14) (concatenation)\n\nprocedure reorder(T )\nu   root of T\nQ   makequeue(u)\nwhile Q is not empty\nL   deletemax(Q)\nOutput elements of list L, in order\n\nfunction makequeue(x)\nif x is a leaf return f g\nlet y; z be the children of x\nQ   union(makequeue(y); makequeue(z))\nr   nyk(cid:22)x (cid:0) (cid:22)yk2 + nzk(cid:22)x (cid:0) (cid:22)zk2\nL   [x]\n\nwhile r < r(max(Q))\nL0   deletemax(Q)\nr   r(cid:1)jLj+r(L0)(cid:1)jL0j\nL   L (cid:14) L0\n\njLj+jL0j\n\nr(L)   r\ninsert(Q; L)\nreturn Q\n\nFigure 1: The reordering move. Here Q is a priority queue of linked lists. Each list L has\na value r(L); and Q is ordered according to these.\n\nWe wish to \ufb01nd a numbering (cid:27) : Vint ! f1; 2; : : : ; n (cid:0) 1g which\n\u2013 respects the precedence constraints of the tree: if x is the parent of y then (cid:27)(x) < (cid:27)(y).\n\u2013 minimizes the overall cost of the hierarchical clustering. Assuming uniform weights\nwk = 1=n, this cost can be seen (by manipulating equation (2)) to be\n\n1\nn X\n\nx2Vint\n\n(cid:27)(x)g(x):\n\nNotice that this is essentially a scheduling problem. There is a \u201ctask\u201d (a split) corresponding\nto each x 2 Vint. We would like to schedule the good tasks (with high g(x)) early on; in the\nlanguage of clustering, if there are particularly useful splits (which lead to well separated\nclusters), we would like to perform them early in the hierarchy. And there are precedence\nconstraints which must be respected: certain splits must precede others.\n\nThe naive greedy solution \u2013 always pick the node with highest g(x), subject to precedence\nconstraints \u2013 doesn\u2019t work. The reason: it is quite possible that a particular split has low\ng(x)-value, but that it leads to other splits of very high value. A greedy algorithm would\nschedule this split very late; an algorithm with some \u201clookahead\u201d capability would realize\nthe value of this split and schedule it early.\n\nHorn[4] has a scheduling algorithm which obtains the optimal ordering, in the case where\nall the weights wk are equal, and can be implemented in O(n log n) time. We believe it\ncan be extended to exponentially decaying, \u201cmemoryless\u201d weights, ie. wk = c (cid:1) (cid:11)k, where\n(cid:11) < 1 and c is some normalization constant.\nWe now present an overview of Horn\u2019s algorithm. For each node x 2 V , de\ufb01ne r(x) to be\nthe maximum, over all subtrees T (not necessarily maximal) rooted at x, of 1\njT j Pz2T g(z)\n(in words, the average of g((cid:1)) over nodes of T ). This value r(x) is a more reliable indicator\nof the utility of split x than the immediate return g(x). Once these r(x) are known, the\noptimal numbering is easy to \ufb01nd: pick nodes in decreasing order of r((cid:1)) while respecting\nthe precedence constraints. So the main goal is to compute the r(x) for all x in the tree.\nThis can be done by a short divide-and-conquer procedure in O(n log n) time (Figure 1).\n\n\f(a)\n\n(d)\n\na\n\nb\n\ne\n\n3\n\nc\n\nd\n\ne\n\n(b)\n\n2\n\n1\n\nc\n\n2\n\n(e)\n\n4\n\na\n\nb\n\nd\n\n3\n\na\n\ne\n\n3\n\nd\n\n4\n\ne\n\n1\n\n4\n\nc\n\n1\n\nd\n\n2\n\nb\n\n(c)\n\ne\n\n2\n\nc\n\n1\n\n3\n\n4\n\na\n\nb\n\nd\n\n1\n\n(f)\n\n3\n\nc\n\na\n\nb\n\n2\n\nd\n\n4\n\ne\n\nc\n\na\n\nb\n\nFigure 2: (a) Five data points. (b)\u2013(f) Iteratively improving the hierarchical clustering.\n\n(a)\n\n(b)\n\na\n\nb\n\nc\n\nd\n\n1.0\n\n0.8\n\n1.0\n\nd\n\n1\n\nc\n\na\n\n2\n\nb\n\n3\n\n(c)\n\n1\n\n2\n\n3\n\na\n\nb\n\nc\n\nd\n\nFigure 3: (a) Four points on a line. (b) Average linkage. (c) Optimal tree.\n\n4 Experiments\n\nIn the experiments, we used uniform weights wk = 1=n. In each iteration of our procedure,\nwe did a reordering of the nodes, and performed one graft \u2013 by trying each possible subtree\n(all O(n) of them), determining the optimal move for that subtree, and greedily picking\nthe best move. We would prefer a more ef\ufb01cient, randomized way to pick which subtree to\ngraft \u2013 either completely randomly, or biased by a simple criterion like \u201camount it deviates\nfrom the center of its parent cluster\u201d; this is future work.\nSimple examples. To give some concrete intuition, Figure 2 shows the sequence of moves\ntaken on a toy example involving \ufb01ve data points in the plane. The initial tree (b) is random\nand has a cost of 62.25. A single graft (c) reduces the cost to 27. A reordering (d), swapping\n2 and 3, reduces the cost to 25.5, and a further graft (e) and reordering (f) result in the \ufb01nal\ntree, which is optimal and has cost 21.\n\nFigure 3 demonstrates a typical failing of average linkage. The initial greedy merger of\npoints b; c gives a small early bene\ufb01t but later turns out to be a bad idea; yet the resulting\ntree is only one graft away from being optimal. Really bad cases for average linkage can\nbe constructed by recursively compounding this simple instance.\nA larger data set. Average linkage is often used in the analysis of gene expression data.\n\n\fe\ng\na\nk\nn\n\ni\nl\n \n\ne\ng\na\nr\ne\nv\na\n\n \nr\ne\nv\no\n\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\ni\n \n\n%\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n5600\n\n5500\n\n5400\n\n5300\n\n5200\n\n5100\n\n5000\n\nt\ns\no\nc\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\nk\n\n300\n\n350\n\n400\n\n450\n\n500\n\n4900\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\niterations\n\n60\n\n70\n\n80\n\n90\n\nFigure 4: (a) On the left, a comparison with average linkage. (b) On the right, the behavior\nof the cost function over the 80 iterations required for convergence.\n\nWe tried our method on the yeast data of [1]. We randomly chose clean subsets (no missing\nentries) of varying sizes from this data set, and tried the following on it: average linkage,\nour method initialized randomly, and our method initialized with average linkage.\n\nThere were two clear trends. First of all: our method, whether initialized randomly or\nwith average linkage, systematically did better than average linkage, not only for the par-\nticular aggregate cost function we are using, but across the whole spectrum of values of\nk. Figure 4(a), obtained on a 500-point data set, shows for each k, the percent by which\nthe (induced) k-clustering found in our method (initialized with average linkage) improved\nupon that found by average linkage; the metric here is the k-means cost function. This is\na fair comparison because both methods are explicitly trying to minimize this cost. Notice\nthat an improvement in the aggregate (weighted average) is to be expected, since we are\nhillclimbing based on this measure. What was reassuring to us was that this improvement\ncame across at almost all values of k (especially the smaller ones), rather than by negotiat-\ning some unexpected tradeoff between different values of k. This experiment also indicates\nthat, in general, the output of average linkage has real scope for improvement.\n\nSecond, our method often took an order of magnitude (ten or more times) longer to con-\nverge if initialized randomly than if initialized with average linkage, even though better\nsolutions were often found with random initialization. We therefore prefer starting with\naverage linkage. On the scant examples we tried, there was a period of rapid improve-\nment involving grafts of large subtrees, followed by a long series of minor \u201c\ufb01xes\u201d; see\nFigure 4(b), which refers again to the 500-point data set mentioned earlier.\n\nReferences\n\n[1] T.L. Ferea et al. Systematic changes in gene expression patterns following adaptive\n\nevolution in yeast. Proceedings of the National Academy of Sciences, 97, 1999.\n\n[2] J.A. Hartigan. Clustering algorithms. Wiley, 1975.\n[3] J.A. Hartigan. Statistical theory in clustering. Journal of Classi\ufb01cation, 1985.\n[4] W.A. Horn. Single-machine job sequencing with treelike precedence ordering and\n\nlinear delay penalties. SIAM Journal on Applied Mathematics, 23:189\u2013202, 1972.\n\n[5] D. Kauchak and S. Dasgupta. Manuscript, 2003.\n\n\f", "award": [], "sourceid": 2500, "authors": [{"given_name": "David", "family_name": "Kauchak", "institution": null}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}]}