{"title": "Agglomerative Multivariate Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": null, "full_text": "Agglomerative Multivariate Information \n\nBottleneck \n\nSchool of Computer Science & Engineering, Hebrew University, Jerusalem 91904, Israel \n\nNoam Sionim Nir Friedman Naftali Tishby \n\n{noamm, nir, tishby } @cs.huji.ac.il \n\nAbstract \n\nThe information bottleneck method is an unsupervised model independent data \norganization technique. Given a joint distribution peA, B), this method con(cid:173)\nstructs a new variable T that extracts partitions, or clusters, over the values of A \nthat are informative about B. In a recent paper, we introduced a general princi(cid:173)\npled framework for multivariate extensions of the information bottleneck method \nthat allows us to consider multiple systems of data partitions that are inter-related. \nIn this paper, we present a new family of simple agglomerative algorithms to \nconstruct such systems of inter-related clusters. We analyze the behavior of these \nalgorithms and apply them to several real-life datasets. \n\n1 Introduction \n\nThe information bottleneck (IB) method of Tishby et al [14] is an unsupervised non(cid:173)\nparametric data organization technique. Given a joint distribution P(A, B), this method \nconstructs a new variable T that represents partitions of A which are (locally) maximizing \nthe mutual information about B. In other words, the variable T induces a sufficient par(cid:173)\ntition, or informative features of the variable A with respect to B. The construction of T \nfinds a tradeoff between the information about A that we try to minimize, J(T; A), and \nthe information about B which we try to maximize, J(T ; B). This approach is particularly \nuseful for co-occurrence data, such as words and documents [12], where we want to cap(cid:173)\nture what information one variable (e.g., use of a word) contains about the other (e.g., the \ndocument). \n\nIn a recent paper, Friedman et al. [4] introduce multivariate extension of the IB principle. \nThis extension allows us to consider cases where the data partition is relevant with respect \nto several variables, or where we construct several systems of clusters simultaneously. In \nthis framework, we specify the desired interactions by a pair of Bayesian networks. One \nnetwork, Gin, represents which variables are compressed versions of the observed variables \n- each new variable compresses its parents in the network. The second network, Gout> \ndefines the statistical relationship between these new variables and the observed variables \nthat should be maintained. \n\nSimilar to the original IB, in Friedman et al. we formulated the general principle as a \ntradeoff between the (multi) information each network carries. On the one hand, we want to \nminimize the information maintained by G in and on the other to maximize the information \nmaintained by Gout. We also provide a characterization of stationary points in this tradeoff \nas a set of self-consistent equations. Moreover, we prove that iterations of these equations \nconverges to a (local) optimum. Then, we describe a deterministic annealing procedure \n\n\fthat constructs a solution by tracking the bifurcation of clusters as it traverses the tradeoff \ncurve, similar to the original IB method. \n\nIn this paper, we consider an alternative approach to solving multivariate IB problems \nwhich is motivated by the success of the agglomerative IB of Slonim and Tishby [11]. As \nshown there, a bottom-up greedy agglomeration is a simple heuristic procedure that can \nyield good solutions to the original IB problem. Here we extend this idea in the context of \nmultivariate IB problems. We start by analyzing the cost of agglomeration steps within this \nframework. This both elucidates the criteria that guides greedy agglomeration and provides \nfor efficient local evaluation rules for agglomeration steps. This construction results with \na novel family of information theoretic agglomerative clustering algorithms, that can be \nspecified using the graphs G in and G out. We demonstrate the performance of some of \nthese algorithms for document and word clustering and gene expression analysis. \n\n2 Multivariate Information Bottleneck \nA Bayesian network structure G is a DAG that specifies interactions among variables [8]. \nA distribution P is consistent with G (denoted P F G), if P(Xl , ... , X n ) = I1 P(Xi I \nPa 0 and Tj , Te co-appear in some information term in r Gout \u2022 \nThis proposition is particularly useful, when we consider \"hard\" clustering where Tj is \na (deterministic) function ofUj . In this case, p(tj,te) is often zero (especially when Tj \nand Te compressing similar variables, i.e., U j n U e =I- 0). In particular, after the merger \n{t;, tj} :::} tj, we do not have to reevaluate merger costs of other values of Tj , except for \nmergers of tj with each of these values. \n\nIn the case of hard clustering we also find thatI(Tj ; U j ) = H(Tj ) (where H(P) is Shan(cid:173)\n\nnon's entropy). Roughly speaking, we may say that H(P) is decreasing for less balanced \nprobability distributions p. Therefore, increasing (3-1 will result with a tendency to look \nfor less balanced \"hard\" partitions and vice verse. This is reflected by the fact that the last \nterm in d( t; , tj) is then simplified through J Sn (p(U j I t;), p(U j I tj)) = H (II) . \n\n5 Examples \nWe now briefly consider three examples of the general methodology. For brevity we focus \non the simpler case of hard clustering. We first consider the example shown in figure I(a). \nThis choice of graphs results in the original IB problem. The merger cost in this case is \ngiven by, \n\n6..c(tl, n = p(t) . (JSn(p(B I tl),p(B I n) - (3-1 H(II)) . \n\n(5) \n\nNote that for (3-1 -+ 0 we get exactly the algorithm presented in [11] . \n\nOne simple extension of the original IB is the parallel bottleneck [4]. In this case we \nintroduce two variables T1 and T2 as in Figure I(b), both of them are functions of A. \nSimilarly to the original IB, Gout specifies that T1 and T2 should predict B. We can think \nof this requirement as an attempt to decompose the information A contains about B into \ntwo \"orthogonal\" components. In this case, the merger cost for T1 is given by, \n6..c(ti, tD = p(t1) . (Ep(.lld[JSnT2 (P(B I ti,T2),p(B I tLT2))]- (3- 1 H(II)) . (6) \nFinally, we consider the symmetric bottleneck [4, 12]. In this case, we want to compress \nA into T A and B into T B so that T A extracts the information A contains about B, and at the \nsame time TB extracts the information B contains about A. The DAG G in of figure I(c) \ncaptures the form of the compression. The choice of G out is less obvious and several alter(cid:173)\nnatives are described in [4]. Here, we concentrate only in one option, shown in figure I(c). \nIn this case we attempt to make each ofTA and TB sufficient to separate A from B. Thus, \non one hand we attempt to compress, and on the other hand we attempt to make T A and T B \nas informative about each other as possible. The merger cost in T A is given by \n\n6..c(t~, tA) = P(tA) . JSn(p(TB I t~) , p(TB ItA)) - ((3-1 -\n\nl)H(II)), \n\n(7) \n\n\fwhile for merging in TB we will get an analogous expression. \n\n6 Applications \n\nWe examine a few applications of the examples presented above. As one data set we \nused a subset ofthe 20 newsgroups corpus [6] where we randomly choose 2000 documents \nevenly distributed among the 4 science discussion groups (sci. crypt, sci. electronics, sci.med \nand sci.space).2 Our pre-processing included ignoring file headers (and the subject lines), \nlowering upper case and ignoring words that contained non ' a .. z' characters. Given this \ndocument set we can evaluate the joint probability p(W, D), which is the probability that a \nrandom word position is equal to w E Wand at the same time the document is dE D . We \nsort all words by their contribution to I(W; D) and used only the 2000 'most informative' \nones, ending up with a joint probability with I W I = ID I = 2000. \nWe first used the original IB to cluster W , while trying to preserve the information about \nD. This was already done in [12] with (3-1 = 0, but in this new experiment we took \n(3-1 = 0.15. Recall that increasing (3-1 results in a tendency for finding less balanced clus(cid:173)\nters. Indeed, while for (3- 1 = 0 we got relatively balanced word clusters (high H(Tw )), \nfor (3- 1 = 0.15 the probability p(Tw) is much less smooth. For 50 word clusters, one \ncluster contained almost half of the words, while the other clusters were typically much \nsmaller. Since the algorithm also tries to maximize I(Tw; D), the words merged into the \nbig cluster are usually the less informative words about D. Thus, a word must be highly in(cid:173)\nformative to stay out of this cluster. In this sense, increasing (3-1 is equivalent for inducing \na \"noise filter\", that leave only the most informative features in specific clusters. In figure 2 \nwe present p( D I tw) for several clusters tw E Tw. Clearly, words that passed the \"filter\" \nform much more informative clusters about the real structure of D. A more formal demon(cid:173)\nstration of this effect is given in the right panel of figure 2. For a given compression level \n(i.e. a given I(Tw; W)), we see that taking (3-1 = 0.15 preserve much more information \naboutD. \n\nWhile an exact implementation of the symmetric IB will require alternating mergers in \nTw and TD, an approximated approach require only two steps. First we find Tw. Second, \nwe project each d E D into the low dimensional space defined by Tw , and use this more \nrobust representation to extract document clusters TD. Approximately, we are trying to \nfind Tw and TD that will maximize I(Tw; TD)' This two-phase IB algorithm was shown \nin [12] to be significantly superior to six other document clustering methods, when the \nperformance are measured by the correlation of the obtained document clusters with the \nreal newsgroup categories. Here we use the same procedure, but for finding Tw we take \n(3-1 = 0.15 (instead of zero). Using the above intuition we predict this will induce a \ncleaner representation for the document set. Indeed, the averaged correlation of TD (for \nlTD I = 4) with the original categories was 0.65, while for (3-1 = 0 it was 0.58 (the average \nis taken over different number of word clusters, ITw I = 10, 11...50). Similar results were \nobtained for all the 9 other subsets of the 20 newsgroups corpus described in [12]. \n\nAs a second data set we used the gene expression measurements of rv 6800 genes in \n72 samples of Leukemia [5]. The sample annotations included type of leukemia (ALL vs. \nAML), type of cells, source of sample, gender and donating hospital. We removed genes \nthat were not expressed in the data and normalized the measurements of each sample to \nget a joint probability P(G, A) over genes and samples (with uniform prior on samples). \nWe sorted all genes by their contribution to I(G; A) and chose the 500 most informative \nones, which capture 47% of the original information, ending up with a joint probability \nwith IAI = 72 and IGI = 500. \nWe first used an exact implementation of the symmetric IB with alternating mergers be-\n\n2We used the same subset already used in [12]. \n\n\faotthetoand \n\nalgonthm secure secunty enayptlon ClaSSlIJed ... \n\n0\u00b704'---~----r~_~CC=, ,~905;O=w~ord~S \n\n0.03 \n\n0.02 \n\n0.04 \n\n0.03 \n\n0.02 \n\nc2, 20 words \n\n0.04 \n\n0.03 \n\n0.02 \n\naCldvltammGaIClumlntakekldDey ... \n\nc4,35words \n\names planetary nasa spaceanane ... \n0\u00b704'---~-~~C5o=\" 35;O=w~ord~s \n\n0.03 \n\n0.02 \n\nanalog mOde signaimput output ... \n0.04,---~-~~c3~, 19;O=w~ord~S \n\n0.03 \n\n0.02 \n\n0.01 \n\n00 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n1 .5~~~------, \n\nI \n\nsCience dataset, lnl0rmallon curves \nI \n~ \n\n~-:=O \n\n1 o \n~ \n\n0.5 \n\nIfTw\u00b7W\\ \n\nFigure 2: P(D I tw) for 5 word clusters, tw E Tw. Documents 1 - 500 belong to sci. crypt \ncategory, 501 - 1000 to sci. electronics, 1001 - 1500 to sci.med and 1501 - 2000 to sci. space. \nIn the title of each panel we see the 5 most frequent words in the cluster. The 'big' cluster (upper \nleft panel) is clearly less informative about the structure of D. In the lower right panel we see the \ntwo information curves. Given some compression level, for (3- 1 = 0.15 we preserve much more \ninformation about D than for (3-1 = O. \n\ntween both clustering hierarchies (and /3-1 = 1). For ITA I = 2 we found an almost perfect \ncorrelation with the ALL vs. AML annotations (with only 4 exceptions). For ITA I = 8 and \nITGI = 10 we found again high correlation between our sample clusters and the different \nsample annotations. For example, one cluster contained 10 samples that were all annotated \nas ALL type, taken from male patients in the same hospital. Almost all of these 10 were also \nannotated as T-cells, taken from bone marrow. Looking at p(TA I TG) we see that given the \nthird genes cluster (which contained 17 genes) the probability of the above specific samples \ncluster is especially high. Further such analysis might yield additional insights about the \nstructure of this data and will be presented elsewhere. \n\nFinally, to demonstrate the performance of the parallel IB we apply it to the same data. \nUsing the parallel IB algorithm (with /3-1 = 0) we clustered the arrays A into two clus(cid:173)\ntering hierarchies, T1 and T2 , that try together to capture the information about G. For \nITj I = 4 we find that each I(Tj; G) preserve about 15% of the original information. How(cid:173)\never, taking ITj I = 2 (i.e. again, just 4 clusters) we see that the combination of the hi(cid:173)\nerarchies, I(T1, T2 ; G), preserve 21 % of the original information. We then compared the \ntwo partitions we found against sample annotations. We found that the first hierarchy with \nIT11 = 2 almost perfectly match the split between B-cells and T-cells (among the 47 sam(cid:173)\nples for which we had this annotation). The second hierarchy, with IT21 = 2 separates a \ncluster of 18 samples, almost all of which are ALL samples taken from the bone marrow of \npatients from the same hospital. These results demonstrate the ability of the algorithm to \nextract in parallel different meaningful independent partitions of the data. \n\n7 Discussion \n\nThe analysis presented by this work enables to implement a family of novel agglomerative \nclustering algorithms. All of these algorithms are motivated by one variational framework \ngiven by the multivariate IB method. Unlike most other clustering techniques, this is a \nprincipled model independent approach, which aims directly at the extraction of informa(cid:173)\ntive structures about given observed variables. It is thus very different from maximum-\n\n\flikelihood estimation of some mixture model and relies on fundamental information theo(cid:173)\nretic notions, similar to rate distortion theory and channel coding. In fact the multivariate \nIB can be considered as a multivariate coding result. The fundamental tradeoff between the \ncompressed multi-information rG in and the preserved multi-information r G ou, provides a \ngeneralized coding limiting function, similar to the information curve in the original IB \nand to the rate distortion function in lossy compression. Despite the only local-optimality \nof the resulting solutions this information theoretic quantity - the fraction of the multi(cid:173)\ninformation that is extracted by the clusters - provides an objective figure of merit for the \nobtained clustering schemes. \n\nThe suggested approach of this paper has several practical advantages over the 'deter(cid:173)\n\nministic annealing' algorithms suggested in [4], as it is simpler, fully deterministic and \nnon-parametric. There is no need to identify cluster splits which is usually rather tricky. \nThough agglomeration procedures do not scale linearly with the sample size as top down \nmethods do, there exist several heuristics to improve the complexity of these algorithms \n(e.g. [1]). \n\nWhile a typical initialization of an agglomerative procedure induces \"hard\" clustering \nsolutions, all of the above analysis holds for \"soft\" clustering as well. Moreover, as already \nnoted in [11], the obtained \"hard\" partitions can be used as a platform to find also \"soft\" \nsolutions through a process of \"reverse annealing\". This raises the possibility for using an \nagglomerative procedure over \"soft\" clustering solutions, which we leave for future work. \nWe could describe here only a few relatively simple examples. These examples show \npromising results on non trivial real life data. Moreover, other choices of Gin and Gout \ncan yield additional novel algorithms with applications over a variety of data types. \n\nAcknowledgements \nThis work was supported in part by the Israel Science Foundation (ISF), the Israeli Ministry \nof Science, and by the US-Israel Bi-national Science Foundation (BSF). N. Slonim was also \nsupported by an Eshkol fellowship. N. Friedman was also supported by an Alon fellowship \nand the Harry & Abe Sherman Senior Lectureship in Computer Science. \n\nReferences \n\n[I] L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In ACM SIGIR 98. \n[2] T. M. Cover and J. A. Thomas. Elements of Information Theory. 1991. \n[3] R. EI-Yaniv, S. Fine, and N. Tishby. Agnostic classification of Markovian sequences. In NIPS'97. \n[4] N. Friedman, O. Mosenzon, N. Sionim and N. Tishby Multivariate Infonnation Bottleneck UAI,2001. \n[5] T. Golub, D. Slonim, P. Tamayo, C.M. Huard, J.M. Caasenbeek, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloom(cid:173)\n\nfield, and E. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring \nScience 286, 531 - 537,1999. \n\n[6] K. Lang. Learning to filter netnews. In ICML'95. \n[7] J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Info. Theory, 37(1):145-151 , 1991. \n[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems. 1988. \n[9] K. Rose. Detenninistic annealing for clustering, compression, classification, regression, and related optimization problems. \n\nProc. IEEE, 86:2210--2239,1998. \n\n[10] N. Sionim, R. Somerville, N. Tishby, and O. Lahav. Objective spectral classification of galaxies using the infonnation \n\nbottleneck method. in \"Monthly Notices of the Royal Astronomical Society\", MNRAS, 323, 270, 2001. \n\n[II] N. Slonim and N. Tishby. Agglomerative Infonnation Bottleneck. In NIPS'99. \n[12] N. Sionim and N. Tishby. Document clustering using word clusters via the infonnation bottleneck method. InACM SIGIR \n\n2000. \n\n[13] N. Slonim and N. Tishby. The power of word clusters for text classification. In ECIR, 2001. \n[14] N. Tishby, F. Pereira, and W. Bialek. The Infonnation Bottleneck method. In Proc. 37th Allerton Conference on Commu(cid:173)\n\nnication and Computation. 1999. \n\n[15] N. Tishby and N. Slonim. Data clustering by markovian relaxation and the infonnation bottleneck method. In NIPS'OO. \n\n\f", "award": [], "sourceid": 1952, "authors": [{"given_name": "Noam", "family_name": "Slonim", "institution": null}, {"given_name": "Nir", "family_name": "Friedman", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}