{"title": "Agglomerative Multivariate Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": null, "full_text": "Agglomerative Multivariate Information \n\nBottleneck \n\nSchool of Computer Science & Engineering, Hebrew University, Jerusalem 91904, Israel \n\nNoam Sionim  Nir Friedman  Naftali Tishby \n\n{noamm, nir, tishby } @cs.huji.ac.il \n\nAbstract \n\nThe information bottleneck method  is  an  unsupervised  model  independent data \norganization  technique.  Given  a joint distribution  peA, B),  this  method  con(cid:173)\nstructs a new variable T  that extracts partitions, or clusters, over the values of A \nthat are informative about B.  In  a recent paper,  we introduced a general princi(cid:173)\npled framework for multivariate extensions of the information bottleneck method \nthat allows us to consider multiple systems of data partitions that are inter-related. \nIn this  paper,  we  present  a  new  family  of simple  agglomerative  algorithms  to \nconstruct such systems of inter-related clusters.  We analyze the behavior of these \nalgorithms and apply them to several real-life datasets. \n\n1  Introduction \n\nThe  information  bottleneck  (IB)  method  of Tishby  et  al  [14]  is  an  unsupervised  non(cid:173)\nparametric data organization technique.  Given  a joint distribution P(A, B), this  method \nconstructs a new variable T  that represents partitions of A  which are (locally) maximizing \nthe mutual  information about B.  In other words,  the variable T  induces a  sufficient par(cid:173)\ntition,  or informative features of the variable A  with respect to  B.  The construction of T \nfinds  a  tradeoff between  the  information about A  that we try  to  minimize,  J(T; A), and \nthe information about B  which we try to maximize, J(T ; B). This approach is particularly \nuseful for co-occurrence data,  such as  words and documents [12],  where we want to cap(cid:173)\nture what information one variable (e.g., use of a word) contains about the other (e.g., the \ndocument). \n\nIn a recent paper, Friedman et al.  [4]  introduce multivariate extension of the IB principle. \nThis extension allows us to consider cases where the data partition is relevant with respect \nto  several variables,  or where we construct several  systems of clusters simultaneously.  In \nthis framework,  we specify the desired interactions by a pair of Bayesian networks.  One \nnetwork, Gin, represents which variables are compressed versions of the observed variables \n- each  new  variable  compresses its  parents  in  the  network.  The  second network,  Gout> \ndefines the statistical relationship between these new variables and the observed variables \nthat should be maintained. \n\nSimilar to  the  original  IB,  in  Friedman et  al.  we  formulated  the  general  principle as  a \ntradeoff between the (multi) information each network carries.  On the one hand, we want to \nminimize the information maintained by G in  and on the other to maximize the information \nmaintained by Gout.  We also provide a characterization of stationary points in this tradeoff \nas a set of self-consistent equations.  Moreover, we prove that iterations of these equations \nconverges to  a  (local)  optimum.  Then,  we  describe  a deterministic  annealing procedure \n\n\fthat constructs a solution by tracking the bifurcation of clusters as  it traverses the tradeoff \ncurve, similar to the original IB method. \n\nIn  this  paper,  we  consider an  alternative  approach  to  solving  multivariate  IB  problems \nwhich is  motivated by the success of the agglomerative IB of Slonim and Tishby [11].  As \nshown  there, a bottom-up greedy  agglomeration is  a  simple heuristic procedure that can \nyield good solutions to the original IB problem.  Here we extend this idea in  the context of \nmultivariate IB problems. We start by analyzing the cost of agglomeration steps within this \nframework.  This both elucidates the criteria that guides greedy agglomeration and provides \nfor efficient local evaluation rules for agglomeration steps.  This construction results with \na novel  family  of information theoretic  agglomerative clustering algorithms, that can  be \nspecified using  the  graphs  G in  and  G out.  We  demonstrate  the  performance of some  of \nthese algorithms for document and word clustering and gene expression analysis. \n\n2  Multivariate Information Bottleneck \nA Bayesian network structure G  is a DAG that specifies interactions among variables [8]. \nA distribution P  is  consistent with G  (denoted P  F G), if P(Xl , ... , X n )  =  I1 P(Xi  I \nPa<fJ, where  Pa<fi  are  the parents of X i  in  G.  Our main interest is  in  the  information \nthat the variables Xl \" '\"  X n contain about each other.  A quantity that captures this is the \nmulti-information given by \n\nwhere V(Pllq)  is the familiar Kullback-Liebler divergence [2]. \nProposition 2.1  [4] Let G be a DAG over {Xl , ... , X n },  and let P  F G be a distribution. \nThen, I G = I(Xl' ... , X n ) =  L i I(Xi ; Pa<fi ). \n\nThat is,  the multi-information is  the sum of local mutual information terms  between each \nvariable and its parents (denoted I G). \n\nFriedman et al.  define the multivariate IE problem as follows.  Suppose we are given a set \nof observed variables, X  =  {Xl , ... , X n} and their joint distribution P (X l ,  ... , X n ).  We \nwant to  \"construct\" new  variables T , where the relations  between the  observed variables \nand these new  compression variables are  specified using  a DAG  Gin  over X  U T  where \nthe  variables  in  T  are  leafs.  Thus,  each Tj  is  a  stochastic  function  of a  set  of variables \nU j  =  Pa~;in  ~ X.  Once these are set,  we have a joint distribution over the combined set \nof variables:  P(X, T)  =  P(X) ITj  P(Tj  I U j ). \nThe \"relevant\" information that we want to preserve is  specified by another DAG, Gout . \n\nThis  graph  specifies,  for  each Tj  which  variables  it predicts.  These are  simply  its  chil(cid:173)\ndren  in  G out .  More  precisely,  we  want  to  predict each  Xi  (or T j )  by  V X i  =  Pa~;\"t \n(resp. V T;  =  Pa~;out ),  its parents in  G out.  Thus, we  think ofIGout  as  a measure of how \nmuch information the variables in T  maintain about their target variables. \n\nThe Lagrangian can then be defined as \n\n(1) \n\nwith  a  tradeoff parameter (Lagrange  multiplier)  (3.  1  The  variation  is  done  subject  to \nthe normalization constraints on the partition distributions.  Thus, we balance between the \ninformation T  loses about X  in G in and the information it preserves in G out. \n\nFriedman et al.  [4]  show  that  stationary points of this  Lagrangian satisfy  a  set  of self(cid:173)\n\nconsistent equations.  Moreover, they  show  that  iterating  these  equations converges  to  a \n\nINotice that under this formul ation  we would like to  maximize \u00a3.  An  equivalent definition  [4] \n\nwould be to minimize \u00a3  = 'LG in  -\n\n(J  . 'LGO\"t . \n\n\fstationary point of the  tradeoff.  Then,  extending the procedure of Tishby et al  [14],  they \npropose a procedure that searches for a solution of the IB equations using a 'deterministic \nannealing' approach [9].  This is a top-down hierarchical algorithm that starts from a single \ncluster for each Tj  at j3  -+  0,  and then undergoes a cascade of cluster splits as  j3  is  being \n\"cooled\". These determines \"soft\" trees of clusters (one for each Tj )  that describe solutions \nat different tradeoff values of j3 . \n\n3  The Agglomerative Procedure \n\nFor the original IB problem, Slonim and Tishby  [11]  introduced a simpler procedure that \nperforms greedy bottom-up merging of values.  Several successful applications of this algo(cid:173)\nrithm are already presented for a variety of real-world problems [10,  12,  13,  15].  The main \nfocus of the current work is in extending this approach for the multivariate IB problem. As \nwe will show, this will lead to further insights about the method, and also provide a rather \nsimple and intuitive clustering procedures. \n\nWe  consider procedures  that  start  with  a  set  of clusters  for  each  Tj  (usually  the  most \nfine-grained  solution  we  can  consider  where  Tj  =  U j )  and  then  iteratively  reduce  the \ncardinality of one of the Tj 's by merging two values t~ and tj  of Tj  into a single value lj. \nTo  formalize  this  notion we must define  the  membership probability of a new cluster lj, \nresulting from merging {t~, tj} '* lj in Tj .  This is done rather naturally by \n\n(2) \n\nIn other words, we view the event l j  as the union of the events t~ and tj. \n\nGiven  the  membership probabilities,  at each  step  we  can also  draw  the  connection be(cid:173)\n\ntween Tj  and the  other variables.  This  is  done using  the  following  proposition which  is \nbased on the conditional independence assumptions given in Gin. \nProposition 3.1  Let Y , Z  C X  U T  \\  {Tj} then, \n\n(3) \n\nII \n\nd\u00b7\n\n\u00b7b\u00b7 \n\n{ \n1f1 ,Z,  1fr ,z  = \n\n} \n\n{ p(t~ I Z)  p(t j IZ) }.  h \np(t; IZ)' p(t; IZ) \n\nh \nwere  Z  = \nz. \nIn particular, this proposition allows us to  evaluate all  the predictions defined in G out  and \nall the informations terms in \u00a3  that involve Tj . \n\nIS  t  e merger  1Str1  uttOn  can  ltzone  on \n\nd\u00b7\u00b7 \n\nd \n\nThe crucial question in an agglomerative process is of course which pair to merge at each \nstep.  We  know that the  merger \"cost\" in  our terms is  exactly the difference in  the values \nof \u00a3 , before  and  after  the  merger.  Let T}ef  and T/ft  denote  the  random  variables  that \ncorrespond to Tj ,  before and after the  merger,  respectively.  Thus,  the values of \u00a3  before \nand after the merger are calculated based on Trf  and Ttt. The merger cost is then simply \ngiven by, \n\n(4) \nThe greedy procedure evaluates all  the potential mergers (for all  Tj )  and then applies  the \nbest one (i.e., the one that minimizes 6.\u00a3( t~ , tj ). This is repeated until all  the variables in \nT  degenerate into trivial  clusters.  The resulting  set of trees describes a range of solutions \nat different resolutions. \n\nThis agglomerative approach is different in several important aspects from the determin(cid:173)\n\nistic annealing approach described above.  In that approach, by  \"cooling\" (i.e., increasing) \nj3, we move along a tradeoff curve from the trivial - single cluster - solution toward solu(cid:173)\ntions  with  higher resolutions that preserve more information in  G out.  In contrast,  in  the \n\n\fagglomerative approach  we  progress in  the  opposite direction.  We  start with  a high res(cid:173)\nolution clustering and as  the  merging process continues we  move toward more and more \ncompact solutions.  During this process (3  is  kept constant and the driving force  is  the  re(cid:173)\nduction in the cardinality of the T/s.  Therefore, we are able to  look for good solutions in \ndifferent resolutions for ajixed tradeoff parameter (3.  Since the merging does not attempt \ndirectly to maintain the (stationary) self-consistent \"soft\" membership probabilities, we do \nnot expect the  self-consistent equations  to  hold  at  solutions found  by  the  agglomerative \nprocedure.  On  the  other hand,  the  agglomerative process  is  much  simpler to  implement \nand fully deterministic.  As we will show, it provides sufficiently good solutions for the IB \nproblem in  many situations. \n\n4  Local Merging Criteria \nIn  the  procedure  we  outline  above,  at every  step  there  are  O(ITj 12) possible  mergers  of \nvalues of Tj  (for every j).  A direct calculation of the costs of all  these potential mergers \nis  typically infeasible.  However, it turns out that one may calculate t:...c (t; , tj) while ex(cid:173)\namining only the probability distributions that involve t;  and tj directly.  Generalizing the \nresults of [11]  for the original IB,  we now develop a closed-form formula for t:...c(t;, tj) . \nTo describe this result we need the following definition. The Jensen-Shannon ( J S) diver(cid:173)\n\ngence [7, 3]  between two probabilities PI , P2 is given by \n\nwhere II  =  {7rl ' 7r2} is  a normalized probability and p =  7rlPl  + 7r2P 2 .  The  J S  diver(cid:173)\ngence  is  equal  zero  if and  only  if both  its  arguments  are  identical.  It is  upper bounded \nand symmetric,  though it is not a metric.  One interpretation of the  J S -divergence relates \nit  to  the  (logarithmic)  measure  of the  likelihood that  the  two  sample  distributions  origi(cid:173)\nnate by  the most likely common source, denoted by p.  In addition,  we  need the  notation \nV-X~ =  V X i  -\nTheorem 4.1  Let t; , tj E  Tj  be two clusters.  Then, t:...c( t; , tj) =  p(tj) . d(t;, tj) where \n\n{Tj} (similarly for V T!). \n\nd(t; , tj) \n\nL  Ep(' lt;) [JSrr v _ ; (P(Xi  1  t;, V-X~ ),p(Xi 1 tj, V -X{))] \n+  L  Ep(' lt; ) [JSrrv _; (p(Te 1 t;, V Tj) ,p(Te 1 tj , V T! ))] \n\ni: T; EVX i \n\nXi \n\ne: T;EvT\u00a3 \n\nT\u00a3 \n\n+  JSrr(p(VT ;  1  t;) ,p(VT ;  1 tj)) - (3-1. JSrr(p(Uj  1  t;) ,p(Uj  1 tj)) \n\nA detailed proof of this theorem will  be given elsewhere.  Thus, the merger cost is  a mul(cid:173)\ntiplication of the weight of the  merger components (P(tj))  with  their \"distance\" given by \nd(t; , tj).  Notice that due to  the properties of the  JS-divergence, this distance is  symmet(cid:173)\nric.  In addition, the last term in this distance has the opposite sign to the first  three terms. \nThus, the distance between two clusters is  a tradeoff between these  two  factors.  Roughly \nspeaking, we may say that the distance is  minimized for pairs that give similar predictions \nabout the  variables  connected with  T j  in  Gout  and have different predictions  (minimum \noverlap) about the variables connected with Tj  in Gin.  We notice also the analogy between \nthis  result  and the  main  theorem  in  [4].  In  [4]  the  optimization is  governed by  the  KL \ndivergences between  data and  cluster's  centroids,  or by  the  likelihood that the  data  was \ngenerated by the centroid distribution.  Here the optimization is controlled through the J S \ndivergences, i.e. the likelihood that the two clusters have a common source. \n\nNext,  we  notice that after applying a merger, only a small  portion of the  other mergers \n\ncosts change. The following proposition characterizes these costs. \n\n\fr  0~ n T} \n\n8  n T,( \nTz  ~ o \n\nGin \n\nG out \n\nGin \n\nG out \n\nTIJ  n A \n\n8 \n\nG out \n\nG in \n\nI(T; B)  - (3 -1 I(T; A) \n\nI(T1, T2; B) \n_(3 -\" (I(T1 ; A) + I(T2; A)) \n\nI(TA;TB ) \n_((3 -\" -\n\nl)(I(TA; A) + I(TB; B)) \n\n(a) Original Bottleneck \n\n(b) Parallel Bottleneck \n\n(c) Symmetric Bottleneck \n\nFigure 1:  The source and target networks and the corresponding Lagrangian for the three \nexamples we consider. \n\nProposition4.2  The  merger {t; , tj}  :::}  tj  in Tj  can change the  cost  6..c(t~ , tc )  only if \np(tj , te) > 0 and Tj , Te co-appear in some information term in r Gout \u2022 \nThis proposition is particularly useful,  when  we consider \"hard\" clustering where Tj  is \na  (deterministic) function  ofUj .  In  this  case,  p(tj,te)  is  often zero  (especially when Tj \nand Te compressing similar variables, i.e.,  U j n U e =I- 0).  In particular, after the merger \n{t;, tj}  :::}  tj, we do not have to reevaluate merger costs of other values of Tj , except for \nmergers of tj with each of these values. \n\nIn the case of hard clustering we also find thatI(Tj ; U j )  =  H(Tj ) (where H(P)  is Shan(cid:173)\n\nnon's entropy).  Roughly speaking, we may  say that H(P)  is decreasing for less balanced \nprobability distributions p.  Therefore, increasing (3-1  will  result with a tendency to  look \nfor less balanced \"hard\" partitions and vice verse.  This is reflected by the fact that the last \nterm in d( t; , tj) is then simplified through J Sn (p(U j  I t;), p(U j  I tj)) =  H (II) . \n\n5  Examples \nWe now briefly consider three examples of the general methodology. For brevity we focus \non the simpler case of hard clustering.  We  first consider the example shown in figure  I(a). \nThis choice of graphs results in  the  original IB problem.  The merger cost in  this  case is \ngiven by, \n\n6..c(tl, n =  p(t) . (JSn(p(B I tl),p(B I n) - (3-1 H(II))  . \n\n(5) \n\nNote that for (3-1 -+  0 we get exactly the algorithm presented in  [11] . \n\nOne  simple  extension of the  original  IB  is  the parallel bottleneck  [4].  In  this  case  we \nintroduce  two  variables  T1  and T2  as  in  Figure  I(b),  both  of them  are  functions  of A. \nSimilarly to the original IB, Gout  specifies that T1  and T2 should predict B. We  can think \nof this  requirement as  an  attempt to  decompose the information A  contains about B  into \ntwo \"orthogonal\" components. In this case, the merger cost for T1  is given by, \n6..c(ti, tD  =  p(t1)  . (Ep(.lld[JSnT2  (P(B  I ti,T2),p(B I tLT2))]- (3- 1 H(II))  .  (6) \nFinally, we consider the symmetric bottleneck  [4,  12].  In this case, we want to compress \nA into T A and B into T B so that T A extracts the information A contains about B, and at the \nsame time TB  extracts the information B  contains about A.  The DAG G in  of figure  I(c) \ncaptures the form of the compression. The choice of G out is less obvious and several alter(cid:173)\nnatives are described in  [4].  Here, we concentrate only in one option, shown in figure  I(c). \nIn this case we attempt to make each ofTA and TB  sufficient to separate A from B. Thus, \non one hand we attempt to compress, and on the other hand we attempt to make T A  and T B \nas informative about each other as possible.  The merger cost in T A  is given by \n\n6..c(t~, tA) =  P(tA) . JSn(p(TB  I t~) , p(TB ItA))  - ((3-1 -\n\nl)H(II)), \n\n(7) \n\n\fwhile for merging in TB we will get an analogous expression. \n\n6  Applications \n\nWe  examine  a  few  applications  of the  examples  presented  above.  As  one  data  set  we \nused a subset ofthe 20 newsgroups corpus [6]  where we randomly choose 2000 documents \nevenly distributed among the 4 science discussion groups (sci. crypt, sci. electronics, sci.med \nand sci.space).2  Our pre-processing included ignoring file headers (and the subject lines), \nlowering  upper case and ignoring words  that contained non  ' a .. z'  characters.  Given this \ndocument set we can evaluate the joint probability p(W, D), which is the probability that a \nrandom word position is equal to w  E  Wand at the same time the document is dE D . We \nsort all  words by their contribution to I(W; D) and used only the 2000 'most informative' \nones, ending up with a joint probability with I W I =  ID I =  2000. \nWe first used the original IB to cluster W , while trying to preserve the information about \nD.  This  was  already  done  in  [12]  with  (3-1  =  0,  but  in  this  new  experiment we  took \n(3-1  =  0.15. Recall that increasing (3-1 results in a tendency for finding less balanced clus(cid:173)\nters.  Indeed, while for (3- 1  =  0 we got relatively balanced word clusters  (high H(Tw )), \nfor  (3- 1  =  0.15  the  probability p(Tw)  is  much  less  smooth.  For 50  word clusters,  one \ncluster contained almost half of the  words,  while the  other clusters  were typically  much \nsmaller.  Since the algorithm also  tries to  maximize I(Tw; D), the words merged into the \nbig cluster are usually the less informative words about D. Thus, a word must be highly in(cid:173)\nformative to stay out of this cluster. In this sense, increasing (3-1 is equivalent for inducing \na \"noise filter\", that leave only the most informative features in specific clusters. In figure 2 \nwe present p( D  I tw) for several clusters tw E Tw. Clearly, words that passed the \"filter\" \nform much more informative clusters about the real structure of D. A more formal demon(cid:173)\nstration of this effect is given in the right panel of figure 2.  For a given compression level \n(i.e.  a given I(Tw; W)), we see that taking (3-1 =  0.15 preserve much more information \naboutD. \n\nWhile  an  exact implementation of the symmetric IB  will  require alternating mergers in \nTw and TD, an approximated approach require only two steps.  First we find Tw.  Second, \nwe project each d  E D  into the low  dimensional space defined by Tw , and use this more \nrobust representation  to  extract document clusters TD.  Approximately,  we  are  trying  to \nfind Tw and TD  that will  maximize I(Tw; TD)'  This two-phase IB algorithm was shown \nin  [12]  to  be  significantly  superior to  six  other document clustering  methods,  when  the \nperformance are  measured by  the correlation of the  obtained document clusters  with  the \nreal  newsgroup categories.  Here we  use the  same procedure, but for finding Tw we  take \n(3-1  =  0.15  (instead  of zero).  Using  the  above  intuition  we  predict this  will  induce  a \ncleaner representation for  the document set.  Indeed,  the  averaged correlation of TD  (for \nlTD I =  4) with the original categories was 0.65, while for (3-1  =  0 it was 0.58 (the average \nis  taken over different number of word clusters,  ITw I =  10, 11...50).  Similar results were \nobtained for all  the 9 other subsets of the 20 newsgroups corpus described in  [12]. \n\nAs  a  second data  set  we  used  the  gene  expression  measurements  of rv  6800  genes  in \n72  samples of Leukemia [5].  The sample annotations included type of leukemia (ALL vs. \nAML), type of cells,  source of sample,  gender and donating hospital.  We  removed genes \nthat were not expressed in  the  data and normalized the  measurements of each  sample  to \nget a joint probability P(G, A) over genes and samples (with uniform prior on samples). \nWe  sorted all  genes by  their contribution to  I(G; A)  and chose the  500 most informative \nones,  which  capture 47%  of the  original  information, ending  up  with  a joint probability \nwith IAI  =  72  and IGI  =  500. \nWe  first  used an  exact implementation of the symmetric IB with alternating mergers be-\n\n2We used the same subset already used in [12]. \n\n\faotthetoand \n\nalgonthm secure secunty enayptlon ClaSSlIJed  ... \n\n0\u00b704'---~----r~_~CC=, ,~905;O=w~ord~S \n\n0.03 \n\n0.02 \n\n0.04 \n\n0.03 \n\n0.02 \n\nc2, 20 words \n\n0.04 \n\n0.03 \n\n0.02 \n\naCldvltammGaIClumlntakekldDey ... \n\nc4,35words \n\names planetary nasa spaceanane ... \n0\u00b704'---~-~~C5o=\" 35;O=w~ord~s \n\n0.03 \n\n0.02 \n\nanalog mOde signaimput output ... \n0.04,---~-~~c3~, 19;O=w~ord~S \n\n0.03 \n\n0.02 \n\n0.01 \n\n00 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n1 .5~~~------, \n\nI \n\nsCience dataset, lnl0rmallon curves \nI \n~ \n\n~-:=O \n\n1 o \n~ \n\n0.5 \n\nIfTw\u00b7W\\ \n\nFigure 2:  P(D  I tw)  for  5  word  clusters,  tw  E  Tw.  Documents  1 - 500  belong  to  sci. crypt \ncategory,  501  - 1000  to  sci. electronics,  1001  - 1500  to  sci.med and  1501  - 2000  to  sci. space. \nIn  the  title of each  panel  we  see the  5 most frequent  words  in  the cluster.  The  'big'  cluster  (upper \nleft panel)  is clearly less  informative  about the  structure of D.  In  the lower right panel  we  see  the \ntwo  information  curves.  Given  some  compression  level, for  (3- 1  =  0.15  we  preserve  much  more \ninformation about D  than for (3-1  = O. \n\ntween both clustering hierarchies (and /3-1  =  1).  For ITA I =  2 we found an almost perfect \ncorrelation with the ALL vs.  AML annotations (with only 4 exceptions).  For ITA I =  8 and \nITGI =  10 we found again high correlation between our sample clusters and the different \nsample annotations. For example, one cluster contained 10 samples that were all annotated \nas ALL type, taken from male patients in the same hospital. Almost all of these 10 were also \nannotated as T-cells, taken from bone marrow. Looking at p(TA  I TG)  we see that given the \nthird genes cluster (which contained 17 genes) the probability of the above specific samples \ncluster is  especially high.  Further such analysis might yield additional insights about the \nstructure of this data and will be presented elsewhere. \n\nFinally, to demonstrate the performance of the parallel IB  we apply it to  the same data. \nUsing the parallel IB  algorithm (with /3-1 =  0)  we clustered the arrays  A  into two  clus(cid:173)\ntering  hierarchies,  T1  and T2 ,  that  try  together to  capture  the  information about G.  For \nITj I =  4 we find that each I(Tj; G) preserve about 15% of the original information. How(cid:173)\never, taking  ITj I  =  2  (i.e.  again, just 4  clusters)  we  see  that the combination of the  hi(cid:173)\nerarchies, I(T1, T2 ; G), preserve 21 % of the original information.  We  then compared the \ntwo partitions we found against sample annotations. We found that the first hierarchy with \nIT11  =  2 almost perfectly match the split between B-cells and T-cells (among the 47  sam(cid:173)\nples  for which we had this  annotation).  The second hierarchy, with  IT21  =  2 separates a \ncluster of 18 samples, almost all of which are ALL samples taken from the bone marrow of \npatients from the same hospital.  These results demonstrate the ability of the algorithm to \nextract in parallel different meaningful independent partitions of the data. \n\n7  Discussion \n\nThe analysis presented by this work enables to implement a family of novel agglomerative \nclustering algorithms.  All of these algorithms are motivated by one variational framework \ngiven  by  the  multivariate  IB  method.  Unlike  most other clustering  techniques,  this  is  a \nprincipled model independent approach, which aims directly at the extraction of informa(cid:173)\ntive  structures  about given  observed variables.  It is  thus  very  different from  maximum-\n\n\flikelihood estimation of some mixture model and relies on fundamental information theo(cid:173)\nretic notions, similar to  rate distortion theory and channel coding.  In fact the multivariate \nIB can be considered as a multivariate coding result. The fundamental tradeoff between the \ncompressed multi-information rG in  and the preserved multi-information r G ou,  provides a \ngeneralized coding limiting function,  similar to  the  information curve in  the  original  IB \nand to  the rate distortion function in lossy compression.  Despite the only local-optimality \nof the  resulting  solutions  this  information  theoretic  quantity - the  fraction  of the  multi(cid:173)\ninformation that is extracted by the clusters - provides an objective figure of merit for the \nobtained clustering schemes. \n\nThe  suggested approach  of this  paper has  several  practical  advantages  over the  'deter(cid:173)\n\nministic  annealing'  algorithms  suggested in  [4],  as  it  is  simpler,  fully  deterministic  and \nnon-parametric.  There is  no  need to  identify cluster splits  which  is  usually rather tricky. \nThough agglomeration procedures do not scale linearly with the sample size as  top  down \nmethods do,  there exist several  heuristics to  improve the  complexity of these  algorithms \n(e.g.  [1]). \n\nWhile  a  typical  initialization  of an  agglomerative procedure  induces  \"hard\"  clustering \nsolutions, all of the above analysis holds for \"soft\" clustering as well.  Moreover, as already \nnoted in  [11],  the  obtained \"hard\" partitions can be used as  a platform to  find  also  \"soft\" \nsolutions through a process of \"reverse annealing\".  This raises the possibility for using an \nagglomerative procedure over \"soft\" clustering solutions, which we leave for future work. \nWe  could describe  here  only  a  few  relatively  simple  examples.  These examples  show \npromising results  on  non trivial  real  life  data.  Moreover, other choices of Gin  and Gout \ncan yield additional novel algorithms with applications over a variety of data types. \n\nAcknowledgements \nThis work was supported in part by the Israel Science Foundation (ISF), the Israeli Ministry \nof Science, and by the US-Israel Bi-national Science Foundation (BSF). N. Slonim was also \nsupported by an Eshkol fellowship.  N. Friedman was also supported by an Alon fellowship \nand the Harry & Abe Sherman Senior Lectureship in Computer Science. \n\nReferences \n\n[I]  L. D. Baker and A.  K. McCallum.  Distributional clustering of words for text classification.  In ACM SIGIR 98. \n[2]  T. M. Cover and J.  A. Thomas.  Elements of Information Theory.  1991. \n[3]  R.  EI-Yaniv, S.  Fine, and N.  Tishby.  Agnostic classification of Markovian sequences.  In NIPS'97. \n[4]  N. Friedman, O.  Mosenzon, N.  Sionim and N. Tishby  Multivariate Infonnation Bottleneck  UAI,2001. \n[5]  T. Golub,  D.  Slonim,  P.  Tamayo, C.M. Huard, J.M. Caasenbeek, H.  Coller, M.  Loh, J.  Downing, M. Caligiuri, C.  Bloom(cid:173)\n\nfield, and E. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring \nScience 286, 531 - 537,1999. \n\n[6]  K. Lang.  Learning to filter netnews.  In  ICML'95. \n[7]  J.  Lin.  Divergence Measures Based on the Shannon Entropy.  IEEE Trans. Info.  Theory, 37(1):145-151 , 1991. \n[8]  J.  Pearl.  Probabilistic Reasoning in Intelligent Systems.  1988. \n[9]  K. Rose. Detenninistic annealing for clustering, compression, classification, regression, and related optimization problems. \n\nProc. IEEE, 86:2210--2239,1998. \n\n[10]  N.  Sionim, R.  Somerville,  N.  Tishby, and  O.  Lahav.  Objective  spectral classification of galaxies  using  the  infonnation \n\nbottleneck method.  in \"Monthly Notices of the Royal Astronomical Society\", MNRAS, 323, 270, 2001. \n\n[II]  N. Slonim and N. Tishby.  Agglomerative Infonnation Bottleneck.  In NIPS'99. \n[12]  N. Sionim and N. Tishby.  Document clustering using word clusters via the infonnation bottleneck method.  InACM SIGIR \n\n2000. \n\n[13]  N. Slonim and N. Tishby.  The power of word clusters for text classification.  In ECIR, 2001. \n[14]  N. Tishby,  F.  Pereira, and W.  Bialek.  The  Infonnation  Bottleneck method.  In  Proc.  37th Allerton Conference on Commu(cid:173)\n\nnication and Computation.  1999. \n\n[15]  N. Tishby and N.  Slonim.  Data clustering by markovian relaxation and the infonnation bottleneck method.  In NIPS'OO. \n\n\f", "award": [], "sourceid": 1952, "authors": [{"given_name": "Noam", "family_name": "Slonim", "institution": null}, {"given_name": "Nir", "family_name": "Friedman", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}