{"title": "Visualizing Group Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 452, "page_last": 458, "abstract": null, "full_text": "Visualizing Group  Structure* \n\nMarcus Held,  Jan Puzicha, and Joachim M.  Buhmann \n\nemail: {heldjanjb}.cs.uni-bonn.de. \n\nVVVVVV:  http://vvv-dbv.cs.uni-bonn.de \n\nInstitut fur  Informatik III, \n\nRomerstraBe  164,  D-53117 Bonn,  Germany \n\nAbstract \n\nCluster  analysis  is  a  fundamental  principle  in  exploratory  data \nanalysis,  providing the user  with a  description  of the  group struc(cid:173)\nture of given data.  A key  problem in  this context is the interpreta(cid:173)\ntion  and  visualization  of clustering  solutions  in  high- dimensional \nor  abstract  data  spaces.  In  particular,  probabilistic  descriptions \nof the  group structure,  essential  to  capture  inter-cluster  relation(cid:173)\nships, are hardly assessable by simple inspection ofthe probabilistic \nassignment  variables.  VVe  present  a  novel  approach  to  the  visual(cid:173)\nization of group structure.  It is  based on  a statistical model of the \nobject  assignments  which  have  been  observed  or  estimated  by  a \nprobabilistic clustering  procedure.  The objects or  data points are \nembedded  in  a  low  dimensional Euclidean space  by  approximating \nthe  observed  data statistics  with  a  Gaussian  mixture model.  The \nalgorithm provides a new approach to the visualization of the inher(cid:173)\nent structure for a broad variety of data types,  e.g. histogram data, \nproximity data and co-occurrence data.  To demonstrate the power \nof the approach,  histograms of textured images are visualized as an \nexample of a  large-scale data mining application. \n\n1 \n\nIntroduction \n\nClustering  and  visualization  are  key  issues  in  exploratory  data  analysis  and  are \nfundamental  principles  of many unsupervised  learning  schemes.  For  a  given  data \nset,  the  aim of any  clustering  approach  is  to extract  a  description  of the  inherent \ngroup structure.  The object  space  is  partitioned  into groups where  each  partition \n\n-This work has been supported by the German Research Foundation (DFG) under grant \n#BU 914/3-1,  by  the  German  Israel  Foundation  for  Science  and  Research  Development \n(GlF) under grant #1-0403-001.06/95  and by the Federal Ministry for  Education,  Science \nand Technology  (BMBF  #01 M  3021  A/4) . \n\n\fVisualizing Group Structure \n\n453 \n\nis as homogeneous as possible and two partitions are maximally heterogeneous.  For \nseveral  reasons  it  is  useful  to deal  with  probabilistic partitioning approaches: \n\n1.  The  data generation  process  itself might  be  stochastic,  resulting  in  over(cid:173)\n\nlapping partitions.  Thus,  a  probabilistic group description  is  adequate and \nprovides  additional information about the inter-cluster relations. \n\n2.  The  number of clusters  might be chosen  too  large.  Forcing  the  algorithm \nto  a  hard  clustering  solution  creates  artificial  structure  not  supported  by \nthe  data.  On  the  other  hand,  superfluous  clusters  can  be  identified  by  a \nprobabilistic group description  . \n\n3.  There  exists  theoretical  and  empirical  evidence  that  probabilistic  assign-\n\nments avoid over-fitting  phenomena [7]. \n\nSeveral  well-known  clustering schemes  result  in  fuzzy  cluster  assignments:  For  the \nmost common type of vector- valued  data,  heuristic fuzzy  clustering  methods were \nsuggested  [4,  5] .  In a  more principled way,  deterministic  annealing algorithms pro(cid:173)\nvide  fuzzy  clustering  solutions for  a  given  cost  function  with  a  rigorous statistical \nfoundation  and  have  been  developed  for  vectorial  [9],  proximity  [6]  and  histogram \ndata  [8].  In  mixture  model  approaches  the  assignments  of objects  to  groups  are \ninterpreted  as  missing  data.  Its  conditional  expectations  given  the  data  and  the \nestimated cluster  parameters are  computed during the  E- step in  the corresponding \nEM-algorithm and  can  be  understood  as  assignment probabilities. \n\nThe  aim  of this  contribution  is  to  develop  a  generic  framework  to  visualize  such \nprobabilities as distances  in  a  low  dimensional Euclidean space .  Especially  in  high \ndimensional or abstract object spaces,  the interpretation of fuzzy group structure is \nrather difficult,  as humans do not perform very  well  in interpreting probabilities.  It \nis,  therefore, a key  issue to make an interpretation of the cluster structure more fea(cid:173)\nsible.  In  contrast  to multidimensional scaling  (MDS),  where  objects are  embedded \nin low dimensional Euclidean spaces by preserving the original inter object distances \n[3],  our approach yields a mixture model in low dimensions, where  the probabilities \nfor  assigning objects  to clusters  are  maximally preserved.  The  proposed  approach \nis  similar in  spirit  to data visualization  methods like  projection  pursuit  clustering, \nGTM  [1],  simultaneous clustering and embedding  [6]'  and hierarchical  latent  vari(cid:173)\nable  models  [2] .  It  also  aims  on  visualizing high dimensional  data.  But  while  the \nother  methods try  to  model  the  data itself by  a  low  dimensional generator  model, \nwe  seek  to  model  the  inferred  probabilistic grouping structure.  As  a  consequence, \nthe  framework  is  generic  in  the  sense  that  it  is  applicable  to  any  probabilistic or \nfuzzy  group description. \n\nThe key  idea is  to  interpret  a  given  probabilistic group description  as  an  observa(cid:173)\ntion  of an  underlying  random  process.  We  estimate  a  low- dimensional statistical \nmodel  by  maximum likelihood  inference  which  provides  the  visualization .  To  our \nknowledge  the  proposed  algorithm  provides  the  first  solution  to  the  visualization \nof distributional data, where  the observations  of an  object  consists  of a  histogram \nof measured features .  Such data is  common in data mining applications like image \nretrieval  where  image similarity  is  often  based  on  histograms  of color  or  texture \nfeatures.  Moreover,  our method is  applicable to proximity and co- occurrence  data. \n\n2  Visualizing Probabilistic Group  Structure \n\nLet  a  set  of N  (abstract)  objects CJ  =  {01 , ... , ON}  be  given  which  have  been  par(cid:173)\ntitioned into K  groups or clusters.  Let  the fuzzy  assignment of object  OJ  to cluster \nCv  be given  by  qjv  E [0,1], where  we  assume 2:~=1 qjv =  1 to enable a  probabilistic \ninterpretation .  We  assume  that  there  exists  an  underlying  \"true\"  assignment  of \n\n\f454 \n\nM  Held,  J  Puzicha and J  M  Buhmann \n\nobjects  to  clusters  which  we  encode  by  Boolean  variables  Miv  denoting  whether \nobject  OJ  belongs  to  (has  been  generated  by)  cluster  C v .  We  thus  interpret  qiv  as \nan  empirical  estimate of the  probability  P(Miv = 1).  For notational simplicity, we \nsummarize the assignment variables in  matrices Q  =  (qiv)  and M  = (Miv). \nThe  key  idea for  visualizing group structure  is  to exploit a  low-dimensional statis(cid:173)\ntical  model  which  \"explains\"  the  observed  qiv.  The  parameters  are  estimated  by \nmaximum likelihood  inference  and  provide  a  natural data visualization.  Gaussian \nmixture models in  low  dimensions  (typically d =  2 or d =  3)  are  often  appropriate \nbut  the  scheme  could  be  easily  extended  to  other  classes,  e.g.  hierarchical  mod(cid:173)\nels.  To define  the  Gaussian  mixture  model,  we  first  introduce  a  set  of prototypes \nY  =  {Y1, ... ,YK} C JRd  representing the K  clusters, and a set vector-valued object \nparameters  X  = {Xl, ... ,XN}  C  JRd.  To  model  the  assignment  probabilities,  the \nprototypes Y and the data points X  are  chosen  such  that the  resulting assignment \nprobabilities  are  maximally similar to  the  given  frequencies  Q.  For  the  Gaussian \nmixture model we  have \n\nNote  that  the  probability  distribution  is  invariant  under  translation  and  rotation \nof the complete parameter sets  X, y.  In  addition,  the  scale  parameter  f3  could  be \ndropped since a  change of f3  only results  in  a  rescaling of the prototypes Y and  the \ndata points X.  For the observation  Q the  log-likelihood is  given  by1 \n\nLQ (X,Y) = LLqivlogmiv  . \n\nN  K \n\ni=l v=l \n\n(2) \n\nIt  is  worth  to  note  that  when  the  qiv  =  (Miv}ptrue  are  estimates  obtained  by \na  factorial  distribution,  i.e.  ptrue(M)  = I1 Lv Mivqiv,  then  maximizing  (2)  is \nidentical  to  minimizing  the  Kullback- Leibler  (KL-)divergence  DKdptruellP)  = \nLM p true log (ptrue IP).  In  that case  the similarity to  the recent  approach of Hof(cid:173)\nmann et  al.  [6]  proposed  as  the  minimization of DKdPllptrue)  becomes  apparent. \nCompared  to  [6]  the  role  of P  and  ptrue  is  interchanged.  From  an  information(cid:173)\ntheoretic  viewpoint  DKdptruellP)  is  a  better  choice  as  it quantifies  the coding in(cid:173)\nefficiency  of assuming the distribution P  when  the  true distribution  is  p true .  Note \nthat  the  choice  of the  KL-divergence  as  a  distortion  measure  for  distributions fol(cid:173)\nlows  intrinsically from  the  likelihood  principle.  Maximum likelihood estimates  are \nderived  by  differentiation: \n\n~ qiv  8m iv \n) \nL..J-. -8.  =-2f3L..Jqiv  L..J m i/1Y/1-Yv \nm~v  x, \nv=l \nN  K \n\n~  (~ \n/1=1 \nv=l \n\nN  K \n\nLL qj~  88miV  =-2f3LLqiv(miO'- JO'v)(Xi-YO') \n\ni=l v=l  mw \n\nyO' \n\ni=l v=l \n\n, \n\n(3) \n\nN \n\n-2f3L (miO'  - qiO')  (Xi  - yO') \n\ni=l \n\n(4) \n\nThe  gradients  can  be  used  for  any  gradient  descent  scheme.  In  the  experiments, \nwe  used  (3)-(4)  in conjunction with a simple gradient descent  technique,  which has \n\n1 Here,  it  is  implicitly  assumed  that  all  qiv  have  been  estimated  based  on  the  same \n\namount  of information. \n\n\fVisualizing Group Structure \n\n455 \n\n0.8 \n\n0.6 \n\n+ \n\nFigure  1:  Visualization of two-dimensional artificial data.  Original data generated \nby  the  mixture  model  with  f3  =  1.0  and  5  prototypes.  Crosses  denote  the  data \npoints  Xi,  circles  the  prototypes  Ya.  The  embedding  prototypes  are  plotted  as \nsquares,  while the embedding data points are diamonds.  The contours are given by \n!(x)  =  maXa  (exp (-f3llx - Ya 112)/L~=1 exp (-f3llx - Y/JW)),  For  visualization \npurposes  the embedding is  translated and  rotated in  the correct  position. \nbeen  observed  to be efficient  and reliable up to a few  hundred objects.  From (4)  an \nexplicit formula for  the prototypes may be  recovered \n\nYa \n\n(5) \n\nwhich can be interpreted as an alternative centroid  rule.  The position of the proto(cid:173)\ntypes is dominated by objects with a large deviation between modeled and measured \nassignment probabilities.  Note that  (5)  should not be  used  as  an iterative equation \nas  the corresponding fixed  point is  not contractive. \n\n3  Results \n\nAs a first  experiment we discuss the approximately recoverable case, where we  sam(cid:173)\nple  from  (1)  to  generate  artificial  two-dimensional data and  infer  the  positions  of \nthe sample points and of the prototypes by the visualizing group structure approach \n(see  Fig.  1).  Due  to  iso- contour  lines  in  the  generator  density  and  in  the  visual(cid:173)\nization  density  not  all  data positions  are  recovered  exactly.  We  like  to emphasize \nthat  the  complete  information  available  on  the  grouping  structure  of the  data is \npreserved,  since  the mean KL-divergence  is quite small (Ri  2.10.10- 5 ).  It is worth \nmentioning  that  the  rank-order  of the  assignments  of objects  i  to  clusters  (}'  is \ncompletely preserved. \n\nFor many image retrieval systems image similarity has been  defined  as similarity of \noccurring feature  coefficients,  e.g.  colors or  texture  features.  In  [7],  a  novel  statis(cid:173)\ntical  mixture model for  distributional data,  the  probabilistic  histogram clustering \n(ACM),  has been  proposed  which  we  applied  to extract  the group structure  inher(cid:173)\nent  in  image databases based  on  histograms of textured  image features.  The ACM \nexplains  the observed  data by the generative model: \n\n\f456 \n\nM. Held. J  puzicha and J  M.  Buhmann \n\nFigure 2:  Embedding of the VisTex database with  MDS. \n\n1.  select  an object  OJ  E 0  with  probability Pi, \n2.  choose  a  cluster  Ca  according  to the cluster  membership  Mia  of Oi, \n3.  sample a feature  Vj  E V  from  the cluster conditional distribution  qjla. \n\nThis generative model is  formalized  by \n\nP  (OJ, vjIM,p, q)  =  Pi L Miaqjla \n\nK \n\na=1 \n\n(6) \n\nThe  parameters are  estimated by  maximum likelihood inference.  The assignments \nMia  are treated as  unobserved data in an (annealed) EM  procedure,  which  provides \na  probabilistic group description.  For  the details we  refer  to [7]. \n\nIn  the  experiments,  texture  features  are  extracted  by  a  bank  of  12  Gabor  filters \nwith  3 scales  and 4 orientations.  Different  Gabor channels are  assumed  to be inde(cid:173)\npendently distributed,  which results in a concatenated histogram of the empirically \nmeasured channel distributions.  Each channel was discretized  into 40 bins resulting \nin  a  480  dimensional  histogram  representing  one  image.  For  the  experiments  two \ndifferent  databases were  used. \nIn Fig. 3 a probabilistic K  =  10 cluster solution with 160 images containing different \ntextures  taken  from  the  Brodatz  album  is  visualized.  The  clustering  algorithm \nproduces  8 well  separated clusters,  while  the two clusters in the mid region exhibit \nsubstantial  overlap.  A  close  inspection  of  these  two  clusters  indicates  that  the \nfuzziness  of  the  assignments  in  this  area  is  plausible  as  the  textures  in  this  area \nhave  similar frequency  components in  common. \n\nThe result for  a more complex database of 220 textured images taken from the MIT \nVisTex image database with a large range of uniformly and non-uniformly textured \nimages is depicted in Fig. 4.  This plot indicates that the proposed approach provides \na structured  view on image databases.  Especially the upper left cluster yields some \n\n\fVisualizing Group Structure \n\n457 \n\ninsight  in  the  clustering  solution,  as  this  cluster  consists  of a  large  range  of non(cid:173)\nuniformly  textured  images,  enabling  the  user  to  decide  that  a  higher  number  of \nclusters  might yield  a  better solution.  The visualization  approach fits  naturally in \nan interactive scenario,  where the user can choose  interactively data points to focus \nhis examination to certain areas of interest  in  the clustering solution. \n\nFor comparison, we present in Fig. 2 a multidimensional scaling (Sammon's mapping \n[3])  solution for  the  VisTex database.  A  detailed  inspection  of this  plot  indicates, \nthat  the  embedding  is  locally  quiet  satisfactory,  while  no  global  structure  of the \ndatabase  is  visible.  This  is  explained  by  the  fact,  that  Sammon's mapping  only \ntries to preserve  the object distances,  while our novel approach first  extracts group \nstructure in a high dimensional feature space and than embeds this group structure \nin a low dimensional Euclidean space.  While MDS completely neglects the grouping \nstructure  we  do not care for  the exact  inter object distances. \n\n4  Conclusion \n\nIn  this  contribution,  a  generic  framework for  the  low-dimensional visualization  of \nprobabilistic group structure  was  presented.  The effectiveness  of this approach was \ndemonstrated by experiments on  artificial data as  well  as on  databases of textured \nimages.  While we  have focussed  on  histogram data the generality of the approach \nmakes  it  feasible  to  visualize  a  broad  range  of different  data types,  e.g.  vectorial, \nproximity or co-occurrence data.  Thus, it is useful in a broad variety of applications, \nranging from  image or document  retrieval  tasks,  the  analysis of marketing data to \nthe  inspection  of protein  data.  We  believe  that  this  technique  provides  the  user \nsubstantial insight in  the validity of clustering solutions making the inspection  and \ninterpretation of large databases more  practicable. \n\nA  natural extension  of the  proposed  approach  leads  to  the visualization of hierar(cid:173)\nchical cluster structures  by  a  hierarchy of visualization plots. \n\nReferences \n[1]  C.M.  Bishop,  M.  Svensen,  and  C .K.1.  Williams.  GTM:  the  generative  topographic \n\nmapping.  Neural  Computation, 10(1):215-234,  1998. \n\n[2]  C.M.  Bishop  and  M.  E.  Tipping.  A  hierarchical  latent  variable  model  for  data  visu(cid:173)\n\nalization.  Technical  Report  NCRG/96/028,  Neural Computing Research  Group Dept. \nOf Computer Science  & Applied  Mathematics,  Aston University,  1998. \n\n[3]  T.F.  Cox  and  M.A.A.  Cox.  Multidimensional  Scaling,  volume  59  of  Mongraphs  On \n\nstatistics and applied probability.  Chapman & Hall,  London,  New York,  1994. \n\n[4]  J.C.  Dunn.  A  fuzzy  relative  of the  ISODATA  process  and  its  use  in  detecting  well(cid:173)\n\nseparated clusters.  Journal of Cybernetics, 3:32-57,  1975. \n\n[5]  1.  Gath  and  A.  Geva.  Unsupervised  optimal  fuzzy  clustering.  IEEE  Transactions  on \n\nPattern Analysis and Machine  Intelligence, 11:773-781,  1989. \n\n[6]  T. Hofmann and J.  M.  Buhmann.  Pairwise data clustering by deterministic  annealing. \n\nPAMI,  19(1):1- 25,  1997. \n\n[7]  T.  Hofmann,  J.  Puzicha,  and  M.  I. Jordan.  Learning  from  dyadic  data.  In  Advances \n\nin  Neural Information Processing Systens  11.  MIT Press,  1999. \n\n[8]  F.C.N.  Pereira,  N.Z.  Tishby,  and  L.  Lee.  Distributional  clustering  of English  words. \nIn  30th  Annual Meeting  of the  Association for  Computational Linguistics,  Columbus, \nOhio,  pages  183-190,  1993. \n\n[9]  K.  Rose, E.  Gurewitz, and G.C . Fox.  A deterministic annealing approach to clustering. \n\nPattern  Recognition Letters, 11(9):589-594,  September  1990. \n\n\f458 \n\nM. Held. J.  Puzicha and J.  M.  Buhmann \n\nFigure 3:  Visualization of a probabilistic grouping structure inferred for  a database \nof 160  Brodatz  textures.  A mean KL-divergence  of 0.031  is  obtained. \n\nFigure 4:  Visualization of a  probabilistic grouping structure inferred for  220  images \nof the VisTex database.  A mean KL-divergence  of 0.0018 is  obtained. \n\n\f", "award": [], "sourceid": 1552, "authors": [{"given_name": "Marcus", "family_name": "Held", "institution": null}, {"given_name": "Jan", "family_name": "Puzicha", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}