{"title": "Generalized Model Selection for Unsupervised Learning in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 970, "page_last": 976, "abstract": null, "full_text": "Generalized Model Selection For Unsupervised \n\nLearning In High Dimensions \n\nShivakumar Vaithyanathan \nIBM Almaden Research Center \n650 Harry Road \nSan Jose, CA 95136 \nShiv@almaden.ibm.com \n\nByron Dom \nIBM Almaden Research Center \n650 Harry Road \nSan Jose, CA 95136 \ndom@almaden.ibm.com \n\nAbstract \n\nWe describe a Bayesian approach to  model  selection in unsupervised \nlearning  that  determines  both  the  feature  set  and  the  number  of \nclusters. We then evaluate this scheme (based on marginal likelihood) \nand  one  based  on  cross-validated  likelihood.  For  the  Bayesian \nscheme  we  derive  a  closed-form  solution  of the  marginal  likelihood \nby  assuming  appropriate  forms  of the  likelihood  function  and  prior. \nExtensive  experiments  compare these  approaches  and  all  results  are \nverified  by comparison against ground truth.  In  these experiments the \nBayesian scheme using our objective function gave better results than \ncross-validation. \n\n1 Introduction \n\nRecent efforts  define  the  model  selection  problem  as  one of estimating  the  number  of \nclusters[ 10,  17].  It  is  easy  to  see,  particularly  in  applications  with  large  number  of \nfeatures,  that  various  choices  of  feature  subsets  will  reveal  different  structures \nunderlying  the  data.  It  is  our  contention  that  this  interplay  between  the  feature  subset \nand the number of clusters is essential to provide appropriate views of the data.We thus \ndefine  the  problem  of model  selection  in  clustering  as  selecting  both  the  number  of \nclusters  and  the  feature  subset.  Towards  this  end  we  propose  a  unified  objective \nfunction  whose  arguments  include  the  both  the  feature  space  and  number  of clusters. \nWe then describe two  approaches to  model  selection  using this  objective function.  The \nfirst  approach  is  based  on  a Bayesian  scheme  using  the  marginal  likelihood  for  model \nselection. The second  approach  is  based  on  a scheme using  cross-validated  likelihood. \nIn  section  3  we  apply  these approaches  to  document clustering by  making assumptions \nabout the  document generation  model.  Further,  for  the  Bayesian  approach  we  derive  a \nclosed-form solution for  the marginal likelihood using this document generation model. \nWe  also  describe  a  heuristic  for  initial  feature  selection  based  on  the  distributional \nclustering  of terms.  Section  5  describes  the  experiments  and  our approach  to  validate \nthe  proposed  models  and  algorithms.  Section 6 reports  and  discusses  the results  of our \nexperiments and finally section 7 provides directions for future  work. \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n971 \n\n2 Model selection in clustering \n\nModel  selection  approaches  in  clustering  have  primarily  concentrated  on  determining \nthe  number  of  components/clusters.  These  attempts  include  Bayesian  approaches \n[7,10], MDL approaches  [15]  and  cross-validation  techniques  [17] . As  noticed  in  [17] \nhowever,  the optimal  number of clusters is dependent on the feature  space in  which  the \nclustering is  performed. Related work has been described in  [7]. \n\n2.1  A generalized model for clustering \n\nLet  D  be  a  data-set  consisting  of  \"patterns\"  {d I, .. , d v },  which  we  assume  to  be \nrepresented  in  some  feature  space  T  with  dimension  M.  The  particular  problem  we \naddress  is  that  of  clustering  D  into  groups  such  that  its  likelihood  described  by  a \nprobability  model  p(DTIQ),  is  maximized,  where  DT  indicates  the  representation  of D \nin  feature  space T and  Q is  the  structure of the  model,  which consists of the  number of \nclusters,  the  partitioning  of the  feature  set  (explained  below)  and  the  assignment  of \npatterns  to  clusters.  This  model  is  a  weighted  sum  of  models  {P(DTIQ, ~)I~ E  [Rm} \nwhere  ~ is  the  set of all  parameters associated  with  Q . To define our model  we  begin \nby  assuming  that the  feature  space  T consists of two  sets:  U  - useful  features  and  N  -\nnoise features.  Our feature-selection problem will  thus consist of partitioning T (into  U \nand  N) for a given number of clusters. \n\nAssumption  1  The  feature  sets  represented  by  U  and  N  are  conditionally \nindependent \n\np(DTIQ,~) = P(DN I Q, ~) P(D u I Q,~) \n\n(1) \n\nwhere DN  indicates data  represented in  the noise feature  space and  D U indicates data \nrepresented in  useful feature  space. \n\nUsing assumption  1 and  assuming that the data is  independently drawn,  we can rewrite \nequation (1) as \n\np(DTIQ,~) = {n p(d~ I ~N). nn p(dy  I ~f)} \n\nk=IJED, \n\n1=1 \n\n(2) \n\nwhere  V is  the number of patterns in D,  p(dy  I ~u)  is  the  probability  of  dy  given \nthe  parameter  vector  ~f and p(d~ I ~N) is  the  probability  of  d~ given  the  parameter \nvector  ~N .  Note  that  while  the  explicit  dependence  on  Q  has  been  removed  in  this \nnotation , it is  implicit in  the  number of clusters  K and  the  partition of T into Nand  U. \n2.2 Bayesian approach to model selection \n\nThe  objective  function,  represented  in  equation  (2)  is  not  regularized  and  attempts  to \noptimize  it directly  may  result  in  the  set N  becoming  empty  - resulting  in  overfitting. \nTo overcome this problem we use the marginallikelihood[2]. \n\nAssumption 2 All parameter vectors are independent.  n (~) = n (~N). n n  (~f) \nwhere  the  n( ... ) denotes  a  Bayesian prior distribution.  The  marginal  likelihood,  using \nassumption 2,  can be written as \n\nP(DT I Q)= IN  [UP(d~ I ~N)]n(~N)d~N. DL [!lp(dY I ~f)]n(~f)d~f(3) \n\nk=1 \n\nK \n\n\f972 \n\nS.  Vaithyanathan  and B.  Dom \n\nwhere  SN, SV are integral  limits appropriate to the particular parameter spaces. These \nwill be omitted to simplify the notation. \n\n3.0 Document clustering \n\nDocument  clustering  algorithms  typically  start  by  representing  the  document  as  a \n\"bag-of-words\"  in  which  the features  can  number - 104to  105 .  Ad-hoc  dimensionality \nreduction techniques  such  as  stop-word removal,  frequency  based  truncations  [16]  and \ntechniques such as  LSI [5]  are available.  Once the dimensionality has been reduced, the \ndocuments are usually clustered into an arbitrary number of clusters. \n\n3.1  Multinomial models \n\nSeveral  models  of text  generation  have  been  studied[3].  Our  choice  is  multinomial \nmodels  using  term  counts  as  the  features.  This  choice  introduces  another  parameter \nindicating  the  probability  of  the  Nand  U  split.  This  is  equivalent  to  assuming  a \ngeneration  model  where  for  each  document  the  number  of noise  and  useful  terms  are \ndetermined  by  a  probability  (}s  and  then  the  terms  in  a  document  are  \"drawn\"  with  a \nprobability ((}n  or ()~ ). \n\n3.2 Marginal likelihood / stochastic complexity \n\nTo  apply our Bayesian objective function  we begin by  substituting multinomial  models \ninto (3) and  simplifying to obtain \n\nP(D I Q) = (tN ;,tV )  S[((}S)tN (1- (}s)t u]n((}s)d(}S . \n\nk==1  IEDk \n\n[Ii: II ({t. tIYEU}J]  S[II((}k)t,.u]n((}f)d(}f \u00b7 \n[iI (  tf  J 1 S[II((}n)ti .\u2022 ] n((}N) d(}N \n\nI,U U \n\nUEV \n\nj=1 \n\n{tj,nlnEN} \n\nnEN \n\n(4) \n\nwhere  (  (.'.\\) is  the  multinomial  coefficient,  ti,u is  the  number  of occurrences  of the \nfeature  term  u  in  document  i, \ndocument  i  (tY = L U  ti,u,  t~:,  and \nnoise features, (n = kl(~~k)!  ,  tNis  the  total  number of all  noise  features  in  all  patterns \n\ntYis  the  total  number  of all  useful  features  (terms)  in \nti,n  are  to  be  interpreted  similar to  above  but for \n\nand  tVis the total number of all  useful  features  in  all  patterns. \n\nTo solve (4)  we  still  need a form for  the priors {n( ... )}. The Beta family  is  conjugate \nto  the  Binomial  family  [2]  and  we  choose the  Dirichlet distribution  (mUltiple  Beta)  as \nthe  form  for  both  n((}f)  and  n((}N)  and  the  Beta  distribution  for  n((}s).  Substituting \nthese into equation (8) and simplifying yields \nP (D  I Q) = [ f(Ya + Yb)  f(tN + Ya)f(tV + Yb)  ] \u2022 [ \n\nII f(/Jn + tn) ] \n\nf(/J) \n\nf(Ya)f(Yb)  [(tV + tN + Ya + Yb) \n[  [(0')  K  r(O'k + IDkl)]  [K \n[(0'+ v) D  f(lD kl) \n\nf(/J + tN)  nEN \n\u2022 D f(a+ tU(k)  Du \n\nf(a) \n\nf(/Jn) \n\n[(au + tV  ] \n\nf(a u) \n\n(5) \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n973 \n\nUEU \n\nneN \n\nf3,  and au  are  the hyper-parameters of the  Dirichlet prior for  noise and  useful \nwhere \nfeatures respectively, f3 = L f3n  , a = L au,  U = L ukand  ro is the \"gamma\" function. \nFurther, Yu, Yure the hyper parameters of the Beta prior for the split probability, IDkl  is \nthe  number  of  documents  in  cluster  k  and  tU(k  is  computed  as  L  tf.  The  results \nreported  for  our  evaluation  will  be  the  negative  of  the  log  of  equation  (5),  which \n(following  Rissanen  [14])  we  refer  to  as  Stochastic  Complexity  (SC).  In  our \nexperiments all  values of the hyper-parameters  pj ,aj  (Jk>  Ya  and  Y bare set equal to  1 \nyielding uniform priors. \n\niEDk \n\nk=1 \n\n3.3 Cross-Validated likelihood \n\nTo compute the cross  validated  likelihood  using  multinomial models we  first  substitute \nthe  multinomial  functional  forms,  using  the  MLE  found  using  the  training  set.  This \nresults in the following equation \n\nP(CVT  I QP) =  [(05)t\" .. (1- ( 5)1,,] IT  p(evf  ION)  .  IT  IT  peevy  I O~i)' p(q) (6) \n\n{ \n\n,.....,.,  U  VIt!\\1 \n\n,.......,  N \n\n~ \n\n,.....,.., \n\nK \n\n1=1 \n\nk=IJEDk \n\n,...,  ---\n\n,..., \n\nwhere  Os,  ON  and  O~i)  are  the  MLE  of the  appropriate  parameter  vectors.  For  our \nimplementation of MCCV,  following  the  suggestion in  [17],  we  have  used  a 50% split \nof the  training  and  test  set.  For  the  vCV  criterion  although  a  value  of v  = 10  was \nsuggested therein, for computational reasons we have used a value of v = 5. \n\n3.4 Feature subset selection algorithm for document clustering \n\nAs noted in section 2.1,  for a feature-set of size M there are a total of 2M  partitions and \nfor  large  M  it  would  be  computationally  intractable  to  search  through  all  possible \npartitions  to  find  the  optimal  subset.  In  this  section  we  propose  a heuristic  method  to \nobtain  a  subset  of tokens  that are  topical  (indicative of underlying  topics)  and  can  be \nused as features in the bag-of-words model to cluster documents. \n\n3.4.1 Distributional clustering for feature subset selection \n\nIdentifying content-bearing and topical terms,  is an  active research area [9]. We are less \nconcerned  with  modeling  the  exact  distributions  of individual  terms  as  we  are  with \nsimply  identifying  groups  of  terms  that  are  topical.  Distributional  clustering  (DC), \napparently  first  proposed  by  Pereira  et  al  [13],  has  been  used  for  feature  selection  in \nsupervised  text  classification  [1]  and  clustering  images  in  video  sequences  [9].  We \nhypothesize that function,  content-bearing and topical  terms have different distributions \nover the  documents.  DC  helps reduce the  size of the  search  space for  feature  selection \nfrom  2M  to  2e,  where  C  is  the  number  of clusters  produced  by  the  DC  algorithm. \nFollowing  the  suggestions  in  [9],  we  compute  the following  histogram for each token. \nThe first  bin consists  of the  number of documents  with  zero  occurrences  of the  token, \nthe  second  bin  is  the  number  of documents  consisting  of a  single  occurrence  of the \ntoken  and  the  third  bin  is  the  number  of documents  that  contain  more  two  or  more \noccurrences of the term.  The histograms are clustered using reLative  entropy  ~(. II  .) as \n\n\f974 \n\nS.  Vaithyanathan  and B.  Dom \n\na distance measure.  For two terms  with  probability distributions  PI (.) and P2(.),  this  is \ngiven by [4]: \n\n,1.(Pt(t)  II  P2(t))  = k  PI(t) log  P2(t) \n\nPI(t) \n\n(7) \n\n'\" \n\nt \n\nWe  use  a  k-means-style  algorithm  in  which  the  histograms  are  normalized  to  sum  to \none and  the sum  in  equation (7) is  taken over the three bins corresponding to counts of \n0,1,  and  ~ 2.  During  the  assignment-to-clusters  step  of  k-means  we  compute \n,1.(pw  II  PCk)  (where pw is  the normalized histogram for term wand Pq(t) is  the centroid \nof cluster k) and the term w is assigned to the cluster for which this is  minimum [13,8]. \n\n4.0 Experimental setup \n\nOur  evaluation  experiments  compared  the  clustering  results  against  human-labeled \nground truth.  The corpus used  was  the AP Reuters Newswire articles from the TREC-6 \ncollection.  A  total  of 8235  documents,  from  the  routing  track,  existing  in  25  classes \nwere  analyzed  in  our  experiments.  To  simplify  matters  we  disregarded  multiple \nassignments and retained each document as a member of a single class. \n\n4.1 Mutual information as an evaluation measure of clustering \n\nWe  verify  our  models  by  comparing  our clustering  results  against  pre-classified  text. \nWe  force  all  clustering  algorithms  to  produce  exactly  as  many  clusters  as  there  are \nclasses in the pre-classified text and  we  report the  mutual  information[ 4]  (MI)  between \nthe cluster labels and pre-classified class labels \n\n5.0 Results and discussions \n\nAfter  tokenizing  the  documents  and  discarding  terms  that  appeared  in  less  than  3 \ndocuments  we  were  left  with  32450  unique  terms.  We  experimented  with  several \nnumbers of clusters for  DC  but report only the  best (lowest SC)  for  lack of space.  For \neach of these  clusters  we  chose the  best of 20 runs  corresponding to  different random \nstarting clusters.  Each of these sets  includes one cluster that consists  of high-frequency \nwords and upon examination were found to contain primarily function  words,  which we \neliminated  from  further  consideration.  The  remaining  non-function-word  clusters  were \nused as  feature  sets  for  the clustering algorithm. Only combinations of feature  sets that \nproduced good results were used for further document clustering runs. \n\nWe initialized the EM algorithm using k-means algorithm - other initialization schemes \nare  discussed  in  [11].  The  feature  vectors  used  in  this  k -means  initialization  were \ngenerated  using the pivoted normal  weighting suggested in  [16].  All  parameter vectors \nOf  and eN  were estimated  using Laplace's Rule  of Succession[2].  Table  1 shows  the \nbest results  of the SC criterion,  the  vCV and  MCCV using  the  feature  subsets  selected \nby the different combinations of distributional clusters. The feature subsets are coded as \nFSXP  where  X  indicates  the  number  of clusters  in  the  distributional  clustering  and  P \nindicates  the  cluster  number(s)  used  as  U.  For  SC  and  MI  all  results  reported  are \naverages  over  3  runs  of the  k-means+EM  combination  with  different  initialization  fo \nk-means. For clarity,  the MI numbers  reported  are  normalized  such that the  theoretical \nmaximum is  1.0. We also show comparisons against no feature selection (NF) and LSI. \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n975 \n\nFor  LSI,  the  principal  165  eigenvectors  were  retained  and  k-means  clustering  was \nperformed in the reduced dimensional  space. While determining the number of clusters, \nfor computational reasons we have limited our evaluation to  only the feature subset that \nprovided us  with the highest MI, i.e., FS41-3 . \n\nFeature \n\nSet \n\nFS41-3 \nFS52 \nNF \nLSI \n\nUseful \nFeatures \n\n6,157 \n386 \n\n32,450 \n\n324501165 \n\nSC \nX  107 \n2.66 \n2.8 \n2.96 \nNA \n\nvCV \nX 107 \n0.61 \n0.3 \n1.25 \nNA \n\nMCCV \nX 107 \n1.32 \n0.69 \n2.8 \nNA \n\nMl \n\n0.61 \n0.51 \n0.58 \n0.57 \n\nTable 1 Comparison Of Results \n\n\u2022 . ..... \n\nFigure 1 \n\n.: \n\n1~1 \n\n.. \n\n. . . \n..  . \" \n\u2022 \n\n, .. \n\n\u2022 \n\n1~1 \n.. \n\nI \n, \n\n. ~ .. \n\nFlgur.2 \n\n\" . \n\n\u2022 \n\n... \n\nMCCV\u00b7 .... gII .....  l avl ... ....-ood \n\nI \n\" \n\n, .. \n\n\" \n\n'\"  .. \n\nThIwI:_P'!Ch \n\n....,.. ..... Log .... IQ ..... L ... hood \n\n5.3 Discussion \n\nThe  consistency between the MI and  SC (Figure  1)  is  striking.  The monotonic  trend  is \nmore apparent at higher SC indicating  that bad clusterings  are more easily detected  by \nSC while  as  the  solution  improves  the  differences  are  more  subtle.  Note  that  the  best \nvalue of SC and Ml coincide.  Given the assumptions made in deriving equation (5),  this \nconsistency  and  is  encouraging.  The  interested  reader  is  referred  to  [18]  for  more \ndetails.  Figures  2  and  3  indicate  that  there  is  certainly  a  reasonable  consistency \nbetween  the  cross-validated  likelihood  and  the  MI  although  not  as  striking  as  the  SC. \nNote  that  the  MI for  the feature  sets  picked  by  MCCV and  vCV is  significantly  lower \nthan  that of the  best feature-set.  Figures  4,5  and  6  show  the  plots of SC,  MCCV  and \nvCV  as  the  number  of clusters  is  increased.  Using  SC we  see  that  FS41-3  reveals  an \noptimal  structure  around  40 clusters.  As  with  feature  selection,  both  MCCV and  vCV \nobtain  models  of  lower  complexity  than  Sc.  Both  show  an  optimum  of  about  30 \nclusters. More experiments are required before we draw final  conclusions, however, the \nfull  Bayesian  approach  seems  a  practical  and  useful  approach  for  model  selection  in \ndocument  clustering.  Our  choice  of  likelihood  function  and  priors  provide  a \nclosed-form solution that is computationally tractable and  provides meaningful results. \n\n6.0 Conclusions \n\nIn  this  paper  we  tackled  the  problem  of model  structure  determination  in  clustering. \nThe main contribution of the paper is  a Bayesian objective function  that treats  optimal \nmodel  selection  as  choosing  both  the  number  of clusters  and  the  feature  subset.  An \nimportant  aspect  of our  work  is  a  formal  notion  that  forms  a  basis  for  doing  feature \nselection  in  unsupervised  learning.  We  then  evaluated  two  approaches  for  model \nselection:  one  using  this  objective  function  and  the  other  based  on  cross-validation. \n\n\f976 \n\nS.  Vaithyanathan  and B.  Dom \n\nBoth approaches performed reasonably well  - with the Bayesian scheme outperforming \nthe  cross-validation  approaches  in  feature  selection. More  experiments  using  different \nparameter settings for the cross-validation schemes and different priors for the Bayesian \nscheme should result in  better understanding and  therefore  more  powerful  applications \nof these  approaches. \n\nFig .... \n\nte. \n\n:: I \n\n.  \"\"\". \n\n! 1: '--\" ____  ------' \nI \n.. \n\nt. \n\nWCCV \u00b7 ,..,._lIogLDiIfIood \n\nt \n\nI \n\nXl01 \n\n....... \n\n\" \n\n. \n\n\u2022 \n\u2022  \u2022 \n.  _o l iO - .  \n\nHI \n\n\" \n\n.  .  I . \n\n\u2022 \n\nReferences \n\nI la~ 1-'-_ _ __\nr::  ~.. \n\n----1  ..... \n\n_ \n\n.\u2022 ..\u2022 .. \u2022  - +  \u2022 \u2022. ' \n\n,  t l .   \" \n\n__  0.-. \n\n\u2022  ~  \u2022 \u2022  \" , ._  \n\nFlgur,' \n\n~ \n\n1  ... \nI  OM \n\" 1=': \n~ ... \n\nOM \n\n\u2022 \n\n. \n\n__ \"0..-. \n\n\u2022 \n\n[I]  Baker, D., et aI,  Distributional Clustering of Words for Text Classification,  SIGIR  1998. \n[2]  Bernardo, J. M.  and Smith, A.  F.  M., Bayesian Theory, Wiley,  1994. \n[3] Church, K.W. et aI, Poisson Mixtures. Natural Language Engineering. 1(12),  1995. \n[4] Cover, T.M . and Thomas, J.A.  Elements of Information Theory. Wiley-Interscience,  1991. \n[5] Deerwester,S. et aI,  Indexing by Latent Semantic Analysis,JASIS,  1990. \n[6]  Dempster, A.et aI., Maximum Likelihood from Incomplete Data Via the EM Algorithm. \nJRSS, 39,1977. \n[7]  Hanson,R., et aI, Bayesian Classification with Correlation and Inheritance, IJCAI,1991 . \n[8] Iyengar, G., Clustering images using relative entropy for efficient retrieval, VLBV,  1998. \n[9]  Katz,  S.M . , Distribution of content words and phrases in  text and language modeling, NLE, \n2,1996. \n[10]  Kontkanen,  P.T.  et  ai,  Comparing  Bayesian  Model  Class  Selection  Criteria  by  Discrete \nFinite Mixtures, ISIS'96 Conference,  1996. \n[II]  Meila,  M.,  Heckerman,  D.,  An  Experimental  Comparison  of  Several  Clustering  and \nInitialization Methods, MSR-TR-98-06. \n[12]  Nigam,  K et aI,  Learning to Classify Text from  Labeled  and  Unlabeled Documents, AAAI, \n1998. \n[13] Pereira, F.C.N. et ai, Distributional clustering of English words,  ACL,1993. \n[14] Rissanen, J., Stochastic Complexity in Statistical Inquiry. World\\ Scientific,  1989. \n[15]  Rissanen,  J.,  Ristad  E.,  Unsupervised  classification  with  stochastic  complexity.\"  The \nUS/Japan Conference on the Frontiers of Statistical Modeling,1992. \n[16] Singhal A.  et aI, Pivoted Document Length Normalization, SIGIR,  1996. \n[17] Smyth, P., Clustering using Monte Carlo cross-validation, KDD,  1996. \n[18]  Vaithyanathan,  S.  and  Dom,  B.  Model  Selection  in  Unsupervised  Learning  with \nApplications to Document Clustering. IBM Research Report RJ-I 0137 (95012) Dec.  14,  1998 . \n\n\f", "award": [], "sourceid": 1644, "authors": [{"given_name": "Shivakumar", "family_name": "Vaithyanathan", "institution": null}, {"given_name": "Byron", "family_name": "Dom", "institution": null}]}