{"title": "Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 543, "page_last": 549, "abstract": null, "full_text": "Very Fast EM-based Mixture Model \n\nClustering using Multiresolution kd-trees \n\nAndrew W. Moore \n\nRobotics Inst.i t. ut.e, Carnegie tl1plloll University \n\nPittsburgh , PA 15:21:3. a\\\\'l11'9.'cs.cll1u.eciu \n\nAbstract \n\nClust ering is impor ta nt in m any fields including m anufac tlll'ing , \nbiol og~', fin ance , a nd astronomy. l\\Iixturp models arp a popula r ap(cid:173)\nproach due to their st.atist.ical found a t.ions, and EM is a very pop(cid:173)\nular l1wthocl for fillding mixture models. EM, however, requires \nlllany accesses of the dat a , a nd thus h as been dismissed as imprac(cid:173)\nt ical (e.g. [9]) for d ata mining of enormous dataset.s. We present a \nnt' \\\\\u00b7 algorit.hm, baspd on thp l1lultiresolution ~.'Cl- trees of [5] , which \ndramatically reelucps the cost of EtlI-baspd clusteriug , wit.h savings \nrising linearl:; wit.h the number of datapoints. Although prespnt.pd \nlwre for maximum likplihoocl estimation of Gaussian mixt.ure mod(cid:173)\nf'ls , it. is also applicable to non-(~aussian models (provided class \ndensit.ies are monotonic in Mahalanobis dist.ance), mixed categori(cid:173)\ncal/ nUllwric clusters. anel Bayesian nwthocls such as Antoclass [1]. \n\n1 Learning Mixture Models \nIn a Gaussian mixture lllod f'l (e.g. [3]) , we aSSUI1W t.hat d ata points {Xl .. . XR} ha\\'p \nbef'n gelw r \n\nbe extremized. \n\nNO .HYPERHECT, using the Mahalanobis cli::otancp MHD(x, x') = (x-x/)T~.j I (x-x'). \nCall tllf':';P short.pst and furtllf'st squarpcl distancps illHDI11I11 and JIHDl11 ax . Then \n\n(D) \n\nis a lowpr bound for minx, END (lij , with a similar dpfinition of aTflX . Thpn write \n\nnlln Wi' \n) \n\nx, E NO \n\nmin (aij}Jj/L((ikPh'l = min (aij}Jj/(Clij}Jj + LouiN)) \nx, E NO \n\nx, E NO \n\n> ajlllnpj /(ClT ll1 pj + L ar1axPk) = W.Tll1 \n\nk \n\n. \nkt.l \n\nh\u00b7tj \n\nwlwrp tl'T II1 is ulli' lo\\\\,pr bound. There is a similar definition for tl'.TflX. The iLlc'qual(cid:173)\nit.\\' i;-, proved b)' elenH'ntary algebra, and requires that all qllantitips are positiw \n(which thpy are). vVe can often tight.en thp bounds further using a procedure that \npxploits the fact. t.hat. 2::j Wij = 1, but space does not permit further discussion. \n\\ \\,p will prune if wjllll1 anel tl'Tax are close for all j. 'Vha t should be the criterion for \nclospnpss? The first. idea that springs to mind is: Prune if V j . (wj11aX - wj11lI1 < t). \nBut such a simplp critprioll is not suitable: some classps may be accumulating very \nlargp sums of weights, whilst others may bp accumulating vpry small Sllms. The \nlargp-sllll1-weight clasl>ps can t.olerate far looser bounds than the small-sum-weight \nda.sses. Hprp, then, is a more satisfactory pruning critf'l'ion : Pnll1P ifVj . (wr ax -\nIl',Tll1 < nC,;otal) where wjotal is the tot al weight. awarded to class j o\\,pr tlw entire \ndataset , and T is SOI1lP small constant. Sadl~', w.ioTal is not. known ill advan('e, but \nhappily we can find a lower bound on u,.~otal of wrfar + NO.NTTMPOINTS x wrlI1 , where \n\nLt'jofar is the total weight awarded to cla.ss j so fa.r during the sear('h over the kcl-trpp. \n\nThe algorithm as c1(>scribed so far performs c1ivide-and-conquer-\\vith-cut.offs on the \nspt of clatapoints. In addition, it is possiblp to achieve an extra ac(,pleration by \nnwallS of diviclp and conquer on the class ('enters. Suppose there wpre N = 100 \nclassps . Illstpad of considering all 100 classps at all Bodes, it is frequelltly possible \nt.o clPlPrmine at SOI1W node that t.he maximum possi ble \\\\,pight. w,Tux for som e class j \nis less thau a minisculp fraction of tllf' minimull1 pos:-;ible weight u'tln for sonlf' other \nda:-,:-, \"'. Thlb if we 0\\'<\"1' find that in some nocle wr ax < Aut lll where /\\ = 10-..( . \n\ntlLell class ('j is rel1lowc\\ from ('onsicleration from all clescendpnt:-; ofthp Clll'l'pnt node. \nFrpC[uPlltly this m ea llS that nea r tllf' tree's Ipa\\'ps, only a tiny fraction of thp dassps \ncompete for o\\\\'nership of the datapoints, and thil> lea.ds to large time savings. \n\n\fVery Fast EM-Based Mixture Model Clusten'ng Using Multiresolution Kd-Trees \n\n547 \n\n2 Results \n\\~'e havp subj ed pd this approach to llumprous i\\Iont.e-Ca rlo empirical tests . Her p \n\\VP report 0 11 one ::::pt of Ruch tpsts . created with the fo llowing m eth od ology. \n\n\u2022 We ra nd omly gPllerate a mixt ure of Ga u::::sia ns in 1\\J -dimensio ll a l ::::pace (by \nciefa nlt .11 = 2 ). The number of G aussians , N is , by default, :20. E ach \n(~ a u ~~i a n h a ~ a m ean ly ing within the unit hypercube, and a covaria nce \nm a tri x randomly generated with diagonal elem ents between 0 up to 40' :! \n(by defa ul t, 0' = 0.05) and random non-diagonal elem ent.s t hat ensure sym(cid:173)\nm etric positive defini tene:-;s. T hus the dist ance from a G a ussian center t.o \nit.s l -::::t.andard-elevia tion contour is of th p order of magni t ude of 0'. \n\n\u2022 \\\\lp r andomly generate a d ataset fro111 t he mixt ure m odel. The number of \npoint:::: , R , i~ (by default) 160,000 . Figure :2 show:::: a typical generated set \nof G a u::::~i a ns a nel clat apoinb. \n\n\u2022 We then build an I/Irkd-t ree for the d ataset. , and record the m em ory re(cid:173)\n\nquirPlllents a nd real time to build (on a Pent.ium :200Mhz, in seconds). \n\n\u2022 We t hpn run Ei\\I on the d ata. Ei\\I begin:::: wit.h an entirely different set of \n\n(~ au ss i a n:-;, randomly ge nera ted using the sam e procedure. \n\n\u2022 \\Vp run 5 it erations of the convent ional EM algori thm and the new mrkd(cid:173)\nt rpp-ba::::pd algorithm. TllP new algorit.hm uses a defa ult value of 0 .1 for T . \n\\Vp record thp rpa l t ime (ill seconds ) for each itera tion of each a lgorithm, \na nd wp a lso record t he m ean log-likelihood score (1/ R) L~= l log P(Xi I rl) \nfor t. he tth m od pl fo r both algorithm:::: . \n\nF igurf' :) :-;ho\\\\':-; t.he nodes t.ha t arp visit.pd during It eration :2 of the Fast. EM wit h \n~y = (j cla::::ses . T a blp 1 shows t.he d ptailecl resul ts a:::: the experimental param eters are \nvaried. Speedups vary fro m 8-fold to 1000-fold . There are 100-fold speedup:\" even \nwit.h very wiele (no n-loca l) G aussians. In oth pI' experiments, simil a r resul ts were also \nobt ainf>c\\ on l'ea l d ata ~ets t hat disobe.y t llP Gaussian ass umption . There too, we find \none- a.nd two-order-of-m agnitude computa tional advantages with indist.in guish able \n::::tat.i:-; tical b phayi or (no bett.pr ancln o worse ) compared with convention al E i\\I. \n\nR e al Data: Prelimin ary experiments in applying t his to large d atasets h ave been \nencouraging. For thrpe-dinlPnsional galaxy clust ering with 800 ,000 galaxies and \nlUOO elust ns , tradition al El\\1 needed :3\u00b75 minutes per iteration, while t he mrkd-trees \nrpquired only H SPcOl1(ls . With l. () millio n galaxies, t.raditional EM needed 70 \nminut es a nd IIIrkd-trpPs required 14 seconds . \n\n3 Conclusion \n\nThp use of vari able resolution struct ures for clustering has been suggested in m a ny \npl aces (P.g . [7 ,8 , 4, !:l]). The BIRCH syst em , in pa rt.icular , is popular in the d at.ab ase \nco mmunity'. BIRCH is, howpver. Iln9blp to identify seconci-mOl11Pllt feat ures of clus(cid:173)\nt,pr:; (s uch as Il on-n xis-aligned spread). Our co ntributions h ave been the use of a \nll1ulti-l'f'solut.io n approac h, with associa tf>d comput ationa l benefi ts, and the intro(cid:173)\nduction of a n pffi cient algori t hm t hat leaves tllP sta tistica l aspects of mixture m odel \nestil1l ation uncil angpd. The growth of rpcpnt d a.t.a mining algorihms tha t are /l ot \nbased on st.a t istica l foundations has frec!pnt.ly been j ust.ified by the following state(cid:173)\nlllent: U:;illg st ate-of- t hp- Cl rt sta tistical techniques is too expensive because such \nt pchniqu ps were not dpsig npel to handle largp da t asets and becom e intraeta bJe with \nmi Ili on:'; of d a t a points . In earlier work we prO\\ iclpd ev idence t.ha t t.his sta te ment may \n\n\f548 \n\nA. W Moore \n\nEffect of Number of Datapoints, R: \nAs R increases so Joes the computational aJvantage, \nessentiall~' linearly. The tree-build time (11 seconds \nat worst) is a tiny cost compared with even just one \niteration of Regular EM (2385 seconds, on the big \ndataset.) FinalSlowSecs: 238.5. FinalFastSecs: 3. \n\n,,,,, \n~ \" \" \n\n','\" I\"\"'''''\u00b7 \n\n~ , \n\n\u2022. , \n\nIf' \n\n\" \n\nNum~ or pOinte (in thou \u2022\u2022 nds) \n\n' ... \n\n,_ \n\n, '\" \n\nEffect of Number of Dimensions, A/f: \nAs with many J.:d-tree algorithms. the benefits decline \nas dimensionality increases, yet even in 6 dimensions, \nthere is an 8-fold advantage. FinalSlowSecs: 2742. \nFinalFastSecs: 310.2.5. \n\nEHect of N umber of Classes, N: \nConventional EM slows down linearly with the num(cid:173)\nber of classes. Fast EM is clearly sublinear, with a \n70-fold speedup even with 320 classes. Note how the \ntree size grows. This is because more classes mean \na more uniform data distribution and fewer data(cid:173)\npoints \"sharing\" tree leaves. FinalSlowSecs: 9278 . \nFinalFastSecs: 143.:3. \nEffect of Tau, T: \nThe larger T, the more willing we are to prune during \nthe tree search, anJ thus the faster we search, but the \nless accurately we mirror EM's statistical behavior. \nInJeeJ when T is large, the discrep<\\llcy In the log \nlikelihood is relatively large. FinalSlowSecs: .584 . .5 . \nFinalFastSecs: .) \nEffect of Standard Deviation, 17: \nEven with very wide Gaussians, with wide support. , \nwe still get large savings . The nodes that are pruned \nin these cases are rarely nodes with one class owning \nall the probability, but instead are nodes where all \nclasses have non-zero, but little varying, probability. \nFinalSlowSecs: 58.1. FinalFastSecs: 4.75. \n\n300 \n\n500 \n\n1 \n\n2 \n\n4 \n\n3 \nNumber of Inpuba \n\n5 \n\n6 \n\n5 \n\n'0 \n\n20 \n\n4 0 \n\n320 \nNumber of cent.,.. \n\n160 \n\n80 \n\n001 ' 003301 \n\n\"u \n\n03 \n\n09 \n\n0025005 01 \n\n02 \n.Igma \n\n04 \n\nTable 1: In all the above results all parameters were held at their default values except \nfor one, which varied as shown in the graphs. Each graph shows the factor by which \nthe new E1.'1 is faster than the conventional EM. Below each graph is the time to build \nthe mrkd-tree in seconds and the number of nodes in the t.ree. Note that although the \ntree builJing cost is not included in the speedup calculation, it is negligibl~ in all cases, \nespecially considering that only one tree build is needed for all EM iterations. Does the \napproximate nature of this process result in inferior clusters'? The answer is no : the \nquality of clusters is indistinguishable between the slow and fast methods when measureJ \nby log-likelihood and when viewed visually. \n\n\fVery Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees \n\n549 \n\n,[J \n\nFigure 2: A typical set of Gaus(cid:173)\nsians generated by our random pro(cid:173)\ncedure. They in t.urn generate the \ndatasets upon which we compare \nthe performance of the old and new \nimplementations of EM. \n\nFigure 3: The ellipses show the model 8t at the start \nof an EM iteration. The rectangles depict the mrkd(cid:173)\ntree nodes that were pruned. Observe larger rectan(cid:173)\ngles (and larger savings) in areas with less variation \nin class probabilities. Note this is not merely able to \nonly prune where the data density is low. \n\nnot apply for locally weighted regression [.5] or Bayesian network learning [6], and \nwe hope this paper provides some evidence that it also needn't apply to clustering . \n\nReferences \n\n(1) P Cheeseman and R. Oldford. Se lectmg Models f,'om Data: Artljioal Int elligence and S tat/shes \n\nIV. Lec tun No t es m S tattstt cs, vol. 89. Spl'inger Verlag, 1994 \n\n[:?) K Denl:) and A W Moore \n\nMorgan Kaufmann , 1995. \n\nl\\1ul t lresolutlOn Inst a nce-based Learning In Proceedl71gs of IJCAI-95. \n\n[3) R O. Duda and P E Hart Pattern C la s,. Ass n for Computing l\\lachme ry, EI~II). \n\n\f", "award": [], "sourceid": 1490, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}]}