{"title": "Active Data Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 528, "page_last": 534, "abstract": "", "full_text": "Active Data Clustering \n\nThomas Hofmann \n\nCenter for Biological and Computational Learning, MIT \n\nCambridge, MA 02139, USA, hofmann@ai.mit.edu \n\nJoachim M. Buhmann \n\nInstitut fur Informatik III, Universitat Bonn \n\nRomerstraBe 164, D-53117 Bonn, Germany, jb@cs.uni-bonn.de \n\nAbstract \n\nActive data clustering is a novel technique for clustering of proxim(cid:173)\nity data which utilizes principles from sequential experiment design \nin order to interleave data generation and data analysis. The pro(cid:173)\nposed active data sampling strategy is based on the expected value \nof information, a concept rooting in statistical decision theory. This \nis considered to be an important step towards the analysis of large(cid:173)\nscale data sets, because it offers a way to overcome the inherent \ndata sparseness of proximity data. '''Ie present applications to unsu(cid:173)\npervised texture segmentation in computer vision and information \nretrieval in document databases. \n\n1 \n\nIntroduction \n\nData clustering is one of the core methods for numerous tasks in pattern recognition, \nexploratory data analysis, computer vision, machine learning, data mining, and in \nmany other related fields. Concerning the data representation it is important to \ndistinguish between vectorial data and proximity data, cf. [Jain, Dubes, 1988]. In \nvectorial data each measurement corresponds to a certain 'feature' evaluated at an \nexternal scale. The elementary measurements of proximity data are, in contrast, \n(dis-)similarity values obtained by comparing pairs of entities from a given data set. \nGenerating proximity data can be advantageous in cases where 'natural' similarity \nfunctions exist, while extracting features and supplying a meaningful vector-space \nmetric may be difficult. We will illustrate the data generation process for two \nexemplary applications: unsupervised segmentation of textured images and data \nmining in a document database. \n\nTextured image segmentation deals with the problem of partitioning an image into \nregions of homogeneous texture. In the unsupervised case, this has to be achieved on \n\n\fActive Data Clustering \n\n529 \n\nthe basis of texture similarities without prior knowledge about the occuring textures. \nOur approach follows the ideas of [Geman et al., 1990] to apply a statistical test to \nempirical distributions of image features at different sites. Suppose we decided to \nwork with the gray-scale representation directly. At every image location P = (x, y) \nwe consider a local sample of gray-values, e.g., in a squared neighborhood around p. \nThen, the dissimilarity between two sites Pi and Pj is measured by the significance of \nrejecting the hypothesis that both samples were generated from the same probability \ndistribution. Given a suitable binning (tk h :5: k:5: R and histograms Ii, Ij, respectively, \nwe propose to apply a x2-test, i.e., \n\n(1) \n\nIn fact, our experiments are based on a multi-scale Gabor filter representation in(cid:173)\nstead of the raw data, cf. [Hofmann et al., 1997] for more details. The main advan(cid:173)\ntage of the similarity-based approach is that it does not reduce the distributional \ninformation, e.g., to some simple first and second order statistics, before comparing \ntextures. This preserves more information and also avoids the ad hoc specifica(cid:173)\ntion of a suitable metric like a weighted Euclidean distance on vectors of extracted \nmoment statistics. \n\nAs a second application we consider structuring a database of documents for im(cid:173)\nproved information retrieval. Typical measures of association are based on the \nnumber of shared index terms [Van Rijsbergen , 1979]. For example, a document \nis represented by a (sparse) binary vector B, where each entry corresponds to the \noccurrence of a certain index term. The dissimilarity can then be defined by the \ncosme measure \n\n(2) \n\nNotice, that this measure (like many other) may violate the triangle inequality. \n\n2 Clustering Sparse Proximity Data \n\nIn spite of potential advantages of similarity-based methods, their major drawback \nseems to be the scaling behavior with the number of data: given a dataset with N \nentities, the number of potential pairwise comparisons scales with O(N2). Clearly, \nit is prohibitive to exhaustively perform or store all dissimilarities for large datasets, \nand the crucial problem is how to deal with this unavoidable data sparseness. More \nfundamentally, it is already the data generation process which has to solve the \nproblem of experimental design, by selecting a subset of pairs (i, j) for evaluation. \nObviously, a meaningful selection strategy could greatly profit from any knowledge \nabout the grouping structure of the data. This observation leads to the concept of \nperforming a sequential experimental design which interleaves the data clustering \nwith the data acquisition process. \n\\Ve call this technique active data clustering, \nbecause it actively selects new data, and uses tentative knowledge to estimate the \nrelevance of missing data. It amounts to inferring from the available data not \nonly a grouping structure, but also to learn which future data is most relevant for \nthe clustering problem. This fundamental concept may also be applied to other \nunsupervised learning problems suffering from data sparseness. \n\nThe first step in deriving a clustering algorithm is the specification of a suitable \nobjective function . In the case of similarity-based clustering this is not at all a \ntrivial problem and we have systematically developed an axiomatic approach based \non invariance and robustness principles [Hofmann et al., 1997] . Here, we can only \n\n\f530 \n\nT. Hofmann and J. M. Buhmann \n\ngive some informal justifications for our choice. Let us introduce indicator func(cid:173)\ntions to represent data partitionings, M iv being the indicator function for entity 0i \nbelonging to cluster Cv ' For a given number J{ of clusters, all Boolean functions \nare summarized in terms of an assignment matrix M E {O, 1 }NXK. Each row of M \nis required to sum to one in order to guarantee a unique cluster membership. To \ndistinguish between known and unknown dissimilarities, index sets or neighborhoods \nN = (N1 , \u2022 \u2022. , NN) are introduced. If j EM this means the value of Dij is available, \notherwise it is not known. For simplicity we assume the dissimilarity measure (and \nin turn the neighborhood relation) to be symmetric, although this is not a necessary \nrequjrement. With the help of these definition the proposed criterion to assess the \nquality of a clustering configuration is given by \n\n1i(M;D,N) \n\nN K \nLLMivdiv, \ni=1 v=1 \n\n(3) \n\n1i additively combines contributions div for each entity, where div corresponds to \nthe average dissimilarity to entities belonging to cluster Cv . In the sparse data case, \naverages are restricted to the fraction of entities with known dissimilarities, i.e., the \nsubset of entities belonging to Cv n;Vi. \n\n3 Expected Value of Information \n\nTo motivate our active data selection criterion, consider the simplified sequential \nproblem of inserting a new entity (or object) ON to a database of N - 1 entities \nwith a given fixed clustering structure. Thus we consider the decision problem of \noptimally assigning the new object to one of the J{ clusters. If all dissimilarities \nbetween objects 0i and object ON are known, the optimal assignment only depends \non the average dissimilarities to objects in the different clusters, and hence is given \nby \n\n(4) \n\nFor incomplete data, the total population averages dNv are replaced by point esti(cid:173)\nmators dNv obtained by restricting the sums in (4) to N N, the neighborhood of ON. \nLet us furthermore assume we want to compute a fixed number L of dissimilarities \nbefore making the terminal decision. If the entities in each cluster are not further \ndistinguished, we can pick a member at random, once we have decided to sample \nfrom a cluster Cv . The selection problem hence becomes equivalent to the prob(cid:173)\nlem of optimally distributing L measurements among J{ populations, such that the \nrisk of making the wrong decision based on the resulting estimates dNv is minimal. \nMore formally, this risk is given by n = dNcx - dNcx.' where a is the decision based \non the subpopulation estimates {dNv } and a* is the true optimum. \nTo model the problem of selecting an optimal experiment we follow the Bayesian \napproach developed by Raiffa & Schlaifer [Raiffa, Schlaifer, 1961] and compute the \nso-called Expected Value of Sampling Information (EVSI). As a fundamental step \nthis involves the calculation of distributions for the quantities dNv ' For reasons \nof computational efficiency we are assuming that dissimilarities resulting from a \ncomparison with an object in cluster Cv are normally distributed 1 with mean dNv \nand variance uNv 2. Since the variances are nuisance parameters the risk func(cid:173)\ntion n does not depend on, it suffices to calculate the marginal distribution of \nlOther computationally more expensive choices to model within cluster dissimilarities \n\nare skewed distributions like the Gamma-d.istribution. \n\n\fActive Data Clustering \n\na) \n\nb) \n\n531 \n\nRANDOM 1-+----4 \nACTIVE ........ \n\nc) 800 \n\nJ{O \n\nI~, \n\n400 \n\n600 PI \nI~~ \nI ,_ \n\n200 \n\n\u00b7 200 \n\n\u00b7400 \n\n-600 \n\n\u00b7800 \n\n\\1!Jln \n'ill \n\n' - - . \n\n50000 \n\n100000 \n# samples \n\n150000 \n\n200000 \n\nFigure 1: (a) Gray-scale visualization of the generated proximity matrix (N = 800). \nDark/light gray values correspond to low/high dissimilarities respectively, Dij being \nencoded by pixel (i, j). (b) Sampling snapshot for active data clustering after 60000 \nsamples, queried values are depicted in white. (c) Costs evaluated on the complete \ndata for sequential active and random sampling. \n\ndN/I' For the class of statistical models we will consider in the sequel the empiri(cid:173)\ncal mean dN/I, the unbiased variance estimator O\"Jv/l and the sample size mN/I are \na sufficient statistic. Depending on these empirical quantities the marginal poste(cid:173)\nrior distribution of dN/I for uninformative priors is a Student t distribution with \nt = .jmN/I(dN/I - dN/I)/O\"N/I and mN/I - 1 degrees of freedom . The corresponding \ndensity will be denoted by !/I(dNII\\dN/I,O\"JvIl,mNII)' With the help of the poste(cid:173)\nrior densities !/I we define the Expected Value of Perfect Information (EVPI) after \nhaving observed (dN/I,O\"Jv/l,mNII) by \n\n1+00 1+00 \n-00'\" -00 m;x{dNa-dNII } g !/I(drvll\\dNII , O\"~II' mN/I) d drvll, \n\n(5) \nwhere a = arg minll dNII . The EVPI is the loss one expects to incur by making \nthe decision a based on the incomplete il1formation {dN/I} instead of the optimal \ndecision a\", or, put the other way round, the expected gain we would obtain if a\" \nwas revealed to us. \n\nEVPI = \n\nK \n\nIn the case of experimental design, the main quantity of interest is not the EVPI but \nthe Expected Value of Sampling Information (EVSI). The EVSI quantifies how much \ngain we are expecting from additional data. The outcome of additional experiments \ncan only be anticipated by making use of the information which is already avail(cid:173)\nable. This is known as preposterior analysis. The linearity of the utility measure \nimplies that it suffices to calculate averages with respect to the preposterous distri(cid:173)\nbution [Raiffa, Schlaifer, 1961, Chapter 5.3]. Drawing mt/l additional samples from \nthe lI-th population, and averaging possible outcomes with the (prior) distribution \n!/I(dN/I\\dN/I,O\"Jv/l,mNII) will not affect the unbiased estimates dN/I,O\"Jv/l, but only \nincrease the number of samples mN/I --;. mNII + mt/l ' Thus, we can compute the \nEVSI from (5) by replacing the prior densities with.its preposterous counterparts. \n\nTo evaluate the K-dimensional integral in (5) or its EVSI variant we apply Monte(cid:173)\nCarlo techniques, sampling from the Student t densities using Kinderman's re-\n\n\f532 \n\nT. Hofmann and 1. M. Buhmann \n\nb) \n\nL=5IHNMI \n\nL=IO(HMMI \n\nI, = I3IUMI \n\nAAN[)OM -\n\nACTlVe -+-. \n\n\\\\ \n\\ \\ \n\n\\ \n\\ \n\n\\ \n\n'''' .. \n\n.............................. .............. - . . . - ... . \n\n:J{ \n\n.30000 \n\n'20000 ,,-\n\n'00000 \n\n90000 \n\n00000 \n\nj \n\n'0000 f \n\n60000 \n\n50000 \n\n# of samples \n\n# of samples \n\n50000 \n\n150000 \n\n200000 \n\n250000 \n\n300000 \n\n3!OOOO ~ \n\nFigure 2: (a) Solution quality for active and random sampling on data generated \nfrom a mixture image of 16 Brodatz textures (N = 1024). (b) Cost trajectories and \nsegmentation results for an active and random sampling example run (N = 4096). \n\njection sampling scheme, to get an empirical estimate of the random variable \n'l/Ja(dN1 , ... , dNK ) = maxv{dNa-dNIJ. Though this enables us in principle to ap(cid:173)\nproximate the EVSI of any possible experiment, we cannot efficiently compute it for \nall possible ways of distributing the L samples among J{ populations. In the large \nsample limit, however, the EVSI becomes a concave function of the sampling sizes. \nThis motivates a greedy design procedure of drawing new samples incrementally \none by one. \n\n4 Active Data Clustering \n\nSo far we have assumed the assignments of all but one entity ON to be given in \nadvance. This might be realistic in certain on-line applications, but more often \nwe want to simultaneously find assignments for all entities in a dataset. The active \ndata selection procedure hence has to be combined with a recalculation of clustering \nsolutions, because additional data may help us not only to improve our terminal \ndecision, but also with respect to our sampling'strategy. A local optimization of 'Ii \nfor assignments of a single object OJ can rely on the quantities \n\nL [f- + ~il MjvDij - L +} -i L MjvM/cvDjk, \n\nJEN. njvnjv kENj-{i} \n\nnjv \n\nJEN. \n\nIV \n\n(6) \n\nnjv - M iv , and nj: = n;: + 1, by setting \nwhere njv = 2:jE N. Mjv, n;: \n{==> a = arg minv giv = argminv'li(M!Miv = 1), a claim which can \nMia = 1 \nbe proved by straightforward algebraic manipulations (cf. [Hofmann et al., 1997]). \nThis effectively amounts to a cluster readjustment by reclassification of objects. For \nadditional evidence arising from new dissimilarities, one thus performs local reas(cid:173)\nsignments, e.g., by cycling through all objects in random order, until no assignment \nis changing. \n\nTo avoid unfavorable local minima one may also introduce a computational tem(cid:173)\nperature T and utilize {9iv} for simulated annealing based on the Gibbs sampler \n[Geman, Geman, 1984], P{Mia = I} = exp [-;J.gia]J2:~=l exp [-;J.9iv], Alterna(cid:173)\ntively, Eq. (6) may also serve as the starting point to derive mean-field equations in \na deterministic annealing framework, cf. [Hofmann, Buhmann, 1997]. These local \n\n\fActive Data Clustering \n\n1 \n\ndugter \nmodel \ndiuribu \nprocc:u \n3tudi \ncloud \nfacta.l \nevent \nra.ndom \npa.llid \n\n11 \n\na \u00aboruhm \nproblem \ncluUer \nmethod \noptim \nheuntt \n\n:lolv \nto ol \n\nprogram \nrna.chin \n\n2 \n\ndugter \n.:Uatc \n,up \n\npa.rlid \nJotudi \n'up \na.lpha. \nsta.te \nparticl \ni.ntera. c \n\n12 \n\na Konthm \n\ncluster \nfuzzi \npropoi \n\nda.ta. \n\nconyer,; \ncmea.n \n\na.lgorithm \n\nfern \n\ncriteria. \n\n3 \n\ndu gter \nalom \nrcoult \n\ntempera-tur \n\nde,;re \na.lloi \na.tom \nion \n\ne)cc\\ron \n\ntempera.tur \n\n13 \n\nc uater \n\nda.ta. \n\npropo:l \nre sult \nmethod \n\nlink \n:lin,1 \n\nmethod \nretriev \nhiera.rchi \n\n4 \n\ncluHeT \n\naltr;orhhm \n\npropOJ \nmethod \n\nnew \n\n:speech \ncontinu \n\nerror \n\nconHruct \n:spea.ker \n\n14 \n\nmelhod \ndo c.u m \nsiKna.tUJ \ncluuer \n\nfile \n\ndOe?m \nretnev \npreviou \na.na.lyt \n\nliteratur \n\n\" \nt .uk \n\n~chedul \ncluiter \n\na.lgorithm \n\n~raph \n3c hedul \n\ntiuk \n\npla.cern \nconnect \nQualiti \n\n1 ~ \n\nc uiter \n\ndatiJ. \n\ntcchniqu \n\nrC:5uh \np.per \nvuua. \nvideo \nta.rKe t \n\nproceuoT \n\nqueli \n\nG \n\nclu,itcr \nitrUc.tur \nmethod \n\nba,3 C \n,;ener \nloop \nvideo \nfa-mill \nioftwa.r \nva.ria.bl \n\n16 \n\nrobu,u \nduater \n:5Y:ltcm \ncomplex \nei~enva.lu \nuncenatnh \n\nrabun \nperturb \nbound \nm&uix \n\n7 \n\ndUiter \nobject \n\na.pproa.ch \naltortthm \n\nbiue \nUicr \nqueri \naCceu \n30ft war \nplacern \n\n17 \nlmao,; \nc1uucr \n,e~men' \na.lgorithm \nm ethod \n\nPIXel \n\n:le\u00abment \n\nirna.g \nmotion \ncolor \n\n533 \n\n10 \n\nnetwork \nciu3ter \nneura.l \nlea.rn \n\nClI,;orithm \n\nn e ura.l \nnetwork \ncom petit \nicifor,an \n\nlCiJ.rn \n20 \n\n\u00aba.lc~Xl \ndU:ltcr \nfunction \n\ncorrel \nred.hifl \n\nh,up \n\nred:lhih \n\nrnpe \ngala.xi \ni urvei \n\n0 \n\nfu:n.i \nclu3ter \n\nal,;orithm \n\nda.ta. \n\nmct.hod \n\nfuzzi \n\nmembcuhip \n\nrule \n\nco ntrol \nidcntif \n\n19 \n\nmodel \ncluiter \n:lciJ.le \n\nnonlinea.r \n\n:limul \nnbodl \n,r;ravit \ndark \nmau \nma.Her \n\n8 \n\nmodel \nciu3ter \nmethod \nobjec t \nda.ta. \nmodel \ncontext \ndeciiion \n\nmanufactur \n\n~h}'.ic.l \n\n18 \n\nCI~,uter \n\na.1\u00aborithm \n\nda.ta \n\n, .. \n\nmethod \ndi:Ulr.nilar \n\npOint \nda.ta. \ncenter \nkmean \n\nFigure 3: Clustering solution with 20 clusters for 1584 documents on 'clustering'. \nClusters are characterized by their 5 most topical and 5 most typical index terms. \n\noptimization algorithms are well-suited for an incremental update after new data \nhas been sampled, as they do not require a complete recalculation from scratch. The \nprobabilistic reformulation in an annealing framework has the further advantage to \nprovide assignment probabilities which can be utilized to improve the randomized \n'partner' selection procedure. For any of these algorithms we sequentially update \ndata assignments until a convergence criterion is fulfilled. \n\n5 Results \n\nTo illustrate the behavior of the active data selection criterion we have run a series \nof repeated experiments on artificial data. For N = 800 the data has been di(cid:173)\nvided into 8 groups of 100 entities. Intra-group dissimilarities have been set to zero, \nwhile inter-group dissimilarities were defined hierarchically. All values have been \ncorrupted by Gaussian noise. The proximity matrix, the sampling performance, \nand a sampling snapshot are depicted in Fig. 1. The sampling exactly performs as \nexpected: after a short initial phase the active clustering algorithm spends more \nsamples to disambiguate clusters which possess a higher mean similarity, while less \ndissimilarities are queried for pairs of entities belonging to well separated clusters. \nFor this type of structured data the gain of active sampling increases with the depth \nof the hierarchy. The final solution variance is due to local minima. Remarkably \nthe active sampling strategy not only shows a faster improvement, it also finds on \naverage significantly better solution. Notice that the sampling has been decom(cid:173)\nposed into stages, refining clustering solutions after sampling of 1000 additional \ndissimilari ties. \n\nThe results of an experiment for unsupervised texture segmentation is shown \nFig. 2. To obtain a close to optimal solution the active sampling strategy \nroughly needs less than 50% of the sample size required by random sampling for \nboth, a resolution of N = 1024 and N = 4096. At a 64 x 64 resolution, for \nL = 100[{, 150[{, 200[{ actively selected samples the random strategy needs on \naverage L = 120[{, 300[{, 440f{ samples, respectively, to obtain a comparable solu(cid:173)\ntion quality. Obviously, active sampling can only be successful in an intermediate \nregime: if too little is known, we cannot infer additional information to improve our \nsampling, if the sample is large enough to reliably detect clusters, there is no need \nto sample any more. Yet, this intermediate regime significantly increases with [{ \n(and N). \n\n\f534 \n\nT. Hofmann and I. M. Buhmann \n\nFinally, we have clustered 1584 documents containing abstracts of papers with clus(cid:173)\ntering as a title word. For I{ = 20 clusters2 active clustering needed 120000 samples \n\u00ab 10% of the data) to achieve a solution quality within 1% of the asymptotic so(cid:173)\nlution. A random strategy on average required 230000 samples. Fig. 3 shows the \nachieved clustering solution, summarizing clusters by topical (most frequent) and \ntypical (most characteristic) index terms. The found solution gives a good overview \nover areas dealing with clusters and clustering3 . \n\n6 Conclusion \n\nAs we have demonstrated, the concept of expected value of information fits nicely \ninto an optimization approach to clustering of proximity data, and establishes a \nsound foundation of active data clustering in statistical decision theory. On the \nmedium size data sets used for validation, active clustering achieved a consistently \nbetter performance as compared to random selection. This makes it a promising \ntechnique for automated structure detection and data mining applications in large \ndata bases. Further work has to address stopping rules and speed-up techniques \nto accelerate the evaluation of the selection criterion, as well as a unification with \nannealing methods and hierarchical clustering. \n\nAcknowledgments \n\nThis work was supported by the Federal Ministry of Education and Science BMBF \nunder grant # 01 M 3021 Aj4 and by a M.l.T. Faculty Sponser's Discretionary \nFund. \n\nReferences \n\n[Geman et al., 1990] Geman, D., Geman, S., Graffigne, C., Dong, P. (1990). Bound(cid:173)\nary Detection by Constrained Optimization. IEEE Transactions on Pattern A nal(cid:173)\nysis and Machine Intelligence, 12(7), 609-628. \n\n[Geman, Geman, 1984] Geman, S., Geman, D. (1984). Stochastic Relaxation, \n\nGibbs Distributions, and the Bayesian Restoration of Images. IEEE Transac(cid:173)\ntions on Pattern Analysis and Machine Intelligen\u00b7ce, 6(6), 721-741. \n\n[Hofmann, Buhmann, 1997] Hofmann, Th., Buhmann, J. M. (1997). Pairwise Data \nClustering by Deterministic Annealing. IEEE Transactions on Pattern Analysis \nand Machine Intelligence, 19(1), 1-14. \n\n[Hofmann et al., 1997] Hofmann, Th., Puzicha, J., Buhmann, J.M. 1997. Deter(cid:173)\n\nministic Annealing for Unsupervised Texture Segmentation. Pages 213-228 of: \nProceedings of the International Workshop on Energy Minimization Methods in \nComputer Vision and Pattern Recognition. Lecture Notes in Computer Science, \nvol. 1223. \n\n[Jain, Dubes, 1988] Jain, A. K., Dubes, R. C. (1988). Algorithms for Clustering \n\nData. Englewood Cliffs, NJ 07632: Prentice Hall. \n\n(Raiffa, Schlaifer, 1961] Raiffa, H., Schlaifer, R. (1961). Applied Statistical Decision \n\nTheory. Cambridge MA: MIT Press. \n\n(Van Rijsbergen, 1979] Van Rijsbergen, C. J. (1979). Information Retrieval. But(cid:173)\n\nterworths, London Boston. \n\n2The number of clusters was determined by a criterion based on complexity costs. \n3Is it by chance, that 'fuzzy' techniques are 'softly' distributed over two clusters? \n\n\f", "award": [], "sourceid": 1363, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}