{"title": "Multidimensional Scaling and Data Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 466, "abstract": null, "full_text": "Multidimensional Scaling and Data Clustering \n\nThomas Hofmann & Joachim Buhmann \nRheinische Friedrich-Wilhelms-U niversitat \nInstitut fur Informatik ill, Romerstra6e 164 \n\nD-53117 Bonn, Germany \n\nemail:{th.jb}@cs.uni-bonn.de \n\nAbstract \n\nVisualizing and structuring pairwise dissimilarity data are difficult combinatorial op(cid:173)\ntimization problems known as multidimensional scaling or pairwise data clustering. \nAlgorithms for embedding dissimilarity data set in a Euclidian space, for clustering \nthese data and for actively selecting data to support the clustering process are discussed \nin the maximum entropy framework. Active data selection provides a strategy to discover \nstructure in a data set efficiently with partially unknown data. \n\n1 Introduction \nGrouping experimental data into compact clusters arises as a data analysis problem in psy(cid:173)\nchology, linguistics, genetics and other experimental sciences. The data which are supposed \nto be clustered are either given by an explicit coordinate representation (central clustering) \nor, in the non-metric case, they are characterized by dissimilarity values for pairs of data \npoints (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric \ndata in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of \nnon-metric data, and (iii) for active data selection to determine a particular cluster structure \nwith minimal number of data queries. All algorithms are derived from the maximum entropy \nprinciple (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984). \nThe data are given by a real-valued, symmetric proximity matrix D E R NXN , 'Dkl being \nthe pairwise dissimilarity between the data points k, l. Apart from the symmetry constraint \nwe make no further assumptions about the dissimilarities, i.e., we do not require D being a \nmetric. The numbers 'Dkl quite often violate the triangular inequality and the dissimilarity of \na datum to itself could be finite. \n\n2 Statistical Mechanics of Multidimensional Scaling \nEmbedding dissimilarity data in a D-dimensional Euclidian space is a non-convex optimiza(cid:173)\ntion problem which typically exhibits a large number of local minima. Stochastic search \nmethods like simulated annealing or its deterministic variants have been very successfulJy \n\n\f460 \n\nThomas Hofmann. Joachim Buhmann \n\napplied to such problems. The question in multidimensional scaling is to find coordinates \n{Xi }i~1 in a D-dimensional Euclidian space with minimal embedding costs \n\nMDS \n\nH \n\n12 \n= 2N L., Xi - Xk \n\n1 '\"' [I \n\n- 'Dik \n\n]2 \n\n. \n\n(1) \n\nN \n\ni,k=1 \n\nWithout loss of generality we shift the center of mass in the origin <2::= I Xk = 0). \nIn the maximum entropy framework the coordinates {Xi} are regarded as random variables \nwhich are distributed according to the Gibbs distribution P ( { Xj} ) = exp( - f3 (H MDS - F). The \ninverse temperature f3 = 1 /T controls the expected embedding costs (HMDS) (expectation val(cid:173)\nues are denoted by (.). To calculate the free energy F for H MDS we approximate the coupling \nterm 2 2:~\"k=1 'DikxiXk/N ~ 2:[:1 xihi with the mean fields hi = 4 2:~= 1 'Dik(Xk}/N. \nStandard t~chniques to evaluate the free energy F yield the equations \n\n' 00 \n\nrv J dy J II dR.d,d' exp (-f3NF), \n\n00 D \n\n- ' 00 \n\n- 00 d,d'=1 \n\nf) \n\n00 \n\nN \n\n2 L R.~d' - f3~ Lin J dXjexp (-f3f(Xi)) ' \nIXil4 - ~IXiI2 L 'Dik + 4xTR.xi + xT (hi - 4Y)\u00b7 \n\nd,d'=1 \n\ni=1 \n\n- 00 \n\nN \n\nk=1 \n\nZ(HMDS) \n\nF(HMDS ) \n\nf(Xi) \n\n(2) \n\n(3) \n\n(4) \n\nThe integral in Eq. (2) is dominated by the absolute minimum of F in the limit N ~ 00. \nTherefore, we calculate the saddle point equations \n\nN \n\nR. = ~ L ((Xjxf) + l(l x iI 2)I) and 0 \n\ni=1 \n\nI Xi exp( -f3f(Xj)dx i \nI exp( -f3 f(Xj)dxi \n\n. \n\n(5) \n\n(6) \n\nEquation (6) has been derived by differentiating F with respect to hi. I denotes the D x D \nunit matrix. In the low temperature limit f3 ~ 00 the integral in (3) is dominated by the \nminimum of f(Xi) . Therefore, a new estimate of (Xi) is calculated minimizing f with respect \nto Xi. Since all explicit dependencies between the Xi have been eliminated, this minimization \ncan be performed independently for all i, 1 ~ i ~ N. \nIn the spirit of the EM algorithm for Gaussian mixture models we suggest the following \nalgorithm to calculate a meanfield approximation for the multidimensional scaling problem. \n\ninitialize (Xi)(O) randomly; t = O. \nwhile 2:::':1 I(Xi )(t ) -\n\n(Xi)(t-I)I > t: \n\nE- step: estimate (Xi) (t+l) as a function of (Xi)( t ) , RY) , y(t ), h~t) \n\nM-step: calculate n (t), h~t) and determine y (t) such \nthat the centroid condition is satisfied. \n\n\fMultidimensional Scaling and Data Clustering \n\n461 \n\nThis algorithm was used to determine the embedding of protein dissimilarity data as shown in \nFig. 1 d. The phenomenon that the data clusters are arranged in a circular fashion is explained \nby the lack of small dissimilarity values. The solution in Fig. Id is about a factor of two \nbetter than the embedding found by a classical MDS program (Gower, 1966). This program \ndetermines a (N - 1)- space where the ranking of the dissimilarities is preserved and uses \nprinciple component analysis to project this tentative embedding down to two dimensions. \nExtensions to other MDS cost functions are currently under investigation. \n\n3 Multidimensional Scaling and Pairwise Clustering \nEmbedding data in a Euclidian space precedes quite often a visual inspection by the data \nanalyst to discover structure and to group data into clusters. The question arises how both \nproblems, the embedding problem and the clustering problem, can be solved simultaneously. \nThe second algorithm addresses the problem to embed a data set in a Euclidian space such \nthat the clustering structure is approximated as faithfully as possible in the maximum entropy \nsense by the clustering solution in this embedding space. The coordinates in the embedding \nspace are the free parameters for this optimization problem. \nClustering of non-metric dissimilarity data, also called pairwise clustering (Buhmann, Hof(cid:173)\nmann, 1994a), is a combinatorial optimization problem which depends on Boolean assign(cid:173)\nments Miv E {a, I} of datum i to cluster lJ. The cost function for pairwise clustering with \nJ( clusters is \n\nE~:(M) = L 2 N L L MkvMlv'Dkl with \n\nN N \n\nIf \n\n1 \n\nv=1 Pv \n\nk=! 1=1 \n\n(7) \n\nIn the meanfield approach we approximate the Gibbs distribution P( Ej;) corresponding \nto the original cost function by a family of approximating distributions. The distribution \nwhich represents most accurately the statistics of the original problem is determined by \nthe minimum of the Kullback-Leibler divergence to the original Gibbs distribution. In the \npairwise clustering case we introduce potentials {Ekv } for the effective interactions, which \ndefine a set of cost functions with non-interacting assignments. \n\n\u00a3<).; (M, {Ekv }) = L L Mk 1jEkl;. \n\nK N \n\nv=1 k=1 \n\nThe optimal potentials derived from this minimization procedure are \n{\u00a3kv} = arg min 'DKL (PO(E~' )IIP(E~)), \n\n{\u00a3kv} \n\n(8) \n\n(9) \n\nwhere PO(E9{) is the Gibbs distribution corresponding to E~., and 'DKL(\u00b7II\u00b7) is the KL(cid:173)\ndivergence. This method is equivalent to minimizing an upper bound on the free energy \n(Buhmann, Hofmann, 1994b), \n\nF(E~:) ::; Fo(E~. ) + (VK)o, with VA\" = Ej; - \u00a3~\" \n\n(10) \n\n(')0 denoting the average over all configurations of the cost function without interactions. \nCorrelations between assignment variables are statistically independent for PO( E9(), i.e., \n(MkvA11v)0 = (Mkv )0(A11v )0. The averaged potential VI\\, therefore, amounts to \n\nK \n\n(Vrd = L L (Mkl;) (Mlv) 2 vN'Dk1 - L L(A1kv)Eklj, \n\nK N \n\nN \n\n(11) \n\nv=1 k ,I=1 \n\nv=1 k=1 \n\n1 \n\nP \n\n\f462 \n\nThomas Hofmann. Joachim Buhmann \n\nthe subscript of averages being omitted for conciseness. The expected assignment variables \nare \n\nMinimizing the upper bound yields \n\nThe \"optimal\" potentials \n\n[i~' = J N L(Mkv ) \n\nN \n\n( \n\n1 \n\n1v \n\nk=1 \n\nIN) \n\n'Dik - 2 N L(M1v)Dkl \n\nPv \n\n1=1 \n\n(12) \n\n(13) \n\n(14) \n\ndepend on the given distance matrix, the averaged assignment variables and the cluster \nprobabilities. They are optimal in the sense, that if we set \n\n(15) \nthe N * K stationarity conditions (13) are fulfilled for every i E {I, ... , N}, 11 E {I, ... , K}. A \nsimultaneous solution ofEq. (15) with (12) constitutes a necessary condition for a minimum \nof the upper bound for the free energy :F. \nThe connection between the clustering and the multidimensional scaling problem is estab(cid:173)\nlished, if we restrict the potentials [iv to be of the form IXi - Yvf with the centroids \nYII = 2:~=1 Mkl/Xv/ 2::=1 Mkv. We consider the coordinates Xi as the variational param(cid:173)\neters. The additional constraints restrict the family of approximating distributions, defined \nby \u00a39\". to a subset. Using the chain rule we can calculate the derivatives of the upper bound \n(10), resulting in the exact stationary conditions for Xi, \nN \n' \" ' \" (MjoJ(Mjv ) \nj=1 a,v=1 \n\nK \n'\" (M )(M ) \n~ ia \na,v=1 \n\nPa \n\nx \n\nja (~Cia -~Civ)Ya = ~ ~ N \n\nco \n\nco \n\nK \n\n(~[ia - ~[ir/) (Mia)! + ~ (Xk - Ya) Oxi \n\n[ \n\nN ( \n\na(Mka) T) 1 \n\n(Xj - Ya), \n\n(16) \n\nwhere ~[iOt = \u00a3ia - [tao The derivatives a(Mka) /Oxi can be exactly calculated, since they \nare given as the solutions of an linear equation system with N x K unknowns for every Xi. To \nreduce the computational complexity an approximation can be derived under the assumption \nay 0/ / aXj ~ O. In this case the right hand side of (16) can be set to zero in a first order \napproximation yielding an explicit formula for Xi, \n\nKiXi ~ ~ L(Miv) (11Yv1l 2 - [tv) (Yv - L(Mia)Ya) , \n\n(17) \n\nK \n\na=1 \n\nK \n\nv=1 \n\nwith the covariance matrix Ki = ((yyT)j - (Y)i(Y)T) and (Y)i = 2:~=1 (Miv)Y v' \nThe derived system of transcendental equations given by (12), (17) and the centroid condi(cid:173)\ntion explicitly reflects the dependencies between the clustering procedure and the Euclidian \nrepresentation. Solving these equations simultaneously leads to an efficient algorithm which \n\n\fMultidimensional Scaling and Data Clustering \n\n463 \n\na \nHB \n\nHA \n\nGGI \n\nMY \n\nHBX, \nHF, HE \nGP \n\nHG~~~ \u2022 \u2022 \u2022\u2022\u2022 [l}?,faitt\\tvJqJ~!;t ...\u2022..\u2022. , .' \u2022... , .\u2022\u2022..... \n\nc \n\n0 \n\nHAfo \n\n+ \n\n+ \n\nHB \n++ + \n\nHG,HE,HF \n\n---+ \n\n~ llt GP \nGGI \n\nx~ \n\n~GGG \nx \n\nx \n\n\u2022 \n\nMY \n\nb \n\n4tHB \n\nHG,H~ . ~ \n\n~ HBX,HF,HE \n\n~. \n\nGP~ \n\n~ \nGGI~ \n\n~ \n\nGGGI \n\nd \n\nRandom Selection \n\n420 \n\n380 \n\n\u00a3re \n\n340 \n\n300 \n\n# of selected Do, \n\nFigure 1: Similarity matrix of 145 protein sequences of the globin family (a): dark gray levels \ncorrespond to high similarity values; (b): clustering with embedding in two dimensions; (c): \nmultidimensional scaling solution for 2-dimensional embedding; (d): quality of clustering \nsolution with random and active data selection of 'D ik values. eKe has been calculated on the \nbasis of the complete set of 'Di k values. \n\ninterleaves the multidimensional scaling process and the clustering process and which avoids \nan artificial separation into two uncorrelated processes. The described algorithm for simul(cid:173)\ntaneous Euclidian embedding and data clustering can be used for dimensionality reduction, \ne.g., high dimensional data can be projected to a low dimensional subspace in a nonlinear \nfashion which resembles local principle component analysis (Buhmann, Hofmann, 1994b). \nFigure (l) shows the clustering result for a real-world data set of 145 protein sequences. The \nsimilarity values between pairs of sequences are determined by a sequence alignment program \nwhich takes biochemical and structural information into account. The sequences belong to \ndifferent protein families like hemoglobin, myoglobin and other globins; they are abbreviated \nwith the displayed capital letters. The gray level visualization of the dissimilarity matrix with \ndark values for similar protein sequences shows the formation of distinct \"squares\" along the \nmain diagonal. These squares correspond to the discovered partition after clustering. The \nembedding in two dimensions shows inter-cluster distances which are in consistent agreement \nwith the similarity values of the data. In three and four dimensions the error between the \n\n\f464 \n\nThomas Hofmann. Joachim Buhmann \n\ngiven dissimilarities and the constructed distances is further reduced. The results are in good \nagreement with the biological classification. \n\n4 Active Data Selection for Data Clustering \nActive data selection is an important issue for the analysis of data which are characterized \nby pairwise dissimilarity values. The size of the distance matrix grows like the square of \nthe number of data 'points'. Such a O(N2) scaling renders the data acquisition process \nexpensive. It is, therefore, desirable to couple the data analysis process to the data acquisition \nprocess, i.e., to actively query the supposedly most relevant dissimilarity values. Before \naddressing active data selection questions for data clustering we have to discuss the problem \nhow to modify the algorithm in the case of incomplete data. \nIf we want to avoid any assumptions about statistical dependencies, it is impossible to infer \nunknown values and we have to work directly with the partial dissimilarity matrix. Since the \ndata enters only in the (re-)ca1culation of the potentials in (14), it is straightforward to appro(cid:173)\npriately modify these equations. All sums are restricted to terms with known dissimilarities \nand the normalization factors are adjusted accordingly. \nAlternatively we can try to explicitly estimate the unknown dissimilarity values based on \na statistical model. For this purpose we propose two models, relying on a known group \nstructure of the data. The first model (I) assumes that all dissimilarities between a point \ni and points j belonging to a group G ~ are i.i.d. random variables with the probability \ndensity Pi/1 parameterized by eiw In this scheme a subset of the known dissimilarities of \ni and j to other points k are used as samples for the estimation of V ij . The selection \nof the specific subset is determined by the clustering structure. In the second model (II) \nwe assume that the dissimilarities between groups G v, G ~ are i.i.d. random variables with \ndensity PV/1 parameterized by e,IW The parameters ev~ are estimated on the basis of all \nknown dissimilarities {Vij E V} between points from Gv and G~. \nThe assignments of points to clusters are not known a priori and have to be determined in the \nlight of the (given and estimated) data. The data selection strategy becomes self-consistent \nif we interpret the mean fields (.I\"vfiv) of the clustering solution as posterior probabilities for \nthe binary assignment variables. Combined with a maximum likelihood estimation for the \nunknown parameters given the posteriors, we arrive at an EM-like iteration scheme with the \nE-step replaced by the clustering algorithm. \nThe precise form of the M-Step depends on the parametric form of the densities Pi~ or PI/~' \nrespectively. In the case of Gaussian distributions the M-Step is described by the following \nestimation equations for the location parameters \n\n(I), \n\n(II), \n\n(18) \n\nwith 1T:j~ = 1+~vl' ((Mil/){Mj~) + (l\\tfi~)(Mjv)). Corresponding expressions are derived \nfor the standard deviations at) or a~'~, respectively. In the case of non-normal distributions \nthe empirical mean might still be a good estimator of the location parameter, though not \nnecessarily a maximum likelihood estimator. The missing dissimilarities are estimated by \nthe following statistics, derived from the empirical means. \n\n- (I) \n\nDij = ~ (l\\tfiv)(JVfj~) \n\n'\"\" \n\ni\\[ \nJ. i~mi~ \n\n- (I) + N \nJY + N. \n\n- (I) \njvmjv \n\nK \n\n1/,11=) \n\n(I), \n\nD~~) = '\"\" .\".ij m- (I) (II) \n\n!} ~ \"11/1 'v~ \n\n, \n\n(19) \n\n1~ \n\n}V \n\n11-:5:~ \n\n\fMultidimensional Scaling and Data Clustering \n\n465 \n\n2600 r - - r -........ -.----,--........ -.-----,., \n\n2400 \n\nL, \n'\\., \n\n2200 \n\n\\ c, \n~---,--... \n\n\\ \n\"\"-! \n1 \n\\ \nt: \n\n\\ \n\n2000 \n\nAc t i ve Da t~:L--, \nSe 1 ec t ion \n\n\\ _______________________ _ \n\no \n\n400 \n\n1200 \n# of selected dissimilarities \n\nBOO \n\nFigure 2: Similarity matrix of 54 word fragments generated by a dynamic programming \nalgorithm. The clustering costs in the experiment with active data selection requires only half \nas much data as a random selection strategy. \nwith Nil' = E'D.kE'D(i11k11)' For model (I) we have used a pooled estimator to exploit the \ndata symmetry. The iteration scheme finally leads to estimates (jill or (j'lt' respectively for the \nparameters and Dij for all unknown dissimilarities. \nCriterion for Active Data Selection: We will use the expected reduction in the variance of \nthe free energy Fo as a score, which should be maximized by the selection criterion. Fo is \ngiven by Fo(D) = -~ E;;:', log E;~l exp( -{3\u00a3i/l(D)). If we query a new dissimilarity \nD ij the expected reduction of the variance of the free energy is approximated by \n\n~ .. = 2 [ aFO]2 V [D .. _ D .. ] \n\naDij \n\nt) \n\ntJ \n\ntJ \n\n(20) \n\nThe partial derivatives can be calculated exactly by solving a system of linear equations with \nN x [ .. : unknowns. Alternatively a first order approximation in f /I = O( 1/ N P,/) yields \n\n(21) \n\nThis expression defines a relevance measure of Dij for the clustering problem since a Dij \nvalue contributes to the clustering costs only if the data i and j belong to the same cluster. \nEquation (21) summarizes the mean-field contributions aFo/aDij ~ a(H)o/aDjj . \nTo derive the final form of our scoring function we have to calculate an approximation of \nthe variance in Eq. (20) which measures the expected squared error for replacing the true \nvalue Dij with our estimate Dij . Since we assumed statistical independence the variances \nare additive V [Dij - Dij] = V [Dij] + V [Dij]. The total population variance is a sum \nof inner- and inter-cluster variances, that can be approximated by the empirical means and \nby the empirical variances instead of the unknown parameters of Pill or P'lt'. The sampling \nvariance of the statistics Dij is estimated under the assumption, that the empirical means ifl'ill \n\n\f466 \n\nThomas Hofmann, Joachim Buhmann \n\nor mVJ.l respectively are uncorrelated. This holds in the hard clustering limit. We arrive at \nthe following final expression for the variances of model (II) \n\nv [Vij-Dij ] ~ L1TYJl[(Dij-mvJl)2+(I+I: 1T:JJ.l 1Tkl(j~Jl)l \n\n(22) \n\nV~Jl \n\nVk1EV VJl \n\nFor model (I) a slightly more complicated formula can be derived. Inserting the estimated \nvariances into Eq. (20) leads to the final expression for our scoring function. \nTo demonstrate the efficiency of the proposed selection strategy, we have compared the \nclustering costs achieved by active data selection with the clustering costs resulting from \nrandomly queried data. Assignments int the case of active selection are calculated with \nstatistical model (I). Figure 1 d demonstrates that the clustering costs decrease significantly \nfaster when the selection criterion (20) is implemented. The structure of the clustering \nsolution has been completely inferred with about 3300 selected V ik values. The random \nstrategy requires about 6500 queries for the same quality. Analogous comparison results for \nlinguistic data are summarized in Fig. 2. Note the inconsistencies in this data set reflected by \nsmallVik values outside the cluster blocks (dark pixels) or by the large Vik values (white \npixels) inside a block. \nConclusion: Data analysis of dissimilarity data is a challenging problem in molecular bi(cid:173)\nology, linguistics, psychology and, in general, in pattern recognition. We have presented \nthree strategies to visualize data structures and to inquire the data structure by an efficient \ndata selection procedure. The respective algorithms are derived in the maximum entropy \nframework for maximal robustness of cluster estimation and data embedding. Active data \nselection has been shown to require only half as much data for estimating a clustering solution \nof fixed quality compared to a random selection strategy. We expect the proposed selection \nstrategy to facilitate maintenance of genome and protein data bases and to yield more robust \ndata prototypes for efficient search and data base mining. \nAcknowledgement: It is a pleasure to thank M. Vingron and D. Bavelier for providing the \nprotein data and the linguistic data, respectively. We are also grateful to A. Polzer and H.J. \nWarneboldt for implementing the MDS algorithm. This work was partially supported by the \nMinistry of Science and Research of the state Nordrhein-Westfalen. \n\nReferences \nBuhmann, J., Hofmann, T. (l994a). Central and Pairwise Data Clustering by Competitive \nNeural Networks. Pages 104-111 of\" Advances in Neural Infonnation Processing \nSystems 6. Morgan Kaufmann Publishers. \n\nBuhmann, J., Hofmann, T. (1994b). A Maximum Entropy Approach to Pairwise Data \nClustering. Pages 207-212 of\" Proceedings of the International Conference on Pattern \nRecognition, Hebrew University, Jerusalem, vol. II. IEEE Computer Society Press. \n\nGower, J. C. (1966). Some distance properties of latent root and vector methods used in \n\nmultivariate analysis. Biometrika, 53, 325-328. \n\nHertz, J., Krogh, A., Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. \n\nNew York: Addison Wesley. \n\nTikochinsky, y, Tishby, N.Z., Levine, R. D. (1984). Alternative Approach to Maximum(cid:173)\n\nEntropy Inference. Physical Review A, 30, 2638-2644. \n\n\f", "award": [], "sourceid": 1008, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}