{"title": "Central and Pairwise Data Clustering by Competitive Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 104, "page_last": 111, "abstract": null, "full_text": "Central and Pairwise Data Clustering by \n\nCompetitive Neural Networks \n\nJoachim Buhmann & Thomas Hofmann \n\nRheinische Friedrich-Wilhelms-U niversiHit \nInstitut fiir Informatik II, RomerstraBe 164 \n\nD-53117 Bonn, Fed. Rep. Germany \n\nAbstract \n\nData clustering amounts to a combinatorial optimization problem to re(cid:173)\nduce the complexity of a data representation and to increase its precision. \nCentral and pairwise data clustering are studied in the maximum en(cid:173)\ntropy framework. For central clustering we derive a set of reestimation \nequations and a minimization procedure which yields an optimal num(cid:173)\nber of clusters, their centers and their cluster probabilities. A meanfield \napproximation for pairwise clustering is used to estimate assignment \nprobabilities. A se1fconsistent solution to multidimensional scaling and \npairwise clustering is derived which yields an optimal embedding and \nclustering of data points in a d-dimensional Euclidian space. \n\n1 Introduction \n\nA central problem in information processing is the reduction of the data complexity with \nminimal loss in precision to discard noise and to reveal basic structure of data sets. Data \nclustering addresses this tradeoff by optimizing a cost function which preserves the original \ndata as complete as possible and which simultaneously favors prototypes with minimal \ncomplexity (Linde et aI., 1980; Gray, 1984; Chou et aI., 1989; Rose et ai., 1990). We dis(cid:173)\ncuss an objective function for the joint optimization of distortion errors and the complexity \nof a reduced data representation. A maximum entropy estimation of the cluster assign(cid:173)\nments yields a unifying framework for clustering algorithms with a number of different \ndistortion and complexity measures. The close analogy of complexity optimized clustering \nwith winner-take-all neural networks suggests a neural-like implementation resembling \ntopological feature maps (see Fig. 1). \n\n104 \n\n\fCentral and Pairwise Data Clustering by Competitive Neural Networks \n\nlOS \n\nX\u00b7 1 \n\nPyli \n\nFigure 1: Architecture of a three \nlayer competitive neural network \nfor central data clustering with \nd neurons in the input layer, K \nneurons in the clustering layer \nwith activity (Mia) and G neu(cid:173)\nrons in the classification layer. \nThe output neurons estimate the \nconditional probability Pl'li of \ndata point i being in class 1. \n\nGiven is a set of data points which are characterized either by coordinates {Xi I Xi E ~d; i = \n1, ... , N} or by pairwise distances {Dikli, k = 1, ... , N}. The goal of data clustering \nis to determine a partitioning of a data set which either minimizes the average distance \nof data points to their cluster centers or the average distance between data points of the \nsame cluster. The two cases are refered to as central or pairwise clustering. Solutions \nto central clustering are represented by a set of data prototypes {y alY a E ~d; a = \n1, ... ,K}, and the size K of that set. The assignments {Miala = 1, ... ,K; i = 1, ... ,N}, \nMia E {O, I} denote that data point i is uniquely assigned to cluster a (Lv Miv = 1). \nRate distortion theory specifies the optimal choice of Ya being the cluster centroids, i.e., \nl:i Mia a~ 0 Dia (Xi, Ya) = O. Given only a set of distances or dissimilarities the solution \nto pairwise clustering is characterized by the expected assignment variables (Mia). The \ncomplexity {Cal a = 1, ... , K} of a clustering solution depends on the specific information \nprocessing application at hand, in particular, we assume that Ca is only a function of the \ncluster probability Pa = L~1 Mia/N. We propose the central clustering cost function \n\nN K \n\nc~( {Miv}) = L L Miv (DiV(Xi, Y v) + ACv(Pv)) \n\ni=1 v=1 \nand the pairwise clustering cost function \n\nCk( {Miv}) = L LMiv (2 N L MkvDik + ACv(Pv)). \n\nN K IN \n\ni=1 v=1 \n\nPv \n\nk=1 \n\n(1) \n\n(2) \n\nThe distortion and complexity costs are adjusted in size by the weighting parameter A. The \ncost functions (1,2) have to be optimized in an iterative fashion: (i) vary the assignment \nvariables Mia for a fixed number K of clusters such that the costs c,;,pc ( {Mia} ) decrease; \n(ii) increment the number of clusters K ~ K + 1 and optimize Mia again. \nComplexity costs which penalize small, sparsely populated clusters, i.e., Ca = l/p~, s = \n1,2 .... , favor equal cluster probabilities, thereby emphasizing the hardware aspect of a \nclustering solution. The special case s = 1 with constant costs per cluster corresponds \nto K -means clustering. An alternative complexity measure which estimates encoding \ncosts for data compression and data transmission is the Shannon entropy of a cluster set \n(C) - LvPvCv = -l:v Pv logpv. \n\n\f106 \n\nBuhmann and Hofmann \n\nThe most common choice for the distortion measure are distances Via = IIXi - YaW\" \nwhich preserve the permutation symmetry of (1) with respect to the cluster index /J. A data \npartitioning scheme without permutation invariance of cluster indices is described by the \ncost function \n\nE1 = L L Mil/ ( ((Vi 1/ )) + .\\C(PI/) ) . \n\n(3) \n\nt \n\n1/ \n\nThe generalized distortion error ((Via)) - 2:')' To.')' Vi,), (Xi , Y')') between data point Xi and \ncluster center Yo. quantifies the intrinsic quantization errors Vh(Xi, Y')') and the additional \nerrors due to transitions To.')' from index ry to a. Such transitions might be caused by noise \nin communication channels. These index transitions impose a topological order on the \nset of indices {a I a = 1, ... , K} which establishes a connection to self-organizing feature \nmaps (Kohonen, 1984; Ritter et al., 1992) in the case of nearest neighbor transitions in a \nd-dimensional index space. We refer to such a partitioning of the data space as topology \npreserving clustering. \n\n2 Maximum Entropy Estimation of Central Clustering \n\nDifferent combinations of complexity terms, distortion measures and topology constraints \ndefine a variety of central clustering algorithms which are relevant in very different infor(cid:173)\nmation processing contexts. To derive robust, preferably parallel algorithms for these data \nclustering cases, we study the clustering optimization problem in the probabilistic frame(cid:173)\nwork of maximum entropy estimation. The resulting Gibbs distribution proved to be the \nmost stable distribution with respect to changes in expected clustering costs (Tikochinsky \net al., 1984) and, therefore, has to be considered optimal in the sense of robust statistics. \nStatistical physics (see e.g. (Amit, 1989; Rose et at., 1990)) states that maximizing the \nentropy at a fixed temperature T = 1/ f3 is equivalent to minimizing the free energy \n\n-TlnZ = -Tln( L exp(-f3EK)) \n\n{M\", } \n\n-.\\N L PI/ 2 ~c: - ~ 2;= log (L exp [-f3( ((Vil/)) + .\\c~)]) \n\n1/ \n\nP \n\nt \n\n1/ \n\n(4) \n\nwith respect to the variables PI/, Y 1/' The effective complexity costs are C~ - a (Pl/CI/) / api/' \nFor a derivation of (4) see (Buhmann, Kiihnel, 1993b). \n\nThe resulting re-estimation equations for the expected cluster probabilities and the expected \ncentroid positions are necessary conditions of :F K being minimal, i.e. \n\nPo. \n\no \n\n1 \nN LLT')'a(Mh)aVia(Xi,Ya), \n\nt \n\n')' \n\n. \nt \n\na \nYo. \nexp [-f3( ((Via)) + .\\C~)] \nK \nLexp[-f3(((Vil/)) + .\\C~)] \n1/=1 \n\n(5) \n\n(6) \n\n(7) \n\n\fCentral and Pairwise Data Clustering by Competitive Neural Networks \n\n107 \n\nThe expectation value (Mia) of the assignment variable Mia can be interpreted as a fuzzy \nmembership of data point X i in cluster Q. The case of supervised clustering can be treated \nin an analogous fashion (Buhmann, Kuhne1, 1993a) which gives rise to the third layer in the \nneural network implementation (see Fig. 1). The global minimum of the free energy (4) \nwith respect to Pa, Ya determines the maximum entropy solution of the cost function (1). \nNote that the optimization problem (1) of a KN state space has been reduced to a K( d + 1) \ndimensional minimization of the free energy F K (4). To find the optimal parameters Pa . Ya \nand the number of clusters K which minimize the free energy, we start with one cluster \nlocated at the centroid of the data distribution, split that cluster and reestimate Pa, Ya using \nequation (5,6). The new configuration is accepted as an improved solution if the free energy \n(4) has been decreased. This splitting and reestimation loop is continued until we fail to \nfind a new configuration with lower free energy. The temperature determines the fuzziness \nof a clustering solution, whereas the complexity term penalizes excessively many clusters. \n\n3 Meanfield Approximation for Pairwise Clustering \n\nThe maximum entropy estimation for pairwise clustering constitutes a much harder problem \nthan the calculation of the free energy for central clustering. Analytical expression for \nthe Gibbs distributions are not known except for the quadratic distance measure Dik = \n(Xi - Xk)2. Therefore, we approximate the free energy by a variational principle commonly \nrefered to as meanfield approximation. Given the costfunction (2) we derive a lower bound \nto the free energy by a system of noninteracting assignment variables. The approximative \ncostfunction with the variational parameters Eiv is \n\nK N \n\nE~ = L L MivEiv, \n\nv=) i=) \n\n(8) \n\nThe original costfunction for pairwise clustering can be written as Ek = E9.: + V with a \n(small) perturbation term V = Ek - E9.: due to cluster interactions. The partition function \n\nZ \n\nL exp ( - f1Ek) \n\n{Mi v} \n\n{M. v } \n\nL exp (-/3 E9.:) exp (-,(3V) \nL exp (-f1E~) ...;,..{M_i......:V} ______ _ \n\nL exp (-/3E9.:) \n\nZo(exp( - /3V)o > Zo exp( - /3 (V)o) \n\n(9) \n\nis bound from below if terms of the order O( ((V - (V)O)3)0) and higher are negligible com(cid:173)\npared to the quadratic term. The angular brackets denote averages over all configurations \nof the costfunction without interactions. The averaged perturbation term (V)o amounts to \n\n{Mi v } \n\n(V}o = LL(MiV}(MkV )2 NDik+ALL(MiV)Cv- LL(Miv)Eiv. (10) \n\n1 \n\n~ \n\nV 1. ' \n\u00b7 k \n\nV \n\n., \n. \n\nV \n\n. \n1. \n\n(Mia) being the averaged assignment variables \n\n(Mia) = exp( -/3Eia) \nL exp( -/3Eiv) \n\nv \n\n(11) \n\n\f108 \n\nBuhmann and Hofmann \n\nThe meanfield approximation with the cost function (8) yields a lower bound to the partition \nfunction Z of the original pairwise clustering problem. Therefore, we vary the parameters \nCia to maximize the quantity In Zo - ,8(V)o which produces the best lower bound of Z \nbased on an interaction free costfunction. Variation of Cia leads to the conditions \n\nViE {1 , ... ,N},a E {l. ... ,K}, \n\n(12) \n\nctv being defined as \n\nFor a given distance matrix V ik the transcendental equations (11,12) have to be solved \nsimultaneously. \nSo far the Cia have been treated as independent variation parameters. An important \nproblem, which is usually discussed in the context of Multidimensional Scaling, is to \nfind an embedding for the data set in an Euclidian space and to cluster the embedded data. \nThe variational framework can be applied to this problem, if we consider the parameters \nCia as functions of data coordinates and prototype coordinates, Cia = Via(Xi, Ya), e.g. \nwith a quadratic distortion measure Via (Xi. Ya) = IIXi - Y a11 2 . The variables Xi, Ya E ~d \nare the variational parameters which have to be determined by maximizing In Zo - {1(V)0. \nWithout imposing the restriction for the prototypes to be the cluster centroids, this leads to \nthe following conditions for the data coordinates \n\nAfter further algebraic manipulations we receive the explicit expression for the data points \n\nKiXi = ~ L (Miv) (iIYvIl 2 - ci*v) (Yv - L (MiJ.t)Y J.t), \n\n(15) \n\nv \n\nJ.t \n\nwith the covariance matrix Ki = ((yyT)i - (Y)i(Y);), (Y)i = L.v(Miv)Yv. Let us \nassume that the matrix Ki is non-singular which imposes the condition K > d and the \ncluster centers {y al a = 1, ... , K} being in general position. For K < d the equations \nCia = cta + Ci are exactly solvable and embedding in dimensions larger than K produces \nnon-unique solutions without improving the lower bound in (9). \nVarying In Zo - ,8(V)o with respect to Ya yields a second set of stationarity conditions \n\nL(Mja ) (1- (Mja)) (Cja -cja) (Xj - Ya) = 0, Va E {I, ... , K}. \nj \n\n(16) \n\nThe weighting factors in (16), however, decay exponentially fast with the inverse temper(cid:173)\nature, i.e., (Mja)(1 - (Mja )) rv 0(,8 exp[-,8c]), C > O. This implies that the optimal \nsolution for the data coordinates displays only a very weak dependence on the special \nchoice of the prototypes in the low temperature regime. Fixing the parameters Ya and \nsolving the transcendental equations (14,15) for Xi, the solution will be very close to the \noptimal approximation. It is thus possible to choose the prototypes as the cluster centroids \nYa = 1/(Pa N ) L.i(Mia)Xi and, thereby, to solve Eq. (15) in a self-consistent fashion. \n\n\fCentral and Pairwise Data Clustering by Competitive Neural Networks \n\n109 \n\na \n\nb \n\n* * \n\n* \n\n* \n\n* \n\nc \n\n* \n\n* * * * \n* * * * \n\n* ir. \n\n* \n\n* \n* \n\n* * \n\nFigure 2: A data distribution (4000 data points) (a), generated by four normally distributed \nsources is clustered with the complexity measure Ca. = -logpa. and.A = 0.4 (b). The plus \nsigns (+) denote the centers ofthe Gaussians and stars (*) denote cluster centers. Figure (c) \nshows a topology preserving clustering solution with complexity Ca. = 1/ Pa. and external \nnoise (ry = 0.05). \n\nIf the prototype variables depend on the data coordinates, the derivatives oY a./ OXi will \nnot vanish in general and the condition (14) becomes more complicated. Regardless \nof this complication the resulting algorithm to estimate data coordinates Xi interleaves the \nclustering process and the optimization of the embedding in a Euclidian space. The artificial \nseparation of multidimensional scaling from data clustering has been avoided. Data points \nare embedded and clustered simultaneously. Furthermore, we have derived a maximum \nentropy approximation which is most robust with respect to changes in the average costs \n(EK). \n\n4 Clustering Results \n\nNon-topological (Ta.'}' = on,},) clustering results at zero temperature for the logarithmic \ncomplexity measure (Ca. = 10gpa.) are shown in Fig. 2b. In the limit of very small com(cid:173)\nplexity costs the best clustering solution densely covers the data distribution. The specific \nchoice of logarithmic complexity costs causes an almost homogeneous density of cluster \ncenters, a phenomenon which is known from studies of asymptotic codebook densities and \nwhich is explained by the vanishing average complexity costs (Ca.) = -Pa.logpa. of very \nsparsely occupied clusters (for references see (Buhmann, Kuhnel, 1993b\u00bb. \n\nFigure 2c shows a clustering configuration assuming a one-dimensional topology in index \nspace with nearest neighbor transitions. The short links between neighboring nodes of the \nneural chain indicate that the distortions due to cluster index transitions have also been \noptimized. Note, that complexity optimized clustering determines the length of the chain \nor, for a more general noise distribution, an optimal size of the cluster set. This stopping \ncriterion for adding new cluster nodes generalizes self-organizing feature maps (Kohonen, \n1984) and removes arbitrariness in the design of topological mappings. Furthermore, our \nalgorithm is derived from an energy minimization principle in contrast to self-organizing \nfeature maps which \"cannot be derived as a stochastic gradient on any energy function\" \n(Erwin et aI., 1992). \n\nThe complexity optimized clustering scheme has been tested on the real world task of \n\n\f110 \n\nBuhmann and Hofmann \n\na \n\nc \n\ne \n\nb \n\nd \n\n; .... \n\n\" I ' \n\n'. \n. ~ \n\u2022 \n\n~ \n\n,'. \n\n. ~ . \n\n, \n.~., \n\n~ \n\n.... \n\n;'! \n\n, \n,I. tI \u2022 .... \n'.~.~ .,' .'\\.~ .. ~' - . \n~\\... \n\u2022 \n'11\\ll \n. . ' \n\n\\ :.-:~ \n\nI \n\n. \"\"r', \n.. ' . \ni' \n\u2022. \n1L .. ~ ._ \n\\-: \n:. \n. \n.... \n\n#\"':. \n\n.. ' \n\u2022 ':.' \n\n. '- ' .. -.. -.. \nlfl\" . tJ\\ \n\n~. \n. ' \n'I, \n' ..... - , \n~: \nu:-. \n.~ \n~ \"1 '. \n\nFigure 3: Quantization of a 128x 128, 8bit, gray-level image. (a) Original picture. (b) \nImage reconstruction from wavelet coefficients quantized with entropic complexity. (c) \nReconstruction from wavelet coefficients quantized by K -means clustering. (d,e) Absolute \nvalues of reconstruction errors in the images (b,c). Black is normalized in (d,e) to a deviation \nof 92 gray values. \n\nimage compression (Buhmann, Kuhnel, 1993b). Entropy optimized clustering of wavelet \ndecomposed images has reduced the reconstruction error of the compressed images up to \n30 percent. Images of a compression and reconstruction experiment are shown in Fig. 3. \nThe compression ratio is 24.5 for a 128 x 128 image. According to our efficiency criterion \nentropy optimized compression is 36.8% more efficient than K -means clustering for that \ncompression factor. The peak SNR values for (b,c) are 30.1 and 27.1, respectively. The \nconsiderable higher error near edges in the reconstruction based on K -means clustering (e) \ndemonstrates that entropy optimized clustering of wavelet coefficients not only results in \nhigher compression ratios but, even more important it preserves psychophysically important \nimage features like edges more faithfully than conventional compression schemes. \n\n5 Conclusion \n\nComplexity optimized clustering is a maximum entropy approach to central and pairwise \ndata clustering which determines the optimal number of clusters as a compromise between \ndistortion errors and the complexity of a cluster set. The complexity term turns out to be as \nimportant for the design of a cluster set as the distortion measure. Complexity optimized \nclustering maps onto a winner-take-all network which suggests hardware implementations \nin analog VLSI (Andreou et al., 1991). Topology preserving clustering provides us with a \n\n\fCentral and Pairwise Data Clustering by Competitive Neural Networks \n\n111 \n\ncost function based approach to limit the size of self-organizing maps. \n\nThe maximum entropy estimation for pairwise clustering cannot be solved analytically \nbut has to be approximated by a meanfield approach. This mean field approximation of \nthe pairwise clustering costs with quadratic Euclidian distances establishes a connection \nbetween multidimensional scaling and clustering. Contrary to the usual strategy which \nembeds data according to their dissimilarities in a Euclidian space and, in a separate second \nstep, clusters the embedded data, our approach finds the Euclidian embedding and the data \nclusters simultaneously and in a selfconsistent fashion. \n\nThe proposed framework for data clustering unifies traditional clustering techniques like \nK -means clustering, entropy constraint clustering or fuzzy clustering with neural network \napproaches such as topological vector quantizers. The network size and the cluster parame(cid:173)\nters are determined by a problem adapted complexity function which removes considerable \narbitrariness present in other non-parametric clustering methods. \n\nAcknowledgement: JB thanks H. Kuhnel for insightful discussions. This work was \nsupported by the Ministry of Science and Research of the state Nordrhein-Westfalen. \n\nReferences \n\nAmit, D. (1989). Modelling Brain Function. Cambridge: Cambridge University Press. \nAndreou, A. G., Boahen, K. A., Pouliquen, P.O., Pavasovic, A., Jenkins, R. E., Stro(cid:173)\n\nhbehn, K. (1991). Current Mode Subthreshold MOS Circuits for Analog VLSI Neural \nSystems. IEEE Transactions on Neural Networks, 2,205-213. \n\nBuhmann, J., Kuhne1, H. (1993a). Complexity Optimized Data Clustering by Competitive \n\nNeural Networks. Neural Computation, 5, 75-88. \n\nBuhmann, J., Kuhnel, H. (1993b). Vector Quantization with Complexity Costs. IEEE \n\nTransactions on Information Theory, 39(4),1133-1145. \n\nChou, P. A., Lookabaugh, T., Gray, R. M. (1989). Entropy-Constrained Vector Quantization. \n\nIEEE Transactions on Acoustics, Speech and Signal Processing, 37, 31-42. \n\nErwin, W., Obermayer, K., Schulten, K. (1992). Self-organizing Maps: Ordering, Conver(cid:173)\n\ngence Properties, and Energy Functions. Biological Cybernetics, 67, 47-55. \n\nGray, R. M. (1984). Vector Quantization. IEEE Acoustics, Speech and Signal Processing \n\nMagazine, April, 4-29. \n\nKohonen, T. (1984). Self-organization and Associative Memory. Berlin: Springer. \nLinde, Y., Buzo, A., Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE \n\nTransactions on Communications COM, 28, 84-95. \n\nRitter, H., Martinetz, T., Schulten, K. (1992). Neural Computation and Self-organizing \n\nMaps. New York: Addison Wesley. \n\nRose, K., Gurewitz, E., Fox, G. (1990). Statistical Mechanics and Phase Transitions in \n\nClustering. Physical Review Letters, 65(8), 945-948. \n\nTikochinsky, Y., Tishby, N.Z., Levine, R. D. (1984). Alternative Approach to Maximum(cid:173)\n\nEntropy Inference. Physical Review A, 30, 2638-2644. \n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Joachim", "family_name": "Buhmann", "institution": null}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": null}]}