{"title": "Differential Entropic Clustering of Multivariate Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 337, "page_last": 344, "abstract": null, "full_text": "Differential Entropic Clustering of Multivariate Gaussians\nJason V. Davis Inderjit Dhillon Dept. of Computer Science University of Texas at Austin Austin, TX 78712 {jdavis,inderjit}@cs.utexas.edu\n\nAbstract\nGaussian data is pervasive and many learning algorithms (e.g., k -means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application.\n\n1\n\nIntroduction\n\nGaussian data is pervasive in all walks of life and many learning algorithms--e.g. k -means, principal components analysis, linear discriminant analysis, etc--model each input object as a single sample drawn from a multivariate Gaussian. For example, the k -means algorithm assumes that each input is a single sample drawn from one of k (unknown) isotropic Gaussians. The goal of k -means can be viewed as the discovery of the mean of each Gaussian and recovery of the generating distribution of each input object. However, in many real-life settings, each input object is naturally represented by multiple samples drawn from an underlying distribution. For example, a student's scores in reading, writing, and arithmetic can be measured at each of four quarters throughout the school year. Alternately, consider a website where movies are rated on the basis of originality, plot, and acting. Here, several different users may rate the same movie. Multiple samples are also ubiquitous in time-series data such as sensor networks, where each sensor device continually monitors its environmental conditions (e.g. humidity, temperature, or light). Clustering is an important data analysis task used in many of these applications. For example, clustering sensor network devices has been used for optimizing routing of the network and also for discovering trends between sensor nodes. If the k -means algorithm is employed, then only the means of the distributions will be clustered, ignoring all second order covariance information. Clearly, a better solution is needed. In this paper, we consider the problem of clustering input objects, each of which can be represented by a multivariate Gaussian distribution. The \"distance\" between two Gaussians can be quantified in an information theoretic manner, in particular by their differential relative entropy. Interestingly, the differential relative entropy between two multivariate Gaussians can be expressed as the convex combination of two Bregman divergences--a Mahalanobis distance between mean vectors and\n\n\f\na Burg matrix divergence between the covariance matrices. We develop an EM style clustering algorithm and show that the optimal cluster parameters can be cheaply determined via a simple, closed-form solution. Our algorithm is a Bregman-like clustering method that clusters both means and covariances of the distributions in a unified framework. We evaluate our method across several domains. First, we present results from synthetic data experiments, and show that incorporating second order information can dramatically increase clustering accuracy. Next, we apply our algorithm to a real-world sensor network dataset comprised of 52 sensor devices that measure temperature, humidity, light, and voltage. Finally, we use our algorithm as a statistical debugging tool by clustering the behavior of functions in a program across a set of known software bugs.\n\n2\n\nPreliminaries\n\nWe first present some essential background material. The multivariate Gaussian distribution is the multivariate generalization of the standard univariate case. The probability density function (pdf) of a d-dimensional multivariate Gaussian is parameterized by mean vector and positive definite covariance matrix : - , 1 1 T -1 p(x|, ) = (x - ) (x - ) d 1 exp 2 (2 ) 2 || 2 where || is the determinant of . The Bregman divergence [2] with respect to is defined as: D (x, y ) = (x) - (y ) - (x - y )T (y ), where is a real-valued, strictly convex function defined over a convex set Q = dom() Rd such that is differentiable on the relative interior of Q. For example, if (x) = xT x, then the resulting Bregman divergence is the standard squared Euclidean distance. Similarly, if (x) = xT AT Ax, for some arbitrary non-singular matrix A, then the resulting divergence is the Mahalanobis distance MS -1 (x, y ) = (x - y )iT S -1 (x - y ), parameterized by the covariance matrix S , S -1 = AT A. Alternately, if (x) = (xi log xi - xi ), then the resulting divergence is the (unnormalized) relative entropy. Bregman divergences generalize many properties of squared loss and relative entropy. Bregman divergences can be naturally extended to matrices, as follows: D (X , Y ) = (X ) - (Y ) - tr(((Y ))T (X - Y )), where X and Y are matrices, is a real-valued, strictly convex function defined over matrices, and tr(A) denotes the trace of A. Consider the function (X ) = X 2 . Then the corresponding F Bregman matrix divergence is the squared Frobenius norm, X - Y 2 . The Burg matrix diverF gence is generi ted from a function of the eigenvalues 1 , ..., d of the positive definite matrix X : a (X ) = - log i = - log |X |, the Burg entropy of the eigenvalues. The resulting Burg matrix divergence is: B (X , Y ) = tr(X Y -1 ) - log |X Y -1 | - d. (1) As we shall see later, the Burg matrix divergence will arise naturally in our application. Let 1 , ..., d be the eigenvalues of X and v1 , ..., vd the corresponding eigenvectors and let 1 , ..., d be the eigenvalues of Y with eigenvectors w1 , ..., wd . The Burg matrix divergence can also be written as i i j i i T (vi wj )2 - log - d. B (X , Y ) = j i From the first term above, we see that the Burg matrix divergence is a function of the eigenvalues as well as of the eigenvectors of X and Y . The differential entropy of a continuous random variable x with probability density function f is defined as f h(f ) = - (x) log f (x)dx. It can be shown [3] that an n-bit quantization of a continuous random variable with pdf f has Shannon entropy approximately equal to h(f ) + n. The continuous analog of the discrete relative\n\n\f\nentropy is the differential relative entropy. Given a random variable x with pdf 's f and g , the differential relative entropy is defined as f D(f ||g ) = (x) log f (x) dx. g (x)\n\n3\n\nClustering Multivariate Gaussians via Differential Relative Entropy\n\nGiven a set of n multivariate Gaussians parameterized by mean vectors m1 , ..., mn and covariances S1 , ..., Sn , we seek a disjoint and exhaustive partitioning of these Gaussians into k different clusters, 1 , ..., k . Each cluster j can be represented by a multivariate Gaussian parameterized by mean j and covariance j . Using differential relative entropy as the distance measure between Gaussians, the problem of clustering may be posed as the minimization (over all clusterings) of jk {\n=1 i:i =j }\n\nD(p(x|mi , Si )||p(x|j , j )).\n\n(2)\n\n3.1\n\nDifferential Relative Entropy and Multivariate Gaussians\n\nWe first show that the differential entropy between two multivariate Gaussians can be expressed as a convex combination of a Mahalanobis distance between means and the Burg matrix divergence between covariance matrices. Consider two multivariate Gaussians, parameterized by mean vectors m and , and covariances S anf , respectively. We first note that the differential relative entropy can be expressed as D(f ||g ) = d f log f - f log g = -h(f ) - log g . The first term is just the negative differential entropy of p(x|m, S ), which can be shown [3] to be: h(p(x|m, S )) = We now consider the second term: p (x|m, S ) log p(x|, ) = =\nd 1 1 (x - )T -1 (x - ) - log(2 ) 2 || 2 (x|m, S ) 2 p 1 - (x|m, S )tr(-1 (x - )(x - )T ) 2 p d 1 - (x|m, S ) log(2 ) 2 || 2\n\nd1 + log(2 )d |S |. 22\n\n(3)\n\np\n\n-\n\n -1 ( 1 log(2 )d || = - tr -1 E x - )(x - )T 2 2 -1 ( = - tr -1 E (x - m) + (m - ))((x - m) + (m - ))T 2 1 log(2 )d || 2 -1 1 log(2 )d || = - tr -1 S + -1 (m - )(m - )T 2 2 -1 1 1 (m - )T -1 (m - ) - log(2 )d ||. = - t r -1 S 2 2 2 The expectation above is taken over the distribution p(x|m, S ). The second to last line above follows from the definition of = E [(x - m)(x - m)T ] and also from the fact that E [(x -\n\n\f\nm)(m - )T ] = E [x - m](m - )T = 0. Thus, we have D(p(x|m, S )||p(x|, )) 1 1 d1 (4) = - - log(2 )d |S | + tr(-1 S ) + log(2 )d || 22 2 2 1 + (m - )T -1 (m - ) 2 +1 1t r(S -1 ) - log |S -1 | - d (m - )T -1 (m - ) = 2 2 1 1 B (S , ) + M-1 (m, ), = (5) 2 2\n\nwhere B (S , ) is the Burg matrix divergence and M-1 (m, ) is the Mahalanobis distance, parameterized by the covariance matrix . We now consider the problem of finding the optimal representative Gaussian for a set of c Gaussians with means m1 , ..., mc and covariances S1 , ..., Sc . For non-negative weights 1 , ...c such that i i = 1, the optimal representative minimizes the cumulative differential relative entropy: i p(x| , ) = arg min i D(p(x|mi , Si )||p(x|, )) (6)\np(x|,)\n\n=\n\narg min\n\ni\n\n1 i\n\np(x|,)\n\n1 B (Si , ) + M-1 (mi , ) 2 2\n\n. (7)\n\nThe second term can be viewed as minimizing the Bregman information with respect to some fixed (albeit unknown) Bregman divergence (i.e. the Mahalanobis distance parameterized by some covariance matrix ). Consequently, it has a unique minimizer [1] of the form i = i mi . (8) Next, we note that equation (7) is strictly convex in -1 . Thus, we can derive the optimal by setting the gradient of (7) with respect to -1 to 0:\nn i i D(p(x|mi , Si )||p(x|, )) -1 =1\n\n=\n\nin\n=1\n\ni\n\nS\ni\n\n- + (mi - )(mi - )T\n\n.\n\nSetting this to zero yields =\n\ni\n\ni\n\nS\ni\n\n+ (mi - )(mi - )T\n\n.\n\n(9)\n\nFigure 1 illustrates optimal representatives of two 2-dimensional Gaussians with means marked by points A and B, and covariances outlined with solid lines. The optimal Gaussian representatives are denoted with dotted covariances; the representative on the left uses weights, (A = 2 , B = 1 ), 3 3 while the representative on the right uses weights (A = 1 , B = 2 ). As we can see from equation 3 3 (8), the optimal representative mean is the weighted average of the means of the constituent Gaussians. Interestingly, the optimal covariance turns out to be the average of the constituent covariances plus rank one updates. These rank-one changes account for the deviations from the individual means to the representative mean. 3.2 Algorithm\n\nAlgorithm 1 presents our clustering algorithm for the case where each Gaussian has equal weight 1 i = n . The method works in an EM-style framework. Initially, cluster assignments are chosen (these can be assigned randomly). The algorithm then proceeds iteratively, until convergence. First, the mean and covariance parameters for the cluster representative distributions are optimally computed given the cluster assignments. These parameters are updated as shown in (8) and (9). Next, the cluster assignments are updated for each input Gaussian. This is done by assigning the ith Gaussian to the cluster j with representative Gaussian that is closest in differential relative entropy.\n\n\f\n6 5 4 3 2 1\n\nA\n\nB\n0 -1 -2 0 1 2 3 4 5 6 7\n\nFigure 1: Optimal Gaussian representatives (shown with dotted lines) of two Gaussians centered at A and B (for two different sets of weights). While the optimal mean of each representative is the average of the individual means, the optimal covariance is the average of the individual covariances plus rank-one corrections. Since both of these steps are locally optimal, convergence of the algorithm to a local optima can be shown. Note that the problem is N P -hard, so convergence to a global optima cannot be guaranteed. We next consider the running time of Algorithm 1 when the input Gaussians are d-dimensional. Lines 6 and 9 compute the optimal means and covariances for each cluster, which requires O(nd) and O(nd2 ) total work, respectively. Line 12 computes the differential relative entropy between each input Gaussian and each cluster representative Gaussian. As only the arg min over all j is needed, we can reduce the Burg matrix divergence computation (equation (1)) to tr(Si -1 ) - log |-1 |. j j Once the inverse of each cluster covariance is computed (for a cost of O(k d3 )), the first term can be computed in O(d2 ) time. The second term can similarly be computed once for each cluster for a total cost of O(k d3 ). Computing the Mahalanobis distance is an O(d2 ) operation. Thus, total cost of line 12 is O(k d3 + nk d2 ) and the total running time of the algorithm, given iterations, is O( k d2 (n + d)). Algorithm 1 Differential Entropic Clustering of Multivariate Gaussians 1: {m1 , ..., mn } means of input Gaussians 2: {S1 , ..., Sn } covariance matrices of input Gaussians 3: initial cluster assignments 4: while not converged do 5: for j = 1 to k do {update cluster means} i 6: j |{i:1=j }| : i =j m i i 7: end for for j = 1 to k do {update cluster covariances} 18: S i T 9: j |{i:1=j }| i + (mi - j )(mi - j ) : i =j i 0: end for 11: for i = 1 to n do {assign each Gaussian to the closest cluster representative Gaussian} 12: i argmin1j k B (Si , j ) + Mj -1 (mi , j ) {B is the Burg matrix divergence and M-1 is the Mahalanobis distance parameterized by j } j 13: end for 14: end while\n\n4\n\nExperiments\n\nWe now present experimental results for our algorithm across three different domains: a synthetic dataset, sensor network data, and a statistical debugging application. 4.1 Synthetic Data\n\nOur synthetic datasets consist of a set of 200 objects, each of which consists of 30 samples drawn from one of k randomly generated d-dimensional multivariate Gaussians. The k Gaussians are\n\n\f\n1 Multivariate Gaussian Clustering K-means Normalized Mutual Information Normalized Mutual Information 0.9 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 4 Multivariate Gaussian Clustering K-means\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4 2\n\n3\n\n4\n\n5 6 7 Number of Clusters\n\n8\n\n9\n\n10\n\n5\n\n6 7 8 Number of Dimensions\n\n9\n\n10\n\nFigure 2: Clustering quality of synthetic data. Traditional k-means clustering uses only first-order information (i.e. the mean), whereas our Gaussian clustering algorithm also incorporates second-order covariance information. Here, we see that our algorithm achieves higher clustering quality for datasets composed of fourdimensional Gaussians with varied number of clusters (left), as well as for varied dimensionality of the input Guassians with k = 5 (right).\n\ngenerated by choosing a mean vector uniformly at random from the unit simplex and randomly selecting a covariance matrix from the set of matrices with eigenvalues 1, 2, ..., d. In Figure 2, we compare our algorithm to the k -means algorithm, which clusters each object solely on the mean of the samples. Accuracy is quantified in terms of normalized mutual information (NMI) between discovered clusters and the true clusters, a standard technique for determining the quality of clusters. Figure 2 (left) shows the clustering quality as a function of the number of clusters when the dimensionality of the input Gaussians is fixed (d = 4). Figure 2 (right) gives clustering quality for five clusters across a varying number of dimensions. All results represent averaged NMI values across 50 experiments. As can be seen in Figure 2, our multivariate Gaussian clustering algorithm yields significantly higher NMI values than k -means for all experiments. 4.2 Sensor Networks\n\nSensor networks are wireless networks composed of small, low-cost sensors that monitor their surrounding environment. An open question in sensor networks research is how to minimize communication costs between the sensors and the base station: wireless communication requires a relatively large amount of power, a limited resource on current sensor devices (which are usually battery powered). A recently proposed sensor network system, BBQ [4], reduces communication costs by modelling sensor network data at each sensor device using a time-varying multivariate Gaussian and transmitting only model parameters to the base station. We apply our multivariate Gaussian clustering algorithm to cluster sensor devices from the Intel Lab at Berkeley [8]. Clustering has been used in sensor network applications to determine efficient routing schemes, as well as for discovering trends between groups of sensor devices. The Intel sensor network consists of 52 working sensors, each of which monitors ambient temperature, humidity, light levels, and voltage every thirty seconds. Conditioned on time, the sensor readings can be fit quite well by a multivariate Gaussian. Figure 3 shows the results of our multivariate Gaussian clustering algorithm applied to this sensor network data. For each device, we compute the sample mean and covariance from sensor readings between noon and 2pm each day, for 36 total days. To account for varying scales of measurement, we normalize all variables to have unit variance. The second cluster (denoted by `2' in figure 3) has the largest variance among all clusters: many of the sensors for this cluster are located in high traffic areas, including the large conference room at the top of the lab, and the smaller tables in the bottom of the lab. Since the measurements were taken during lunchtime, we expect higher traffic in these areas. Interestingly, this cluster shows very high co-variation between humidity and voltage. Cluster one is characterized by high temperatures, which is not surprising, as there are several windows on the left side of the lab. This window faces west and has an unobstructed view of the ocean. Finally, cluster three has a moderate level of total variation, with relatively low light levels. The cluster is primarily located in the center and the right of lab, away from outside windows.\n\n\f\nFigure 3: To reduce communication costs in sensor networks, each sensor device may be modelled by a\nmultivariate Gaussian. The above plot shows the results of applying our algorithm to cluster sensors into three groups, denoted by labels `1', `2', and `3'.\n\n4.3\n\nStatistical Debugging\n\nLeveraging program runtime statistics for the purpose of software debugging has received recent research attention [12]. Here we apply our algorithm to cluster functional behavior patterns over A software bugs in the LTEX document preparation program. The data is taken from the Navel system [7], a system that uses machine learning to provide better error messaging. The dataset contains A four software bugs, each of which is caused by an unsuccessful LTEX compilation (e.g. specifying an incorrect number of columns in an array environment) with ambiguous or unclear error messages A provided. LTEX has notoriously cryptic error messages for document compilation failures--for example, the message \"LaTeX Error: There's no line here to end\" can be caused by numerous problems in the source document. Each function in the program's source is measured by the frequency with which it is called across each of the four software bugs. We model this distribution as a 4-dimensional multivariate Gaussian, one dimension for each bug. The distributions are estimated from a set of samples; each sample A corresponds to a single LTEX file drawn from a set of grant proposals and submitted computer A science research papers. For each file and for each of the four bugs, the LTEX compiler is executed over a slightly modified version of the file that has been changed to exhibit the bug. During program execution, function counts are measured and recorded. More details can be found in [7]. Clustering these function counts can yield important debugging insight to assist a software engineer in understanding error dependent program behavior. Figure 4 shows three covariance matrices from a sample clustering of eight clusters. To capture the dependencies between bugs, we normalize each input Gaussian to have zero mean and unit variance. Cluster (a) represents functions that are highly error independent--i.e. the matrix shows high levels of covariation among all pairs of error classes. Conversely, clusters (b) and (c) show that some functions are highly error dependent. Cluster (b) shows a high dependency between bugs 1 and 4, while cluster (c) exhibits high covariation between bugs 1 and 3, and between bugs 2 and 4. 1.00 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 1.00 0.94 0.94 (a) 0.94 1.00 0.94 0.58 0.94 0.58 1.00 0.91 0.58 0.58 1.00 0.55 0.55 1.00 0.67 0.68 (b) 0.91 1.00 0.67 0.58 0.68 0.95 1.00 0.58 0.58 0.95 1.00 0.58 0.58 1.00 0.95 0.58 (c) 0.58 0.95 0.58 1.00\n\nA Figure 4: Covariance matrices for three clusters discovered by clustering functional behavior of the LTEX document preparation program. Cluster (a) corresponds to functions which are error-independent, while clusters (b) and (c) represent two groups of functions that exhibit different types of error dependent behavior.\n\n5\n\nRelated Work\n\nIn this work, we showed that the differential relative entropy between two multivariate Gaussian distributions can be expressed as a convex combination of the Mahalanobis distance between their\n\n\f\nmean vectors and the Burg matrix divergence between their covariances. This is in contrast to information theoretic clustering [5], where each input is taken to be a probability distribution over some finite set. In [5], no parametric form is assumed, and the Kullback-Liebler divergence (i.e. discrete relative entropy) can be computed directly from the distributions. The differential entropy between two multivariate Gaussians wass considered in [10] in the context of solving Gaussian mixture models. Although an algebraic expression for this differential entropy was given in [10], no connection to the Burg matrix divergence was made there. Our algorithm is based on the standard expectation-maximization style clustering algorithm [6]. Although the closed-form updates used by our algorithm are similar to those employed by a Bregman clustering algorithm [1], we note that the computation of the optimal covariance matrix (equation (9)) involves the optimal mean vector. In [9], the problem of clustering Gaussians with respect to the symmetric differential relative entropy, D(f ||g ) + D(g ||f ) is considered in the context of learning HMM parameters for speech recognition. The resulting algorithm, however, is much more computationally expensive than ours; whereas in our method, the optimal means and covariance parameters can be computed via a simple closed form solution. In [9], no such solution is presented and an iterative method must instead be employed. The problem of finding the optimal Gaussian with respect to the first argument (note that equation (6) is minimized with respect to the second argument) is considered in [11] for the problem of speaker interpolation. Here, only one source is assumed, and thus clustering is not needed.\n\n6\n\nConclusions\n\nWe have presented a new algorithm for the problem of clustering multivariate Gaussian distributions. Our algorithm is derived in an information theoretic context, which leads to interesting connections with the differential entropy between multivariate Gaussians, and Bregman divergences. Unlike existing clustering algorithms, our algorithm optimizes both first and second order information in the data. We have demonstrated the use of our method on sensor network data and a statistical debugging application.\n\nReferences\n[1] A. Banerjee, S. Merugu, I. Dhillon, and S. Ghosh. Clustering with Bregman divergences. In Siam International Conference on Data Mining, pages 234245, 2004. [2] L. Bregman. The relaxation method finding the common point of convex sets and its application to the solutions of problems in convex programming. In USSR Comp. of Mathematics and Mathematical Physics, volume 7, pages 200217, 1967. [3] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley Series in Telecommunications, 1991. [4] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-based approximate querying in sensor networks. In International Journal of Very Large Data Bases, 2005. [5] I. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classification. In Journal of Machine Learning Research, volume 3, pages 12651287, 2003. [6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., 2001. [7] J. Ha, H. Ramadan, J. Davis, C. Rossbach, I. Roy, and E. Witchel. Navel: Automating software support by classifying program behavior. Technical Report TR-06-11, University of Texas at Austin, 2006. [8] S. Madden. Intel lab data. http://berkeley.intel-research.net/labdata, 2004. [9] T. Myrvoll and F. Soong. On divergence based clustering of normal distributions and its application to HMM adaptation. In Eurospeech, pages 15171520, 2003. [10] Y. Singer and M. Warmuth. Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy. In Neural Information Processing Systems, 1998. [11] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura. Speaker interpolation in HMMbased speech synthesis. In European Conference on Speech Communication and Technology, 1997. [12] A. Zheng, M. Jordan, B. Liblit, and A. Aiken. Statistical debugging of sampled programs. In Neural Information Processing Systems, 2004.\n\n\f\n", "award": [], "sourceid": 3137, "authors": [{"given_name": "Jason", "family_name": "Davis", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}]}