{"title": "Convex Clustering with Exemplar-Based Models", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 832, "abstract": null, "full_text": "Convex Clustering with Exemplar-Based Models\n\nDanial Lashkari\n\nPolina Golland\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\n{danial, polina}@csail.mit.edu\n\nAbstract\n\nClustering is often formulated as the maximum likelihood estimation of a mixture\nmodel that explains the data. The EM algorithm widely used to solve the resulting\noptimization problem is inherently a gradient-descent method and is sensitive to\ninitialization. The resulting solution is a local optimum in the neighborhood of\nthe initial guess. This sensitivity to initialization presents a signi\ufb01cant challenge\nin clustering large data sets into many clusters. In this paper, we present a dif-\nferent approach to approximate mixture \ufb01tting for clustering. We introduce an\nexemplar-based likelihood function that approximates the exact likelihood. This\nformulation leads to a convex minimization problem and an ef\ufb01cient algorithm\nwith guaranteed convergence to the globally optimal solution. The resulting clus-\ntering can be thought of as a probabilistic mapping of the data points to the set of\nexemplars that minimizes the average distance and the information-theoretic cost\nof mapping. We present experimental results illustrating the performance of our\nalgorithm and its comparison with the conventional approach to mixture model\nclustering.\n\n1 Introduction\nClustering is one of the most basic problems of unsupervised learning with applications in a wide\nvariety of \ufb01elds. The input is either vectorial data, that is, vectors of data points in the feature\nspace, or proximity data, the pairwise similarity or dissimilarity values between the data points. The\nchoice of the clustering cost function and the optimization algorithm employed to solve the problem\nIntuitively, most methods seek compact clusters of data\ndetermines the resulting clustering [1].\npoints, namely, clusters with relatively small intra-cluster and high inter-cluster distances. Other\napproaches, such as Spectral Clustering [2], look for clusters of more complex shapes lying on some\nlow dimensional manifolds in the feature space. These methods typically transform the data such\nthat the manifold structures get mapped to compact point clouds in a different space. Hence, they\ndo not remove the need for ef\ufb01cient compact-cluster-\ufb01nding techniques such as k-means.\nThe widely used Soft k-means method is an instance of maximum likelihood \ufb01tting of a mixture\nmodel through the EM algorithm. Although this approach yields satisfactory results for problems\nwith a small number of clusters and is relatively fast, its use of a gradient-descent algorithm for\nminimization of a cost function with many local optima makes it sensitive to initialization. As\nthe search space grows, that is, the number of data points or clusters increases, it becomes harder\nto \ufb01nd a good initialization. This problem often arises in emerging applications of clustering for\nlarge biological data sets such as gene-expression. Typically, one runs the algorithm many times\nwith different random initializations and selects the best solution. More sophisticated initialization\nmethods have been proposed to improve the results but the challenge of \ufb01nding good initialization\nfor EM algorithm remains [4].\nWe aim to circumvent the initialization procedure by designing a convex problem whose global\noptimum can be found with a simple algorithm.\nIt has been shown that mixture modeling can\n\n1\n\n\fbe formulated as an instance of iterative distance minimization between two sets of probability\ndistributions [3]. This formulation shows that the non-convexity of mixture modeling cost function\ncomes from the parametrization of the model components . More precisely, any mixture model is,\nby de\ufb01nition, a convex combination of some set of distributions. However, for a \ufb01xed number of\nmixture components, the set of all such mixture models is usually not convex when the distributions\nhave, say, free mean parameters in the case of normal distributions.\nInspired by combinatorial,\nnon-parametric methods such as k-medoids [5] and af\ufb01nity propagation [6], our main idea is to\nemploy the notion of exemplar \ufb01nding, namely, \ufb01nding the data points which could best describe\nthe data set. We assume that the clusters are dense enough such that there is always a data point\nvery close to the real cluster centroid and, thus, restrict the set of possible cluster means to the set\nof data points. Further, by taking all data points as exemplar candidates, the modeling cost function\nbecomes convex. A variant of EM algorithm \ufb01nds the globally optimal solution.\nConvexity of the cost function means that the algorithm will unconditionally converge to the global\nminimum. Moreover, since the number of clusters is not speci\ufb01ed a priori, the algorithm automat-\nically \ufb01nds the number of clusters depending only on one temperature-like parameter. This param-\neter, which is equivalent to a common \ufb01xed variance in case of Gaussian models, de\ufb01nes the width\nscale of the desired clusters in the feature space. Our method works exactly in the same way with\nboth proximity and vectorial data, unifying their treatment and providing insights into the modeling\nassumptions underlying the conversion of feature vectors into pairwise proximity data.\nIn the next section, we introduce our maximum likelihood function and the algorithm that maximizes\nit. In Section 3, we make a connection to the Rate-Distortion theory as a way to build intuition about\nour objective function. Section 4 presents implementation details of our algorithm. Experimental\nresults comparing our method with a similar mixture model \ufb01tting method are presented in Section\n5, followed by a discussion of the algorithm and the related work in Section 6.\n\n2 Convex Cost Function\nGiven a set of data points X = {x1,\u00b7\u00b7\u00b7 , xn} \u2282 IRd, mixture model clustering seeks to maximize\nthe scaled log-likelihood function\n\nl({qj}k\n\nj=1,{mj}k\n\nj=1;X ) =\n\n1\nn\n\nlog\n\nqjf(xi; mj)\n\n,\n\n(1)\n\nnX\n\n(cid:20) kX\n\ni=1\n\nj=1\n\n(cid:21)\n\nwhere f(x; m) is an exponential family distribution on random variable X.\nIt has been shown\nthat there is a bijection between regular exponential families and a broad family of divergences\ncalled Bregman divergence [7]. Most of the well-known distance measures, such as Euclidean\ndistance or Kullback-Leibler divergence (KL-divergence) are included in this family. We em-\nploy this relationship and let our model be an exponential family distribution on X of the form\nf(x; m) = C(x) exp(\u2212d\u03c6(x, m)) where d\u03c6 is some Bregman divergence and C(x) is indepen-\ndent of m. Note that with this representation, m is the expected value of X under the distribution\nf(x; m). For instance, taking Euclidean distance as the divergence, we obtain normal distribution\nas our model f.\nIn this work, we take models of the above form whose parameters m lie in the same space as data\nvectors. Thus, we can restrict the set of mixture components to the distributions centered at the data\npoints, i.e., mj \u2208 X . Yet, for a speci\ufb01ed number of clusters k, the problem still has a combinatorial\nnature of choosing the right k cluster centers among n data points. To avoid this problem, we\nincrease the number of possible components to n and represent all data points as cluster-center\ncandidates. The new log-likelihood function is\n\nnX\n\n(cid:20) nX\n\n(cid:21)\n\nlog\n\nqje\u2212\u03b2d\u03c6(xi,xj )\n\n+ const. ,\n\n(2)\n\nl({qj}n\n\nj=1;X ) =\n\n1\nn\n\nqjfj(xi) =\n\n1\nn\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\nnX\n\nlog\n\nnX\n\nwhere fj(x) is an exponential family member with its expectation parameter equal to the jth data\nvector and the constant denotes a term that does not depend on the unknown variables {qj}n\nj=1.\nThe constant scaling factor \u03b2 in the exponent controls the sharpness of mixture components. We\nmaximize l(\u00b7;X ) over the set of all mixture distributions Q =\n\nn\nQ|Q(\u00b7) =Pn\n\no\nj=1 qjfj(\u00b7)\n\n.\n\n2\n\n\fThe log-likelihood function (2) can be expressed in terms of the KL-divergence by de\ufb01ning \u02c6P (x) =\n1/n, x \u2208 X , to be the empirical distribution of the data on IRd and by noting that\n\n\u02c6P (x) log Q(x) \u2212 H( \u02c6P ) = \u2212l({qj}n\n\nj=1;X ) + const.\n\n(3)\n\nD( \u02c6PkQ) = \u2212X\n\nx\u2208X\n\nwhere H( \u02c6P ) is the entropy of the empirical distribution and does not depend on the unknown mixture\ncoef\ufb01cients {qj}n\nj=1. Consequently, the maximum likelihood problem can be equivalently stated as\nthe minimization of the KL-divergence between \u02c6P and the set of mixture distributions Q.\nIt is easy to see that unlike the unconstrained set of mixture densities considered by the likelihood\nfunction (1), set Q is convex. Our formulation therefore leads to a convex minimization problem.\nFurthermore, it is proved in [3] that for such a problem, the sequence of distributions Q(t) with\ncorresponding weights {q(t)\n\nj=1 de\ufb01ned iteratively via\n\nj }n\n\nX\n\nx\u2208X\n\nPn\n\n\u02c6P (x)fj(x)\nj0=1 q(t)\n\nj0 fj0(x)\n\nq(t+1)\nj\n\n= q(t)\nj\n\n(4)\n\n(cid:21)\n\nP\n(cid:21)\n\nis guaranteed to converge to the global optimum solution Q\u2217 if the support of the initial distribution\nis the entire index set, i.e., q(0)\n\nj > 0 for all j.\n\n3 Connection to Rate-Distortion Problems\nNow, we present an equivalent statement of our problem on the product set of exemplars and data\npoints. This alternative formulation views our method as an instance of lossy data compression and\ndirectly implies the optimality of the algorithm (4).\nThe following proposition is introduced and proved in [3]:\nProposition 1. Let Q0 be the set of distributions of the complete data random variable (J, X) \u2208\n{1,\u00b7\u00b7\u00b7 , n} \u00d7 IRd with elements Q0(j, x) = qjfj(x). Let P0 be the set of all distributions on the\nsame random variable (J, X) which have \u02c6P as their marginal on X. Then,\n\nQ\u2208Q D( \u02c6PkQ) =\nmin\n\n(5)\nwhere Q is the set of all marginal distributions of elements of Q0 on X. Furthermore, if Q\u2217 and\n(P 0\u2217, Q0\u2217) are the corresponding optimal arguments, Q\u2217 is the marginal of Q0\u2217.\nThis proposition implies that we can express our problem of minimizing (3) as minimization of\nD(P 0kQ0) where P 0 and Q0 are distributions of the random variable (J, X). Speci\ufb01cally, we de\ufb01ne:\n\nP 0\u2208P0,Q0\u2208Q0D(P 0kQ0)\n\nmin\n\nQ0(j, x) = qjC(x)e\u2212\u03b2d\u03c6(x,xj )\n\nP 0(j, x) = \u02c6P (x)P 0(j|x) =\n\notherwise\nwhere qj and rij = P 0(j|x = xi) are probability distributions over the set {j}n\nj=1. This formulation\nensures that P 0 \u2208 P0, Q0 \u2208 Q0 and the objective function is expressed only in terms of variables\nqj and P 0(j|x) for x \u2208 X . Our goal is then to solve the minimization problem in the space of\ndistributions of random variable (J, I) \u2208 {j}n\nj=1, namely, in the product space of exemplar\n\u00d7 data point indices. Substituting expressions (6) into the KL-divergence D(P 0kQ0), we obtain the\nequivalent cost function:\n\n(6)\n\n(cid:26) 1\nn rij, x = xi \u2208 X ;\n0,\n\nD(P 0kQ0) =\n\n1\nn\n\n+ \u03b2d\u03c6(xi, xj)\n\n+ const.\n\n(7)\n\nIt is straightforward to show that for any set of values rij, setting qj = 1\nn\nSubstituting this expression into the cost function, we obtain the \ufb01nal expression\n\ni rij minimizes (7).\n\nD(P 0kQ0\u2217(P 0)) =\n\n1\nn\n\n+ \u03b2d\u03c6(xi, xj)\n\n+ const. ,\n\n= I(I; J) + \u03b2EI,J d\u03c6(xi, xj) + const.\n\n(8)\n\nj=1\u00d7{j}n\n(cid:20)\n\nlog rij\nqj\n\nrij\n\nnX\n\ni,j=1\n\n(cid:20)\n\nnX\n\ni,j=1\n\nP\n\nrij\ni0 ri0j\n\nrij\n\nlog\n\n1\nn\n\n3\n\n\fwhere the \ufb01rst term is the mutual information between the random variables I (data points) and\nJ (exemplars) under the distribution P 0 and the second term is the expected value of the pairwise\ndistances with the same distribution on indices. The n2 unknown values of rij lie on n separate\nn-dimensional simplices. These parameters have the same role as cluster responsibilities in soft\nk-means: they stand for the probability of data point xi choosing data point xj as its cluster-center.\nThe algorithm described in (4) is in fact the same as the standard Arimoto-Blahut algorithm [10]\ncommonly used for solving problems of the form (8).\nWe established that the problem of maximizing log-likelihood function (2) is equivalent to the min-\nimization of objective function (8). This helps us to interpret this problem in the framework of\nRate-Distortion theory. The data set can be thought of as an information source with a uniform\ndistribution on the alphabet X . Such a source has entropy log n, which means that any scheme for\nencoding an in\ufb01nitely long i.i.d. sequence generated by this source requires on average this number\nof bits per symbol, i.e., has a rate of at least log n. We cannot compress the information source\nbeyond this rate without tolerating some distortion, when the original data points are encoded into\nother points with nonzero distances between them. We can then consider rij\u2019s as a probabilistic\nencoding of our data set onto itself with the corresponding average distortion D = EI,J d\u03c6(xi, xj)\nand the rate I(I; J). A solution r\u2217\nij that minimizes (8) for some \u03b2 yields the least rate that can be\nachieved having no more than the corresponding average distortion D. This rate is usually denoted\nby R(D), a function of average distortion, and is called the rate-distortion function [8]. Note that\nwe have \u2202R/\u2202D = \u2212\u03b2, 0 < \u03b2 < \u221e at any point on the rate-distortion function graph. The weight\nqj for the data point xj is a measure of how likely this point is to appear in the compressed repre-\nsentation of the data set, i.e., to be an exemplar. Here, we can rigorously quantify our intuitive idea\nthat higher number of clusters (corresponding to higher rates) is the inherent cost of attaining lower\naverage distortion. We will see an instance of this rate-distortion trade-off in Section 5.\n\n4 Implementation\nThe implementation of our algorithm costs two matrix-vector multiplications per iteration, that\nis, has a complexity of order n2 per iteration, if solved with no approximations. Letting sij =\nexp(\u2212\u03b2d\u03c6(xi, xj)) and using two auxiliary vectors z and \u03b7, we obtain the simple update rules\n\nnX\n\nj=1\n\nnX\n\ni=1\n\nz(t)\ni =\n\nsijq(t)\nj\n\n\u03b7(t)\nj =\n\n1\nn\n\nsij\nz(t)\ni\n\nq(t+1)\nj\n\n= \u03b7(t)\n\nj q(t)\n\nj\n\n(9)\n\nj\n\nij = q(t)\n\nj sij/nz(t)\n\nless than 1 otherwise [10]. In practice, we compute the gap between maxj (log \u03b7j) andP\n\nwhere the initialization q(0)\nis nonzero for all the data points we want to consider as possible exem-\nplars. At the \ufb01xed point, the values of \u03b7j are equal to 1 for all data points in the support of qj and are\nj qj log \u03b7j\nin each iteration and stop the algorithm when this gap becomes less than a small threshold. Note\nthat the soft assignments r(t)\ni need to be computed only once after the algorithm has\nconverged.\nAny value of \u03b2 \u2208 [0,\u221e) yields a different solution to (8) with different number of nonzero qj values.\nSmaller values of \u03b2 correspond to having wider clusters and greater values correspond to narrower\nclusters. Neither extreme, one assigning all data points to the central exemplar and the other taking\nall data points as exemplars, is interesting. For reasonable ranges of \u03b2, the solution is sparse and the\nresulting number of nonzero components of qj determines the \ufb01nal number of clusters.\nSimilar to other interior-point methods, the convergence of our algorithm becomes slow as we move\nclose to the vertices of the probability simplex where some qj\u2019s are very small. In order to improve\nthe convergence rate, after each iteration, we identify all qj\u2019s that are below a certain threshold\n(10\u22123/n in our experiments,) set them to zero and re-normalize the entire distribution over the\nremaining indices. This effectively excludes the corresponding points as possible exemplars and\nreduces the cost of the following iterations.\nIn order to further speed up the algorithm for very large data sets, we can search over values of\nsij for any i and keep only the largest no values in any row turning the proximity matrix into a\nsparse one. The reasoning is simply that we expect any point to be represented in the \ufb01nal solution\nwith exemplars relatively close to it. We observed that as long as no values are a few times greater\nthan the expected number of data points in each cluster, the \ufb01nal results remain almost the same\n\n4\n\n\fFigure 1: Left: rate-distortion function for the example described in the text. The line with slope \u2212\u03b2o is also\nillustrated for comparison (dotted line) as well as the point corresponding to \u03b2 = \u03b2o (cross) and the line tangent\nto the graph at that point. Right: the exponential of rate (dotted line) and number of hard clusters for different\nvalues of beta (solid line.) The rate is bounded above by logarithm of number of clusters.\n\nwith or without this preprocessing. However, this approximation decreases the running time of the\nalgorithm by a factor n/no.\n\nthe case of Gaussian models, we chose an empirical value \u03b2o = n2 log n/P\n\n5 Experimental Results\nTo illustrate some general properties of our method, we apply it to the set of 400 random data points\nin IR2 shown in Figure 2. We use Euclidean distance and run the algorithm for different values of\n\u03b2. Figure 1 (left) shows the resulting rate-distortion function for this example. As we expect, the\nestimated rate-distortion function is smooth, monotonically decreasing and convex. To visualize the\nclustering results, we turn the soft responsibilities into hard assignments. Here, we \ufb01rst choose the\nset of exemplars to be the set of all indices j that are MAP estimate exemplars for some data point\ni under P 0(j|xi). Then, any point is assigned to its closest exemplar. Figure 2 illustrates the shapes\nof the resulting hard clusters for different values of \u03b2. Since \u03b2 has dimensions of inverse variance in\ni,j kxi \u2212 xjk2 so that\nvalues \u03b2 around \u03b2o give reasonable results. We can see how clusters split when we increase \u03b2. Such\ncluster splitting behavior also occurs in the case of a Gaussian mixture model with unconstrained\ncluster centers and has been studied as the phase transitions of a corresponding statistical system\n[9]. The nature of this connection remains to be further investigated.\nThe resulting number of hard clusters for different values of \u03b2 are shown in Figure 1 (right). The\n\ufb01gure indicates two regions of \u03b2 with relatively stable number of clusters, namely 4 and 10, while\nother cluster numbers have a more transitory nature with varying \u03b2. The distribution of data points\nin Figure 2 shows that this is a reasonable choice of number of clusters for this data set. However,\nwe also observe some \ufb02uctuations in the number of clusters even in the more stable regime of values\nof \u03b2. Comparing this behavior with the monotonicity of our rate shows how, by turning the soft\nassignments into the hard ones, we lose the strong optimality guarantees we have for the original soft\nsolution. Nevertheless, since our global optimum is minimum to a well justi\ufb01ed cost function, we\nexpect to obtain relatively good hard assignments. We further discuss this aspect of the formulation\nin Section 6.\nThe main motivation for developing a convex formulation of clustering is to avoid the well-known\nproblem of local optima and sensitivity to initialization. We compare our method with a regular\nmixture model of the form (1) where f(x; m) is a Gaussian distribution and the problem is solved\nusing the EM algorithm. We will refer to this regular mixture model as the soft k-means. The k-\nmeans algorithm is a limiting case of this mixture-model problem when \u03b2 \u2192 \u221e, hence the name\nsoft k-means. The comparison will illustrate how employing convexity helps us better explore the\nsearch space as the problem grows in complexity. We use synthetic data sets by drawing points from\nunit variance Gaussian distributions centered around a set of vectors.\nThere is an important distinction between the soft k-means and our algorithm: although the results\nof both algorithms depend on the choice of \u03b2, only the soft k-means needs the number of clusters k\nas an input. We run the two algorithms for \ufb01ve different values of \u03b2 which were empirically found\n\n5\n\n010020030040050060070080090010000123456Average DistortionRate (bits)00.511.522.5024681012\u03b2/\u03b2o\fFigure 2: The clusters found for different values of \u03b2, (a) 0.1\u03b2o (b) 0.5\u03b2o (c) \u03b2o (d) 1.2\u03b2o (e) 1.6\u03b2o (f)\n1.7\u03b2o. The exemplar data point of each cluster is denoted by a cross. The range of normal distributions for any\nmixture model is illustrated here by circles around these exemplar points with radius equal to the square root of\nthe variance corresponding to the value of \u03b2 used by the algorithm (\u03c3 = (2\u03b2)\u22121/2). Shapes and colors denote\ncluster labels.\n\nto yield reasonable results for the problems presented here. As a measure of clustering quality, we\nuse micro-averaged precision. We form the contingency tables for the cluster assignments found by\nthe algorithm and the true cluster labels. The percentage of the total number of data points assigned\nto the right cluster is taken as the precision value of the clustering result. Out of the \ufb01ve runs with\ndifferent values of \u03b2, we take the result with the best precision value for any of the two algorithms.\nIn the \ufb01rst experiment, we look at the performance of the two algorithms as the number of clusters\nincreases. Different data sets are generated by drawing 3000 data points around some number of\ncluster centers in IR20 with all clusters having the same number of data points. Each component of\nany data-point vector comes from an independent Gaussian distribution with unit variance around\nthe value of the corresponding component of its cluster center. Further, we randomly generate\ncomponents of the cluster-center vectors from a Gaussian distribution with variance 25 around zero.\nIn this experiment, for any value of \u03b2, we repeat soft k-means 1000 times with random initialization\nand pick the solution with the highest likelihood value. Figure 3 (left) presents the precision values as\na function of the number of clusters in the mixture distribution that generates the 3000 data points.\nThe error bars summarize the standard deviation of precision over 200 independently generated\ndata sets. We can see that performance of soft k-means drops as the number of clusters increases\nwhile our performance remains relatively stable. Consequently, as illustrated in Figure 3 (right),\n\n6\n\n\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(a)\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(b)\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(c)\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(d)\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(e)\u221240\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040(f)\fFigure 3: Left: average precision values of Convex Clustering and Soft k-means for different numbers of\nclusters in 200 data sets of 3000 data points. Right: precision gain of using Convex Clustering in the same\nexperiment.\nthe average precision difference of the two algorithms increases with increasing number of clusters.\nSince the total number of data points remains the same, increasing the number of clusters results in\nincreasing complexity of the problem with presumably more local minima to the cost function. This\ntrend agrees with our expectation that the results of the convex algorithm improves relative to the\noriginal one with a larger search space.\nAs another way of exploring the complexity of the problem, in our second experiment, we generate\ndata sets with different dimensionality. We draw 100 random vectors, with unit variance Gaussian\ndistribution in each component, around any of the 40 cluster centers to make data sets of total 4000\ndata points. The cluster centers are chosen to be of the form (0,\u00b7\u00b7\u00b7 , 0,\n50, 0,\u00b7\u00b7\u00b7 , 0) where we\nchange the position of the nonzero component to make different cluster centers. In this way, the\npairwise distance between all cluster centers is 50 by formation.\nFigure 4 (left) presents the precision values found for the two algorithms when 4000 points lie in\nspaces with different dimensionality. Soft k-means was repeated 100 times with random initializa-\ntion for any value of \u03b2. Again, the relative performance of Convex Clustering when compared to\nsoft k-means improves with the increasing problem complexity. This is another evidence that for\nlarger data sets the less precise nature of our constrained search, as compared to the full mixture\nmodels, is well compensated by its ability to always \ufb01nd its global optimum. In general the value\nof \u03b2 should be tuned to \ufb01nd the desired solution. We plan to develop a more systematic way for\nchoosing \u03b2.\n\n\u221a\n\n6 Discussion and Related Work\nSince only the distances take part in our formulation and the values of data point vectors are not\nrequired, we can extend this method to any proximity data. Given a matrix Dn\u00d7n = [dij] that\ndescribes the pairwise symmetric or asymmetric dissimilarities between data points, we can replace\nd\u03c6(xi, xj)\u2019s in (8) with dij\u2019s and solve the same minimization problem whose convexity can be\ndirectly veri\ufb01ed. The algorithm works in exactly the same way and all the aforementioned properties\ncarry over to this case as well.\nA previous application of rate-distortion theoretic ideas in clustering led to the deterministic anneal-\ning (DA). In order to avoid local optima, DA gradually decreases an annealing parameter, tightening\nthe bound on the average distortion [9]. However, at each temperature the same standard EM updates\nare used. Consequently, the method does not provide strong guarantees on the global optimality of\nthe resulting solution.\nAf\ufb01nity propagation is another recent exemplar-based clustering algorithm. It \ufb01nds the exemplars\nby forming a factor graph and running a message passing algorithm on the graph as a way to mini-\nmize the clustering cost function [6]. If the data point i is represented by the data point ci, assuming\na common preference parameter value \u03bb for all data points, the objective function of af\ufb01nity prop-\ni dici + \u03bbk where k is the number of found clusters. The second term\nis needed to put some cost on picking any point as an exemplar to prevent the trivial case of send-\ning any point to itself. Outstanding results have been reported for the af\ufb01nity propagation [6] but\ntheoretical guarantees on its convergence or optimality are yet to be established.\n\nagation can be stated asP\n\n7\n\n\u00a05681012152025307580859095100105Number of ClustersAverage Precision\u00a0Convex\u00a0Clustering\u00a0Soft\u00a0k\u2010means\u00a0\u00a0568101215202530-50510152025Number of ClustersAverage Precision Gain\fP\n\nFigure 4: Left: average precision values of Convex Clustering and Soft k-means for different data dimension-\nality in 100 data sets of 4000 data points with 40 clusters. Right: precision gain of using Convex Clustering in\nthe same experiment.\nWe can interpret our algorithm as a relaxation of this combinatorial problem to the soft assignment\ncase by introducing probabilities P(ci = j) = rij of associating point i with an exemplar j. The\nmarginal distribution qj = 1\ni rij is the probability that point j is an exemplar. In order to use\nn\nanalytical tools for solving this problem, we have to turn the regularization term k into a continuous\nfunction of assignments. A possible choice might be H(q), entropy of distribution qj, which is\nbounded above by log k. However, the entropy function is concave and any local or global minimum\nof a concave minimization problem over a simplex occurs in an extreme point of the feasible domain\nwhich in our case corresponds to the original combinatorial hard assignments [11]. In contrast, using\nmutual information I(I, J) induced by rij as the regularizing term turns the problem into a convex\nproblem. Mutual information is convex and serves as a lower bound on H(q) since it is always less\nthan the entropy of both of its random variables. Now, by letting \u03bb = 1/\u03b2 we arrive to our cost\nfunction in (8). We can therefore see that our formulation is a convex relaxation of the original\ncombinatorial problem.\nIn conclusion, we proposed a framework for constraining the search space of general mixture models\nto achieve global optimality of the solution.\nIn particular, our method promises to be useful in\nproblems with large data sets where regular mixture models fail to yield consistent results due to\ntheir sensitivity to initialization. We also plan to further investigate generalization of this idea to the\nmodels with more elaborate parameterizations.\nAcknowledgements. This research was supported in part by the NIH NIBIB NAMIC U54-\nEB005149, NCRR NAC P41-RR13218 grants and by the NSF CAREER grant 0642971.\nReferences\n[1] J. Puzicha, T. Hofmann, and J. M. Buhmann, \u201cTheory of proximity based clustering: Structure detection\n\nby optimization,\u201d Pattern Recognition, Vol. 33, No. 4, pp. 617\u2013634, 2000.\n\n[2] A. Y. Ng, M. I. Jordan, and Y. Weiss, \u201cOn Spectral Clustering: Analysis and an Algorithml,\u201d Advances in\n\nNeural Information Processing Systems, Vol. 14, pp. 849\u2013856, 2001.\n\n[3] I. Csisz\u00b4ar and P. Shields, \u201cInformation Theory and Statistics: A Tutorial,\u201d Foundations and Trends in\n\nCommunications and Information Theory, Vol. 1, No. 4, pp. 417\u2013528, 2004.\n\n[4] M. Meil\u02d8a, and D. Heckerman, \u201cAn Experimental Comparison of Model-Based Clustering Methods,\u201d Ma-\n\nchine Learning, Vol. 42, No. 1-2, pp. 9\u201329, 2001.\n\n[5] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.\n[6] B. J. Frey, and D. Dueck, \u201cClustering by Passing Messages Between Data Points,\u201d Science, Vol. 315, No.\n\n5814, pp. 972\u2013976, 2007.\n\n[7] A. Banerjee, S. Merugu, I. S.Dhillon, and J. Ghosh, \u201cClustering with Bregman Divergences,\u201d Journal of\n\nMachine Learning Research, Vol. 6, No. 6, pp. 1705-1749, 2005.\n\n[8] T. M. Cover, and J. A. Thomas, Elements of information theory, New York, Wiley, 1991.\n[9] K. Rose, \u201cDeterministic Annealing for Clustering, Compression, Classi\ufb01cation, Regression, and Related\n\nOptimization Problems,\u201d Proceedings of the IEEE, Vol. 86, No. 11, pp. 2210\u20132239, 1998.\n\n[10] R. E. .Blahut, \u201cComputation of Channel Capacity and Rate-Distortion Functions,\u201d IEEE Transactions on\n\nInformation Theory, Vol. IT-18, No. 4, pp. 460\u2013473, 1974.\n\n[11] M. Pardalos, and J. B. Rosen, \u201cMethods for Global Concave Minimization: A Bibliographic Survey,\u201d\n\nSIAM Review, Vol. 28, No. 3., pp. 367\u2013379, 1986.\n\n8\n\n5075100125150859095100Number of DimensionsAverage Precision\u00a0Convex\u00a0Clustering\u00a0Soft\u00a0k\u2010means\u00a0\u00a0507510012515081012141618Number of DimensionsAverage Precision Gain\f", "award": [], "sourceid": 316, "authors": [{"given_name": "Danial", "family_name": "Lashkari", "institution": null}, {"given_name": "Polina", "family_name": "Golland", "institution": null}]}