{"title": "Robust Bayesian Max-Margin Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 532, "page_last": 540, "abstract": "We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its flexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.", "full_text": "Robust Bayesian Max-Margin Clustering\n\nChangyou Chen\u2020\n\nJun Zhu\u2021\n\n\u2020Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA\n\u2021State Key Lab of Intelligent Technology & Systems; Tsinghua National TNList Lab;\n\u2021Dept. of Computer Science & Tech., Tsinghua University, Beijing 100084, China\n\nXinhua Zhang(cid:93)\n\n(cid:93)Australian National University (ANU) and National ICT Australia (NICTA), Canberra, Australia\ncchangyou@gmail.com; dcszj@tsinghua.edu.cn; xinhua.zhang@anu.edu.au\n\nAbstract\n\nWe present max-margin Bayesian clustering (BMC), a general and robust frame-\nwork that incorporates the max-margin criterion into Bayesian clustering models,\nas well as two concrete models of BMC to demonstrate its \ufb02exibility and effective-\nness in dealing with different clustering tasks. The Dirichlet process max-margin\nGaussian mixture is a nonparametric Bayesian clustering model that relaxes the\nunderlying Gaussian assumption of Dirichlet process Gaussian mixtures by in-\ncorporating max-margin posterior constraints, and is able to infer the number of\nclusters from data. We further extend the ideas to present max-margin cluster-\ning topic model, which can learn the latent topic representation of each document\nwhile at the same time cluster documents in the max-margin fashion. Extensive\nexperiments are performed on a number of real datasets, and the results indicate\nsuperior clustering performance of our methods compared to related baselines.\n\n1\n\nIntroduction\n\nExisting clustering methods fall roughly into two categories. Deterministic clustering directly op-\ntimises some loss functions, while Bayesian clustering models the data generating process and in-\nfers the clustering structure via Bayes rule. Typical deterministic methods include the well known\nkmeans [1], nCut [2], support vector clustering [3], Bregman divergence clustering [4, 5], and the\nmethods built on the very effective max-margin principle [6\u20139]. Although these methods can \ufb02exi-\nbly incorporate constraints for better performance, it is challenging for them to \ufb01nely capture hidden\nregularities in the data, e.g., automated inference of the number of clusters and the hierarchies un-\nderlying the clusters. In contrast, Bayesian clustering provides favourable convenience in modelling\nlatent structures, and their posterior distributions can be inferred in a principled fashion. For ex-\nample, by de\ufb01ning a Dirichlet process (DP) prior on the mixing probability of Gaussian mixtures,\nDirichlet process Gaussian mixture models [10] (DPGMM) can infer the number of clusters in the\ndataset. Other priors on latent structures include the hierarchical cluster structure [11\u201313], co-\nclustering structure [14], etc. However, Bayesian clustering is typically dif\ufb01cult to accommodate\nexternal constraints such as max-margin. This is because under the standard Bayesian inference\ndesigning some informative priors (if any) that satisfy these constraints is highly challenging.\nTo address this issue, we propose Bayesian max-margin clustering (BMC), which allows max-\nmargin constraints to be \ufb02exibly incorporated into a Bayesian clustering model. Distinct from the\ntraditional max-margin clustering, BMC is fully Bayesian and enables probabilistic inference of the\nnumber of clusters or the latent feature representations of data. Technically, BMC leverages the\nregularized Bayesian inference (RegBayes) principle [15], which has shown promise on supervised\nlearning tasks, such as classi\ufb01cation [16, 17], link prediction [18], and matrix factorisation [19],\nwhere max-margin constraints are introduced to improve the discriminative power of a Bayesian\n\n1\n\n\fmodel. However, little exploration has been devoted to the unsupervised setting, due in part to the\nabsence of true labels that makes it technically challenging to enforce max-margin constraints. BMC\nconstitutes a \ufb01rst extension of RegBayes to the unsupervised clustering task. Note that distinct from\nthe clustering models using maximum entropy principle [20, 21] or posterior regularisation [22],\nBMC is more general due to the intrinsic generality of RegBayes [15].\nWe demonstrate the \ufb02exibility and effectiveness of BMC by two concrete instantiations. The \ufb01rst is\nDirichlet process max-margin Gaussian mixture (DPMMGM), a nonparametric Bayesian clustering\nmodel that relaxes the Gaussian assumption underlying DPGMM by incorporating max-margin con-\nstraints, and is able to infer the number of clusters in the raw input space. To further discover latent\nfeature representations, we propose the max-margin clustering topic model (MMCTM). As a topic\nmodel, it performs max-margin clustering of documents, while at the same time learns the latent\ntopic representation for each document. For both DPMMGM and MMCTM, we develop ef\ufb01cient\nMCMC algorithms by exploiting data augmentation techniques. This avoids imposing restrictive\nassumptions such as in variational Bayes, thereby facilitating the inference of the true posterior. Ex-\ntensive experiments demonstrate superior clustering performance of BMC over various competitors.\n\n2 Regularized Bayesian Inference\nWe \ufb01rst brie\ufb02y overview the principle of regularised Bayesian inference (RegBayes) [15].\nThe motivation of RegBayes is to enrich the posterior of a probabilistic model by incorporating ad-\nditional constraints, under an information-theoretical optimisation formulation. Formally, suppose a\nprobabilistic model has latent variables \u0398, endowed with a prior p(\u0398) (examples of \u0398 will be clear\nsoon later). We also have observations X := {x1,\u00b7\u00b7\u00b7 , xn}, with xi \u2208 Rp. Let p(X|\u0398) be the\nlikelihood. Then, posterior inference via the Bayes\u2019 theorem is equivalent to solving the following\noptimisation problem [15]:\n\nKL(q(\u0398) || p(\u0398)) \u2212 E\u0398\u223cq(\u0398) [log p(X|\u0398)]\n\n(1)\nwhere P is the space of probability distribution1, q(\u0398) is the required posterior (here and afterwards\nwe will drop the dependency on X for notation simplicity). In other words, the Bayesian posterior\np(\u0398|X) is identical to the optimal solution to (1). The power of RegBayes stems in part from\nthe \ufb02exibility of engineering P, which typically encodes constraints imposed on q(\u0398), e.g., via\nexpectations of some feature functions of \u0398 (and possibly the data X). Furthermore, the constraints\ncan be parameterised by some auxiliary variable \u03be. For example, \u03be may quantify the extent to which\nthe constraints are violated, then it is penalised in the objective through a function U. To summarise,\nRegBayes can be generally formulated as\n\ninf\n\nq(\u0398)\u2208P\n\nKL(q(\u0398) || p(\u0398))\u2212 E\u0398\u223cq(\u0398) [log p(X|\u0398)]+ U (\u03be)\n\n\u03be,q(\u0398)\n\ninf\ns.t. q(\u0398) \u2208 P(\u03be).\n\n(2)\nTo distinguish from the standard Bayesian posterior, the optimal q(\u0398) is called post-data posterior.\nUnder mild regularity conditions, RegBayes admits a generic representation theorem to characterise\nthe solution q(\u0398) [15]. It is also shown to be more general than the conventional Bayesian methods,\nincluding those methods that introduce constraints on a prior. Such generality is essential for us\nto develop a Bayesian framework of max-margin clustering. Note that like many sophisticated\nBayesian models, posterior inference remains as a key challenge of developing novel RegBayes\nmodels. Therefore, one of our key technical contributions is on developing ef\ufb01cient and accurate\nalgorithms for BMC, as detailed below.\n\n3 Robust Bayesian Max-margin Clustering\nFor clustering, one key assumption of our model is that X forms a latent cluster structure. In partic-\nular, let each cluster be associated with a latent projector \u03b7k \u2208 Rp, which is included in \u0398 and has\nprior distribution subsumed in p(\u0398). Given any distribution q on \u0398, we then de\ufb01ne the compatibility\nscore of xi with respect to cluster k by using the marginal distribution on \u03b7k (as \u03b7k \u2208 \u0398):\n\n(cid:2)\u03b7T\n\nFk(xi) = Eq(\u03b7k)\n\nk xi\n\n(cid:3) = Eq(\u0398)\n\n(cid:2)\u03b7T\n\nk xi\n\n(cid:3) .\n\n(3)\n\n1In theory, we also require that q is absolutely continuous with respect to p to make the KL-divergence well\n\nde\ufb01ned. The present paper treats this constraint as an implicit assumption for clarity.\n\n2\n\n\fFor each example xi, we introduce a random variable yi valued in Z+, which denotes its cluster\nassignment and is also included in \u0398. Inspired by conventional multiclass SVM [7, 23], we utilize\nP(\u03be) in RegBayes (2) to encode the max-margin constraints based on Fk(xi), with the slack variable\n\u03be penalised via their sum in U (\u03be). This amounts to our Bayesian max-margin clustering (BMC):\n\ninf\n\nL(q(\u0398)) + 2c\n\n\u03bei\u22650,q(\u0398)\ns.t. Fyi(xi) \u2212 Fk(xi) \u2265 (cid:96) I(yi (cid:54)= k) \u2212 \u03bei,\n\ni\n\n\u2200i, k\n\n(4)\n\n(cid:88)\n\n\u03bei\n\n(cid:16)\n\n(cid:20)\n\nwhere L(q(\u0398)) = KL(q(\u0398)||p(\u0398)) \u2212 E\u0398\u223cq(\u0398)[log p(X|\u0398)] measures the KL divergence between\nq and the original Bayesian posterior p(\u0398|X) (up to a constant); I(\u00b7) = 1 if \u00b7 holds true, and\n0 otherwise; (cid:96) > 0 is a constant scalar of margin. Note we found that the commonly adopted\nbalance constraints in max-margin clustering models [6] either are unnecessary or do not help in our\nframework. We will address this issue in speci\ufb01c models.\nClearly by absorbing the slack variables \u03be, the optimisation problem (4) is equivalent to\n\nL(q(\u0398)) + 2c\n\nE\u0398\u223cq(\u0398)[\u03b6ik]\n\ni\n\nmax\n\ninf\nq(\u0398)\n\n0, max\nk:k(cid:54)=yi\n\n(5)\nwhere \u03b6ik := (cid:96) I(yi (cid:54)= k) \u2212 (\u03b7yi \u2212 \u03b7k)T xi. Exact solution to (5) is hard to compute. An alternative\napproach is to approximate the posterior by assuming independence between random variables, e.g.\nvariational inference. However, this is usually slow and susceptible to local optimal. In order to\nobtain an analytic optimal distribution q that facilitates ef\ufb01cient Bayesian inference, we resort to the\ntechnique of Gibbs classi\ufb01er [17] which approximates (in fact, upper bounds due to the convexity\nof max function) the second term in (5) by an expected hinge loss, i.e., moving the expectation out\nof the max. This leads to our \ufb01nal formulation of BMC:\n\n(cid:88)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)(cid:21)\n\nL(q(\u0398)) + 2c\n\ninf\nq(\u0398)\n\nE\u0398\u223cq(\u0398)\n\nmax\n\n0, max\nk:k(cid:54)=yi\n\n\u03b6ik\n\n.\n\n(6)\n\n(cid:88)\n\ni\n\nProblem (6) is still much more challenging than existing RegBayes models [17], which are restricted\nto supervised learning with two classes only. Speci\ufb01cally, BMC allows multiple clusters/classes in\nan unsupervised setting, and the latent cluster membership yi needs to be inferred. This complicates\nthe model and brings challenges for posterior inference, as addressed below.\nIn a nutshell, our\ninference algorithms rely on two key steps by exploring data augmentation techniques. First, in order\nto tackle the multi-class case, we introduce auxiliary variables si := arg maxk:k(cid:54)=yi \u03b6ik. Applying\nstandard derivations in calculus of variation [24] and augmenting the model with {si}, we obtain an\nanalytic form of the optimal solution to (6) by augmenting \u0398 (refer to Appendix A for details):\n\nq(\u0398,{si}) \u221d p(\u0398|X)\n\nexp(\u22122c max(0, \u03b6isi )) .\n\n(7)\n\n(cid:89)\n\ni\n\n(cid:89)\n\nSecond, since the max term in (7) obfuscates ef\ufb01cient sampling, we apply the augmentation tech-\nnique introduced by [17], which showed that q(\u0398,{si}) is identical to the marginal distribution of\nthe augmented post-data posterior\n\n\u02dc\u03c6i(\u03bbi|\u0398),\n\nq(\u0398,{si},{\u03bbi}) \u221d p(\u0398|X)\n\n2\n\n\u2212 1\ni\n\nexp(cid:0) \u22121\n\n(\u03bbi + c\u03b6isi)2(cid:1). Here \u03bbi is an augmented variable for xi that has\n\nwhere \u02dc\u03c6i(\u03bbi|\u0398) := \u03bb\nan generalised inverse Gaussian distribution [25] given \u0398 and xi.\nNote that our two steps of data augmentation are exact and incur no approximation. With the aug-\nmented variables ({si},{\u03bbi}), we can develop ef\ufb01cient sampling algorithms for the augmented\nposterior q(\u0398,{si},{\u03bbi}) without restrictive assumptions, thereby allowing us to approach the true\ntarget posterior q(\u0398) by dropping the augmented variables. The details will become clear soon in\nour subsequent clustering models.\n\n2\u03bbi\n\ni\n\n(8)\n\n4 Dirichlet Process Max-margin Gaussian Mixture Models\nIn (4), we have left unspeci\ufb01ed the prior p(\u0398) and the likelihood p(X|\u0398). This section presents an\ninstantiation of Bayesian nonparametric clustering for non-Gaussian data. We will present another\ninstantiation of max-margin document clustering based on topic models in next section.\n\n3\n\n\f\u03b7k\n\n\u039bk\n\n\u00b5k\n\u221e\n\nw\n\n\u03b1\n\nyi\n\nxi\n\nN\n\n\u03c9\n\n\u03b3\n\nv\n\nv\n\n\u03bd, S\n\nr, m\n\n\u03b10\n\n\u03b11\n\n\u03b1\n\n\u00b50\n\n\u00b5k\n\nK\n\nyi\n\n\u00b5i\n\n\u03b7k\n\nK\n\n\u03b2\n\n\u03c6t\n\nT\n\nzil\n\nwil\n\nNi\n\nD\n\nFigure 1: Left: Graphical model of DPMMGM. The part excluding \u03b7k and v corresponds to\nDPGMM. Right: Graphical model of MMCTM. The one excluding {\u03b7k} and the arrow between\nyi and wil corresponds to CTM.\n\nHere a convenient model of p(X, \u0398) is mixture of Gaussian. Let the mean and variance of the k-th\ncluster component be \u00b5k and \u039bk. In a nonparametric setting, the number of clusters is allowed to\nbe in\ufb01nite, and the cluster yi that each data point belongs to is drawn from a Dirichlet process [10].\nTo summarize, the latent variables are \u0398 = {\u00b5k, \u039bk, \u03b7k}\u221e\nk=1 \u222a{yi}n\ni=1. The prior p(\u0398) is speci\ufb01ed\nas: \u00b5k and \u039bk employ a standard Normal-inverse Wishart prior [26]:\n\n\u00b5k \u223c N (\u00b5k; m, (r\u039bk)\u22121), and \u039bk \u223c IW(\u039bk; S, \u03bd).\n\n(9)\nyi \u2208 Z+ has a Dirichlet process prior with parameter \u03b1. \u03b7k follows a normal prior with mean 0 and\nvariance vI, where I is the identity matrix. The likelihood p(xi|\u0398) is N (xi; \u00b5yi, (r\u039byi)\u22121), i.e.\nindependent of \u03b7k. The max-margin constraints take effects in the model via \u02dc\u03c6i\u2019s in (8). Note this\nmodel of p(\u0398, X), apart from \u03b7k, is effectively the Dirichlet process Gaussian mixture model [10]\n(DPGMM). Therefore, we call our post-data posterior q(\u0398,{si},{\u03bbi}) in (8) as Dirichlet process\nmax-margin Gaussian mixture model (DPMMGM). The hyperparameters include m, r, S, \u03bd, \u03b1, v.\nInterpretation as a generalised DP mixture The formula of the augmented post-data posterior\nin (8) reveals that, compared with DPGMM, each data point is associated with an additional factor\n\u02dc\u03c6i(\u03bbi|\u0398). Thus we can interpret DPMMGM as a generalised DP mixture with Normal-inversed\nWishart-Normal as the base distribution, and a generalised pseudo likelihood that is proportional to\n(10)\nTo summarise, DPMMGM employs the following generative process with the graphical model\nshown in Fig. 1 (left):\n\n(\u00b5k, \u039bk, \u03b7k) \u223c N(cid:0)\u00b5k; m, (r\u039bk)\u22121(cid:1) \u00d7 IW (\u039bk; S, \u03bd) \u00d7 N (\u03b7k; 0, vI) , k = 1, 2,\u00b7\u00b7\u00b7\n\nf (xi, \u03bbi|yi, \u00b5yi, \u039byi,{\u03b7k}) := N (xi; \u00b5yi, (r\u039byi )\u22121) \u02dc\u03c6i (\u03bbi|\u0398) .\n\nw \u223c Stick-Breaking(\u03b1),\n\nyi|w \u223c Discrete(w),\n\ni \u2208 [n]\n(xi, \u03bbi)|yi,{\u00b5k, \u039bk, \u03b7k} (cid:39) f (xi, \u03bbi|yi, \u00b5yi, \u039byi ,{\u03b7k}).\ni \u2208 [n]\nHere [n] := {1,\u00b7\u00b7\u00b7 , n} is the set of integers up to n and (cid:39) means that (xi, \u03bbi) is generative from a\ndistribution that is proportional to f (\u00b7). Since this normalisation constant is shared by all samples\nxi, there is no need to deal with it by posterior inference. Another bene\ufb01t of this interpretation\nis that it allows us to use existing techniques for non-conjugate DP mixtures to sample the cluster\nindicators yi ef\ufb01ciently, and to infer the number of clusters in the data. This approach is different\nfrom previous work on RegBayes nonparametric models where truncated approximation is used to\ndeal with the in\ufb01nite dimensional model space [15, 18]. In contrast, our method does not rely on any\napproximation. Note that DPMMGM does not need the complicated class balance constraints [6]\nbecause the Gaussians in the pseudo likelihood would balance the clusters to some extent.\nPosterior inference Posterior inference for DPMMGM can be done by ef\ufb01cient Gibbs sam-\npling. We integrate out the in\ufb01nite dimension vector w, so the variables needed to be sampled\nare {\u00b5k, \u039bk, \u03b7k}k \u222a{yi, si, \u03bbi}i. Conditional distributions are derived in Appendix B. Note that we\nuse an extension of the Reused Algorithm [27] to jointly sample (yi, si), which allows it to allocate\nto empty clusters in Bayesian nonparametric setting. The time complexity is almost the same as\nDPGMM except for the additional step to sample \u03b7k, with cost O(p3). So it would be necessary to\nput the constraints on a subspace (e.g., by projection) of the original feature space when p is high.\n\n4\n\n\fDir(\u03b1\u00b5yi).\n\n(cid:80)Ni\n\nNi\n\nl=1\n\ni=1 \u222a {zil}D,Ni\n\ni=1,l=1.\n\n5 Max-margin Clustering Topic Model\nAlthough many applications exhibit clustering structures in the raw observed data which can be\neffectively captured by DPMMGM, it is common that such regularities are more salient in terms\nof some high-level but latent features. For example, topic distributions are often more useful than\nword frequency in the task of document clustering. Therefore, we develop a max-margin clustering\ntopic model (MMCTM) in the framework of BMC, which allows topic discovery to co-occur with\ndocument clustering in a Bayesian and max-margin fashion. To this end, the latent Dirichlet alloca-\ntion (LDA) [28] needs to be extended by introducing a cluster label into the model, and de\ufb01ne each\ncluster as a mixture of topic distributions. This cluster-based topic model [29] (CTM) can then be\nused in concert with BMC to enforce large margin between clusters in the posterior q(\u0398).\nLet V be the size of the word vocabulary, T be the number of topics, and K be the number of clusters,\n1N be a N-dimensional one vector. Then the generative process of CTM for the documents goes as:\n1. For each topic t, generate its word distribution \u03c6t: \u03c6t|\u03b2 \u223c Dir(\u03b21V ).\n2. Draw a base topic distribution \u00b50: \u00b50|\u03b10 \u223c Dir(\u03b101T ). Then for each cluster k, generate its\ntopic distribution mixture \u00b5k: \u00b5k|\u03b11, \u00b50 \u223c Dir(\u03b11\u00b50).\n3. Draw a base cluster distribution \u03b3: \u03b3|\u03c9 \u223c Dir(\u03c91K). Then for each document i \u2208 [D]:\n\ni=1 in the posterior, thus \u0398 = {yi}D\n\n\u2022 Generate a cluster label yi and a topic distribution \u00b5i: yi|\u03b3 \u223c Discrete(\u03b3), \u00b5i|\u03b1, \u00b5yi \u223c\n\u2022 Generate the observed words wil: zil \u223c Discrete(\u00b5i), wil \u223c Discrete(\u03c6zil ), \u2200 l \u2208 [Ni].\nFig. 1 (right) shows the structure. We then augment CTM with max-margin constraints, and get\nthe same posterior as in Eq. (7), with the variables \u0398 corresponding to {\u03c6t}T\nk=1 \u222a\n{\u00b50, \u03b3} \u222a {\u00b5i, yi}D\nCompared with the raw word space which is normally extremely high-dimensional and sparse, it\nis more reasonable to characterise the clustering structure in the latent feature space\u2013the empirical\nlatent topic distributions as in the MedLDA [16]. Speci\ufb01cally, we summarise the topic distribution\nof document i by xi \u2208 RT , whose t-th element is 1\nI(zil = t). Then the compatibility score\nfor document i with respect to cluster k is de\ufb01ned similar to (3) as Fk(xi) = Eq(\u0398)\nhowever, the expectation is also taken over xi since it is not observed.\nk=1 \u222a\nPosterior inference To achieve fast mixing, we integrate out {\u03c6t}T\n{\u00b5i}D\ni=1,l=1. The integration is straightfor-\nward by the Dirichlet-Multinomial conjugacy. The detailed form of the posterior and the conditional\ndistributions are derived in Appendix C. By extending CTM with max-margin, we note that many\nof the the sampling formulas are extension of those in CTM [29], with additional sampling for \u03b7k,\nthus the sampling can be done fairly ef\ufb01ciently.\nDealing with vacuous solutions Different from DPMMGM, the max-margin constraints in MM-\nCTM do not interact with the observed words wil, but with the latent topic representations xi (or zil)\nthat are also inferred from the model. This easily makes the latent representation zi\u2019s collapse into a\nsingle cluster, a vacuous solution plaguing many other unsupervised learning methods as well. One\nremedy is to incorporate the cluster balance constraints into the model [7]. However, this does not\nhelp in our Bayesian setting because apart from signi\ufb01cant increase in computational cost, MCMC\noften fails to converge in practice2. Another solution is to morph the problem into a weakly semi-\nsupervised setting, where we assign to each cluster a few documents according to their true label\n(we will refer to these documents as landmarks), and sample the rest as in the above unsupervised\nsetting. These \u201clabeled examples\u201d can be considered as introducing constraints that are alternative\nto the balance constraints. Usually only a very small number of labeled documents are needed, thus\nbarely increasing the cost in training and labelling. We will focus on this setting in experiment.\n6 Experiments\n6.1 Dirichlet Process Max-margin Gaussian Mixture\n\nt=1 \u222a {\u00b50, \u03b3} \u222a {\u00b5k}K\n\ni=1\u222a{\u03b7k}K\n\nk=1\u222a{zil}D,Ni\n\nt=1 \u222a {\u03b7k, \u00b5k}K\n\n(cid:2)\u03b7T\n\nk xi\n\n(cid:3). Note,\n\n2We observed the cluster sizes kept bouncing with sampling iterations. This is probably due to the highly\nnonlinear mapping from observed word space to the feature space (topic distribution), making the problem\nmulti-modal, i.e., there are multiple optimal topic assignments in the post-data posterior (8). Also the balance\nconstraints might weaken the max-margin constraints too much.\n\n5\n\n\fWe \ufb01rst show the distinction between our DPMMGM and DPGMM\nby running both models on the non-Gaussian half-rings data set [30].\nThere are a number of hyperparameters to be determined, e.g.,\n(\u03b1, r, S, \u03bd, v, c, (cid:96)); see Section 4. It turns out the cluster structure is\ninsensitive to (\u03b1, r, S, \u03bd), and so we use a standard sampling method\nto update \u03b1 [31], while r, \u03bd, S are sampled by employing Gamma,\ntruncated Poisson, inverse Wishart priors respectively, as is done\nin [32]. We set v = 0.01, c = 0.1, (cid:96) = 5 in this experiment. Note that\nthe clustering structure is sensitive to the values of c and (cid:96), which will\nbe studied below. Empirically we \ufb01nd that DPMMGM converges\nmuch faster than DPGMM, both converging well within 200 iter-\nations (see Appendix D.4 for examples).\nIn Fig. 2, the clustering\nstructures demonstrate clearly that DPMMGM relaxes the Gaussian\nassumption of the data distribution, and correctly \ufb01nds the number of\nclusters based on the margin boundary, whereas DPGMM produces\na too fragmented partition of the data for the clustering task.\nParameter sensitivity We next study the sensitivity of hyperparameters c and (cid:96), with other hyperpa-\nrameters sampled during inference as above. Intuitively the impact of these parameters is as follows.\nc controls the weight that the max-margin constraint places on the posterior. If there were no other\nconstraint, the max-margin constraint would drive the data points to collapse into a single cluster.\nAs a result, we expect that a larger value of the weight c will result in fewer clusters. Similarly,\nincreasing the value of (cid:96) will lead to a higher loss for any violation of the constraints, thus driving\nthe data points to collapse as well. To test these implications, we run DPMMGM on a 2-dimensional\nsynthetic dataset with 15 clusters [33]. We vary c and (cid:96) to study how the cluster structures change\nwith respect to these parameter settings. As can be observed from Fig. 3, the results indeed follow\nour intuition, providing a mean to control the cluster structure in applications.\n\nFigure 2: An illustration\nof DPGMM (up) and DPM-\nMGM (bottom).\n\n(a) c :5e-6, (cid:96) :5e-1\n\n(b) c :5e-4, (cid:96) :5e-1\n\n(c) c :5e-3, (cid:96) :5e-1\n\n(d) c :5e-2, (cid:96) :5e-1\n\n(e) c :5e-1, (cid:96) :5e-1\n\n(f) c :5e-3, (cid:96) :5e-4\n\n(g) c :5e-3, (cid:96) :5e-2\n\n(h) c :5e-3, (cid:96) :5e-1\n\n(i) c :5e-3, (cid:96) :2\n\n(j) c :5e-3, (cid:96) :5\n\nFigure 3: Clustering structures with varied (cid:96) and c: (\ufb01rst row) \ufb01xed (cid:96) and increasing c; (second row)\n\ufb01xed c and increasing (cid:96). Lines are \u03b7\u2019s. Clearly the number cluster decreases with growing c and (cid:96).\n\nReal Datasets. As other clustering models, we test DPMMGM on ten real datasets (small to mod-\nerate sizes) from the UCI repository [34]. Scaling up to large dataset is an interesting future. The\n\ufb01rst three columns of Table 1 list some of the statistics of these datasets (we used random subsets of\nthe three large datasets \u2013 Letter, MNIST, and Segmentation).\nA heuristic approach for model selection. Model selection is generally hard for unsupervised\nclustering. Most existing algorithms simply \ufb01x the hyperparameters without examining their impacts\non model performance [10, 35]. In DPMMGM, the hyperparameters c and (cid:96) are critical to clustering\nquality since they control the number of clusters. Without training data in our setting they can not\nbe set using cross validation. Moreover, they are not feasible to be estimated use Bayesian sampling\nas well because they are not parameters from a proper Bayesian model. we thus introduce a time-\nef\ufb01cient heuristic approach to selecting appropriate values. Suppose the dataset is known to have\nK clusters. Our heuristic goes as follows. First initialise c and (cid:96) to 0.1. Then at each iteration,\nwe compare the inferred number of clusters with K. If it is larger than K (otherwise we do the\nn, where u is a uniform random\nconverse), we choose c or (cid:96) randomly, and increase its value by u\nvariable in [0, 1] and n is the number of iterations so far. According to the parameter sensitivity\nstudied above, increasing c or (cid:96) tends to decrease the number of clusters, and the model eventually\n\n6\n\n50505050505050505050246802468246802468246802468246802468246802468246802468246802468246802468246802468246802468\fDataset\n\nGlass\n\nHalf circle\n\nIris\nLetter\nMNIST\nSatimage\nSegment\u2019n\n\nVehicle\nVowel\nWine\n\nData property\nn\n214\n300\n150\n1000\n1000\n4435\n1000\n846\n990\n178\n\np\n10\n2\n4\n16\n784\n36\n19\n18\n10\n13\n\nK kmeans\n0.37\u00b10.04\n7\n0.43\u00b10.00\n2\n0.72\u00b10.08\n3\n0.33\u00b10.01\n10\n0.50\u00b10.01\n10\n0.57\u00b10.06\n6\n0.52\u00b10.03\n7\n0.10\u00b10.00\n4\n0.42\u00b10.01\n11\n0.84\u00b10.01\n3\n\nnCut\n\n0.22\u00b10.00\n1.00\u00b10.00\n0.61\u00b10.00\n0.04\u00b10.00\n0.38\u00b10.00\n0.55\u00b10.00\n0.34\u00b10.00\n0.14\u00b10.00\n0.44\u00b10.00\n0.46\u00b10.00\n\nNMI\n\nDPGMM DPMMGM DPMMGM\u2217\n0.37\u00b10.05\n0.49\u00b10.02\n0.73\u00b10.00\n0.19\u00b10.09\n0.55\u00b10.03\n0.21\u00b10.05\n0.23\u00b10.09\n0.02\u00b10.02\n0.28\u00b10.03\n0.56\u00b10.02\n\n0.46\u00b10.01\n0.67\u00b10.02\n0.73\u00b10.00\n0.38\u00b10.04\n0.56\u00b10.01\n0.51\u00b10.01\n0.61\u00b10.05\n0.14\u00b10.00\n0.39\u00b10.02\n0.90\u00b10.02\n\n0.45\u00b10.01\n0.51\u00b10.07\n0.73\u00b10.00\n0.23\u00b10.04\n0.55\u00b10.02\n0.30\u00b10.00\n0.52\u00b10.10\n0.05\u00b10.00\n0.41\u00b10.02\n0.59\u00b10.01\n\nTable 1: Comparison for different methods on NMI scores. K: true number of clusters.\n\nstabilises due to the stochastic decrement by u\nn. We denote the model learned from this heuristic\nas DPMMGM. In the case where the true number of clusters is unknown, we can still apply this\nstrategy, except that the number of clusters K needs to be \ufb01rst inferred from DPGMM. This method\nis denoted as DPMMGM\u2217.\nComparison. We measure the quality of clustering results by using the standard normalised mutual\ninformation (NMI) criterion [36]. We compare our DPMMGM with the well established KMeans,\nnCut and DPGMM clustering methods3. All experiments are repeated for \ufb01ve times with random\ninitialisation. The results are shown in Table 1. Clearly DPMMGM signi\ufb01cantly outperforms other\nmodels, achieving the best NMI scores. DPMMGM\u2217, which is not informed of the true number of\nclusters, still obtains reasonably high NMI scores, and outperforms the DPGMM model.\n6.2 Max-margin Clustering Topic Model\nDatasets. We test the MMCTM model on two document datasets: 20NEWS and Reuters-R8 . For\nthe 20NEWS dataset, we combine the training and test datasets used in [16], which ends up with 20\ncategories/clusters with roughly balanced cluster sizes. It contains 18,772 documents in total with a\nvocabulary size of 61,188. The Reuters-R8 dataset is a subset of the Reuters-21578 dataset4, with of\n8 categories and 7,674 documents in total. The size of different categories is biased, with the lowest\nnumber of documents in a category being 51 while the highest being 2,292.\nComparison We choose L \u2208 {5, 10, 15, 20, 25} documents randomly from each category as the\nlandmarks, use 80% documents for training and the rest for testing. We set the number of topics\n(i.e., T ) to 50, and set the Dirichlet prior in Section 5 to \u03c9 = 0.1, \u03b2 = 0.01, \u03b1 = \u03b10 = \u03b11\n= 10, as clustering quality is not sensitive to them. For the other hyperparameters related to the\nmax-margin constraints, e.g., v in the Gaussian prior for \u03b7, the balance parameter c, and the cost\nparameter (cid:96), instead of doing cross validation which is computationally expensive and not helpful\nfor our scenario with few labeled data, we simply set v = 0.1, c = 9, (cid:96) = 0.1. This is found to\nbe a good setting and denoted as MMCTM. To test the robustness of this setting, we vary c over\n{0.1, 0.2, 0.5, 0.7, 1, 3, 5, 7, 9, 15, 30, 50} and keep v = (cid:96) = 0.1 ((cid:96) and c play similar roles and so\nvarying one is enough). We choose the best performance out of these parameter settings, denoted\nas MMCTM\u2217, which can be roughly deemed as the setting for the optimal performance. We com-\npared MMCTM with state-of-the-art SVM and semi-supervised SVM (S3VM) models. They are\nef\ufb01ciently implemented in [37], and the related parameters are chosen by 5-fold cross validation.\nAs in [16], raw word frequencies are used as input features. We also compare MMCTM with a\nBayesian baseline\u2013cluster based topic model (CTM) [29], the building block of MMCTM without\nthe max-margin constraints. Note we did not compare with the standard MedLDA [16] because it\nis supervised. We measure the performance by cluster accuracy, which is the proportion of cor-\nrectly clustered documents. To accelerate MMCTM, we simply initialise it with CTM, and \ufb01nd it\nconverges surprisingly fast in term of accuracy, e.g., usually within 30 iterations (refer to Appendix\n\n3We additionally show some comparison with some existing max-margin clustering models in Appendix D.2\n\non two-cluster data because their code only deals with the case of two clusters. Our method performs best.\n\n4Downloaded from csmining.org/index.php/r52-and-r8-of-reuters-21578.html.\n\n7\n\n\fL\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\nCTM\n\nSVM\n\nS3VM\n20NEWS\n\n17.22\u00b1 4.2\n24.50\u00b1 4.5\n22.76\u00b1 4.2\n26.07\u00b1 7.2\n27.20\u00b1 1.5\n\n41.27\u00b1 16.7\n42.63\u00b1 7.4\n39.67\u00b1 9.9\n58.24\u00b1 8.3\n51.93\u00b1 5.9\n\n37.13\u00b1 2.9\n46.99\u00b1 2.4\n52.80\u00b1 1.2\n56.10\u00b1 1.5\n59.15\u00b1 1.4\n\n39.36\u00b1 3.2\n47.91\u00b1 2.8\n52.49\u00b1 1.4\n54.44\u00b1 2.1\n57.45\u00b1 1.7\n\nReuters-R8\n\n78.12\u00b1 1.1\n80.69\u00b1 1.2\n83.25\u00b1 1.7\n85.66\u00b1 1.0\n84.95\u00b1 0.1\n\n78.51\u00b1 2.3\n79.15\u00b1 1.2\n81.87\u00b1 0.8\n73.95\u00b1 2.0\n82.39\u00b1 1.8\n\nMMCTM MMCTM\u2217\n\n56.70\u00b1 1.9\n54.92\u00b1 1.6\n55.06\u00b1 2.7\n56.62\u00b1 2.2\n55.70\u00b1 2.4\n\n79.18\u00b1 4.1\n80.04\u00b1 5.3\n85.48\u00b1 2.1\n82.92\u00b1 1.7\n86.56\u00b1 2.5\n\n57.86\u00b1 0.9\n56.56\u00b1 1.3\n57.80\u00b1 2.2\n59.70\u00b1 1.4\n61.92\u00b1 3.0\n\n80.86\u00b1 2.9\n83.48\u00b1 1.0\n86.86\u00b1 2.5\n83.82\u00b1 1.6\n88.12\u00b1 0.5\n\n(a) 20NEWS dataset\n\n(b) Reuters-R8 dataset\n\nTable 2: Clustering acc. (in %). Bold means signi\ufb01cantly different.\n\nFigure 4: Accuracy vs. #topic aaaaaaaaaaaaaaaa\n\nFigure 5: 2-D tSNE embedding on 20NEWS for MMCTM (left) and CTM (right). Best viewed in\ncolor. See Appendix D.3 for the results on Reuters-R8 datasets.\n\nD.5). The accuracies are shown in Table 2, and we can see that MMCTM outperforms other models\n(also see Appendix D.4), except for SVM when L = 20 on the Reuters-R8 dataset. In addition,\nMMCTM performs almost as well as using the optimal parameter setting (MMCTM\u2217).\nSensitivity to the number of topics (i.e., T ). Note the above experiments simply set T = 50. To\nvalidate the affect of T , we varied T from 10 to 100, and the corresponding accuracies are plotted\nIn Fig. 4 for the two datasets. In both cases, T = 50 seems to be a good parameter value.\nCluster embedding. We \ufb01nally plot the clustering results by embedding them into the 2-\ndimensional plane using tSNE [38]. In Fig. 5, it can be observed that compared to CTM, MMCTM\ngenerates well separated clusters with much larger margin between clusters.\n7 Conclusions\nWe propose a robust Bayesian max-margin clustering framework to bridge the gap between max-\nmargin learning and Bayesian clustering, allowing many Bayesian clustering algorithms to be di-\nrectly equipped with the max-margin criterion. Posterior inference is done via two data augmenta-\ntion techniques. Two models from the framework are proposed for Bayesian nonparametric max-\nmargin clustering and topic model based document clustering. Experimental results show our mod-\nels signi\ufb01cantly outperform existing methods with competitive clustering accuracy.\nAcknowledgments\nThis work was supported by an Australia China Science and Research Fund grant (ACSRF-06283)\nfrom the Department of Industry, Innovation, Climate Change, Science, Research and Tertiary Ed-\nucation of the Australian Government, the National Key Project for Basic Research of China (No.\n2013CB329403), and NSF of China (Nos. 61322308, 61332007). NICTA is funded by the Aus-\ntralian Government as represented by the Department of Broadband, Communications and the Dig-\nital Economy and the Australian Research Council through the ICT Centre of Excellence program.\n\n8\n\n10203050701000204060Number of topics (#topic)Accuracy (%) trainingtest1020305070100020406080Number of topicsAccuracy (%) \u221250\u221240\u221230\u221220\u22121001020304050\u221260\u221240\u2212200204060\u221250\u221240\u221230\u221220\u22121001020304050\u221260\u221240\u2212200204060\fReferences\n[1] J. MacQueen. Some methods of classi\ufb01cation and analysis of multivariate observations. In Proc. 5th\n\nBerkeley Symposium on Math., Stat., and Prob., page 281, 1967.\n\n[2] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):705\u2013767, 2000.\n[3] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. JMLR, 2:125\u2013137, 2001.\n[4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 6:\n\n[5] H. Cheng, X. Zhang, and D. Schuurmans. Convex relaxations of Bregman divergence clustering. In UAI,\n\n[6] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Max-margin clustering. In NIPS, 2005.\n[7] B. Zhao, F. Wang, and C. Zhang. Ef\ufb01cient multiclass maximum margin clustering. In ICML, 2008.\n[8] Y. F. Li, I. W. Tsang, J. T. Kwok, and Z. H. Zhou. Tighter and convex maximum margin clustering. In\n\n1705\u20131749, 2005.\n\n2013.\n\nAISTATS, 2009.\n\n[9] G. T. Zhou, T. Lan, A. Vahdat, and G. Mori. Latent maximum margin clustering. In NIPS, 2013.\n[10] C. E. Rasmussen. The in\ufb01nite Gaussian mixture model. In NIPS, 2000.\n[11] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In ICML, 2005.\n[12] Y. W. Teh, H. Dau \u00b4me III, and D. Roy. Bayesian agglomerative clustering with coalescents. In NIPS, 2008.\n[13] Y. Hu, J. Boyd-Graber, H. Dau \u00b4me, and Z. I. Ying. Binary to bushy: Bayesian hierarchical clustering with\n\nthe Beta coalescent. In NIPS, 2013.\n\nbles. In SDM, 2011.\n\nin\ufb01nite latent SVMs. JMLR, 2014.\n\n[14] P. Wang, K. B. Laskey, C. Domeniconi, and M. I. Jordan. Nonparametric Bayesian co-clustering ensem-\n\n[15] J. Zhu, N. Chen, and E. P. Xing. Bayesian inference with posterior regularization and applications to\n\n[16] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximum margin supervised topic models. JMLR, 13(8):\n\n[17] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with fast sampling algorithms.\n\n[18] J. Zhu. Max-margin nonparametric latent feature models for link prediction. In ICML, 2012.\n[19] M. Xu, J. Zhu, and B. Zhang. Fast max-margin matrix factorization with data augmentation. In ICML,\n\n[20] L. Wang, X. Li, Z. Tu, and J. Jia. Discriminative cllustering via generative feature mapping. In AAAI,\n\n2237\u20132278, 2012.\n\nIn ICML, 2013.\n\n2013.\n\n2012.\n\nIn NIPS, 2010.\n\n[21] R. Gomes, A. Krause, and P. Perona. Discriminative clustering by regularized information maximization.\n\n[22] K.\u2018Ganchev, J. Graa, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable\n\nmodels. JMLR, 11:2001\u20132049, 2010.\n\nchines. JMLR, 2:265\u2013292, 2001.\n\n[23] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\n[24] C. Fox. An introduction to the calculus of variations. Courier Dover Publications, 1987.\n[25] B. Jorgensen. Statistical properties of the generalized inverse Gaussian distribution. Lecture Notes in\n\nStatistics, 1982.\n\n[26] K. P. Murphy. Conjugate Bayesian analysis of the Gaussian distribution. Technical report, UCB, 2007.\n[27] S. Favaro and Y. W. Teh. MCMC for normalized random measure mixture models. Stat. Sci., 2013.\n[28] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[29] H. M. Wallach. Structured topic models for language. PhD thesis, University of Cambridge, 2008.\n[30] A. Jain and M. Law. Data clustering: A user\u2019s dilemma. Lecture Notes in Comp. Sci., 3776:1\u201310, 2005.\n[31] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. J. Amer. Statist.\n\n[32] M. Davy and J. Y. Tourneret. Generative supervised classi\ufb01cation using Dirichlet process priors. TPAMI,\n\nAssoc., 101(476):1566\u20131581, 2006.\n\n32(10):1781\u20131794, 2010.\n\n[33] P. Franti and O. Virmajoki. Iterative shrinking method for clustering problems. PR, 39(5):761\u2013765, 2006.\n[34] K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.\n\nuci.edu/ml.\n\n[35] A. Shah and Z. Ghahramani. Determinantal clustering process \u2013 a nonparametric bayesian approach to\n\nkernel based semi-supervised clustering. In UAI, 2013.\n\n[36] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: variants,\n\nproperties, normalization and correction for chance. JMLR, (11):2837\u20132854, 2010.\n\n[37] V. Sindhwani, P. Niyogi, and M. Belkin. SVMlin: Fast linear SVM solvers for supervised and semi-\nIn NIPS Workshop on Machine Learning Open Source Software, 2006. http:\n\nsupervised learning.\n//vikas.sindhwani.org/svmlin.html.\n\n[38] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-SNE. JMLR, 9(11):\n\n2579\u20132605, 2008.\n\n9\n\n\f", "award": [], "sourceid": 354, "authors": [{"given_name": "Changyou", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "NICTA and Australian National University"}]}