{"title": "Accelerated Variational Dirichlet Process Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": "", "full_text": "Accelerated Variational Dirichlet Process Mixtures\n\nKenichi Kurihara\n\nDept. of Computer Science\n\nTokyo Institute of Technology\n\nTokyo, Japan\n\nBren School of Information and Computer Science\n\nMax Welling\n\nUC Irvine\n\nIrvine, CA 92697-3425\n\nkurihara@mi.cs.titech.ac.jp\n\nwelling@ics.uci.edu\n\nNikos Vlassis\n\nInformatics Institute\n\nUniversity of Amsterdam\n\nThe Netherlands\n\nvlassis@science.uva.nl\n\nAbstract\n\nDirichlet Process (DP) mixture models are promising candidates for clustering\napplications where the number of clusters is unknown a priori. Due to compu-\ntational considerations these models are unfortunately unsuitable for large scale\ndata-mining applications. We propose a class of deterministic accelerated DP\nmixture models that can routinely handle millions of data-cases. The speedup is\nachieved by incorporating kd-trees into a variational Bayesian algorithm for DP\nmixtures in the stick-breaking representation, similar to that of Blei and Jordan\n(2005). Our algorithm differs in the use of kd-trees and in the way we handle\ntruncation: we only assume that the variational distributions are \ufb01xed at their pri-\nors after a certain level. Experiments show that speedups relative to the standard\nvariational algorithm can be signi\ufb01cant.\n\n1 Introduction\n\nEvidenced by three recent workshops1, nonparametric Bayesian methods are gaining popularity\nin the machine learning community.\nIn each of these workshops computational ef\ufb01ciency was\nmentioned as an important direction for future research. In this paper we propose computational\nspeedups for Dirichlet Process (DP) mixture models [1, 2, 3, 4, 5, 6, 7], with the purpose of im-\nproving their applicability in modern day data-mining problems where millions of data-cases are no\nexception.\n\nOur approach is related to, and complements, the variational mean-\ufb01eld algorithm for DP mixture\nmodels of Blei and Jordan [7].\nIn this approach, the intractable posterior of the DP mixture is\napproximated with a factorized variational \ufb01nite (truncated) mixture model with T components, that\nis optimized to minimize the KL distance to the posterior. However, a downside of their model is\nthat the variational families are not nested over T , and locating an optimal truncation level T may\nbe dif\ufb01cult (see Section 3).\n\nIn this paper we propose an alternative variational mean-\ufb01eld algorithm, called VDP (Variational\nDP), in which the variational families are nested over T . In our model we allow for an unbounded\nnumber of components for the variational mixture, but we tie the variational distributions after level\n\n1http://aluminum.cse.buffalo.edu:8079/npbayes/nipsws05/topics\nhttp://www.cs.toronto.edu/\u223cbeal/npbayes/\nhttp://www2.informatik.hu-berlin.de/\u223cbickel/npb-workshop.html\n\n\fT to their priors. Our algorithm proceeds in a greedy manner by starting with T = 1 and releasing\ncomponents when this improves (signi\ufb01cantly) the KL bound. Releasing is most effectively done\nby splitting a component in two children and updating them to convergence. Our approach essen-\ntially resolves the issue in [7] of searching for an optimal truncation level of the variational mixture\n(see Section 4).\n\nAdditionally, a signi\ufb01cant contribution is that we incorporate kd-trees into the VDP algorithm as a\nway to speed up convergence [8, 9]. A kd-tree structure recursively partitions the data space into a\nnumber of nodes, where each node contains a subset of the data-cases. Following [9], for a given tree\nexpansion we tie together the responsibility over mixture components of all data-cases contained in\neach outer node of the tree. By caching certain suf\ufb01cient statistics in each node of the kd-tree we\nthen achieve computational gains, while the variational approximation becomes a function of the\ndepth of the tree at which one operates (see Section 6).\n\nThe resulting Fast-VDP algorithm provides an elegant way to trade off computational resources\nagainst accuracy. We can always release new components from the pool and split kd-tree nodes as\nlong as we have computational resources left. Our setup guarantees that this will always (at least\nin theory) improve the KL bound (in practice local optima may force us to reject certain splits,\nsee Section 7). As we empirically demonstrate in Section 8, a kd-tree can offer signi\ufb01cant speedups,\nallowing our algorithm to handle millions of data-cases. As a result, Fast-VDP is the \ufb01rst algo-\nrithm entertaining an unbounded number of clusters that is practical for modern day data-mining\napplications.\n\n2 The Dirichlet Process Mixture in the Stick-Breaking Representation\n\nA DP mixture model in the stick-breaking representation can be viewed as possessing an in\ufb01nite\nnumber of components with random mixing weights [4]. In particular, the generative model of a DP\nmixture assumes:\n\n\u2022 An in\ufb01nite collection of components H = {\u03b7i}\u221e\n\ni=1 that are independently drawn from a\n\nprior p\u03b7(\u03b7i|\u03bb) with hyperparameters \u03bb.\n\n\u2022 An in\ufb01nite collection of \u2018stick lengths\u2019 V = {vi}\u221e\n\ni=1, vi \u2208 [0, 1], \u2200i, that are independently\ndrawn from a prior pv(vi|\u03b1) with hyperparameters \u03b1. They de\ufb01ne the mixing weights\n{\u03c0i}\u221e\n\nj=1(1 \u2212 vj), for i = 1, . . . , \u221e.\n\ni=1 of the mixture as \u03c0i(V ) = viQi\u22121\n\n\u2022 An observation model px(x|\u03b7) that generates a datum x from component \u03b7.\n\nn=1, each data-case xn is assumed to be generated by \ufb01rst drawing a\nGiven a dataset X = {xn}N\ncomponent label zn = k \u2208 {1, . . . , \u221e} from the in\ufb01nite mixture with probability pz(zn = k|V ) \u2261\n\u03c0k(V ), and then drawing xn from the corresponding observation model px(xn|\u03b7k).\nWe will denote Z = {zn}N\nn=1 the set of all labels, W = {H, V, Z} the set of all latent variables\nof the DP mixture, and \u03b8 = {\u03bb, \u03b1} the hyperparameters. In clustering problems we are mainly\ninterested in computing the posterior over data labels p(zn|X, \u03b8), as well as the predictive density\n\np(x|X, \u03b8) = RH,V p(x|H, V )RZ p(W |X, \u03b8), which are both intractable since p(W |X, \u03b8) cannot be\n\ncomputed analytically.\n\n3 Variational Inference in Dirichlet Process Mixtures\n\nFor variational inference, the intractable posterior p(W |X, \u03b8) of the DP mixture can be approxi-\nmated with a parametric family of factorized variational distributions q(W ; \u03c6) of the form\n\nq(W ; \u03c6) =\n\nL\n\nYi=1hqvi(vi; \u03c6v\n\ni )i\ni ) q\u03b7i (\u03b7i; \u03c6\u03b7\n\nN\n\nYn=1\n\nqzn(zn)\n\n(1)\n\ni ) and q\u03b7i (\u03b7i; \u03c6\u03b7\n\nwhere qvi(vi; \u03c6v\ni (one parameter\nper i), and qzn(zn) are discrete distributions over the component labels (one distribution per n).\nBlei and Jordan [7] de\ufb01ne an explicit truncation level L \u2261 T for the variational mixture in (1) by\nsetting qvT (vT = 1) = 1 and assuming that data-cases assign zero responsibility to components\n\ni ) are parametric models with parameters \u03c6v\n\ni and \u03c6\u03b7\n\n\fwith index higher than the truncation level T , i.e., qzn (zn > T ) = 0. Consequently, in their model\nonly components of the mixture up to level T need to be considered. Variational inference then\nn=1,\nconsists in estimating a set of T parameters {\u03c6v\ncollectively denoted by \u03c6, that minimize the Kullback-Leibler divergence D[q(W ; \u03c6)||p(W |X, \u03b8)]\nbetween the true posterior and the variational approximation, or equivalently that minimize the free\nenergy F (\u03c6) = Eq[log q(W ; \u03c6)] \u2212 Eq[log p(W, X|\u03b8)]. Since each distribution qzn (zn) has nonzero\nsupport only for zn \u2264 T , minimizing F (\u03c6) results in a set of update equations for \u03c6 that involve\nonly \ufb01nite sums [7].\n\ni=1 and a set of N distributions {qzn (zn)}N\n\ni , \u03c6\u03b7\n\ni }T\n\nHowever, explicitly truncating the variational mixture as above has the undesirable property that\nthe variational family with truncation level T is not contained within the variational family with\ntruncation level T + 1, i.e., the families are not nested.2 The result is that there may be an optimal\n\ufb01nite truncation level T for q, that contradicts the intuition that the more components we allow in\nq the better the approximation should be (reaching its best when T \u2192 \u221e). Moreover, locating a\nnear-optimal truncation level may be dif\ufb01cult since F as a function of T may exhibit local minima\n(see Fig. 4 in [7]).\n\n4 Variational Inference with an In\ufb01nite Variational Model\n\nHere we propose a slightly different variational model for q that allows families over T to be nested.\nIn our setup, q is given by (1) where we let L go to in\ufb01nity but we tie the parameters of all mod-\nels after a speci\ufb01c level T (with T \u226a L).\nIn particular, we impose the condition that for all\ncomponents with index i > T the variational distributions for the stick-lengths qvi(vi) and the\nvariational distributions for the components q\u03b7i (\u03b7i) are equal to their corresponding priors, i.e.,\ni ) = p\u03b7(\u03b7i|\u03bb). In our model we de\ufb01ne the free energy F as\nqvi(vi; \u03c6v\nthe limit F = limL\u2192\u221e FL, where FL is the free energy de\ufb01ned by q in (1) and a truncated DP\nmixture at level L (justi\ufb01ed by the almost sure convergence of an L-truncated Dirichlet process to\nan in\ufb01nite Dirichlet process when L \u2192 \u221e [6]). Using the parameter tying assumption for i > T , the\nfree energy reads\n\ni ) = pv(vi|\u03b1) and q\u03b7i(\u03b7i; \u03c6\u03b7\n\nF =\n\nT\n\nXi=1\n\n(cid:26)Eqvi(cid:20)log\n\nqvi(vi; \u03c6v\npv(vi|\u03b1) (cid:21)+Eq\u03b7i(cid:20)log\ni )\n\nq\u03b7i(\u03b7i; \u03c6\u03b7\np\u03b7(\u03b7i|\u03bb) (cid:21)(cid:27)+\ni )\n\nN\n\nXn=1\n\nEq(cid:20)log\n\nqzn(zn)\n\npz(zn|V )px(xn|\u03b7zn)(cid:21). (2)\n\ni }T\n\ni , \u03c6\u03b7\n\ni=1 and N distributions {qzn (zn)}N\n\nIn our scheme T de\ufb01nes an implicit truncation level of the variational mixture, since there are no free\nparameters to optimize beyond level T . As in [7], the free energy F is a function of T parameters\nn=1. However, contrary to [7], data-cases may now\n{\u03c6v\nassign nonzero responsibility to components beyond level T , and therefore each qzn(zn) must now\nhave in\ufb01nite support (which requires computing in\ufb01nite sums in the various quantities of interest).\nAn important implication of our setup is that the variational families are now nested with respect\nto T (since for i > T , qvi(vi) and q\u03b7i(\u03b7i) can always revert to their priors), and as a result it is\nguaranteed that as we increase T there exist solutions that decrease F . This is an important result\nbecause it allows for optimization with adaptive T starting from T = 1 (see Section 7).\n\u00bfFrom the last term of (2) we directly see that the qzn (zn) that minimizes F is given by\n\nwhere\n\nqzn(zn = i) =\n\nexp(Sn,i)\nj=1 exp(Sn,j)\n\nP\u221e\n\n(3)\n\nMinimization of F over \u03c6v\nchoices of models for qvi and q\u03b7i (see Section 5). Using qzn from (3), the free energy (2) reads\n\n(4)\ni can be carried out by direct differentiation of (2) for particular\n\nSn,i = EqV [log pz(zn = i|V )] + Eq\u03b7i\n\n[log px(xn|\u03b7i)].\n\ni and \u03c6\u03b7\n\nF =\n\nT\n\nXi=1\n\n(cid:26)Eqvi(cid:20)log\n\nqvi(vi; \u03c6v\npv(vi|\u03b1) (cid:21) + Eq\u03b7i(cid:20)log\ni )\n\nq\u03b7i(\u03b7i; \u03c6\u03b7\np\u03b7(\u03b7i|\u03bb) (cid:21)(cid:27) \u2212\ni )\n\nN\n\nXn=1\n\nlog\n\n\u221e\n\nXi=1\n\nexp(Sn,i).\n\n(5)\n\nEvaluation of F requires computing the in\ufb01nite sum P\u221e\ni=1 exp(Sn,i) in (5). The dif\ufb01cult part is\ni=T +1 exp(Sn,i). Under the parameter tying assumption for i > T , most terms of Sn,i in (4)\nP\u221e\n2We thank David Blei for pointing this out.\n\n\ffactor out of the in\ufb01nite sum as constants (since they do not depend on i) except for the term\nPi\u22121\nj=T +1 Epv [log(1 \u2212 v)] = (i \u2212 1 \u2212 T )Epv [log(1 \u2212 v)]. From the above, the in\ufb01nite sum can\n\nbe shown to be\n\n.\n\n(6)\n\n\u221e\n\nXi=T +1\n\nexp(Sn,i) =\n\nSn,T +1\n\n1 \u2212 exp(cid:0)Epv [log(1 \u2212 v)](cid:1)\n\nUsing the variational q(W ) as an approximation to the true posterior p(W |X, \u03b8), the required pos-\nterior over data labels can be approximated by p(zn|X, \u03b8) \u2248 qzn (zn). Although qzn (zn) has in\ufb01nite\nsupport, in practice it suf\ufb01ces to use the individual qzn(zn = i) for the \ufb01nite part i \u2264 T , and the\ncumulative qzn (zn > T ) for the in\ufb01nite part. Finally, using the parameter tying assumption for\ni=1 \u03c0i(V ) = 1, the predictive density p(x|X, \u03b8) can be approximated by\ni > T , and the identity P\u221e\n\nT\n\nT\n\np(x|X, \u03b8) \u2248\n\nXi=1\n\nEqV [\u03c0i(V )]Eq\u03b7i\n\n[px(x|\u03b7i)] +h1 \u2212\n\nXi=1\n\nEpv [\u03c0i(V )]iEp\u03b7 [px(x|\u03b7)].\n\n(7)\n\nNote that all quantities of interest, such as the free energy (5) and the predictive distribution (7), can\nbe computed analytically even though they involve in\ufb01nite sums.\n\n5 Solutions for the exponential family\n\nThe results in the previous section apply independently of the choice of models for the DP mixture.\nIn this section we provide analytical solutions for models in the exponential family. In particular we\nassume that pv(vi|\u03b1) = Beta(\u03b11, \u03b12) and qvi(vi; \u03c6v\ni,2), and that px(x|\u03b7), p\u03b7(\u03b7|\u03bb),\nand q\u03b7i(\u03b7i; \u03c6\u03b7\n\ni ) = Beta(\u03c6v\n\ni ) are given by\n\ni,1, \u03c6v\n\npx(x|\u03b7) = h(x) exp{\u03b7T x \u2212 a(\u03b7)}\np\u03b7(\u03b7|\u03bb) = h(\u03b7) exp{\u03bb1\u03b7 + \u03bb2(\u2212a(\u03b7)) \u2212 a(\u03bb)}\n\nq\u03b7i(\u03b7i; \u03c6\u03b7\n\ni ) = h(\u03b7i) exp{\u03c6\u03b7\n\ni,1\u03b7i + \u03c6\u03b7\n\ni,2(\u2212a(\u03b7i)) \u2212 a(\u03c6\u03b7\n\ni )}.\n\nIn this case, the probabilities qzn (zn = i) are given by (3) with Sn,i computed from (4) using\n\n[log vi] = \u03a8(\u03c6v\nEqvi\n[log(1 \u2212 vj)] = \u03a8(\u03c6v\n\nEqvj\n\nEq\u03b7i\n\n[log px(xn|\u03b7i)] = Eq\u03b7i\n\ni,1 + \u03c6v\ni,1 + \u03c6v\n\ni,1) \u2212 \u03a8(\u03c6v\ni,2) \u2212 \u03a8(\u03c6v\n[\u03b7i]T xn \u2212 Eq\u03b7i\n\ni,2)\ni,2)\n\n[a(\u03b7i)]\n\nwhere \u03a8(\u00b7) is the digamma function. The optimal parameters \u03c6v, \u03c6\u03b7 can be found to be\n\n\u03c6v\n\ni,1 = \u03b11 +\n\n\u03c6\u03b7\ni,1 = \u03bb1 +\n\nN\n\nXn=1\n\nN\n\nXn=1\n\nqzn(zn = i)\n\n\u03c6v\n\ni,2 = \u03b12 +\n\nqzn(zn = i)xn\n\n\u03c6\u03b7\ni,2 = \u03bb2 +\n\nN\n\n\u221e\n\nXn=1\n\nXj=i+1\n\nqzn (zn = j)\n\nN\n\nXn=1\n\nqzn(zn = i).\n\n(8)\n(9)\n(10)\n\n(11)\n(12)\n\n(13)\n\n(14)\n\n(15)\n\ni,2 involves an in\ufb01nite sumP\u221e\n\nThe update equations are similar to those in [7] except that we have used Beta(\u03b11, \u03b12) instead of\nBeta(1, \u03b1), and \u03c6v\nj=i+1 qzn(zn = j) which can be computed using (3)\nand (6). In [7] the corresponding sum is \ufb01nite since qzn (zn) is truncated at T .\nNote that the VDP algorithm operates in a space where component labels are distinguishable, i.e.,\nif we permute the labels the total probability of the data changes. Since the average a priori mix-\nture weights of the components are ordered by their size, the optimal labelling of the a posteriori\nvariational components is also ordered according to cluster size. Hence, we have incorporated a re-\nordering step of components according to approximate size after each optimization step in our \ufb01nal\nalgorithm (a feature that was not present in [7]).\n\n6 Accelerating inference using a kd-tree\n\nIn this section we show that we can achieve accelerated inference for large datasets when we store\nthe data in a kd-tree [10] and cache data suf\ufb01cient statistics in each node of the kd-tree [8]. A kd-tree\n\n\fis a binary tree in which the root node contains all points, and each child node contains a subset of\nthe data points contained in its father node, where points are separated by a (typically axis-aligned)\nhyperplane. Each point in the set is contained in exactly one node, and the set of outer nodes of a\ngiven expansion of the kd-tree form a partition of the data set.\n\nSuppose the kd-tree containing our data X is expanded to some level. Following [9], to achieve\naccelerated update equations we constrain all xn in outer node A to share the same qzn (zn) \u2261\nqzA(zA). We can then show that, under this constraint, the qzA(zA) that minimizes F is given by\n\nqzA (zA = i) =\n\n(16)\n\nexp(SA,i)\nj=1 exp(SA,j)\n\nP\u221e\n\nwhere SA,i is computed as in (4) using (11)\u2013(13) with (13) replaced by Eq\u03b7i\n[a(\u03b7i)],\nand hxiA denotes average over all data xn contained in node A. Similarly, if |nA| is the number of\ndata in node A, the optimal parameters can be shown to be\n\n[\u03b7i]T hxiA \u2212Eq\u03b7i\n\n\u03c6v\n\n\u03c6\u03b7\n\ni,1 = \u03b11 +XA\ni,1 = \u03bb1 +XA\n\n|nA|qzA (zA = i)\n\n|nA|qzA (zA = i)hxiA\n\n\u03c6v\n\ni,2 = \u03b12 +XA\ni,2 = \u03bb2 +XA\n\n\u03c6\u03b7\n\n|nA|\n\n\u221e\n\nXj=i+1\n\nqzA (zA = j)\n\n(17)\n\n|nA|qzA (zA = i).\n\n(18)\n\nFinally, using qzA(zA) from (16) the free energy (5) reads\n\nF =\n\nT\n\nXi=1\n\n(cid:26)Eqvi(cid:20)log\n\nqvi(vi; \u03c6v\npv(vi|\u03b1) (cid:21) + Eq\u03b7i(cid:20)log\ni )\n\nq\u03b7i(\u03b7i; \u03c6\u03b7\ni )\n\np\u03b7(\u03b7i|\u03bb) (cid:21)(cid:27) \u2212XA\n\n|nA| log\n\n\u221e\n\nXi=1\n\nexp(SA,i). (19)\n\nThe in\ufb01nite sums in (17) and (19) can be computed from (6) with Sn,T +1 replaced by SA,T +1. Note\nthat the cost of each update cycle is O(T |A|), which can be a signi\ufb01cant improvement over the\nO(T N ) cost when not using a kd-tree. (The cost of building the kd-tree is O(N log N ) but this is\namortized by multiple optimization steps.) Note that by re\ufb01ning the tree (expanding outer nodes)\nthe free energy F cannot increase. This allows us to control the trade-off between computational\nresources and approximation: we can always choose to descend the tree until our computational\nresources run out, and the level of approximation will be directly tied to F (deeper levels will mean\nlower F ).\n\n7 The algorithm\n\nThe proposed framework is quite general and allows \ufb02exibility in the design of an algorithm. Below\nwe show in pseudocode the algorithm that we used in our experiments (for DP Gaussian mixtures).\nInput is a dataset X = {xn}N\nn=1 that is already stored in a kd-tree structure. Output is a set of\ni=1 and a value for T . From that we can compute responsibilities qzn using (3).\nparameters {\u03c6v\n\ni , \u03c6\u03b7\n\ni }T\n\n1. Set T = 1. Expand the kd-tree to some initial level (e.g. four).\n\n2. Sample a number of \u2018candidate\u2019 components c according to sizePA |nA|qzA (zA = c), and split the\n\ncomponent that leads to the maximal reduction of FT . For each candidate c do:\n(a) Expand one-level deeper the outer nodes of the kd-tree that assign to c the highest responsibility\n\nqzA (zA = c) among all components.\n\n(b) Split c in two components, i and j, through the bisector of its principal component. Initialize\n\nthe responsibilities qzA (zA = i) and qzA (zA = j).\n\n(c) Update only SA,i, \u03c6v\n\ni , \u03c6\u03b7\n\ni and SA,j, \u03c6v\n\nj , \u03c6\u03b7\n\nj for the new components i and j, keeping all other\n\nparameters as well as the kd-tree expansion \ufb01xed.\n\n3. Update SA,t, \u03c6v\n4. If FT +1 > FT \u2212 \u01eb then halt, else set T := T + 1 and go to step 2.\n\nt for all t \u2264 T + 1, while expanding the kd-tree and reordering components.\n\nt , \u03c6\u03b7\n\nIn the above algorithm, the number of sampled candidate components in step 2 can be tuned ac-\ncording to the desired cost/accuracy tradeoff. In our experiments we used 10 candidate components.\nIn step 2b we initialized the responsibilities by qzA(zA = i) = 1 = 1 \u2212 qzA(zA = j) if hxiA is\n\n\fFast-VDP\nVDP\nBJ\n\nr\no\nt\nc\na\n\n 23\n 9\n 1\n\nf\n \n\np\nu\nd\ne\ne\np\ns\n\n 1000\n\n 2000\n\n#data\n\no\n\ni\nt\n\na\nr\n \ny\ng\nr\ne\nn\ne\ne\ne\nr\nf\n\n \n\n 1\n 0.9\n 0.8\n\n 5000\n\nFigure 1: Relative runtimes and free energies of Fast-VDP, VDP, and BJ.\n\nFast-VDP\nVDP\n\n 160\n\n 40\n 1\n\n 15\n 10\n 5\n 1\n\n 1.02\n 1.01\n 1\n\n 5\n 3\n 1\n\n 1.02\n 1.01\n 1\n\n 1.02\n 1.01\n 1\n\n 100\n\n 10\n 1000\n#data in thousands\n\n 16\n\n 32\n\n 64\n\n 128\n\ndimensionality\n\n 1\n\n 2\n\n 3\n\nc-separation\n\nFigure 2: Speedup factors and free energy ratios between Fast-VDP and VDP. Top and bottom\n\ufb01gures show speedups and free energy ratios, respectively.\n\ncloser to i than to j (according to distance to the expected \ufb01rst moment). In order to speed up the\npartial updates in step 2c, we additionally set qzA(zA = k) = 0 for all k 6= i, j (so all responsibility\nis shared between the two new components). In step 3 we reordered components every one cycle\nand expanded the kd-tree every three update cycles, controlling the expansion by the relative change\nof qzA (zA) between a node and its children (alternatively one can measure the change of FT +1).\nFinally, in step 2c we monitored convergence of the partial updates through FT +1 which can be\nef\ufb01ciently computed by adding/subtracting terms involving the new/old components.\n\n8 Experimental results\n\ni ).\n\nIn this section we demonstrate VDP, and its kd-tree extension Fast-VDP, on synthetic and real\ndatasets. In all experiments we assumed a Gaussian observation model px(x|\u03b7) and a Gaussian-\ninverse Wishart for p\u03b7(\u03b7|\u03bb) and q\u03b7i(\u03b7i; \u03c6\u03b7\nSynthetic datasets. As argued in Section 4, an important advantage of VDP over the \u2018BJ\u2019 algo-\nrithm of [7] is that in VDP the variational families are nested over T , which ensures that the free\nenergy is a monotone decreasing function of T and therefore allows for an adaptive T (starting with\nthe trivial initialization T = 1). On the contrary, BJ optimizes the parameters for \ufb01xed T (and\npotentially minimizes the resulting free energy over different values of T ), which requires a nontriv-\nial initialization step for each T . Clearly, both the total runtime as well as the quality of the \ufb01nal\nsolution of BJ depend largely on its initialization step, which makes the direct comparison of VDP\nwith BJ dif\ufb01cult. Still, to get a feeling of the relative performance of VDP, Fast-VDP, and BJ, we\napplied all three algorithms on a synthetic dataset containing 1000 to 5000 data-cases sampled from\n10 Gaussians in 16 dimensions, in which the free parameters of BJ were set exactly as described\nin [7] (20 initialization trials and T = 20). VDP and Fast-VDP were also executed until T = 20. In\nFig. 1 we show the speedup factors and free energy ratios3 among the three algorithms. Fast-VDP\n\n3Free energy ratio is de\ufb01ned as 1 + (FA \u2212 FB)/|FB|, where A and B are either Fast-VDP, VDP or BJ.\n\n\fFast-VDP\n\nVDP\n\nFigure 3: Clustering results of Fast-VDP and VDP, with a speedup of 21. The clusters are ordered\naccording to size (from top left to bottom right).\n\nwas approximately 23 times faster than BJ, and three times faster than VDP on 5000 data-cases.\nMoreover, Fast-VDP and VDP were always better than BJ in terms of free energy.\n\nIn a second synthetic set of experiments we compared the speedup of Fast-VDP over VDP. We\nsampled data from 10 Gaussians in dimension D with component separation4 c. Using default\nnumber of data-cases N = 10, 000, dimensionality D = 16, and separation c = 2, we varied\neach of them, one at a time. In Fig. 2 we show the speedup factor (top) and the free energy ratio\n(bottom) between the two algorithms. Note that the latter is always worse for Fast-VDP since it is an\napproximation to VDP (ratio closer to one means better approximation). Fig. 2-left illustrates that\nthe speedup of Fast-VDP over VDP is at least linear in N, as expected from the update equations\nin Section 6. The speedup factor was approximately 154 for one million data-cases, while the free\nenergy ratio was almost constant over N. Fig. 2-center shows an interesting dependence of speed on\ndimensionality, with D = 64 giving the largest speedup. The three plots in Fig. 2 are in agreement\nwith similar plots in [8, 9].\nReal datasets.\nIn this experiment we applied VDP and Fast-VDP for clustering image data. We\nused the MNIST dataset (http://yann.lecun.com/exdb/mnist/) which consists of 60, 000\nimages of the digits 0\u20139 in 784 dimensions (28 by 28 pixels). We \ufb01rst applied PCA to reduce the\ndimensionality of the data to 50. Fast-VDP found 96 clusters in 3, 379 seconds with free energy\nF = 1.759 \u00d7 107, while VDP found 88 clusters in 72, 037 seconds with free energy 1.684 \u00d7 107.\nThe speedup was 21 and the free energy ratio was 1.044. The mean images of the discovered\ncomponents are illustrated in Fig. 3. The results of the two algorithms seem qualitatively similar,\nwhile Fast-VDP computed its results much faster than VDP.\n\nIn a second real data experiment we clustered documents from citeseer (http://citeseer.ist.\npsu.edu). The dataset has 30, 696 documents, with a vocabulary size of 32, 473 words. Each\ndocument is represented by the counts of words in its abstract. We preprocessed the dataset by\nLatent Dirichlet Allocation [12] with 200 topics5. We subsequently transformed these topic-counts\nyj,k (count value of k\u2019th topic in the j\u2019th document) into xj,k = log(1 + yj,k) to \ufb01t a normal\ndistribution better. In this problem the elapsed time of Fast-VDP and VDP were 335 seconds and\n2,256 seconds, respectively, hence a speedup of 6.7. The free energy ratio was 1.040. Fast-VDP\nfound \ufb01ve clusters, while VDP found six clusters. Table 1 shows the three most frequent topics in\neach cluster. Although the two algorithms found a different number of clusters, we can see that the\nclusters B and F found by VDP are similar clusters, whereas Fast-VDP did not distinguish between\nthese two. Table 2 shows words included in these topics, showing that the documents are well-\nclustered.\n\n9 Conclusions\n\nWe described VDP, a variational mean-\ufb01eld algorithm for Dirichlet Process mixtures, and its fast\nextension Fast-VDP that utilizes kd-trees to achieve speedups. Our contribution is twofold: First,\n\n4A Gaussian mixture is c-separated if for each pair (i, j) of components we have ||mi \u2212 mj||2 \u2265\n\n, \u03bbmax\n\nj\n\n) , where \u03bbmax denotes the maximum eigenvalue of their covariance [11].\n\nc2D max(\u03bbmax\n\ni\n\n5We thank David Newman for this preprocessing.\n\n\fcluster\n(in descending order)\nthe three most\nfrequent topics\n\n1\n2\n3\n\na\n\n81\n102\n59\n\nFast-VDP\nb\n\nc\n\n73\n174\n40\n\n35\n50\n110\n\nd\n\n49\n92\n94\n\ne\n\nA\n\nB\n\nVDP\nC\n\nD\n\n76\n4\n129\n\n81\n102\n59\n\n73\n40\n174\n\n35\n50\n110\n\n76\n4\n129\n\nE\n\n49\n92\n94\n\nF\n\n73\n174\n40\n\nTable 1: The three most frequent topics in each clusters. Fast-VDP found \ufb01ve clusters, a\u2013e, while\nVDP found six clusters, A\u2013F.\n\ncluster\n\na, A\nb, B, F\nc, C\nd, E\ne, D\n\nthe most\nfrequent topic\n81\n73\n35\n49\n76\n\nwords\n\neconomic, policy, countries, bank, growth, \ufb01rm, public, trade, market, ...\ntraf\ufb01c, packet, tcp, network, delay, rate, bandwidth, buffer, end, loss, ...\nalgebra, algebras, ring, algebraic, ideal, \ufb01eld, lie, group, theory, ...\nmotion, tracking, camera, image, images, scene, stereo, object, ...\ngrammar, semantic, parsing, syntactic, discourse, parser, linguistic, ...\n\nTable 2: Words in the most frequent topic of each cluster.\n\nwe extended the framework of [7] to allow for nested variational families and an adaptive truncation\nlevel for the variational mixture. Second, we showed how kd-trees can be employed in the frame-\nwork offering signi\ufb01cant speedups, thus extending related results for \ufb01nite mixture models [8, 9]. To\nour knowledge, the VDP algorithm is the \ufb01rst nonparametric Bayesian approach to large-scale data\nmining. Future work includes extending our approach to other models in the stick-breaking repre-\nsentation (e.g., priors of the form pvi(vi|ai, bi) = Beta(ai, bi)), as well as alternative DP mixture\nrepresentations such as the Chinese restaurant process [3].\n\nAcknowledgments\nWe thank Dave Newman for sharing code and David Blei for helpful comments. This material is\nbased upon work supported by ONR under Grant No. N00014-06-1-0734 and the National Science\nFoundation under Grant No. 0535278\n\nReferences\n\n[1] T. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1:209\u2013230, 1973.\n[2] C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann.\n\nStatist., 2(6):1152\u20131174, 1974.\n\n[3] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019 \u00b4et\u00b4e de Probabilit\u00b4e de Saint-Flour XIII, 1983.\n[4] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statist. Sinica, 4:639\u2013650, 1994.\n[5] C.E. Rasmussen. The in\ufb01nite Gaussian mixture model. In NIPS 12. MIT Press, 2000.\n[6] H. Ishwaran and M. Zarepour. Exact and approximate sum-representations for the Dirichlet process. Can.\n\nJ. Statist., 30:269\u2013283, 2002.\n\n[7] D.M. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian\n\nAnalysis, 1(1):121\u2013144, 2005.\n\n[8] A.W. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In NIPS 11.\n\nMIT Press, 1999.\n\n[9] J.J. Verbeek, J.R.J. Nunnink, and N. Vlassis. Accelerated EM-based clustering of large data sets. Data\n\nMining and Knowledge Discovery, 13(3):291\u2013307, 2006.\n\n[10] J.L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM,\n\n18(9):509\u2013517, 1975.\n\n[11] S. Dasgupta. Learning mixtures of Gaussians. In IEEE Symp. on Foundations of Computer Science, 1999.\n[12] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n\f", "award": [], "sourceid": 3025, "authors": [{"given_name": "Kenichi", "family_name": "Kurihara", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Nikos", "family_name": "Vlassis", "institution": "Adobe Research"}]}