{"title": "Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 403, "abstract": "Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm -- random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency -- orders of magnitude speed-up compared to the state-of-the-art.", "full_text": "Online Learning of Nonparametric Mixture Models\n\nvia Sequential Variational Approximation\n\nDahua Lin\n\nToyota Technological Institute at Chicago\n\ndhlin@ttic.edu\n\nAbstract\n\nReliance on computationally expensive algorithms for inference has been limiting\nthe use of Bayesian nonparametric models in large scale applications. To tackle this\nproblem, we propose a Bayesian learning algorithm for DP mixture models. In-\nstead of following the conventional paradigm \u2013 random initialization plus iterative\nupdate, we take an progressive approach. Starting with a given prior, our method\nrecursively transforms it into an approximate posterior through sequential varia-\ntional approximation. In this process, new components will be incorporated on the\n\ufb02y when needed. The algorithm can reliably estimate a DP mixture model in one\npass, making it particularly suited for applications with massive data. Experiments\non both synthetic data and real datasets demonstrate remarkable improvement on\nef\ufb01ciency \u2013 orders of magnitude speed-up compared to the state-of-the-art.\n\n1\n\nIntroduction\n\nBayesian nonparametric mixture models [7] provide an important framework to describe complex\nIn this family of models, Dirichlet process mixture models (DPMM) [1, 15, 18] are among\ndata.\nthe most popular in practice. As opposed to traditional parametric models, DPMM allows the num-\nber of components to vary during inference, thus providing great \ufb02exibility for explorative analy-\nsis. Nonetheless, the use of DPMM in practical applications, especially those with massive data,\nhas been limited due to high computational cost. MCMC sampling [12, 14] is the conventional ap-\nproach to Bayesian nonparametric estimation. With heavy reliance on local updates to explore the\nsolution space, they often show slow mixing, especially on large datasets. Whereas the use of split-\nmerge moves and data-driven proposals [9,17,20] has substantially improved the mixing performance,\nMCMC methods still require many passes over a dataset to reach the equilibrium distribution.\nVariational inference [4, 11, 19, 22], an alternative approach based on mean \ufb01eld approximation, has\nbecome increasingly popular recently due to better run-time performance. Typical variational meth-\nods for nonparametric mixture models rely on a truncated approximation of the stick breaking con-\nstruction [16], which requires a \ufb01xed number of components to be maintained and iteratively updated\nduring inference. The truncation level are usually set conservatively to ensure approximation accu-\nracy, incurring considerable amount of unnecessary computation.\nThe era of Big Data presents new challenges for machine learning research. Many real world appli-\ncations involve massive amount of data that even cannot be accommodated entirely in the memory.\nBoth MCMC sampling and variational inference maintain the entire con\ufb01guration and perform it-\nerative updates of multiple passes, which are often too expensive for large scale applications. This\nchallenge motivated us to develop a new learning method for Bayesian nonparametric models that\ncan handle massive data ef\ufb01ciently. In this paper, we propose an online Bayesian learning algorithm\nfor generic DP mixture models. This algorithm does not require random initialization of components.\nInstead, it begins with the prior DP(\u03b1\u00b5) and progressively transforms it into an approximate posterior\nof the mixtures, with new components introduced on the \ufb02y as needed. Based on a new way of varia-\ntional approximation, the algorithm proceeds sequentially, taking in one sample at a time to make the\n\n1\n\n\fupdate. We also devise speci\ufb01c steps to prune redundant components and merge similar ones, thus\nfurther improving the performance. We tested the proposed method on synthetic data as well as two\nreal applications: modeling image patches and clustering documents. Results show empirically that\nthe proposed algorithm can reliably estimate a DP mixture model in a single pass over large datasets.\n\n2 Related Work\n\nRecent years witness lots of efforts devoted to developing ef\ufb01cient learning algorithms for Bayesian\nnonparametric models. A n important line of research is to accelerate the mixing in MCMC through\nbetter proposals. Jain and Neal [17] proposed to use split-merge moves to avoid being trapped in\nlocal modes. Dahl [6] developed the sequentially allocated sampler, where splits are proposed by\nsequentially allocating observations to one of two split components through sequential importance\nsampling. This method was recently extended for HDP [20] and BP-HMM [9].\nThere has also been substantial advancement in variational inference. A signi\ufb01cant development along\nis line is the Stochastic Variational Inference, a framework that incorporates stochastic optimization\nwith variational inference [8]. Wang et al. [23] extended this framework to the non-parametric realm,\nand developed an online learning algorithm for HDP [18]. Wang and Blei [21] also proposed a\ntruncation-free variational inference method for generic BNP models, where a sampling step is used\nfor updating atom assignment that allows new atoms to be created on the \ufb02y.\nBryant and Sudderth [5] recently developed an online variational inference algorithm for HDP, using\nmini-batch to handle streaming data and split-merge moves to adapt truncation levels. They tried to\ntackle the problem of online BNP learning as we do, but via a different approach. First, we propose a\ngeneric method while they focuses on topic models. The designs are also different \u2013 our method starts\nfrom scratch and progressively adds new components. Its overall complexity is O(nK), where n and\nK are number of samples and expected number of components. Bryant\u2019s method begins with random\ninitialization and relies on splits over mini-batch to create new topics, resulting in the complexity of\nO(nKT ), where T is the number of iterations for each mini-batch. The differences stem from the\ntheoretical basis \u2013 our method uses sequential approximation based on the predictive law, while theirs\nis an extension of the standard truncation-based model.\nNott et al. [13] recently proposed a method, called VSUGS, for fast estimation of DP mixture models.\nSimilar to our algorithm, the VSUGS method proposed takes a sequential updating approach, but\nrelies on a different approximation. Particularly, what we approximate is a joint posterior over both\ndata allocation and model parameters, while VSUGS is based on the approximating the posterior of\ndata allocation. Also, VSUGS requires \ufb01xing a truncation level T in advance, which may lead to\ndif\ufb01culties in practice (especially for large data). Our algorithm provides a way to tackle this, and no\nlonger requires \ufb01xed truncation.\n\n3 Nonparametric Mixture Models\n\nThis section provide a brief review of Dirichlet Process Mixture Model \u2013 one of the most widely\nused nonparametric mixture models. A Dirichlet Process (DP), typically denoted by DP(\u03b1\u00b5) is\ncharacterized by a concentration parameter \u03b1 and a base distribution \u00b5.\nIt has been shown that\nsample paths of a DP are almost surely discrete [16], and can be expressed as\n\n\u221e(cid:88)\n\nk\u22121(cid:89)\n\nD =\n\n\u03c0k\u03b4\u03c6k , with \u03c0k = vk\n\nvl, vk \u223c Beta(1, \u03b1k), \u2200k = 1, 2, . . . .\n\n(1)\n\nk=1\n\nl=1\n\nThis is often referred to as the stick breaking representation, and \u03c6k is called an atom. Since an\natom can be repeatedly generated from D with positive probability, the number of distinct atoms is\nusually less than the number of samples. The Dirichlet Process Mixture Model (DPMM) exploits this\nproperty, and uses a DP sample as the prior of component parameters. Below is a formal de\ufb01nition:\n\nD \u223c DP (\u03b1\u00b5),\n\n(2)\nConsider a partition {C1, . . . , CK} of {1, . . . , n} such that \u03b8i are identical for all i \u2208 Ck, which\nwe denote by \u03c6k. Instead of maintaining \u03b8i explicitly, we introduce an indicator zi for each i with\n\n\u03b8i \u223c \u00b5, xi \u223c F (\u00b7|\u03b8i), \u2200i = 1, . . . , n.\n\n2\n\n\f\u03b8i = \u03c6zi. Using this clustering notation, this formulation can be rewritten equivalently as follows:\n\n(3)\nHere, CRP(\u03b1) denotes a Chinese Restaurant Prior, which is a distribution over exchangeable parti-\ntions. Its probability mass function is given by\npCRP (z1:n|\u03b1) =\n\nz1:n \u223c CRP(\u03b1), \u03c6k \u223c \u00b5, \u2200k = 1, 2, . . . K\nxi \u223c F (\u00b7|\u03c6zi), \u2200i = 1, 2, . . . , n.\nK(cid:89)\n\n\u0393(|Ck|).\n\n(4)\n\n\u0393(\u03b1)\u03b1K\n\u0393(\u03b1 + n)\n\nk=1\n\n4 Variational Approximation of Posterior\n\nGenerally, there are two approaches to learning a mixture model from observed data, namely Max-\nimum likelihood estimation (MLE) and Bayesian learning. Speci\ufb01cally, maximum likelihood esti-\nmation seeks an optimal point estimate of \u03bd, while Bayesian learning aims to derive the posterior\ndistribution over the mixtures. Bayesian learning takes into account the uncertainty about \u03bd, often\nresulting in better generalization performance than MLE.\nIn this paper, we focus on Bayesian learning. In particular, for DPMM, the predictive distribution of\ncomponent parameters, conditioned on a set of observed samples x1:n, is given by\n\n(5)\nHere, ED|x1:n takes the expectation w.r.t. p(D|x1:n). In this section, we derive a tractable approxima-\ntion of this predictive distribution based on a detailed analysis of the posterior.\n\np(\u03b8(cid:48)|x1:n) = ED|x1:n [p(\u03b8(cid:48)|D)] .\n\n4.1 Posterior Analysis\nLet D \u223c DP(\u03b1\u00b5) and \u03b81, . . . , \u03b8n be iid samples from D, {C1, . . . , CK} be a partition of {1, . . . , n}\nsuch that \u03b8i for all i \u2208 Ck are identical, and \u03c6k = \u03b8i \u2200i \u2208 Ck. Then the posterior distribution of D\nremains a DP, as D|\u03b81:n \u223c DP(\u02dc\u03b1\u02dc\u00b5), where \u02dc\u03b1 = \u03b1 + n, and\n|Ck|\n\u03b1 + n\n\nK(cid:88)\n\n\u03b1 + n\n\n\u03b4\u03c6k .\n\n\u02dc\u00b5 =\n\n\u00b5 +\n\n(6)\n\n\u03b1\n\nk=1\n\nThe atoms are generally unobservable, and therefore it is more interesting in practice to consider the\nposterior distribution of D given the observed samples. For this purpose, we derive the lemma below\nthat provides a constructive characterization of the posterior distribution given both the observed\nsamples x1:n and the partition z.\nLemma 1. Consider the DPMM in Eq.(3). Drawing a sample from the posterior distribution\np(D|z1:n, x1:n) is equivalent to constructing a random probability measure as follows\n\nK(cid:88)\n\n\u03b20D(cid:48) +\n\n\u03b2k\u03b4\u03c6k ,\nwith D(cid:48) \u223c DP(\u03b1\u00b5), (\u03b20, \u03b21, . . . , \u03b2k) \u223c Dir(\u03b1, m1, . . . , mK), \u03c6k \u223c \u00b5|Ck .\n\nHere, mk = |Ck|, \u00b5|Ck is a posterior distribution given by i.e. \u00b5|Ck (d\u03b8) \u221d \u00b5(d\u03b8)(cid:81)\n\nk=1\n\nF (xi|\u03b8).\n\ni\u2208Ck\n\n(7)\n\nThis lemma immediately follows from the Theorem 2 in [10] as DP is a special case of the so-\ncalled Normalized Random Measures with Independent Increments (NRMI). It is worth emphasizing\nthat p(D|x, z) is no longer a Dirichlet process, as the locations of the atoms \u03c61, . . . , \u03c6K are non-\ndeterministic, instead they follow the posterior distributions \u00b5|Ck.\nBy marginalizing out the partition z1:n, we obtain the posterior distribution p(D|x1:n):\n\np(D|x1:n) =\n\np(z1:n|x1:n)p(D|x1:n, z1:n).\n\n(cid:88)\n\nz1:n\n\nLet {C (z)\n\n1 , . . . , C (z)\n\nK } be the partition corresponding to z1:n, we have\n(cid:89)\np(z1:n|x1:n) \u221d pCRF (z1:n|\u03b1)\n\nK(z)(cid:89)\n\n\u00b5(d\u03c6k)\n\n(cid:90)\n\nF (xi|\u03c6k).\n\n(8)\n\n(9)\n\ni\u2208C(z)\n\nk\n\nk=1\n\n3\n\n\f4.2 Variational Approximation\n\nComputing the predictive distribution based on Eq.(8) requires enumerating all possible partitions,\nwhich grow exponentially as n increases. To tackle this dif\ufb01culty, we resort to variational approxi-\nmation, that is, to choose a tractable distribution to approximate p(D|x1:n, z1:n).\nIn particular, we consider a family of random probability measures that can be expressed as follows:\n\nq(D|\u03c1, \u03bd) =\n\n\u03c1i(zi)q(z)\n\n\u03bd (D|z1:n).\n\n(cid:88)\n\nn(cid:89)\n\nz1:n\n\ni=1\n\nHere, q(z)\n\n\u03bd (D|z1:n) is a stochastic process conditioned on z1:n, de\ufb01ned as\n\n\u03bd (D|z1:n) d\u223c \u03b20D(cid:48) +\nq(z)\n\n\u03b2k\u03b4\u03c6k ,\n\nK(cid:88)\n\nk=1\n\nwith D(cid:48) \u223c DP(\u03b1\u00b5), (\u03b20, \u03b21, . . . , \u03b2K) \u223c Dir(\u03b1, m(z)\n\n1 , . . . , m(z)\n\nK ), \u03c6k \u223c \u03bdk.\n\n\u03bd\n\nk = |C (z)\n\nis equivalent to constructing one according\nk | is the cardinality of the k-th cluster w.r.t. z1:n, and\n\ndifferences: (1) p(z1:n|x1:n) is replaced by a product distribution(cid:81)\n\nHere, we use d\u223c to indicate that drawing a sample from q(z)\nto the right hand side. In addition, m(z)\n\u03bdk is a distribution over component parameters that is independent from z.\nThe variational construction in Eq.(10) and (11) is similar to Eq.(7) and (8), except for two signi\ufb01cant\ni \u03c1i(zi), and (2) \u00b5|Ck, which\ndepends on z1:n, is replaced by an independent distribution \u03bdk. With this design, zi for different i and\n\u03c6k for different k are independent w.r.t. q, thus resulting in a tractable predictive law below: Let q be\na random probability measure given by Eq.(10) and (11), then\n\nEq(D|\u03c1,\u03bd) [p(\u03b8(cid:48)|D)] =\n\n\u03b1\n\n\u03b1 + n\n\n\u00b5(\u03b8(cid:48)) +\n\ni=1 \u03c1i(k)\n\u03b1 + n\n\n\u03bdk(\u03b8(cid:48)).\n\n(12)\n\nThe approximate posterior has two sets of parameters: \u03c1 (cid:44) (\u03c11, . . . , \u03c1n) and \u03bd (cid:44) (\u03bd1, . . . , \u03bdn). With\nthis approximation, the task of Bayesian learning reduces to the problem of \ufb01nding the optimal setting\nof these parameters such that q(D|\u03c1, \u03bd) best approximates the true posterior distribution.\n\n(cid:80)n\n\nK(cid:88)\n\nk=1\n\n(10)\n\n(11)\n\n4.3 Sequential Approximation\n\nThe \ufb01rst problem here is to determine the value of K. A straightforward approach is to \ufb01x K to a large\nnumber as in the truncated methods. This way, however, would incur substantial computational costs\non unnecessary components. We take a different approach here. Rather than randomly initializing a\n\ufb01xed number of components, we begin with an empty model (i.e. K = 1) and progressively re\ufb01ne\nthe model as samples come in, adding new components on the \ufb02y when needed.\nSpeci\ufb01cally, when the \ufb01rst sample x1 is observed, we introduce the \ufb01rst component and denote the\nposterior for this component by \u03bd1. As there is only one component at this point, we have z1 = 1,\n1 (d\u03b8) \u221d\ni.e. \u03c11(z1 = 1) = 1, and the posterior distribution over the component parameter is \u03bd(1)\n\u00b5(d\u03b8)F (x1|\u03b8). Samples are brought in sequentially. In particular, we compute \u03c1i, and update \u03bd(i\u22121)\nto \u03bdi upon the arrival of the i-th sample xi.\nSuppose we have \u03c1 = (\u03c11, . . . , \u03c1i) and \u03bd(i) = (\u03bd(i)\nK ) after processing i samples. To explain\nxi+1, we can use either of the K existing components or introduce a new component \u03c6k+1. Then the\nposterior distribution of zi+1, \u03c61, . . . , \u03c6K+1 given x1, . . . , xn, xn+1 is\n\np(zi+1, \u03c61:K+1|x1:i+1) \u221d p(zi+1, \u03c61:K+1|x1:i)p(xi+1|zi+1, \u03c61:K+1).\n\n(13)\nUsing the tractable distribution q(\u00b7|\u03c11:i, \u03bd(i)) in Eq.(10) to approximate the posterior p(\u00b7|x1:i), we get\n(14)\nThen, the optimal settings of qi+1 and \u03bd(i+1) that minimizes the Kullback-Leibler divergence between\nq(zi+1, \u03c61:K+1|q1:i+1, \u03bd(i+1)) and the approximate posterior in Eq.(14) are given as follows:\n\np(zi+1, \u03c61:K+1|x1:i+1) \u221d q(zi+1|\u03c11:i, \u03bd(i))p(xi+1|zi+1, \u03c61:K+1).\n\n1 , . . . , \u03bd(i)\n\n(k \u2264 K),\n(k = K + 1),\n\n(15)\n\n(cid:40)\n\n(cid:82)\n\u03b8 F (xi+1|\u03b8)\u03bd(i)\n\u03b8 F (xi+1|\u03b8)\u00b5(d\u03b8)\n\n\u03b1(cid:82)\n\nw(i)\nk\n\nk (d\u03b8)\n\n\u03c1i+1 \u221d\n\n4\n\n\fAlgorithm 1 Sequential Bayesian Learning of DPMM (for conjugate cases).\nRequire: base measure params: \u03bb, \u03bb0, observed samples: x1, . . . , xn, and threshold \u0001\n\nLet K = 1, \u03c11(1) = 1, w1 = \u03c11, \u03b61 = \u03c6(x1), and \u03b6(cid:48)\nfor i = 2 : n do\n\n1 = 1.\n\nTi \u2190 T (xi), and bi \u2190 b(xi)\nmarginal log-likelihood: hi(k) \u2190\n\n\u03c1i(k) \u2190 wkehi(k)/(cid:80)\n\nif \u03c1i(K + 1) > \u0001 then\n\nk) \u2212 bi\nB(\u03b6k + Ti, \u03b6(cid:48)\nB(\u03bb + Ti, \u03bb(cid:48) + \u03c4 ) \u2212 B(\u03bb, \u03bb(cid:48)) \u2212 bi\nl wlehi(l) for k = 1, . . . , K + 1 with wK+1 = \u03b1\n\nk + \u03c4 ) \u2212 B(\u03b6k, \u03b6(cid:48)\n\n(cid:40)\n\n(k = 1, . . . , K)\n(k = K + 1)\n\nwk \u2190 wk + \u03c1i(k), \u03b6k \u2190 \u03b6k + \u03c1i(k)Ti, and \u03b6(cid:48)\nk \u2190 \u03b6(cid:48)\nwK+1 \u2190 \u03c1i(K + 1), \u03b6K+1 \u2190 \u03c1i(K + 1)Ti, and \u03b6(cid:48)\nK \u2190 K + 1\n\nre-normalize \u03c1i such that(cid:80)K\n\nwk \u2190 wk + \u03c1i(k), \u03b6k \u2190 \u03b6k + \u03c1i(k)Ti, and \u03b6(cid:48)\n\nk=1 \u03c1i(k) = 1\n\nk \u2190 \u03b6(cid:48)\n\nelse\n\nk + \u03c1i(k)\u03c4, for k = 1, . . . , K\nK+1 \u2190 \u03c1i(K + 1)\u03c4\n\nk + \u03c1i(k)\u03c4, for k = 1, . . . , K\n\nend if\nend for\n\nk =(cid:80)i\n\nwith w(i)\n\nj=1 \u03c1j(k), and\n\n(cid:40)\n\u00b5(d\u03b8)(cid:81)i+1\n\nj=1 F (xj|\u03b8)\u03c1j (k)\n\n\u00b5(d\u03b8)F (xi+1|\u03b8)\u03c1i+1(k)\n\n(k \u2264 K),\n(k = K + 1).\n\n(16)\n\n\u03bd(i+1)\nk\n\n(d\u03b8) \u221d\n\nDiscussion. There is a key distinction between this approximation scheme and conventional ap-\nproaches: Instead of seeking the approximation of p(D|x1:n), which is very dif\ufb01cult (D is in\ufb01nite)\nand unnecessary (only a \ufb01nite number of components are useful), we try to approximate the posterior\nof a \ufb01nite subset of latent variables that are truly relevant for prediction, namely z and \u03c61:K+1.\nThis sequential approximation scheme introduces a new component for each sample, resulting in\nn components over the entire dataset. This, however, is unnecessary. We \ufb01nd empirically that for\nmost samples, \u03c1i(K + 1) is negligible, indicating that the sample is adequately explained by existing\ncomponent, and there is no need of new components. In practice, we set a small value \u0001 and increase\nK only when \u03c1i(K + 1) > \u0001. This simple strategy is very effective in controlling the model size.\n\n5 Algorithm and Implementation\n\nThis section discusses the implementation of the sequential Bayesian learning algorithm under two\ndifferent circumstances: (1) \u00b5 and F are exponential family distributions that form a conjuate pair,\nand (2) \u00b5 is not a conjugate prior w.r.t. F .\n\nConjugate Case.\n\nIn general, when \u00b5 is conjugate to F , they can be written as follows:\n\n\u00b5(d\u03b8|\u03bb, \u03bb(cid:48)) = exp(cid:0)\u03bbT \u03b7(\u03b8) \u2212 \u03bb(cid:48)A(\u03b8) \u2212 B(\u03bb, \u03bb(cid:48))(cid:1) h(d\u03b8),\nF (x|\u03b8) = exp(cid:0)\u03b7(\u03b8)T T (x) \u2212 \u03c4 A(\u03b8) \u2212 b(x)(cid:1) .\n\n(17)\n(18)\nHere, the prior measure \u00b5 has a pair of natural parameters: (\u03bb, \u03bb(cid:48)). Conditioned on a set of ob-\nservations x1, . . . , xn, the posterior distribution remains in the same family as \u00b5 with parameters\n\ni=1 T (xi), \u03bb(cid:48) + n\u03c4 ). In addition, the marginal likelihood is given by\n\n(\u03bb +(cid:80)n\n\nF (x|\u03b8)\u00b5(d\u03b8|\u03bb, \u03bb(cid:48)) = exp (B(\u03bb + T (x), \u03bb(cid:48) + \u03c4 ) \u2212 B(\u03bb, \u03bb(cid:48)) \u2212 b(x)) .\n\n(19)\n\n\u03b8\n\nIn such cases, both the base measure \u00b5 and the component-speci\ufb01c posterior measures \u03bdk can be\nrepresented using the natural parameter pairs, which we denote by (\u03bb, \u03bb(cid:48)) and (\u03b6k, \u03b6(cid:48)\nk). With this\nnotation, we derive a sequential learning algorithm for conjugate cases, as shown in Alg 1.\n\nNon-conjugate Case.\nIn practical models, it is not uncommon that \u00b5 and F are not a conjugate\npair. Unlike in the conjugate cases discussed above, there exist no formulas to update posterior\n\n5\n\n(cid:90)\n\n\fstochastic optimization. Consider a posterior distribution given by p(\u03b8|x1:n) \u221d \u00b5(\u03b8)(cid:81)n\n\nparameters or to compute marginal likelihood in general. Here, we propose to address this issue using\ni=1 F (xi|\u03b8).\n\nA stochastic optimization method \ufb01nds the MAP estimate of \u03b8 through update steps as below:\n\n(20)\nThe basic idea here is to use the gradient computed at a particular sample xi to approximate the\ntrue gradient. This procedure converges to a (local) maximum, as long as the step size \u03c3i satisfy\n\n\u03b8 \u2190 \u03b8 + \u03c3i (\u2207\u03b8 log \u00b5(\u03b8) + n\u2207\u03b8 log F (xi|\u03b8)) .\n\n(cid:80)\u221e\ni=1 \u03c3i = \u221e and(cid:80)\u221e\n\ni=1 \u03c32\n\ni < \u221e.\n\nIncorporating the stochastic optimization method into our algorithm, we obtain a variant of Alg 1. The\ngeneral procedure is similar, except for the following changes: (1) It maintains point estimates of the\ncomponent parameters instead of the posterior, which we denote by \u02c6\u03c61, . . . , \u02c6\u03c6K. (2) It computes the\nlog-likelihood as hi(k) = log F (xi| \u02c6\u03c6k). (3) The estimates of the component parameters are updated\nusing the formula below:\nk \u2190 \u02c6\u03c6(i\u22121)\n\u02c6\u03c6(i)\n\n+ \u03c3i (\u2207\u03b8 log \u00b5(\u03b8) + n\u03c1i(k)\u2207\u03b8 log F (xi|\u03b8)) .\n\nFollowing the common practice of stochastic optimization, we set \u03c3i = i\u2212\u03ba/n with \u03ba \u2208 (0.5, 1].\n\n(21)\n\nk\n\nl\n\nl w(i)\n\n(with w(i)\n\nk = (cid:80)i\n\nPrune and Merge. As opposed to random initialization, components created during this sequen-\ntial construction are often truly needed, as the decisions of creating new components are based on\nknowledge accumulated from previous samples. However, it is still possible that some components\nintroduced at early iterations would become less useful and that multiple components may be similar.\nWe thus introduce a mechanism to remove undesirable components and merge similar ones.\n\nk /(cid:80)\ntween \u03c1i(k) and \u03c1i(k(cid:48)) over all processed samples, as d\u03c1(k, k(cid:48)) = i\u22121(cid:80)i\n\nWe identify opportunities to make such adjustments by looking at the weights. Let \u02dcw(i)\nk =\nw(i)\nj=1 \u03c1j(k)) be the relative weight of a component at the i-th itera-\ntion. Once the relative weight of a component drops below a small threshold \u03b5r, we remove it to save\nunnecessary computation on this component in the future.\nThe similarity between two components \u03c6k and \u03c6k(cid:48) can be measured in terms of the distance be-\nj=1 |\u03c1j(k) \u2212 \u03c1j(k(cid:48))|. We\nk are merged (i.e. d\u03c1(k, k(cid:48)) < \u03b5d). We also merge\nincrement \u03c1i(k) to \u03c1i(k) + \u03c1i(k(cid:48)) when \u03c6k and \u03c6(cid:48)\nthe associated suf\ufb01cient statistics (for conjugate case) or take an weighted average of the parameters\n(for non-conjugate case). Generally, there is no need to perform such checks at every iteration. Since\ncomputing this distance between a pair of components takes O(n), we propose to examine similarities\nat an O(i \u00b7 K)-interval so that the amortized complexity is maintained at O(nK).\n\nDiscussion. As compared to existing methods, the proposed method has several important advan-\ntages. First, it builds up the model on the \ufb02y, thus avoiding the need of randomly initializing a set of\ncomponents as required by truncation-based methods. The model learned in this way can be readily\nextended (e.g. adding more components or adapting existing components) when new data is available.\nMore importantly, the algorithm can learn the model in one pass, without the need of iterative updates\nover the data set. This distinguishes it from MCMC methods and conventional variational learning\nalgorithms, making it a great \ufb01t for large scale problems.\n\n6 Experiments\n\nTo test the proposed algorithm, we conducted experiments on both synthetic data and real world\napplications \u2013 modeling image patches and document clustering. All algorithms are implemented\nusing Julia [2], a new language for high performance technical computing.\n\n6.1 Synthetic Data\n\nFirst, we study the behavior of the proposed algorithm on synthetic data. Speci\ufb01cally, we constructed\na data set comprised of 10000 samples in 9 Gaussian clusters of unit variance. The distances between\nthese clusters were chosen such that there exists moderate overlap between neighboring clusters. The\nestimation of these Gaussian components are based on the DPMM below:\n\nD \u223c DP(cid:0)\u03b1 \u00b7 N (0, \u03c32\npI)(cid:1) ,\n\n\u03b8i \u223c D, xi \u223c N (\u03b8i, \u03c32\n\nxI).\n\n(22)\n\n6\n\n\fFigure 1: Gaussian clusters on syn-\nthetic data obtained using different\nmethods. Both MC-SM and SVA-PM\nidenti\ufb01ed the 9 clusters correctly. The\nresult of MC-SM is omitted here, as it\nlooks the same as SVA-PM.\n\nFigure 2: Joint log-likelihood on synthetic data as func-\ntions of run-time. The likelihood values were evaluated\non a held-out testing set. (Best to view with color)\n\nHere, we set \u03b1 = 1, \u03c3p = 100 and \u03c3x = 1.\nWe tested the following inference algorithms: Collapsed Gibbs sampling (CGS) [12], MCMC with\nSplit-Merge (MC-SM) [6], Truncation-Free Variational Inference (TFV) [21], Sequential Variational\nApproximation (SVA), and its variant Sequential Variational Approximation with Prune and Merge\n(SVA-PM). For CGS, MC-SM, and TFV, we run the updating procedures iteratively for one hour,\nwhile for SVA and SVA-PM, we run only one-pass.\nFigure 1 shows the resulting components. CGS and TFV yield obviously redundant components.\nThis corroborates observations in previous work [9]. Such nuisances are signi\ufb01cantly reduced in\nSVA, which only occasionally brings in redundant components. The key difference that leads to this\nimprovement is that CGS and TFV rely on random initialization to bootstrap the algorithm, which\nwould inevitably introduce similar components, while SVA leverages information gained from pre-\nvious samples to decide whether new components are needed. Both MC-SM and SVA-PM produce\ndesired mixtures, demonstrating the importance of an explicit mechanism to remove redundancy.\nFigure 2 plots the traces of joint log-likelihoods evaluated on a held-out set of samples. We can see that\nSVA-PM quickly reaches the optimal solution in a matter of seconds. SVA also gets to a reasonable\nsolution within seconds, and then the progress slows down. Without the prune-and-merge steps, it\ntakes much longer for redundant components to fade out. MC-SM eventually reaches the optimal\nsolution after many iterations. Methods relying on local updates, including CGS and TFV, did not\neven come close to the optimal solution within one hour. These results clearly demonstrate that our\nprogressive strategy, which gradually constructs the model through a series of informed decisions, is\nmuch more ef\ufb01cient than random initialization followed by iterative updating.\n\n6.2 Modeling Image Patches\n\nImage patches, which capture local characteristics of images, play a fundamental role in various\ncomputer vision tasks, such as image recovery and scene understanding. Many vision algorithms rely\non a patch dictionary to work. It has been a common practice in computer vision to use parametric\nmethods (e.g. K-means) to learn a dictionary of \ufb01xed size. This approach is inef\ufb01cient when large\ndatasets are used. It is also dif\ufb01cult to be extended when new data with a \ufb01xed K.\nTo tackle this problem, we applied our method to learn a nonparametric dictionary from the SUN\ndatabase [24], a large dataset comprised of over 130K images, which capture a broad variety of\nscenes. We divided all images into two disjoint sets: a training set with 120K images and a testing\nset with 10K. We extracted 2000 patches of size 32 \u00d7 32 from each image, and characterize each\npatch by a 128-dimensional SIFT feature. In total, the training set contains 240M feature vectors.\nWe respectively run TFV, SVA, and SVA-SM to learn a DPMM from the training set, based on the\n\n7\n\nCGSTVFSVASVA-PM\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468\u22128\u22126\u22124\u2212202468020406080100\u22128.5\u22128\u22127.5\u22127\u22126.5\u22126\u22125.5x 104minutejoint log\u2212lik CGSMC\u2212SMTVFSVASVA\u2212PM\fFigure 3: Examples of im-\nage patche clusters learned us-\ning SVA-PM. Each row corre-\nsponds to a cluster. We can see\nsimilar patches are in the same\ncluster.\n\nFigure 4:\nlog-\nlikelihood on image modeling\nas functions of run-time.\n\nAverage\n\nFigure 5:\nlog-\nlikelihood of document clusters\nas functions of run-time.\n\nAverage\n\nformulation given in Eq.(22), and evaluate the average predictive log-likelihood over the testing set as\nthe measure of performance. Figure 3 shows a small subset of patch clusters obtained using SVA-PM.\nFigure 4 compares the trajectories of the average log-likelihoods obtained using different algorithms.\nTFV takes multiple iterations to move from a random con\ufb01guration to a sub-optimal one and get\ntrapped in a local optima. SVA steadily improves the predictive performance as it sees more samples.\nWe notice in our experiments that even without an explicit redundancy-removal mechanism, some\nunnecessary components can still get removed when their relative weights decreases and becomes\nnegligible. SVM-PM accelerates this process by explicitly merging similar components.\n\n6.3 Document Clustering\n\nNext, we apply the proposed method to explore categories of documents. Unlike standard topic mod-\neling task, this is a higher level application that builds on top of the topic representation. Speci\ufb01cally,\nwe \ufb01rst obtain a collection of m topics from a subset of documents, and characterize all documents by\ntopic proportions. We assume that the topic proportion vector is generated from a category-speci\ufb01c\nDirichlet distribution, as follows\n\nD \u223c DP (\u03b1 \u00b7 Dirsym(\u03b3p)) ,\n\n\u03b8i \u223c D, xi \u223c Dir(\u03b3x\u03b8i).\n\n(23)\nHere, the base measure is a symmetric Dirichlet distribution. To generate a document, we draw a\nmean probability vector \u03b8i from D, and generates the topic proportion vector xi from Dir(\u03b3x\u03b8i).\nThe parameter \u03b3x is a design parameter that controls how far xi may deviate from the category-\nspeci\ufb01c center \u03b8i. Note that this is not a conjugate model, and we use stochastic optimization instead\nof Bayesian updates in SVA (see section 5).\nWe performed the experiments on the New York Times database, which contains about 1.8M articles\nfrom year 1987 to 2007. We pruned the vocabulary to 5000 words by removing stop words and\nthose with low TF-IDF scores, and obtained 150 topics by running LDA [3] on a subset of 20K\ndocuments. Then, each document is represented by a 150-dimensional vector of topic proportions.\nWe held out 10K documents for testing and use the remaining to train the DPMM. We compared SVA,\nSVA-PM, and TVF. The traces of log-likelihood values are shown in Figure 5. We observe similar\ntrends as above: SVA and SVA-PM attains better solution more quickly, while TVF is less ef\ufb01cient\nand is prune to being trapped in local maxima. Also, TVF tends to generate more components than\nnecessary, while SVA-PM maintains a better performance using much less components.\n\n7 Conclusion\n\nWe presented an online Bayesian learning algorithm to estimate DP mixture models. The proposed\nmethod does not require random initialization. Instead, it can reliably and ef\ufb01ciently learn a DPMM\nfrom scratch through sequential approximation in a single pass. The algorithm takes in data in a\nstreaming fashion, and thus can be easily adapted to new data. Experiments on both synthetic data\nand real applications have demonstrated that our algorithm achieves remarkable speedup \u2013 it can\nattain nearly optimal con\ufb01guration within seconds or minutes, while mainstream methods may take\nhours or even longer. It is worth noting that the approximation is derived based on the predictive law\nof DPMM. It is an interesting future direction to investigate how it can be generalized to a broader\nfamily of BNP models, such as HDP, Pitman-Yor processes, and NRMIs [10].\n\n8\n\n02468\u2212180\u2212160\u2212140\u2212120\u2212100houravg. pred. log\u2212lik TVFSVASVA\u2212PM0246810300350400450500550houravg. pred. log\u2212lik TVFSVASVA\u2212PM\fReferences\n[1] C. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The\n\nAnnals of Statistics, 2(6):1152\u20131174, 1974.\n\n[2] Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. Julia: A fast dynamic language for\n\ntechnical computing. CoRR, abs/1209.5145, 2012.\n\n[3] David Blei, Ng Andrew, and Michael Jordan. Latent dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[4] David M. Blei and Michael I. Jordan. Variational methods for the Dirichlet process. In Proc. of ICML\u201904,\n\n2004.\n\n[5] Michael Bryant and Erik Sudderth. Truly nonparametric online variational inference for hierarchical dirich-\n\nlet processes. In Proc. of NIPS\u201912, 2012.\n\n[6] David B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate dirichlet process\n\nmixture models, 2005.\n\n[7] Nils Lid Hjort, Chris Holmes, Peter Muller, and Stephen G. Walker. Bayesian Nonparametrics: Principles\n\nand Practice. Cambridge University Press, 2010.\n\n[8] Matt Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. arXiv\n\neprints, 1206.7501, 2012.\n\n[9] Michael C. Hughes, Emily B. Fox, and Erik B. Sudderth. Effective split-merge monte carlo methods for\n\nnonparametric models of sequential data. 2012.\n\n[10] Lancelot F. James, Antonio Lijoi, and Igor Pr\u00a8unster. Posterior analysis for normalized random measures\n\nwith independent increments. Scaninavian Journal of Stats, 36:76\u201397, 2009.\n\n[11] Kenichi Kurihara, Max Welling, and Yee Whye Teh. Collapsed variational dirichlet process mixture mod-\n\nels. In Proc. of IJCAI\u201907, 2007.\n\n[12] Radford M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of\n\ncomputational and graphical statistics, 9(2):249\u2013265, 2000.\n\n[13] David J. Nott, Xiaole Zhang, Christopher Yau, and Ajay Jasra. A sequential algorithm for fast \ufb01tting of\n\ndirichlet process mixture models. In Arxiv: 1301.2897, 2013.\n\n[14] Ian Porteous, Alex Ihler, Padhraic Smyth, and Max Welling. Gibbs Sampling for (Coupled) In\ufb01nite Mixture\n\nModels in the Stick-breaking Representation. In Proc. of UAI\u201906, 2006.\n\n[15] Carl Edward Rasmussen. The In\ufb01nite Gaussian Mixture Model. In Proc. of NIPS\u201900, 2000.\n[16] Jayaram Sethuraman. A constructive de\ufb01nition of dirichlet priors. Statistical Sinica, 4:639\u2013650, 1994.\n[17] S.Jain and R.M. Neal. A split-merge markov chain monte carlo procedure for the dirichlet process mixture\n\nmodel. Journal of Computational and Graphical Statistics, 13(1):158\u2013182, 2004.\n\n[18] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet Processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581, 2007.\n\n[19] Y.W. Teh, K. Kurihara, and Max Welling. Collapsed Variational Inference for HDP. In Proc. of NIPS\u201907,\n\nvolume 20, 2007.\n\n[20] Chong Wang and David Blei. A split-merge mcmc algorithm for the hierarchical dirichlet process. arXiv\n\neprints, 1201.1657, 2012.\n\n[21] Chong Wang and David Blei. Truncation-free stochastic variational inference for bayesian nonparametric\n\nmodels. In Proc. of NIPS\u201912, 2012.\n\n[22] Chong Wang and David M Blei. Variational Inference for the Nested Chinese Restaurant Process. In Proc.\n\nof NIPS\u201909, 2009.\n\n[23] Chong Wang, John Paisley, and David Blei. Online variational inference for the hierarchical dirichlet\n\nprocess. In AISTATS\u201911, 2011.\n\n[24] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from\n\nabbey to zoo. In Proc. of CVPR\u201910, 2010.\n\n9\n\n\f", "award": [], "sourceid": 263, "authors": [{"given_name": "Dahua", "family_name": "Lin", "institution": "TTI Chicago"}]}