{"title": "Streaming, Distributed Variational Inference for Bayesian Nonparametrics", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 288, "abstract": "This paper presents a methodology for creating streaming, distributed inference algorithms for Bayesian nonparametric (BNP) models. In the proposed framework, processing nodes receive a sequence of data minibatches, compute a variational posterior for each, and make asynchronous streaming updates to a central model. In contrast to previous algorithms, the proposed framework is truly streaming, distributed, asynchronous, learning-rate-free, and truncation-free. The key challenge in developing the framework, arising from fact that BNP models do not impose an inherent ordering on their components, is finding the correspondence between minibatch and central BNP posterior components before performing each update. To address this, the paper develops a combinatorial optimization problem over component correspondences, and provides an efficient solution technique. The paper concludes with an application of the methodology to the DP mixture model, with experimental results demonstrating its practical scalability and performance.", "full_text": "Streaming, Distributed Variational Inference for\n\nBayesian Nonparametrics\n\nTrevor Campbell1\nJonathan P. How1\n{tdjc@ , jstraub@csail. , fisher@csail. , jhow@}mit.edu\n\nJohn W. Fisher III2\n\nJulian Straub2\n\n1LIDS, 2CSAIL, MIT\n\nAbstract\n\nThis paper presents a methodology for creating streaming, distributed inference al-\ngorithms for Bayesian nonparametric (BNP) models. In the proposed framework,\nprocessing nodes receive a sequence of data minibatches, compute a variational\nposterior for each, and make asynchronous streaming updates to a central model.\nIn contrast to previous algorithms, the proposed framework is truly streaming, dis-\ntributed, asynchronous, learning-rate-free, and truncation-free. The key challenge\nin developing the framework, arising from the fact that BNP models do not impose\nan inherent ordering on their components, is \ufb01nding the correspondence between\nminibatch and central BNP posterior components before performing each update.\nTo address this, the paper develops a combinatorial optimization problem over\ncomponent correspondences, and provides an ef\ufb01cient solution technique. The\npaper concludes with an application of the methodology to the DP mixture model,\nwith experimental results demonstrating its practical scalability and performance.\n\n1\n\nIntroduction\n\nBayesian nonparametric (BNP) stochastic processes are streaming priors \u2013 their unique feature is\nthat they specify, in a probabilistic sense, that the complexity of a latent model should grow as the\namount of observed data increases. This property captures common sense in many data analysis\nproblems \u2013 for example, one would expect to encounter far more topics in a document corpus after\nreading 106 documents than after reading 10 \u2013 and becomes crucial in settings with unbounded, per-\nsistent streams of data. While their \ufb01xed, parametric cousins can be used to infer model complexity\nfor datasets with known magnitude a priori [1, 2], such priors are silent with respect to notions of\nmodel complexity growth in streaming data settings.\nBayesian nonparametrics are also naturally suited to parallelization of data processing, due to the\nexchangeability, and thus conditional independence, they often exhibit via de Finetti\u2019s theorem. For\nexample, labels from the Chinese Restaurant process [3] are rendered i.i.d. by conditioning on the\nunderlying Dirichlet process (DP) random measure, and feature assignments from the Indian Buffet\nprocess [4] are rendered i.i.d. by conditioning on the underlying beta process (BP) random measure.\nGiven these properties, one might expect there to be a wealth of inference algorithms for BNPs that\naddress the challenges associated with parallelization and streaming. However, previous work has\nonly addressed these two settings in concert for parametric models [5, 6], and only recently has each\nbeen addressed individually for BNPs. In the streaming setting, [7] and [8] developed streaming\ninference for DP mixture models using sequential variational approximation. Stochastic variational\ninference [9] and related methods [10\u201313] are often considered streaming algorithms, but their per-\nformance depends on the choice of a learning rate and on the dataset having known, \ufb01xed size a\npriori [5]. Outside of variational approaches, which are the focus of the present paper, there exist\nexact parallelized MCMC methods for BNPs [14, 15]; the tradeoff in using such methods is that they\nprovide samples from the posterior rather than the distribution itself, and results regarding assessing\n\n1\n\n\f\u03b7o1\n\nRetrieve Original\nCentral Posterior\n\nCentral Node\n\n\u03b7o1\n\nNode\n\nRetrieve Data\n\n. . .\n\nyt+1\n\nyt\n\n. . .\n\nMinibatch\nPosterior\n\nIntermediate\nPosterior\n\n\u03b7i2\n\n\u03b7i1\n\n\u03b7m3\n\n\u03b7m2\n\n\u03b7m1\n\n\u03b73\n\n\u03b72\n\n\u03b71\n\n\u03c3\n\n\u03b7m3\n\n\u03b7m2\n\n\u03b7m1\n\n\u03b7i2\n\n\u03b7i1\n\n\u03b73\n\n\u03b72\n\n\u03b71\n\n(= \u03b7m2)\n\n(= \u03b7i2 + \u03b7m3 \u2212 \u03b70)\n\n(= \u03b7i1 + \u03b7m1 \u2212 \u03b7o1)\n\nCentral Node\n\nNode\n\nCentral Node\n\nNode\n\nCentral Node\n\nNode\n\n. . .\n\nyt+1\n\nyt\n\n. . .\n\n. . .\n\nyt+1\n\nyt\n\n. . .\n\n. . .\n\nyt+2\n\nyt+1\n\n. . .\n\nData Stream\n\nData Stream\n\nData Stream\n\nData Stream\n\n(a) Retrieve the data/prior\nFigure 1: The four main steps of the algorithm that is run asynchronously on each processing node.\n\n(c) Perform component ID\n\n(b) Perform inference\n\n(d) Update the model\n\nK1\n\nconvergence remain limited. Sequential particle \ufb01lters for inference have also been developed [16],\nbut these suffer issues with particle degeneracy and exponential forgetting.\nThe main challenge posed by the streaming, distributed setting for BNPs is the combinatorial prob-\nlem of component identi\ufb01cation. Most BNP models contain some notion of a countably in\ufb01nite set\nof latent \u201ccomponents\u201d (e.g. clusters in a DP mixture model), and do not impose an inherent order-\ning on the components. Thus, in order to combine information about the components from multiple\nprocessors, the correspondence between components must \ufb01rst be found. Brute force search is in-\n\ntractable even for moderately sized models \u2013 there are(cid:0)K1+K2\n\n(cid:1) possible correspondences for two\n\nsets of components of sizes K1 and K2. Furthermore, there does not yet exist a method to evaluate\nthe quality of a component correspondence for BNP models. This issue has been studied before in\nthe MCMC literature, where it is known as the \u201clabel switching problem\u201d, but past solution tech-\nniques are generally model-speci\ufb01c and restricted to use on very simple mixture models [17, 18].\nThis paper presents a methodology for creating streaming, distributed inference algorithms for\nBayesian nonparametric models. In the proposed framework (shown for a single node A in Fig-\nure 1), processing nodes receive a sequence of data minibatches, compute a variational posterior\nfor each, and make asynchronous streaming updates to a central model using a mapping obtained\nfrom a component identi\ufb01cation optimization. The key contributions of this work are as follows.\nFirst, we develop a minibatch posterior decomposition that motivates a learning-rate-free streaming,\ndistributed framework suitable for Bayesian nonparametrics. Then, we derive the component iden-\nti\ufb01cation optimization problem by maximizing the probability of a component matching. We show\nthat the BNP prior regularizes model complexity in the optimization; an interesting side effect of this\nis that regardless of whether the minibatch variational inference scheme is truncated, the proposed\nalgorithm is truncation-free. Finally, we provide an ef\ufb01ciently computable regularization bound for\nthe Dirichlet process prior based on Jensen\u2019s inequality1. The paper concludes with applications of\nthe methodology to the DP mixture model, with experimental results demonstrating the scalability\nand performance of the method in practice.\n\n2 Streaming, distributed Bayesian nonparametric inference\n\nThe proposed framework, motivated by a posterior decomposition that will be discussed in Section\n2.1, involves a collection of processing nodes with asynchronous access to a central variational pos-\nterior approximation (shown for a single node in Figure 1). Data is provided to each processing\nnode as a sequence of minibatches. When a processing node receives a minibatch of data, it obtains\nthe central posterior (Figure 1a), and using it as a prior, computes a minibatch variational posterior\napproximation (Figure 1b). When minibatch inference is complete, the node then performs compo-\nnent identi\ufb01cation between the minibatch posterior and the current central posterior, accounting for\npossible modi\ufb01cations made by other processing nodes (Figure 1c). Finally, it merges the minibatch\nposterior into the central variational posterior (Figure 1d).\nIn the following sections, we use the DP mixture [3] as a guiding example for the technical de-\nvelopment of the inference framework. However, it is emphasized that the material in this paper\ngeneralizes to many other BNP models, such as the hierarchical DP (HDP) topic model [19], BP la-\ntent feature model [20], and Pitman-Yor (PY) mixture [21] (see the supplement for further details).\n\n1Regularization bounds for other popular BNP priors may be found in the supplement.\n\n2\n\n\f2.1 Posterior decomposition\n\nConsider a DP mixture model [3], with cluster parameters \u03b8, assignments z, and observed data y.\nFor each asynchronous update made by each processing node, the dataset is split into three subsets\ny = yo \u222a yi \u222a ym for analysis. When the processing node receives a minibatch of data ym, it\nqueries the central processing node for the original posterior p(\u03b8, zo|yo), which will be used as the\nprior for minibatch inference. Once inference is complete, it again queries the central processing\nnode for the intermediate posterior p(\u03b8, zo, zi|yo, yi) which accounts for asynchronous updates from\nother processing nodes since minibatch inference began. Each subset yr, r \u2208 {o, i, m}, has Nr\nj=1, and each variable zrj \u2208 N assigns yrj to cluster parameter \u03b8zrj . Given the\nobservations {yrj}Nr\nindependence of \u03b8 and z in the prior, and the conditional independence of the data given the latent\nparameters, Bayes\u2019 rule yields the following decomposition of the posterior of \u03b8 and z given y,\n\nUpdated Central Posterior\n\n(cid:122) (cid:125)(cid:124) (cid:123)\np(\u03b8, z|y)\u221d p(zi, zm|zo)\np(zi|zo)p(zm|zo)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\nOriginal Posterior\n\nMinibatch Posterior\n\nIntermediate Posterior\n\n\u00b7\n\np(\u03b8, zo|yo)\u22121 \u00b7\n\np(\u03b8, zm, zo|ym, yo) \u00b7\n\np(\u03b8, zi, zo|yi, yo).\n\n(1)\n\n(cid:123)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\nThis decomposition suggests a simple streaming, distributed, asynchronous update rule for a pro-\ncessing node: \ufb01rst, obtain the current central posterior density p(\u03b8, zo|yo), and using it as a prior,\ncompute the minibatch posterior p(\u03b8, zm, zo|yo, ym); and then update the central posterior density\nby using (1) with the current central posterior density p(\u03b8, zi, zo|yi, yo). However, there are two\nissues preventing the direct application of the decomposition rule (1):\nUnknown component correspondence: Since it is generally intractable to \ufb01nd the minibatch pos-\nteriors p(\u03b8, zm, zo|yo, ym) exactly, approximate methods are required. Further, as (1) requires the\nmultiplication of densities, sampling-based methods are dif\ufb01cult to use, suggesting a variational ap-\nproach. Typical mean-\ufb01eld variational techniques introduce an arti\ufb01cial ordering of the parameters\nin the posterior, thereby breaking symmetry that is crucial to combining posteriors correctly using\ndensity multiplication [6]. The use of (1) with mean-\ufb01eld variational approximations thus requires\n\ufb01rst solving a component identi\ufb01cation problem.\nUnknown model size: While previous posterior merging procedures required a 1-to-1 matching\nbetween the components of the minibatch posterior and central posterior [5, 6], Bayesian nonpara-\nmetric posteriors break this assumption. Indeed, the datasets yo, yi, and ym from the same non-\nparametric mixture model can be generated by the same, disjoint, or an overlapping set of cluster\nparameters. In other words, the global number of unique posterior components cannot be determined\nuntil the component identi\ufb01cation problem is solved and the minibatch posterior is merged.\n\n2.2 Variational component identi\ufb01cation\n\nSuppose we have the following mean-\ufb01eld exponential family prior and approximate variational\nposterior densities in the minibatch decomposition (1),\n\np(\u03b8k) = h(\u03b8k)e\u03b7T\n\n0 T (\u03b8k)\u2212A(\u03b70) \u2200k \u2208 N\n\np(\u03b8, zo|yo) (cid:39) qo(\u03b8, zo) = \u03b6o(zo)\n\nh(\u03b8k)e\u03b7T\n\nokT (\u03b8k)\u2212A(\u03b7ok)\n\nKo(cid:89)\n\nk=1\n\nKm(cid:89)\nKi(cid:89)\n\nk=1\n\nk=1\n\nK(cid:89)\n\nk=1\n\n3\n\np(\u03b8, zm, zo|ym, yo) (cid:39) qm(\u03b8, zm, zo) = \u03b6m(zm)\u03b6o(zo)\n\nh(\u03b8k)e\u03b7T\n\nmkT (\u03b8k)\u2212A(\u03b7mk)\n\n(2)\n\np(\u03b8, zi, zo|yi, yo) (cid:39) qi(\u03b8, zi, zo) = \u03b6i(zi)\u03b6o(zo)\n\nh(\u03b8k)e\u03b7T\n\nikT (\u03b8k)\u2212A(\u03b7ik),\n\nwhere \u03b6r(\u00b7), r \u2208 {o, i, m} are products of categorical distributions for the cluster labels zr, and the\ngoal is to use the posterior decomposition (1) to \ufb01nd the updated posterior approximation\n\np(\u03b8, z|y) (cid:39) q(\u03b8, z) = \u03b6(z)\n\nh(\u03b8k)e\u03b7T\n\nk T (\u03b8k)\u2212A(\u03b7k).\n\n(3)\n\n\fAs mentioned in the previous section, the arti\ufb01cial ordering of components causes the na\u00a8\u0131ve appli-\ncation of (1) with variational approximations to fail, as disparate components from the approximate\nposteriors may be merged erroneously. This is demonstrated in Figure 3a, which shows results from\na synthetic experiment (described in Section 4) ignoring component identi\ufb01cation. As the number of\nparallel threads increases, more matching mistakes are made, leading to decreasing model quality.\nTo address this, \ufb01rst note that there is no issue with the \ufb01rst Ko components of qm and qi; these can\nbe merged directly since they each correspond to the Ko components of qo. Thus, the component\nm = Km \u2212 Ko\nidenti\ufb01cation problem reduces to \ufb01nding the correspondence between the last K(cid:48)\ni = Ki \u2212 Ko components of the intermediate\ncomponents of the minibatch posterior and the last K(cid:48)\nposterior. For notational simplicity (and without loss of generality), \ufb01x the component ordering\nof the intermediate posterior qi, and de\ufb01ne \u03c3 : [Km] \u2192 [Ki + K(cid:48)\nm] to be the 1-to-1 mapping\nfrom minibatch posterior component k to updated central posterior component \u03c3(k), where [K] :=\n{1, . . . , K}. The fact that the \ufb01rst Ko components have no ordering ambiguity can be expressed as\n\u03c3(k) = k \u2200k \u2208 [Ko]. Note that the maximum number of components after merging is Ki + K(cid:48)\nm,\nsince each of the last K(cid:48)\nm components in the minibatch posterior may correspond to new components\nin the intermediate posterior. After substituting the three variational approximations (2) into (1), the\ngoal of the component identi\ufb01cation optimization is to \ufb01nd the 1-to-1 mapping \u03c3(cid:63) that yields the\nlargest updated posterior normalizing constant, i.e. matches components with similar densities,\n\n\u03c3(cid:63) \u2190 argmax\n\n\u03c3\n\nz\n\n\u03b8\n\np(zi, zm|zo)\n\np(zi|zo)p(zm|zo)\n\nqo(\u03b8, zo)\u22121q\u03c3\n\nm(\u03b8, zm, zo)qi(\u03b8, zi, zo)\n\n(cid:90)\n\n(cid:88)\n\nh(\u03b8\u03c3(k))e\u03b7T\n\nmkT (\u03b8\u03c3(k))\u2212A(\u03b7mk)\n\n(4)\n\nKm(cid:89)\n\ns.t.\n\nm(zm)\n\nq\u03c3\nm(\u03b8, zm) = \u03b6 \u03c3\n\u03c3(k) = k, \u2200k \u2208 [Ko] , \u03c3 1-to-1\n\nk=1\n\nm(zm) is the distribution such that P\u03b6\u03c3\n\n(zmj = \u03c3(k)) = P\u03b6m (zmj = k). Taking the\nwhere \u03b6 \u03c3\nlogarithm of the objective and exploiting the mean-\ufb01eld decoupling allows the separation of the\nobjective into a sum of two terms: one expressing the quality of the matching between components\n(the integral over \u03b8), and one that regularizes the \ufb01nal model size (the sum over z). While the \ufb01rst\nterm is available in closed form, the second is in general not. Therefore, using the concavity of\nthe logarithm function, Jensen\u2019s inequality yields a lower bound that can be used in place of the\nintractable original objective, resulting in the \ufb01nal component identi\ufb01cation optimization:\n\nm\n\nm(cid:88)\n\nKi+K(cid:48)\n\n\u03c3(cid:63) \u2190 argmax\n\n\u03c3\n\ns.t.\n\n\u03b6 [log p(zi, zm, zo)]\n\nA (\u02dc\u03b7\u03c3\n\nk ) + E\u03c3\nmk \u2212 \u02dc\u03b7ok\n\nk=1\n\u02dc\u03b7\u03c3\nk = \u02dc\u03b7ik + \u02dc\u03b7\u03c3\n\u03c3(k) = k \u2200k \u2208 [Ko] , \u03c3 1-to-1.\n\n(5)\n\nA more detailed derivation of the optimization may be found in the supplement. E\u03c3\ntation under the distribution \u03b6o(zo)\u03b6i(zi)\u03b6 \u03c3\n\nm(zm), and\n\n\u03b6 denotes expec-\n\nk \u2264 Kr\nk > Kr\n\n\u2200r \u2208 {o, i, m},\n\n\u02dc\u03b7\u03c3\nmk =\n\nk \u2208 \u03c3 ([Km])\nk /\u2208 \u03c3 ([Km])\n\n,\n\n(6)\n\n(cid:26) \u03b7rk\n\n\u03b70\n\n\u02dc\u03b7rk =\n\n(cid:26) \u03b7m\u03c3\u22121(k)\n\n\u03b70\n\nwhere \u03c3 ([Km]) denotes the range of the mapping \u03c3. The de\ufb01nitions in (6) ensure that the prior \u03b70\nis used whenever a posterior r \u2208 {i, m, o} does not contain a particular component k. The intuition\nfor the optimization (5) is that it combines \ufb01nding component correspondences with high similarity\n(via the log-partition function) with a regularization term2 on the \ufb01nal updated posterior model size.\nDespite its motivation from the Dirichlet process mixture, the component identi\ufb01cation optimization\n(5) is not speci\ufb01c to this model. Indeed, the derivation did not rely on any properties speci\ufb01c to the\nDirichlet process mixture; the optimization applies to any Bayesian nonparametric model with a set\nof \u201ccomponents\u201d \u03b8, and a set of combinatorial \u201cindicators\u201d z. For example, the optimization applies\nto the hierarchical Dirichlet process topic model [10] with topic word distributions \u03b8 and local-to-\nglobal topic correspondences z, and to the beta process latent feature model [4] with features \u03b8 and\n\n(cid:104)\n\n\u03b6o(zo)\u03b6i(zi)\u03b6 \u03c3\n\nm(zm)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p(zi, zm, zo)\n\n(cid:105)\n\n.\n\n2This is equivalent to the KL-divergence regularization \u2212KL\n\n4\n\n\fbinary assignment vectors z. The form of the objective in the component identi\ufb01cation optimization\n(5) re\ufb02ects this generality. In order to apply the proposed streaming, distributed method to a partic-\nular model, one simply needs a black-box variational inference algorithm that computes posteriors\nof the form (2), and a way to compute or bound the expectation in the objective of (5).\n\n2.3 Updating the central posterior\n\n(cid:26)\n\n(cid:27)\n\nTo update the central posterior, the node \ufb01rst locks it and solves for \u03c3(cid:63) via (5). Locking prevents\nother nodes from solving (5) or modifying the central posterior, but does not prevent other nodes\nfrom reading the central posterior, obtaining minibatches, or performing inference; the synthetic\nexperiment in Section 4 shows that this does not incur a signi\ufb01cant time penalty in practice. Then\nthe processing node transmits \u03c3(cid:63) and its minibatch variational posterior to the central processing\nnode where the product decomposition (1) is used to \ufb01nd the updated central variational posterior q\nin (3), with parameters\n\nK = max\n\nKi, max\nk\u2208[Km]\n\n\u03c3(cid:63)(k)\n\n,\n\n\u03b6(z) = \u03b6i(zi)\u03b6o(zo)\u03b6 \u03c3(cid:63)\n\nm (zm),\n\n\u03b7k = \u02dc\u03b7ik + \u02dc\u03b7\u03c3(cid:63)\n\nmk \u2212 \u02dc\u03b7ok.\n\n(7)\n\nFinally, the node unlocks the central posterior, and the next processing node to receive a new mini-\nbatch will use the above K, \u03b6(z), and \u03b7k from the central node as their Ko, \u03b6o(zo), and \u03b7ok.\n\n3 Application to the Dirichlet process mixture model\n\nThe expectation in the objective of (5) is typically intractable to compute in closed-form; therefore,\na suitable lower bound may be used in its place. This section presents such a bound for the Dirichlet\nprocess, and discusses the application of the proposed inference framework to the Dirichlet process\nmixture model using the developed bound. Crucially, the lower bound decomposes such that the op-\ntimization (5) becomes a maximum-weight bipartite matching problem. Such problems are solvable\nin polynomial time [22] by the Hungarian algorithm, leading to a tractable component identi\ufb01cation\nstep in the proposed streaming, distributed framework.\n\n3.1 Regularization lower bound\n\nFor the Dirichlet process with concentration parameter \u03b1 > 0, p(zi, zm, zo) is the Exchangeable\nPartition Probability Function (EPPF) [23]\n\n(nk \u2212 1)!,\n\n(8)\n\nwhere nk is the amount of data assigned to cluster k, and K is the set of labels of nonempty clusters.\nGiven that the variational distribution \u03b6r(zr), r \u2208 {i, m, o} is a product of independent categor-\n, Jensen\u2019s inequality may be used to bound the\n\nical distributions \u03b6r(zr) = (cid:81)Nr\n\nregularization in (5) below (see the supplement for further details) by\n\n\u03b6 [log p(zi, zm, zo)] \u2265\nE\u03c3\n\nlog \u03b1 + log \u0393(cid:0)max(cid:8)2, \u02dct\u03c3\n\n(cid:9)(cid:1) + C\n\nk\n\nk\u2208K\n\np(zi, zm, zo) \u221d \u03b1|K|\u22121 (cid:89)\n(cid:81)Kr\nm(cid:88)\n\n1[zrj =k]\nrjk\n\nKi+K(cid:48)\n\nk=1 \u03c0\n\n(cid:17)\n\n(cid:16)\n\nj=1\n\n1 \u2212 e\u02dcs\u03c3\n\nk\n\nk=1\n\n(9)\n\n(10)\n\n\u02dct\u03c3\nk = \u02dctik + \u02dct\u03c3\nwhere C is a constant with respect to the component mapping \u03c3, and\n\n\u02dcs\u03c3\nk = \u02dcsik + \u02dcs\u03c3\n\nmk + \u02dcsok,\n\nmk + \u02dctok,\n\nj=1 log(1\u2212\u03c0rjk)\n\nk\u2264Kr\nk>Kr\n\nj=1 log(1\u2212\u03c0mj\u03c3\u22121(k))\n\n\u2200r\u2208{o,i,m}\n\n\u02dctrk =\n\nk\u2208\u03c3([Km])\nk /\u2208\u03c3([Km])\n\n\u02dct\u03c3\nmk =\n\nj=1 \u03c0rjk\n\nk\u2264Kr\nk>Kr\n\n\u2200r\u2208{o,i,m}\n\nj=1 \u03c0mj\u03c3\u22121 (k)\n\nk\u2208\u03c3([Km])\nk /\u2208\u03c3([Km])\n\n.\n\n(cid:26) (cid:80)Nr\n(cid:26) (cid:80)Nm\n\n0\n\n0\n\n(cid:26) (cid:80)Nr\n(cid:26) (cid:80)Nm\n\n0\n\n0\n\n\u02dcsrk =\n\n\u02dcs\u03c3\nmk =\n\nNote that the bound (9) allows incremental updates: after \ufb01nding the optimal mapping \u03c3(cid:63), the central\nupdate (7) can be augmented by updating the values of sk and tk on the central node to\n\nsk \u2190 \u02dcsik + \u02dcs\u03c3(cid:63)\n\nmk + \u02dcsok,\n\ntk \u2190 \u02dctik + \u02dct\u03c3(cid:63)\n\nmk + \u02dctok.\n\n(11)\n\n5\n\n\fIncreasing \u03b1\n\nIncreasing \u03b1\n\nFigure 2: The Dirichlet process regularization and lower bound, with (2a) fully uncertain labelling and varying\nnumber of clusters, and (2b) the number of clusters \ufb01xed with varying labelling uncertainty.\n\n(a)\n\n(b)\n\ndatapoints for various DP concentration parameter values \u03b1 \u2208(cid:2)10\u22123, 103(cid:3). The true regularization\n\nAs with K, \u03b7k, and \u03b6 from (7), after performing the regularization statistics update (11), a processing\nnode that receives a new minibatch will use the above sk and tk as their sok and tok, respectively.\nFigure 2 demonstrates the behavior of the lower bound in a synthetic experiment with N = 100\nlog E\u03b6 [p(z)] was computed by sample approximation with 104 samples. In Figure 2a, the number of\nK . This \ufb01gure demonstrates\nclusters K was varied, with symmetric categorical label weights set to 1\ntwo important phenomena. First, the bound increases as K \u2192 0; in other words, it gives preference\nto fewer, larger clusters, which is the typical BNP \u201crich get richer\u201d property. Second, the behavior of\nthe bound as K \u2192 N depends on the concentration parameter \u03b1 \u2013 as \u03b1 increases, more clusters are\npreferred. In Figure 2b, the number of clusters K was \ufb01xed to 10, and the categorical label weights\n\nwere sampled from a symmetric Dirichlet distribution with parameter \u03b3 \u2208(cid:2)10\u22123, 103(cid:3). This \ufb01gure\n\ndemonstrates that the bound does not degrade signi\ufb01cantly with high labelling uncertainty, and is\nnearly exact for low labelling uncertainty. Overall, Figure 2a demonstrates that the proposed lower\nbound exhibits similar behaviors to the true regularization, supporting its use in the optimization (5).\n\n3.2 Solving the component identi\ufb01cation optimization\n\nGiven that both the regularization (9) and component matching score in the objective (5) decompose\nas a sum of terms for each k \u2208 [Ki + K(cid:48)\nm], the objective can be rewritten using a matrix of matching\nscores R \u2208 R(Ki+K(cid:48)\nm). Setting\nXkj = 1 indicates that component k in the minibatch posterior is matched to component j in the\nintermediate posterior (i.e. \u03c3(k) = j), providing a score Rkj de\ufb01ned using (6) and (10) as\n\nRkj = A (\u02dc\u03b7ij + \u02dc\u03b7mk \u2212 \u02dc\u03b7oj)+(cid:0)1 \u2212 e\u02dcsij +\u02dcsmk+\u02dcsoj(cid:1) log \u03b1+log \u0393(cid:0)max(cid:8)2, \u02dctij + \u02dctmk + \u02dctoj\n\nm) and selector variables X \u2208 {0, 1}(Ki+K(cid:48)\n\nm)\u00d7(Ki+K(cid:48)\n\nm)\u00d7(Ki+K(cid:48)\n\n(cid:9)(cid:1) .\n\nThe optimization (5) can be rewritten in terms of X and R as\n\ntr(cid:2)XT R(cid:3)\n\nX(cid:63) \u2190 argmax\n\nX\n\ns.t. X1 = 1, XT 1 = 1, Xkk = 1,\u2200k \u2208 [Ko]\n\nX \u2208 {0, 1}(Ki+K(cid:48)\n\nm)\u00d7(Ki+K(cid:48)\n\nm),\n\n1 = [1, . . . , 1]T .\n\n(12)\n\n(13)\n\n(14)\n\nThe \ufb01rst two constraints express the 1-to-1 property of \u03c3(\u00b7). The constraint Xkk = 1\u2200k \u2208 [Ko] \ufb01xes\nthe upper Ko\u00d7Ko block of X to I (due to the fact that the \ufb01rst Ko components are matched directly),\nand the off-diagonal blocks to 0. Denoting X(cid:48), R(cid:48) to be the lower right (K(cid:48)\ni + K(cid:48)\nm)\nblocks of X, R, the remaining optimization problem is a linear assignment problem on X(cid:48) with\ncost matrix \u2212R(cid:48), which can be solved using the Hungarian algorithm3. Note that if Km = Ko\nor Ki = Ko, this implies that no matching problem needs to be solved \u2013 the \ufb01rst Ko components\nof the minibatch posterior are matched directly, and the last K(cid:48)\nm are set as new components. In\npractical implementation of the framework, new clusters are typically discovered at a diminishing\nrate as more data are observed, so the number of matching problems that are solved likewise tapers\noff. The \ufb01nal optimal component mapping \u03c3(cid:63) is found by \ufb01nding the nonzero elements of X(cid:63):\n\nm) \u00d7 (K(cid:48)\n\ni + K(cid:48)\n\n\u03c3(cid:63)(k) \u2190 argmax\n\nkj \u2200k \u2208 [Km] .\nX(cid:63)\n\nj\n\n3For the experiments in this work, we used the implementation at github.com/hrldcpr/hungarian.\n\n6\n\n020406080100NumberofClusters-600-400-2000200400600RegularizationTruthLowerBound0.001Certain0.010.11.01000UncertainClusteringUncertainty\u03b3406080100120140RegularizationTruthLowerBound\f(a) SDA-DP without component ID\n\n(b) SDA-DP with component ID\n\n(c) Test log-likelihood traces\n\n(e) Cluster/component ID counts\n\n(f) Final cluster/component ID counts\n\n(d) CPU time for component ID\n\nFigure 3: Synthetic results over 30 trials. (3a-3b) Computation time and test log likelihood for SDA-DP with\nvarying numbers of parallel threads, with component identi\ufb01cation disabled (3a) and enabled (3b). (3c) Test log\nlikelihood traces for SDA-DP (40 threads) and the comparison algorithms. (3d) Histogram of computation time\n(in microseconds) to solve the component identi\ufb01cation optimization. (3e) Number of clusters and number of\ncomponent identi\ufb01cation problems solved as a function of the number of minibatch updates (40 threads). (3f)\nFinal number of clusters and matchings solved with varying numbers of parallel threads.\n4 Experiments\n\nIn this section, the proposed inference framework is evaluated on the DP Gaussian mixture with a\nnormal-inverse-Wishart (NIW) prior. We compare the streaming, distributed procedure coupled with\nstandard variational inference [24] (SDA-DP) to \ufb01ve state-of-the-art inference algorithms: memo-\nized online variational inference (moVB) [13], stochastic online variational inference (SVI) [9] with\nlearning rate (t+10)\u2212 1\n2 , sequential variational approximation (SVA) [7] with cluster creation thresh-\nold 10\u22121 and prune/merge threshold 10\u22123, subcluster splits MCMC (SC) [14], and batch variational\ninference (Batch) [24]. Priors were set by hand and all methods were initialized randomly. Meth-\nods that use multiple passes through the data (e.g. moVB, SVI) were allowed to do so. moVB was\nallowed to make birth/death moves, while SVI/Batch had \ufb01xed truncations. All experiments were\nperformed on a computer with 24 CPU cores and 12GiB of RAM.\nSynthetic: This dataset consisted of 100,000 2-dimensional vectors generated from a Gaussian mix-\nture model with 100 clusters and a NIW(\u00b50, \u03ba0, \u03a80, \u03bd0) prior with \u00b50 = 0, \u03ba0 = 10\u22123, \u03a80 = I,\nand \u03bd0 = 4. The algorithms were given the true NIW prior, DP concentration \u03b1 = 5, and mini-\nbatches of size 50. SDA-DP minibatch inference was truncated to K = 50 components, and all other\nalgorithms were truncated to K = 200 components. Figure 3 shows the results from the experiment\nover 30 trials, which illustrate a number of important properties of SDA-DP. First and foremost,\nignoring the component identi\ufb01cation problem leads to decreasing model quality with increasing\nnumber of parallel threads, since more matching mistakes are made (Figure 3a). Second, if compo-\nnent identi\ufb01cation is properly accounted for using the proposed optimization, increasing the number\nof parallel threads reduces execution time, but does not affect the \ufb01nal model quality (Figure 3b).\nThird, SDA-DP (with 40 threads) converges to the same \ufb01nal test log likelihood as the comparison\nalgorithms in signi\ufb01cantly reduced time (Figure 3c). Fourth, each component identi\ufb01cation opti-\nmization typically takes \u223c 10\u22125 seconds, and thus matching accounts for less than a millisecond of\ntotal computation and does not affect the overall computation time signi\ufb01cantly (Figure 3d). Fifth,\nthe majority of the component matching problems are solved within the \ufb01rst 80 minibatch updates\n(out of a total of 2,000) \u2013 afterwards, the true clusters have all been discovered and the processing\nnodes contribute to those clusters rather than creating new ones, as per the discussion at the end of\nSection 3.2 (Figure 3e). Finally, increased parallelization can be advantageous in discovering the\ncorrect number of clusters; with only one thread, mistakes made early on are built upon and persist,\nwhereas with more threads there are more component identi\ufb01cation problems solved, and thus more\nchances to discover the correct clusters (Figure 3f).\n\n7\n\n12481624324048#Threads0.02.04.06.08.010.012.0CPUTime(s)-11.0-10.0-9.0-8.0-7.0-6.0-5.0TestLogLikelihoodCPUTimeTestLogLikelihood12481624324048#Threads0.02.04.06.08.010.012.0CPUTime(s)-11.0-10.0-9.0-8.0-7.0-6.0-5.0TestLogLikelihoodCPUTimeTestLogLikelihood-5-4-3-2-101234Time(Log10s)-11-10-9-8-7-6TestLogLikelihoodSDA-DPBatchSVASVImoVBSC010203040506070MergeTime(microseconds)051015202530354045Count020406080100#MinibatchesMerged020406080100120140Count#Clusters#MatchingsTrue#Clusters12481624324048#Threads020406080100120140Count#Clusters#MatchingsTrue#Clusters\f(a) Airplane trajectory clusters\n\n(b) Airplane cluster weights\n\n(c) MNIST clusters\n\n(d) Numerical results on Airplane, MNIST, and SUN\n\nAirplane\n\nMNIST\n\nSUN\n\nTime (s)\n9.4\n568.9\n10.4\n1258.1\n1618.4\n1881.5\n\nTestLL\n-150.3\n-149.9\n-152.8\n-149.7\n-150.6\n-149.7\n\nAlgorithm Time (s)\n\nSDA-DP\n0.66\nSVI\n1.50\nSVA 3.00\nmoVB 0.69\nSC 393.6\n1.07\n\nBatch\n\nTestLL\n-0.55\n-0.59\n-4.71\n-0.72\n-1.06\n-0.72\n\nTime (s)\n3.0\n117.4\n57.0\n645.9\n1639.1\n829.6\n\nTestLL\n-145.3\n-147.1\n-145.0\n-149.2\n-146.8\n-149.5\n\nFigure 4: (4a-4b) Highest-probability instances and counts for 10 trajectory clusters generated by SDA-DP.\n(4c) Highest-probability instances for 20 clusters discovered by SDA-DP on MNIST. (4d) Numerical results.\nAirplane Trajectories: This dataset consisted of \u223c3,000,000 automatic dependent surveillance\nbroadcast (ADS-B) messages collected from planes across the United States during the period 2013-\n03-22 01:30:00UTC to 2013-03-28 12:00:00UTC. The messages were connected based on plane call\nsign and time stamp, and erroneous trajectories were \ufb01ltered based on reasonable spatial/temporal\nbounds, yielding 15,022 trajectories with 1,000 held out for testing. The latitude/longitude points\nin each trajectory were \ufb01t via linear regression, and the 3-dimensional parameter vectors were clus-\ntered. Data was split into minibatches of size 100, and SDA-DP used 16 parallel threads.\nMNIST Digits [25]: This dataset consisted of 70,000 28 \u00d7 28 images of hand-written digits, with\n10,000 held out for testing. The images were reduced to 20 dimensions with PCA prior to clustering.\nData was split into minibatches of size 500, and SDA-DP used 48 parallel threads.\nSUN Images [26]: This dataset consisted of 108,755 images from 397 scene categories, with 8,755\nheld out for testing. The images were reduced to 20 dimensions with PCA prior to clustering. Data\nwas split into minibatches of size 500, and SDA-DP used 48 parallel threads.\nFigure 4 shows the results from the experiments on the three real datasets. From a qualitative stand-\npoint, SDA-DP discovers sensible clusters in the data, as demonstrated in Figures 4a\u20134c. However,\nan important quantitative result is highlighted by Table 4d: the larger a dataset is, the more the\nbene\ufb01ts of parallelism provided by SDA-DP become apparent. SDA-DP consistently provides a\nmodel quality that is competitive with the other algorithms, but requires orders of magnitude less\ncomputation time, corroborating similar \ufb01ndings on the synthetic dataset.\n\n5 Conclusions\n\nThis paper presented a streaming, distributed, asynchronous inference algorithm for Bayesian non-\nparametric models, with a focus on the combinatorial problem of matching minibatch posterior com-\nponents to central posterior components during asynchronous updates. The main contributions are\na component identi\ufb01cation optimization based on a minibatch posterior decomposition, a tractable\nbound on the objective for the Dirichlet process mixture, and experiments demonstrating the per-\nformance of the methodology on large-scale datasets. While the present work focused on the DP\nmixture as a guiding example, it is not limited to this model \u2013 exploring the application of the\nproposed methodology to other BNP models is a potential area for future research.\n\nAcknowledgments\nThis work was supported by the Of\ufb01ce of Naval Research under ONR MURI grant N000141110688.\n\n8\n\n0123456789Cluster0500100015002000250030003500Count\fReferences\n[1] Agostino Nobile. Bayesian Analysis of Finite Mixture Distributions. PhD thesis, Carnegie Mellon Uni-\n\nversity, 1994.\n\n[2] Jeffrey W. Miller and Matthew T. Harrison. A simple example of Dirichlet process mixture inconsistency\n\nfor the number of components. In Advances in Neural Information Processing Systems 26, 2013.\n\n[3] Yee Whye Teh. Dirichlet processes. In Encyclopedia of Machine Learning. Springer, New York, 2010.\n[4] Thomas L. Grif\ufb01ths and Zoubin Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn Advances in Neural Information Processing Systems 22, 2005.\n\n[5] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. Streaming\n\nvariational Bayes. In Advances in Neural Information Procesing Systems 26, 2013.\n\n[6] Trevor Campbell and Jonathan P. How. Approximate decentralized Bayesian inference. In Proceedings\n\nof the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[7] Dahua Lin. Online learning of nonparametric mixture models via sequential variational approximation.\n\nIn Advances in Neural Information Processing Systems 26, 2013.\n\n[8] Xiaole Zhang, David J. Nott, Christopher Yau, and Ajay Jasra. A sequential algorithm for fast \ufb01tting of\nDirichlet process mixture models. Journal of Computational and Graphical Statistics, 23(4):1143\u20131162,\n2014.\n\n[9] Matt Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of\n\nMachine Learning Research, 14:1303\u20131347, 2013.\n\n[10] Chong Wang, John Paisley, and David M. Blei. Online variational inference for the hierarchical Dirichlet\nIn Proceedings of the 11th International Conference on Arti\ufb01cial Intelligence and Statistics,\n\nprocess.\n2011.\n\n[11] Michael Bryant and Erik Sudderth. Truly nonparametric online variational inference for hierarchical\n\nDirichlet processes. In Advances in Neural Information Proecssing Systems 23, 2009.\n\n[12] Chong Wang and David Blei. Truncation-free stochastic variational inference for Bayesian nonparametric\n\nmodels. In Advances in Neural Information Processing Systems 25, 2012.\n\n[13] Michael Hughes and Erik Sudderth. Memoized online variational inference for Dirichlet process mixture\n\nmodels. In Advances in Neural Information Processing Systems 26, 2013.\n\n[14] Jason Chang and John Fisher III. Parallel sampling of DP mixture models using sub-clusters splits. In\n\nAdvances in Neural Information Procesing Systems 26, 2013.\n\n[15] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarassingly parallel MCMC.\n\nIn Proceedings of the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[16] Carlos M. Carvalho, Hedibert F. Lopes, Nicholas G. Polson, and Matt A. Taddy. Particle learning for\n\ngeneral mixtures. Bayesian Analysis, 5(4):709\u2013740, 2010.\n\n[17] Matthew Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical\n\nSociety: Series B, 62(4):795\u2013809, 2000.\n\n[18] Ajay Jasra, Chris Holmes, and David Stephens. Markov chain Monte Carlo methods and the label switch-\n\ning problem in Bayesian mixture modeling. Statistical Science, 20(1):50\u201367, 2005.\n\n[19] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[20] Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the Indian buffet process.\n\nProceedings of the International Conference on Machine Learning, 2009.\n\nIn\n\n[21] Avinava Dubey, Sinead Williamson, and Eric Xing. Parallel Markov chain Monte Carlo for Pitman-Yor\n\nmixture models. In Proceedings of the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[22] Jack Edmonds and Richard Karp. Theoretical improvements in algorithmic ef\ufb01ciency for network \ufb02ow\n\nproblems. Journal of the Association for Computing Machinery, 19:248\u2013264, 1972.\n\n[23] Jim Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and Related\n\nFields, 102(2):145\u2013158, 1995.\n\n[24] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian\n\nAnalysis, 1(1):121\u2013144, 2006.\n\n[25] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. MNIST database of handwritten digits. On-\n\nline: yann.lecun.com/exdb/mnist.\n\n[26] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN 397 image\n\ndatabase. Online: vision.cs.princeton.edu/projects/2010/SUN.\n\n9\n\n\f", "award": [], "sourceid": 176, "authors": [{"given_name": "Trevor", "family_name": "Campbell", "institution": "MIT"}, {"given_name": "Julian", "family_name": "Straub", "institution": "Mit"}, {"given_name": "John", "family_name": "Fisher III", "institution": "MIT"}, {"given_name": "Jonathan", "family_name": "How", "institution": null}]}