{"title": "The Population Posterior and Bayesian Modeling on Streams", "book": "Advances in Neural Information Processing Systems", "page_first": 1153, "page_last": 1161, "abstract": "Many modern data analysis problems involve inferences from streaming data. However, streaming data is not easily amenable to the standard probabilistic modeling approaches, which assume that we condition on finite data. We develop population variational Bayes, a new approach for using Bayesian modeling to analyze streams of data. It approximates a new type of distribution, the population posterior, which combines the notion of a population distribution of the data with Bayesian inference in a probabilistic model. We study our method with latent Dirichlet allocation and Dirichlet process mixtures on several large-scale data sets.", "full_text": "The Population Posterior\n\nand Bayesian Modeling on Streams\n\nJames McInerney\nColumbia University\n\njames@cs.columbia.edu\n\nRajesh Ranganath\nPrinceton University\n\nrajeshr@cs.princeton.edu\n\nDavid Blei\n\nColumbia University\n\ndavid.blei@columbia.edu\n\nAbstract\n\nMany modern data analysis problems involve inferences from streaming data. How-\never, streaming data is not easily amenable to the standard probabilistic modeling\napproaches, which require conditioning on \ufb01nite data. We develop population\nvariational Bayes, a new approach for using Bayesian modeling to analyze streams\nof data. It approximates a new type of distribution, the population posterior, which\ncombines the notion of a population distribution of the data with Bayesian in-\nference in a probabilistic model. We develop the population posterior for latent\nDirichlet allocation and Dirichlet process mixtures. We study our method with\nseveral large-scale data sets.\n\n1\n\nIntroduction\n\nProbabilistic modeling has emerged as a powerful tool for data analysis. It is an intuitive language\nfor describing assumptions about data and provides ef\ufb01cient algorithms for analyzing real data under\nthose assumptions. The main idea comes from Bayesian statistics. We encode our assumptions about\nthe data in a structured probability model of hidden and observed variables; we condition on a data\nset to reveal the posterior distribution of the hidden variables; and we use the resulting posterior as\nneeded, for example to form predictions through the posterior predictive distribution or to explore the\ndata through the posterior expectations of the hidden variables.\nMany modern data analysis problems involve inferences from streaming data. Examples include\nexploring the content of massive social media streams (e.g., Twitter, Facebook), analyzing live video\nstreams, estimating the preferences of users on an online platform for recommending new items, and\npredicting human mobility patterns for anticipatory computing. Such problems, however, cannot\neasily take advantage of the standard approach to probabilistic modeling, which requires that we\ncondition on a \ufb01nite data set.\nThis might be surprising to some readers; after all, one of the tenets of the Bayesian paradigm is that\nwe can update our posterior when given new information. (\u201cYesterday\u2019s posterior is today\u2019s prior.\u201d)\nBut there are two problems with using Bayesian updating on data streams. The \ufb01rst problem is that\nBayesian inference computes posterior uncertainty under the assumption that the model is correct.\nIn theory this is sensible, but only in the impossible scenario where the data truly came from the\nproposed model. In practice, all models provide approximations to the data-generating distribution,\nand when the model is incorrect, the uncertainty that maximizes predictive likelihood may be larger or\nsmaller than the Bayesian posterior variance. This problem is exacerbated in potentially never-ending\nstreams; after seeing only a few data points, uncertainty is high, but eventually the model becomes\novercon\ufb01dent.\nThe second problem is that the data stream might change over time. This is an issue because,\nfrequently, our goal in applying probabilistic models to streams is not to characterize how they\nchange, but rather to accommodate it. That is, we would like for our current estimate of the latent\nvariables to be accurate to the current state of the stream and to adapt to how the stream might slowly\n\n1\n\n\fchange. (This is in contrast, for example, to time series modeling.) Traditional Bayesian updating\ncannot handle this. Either we explicitly model the time series, and pay a heavy inferential cost, or we\ntacitly assume that the data are exchangeable, i.e., that the underlying distribution does not change.\nIn this paper we develop new ideas for analyzing data streams with probabilistic models. Our\napproach combines the frequentist notion of the population distribution with probabilistic models and\nBayesian inference.\nMain idea: The population posterior. Consider a latent variable model of \u03b1 data points. (This\nis unconventional notation; we will describe why we use it below.) Following [14], we de\ufb01ne the\nmodel to have two kinds of hidden variables: global hidden variables \u03b2 contain latent structure that\npotentially governs any data point; local hidden variables zi contain latent structure that only governs\nthe ith data point. Such models are de\ufb01ned by the joint,\n\np(\u03b2 ,z,x) = p(\u03b2 )\n\n\u03b1\n\u220f\n\ni=1\n\np(xi,zi | \u03b2 ),\n\n(1)\n\nwhere x = x1:\u03b1 and z = z1:\u03b1. Traditional Bayesian statistics conditions on a \ufb01xed data set x to obtain\nthe posterior distribution of the hidden variables p(\u03b2 ,z| x). As we discussed, this framework cannot\naccommodate data streams. We need a different way to use the model.\nWe de\ufb01ne a new distribution, the population posterior, which enables us to consider Bayesian\nmodeling of streams. Suppose we observe \u03b1 data points independently from the underlying population\ndistribution, X \u223c F\u03b1. This induces a posterior p(\u03b2 ,z| X), which is a function of the random data.\nThe population posterior is the expected value of this distribution,\n\n(cid:20) p(\u03b2 ,z,X)\n\n(cid:21)\n\nEF\u03b1 [p(z,\u03b2|X)] = EF\u03b1\n\np(X)\n\n.\n\n(2)\n\nNotice that this distribution is not a function of observed data; it is a function of the population\ndistribution F and the data size \u03b1. The data size is a hyperparameter that can be set; it effectively\ncontrols the variance of the population posterior. How to best set it depends on how close the model\nis to the true data distribution.\nWe have de\ufb01ned a new problem. Given an endless stream of data points coming from F and a value\nfor \u03b1, our goal is to approximate the corresponding population posterior. In this paper, we will\napproximate it through an algorithm based on variational inference and stochastic optimization. As\nwe will show, our algorithm justi\ufb01es applying a variant of stochastic variational inference [14] to\na data stream. We used our method to analyze several data streams with two modern probabilistic\nmodels, latent Dirichlet allocation [5] and Dirichlet process mixtures [11]. With held-out likelihood\nas a measure of model \ufb01tness, we found our method to give better models of the data than approaches\nbased on full Bayesian inference [14] or Bayesian updating [8].\nRelated work. Researchers have proposed several methods for inference on streams of data.\nRefs. [1, 9, 27] propose extending Markov chain Monte Carlo methods for streaming data. However,\nsampling-based approaches do not scale to massive datasets; the variational approximation enables\nmore scalable inference. In variational inference, Ref. [15] propose online variational inference by\nexponentially forgetting the variational parameters associated with old data. Stochastic variational\ninference (SVI) [14] also decay parameters derived from old data, but interprets this in the context of\nstochastic optimization. Neither of these methods applies to streaming data; both implicitly rely on\nthe data being of known size (even when subsampling data to obtain noisy gradients).\nTo apply the variational approximation to streaming data, Ref. [8] and Ref. [12] both propose\nBayesian updating of the approximating family; Ref. [22] adapts this framework to nonparametric\nmixture models. Here we take a different approach, changing the variational objective to incorporate\na population distribution and then following stochastic gradients of this new objective. In Section 3\nwe show that this generally performs better than Bayesian updating.\nIndependently, Ref. [23] applied SVI to streaming data by accumulating new data points into a\ngrowing window and then uniformly sampling from this window to update the variational parameters.\nOur method justi\ufb01es that approach. Further, they propose updating parameters along a trust region,\ninstead of following (natural) gradients, as a way of mitigating local optima. This innovation can be\nincorporated into our method.\n\n2\n\n\f2 Variational Inference for the Population Posterior\n\nWe develop population variational Bayes, a method for approximating the population posterior in\nEq. 2. Our method is based on variational inference and stochastic optimization.\nThe F-ELBO. The idea behind variational inference is to approximate dif\ufb01cult-to-compute distribu-\ntions through optimization [16, 25]. We introduce an approximating family of distributions over the\nlatent variables q(\u03b2 ,z) and try to \ufb01nd the member of q(\u00b7) that minimizes the Kullback-Leibler (KL)\ndivergence to the target distribution.\nPopulation variational Bayes (VB) uses variational inference to approximate the population posterior\nin Eq. 2. It aims to minimize the KL divergence from an approximating family,\n\nq\u2217(\u03b2 ,z) = argmin\nq\n\nKL(q(\u03b2 ,z)||EF\u03b1 [p(\u03b2 ,z| X)]).\n\n(3)\n\nAs for the population posterior, this objective is a function of the population distribution of \u03b1 data\npoints F\u03b1. Notice the difference to classical VB. In classical VB, we optimize the KL divergence\nbetween q(\u00b7) and a posterior, KL(q(\u03b2 ,z)||p(\u03b2 ,z| x); its objective is a function of a \ufb01xed data set x.\nIn contrast, the objective in Eq. 3 is a function of the population distribution F\u03b1.\nWe will use the mean-\ufb01eld variational family, where each latent variable is independent and governed\nby a free parameter,\n\nq(\u03b2 ,z) = q(\u03b2 | \u03bb )\n\n\u03b1\n\u220f\n\ni=1\n\nq(zi | \u03c6i).\n\n(4)\n\nThe free variational parameters are the global parameters \u03bb and local parameters \u03c6i. Though we\nfocus on the mean-\ufb01eld family, extensions could consider structured families [13, 20], where there is\ndependence between variables.\nIn classical VB, where we approximate the usual posterior, we cannot compute the KL. Thus, we\noptimize a proxy objective called the ELBO (evidence lower bound) that is equal to the negative KL\nup to an additive constant. Maximizing the ELBO is equivalent to minimizing the KL divergence to\nthe posterior.\nIn population VB we also optimize a proxy objective, the F-ELBO. The F-ELBO is an expectation of\nthe ELBO under the population distribution of the data,\n\n(cid:34)\n\n(cid:34)\nlog p(\u03b2 )\u2212 logq(\u03b2 | \u03bb ) +\n\n\u03b1\n\u2211\n\ni=1\n\n(cid:35)(cid:35)\nlog p(Xi,Zi | \u03b2 )\u2212 logq(Zi)]\n\n.\n\n(5)\n\nL (\u03bb ,\u03c6;F\u03b1 ) = EF\u03b1\n\nEq\n\nThe F-ELBO is a lower bound on the population evidence logEF\u03b1 [p(X)] and a lower bound on the\nnegative KL to the population posterior. (See Appendix A.) The inner expectation is over the latent\nvariables \u03b2 and Z, and is a function of the variational distribution q(\u00b7). The outer expectation is over\nthe \u03b1 random data points X, and is a function of the population distribution F\u03b1 (\u00b7). The F-ELBO is\nthus a function of both the variational distribution and the population distribution.\nAs we mentioned, classical VB maximizes the (classical) ELBO, which is equivalent to minimizing\nthe KL. The F-ELBO, in contrast, is only a bound on the negative KL to the population posterior.\nThus maximizing the F-ELBO is suggestive but is not guaranteed to minimize the KL. That said, our\nstudies show that this is a good quantity to optimize, and in Appendix A we show that the F-ELBO\ndoes minimize EF\u03b1 [KL(q(z||p(z,\u03b2|X))], the population KL.\nConditionally conjugate models.\nalgorithm to maximize Eq. 5. First, we describe the class of models that we will work with.\nFollowing [14] we focus on conditionally conjugate models. A conditionally conjugate model is one\nwhere each complete conditional\u2014the conditional distribution of a latent variable given all the other\nlatent variables and the observations\u2014is in the exponential family. This class includes many models\nin modern machine learning, such as mixture models, topic models, many Bayesian nonparametric\nmodels, and some hierarchical regression models. Using conditionally conjugate models simpli\ufb01es\nmany calculations in variational inference.\n\nIn the next section we will develop a stochastic optimization\n\n3\n\n\fp(zi,xi | \u03b2 ) = h(zi,xi)exp(cid:8)\u03b2(cid:62)t(zi,xi)\u2212 a(\u03b2 )(cid:9)\np(\u03b2 | \u03b6 ) = h(\u03b2 )exp(cid:8)\u03b6(cid:62)t(\u03b2 )\u2212 a(\u03b6 )(cid:9) .\n\nUnder the joint in Eq. 1, we can write a conditionally conjugate model with two exponential families:\n(6)\n(7)\nWe overload notation for base measures h(\u00b7), suf\ufb01cient statistics t(\u00b7), and log normalizers a(\u00b7). Note\nthat \u03b6 is the hyperparameter and that t(\u03b2 ) = [\u03b2 ,\u2212a(\u03b2 )] [3].\nIn conditionally conjugate models each complete conditional is in an exponential family, and we\nuse these families as the factors in the variational distribution in Eq. 4. Thus \u03bb indexes the same\nfamily as p(\u03b2 | z,x) and \u03c6i indexes the same family as p(zi | xi,\u03b2 ). For example, in latent Dirichlet\nallocation [5], the complete conditional of the topics is a Dirichlet; the complete conditional of\nthe per-document topic mixture is a Dirichlet; and the complete conditional of the per-word topic\nassignment is a categorical. (See [14] for details.)\nPopulation variational Bayes. We have described the ingredients of our problem. We are given a\nconditionally conjugate model, described in Eqs. 6 and 7, a parameterized variational family in Eq. 4,\nand a stream of data from an unknown population distribution F. Our goal is to optimize the F-ELBO\nin Eq. 5 with respect to the variational parameters.\nThe F-ELBO is a function of the population distribution, which is an unknown quantity. To overcome\nthis hurdle, we will use the stream of data from F to form noisy gradients of the F-ELBO; we then\nupdate the variational parameters with stochastic optimization (a technique to \ufb01nd a local optimum\nby following noisy unbiased gradients [7]).\nBefore describing the algorithm, however, we acknowledge one technical detail. Mirroring [14], we\noptimize an F-ELBO that is only a function of the global variational parameters. The one-parameter\npopulation VI objective is LF\u03b1 (\u03bb ) = max\u03c6 LF\u03b1 (\u03bb ,\u03c6 ). This implicitly optimizes the local parameter\nas a function of the global parameter and allows us to convert the potentially in\ufb01nite-dimensional\noptimization problem in Eq. 5 to a \ufb01nite one. The resulting objective is identical to Eq. 5, but with \u03c6\nreplaced by \u03c6 (\u03bb ). (Details are in Appendix B).\nThe next step is to form a noisy gradient of the F-ELBO so that we can use stochastic optimization\nto maximize it. Stochastic optimization maximizes an objective by following noisy and unbiased\ngradients [7, 19]. We will write the gradient of the F-ELBO as an expectation with respect to F\u03b1, and\nthen use Monte Carlo estimates to form noisy gradients.\nWe compute the gradient of the F-ELBO by bringing the gradient operator inside the expectations of\nEq. 5.1 This results in a population expectation of the classical VB gradient with \u03b1 data points.\nWe take the natural gradient [2], which has a simple form in completely conjugate models [14].\nSpeci\ufb01cally, the natural gradient of the F-ELBO is\n\n(cid:34) \u03b1\n\n\u2211\n\ni=1\n\n(cid:35)\n\n\u02c6\u2207\u03bb L (\u03bb ;F\u03b1 ) = \u03b6 \u2212 \u03bb + EF\u03b1\n\nE\u03c6i(\u03bb ) [t(xi,Zi)]\n\n.\n\n(8)\n\nWe approximate this expression using Monte Carlo to compute noisy, unbiased natural gradients at \u03bb .\nTo form the Monte Carlo estimate, we collect \u03b1 data points from F; for each we compute the optimal\nlocal parameters \u03c6i(\u03bb ), which is a function of the sampled data point and variational parameters; we\nthen compute the quantity inside the brackets in Eq. 8. Averaging these results gives the Monte Carlo\nestimate of the natural gradient. We follow the noisy natural gradient and repeat.\nThe algorithm is summarized in Algorithm 1. Because Eq. 8 is a Monte Carlo estimate, we are free to\ndraw B data points from F\u03b1 (where B << \u03b1) and rescale the suf\ufb01cient statistics by \u03b1/B. This makes\nthe natural gradient estimate noisier, but faster to calculate. As highlighted in [14], this strategy is\nmore computationally ef\ufb01cient because early iterations of the algorithm have inaccurate values of \u03bb .\nIt is wasteful to pass through a lot of data before making updates to \u03bb .\nDiscussion. Thus far, we have de\ufb01ned the population posterior and showed how to approximate\nit with population variational inference. Our derivation justi\ufb01es using an algorithm like stochastic\nvariational inference (SVI) [14] on a stream of data. It is nearly identical to SVI, but includes an\nadditional parameter: the number of data points in the population posterior \u03b1.\n\n1For most models of interest, this is justi\ufb01ed by the dominated convergence theorem.\n\n4\n\n\fAlgorithm 1 Population Variational Bayes\n\nRandomly initialize global variational parameter \u03bb (0)\nSet iteration t \u2190 0\nrepeat\n\nDraw data minibatch x1:B \u223c F\u03b1\nOptimize local variational parameters \u03c61(\u03bb (t)), . . . ,\u03c6B(\u03bb (t))\nCalculate natural gradient \u02c6\u2207\u03bb L (\u03bb (t);F\u03b1 ) [see Eq. 8]\nUpdate global variational parameter with learning rate \u03c1 (t)\n\n\u03bb (t+1) = \u03bb (t) + \u03c1 (t) \u03b1\nB\n\n\u02c6\u2207\u03bb L (\u03bb (t);F\u03b1 )\n\nUpdate iteration count t \u2190 t + 1\n\nuntil forever\n\nNote we can recover the original SVI algorithm as an instance of population VI, thus reinterpreting it\nas minimizing the KL divergence to the population posterior. We recover SVI by setting \u03b1 equal to\nthe number of data points in the data set and replacing the stream of data F with \u02c6Fx, the empirical\ndistribution of the observations. The \u201cstream\u201d in this case comes from sampling with replacement\nfrom \u02c6Fx, which results in precisely the original SVI algorithm.2\nWe focused on the conditionally conjugate family for convenience, i.e., the simple gradient in Eq. 8.\nWe emphasize, however, that by using recent tools for nonconjugate inference [17, 18, 24], we\ncan adapt the new ideas described above\u2014the population posterior and the F-ELBO\u2014outside of\nconditionally conjugate models.\nFinally, we analyze the population posterior distribution under the assumption the only way\nthe stream affects the model is through the data. Formally, this means the unobserved vari-\nExpanding the expectation gives (cid:82) p(\u03b2 | X)p(X | F\u03b1 )dX, showing that the population poste-\nables in the model and the stream F\u03b1 are independent given the data X. The population pos-\nterior without the local latent variables z (which can be marginalized out) is EF\u03b1 [p(\u03b2 | X)].\nrior distribution can be written as p(\u03b2 | F\u03b1 ). This can be depicted as a graphical model:\n\nThis means \ufb01rst, that the population posterior is well de\ufb01ned even when the model does not specify\nthe marginal distribution of the data and, second, rather than the classical Bayesian setting where the\nposterior is conditioned on a \ufb01nite \ufb01xed dataset, the population posterior is a distributional posterior\nconditioned on the stream F\u03b1.\n\n3 Empirical Evaluation\n\nWe study the performance of population variational Bayes (population VB) against SVI and streaming\nvariational Bayes (SVB) [8]. With large real-world data we study two models, latent Dirichlet\nallocation [5] and Bayesian nonparametric mixture models, comparing the held-out predictive\nperformance of the algorithms. All three methods share the same local variational update, which\nis the dominating computational cost. We study the data coming in a true ordered stream, and in a\npermuted stream (to better match the assumptions of SVI). Across data and models, population VB\nusually outperforms the existing approaches.\nModels. We study two models. The \ufb01rst is latent Dirichlet allocation (LDA) [5]. LDA is a\nmixed-membership model of text collections and is frequently used to \ufb01nd its latent topics. LDA\nassumes that there are K topics \u03b2k \u223c Dir(\u03b7), each of which is a multinomial distribution over a \ufb01xed\nvocabulary. Documents are drawn by \ufb01rst choosing a distribution over topics \u03b8d \u223c Dir(\u03b1) and then\n2This derivation of SVI is an application of Efron\u2019s plug-in principle [10] applied to inference of the\npopulation posterior. The plug-in principle says that we can replace the population F with the empirical\ndistribution of the data \u02c6F to make population inferences. In our empirical study, however, we found that\npopulation VI often outperforms stochastic VI. Treating the data in a true stream, and setting the number of data\npoints different to the true number, can improve predictive accuracy.\n\n5\n\nF\u02dbX\u02c7\fFigure 1: Held out predictive log likelihood for LDA on large-scale streamed text corpora. Population-\nVB outperforms existing methods for two out of the three settings. We use the best settings of \u03b1.\n\ndrawing each word by choosing a topic assignment zdn \u223c Mult(\u03b8d) and \ufb01nally choosing a word from\nthe corresponding topic wdn \u223c \u03b2zdn. The joint distribution is\n\u220f\n\np(\u03b2 ,\u03b8 ,z,w|\u03b7,\u03b3) = p(\u03b2|\u03b7)\n\np(zdi|\u03b8d)p(wdi|\u03b2 ,zdi).\n\np(\u03b8d|\u03b3)\n\n(9)\n\n\u03b1\n\u220f\n\nd=1\n\nN\n\ni=1\n\nFixing hyperparameters, the inference problem is to estimate the conditional distribution of the topics\ngiven a large collection of documents.\nThe second model is a Dirichlet process (DP) mixture [11]. Loosely, DP mixtures are mixture models\nwith a potentially in\ufb01nite number of components; thus choosing the number of components is part\nof the posterior inference problem. When using variational inference for DP mixtures [4], we take\nadvantage of the stick breaking representation to construct a truncated variational approximation [21].\nThe variables are mixture proportions \u03c0 \u223c Stick(\u03b7), mixture components \u03b2k \u223c H(\u03b3) (for in\ufb01nite k),\nmixture assignments zi \u223c Mult(\u03c0), and observations xi \u223c G(\u03b2zi). The joint is\np(zi|\u03c0)p(xi|\u03b2 ,xi).\n\np(\u03b2 ,\u03c0,z,x|\u03b7,\u03b3) = p(\u03c0|\u03b7)p(\u03b2|\u03b3)\n\n\u03b1\n\u220f\n\n(10)\n\ni=1\n\nThe likelihood and prior on the components are general to the observations at hand. In our study\nof real-valued data we use normal priors and normal likelihoods; in our study of text data we use\nDirichlet priors and multinomial likelihoods.\nFor both models we vary \u03b1, usually \ufb01xed to the number of data points in traditional analysis.\nDatasets. With LDA we analyze three large-scale streamed corpora: 1.7M articles from the New\nYork Times spanning 10 years, 130K Science articles written over 100 years, and 7.4M tweets\ncollected from Twitter on Feb 2nd, 2014. We processed them all in a similar way, choosing a\nvocabulary based on the most frequent words in the corpus (with stop words removed): 8,000 for the\nNew York Times, 5,855 for Science, and 13,996 for Twitter. On Twitter, each tweet is a document,\nand we removed duplicate tweets and tweets that did not contain at least 2 words in the vocabulary.\nFor each data stream, all algorithms took a few hours to process all the examples we collected.\nWith DP mixtures, we analyze human location behavior data. These data allow us to build periodic\nmodels of human population mobility, with applications to disaster response and urban planning.\nSuch models account for periodicity by including the hour of the week as one of the dimensions of the\n\n6\n\n024681012141618\u22128.0\u22127.8\u22127.6\u22127.4\u22127.2heldoutloglikelihoodNewYorkTimes0.00.20.40.60.81.01.21.4\u22128.0\u22127.8\u22127.6\u22127.4\u22127.2Science010203040506070\u22128.6\u22128.4\u22128.2\u22128.0\u22127.8\u22127.6\u22127.4TwitterPopulation-VB\u03b1=1MStreaming-VB[8]numberofdocumentsseen(\u00d7105)Time-orderedstream024681012141618\u22128.1\u22128.0\u22127.9\u22127.8\u22127.7\u22127.6\u22127.5heldoutloglikelihoodNewYorkTimes0.00.20.40.60.81.01.21.4\u22128.0\u22127.8\u22127.6\u22127.4\u22127.2\u22127.0Science010203040506070\u22128.0\u22127.9\u22127.8\u22127.7\u22127.6\u22127.5\u22127.4\u22127.3TwitterPopulation-VB\u03b1=1MStreaming-VB[8]SVI[15]numberofdocumentsseen(\u00d7105)Randomtime-permutedstream\fFigure 2: Held out predictive log likelihood for Dirichlet process mixture models on large-scale\nstreamed location and text data sets. Note that we apply Gaussian likelihoods in the Geolife dataset,\nso the reported predictive performance is measured by probability density. We chose the best \u03b1 for\neach population-VB curve.\n\nFigure 3: We show the sensitivity of population-VB to hyperparameter \u03b1 (based on \ufb01nal log\nlikelihoods in the time-ordered stream) and \ufb01nd that the best setting of \u03b1 often differs from the true\nnumber of data points (which may not be known in any case in practice).\n\ndata to be modeled. The Ivory Coast location data contains 18M discrete cell tower locations for 500K\nusers recorded over 6 months [6]. The Microsoft Geolife dataset contains 35K latitude-longitude\nGPS locations for 182 users over 5 years. For both data sets, our observations re\ufb02ect down-sampling\nthe data to ensure that each individual is seen no more than once every 15 minutes.\n\n7\n\n020406080100120140160180\u22127.0\u22126.9\u22126.8\u22126.7\u22126.6\u22126.5heldoutloglikelihoodIvoryCoastLocations0.000.050.100.150.200.250.30\u22120.4\u22120.3\u22120.2\u22120.10.00.1GeolifeLocations024681012141618\u22128.5\u22128.4\u22128.3\u22128.2\u22128.1\u22128.0\u22127.9\u22127.8NewYorkTimesPopulation-VB\u03b1=bestStreaming-VB[8]numberofdatapointsseen(\u00d7105)Time-orderedstream020406080100120140160180\u22126.84\u22126.82\u22126.80\u22126.78\u22126.76\u22126.74\u22126.72\u22126.70heldoutloglikelihoodIvoryCoastLocations0.000.050.100.150.200.250.30\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10.00.1GeolifeLocations024681012141618\u22128.5\u22128.4\u22128.3\u22128.2\u22128.1\u22128.0NewYorkTimesPopulation-VB\u03b1=bestStreaming-VB[8]SVI[15]numberofdatapointsseen(\u00d7105)Randomtime-permutedstream456789\u22127.90\u22127.85\u22127.80\u22127.75\u22127.70\u22127.65\u22127.60heldoutloglikelihoodNewYorkTimes456789\u22127.30\u22127.28\u22127.26\u22127.24\u22127.22\u22127.20\u22127.18\u22127.16Science456789\u22128.5\u22128.4\u22128.3\u22128.2\u22128.1\u22128.0\u22127.9\u22127.8TwitterPopulation-VB\u03b1=trueNlogarithm(base10)of\u03b1Population-VBsensitivityto\u03b1forLDA456789101112\u22126.82\u22126.81\u22126.80\u22126.79\u22126.78\u22126.77\u22126.76\u22126.75heldoutloglikelihoodIvoryCoastLocations456789\u22120.20\u22120.15\u22120.10\u22120.050.00GeolifeLocations3456789\u22129.5\u22129.0\u22128.5\u22128.0NewYorkTimesPopulation-VB\u03b1=trueNlogarithm(base10)of\u03b1Population-VBsensitivityto\u03b1forDP-Mixture\fResults. We compare population VB with SVI [14] and SVB [8] for LDA [8] and DP mixtures [22].\nSVB updates the variational approximation of the global parameter using density \ufb01ltering with\nexponential families. The complexity of the approximation remains \ufb01xed as the expected suf\ufb01cient\nstatistics from minibatches observed in a stream are combined with those of the current approximation.\n(Here we give the \ufb01nal results. We include details of how we set and \ufb01t hyperparameters below.)\nWe measure model \ufb01tness by evaluating the average predictive log likelihood on held-out data. This\ninvolves splitting held-out observations (that were not involved in the posterior approximation of \u03b2 )\ninto two equal halves, inferring the local component distribution based on the \ufb01rst half, and testing\nwith the second half [14, 26]. For DP-mixtures, we condition on the observed hour of the week and\npredict the geographic location of the held-out data point.\nIn standard of\ufb02ine studies, the held-out set is randomly selected from the data. With streams, however,\nwe test on the next 10K documents (for New York Times, Science), 500K tweets (for Twitter), or 25K\nlocations (on Geo data). This is a valid held-out set because the data ahead of the current position in\nthe stream have not yet been seen by the inference algorithms.\nFigure 1 shows the performance for LDA. We looked at two types of streams: one in which the data\nappear in order and the other in which they have been permuted (i.e., an exchangeable stream). The\ntime permuted stream reveals performance when each data minibatch is safely assumed to be an\ni.i.d. sample from F; this results in smoother improvements to predictive likelihood. On our data, we\nfound that population VB outperformed SVI and SVB on two of the data sets and outperformed SVI\non all of the data. SVB performed better than population VB on Twitter.\nFigure 2 shows a similar study for DP mixtures. We analyzed the human mobility data and the\nNew York Times. (Ref. [22] also analyzed the New York Times.) On these data population VB\noutperformed SVB and SVI in all settings.3\n\nHyperparameters Unlike traditional Bayesian methods, the data set size \u03b1 is a hyperparameter to\npopulation VB. It helps control the posterior variance of the population posterior. Figure 3 reports\nsensitivity to \u03b1 for all studies (for the time-ordered stream). These plots indicate that the optimal\nsetting of \u03b1 is often different from the true number of data points; the best performing population\nposterior variance is not necessarily the one implied by the data. The other hyperparameters to our\nexperiments are reported in Appendix C.\n\n4 Conclusions and Future Work\n\nWe introduced the population posterior, a distribution over latent variables that combines traditional\nBayesian inference with the frequentist idea of the population distribution. With this idea, we derived\npopulation variational Bayes, an ef\ufb01cient algorithm for probabilistic inference on streams. On two\ncomplex Bayesian models and several large data sets, we found that population variational Bayes\nusually performs better than existing approaches to streaming inference.\nIn this paper, we made no assumptions about the structure of the population distribution. Making\nassumptions, such as the ability to obtain streams conditional on queries, can lead to variants of\nour algorithm that learn which data points to see next during inference. Finally, understanding the\ntheoretical properties of the population posterior is also an avenue of interest.\nAcknowledgments. We thank Allison Chaney, John Cunningham, Alp Kucukelbir, Stephan Mandt,\nPeter Orbanz, Theo Weber, Frank Wood, and the anonymous reviewers for their comments. This work\nis supported by NSF IIS-0745520, IIS-1247664, IIS-1009542, ONR N00014-11-1-0651, DARPA\nFA8750-14-2-0009, N66001-15-C-4032, NDSEG, Facebook, Adobe, Amazon, and the Siebel Scholar\nand John Templeton Foundations.\n\n3Though our purpose is to compare algorithms, we make one note about a speci\ufb01c data set. The predictive\naccuracy for the Ivory Coast data set plummets after 14M data points. This is because of the data collection\npolicy. For privacy reasons the data set provides the cell tower locations of a randomly selected cohort of 50K\nusers every 2 weeks [6]. The new cohort at 14M data points behaves differently to previous cohorts in a way that\naffects predictive performance. However, both algorithms steadily improve after this shock.\n\n8\n\n\fReferences\n[1] A. Ahmed, Q. Ho, C. H. Teo, J. Eisenstein, E. P. Xing, and A. J. Smola. Online inference for the in\ufb01nite\ntopic-cluster model: Storylines from streaming text. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 101\u2013109, 2011.\n\n[2] S. I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\n[3] J. M. Bernardo and A. F. Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009.\n[4] D. M. Blei, M. I. Jordan, et al. Variational inference for Dirichlet process mixtures. Bayesian Analysis,\n\n1(1):121\u2013143, 2006.\n\n[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[6] V. D. Blondel, M. Esch, C. Chan, F. Cl\u00e9rot, P. Deville, E. Huens, F. Morlot, Z. Smoreda, and C. Ziemlicki.\nData for development: the D4D challenge on mobile phone data. arXiv preprint arXiv:1210.0137, 2012.\n[7] L. Bottou. Online learning and stochastic approximations. Online learning in Neural Networks, 17:9,\n\n1998.\n\n[8] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. Jordan. Streaming variational Bayes. In\n\nAdvances in Neural Information Processing Systems, pages 1727\u20131735, 2013.\n\n[9] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian \ufb01ltering.\n\nStatistics and Computing, 10(3):197\u2013208, 2000.\n\n[10] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n[11] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the\n\nAmerican Statistical Association, 90(430):577\u2013588, 1995.\n\n[12] Z. Ghahramani and H. Attias. Online variational Bayesian learning. In Slides from talk presented at NIPS\n\n2000 Workshop on Online learning, pages 101\u2013109, 2000.\n\n[13] M. D. Hoffman and D. M. Blei. Structured stochastic variational inference. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 101\u2013109, 2015.\n\n[14] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of\n\nMachine Learning Research, 14(1):1303\u20131347, 2013.\n\n[15] A. Honkela and H. Valpola. On-line variational Bayesian learning. In 4th International Symposium on\n\nIndependent Component Analysis and Blind Signal Separation, pages 803\u2013808, 2003.\n\n[16] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[17] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.\nIn Proceedings of the\n[18] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference.\n\nSeventeenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 805\u2013813, 2014.\n\n[19] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\npages 400\u2013407, 1951.\n\n[20] L. K. Saul and M. I. Jordan. Exploiting tractable substructures in intractable networks. Advances in Neural\n\nInformation Processing Systems, pages 486\u2013492, 1996.\n\n[21] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[22] A. Tank, N. Foti, and E. Fox. Streaming variational inference for Bayesian nonparametric mixture models.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[23] L. Theis and M. D. Hoffman. A trust-region method for stochastic variational inference with applications\n\nto streaming data. In International Conference on Machine Learning, 2015.\n\n[24] M. Titsias and M. L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In\n\nProceedings of the 31st International Conference on Machine Learning, pages 1971\u20131979, 2014.\n\n[25] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, Jan. 2008.\n\n[26] H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models.\n\nInternational Conference on Machine Learning, 2009.\n\nIn\n\n[27] L. Yao, D. Mimno, and A. McCallum. Ef\ufb01cient methods for topic model inference on streaming document\n\ncollections. In Conference on Knowledge Discovery and Data Mining, pages 937\u2013946. ACM, 2009.\n\n9\n\n\f", "award": [], "sourceid": 729, "authors": [{"given_name": "James", "family_name": "McInerney", "institution": "Columbia"}, {"given_name": "Rajesh", "family_name": "Ranganath", "institution": "Princeton University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}