{"title": "Distributed Bayesian Posterior Sampling via Moment Sharing", "book": "Advances in Neural Information Processing Systems", "page_first": 3356, "page_last": 3364, "abstract": "We propose a distributed Markov chain Monte Carlo (MCMC) inference algorithm for large scale Bayesian posterior simulation. We assume that the dataset is partitioned and stored across nodes of a cluster. Our procedure involves an independent MCMC posterior sampler at each node based on its local partition of the data. Moment statistics of the local posteriors are collected from each sampler and propagated across the cluster using expectation propagation message passing with low communication costs. The moment sharing scheme improves posterior estimation quality by enforcing agreement among the samplers. We demonstrate the speed and inference quality of our method with empirical studies on Bayesian logistic regression and sparse linear regression with a spike-and-slab prior.", "full_text": "Distributed Bayesian Posterior Sampling via\n\nMoment Sharing\n\nMinjie Xu1\u2217, Balaji Lakshminarayanan2, Yee Whye Teh3, Jun Zhu1, and Bo Zhang1\n1State Key Lab of Intelligent Technology and Systems; Tsinghua National TNList Lab\n\n1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China\n\n2Gatsby Unit, University College London, 17 Queen Square, London WC1N 3AR, UK\n\n3Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK\n\nAbstract\n\nWe propose a distributed Markov chain Monte Carlo (MCMC) inference algorith-\nm for large scale Bayesian posterior simulation. We assume that the dataset is\npartitioned and stored across nodes of a cluster. Our procedure involves an inde-\npendent MCMC posterior sampler at each node based on its local partition of the\ndata. Moment statistics of the local posteriors are collected from each sampler\nand propagated across the cluster using expectation propagation message passing\nwith low communication costs. The moment sharing scheme improves posterior\nestimation quality by enforcing agreement among the samplers. We demonstrate\nthe speed and inference quality of our method with empirical studies on Bayesian\nlogistic regression and sparse linear regression with a spike-and-slab prior.\n\n1\n\nIntroduction\n\nAs we enter the age of \u201cbig data\u201d, datasets are growing to ever increasing sizes and there is an urgent\nneed for scalable machine learning algorithms. In Bayesian learning, the central object of interest\nis the posterior distribution, and a variety of variational and Markov chain Monte Carlo (MCMC)\nmethods have been developed for \u201cbig data\u201d settings. The main dif\ufb01culty with both approaches is\nthat each iteration of these algorithms requires an impractical O(N ) computation for a dataset of\nsize N (cid:29) 1. There are two general solutions: either to use stochastic approximation techniques\nbased on small mini-batches of data [15, 4, 5, 20, 1, 14], or to distribute data as well as computation\nacross a parallel computing architecture, e.g. using MapReduce [3, 13, 16].\nIn this paper we consider methods for distributing MCMC sampling across a computer cluster where\na dataset has been partitioned and locally stored on the nodes. Recent years have seen a \ufb02urry\nof research on this topic, with many papers based around \u201cembarrassingly parallel\u201d architectures\n[16, 12, 19, 9]. The basic thesis is that because communication costs are so high, it is better for each\nnode to run a separate MCMC sampler based on its data stored locally, completely independently\nfrom others, and then for a \ufb01nal combination stage to transform the local samples into samples for\nthe desired global posterior distribution given the whole dataset. [16] directly combines the samples\nby weighted averages under an implicit Gaussian assumption; [12] approximates each local poste-\nrior with either a Gaussian or a Gaussian kernel density estimate (KDE) so that the combination\nfollows an explicit product of densities; [19] takes the KDE idea one step further by representing it\nas a Weierstrass transform; [9] uses the \u201cmedian posterior\u201d in an RKHS embedding space as a com-\nbination technique that is robust in the presence of outliers. The main drawback of embarrassingly\nparallel MCMC sampling is that if the local posteriors differ signi\ufb01cantly, perhaps due to noise or\nnon-random partitioning of the dataset across the cluster, or if they do not satisfy the Gaussian as-\n\n\u2217This work was started and completed when the author was visiting University of Oxford.\n\n1\n\n\fsumptions in a number of methods, the \ufb01nal combination stage can result in highly inaccurate global\nposterior representations.\nTo encourage local MCMC samplers to roughly be aware of and hence agree with one another\nso as to improve inference quality, we develop a method to enforce sharing of a small number of\nmoment statistics of the local posteriors, e.g. mean and covariance, across the samplers. We frame\nour method as expectation propagation (EP) [8], where the exponential family is de\ufb01ned by the\nshared moments and each node represents a factor to be approximated, with moment statistics to\nbe estimated by the corresponding sampler. Messages passed among the nodes encode differences\nbetween the estimated moments, so that at convergence all nodes agree on these moments. As EP\ntends to converge rapidly, these messages will be passed around only infrequently (relative to the\nnumber of MCMC iterations). It can also be performed in an asynchronous fashion, hence incurring\nlow communication costs. As opposed to previous embarrassingly parallel schemes which require a\n\ufb01nal combination stage, upon convergence each sample drawn at any single node with our method\ncan be directly treated as a sample from an approximate global posterior distribution. Our method\ndiffers from standard EP as each factor to be approximated consists of a product of many likelihood\nterms (rather than just one as in standard EP), and therefore suffers less approximation bias.\n\n2 A Distributed Bayesian Posterior Sampling Algorithm\n\nobject of interest is the posterior distribution, p(\u03b8|D) \u221d p0(\u03b8)(cid:81)m\n\nIn this section we develop our method for distributed Bayesian posterior sampling. We assume that\nwe have a dataset D = {xn}N\nn=1 with N (cid:29) 1 which has already been partitioned onto m compute\nnodes. Let Di denote the data on node i for i = 1, . . . , m such that D = \u222am\ni=1Di. Let D\u2212i = D\\Di.\nWe assume that the data are i.i.d. given a parameter vector \u03b8 \u2208 \u0398 with prior distribution p0(\u03b8). The\ni=1 p(Di|\u03b8), where p(Di|\u03b8) is a\nproduct of likelihood terms, one for each data item in Di.\nRecall that our general approach is to have an independent sampler running on each node targeting\na \u201clocal posterior\u201d, and our aim is for the samplers to agree on the overall shape of the posteriors,\nby enforcing that they share the same moment statistics, e.g. using the \ufb01rst two moments they will\nshare the same mean and covariance. Let S(\u03b8) be the suf\ufb01cient statistics function such that f (S) :=\nEf [S(\u03b8)] are the moments of interest for some density f (\u03b8). Consider an exponential family of\ndistributions with suf\ufb01cient statistics S(\u00b7) and let q(\u03b8; \u03b7) be a density in the family with natural\nparameter \u03b7. We will assume for simplicity that the prior belongs to the exponential family, p0(\u03b8) =\nq(\u03b8; \u03b70) for some natural parameter \u03b70. Let \u02dcpi(\u03b8|Di) denote the local posterior at node i. Rather\nthan using the same prior, e.g. p0(\u03b8), at all nodes, we use a local prior which enforces the moments\nto be similar between local posteriors. More precisely, we consider the following target density,\n\n\u02dcpi(\u03b8|Di) \u221d q(\u03b8; \u03b7\u2212i)p(Di|\u03b8),\n\nwhere the effective local prior q(\u03b8; \u03b7\u2212i) is determined by the (natural) parameter \u03b7\u2212i. We set \u03b7\u2212i\nsuch that E \u02dcpi(\u03b8|Di)[S(\u03b8)] = \u00b5 for all i, for some shared moment vector \u00b5.\nAs an aside, note that the overall posterior distribution can be recovered via\n\np(\u03b8|D) \u221d p(D|\u03b8)p0(\u03b8) = p0(\u03b8)\n\np(Di|\u03b8) \u221d q(\u03b8; \u03b70)\n\n,\n\n(1)\n\nm(cid:89)\n\ni=1\n\n(cid:20) \u02dcpi(\u03b8|Di)\n\n(cid:21)\n\nm(cid:89)\n\nq(\u03b8; \u03b7\u2212i)\n\ni=1\n\np(\u03b8|D) \u221d(cid:81)m\n\nfor any choice of the parameters \u03b7\u2212i, with a number of previous works corresponding to differen-\nt choices. [16, 12, 19] use \u03b7\u2212i = \u03b70/m, so that the local prior is p0(\u03b8)1/m and (1) reduces to\ni=1 \u02dcpi(\u03b8|Di). [2] set \u03b7\u2212i = \u03b70 for their distributed asynchronous streaming variational\nalgorithm, but reported that setting \u03b7\u2212i such that q(\u03b8; \u03b7\u2212i) approximates the posterior distribution\ngiven previously processed data achieves better performance. We say that such choice of \u03b7\u2212i is\ncontext aware as it contains contextual information from other local posteriors. Finally, in the ideal\nsituation with exact equality, q(\u03b8; \u03b7\u2212i) = p(\u03b8|D\u2212i), then each local posterior is precisely the true\nposterior p(\u03b8|D). In the following subsections, we will describe how EP can be used to iteratively\napproximate \u03b7\u2212i so that q(\u03b8; \u03b7\u2212i) matches p(\u03b8|D\u2212i) as closely as possible in the sense of min-\nimising the KL divergence. Since our algorithm performs distributed sampling by sharing messages\ncontaining moment information, we refer to it as SMS (in short for sampling via moment sharing).\n\n2\n\n\f2.1 Expectation Propagation\n\nIn many typical scenarios the posterior is intractable to compute because the product of likelihoods\nand the prior is not analytically tractable and approximation schemes, e.g. variational methods or\nMCMC, are required to compute the posterior. EP is a variational message-passing scheme [8],\nwhere each likelihood term is approximated by an exponential family density chosen iteratively to\nminimise the KL divergence to a \u201clocal posterior\u201d.\nSuppose we wish to approximate (up to normalisation) the likelihood p(Di|\u03b8) (as a function of \u03b8),\nusing the exponential family density q(\u03b8; \u03b7i) for some suitably chosen natural parameter \u03b7i, and\nthat other parameters {\u03b7j}j(cid:54)=i are known such that each q(\u03b8; \u03b7j) approximates the corresponding\np(Dj|\u03b8) well. Then the posterior distribution is well approximated by a local posterior where all but\none likelihood factor is approximated,\n\n(cid:89)\nwhere \u02dcpi(\u03b8|D\u2212i) = q(\u03b8; \u03b7\u2212i), with \u03b7\u2212i = \u03b70 +(cid:80)\n\np(\u03b8|D) \u2248 \u02dcpi(\u03b8|D) \u221d p0(\u03b8)p(Di|\u03b8)\n\nj(cid:54)=i\nj(cid:54)=i \u03b7j, is a context-aware prior which incorpo-\nrates information from the other data subsets and is an approximation to the conditional distribution\np(\u03b8|D\u2212i). Replace p(Di|\u03b8) by q(\u03b8; \u03b7i), then the corresponding local posterior \u02dcpi(\u03b8|D) would be\napproximated by q(\u03b8; \u03b7\u2212i + \u03b7i). A natural choice for the parameter \u03b7i is the one that minimises\nKL(\u02dcpi(\u03b8|D)(cid:107)q(\u03b8; \u03b7\u2212i + \u03b7i)). This optimisation can be solved by calculating the moment parameter\n\u00b5i = E \u02dcpi(\u03b8|D)[S(\u03b8)], transforming the moment parameter \u00b5i into its natural parameter, say \u03bdi, and\nthen updating \u03b7i \u2190 \u03bdi \u2212 \u03b7\u2212i.\nEP proceeds iteratively, by updating each parameter given the current values of the others using the\nabove procedure until convergence. At convergence (which is not guaranteed), we have that,\n\nq(\u03b8; \u03b7j) = p(Di|\u03b8)\u02dcpi(\u03b8|D\u2212i),\n\nm(cid:88)\n\n\u03bdi = \u03bd := \u03b70 +\n\n\u03b7j,\n\nj=1\n\nfor all i, where \u03b7j are the converged parameter values. Hence the natural parameters, as well as\nthe moments of the local posteriors, at all nodes agree. When the prior p0(\u03b8) does not belong to\nthe exponential family, we may simply treat it as p(D0|\u03b8) where D0 = \u2205 and approximate it with\nq(\u03b8; \u03b70) just as we approximate the likelihoods.\n\n2.2 Distributed Sampling via Moment Sharing\nIn typical EP applications, the moment parameter \u00b5i = E \u02dcpi(\u03b8|D)[S(\u03b8)] can be computed either\nanalytically or using numerical quadrature. In our setting, this is not possible as each likelihood\nfactor p(Di|\u03b8) is now a product of many likelihoods with generally no tractable analytic form.\nInstead we can use MCMC sampling to estimate these moments.\nThe simplest algorithm involves synchronous EP updates: At each EP iteration, each node i receives\nfrom a master node \u03b7\u2212i (initialised to \u03b70 at the \ufb01rst iteration) calculated from the previous iteration,\nruns MCMC to obtain T samples from which the moments \u00b5i are estimated, converts this into natural\nparameters \u03bdi, and returns \u03b7i = \u03bdi \u2212 \u03b7\u2212i to the master node. (Note that the MCMC samplers are\nrun in parallel; hence the moments are computed in parallel unlike standard EP.) An asynchronous\nversion can be implemented as well: At each node i, after the MCMC samples are obtained and the\nnew \u03b7i parameter computed, the node communicates asynchronously with the master to send \u03b7i and\nreceive the new value of \u03b7\u2212i based on the current \u03b7j(cid:54)=i from other nodes. Finally, a decentralised\nscheme is also possible: Each node i stores a local copy of all the parameters \u03b7j for each j =\n1, . . . , m, after the MCMC phase and a new value of \u03b7i is computed it is broadcast to all nodes,\nthe local copy is updated based on messages the node received in the mean time, and a new \u03b7\u2212i is\ncomputed.\n\n2.3 Multivariate Gaussian Exponential Family\n\nFor concreteness, we will describe the required computations of the moments and natural parameters\nin the special cases of a multivariate Gaussian exponential family. In addition to being analytically\ntractable and popular, the usage of multivariate Gaussian distribution can also be motivated using\n\n3\n\n\fBayesian asymptotics for large datasets. In particular, for parameters in Rd and under regularity\nconditions, if the size of the subset Di is large, the Bernstein-von Mises Theorem shows that the local\nposterior distribution is well approximated by a multivariate Gaussian; hence the EP approximation\nby an exponential family density will be very good. Given T samples {\u03b8it}T\nt=1 collected at node i,\nunbiased estimates of the moments (mean \u00b5i and covariance \u03a3i) are given by\n\n\u00b5i \u2190 1\nT\n\n\u03b8it\n\n\u03a3i \u2190 1\n\nT \u2212 1\n\n(\u03b8it \u2212 \u00b5i)(\u03b8it \u2212 \u00b5i)(cid:62),\n\nwhile the natural parameters can be computed as \u03b7i = (\u2126i\u00b5i, \u2126i), where\n\n(2)\n\n(3)\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n\u2126i =\n\nT \u2212 d \u2212 2\n\nT \u2212 1\n\n\u03a3\u22121\n\ni\n\nis an unbiased estimate of the precision matrix [11]. Note that simply using \u03a3\u22121\nleads to a biased\nestimate, which impacts upon the convergence of EP. Alternative estimators exist [18] but we use\nthe above unbiased estimate for simplicity. We stress that our approach is not limited to multivariate\nGaussian, but applicable to any exponential family distribution. In Section 3.2, we consider the case\nwhere the local posterior is approximated using the spike and slab distribution.\n\ni\n\n2.4 Additional Comments\nThe collected samples can be used to form estimates for the global posterior p(\u03b8|D) in two ways.\nFirstly, these samples can be combined using a combination technique [16, 12, 19, 9]. According to\n(1), each sample \u03b8 needs to be assigned a weight of q(\u03b8; \u03b7\u2212i)\u22121 before being combined. Alternative-\nly, once EP has converged, the MCMC samples target the local posterior pi(\u03b8|D), which is already\na good approximation to the global posterior, so the samples can be used directly as approximate\nsamples of the global posterior without need for a combination stage. This has the advantage of pro-\nducing mT samples if each of the m nodes produces T samples, while other combination techniques\nonly produce T samples. We have found the second approach to perform well in practice.\nIn our experiments we have found damping to be essential for the convergence of the algorithm.\nThis is because in addition to the typical convergence issues with EP, our mean parameters are also\nestimated using MCMC and thus introduces additional stochasticity which can affect the conver-\ngence. There is little theory in the literature on convergence of EP [17], and even less can be shown\nwith the additional stochasticity introduced by the MCMC sampling. Nevertheless, we have found\nthat damping the natural parameters \u03b7i works well in practice.\nIn the case of multivariate Gaussians, additional consideration has to be given due to the possibility\nthat the oscillatory behaviour in EP can lead to covariance matrices that are not positive de\ufb01nite. If\nthe precision of a local prior \u2126\u2212i is not positive de\ufb01nite, the resulting local posterior will become\nunnormalisable and the MCMC sampling will diverge. We adopt a number of mitigating strategies\nthat we have found to be effective: Whenever a new value of the precision matrix \u2126new\u2212i is not positive\nde\ufb01nite, we damp it towards its previous value as \u03b1\u2126old\u2212i + (1\u2212 \u03b1)\u2126new\u2212i , with an \u03b1 large enough such\nthat the linear combination is positive de\ufb01nite; We collect a large enough number of samples at\neach MCMC phase to reduce variability of the estimators; And we use the pseudo-inverse instead of\nactual matrix inverse in (3).\n\n3 Experiments\n\n3.1 Bayesian Logistic Regression\n\nWe tested our sampling via moment sharing method (SMS) on Bayesian logistic regression with\nn=1 where xn \u2208 Rd and yn = \u00b11, the conditional\nsimulated data. Given a dataset D = {(xn, yn)}N\nmodel of each yn given xn is\n(4)\nwhere \u03c3(x) = 1/(1+e\u2212x) is the standard logistic (sigmoid) function and the weight vector w \u2208 Rd\nis our parameter of interest. For simplicity we did not include the intercept in the model. We used\na standard Gaussian prior p0(w) = N (w; 0d, Id) on w and the aim is to draw samples from the\nposterior p(w|D).\n\np(yn|xn, w) = \u03c3(ynw(cid:62)xn),\n\n4\n\n\fFigure 1: Plot of covariate dimensions\n1 and 20 of the simulated dataset for\nBayesian logistic regression.\n\neach EP iteration, SMS produced both the EP approximated Gaussian posterior q(\u03b8; \u03b70 +(cid:80)m\n\nOur simulated dataset consists of N = 4000 data points,\neach with d = 20 dimensional covariates, generated using\ni.i.d. draws xn \u223c N (\u00b5x, \u03a3x), where \u03a3x = P P (cid:62), P \u2208\n[0, 1]d\u00d7d and each entry of \u00b5x and P are in turn generat-\ned i.i.d. from U(0, 1). We generate the \u201ctrue\u201d parameter\nvector w\u2217 from the prior N (0d, Id), with which the label-\ns are sampled i.i.d. according to the model, i.e. p(yn) =\n\u03c3(ynw\u2217(cid:62)xn). The dataset is visualized in Fig. 1.\nAs the base MCMC sampler used across all methods, we\nused the No-U-Turn sampler (NUTS) [6]. NUTS was al-\nso used to generate 100000 samples from the full posterior\np(\u03b8|D) for ground truth. Across all the methods, the sam-\npler was initialised at 0d and used the \ufb01rst 20d samples for\nburn-in, then thinned every other sample.\nWe compared our method SMS against consensus Monte Carlo (SCOT) [16], the embarrassingly\nparallel MCMC sampler (NEIS) of [12] and the Weierstrass sampler (WANG) [19].\nSMS: We tested both the synchronous (SMS(s)) and asynchronous (SMS(a)) versions of our\nmethod, using a multivariate Gaussian exponential family. The damping factor used was 0.2. At\ni=1 \u03b7i),\nas well as a collection of mT local posterior samples \u0398. We use K to denote the total number of EP\niterations. For SMS(a), every m worker-master update is counted as one EP iteration.\nSCOT: Since each node in our algorithm effectively draws KT samples in total, we allowed each\nnode in SCOT to draw KT samples as well, using a single NUTS run. To compare against our al-\ngorithm at iteration k \u2264 K, we used the \ufb01rst kT samples for combination and form the approximate\nposterior samples.\nNEIS: As in SCOT, we drew KT samples at each node, and compared against ours at iteration k\nusing the \ufb01rst kT samples. We tested both the parametric (NEIS(p)) and non-parametric (NEIS(n))\ncombination methods. To combine the kernel density estimates in NEIS(n), we adopted the recur-\nsive pairwise combination strategy as suggested in [12, 19]. We retained 10mT samples during\nintermediate stages of pair reduction and \ufb01nally drew mT samples from the \ufb01nal reduction.\nWANG: We test the sequential sampler in the \ufb01rst arXiv version, which can handle moderate-\nly high dimensional data and does not require a good initial approximation. The bandwidths hl\n(l = 1, . . . , d) were initialized to 0.01 and updated with\nm\u03c3l (if smaller) as suggested by the au-\nthors, where \u03c3l is the estimated posterior standard deviation of dimension l. As a Gibbs sampling\nalgorithm, WANG requires a larger number of iterations for convergence but does not need as many\nsamples within each iteration. Hence we ran it for K(cid:48) = 700 (cid:29) K iterations, each time generating\nKT /K(cid:48) samples on every node. We then collected every T combined samples generated from each\nsubsequent K(cid:48)/K iterations for comparative purposes, leaving all previous samples as burn-in.\nAll methods were implemented and tested in Matlab. Experiments were conducted on a cluster with\nas many as 24 nodes (Matlab workers), arranged in 4 servers, each being a multi-core server with 2\nIntel(R) Xeon(R) E5645 CPUs (6 cores, 12 threads). We used the parfor command (synchronous)\nand the parallel.FevalFuture object (asynchronous) in Matlab for parallel computations.\nThe underlying message passing is managed by the Matlab Distributed Computing Server.\nConvergence of Shared Moments. Figure 2 demonstrates the convergence of the local posterior\nmeans as the EP iteration progresses, on a smaller dataset generated likewise with N = 1000, d = 5\nand 25000 samples as ground truth.\nIt clearly illustrates that our algorithm achieves very good\napproximation accuracy by quickly enforcing agreement across nodes on local posterior moments\n(mean in this case). When m = 50, we used a larger number of samples for stable convergence.\nApproximation Accuracies. We compare the approximation accuracy of the different methods on\nour main simulated data (N = 4000, d = 20). We use a moderately large number of nodes m = 32,\nand T = 10000. In this case, each subset consists of 125 data points. We considered three different\nerror measures for the approximation accuracies. Denote the ground truth posterior samples, mean\n\nand covariance by \u0398\u2217, \u00b5\u2217 and \u03a3\u2217, and correspondingly (cid:98)\u0398, (cid:98)\u00b5 and (cid:98)\u03a3 for the approximate samples\n\ncollected using a distributed MCMC method. The \ufb01rst error measure is mean squared error (MSE)\n\n\u221a\n\n5\n\n\u2212505\u2212505d1d20 yn = +1yn = \u22121p0(w)\f(a) m = 4, T = 1000\n\n(b) m = 10, T = 1000\n\n(c) m = 50, T = 10000\n\nFigure 2: Convergence of local posterior means on a smaller Bayesian logistic regression dataset\n(N = 1000, d = 5). The x-axis indicates the number of likelihood evaluations, with vertical\nlines denoted EP iteration numbers. The y-axis indicates the estimated posterior means (dimensions\nindicated by different colours). We show ground truth with solid horizontal lines, the EP estimated\nmean with asterisks, and local sample estimated means dots connected with dash lines.\n\n(a) MSE of posterior mean\n\n(b) Approximate KL-divergence\n\n(c) MSE of conditional prob. (5)\n\nFigure 3: Errors (log-scale) against the cumulative number of samples drawn on all nodes (kT m).\nWe tested two random splits of the dataset (hence 2 curves for each algorithm). Each complete EP\niteration is highlighted by a vertical grid line. Note that for SCOT, NEIS(p) and NEIS(n), apart\nfrom usual combinations that occur after every T m/2 local samples are drawn on all nodes, we also\ndeliberately looked into combinations at a much earlier stage at (0.01, 0.02, 0.1, 0.5)T m.\n\n(a) Approximate KL-divergence\n\n(b) Approximate KL-divergence\n\n(c) Approximate KL-divergence\n\nFigure 4: Cross comparison with different numbers of nodes. Note that the x-axes have different\nmeanings. In \ufb01gure (a), it is the cumulative number of samples drawn locally on each node (kT ). For\nthe asynchronous SMS(a), we only plot every m iterations so as to mimic the behaviour of SMS(s)\nfor a more direct comparison.\nIn \ufb01gure (b) however, it is the cumulative number of likelihood\nevaluations on each node (kT N/m), which more accurately re\ufb02ect computation time.\n\n6\n\n250500750100012501500\u22122.5\u22122\u22121.5\u22121\u22120.500.51k \u00d7 T \u00d7 N/m \u00d7 103100200300400500600\u22122.5\u22122\u22121.5\u22121\u22120.500.51k \u00d7 T \u00d7 N/m \u00d7 103200400600800100012001400\u22122.5\u22122\u22121.5\u22121\u22120.500.51k \u00d7 T \u00d7 N/m \u00d7 1033.26.49.612.81619.2x 10510\u2212610\u2212410\u22122100k \u00d7 T \u00d7 m SMS(s)SMS(a)SCOTNEIS(p)NEIS(n)WANG3.26.49.612.81619.2x 10510\u22121100101102k \u00d7 T \u00d7 m SMS(s)SMS(a)SCOTWANG3.26.49.612.81619.2x 10510\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121k \u00d7 T \u00d7 m SMS(s)SMS(a)SCOTNEIS(n)WANG01234567x 10410\u2212210\u22121100101102k \u00d7 T SMS(s)SMS(a) m = 8m = 16m = 32m = 48m = 6400.511.522.5x 10810\u2212210\u22121100101102k \u00d7 T \u00d7 N/m SMS(s)SMS(a) m = 8m = 16m = 32m = 48m = 64m=8m=16m=32m=48m=6400.511.522.5 SMS(s,s)SMS(s,e)SMS(a,s)SMS(a,e)SCOTXING(p)\fbetween (cid:98)\u00b5 and \u00b5\u2217: (cid:80)d\nN ((cid:98)\u00b5,(cid:98)\u03a3); and \ufb01nally the MSE of the conditional probabilities:\n(cid:88)\n\nl=1((cid:98)\u00b5l \u2212 \u00b5\u2217\nl )2/d; the second is KL-divergence between N (\u00b5\u2217, \u03a3\u2217) and\n(cid:104) 1\n(cid:88)\n(cid:88)\n|(cid:98)\u0398|\nw\u2208(cid:98)\u0398\n\n\u03c3(w(cid:62)x) \u2212 1\n|\u0398\u2217|\n\n\u03c3(w(cid:62)x)\n\n.\n\n1\nN\n\nx\u2208D\n\n(cid:105)2\n\nw\u2208\u0398\u2217\n\n(5)\n\nFigure 3 shows the results for two separate runs of each method. We observe that both versions of\nSMS converge rapidly, requiring few rounds of EP iterations. Further, they produce approximation\nerrors signi\ufb01cantly below other methods. The synchronous SMS(s) does appear more stable and\nconverges faster than its asynchronous counterpart but ultimately both versions achieve the same\nlevel of accuracy. SCOT and NEIS(p) are very closely related, with their MSE for posterior mean\noverlapping. Both methods achieve reasonable accuracy early on, but fail to further improve with\nthe increasing number of samples available for combination due to their assumptions of Gaussianity.\n\nNEIS(p) directly estimates(cid:98)\u00b5 and(cid:98)\u03a3 without drawing samples(cid:98)\u0398 and is thus missing from Figure 3b\n\nand 3c. Note that NEIS(n) is missing from Figure 3b because the posterior covariance estimated\nfrom the combined samples is singular due to an insuf\ufb01cient number of distinct samples. Unsurpris-\ningly, WANG requires a large number of iterations for convergence and does not achieve very good\napproximation accuracy. It is also possible that the poor performances of NEIS(n) and WANG are\ndue to the kernel density estimation used, as its quality deteriorates very quickly with dimensionality.\nIn\ufb02uence of the Number of Nodes. We also investigated how the methods behave with varying\nnumbers of partitions, m = 8, 16, 32, 48, 64. We tested the methods on three runs with three differ-\nent random partitions of the dataset. We only tested m = 64 on our SMS methods.\nIn Figure 4a, we see the rapid convergence in terms of the number of EP iterations, and the insensi-\ntivity to the number of nodes. Also, the \ufb01nal accuracies of the SMS methods are better for smaller\nvalues of m. This is not surprising since the approximation error of EP tends to increase when the\nposterior is factorised into more factors. In the extreme case of m = 1, the methods will be exac-\nt. Note however that with larger m, each node contains a smaller subset of data, and computation\ntime is hence reduced. In Figure 4b we plotted the same curves against the number kT N/m of\nlikelihood evaluations on each node, which better re\ufb02ects the computation times. We thus see an\naccuracy-computation time trade-off, where with larger m computation time is reduced but accu-\nracies get worse. In Figure 4c, we looked into the accuracy of the obtained approximate posterior\nin terms of KL-divergence. Note that apart from a direct read-off of the mean and covariance from\nthe parametric EP estimate (SMS(s,e) & SMS(a,e)), we might also compute the estimators from the\nposterior samples (SMS(s,s) & SMS(a,s)), and we compared both of these in the \ufb01gure. As noted\nabove, the accuracies are better when we have less nodes. However, the errors of our methods still\nincrease much slower than SCOT and NEIS(p), for both of which the KL-divergence increases to\naround 20 and 85 when m = 32 and 48 and is thus cropped from the \ufb01gure.\n\n3.2 Bayesian sparse linear regression with spike and slab prior\n\nIn this experiment, we apply SMS to a Bayesian sparse linear regression model with a spike and\nslab prior over the weights. Our goal is to illustrate that our framework is applicable in scenarios\nwhere the local posterior distribution is approximated by other exponential family distributions and\nnot just the multivariate Gaussian.\nGiven a feature vector xn \u2208 Rd, we model the label as yn \u223c N (w(cid:62)xn, \u03c32\ny), where w is the\nparameter of interest. We use a spike and slab prior [10] over w, which is equivalent to setting\n\n0 inactive) whose elements are drawn independently from a Bernoulli distribution whose natural\nw) i.i.d. for each l = 1, . . . , d. [7] proposed the\n\nw = (cid:101)w (cid:12) s, where s is a d-dimensional binary vector (where 1 corresponds to an active feature and\n(log odds) parameter is \u03b20 and (cid:101)wl|sl \u223c N (0, \u03c32\nfollowing variational approximation of the posterior: q((cid:101)w, s) =(cid:81)d\nl=1 q((cid:101)wl, sl) where each factor\nq((cid:101)wl, sl) = q(sl)q((cid:101)wl|sl) is a spike and slab distribution. (We refer the reader to [7] for details.)\nThe spike and slab distribution over \u03b8 = ((cid:101)w, s) is an exponential family distribution with suf\ufb01cient\nstatistics {sl, sl(cid:101)wl, sl(cid:101)w2\nsist of the probability of sl = 1, and the mean and variance of (cid:101)wl conditioned on sl = 1, for each\nl }d\nl=1, which we use for the EP approximation. The moments required con-\nl = 1, . . . , d. The conditional distribution of (cid:101)wl given sl = 0 is simply the prior N (0, \u03c32\nnatural parameters consist of the log odds of sl = 1, as well as those for (cid:101)wl conditioned on sl = 1\n\nw). The\n\n7\n\n\f(a) m = 2\n\n(b) m = 4\n\nFigure 5: Results on Boston housing dataset for Bayesian sparse linear regression model with spike\nand slab prior. The x-axis plots the number of data points per node (equals the number of likeli-\nhood evaluations per sample) times the cumulative number of samples drawn per node, which is a\nsurrogate for the computation times of the methods. The y-axis plots the ground truth (solid), local\nsample estimated means (dashed) and EP estimated mean (asterisks) at every iteration.\n\n(Section 2.3). We used the paired Gibbs sampler described in [7] as the underlying MCMC sampler,\nand a damping factor of 0.5.\nWe experimented using the Boston housing dataset which consists of N = 455 training data points\nin d = 13 dimensions. We \ufb01xed the hyperparameters to the values described in [7], and generated\nground truth samples by running a long chain of the paired Gibbs sampler and computed the pos-\nterior mean of w using these ground truth samples. Figure 5 illustrates the output of SMS(s) for\nm = 2 and m = 4 (the number of nodes was kept small to ensure that each node contains at least\n100 observations). Each color denotes a different dimension; to avoid clutter, we report results only\nfor dimensions 2, 5, 6, 7, 9, 10, and 13. The dashed lines denote the local sample estimated means at\neach of the nodes; the solid lines denote the ground truth and the asterisks denote the EP estimated\nmean at each iteration. Initially, the local estimated means are quite different since each node has\na different random data subset. As EP progresses, these local estimated means as well as the EP\nestimated mean converge rapidly to the ground truth values.\n\n4 Conclusion\n\nWe proposed an approach to performing distributed Bayesian posterior sampling where each com-\npute node contains a different subset of data. We show that through very low-cost and rapidly\nconverging EP messages passed among the nodes, the local MCMC samplers can be made to share\na number of moment statistics like the mean and covariance. This in turn allows the local MCMC\nsamplers to converge to the same part of the parameter space, and allows each local sample pro-\nduced to be interpreted as an approximate global sample without the need for a combination stage.\nThrough empirical studies, we showed that our methods are more accurate than previous methods\nand also exhibits better scalability to the number of nodes. Interesting avenues of research include\nusing our SMS methods to adjust hyperparameters using either empirical or fully Bayesian learning,\nimplementation and evaluation of the decentralised version of SMS, and theoretical analysis of the\nbehaviour of EP under the stochastic perturbations caused by the MCMC estimation of moments.\n\nAcknowledgements\n\nWe thank Willie Neiswanger for sharing his implementation of NEIS(n), and Michalis K Titsias for\nsharing the code used in [7]. MX, JZ and BZ gratefully acknowledge funding from the National Ba-\nsic Research Program of China (No. 2013CB329403) and National NSF of China (Nos. 61322308,\n61332007). BL gratefully acknowledges generous funding from the Gatsby charitable foundation.\nYWT gratefully acknowledges EPSRC for research funding through grant EP/K009362/1.\n\n8\n\n01000200030004000\u22120.4\u22120.200.20.4k \u00d7 T \u00d7 N/m \u00d7 1030500100015002000\u22120.4\u22120.200.20.4k \u00d7 T \u00d7 N/m \u00d7 103\fReferences\n[1] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic\nIn Proceedings of the 29th International Conference on Machine\n\ngradient Fisher scoring.\nLearning (ICML-12), 2012.\n\n[2] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan.\nStreaming variational Bayes. In Advances in Neural Information Processing Systems, pages\n1727\u20131735, 2013.\n\n[3] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters.\n\nCommunications of the ACM, 51(1):107\u2013113, 2008.\n\n[4] Matthew D Hoffman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet\n\nallocation. In Advances in Neural Information Processing Systems, pages 856\u2013864, 2010.\n\n[5] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[6] Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path\nlengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593\u20131623,\n2014.\n\n[7] Miguel L\u00b4azaro-gredilla and Michalis K Titsias. Spike and slab variational inference for multi-\nIn Advances in Neural Information Processing Systems,\n\ntask and multiple kernel learning.\npages 2339\u20132347, 2011.\n\n[8] Thomas P Minka. A family of algorithms for approximate Bayesian inference. PhD thesis,\n\nMassachusetts Institute of Technology, 2001.\n\n[9] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and robust\nBayesian inference via the median posterior. In Proceedings of the 31st International Confer-\nence on Machine Learning (ICML-14), pages 1656\u20131664, 2014.\n\n[10] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Jour-\n\nnal of the American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[11] Robb J Muirhead. Aspects of multivariate statistical theory, volume 197. John Wiley & Sons,\n\n2009.\n\n[12] Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly paral-\nlel MCMC. In Proceedings of the 30th International Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-14), pages 623\u2013632, 2014.\n\n[13] David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. Distributed algorithms\n\nfor topic models. The Journal of Machine Learning Research, 10:1801\u20131828, 2009.\n\n[14] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on\nthe probability simplex. In Advances in Neural Information Processing Systems, pages 3102\u2013\n3110, 2013.\n\n[15] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathe-\n\nmatical Statistics, 22(3):400\u2013407, 1951.\n\n[16] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I\nGeorge, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm.\nEFaBBayes 250 conference, 16, 2013.\n\n[17] Matthias W Seeger. Bayesian inference and optimal design for the sparse linear model. The\n\nJournal of Machine Learning Research, 9:759\u2013813, 2008.\n\n[18] Hisayuki Tsukuma and Yoshihiko Konno. On improved estimation of normal precision matrix\n\nand discriminant coef\ufb01cients. Journal of Multivariate Analysis, 97(7):1477 \u2013 1500, 2006.\n\n[19] Xiangyu Wang and David B. Dunson. Parallel MCMC via Weierstrass sampler. arXiv preprint\n\narXiv:1312.4605, 2013.\n\n[20] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynam-\nics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11),\npages 681\u2013688, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1711, "authors": [{"given_name": "Minjie", "family_name": "Xu", "institution": "Tsinghua University"}, {"given_name": "Balaji", "family_name": "Lakshminarayanan", "institution": "Gatsby Unit, University College London"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}