{"title": "Entangled Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 2726, "page_last": 2734, "abstract": "We propose a novel method for scalable parallelization of SMC algorithms, Entangled Monte Carlo simulation (EMC).  EMC avoids the transmission of particles between  nodes, and instead reconstructs them from the particle genealogy. In particular, we show that we can reduce the communication to the particle weights for each machine while efficiently maintaining implicit global coherence of the parallel simulation. We explain methods to efficiently maintain a genealogy of particles from which any particle can be reconstructed. We demonstrate using examples from Bayesian phylogenetic that the computational gain from parallelization using EMC significantly outweighs the cost of particle reconstruction. The timing experiments show that reconstruction of particles is indeed much more efficient as compared to transmission of particles.", "full_text": "Entangled Monte Carlo\n\nSeong-Hwan Jun\n\nUniversity of British Columbia\n\n{seong.jun, l.wang, bouchard}@stat.ubc.ca\n\nLiangliang Wang\nDepartment of Statistics\n\nAlexandre Bouchard-C\u02c6ot\u00b4e\n\nAbstract\n\nWe propose a novel method for scalable parallelization of SMC algorithms, En-\ntangled Monte Carlo simulation (EMC). EMC avoids the transmission of particles\nbetween nodes, and instead reconstructs them from the particle genealogy. In par-\nticular, we show that we can reduce the communication to the particle weights for\neach machine while ef\ufb01ciently maintaining implicit global coherence of the paral-\nlel simulation. We explain methods to ef\ufb01ciently maintain a genealogy of particles\nfrom which any particle can be reconstructed. We demonstrate using examples\nfrom Bayesian phylogenetic that the computational gain from parallelization us-\ning EMC signi\ufb01cantly outweighs the cost of particle reconstruction. The timing\nexperiments show that reconstruction of particles is indeed much more ef\ufb01cient as\ncompared to transmission of particles.\n\n1\n\nIntroduction\n\nIn this paper, we focus on scalable parallelization of Monte Carlo simulation, a problem motivated by\nthe increasingly large inference problems occurring in a variety of \ufb01elds in science and engineering.\nSpeci\ufb01cally, we assume that we are given a large scale inference problem involving an intractable\nposterior expectation, for example a Bayes estimator, and that Monte Carlo simulation is to be used\nto approximate the targeted expectation.\nWe are speci\ufb01cally interested in parallel Monte Carlo algorithms that scale not only in scienti\ufb01c-\ncomputing clusters, where node communication is fast and cheap, but also in situations where com-\nmunication between nodes is limited by a combination of latency, throughput, and cost. For exam-\nple, severe communication constraints arise in peer-to-peer distributed computing projects such as\nBOINC [1], and more generally in clusters assembled from commodity hardware.\nSequential Monte Carlo (SMC) is generally viewed as the leading candidate for massively parallel\nsimulation, but because of particle resampling, existing implementations require the network transfer\nof a large number of particles and a central server with a global view on the weights carried by the\nparticles. As a consequence, the naive communication cost grows with the size of the inference\nproblem.\nOur main contribution is a method, Entangled Monte Carlo simulation (EMC), for carrying out SMC\nsimulation in a cluster with a communication cost per particle independent of the problem size. Our\napproach is fully generic and does not assume any known structure on the target distribution or the\nproposal used in the simulation. These desirable characteristics are achieved by limiting the contents\nof inter-node transmission to summary statistics on the particle weights. These summary statistics\nare compact and of size independent in the size of the state space of the target integral. We show\nthat our summary statistics are suf\ufb01cient, in the sense that they can be used in combination with the\nparticle genealogy to quickly reconstruct any particle in any node of the cluster.\nWe will illustrate the advantage of particle reconstruction versus network transmission in the context\nof phylogenetic inference, a well known example where Monte Carlo simulation is both important\n\n1\n\n\fand challenging. In the case of the SMC sampler from [2], the cost of transmitting one particle is\nproportional to the product of the number of species under study, times the number of sites in the\nsequences, times the number of characters possible at each site.\nWe also introduce the algorithms needed to do these reconstructions ef\ufb01ciently while maintaining a\ndistributed representation of the particle genealogies. The main algorithm is based on an alternative\nrepresentation of simulation borrowed from the \ufb01eld of perfect simulation [3]. We demonstrate\nthat using our algorithms, the computational cost involved in these reconstructions is negligible\ncompared to the corresponding gains obtained from parallelization. While we describe EMC in the\ncontext of SMC simulation, it can accommodate any MCMC proposal. This is done by using the\nconstruction of arti\ufb01cial backward kernels [4, 5].\nThere is a large literature on parallelization of both MCMC and SMC algorithms. For SMC, most\nof the work has been on parallelization of the proposal steps [6], which is suf\ufb01cient in setups such\nas GPU parallelization where communication between computing units is fast and cheap. However\nin generic clusters or peer-to-peer architectures, we argue that our more ef\ufb01cient parallelization of\nthe resampling step is advantageous.\nFor MCMC, there is a large amount of literature on parallelization involving kernels that take the\nform of local Gibbs update in a graphical model. These methods allow for several blocks of variables\nto be updated in parallel. However, the communication cost can be high in a dense graphical model\nas state information needs to be synchronized. Moreover, the method is restricted to certain kinds of\nGibbs kernel [7, 8, 9].\nAnother popular MCMC parallelization method is parallel tempering [10], where auxiliary chains\nare added to enable faster exploration of the space by swapping states in different chains. While\nparallel tempering has a low communication cost independent of the inference problem size, the\nadditional gain of parallelism can quickly decrease as more chains are added because many swaps\nare needed to get from the most heated chain to the main chain.\n\n2 Background\n\nWe will denote the target distribution by \u03c0, which in a Bayesian problem would correspond to a\nposterior distribution. The main goal is to compute the integral under \u03c0 of one or more test functions\nh, which we denote by \u03c0(h) for short. In a Bayesian problem, this arises as the posterior expectation\nneeded when computing a Bayes estimator. We will denote the state space by S, i.e. h : S \u2192 R,\n\u03c0 : FS \u2192 [0, 1], where (S,FS ) is a probability space.\n2.1 Stochastic maps\n\nAn important concept used in the construction of our algorithms is the idea of a stochastic map. We\nstart by reviewing stochastic maps in the context of a Markov chain, where it was \ufb01rst introduced to\ndesign perfect simulation algorithms.\nLet T : S \u00d7 FS \u2192 [0, 1] denote the transition kernel of a Markov chain (generally constructed\nby \ufb01rst proposing and then deciding whether to move or not using a Metropolis-Hastings (MH)\nratio). A stochastic map is an equivalent view of this chain, pushing the randomness into a list\nof random transition functions. Formally, it is a (S \u2192 S)-valued random variable F such that\nT (s, A) = P(F (s) \u2208 A) for all state s \u2208 S and event A \u2208 FS. Concretely, these maps are\nconstructed by using the observation that T is typically de\ufb01ned as a transformation t(u, s) with\nu \u2208 [0, 1]. The most fundamental example is the case where t is based on the inverse cumulative\ndistribution method. We can then write F (s) = t(U, s) for a uniform random variable U on [0, 1].\nWith this notation, we get a non-standard, but useful way of formulating MCMC algorithms. First,\nsample N stochastic maps F1, F2, . . . , FN , independently and identically. Second, to compute the\nstate of the chain after n transitions, simply return F1(F2(. . . (Fn(x0)) . . . )) = F1\u25e6\u00b7\u00b7\u00b7\u25e6Fn(x0), for\nan arbitrary start state x0 \u2208 S, n \u2208 {1, 2, . . . , N}. This representation decouples the dependencies\ninduced by random number generation from the dependencies induced by operations on the state\nspace. In MCMC, the latter are still not readily amenable to parallelization, and this is the motivation\nfor using SMC as the foundation of our method. We will show in Section 3 that SMC algorithms\ncan also be rewritten using stochastic maps.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) a graphical illustration of the SMC algorithm. (b) Particle genealogy.\n\n2.2 SMC algorithms\n\nBefore going over our parallel version of SMC and to keep the exposition self-contained, we review\nhere the notation and description of standard, serial SMC algorithms from [11], which in turn is\nbased on the SMC framework of [12, 4, 5]. The samplers used in this paper are de\ufb01ned using\na proposal \u03bd : S \u00d7 FS \u2192 [0, 1]. Here, S can be an enlarged version of the target space, with\nintermediate states added to ease sampling. We assume that \u03c0 has been correspondingly enlarged.\nThe technical conditions on \u03bd and \u03c0 are explained in [11], but for the purpose of understanding our\nmethod, only the algorithmic description of SMC given below is necessary.\nSMC proceeds in a sequence of generations indexed by r. At each generation, the algorithm\nkeeps in memory a weighted list of K particles, sr,1, . . . , sr,K \u2208 S, with corresponding weights\nwr,1, . . . , wr,K (see Figure 1 (a)). The weighted particles induce a distribution on S de\ufb01ned by:\n\nK(cid:88)\n\nk=1\n\n\u03c0r,K(A) \u221d\n\nwr,k\u03b4sr,k (A),\n\n(1)\n\nwhere A \u2208 FS is an event, and \u03b4x(A) = 1 if x \u2208 A and 0 otherwise. We de\ufb01ne the algorithm\nrecursively on the generation r. In the base case, we set w0,k = 1/K for all k \u2208 {1, . . . , K}, and\nthe s0,k are initialized to a designated start state \u22a5. Given the list of particles and weights from\nthe previous generation r \u2212 1, a new list is created in three steps. The \ufb01rst step can be understood\nas a method for pruning unpromising particles. This is done by sampling independently K times\nfrom the weighted particles distribution \u03c0r,K de\ufb01ned above. The result of this step is that some\nof the particles (mostly those of low weight) will be pruned. We denote the sampled particles by\n\u02dcsr\u22121,1, . . . , \u02dcsr\u22121,K. The second step is to create new particles, sr,1, . . . , sr,K, by extending the\npartial states of each of the sampled particles from the previous iteration. This is done by sampling\nK times from the proposal distribution, sr,k \u223c \u03bd\u02dcsr\u22121,k. The third step is to compute weights for the\nnew particles: wr,k = \u03b1(\u02dcsr\u22121,k, sr,k), where the weight update function \u03b1 is an easy to evaluate\ndeterministic function \u03b1 : S 2 \u2192 [0,\u221e). We give examples in Section 4.1.\nFinally, the target integral \u03c0(h) is approximated using the weighted distribution of the last generation\nR, \u03c0R,K(h). Note that using recent work on SMC, it is possible to convert any MCMC proposals\ntargeting a state space X into a valid SMC algorithm [4, 5]. This can be done for example by using\nan expended state space S = X R and by constructing an auxiliary distribution on this new space.\nSee [4, 5] for details.\n\n3 Entangled Monte Carlo Simulation\n\nTo parallelize SMC, we will view the applications of SMC proposals as a collection of stochastic\nmaps to be shared across machines. Note that there are K \u00b7 R proposal applications in total, which\nwe will index by I (cid:51) i = i(r, k) = (r(i), k(i)) for convenience. Applying these stochastic maps,\ndenoted by F = {Fi : i \u2208 I}, is often computationally intensive (for example because of Rao-\nBlackwellization), and it is common to view this step as the computational bottleneck. At iteration\nr, each machine, with index m \u2208 {1, . . . , M}, will therefore be responsible of computing proposals\n\n3\n\ns1,1s1,2s1,3w1,1 = 0.03w1,2 = 0.02w1,3 = 0.08s1,1s1,3s1,2s2,1s2,3s2,2~~~s2,1s2,3s2,2w2,1 = 0.12w2,2 = 0.2w2,3 = 0.02ResamplingProposalWeightingcomponentsdecreasesbyoneateverystep.Moreprecisely,wewillbuildeachrootedX-treetbyproposingasequenceofX-forestss1,s2,...,sR=t,whereanX-forestsr={(ti,Xi)}isacollectionofrootedXi-treestisuchthatthedisjointunionofleavesofthetreesintheforestisequaltotheoriginalsetofleaves,\uffffiXi=X.Notethatwiththisspeci\ufb01cconstruction,aforestofrankrhas|X|\u2212rtrees.ThesetsofpartialstatesconsideredinthisSectionareassumedtosatisfythefollowingthreeconditions:1.Thesetsofpartialstatesofdi\ufb00erentranksshouldbedisjoint,i.e.Sr\u2229Ssforallr\uffff=s(inphylogenetics,thisholdssinceaforestwithrtreescannotbeaforestwithstreeswhenr\uffff=s).2.Thesetofpartialstateofsmallestrankshouldhaveasingleelementdenotedby\u22a5,S1={\u22a5}(inphylogenetics,\u22a5isthedisconnectedgraphonX).3.ThesetofpartialstatesofrankRshouldcoincidewiththetargetspace,SR=X(inphylogenetics,atrankR=|X|\u22121,forestshaveasingletreeandaremembersofthetargetspaceX).TheseconditionswillbesubsumedbythemoregeneralframeworkofSec-tion4.5,butthemoreconcreteconditionsabovehelpunderstandingtheposetframework.Inordertogrowparticlesfromoneranktothenext,theuserneedstospecifyaproposalprobabilitykernel\u03bd+.GivenaninitialpartialstatesandasetofdestinationpartialstatesA,wedenotetheprobabilityofproposinganelementinAfromsby\u03bd+s(A).Inthediscretecase,weabusethenotation14\fAlgorithm 1 : EMC(\u03b1, \u03bd, h,I0)\n1: (F , G , H ) \u2190 entangle(\u03bd) {Section 3.3}\n2: s \u2190 empty-hashtable\n3: \u03c1 \u2190 empty-genealogy\n4: init(s, w)\n5: for r \u2208 {1, . . . , R} do\n6:\n7:\n\nexchange(wr\u22121)\nresample(wr\u22121, \u03c1,Ir\u22121, G )\n{Supplementary Material}\nIr \u2190 allocate(\u03c1,Ir\u22121, H ) {Section 3.1}\nfor i \u2208 Ir do\n\ns(i) \u2190 reconstruct(s, \u03c1, i, F )\n{Algorithm 2}\nwr,k(i) \u2190 \u03b1(s(\u03c1(i)), s(i))\n\n8:\n9:\n10:\n\nend for\n\n11:\n12:\n13: end for\n14: process(s, w, h)\n\nreconstruct(s, \u03c1, i, F =\n\nAlgorithm 2 :\n{Fi : i \u2208 I})\n1: F \u2190 I\n2: while (s(i) = nil) do\nF \u2190 F \u25e6 Fi\n3:\ni \u2190 \u03c1(i)\n4:\n5: end while\n6: return F (s(i))\n\nFigure 2: Illustration of compact particles (blue),\nconcrete particles (black), and discarded particles\n(grey).\n\nfor only a subset Ir of the particles indices {1, . . . , K}. We refer to machine m as the reference\nmachine. For brevity of notation, we omit notation m when it is clear that we refer to the reference\nmachine.\nParallelizing SMC is complicated by the resampling step. If roughly all particles were resampled\nexactly once, we would be able to assign to each machine the same indices as the previous iteration,\navoiding communication. However, this rarely happens in practice.\nInstead, a small number of\nparticles is often resampled a large number of times while many others have no offspring. This\nmeans that Ir can radically change across iterations. This raises an important question: how can a\nmachine compute a proposal if the particle from which to propose was itself computed by a different\nmachine?\nThe naive approach would consist in transmitting the \u2018missing\u2019 particles over the network. However,\neven if basic optimizations are used (for example sending particles with multiplicities only once),\nwe show in Section 4 that this transfer can be slow in practice. Instead, our approach relies on a\ncombination of the stochastic maps with the particle genealogy to reconstruct the particle. Let us\nsee what this means in more detail, by going over the key steps of EMC, shown in Algorithm 1.\nFirst, note that the resampling step in SMC algorithms induces a one-to-many relationship between\nthe particle in generation r and those in generation r \u2212 1. This relationship is called the particle\ngenealogy, illustrated in Figure 1 (b). Formally, a genealogy is a directed graph where nodes are\nparticles sr,k, r \u2208 {1, . . . , R}, k \u2208 {1, . . . , K}, and where node sr\u22121,k is deemed the parent of\nnode sr,k(cid:48) if the latter was obtained by resampling \u02dcsr\u22121,k(cid:48) = sr\u22121,k followed by proposing sr,k(cid:48)\nfrom \u02dcsr\u22121,k(cid:48).\nSuppose for now that each machine kept track of the full genealogy, in the form of a hashtable\n\u03c1 : I \u2192 I of parent pointers. Each machine also maintains a hashtable holding the particles held in\nmemory in the reference machine s : I \u2192 S \u222a {nil} (the value nil represent a particle not currently\nrepresented explicitly in the reference machine). Algorithm 2 shows that this information, s, \u03c1, F , is\nsuf\ufb01cient to instantiate any query particle (indexed by i in the pseudo-code). Note that the procedure\nreconstruct is guaranteed to terminate: in the procedure init, we set s(i(0, k)) to \u22a5, and the weights\nuniformly, hence \u22a5 is an ancestor of all particles.\nThis high-level description raises several questions. How can we ef\ufb01ciently store and retrieve the\nstochastic maps? Can we maintain a sparse view of the genealogical information \u03c1, s to keep space\nrequirements low? Finally, how can we do resampling and particle allocation in this distributed\nframework? We will cover these issues in the remaining of this section, describing at the same time\nhow the procedures allocate, resample and exchange are implemented.\n\n4\n\ncomponentsdecreasesbyoneateverystep.Moreprecisely,wewillbuildeachrootedX-treetbyproposingasequenceofX-forestss1,s2,...,sR=t,whereanX-forestsr={(ti,Xi)}isacollectionofrootedXi-treestisuchthatthedisjointunionofleavesofthetreesintheforestisequaltotheoriginalsetofleaves,\uffffiXi=X.Notethatwiththisspeci\ufb01cconstruction,aforestofrankrhas|X|\u2212rtrees.ThesetsofpartialstatesconsideredinthisSectionareassumedtosatisfythefollowingthreeconditions:1.Thesetsofpartialstatesofdi\ufb00erentranksshouldbedisjoint,i.e.Sr\u2229Ssforallr\uffff=s(inphylogenetics,thisholdssinceaforestwithrtreescannotbeaforestwithstreeswhenr\uffff=s).2.Thesetofpartialstateofsmallestrankshouldhaveasingleelementdenotedby\u22a5,S1={\u22a5}(inphylogenetics,\u22a5isthedisconnectedgraphonX).3.ThesetofpartialstatesofrankRshouldcoincidewiththetargetspace,SR=X(inphylogenetics,atrankR=|X|\u22121,forestshaveasingletreeandaremembersofthetargetspaceX).TheseconditionswillbesubsumedbythemoregeneralframeworkofSec-tion4.5,butthemoreconcreteconditionsabovehelpunderstandingtheposetframework.Inordertogrowparticlesfromoneranktothenext,theuserneedstospecifyaproposalprobabilitykernel\u03bd+.GivenaninitialpartialstatesandasetofdestinationpartialstatesA,wedenotetheprobabilityofproposinganelementinAfromsby\u03bd+s(A).Inthediscretecase,weabusethenotation14r123\f3.1 Allocation and resampling\n\nIn SMC algorithms, the weights are periodically used for resampling the particles, a step also known\nas the bootstrapping stage and denoted by resample in Algorithm 1. This is the only stage where\nEMC requires communication over the network to be done. With each machine having the full infor-\nmation of the weights in the current iteration, they can each perform a standard, global resampling\nstep without further communication.\nIn most cases of interest, each machine can transmit all the individual weights of its particles and to\ncommunicate it with every other machine (either via a central server, or a decentralized scheme such\nas [13]) without becoming the bottleneck. Extreme cases, where even the list of weights alone is too\nlarge to transmit, can also be handled by transmitting only the sum of the weights of each machine,\nand using a distributed hashtable [13] to represent the genealogy. The modi\ufb01cations needed to\nimplement this are discussed in Supplementary Material. We focus on the simpler case here.\nOnce the resampling step determines which particles survive to the next generation, the next step is to\ndetermine allocation of particles to machines. Particle allocation is an optimization problem where\nthe objective is to reduce the reconstruction time with respect to the set of partition of particles.\nLet {A1\nthe maximum number of particles that can be processed by machine m. For i \u2208 Am\nnumber of times the stochastic map needs to be applied. The objective function is de\ufb01ned as,\n\nr } be the set of partition of particles {1, . . . , K} at generation r and let cm denote\nr , let \u03a6(i) be the\n\nr, . . . ,AM\n\nM(cid:88)\n\n(cid:88)\n\n{A1\n\nr,...,AM\n\nmin\nr s.t |Am\n\nr |\u2264cm\u2200m}\n\n\u03a6(i)\n\nm=1\n\ni\u2208Am\n\nr\n\nr \u2286 I m\n\nr\u22121 as possible. Let \u02dcI m\n\nObtaining an exact solution to this optimization problem is infeasible in practice as it requires enu-\nmerating over the set of all possible partitions. We propose greedy methods where each machine\nretains as many particles from I m\nr\u22121 be the set of particles resampled\nfrom machine m. If |\u02dcI m\nr | \u2212 cm > 0, this machine is in surplus of particles. We propose variety of\ngreedy schemes to allocate the surplus of particles over to machines m(cid:48)\nr | > 0.\nFirstOpen: a deterministic scheme where a known list of preferred machines are known by all\nmachines. The surplus particles are allocated according to this list.\nMostAvailable: attempts to allocate the surplus particles to machines with the most capacity as\nde\ufb01ned by cm(cid:48) \u2212 \u02dcI m(cid:48)\nr .\nRandom: samples a machine m(cid:48) at random with equal probability 1/M. The intention is that the\nparticles are mixed well over different machines so that the reconstruction algorithm rarely traces\nback the genealogy to the root ancestor.\n\n(cid:54)= m, where cm(cid:48)\u2212|\u02dcI m(cid:48)\n\n3.2 Genealogy\n\nIn this section, we argue that for the purpose of reconstruction, only a sparse subset of the genealogy\nneeds to be represented at any given iteration and machine. The key idea is that if a particle has no\ndescendant in the current generation, storing its parent is not necessary. In practice, we observed that\nthe vast majority of the ancestral particles have this property. We discuss at the end of this section\nsome intuition as of why this holds, using a coalescent model.\nLet us \ufb01rst look at how we can ef\ufb01ciently exploit this property. First, it is useful to draw a distinction\nbetween concrete particles, with s(i) (cid:54)= nil, and compact particles, which are particles implicitly\nrepresented via an integer (the parent of the particle), and are therefore considerably more space-\nef\ufb01cient. For example, in the smallest phylogenetic example considered in Section 4, a compact\nparticle occupies about 50, 000 times less memory than a concrete particle. Whereas a concrete\nparticle can grow in size as the problem size increases, a compact particle size is \ufb01xed.\nParticles, concrete or compact, can become obsolete, meaning that the algorithm can guarantee that\nthey will not be needed in subsequent iterations. This can happen for at least two reasons, each of\nwhich is ef\ufb01ciently detected at a different stage of the algorithm.\nUpdate after resampling: Any lineage (path in \u03c1) that did not survive the resampling stage no\nlonger needs to be maintained. This is illustrated in Figure 2. The greyed out particles will never\n\n5\n\n\fbe reconstructed in the future generation so they are no longer maintained. Note that it is easy to\nharness a garbage collector to perform this update in practice.\nUpdate after reconstruction: Once a particle is reconstructed, the lineage of the reconstructed\nparticle can be updated. Let j be the particle that is reconstructed at generation r. At any future\ngeneration r(cid:48) > r, the reconstruction algorithm will only trace up to j (as s(j) (cid:54)= nil), and hence\nall its parent can be discarded. Note that similar updates can be performed on s to keep s sparse as\nwell.\nThe coalescent [14] can provide a potential theoretical model for understanding why these strategies\nare so effective in practice. If we assumed the weight function \u03b1 to be constant, the genealogy\ninduced by resampling can be viewed as a Wright-Fisher model [14, 15], which is well approximated\nby the coalescent when the number of particles is large. For example, this means that (1\u22121/k)/(1\u2212\n1/K) is the expected time spent waiting for the last k copies to coalesce [15].\nNote that the coalescent also gives an intuition for having Algorithm 2 terminating well before\nreaching the initial symbol \u22a5. Again, this re\ufb02ects what we observed in our experiments.\n3.3 Compact representation of the stochastic maps\nThe cardinality of the set of the stochastic maps F = {Fi : i \u2208 I} grows proportionally to the\nnumber of particles K time the number of generations R. To store these maps naively would require\nthe storage of O(KR) uniform random variables Ui. However, since in practice pseudo-random\nnumbers rather than true independent numbers are typically used, the sequence can be stored im-\nplicitly by maintaining only a random seed shared between machines. A drawback to this approach\nis that it is not ef\ufb01cient to perform random access of the random numbers. Random access of random\nnumbers is an unusual requirement imposed by the genealogy reconstruction algorithm. Fortunately,\nas we discuss in this section, it is not hard to modify pseudo-random generators to support random\naccess.\nThe simplest strategy to obtain faster random access is to cache intermediate internal states of the\npseudo-random generator. For example by doing so for every particle generation, we get a faster\naccess time of O(K) and a larger space requirement of O(R). More generally, this method can\nprovide tradeoff of O(n) space and O(m) time with mn = RK.\nIn Supplementary Material, we describe the details of an alternative that requires O(1) storage with\nO(log(KR)) time for random access of any given map with index i \u2208 I. This method could poten-\ntially change the quality of the pseudo-random sequences obtained, but as described in Section 4.2,\nwe have empirical evidence suggesting that the new pseudo-random scheme does not affect the\nquality of the estimated posterior expectations.\n\n4 Experiments\n\nIn this section, we demonstrate the empirical performance of our method on synthetic and real\ndatasets. As a \ufb01rst validation, we start by demonstrating that the behavior of our sampler equipped\nwith our stochastic map datastructure is indistinguishable from that of a sampler based on a stan-\ndard pseudo-random generator. Then we show results on the task of Bayesian phylogenetic in-\nference, a challenging domain where massively parallel simulation is likely to have an impact for\npractitioners\u2014running phylogenetic MCMC chains for months is not uncommon. To keep the ex-\nposition self-contained, we include a review of the phylogenetic SMC techniques we used.\n\n4.1 Experimental setup\n\nGiven a collection of biological sequences for different species (taxa), Bayesian phylogenetics aims\nto compute expectations under a posterior distribution over phylogenetic trees, which represent the\nrelationship among the species under study [16]. For intermediate to large numbers of species,\nBayesian phylogenetic inference via SMC requires a large number of particles to achieve an accurate\nestimate. This is due to fact that the total number of distinct tree topologies increases at a super-\nexponential rate as the number of species increases [16].\n\n6\n\n\fIn the following section, we use the phylogenetic SMC algorithm described in [2], where particles\nare proposed using a proposal with density \u03bd(s \u2192 s(cid:48)). Starting from a fully disconnected forest over\nthe species, \u03bd picks one pair of trees in the forest at random, and forms a new tree by connecting\ntheir roots. Under weak conditions described in [11], the following weight update yields a consistent\nestimator for the posterior over phylogenies:\n\n\u03b1(\u02dcsr\u22121 \u2192 sr) \u2190 \u03b3(sr)\n\u03b3(\u02dcsr\u22121)\n\n\u00b7\n\n1\n\n\u03bd(\u02dcsr\u22121 \u2192 sr)\n\n,\n\nwhere \u03b3 is an unnormalized density over forests. In the experiments in Section 4.2 and Supple-\nmentary Material, where we wanted to run our SMC for more iterations, we use an alternation of\nkernel: in a \ufb01rst phase, the kernel described above, and in the second phase, the MCMC kernel of\n[17], transformed into a SMC kernel using the technique of [4, 5].\nTo generate synthetic datasets, we used a standard process [11]: we sampled trees from the coales-\ncent, simulated data along the tree using a K2P likelihood model, discarded the values at internal\nnodes to keep only the observations at the leaves and held out the tree.\nFor real datasets, we used the manually aligned ribosomal RNA (rRNA) dataset of [18]. We used a\nsubset of 28 sequences in the directory containing 5S rRNA sequences of Betaproteobacteria and a\nlarger subset of 4,510 sequences of 16S rRNA sequences from Actinobacteria. We did experiments\non two different numbers of subsampled species: 20 and 100.\n\n4.2 Validation of the stochastic random maps datastructure\n\nTo check if the scheme described in Section 3.3 affects the quality of the SMC approximation of\nthe target distribution, we carried out experiments to compare the quality of the SMC approximation\nbased on pseudo-random numbers generated from uniform algorithm outlined in the Supplementary\nMaterial against the standard pseudo-random number generator. The dataset is a synthetic phylo-\ngenetics data with 20 taxa and 1000 sites. We measured a tree metric, the Robinson Foulds metric,\non the consensus tree at every iteration, to detect potential biases in the estimator. We show random\nexamples of pairs of runs in the Supplementary Material.\n\n4.3 Speed-up results\n\nIn this section, we show experimental results where we measure the speed-up of an EMC algorithm\non two sets of phylogenetics data by counting the number of times the maps Fi are applied. The\nquestion we explore here is how deep the reconstruction algorithm has to trace back, or more pre-\ncisely, how many times a parallelized version of our algorithm applies maps Fi compared to the\nnumber of times the equivalent operation is performed in the serial version of SMC.\nWe denote N1 to be the number of times the proposal function is applied in serial SMC, and NM\nto be the number of stochastic maps applied in our algorithm ran on M machines. We measure the\nspeedup as a ratio of M and RM , SM = M\nRM\nWe ran these experiments on the 16S and 5S subsets of the rRNA data described earlier. In both\nsubsets, we found a substantial speedup, suggesting that deep reconstruction was rarely needed in\npractice. We also obtained the following empirical ranking across the performance of the allocation\nmethods: FirstOpen > MostAvailable > Random. We show the results on 100 taxa (species) for 5S\nand 16S in Figure 3.\nWe also performed an experiment on synthetic data generated with 20 taxa and 1000 sites that\nparallelization using EMC is bene\ufb01cial in the corner case when the weights are all equal. For\nthe purposes of illustration, we included an extra allocation method, chaos. This is an allocation\nmethod where particles are allocated at random, which disregards the greedy method suggested in\nSection 3.1. We show the results in Figure 3, where it can be seen that the speedup is still substantial\nin this context for all of the allocation methods.\n\nwhere RM = NM\nN1\n\n.\n\n4.4 Timing results\n\nAn SMC algorithm can easily be distributed over multiple machines by relying naively on particle\ntransmission between machines over the network. In this section, we compared the particle trans-\nmission time to reconstruction time of EMC on Amazon EC2 micro instances.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: The speedup factor for (a) the 16S actinobacteria dataset with 100 taxa, (b) the 5S actinobacteria\ndataset with 100 taxa, and (c) the uniform weight synthetic experiment (see text).\n\n(a)\n\n(b)\n\nFigure 4: (a) Total time for particle transfer (red), total time for EMC (blue). (b) Sample generation time\nincluding reconstruction time (black), reconstruction time (blue), and particle transfer time (red) by generation.\n\nThe timing results in this section builds on the results from Section 4.3 where we showed that the\nratio of NM and N1 is small. Here, we ran SMC algorithm for 100 generations and measured\nthe total run time of the EMC algorithm and an SMC algorithm parallelized via explicit particle\ntransfer\u2014see Figure 4 (a). We \ufb01xed the number of particles per machine at 100 and produced a\nsequence of experiments by doubling the number of machines and hence the number of particles\nat each step. In Figure 4 (b), we show the reconstruction time, the sample generation time (which\nincludes the reconstruction time), and the particle transmission time by generation. As expected,\nthe particle transmission is the bottleneck to the SMC algorithm whereas the reconstruction time is\nstable, which veri\ufb01es that the reconstruction algorithm rarely traced deep.\nThe total timing result in Figure 4 (a) shows that the overhead arising from increasing the num-\nber of particles (or increasing the number of machines) is much smaller compared to the particle\ntransmission method. The breakdown of time by generation in Figure 4 (b) shows that the particle\ntransmission time is volatile as it depends on the network latency and throughput. The reconstruction\ntime is stable as it relies only on the CPU cycles.\n5 Conclusion\n\nWe have introduced EMC, a method to parallelize an SMC algorithm over multiple nodes. The new\nmethod requires only a small amount of data communication over the network, of size per particle\nindependent of the scale of the inference problem. We have shown that the algorithm performs\nvery well in practice on a Bayesian phylogenetic example and our software can be downloaded at\nstat.ubc.ca/\u02dcseong.jun/.\n\nAcknowledgements\n\nWe thank Arnaud Doucet, Fabian Wauthier, and the anonymous reviewers for their helpful com-\nments.\n\n8\n\nSpeedupNumber of Machines16S actinobacteria dataset with 100 taxaNumber of MachinesSpeedup5S proteobacteria dataset with 100 taxaSpeedupNumber of MachinesChaos experiment with 20 taxa5001000150020002500300002000004000006000008000001200000Total run time of EMC versus Particle transfer# of particlesTime (milliseconds)0204060801000100002000030000400005000060000Run time elapsed per generationGenerationTime (milliseconds)\fReferences\n[1] D. P. Anderson. BOINC: A System for Public-Resource Computing and Storage. In GRID\n\u201904: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pages\n4\u201310, Washington, DC, USA, 2004. IEEE Computer Society.\n\n[2] Y. W. Teh, H. Daum\u00b4e III, and D. M. Roy. Bayesian agglomerative clustering with coalescents.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2008.\n\n[3] J. G. Propp and D. B. Wilson. Coupling from the past: a user\u2019s guide. Microsurveys in Dis-\ncrete Probability. DIMACS Series in Discrete Mathematics and Theoretical Computer Science,\n41:181\u2013192, 1998.\n\n[4] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of The Royal\n\nStatistical Society Series B-statistical Methodology, 68(3):411\u2013436, 2006.\n\n[5] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo for Bayesian computation.\n\nBayesian Statistics, 8:1\u201334, 2007.\n\n[6] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to\nperform massively parallel simulation of advanced Monte Carlo methods. Journal of Compu-\ntational and Graphical Statistics, 19(4):769\u2013789, 2010.\n\n[7] S. Singh and A. McCallum. Towards asynchronous distributed MCMC inference for large\ngraphical models. In Neural Information Processing Systems (NIPS), Big Learning Workshop\non Algorithms, Systems, and Tools for Learning at Scale, 2011.\n\n[8] J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel Gibbs sampling: From colored \ufb01elds\nto thin junction trees. In In Arti\ufb01cial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL,\nMay 2011.\n\n[9] S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Large-scale cross-document coref-\nerence using distributed inference and hierarchical models. In Association for Computational\nLinguistics: Human Language Technologies (ACL HLT), 2011.\n\n[10] R. H. Swendsen and J-S. Wang. Replica Monte Carlo simulation of spin-glasses. Phys. Rev.\n\nLett., 57:2607\u20132609, Nov 1986.\n\n[11] A. Bouchard-C\u02c6ot\u00b4e, S. Sankararaman, and M. I. Jordan. Phylogenetic inference via Sequential\n\nMonte Carlo. Systematic Biology, 2011.\n\n[12] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo methods in practice.\n\nSpringer, 2001.\n\n[13] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable\npeer-to-peer lookup service for internet applications. ACM SIGCOMM 2001, pages 149\u2013160,\n2001.\n\n[14] J. F. C. Kingman. On the Genealogy of Large Populations. Journal of Applied Probability,\n\n19:27\u201343, 1982.\n\n[15] J. Felsenstein. Inferring phylogenies. Sinauer Associates, 2003.\n[16] C. Semple and M. Steel. Phylogenetics. Oxford, 2003.\n[17] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees.\n\nBioinformatics, 17(8):754\u2013755, August 2001.\n\n[18] J.J. Cannone, S. Subramanian, M.N. Schnare, J.R. Collett, L.M. D\u2019Souza, Y. Du, B. Feng,\nN. Lin, L.V. Madabusi, K.M. Muller, N. Pande, Z. Shang, N. Yu, and R.R. Gutell. The com-\nparative RNA web (CRW) site: An online database of comparative sequence and structure\ninformation for ribosomal, intron, and other RNAs. BioMed Central Bioinformatics, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1262, "authors": [{"given_name": "Seong-hwan", "family_name": "Jun", "institution": null}, {"given_name": "Liangliang", "family_name": "Wang", "institution": null}, {"given_name": "Alexandre", "family_name": "Bouchard-c\u00f4t\u00e9", "institution": null}]}