{"title": "Graphical model inference: Sequential Monte Carlo meets deterministic approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 8190, "page_last": 8200, "abstract": "Approximate inference in probabilistic graphical models (PGMs) can be grouped into deterministic methods and Monte-Carlo-based methods. The former can often provide accurate and rapid inferences, but are typically associated with biases that are hard to quantify. The latter enjoy asymptotic consistency, but can suffer from high computational costs. In this paper we present a way of bridging the gap between deterministic and stochastic inference. Specifically, we suggest an efficient sequential Monte Carlo (SMC) algorithm for PGMs which can leverage the output from deterministic inference methods. While generally applicable, we show explicitly how this can be done with loopy belief propagation, expectation propagation, and Laplace approximations. The resulting algorithm can be viewed as a post-correction of the biases associated with these methods and, indeed, numerical results show clear improvements over the baseline deterministic methods as well as over \"plain\" SMC.", "full_text": "Graphical model inference: Sequential Monte Carlo\n\nmeets deterministic approximations\n\nDepartment of Information Technology\n\nDepartment of Science and Technology\n\nFredrik Lindsten\n\nUppsala University\nUppsala, Sweden\n\nJouni Helske\n\nLink\u00f6ping University\nNorrk\u00f6ping, Sweden\n\nfredrik.lindsten@it.uu.se\n\njouni.helske@liu.se\n\nMatti Vihola\n\nDepartment of Mathematics and Statistics\n\nUniversity of Jyv\u00e4skyl\u00e4\n\nJyv\u00e4skyl\u00e4, Finland\n\nmatti.s.vihola@jyu.fi\n\nAbstract\n\nApproximate inference in probabilistic graphical models (PGMs) can be grouped\ninto deterministic methods and Monte-Carlo-based methods. The former can often\nprovide accurate and rapid inferences, but are typically associated with biases\nthat are hard to quantify. The latter enjoy asymptotic consistency, but can suffer\nfrom high computational costs. In this paper we present a way of bridging the\ngap between deterministic and stochastic inference. Speci\ufb01cally, we suggest an\nef\ufb01cient sequential Monte Carlo (SMC) algorithm for PGMs which can leverage\nthe output from deterministic inference methods. While generally applicable, we\nshow explicitly how this can be done with loopy belief propagation, expectation\npropagation, and Laplace approximations. The resulting algorithm can be viewed as\na post-correction of the biases associated with these methods and, indeed, numerical\nresults show clear improvements over the baseline deterministic methods as well\nas over \u201cplain\u201d SMC.\n\n1\n\nIntroduction\n\nProbabilistic graphical models (PGMs) are ubiquitous in machine learning for encoding dependencies\nin complex and high-dimensional statistical models [18]. Exact inference over these models is\nintractable in most cases, due to non-Gaussianity and non-linear dependencies between variables.\nEven for discrete random variables, exact inference is not possible unless the graph has a tree-topology,\ndue to an exponential (in the size of the graph) explosion of the computational cost. This has resulted\nin the development of many approximate inference methods tailored to PGMs. These methods can\nroughly speaking be grouped into two categories: (i) methods based on deterministic (and often\nheuristic) approximations, and (ii) methods based on Monte Carlo simulations.\nThe \ufb01rst group includes methods such as Laplace approximations [30], expectation propagation [23],\nloopy belief propagation [26], and variational inference [36]. These methods are often promoted as\nbeing fast and can reach higher accuracy than Monte-Carlo-based methods for a \ufb01xed computational\ncost. The downside, however, is that the approximation errors can be hard to quantify and even if the\ncomputational budget allows for it, simply spending more computations to improve the accuracy can\nbe dif\ufb01cult. The second group of methods, including Gibbs sampling [28] and sequential Monte Carlo\n(SMC) [11, 24], has the bene\ufb01t of being asymptotically consistent. That is, under mild assumptions\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthey can often be shown to converge to the correct solution if simply given enough compute time. The\nproblem, of course, is that \u201cenough time\u201d can be prohibitively long in many situations, in particular if\nthe sampling algorithms are not carefully tuned.\nIn this paper we propose a way of combining deterministic inference methods with SMC for inference\nin general PGMs expressed as factor graphs. The method is based on a sequence of arti\ufb01cial target\ndistributions for the SMC sampler, constructed via a sequential graph decomposition. This approach\nhas previously been used by [24] for enabling SMC-based inference in PGMs. The proposed method\nhas one important difference however; we introduce a so called twisting function in the targets\nobtained via the graph decomposition which allows for taking dependencies on \u201cfuture\u201d variables\nof the sequence into account. Using twisted target distributions for SMC has recently received\nsigni\ufb01cant attention in the statistics community, but to our knowledge, it has mainly been developed\nfor inference in state space models [14, 16, 34, 31]. We extend this idea to SMC-based inference in\ngeneral PGMs, and we also propose a novel way of constructing the twisting functions, as described\nbelow. We show in numerical illustrations that twisting the targets can signi\ufb01cantly improve the\nperformance of SMC for graphical models.\nA key question when using this approach is how to construct ef\ufb01cient twisting functions. Computing\nthe optimal twisting functions boils down to performing exact inference in the model, which is\nassumed to be intractable. However, this is where the use of deterministic inference algorithms comes\ninto play. We show how it is possible to compute sub-optimal, but nevertheless ef\ufb01cient, twisting\nfunctions using some popular methods\u2014Laplace approximations, expectation propagation and loopy\nbelief propagation. Furthermore, the framework can easily be used with other methods as well, to\ntake advantage of new and more ef\ufb01cient methods for approximate inference in PGMs.\nThe resulting algorithm can be viewed as a post-correction of the biases associated with the deter-\nministic inference method used, by taking advantage of the rigorous convergence theory for SMC\n(see e.g., [9]). Indeed, the approximation of the twisting functions only affect the ef\ufb01ciency of the\nSMC sampler, not its asymptotic consistency, nor the unbiasedness of the normalizing constant\nestimate (which is a key merit of SMC samplers). An implication of the latter point is that the\nresulting algorithm can be used together with pseudo-marginal [1] or particle Markov chain Monte\nCarlo (MCMC) [3] samplers, or as a post-correction of approximate MCMC [34]. This opens up the\npossibility of using well-established approximate inference methods for PGMs in this context.\nAdditional related work: An alternative approach to SMC-based inference in PGMs is to make use\nof tempering [10]. For discrete models, [15] propose to start with a spanning tree to which edges are\ngradually added within an SMC sampler to recover the original model. This idea is extended by [6] by\nde\ufb01ning the intermediate targets based on conditional mean \ufb01eld approximations. Contrary to these\nmethods our approach can handle both continuous and/or non-Gaussian interactions, and does not rely\non intermediate MCMC steps within each SMC iteration. When it comes to combining deterministic\napproximations and Monte-Carlo-based inference, previous work has largely been focused on using\nthe approximation as a proposal distribution for importance sampling [13] or MCMC [8]. Our method\nhas the important difference that we do not only use the deterministic approximation to design the\nproposal, but also to select the intermediate SMC targets via the design of ef\ufb01cient twisting functions.\n\n2 Setting the stage\n\n2.1 Problem formulation\n\nLet \u03c0(x1:T ) denote a distribution of interest over a collection of random variables x1:T =\n{x1, . . . , xT}. The model may also depend on some \u201ctop-level\u201d hyperparameters, but for brevity\nwe do not make this dependence explicit. In Bayesian statistics, \u03c0 would typically correspond to a\nposterior distribution over some latent variables given observed data. We assume that there is some\nstructure in the model which is encoded in a factor graph representation [20],\n\n(cid:89)\n\nj\u2208F\n\n\u03c0(x1:T ) =\n\n1\nZ\n\nfj(xIj ),\n\n(1)\n\nwhere F denotes the set of factors, I := {1, . . . , T} is the set of variables, Ij denotes the index set\nof variables on which factor fj depends, and xIj := {xt : t \u2208 Ij}. Note that Ij = Ne(j) is simply\nthe set of neighbors of factor fj in the graph (recall that in a factor graph all edges are between factor\n\n2\n\n\fnodes and variable nodes). Lastly, Z is the normalization constant, also referred to as the partition\nfunction of the model, which is assumed to be intractable. The factor graph is a general representation\nof a probabilistic graphical model and both directed and undirected PGMs can be written as factor\ngraphs. The task at hand is to approximate the distribution \u03c0(x1:T ), as well as the normalizing\nconstant Z. The latter plays a key role, e.g., in model comparison and learning of top-level model\nparameters.\n\n2.2 Sequential Monte Carlo\n\nSequential Monte Carlo (SMC, see, e.g., [11]) is a class of importance-sampling-based algorithms\nthat can be used to approximate some, quite arbitrary, sequence of probability distributions of interest.\nLet\n\n\u03c0t(x1:t) =\n\n\u03b3t(x1:t)\n\nZt\n\n,\n\nt = 1, . . . , T,\n\nbe a sequence of probability density functions de\ufb01ned on spaces of increasing dimension, where\n\u03b3t can be evaluated point-wise and Zt is a normalizing constant. SMC approximates each \u03c0t by a\ncollection of N weighted particles {(xi\n\ni=1, generated according to Algorithm 1.\n\nt)}N\n\n1:t, wi\n\nAlgorithm 1 Sequential Monte Carlo (all steps are for i = 1, . . . , N)\n\n1.\n\n1/(cid:80)N\nj=1 (cid:101)wj\n1 = (cid:101)wi\ni=1 with probabilities {\u03bdi\nt}N\nt\u22121}N\ni=1.\nt/(cid:80)N\nt}.\nj=1 (cid:101)wj\n1:t\u22121, xi\nt .\n\n1:t = {xai\nt = (cid:101)wi\n\nt\n\n1:t\u22121) and set xi\nt\u22121/\u03bdai\nt\u22121 and wi\n\nt\n\nt\n\n1 \u223c q1(x1), set (cid:101)wi\n\n1. Sample xi\n2. for t = 2, . . . , T :\n\n1 = \u03b31(xi\n\n1)/q1(xi\n\n1) and wi\n\n(a) Resampling: Simulate ancestor indices {ai\n(b) Propagation: Simulate xi\n\n(c) Weighting: Compute (cid:101)wi\n\nt \u223c qt(xt|xai\n1:t)wai\nt = \u03c9t(xi\n\nt\n\nt\u22121}N\n\nIn step 2(a) we use arbitrary resampling weights {\u03bdi\ni=1, which may depend on all variables\ngenerated up to iteration t \u2212 1. This allows for the use of look-ahead strategies akin to the auxiliary\nparticle \ufb01lter [27], as well as adaptive resampling based on effective sample size (ESS) [19]: if the\nESS is below a given threshold, say N/2, set \u03bdi\nt\u22121 to resample according to the importance\nt\u22121 \u2261 1/N which, together with the use of a low-variance (e.g., strati\ufb01ed)\nweights. Otherwise, set \u03bdi\nresampling method, effectively turns the resampling off at iteration t.\nAt step 2(b) the particles are propagated forward by simulating from a user-chosen proposal distri-\nbution qt(xt|x1:t\u22121), which may depend on the complete history of the particle path. The locally\noptimal proposal, which minimizes the conditional weight variance at iteration t, is given by\n\nt\u22121 = wi\n\nqt(xt|x1:t\u22121) \u221d \u03b3t(x1:t)/\u03b3t\u22121(x1:t\u22121)\n\nt\u22121 \u221d(cid:82) \u03b3t({xi\n\n(2)\nfor t \u2265 2 and q1(x1) \u221d \u03b31(x1). If, in addition to using the locally optimal proposal, the resampling\nweights are computed as \u03bdi\n1:t\u22121), then the SMC sampler is said\nto be fully adapted. At step 2(c) new importance weights are computed using the weight function\n\u03c9t(x1:t) = \u03b3t(x1:t)/ (\u03b3t\u22121(x1:t\u22121)qt(xt|x1:t\u22121)) .\nThe weighted particles generated by Algorithm 1 can be used to approximate each \u03c0t by the empirical\n(dx1:t). Furthermore, the algorithm provides unbiased estimates of the\n; see [9] and the supplementary\n\ndistribution(cid:80)N\nnormalizing constants Zt, computed as (cid:98)Zt =(cid:81)t\n\n1:t\u22121, xt})dxt/\u03b3t\u22121(xi\n\n(cid:80)N\ni=1 (cid:101)wi\n\n(cid:110) 1\n\ni=1 wi\n\n(cid:111)\n\nt\u03b4xi\n\ns=1\n\nN\n\n1:t\n\ns\n\nmaterial.\n\n3 Graph decompositions and twisted targets\n\nWe now turn our attention to the factor graph (1). To construct a sequence of target distributions for\nan SMC sampler, [24] proposed to decompose the graphical model into a sequence of sub-graphs,\neach de\ufb01ning an intermediate target for the SMC sampler. This is done by \ufb01rst ordering the variables,\nor the factors, of the model in some way\u2014here we assume a \ufb01xed order of the variables x1:T as\n\n3\n\n\findicated by the notation; see Section 5 for a discussion about the ordering. We then de\ufb01ne a\nsequence of unnormalized densities {\u03b3t(x1:t)}T\nt=1 by gradually including the model variables and\nthe corresponding factors. This is done in such a way that the \ufb01nal density of the sequence includes\nall factors and coincides with the original target distribution of interest,\n\n\u03b3T (x1:T ) =\n\nfj(xIj ) \u221d \u03c0(x1:T ).\n\n(3)\n\n(cid:89)\n\nj\u2208F\n\nWe can then target {\u03b3t(x1:t)}T\n\njectories can be taken as (weighted) samples from \u03c0, and (cid:98)Z := (cid:98)ZT will be an unbiased estimate\n\nt=1 with an SMC sampler. At iteration T the resulting particle tra-\n\nof Z.\nTo de\ufb01ne the intermediate densities, let F1, . . . , FT be a partitioning of the factor set F de\ufb01ned by:\n\nFt = {j \u2208 F : t \u2208 Ij, t + 1 /\u2208 Ij, . . . , T /\u2208 Ij}.\n\nIn words, Ft is the set of factors depending on xt, and possibly x1:t\u22121, but not xt+1:T . Furthermore,\nlet Ft = (cid:116)t\n\ns=1Fs. Naesseth et al. [24] de\ufb01ned a sequence of intermediate target densities as1\n\n\u03b3t(x1:t) =\n\nfj(xIj ),\n\nt = 1, . . . , T.\n\n(4)\n\n(cid:89)\n\nj\u2208Ft\n\n(cid:89)\n\n\u03b3\u03c8\nt (x1:t) := \u03c8t(x1:t)\u03b3t(x1:t) = \u03c8t(x1:t)\n\nSince FT = F, it follows that the condition (3) is satis\ufb01ed. However, even though this is a valid\nchoice of target distributions, leading to a consistent SMC algorithm, the resulting sampler can\nhave poor performance. The reason is that the construction (4) neglects the dependence on \u201cfuture\u201d\nvariables xt+1:T which may have a strong in\ufb02uence on x1:t. Neglecting this dependence can result\nin samples at iteration t which provide an accurate approximation of the intermediate target \u03b3t, but\nwhich are nevertheless very unlikely under the actual target distribution \u03c0.\nTo mitigate this issue we propose to use a sequence of twisted intermediate target densities,\nt = 1, . . . , T \u2212 1,\n\nwhere \u03c8t(x1:t) is an arbitrary positive \u201ctwisting function\u201d such that(cid:82) \u03b3\u03c8\n\nt (x1:t)dx1:t < \u221e. (Note\nthat there is no need to explicitly compute this integral as long as it can be shown to be \ufb01nite.)\nTwisting functions have previously been used by [14, 16] to \u201ctwist\u201d the Markov transition kernel of a\nstate space (or Feynman-Kac) model; we take a slightly different viewpoint and simply consider the\ntwisting function as a multiplicative adjustment of the SMC target distribution.\nThe de\ufb01nition of the twisted targets in (5) is of course very general and not very useful unless\nadditional guidance is provided. To this end we state the following simple optimality condition (the\nproof is in the supplementary material; see also [14, Proposition 2]).\nProposition 1. Assume that the twisting functions in (5) are given by\n\nfj(xIj ),\n\nj\u2208Ft\n\n(5)\n\n\u03c8\u2217\nt (x1:t) :=\n\nfj(xIj )dxt+1:T\n\nt = 1, . . . , T \u2212 1,\n\n(6)\n\n(cid:90) (cid:89)\n\nj\u2208F\\Ft\n\nthat the locally optimal proposals (2) are used in the SMC sampler, and that \u03bdi\nt. Then,\nAlgorithm 1 results in particle trajectories exactly distributed according to \u03c0(x1:T ) and the estimate\n\nof the normalizing constant is exact; (cid:98)Z = Z w.p.1.\n\nt = wi\n\nClearly, the optimal twisting functions are intractable in all situations of interest. Indeed, computing\n(6) essentially boils down to solving the original inference problem. However, guided by this, we\nwill strive to select \u03c8t(x1:t) \u2248 \u03c8\u2217\nt (x1:t). As pointed out above, the approximation error, here, only\nVarious ways for approximating \u03c8\u2217\n\naffects the ef\ufb01ciency of the SMC sampler, not its asymptotic consistency or the unbiasedness of (cid:98)Z.\n\nt are discussed in the next section.\n\n1More precisely, [24] use a \ufb01xed ordering of the factors (and not the variables) of the model. They then\ninclude one or more additional factors, together with the variables on which these factors depend, in each step of\nthe SMC algorithm. This approach is more or less equivalent to the one adopted here.\n\n4\n\n\f4 Twisting functions via deterministic approximations\n\nIn this section we show how a few popular deterministic inference methods can be used to approximate\nthe optimal twisting functions in (6), namely loopy belief propagation (Section 4.1), expectation\npropagation (Section 4.2), and Laplace approximations (Section 4.3). These methods are likely to be\nuseful for computing the twisting functions in many situations, however, we emphasize that they are\nmainly used to illustrate the general methodology which can be used with other inference procedures\nas well.\n\n4.1 Loopy belief propagation\n\nBelief propagation [26] is an exact inference procedure for tree-structured graphical models, although\nits \u201cloopy\u201d version has been used extensively as a heuristic approximation for general graph topologies.\nBelief propagation consists of passing messages:\n\nIn graphs with loops, the messages are passed until convergence.\nTo see how loopy belief propagation can be used to approximate the twisting functions for SMC, we\nstart with the following result for tree-structured model (the proof is in the supplementary material).\nProposition 2. Assume that the factor graph with variable nodes {1, . . . , t} and factor nodes\n{fj : j \u2208 Ft} form a (connected) tree for all t = 1, . . . , T . Then, the optimal twisting function (6)\nis given by\n\n\u00b5j\u2192(1:t)(x1:t)\n\nwhere\n\n\u00b5j\u2192(1:t)(x1:t) =\n\n\u00b5j\u2192s(xs).\n\n(7)\n\n\u03c8\u2217\nt (x1:t) =\n\n(cid:89)\n\nj\u2208F\\Ft\n\n(cid:89)\n\ns\u2208{1, ..., t}\u2229Ij\n\nRemark 1. The sub-tree condition of Proposition 2 implies that the complete model is a tree, since this\nis obtained for t = T . The connectedness assumption can easily be enforced by gradually growing\nthe tree, lumping model variables together if needed.\nWhile the optimality of (7) only holds for tree-structured models, we can still make use of this\nexpression for models with cycles, analogously to loopy belief propagation. Note that the message\n\u00b5j\u2192(1:t)(x1:t) is the product of factor-to-variable messages going from the non-included factor\nj \u2208 F \\ Ft to included variables s \u2208 {1, . . . , t}. For a tree-based model there is at most one\nsuch message (under the connectedness assumption of Proposition 2), whereas for a cyclic model\n\u00b5j\u2192(1:t)(x1:t) might be the product of several \u201cincoming\u201d messages.\nIt should be noted that the numerous modi\ufb01cations of the loopy belief propagation algorithm that are\navailable can be used within the proposed framework as well. In fact, methods based on tempering of\nthe messages, such as tree-reweighting [35], could prove to be particularly useful. The reason is that\nthese methods counteract the double-counting of information in classical loopy belief propagation,\nwhich could be problematic for the following SMC sampler due to an over-concentration of probability\nmass. That being said, we have found that even the standard loopy belief propagation algorithm\ncan result in ef\ufb01cient twisting, as illustrated numerically in Section 6.1, and we do not pursue\nmessage-tempering further in this paper.\n\n4.2 Expectation propagation\n\nExpectation propagation (EP, [23]) is based on introducing approximate factors, (cid:101)fj(xIj ) \u2248 fj(xIj )\n\nsuch that\n\napproximates \u03c0(x1:T ), and where the (cid:101)fj\u2019s are assumed to be simple enough so that the integral in the\n\nexpression above is tractable. The approximate factors are updated iteratively until some convergence\n\n(8)\n\n(cid:101)\u03c0(x1:T ) =\n\n(cid:81)\nj\u2208F (cid:101)fj(xIj )\n(cid:82)(cid:81)\nj\u2208F (cid:101)fj(xIj )dx1:T\n\nFactor \u2192 variable :\n\nVariable \u2192 factor :\n\n\u00b5j\u2192s(xs) =\n\n\u03bbs\u2192j(xs) =\n\n(cid:89)\n\nfj(xIj )\n\n(cid:89)\n\ni\u2208Ne(s)\\{j}\n\nu\u2208Ne(j)\\{s}\n\n\u00b5i\u2192s(xs).\n\n\u03bbu\u2192j(xu)dxIj\\{s},\n\n(cid:90)\n\n5\n\n\fthe Kullback\u2013Leibler divergence between the two distributions. We refer to [23] for additional details\non the EP algorithm.\nOnce the EP approximation has been computed, it can naturally be used to approximate the optimal\n\ncriterion is met. To update factor (cid:101)fj, we \ufb01rst remove it from the approximation to obtain the so called\ncavity distribution(cid:101)\u03c0\u2212j(x1:T ) \u221d(cid:101)\u03c0(x1:T )/(cid:101)fj(xIj ). We then compute a new approximate factor (cid:101)fj,\nsuch that (cid:101)fj(xIj )(cid:101)\u03c0\u2212j(x1:T ) approximates fj(xIj )(cid:101)\u03c0\u2212j(x1:T ). Typically, this is done by minimizing\ntwisting functions in (6). By simply plugging in (cid:101)fj in place of fj we get\n(cid:101)fj(xIj )dxt+1:T .\n\uf8f6\uf8f8(cid:82)(cid:81)\nj\u2208F\\Ft (cid:101)fj(xIj )dxt+1:T\n(cid:82)(cid:81)\nj\u2208F\\Ft\u22121 (cid:101)fj(xIj )dxt:T\n(cid:89)\n(cid:101)fj(xIj )\n\nFurthermore, the EP approximation can be used to approximate the optimal SMC proposal. Speci\ufb01-\ncally, at iteration t we can select the proposal distribution as\n\nqt(xt|x1:t\u22121) =(cid:101)\u03c0(xt|x1:t\u22121) =\n\n(cid:90) (cid:89)\n\uf8eb\uf8ed(cid:89)\n\nj\u2208Ft\n\nThis choice has the advantage that the weight function gets a particularly simple form:\n\n(cid:101)fj(xIj )\n\nfj(xIj )\n\n.\n\nt\u22121(x1:t\u22121)qt(xt|x1:t\u22121)\n\u03b3\u03c8\n\n=\n\nj\u2208Ft\n\n.\n\n(10)\n\n\u03b3\u03c8\nt (x1:t)\n\n\u03c8t(x1:t) =\n\nj\u2208F\\Ft\n\n\u03c9t(x1:t) =\n\n(11)\n\n(9)\n\n4.3 Laplace approximations for Gaussian Markov random \ufb01elds\n\nA speci\ufb01c class of PGMs with a large number of applications in spatial statistics are latent Gaussian\nMarkov random \ufb01elds (GMRFs, see, e.g., [29, 30]). These models are de\ufb01ned via a Gaussian prior\np(x1:T ) = N (x1:T|\u00b5, Q\u22121) where the precision matrix Q has Qij (cid:54)= 0 if and only if variables xi\nand xj share a factor in the graph. When this latent \ufb01eld is combined with some non-Gaussian or non-\nlinear observational densities p(yt|xt), t = 1, . . . , T , the posterior \u03c0(x1:T ) is typically intractable.\nHowever, when p(yt|xt) is twice differentiable, it is straightforward to \ufb01nd an approximating Gaussian\nmodel based on a Laplace approximation by simple numerical optimization [12, 33, 30], and use the\nobtained model as a basis of twisted SMC. Speci\ufb01cally, we use\n\n(cid:90)\n\nT(cid:89)\n\n(cid:8)(cid:101)p(ys|xs)(cid:9)p(xt+1:T|x1:t)dxt+1:T ,\n\n\u03c8t(x1:t) =\n\nwhere(cid:101)p(yt|xt) \u2248 p(yt|xt), t = 1, . . . , T are the Gaussian approximations obtained using Laplace\u2019s\nmethod. For proposal distributions, we simply use the obtained Gaussian densities(cid:101)p(xt|x1:t\u22121, y1:T ).\nThe weight functions have similar form as in (11), \u03c9t(x1:t) = p(yt|xt)/(cid:101)p(yt|xt). For state space\n\ns=t+1\n\nmodels, this approach was recently used in [34].\n\n(12)\n\n5 Practical considerations\n\nA natural question is how to order the variables of the model. In a time series context a trivial\nprocessing order exists, but it is more dif\ufb01cult to \ufb01nd an appropriate order for a general PGM.\nHowever, in Section 6.3 we show numerically that while the processing order has a big impact on the\nperformance of non-twisted SMC, the effect of the ordering is less severe for twisted SMC. Intuitively\nthis can be explained by the look-ahead effect of the twisting functions: even if the variables are\nprocessed in a non-favorable order they will not \u201ccome as a surprise\u201d.\nStill, intuitively a good candidate for the ordering is to make the model as \u201cchain-like\u201d as possible by\nminimizing the bandwidth (see, e.g., [7]) of the adjacency matrix of the graphical model. A related\nstrategy is to instead minimize the \ufb01ll-in of the Cholesky decomposition of the full posterior precision\nmatrix. Speci\ufb01cally, this is recommended in the GMRF setting for faster matrix algebra [29] and this\nis the approach we use in Section 6.3. Alternatively, [25] propose a heuristic method for adaptive\norder selection that can be used in the context of twisted SMC as well.\n\n6\n\n\fApplication of twisting often leads to nearly constant SMC weights and good performance. However,\nthe boundedness of the SMC weights is typically not guaranteed.\nIndeed, the approximations\nmay have lighter tails than the target, which may occasionally lead to very large weights. This\nis particularly problematic when the method is applied within a pseudo-marginal MCMC scheme,\nbecause unbounded likelihood estimators lead to poor mixing MCMC [1, 2]. Fortunately, it is\nrelatively easy to add a \u2018regularization\u2019 to the twisting, which leads to bounded weights. We discuss\nthe regularization in more detail in the supplement.\nFinally, we comment on the computational cost of the proposed method. Once a sequence of\ntwisting functions has been found, the cost of running twisted SMC is comparable to that of running\nnon-twisted SMC. Thus, the main computational overhead comes from executing the deterministic\ninference procedure used for computing the twisting functions. Since the cost of this is independent\nof the number of particles N used for the subsequent SMC step, the relative computational overhead\nwill diminish as N increases. As for the scaling with problem size T , this will very much depend\non the choice of deterministic inference procedure, as well as on the connectivity of the graph, as\nis typical for graphical model inference. It is worth noting, however, that even for a sparse graph\nthe SMC sampler needs to be ef\ufb01ciently implemented to obtain a favorable scaling with T . Due to\nthe (in general) non-Markovian dependencies of the random variables x1:T , it is necessary to keep\ntrack of the complete particle trajectories {xi\ni=1 for each t = 1, . . . , T . Resampling of these\ntrajectories can however result in the copying of large chunks of memory (of the order N t at iteration\nt), if implemented in a \u2019straightforward manner\u2019. Fortunately, it is possible to circumvent this issue\nby an ef\ufb01cient storage of the particle paths, exploiting the fact that the paths tend to coalesce in log N\nsteps; see [17] for details. We emphasize that this issue is inherent to the SMC framework itself,\nwhen applied to non-Markovian models, and does not depend on the proposed twisting method.\n\n1:t}N\n\n6 Numerical illustration\n\nWe illustrate the proposed twisted SMC method on three PGMs using the three deterministic approxi-\nmation methods discussed in Section 4. In all examples we compare with the baseline SMC algorithm\nby [24] and the two samplers are denoted as SMC-Twist and SMC-Base, respectively. While the\nmethods can be used to estimate both the normalizing constant Z and expectations with respect to\n\u03c0, we focus the empirical evaluation on the former. The reasons for this are: (i) estimating Z is of\nsigni\ufb01cant importance on its own, e.g., for model comparison and for pseudo-marginal MCMC, (ii) in\nour experience, the accuracy of the normalizing constant estimate is a good indicator for the accuracy\nof other estimates as well, and (iii) the fact that SMC produces unbiased estimates of Z means that\n\nwe can more easily assess the quality of the estimates. Speci\ufb01cally, log(cid:98)Z\u2014which is what we actually\n\ncompute\u2014is negatively biased and it therefore typically holds that higher estimates are better.\n\nIsing model\n\n6.1\nAs a \ufb01rst proof of concept we consider a 16 \u00d7 16 square\nlattice Ising model with periodic boundary condition,\n\n(cid:18) (cid:88)\n\n(i,j)\u2208E\n\n(cid:19)\n\n(cid:88)\n\ni\u2208I\n\n\u03c0(x1:T ) =\n\n1\nZ\n\nexp\n\nJijxixj +\n\nHixi\n\n.\n\ni.i.d.\u223c Uniform(\u22121, 1).\n\nwhere T = 256 and xi \u2208 {\u22121, +1}. We let the inter-\nactions be Jij \u2261 0.44 and the external magnetic \ufb01eld is\nsimulated according to Hi\nWe use the Left-to-Right sequential decomposition consid-\nered by [24]. For SMC-Twist we use loopy belief prop-\nagation to compute the twisting potentials, as described\nin Section 4.1. Both SMC-Base and SMC-Twist use fully\nadapted proposals, which is possible due to the discrete na-\nture of the problem. Apart for the computational overhead\nof running the belief propagation algorithm (which is quite\nsmall, and independent of the number of particles used in\nthe subsequent SMC algorithm), the computational costs\nof the two SMC samplers is more or less the same.\n\n7\n\nFigure 1: Results for the Ising model.\nSee text for details.\n\n252254256258642561024Number of particlesLog ZMethodSMC\u2212PGMSMC\u2212Twist\fFigure 2: Results for LDA likelihood evaluation for the toy model (left), PubMed data (mid), and 20\nnewsgroups data (right). Dotted lines correspond to the plain EP estimates. See text for details.\n\nEach algorithm is run 50 times for varying number of particles. Box-plots over the obtained\nnormalizing constant estimates are shown in Figure 1, together with a \u201cground truth\u201d estimate (dashed\nline) obtained with an annealed SMC sampler [10] with a large number of particles and temperatures.\nAs is evident from the \ufb01gure, the twisted SMC sampler outperforms the baseline SMC. Indeed, with\ntwisting we get similar accuracy using N = 64 particles, as the baseline SMC with N = 1024\nparticles.\n\n6.2 Topic model evaluation\n\nTopic models, such as latent Dirichlet allocation (LDA) [4], are widely used for information retrieval\nfrom large document collections. To assess the quality of a learned model it is common to evaluate\nthe likelihood of a set of held out documents. However, this turns out to be a challenging inference\nproblem on its own which has attracted signi\ufb01cant attention [37, 5, 32, 22]. Naesseth et al. [24]\nobtained good performance for this problem with a (non-twisted) SMC method, outperforming the\nspecial purpose Left-Right-Sequential sampler by [5]. Here we repeat this experiment and compare\nthis baseline SMC with a twisted SMC. For computing the twisting functions we use the EP algorithm\nby Minka and Lafferty [22], speci\ufb01cally developed for inference in the LDA model. See [37, 22] and\nthe supplementary material for additional details on the model and implementation details.\nFirst we consider a synthetic toy model with 4 topics and 10 words, for which the exact likelihood\ncan be computed. Figure 2 (left) shows the mean-squared errors in the estimates of the log-likelihood\nestimates for the two SMC samplers as we increase the number of particles. As can be seen, twisting\nreduces the error by about half an order-of-magnitude compared to the baseline SMC. In the middle\nand right panels of Figure 2 we show results for two real datasets, PubMed Central abstracts and 20\nnewsgroups, respectively (see [37]). For each dataset we compute the log-likelihood of 10 held-out\ndocuments. The box-plots are for 50 independent runs of each algorithm, for different number\nof particles. As pointed out above, due to the unbiasedness of the SMC likelihood estimates it is\ntypically the case that \u201chigher is better\u201d. This is also supported by the fact that the estimates increase\non average as we increase the number of particles. With this in mind, we see that EP-based twisting\nsigni\ufb01cantly improves the performance of the SMC algorithm. Furthermore, even with as few as 50\nparticles, SMC-Twist clearly improves the results of the EP algorithm itself, showing that twisted\nSMC can successfully correct for the bias of the EP method.\n\n6.3 Conditional autoregressive model with Binomial observations\nConsider a latent GMRF x1:T \u223c N (0, \u03c4 Q\u22121), where Qtt = nt + d, Qtt(cid:48) = \u22121 if t \u223c t(cid:48), and\nQtt(cid:48) = 0 otherwise. Here nt is the number of neighbors of xt, \u03c4 = 0.1 is a scaling parameter, and\nd = 1 is a regularization parameter ensuring a positive de\ufb01nite precision matrix. Given the latent\n\ufb01eld we assume binomial observations yt \u223c Binomial(10, logit\u22121(xt)). The spatial structure of the\nGMRF corresponds to the map of Germany obtained from the R package INLA [21], containing\nT = 544 regions. We simulated one realization of x1:T and y1:T from this con\ufb01guration and then\nestimated the log-likelihood of the model 10 000 times with a baseline SMC using a bootstrap\nproposal, as well as with twisted SMC where the twisting functions were computed using a Laplace\n\n8\n\nMethodSMC\u2212BaseSMC\u2212Twist0.0010.0100.100\u22129000\u22128950\u22128900\u22128850\u22128800\u221213800\u221213700\u2212136001101001000502501000502501000Number of particlesNumber of particlesNumber of particlesMSE in log ZLog ZLog ZMedlineNewsgroupsSynthetic data\fFigure 3: Results for GMRF likelihood evaluation. See text for details.\n\napproximation (see details in the supplementary material). To test the sensitivity of the algorithms to\nthe ordering of the latent variables, we randomly permuted the variables for each replication. We\ncompare this random order with approximate minimum degree reordering (AMD) of the variables,\napplied before running the SMC. We also varied N, the number of particles, from 64 up to 1024. For\nboth SMC approaches, we used adaptive resampling based on effective sample size with threshold of\nN/2. In addition, we ran a twisted sequential importance sampler (SIS), i.e., we set the resampling\nthreshold to zero.\nFigure 3 shows the log-likelihood estimates for SMC-Base, SIS and SMC-Twist with N = 64 and\nN = 1024 particles, with dashed lines corresponding to the estimates obtained from a single SMC-\nTwist run with 100 000 particles, and dotted lines to the estimates from the Laplace approximation.\nSMC-Base is highly affected by the ordering of the variables, while the effect is minimal in case of\nSIS and SMC-Twist. Twisted SMC is relatively accurate already with 64 particles, whereas sequential\nimportance sampling and SMC-Base exhibit large variation and bias still with 1024 particles.\n\n7 Conclusions\n\nThe twisted SMC method for PGMs presented in this paper is a promising way to combine deter-\nministic approximations with ef\ufb01cient Monte Carlo inference. We have demonstrated how three\nwell-established methods can be used to approximate the optimal twisting functions, but we stress\nthat the general methodology is applicable also with other methods.\nAn important feature of our approach is that it may be used as \u2018plug-in\u2019 module with pseudo-marginal\n[1] or particle MCMC [3] methods, allowing for consistent hyperparameter inference. It may also\nbe used as (parallelizable) post-processing of approximate hyperparameter MCMC, which is based\npurely on deterministic PGM inferences [cf. 34].\nAn interesting direction for future work is to investigate which properties of the approximations\nthat are most favorable to the SMC sampler. Indeed, it is not necessarily the case that the twisting\nfunctions obtained directly from the most accurate deterministic method result in the most ef\ufb01cient\nSMC sampler. It is also interesting to consider iterative re\ufb01nements of the twisting functions, akin to\nthe method proposed by [14], in combination with the approach taken here.\n\nAcknowledgments\n\nFL has received support from the Swedish Foundation for Strategic Research (SSF) via the project\nProbabilistic Modeling and Inference for Machine Learning (contract number: ICA16-0015) and\nfrom the Swedish Research Council (VR) via the projects Learning of Large-Scale Probabilistic\nDynamical Models (contract number: 2016-04278) and NewLEADS \u2013 New Directions in Learning\nDynamical Systems (contract number: 621-2016-06079). JH and MV have received support from the\nAcademy of Finland (grants 274740, 284513 and 312605).\n\n9\n\nOrderingAMDrandom\u22123330\u22123320\u22123310\u22123300\u22123290\u22123305\u22123300\u22123295\u22123290SISSMC\u2212baseSMC\u2212twistSISSMC\u2212baseSMC\u2212twistMethodMethodLog ZLog ZN = 1024N = 64\fReferences\n[1] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for ef\ufb01cient Monte Carlo computations.\n\nThe Annals of Statistics, 37(2):697\u2013725, 2009.\n\n[2] C. Andrieu and M. Vihola. Convergence properties of pseudo-marginal Markov chain Monte Carlo\n\nalgorithms. The Annals of Applied Probability, 25(2):1030\u20131077, 2015.\n\n[3] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the\n\nRoyal Statistical Society: Series B, 72(3):269\u2013342, 2010.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003. ISSN 1532-4435.\n\n[5] W. Buntine. Estimating likelihoods for topic models. In Proceedings of the 1st Asian Conference on\n\nMachine Learning: Advances in Machine Learning, 2009.\n\n[6] Peter Carbonetto and Nando D. Freitas. Conditional mean \ufb01eld. In Advances in Neural Information\n\nProcessing Systems (NIPS) 19, pages 201\u2013208. 2006.\n\n[7] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the\n\n1969 24th National Conference, 1969.\n\n[8] N. de Freitas, P. H\u00f8jen-S\u00f8rensen, M. I. Jordan, and S. Russell. Variational MCMC. In Proceedings of the\n\n17th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 120\u2013127, 2001.\n\n[9] P. Del Moral. Feynman-Kac Formulae - Genealogical and Interacting Particle Systems with Applications.\n\nProbability and its Applications. Springer, 2004.\n\n[10] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical\n\nSociety: Series B, 68(3):411\u2013436, 2006.\n\n[11] A. Doucet and A. Johansen. A tutorial on particle \ufb01ltering and smoothing: Fifteen years later. In D. Crisan\nand B. Rozovskii, editors, The Oxford Handbook of Nonlinear Filtering, pages 656\u2013704. Oxford University\nPress, Oxford, UK, 2011.\n\n[12] J. Durbin and S. J. Koopman. Monte Carlo maximum likelihood estimation for non-Gaussian state space\n\nmodels. Biometrika, 84(3):669\u2013684, 1997. doi: 10.1093/biomet/84.3.669.\n\n[13] Zoubin Ghahramani and Matthew J. Beal. Variational inference for Bayesian mixtures of factor analysers.\n\nIn Advances in Neural Information Processing Systems (NIPS) 12, pages 449\u2013455. 1999.\n\n[14] P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle \ufb01lter. Journal of the American\n\nStatistical Association, 112(520):1636\u20131647, 2017.\n\n[15] F. Hamze and N. de Freitas. Hot coupling: A particle approach to inference and normalization on pairwise\nundirected graphs. In Advances in Neural Information Processing Systems (NIPS) 18, pages 491\u2013498.\n2005.\n\n[16] J. Heng, A. N. Bishop, G. Deligiannidis, and A. Doucet. Controlled sequential Monte Carlo. arXiv.org,\n\narXiv:1708.08396, 2018.\n\n[17] P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle \ufb01lter. Statistics and Computing,\n\n25(2):487\u2013496, 2015. doi: 10.1007/s11222-013-9445-x.\n\n[18] M. I. Jordan. Graphical models. Statistical Science, 19(1):140\u2013155, 2004.\n[19] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesian missing data problems. Journal\n\nof the American Statistical Association, 89(425):278\u2013288, 1994.\n\n[20] F. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum\u2013product algorithm. IEEE\n\nTransactions on Information Theory, 47:498\u2013519, 2001.\n\n[21] F. Lindgren and H. Rue. Bayesian spatial modelling with R-INLA. Journal of Statistical Software, 63(19):\n\n1\u201325, 2015.\n\n[22] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Proceedings of the\n\n18th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2002.\n\n[23] T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2001.\n\n[24] C. A. Naesseth, F. Lindsten, and T. B. Sch\u00f6n. Sequential Monte Carlo methods for graphical models. In\n\nAdvances in Neural Information Processing Systems (NIPS) 27, pages 1862\u20131870. 2014.\n\n[25] C. A. Naesseth, F. Lindsten, and T. B. Sch\u00f6n. Towards automated sequential Monte Carlo for probabilistic\n\ngraphical models. NIPS Workshop on Black Box Inference and Learning, 2015.\n\n[26] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann, San Francisco, CA, USA, 2nd edition, 1988.\n\n10\n\n\f[27] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle \ufb01lters. Journal of the American\n\nStatistical Association, 94(446):590\u2013599, 1999.\n\n[28] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.\n[29] H. Rue and L. Held. Gaussian Markov Random Fields: Theory And Applications (Monographs on Statistics\n\nand Applied Probability). Chapman & Hall/CRC, 2005. ISBN 1584884320.\n\n[30] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by\nusing integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):\n319\u2013392, 2009.\n\n[31] H. C. Ruiz and H. J. Kappen. Particle smoothing for hidden diffusion processes: adaptive path integral\n\nsmoother. IEEE Transactions on Signal Processing, 65(12):3191\u20133203, 2017.\n\n[32] G. S. Scott and J. Baldridge. A recursive estimate for the predictive likelihood in a topic model. In\n\nProceedings of the 16th International Conference on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[33] N. Shephard and M. K. Pitt. Likelihood analysis of non-Gaussian measurement time series. Biometrika,\n\n84(3):653\u2013667, 1997. ISSN 00063444.\n\n[34] M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal\n\nMCMC. arXiv.org, arXiv:1609.02541, 2018.\n\n[35] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[36] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[37] H. M Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In\n\nProceedings of the 26th International Conference on Machine Learning, 2009.\n\n[38] Nick Whiteley, Anthony Lee, Kari Heine, et al. On the role of interaction in sequential Monte Carlo\n\nalgorithms. Bernoulli, 22(1):494\u2013529, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5007, "authors": [{"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Uppsala University"}, {"given_name": "Jouni", "family_name": "Helske", "institution": "Link\u00f6ping University"}, {"given_name": "Matti", "family_name": "Vihola", "institution": "University of Jyv\u00e4skyl\u00e4"}]}