{"title": "Ancestor Sampling for Particle Gibbs", "book": "Advances in Neural Information Processing Systems", "page_first": 2591, "page_last": 2599, "abstract": "We present a novel method in the family of particle MCMC methods that we refer to as particle Gibbs with ancestor sampling (PG-AS). Similarly to the existing PG with backward simulation (PG-BS) procedure, we use backward sampling to (considerably) improve the mixing of the PG kernel. Instead of using separate forward and backward sweeps as in PG-BS, however, we achieve the same effect in a single forward sweep. We apply the PG-AS framework to the challenging class of non-Markovian state-space models. We develop a truncation strategy of these models that is applicable in principle to any backward-simulation-based method, but which is particularly well suited to the PG-AS framework. In particular, as we show in a simulation study, PG-AS can yield an order-of-magnitude improved accuracy relative to PG-BS due to its robustness to the truncation error. Several application examples are discussed, including Rao-Blackwellized particle smoothing and inference in degenerate state-space models.", "full_text": "Ancestor Sampling for Particle Gibbs\n\nFredrik Lindsten\n\nDiv. of Automatic Control\n\nLink\u00a8oping University\n\nlindsten@isy.liu.se\n\nMichael I. Jordan\n\nDept. of EECS and Statistics\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nThomas B. Sch\u00a8on\n\nDiv. of Automatic Control\n\nLink\u00a8oping University\nschon@isy.liu.se\n\nAbstract\n\nWe present a novel method in the family of particle MCMC methods that we refer\nto as particle Gibbs with ancestor sampling (PG-AS). Similarly to the existing\nPG with backward simulation (PG-BS) procedure, we use backward sampling\nto (considerably) improve the mixing of the PG kernel. Instead of using sepa-\nrate forward and backward sweeps as in PG-BS, however, we achieve the same\neffect in a single forward sweep. We apply the PG-AS framework to the challeng-\ning class of non-Markovian state-space models. We develop a truncation strategy\nof these models that is applicable in principle to any backward-simulation-based\nmethod, but which is particularly well suited to the PG-AS framework. In partic-\nular, as we show in a simulation study, PG-AS can yield an order-of-magnitude\nimproved accuracy relative to PG-BS due to its robustness to the truncation error.\nSeveral application examples are discussed, including Rao-Blackwellized particle\nsmoothing and inference in degenerate state-space models.\n\n1\n\nIntroduction\n\nState-space models (SSMs) are widely used to model time series and dynamical systems. The\nstrong assumptions of linearity and Gaussianity that were originally invoked in state-space inference\nhave been weakened by two decades of research on sequential Monte Carlo (SMC) and Markov\nchain Monte Carlo (MCMC). These Monte Carlo methods have not, however, led to substantial\nweakening of a further strong assumption, that of Markovianity. It remains a major challenge to\ndevelop inference algorithms for non-Markovian SSMs:\n\nyt \u223c g(yt | \u03b8, x1:t),\n\nxt+1 \u223c f (xt+1 | \u03b8, x1:t),\n\n(1)\nwhere \u03b8 \u2208 \u0398 is a static parameter with prior density p(\u03b8), xt is the latent state and yt is the ob-\nservation at time t, respectively. Models of this form arise in many different application scenarios,\neither from direct modeling or via a transformation or marginalization of a larger model. We provide\nseveral examples in Section 5.\nTo tackle the challenging problem of inference for non-Markovian SSMs, we work within the frame-\nwork of particle MCMC (PMCMC), a family of inferential methods introduced in [1]. The basic idea\nin PMCMC is to use SMC to construct a proposal kernel for an MCMC sampler. Assume that we\nobserve a sequence of measurements y1:T . We are interested in \ufb01nding the density p(x1:T , \u03b8 | y1:T ),\ni.e., the joint posterior density of the state sequence and the parameter. In an idealized Gibbs sam-\npler we would target this density by sampling as follows: (i) Draw \u03b8(cid:63) | x1:T \u223c p(\u03b8 | x1:T , y1:T );\n1:T | \u03b8(cid:63) \u223c p(x1:T | \u03b8(cid:63), y1:T ). The \ufb01rst step of this procedure can be carried out\n(ii) Draw x(cid:63)\nexactly if conjugate priors are used. For non-conjugate models, one option is to replace Step (i)\nwith a Metropolis-Hastings step. However, Step (ii)\u2014sampling from the joint smoothing density\np(x1:T | \u03b8, y1:T )\u2014is in most cases very dif\ufb01cult. In PMCMC, this is addressed by instead sampling\na particle trajectory x(cid:63)\n1:T based on an SMC approximation of the joint smoothing density. More\nprecisely, we run an SMC sampler targeting p(x1:T | \u03b8(cid:63), y1:T ). We then sample one of the particles\n\n1\n\n\f1:T . This overall procedure is referred to as particle Gibbs (PG).\n\nat the \ufb01nal time T , according to their importance weights, and trace the ancestral lineage of this\nparticle to obtain the trajectory x(cid:63)\nThe \ufb02exibility provided by the use of SMC as a proposal mechanism for MCMC seems promising for\ntackling inference in non-Markovian models. To exploit this \ufb02exibility we must address a drawback\nof PG in the high-dimensional setting, which is that the mixing of the PG kernel can be very poor\nwhen there is path degeneracy in the SMC sampler [2, 3]. This problem has been addressed in the\ngeneric setting of SSMs by adding a backward simulation step to the PG sampler, yielding a method\ndenoted PG with backward simulation (PG-BS). It has been found that this considerably improves\nmixing, making the method much more robust to a small number of particles as well as larger data\nrecords [2, 3].\nUnfortunately, however, the application of backward simulation is problematic for non-Markovian\nmodels. The reason is that we need to consider full state trajectories during the backward simulation\npass, leading to O(T 2) computational complexity (see Section 4 for details). To address this issue,\nwe develop a novel PMCMC method which we refer to as particle Gibbs with ancestor sampling\n(PG-AS) that achieves the effect of backward sampling without an explicit backward pass. As part\nof our development, we also develop a truncation method geared to non-Markovian models. This\nmethod is a generic method that is also applicable to PG-BS, but, as we show in a simulation study in\nSection 6, the effect of the truncation error is much less severe for PG-AS than for PG-BS. Indeed,\nwe obtain up to an order of magnitude increase in accuracy in using PG-AS when compared to\nPG-BS in this study.\nSince we assume that it is straightforward to sample the parameter \u03b8 of the idealized Gibbs sampler,\nwe will not explicitly include sampling of \u03b8 in the subsequent sections to simplify our presentation.\n\n2 Sequential Monte Carlo\n\nWe \ufb01rst review the standard auxiliary SMC sampler, see e.g. [4,5]. Let \u03b3t(x1:t) for t = 1, . . . , T be\na sequence of unnormalized densities on Xt, which we assume can be evaluated pointwise in linear\ntime. Let \u00af\u03b3t(x1:t) be the corresponding normalized probability densities. For an SSM we would typ-\nically have \u00af\u03b3t(x1:t) = p(x1:t | y1:t) and \u03b3t(x1:t) = p(x1:t, y1:t). Assume that {xm\nm=1\nis a weighted particle system targeting \u00af\u03b3t\u22121(x1:t\u22121). This particle system is propagated to time t by\nsampling independently from a proposal kernel,\nwat\nt\u22121\u03bdat\nt\u22121\nl wl\nt\u22121\u03bdl\n\nRt(xt | xat\n\nMt(at, xt) =\n\n1:t\u22121, wm\n\nt\u22121}N\n\n(cid:80)\n\n1:t\u22121).\n\nt\u22121\n\n(2)\n\nIn this formulation, the resampling step is implicit and corresponds to sampling the ancestor indices\nat. Note that am\n1:t we refer to the\nt\nancestral path of xm\n1:t), known as adjustment multiplier weights, are\nused in the auxiliary SMC sampler to increase the probability of sampling ancestors that better can\n1:t),\ndescribe the current observation [5]. The particles are then weighted according to wm\nwhere the weight function is given by\n\nis the index of the ancestor particle of xm\nt . The factors \u03bdm\n\nt . When we write xm\n\nt = Wt(xm\n\nt = \u03bdt(xm\n\nWt(x1:t) =\n\n(3)\n1 \u223c R1(x1) and\nfor t \u2265 2. The procedure is initiated by sampling from a proposal density xm\n1 ) with W1(x1) = \u03b31(x1)/R1(x1). In PMCMC it is\nassigning importance weights wm\ninstructive to view this sampling procedure as a way of generating a single sample from the density\n\n\u03b3t\u22121(x1:t\u22121)\u03bdt\u22121(x1:t\u22121)Rt(xt | x1:t\u22121)\n\n1 = W1(xm\n\n,\n\n\u03b3t(x1:t)\n\n\u03c8(x1:T , a2:T ) (cid:44) N(cid:89)\n\nT(cid:89)\n\nN(cid:89)\n\n(4)\non the space XN T \u00d7 {1, . . . , N}N (T\u22121). Here we have introduced the boldface notation xt =\n{x1\n\nt } and similarly for the ancestor indices.\n\nt , . . . , xN\n\nR1(xm\n1 )\n\nt , xm\nt )\n\nMt(am\n\nm=1\n\nm=1\n\nt=2\n\n3 Particle Gibbs with ancestor sampling\n\nPMCMC methods is a class of MCMC samplers in which SMC is used to construct proposal ker-\nnels [1]. The validity of these methods can be assessed by viewing them as MCMC samplers on an\n\n2\n\n\fextended state space in which all the random variables generated by the SMC sampler are seen as\nauxiliary variables. The target density on this extended space is given by\n\u03c8(x1:T , a2:T )\nt=2 Mt(abt\n\n\u03c6(x1:T , a2:T , k) (cid:44) \u00af\u03b3T (xk\nN T\n1:T ) as a marginal, and can thus be used as a surrogate for\nBy construction, this density admits \u00af\u03b3T (xk\nthe original target density \u00af\u03b3T [1]. Here k is a variable indexing one of the particles at the \ufb01nal time\n1:T = {xb1\nT }.\npoint and b1:T corresponds to the ancestral path of this particle: xk\n1 , . . . , xbT\nThese indices are given recursively from the ancestor indices by bT = k and bt = abt+1\nt+1 . The PG\n2:T , bT}),\nsampler [1] is a Gibbs sampler targeting \u03c6 using the following sweep (note that b1:T = {ab2:T\n\n1 )(cid:81)T\n\n1:T = xb1:T\n\nt , xbt\nt )\n\nR1(xb1\n\n1:T )\n\n(5)\n\n.\n\n1. Draw x(cid:63),\u2212b1:T\n2. Draw k(cid:63) \u223c \u03c6(k | x(cid:63),\u2212b1:T\n\n, a(cid:63),\u2212b2:T\n\n\u223c \u03c6(x\n\u2212b1:T\n1:T\n, a(cid:63),\u2212b2:T\n\n1:T\n\n2:T\n\n1:T\n\n2:T\n\n1:T , b1:T ).\n\n\u2212b2:T\n, a\n2:T\n1:T , ab2:T\n, xb1:T\n= {x1\n\n| xb1:T\n2:T ).\nt , . . . , xm\u22121\n\nt\n\nt\n\nt\n\n\u2212b1\n1\n\n\u2212bT\nT\n\n, . . . , x\n\n, xm+1\n\n\u2212b1:T\n1:T\n\nt }, x\n\n, . . . , xN\n\n1:T = x(cid:48)\n\nT , and it can thus straightforwardly be sampled from.\n\nHere we have introduced the notation x\u2212m\n=\n} and similarly for the ancestor indices. In [1], a sequential procedure for sampling\n{x\nfrom the conditional density appearing in Step 1 is given. This method is known as conditional SMC\n(CSMC). It takes the form of an SMC sampler in which we condition on the event that a prespeci\ufb01ed\n1:T , with indices b1:T , is maintained throughout the sampler (see Algorithm 1 for a\npath xb1:T\nrelated procedure). Furthermore, the conditional distribution appearing in Step 2 of the PG sampler\nis shown to be proportional to wk\n1:T , b1:T\u22121} in this sweep. Hence, the PG\nNote that we never sample new values for the variables {xb1:T\nsampler is an \u201cincomplete\u201d Gibbs sampler, since it does not loop over all the variables of the model.\nIt still holds that the PG sampler is ergodic, which intuitively can be explained by the fact that the\ncollection of variables that is left out is chosen randomly at each iteration. However, it has been\nobserved that the PG sampler can have very poor mixing, especially when N is small and/or T is\nlarge [2, 3]. The reason for this poor mixing is that the SMC path degeneracy causes the collections\nof variables that are left out at any two consecutive iterations to be strongly dependent.\nWe now turn to our new procedure, PG-AS, which aims to address this fundamental issue. Our\nidea is to sample new values for the ancestor indices b1:T\u22121 as part of the CSMC procedure1. By\nadding these variables to the Gibbs sweep, we can considerably improve the mixing of the PG\n| xb1:T\nkernel. The CSMC method is a sequential procedure to sample from \u03c6(x\n1:T , b1:T )\n| x(cid:63),\u2212b1:t\u22121\nby sampling according to {x(cid:63),\u2212bt\n1:T , b1:T ), for\nt = 1, . . . , T . After having sampled these variables at time t, we add a step in which we generate a\nnew value for bt\u22121(= abt\n\nt ), resulting in the following sweep:\n\n\u2212b1:T\n, a\n1:T\n, a(cid:63),\u2212b2:t\u22121\n2:t\u22121\n\n} \u223c \u03c6(x\n\n, a(cid:63),\u2212bt\n\n\u2212bt\n, a\nt\n\n\u2212b2:T\n2:T\n\n, xb1:T\n\n\u2212bt\nt\n\n1:t\u22121\n\nt\n\nt\n\n1(cid:48). (CSMC with ancestor sampling) For t = 1, . . . , T , draw\n| x(cid:63),\u2212b1:t\u22121\n, a(cid:63)\n\nx(cid:63),\u2212bt\n, a(cid:63),\u2212bt\nt\n(a(cid:63),bt\nt =) b(cid:63)\nT \u223c \u03c6(bT | x(cid:63),\u2212b1:T\n2(cid:48). Draw (k(cid:63) =) b(cid:63)\n\n\u2212bt\n, a\nt\nt\u22121 \u223c \u03c6(bt\u22121 | x(cid:63),\u2212b1:t\u22121\n1:t\u22121\n1:T ).\n2:T , xb1:T\n\n1:t\u22121\n, a(cid:63)\n\n\u223c \u03c6(x\n\n\u2212bt\nt\n\n, a(cid:63)\n\n1:T\n\nt\n\n2:t\u22121, xb1:T\n1:T , bt:T ).\n\n2:t\u22121, xb1:T\n\n1:T , bt\u22121:T ),\n\nIt can be veri\ufb01ed that this corresponds to a partially collapsed Gibbs sampler [6] and will thus leave\n\u03c6 invariant. To determine the conditional densities from which the ancestor indices are drawn,\nconsider the following factorization, following directly from (3),\n\n\u03b3t(x1:t) = Wt(x1:t)\u03bdt\u22121(x1:t\u22121)Rt(xt | x1:t\u22121)\u03b3t\u22121(x1:t\u22121)\n\nwl\n\ns\u03bdl\ns\n\nR1(xb1\n1 )\n\nMt(abs\n\ns , xbs\n\ns ).\n\n(6)\n\n1Ideally, we would like to include the variables xb1:T\n\n1:T as well, but this is in general not possible since it\n\nwould be similar to sampling from the original target density (which we assume is infeasible).\n\ns=1\n\nl\n\ns=2\n\n3\n\n(cid:80)\n(cid:32)t\u22121(cid:89)\n\nt\u22121\u03bdl\nl wl\nwbt\u22121\nt\u22121\n\n(cid:88)\n\n(cid:80)\n(cid:33)\n\nt\u22121\n\nt\u22121 \u03bdbt\u22121\nwbt\u22121\nt\u22121\nl wl\nt\u22121\u03bdl\nt\u22121\n\nt(cid:89)\n\n\u21d2 \u03b3t(xbt\n\n1:t) = wbt\n\nt\n\n= . . . = wbt\nt\n\nRt(xbt\nt\n\n| xbt\u22121\n\n1:t\u22121)\u03b3t\u22121(xbt\u22121\n\n1:t\u22121)\n\n\fFurthermore, we have\n\n\u03c6(bt | x1:t, a2:t, xbt+1:T\n\nt+1:T , bt+1:T ) \u221d \u03c6(x1:t, a2:t, xbt+1:T\n1:T )\u03c8(x1:t, a2:t)\n\nt+1:T , bt:T )\n\n\u221d\n\n\u03b3T (xk\n\n1 )(cid:81)t\n\ns=2 Ms(abs\nBy plugging (6) into the numerator we get,\n\nR1(xb1\n\ns , xbs\ns )\n\n\u221d \u03b3t(xbt\n1:t)\n\u03b3t(xbt\n1:t)\n\nR1(xb1\n\n1:T )\n\n\u03b3T (xk\ns=2 Ms(abs\n\ns , xbs\ns )\n\n1 )(cid:81)t\n\n\u03c6(bt | x1:t, a2:t, xbt+1:T\n\nt+1:T , bt+1:T ) \u221d wbt\n\nt\n\n\u03b3T (xk\n\u03b3t(xbt\n\n1:T )\n1:t)\n\n.\n\n.\n\n(7)\n\n(8)\n\nHence, to sample a new ancestor index for the conditioned path at time t + 1, we proceed as follows.\nGiven x(cid:48)\n\nt+1:T ) we compute the backward sampling weights,\n\nt+1:T (= xbt+1:T\n\nwm\n\nt|T = wm\nt\n\n\u03b3T ({xm\n\nt+1:T})\n1:t, x(cid:48)\n\u03b3t(xm\n1:t)\n\n,\n\n(9)\n\nfor m = 1, . . . , N. We then set bt = m with probability proportional to wm\nt|T .\n1:T , b1:T},\nIt follows that the proposed CSMC with ancestor sampling (Step 1(cid:48)), conditioned on {x(cid:48)\ncan be realized as in Algorithm 1. The difference between this algorithm and the CSMC sampler\nderived in [1] lies in the ancestor sampling step 2(b) (where instead, they set abt\nt = bt\u22121). By\nintroducing the ancestor sampling, we break the strong dependence between the generated particle\ntrajectories and the path on which we condition. We call the resulting method, de\ufb01ned by Steps 1(cid:48)\nand 2(cid:48) above, PG with ancestor sampling (PG-AS).\nAlgorithm 1 CSMC with ancestor sampling, conditioned on {x(cid:48)\n\n1:T , b1:T}\n\n1. Initialize (t = 1):\n\n(a) Draw xm\n(b) Set wm\n\n1 \u223c R1(x1) for m (cid:54)= b1 and set xb1\n1 = W1(xm\n\n1 ) for m = 1, . . . , N.\n\n1 = x(cid:48)\n1.\n\n(a) Draw {am\n(b) Draw abt\n\n2. for t = 2, . . . , T :\nt , xm\nt with P (abt\n1:t = {xam\n\n(c) Set xm\n\nt\n\n1:t\u22121, xm\n\nt = m) \u221d wm\nt } and wm\n\nt\u22121|T .\nt = Wt(xm\n\nt } \u223c Mt(at, xt) for m (cid:54)= bt and set xbt\n\nt = x(cid:48)\nt.\n\n1:t) for m = 1, . . . , N.\n\nThe idea of including the variables b1:T\u22121 in the PG sampler has previously been suggested by\nWhiteley [7] and further explored in [2, 3]. This previous work, however, accomplishes this with\na explicit backward simulation pass, which, as we discuss in the following section, is problematic\nfor our applications to non-Markovian SSMs. In the PG-AS sampler, instead of requiring distinct\nforward and backward sequences of Gibbs steps as in PG with backward simulation (PG-BS), we\nobtain a similar effect via a single forward sweep.\n\n4 Truncation for non-Markovian state-space models\n\nWe return to the problem of inference in non-Markovian SSMs of the form shown in (1). To employ\nbackward sampling, we need to evaluate the ratio\n\nT(cid:89)\n\ns=t+1\n\n\u03b3T (x1:T )\n\u03b3t(x1:t)\n\n=\n\np(x1:T , y1:T )\np(x1:t, y1:t)\n\n=\n\ng(ys | x1:s)f (xs | x1:s\u22121).\n\n(10)\n\nIn general, the computational cost of computing the backward sampling weights will thus be O(T ).\nThis implies that the cost of generating a full backward trajectory is O(T 2). It is therefore compu-\ntationally prohibitive to employ backward simulation type of particle smoothers, as well as the PG\nsamplers discussed above, for general non-Markovian models.\n\n4\n\n\fFigure 1: Probability under (cid:101)Pp as a function of the truncation level p for two different systems; one\n5 dimensional (left) and one 20 dimensional (right). The N = 5 dotted lines correspond to (cid:101)Pp(m)\n\nfor m \u2208 {1, . . . , N}, respectively (N.B. two of the lines overlap in the left \ufb01gure). The dashed\nvertical lines show the value of the truncation level padpt., resulting from the adaption scheme with\n\u03b3 = 0.1 and \u03c4 = 10\u22122. See Section 6.2 for details on the experiments.\n\nTo make progress, we consider non-Markovian models in which there is a decay in the in\ufb02uence\nof the past on the present, akin to that in Markovian models but without the strong Markovian\nassumption. Hence, it is possible to obtain a useful approximation when the product in (10) is\ntruncated to a smaller number of factors, say p. We then replace (9) with the approximation,\n\n(cid:101)wp,m\n\nt|T = wm\n\nt\n\n\u03b3t+p({xm\n\nt+1:t+p})\n1:t, x(cid:48)\n\u03b3t(xm\n1:t)\n\n.\n\n(11)\n\nThe following proposition formalizes our assumption.\n\nProposition 1. Let P and (cid:101)Pp be the probability distributions on {1, . . . , N}, de\ufb01ned by\nmaxk,l (hs(k)/hs(l) \u2212 1) \u2264 A exp(\u2212cs), for some constants A and c > 0. Then, DKLD(P(cid:107)(cid:101)Pp) \u2264\n\nthe backward sampling weight (9) and the truncated backward sampling weights (11), re-\nspectively. Let hs(k) = g(yt+s\nt+1:t\u2212s) and assume that\nC exp(\u2212cp) for some constant C, where DKLD is the Kullback-Leibler divergence (KLD).\n\nt+1:t+s)f (x(cid:48)\n\n1:t, x(cid:48)\n\n| xk\n\n1:t, x(cid:48)\n\n| xk\n\nt+s\n\nProof. Provided in the supplemental material.\n\nFrom (11), we see that we can compute the backward weights in constant time under the truncation\nwithin the PG-AS framework. The resulting approximation can be quite useful; indeed, in our\nexperiments we have seen that even p = 1 can lead to very accurate inferential results. In general,\nhowever, it will not be known a priori how to set the truncation level p for any given problem. To\naddress this problem, we propose to use an adaption of the truncation level. Since the approximative\nweights (11) can be evaluated sequentially, the idea is to start with p = 1 and then increase p until\nthe weights have, in some sense, converged. In particular, in our experimental work, we have used\nthe following simple approach.\n\nLet (cid:101)Pp be the discrete probability measure de\ufb01ned by (11). Let \u03b5p = DTV((cid:101)Pp,(cid:101)Pp\u22121) be the total\n\nvariation (TV) distance between the distributions for two consecutive truncation levels. We then\ncompute the exponentially decaying moving average of the sequence \u03b5p, with forgetting factor \u03b3 \u2208\n[0, 1], and stop when this falls below some threshold \u03c4 \u2208 [0, 1]. This adaption scheme removes the\nrequirement to specify p directly, but instead introduces the design parameters \u03b3 and \u03c4. However,\nthese parameters are much easier to reason about \u2013 a small value for \u03b3 gives a rapid response to\nchanges in \u03b5p whereas a large value gives a more conservative stopping rule, improving the accuracy\nof the approximation at the cost of higher computational complexity. A similar trade off holds for\nthe threshold \u03c4 as well. Most importantly, we have found that the same values for \u03b3 and \u03c4 can be\nused for a wide range of models, with very different mixing properties.\n\nTo illustrate the effect of the adaption rule, and how the distribution (cid:101)Pp typically evolves as we\n\nincrease p, we provide two examples in Figure 1. These examples are taken from the simulation\nstudy provided in Section 6.2. Note that the untruncated distribution P is given for the maximal\nvalue of p, i.e., furthest to the right in the \ufb01gures. By using the adaptive truncation, we can stop the\nevaluation of the weights at a much earlier stage, and still obtain an accurate approximation of P .\n\n5\n\n05010015020000.20.40.60.81padpt.=5Probability050100150200padpt.=12\f5 Application areas\n\nIn this section we present examples of problem classes involving non-Markovian SSMs for which\nthe proposed PG-AS sampler can be applied. Numerical illustrations are provided in Section 6.\n\n5.1 Rao-Blackwellized particle smoothing\n\nOne popular approach to increase the ef\ufb01ciency of SMC samplers for SSMs is to marginalize over\none component of the state, and apply an SMC sampler in the lower-dimensional marginal space.\nThis leads to what is known as the Rao-Blackwellized particle \ufb01lter (RBPF) [8\u201310]. The same\napproach has also been applied to state smoothing [11,12], but it turns out that Rao-Blackwellization\nis less straightforward in this case, since the marginal state-process will be non-Markovian. As an\nexample, a mixed linear/nonlinear Gaussian SSM (see, e.g., [10]) with \u201cnonlinear state\u201d xt and\n\u201clinear state\u201d zt, can be reduced to xt \u223c p(xt | x1:t\u22121, y1:t\u22121) and yt \u223c p(yt | x1:t, y1:t\u22121). These\nconditional densities are Gaussian and can be evaluated for any \ufb01xed marginal state trajectory x1:t\u22121\nby running a conditional Kalman \ufb01lter to marginalize the zt-process.\nIn order to apply a backward-simulation-based method (e.g., a particle smoother) for this model, we\nneed to evaluate the backward sampling weights (9). In a straightforward implementation2, we thus\nneed to run N Kalman \ufb01lters for T \u2212 t time steps, for each t = 1, . . . , T \u2212 1. The computational\ncomplexity of this calculation can be reduced by employing the truncation proposed in Section 4.\n\n5.2 Particle smoothing for degenerate state-space models\n\nMany dynamical systems are most naturally modelled as degenerate in the sense that the transi-\ntion kernel of the state process does not admit any dominating measure. For instance, consider a\nnonlinear system with additive noise of the form,\n\nyt = g(\u03bet) + et,\n\n\u03bet = f (\u03bet\u22121) + G\u03c9t\u22121,\n\n(12)\nwhere G is a tall matrix, and consequently rank(G) < dim(\u03bet). That is, the process noise covariance\nmatrix is singular. SMC samplers can straightforwardly be applied to this type of models, but it is\nmore problematic to address the smoothing problem using particle methods. The reason is that the\nbackward kernel also will be degenerate and it cannot be approximated in a natural way by the\nforward \ufb01lter particles, as is normally done in backward-simulation-based particle smoothers.\nA possible remedy for this issue is to recast the degenerate SSM as a non-Markovian model in\nT\na lower-dimensional space. Let G = U [\u03a3 0]\nV T with unitary U and V be a singular value\n\n(cid:44) U T\u03bet = U Tf (U U T\u03bet\u22121) +\n\n\u03a3V T\u03c9t\u22121\n\n0\n\n.\n\n(13)\n\nFor simplicity we assume that z1 is known. If this is not the case, it can be included in the system\nstate or seen as a static parameter of the model. Hence, the sequence z1:t is \u03c3(x1:t\u22121)-measurable\nand we can write zt = zt(x1:t\u22121). With vt (cid:44) \u03a3V T\u03c9t and by appropriate de\ufb01nitions of the functions\nfx and h, the model (12) can thus be rewritten as, xt = fx(x1:t\u22121) + vt\u22121 and yt = h(x1:t) + et,\nwhich is a non-degenerate, non-Markovian SSM. By exploiting the truncation proposed in Section 4\nwe can thus apply PG-AS to do inference in this model.\n\n5.3 Additional problem classes\n\nThere are many more problem classes in which non-Markovian models arise and in which backward-\nsimulation-based methods can be of interest. For instance, the Dirichlet process mixture model\n(DPMM, see, e.g., [13]) is a popular nonparametric Bayesian model for mixtures with an unknown\nnumber of components. Using a Polya urn representation, the mixture labels are given by a non-\nMarkovian stochastic process, and the DPMM can thus be seen as a non-Markovian SSM. SMC has\n\n2For the speci\ufb01c problem of Rao-Blackwellized smoothing in conditionally Gaussian models, a backward\nsimulator which can be implemented in O(T ) computational complexity has recently been proposed in [11].\nThis is based on the idea of propagating information backward in time as the backward samples are generated.\n\n6\n\ndecomposition of G and let,(cid:20)xt\n(cid:21)\n\nzt\n\n(cid:20)\n\n(cid:21)\n\n\fFigure 2: Rao-Blackwellized state smoothing using PG. Running RMSEs for \ufb01ve independent runs\nof PG-AS (\u2022) and PG-BS (\u25e6), respectively. The truncation level is set to p = 1. The solid line\ncorresponds to a run of an untruncated FF-BS.\n\npreviously been used for inference in DPMMs [14, 15]. An interesting venue for future work is to\nuse the PG-AS sampler for these models. A second example in Bayesian nonparametrics is Gaussian\nprocess (GP) regression and classi\ufb01cation (see, e.g., [16]). The sample path of the GP can be seen as\nthe state-process in a non-Markovian SSM. We can thus employ PMCMC, and in particular PG-AS,\nto address these inference problems.\nAn application in genetics, for which SMC has been been successfully applied, is reconstruction\nof phylogenetic trees [17]. A phylogenetic tree is a binary tree with observation at the leaf nodes.\nSMC is used to construct the tree in a bottom up fashion. A similar approach has also been used\nfor Bayesian agglomerative clustering, in which SMC is used to construct a binary clustering tree\nbased on Kingman\u2019s coalescent [18]. The generative models for the trees used in [17, 18] are in fact\nMarkovian, but the observations give rise to a conditional dependence which destroys the Markov\nproperty. To employ backward simulation to these models, we are thus faced with problems of a\nsimilar nature as those discussed in Section 4.\n\n6 Numerical evaluation\n\nThis section contains a numerical evaluation of the proposed method. We consider linear Gaussian\nsystems, which is instructive since the exact smoothing density then is available, e.g., by running\na modi\ufb01ed Bryson-Frazier (MBF) smoother [19]. For more details on the experiments, and for\nadditional (nonlinear) examples, see [20].\n\n6.1 RBPS: Linear Gaussian state-space model\n\nAs a \ufb01rst example, we consider Rao-Blackwellized particle smoothing (RBPS) in a single-output\n4th-order linear Gaussian SSM. We generate T = 100 samples from the system and run PG-AS and\nPG-BS, marginalizing three out of the four states using an RBPF, i.e., dim(xt) = 1. Both methods\nare run for R = 10000 iterations using N = 5 particles. The truncation level is set to p = 1, leading\nto a coarse approximation. We discard the \ufb01rst 1000 iterations and then compute running means\nof the state trajectory x1:T . From these, we then compute the running root mean squared errors\n(RMSEs) \u0001r relative to the true posterior means (computed with an MBF smoother). Hence, if no\napproximation would have been made, we would expect \u0001r \u2192 0, so any static error can be seen as\nthe effect of the truncation. The results for \ufb01ve independent runs from both PG samplers are shown\nin Figure 2. First, we note that both methods give accurate results. Still, the error for PG-AS is close\nto an order of magnitude less than for PG-BS. Furthermore, it appears as if the error for PG-AS\nwould decrease further, given more iterations, suggesting that the bias caused by the truncation is\ndominated by the Monte Carlo variance, even after R = 10000 iterations.\nFor further comparison, we also run an untruncated forward \ufb01lter/backward simulator (FF-BS) par-\nticle smoother [21], using N = 5000 forward \ufb01lter particles and M = 500 backward trajectories\n(with a computational complexity of O(N M T 2)). The resulting RMSE value is shown as a solid\nline in Figure 2. These results suggest that PMCMC samplers, such as the PG-AS, indeed can be\nserious competitors to more \u201cstandard\u201d particle smoothers. Even with p = 1, PG-AS outperforms\n\n7\n\n1000200030004000500060007000800090001000010\u2212210\u22121 PG w. ancestral samplingPG w. backward simulation\fFigure 3: Box plots of the RMSE errors for PG-AS (black) and PG-BS (gray), for 150 random\nsystems of different dimensions d (left, d = 2; middle, d = 5; right, d = 20). Different values for\nthe truncation level p are considered. The rightmost boxes correspond to an adaptive threshold and\nthe values in parentheses are the average over all systems and MCMC iterations (the same for both\nmethods). The dots within the boxes show the median errors.\n\nFF-BS in terms of accuracy and, due to the fact that the ancestor sampling allows us to use as few\nas N = 5 particles at each iteration, at a lower computational cost.\n\n6.2 Random linear Gaussian systems with rank de\ufb01cient process noise covariances\n\nTo see how the PG samplers are affected by the choice of truncation level p and by the mixing\nproperties of the system, we evaluate them on random linear Gaussian SSMs of different orders.\nWe generate 150 random systems, using the MATLAB function drss from the Control Systems\nToolbox, with model orders 2, 5 and 20 (50 systems for each model order). The number of outputs\nare taken as 1, 2 and 4 for the different model orders, respectively. The systems are then simulated\nfor T = 200 time steps, driven by Gaussian process noise entering only on the \ufb01rst state component.\nHence, the rank of the process noise covariance is 1 for all systems.\nWe run the PG-AS and PG-BS samplers for 10000 iterations using N = 5 particles. We consider\ndifferent \ufb01xed truncation levels, as well as an adaptive level with \u03b3 = 0.1 and \u03c4 = 10\u22122. Again,\nwe compute running posterior means (discarding 1000 samples) and RMSE values relative the true\nposterior mean. Box plots are shown in Figure 3. Since the process noise only enters on one of the\nstate components, the mixing tends to deteriorate as we increase the model order. Figure 1 shows\nhow the probability distributions on {1, . . . , N} change as we increase the truncation level, in two\nrepresentative cases for a 5th and a 20th order system, respectively. By using an adapted level,\nwe can obtain accurate results for systems of different dimensions, without having to change any\nsettings between the runs.\n\n7 Discussion\n\nPG-AS is a novel approach to PMCMC that makes use of backward simulation ideas without need-\ning an explicit backward pass. Compared to PG-BS, a conceptually similar method that does require\nan explicit backward pass, PG-AS has advantages, most notably for inference in the non-Markovian\nSSMs that have been our focus here. When using the proposed truncation of the backward weights,\nwe have found PG-AS to be more robust to the approximation error than PG-BS. Furthermore, for\nnon-Markovian models, PG-AS is easier to implement than PG-BS, since it requires less bookkeep-\ning. It can also be more memory ef\ufb01cient, since it does not require us to store intermediate quantities\nthat are needed for a separate backward simulation pass, as is done in PG-BS. Finally, we note that\nPG-AS can be used as an alternative to PG-BS for other inference problems to which PMCMC can\nbe applied, and we believe that it will prove attractive in problems beyond the non-Markovian SSMs\nthat we have discussed here.\n\nAcknowledgments\n\nThis work was supported by: the project Calibrating Nonlinear Dynamical Models (Contract num-\nber: 621-2010-5876) funded by the Swedish Research Council and CADICS, a Linneaus Center\nalso funded by the Swedish Research Council.\n\n8\n\n10\u2212310\u2212210\u22121p=1p=2p=3Adapt.(3.8)d=210\u2212310\u2212210\u22121p=1p=5p=10Adapt.(5.9)d=510\u2212310\u2212210\u22121p=1p=5p=10Adapt.(10.6)d=20 PG w. ancestral samplingPG w. backward simulation\fReferences\n[1] C. Andrieu, A. Doucet, and R. Holenstein, \u201cParticle Markov chain Monte Carlo methods,\u201d\n\nJournal of the Royal Statistical Society: Series B, vol. 72, no. 3, pp. 269\u2013342, 2010.\n\n[2] N. Whiteley, C. Andrieu, and A. Doucet, \u201cEf\ufb01cient Bayesian inference for switching state-\nspace models using discrete particle Markov chain Monte Carlo methods,\u201d Bristol Statistics\nResearch Report 10:04, Tech. Rep., 2010.\n\n[3] F. Lindsten and T. B. Sch\u00a8on, \u201cOn the use of backward simulation in the particle Gibbs sampler,\u201d\nin Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), Kyoto, Japan, Mar. 2012.\n\n[4] A. Doucet and A. Johansen, \u201cA tutorial on particle \ufb01ltering and smoothing: Fifteen years later,\u201d\nin The Oxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovsky, Eds. Oxford\nUniversity Press, 2011.\n\n[5] M. K. Pitt and N. Shephard, \u201cFiltering via simulation: Auxiliary particle \ufb01lters,\u201d Journal of the\n\nAmerican Statistical Association, vol. 94, no. 446, pp. 590\u2013599, 1999.\n\n[6] D. A. V. Dyk and T. Park, \u201cPartially collapsed Gibbs samplers: Theory and methods,\u201d Journal\n\nof the American Statistical Association, vol. 103, no. 482, pp. 790\u2013796, 2008.\n\n[7] N. Whiteley, \u201cDiscussion on Particle Markov chain Monte Carlo methods,\u201d Journal of the\n\nRoyal Statistical Society: Series B, 72(3), p 306\u2013307, 2010.\n\n[8] R. Chen and J. S. Liu, \u201cMixture Kalman \ufb01lters,\u201d Journal of the Royal Statistical Society: Series\n\nB, vol. 62, no. 3, pp. 493\u2013508, 2000.\n\n[9] A. Doucet, S. J. Godsill, and C. Andrieu, \u201cOn sequential Monte Carlo sampling methods for\n\nBayesian \ufb01ltering,\u201d Statistics and Computing, vol. 10, no. 3, pp. 197\u2013208, 2000.\n\n[10] T. Sch\u00a8on, F. Gustafsson, and P.-J. Nordlund, \u201cMarginalized particle \ufb01lters for mixed lin-\near/nonlinear state-space models,\u201d IEEE Transactions on Signal Processing, vol. 53, no. 7,\npp. 2279\u20132289, Jul. 2005.\n\n[11] S. S\u00a8arkk\u00a8a, P. Bunch, and S. Godsill, \u201cA backward-simulation based Rao-Blackwellized par-\nticle smoother for conditionally linear Gaussian models,\u201d in Proceedings of the 16th IFAC\nSymposium on System Identi\ufb01cation, Brussels, Belgium, Jul. 2012.\n\n[12] W. Fong, S. J. Godsill, A. Doucet, and M. West, \u201cMonte Carlo smoothing with application\nto audio signal enhancement,\u201d IEEE Transactions on Signal Processing, vol. 50, no. 2, pp.\n438\u2013449, Feb. 2002.\n\n[13] N. L. Hjort, C. Holmes, P. Mller, and S. G. Walker, Eds., Bayesian Nonparametrics. Cam-\n\nbridge University Press, 2010.\n\n[14] S. N. MacEachern, M. Clyde, and J. S. Liu, \u201cSequential importance sampling for nonparamet-\nric Bayes models: The next generation,\u201d The Canadian Journal of Statistics, vol. 27, no. 2, pp.\n251\u2013267, 1999.\n\n[15] P. Fearnhead, \u201cParticle \ufb01lters for mixture models with an unknown number of components,\u201d\n\nStatistics and Computing, vol. 14, pp. 11\u201321, 2004.\n\n[16] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[17] A. Bouchard-C\u02c6ot\u00b4e, S. Sankararaman, and M. I. Jordan, \u201cPhylogenetic inference via sequential\n\nMonte Carlo,\u201d Systematic Biology, vol. 61, no. 4, pp. 579\u2013593, 2012.\n\n[18] Y. W. Teh, H. Daum\u00b4e III, and D. Roy, \u201cBayesian agglomerative clustering with coalescents,\u201d\n\nAdvances in Neural Information Processing, pp. 1473\u20131480, 2008.\n\n[19] G. J. Bierman, \u201cFixed interval smoothing with discrete measurements,\u201d International Journal\n\nof Control, vol. 18, no. 1, pp. 65\u201375, 1973.\n\n[20] F. Lindsten, M. I. Jordan, and T. B. Sch\u00a8on, \u201cAncestor sampling for particle Gibbs,\u201d arXiv.org,\n\narXiv:1210.6911, Oct. 2012.\n\n[21] S. J. Godsill, A. Doucet, and M. West, \u201cMonte Carlo smoothing for nonlinear time series,\u201d\n\nJournal of the American Statistical Association, vol. 99, no. 465, pp. 156\u2013168, Mar. 2004.\n\n9\n\n\f", "award": [], "sourceid": 1230, "authors": [{"given_name": "Fredrik", "family_name": "Lindsten", "institution": null}, {"given_name": "Thomas", "family_name": "Sch\u00f6n", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}