{"title": "MCMC for continuous-time discrete-state systems", "book": "Advances in Neural Information Processing Systems", "page_first": 701, "page_last": 709, "abstract": "We propose a simple and novel framework for MCMC inference in continuous-time discrete-state systems with pure jump trajectories. We construct an exact MCMC sampler for such systems by alternately sampling a random discretization of time given a trajectory of the system, and then a new trajectory given the discretization.  The first step can be performed efficiently using properties of the Poisson process, while the second step can avail of discrete-time MCMC techniques based on the forward-backward algorithm. We compare our approach to particle MCMC and a uniformization-based sampler, and show its advantages.", "full_text": "MCMC for continuous-time discrete-state systems\n\nVinayak Rao\n\nYee Whye Teh\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nvrao@gatsby.ucl.ac.uk\n\nUniversity College London\n\nywteh@gatsby.ucl.ac.uk\n\nAbstract\n\nWe propose a simple and novel framework for MCMC inference in continuous-\ntime discrete-state systems with pure jump trajectories. We construct an exact\nMCMC sampler for such systems by alternately sampling a random discretiza-\ntion of time given a trajectory of the system, and then a new trajectory given the\ndiscretization. The \ufb01rst step can be performed ef\ufb01ciently using properties of the\nPoisson process, while the second step can avail of discrete-time MCMC tech-\nniques based on the forward-backward algorithm. We show the advantage of our\napproach compared to particle MCMC and a uniformization-based sampler.\n\n1\n\nIntroduction\n\nThere has been growing interest in the machine learning community to model dynamical systems in\ncontinuous time. Examples include point processes [1], Markov processes [2], structured Markov\nprocesses [3], in\ufb01nite state Markov processes [4], semi-Markov processes [5] etc. However, a major\nimpediment towards the more widespread use of these models is the problem of inference. A simple\napproach is to discretize time, and then run inference on the resulting approximation. This however\nhas a number of drawbacks, not least of which is that we lose the advantages that motivated the use\nof continuous time in the \ufb01rst place. Time-discretization introduces a bias into our inferences, and\nto control this, one has to work at a time resolution that results in a very large number of discrete\ntime steps. This can be computationally expensive.\nOur focus in this paper is on posterior sampling via Markov chain Monte Carlo (MCMC), and there\nis a huge literature on such techniques for discrete-time models [6]. Here, we construct an exact\nMCMC sampler for pure jump processes in continuous time, using a workhorse of the discrete-time\ndomain, the forward-\ufb01ltering backward-sampling algorithm [7, 8], to make ef\ufb01cient updates.\nThe core of our approach is an auxiliary variable Gibbs sampler that repeats two steps. The \ufb01rst\nstep runs the forward-backward algorithm on a random discretization of time to sample a new tra-\njectory. The second step then resamples a new time-discretization given this trajectory. A random\ndiscretization allows a relatively coarse grid, while still keeping inferences unbiased. Such a coarse\ndiscretization allows us to apply the forward-backward algorithm to a Markov chain with relatively\nfew time steps, resulting in computational savings. Even though the marginal distribution of the\nrandom time-discretization can be quite complicated, we show that conditioned on the system tra-\njectory, it is just distributed as a Poisson process.\nWhile the forward-backward algorithm was developed originally for \ufb01nite state hidden Markov mod-\nels and linear Gaussian systems, it also forms the core of samplers for more complicated systems\nlike nonlinear/non-Gaussian [9], in\ufb01nite state [10], and non-Markovian [11] time series. Our ideas\nthus apply to essentially any pure jump process, so long as it makes only \ufb01nite transitions over \ufb01nite\nintervals. For concreteness, we focus on semi-Markov processes. We compare our sampler with\ntwo other continuous-time MCMC samplers, a particle MCMC sampler [12], and a uniformization-\nbased sampler [13]. The latter turns out to be a special case of ours, corresponding to a random\ntime-discretization that is marginally distributed as a homogeneous Poisson process.\n\n1\n\n\fs(cid:48)\n0 Ass(cid:48) (u)du),\n\nP (\u03c4s(cid:48) > \u03c4 ) = e(\u2212(cid:82) \u03c4\n\n(cid:90) \u03c4s(cid:48)\n\n0\n\n2 Semi-Markov processes\nA semi-Markov (jump) process (sMJP) is a right-continuous, piecewise-constant stochastic process\non the nonnegative real-line taking values in some state space S [14, 15]. For simplicity, we assume\nS is \ufb01nite, labelling its elements from 1 to N. We also assume the process is stationary. Then, the\nsMJP is parametrized by \u03c00, an (arbitrary) initial distribution over states, as well as an N \u00d7N matrix\nof hazard functions, Ass(cid:48)(\u00b7) \u2200s, s(cid:48) \u2208 S. For any \u03c4, Ass(cid:48)(\u03c4 ) gives the rate of transitioning to state s(cid:48),\n\u03c4 time units after entering state s (we allow self-transitions, so s(cid:48) can equal s). Let this transition\noccur after a waiting time \u03c4s(cid:48). Then \u03c4s(cid:48) is distributed according to the density rss(cid:48)(\u00b7), related to\nAss(cid:48)(\u00b7) as shown below (see eg. [16]):\nrss(cid:48)(\u03c4s(cid:48)) = Ass(cid:48)(\u03c4s(cid:48))e(\u2212(cid:82) \u03c4\n(1)\nSampling an sMJP trajectory proceeds as follows: on entering state s, sample waiting times \u03c4s(cid:48) \u223c\nAss(cid:48)(\u00b7) \u2200s(cid:48) \u2208 S. The sMJP enters a new state, snew, corresponding to the smallest of these waiting\ntimes. Let this waiting time be \u03c4hold (so that \u03c4hold = \u03c4snew = mins(cid:48) \u03c4s(cid:48)). Then, advance the current\ntime by \u03c4hold, and set the sMJP state to snew. Repeat this procedure, now with the rate functions\nAsnews(cid:48)(\u00b7) \u2200s(cid:48) \u2208 S.\n(cid:89)\ns(cid:48)\u2208S Ass(cid:48)(\u00b7). From the independence of the times \u03c4ss(cid:48), equation 1 tells us that\n\nAss(cid:48)(\u03c4s(cid:48)) = rss(cid:48)(\u03c4s(cid:48))/(cid:0)1 \u2212\n\nDe\ufb01ne As(\u00b7) =(cid:80)\n\nrss(cid:48)(u)du(cid:1)\n\n\u03c4hold \u223c rs(\u03c4 ) \u2261 As(\u03c4 )e(\u2212(cid:82) \u03c4\n\ns(cid:48)\u2208S\n\n0 As(u)du),\n\n0 As(u)du) (2)\nP (\u03c4hold > \u03c4 ) =\nComparing with equation 1, we see that As(\u00b7) gives the rate of any transition out of state s. An\nequivalent characterization of many continuous-time processes is to \ufb01rst sample the waiting time\n\u03c4hold, and then draw a new state s(cid:48). For the sMJP, the latter probability is proportional to Ass(cid:48)(\u03c4hold).\nA special sMJP is the Markov jump process (MJP) where the hazard functions are constant (giving\nexponential waiting times). For an MJP, future behaviour is independent of the current waiting time.\nBy allowing general waiting-time distributions, an sMJP can model memory effects like burstiness\nor refractoriness in the system dynamics.\nWe represent an sMJP trajectory on an interval [tstart, tend] as (S, T ), where T = (t0,\u00b7\u00b7\u00b7 , t|T|) is\nthe sequence of jump times (including the endpoints) and S = (s0,\u00b7\u00b7\u00b7 , s|S|) is the corresponding\nsequence of state values. Here |S| = |T|, and si+1 = si implies a self-transition at time ti+1 (except\nat the end time t|T| = tend which does not correspond to a jump). The \ufb01lled circles in \ufb01gure 1(c)\nrepresent (S, T ); since the process is right-continuous, si gives the state after the jump at ti.\n2.1 Sampling by dependent thinning\nWe now describe an alternate thinning-based approach to sampling an sMJP trajectory. Our ap-\nproach will produce candidate event times at a rate higher that the actual event rates in the system.\nTo correct for this, we probabilistically reject (or thin) these events. De\ufb01ne W as the sequence\nof actual event times T , together with the thinned event times (which we call U, these are the\nempty circles in \ufb01gure 1(c)). W = (w0,\u00b7\u00b7\u00b7 , w|W|) forms a random discretization of time (with\n|W| = |T| + |U|); de\ufb01ne V = (v0,\u00b7\u00b7\u00b7 , v|W|) as a sequence of state assignments to the times W .\nAt any wi, let li represent the time since the last sMJP transition (so that, li = wi \u2212 maxt\u2208T,t\u2264wi t),\n\nand let L =(cid:0)l1,\u00b7\u00b7\u00b7 , l|W|(cid:1). Figures 1(b) and (c) show these quantities, as well as continuous-time\n\nprocesses S(t) and L(t) such that li = L(wi) and si = S(wi). (V, L, W ) forms an equivalent\nrepresentation of (S, T ) that includes a redundant set of thinned events U. Note that if the ith event\nis thinned, vi = vi\u22121, however this is not a self-transition. L helps distinguish self-transitions (hav-\ning associated l\u2019s equal to 0) from thinned events. We explain the generative process of (V, L, W )\nbelow; a proof of its correctness is included in the supplementary material.\nFor each hazard function As(\u03c4 ), de\ufb01ne another dominating hazard function Bs(\u03c4 ), so that Bs(\u03c4 ) \u2265\nAs(\u03c4 ) \u2200s, \u03c4. Suppose we have instantiated the system trajectory until time wi, with the sMJP\nhaving just entered state vi \u2208 S (so that li = 0). We sample the next candidate event time wi+1,\nwith \u2206wi = (wi+1\u2212 wi) drawn from the hazard function Bvi(\u00b7). A larger rate implies faster events,\nso that \u2206wi will on average be smaller than a waiting time \u03c4hold drawn from Avi(\u00b7). We correct\nfor this by treating wi+1 as an actual event with probability Avi (\u2206wi+li)\nBvi (\u2206wi+li). If this is the case, we\nsample a new state vi+1 with probability proportional to Avivi+1 (\u2206wi + li), and set li+1 = 0. On\nthe other hand, if the event is rejected, we set vi+1 to vi, and li+1 = (\u2206wi + li). We now sample\n\n2\n\n\fFigure 1: a) Instantaneous hazard rates given a trajectory b) State holding times, L(t) c) sMJP state values\nS(t) d) Graphical model for the randomized time-discretization e) Resampling the sMJP trajectory. In b) and\nc), the \ufb01lled and empty circles represent actual and thinned events respectively.\n\u2206wi+1 (and thus wi+2), such that (\u2206wi+1 + li+1) \u223c Bvi+1 (\u00b7). More simply, we sample a new\nwaiting time from Bvi+1 (\u00b7), conditioned on it being greater than li+1. Again, accept this point with\nprobability Avi+1 (\u2206wi+1+li+1)\nBvi+1 (\u2206wi+1li+1) , and repeat this process. Proposition 1 con\ufb01rms that this generative\nprocess (summarized by the graphical model in \ufb01gure 1(d), and algorithm 1) yields a trajectory from\nthe sMJP. Figure 1(d) also depicts observations X of the sMJP trajectory; we elaborate on this later.\nProposition 1. The path (V, L, W ) returned by the thinning procedure described above is equivalent\nto a sample (S, T ) from the sMJP (\u03c00, A).\n\nAlgorithm 1 State-dependent thinning for sMJPs\nInput:\n\nDominating hazard functions Bs(\u03c4 ) \u2265 As(\u03c4 ) \u2200\u03c4, s, where As(\u03c4 ) =(cid:80)\n\nHazard functions Ass(cid:48)(\u00b7) \u2200s, s(cid:48) \u2208 S, and an initial distribution over states \u03c00.\ns(cid:48) Ass(cid:48)(\u03c4 ).\nA piecewise constant path (V, L, W ) \u2261 ((vi, li, wi)) on the interval [tstart, tend].\n\nSample \u03c4hold \u223c Bvi(\u00b7), with \u03c4hold > li. Let \u2206wi = \u03c4hold \u2212 li, and wi+1 = wi + \u2206wi.\nwith probability Avi (\u03c4hold)\nBvi (\u03c4hold)\n\nOutput:\n1: Draw v0 \u223c \u03c00 and set w0 = tstart. Set l0 = 0 and i = 0.\n2: while wi < tend do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end while\n11: Set w|W| = tend, v|W| = v|W|\u22121, l|W| = l|W| + w|W| \u2212 w|W|\u22121.\n\nSet li+1 = 0, and sample vi+1, with P (vi+1 = s(cid:48)|vi) \u221d Avis(cid:48)(\u03c4hold), s(cid:48) \u2208 S.\nSet li+1 = li + \u2206wi, and vi+1 = vi.\n\nelse\n\nend\nIncrement i.\n\n3\n\n\f2.2 Posterior inference via MCMC\n\nWe now de\ufb01ne an auxiliary variable Gibbs sampler, setting up a Markov chain that converges to\nthe posterior distribution over the thinned representation (V, L, W ) given observations X of the\nsMJP trajectory. The observations can lie in any space X , and for any time-discretization W , let xi\nrepresent all observations in the interval (wi, wi+1). By construction, the sMJP stays in a single state\nvi over this interval; let P (xi|vi) be the corresponding likelihood vector. Given a time discretization\nW \u2261 (U \u222a T ) and the observations X, we discard the old state labels (V, L), and sample a new path\n( \u02dcV , \u02dcL, W ) \u2261 ( \u02dcS, \u02dcT ) using the forward-backward algorithm. We then discard the thinned events \u02dcU,\nand given the path ( \u02dcS, \u02dcT ), resample new thinned events Unew, resulting in a new time discretization\nWnew \u2261 ( \u02dcT \u222a Unew). We describe both operations below.\nResampling the sMJP trajectory given the set of times W :\nGiven W (and thus all \u2206wi), this involves assigning each element wi \u2208 W , a label (vi, li) (see\n\ufb01gure 1(d)). Note that the system is Markov in the pair (vi, li), so that this step is a straightforward\napplication of the forward-backward algorithm to the graphical model shown in \ufb01gure 1(d). Observe\nfrom this \ufb01gure that the joint distribution factorizes as:\n\nP (V, L, W, X) = P (v0, l0)\n\nP (xi|vi)P (\u2206wi|vi, li)P (vi+1, li+1|vi, li, \u2206wi)\n\n(3)\n\n|W|\u22121(cid:89)\n\ni=0\n\nFrom equation 2, (with B instead of A), P (\u2206wi|vi, li) = Bvi(li + \u2206wi)e\nThe term P (vi+1, li+1|vi, li, \u2206wi) is the thinning/state-transition probability from steps 4 and 5 of\nalgorithm 1. The forward-\ufb01ltering stage then moves sequentially through the times in W , succes-\nsively calculating the probabilities P (vi, li, w1:i+1, x1:i) using the recursion:\nP (vi, li, w1:i+1, x1:i) = P (xi|vi)P (wi+1|vi, li)\n\nP (vi, li|vi\u22121, li\u22121, \u2206wi)P (vi\u22121, li\u22121, w1:i, x1:i\u22121)\n\n(cid:88)\n\nBvi (t)dt\n\n(cid:16)\u2212(cid:82) (li+\u2206wi)\n\nli\n\n(cid:17)\n\n.\n\nvi\u22121,li\u22121\n\nThe backward sampling stage then returns a new trajectory ( \u02dcV , \u02dcL, W ) \u2261 ( \u02dcS, \u02dcT ). See \ufb01gure 1(e).\nObserve that li can take (i + 1) values (in the set {0, wi \u2212 wi\u22121,\u00b7\u00b7\u00b7 , wi \u2212 w0}), with the value of\nli affecting P (vi+1, li+1|vi, li, \u2206wi+1).Thus, the forward-backward algorithm for a general sMJP\nscales quadratically with |W|. We can however use ideas from discrete-time MCMC to reduce this\ncost (eg. [11] use a slice sampler to limit the maximum holding time of a state, and thus limit li).\nResampling the thinned events given the sMJP trajectory:\nHaving obtained a new sMJP trajectory (V, L, W ), we discard all thinned events U, so that the\ncurrent state of the sampler is now (S, T ). We then resample the thinned events \u02dcU, recovering a new\nthinned representation ( \u02dcV , \u02dcL, \u02dcW ), and with it, a new discretization of time. To simplify notation,\nwe de\ufb01ne the instantaneous hazard functions A(t) and B(t) (see \ufb01gure 1(a)):\n\nA(t) = AS(t)(L(t)),\n\nand B(t) = BS(t)(L(t))\n\n(4)\nThese were the event rates relevant at any time t during the generative process. Note that the\nsMJP trajectory completely determines these quantities. The events W (whether thinned or not)\nwere generated from a rate B(\u00b7) process, while the probability that an event wi was thinned is\n1 \u2212 A(wi)/B(wi). The Poisson thinning theorem [17] then suggests that the thinned events U are\ndistributed as a Poisson process with intensity (B(t) \u2212 A(t)). The following proposition (see the\nsupplementary material for a proof) shows that this is indeed the case.\nProposition 2. Conditioned on a trajectory (S, T ) of the sMJP, the thinned events U are distributed\nas a Poisson process with intensity (B(t) \u2212 A(t)).\nObserve that this is independent of the observations X. We show in section 2.4 how sampling from\nsuch a Poisson process is straightforward for appropriately chosen bounding rates Bs.\n\n2.3 Related work\nAn increasingly popular approach to inference in continuous-time systems is particle MCMC (pM-\nCMC) [12]. At a high level, this uses particle \ufb01ltering to generate a continuous-time trajectory,\nwhich then serves as a proposal for a Metropolis-Hastings (MH) algorithm. Particle \ufb01ltering how-\never cannot propogate back information from future observations, and pMCMC methods can have\ndif\ufb01culty in situations where strong observations cause the posterior to deviate from the prior.\n\n4\n\n\fRecently, [13] proposed a sampler for MJPs that is a special case of ours. This was derived via a\nclassical idea called uniformization, and constructed the time discretization W from a homogeneous\nPoisson process. Our sampler reduces to this when a constant dominating rate B > maxs,\u03c4 As(\u03c4 )\nis used to bound all event rates. However, such a \u2018uniformizing\u2019 rate does not always exist (we\nwill discuss two such systems with unbounded rates). Moreover, with a single rate B, the average\nnumber of candidate events |W|, (and thus the computational cost of the algorithm), scales with the\nleaving rate of the most unstable state. Since this state is often the one that the system will spend the\nleast amount of time in, such a strategy can be wasteful. Under our sampler, the distribution of W is\nnot a Poisson process. Instead, events rates are coupled via the sMJP state. This allows our sampler\nto adapt the granularity of time-discretization to that required by the posterior trajectories, moreover\nthis granularity can vary over the time interval.\nThere exists other work on continuous-time models based on the idea of a random discretization\nof time [18, 1]. Like uniformization, these all are limited to speci\ufb01c continuous-time models with\nspeci\ufb01c thinning constructions, and are not formulated in as general a manner as we have done.\nMoreover, none of these exploit the ability to ef\ufb01ciently resample the time-discretization from a\nPoisson process, or a new trajectory using the forward-backward algorithm.\n\n2.4 Experiments\nIn this section, we evaluate our sampler on a 3-state sMJP with Weibull hazard rates. Here\nrss(cid:48)(\u03c4|\u03b1ss(cid:48), \u03bbss(cid:48)) = e(\u2212(\u03c4 /\u03bbss(cid:48) )\u03b1\n\n, Ass(cid:48)(\u03c4|\u03b1ss(cid:48), \u03bbss(cid:48)) =\n\n(cid:19)\u03b1ss(cid:48)\u22121\n\n(cid:18) \u03c4\n\n(cid:18) \u03c4\n\n(cid:19)\u03b1ss(cid:48)\u22121\n\nss(cid:48) ) \u03b1ss(cid:48)\n\u03bbss(cid:48)\n\n\u03bbss(cid:48)\n\n\u03b1ss(cid:48)\n\u03bbss(cid:48)\n\n\u03bbss(cid:48)\n\nwhere \u03bbss(cid:48) is the scale parameter, and the shape parameter \u03b1ss(cid:48) controls the stability of a state s.\nWhen \u03b1ss(cid:48) < 1, on entering state s, the system is likely to quickly jump to state s(cid:48). By contrast,\n\u03b1ss(cid:48) > 1 gives a \u2018recovery\u2019 period before transitions to s(cid:48). Note that for \u03b1ss(cid:48) < 1, the hazard\nfunction tends to in\ufb01nity as \u03c4 \u2192 0. Now, choose an \u2126 > 1. We use the following simple upper\nbound Bss(cid:48)(\u03c4 ):\n\n(cid:18) \u03c4\n\n(cid:19)\u03b1ss(cid:48)\u22121\n\n(cid:19)\u03b1ss(cid:48)\u22121\n\n(cid:18) \u03c4\n\n\u02dc\u03bbss(cid:48)\n\n\u2126\n\n=\n\n\u03bbss(cid:48)\n\n\u03b1ss(cid:48)\n\u02dc\u03bbss(cid:48)\n\n\u2126\u03b1ss(cid:48)\n\u03bbss(cid:48)\n\nBss(cid:48)(\u03c4 ) = \u2126Ass(cid:48)(\u03c4|\u03b1ss(cid:48), \u03bbss(cid:48)) =\n\n(5)\n\u221a\n\u2126 for any \u03bb and \u03b1. Thus, sampling from the dominating hazard function Bss(cid:48)(\u00b7)\nHere, \u02dc\u03bb = \u03bb/ \u03b1\nreduces to straightforward sampling from a Weibull with a smaller scale parameter \u02dc\u03bbss(cid:48). Note from\nalgorithm 1 that with this construction of the dominating rates, each candidate event is rejected with\n\n(cid:1); this can be a guide to choosing \u2126. In our experiments, we set \u2126 equal to 2.\n\nprobability(cid:0)1 \u2212 1\nfrom a Poisson process with intensity (B(t) \u2212 A(t)) = (\u2126 \u2212 1)A(t) = (\u2126 \u2212 1)(cid:80)\nSampling thinned events on an interval (ti, ti+1) (where the sMJP is in state si) involves sampling\ns(cid:48) Asis(cid:48)(t \u2212 ti).\nThis is just the superposition of N independent and shifted Poisson processes on (0, ti+1 \u2212 ti),\nthe nth having intensity (\u2126 \u2212 1)Asin(\u00b7) \u2261 \u02c6Asin(\u00b7). As before, \u02c6A(\u00b7) is a Weibull hazard function\n\u221a\nobtained by correcting the scale parameter \u03bb of A(\u00b7) by \u03b1\n\u2126 \u2212 1. A simple way to sample such\n(cid:82) (ti+1\u2212ti)\na Poisson process is by \ufb01rst drawing the number of events from a Poisson distribution with mean\n\u02c6Asin(u)du, and then drawing that many events i.i.d. from \u02c6Asin truncated at (ti+1 \u2212 ti).\n0\nSolving the integral for the Poisson mean is straightforward for the Weibull. Call the resulting\nPoisson sequence \u02dcTn, and de\ufb01ne \u02dcT = \u222an\u2208S \u02dcTn. Then Wi \u2261 \u02dcT + ti is the set of resampled thinned\nevents on the interval (ti, ti+1). We repeat this over each segment (ti, ti+1) of the sMJP path.\nIn the following experiments, the shape parameters for each Weibull hazard (\u03b1ss(cid:48)) was randomly\ndrawn from the interval [0.6, 3], while the scale parameter was always set to 1. \u03c00 was set to\nthe discrete uniform distribution. The unbounded hazards associated with \u03b1ss(cid:48) < 1 meant that\nuniformization is not applicable to this problem, and we only compared our sampler with pMCMC.\nWe implemented both samplers in Matlab. Our MCMC sampler was set up with \u2126 = 2, so that the\ndominating hazard rate at any instant equalled twice the true hazard rate (i.e. Bss(cid:48)(\u03c4 ) = 2Ass(cid:48)(\u03c4 )),\ngiving a probability of thinning equal to 0.5. For pMCMC, we implemented the particle independent\nMetropolis-Hastings sampler from [12]. We tried different values for the number of particles; for\nour problems, we found 10 gave best results.\nAll MCMC runs consisted of 5000 iterations following a burn-in period of 1000. After any MCMC\nrun, given a sequence of piecewise constant trajectories, we calculated the empirical distribution of\n\n5\n\n\ftime vs\n\nFigure 2: ESS per\nunit\nthe\ninverse-temperature\nof\nthe likelihood,\nwhen\ntrajec-\ntories are over an\nlength\ninterval of\nand\n20\n2\n(right).\n\nthe\n\n(left)\n\nFigure 3: ESS per second for increasing interval lengths. Temperature decreases from the left to right subplots.\n\nthe time spent in each state as well as the number of state transitions. We then used R-coda [19] to\nestimate effective sample sizes (ESS) for these quantities. The ESS of the simulation was set to the\nmedian ESS of all these statistics.\nEffect of the observations For our \ufb01rst experiment, we distributed 10 observations over an interval\nof length 20. Each observation favoured a particular, random state over the other two states by a\nfactor of 100, giving random likelihood vectors like (1, 100, 1)(cid:62). We then raised the likelihood\nvector P (xi|\u00b7) to an \u2018inverse-temperature\u2019 \u03bd, so that the effective likelihood at the ith observation\nwas (P (xi|si))\u03bd. As this parameter varied from 0 to 1, the problem moved from sampling from the\nprior to a situation where the trajectory was observed (almost) perfectly at 10 random times.\nThe left plot in \ufb01gure 2 shows the ESS produced per unit time by both samplers as the inverse-\ntemperature increased, averaging results from 10 random parametrizations of the sMJP. We see (as\none might expect), that when the effect of the observations is weak, particle MCMC (which uses\nthe prior distribution to make local proposals), outperforms our thinning-based sampler. pMCMC\nalso has the bene\ufb01t of being simpler implementation-wise, and is about 2-3 times faster (in terms\nof raw computation time) for a Weibull sMJP, than our sampler. As the effect of the likelihood\nincreases, pMCMC starts to have more and more dif\ufb01culty tracking the observations. By contrast,\nour sampler is fairly insensitive to the effect of the likelihood, eventually outperforming the particle\nMCMC sampler. While there exist techniques to generate more data-driven proposals for the particle\nMCMC [12, 20], these compromise the appealing simplicity of the original particle MCMC sampler.\nMoreover, none of these really have the ability to propagate information back from the future (like\nthe forward-backward algorithm), rather they make more and more local moves (for instance, by\nupdating the sMJP trajectory on smaller and smaller subsets of the observation interval).\nThe right plot in \ufb01gure 2 shows the ESS per unit time for both samplers, now with the observation\ninterval set to a smaller length of 2. Here, our sampler comprehensively outperforms pMCMC. There\nare two reasons for this. First, more observations per unit time requires rapid switching between\nstates, a deviation from the prior that particle \ufb01ltering is unlikely to propose. Additionally, over\nshort intervals, the quadratic cost of the forward-backward step of our algorithm is less pronounced.\nEffect of the observation interval length In the next experiment, we more carefully compare the\ntwo samplers as the interval length varies. For three setting of the inverse temperature parameter\n(0.1, 0.5 and 0.9), we calculated the number of effective samples produced per unit time as the\nlength of the observation interval increased from 2 to 50. Once again, we averaged results from 10\nrandom settings of the sMJP parameters. Figure 3 show the results for the low, medium and high\nsettings of the the inverse temperature. Again, we clearly see the bene\ufb01t of the forward-backward\nalgorithm, especially in the low temperature and short interval regimes where the posterior deviates\nfrom the prior. Of course, the performance of our sampler can be improved further using ideas from\nthe discrete-time domain; these can help ameliorate effect of the quadratic cost for long intervals.\n\n6\n\n0.10.20.30.40.50.60.70.80.9  1024681012Effective samples per second  Thinningparticle MCMC10particle MCMC200.10.20.30.40.50.60.70.80.9  10204060Effective samples per second  Thinningparticle MCMC10particle MCMC20 2 5102050010203040Effective samples per second  Thinningparticle MCMC10 2 51020500102030Effective samples per second  Thinningparticle MCMC10 2 5102050051015202530Effective samples per second  Thinningparticle MCMC10\fFigure 4: Effect of increasing the leaving rate of a state. Temperature decreases from the left to right plots.\n3 Markov jump processes\nIn this section, we look at the Markov jump process (MJP), which we saw has constant hazard\nfunctions Ass(cid:48). MJPs are also de\ufb01ned to disallow self-transitions, so that Ass = 0 \u2200s \u2208 S. If we\nuse constant dominating hazard rates Bs, we see from algorithm 1 that all probabilities at time wi\ndepend only on the current state si, and are independent of the holding time li. Thus, we no longer\nneed to represent the holding times L. The forward message at time wi needs only to represent the\nprobability of vi taking different values in S; this completely speci\ufb01es the state of the MJP. As a\nresult, the cost of a forward-backward iteration is now linear in |W|.\nIn the next experiment, we compare Matlab implementations of our thinning-based sampler and the\nparticle MCMC sampler with the uniformization-based sampler described in section 2.3. Recall\nthat the latter samples candidate event times W from a homogeneous Poisson process with a state-\nindependent rate B > maxs As. Following [13], we set B = 2 maxs As. As in section 2.4, we set\n\u2126 = 2 for our sampler, so that Bs = 2As \u2200s. pMCMC was run with 20 particles.\nObserve that for uniformization, the rate B is determined by the leaving rate of the most unstable\nstate; often this is the state the system spends the least time in. To study this, we applied all three\nsamplers to a 3-state MJP, two of whose states had leaving rates equal to 1. The leaving rate of\nthe third state, was varied from 1 to 20 (call this rate \u03b3). On leaving any state, the probability of\ntransitioning to either of the other two was uniformly distributed between 0 and 1. This way, we\nconstructed 10 random MJPs for each \u03b3. We distributed 5 observation times (again, favouring a\nrandom state by a factor of 100) over the interval [0, 10]. Like section 2.4, we looked at the ESS per\nunit time for 3 settings of the inverse temperature parameter \u03bd, now as we varied \u03b3.\nFigure 4 shows the results. The pMCMC sampler clearly performs worse than the other two. The\nMarkov structure of the MJP makes the forward-backward algorithm very natural and ef\ufb01cient, by\ncontrast, running a particle \ufb01lter with 20 particles took about twice as long as our sampler. Further,\nwe see that while both the uniformization and our sampler perform comparably for low values of \u03b3,\nour sampler starts to outperform uniformization for \u03b3\u2019s greater than 2. In fact, for weak observations\nand large \u03b3s, even particle MCMC outperforms uniformization. As we mentioned earlier, this is\nbecause for uniformization, the granularity of time-discretization is determined by the least stable\nstate, resulting in very long Markov chains for large values of \u03b3.\n3.1 The M/M/\u221e queue\nWe \ufb01nally apply our ideas to an in\ufb01nite state MJP from queuing theory, the M/M/\u221e queue (also\ncalled an immigration-death process [21]). Here, individuals (customers, messages, jobs etc.) enter\na population according to a homogeneous Poisson process with rate \u03b1 independent of the population\nsize. The lifespan of each individual (or the job \u2018service time\u2019) is exponentially distributed with rate\n\u03b2, so that the rate at which a \u2018death\u2019 occurs in the population is proportional to the population size.\nLet S(t) represent the population size (or the number of \u2018busy servers\u2019) at time t. Then, under the\nM/M/\u221e queue, the stochastic process S(t) evolves according to a simple birth-death Markov jump\nprocess on the space S = {1,\u00b7\u00b7\u00b7 ,\u221e}, with rates As,s+1 = \u03b1 and As,s\u22121 = s\u03b2. All other rates\nare 0. Observe that since the population size of the M/M/\u221e queue is unbounded, we cannot upper\nbound the event rates in the system. Thus, uniformization is not directly applicable to this system.\nInstead, we have to truncate the maximum value of S(t) to some constant, say c. This is the so-called\nM/M/c/c queue; now, when all c servers are busy, any incoming jobs are rejected.\nIn the following, we considered an M/M/\u221e queue with \u03b1 and \u03b2 set to 10 and 1 respectively. For\nsome tend, the state of the system was observed perfectly at three times 0, tend/10 and tend, with\nvalues 10, 2 and 15 respectively. Conditioned on these, we sought the posterior distribution over the\n\n7\n\n 1 2 51020020406080100Effective samples per second  UniformizationThinningparticle MCMC 1 2 51020020406080100Effective samples per second  UniformizationThinningparticle MCMC 1 2 51020020406080Effective samples per second  UniformizationThinningparticle MCMC\fFigure 5: The M/M/\u221e\nqueue: a) ESS per unit time\nb) ESS per unit time scaled\nby interval length.\n\nsystem trajectory on the interval [0, tend]. Since the state of the system at time 0 is perfectly observed\nto be 10, given any time-discretization, the maximum value of si at step i of the Markov chain is\n(10 + i). Thus, message dimensions are always \ufb01nite, and we can directly apply the forward-\nbackward algorithm. For noisy observations, we can use a slice sampler [22]. We compared our\nsampler with uniformization; for this, we approximated the M/M/\u221e system with an M/M/50/50\nsystem. We also applied our sampler to this truncated approximation, labelling it as \u2018Thinning\n(trunc)\u2019. For both these samplers, the message dimensions were 50. The large state spaces involved\nmakes pMCMC very inef\ufb01cient, and we did not include it in our results.\nFigure 5(a) shows the ESS per unit time for all three samplers as we varied the interval length tend\nfrom 1 to 20. Sampling a trajectory over a long interval will take more time than over a short one,\nand to more clearly distinguish performance for large values of tend, we scale each ESS from the\nleft plot with tend, the length of the interval, in the right subplot of \ufb01gure 5.\nWe see our sampler always outperforms uniformization, with the difference particularly signi\ufb01cant\nfor short intervals. Interestingly, running our thinning-based sampler on the truncated system offers\nno signi\ufb01cant computational bene\ufb01t over running it on the full model. As the observation interval be-\ncomes longer and longer, the MJP trajectory can make larger and larger excursions (especially over\nthe interval [tend/10, tend]). Thus as tend increases, event rates witnessed in posterior trajectories\nstarts to increase. As our sampler adapts to this, the number of thinned events in all three samplers\nstart to become comparable, causing the uniformization-based sampler to approach the performance\nof the other two samplers. At the same time, we see that the difference between our truncated and\nour untruncated sampler starts to widen. Of course, we should remember that over long intervals,\ntruncating the system size to 50 becomes more likely to introduce biases into our inferences.\n4 Discussion\nWe described a general framework for MCMC inference in continuous-time discrete-state systems.\nEach MCMC iteration \ufb01rst samples a random discretization of time given the trajectory of the sys-\ntem. Given this, we then resample the sMJP trajectory using the forward-backward algorithm. While\nwe looked only at semi-Markov and Markov jump processes, it is easy to extend our approach to\npiecewise-constant stochastic processes with more complicated dependency structures.\nFor our sampler, a bottleneck in the rate of mixing is that the new and old trajectories share an inter-\nmediate discretization W (see \ufb01gure 1(e)). Recall that an sMJP trajectory de\ufb01nes an instantaneous\nhazard function B(t); our scheme requires the discretization sampled from the old hazard function\nbe compatible with the new hazard function. Thus, the forward-backward algorithm is unlikely to\nreturn a trajectory associated with a hazard function that differ signi\ufb01cantly from the old one. By\ncontrast, for uniformization, the hazard function is a constant B, independent of the system state.\nHowever, this comes at the cost of a conservatively high discretization of time. An interesting di-\nrection for future work is too see how different choices of the dominating hazard function can help\ntrade-off these factors. For instance, we proposed, using a single \u2126, with Bs(\u00b7) = \u2126As(\u00b7). It is\npossible to use a different \u2126s for each state s, or even an \u2126s(\u00b7) that varies with time. Similarly, one\ncan consider additive (rather than multiplicative) constructions of Bs(\u00b7).\nFor general sMJPs, the forward-backward algorithm scales quadratically with |W|, the number of\ncandidate jump times. Such scaling is characteristic of sMJPs, though we can avail of discrete-time\nMCMC techniques to ameliorate this. For sMJPs whose hazard functions are constant beyond a\n\u2018window of memory\u2019, inference scales quadratically with the memory length, and only linearly with\n|W|. One can use such approximations to devise ef\ufb01cient MH proposals for sMJPs trajectories.\n\n8\n\n 1 2 510200102030405060Effective samples per second  UniformizationDependent thinningThinning (trunc) 1 2 5102010203040506070Effective samples per second per unit interval length  UniformizationDependent thinningThinning (trunc)\fReferences\n[1] Ryan P. Adams, Iain Murray, and David J. C. MacKay. Tractable nonparametric Bayesian inference in\nPoisson processes with Gaussian process intensities. In Proceedings of the 26th International Conference\non Machine Learning (ICML), 2009.\n\n[2] Y. W. Teh, C. Blundell, and L. T. Elliott. Modelling genetic variations with fragmentation-coagulation\n\nprocesses. In Advances In Neural Information Processing Systems, 2011.\n\n[3] U. Nodelman, C.R. Shelton, and D. Koller. Continuous time Bayesian networks. In Proceedings of the\n\nEighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 378\u2013387, 2002.\n\n[4] Ardavan Saeedi and Alexandre Bouchard-C\u02c6ot\u00b4e. Priors over Recurrent Continuous Time Processes. In\n\nAdvances in Neural Information Processing Systems 24 (NIPS), volume 24, 2011.\n\n[5] Matthias Hoffman, Hendrik Kueck, Nando de Freitas, and Arnaud Doucet. New inference strategies\nIn Proceedings of the Twenty-\nfor solving Markov decision processes using reversible jump MCMC.\nFifth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-09), pages 223\u2013231,\nCorvallis, Oregon, 2009. AUAI Press.\n\n[6] A. Doucet, N. de Freitas, and N. J. Gordon. Sequential Monte Carlo Methods in Practice. Statistics for\n\nEngineering and Information Science. New York: Springer-Verlag, May 2001.\n\n[7] Fr\u00a8uwirth-Schnatter. Data augmentation and dynamic linear models. J. Time Ser. Anal., 15:183\u2013202, 1994.\n[8] C. K. Carter and R. Kohn. Markov chain Monte Carlo in conditionally Gaussian state space models.\n\nBiometrika, 83:589\u2013601, 1996.\n\n[9] Radford M. Neal, Matthew J. Beal, and Sam T. Roweis.\n\nInferring state sequences for non-linear sys-\ntems with embedded hidden Markov models. In Advances in Neural Information Processing Systems 16\n(NIPS), volume 16, pages 401\u2013408. MIT Press, 2004.\n\n[10] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden Markov\n\nmodel. In Proceedings of the International Conference on Machine Learning, volume 25, 2008.\n\n[11] M. Dewar, C. Wiggins, and F. Wood. Inference in hidden Markov models with explicit state duration\n\ndistributions. IEEE Signal Processing Letters, page To Appear, 2012.\n\n[12] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo meth-\n\nods. Journal of the Royal Statistical Society Series B, 72(3):269\u2013342, 2010.\n\n[13] V. Rao and Y. W. Teh. Fast MCMC sampling for Markov jump processes and continuous time Bayesian\nnetworks. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n[14] William Feller. On semi-Markov processes. Proceedings of the National Academy of Sciences of the\n\nUnited States of America, 51(4):pp. 653\u2013659, 1964.\n\n[15] D. Sonderman. Comparing semi-Markov processes. Mathematics of Operations Research, 5(1):110\u2013119,\n\n1980.\n\n[16] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, 2008.\n[17] J. F. C. Kingman. Poisson processes, volume 3 of Oxford Studies in Probability. The Clarendon Press\n\nOxford University Press, New York, 1993. Oxford Science Publications.\n\n[18] A. Beskos and G.O. Roberts. Exact simulation of diffusions. Annals of applied probability, 15(4):2422 \u2013\n\n2444, November 2005.\n\n[19] Martyn Plummer, Nicky Best, Kate Cowles, and Karen Vines. CODA: Convergence diagnosis and output\n\nanalysis for MCMC. R News, 6(1):7\u201311, March 2006.\n\n[20] Andrew Golightly and Darren J. Wilkinson. Bayesian parameter inference for stochastic biochemical\nInterface Focus, 1(6):807\u2013820, December\n\nnetwork models using particle Markov chain Monte Carlo.\n2011.\n\n[21] S. Asmussen. Applied Probability and Queues. Applications of Mathematics. Springer, 2003.\n[22] Stephen G. Walker. Sampling the Dirichlet mixture model with slices. Communications in Statistics -\n\nSimulation and Computation, 36:45, 2007.\n\n9\n\n\f", "award": [], "sourceid": 331, "authors": [{"given_name": "Vinayak", "family_name": "Rao", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}