{"title": "Particle Gibbs for Infinite Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2395, "page_last": 2403, "abstract": "Infinite Hidden Markov Models (iHMM's) are an attractive, nonparametric generalization of the classical Hidden Markov Model which can automatically infer the number of hidden states in the system. However, due to the infinite-dimensional nature of the transition dynamics, performing inference in the iHMM is difficult. In this paper, we present an infinite-state Particle Gibbs (PG) algorithm to resample state trajectories for the iHMM. The proposed algorithm uses an efficient proposal optimized for iHMMs, and leverages ancestor sampling to improve the mixing of the standard PG algorithm. Our algorithm demonstrates significant convergence improvements on synthetic and real world data sets.", "full_text": "Particle Gibbs for In\ufb01nite Hidden Markov Models\n\nNilesh Tripuraneni*\nUniversity of Cambridge\nnt357@cam.ac.uk\n\nShixiang Gu*\n\nUniversity of Cambridge\nMPI for Intelligent Systems\n\nsg717@cam.ac.uk\n\nHong Ge\n\nUniversity of Cambridge\nhg344@cam.ac.uk\n\nZoubin Ghahramani\nUniversity of Cambridge\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nIn\ufb01nite Hidden Markov Models (iHMM\u2019s) are an attractive, nonparametric gener-\nalization of the classical Hidden Markov Model which can automatically infer the\nnumber of hidden states in the system. However, due to the in\ufb01nite-dimensional\nnature of the transition dynamics, performing inference in the iHMM is dif\ufb01cult.\nIn this paper, we present an in\ufb01nite-state Particle Gibbs (PG) algorithm to re-\nsample state trajectories for the iHMM. The proposed algorithm uses an ef\ufb01cient\nproposal optimized for iHMMs and leverages ancestor sampling to improve the\nmixing of the standard PG algorithm. Our algorithm demonstrates signi\ufb01cant con-\nvergence improvements on synthetic and real world data sets.\n\nIntroduction\n\n1\nHidden Markov Models (HMM\u2019s) are among the most widely adopted latent-variable models used\nto model time-series datasets in the statistics and machine learning communities. They have also\nbeen successfully applied in a variety of domains including genomics, language, and \ufb01nance where\nsequential data naturally arises [Rabiner, 1989; Bishop, 2006].\nOne possible disadvantage of the \ufb01nite-state space HMM framework is that one must a-priori spec-\nify the number of latent states K. Standard model selection techniques can be applied to the \ufb01-\nnite state-space HMM but bear a high computational overhead since they require the repetitive\n\ntraining(cid:14)exploration of many HMM\u2019s of different sizes.\nover the latent states, transition(cid:14)emission distributions and hyperparameters since it is impossible\n\nBayesian nonparametric methods offer an attractive alternative to this problem by adapting their\neffective model complexity to \ufb01t the data. In particular, Beal et al. [2001] constructed an HMM over\na countably in\ufb01nite state-space using a Hierarchical Dirichlet Process (HDP) prior over the rows\nof the transition matrix. Various approaches have been taken to perform full posterior inference\n\nto directly apply the forward-backwards algorithm due to the in\ufb01nite-dimensional size of the state\nspace. The original Gibbs sampling approach proposed in Teh et al. [2006] suffered from slow\nmixing due to the strong correlations between nearby time steps often present in time-series data\n[Scott, 2002]. However, Van Gael et al. [2008] introduced a set of auxiliary slice variables to\ndynamically \u201ctruncate\u201d the state space to be \ufb01nite (referred to as beam sampling), allowing them\nto use dynamic programming to jointly resample the latent states thus circumventing the problem.\nDespite the power of the beam-sampling scheme, Fox et al. [2008] found that application of the\nbeam sampler to the (sticky) iHMM resulted in slow mixing relative to an inexact, blocked sampler\ndue to the introduction of auxiliary slice variables in the sampler.\n\n*equal contribution.\n\n1\n\n\fThe main contributions of this paper are to derive an in\ufb01nite-state PG algorithm for the iHMM\nusing the stick-breaking construction for the HDP, and constructing an optimal importance proposal\nto ef\ufb01ciently resample its latent state trajectories. The proposed algorithm is compared to existing\nstate-of-the-art inference algorithms for iHMMs, and empirical evidence suggests that the in\ufb01nite-\nstate PG algorithm consistently outperforms its alternatives. Furthermore, by construction the time\ncomplexity of the proposed algorithm is O(T N K). Here T denotes the length of the sequence, N\ndenotes the number of particles in the PG sampler, and K denotes the number of \u201cactive\u201d states\nin the model. Despite the simplicity of sampler, we \ufb01nd in a variety of synthetic and real-world\nexperiments that these particle methods dramatically improve convergence of the sampler, while\nbeing more scalable.\nWe will \ufb01rst de\ufb01ne the iHMM/sticky iHMM in Section 2, and review the Dirichlet Process (DP) and\nHierarchical Dirichlet Process (HDP) in our appendix. Then we move onto the description of our\nMCMC sampling scheme in Section 3. In Section 4 we present our results on a variety of synthetic\nand real-world datasets.\n2 Model and Notation\n2.1\nWe can formally de\ufb01ne the iHMM (we review the theory of the HDP in our appendix) as follows:\n\nIn\ufb01nite Hidden Markov Models\n\n\u03b2 \u223c GEM(\u03b3),\n\n\u03c0j|\u03b2 iid\u223c DP(\u03b1, \u03b2), \u03c6j\nst|st\u22121 \u223c Cat(\u00b7|\u03c0st\u22121 ),\n\nj = 1, . . . ,\u221e\n\niid\u223c H,\nyt|st \u223c f (\u00b7|\u03c6st),\n\nt = 1, . . . , T.\n\n(1)\n\n(2)\n\nconnect the HDP with the iHMM, note that given a draw from the HDP Gk = (cid:80)\u221e\n\nHere \u03b2 is the shared DP measure de\ufb01ned on integers Z. Here s1:T = (s1, ..., sT ) are the latent\nstates of the iHMM, y1:T = (y1, ..., yT ) are the observed data, and \u03c6j parametrizes the emission\ndistribution f. Usually H and f are chosen to be conjugate to simplify the inference. \u03b2k(cid:48) can be\ninterpreted as the prior mean for transition probabilities into state k(cid:48), with \u03b1 governing the variability\nof the prior mean across the rows of the transition matrix. The hyper-parameter \u03b3 controls how\nconcentrated or diffuse the probability mass of \u03b2 will be over the states of the transition matrix. To\nk(cid:48)=1 \u03c0kk(cid:48)\u03b4\u03c6k(cid:48)\nwe identify \u03c0kk(cid:48) with the transition probability from state k to state k(cid:48) where \u03c6k(cid:48) parametrize the\nemission distributions.\nNote that \ufb01xing \u03b2 = ( 1\nK , 0, 0...) implies only transitions between the \ufb01rst K states of the\ntransition matrix are ever possible, leaving us with the \ufb01nite Bayesian HMM. If we de\ufb01ne a \ufb01nite,\nhierarchical Bayesian HMM by drawing\n\nK , ...., 1\n\n\u03b2 \u223c Dir(\u03b3/K, ..., \u03b3/K)\n\u03c0k \u223c Dir(\u03b1\u03b2)\n\nwith joint density over the latent/hidden states as\np\u03c6(s1:T , y1:T ) = \u03a0T\n\nt=1\u03c0(st|st\u22121)f\u03c6(yt|st)\n\nthen after taking K \u2192 \u221e, the hierarchical prior in Equation (2) approaches the HDP.\n\nFigure 1: Graphical Model for the sticky HDP-HMM (setting \u03ba = 0 recovers the HDP-HMM)\n\n2\n\n\f2.2 Prior and Emission Distribution Speci\ufb01cation\nThe hyperparameter \u03b1 governs the variability of the prior mean across the rows of the transition\nmatrix and \u03b3 controls how concentrated or diffuse the probability mass of \u03b2 will be over the states\nof the transition matrix. However, in the HDP-HMM we have each row of the transition matrix\nis drawn as \u03c0j \u223c DP(\u03b1, \u03b2). Thus the HDP prior doesn\u2019t differentiate self-transitions from jumps\nbetween different states. This can be especially problematic in the non-parametric setting, since\nnon-Markovian state persistence in data can lead to the creation of unnecessary extra states and un-\nrealistically, rapid switching dynamics in our model. In Fox et al. [2008], this problem is addressed\nby including a self-transition bias parameter into the distribution of transitioning probability vector\n\u03c0j:\n\n\u03c0j \u223c DP(\u03b1 + \u03ba,\n\n\u03b1\u03b2 + \u03ba\u03b4j\n\n\u03b1 + \u03ba\n\n)\n\n(3)\n\nto incorporate prior beliefs that smooth, state-persistent dynamics are more probable. Such a con-\nstruction only involves the introduction of one further hyperparameter \u03ba which controls the \u201csticki-\nness\u201d of the transition matrix (note a similar self-transition was explored in Beal et al. [2001]).\nFor the standard iHMM, most approaches to inference have placed vague gamma hyper-priors on\nthe hyper-parameters \u03b1 and \u03b3, which can be resampled ef\ufb01ciently as in Teh et al. [2006]. Similarly\nin the sticky iHMM, in order to maintain tractable resampling of hyper-parameters Fox et al. [2008]\nchose to place vague gamma priors on \u03b3, \u03b1+\u03ba, and a beta prior on \u03ba/(\u03b1+\u03ba). In this work we follow\nTeh et al. [2006]; Fox et al. [2008] and place priors \u03b3 \u223c Gamma(a\u03b3, b\u03b3), \u03b1 + \u03ba \u223c Gamma(as, bs),\nand \u03ba \u223c Beta(a\u03ba, b\u03ba) on the hyper-parameters.\nWe consider two conjugate emission models for the output states of the iHMM \u2013 a multinomial\nemission distribution for discrete data, and a normal emission distribution for continuous data. For\ndiscrete data we choose \u03c6k \u223c Dir(\u03b1\u03c6) with f (\u00b7| \u03c6st) = Cat(\u00b7|\u03c6k). For continuous data we choose\n\u03c6k = (\u00b5, \u03c32) \u223c N IG(\u00b5, \u03bb, \u03b1\u03c6, \u03b2\u03c6) with f (\u00b7| \u03c6st) = N (\u00b7|\u03c6k = (\u00b5, \u03c32)).\n3 Posterior Inference for the iHMM\nLet us \ufb01rst recall the collection of variables we need to sample: \u03b2 is a shared DP base measure, (\u03c0k)\nis the transition matrix acting on the latent states, while \u03c6k parametrizes the emission distribution\nf, k = 1, . . . , K. We can then resample the variables of the iHMM in a series of Gibbs steps:\nStep 1: Sample s1:T | y1:T , \u03c61:K, \u03b2, \u03c01:K.\nStep 2: Sample \u03b2 | s1:T , \u03b3.\nStep 3: Sample \u03c01:K | \u03b2, \u03b1, \u03ba, s1:T .\nStep 4: Sample \u03c61:K | y1:T , s1:T , H.\nStep 5: Sample (\u03b1, \u03b3, \u03ba)| s1:T , \u03b2, \u03c01:K.\nDue to the strongly correlated nature of time-series data, resampling the latent hidden states in Step\n1, is often the most dif\ufb01cult since the other variables can be sampled via the Gibbs sampler once a\nsample of s1:T has been obtained. In the following section, we describe a novel ef\ufb01cient sampler for\nthe latent states s1:T of the iHMM, and refer the reader to our appendix and Teh et al. [2006]; Fox\net al. [2008] for a detailed discussion on steps for sampling variables \u03b1, \u03b3, \u03ba, \u03b2, \u03c01:K, \u03c61:K.\n3.1\nWithin the Particle MCMC framework of Andrieu et al. [2010], Sequential Monte Carlo (or particle\n\ufb01ltering) is used as a complex, high-dimensional proposal for the Metropolis-Hastings algorithm.\nThe Particle Gibbs sampler is a conditional SMC algorithm resulting from clamping one particle to\nan apriori \ufb01xed trajectory. In particular, it is a transition kernel that has p(s1:T|y1:T ) as its stationary\ndistribution.\nThe key to constructing a generic, truncation-free sampler for the iHMM to resample the latent\nstates, s1:T , is to note that the \ufb01nite number of particles in the sampler are \u201clocalized\u201d in the latent\nspace to a \ufb01nite subset of the in\ufb01nite set of possible states. Moreover, they can only transition\nto \ufb01nitely many new states as they are propagated through the forward pass. Thus the \u201cin\ufb01nite\u201d\nmeasure \u03b2, and \u201cin\ufb01nite\u201d transition matrix \u03c0 only need to be instantiated to support the number of\n\u201cactive\u201d states (de\ufb01ned as being {1, ..., K}) in the state space. In the particle Gibbs algorithm, if a\nparticle transitions to a state outside the \u201cactive\u201d set, the objects \u03b2 and \u03c0 can be lazily expanded via\n\nIn\ufb01nite State Particle Gibbs Sampler\n\n3\n\n\fthe stick-breaking constructions derived for both objects in Teh et al. [2006] and stated in equations\n(2), (4) and (5). Thus due to the properties of both the stick-breaking construction and the PGAS\nkernel, this resampling procedure will leave the target distribution p(s1:T|y1:T ) invariant. Below we\n\ufb01rst describe our in\ufb01nite-state particle Gibbs algorithm for the iHMM then detail our notation (we\nprovide further background on SMC in our supplement):\n\n1 = p(s1)f1(y1|s1)(cid:14)q1(s1) for i \u2208 1, ..., N.\n\nStep 1: For iteration t = 1 initialize as:\n1 \u223c q1(\u00b7), for i \u2208 1, ..., N.\n(a) sample si\n(b) initialize weights wi\nStep 2: For iteration t > 1 use trajectory s(cid:48)\nt\u22121 \u223c Cat(\u00b7|W 1:N\n(a) sample the index ai\nt \u223c qt(\u00b7| s\nt\u22121 ) for i \u2208 1, ..., N \u2212 1. If si\n(b) sample si\n\nai\nt\u22121\n\n1:T from t \u2212 1, \u03b2, \u03c0, \u03c6, and K:\n\nt\u22121 ) of the ancestor of particle i for i \u2208 1, ..., N \u2212 1.\n\nt = K + 1 then create a new state using the\n\nstick-breaking construction for the HDP:\n(i) Sample a new transition probability vector \u03c0K+1 \u223c Dir(\u03b1\u03b2).\n(ii) Use stick-breaking construction to iteratively expand \u03b2 \u2190 [\u03b2, \u03b2K+1] as:\n\n\u03b2(cid:48)\n\nK+1\n\niid\u223c Beta(1, \u03b3),\n\n\u03b2K+1 = \u03b2(cid:48)\n\nK+1\u03a0K\n\n(cid:96)=1(1 \u2212 \u03b2(cid:48)\n(cid:96)).\n\n(iii) Expand transition probability vectors (\u03c0k), k = 1, . . . , K + 1, to include transitions\n\nto K + 1st state via the HDP stick-breaking construction as:\n\n\u03c0j \u2190 [\u03c0j1, \u03c0j2, . . . , \u03c0j,K+1],\n\n\u2200j = 1, . . . , K + 1.\n\nwhere\n\njK+1 \u223c Beta(cid:0)\u03b10\u03b2K, \u03b10(1 \u2212 K+1(cid:88)\n\n\u03c0(cid:48)\n\n\u03b2l)(cid:1), \u03c0jK+1 = \u03c0(cid:48)\n\njK+1\u03a0K\n\n(cid:96)=1(1 \u2212 \u03c0(cid:48)\n\nj(cid:96)).\n\n(iv) Sample a new emission parameter \u03c6K+1 \u223c H.\n\n(cid:96)=1\n\n(c) compute the ancestor weights \u02dcwi\n\nt\u22121|T = wi\nP(aN\n\nt|si\n\nt\u22121\u03c0(s(cid:48)\nt = i) \u221d \u02dcwi\n\nt\u22121|T .\n\nt\u22121) and resample aN\n\nt as\n\n(d) recompute and normalize particle weights using:\n\nwt(si\n\nt) = \u03c0(si\n\nt | s\n\nai\n\nt\u22121\n\nt\u22121 )f (yt | si\nN(cid:88)\n\nt)/qt(si\n\nt | s\n\nai\nt\u22121\nt\u22121 )\n\nt))\n\n1:T .\n\nt, wi\n\nwt(si\n\ni=1\n1:T = sk\n\nt) = wt(si\nt)/(\nT and return s\u2217\n\nWt(si\nStep 3: Sample k with P(k = i) \u221d wi\nIn the particle Gibbs sampler, at each step t a weighted particle system {si\nt}N\ni=1 serves as an\nempirical point-mass approximation to the distribution p(s1:T ), with the variables ai\nt denoting the\nt. Here we have used \u03c0(st|st\u22121) to denote the latent transition distribution,\n\u2018ancestor\u2019 particles of si\nf (yt|st) the emission distribution, and p(s1) the prior over the initial state s1.\n3.2 More Ef\ufb01cient Importance Proposal qt(\u00b7)\nIn the PG algorithm described above, we have a choice of the importance sampling density qt(\u00b7) to\nuse at every time step. The simplest choice is to sample from the \u201cprior\u201d \u2013 qt(\u00b7|s\nai\nt\u22121\nt\u22121 )\n\u2013 which can lead to satisfactory performance when then observations are not too informative and the\ndimension of the latent variables are not too large. However using the prior as importance proposal\nin particle MCMC is known to be suboptimal. In order to improve the mixing rate of the sampler, it\nis desirable to sample from the partial \u201cposterior\u201d \u2013 qt(\u00b7| s\nt) \u2013 whenever\npossible.\nIn general, sampling from the \u201cposterior\u201d, qt(\u00b7| s\nt ), may be impossible,\nbut in the iHMM we can show that it is analytically tractable. To see this, note that we have lazily\n\nt\u22121 ) \u221d \u03c0(sn\nt |s\n\nt\u22121 ) \u221d \u03c0(si\nt|s\n\nt|s\nai\nt\u22121\nt\u22121 ) = \u03c0(si\n\nai\nt\u22121\n\nt\u22121 )f (yt|si\n\nan\nt\u22121\n\nt\u22121 )f (yt|sn\n\nai\nt\u22121\n\nan\n\nt\u22121\n\n4\n\n\ft\u22121,1:K, \u03c0sn\n\nt |sn\n\nt |sn\n\nk=1 \u03c0(k | sn\n\nt where n \u2208 1, ..., N \u2212 1.\n\nt \u2208 1, ..., K. However, if sn\n\nt\u22121) = (cid:80)K+1\n\nt\u22121) as a \ufb01nite vector \u2013 [\u03c0sn\nt |sn\n\nt , \u03c61:K) for all sn\n\nt = K + 1) =(cid:82) f (yn\n\nrepresented \u03c0(\u00b7|sn\nt\u22121,K+1]. Moreover, we can easily evaluate\nt = K + 1, we need to compute\nthe likelihood f (yn\nt = K + 1, \u03c6)H(\u03c6)d\u03c6. If f and H are conjugate, we can analyt-\nf (yn\nically compute the marginal likelihood of the K + 1st state, but this can also be approximated by\nMonte Carlo sampling for non-conjugate likelihoods \u2013 see Neal [2000] for a more detailed discus-\nt\u22121)f (yt | \u03c6k) for each\nsion of this argument. Thus, we can compute p(yt|sn\nparticle sn\nWe investigate the impact of \u201cposterior\u201d vs. \u201cprior\u201d proposals in Figure 5. Based on the convergence\nof the number of states and joint log-likelihood, we can see that sampling from the \u201cposterior\u201d\nimproves the mixing of the sampler. Indeed, we see from the \u201cprior\u201d sampling experiments that\nincreasing the number of particles from N = 10 to N = 50 does seem to marginally improve the\nmixing the sampler, but have found N = 10 particles suf\ufb01cient to obtain good results. However,\nwe found no appreciable gain when increasing the number of particles from N = 10 to N = 50\nwhen sampling from the \u201cposterior\u201d and omitted the curves for clarity. It is worth noting that the\nPG sampler (with ancestor resampling) does still perform reasonably even when sampling from the\n\u201cprior\u201d.\n3.3\nIt has been recognized that the mixing properties of the PG kernel can be poor due to path degeneracy\n[Lindsten et al., 2014]. A variant of PG that is presented in Lindsten et al. [2014] attempts to\naddress this problem for any non-Markovian state-space model with a modi\ufb01cation \u2013 resample a new\nin an \u201cancestor sampling\u201d step at every time step, which can signi\ufb01cantly\nvalue for the variable aN\nt\nimprove the mixing of the PG kernel with little extra computation in the case of Markovian systems.\nTo understand ancestor sampling, for t \u2265 2 consider the reference trajectory s(cid:48)\nt:T ranging from the\ncurrent time step t to the \ufb01nal time T . Now, arti\ufb01cially assign a candidate history to this partial path,\nby connecting s(cid:48)\ni=1 which can be\nt \u2208 1, ..., N. To do this, we \ufb01rst compute\nachieved by simply assigning a new value to the variable aN\nthe weights:\nt:T|y1:T )\n1:t\u22121|y1:T )\n\nt:T to one of the other particles history up until that point {si\n\nImproving Mixing via Ancestor Resampling\n\nt\u22121|T \u2261 wi\n\u02dcwi\n\npT (si\n\n1:t\u22121, s(cid:48)\n\n,\n\ni = 1, ..., N\n\n1:t\u22121}N\n\n(4)\n\nt\u22121\n\npt\u22121(si\nt = i) \u221d \u02dcwi\n\nis sampled according to P(aN\n\nThen aN\nt\u22121|T . Remarkably, this ancestor sampling\nt\nstep leaves the density p(s1:T | y1:T ) invariant as shown in Lindsten et al. [2014] for arbitrary, non-\nMarkovian state-space models. However since the in\ufb01nite HMM is Markovian, we can show the\ncomputation of the ancestor sampling weights simpli\ufb01es to\nt|si\n\nt\u22121\u03c0(s(cid:48)\n\nt\u22121|T = wi\n\nt\u22121)\n\n(5)\n\n\u02dcwi\n\nNote that the ancestor sampling step does not change the O(T N K) time complexity of the in\ufb01nite-\nstate PG sampler.\n3.4 Resampling \u03c0, \u03c6, \u03b2, \u03b1, \u03b3, and \u03ba\nOur resampling scheme for \u03c0, \u03b2, \u03c6, \u03b1, \u03b3, and \u03ba will follow straightforwardly from this scheme in\nFox et al. [2008]; Teh et al. [2006]. We present a review of their methods and related work in our\nappendix for completeness.\n4 Empirical Study\nIn the following experiments we explore the performance of the PG sampler on both the iHMM\nand the sticky iHMM. Note that throughout this section we have only taken N = 10 and N =\n50 particles for the PG sampler which has time complexity O(T N K) when sampling from the\n\u201cposterior\u201d compared to the time complexity of O(T K 2) of the beam sampler. For completeness,\nwe also compare to the Gibbs sampler, which has been shown perform worse than the beam sampler\n[Van Gael et al., 2008], due to strong correlations in the latent states.\n4.1 Convergence on Synthetic Data\nTo study the mixing properties of the PG sampler on the iHMM and sticky iHMM, we consider\ntwo synthetic examples with strongly positively correlated latent states. First as in Van Gael et al.\n\n5\n\n\fFigure 2: Comparing the performance of the PG\nsampler, PG sampler on sticky iHMM (PG-S), beam\nsampler, and Gibbs sampler on inferring data from\na 4 state strongly correlated HMM. Left: Number\nof \u201cActive\u201d States K vs. Iterations Right: Joint-Log\nLikelihood vs. Iterations (Best viewed in color)\n\nFigure 3: Learned Latent Transition Matri-\nces for the PG sampler and Beam Sampler\nvs Ground Truth (Transition Matrix for Gibbs\nSampler omitted for clarity). PG correctly re-\ncovers strongly correlated self-transition ma-\ntrix, while the Beam Sampler supports extra\n\u201cspurious\u201d states in the latent space.\n\n[2008], we generate sequences of length 4000 from a 4 state HMM with self-transition probability\nof 0.75, and residual probability mass distributed uniformly over the remaining states where the\nemission distributions are taken to be normal with \ufb01xed standard deviation 0.5 and emission means\nof \u22122.0,\u22120.5, 1.0, 4.0 for the 4 states. The base distribution, H for the iHMM is taken to be normal\nwith mean 0 and standard deviation 2, and we initialized the sampler with K = 10 \u201cactive\u201d states.\nIn the 4-state case, we see in Figure 2 that the PG sampler applied to both the iHMM and the\nsticky iHMM converges to the \u201ctrue\u201d value of K = 4 much quicker than both the beam sampler\nand Gibbs sampler \u2013 uncovering the model dimensionality, and structure of the transition matrix\nby more rapidly eliminating spurious \u201cactive\u201d states from the space as evidenced in the learned\ntransition matrix plots in Figure 3. Moreover, as evidenced by the joint log-likelihood in Figure 2,\nwe see that the PG sampler applied to both the iHMM and the sticky iHMM converges quickly to a\ngood mode, while the beam sampler has not fully converged within a 1000 iterations, and the Gibbs\nsampler is performing poorly.\nTo further explore the mixing of the PG sampler vs. the beam sampler we consider a similar infer-\nence problem on synthetic data over a larger state space. We generate data from sequences of length\n4000 from a 10 state HMM with self-transition probability of 0.75, and residual probability mass\ndistributed uniformly over the remaining states, and take the emission distributions to be normal\nwith \ufb01xed standard deviation 0.5 and means equally spaced 2.0 apart between \u221210 and 10. The\nbase distribution, H, for the iHMM is also taken to be normal with mean 0 and standard deviation\n2. The samplers were initialized with K = 3 and K = 30 states to explore the convergence and\nrobustness of the in\ufb01nite-state PG sampler vs. the beam sampler.\n\nFigure 4: Comparing the performance of the\nPG sampler vs. beam sampler on inferring data\nfrom a 10 state strongly correlated HMM with\ndifferent initializations. Left: Number of \u201cAc-\ntive\u201d States K from different Initial K vs. Iter-\nations Right: Joint-Log Likelihood from differ-\nent Initial K vs. Iterations\n\nFigure 5: In\ufb02uence of \u201cPosterior\u201d vs. \u201cPrior\u201d\nproposal and Number of Particles in PG sam-\npler on iHMM. Left: Number of \u201cActive\u201d States\nK from different Initial K, Numbers of Parti-\ncles, and \u201cPrior\u201d/\u201cPosterior\u201d proposal vs.\nIt-\nerations Right: Joint-Log Likelihood from dif-\nferent Initial K, Numbers of Particles, and\n\u201cPrior\u201d/\u201dPosterior\u201d proposal vs. Iterations\n\n6\n\niteration050010000510152025303540 4-State: KPGAS-SPGASBeamGibbsiteration05001000-10000-9000-8000-7000-6000-5000 4-State: JLLPGAS-SPGASBeamGibbsPGBeamTruth0200400600800100005101520253010-State: KPGAS-initK30PGAS-initK30-SPGAS-initK3PGAS-initK3-SBeam-initK30Beam-initK302004006008001000-10000-9500-9000-8500-8000-7500-7000 10-State: JLLPGAS-initK30PGAS-initK30-SPGAS-initK3PGAS-initK3-SBeam-initK30Beam-initK302004006008001000051015202530 10-State: KPGAS-n10-post-initK30PGAS-n10-post-initK3PGAS-n10-pri-initK30PGAS-n10-pri-initK3PGAS-n50-pri-initK30PGAS-n50-pri-initK302004006008001000-10000-9500-9000-8500-8000-7500-7000 10-State: JLLPGAS-n10-post-initK30PGAS-n10-post-initK3PGAS-n10-pri-initK30PGAS-n10-pri-initK3PGAS-n50-pri-initK30PGAS-n50-pri-initK3\fIon Channel Recordings\n\nAs observed in Figure 4, we see that the PG sampler applied to the iHMM and sticky iHMM,\nconverges far more quickly from both \u201csmall\u201d and \u201clarge\u201d initialization of K = 3 and K = 30\n\u201cactive\u201d states to the true value of K = 10 hidden states, as well as converging in JLL more quickly.\nIndeed, as noted in Fox et al. [2008], the introduction of the extra slice variables in the beam\nsampler can inhibit the mixing of the sampler, since for the beam sampler to consider transitions\nwith low prior probability one must also have sampled an unlikely corresponding slice variable so\nas not to have truncated that state out of the space. This can become particularly problematic if\none needs to consider several of these transitions in succession. We believe this provides evidence\nthat the in\ufb01nite-state Particle Gibbs sampler presented here, which does not introduce extra slice\nvariables, is mixing better than beam sampling in the iHMM.\n4.2\nFor our \ufb01rst real dataset, we investigate the behavior of the PG sampler and beam sampler on an ion\nchannel recording. In particular, we consider a 1MHz recording from Rosenstein et al. [2013] of\na single alamethicin channel previously investigated in Palla et al. [2014]. We subsample the time\nseries by a factor of 100, truncate it to be of length 2000, and further log transform and normalize it.\nWe ran both the beam and PG sampler on the iHMM for 1000 iterations (until we observed a con-\nvergence in the joint log-likelihood). Due to the large \ufb02uctuations in the observed time series, the\nbeam sampler infers the number of \u201cactive\u201d hidden states to be K = 5 while the PG sampler infers\nthe number of \u201cactive\u201d hidden states to be K = 4. However in Figure 6, we see that beam sampler\ninfers a solution for the latent states which rapidly oscillates between a subset of likely states during\ntemporal regions which intuitively seem to be better explained by a single state. However, the PG\nsampler has converged to a mode which seems to better represent the latent transition dynamics, and\nonly seems to infer \u201cextra\u201d states in the regions of large \ufb02uctuation. Indeed, this suggests that the\nbeam sampler is mixing worse with respect to the PG sampler.\n\nFigure 6: Left: Observations colored by an inferred latent trajectory using beam sampling inference.\nRight: Observations colored by an inferred latent state trajectory using PG inference.\n4.3 Alice in Wonderland Data\nFor our next example we consider the task of predicting sequences of letters taken from Alice\u2019s\nAdventures in Wonderland. We trained an iHMM on the 1000 characters from the \ufb01rst chapter of the\nbook, and tested on 4000 subsequent characters from the same chapter using a multinomial emission\nmodel for the iHMM.\nOnce again, we see that the PG sampler applied to the iHMM/sticky iHMM converges quickly in\njoint log-likelihood to a mode where it stably learns a value of K \u2248 10 as evidenced in Figure 7.\nThough the performance of the PG and beam samplers appear to be roughly comparable here, we\nwould like to highlight two observations. Firstly, the inferred value of K obtained by the PG sam-\npler quickly converges independent of the initialization K in the rightmost of Figure 7. However,\nthe beam sampler\u2019s prediction for the number of active states K still appears to be decreasing and\nmore rapidly \ufb02uctuating than both the iHMM and sticky iHMM as evidenced by the error bars in\nthe middle plot in addition to being quite sensitive to the initialization K as shown in the right-\nmost plot. Based on the previous synthetic experiment (Section 4.1), and this result we suspect\nthat although both the beam sampler and PG sampler are quickly converging to good solutions as\nevidenced by the training joint log-likelihood, the beam sampler is learning a transition matrix with\n\nunnecessary(cid:14)spurious \u201cactive\u201d states. Next we calculate the predictive log-likelihood of the Alice\n\n7\n\nBeam: Latent Statest0500100015002000y-1.5-1-0.500.512345PG: Latent Statest0500100015002000y-1.5-1-0.500.51234\fFigure 7: Left: Comparing the Joint Log-Likelihood vs. Iterations for the PG sampler and Beam\nsampler. Middle: Comparing the convergence of the \u201cactive\u201d number of states for the iHMM and\nsticky iHMM for the PG sampler and Beam sampler. Right: Trace plots of the number of states for\ndifferent initializations for K.\n\nin Wonderland test data averaged over 2500 different realizations and \ufb01nd that the in\ufb01nite-state PG\nsampler with N = 10 particles achieves a predictive log-likelihood of \u22125918.4 \u00b1 123.8 while the\nbeam sampler achieves a predictive log-likelihood of \u22126099.0 \u00b1 106.0, showing the PG sampler\napplied to the iHMM and Sticky iHMM learns hyperparameter and latent variable values that obtain\nbetter predictive performance on the held-out dataset. We note that in this experiment as well, we\nhave only found it necessary to take N = 10 particles in the PG sampler achieve good mixing and\nempirical performance, although increasing the number of particles to N = 50 does improve the\nconvergence of the sampler in this instance. Given that the PG sampler has a time complexity of\nO(T N K) for a single pass, while the beam sampler (and truncated methods) have a time complex-\nity of O(T K 2) for a single pass, we believe that the PG sampler is a competitive alternative to the\nbeam sampler for the iHMM.\n5 Discussions and Conclusions\nIn this work we derive a new inference algorithm for the iHMM using the particle MCMC frame-\nwork based on the stick-breaking construction for the HDP. We also develop an ef\ufb01cient proposal\ninside PG optimized for iHMM\u2019s, to ef\ufb01ciently resample the latent state trajectories for iHMM\u2019s.\nThe proposed algorithm is empirically compared to existing state-of-the-art inference algorithms\nfor iHMMs, and shown to be promising because it converges more quickly and robustly to the true\nnumber of states, in addition to obtaining better predictive performance on several synthetic and\nrealworld datasets. Moreover, we argued that the PG sampler proposed here is a competitive alter-\nnative to the beam sampler since the time complexity of the particle samplers presented is O(T N K)\nversus the O(T K 2) of the beam sampler.\nAnother advantage of the proposed method is the simplicity of the PG algorithm, which doesn\u2019t\nrequire truncation or the introduction of auxiliary variables, also making the algorithm easily adapt-\nable to challenging inference tasks. In particular, the PG sampler can be directly applied to the sticky\nHDP-HMM with DP emission model considered in Fox et al. [2008] for which no truncation-free\nsampler exists. We leave this development and application as an avenue for future work.\nReferences\nAndrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269\u2013342.\n\nBeal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2001). The in\ufb01nite hidden Markov model. In\n\nAdvances in neural information processing systems, pages 577\u2013584.\n\nBishop, C. M. (2006). Pattern recognition and machine learning, volume 4. Springer New York.\nFox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2008). An HDP\u2013HMM for systems\nwith state persistence. In Proceedings of the 25th international conference on Machine learning,\npages 312\u2013319. ACM.\n\nLindsten, F., Jordan, M. I., and Sch\u00a8on, T. B. (2014). Particle Gibbs with ancestor sampling. The\n\nJournal of Machine Learning Research, 15(1):2145\u20132184.\n\n8\n\nJLLiteration sPGAS (N=50)PGAS (N=10)BeamKiterationPGAS (N=50)PGAS (N=10)\fNeal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal\n\nof Computational and Graphical Statistics, 9:249\u2013265.\n\nPalla, K., Knowles, D. A., and Ghahramani, Z. (2014). A reversible in\ufb01nite hmm using normalised\n\nrandom measures. arXiv preprint arXiv:1403.4206.\n\nRabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77(2):257\u2013286.\n\nRosenstein, J. K., Ramakrishnan, S., Roseman, J., and Shepard, K. L. (2013). Single ion channel\n\nrecordings with cmos-anchored lipid membranes. Nano letters, 13(6):2682\u20132686.\n\nScott, S. L. (2002). Bayesian methods for hidden Markov models. Journal of the American Statis-\n\ntical Association, 97(457).\n\nTeh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 101(476):1566\u20131581.\n\nVan Gael, J., Saatci, Y., Teh, Y. W., and Ghahramani, Z. (2008). Beam sampling for the in\ufb01nite\nhidden Markov model. In Proceedings of the International Conference on Machine Learning,\nvolume 25.\n\n9\n\n\f", "award": [], "sourceid": 1403, "authors": [{"given_name": "Nilesh", "family_name": "Tripuraneni", "institution": "Cambridge University"}, {"given_name": "Shixiang (Shane)", "family_name": "Gu", "institution": "University of Cambridge and Max Planck Institute for Intelligent Systems"}, {"given_name": "Hong", "family_name": "Ge", "institution": "University of Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}