{"title": "Nonparametric Bayesian Learning of Switching Linear Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 457, "page_last": 464, "abstract": "Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. In this paper, we present a nonparametric approach to the learning of an unknown number of persistent, smooth dynamical modes by utilizing a hierarchical Dirichlet process prior. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with an efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.", "full_text": "Nonparametric Bayesian Learning of Switching\n\nLinear Dynamical Systems\n\nElectrical Engineering & Computer Science, Massachusetts Institute of Technology\n\nEmily B. Fox\n\nebfox@mit.edu\n\n\u2020Electrical Engineering & Computer Science and \u2021Statistics, University of California, Berkeley\n\n{sudderth, jordan}@eecs.berkeley.edu\n\nErik B. Sudderth\u2020, Michael I. Jordan\u2020\u2021\n\nElectrical Engineering & Computer Science, Massachusetts Institute of Technology\n\nAlan S. Willsky\n\nwillsky@mit.edu\n\nAbstract\n\nMany nonlinear dynamical phenomena can be effectively modeled by a system\nthat switches among a set of conditionally linear dynamical modes. We con-\nsider two such models: the switching linear dynamical system (SLDS) and the\nswitching vector autoregressive (VAR) process. Our nonparametric Bayesian ap-\nproach utilizes a hierarchical Dirichlet process prior to learn an unknown number\nof persistent, smooth dynamical modes. We develop a sampling algorithm that\ncombines a truncated approximation to the Dirichlet process with ef\ufb01cient joint\nsampling of the mode and state sequences. The utility and \ufb02exibility of our model\nare demonstrated on synthetic data, sequences of dancing honey bees, and the\nIBOVESPA stock index.\n\n1 Introduction\n\nLinear dynamical systems (LDSs) are useful in describing dynamical phenomena as diverse as hu-\nman motion [9], \ufb01nancial time-series [4], maneuvering targets [6, 10], and the dance of honey bees\n[8]. However, such phenomena often exhibit structural changes over time and the LDS models\nwhich describe them must also change. For example, a coasting ballistic missile makes an evasive\nmaneuver; a country experiences a recession, a central bank intervention, or some national or global\nevent; a honey bee changes from a waggle to a turn right dance. Some of these changes will ap-\npear frequently, while others are only rarely observed. In addition, there is always the possibility\nof a new, previously unseen dynamical behavior. These considerations motivate us to develop a\nnonparametric Bayesian approach for learning switching LDS (SLDS) models. We also consider\na special case of the SLDS\u2014the switching vector autoregressive (VAR) process\u2014in which direct\nobservations of the underlying dynamical process are assumed available. Although a special case of\nthe general linear systems framework, autoregressive models have simplifying properties that often\nmake them a practical choice in applications.\n\nOne can view switching dynamical processes as an extension of hidden Markov models (HMMs)\nin which each HMM state, or mode, is associated with a dynamical process. Existing methods for\nlearning SLDSs and switching VAR processes rely on either \ufb01xing the number of HMM modes,\nsuch as in [8], or considering a change-point detection formulation where each inferred change is\nto a new, previously unseen dynamical mode, such as in [14]. In this paper we show how one can\nremain agnostic about the number of dynamical modes while still allowing for returns to previously\nexhibited dynamical behaviors.\n\n\fHierarchical Dirichlet processes (HDP) can be used as a prior on the parameters of HMMs with\nunknown mode space cardinality [2, 12].\nIn this paper we make use of a variant of the HDP-\nHMM\u2014the sticky HDP-HMM of [5]\u2014that provides improved control over the number of modes\ninferred by the HDP-HMM; such control is crucial for the problems we examine. Although the\nHDP-HMM and its sticky extension are very \ufb02exible time series models, they do make a strong\nMarkovian assumption that observations are conditionally independent given the HMM mode. This\nassumption is often insuf\ufb01cient for capturing the temporal dependencies of the observations in real\ndata. Our nonparametric Bayesian approach for learning switching dynamical processes extends the\nsticky HDP-HMM formulation to learn an unknown number of persistent, smooth dynamical modes\nand thereby capture a wider range of temporal dependencies.\n2 Background: Switching Linear Dynamic Systems\nA state space (SS) model provides a general framework for analyzing many dynamical phenomena.\nThe model consists of an underlying state, xt \u2208 Rn, with linear dynamics observed via yt \u2208 Rd. A\nlinear time-invariant SS model, in which the dynamics do not depend on time, is given by\n\n(1)\nwhere et and wt are independent Gaussian noise processes with covariances \u03a3 and R, respectively.\nAn order r VAR process, denoted by VAR(r), with observations yt \u2208 Rd, can be de\ufb01ned as\n\nyt = Cxt + wt,\n\nxt = Axt\u22121 + et\n\nyt =\n\nr\n\nXi=1\n\nAiyt\u2212i + et\n\net \u223c N (0, \u03a3).\n\n(2)\n\nHere, the observations depend linearly on the previous r observation vectors. Every VAR(r) process\ncan be described in SS form by, for example, the following transformation:\n\nxt =\n\nxt\u22121 +\n\net\n\nyt = [I\n\n0\n\n. . .\n\n0] xt.\n\n(3)\n\nA1 A2\n0\nI\n...\n...\n. . .\n0\n\n\uf8ee\n\uf8ef\uf8ef\uf8f0\n\n. . . Ar\n. . .\n0\n...\n...\n0\nI\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\nI\n0\n...\n0\n\n\uf8ee\n\uf8ef\uf8ef\uf8f0\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\nNote that there are many such equivalent minimal SS representations that result in the same input-\noutput relationship, where minimality implies that there does not exist a realization with lower state\ndimension. On the other hand, not every SS model may be expressed as a VAR(r) process for \ufb01nite\nr [1]. We can thus conclude that considering a class of SS models with state dimension r \u00b7 d and\narbitrary dynamic matrix A subsumes the class of VAR(r) processes.\nThe dynamical phenomena we examine in this paper exhibit behaviors better modeled as switches\nbetween a set of linear dynamical models. Due to uncertainty in the mode of the process, the overall\nmodel is nonlinear. We de\ufb01ne a switching linear dynamical system (SLDS) by\n\n(4)\nThe \ufb01rst-order Markov process zt indexes the mode-speci\ufb01c LDS at time t, which is driven by\nGaussian noise et(zt) \u223c N (0, \u03a3(zt)). We similarly de\ufb01ne a switching VAR(r) process by\n\nxt = A(zt)xt\u22121 + et(zt)\n\nyt = Cxt + wt.\n\nyt =\n\nr\n\nXi=1\n\nA(zt)\n\ni yt\u2212i + et(zt)\n\net(zt) \u223c N (0, \u03a3(zt)).\n\n(5)\n\nNote that the underlying state dynamics of the SLDS are equivalent to a switching VAR(1) process.\n3 Background: Dirichlet Processes and the Sticky HDP-HMM\nA Dirichlet process (DP), denoted by DP(\u03b3, H), is a distribution on discrete measures\n\nG0 =\n\n\u221e\n\nXk=1\n\n\u03b2k\u03b4\u03b8k\n\n\u03b8k \u223c H\n\non a parameter space \u0398. The weights are generated via a stick-breaking construction [11]:\n\n\u03b2k = \u03b2\u2032\nk\n\nk\u22121\n\nY\u2113=1\n\n(1 \u2212 \u03b2\u2032\n\u2113)\n\n\u03b2\u2032\nk \u223c Beta(1, \u03b3).\n\n(6)\n\n(7)\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: For all graphs, \u03b2 \u223c GEM(\u03b3) and \u03b8k \u223c H(\u03bb). (a) DP mixture model in which zi \u223c \u03b2 and\nyi \u223c f (y | \u03b8zi ). (b) HDP mixture model with \u03c0j \u223c DP(\u03b1, \u03b2), zji \u223c \u03c0j, and yji \u223c f (y | \u03b8zji ). (c)-(d)\nSticky HDP-HMM prior on switching VAR(2) and SLDS processes with the mode evolving as zt+1 \u223c \u03c0zt for\n\u03c0k \u223c DP(\u03b1 + \u03ba, (\u03b1\u03b2 + \u03ba\u03b4k)/(\u03b1 + \u03ba)). The dynamical processes are as in Eq. (13).\nWe denote this distribution by \u03b2 \u223c GEM(\u03b3). The DP is commonly used as a prior on the parameters\nof a mixture model, resulting in a DP mixture model (see Fig.1(a)). To generate observations, we\nchoose \u00af\u03b8i \u223c G0 and yi \u223c F (\u00af\u03b8i). This sampling process is often described via a discrete variable\nzi \u223c \u03b2 indicating which component generates yi \u223c F (\u03b8zi).\nThe hierarchical Dirichlet process (HDP) [12] extends the DP to cases in which groups of data are\nproduced by related, yet distinct, generative processes. Taking a hierarchical Bayesian approach, the\nHDP draws G0 from a Dirichlet process prior DP(\u03b3, H), and then draws group speci\ufb01c distributions\nGj \u223c DP(\u03b1, G0). Here, the base measure G0 acts as an \u201caverage\u201d distribution (E[Gj | G0] = G0)\nencoding the frequency of each shared, global parameter:\n\nGj =\n\n=\n\n\u221e\n\nXt=1\nXk=1\n\n\u221e\n\n\u02dc\u03c0jt\u03b4 \u02dc\u03b8jt\n\n\u02dc\u03c0j \u223c GEM(\u03b1)\n\n\u03c0jk\u03b4\u03b8k\n\n\u03c0j \u223c DP(\u03b1, \u03b2) .\n\n(8)\n\n(9)\n\nBecause G0 is discrete, multiple \u02dc\u03b8jt \u223c G0 may take identical values \u03b8k. Eq. (9) aggregates these\nprobabilities, allowing an observation yji to be directly associated with the unique global parameters\nvia an indicator random variable zji \u223c \u03c0j. See Fig. 1(b).\nAn alternative, non\u2013constructive characterization of samples G0 \u223c DP(\u03b3, H) from a Dirichlet\nprocess states that for every \ufb01nite partition {A1, . . . , AK} of \u0398,\n\n(G0(A1), . . . , G0(AK)) \u223c Dir(\u03b3H(A1), . . . , \u03b3H(AK)).\n\n(10)\nUsing this expression, it can be shown that the following \ufb01nite, hierarchical mixture model converges\nin distribution to the HDP as L \u2192 \u221e [7, 12]:\n\u03b2 \u223c Dir(\u03b3/L, . . . , \u03b3/L)\n\n\u03c0j \u223c Dir(\u03b1\u03b21, . . . , \u03b1\u03b2L).\n\n(11)\n\nThis weak limit approximation is used by the sampler of Sec. 4.2.\n\nThe HDP can be used to develop an HMM with a potentially in\ufb01nite mode space [2, 12]. For\nthis HDP-HMM, each HDP group-speci\ufb01c distribution, \u03c0j, is a mode-speci\ufb01c transition distribution\nand, due to the in\ufb01nite mode space, there are in\ufb01nitely many groups. Let zt denote the mode of the\nMarkov chain at time t. For discrete Markov processes zt \u223c \u03c0zt\u22121, so that zt\u22121 indexes the group\nto which yt is assigned. The current HMM mode zt then indexes the parameter \u03b8zt used to generate\nobservation yt. See Fig. 1(c), ignoring the direct correlation in the observations.\nBy sampling \u03c0j \u223c DP(\u03b1, \u03b2), the HDP prior encourages modes to have similar transition distri-\nbutions (E[\u03c0jk | \u03b2] = \u03b2k). However, it does not differentiate self\u2013transitions from moves be-\ntween modes. When modeling dynamical processes with mode persistence, the \ufb02exible nature of\nthe HDP-HMM prior allows for mode sequences with unrealistically fast dynamics to have large\nposterior probability. Recently, it has been shown [5] that one may mitigate this problem by instead\nconsidering a sticky HDP-HMM where \u03c0j is distributed as follows:\n\n\u03c0j \u223c DP(cid:18)\u03b1 + \u03ba,\n\n\u03b1\u03b2 + \u03ba\u03b4j\n\n\u03b1 + \u03ba (cid:19) .\n\n(12)\n\n\fHere, (\u03b1\u03b2 + \u03ba\u03b4j) indicates that an amount \u03ba > 0 is added to the jth component of \u03b1\u03b2. The measure\nof \u03c0j over a \ufb01nite partition (Z1, . . . , ZK) of the positive integers Z+, as described by Eq. (10), adds\nan amount \u03ba only to the arbitrarily small partition containing j, corresponding to a self-transition.\nWhen \u03ba = 0 the original HDP-HMM is recovered. We place a vague prior on \u03ba and learn the\nself-transition bias from the data.\n4 The HDP-SLDS and HDP-AR-HMM Models\nFor greater modeling \ufb02exibility, we take a nonparametric approach in de\ufb01ning the mode space of\nour switching dynamical processes. Speci\ufb01cally, we develop extensions of the sticky HDP-HMM\nfor both the SLDS and switching VAR models. For the SLDS, we consider conditionally-dependent\nemissions of which only noisy observations are available (see Fig. 1(d)). For this model, which we\nrefer to as the HDP-SLDS, we place a prior on the parameters of the SLDS and infer their posterior\nfrom the data. We do, however, \ufb01x the measurement matrix, C, for reasons of identi\ufb01ability. Let\n\u02dcC \u2208 Rd\u00d7n, n \u2265 d, be the measurement matrix associated with a dynamical system de\ufb01ned by \u02dcA,\nand assume \u02dcC has full row rank. Then, without loss of generality, we may consider C = [I 0] since\nthere exists an invertible transformation T such that the pair C = \u02dcCT = [I 0] and A = T \u22121 \u02dcAT\nde\ufb01nes an equivalent input-output system. The dimensionality of I is determined by that of the data.\nOur choice of the number of columns of zeros is, in essence, a choice of model order.\n\nThe previous work of Fox et al. [6] considered a related, yet simpler formulation for modeling a\nmaneuvering target as a \ufb01xed LDS driven by a switching exogenous input. Since the number of\nmaneuver modes was assumed unknown, the exogenous input was taken to be the emissions of a\nHDP-HMM. This work can be viewed as an extension of the work by Caron et. al. [3] in which\nthe exogenous input was an independent noise process generated from a DP mixture model. The\nHDP-SLDS is a major departure from these works since the dynamic parameters themselves change\nwith the mode and are learned from the data, providing a much more expressive model.\n\nThe switching VAR(r) process can similarly be posed as an HDP-HMM in which the observations\nare modeled as conditionally VAR(r). This model is referred to as the HDP-AR-HMM and is de-\npicted in Fig. 1(c). The generative processes for these two models are summarized as follows:\n\nHDP-AR-HMM\nzt \u223c \u03c0zt\u22121\n\nMode dynamics\n\nObservation dynamics yt = Pr\n\nHDP-SLDS\nzt \u223c \u03c0zt\u22121\n\nyt = Cxt + wt\n\ni=1 A(zt)\n\ni yt\u2212i + et(zt) xt = A(zt)xt\u22121 + et(zt)\n\n(13)\n\nHere, \u03c0j is as de\ufb01ned in Sec. 3 and the additive noise processes as in Sec. 2.\n4.1 Posterior Inference of Dynamic Parameters\nIn this section we focus on developing a prior to regularize the learning of different dynamical modes\nconditioned on a \ufb01xed mode assignment z1:T . For the SLDS, we analyze the posterior distribution of\nthe dynamic parameters given a \ufb01xed, known state sequence x1:T . Methods for learning the number\nof modes and resampling the sequences x1:T and z1:T are discussed in Sec. 4.2.\nConditioned on the mode sequence, one may partition the observations into K different linear re-\ngression problems, where K = |{z1, . . . , zT }|. That is, for each mode k, we may form a matrix\nY(k) with Nk columns consisting of the observations yt with zt = k. Then,\n\nY(k) = A(k) \u00afY(k) + E(k),\n\n(14)\n\nwhere A(k) = [A(k)\n], \u00afY(k) is a matrix of lagged observations, and E(k) the associated\nnoise vectors. Let D(k) = {Y(k), \u00afY(k)}. The posterior distribution over the VAR(r) parameters\nassociated with the kth mode decomposes as follows:\n\n. . . A(k)\n\n1\n\nr\n\np(A(k), \u03a3(k) | D(k)) = p(A(k) | \u03a3(k), D(k))p(\u03a3(k) | D(k)).\n\n(15)\n\nWe place a conjugate matrix-normal inverse-Wishart prior on the parameters {A(k), \u03a3(k)} [13],\nproviding a reasonable combination of \ufb02exibility and analytical convenience. A matrix A \u2208 Rd\u00d7m\nhas a matrix-normal distribution MN (A; M , V , K) if\n\np(A) =\n\n|K| d\n\n2\n\n|2\u03c0V | m\n\n2\n\ne\u2212 1\n\n2 tr\u201c(A\u2212M )T V \u22121(A\u2212M )K\u201d,\n\n(16)\n\n\fwhere M is the mean matrix and V and K \u22121 are the covariances along the rows and columns,\nrespectively. A vectorization of the matrix A results in\n\np(vec(A)) = N (vec(M ), K \u22121 \u2297 V ),\n\nwhere \u2297 denotes the Kronecker product. The resulting posterior is derived as\n, \u03a3\u2212(k), S(k)\n\np(A(k) | \u03a3(k), D(k)) = MN (A(k); S(k)\ny \u00afy\n\nS\u2212(k)\n\n\u00afy \u00afy\n\n\u00afy \u00afy ),\n\n(17)\n\n(18)\n\nwith B\u2212(k) denoting (B(k))\u22121 for a given matrix B, and\n\n\u00afy \u00afy = \u00afY(k) \u00afY(k)T\nS(k)\n\n+ K S(k)\n\ny \u00afy = Y(k) \u00afY(k)T\n\n+ M K S(k)\n\nyy = Y(k)Y(k)T\n\n+ M KM T .\n\nWe place an inverse-Wishart prior IW(S0, n0) on \u03a3(k). Then,\n\np(\u03a3(k) | D(k)) = IW(S\n\n(k)\ny|\u00afy + S0, Nk + n0),\n\n(19)\n\nS(k)T\ny \u00afy\n\n\u00afy \u00afy\n\nS\u2212(k)\n\nyy \u2212 S(k)\ny \u00afy\n\ny|\u00afy = S(k)\n\n. When A is simply a vector, the matrix-normal inverse-\n\nwhere S(k)\nWishart prior reduces to the normal inverse-Wishart prior with scale parameter K.\nFor the HDP-SLDS, we additionally place an IW(R0, r0) prior on the measurement noise covariance\nR, which is shared between modes. The posterior distribution is given by\np(R | y1:T , x1:T ) = IW(SR + R0, T + r0),\n\n(20)\n\nwith SR =PT\n\nt=1(yt \u2212 Cxt)(yt \u2212 Cxt)T . Further details are provided in supplemental Appendix I.\n\n4.2 Gibbs Sampler\nFor the switching VAR(r) process, our sampler iterates between sampling the mode sequence, z1:T ,\nand both the dynamic and sticky HDP-HMM parameters. The sampler for the SLDS is identical to\nthat of a switching VAR(1) process with the additional step of sampling the state sequence, x1:T ,\nand conditioning on the state sequence when resampling dynamic parameters. The resulting Gibbs\nsampler is described below and further elaborated upon in supplemental Appendix II.\nSampling Dynamic Parameters Conditioned on a sample of the mode sequence, z1:T , and the ob-\nservations, y1:T , or state sequence, x1:T , we can sample the dynamic parameters \u03b8 = {A(k), \u03a3(k)}\nfrom the posterior density described in Sec. 4.1. For the HDP-SLDS, we additionally sample R.\nSampling z1:T As shown in [5], the mixing rate of the Gibbs sampler for the HDP-HMM can\nbe dramatically improved by using a truncated approximation to the HDP, such as the weak limit\napproximation, and jointly sampling the mode sequence using a variant of the forward-backward\nalgorithm. Speci\ufb01cally, we compute backward messages mt+1,t(zt) \u221d p(yt+1:T |zt, yt\u2212r+1:t, \u03c0, \u03b8)\nand then recursively sample each zt conditioned on zt\u22121 from\n\np(zt | zt\u22121, y1:T , \u03c0, \u03b8) \u221d p(zt | \u03c0zt\u22121 )p(yt | yt\u2212r:t\u22121, A(zt), \u03a3(zt))mt+1,t(zt),\n\n(21)\n\ni=1 A(zt)\n\nwhere p(yt | yt\u2212r:t\u22121, A(zt), \u03a3(zt)) = N (Pr\n\ni yt\u2212i, \u03a3(zt)). Joint sampling of the mode se-\nquence is especially important when the observations are directly correlated via a dynamical process\nsince this correlation further slows the mixing rate of the direct assignment sampler of [12]. Note\nthat the approximation of Eq. (11) retains the HDP\u2019s nonparametric nature by encouraging the use\nof fewer than L components while allowing the generation of new components, upper bounded by\nL, as new data are observed.\nSampling x1:T (HDP-SLDS only) Conditioned on the mode sequence z1:T and the set of dy-\nnamic parameters \u03b8, our dynamical process simpli\ufb01es to a time-varying linear dynamical sys-\ntem. We can then block sample x1:T by \ufb01rst running a backward \ufb01lter to compute mt+1,t(xt) \u221d\np(yt+1:T |xt, zt+1:T , \u03b8) and then recursively sampling each xt conditioned on xt\u22121 from\np(xt | xt\u22121, y1:T , z1:T , \u03b8) \u221d p(xt | xt\u22121, A(zt), \u03a3(zt))p(yt | xt, R)mt+1,t(xt).\n\n(22)\nThe messages are given in information form by mt,t\u22121(xt\u22121) \u221d N \u22121(xt\u22121; \u03b8t,t\u22121, \u039bt,t\u22121), where\nthe information parameters are recursively de\ufb01ned as\n\n\u03b8t,t\u22121 = A(zt)T\n\u039bt,t\u22121 = A(zt)T\n\n\u03a3\u2212(zt)(\u03a3\u2212(zt) + C T R\u22121C + \u039bt+1,t)\u22121(C T R\u22121yt + \u03b8t+1,t)\n\u03a3\u2212(zt)A(zt) \u2212 A(zt)T\n\n\u03a3\u2212(zt)(\u03a3\u2212(zt) + C T R\u22121C + \u039bt+1,t)\u22121\u03a3\u2212(zt)A(zt).\n\n(23)\n\nSee supplemental Appendix II for a more numerically stable version of this recursion.\n\n\fi\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n500\n\n1000\nTime\n\n1500\n\n2000\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n200\n\n400\n\n600\nIteration\n\n800\n\n1000\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n200\n\n400\n\n600\nIteration\n\n800\n\n1000\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n200\n\n400\n\n600\nIteration\n\n800\n\n1000\n\n200\n\n400\n\n600\nIteration\n\n800\n\n1000\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n0\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n0\n\n200\n\n150\n\n100\n\n50\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n800\n\n1000\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n800\n\n1000\n\n1000\n\n1000\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\n1000\n\n1000\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\n2000\n\n3000\n\nIteration\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n1000\n\n1000\n\ni\n\n \n\ne\nc\nn\na\nt\ns\nD\ng\nn\nm\nm\na\nH\nd\ne\nz\n\n \n\ni\n\ni\nl\n\na\nm\nr\no\nN\n\n4000\n\n5000\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n1000\n\n1000\n\n4000\n\n5000\n\n4000\n\n5000\n\n200\n\n400\n\n600\n\nTime\n\n2000\n\n3000\n\nIteration\n\n2000\n\n3000\n\nIteration\n\n2000\n\n3000\n\nIteration\n\n0\n\n0\n\n200\n\n400\n\n600\n\nTime\n\n(a)\n\n2000\n\n3000\n\nIteration\n\n(b)\n\n2000\n\n3000\n\nIteration\n\n(c)\n\n2000\n\n3000\n\nIteration\n\n(d)\n\n2000\n\n3000\n\nIteration\n\n(e)\n\nFigure 2: (a) Observation sequence (blue, green, red) and associated mode sequence (magenta) for a 5-mode\nswitching VAR(1) process (top), 3-mode switching AR(2) process (middle), and 3-mode SLDS (bottom). The\nassociated 10th, 50th, and 90th Hamming distance quantiles over 100 trials are shown for the (b) HDP-VAR(1)-\nHMM, (c) HDP-VAR(2)-HMM, (d) HDP-SLDS with C = I (top and bottom) and C = [1 0] (middle), and\n(e) sticky HDP-HMM using \ufb01rst difference observations.\n\n5 Results\nSynthetic Data In Fig. 2, we compare the performance of the HDP-VAR(1)-HMM, HDP-VAR(2)-\nHMM, HDP-SLDS, and a baseline sticky HDP-HMM on three sets of test data (see Fig. 2(a)). The\nHamming distance error is calculated by \ufb01rst choosing the optimal mapping of indices maximiz-\ning overlap between the true and estimated mode sequences. For the \ufb01rst scenario, the data were\ngenerated from a 5-mode switching VAR(1) process. The three switching linear dynamical models\nprovide comparable performance since both the HDP-VAR(2)-HMM and HDP-SLDS with C = I\ncontain the class of HDP-VAR(1)-HMMs. Note that the HDP-SLDS sampler is slower to mix since\nthe hidden, three-dimensional continuous state is also sampled. In the second scenario, the data were\ngenerated from a 3-mode switching AR(2) process. The HDP-AR(2)-HMM has signi\ufb01cantly better\nperformance than the HDP-AR(1)-HMM while the performance of the HDP-SLDS with C = [1 0]\nis comparable after burn-in. As shown in Sec. 2, this HDP-SLDS model encompasses the class of\nHDP-AR(2)-HMMs. The data in the third scenario were generated from a 3-mode SLDS model\nwith C = I. Here, we clearly see that neither the HDP-VAR(1)-HMM nor HDP-VAR(2)-HMM is\nequivalent to the HDP-SLDS. Together, these results demonstrate both the differences between our\nmodels as well as the models\u2019 ability to learn switching processes with varying numbers of modes.\nFinally, note that all of the switching models yielded signi\ufb01cant improvements relative to the base-\nline sticky HDP-HMM, even when the latter was given \ufb01rst differences of the observations. This\ninput representation, which is equivalent to an HDP-VAR(1)-HMM with random walk dynamics\n(A(k) = I for all k), is more effective than using raw observations for HDP-HMM learning, but still\nmuch less effective than richer models which switch among learned LDS.\n\nIBOVESPA Stock Index We test the HDP-SLDS model on the IBOVESPA stock index (Sao\nPaulo Stock Exchange) over the period of 01/03/1997 to 01/16/2001. There are ten key world\nevents shown in Fig. 3 and cited in [4] as affecting the emerging Brazilian market during this time\nperiod. In [4], a 2-mode Markov switching stochastic volatility (MSSV) model is used to identify\nperiods of higher volatility in the daily returns. The MSSV assumes that the log-volatilities follow an\nAR(1) process with a Markov switching mean. This underlying process is observed via conditionally\nindependent and normally distributed daily returns. The HDP-SLDS is able to infer very similar\nchange points to those presented in [4]. Interestingly, the HDP-SLDS consistently identi\ufb01es three\nregimes of volatility versus the assumed 2-mode model. In Fig. 3, the overall performance of the\n\n\f \n\ni\n\nt\nn\no\nP\ne\ng\nn\na\nh\nC\n\n \n\n \nf\no\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\ni\n\nt\nn\no\nP\ne\ng\nn\na\nh\nC\n\n \n\n \nf\no\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n \n\n1\n\n0.8\n\n0.6\n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\nc\ne\ne\nD\n\nt\n\n0.4\n\n0.2\n\nHDP\u2212SLDS\nHDP\u2212SLDS, non\u2212sticky\nHDP\u2212AR(1)\u2212HMM\nHDP\u2212AR(2)\u2212HMM\n0.8\n\n0.6\n\n1\n\n0.4\n\nFalse Alarm Rate\n\n1/3/97 7/2/97\n\n6/1/98\n\n1/15/99\nDate\n\n1/13/00\n\n0\n\n1/3/97 7/2/97\n\n6/1/98\n\n1/15/99\nDate\n\n1/13/00\n\n0\n\n1/3/97 7/2/97\n\n6/1/98\n\n1/15/99\nDate\n\n1/13/00\n\n \n\n0\n0\n\n0.2\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: (a) IBOVESPA stock index daily returns from 01/03/1997 to 01/16/2001. (b) Plot of the estimated\nprobability of a change point on each day using 3000 Gibbs samples for the HDP-SLDS. The 10 key events are\nindicated with red lines. (c) Similar plot for the non-sticky HDP-SLDS with no bias towards self-transitions.\n(d) ROC curves for the HDP-SLDS, non-sticky HDP-SLDS, HDP-AR(1)-HMM, and HDP-AR(2)-HMM.\n\nHDP-SLDS is compared to that of the HDP-AR(1)-HMM, HDP-AR(2)-HMM, and HDP-SLDS\nwith no bias for self-transitions (i.e., \u03ba = 0.) The ROC curves shown in Fig. 3(d) are calculated\nby windowing the time axis and taking the maximum probability of a change point in each window.\nThese probabilities are then used as the con\ufb01dence of a change point in that window. We clearly\nsee the advantage of using a SLDS model combined with the sticky HDP-HMM prior on the mode\nsequence. Without the sticky extension, the HDP-SLDS over-segments the data and rapidly switches\nbetween redundant states which leads to a dramatically larger number of inferred change points.\n\nDancing Honey Bees We test the HDP-VAR(1)-HMM on a set of six dancing honey bee se-\nquences, aiming to segment the sequences into the three dances displayed in Fig. 4. (Note that we\ndid not see performance gains by considering the HDP-SLDS, so we omit showing results for that\narchitecture.) The data consist of measurements yt = [cos(\u03b8t)\nsin(\u03b8t) xt yt]T , where (xt, yt)\ndenotes the 2D coordinates of the bee\u2019s body and \u03b8t its head angle. We compare our results to\nthose of Xuan and Murphy [14], who used a change-point detection technique for inference on this\ndataset. As shown in Fig. 4(d)-(e), our model achieves a superior segmentation compared to the\nchange-point formulation in almost all cases, while also identifying modes which reoccur over time.\n\nOh et al. [8] also presented an analysis of the honey bee data, using an SLDS with a \ufb01xed number of\nmodes. Unfortunately, that analysis is not directly comparable to ours, because [8] used their SLDS\nin a supervised formulation in which the ground truth labels for all but one of the sequences are\nemployed in the inference of the labels for the remaining held-out sequence, and in which the kernels\nused in the MCMC procedure depend on the ground truth labels. (The authors also considered a\n\u201cparameterized segmental SLDS (PS-SLDS),\u201d which makes use of domain knowledge speci\ufb01c to\nhoney bee dancing and requires additional supervision during the learning process.) Nonetheless,\nin Table 1 we report the performance of these methods as well as the median performance (over\n100 trials) of the unsupervised HDP-VAR(1)-HMM to provide a sense of the level of performance\nachievable without detailed, manual supervision. As seen in Table 1, the HDP-VAR(1)-HMM yields\nvery good performance on sequences 4 to 6 in terms of the learned segmentation and number of\nmodes (see Fig. 4(a)-(c)); the performance approaches that of the supervised method. For sequences\n1 to 3\u2014which are much less regular than sequences 4 to 6\u2014the performance of the unsupervised\nprocedure is substantially worse. This motivated us to also consider a partially supervised variant\nof the HDP-VAR(1)-HMM in which we \ufb01x the ground truth mode sequences for \ufb01ve out of six of\nthe sequences, and jointly infer both a combined set of dynamic parameters and the left-out mode\nsequence. As we see in Table 1, this considerably improved performance for these three sequences.\n\nNot depicted in the plots in Fig. 4 is the extreme variation in head angle during the waggle dances\nof sequences 1 to 3. This dramatically affects our performance since we do not use domain-speci\ufb01c\ninformation. Indeed, our learned segmentations consistently identify turn-right and turn-left modes,\nbut often create a new, sequence-speci\ufb01c waggle dance mode. Many of our errors can be attributed to\ncreating multiple waggle dance modes within a sequence. Overall, however, we are able to achieve\nreasonably good segmentations without having to manually input domain-speci\ufb01c knowledge.\n\n6 Discussion\n\nIn this paper, we have addressed the problem of learning switching linear dynamical models with\nan unknown number of modes for describing complex dynamical phenomena. We presented a non-\n\n\f(1)\n\n(2)\n\n \n\ne\nd\no\nm\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n4\n\n3\n\n2\n\n1\n\n \n\ne\nd\no\nm\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n3\n\n2\n\n1\n\n(3)\n\n4\n\n3\n\n2\n\n1\n\n \n\ne\nd\no\nm\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n0\n\n200\n\n400\nTime\n\n(a)\n\n600\n\n0\n\n200\n\n400\nTime\n\n(b)\n\n600\n\n800\n\n0\n\n200\n\n400\n\n600\n\nTime\n\n(c)\n\n(4)\n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\nc\ne\n\nt\n\ne\nD\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n(5)\n\n \n\nt\n\ne\na\nR\nn\no\n\n \n\ni\nt\nc\ne\n\nt\n\ne\nD\n\nHDP\u2212VAR\u2212HMM, unsupervised\nHDP\u2212VAR\u2212HMM, supervised\nChange\u2212point formulation\nViterbi sequence\n0.4\n\n1\n\n0.2\n0.8\nFalse Alarm Rate\n\n0.6\n\n(d)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n(6)\n\n \n\nHDP\u2212VAR\u2212HMM, unsupervised\nHDP\u2212VAR\u2212HMM, supervised\nChange\u2212point formulation\nViterbi sequence\n0.4\n\n1\n\n0.2\n0.8\nFalse Alarm Rate\n\n0.6\n\n(e)\n\nFigure 4: (top) Trajectories of the dancing honey bees for sequences 1 to 6, colored by waggle (red), turn\nright (blue), and turn left (green) dances. (a)-(c) Estimated mode sequences representing the median error for\nsequences 4, 5, and 6 at the 200th Gibbs iteration, with errors indicated in red. (d)-(e) ROC curves for the\nunsupervised HDP-VAR-HMM, partially supervised HDP-VAR-HMM, and change-point formulation of [14]\nusing the Viterbi sequence for segmenting datasets 1-3 and 4-6, respectively.\n\nSequence\n\nHDP-VAR(1)-HMM unsupervised\n\nHDP-VAR(1)-HMM partially supervised\n\nSLDS DD-MCMC\n\nPS-SLDS DD-MCMC\n\n1\n\n46.5\n65.9\n74.0\n75.9\n\n2\n\n44.1\n88.5\n86.1\n92.4\n\n3\n\n45.6\n79.2\n81.3\n83.1\n\n4\n\n83.2\n86.9\n93.4\n93.4\n\n5\n\n93.2\n92.3\n90.2\n90.4\n\n6\n\n88.7\n89.1\n90.4\n91.0\n\nTable 1: Median label accuracy of the HDP-VAR(1)-HMM using unsupervised and partially supervised Gibbs\nsampling, compared to accuracy of the supervised PS-SLDS and SLDS procedures, where the latter algorithms\nwere based on a supervised MCMC procedure (DD-MCMC) [8].\n\nparametric Bayesian approach and demonstrated both the utility and versatility of the developed\nHDP-SLDS and HDP-AR-HMM on real applications. Using the same parameter settings, in one\ncase we are able to learn changes in the volatility of the IBOVESPA stock exchange while in an-\nother case we learn segmentations of data into waggle, turn-right, and turn-left honey bee dances.\nAn interesting direction for future research is learning models of varying order for each mode.\n\nReferences\n[1] M. Aoki and A. Havenner. State space modeling of multiple time series. Econ. Rev., 10(1):1\u201359, 1991.\n[2] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model. In NIPS, 2002.\n[3] F. Caron, M. Davy, A. Doucet, E. Du\ufb02os, and P. Vanheeghe. Bayesian inference for dynamic models with\n\nDirichlet process mixtures. In Int. Conf. Inf. Fusion, July 2006.\n\n[4] C. Carvalho and H. Lopes. Simulation-based sequential analysis of Markov switching stochastic volatility\n\n[8] S. Oh, J. Rehg, T. Balch, and F. Dellaert. Learning and inferring motion patterns using parametric seg-\n\nmental switching linear dynamic systems. IJCV, 77(1\u20133):103\u2013124, 2008.\n\n[9] J. M. Pavlovi\u00b4c, V. Rehg and J. MacCormick. Learning switching linear models of human motion. In\n\nNIPS, 2000.\n\n[10] X. Rong Li and V. Jilkov. Survey of maneuvering target tracking. Part V: Multiple-model methods. IEEE\n\nTrans. Aerosp. Electron. Syst., 41(4):1255\u20131321, 2005.\n\n[11] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Stat. Sinica, 4:639\u2013650, 1994.\n[12] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Stat.\n\nAssoc., 101(476):1566\u20131581, 2006.\n\n[13] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1997.\n[14] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. In ICML,\n\n2007.\n\n[5] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. An HDP-HMM for systems with state\n\n[6] E. B. Fox, E. B. Sudderth, and A. S. Willsky. Hierarchical Dirichlet processes for tracking maneuvering\n\n[7] H. Ishwaran and M. Zarepour. Exact and approximate sum\u2013representations for the Dirichlet process. Can.\n\nmodels. Comp. Stat. & Data Anal., 2006.\n\npersistence. In ICML, 2008.\n\ntargets. In Int. Conf. Inf. Fusion, July 2007.\n\nJ. Stat., 30:269\u2013283, 2002.\n\n\f", "award": [], "sourceid": 312, "authors": [{"given_name": "Emily", "family_name": "Fox", "institution": ""}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": ""}]}