{"title": "Small-Variance Asymptotics for Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2103, "page_last": 2111, "abstract": "Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a small-variance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a \u201chard\u201d inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that \u2014 particularly in the nonparametric setting \u2014 standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework.", "full_text": "Small-Variance Asymptotics for Hidden Markov\n\nModels\n\nAnirban Roychowdhury, Ke Jiang, Brian Kulis\nDepartment of Computer Science and Engineering\n\nroychowdhury.7@osu.edu, {jiangk,kulis}@cse.ohio-state.edu\n\nThe Ohio State University\n\nAbstract\n\nSmall-variance asymptotics provide an emerging technique for obtaining scalable\ncombinatorial algorithms from rich probabilistic models. We present a small-\nvariance asymptotic analysis of the Hidden Markov Model and its in\ufb01nite-state\nBayesian nonparametric extension. Starting with the standard HMM, we \ufb01rst de-\nrive a \u201chard\u201d inference algorithm analogous to k-means that arises when partic-\nular variances in the model tend to zero. This analysis is then extended to the\nBayesian nonparametric case, yielding a simple, scalable, and \ufb02exible algorithm\nfor discrete-state sequence data with a non-\ufb01xed number of states. We also de-\nrive the corresponding combinatorial objective functions arising from our analy-\nsis, which involve a k-means-like term along with penalties based on state tran-\nsitions and the number of states. A key property of such algorithms is that\u2014\nparticularly in the nonparametric setting\u2014standard probabilistic inference algo-\nrithms lack scalability and are heavily dependent on good initialization. A num-\nber of results on synthetic and real data sets demonstrate the advantages of the\nproposed framework.\n\nIntroduction\n\n1\nInference in large-scale probabilistic models remains a challenge, particularly for modern \u201cbig data\u201d\nproblems. While graphical models are undisputedly important as a way to build rich probability dis-\ntributions, existing sampling-based and variational inference techniques still leave some applications\nout of reach.\nA recent thread of research has considered small-variance asymptotics of latent-variable models\nas a way to capture the bene\ufb01ts of rich graphical models while also providing a framework for\ndesigning more scalable combinatorial optimization algorithms. Such models are often motivated\nby the well-known connection between mixtures of Gaussians and k-means: as the variances of\nthe Gaussians tend to zero, the mixture of Gaussians model approaches k-means, both in terms of\nobjectives and algorithms. This approach has recently been extended beyond the standard Gaussian\nmixture in various ways\u2014to Dirichlet process mixtures and hierarchical Dirichlet processes [8], to\nnon-Gaussian observations in the nonparametric setting [7], and to feature learning via the Beta\nprocess [5]. The small-variance analysis for each of these models yields simple algorithms that\nfeature many of the bene\ufb01ts of the probabilistic models but with increased scalability. In essence,\nsmall-variance asymptotics provides a link connecting some probabilistic graphical models with\nnon-probabilistic combinatorial counterparts.\nThus far, small-variance asymptotics has been applied only to fairly simple latent-variable models.\nIn particular, to our knowledge there has been no such analysis for sequential data models such\nas the Hidden Markov Model (HMM) nor its nonparametric counterpart, the in\ufb01nite-state HMM\n(iHMM). HMMs are one of the most widely used probabilistic models for discrete sequence data,\nwith diverse applications including DNA sequence analysis, natural language processing and speech\nrecognition [4]. The HMMs consist of a discrete hidden state sequence that evolves according\n\n1\n\n\fto Markov assumptions, along with independent observations at each time step depending on the\nhidden state. The learning problem is to estimate the model given only the observation data.\nTo develop scalable algorithms for sequential data, we begin by applying small-variance analysis to\nthe standard parametric HMM. In the small-variance limit, we obtain a penalized k-means problem\nwhere the penalties capture the state transitions. Further, a special case of the resulting model yields\nsegmental k-means [9]. For the nonparametric model we obtain an objective that effectively com-\nbines the asymptotics from the parametric HMM with the asymptotics for the Hierarchical Dirichlet\nProcess [8]. We obtain a k-means-like objective with three penalties: one for state transitions, one\nfor the number of reachable states out of each state, and one for the number of total states. The\nkey aspect of our resulting formulation is that, unlike the standard sampler for the in\ufb01nite-state\nHMM, dynamic programming can be used. In particular, we describe a simple algorithm that mono-\ntonically decreases the underlying objective function. Finally, we present results comparing our\nnon-probabilistic algorithms to their probabilistic counterparts, on a number of real and synthetic\ndata sets.\nRelated Work. In the parametric setting (i.e., the standard HMM), several algorithms exist for max-\nimum likelihood (ML) estimation, such as the Baum-Welch algorithm (a special instance of the EM\nalgorithm) and the segmental k-means algorithm [9]. In\ufb01nite-state HMMs [3, 12] are nonparametric\nBayesian extensions of the \ufb01nite HMMs where hierarchical Dirichlet process (HDP) priors are used\nto allow for an unspeci\ufb01ed number of states. Exact inference in this model is intractable, so one\ntypically resorts to sampling methods. The standard Gibbs sampling methods [12] are notoriously\nslow to converge and cannot exploit the forward-backward structure of the HMMs. [6] presents a\nBeam sampling method which bypasses this obstacle via slice sampling, where only a \ufb01nite number\nof hidden states are considered in each iteration. However, this approach is still computationally in-\ntensive since it works in the non-collapsed space. Thus in\ufb01nite-state HMMs, while desirable from a\nmodeling perspective, have been limited by their inability to scale to large data sets\u2014this is precisely\nthe situation in which small-variance asymptotics has the potential to be bene\ufb01cial.\nConnections between the mixture of Gaussians model and k-means are widely known. Beyond\nthe references discussed earlier, we note that a similar connection relating probabilistic PCA and\nstandard PCA was discussed in [13, 10], as well as connections between support vector machines\nand a restricted Bayes optimal classi\ufb01er in [14].\n\n2 Asymptotics of the \ufb01nite-state HMM\nWe begin, as a warm-up, with the simpler parametric (\ufb01nite-state) HMM model, and show that\nsmall-variance asymptotics on the joint log likelihood yields a penalized k-means objective, and on\nthe EM algorithm yields a generalized segmental k-means algorithm. The tools developed in this\nsection will then be used for the more involved nonparametric model.\n\n2.1 The Model\nThe Hidden Markov Model assumes a hidden state sequence Z = {z1, . . . , zN} drawn from a \ufb01nite\ndiscrete state space {1, . . . , K}, coupled with the observation sequence X = {x1, . . . , xN}. The\nresulting generative model de\ufb01nes a probability distribution over the hidden state sequence Z and\nthe observation sequence X . Let T \u2208 RK\u00d7K be the stationary transition probability matrix of the\nhidden state sequence with Ti. = \u03c0i \u2208 RK being a distribution over the latent states. For clarity\nof presentation, we will use a binary 1-of-K representation for the latent state assignments. That\nis, we will write the event of the latent state at time step t being k as ztk = 1 and ztl = 0 \u2200l =\n1...K, l (cid:54)= k. Then the transition probabilities can be written as Tij = Pr(ztj = 1|zt\u22121,i = 1). The\ninitial state distribution is \u03c00 \u2208 RK. The Markov structure dictates that zt \u223c Mult(\u03c0zt\u22121), and the\nobservations follow xt \u223c \u03a6(\u03b8zt). The observation density \u03a6 is assumed invariant, and the Markov\nstructure induces conditional independence of the observations given the latent states.\nIn the following, we present the asymptotic treament for the \ufb01nite HMM with Gaussian emis-\nsion densities Pr(xt|ztk = 1) = N (xt|\u00b5k, \u03c32Id). Here \u03b8zt = \u00b5k, since the parameter space\n\u0398 contains only the emission means. Generalization to exponential family emission densities is\nstraightforward[7]. At a high level, the connection we seek to establish can be proven in two ways.\nThe \ufb01rst approach is to examine small-variance asymptotics directly on the the joint probability dis-\ntribution of the HMM, as done in [5] for clustering and feature learning problems. We will primarily\n\n2\n\n\ffocus on this approach, since our ideas can be more clearly expressed by this technique, and it is\nindependent of any inference algorithm. The other approach is to analyze the behavior of the EM\nalgorithm as the variance goes to zero. We will brie\ufb02y discuss this approach as well, but for further\ndetails the interested reader can consult the supplementary material.\n\n2.1.1 Exponential Family Transitions\n\nOur main analysis relies on appropriate manipulation of the transition probabilities, where we will\nuse the bijection between exponential families and Bregman divergences established in [2]. Since\nthe conditional distribution of the latent state at any time step is multinomial in the transition proba-\nbilities from the previous state, we use the aforementioned bijection to refactor the transition prob-\nabilities in the joint distribution of the HMM into a form than utilizes Bregman divergences. This,\nwith an appropriate scaling to enable small-variance asymptotics as mentioned in [7], allows us to\ncombine the emission and transition distributions into a simple objective function.\nWe denote Tjk = Pr(ztk = 1|zt\u22121,j = 1) as before, and the multinomial distribution for the latent\nstate at time step t as\n\nPr(zt|zt\u22121,j = 1) =\n\nT ztk\njk .\n\n(1)\n\nK(cid:89)\n\nk=1\n\nN(cid:89)\n\nN(cid:88)\n\nIn order to apply small-variance asymptotics, we must allow the variance in the transition probabili-\nties to go to zero in a reasonable way. Following the treatment in [2], we can rewrite this distribution\nin a suitable exponential family notation, which we then express in the following equivalent form:\n\nwhere the Bregman divergence d\u03c6(zt, mj) = (cid:80)K\n\nPr(zt|zt\u22121,j = 1) = exp(\u2212d\u03c6(zt, mj))b\u03c6(zt),\n\nk=1 ztk log (ztk/Tjk) = KL(zt, mj), mj =\n{Tjk}K\nk=1 and b\u03c6(zt) = 1. See the supplementary notes for derivation details. The prime moti-\nvation for using this form is that we can appropriately scale the variance of the exponential family\nIn particular, if we introduce a new parameter \u02c6\u03b2, and\ndistribution following Lemma 3.1 of [7].\ngeneralize the transition probabilities to be\n\n(2)\n\nPr(zt|zt\u22121,j = 1) = exp(\u2212 \u02c6\u03b2d\u03c6(zt, mj))b \u02dc\u03c6(zt),\n\nwhere \u02dc\u03c6 = \u02c6\u03b2\u03c6, then the mean of the distribution is the same in the scaled distribution (namely, mj)\nwhile the variance is scaled. As \u02c6\u03b2 \u2192 \u221e, the variance goes to zero.\nThe next step is to link the emission and transition probabilities so that the variance is scaled ap-\npropriately in both. In particular, we will de\ufb01ne \u03b2 = 1/2\u03c32 and then let \u02c6\u03b2 = \u03bb\u03b2 for some \u03bb. The\nGaussian emission densities can now be written as Pr(xt|ztk = 1) = exp(\u2212\u03b2(cid:107)xt \u2212 \u00b5k(cid:107)2\n2)f (\u03b2) and\nthe transition probabilities as Pr(zt|zt\u22121,j = 1) = exp(\u2212\u03bb\u03b2d\u03c6(zt, mj))b \u02dc\u03c6(zt). See [7] for further\ndetails about the scaling operation.\n\n2.1.2 Joint Probability Asymptotics\n\nWe now have all the background development required to perform small-variance asymptotics on\nthe HMM joint probability, and derive the segmental k-means algorithm. Our parameters of interest\nare the Z = [z1, ..., zN ] vectors, the \u00b5 = [\u00b51, ..., \u00b5K] means, and the transition parameter matrix\nT . We can write down the joint likelihood by taking a product of all the probabilities in the model:\n\nN(cid:89)\n\np(X ,Z) = p(z1)\n\np(zt|zt\u22121)\n\nN (xt|\u00b5zt, \u03c32Id),\n\nt=2\n\nt=1\n\nWith some abuse of notation, let mzt\u22121 denote the mean transition vector given by the assignment\nzt\u22121 (that is, if zt\u22121,j = 1 then mzt\u22121 = mj). The exp-family probabilities above allow us to\nrewrite this joint likelihood as\n\n(cid:33)\n\n(cid:35)\n\np(X ,Z) \u221d exp\n\n\u2212\u03b2\n\n(cid:107)xt \u2212 \u00b5zt(cid:107)2\n\n2 + \u03bb\n\nKL(zt, mzt\u22121 )\n\n+ log p(z1)\n\n.\n\n(3)\n\n(cid:34)\n\n(cid:32) N(cid:88)\n\nt=1\n\nt=2\n\n3\n\n\fTo obtain the corresponding non-probabilistic objective from small-variance asymptotics, we con-\nsider the MAP estimate obtained by maximizing the joint likelihood with respect to the parameters\nasymptotically as \u03c32 goes to zero (\u03b2 goes to \u221e). In our case, it is particularly simple given the joint\nlikelihood above. The log-likelihood easily yields the following asymptotically:\n\n\u2212\n\n(cid:32) N(cid:88)\n(cid:32) N(cid:88)\n\nt=1\n\n(cid:33)\n(cid:33)\n\nN(cid:88)\nN(cid:88)\n\nt=2\n\nt=2\n\n(4)\n\n(5)\n\nmax\nZ,\u00b5,T\n\nor equivalently,\n\n(cid:107)xt \u2212 \u00b5zt(cid:107)2\n\n2 + \u03bb\n\nKL(zt, mzt\u22121)\n\nNote that, as mentioned above, mj = (cid:8)Tjk\n\n(cid:107)xt \u2212 \u00b5zt(cid:107)2\n(cid:9)K\n\nmin\nZ,\u00b5,T\n\nt=1\n\n2 + \u03bb\n\nKL(zt, mzt\u22121 )\n\n.\n\nKL(zt, mj) \u221d \u2212 K(cid:80)\n\nk=1. We can view the above objective function as a\n\n\u201cpenalized\u201d k-means problem, where the penalties are given by the transitions from state to state.\nOne possible strategy to minimize (5) would be to iteratively minimize with respect to each of the\nindividual parameters (Z, \u00b5, T ) keeping the other two \ufb01xed. When \ufb01xing \u00b5 and T , and taking \u03bb = 1,\nthe solution for Z in (4) is identical to the MAP update on the latent variables Z for this model, as\nin a standard HMM. When \u03bb (cid:54)= 1, a simple generalization of the standard forward-backward routine\ncan be used to \ufb01nd the optimal assignments. Keeping Z and T \ufb01xed, the update on \u00b5k is easily seen\nto be the equiweighted average of the data points which have been assigned to latent state k in the\nMAP estimate (it is the same minimization as in k-means for updating cluster means). Finally, since\n\nlog Tjk, minimization with respect to T simply yields the empirical transition\n\nk=1\n\nprobabilities, that is Tjk,new = # of transitions from state j to k\n, both counts from the MAP path computed\n# of transitions from state j\nduring maximization w.r.t Z. We observe that, when \u03bb = 1, the iterative minimization algorithm to\nsolve (5) is exactly the segmental k-means algorithm, also known as Viterbi re-estimation.\n\n2.1.3 EM algorithm asymptotics\n\nWe can reach the same algorithm alternatively by writing down the steps of the EM algorithm and\nexploring the small-variance limit of these steps, analogous to the approach of [8] for a Dirichlet\nprocess mixture. Given space limitations (and the fact that the resulting algorithm is identical, as\nexpected), a more detailed discussion can be found in the supplementary material.\n\n3 Asymptotics of the In\ufb01nite Hidden Markov Model\n\nWe now tackle the more complex nonparametric model. We will derive the objective function di-\nrectly as in the parametric case but, unlike the parametric version, we will not apply asymptotics to\nthe existing sampler algorithms. Instead, we will present a new algorithm to optimize our derived\nobjective function. By deriving an algorithm directly, we ensure that our method takes advantage of\ndynamic programming, unlike the standard sampler.\n\n3.1 The Model\nThe iHMM, also known as the HDP-HMM [3, 12] is a nonparametric Bayesian extension to the\nHMM, where an HDP prior is used to allow for an unspeci\ufb01ed number of states. The HDP is a\nset of Dirichlet Processes (DPs) with a shared base distribution, that are themselves drawn from a\nDirichlet process [12]. Formally, we can write Gk \u223c DP(\u03b1, G0) with a shared base distribution\nG0 \u223c DP(\u03b3, H), where H is the global base distribution that permits sharing of probability mass\nacross Gks. \u03b1 and \u03b3 are the concentration parameters for the Gk and G0 measures, respectively.\nTo apply HDPs to sequential data, the iHMM can be formulated as follows:\n\u03b8k \u223c H,\nxt \u223c \u03a6(\u03b8zt).\n\n\u03b2 \u223c GEM(\u03b3), \u03c0k|\u03b2 \u223c DP(\u03b1, \u03b2),\n\nzt|zt\u22121 \u223c Mult(\u03c0zt\u22121),\n\nFor a full Bayesian treatment, Gamma priors are placed on the concentration parameters (though we\nwill not employ such priors in our asymptotic analysis).\n\n4\n\n\fFollowing the discussion in the parametric case, our goal is to write down the full joint likelihood\nin the above model. As discussed in [12], the Hierarchical Dirichlet Process yields assignments that\nfollow the Chinese Restaurant Franchise (CRF), and thus the iHMM model additionally incorporates\na term in the joint likelihood involving the prior probability of a set of state assignments arising\nfrom the CRF. Suppose an assignment of observations to states has K different states (i.e., number\nof restaurants in the franchise), si is the number of states that can be reached from state i in one step\n(i.e., number of tables in each restaurant i), and ni is the number observations in each state i (i.e.,\nnumber of customers in each restaurant). Then the probability of an assignment in the HDP can be\nwritten as (after integrating out mixture weights [1, 11], and if we only consider terms that would\nnot be constants after we do the asymptotic analysis [5]):\n\np(Z|\u03b1, \u03b3, \u03bb) \u221d \u03b3K\u22121\n\nK(cid:89)\n\n\u0393(\u03b3 +(cid:80)K\n\n\u0393(\u03b3 + 1)\n\nk=1 sk)\n\nk=1\n\n\u03b1sk\u22121 \u0393(\u03b1 + 1)\n\u0393(\u03b1 + ni)\n\n.\n\nFor the likelihood, we follow the same assumption as in the parametric case: the observation den-\nsities are Gaussians with a shared covariance matrix \u03c32Id. Further, the means are drawn indepen-\ndently from the prior N (0, \u03c12Id), where \u03c12 > 0 (this is needed, as the model is fully Bayesian now).\n\nTherefore, p(\u00b51:K) =(cid:81)K\n\nk=1 N (\u00b5k|0, \u03c12Id), and\nN(cid:89)\n\np(zt|zt\u22121) \u00b7 N(cid:89)\n\np(X ,Z) \u221d p(Z|\u03b1, \u03b3, \u03bb) \u00b7 p(z1)\n\nt=2\n\nt=1\n\nN (xt|\u00b5zt, \u03c32Id) \u00b7 p(\u00b51:K).\n\nNow, we can perform the small-variance analysis on the generative model. In order to retain the\nimpact of the hyperparameters \u03b1 and \u03b3 in the asymptotics, we can choose some constants \u03bb1, \u03bb2 > 0\nsuch that\nwhere \u03b2 = 1/(2\u03c32) as before. Note that, in this way, we have \u03b1 \u2192 0 and \u03b3 \u2192 0 when \u03b2 \u2192 \u221e.\nWe now can consider the objective function for maximizing the generative probability as we let\n\u03b2 \u2192 \u221e. This gives p(X ,Z) \u221d\n\n\u03b1 = exp(\u2212\u03bb1\u03b2), \u03b3 = exp(\u2212\u03bb2\u03b2),\n\n(cid:33)\n\n(cid:107)xt \u2212 \u00b5zt(cid:107)2 + \u03bb\n(cid:105)\n\nKL(zt, mzt\u22121 ) + \u03bb1\n\n(sk \u2212 1) + \u03bb2(K \u2212 1)\n\nt=2\n\nk=1\n\n(6)\n\n+ log(p(z1))\n\n.\n\nTherefore, maximizing the generative probability is asymptotically equivalent to the following opti-\nmization problem:\n\n(cid:107)xt \u2212 \u00b5zt(cid:107)2 + \u03bb\n\nKL(zt, mzt\u22121) + \u03bb1\n\n(sk \u2212 1) + \u03bb2(K \u2212 1).\n\n(7)\n\nt=1\n\nt=2\n\nk=1\n\nIn words, this objective seeks to minimize a penalized k-means objective, with three penalties. The\n\ufb01rst is the same as in the parametric case\u2014a penalty based on the transitions from state to state. The\nsecond penalty penalizes the number of transitions out of each state, and the third penalty penalizes\nthe total number of states. Note this is similar to the objective function derived in [8] for the HDP,\nbut here there is no dependence on any particular samplers. This can also be considered as MAP\nestimation of the parameters, since p(Z, \u00b5|X ) \u221d p(X|Z)p(Z)p(\u00b5).\n3.2 Algorithm\nThe algorithm presented in [8] could be almost directly applied to (7) but it neglects the sequential\ncharacteristics of the model. Instead, we present a new algorithm to directly optimize (7). We follow\nthe alternating minimization framework as in the parametric case, with some slight tweaks. Speci\ufb01-\ncally, given observations {x1, . . . , xN}, \u03bb, \u03bb1, \u03bb2, our high-level algorithm proceeds as follows:\n(1) Initialization:\n\ninitialize with one hidden state. The parameters are therefore K = 1, \u00b51 =\n\n(cid:80)N\n\n1\nN\n\ni=1 xi, T = 1.\n\n(2) Perform a forward-backward step (via approximate dynamic programming) to update Z.\n\n5\n\n(cid:104) \u2212 \u03b2\n\nexp\n\n(cid:32) N(cid:88)\n\nt=1\n\nN(cid:88)\n\nmin\n\nK,Z,\u00b5,T\n\nN(cid:88)\n\nN(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\n\f(3) Update K, \u00b5, T .\n(4) For each state i, (i = 1, . . . , K), check if the set of observations to any state j that are reached\nby transitioning out of i can form a new dedicated hidden state and lower the objective function\nin the process. If there are several such moves, choose the one with the maximum improvement\nin objective function.\n\n(5) Update K, \u00b5, T .\n(6) Iterate steps (2)-(5) until convergence.\n\nThere are two key changes to the algorithm beyond the standard parametric case. In the forward-\nbackward routine (step 2), we compute the usual K \u00d7 N matrix \u03b1, where \u03b1(c, t) represents the\nminimum cost over paths of length t from the beginning of the sequence and that reach state c at\ntime step t. We use the term \u201ccost\u201d to refer to the sum of the distances of points to state means,\nas well as the additive penalties incurred. However, to see why it is dif\ufb01cult to compute the exact\nvalue of \u03b1 in the nonparametric case, suppose we have computed the minimum cost of paths up to\nstep t \u2212 1 and we would like to compute the values of \u03b1 for step t. The cost of a path that ends\nin state c is obtained by examining, for all states i, the cost of a path that ends at i at step t \u2212 1\nand then transitions to state c at step t. Thus, we must consider the transition from i to c. If there\nare existing transitions from state i to state c, then we proceed as in a standard forward-backward\nalgorithm. However, we are also interested in two other cases\u2014one where there are no existing\ntransitions from i to c but we consider this transition along with a penalty \u03bb1, and another where\nan entirely new state is formed and we pay a penalty \u03bb2. In the \ufb01rst case, the standard forward-\nbackward routine faces an immediate problem, since when we try to compute the cost of the path\ngiven by \u03b1(c, t), the cost will be in\ufb01nite as there is a \u2212 log(0) term from the transition probability.\nWe must therefore alter the forward-backward routine, or there will never be new states created nor\ntransitions to an existing state which previously had no transitions. The main idea is to derive and\nuse bounds on how much the transition matrix can change under the above scenarios. As long as\nwe can show that the values we obtain for \u03b1 are upper bounds, then we can show that the objective\nfunction will decrease after the forward-backward routine, as the existing sequence of states is also\na valid path (with no new incurred penalties).\nThe second change (step 4) is that we adopt a \u201clocal move\u201d analogous to that described for the hard\nHDP in [8]. This step determines if the objective will decrease if we create a new global state in\na certain fashion; in particular, for each existing state j, we compute the change in objective that\noccurs when data points that transition from j to some state k are given their own new global state.\nBy construction this step decreases the objective.\nDue to space constraints, full details of the algorithm, along with a local convergence proof, are\nprovided in the supplementary material (section B).\n4 Experiments\nWe conclude with a brief set of experiments designed to highlight the bene\ufb01ts of our approach.\nNamely, we will show that our methods have bene\ufb01ts over the existing parametric and non-\nparametric HMM algorithms in terms of speed and accuracy.\nSynthetic Data. First we compare our nonparametric algorithm with the Beam Sampler for the\niHMM1. A sequence of length 3000 was generated over a varying number of hidden states with\nthe all-zeros transition matrix except that Ti,i+1 = 0.8 and Ti,i+2 = 0.2 (when i + 1 > K, the\ntotal number of states, we choose j = i + 1 mod K and let Ti,j = 0.8, and similarly for i + 2).\nObservations were sampled from symmetric Gaussian distributions with means of {3, 6, . . . , 3K}\nand a variance of 0.9.\nThe data described above were trained using our nonparametric algorithm (asymp-iHMM) and the\nBeam sampler. For our nonparametric algorithm, we performed a grid search over all three param-\neters and selected the parameters using a heuristic (see the supplementary material for a discussion\nof this heuristic). For the Beam sampling algorithm, we used the following hyperparameter settings:\ngamma hyperpriors (4, 1) for \u03b1, (3, 6) for \u03b3, and a zero mean normal distribution for the base H\nwith the variance equal to 10% of the empirical variance of the dataset. We also normalized the\nsequence to have zero mean. The number of selected samples was varied among 10, 100, and 1000\n\n1http://mloss.org/software/view/205/\n\n6\n\n\ffor different numbers of states, with 5 iterations between two samples. (Note: there are no burn-in\niterations and all samplers are initialized with a randomly initialized 20-state labeling.)\n\nFigure 1: Our algorithm (asymp-iHMM) vs. the Beam Sampler on the synthetic Gaussian hidden\nMarkov model data. (Left) The training accuracy; (Right) The training time on a log-scale.\n\nIn Figure 1 (best viewed in color), the training accuracy and running time for the two algorithms\nare shown, respectively. The accuracy of the Beam sampler is given by the highest among all the\nsamples selected. The accuracy is shown in terms of the normalized mutual information (NMI) score\n(in the range of [0,1]), since the sampler may output different number of states than the ground truth\nand NMI can handle this situation. We can see that, in all datasets, our algorithm performs better\nthan the sampling method in terms of accuracy, but with running time similar to the sampler with\nonly 10 samples. For these datasets, we also observe that the EM algorithm for the standard HMM\n(not reported in the \ufb01gure) can easily output a smaller number of states than the ground truth, which\nyields a smaller NMI score. We also observed that the Beam sampler is highly sensitive to the\ninitialization of hyperparameters. Putting \ufb02at priors over the hyperparameters can ameliorate the\nsituation, but also substantially increases the number of samples required.\nNext we demonstrate the effect of the compensation parameter \u03bb in the parametric asymptotic model,\nalong with comparisons to the standard HMM. We will call the generalized segmental k-means of\nSection 2 the \u2018asymp-HMM\u2019 algorithm, shortened to \u2018AHMM\u2019 as appropriate. For this experiment,\nwe used univariate Gaussians with means at 3, 6, and 10, and standard deviation of 2.9. In our\nground-truth transition kernel, state i had an 80% prob. of transitioning to state i + 1, and 10% prob.\nof transitioning to each of the other states. 5000 datapoints were generated from this model. The\n\ufb01rst 40% of the data was used for training, and the remaining 60% for prediction. The means in\nboth the standard HMM and the asymp-HMM were initialized by the centroids learned by k-means\nfrom the training data. The transition kernels were initialized randomly. Each algorithm was run 50\ntimes; the averaged results are shown in Figure 2.\nFigure 2 shows the effect of \u03bb on accuracy\nas measured by NMI and scaled prediction er-\nror. We see the expected tradeoff: for small \u03bb,\nthe problem essentially reduces to standard k-\nmeans, whereas for large \u03bb the observations are\nessentially ignored. For \u03bb = 1, correspond-\ning to standard segmental k-means, we obtain\nresults similar to the standard HMM, which ob-\ntains an NMI of .57 and error of 1.16. Thus, the\nparametric method offers some added \ufb02exibil-\nity via the new \u03bb parameter.\nFinancial time-series prediction. Our next ex-\nperiment illustrates the advantages of our algo-\nrithms in a \ufb01nancial prediction problem. The\nsequence consists of 3668 values of the Standard & Poor\u2019s 500 index on consecutive trading days\n\nFigure 2: NMI and prediction error as a function\nof the compensation parameter \u03bb\n\n7\n\n0.00.20.40.60.81.0Training Accuracy# of statesNMI ScoreAsymIhmmBeam10Beam100Beam100024681001234567Training Time# of statesLog of the timeAsymIhmmBeam10Beam100Beam1000246810\fFigure 3: Predicted values of the S&P 500 index from 12/29/1999 to 07/30/2012 returned by the\nasymp-HMM, asymp-iHMM and the standard HMM algorithms, with the true index values for that\nperiod (better in color); see text for details.\nfrom Jan 02, 1998 to July 30, 20122. The index exhibited appreciable variability in this period, with\nboth bull and bear runs. The goal here was to predict the index value on a test sequence of trading\ndays, and compare the accuracies and runtimes of the algorithms.\nTo prevent over\ufb01tting, we used a training window of length 500. This window size was empirically\nchosen to provide a balance between prediction accuracy and runtime. The algorithms were trained\non the sequence from index i to i + 499, and then the i + 500-th value was predicted and compared\nwith the actual recorded value at that point in the sequence. i ranged from 1 to 3168. As before, the\nmixture means were initialized with k-means and the transition kernels were given random initial\nvalues. For the asymp-HMM and the standard HMM, the number of latent states was empirically\nchosen to be 5. For the asymp-iHMM, we tune the parameters to get also 5 states on average. For\npredicting observation T + 1 given observations up to step T , we used the weighted average of the\nlearned state means, weighted by the transition probabilities given by the state of the observation at\ntime T .\nWe ran the standard HMM along with both the parametric and non-parametric asymptotic algorithms\non this data (the Beam sampler was too slow to run over this data, as each individual prediction took\non the order of minutes). The values predicted from time step 501 to 3668 are plotted with the true\nindex values in that time range in Figure 3. Both the parametric and non-parametric asymptotic\nalgorithms perform noticably better than the standard HMM; they are able to better approximate the\nactual curve across all kinds of temporal \ufb02uctuations. Indeed, the difference is most stark in the\nareas of high-frequency oscillations. While the standard HMM returns an averaged-out prediction,\nour algorithms latch onto the underlying behavior almost immediately and return noticably more\naccurate predictions. Prediction accuracy has been measured using the mean absolute percentage\n(MAP) error, which is the mean of the absolute differences of the predicted and true values expressed\nas percentages of the true values. The MAP error for the HMM was 6.44%, that for the asymp-HMM\nwas 3.16%, and that for the asymp-iHMM was 2.78%. This con\ufb01rms our visual perception of the\nasymp-iHMM algorithm returning the best-\ufb01tted prediction in Figure 3.\nAdditional Real-World Results. We also compared our methods on a well-log data set that was\nused for testing the Beam sampler. Due to space constraints, further discussion of these results is\nincluded in the supplementary material.\n\n5 Conclusion\nThis paper considered an asymptotic treatment of the HMM and iHMM. The goal was to obtain\nnon-probabilistic formulations inspired by the HMM, in order to expand small-variance asymptotics\nto sequential models. We view our main contribution as a novel dynamic-programming-based algo-\nrithm for sequential data with a non-\ufb01xed number of states that is derived from the iHMM model.\nAcknowledgements\nThis work was supported by NSF award IIS-1217433.\n\n2http://research.stlouisfed.org/fred2/series/SP500/downloaddata\n\n8\n\n\fReferences\n[1] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric\n\nproblems. The Annals of Statistics, 2(6):1152\u20131174, 1974.\n\n[2] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences.\n\nJournal of Machine Learning Research, 6:1705\u20131749, 2005.\n\n[3] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model.\n\nAdvances in neural information processing systems, 2002.\n\nIn\n\n[4] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[5] T. Broderick, B. Kulis, and M. I. Jordan. MAD-Bayes: MAP-based asymptotic derivations\nfrom Bayes. In Proceedings of the 30th International Conference on Machine Learning, 2013.\n[6] J. V. Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden\nMarkov model. In Proceedings of the 25th International Conference on Machine Learning,\n2008.\n\n[7] K. Jiang, B. Kulis, and M. I. Jordan. Small-variance asymptotics for exponential family Dirich-\n\nlet process mixture models. In Advances in Neural Information Processing Systems, 2012.\n\n[8] B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics.\n\nIn Proceedings of the 29th International Conference on Machine Learning, 2012.\n\n[9] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recog-\n\nnition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[10] S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing\n\nSystems, 1998.\n\n[11] E. Sudderth. Toward reliable Bayesian nonparametric learning. In NIPS Workshop on Bayesian\nNoparametric Models for Reliable Planning and Decision-Making Under Uncertainty, 2012.\n[12] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[13] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of Royal\n\nStatistical Society, Series B, 21(3):611\u2013622, 1999.\n\n[14] S. Tong and D. Koller. Restricted Bayes optimal classi\ufb01ers. In Proc. 17th AAAI Conference,\n\n2000.\n\n9\n\n\f", "award": [], "sourceid": 1045, "authors": [{"given_name": "Anirban", "family_name": "Roychowdhury", "institution": "Ohio State University"}, {"given_name": "Ke", "family_name": "Jiang", "institution": "Ohio State University"}, {"given_name": "Brian", "family_name": "Kulis", "institution": "Ohio State University"}]}