{"title": "Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 333, "page_last": 341, "abstract": "Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer's, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method \\cite{anandkumar2012tensor} to \\textit{provably} recover first-order Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings.", "full_text": "Learning Hidden Markov Models from Non-sequence\n\nData via Tensor Decomposition\n\nTzu-Kuo Huang\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nJeff Schneider\nRobotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ntzukuoh@cs.cmu.edu\n\nschneide@cs.cmu.edu\n\nAbstract\n\nLearning dynamic models from observed data has been a central issue in many\nscienti\ufb01c studies or engineering tasks. The usual setting is that data are collected\nsequentially from trajectories of some dynamical system operation. In quite a few\nmodern scienti\ufb01c modeling tasks, however, it turns out that reliable sequential\ndata are rather dif\ufb01cult to gather, whereas out-of-order snapshots are much eas-\nier to obtain. Examples include the modeling of galaxies, chronic diseases such\nAlzheimer\u2019s, or certain biological processes.\nExisting methods for learning dynamic model from non-sequence data are mostly\nbased on Expectation-Maximization, which involves non-convex optimization and\nis thus hard to analyze. Inspired by recent advances in spectral learning methods,\nwe propose to study this problem from a different perspective: moment matching\nand spectral decomposition. Under that framework, we identify reasonable as-\nsumptions on the generative process of non-sequence data, and propose learning\nalgorithms based on the tensor decomposition method [2] to provably recover \ufb01rst-\norder Markov models and hidden Markov models. To the best of our knowledge,\nthis is the \ufb01rst formal guarantee on learning from non-sequence data. Preliminary\nsimulation results con\ufb01rm our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nLearning dynamic models from observed data has been a central issue in many \ufb01elds of study, scien-\nti\ufb01c or engineering tasks. The usual setting is that data are collected sequentially from trajectories of\nsome dynamical system operation, and the goal is to recover parameters of the underlying dynamic\nmodel. Although many research and engineering efforts have been devoted to that setting, it turns\nout that in quite a few modern scienti\ufb01c modeling problems, another situation is more frequently en-\ncountered: observed data are out-of-order (or partially-ordered) snapshots rather than full sequential\nsamples of the system operation. As pointed out in [7, 8], this situation may appear in the modeling\nof celestial objects such as galaxies or chronic diseases such as Alzheimer\u2019s, because observations\nare usually taken from different trajectories (galaxies or patients) at unknown, arbitrary times. Or it\nmay also appear in the study of biological processes, such as cell metabolism under external stimuli,\nwhere most measurement techniques are destructive, making it very dif\ufb01cult to repetitively collect\nobservations from the same individual living organisms as they change over time. However, it is\nmuch easier to take single snapshots of multiple organisms undergoing the same biological process\nin a fully asynchronous fashion, hence the lack of timing information. Rabbat et al. [9] noted that in\ncertain network inference problems, the only available data are sets of nodes co-occurring in random\nwalks on the network without the order in which they were visited, and the goal is to reconstruct\nthe network structure from such co-occurrence data. This problem is essentially about learning a\n\ufb01rst-order Markov chain from data lacking sequence information.\n\n1\n\n\fAs one can imagine, dynamic model learning in a non-sequential setting is much more dif\ufb01cult\nthan in the sequential setting and has not been thoroughly studied. One issue is that the notion\nof non-sequence data is vague because there can be many different generative processes resulting\nin non-sequence data. Without any restrictions, one can easily \ufb01nd a case where no meaningful\ndynamic model can be learnt. It is therefore important to \ufb01gure out what assumptions on the data\nand the model would lead to successful learning. However, existing methods for non-sequential\nsettings, e.g., [9, 11, 6, 8], do not shed much light on this issue because they are mostly based\non Expectation-Maximization (EM), which require non-convex optimization. Regardless of the\nassumptions we make, as long as the resulting optimization problem remains non-convex, formal\nanalysis of learning guarantees is still formidable.\n\nWe thus propose to take a different approach, based on another long-standing estimation principle:\nthe method of moments (MoM). The basic idea of MoM is to \ufb01nd model parameters such that the\nresulting moments match or resemble the empirical moments. For some estimation problems, this\napproach is able to give unique and consistent estimates while the maximum-likelihood method gets\nentangled in multiple and potentially undesirable local maxima. Taking advantage of this property,\nan emerging area of research in machine learning has recently developed MoM-based learning al-\ngorithms with formal guarantees for some widely used latent variable models, such as Gaussian\nmixture models[5], Hidden Markov models [3], Latent Dirichlet Allocation [1, 4], etc. Although\nmany learning algorithms for these models exist, some having been very successful in practice,\nbarely any formal learning guarantee was given until the MoM-based methods were proposed. Such\nbreakthroughs seem surprising, but it turns out that they are mostly based on one crucial property:\nfor quite a few latent variable models, the model parameters can be uniquely determined from spec-\ntral decompositions of certain low-order moments of observable quantities.\n\nIn this work we demonstrate that under the MoM and spectral learning framework, there are reason-\nable assumptions on the generative process of non-sequence data, under which the tensor decompo-\nsition method [2], a recent advancement in spectral learning, can provably recover the parameters\nof \ufb01rst-order Markov models and hidden Markov models. To the best of our knowledge, ours is the\n\ufb01rst work that provides formal guarantees for learning from non-sequence data. Interestingly, these\nassumptions bear much similarity to the usual idea behind topic modeling: with the bag-of-words\nrepresentation which is invariant to word orderings, the task of inferring topics is almost impossi-\nble given one single document (no matter how long it is!), but becomes easier as more documents\ntouching upon various topics become available. For learning dynamic models, what we need in the\nnon-sequence data are multiple sets of observations, where each set contains independent samples\ngenerated from its own initial distribution, and the many different initial distributions together cover\nthe entire (hidden) state space. In some of the aforementioned scienti\ufb01c applications, such as bi-\nological studies, this type of assumptions might be realized by running multiple experiments with\ndifferent initial con\ufb01gurations or amounts of stimuli.\n\nThe main body of the paper consists of four sections. Section 2 brie\ufb02y reviews the essentials of\nthe tensor decomposition framework [2]; Section 3 details our assumptions on non-sequence data,\ntensor-decomposition based learning algorithms, and theoretical guarantees; Section 4 reports some\nsimulation results con\ufb01rming our theoretical \ufb01ndings, followed by conclusions in Section 5. Proofs\nof theoretical results are given in the appendices in the supplementary material.\n\n2 Tensor Decomposition\n\nWe mainly follow the exposition in [2], starting with some preliminaries and notations. A real p-th\norder tensor A is a member of the tensor product spaceNp\nRmi of p Euclidean spaces. For a vec-\ntor x \u2208 Rm, we denote by x\u2297p := x\u2297 x\u2297\u00b7\u00b7\u00b7\u2297 x \u2208Np\nRm its p-th tensor power. A convenient\nway to represent A \u2208Np\nRm is through a p-way array of real numbers [Ai1i2\u00b7\u00b7\u00b7ip]1\u2264i1,i2,...,ip\u2264m,\nwhere Ai1i2\u00b7\u00b7\u00b7ip denotes the (i1, i2, . . . , ip)-th coordinate of A with respect to a canonical basis.\nWith this representation, we can view A as a multi-linear map that, given a set of p matrices\n{Xi \u2208 Rm\u00d7mi}p\nRmi with\nthe following p-way array representation:\n\ni=1, produces another p-th order tensor A(X1, X2,\u00b7\u00b7\u00b7 , Xp) \u2208 Np\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nA(X1, X2,\u00b7\u00b7\u00b7 , Xp)i1i2\u00b7\u00b7\u00b7ip := X1\u2264j1,j2,...,jp\u2264m\n\nAj1j2\u00b7\u00b7\u00b7jp(X1)j1i1(V2)j2i2 \u00b7\u00b7\u00b7 (Xp)jpip.\n\n(1)\n\n2\n\n\fFigure 1: Running example of Markov chain with three states\n\nIn this work we consider tensors that are up to the third-order (p \u2264 3) and, for most of the time,\nalso symmetric, meaning that their p-way array representations are invariant under permutations of\narray indices. More speci\ufb01cally, we focus on second and third-order symmetric tensors in or slightly\nperturbed from the following form:\n\nM2 :=\n\nkXi=1\n\n\u03c9i\u00b5i \u2297 \u00b5i, M3 :=\n\nkXi=1\n\n\u03c9i\u00b5i \u2297 \u00b5i \u2297 \u00b5i,\n\n(2)\n\nsatisfying the following non-degeneracy conditions:\nCondition 1. \u03c9i \u2265 0 \u2200 1 \u2264 i \u2264 k, {\u00b5i \u2208 Rm}k\ni=1 and {\u00b5i}k\nAs described in later sections, the core of our learning task involves estimating {\u03c9i}k\ni=1\nfrom perturbed or noisy versions of M2 and M3. We solve this estimation problem with the tensor\ndecomposition method recently proposed by Anandkumar et al. [2]. The algorithm and its theoreti-\ncal guarantee are summarized in Appendix A. The key component of this method is a novel tensor\npower iteration procedure for factorizing a symmetric orthogonal tensor, which is robust against\ninput perturbation.\n\ni=1 are linearly independent, and k \u2264 m.\n\n3 Learning from Non-sequence Data\n\nWe \ufb01rst describe a generative process of non-sequence data for \ufb01rst-order Markov models and\ndemonstrate how to apply tensor decomposition methods to perform consistent learning. Then\nwe extend these ideas to hidden Markov models and provide theoretical guarantees on the sam-\nple complexity of the proposed learning algorithm. For notational conveniences we de\ufb01ne the\nfollowing vector-matrix cross product \u2297d\u2208{1,2,3} : (v \u22971 M )ijk := vi(M )jk, (v \u22972 M )ijk =\nvj(M )ik, (v \u22973 M )ijk = vk(M )ij. For a matrix M we denote by Mi its i-th column.\n3.1 First-order Markov Models\n\nLet P \u2208 [0, 1]m\u00d7m be the transition probability matrix of a discrete, \ufb01rst-order, ergodic Markov\nchain with m states and a unique stationary distribution \u03c0. Let P be of full rank and 1\u22a4P = 1\u22a4.\nTo give a high-level idea of what makes it possible to learn P from non-sequence data, we use the\nsimple Markov chain with three states shown in Figure 1 as our running example, demonstrating\nstep by step how to extend from a very restrictive generative setting of the data to a reasonably\ngeneral setting, along with the assumptions made to allow consistent parameter estimation. In the\nusual setting where we have sequences of observations, say {x(1), x(2), . . .} with parenthesized\nsuperscripts denoting time, it is straightforward to consistently estimate P . We simply calculate the\nempirical frequency of consecutive pairs of states:\n\ncPij := Pt\n\nPt\n\nAlternatively, suppose for each state j, we have an i.i.d. sample of its immediate next state Dj :=\n2 , . . . | x(0) = j}, where subscripts are data indices. Consistent estimation in this case\n{x\nis also easy: the empirical distribution of Dj consistently estimates Pj, the j-th column of P . For\n\n(1)\n1 , x\n\n(1)\n\n(x(t+1) = i, x(t) = j)\n\n.\n\n(x(t) = j)\n\n3\n\n\fexample, the Markov chain in Figure 1 may produce the following three samples, whose empirical\ndistributions estimate the three columns of P respectively:\n\nD1 = {2, 1, 2, 2, 2, 2, 2, 2, 2, 2} \u21d2 cP1 = [0.1 0.9 0.0]\u22a4,\nD2 = {3, 3, 2, 3, 2, 3, 3, 2, 3, 3} \u21d2 cP2 = [0.0 0.3 0.7]\u22a4,\nD3 = {1, 1, 3, 1, 3, 3, 1, 3, 3, 1} \u21d2 cP3 = [0.5 0.0 0.5]\u22a4.\n\nA nice property of these estimates is that, unlike in the sequential setting, they do not depend on\nany particular ordering of the observations in each set. Nevertheless, such data are not quite non-\nsequenced because all observations are made at exactly the next time step. We thus consider the\nfollowing generalization: for each state j, we have Dj := {x\n, . . . | x(0) = j}, i.e.,\nindependent samples of states drawn at unknown future times {t1, t2, . . .}. For example, our data in\nthis setting might be\n\n(t2)\n2\n\n(t1)\n1\n\n, x\n\nD1 = {2, 1, 2, 3, 2, 3, 3, 2, 2, 3},\nD2 = {3, 3, 2, 3, 2, 1, 3, 2, 3, 1},\nD3 = {1, 1, 3, 1, 2, 3, 2, 3, 3, 2}.\n\n(3)\n\nObviously it is hard to extract information about P from such data. However, if we assume that\nthe unknown times {ti} are i.i.d. random variables following some distribution independent of the\ninitial state j, it can then be easily shown that Dj\u2019s empirical distribution consistently estimates Tj,\nthe j-th column of the the expected transition probability matrix T := Et[P t]:\n\nD1 = {2, 1, 2, 3, 2, 3, 3, 2, 2, 3} \u21d2 cT1 = [0.1 0.5 0.4]\u22a4,\nD2 = {3, 3, 2, 3, 2, 1, 3, 2, 3, 1} \u21d2 cT2 = [0.2 0.3 0.5]\u22a4,\nD3 = {1, 1, 3, 1, 2, 3, 2, 3, 3, 2} \u21d2 cT3 = [0.3 0.3 0.4]\u22a4.\n\nIn general there exist many P \u2019s that result in the same T . Therefore, as detailed later, we make\na speci\ufb01c distributional assumption on {ti} to enable unique recovery of the transition matrix P\nfrom T (Assumption A.1). Next we consider a further generalization, where the unknowns are not\nonly the time stamps of the observations, but also the initial state j. In other words, we only know\neach set was generated from the same initial state, but do not know the actual initial state. In this\ncase, the empirical distributions of the sets consistently estimate the columns of T in some unknown\npermutation \u03a0:\n\nD\u03a0(3) = {1, 1, 3, 1, 2, 3, 2, 3, 3, 2} \u21d2 [T\u03a0(3) = [0.3 0.3 0.4]\u22a4.\nD\u03a0(2) = {3, 3, 2, 3, 2, 1, 3, 2, 3, 1} \u21d2 [T\u03a0(2) = [0.2 0.3 0.5]\u22a4,\nD\u03a0(1) = {2, 1, 2, 3, 2, 3, 3, 2, 2, 3} \u21d2 [T\u03a0(1) = [0.1 0.5 0.4]\u22a4.\n\nIn order to be able to identify \u03a0, we will again resort to randomness and assume the unknown initial\nstates are random variables following a certain distribution (Assumption A.2) so that the data carry\ninformation about \u03a0. Finally, we generalize from a single unknown initial state to an unknown\ninitial state distribution, where each set of observations D := {x\n(0)} consists of\nindependent samples of states drawn at random times from some unknown initial state distribution\n(0). For example, the data may look like:\n\u03c0\n\n, . . . | \u03c0\n\n(t2)\n2\n\n(t1)\n1\n\n, x\n\nD\u03c0\n\nD\u03c0\n\nD\u03c0\n\n(0)\n1\n\n(0)\n2\n\n(0)\n3\n\n= {1, 3, 3, 1, 2, 3, 2, 3, 3, 2},\n= {3, 1, 2, 3, 2, 1, 3, 2, 3, 1},\n= {2, 1, 2, 3, 3, 3, 3, 1, 2, 3},\n...\n\nWith this \ufb01nal generalization, most would agree that the generated data are non-sequenced and that\nthe generative process is \ufb02exible enough to model the real-world situations described in Section\n1. However, simple estimation with empirical distributions no longer works because each set may\nnow contain observations from multiple initial states. This is where we take advantage of the tensor\n\n4\n\n\fdecomposition framework outlined in Section 2, which requires proper assumptions on the initial\nstate distribution \u03c0\n\n(0) (Assumption A.3).\n\nNow we are ready to give the de\ufb01nition of our entire generative process. Assume we have N sets\nof non-sequence data each containing n observations, and each set of observations {xi}n\ni=1 were\nindependently generated by the following:\n\n\u2022 Draw an initial distribution\n\ni=1 \u03b1i) = \u03c0,\n\n(0) \u223c Dirichlet(\u03b1),\n\u03c0\n(0)] = \u03b1/(Pm\nE[\u03c0\n\u2022 For i = 1, . . . , n,\n\u2013 Draw a discrete time ti \u223c Geometric(r), ti \u2208 {1, 2, 3, . . .}.\n\u2013 Draw an initial state si \u223c Multinomial(\u03c00), si \u2208 {0, 1}m.\n\u2013 Draw an observation xi \u223c Multinomial(P ti si), xi \u2208 {0, 1}m.\n\n\u03c0i 6= \u03c0j \u2200 i 6= j.\n\n(Assumption A.3)\n(Assumption A.2)\n\n(Assumption A.1)\n\nThe above generative process has several properties. First, all the data points in the same set share\nthe same initial state distribution but can have different initial states; the initial state distribution\nvaries across different sets and yet centers at the stationary distribution of the Markov chain. As\nmentioned in Section 1, this may be achieved in biological studies by running multiple experiments\nwith different input stimuli, so the data collected in the same experiment can be assumed to have the\nsame initial state distribution. Second, each data point is drawn from an independent trajectory of\nthe Markov chain, a similar situation in the modeling of galaxies or Alzheimer\u2019s, and random time\nsteps could be used to compensate for individual variations in speed: a small/large ti corresponds\nto a slowly/fast evolving individual object. Finally, the geometric distribution can be interpreted as\nan overall measure of the magnitude of speed variation: a large success probability r would result\nin many small ti\u2019, meaning that most objects evolve at similar speeds, while a small r would lead to\nti\u2019s taking a wide range of values, indicating a large speed variation.\nTo use the tensor decomposition method in Appendix A, we need the tensor structure (2) in certain\nlow-order moments of observed quantities. The following theorem identi\ufb01es such quantities:\nTheorem 1. De\ufb01ne the expected transition probability matrix T := Et[P t] = rP (I \u2212 (1 \u2212 r)P )\u22121\nand let \u03b10 :=Pi \u03b1i, C2 := E[x1x\u22a4\n\n2 ] and C3 := E[x1 \u2297 x2 \u2297 x3]. Then the following holds:\n\u03b10+1 T diag(\u03c0)T \u22a4 + \u03b10\n\nE[x1] = \u03c0, C2 = 1\n\n(4)\n\n\u22a4,\n\nC3 =\n\nM2\n\nM3\n\n2\n\n(\u03b10+2)(\u03b10+1)Pi \u03c0iT \u22973\n:= (\u03b10 + 1)C2 \u2212 \u03b10\u03c0\u03c0\nC3 \u2212 (\u03b10+1)\u03b10\n:= (\u03b10+2)(\u03b10+1)\n\ni + \u03b10\n\u22a4 = T diag(\u03c0)T \u22a4,\n\n\u03b10+1 \u03c0\u03c0\nd=1 \u03c0 \u2297d C2 \u2212\n\n\u03b10+2P3\nd=1 \u03c0 \u2297d C2 + \u03b12\n\n2\n\n0\u03c0\n\n2 P3\n\n2\u03b12\n0\n\n(\u03b10+2)(\u03b10+1) \u03c0\n\n\u22973,\n\n\u22973 =Pi \u03c0iT \u22973\n\ni\n\n.\n\n(5)\n\n(6)\n(7)\n\nThe proof is in Appendix B.1, which relies on the special structure in the moments of the Dirichlet\ndistribution (Assumption A.3). It is clear that M2 and M3 have the desired tensor structure. As-\n\nsuming \u03b10 is known, we can form estimates cM2 and cM3 by computing empirical moments from\n\nthe data. Note that the xi\u2019s are exchangeable, so we can use all pairs and triples of data points to\ncompute the estimates. Interestingly, these low-order moments have a very similar structure to those\nin Latent Dirichlet Allocation [1]. Indeed, according to our generative process, we can view a set\nof non-sequence data points as a document generated by an LDA model with the expected transi-\ntion matrix T as the topic matrix, the stationary distribution \u03c0 as the topic proportions, and most\nimportantly, the states as both the words and the topics. The last property is what distinguishes our\ngenerative process from a general LDA model: because both the words and the topics correspond to\nthe states, the topic matrix is no longer invariant to column permutations. Since the tensor decompo-\n\nsition method may return bT under any column permutation, we need to recover the correct matching\nbetween its rows and columns. Note that the b\u03c0 returned by the tensor decomposition method under-\ngoes the same permutation as bT \u2019s columns. Because all \u03c0i\u2019s have different values by Assumption\nA.2, we may recover the correct matching by sorting both the returned b\u03c0 and the mean \u00af\u03c0 of all data.\nA \ufb01nal issue is estimating P and r from bT . This is in general dif\ufb01cult even when the exact T\n\nis available because multiple choices of P and r may result in the same T . However, if the true\ntransition matrix P has at least one zero entry, then unique recovery is possible:\n\n5\n\n\fTheorem 2. Let P \u2217, r\u2217, T \u2217 and \u03c0\n\u2217 denote the true values of the transition probability matrix,\nthe success probability, the expected transition matrix, and the stationary distribution, respectively.\nAssume that P \u2217 is ergodic and of full rank, and P \u2217\nij = 0 for some i and j. Let S := {\u03bb/(\u03bb \u2212 1) |\n\u03bb is a real negative eigenvalue of T \u2217} \u222a {0}. Then the following holds:\n\n\u2022 0 \u2264 max(S) < r\u2217 \u2264 1.\n\u2022 For all r \u2208 (0, 1] \\ S, P (r) := (rI + (1 \u2212 r)T \u2217)\u22121T \u2217 is well-de\ufb01ned and\n\n1\u22a4P (r) = 1\u22a4, P (r)\u03c0\nP (r)ij \u2265 0 \u2200 i, j \u21d0\u21d2 r \u2265 r\u2217.\n\n\u2217 = \u03c0\n\n\u2217, P \u2217 = P (r\u2217),\n\nThat is, P (r) is a stochastic matrix if and only if r \u2265 r\u2217.\nThe proof is in Appendix C. This theorem indicates that we can determine r\u2217 from T \u2217 by doing\nbi-section on (0, 1]. But this approach fails when we replace T \u2217 by an estimate bT because even\nbP (r\u2217) might contain negative values. A more practical estimation procedure is the following: for\neach value of r in a decreasing sequence starting from 1, project bP (r) := (rI + (1 \u2212 r)bT )\u22121bT onto\nof r and projected bP (r) as our estimates.\n\nthe space of stochastic matrices and record the projection distance. Then search in the sequence of\nprojection distances for the \ufb01rst sudden increase1 starting from 1, and take the corresponding value\n\nAssuming the true r and \u03b10 are known, with the empirical moments being consistent estimators for\nthe true moments and the tensor decomposition method guaranteed to return accurate estimates un-\nder small input perturbation, we can conclude that the estimates described above will converge (with\nhigh probability) to the true quantities as the sample size N increases. We give sample complexity\nbound on estimation error in the next section for hidden Markov models.\n\n3.2 Hidden Markov Models\n\nLet P and \u03c0 now be de\ufb01ned over the hidden discrete state space of size k and have the same\nproperties as the \ufb01rst-order Markov model. The generative process here is almost identical to (and\ntherefore share the same interpretation with) the one in Section 3.1, except for an extra mapping\nfrom the discrete hidden state to a continuous observation space:\n\n\u2022 Draw a state indicator vector hi \u223c Multinomial(P ti si), hi \u2208 {0, 1}k.\n\u2022 Draw an observation: xi = U hi + \u01ebi, where U \u2208 Rm\u00d7k denotes a rank-k matrix of\nmean observation vectors for the k hidden states, and the random noise vectors \u01ebi\u2019s are i.i.d\nsatisfying E[\u01ebi] = 0 and Var[\u01ebi] = \u03c32I.\n\nNote that a spherical covariance2 is required for the tensor decomposition method to be applicable.\nThe low-order moments that lead to the desired tensor structure are given in the following:\nTheorem 3. De\ufb01ne the expected hidden state transition matrix T := Et[P t] = rP (I \u2212 (1\u2212 r)P )\u22121\nand let \u03b10 := Pi \u03b1i, V1 := E[x1], V2 := E[x1x\u22a4\n2 ] and C3 :=\nE[x1 \u2297 x2 \u2297 x3]. Then the following holds:\n\n1 ], C2 := E[x1x\u22a4\n\n1 ], V3 := E[x\u22973\n\nV1 = U \u03c0, V2 = Udiag(\u03c0)U \u22a4 + \u03c32I,\nM2\nC2 =\n\n:= V2 \u2212 \u03c32I = Udiag(\u03c0)U \u22a4, M3 := V3 \u2212P3\n\n\u03b10+1 U T diag(\u03c0)(U T )\u22a4 + \u03b10\n\n\u03b10+1 V1V \u22a4\n1 ,\n\n1\n\nV3 = Pi \u03c0iU \u22973\n\ni +P3\n\nd=1 V1 \u2297d (\u03c32I),\nd=1 V1 \u2297d (\u03c32I) =Pi \u03c0iU \u22973\n,\n\ni\n\n2\n\nC3 =\nM \u2032\n2\nM \u2032\n3\n\n(\u03b10+2)(\u03b10+1)Pi \u03c0i(U T )\u22973\nC3 \u2212 (\u03b10+1)\u03b10\n\n:= (\u03b10 + 1)C2 \u2212 \u03b10V1V \u22a4\n:= (\u03b10+2)(\u03b10+1)\n\n2\n\nd=1 V1 \u2297d C2 \u2212\n\n\u03b10+2P3\nd=1 V1 \u2297d C2 + \u03b12\n\n1 = U T diag(\u03c0)(U T )\u22a4,\n\ni + \u03b10\n\n2 P3\n\n2\u03b12\n0\n\n(\u03b10+2)(\u03b10+1) V \u22973\n\n1\n\n0V \u22973\n\n1 =Pi \u03c0i(U T )\u22973\n\ni\n\n.\n\n1Intuitively the jump should be easier to locate as P gets sparser, but we do not have a formal result.\n2We may allow different covariances \u03c32\n\nI for different hidden states. See Section 3.2 of [2] for details.\n\nj\n\n6\n\n\fAlgorithm 1 Tensor decomposition method for learning HMM from non-sequence data\ninput N sets of non-sequence data points, the success probability r, the Dirichlet parameter \u03b10, the\n\nnumber of hidden states k, and numbers of iterations L and N.\n\n3\n\n\u22a4\n\n).\n\n2, cM \u2032\n\n4: Run Algorithm A.2 (Appendix A) k times each with numbers of iterations L and N, the input\n\noutput Estimates b\u03c0, eP and eU possibly under permutation of state labels.\n1: Compute empirical averagescV1,cV2,cV3,cC2,cC3, andc\u03c32 := \u03bbmin(cV2 \u2212cV1cV1\n2: Compute cM2, cM3, cM \u2032\n3: Run Algorithm A.1 (Appendix A) on cM2 and cM3 with the number of hidden states k to obtain\na symmetric tensor bT \u2208 Rk\u00d7k\u00d7k and a whitening transformation cW \u2208 Rm\u00d7k.\ntensor in the \ufb01rst run set to bT and in each subsequent run set to the de\ufb02ated tensor returned by\nthe previous run, resulting in k pairs of eigenvalue/eigenvector {( \u02c6\u03bbi, bvi)}k\ni, bv\u2032\n2 and cM \u2032\n5: Repeat Steps 4 and 5 on cM \u2032\n3 to obtaincT \u2032, cW \u2032 and {( \u02c6\u03bb\u2032\ni=1.\ni)}k\ni, bv\u2032\ni=1 with {( \u02c6\u03bb\u2032\ni=1 and { \u02c6\u03bb\u2032\n6: Match {( \u02c6\u03bbi, bvi)}k\ni=1 by sorting { \u02c6\u03bbi}k\ni=1.\ni)}k\ni}k\n7: Obtain estimates of HMM parameters:\nbU := (cW \u22a4)\u2020bVb\u039b,\ndU T := (cW \u2032)\u2020cV \u2032b\u039b\u2032,\nb\u03c0 := [ \u02c6\u03bb\u2032\nbP := (rbU + (1 \u2212 r)dU T )\u2020dU T ,\n\nwhere bV := [cv1 \u00b7\u00b7\u00b7cvk],b\u039b := diag([ \u02c6\u03bb1 \u00b7\u00b7\u00b7 \u02c6\u03bbk]\u22a4);cV \u2032 and b\u039b\u2032 are de\ufb01ned in the same way.\n8: (Optional) Project b\u03c0 onto the simplex and bP onto the space of stochastic matrices.\n\n\u00b7\u00b7\u00b7 \u02c6\u03bb\u2032\n\ni=1.\n\n]\u22a4,\n\n\u22122\n\n\u22122\n\n1\n\nk\n\nThe proof is in Appendix B.2. This theorem suggests that, unlike \ufb01rst-order Markov models, HMMs\nrequire two applications of the tensor decomposition methods, one on M2 and M3 for extracting\n3 for extracting the matrix product\nthe mean observation vectors U, and the other on M \u2032\nU T . Another issue is that the estimates for M2 and M3 require an estimate for the noise variance\n\u03c32, which is not directly observable. Nevertheless, since M2 and M3 are in the form of low-order\nmoments of spherical Gaussian mixtures, we may use the existing result (Theorem 3.2, [2]) to obtain\nan estimateb\u03c32 = \u03bbmin(cV2 \u2212 bV1bV \u22a4\n1 ). The situation regarding permutations of the estimates is also\ndifferent here. First note that P = (rU +(1\u2212r)U T )\u2020U T, which implies that permuting the columns\nof U and the columns of U T in the same manner has the effect of permuting both the rows and the\ncolumns of P , essentially re-labeling the hidden states. Hence we can only expect to recover P up\nto some simultaneous row and column permutation. By the assumption that \u03c0i\u2019s are all different, we\n\n2 and M \u2032\n\ncan sort the two estimates b\u03c0 and b\u03c0\u2032 to match the columns of bU anddU T , and obtain bP if r is known.\nWhen r is unknown, a similar heuristics to the one for \ufb01rst-order Markov models can be used to\nestimate r, based on the fact that P = (rU + (1 \u2212 r)U T )\u2020U T = (rI + (1 \u2212 r)T )\u22121T , suggesting\nthat Theorem 2 remains true when expressing P by U and U T .\nAlgorithm 1 gives the complete procedure for learning HMM from non-sequence data. Combining\nthe perturbation bounds of the tensor decomposition method (Appendix A), the whitening procedure\n(Appendix D.1) and the matrix pseudoinverse [10], and concentration bounds on empirical moments\n(Appendix D.3), we provide a sample complexity analysis:\nTheorem 4. Suppose the numbers of iterations N and L for Algorithm A.2 satisfy the conditions in\nTheorem A.1 (Appendix A), and the number of hidden states k, the success probability r, and the\nDirichlet parameter \u03b10 are all given. For any \u03b7 \u2208 (0, 1) and \u01eb > 0, if the number of sets N satis\ufb01es\nN \u2265\n\u01eb2\u03c3k(rU + (1 \u2212 r)U T )4 min(\u03c3k(U T ), \u03c3k(U ), 1)4(cid:19) ,\nmax(cid:18) 225000\nwhere c is some constant, \u03bd := max(\u03c32 + maxi,j(|Uik|2), 1), \u03b4min := mini,j |1/\u221a\u03c0j \u2212 1/\u221a\u03c0j|,\nand \u03c3i(\u00b7) denotes the i-th largest singular value, then the bP and bU returned by Algorithm 1 satisfy\n(cid:19) \u2265 1 \u2212 \u03b7,\nProb(kP \u2212 bPk \u2264 \u01eb) \u2265 1 \u2212 \u03b7\n\nand Prob(cid:18)kU \u2212 bUk \u2264\n\n\u03b7\n4600\n2), \u03c3k(M2))2 ,\n\n42000c2\u03c31(U T )2 max(\u03c31(U T ), \u03c31(U ), 1)2\n\n12 max(k2, m)m3\u03bd3(\u03b10 + 2)2(\u03b10 + 1)2\n\n\u01eb\u03c3k(rU + (1 \u2212 r)U T )2\n\nmin(\u03c3k(M \u2032\n\n6\u03c31(U T )\n\n\u03b42\nmin\n\n\u00b7\n\n,\n\n7\n\n\f(a) Matrix estimation error\n\n(b) Projection distance\n\nFigure 2: Simulation results\n\nwhere P and U may undergo label permutation.\n\nThe proof is in Appendix E. In this result, the sample size N exhibits a fairly high-order polynomial\ndependency on m, k, \u01eb\u22121 and scales with 1/\u03b7 linearly instead of logarithmically, as is common in\nsample complexity results on spectral learning. This is because we do not impose any constraints\non the observation model and simply use the Markov inequality for bounding the deviation in the\nempirical moments. If we make stronger assumptions such as boundedness or sub-Gaussianity, it\nis possible to use stronger, exponential tail bounds to obtain better sample complexity. Also worth\nnoting is that \u03b4\u22122\nmin acts as a threshold. As shown in our proof, as long as the operator norm of\nthe tensor perturbation is suf\ufb01ciently smaller than \u03b4min, which measures the gaps between different\n\u03c0i\u2019s, we can correctly match the two sets of estimated tensor eigenvalues. Lastly, the lower bound\nof N, as one would expect, depends on conditions of the matrices being estimated as re\ufb02ected in\nthe various ratios of singular values. An interesting quantity missing from the sample analysis is the\nsize of each set n. To simplify the analysis we essentially assume n = 3, but understanding how\nn might affect the sample complexity may have a critical impact in practice: when collecting more\ndata, should we collect more sets or larger sets? What is the trade-off between them? This is an\ninteresting direction for future work.\n\n4 Simulation\n\nOur HMM has m = 40 and k = 5 with Gaussian noise \u03c32 = 2. The mean vectors U were sampled\nfrom independent univariate standard normal and then normalized to lie on the unit sphere. The\ntransition matrix P contains one zero entry. For the generative process, we set \u03b10 = 1, r = 0.3, n =\n1000, and N \u2208 1000{20, 21, . . . , 210}. The numbers of iterations for Algorithm A.2 were N = 200\nand L = 1000. Figure 2(a) plots the relative matrix estimation error (in spectral norm) against the\nsample size N for P , U, and U T obtained by Algorithm 1 given the true r. It is clear that U is\nthe easiest to learn, followed by U T , and P is the most dif\ufb01cult, and that all three errors converge\nto a very small value for suf\ufb01ciently large N. Note that in Theorem 4 the bounds for P and U are\ndifferent. With the model used here, the extra multiplicative factor in the bound for U is less than\n0.007, suggesting that U is indeed easier to estimate than P . Figure 2(b) demonstrates the heuristics\nfor determining r, showing projection distances (in logarithm) versus r. As N increases, the take-off\npoint gets closer to the true r = 0.3. The large peak indicates a pole (the set S in Theorem 2).\n\n5 Conclusions\n\nWe have demonstrated that under reasonable assumptions, tensor decomposition methods can prov-\nably learn \ufb01rst-order Markov models and hidden Markov models from non-sequence data. We\nbelieve this is the \ufb01rst formal guarantee on learning dynamic models in a non-sequential setting.\nThere are several ways to extend our results. No matter what distribution generates the random time\nsteps, tensor decomposition methods can always learn the expected transition probability matrix T .\nDepending on the application, it might be better to use some other distribution for the missing time.\nThe proposed algorithm can be modi\ufb01ed to learn discrete HMMs under a similar generative process.\nFinally, applying the proposed methods to real data should be the most interesting future direction.\n\n8\n\n\fReferences\n\n[1] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu. A spectral algorithm for\n\nlatent dirichlet allocation. arXiv preprint arXiv:1204.6703v4, 2013.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv preprint arXiv:1210.7559v2, 2012.\n\n[3] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and\n\nhidden Markov models. arXiv preprint arXiv:1203.0683, 2012.\n\n[4] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical\nalgorithm for topic modeling with provable guarantees. arXiv preprint arXiv:1212.4777, 2012.\n[5] D. Hsu and S. M. Kakade. Learning mixtures of spherical gaussians: moment methods and\nspectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical\nComputer Science, pages 11\u201320. ACM, 2013.\n\n[6] T.-K. Huang and J. Schneider. Learning linear dynamical systems without sequence infor-\nIn Proceedings of the 26th International Conference on Machine Learning, pages\n\nmation.\n425\u2013432, 2009.\n\n[7] T.-K. Huang and J. Schneider. Learning auto-regressive models from sequence and non-\nsequence data. In Advances in Neural Information Processing Systems 24, pages 1548\u20131556.\n2011.\n\n[8] T.-K. Huang, L. Song, and J. Schneider. Learning nonlinear dynamic models from non-\nsequenced data. In Proceedings of the 13th International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2010.\n\n[9] M. G. Rabbat, M. A. Figueiredo, and R. D. Nowak. Network inference from co-occurrences.\n\nInformation Theory, IEEE Transactions on, 54(9):4053\u20134068, 2008.\n\n[10] G. Stewart. On the perturbation of pseudo-inverses, projections and linear least squares prob-\n\nlems. SIAM review, 19(4):634\u2013662, 1977.\n\n[11] X. Zhu, A. B. Goldberg, M. Rabbat, and R. Nowak. Learning bigrams from unigrams. In the\nProceedings of 46th Annual Meeting of the Association for Computational Linguistics: Human\nLanguage Technology, Columbus, OH, 2008.\n\n9\n\n\f", "award": [], "sourceid": 243, "authors": [{"given_name": "Tzu-Kuo", "family_name": "Huang", "institution": "CMU"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}]}