{"title": "Spectral Learning of Mixture of Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2249, "page_last": 2257, "abstract": "In this paper, we propose a learning approach for the Mixture of Hidden Markov Models (MHMM) based on the Method of Moments (MoM). Computational advantages of MoM make MHMM learning amenable for large data sets. It is not possible to directly learn an MHMM with existing learning approaches, mainly due to a permutation ambiguity in the estimation process. We show that it is possible to resolve this ambiguity using the spectral properties of a global transition matrix even in the presence of estimation noise. We demonstrate the validity of our approach on synthetic and real data.", "full_text": "Spectral Learning of Mixture of Hidden Markov\n\nModels\n\nY. Cem S\u00a8ubakan(cid:91), Johannes Traa(cid:93), Paris Smaragdis(cid:91),(cid:93),(cid:92)\n\n(cid:91)Department of Computer Science, University of Illinois at Urbana-Champaign\n\n(cid:93)Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign\n\n{subakan2, traa2, paris}@illinois.edu\n\n(cid:92)Adobe Systems, Inc.\n\nAbstract\n\nIn this paper, we propose a learning approach for the Mixture of Hidden Markov\nModels (MHMM) based on the Method of Moments (MoM). Computational ad-\nvantages of MoM make MHMM learning amenable for large data sets. It is not\npossible to directly learn an MHMM with existing learning approaches, mainly\ndue to a permutation ambiguity in the estimation process. We show that it is pos-\nsible to resolve this ambiguity using the spectral properties of a global transition\nmatrix even in the presence of estimation noise. We demonstrate the validity of\nour approach on synthetic and real data.\n\n1\n\nIntroduction\n\nMethod of Moments (MoM) based algorithms [1, 2, 3] for learning latent variable models have\nrecently become popular in the machine learning community. They provide uniqueness guarantees\nin parameter estimation and are a computationally lighter alternative compared to more traditional\nmaximum likelihood approaches. The main reason behind the computational advantage is that once\nthe moment expressions are acquired, the rest of the learning work amounts to factorizing a moment\nmatrix whose size is independent of the number of data items. However, it is unclear how to use these\nalgorithms for more complicated models such as Mixture of Hidden Markov Models (MHMM).\nMHMM [4] is a useful model for clustering sequences, and has various applications [5, 6, 7]. The\nE-step of the Expectation Maximization (EM) algorithm for an MHMM requires running forward-\nbackward message passing along the latent state chain for each sequence in the dataset in every EM\niteration. For this reason, if the number of sequences in the dataset is large, EM can be computa-\ntionally prohibitive.\nIn this paper, we propose a learning algorithm based on the method of moments for MHMM. We\nuse the fact that an MHMM can be expressed as an HMM with block diagonal transition matrix.\nHaving made that observation, we use an existing MoM algorithm to learn the parameters up to a\npermutation ambiguity. However, this doesn\u2019t recover the parameters of the individual HMMs. We\nexploit the spectral properties of the global transition matrix to estimate a de-permutation mapping\nthat enables us to recover the parameters of the individual HMMs. We also specify a method that\ncan recover the number of HMMs under several spectral conditions.\n\n2 Model De\ufb01nitions\n\n2.1 Hidden Markov Model\nIn a Hidden Markov Model (HMM), an observed sequence x = x1:T = {x1, . . . , xt, . . . , xT} with\nxt \u2208 RL is generated conditioned on a latent Markov chain r = r1:T = {r1, . . . , rt, . . . , rT}, with\n\n1\n\n\frt \u2208 {1, . . . M}. The HMM is parameterized by an emission matrix O \u2208 RL\u00d7M , a transition matrix\nA \u2208 RM\u00d7M and an initial state distribution \u03bd \u2208 RM . Given the model parameters \u03b8 = (O, A, \u03bd),\nthe likelihood of an observation sequence x1:T is de\ufb01ned as follows:\n\n(cid:88)\n\n(cid:88)\n\nT(cid:89)\n\np(x1:T|\u03b8) =\n\np(x1:T , r1:T|\u03b8) =\n\np(xt|rt, O) p(rt|rt\u22121, A)\n\nr1:T\n\nr1:T\n\nt=1\n\n=1(cid:62)\n\nM A diag(p(xT| :, O))\u00b7\u00b7\u00b7 A diag(p(x1| :, O)) \u03bd = 1(cid:62)\n\nM\n\n(cid:32) T(cid:89)\n\n(cid:33)\n\nAdiag(O(xt))\n\n\u03bd,\n\n(1)\n\nt=1\n\nwhere 1M \u2208 RM is a column vector of ones, we have switched from index notation to matrix\nnotation in the second line such that summations are embedded in matrix multiplications, and we\nuse the MATLAB colon notation to pick a row/column of a matrix. Note that O(xt) := p(xt| :, O).\nThe model parameters are de\ufb01ned as follows:\n\u2022 \u03bd(u) = p(r1 = u|r0) = p(r1 = u)\n\u2022 A(u, v) = p(rt = u|rt\u22121 = v), t \u2265 2\n\u2022 O(:, u) = E[xt|rt = u]\n\ninitial latent state distribution\nlatent state transition matrix\nemission matrix\n\nThe choice of the observation model p(xt|rt) determines what the columns of O correspond to:\n\u21d2 O(:, u) = E[xt|rt = u] = \u00b5u.\n\u2022 Gaussian: p(xt|rt = u) = N (xt; \u00b5u, \u03c32)\n\u2022 Poisson: p(xt|rt = u) = PO(xt; \u03bbu)\n\u21d2 O(:, u) = E[xt|rt = u] = \u03bbu.\n\u2022 Multinomial: p(xt|rt = u) = Mult(xt; pu, S) \u21d2 O(:, u) = E[xt|rt = u] = pu.\n\nThe \ufb01rst model is a multivariate, isotropic Gaussian with mean \u00b5u \u2208 RL and covariance \u03c32I \u2208\nRL\u00d7L. The second distribution is Poisson with intensity parameter \u03bbu \u2208 RL. This choice is partic-\nularly useful for counts data. The last density is a multinomial distribution with parameter pu \u2208 RL\nand number of draws S.\n\n2.2 Mixture of HMMs\n\nThe Mixture of HMMs (MHMM) is a useful model for clustering sequences where each sequence\nis modeled by one of K HMMs. It is parameterized by K emission matrices Ok \u2208 RL\u00d7M , K\ntransition matrices1 Ak \u2208 RM\u00d7M , and K initial state distributions \u03bdk \u2208 RM as well as a cluster\nprior probability distribution \u03c0 \u2208 RK. Given the model parameters \u03b81:K = (O1:K, A1:K, \u03bd1:K, \u03c0),\nthe likelihood of an observation sequence xn = {x1,n, x2,n, . . . , xTn,n} is computed as a convex\ncombination of the likelihood of K HMMs:\n\nk=1\n\nK(cid:88)\nK(cid:88)\nK(cid:88)\n\nk=1\n\np(xn|\u03b81:K) =\n\n=\n\n=\n\n(cid:88)\nTn(cid:89)\n(cid:32) Tn(cid:89)\n(cid:40)\n\nr1:Tn,n\n1(cid:62)\n\nt=1\n\nJ\n\n\u03c0k\n\n\u03c0k\n\np(hn = k)p(xn|hn = k, \u03b8k) =\n\n\u03c0k\n\np(xn, rn|hn = k, \u03b8k)\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\nr1:Tn ,n\n\np(xt,n|rt,n, hn = k, Ok)p(rt,n|rt\u22121,n, hn = k, Ak)\n\n(cid:33)\n\n(cid:41)\n\nAkdiag(Ok(xt,n))\n\n\u03bdk\n\n,\n\n(2)\n\nk=1\n\nt=1\n\nwhere hn \u2208 {1, 2, . . . , K} is the latent cluster indicator, rn = {r1,n, r2,n, . . . , rTn,n} is the latent\nstate sequence for the observed sequence xn, and Ok(xt,n) is a shorthand for p(xt,n| :, hn = k, Ok).\nNote that if a sequence is assigned to the kth cluster (hn = k), the corresponding HMM parameters\n\u03b8k = (Ak, Ok, \u03bdk) are used to generate it.\n\n1Without loss of generality, the number of hidden states for each HMM is taken to be M to keep the notation\n\nuncluttered.\n\n2\n\n\f3 Spectral Learning for MHMMs\n\nTraditionally, the parameters of an MHMM are learned with the Expectation-Maximization (EM)\nalgorithm. One drawback of EM is that it requires a good initialization. Another issue is its com-\nputational requirements. In every iteration, one has to perform forward-backward message passing\nfor every sequence, resulting in a computationally expensive process, especially when dealing with\nlarge datasets.\nThe proposed MoM approach avoids the issues associated with EM by leveraging the information in\nvarious moments computed from the data. Given these moments, which can be computed ef\ufb01ciently,\nthe computation time of the learning algorithm is independent of the amount of data (number of\nsequences and their lengths).\nOur approach is mainly based on the observation that an MHMM can be seen as a single HMM with a\nblock-diagonal transition matrix. We will \ufb01rst establish this proposition and discuss its implications.\nThen, we will describe the proposed learning algorithm.\n\n3.1 MHMM as an HMM with a special structure\n\nLemma 1:\nAn MHMM with local parameters \u03b81:K = (O1:K, A1:K, \u03bd1:K, \u03c0) is an HMM with global parame-\nters \u00af\u03b8 = ( \u00afO, \u00afA, \u00af\u03bd), where:\n\n\u00afO = [O1 O2 . . . OK]\n\n,\n\n\u00afA =\n\n\uf8f9\uf8fa\uf8fa\uf8fb .\n\n0\n\n0\n0\n\n\u00af\u03bd =\n\n0 . . . AK\n\n\u03c02\u03bd2\n...\n\n0 A2 . . .\n...\n\n\uf8ee\uf8ef\uf8ef\uf8f0 \u03c01\u03bd1\n\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n\n\uf8ee\uf8ef\uf8ef\uf8f0A1 0 . . .\n(cid:33)\n\uf8f9\uf8fa\uf8fa\uf8fb diag([O1 O2 . . . OK] (xt))\n\uf8f6\uf8f7\uf8f7\uf8f8\n\n(cid:41)\n\n0\n0\n\n\u03c0K\u03bdK\n\n\uf8ee\uf8ef\uf8ef\uf8f0 \u03c01\u03bd1\n\n\u03c02\u03bd2\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n\u03c0K\u03bdK\n\n(3)\n\n(4)\n\nProof: Consider the MHMM likelihood for a sequence xn:\n\np(xn|\u03b81:K) =\n\n\u03c0k\n\n1(cid:62)\n\nM\n\nAk diag(Ok(xt))\n\n\u03bdk\n\n(cid:32) Tn(cid:89)\n\uf8ee\uf8ef\uf8ef\uf8f0A1 0 . . .\n\n0 A2 . . .\n...\n\nt=1\n\n(cid:40)\n\uf8eb\uf8ec\uf8ec\uf8ed Tn(cid:89)\n(cid:32) Tn(cid:89)\n\nt=1\n\nK(cid:88)\n\nk=1\n\n=1(cid:62)\n\nM K\n\n=1(cid:62)\n\nM K\n\n(cid:33)\n\n0\n\n0 . . . AK\n\n\u00afA diag( \u00afO(xt))\n\n\u00af\u03bd,\n\nt=1\n\nwhere [O1 O2 . . . OK] (xt) := \u00afO(xt). We conclude that the MHMM and an HMM with param-\n(cid:3)\neters \u00af\u03b8 describe equivalent probabilistic models.\nWe see that the state space of an MHMM consists of K disconnected regimes. For each sequence\nsampled from the MHMM, the \ufb01rst latent state r1 determines what region the entire latent state\nsequence lies in.\n\n3.2 Learning an MHMM by learning an HMM\n\nIn the previous section, we showed the equivalence between the MHMM and an HMM with a block-\ndiagonal transition matrix. Therefore, it should be possible to use an HMM learning algorithm such\nas spectral learning for HMMs [1, 2] to \ufb01nd the parameters of an MHMM. However, the true global\nparameters \u00af\u03b8 are recovered inexactly due to noise \u0001: \u00af\u03b8 \u2192 \u00af\u03b8\u0001 and state indexing ambiguity via a\npermutation mapping P: \u00af\u03b8\u0001 \u2192 \u00af\u03b8P\n\u0001 ) obtained\nfrom the learning algorithm are in the following form:\n\n\u0001 . Consequently, the parameters \u00af\u03b8P\n\n\u0001 = ( \u00afOP\n\n\u0001 , \u00afAP\n\n\u0001 , \u00af\u03bdP\n\n\u00afOP\n\u0001 = \u00afO\u0001P (cid:62),\n\n\u00afAP\n\u0001 = P \u00afA\u0001P (cid:62),\n\n\u00af\u03bdP\n\u0001 = P \u00af\u03bd\u0001 ,\n\n(5)\n\n3\n\n\fwhere P is the permutation matrix corresponding to the permutation mapping P.\nThe presence of the permutation is a fundamental nuisance for MHMM learning since it causes\nparameter mixing between the individual HMMs. The global parameters are permuted such that it\nbecomes impossible to identify individual cluster parameters. A brute force search to \ufb01nd P requires\n(M K)! trials, which is infeasible for anything but very small M K. Nevertheless, it is possible to\n\nef\ufb01ciently \ufb01nd a depermutation mapping (cid:101)P using the spectral properties of the global transition\nmatrix \u00afA. Our ultimate goal in this section is to undo the effect of P by estimating a (cid:101)P that makes\n\n\u00afAP\n\u0001 block diagonal despite the presence of the estimation noise \u0001.\n\n3.2.1 Spectral properties of the global transition matrix\nLemma 2:\nAssuming that each of the local transition matrices A1:K has only one eigenvalue which is 1, the\nglobal transition matrix \u00afA has K eigenvalues which are 1.\nProof:\n\n\uf8ee\uf8ef\uf8f0V1\u039b1V \u22121\n\n1\n\n0\n0\n\n\u00afA =\n\n0\n\n. . .\n...\n0 VK\u039bKV \u22121\n\n0\n\nK\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0V1 . . .\n(cid:124)\n\n0\n0\n\n0\n\n...\n0\n0 VK\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0\u039b1 . . .\n(cid:123)(cid:122)\n\n0\n0\n\n0\n\n...\n0\n0 \u039bK\n\n\u00afV \u00af\u039b \u00afV \u22121\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0V1 . . .\n\n0\n0\n\n0\n\n...\n0\n0 VK\n\n\uf8f9\uf8fa\uf8fb\u22121\n(cid:125)\n\n,\n\nk\n\nwhere Vk\u039bkV \u22121\nis the eigenvalue decomposition of Ak with Vk as eigenvectors, and \u039bk as a di-\nagonal matrix with eigenvalues on the diagonal. The eigenvalues of A1:K appear unaltered in the\n(cid:3)\neigenvalue decomposition of \u00afA, and consequently \u00afA has K eigenvalues which are 1.\nCorollary 1:\n\n\u00afAe =(cid:2)\u00afv11(cid:62)\n\nlim\ne\u2192\u221e\n\nM . . . \u00afvk1(cid:62)\n\n(6)\n\nwhere \u00afvk = [0(cid:62) . . . v(cid:62)\n\nk . . . 0(cid:62)](cid:62) and vk is the stationary distribution of Ak, \u2200k \u2208 {1, . . . , K}.\n\nM\n\nM . . . \u00afvK1(cid:62)\n\n(cid:3) ,\n\uf8f9\uf8fa\uf8fa\uf8fb V \u22121\n\uf8ee\uf8ef\uf8ef\uf8f01 0 . . . 0\n\n0 0 . . . 0\n\n...\n\nProof:\n\ne\u2192\u221e(Vk\u039bkV \u22121\n\nlim\n\nk\n\n)e = lim\n\ne\u2192\u221e Vk\u039be\n\nkV \u22121\n\nk = Vk\n\nk = vk1(cid:62)\nM .\n\n0 0 . . . 0\n\nThe third step follows because there is only one eigenvalue with magnitude 1. Since multiplying \u00afA\nby itself amounts to multiplying the corresponding diagonal blocks, we have the structure in (6). (cid:3)\nNote that equation (6) points out that the matrix lime\u2192\u221e \u00afAe consists of K blocks of size M \u00d7 M\nwhere the k\u2019th block is vk1(cid:62)\nM . A straightforward algorithm can now be developed for making\n\u00afAP block diagonal. Since the eigenvalue decomposition is invariant under permutation, \u00afA and\n\u00afAP have the same eigenvalues and eigenvectors. As e \u2192 \u221e, K clusters of columns appear in\n( \u00afAP )e. Thus, \u00afAP can be made block-diagonal by clustering the columns of ( \u00afAP )\u221e. This idea\nis illustrated in the middle row of Figure 1. Note that, in an actual implementation, one would\nuse a low-rank reconstruction by zeroing-out the eigenvalues that are not equal to 1 in \u00af\u039b to form\n( \u00afAP )r := \u00afV P (\u00af\u039bP )r( \u00afV P )\u22121 = ( \u00afAP )\u221e, where (\u00af\u039bP )r \u2208 RM K\u00d7M K is a diagonal matrix with only\nK non-zero entries, corresponding to the eigenvalues which are 1.\nThis algorithm corresponds to the noiseless case \u00afAP. In practice, the output of the learning algorithm\n\u0001 )e, as e \u2192 \u221e, as illustrated in\nis \u00afAP\nthe bottom row of Figure 1. We can see that the three-cluster structure no longer holds for large e.\nInstead, the columns of the transition matrix converge to a global stationary distribution.\n\n\u0001 and the clear structure in Equation (6) no longer holds in ( \u00afAP\n\n3.2.2 Estimating the permutation in the presence of noise\n\nIn the general case with noise \u0001, we lose the spectral property that the global transition matrix\nhas K eigenvalues which are 1. Consequently, the algorithm described in Section 3.2.1 cannot be\n\n4\n\n\fFigure 1: (Top left) Block-diagonal transition matrix after e-fold exponentiation. Each block con-\nverge to its own stationary distribution. (Top right) Same as above with permutation. (Bottom)\nCorrupted and permuted transition matrix after exponentiation. The true number K = 3 of HMMs\nis clear for intermediate values of e, but as e \u2192 \u221e, the columns of the matrix converge to a global\nstationary distribution.\n\napplied directly to make \u00afAP\none eigenvalue with unit magnitude and lime\u2192\u221e( \u00afAP\n\n\u0001 block diagonal. In practice, the estimated transition matrix has only\n\u0001 )e converges to a global stationary distribution.\n\nHowever, if the noise \u0001 is suf\ufb01ciently small, a depermutation mapping (cid:101)P and the number of HMM\nclusters K can be successfully estimated. We now specify the spectral conditions for this.\nk := \u03b1k\u03bb1,k for k \u2208 {1, . . . , K} as the global, noisy eigenvalues with\nDe\ufb01nition 1: We denote \u03bbG\n|\u03bbG\nk+1|, \u2200k \u2208 {1, . . . , K \u2212 1}, where \u03bb1,k is the original eigenvalue of the kth cluster\nk| \u2265 |\u03bbG\nwith magnitude 1 and \u03b1k is the noise that acts on that eigenvalue (note that \u03b11 = 1). We denote\nj,k := \u03b2j,k\u03bbj,k for j \u2208 {2, . . . , M} and k \u2208 {1, . . . , K} as the local, noisy eigenvalues with\n\u03bbL\n|\u03bbL\nj+1,k|, \u2200k \u2208 {1, . . . , K} and \u2200j \u2208 {1, . . . , M \u2212 1}, where \u03bbj,k is the original eigenvalue\nj,k| \u2265 |\u03bbL\nwith the jth largest magnitude in the kth cluster, and \u03b2j,k is the noise that acts on that eigenvalue.\nDe\ufb01nition 2: The low-rank eigendecomposition of the estimated transition matrix \u00afAP\n\u0001 is de\ufb01ned as\n\u0001 := V \u039brV \u22121, where V is a matrix with eigenvectors in the columns and \u039br is a diagonal matrix\nAr\nwith eigenvalues \u03bbG\nConjecture 1:\n|\u03bbL\n2,k|, then Ar can be formed using the eigen-decomposition of \u00afAP\nIf |\u03bbG\n\u221a\n\u0001 \u2212 Ar(cid:107)F \u2264 O(1/\nwith high probability, (cid:107)Ar\nvectors.\nJusti\ufb01cation:\n\n\u0001 . Then,\nT N ), where T N is the total number of observed\n\n1:K in the \ufb01rst K entries.\n\nK| > max\n\nk\u2208{1,...,K}\n\n(cid:107)Ar\n\n\u0001 \u2212 Ar(cid:107)F = (cid:107)Ar\n\n\u0001 \u2212 A + A \u2212 Ar(cid:107)F \u2264(cid:107)Ar\n\n\u0001 \u2212 A(cid:107)F + (cid:107)A \u2212 Ar(cid:107)F\n\n=(cid:107)A \u2212 Ar(cid:107)F + (cid:107)A \u2212 A\u0001 + A\u00afr\n\u2264(cid:107)A \u2212 Ar(cid:107)F + (cid:107)A\u00afr\n\u221a\n\u22642KM + O(1/\n\n\u0001(cid:107)F + (cid:107)A \u2212 A\u0001(cid:107)F\nT N ) = O(1/\n\n\u0001(cid:107)F\n\u221a\n\nT N ), w.h.p.,\n\nwhere A is used for \u00afAP to reduce the notation clutter (and similarly Ar for ( \u00afAP )r and so on), we\n\u0001 = V \u039b\u00afrV \u22121, where \u039b\u00afr is a\nused the triangle inequality for the \ufb01rst and second inequalities and A\u00afr\ndiagonal matrix of eigenvalues with the \ufb01rst K diagonal entries equal to zero (complement of \u039br).\nFor the last inequality, we used the fact that A \u2208 RM K\u00d7M K has entries in the interval [0, 1] and we\nused the sample complexity result from [1]. The bound speci\ufb01ed in [1] is for a mixture model, but\nsince the two models are similar and the estimation procedure is almost identical, we are reusing it.\nWe believe that further analysis of the spectral learning algorithm is out of the scope of this paper,\n(cid:3)\nso we leave this proposition as a conjecture.\nConjecture 1 asserts that, if we have enough data we should obtain an estimate Ar\n\u0001 close to Ar in the\nsquared error sense. Furthermore, if the following mixing rate condition is satis\ufb01ed, we will be able\nto identify the number of clusters K from the data.\n\n5\n\n e: 1 rt rt+1 e: 5 rt e: 10 rt e: 20 rt e: 1 rt rt+1 e: 5 rt e: 10 rt e: 20 rt e: 1 rt rt+1 e: 5 rt e: 10 rt e: 20 rt \f(Left) Number of signi\ufb01cant eigenvalues across exponentiations.\n\nFigure 2:\nLongevity L\u02dc\u03bbK(cid:48) with respect to the eigenvalue index K(cid:48).\nDe\ufb01nition 3: Let(cid:101)\u03bbk denote the kth largest eigenvalue (in decreasing order) of the estimated transi-\n\n(Right) Spectral\n\ntion matrix \u00afAP\n\n\u0001 . We de\ufb01ne the quantity,\n\n> 1 \u2212 \u03b3\n\n\u2212\n\n> 1 \u2212 \u03b3\n\n,\n\n(7)\n\n(cid:32)(cid:34) (cid:80)K(cid:48)\n(cid:80)M K\nl=1 |\u02dc\u03bbl|e\nl(cid:48)=1 |\u02dc\u03bbl(cid:48)|e\n\n\u221e(cid:88)\n\ne=1\n\nL\u02dc\u03bbK(cid:48) :=\n\n(cid:35)\n\n(cid:34)(cid:80)K(cid:48)\u22121\n(cid:80)M K\n|\u02dc\u03bbl|e\nl(cid:48)=1 |\u02dc\u03bbl(cid:48)|e\n\nl=1\n\n(cid:35)(cid:33)\n\nmax\n\nIf |\u03bbG\n\nK| >\n\nas the spectral longevity of \u02dc\u03bbK(cid:48). The square brackets [.] denote an indicator function which outputs\n1 if the argument is true and 0 otherwise, and \u03b3 is a small number such as machine epsilon.\n|(cid:101)\u03bbK(cid:48)+1||(cid:101)\u03bbK(cid:48)\u22121| = K, for K(cid:48) \u2208\n\nLemma 3:\n{2, 3, . . . , M K \u2212 1}, then arg maxK(cid:48) L\u02dc\u03bbK(cid:48) = K.\nProof: The \ufb01rst condition ensures that the top K eigenvalues are global eigenvalues. The second\ncondition is about the convergence rates of the two ratios in equation (7). The \ufb01rst indicator function\n(cid:80)K(cid:48)\u22121\nhas the following summation inside:\n\n|\u03bbL\n2,k| and arg maxK(cid:48)\n\n|(cid:101)\u03bbK(cid:48)|2\n\nk\u2208{1,...,K}\n\n(cid:80)K(cid:48)\u22121\nl(cid:48)=1 |\u02dc\u03bbl(cid:48)|e + |\u02dc\u03bbK(cid:48)|e + |\u02dc\u03bbK(cid:48)+1|e +(cid:80)M K\n\n|\u02dc\u03bbl|e + |\u02dc\u03bbK(cid:48)|e\n\nl=1\n\n=\n\n(cid:80)K(cid:48)\n(cid:80)M K\nl=1 |\u02dc\u03bbl|e\nl(cid:48)=1 |\u02dc\u03bbl(cid:48)|e\n\nl(cid:48)=K(cid:48)+2 |\u02dc\u03bbl(cid:48)|e\n\n.\n\nThe rate at which this term goes to 1 is determined by the spectral gap |\u03bbK(cid:48)|/|\u03bbK(cid:48)+1|. The smaller\nthis ratio is, the faster the term (it is non-decreasing w.r.t. e) converges to 1. For the second indi-\ncator function inside L\u02dc\u03bbK(cid:48) , we can do the same analysis and see that the convergence rate is again\ndetermined by the gap |\u03bbK(cid:48)\u22121|/|\u03bbK(cid:48)|. The ratio of the two spectral gaps determines the spectral\n|(cid:101)\u03bbK(cid:48)+1||(cid:101)\u03bbK(cid:48)\u22121|, we have arg maxK(cid:48) L\u02dc\u03bbK(cid:48) = K. (cid:3)\nlongevity. Hence, for the K(cid:48) with largest ratio\nis not too noisy, we can\nLemma 3 tells us the following.\ndetermine the number of clusters by choosing the value of K(cid:48) such that it maximizes L\u02dc\u03bbK(cid:48) . This\ncorresponds to exponentiating the sorted eigenvalues in a \ufb01nite range, and recording the number of\nnon-negligible eigenvalues. This is depicted in Figure 2.\n\nIf the estimated transition matrix \u00afAP\n\n|(cid:101)\u03bbK(cid:48)|2\n\n\u0001\n\n3.3 Proposed Algorithm\n\nIn previous sections, we have shown that the permutation caused by the MoM estimation procedure\ncan be undone, and we have proposed a way to estimate the number of clusters K. We summarize\nthe whole procedure in Algorithm 1.\n\n4 Experiments\n\n4.1 Effect of noise on depermutation algorithm\n\nWe have tested the algorithm\u2019s performance with respect to amount of data. We used the parameters\nK = 3, M = 4, L = 20, and we have 2 sequences with length T for each cluster. We used a\nGaussian observation model with unit observation variance and the columns of the emission matrices\nO1:K were drawn from zero mean spherical Gaussian with variance 2. Results for 10 uniformly\n\n6\n\n1020123456789 e K0 No. of Significant Eigenvalues 1234567890246810 Eigenvalue Index Spectral Longevity Spectral Longevity of Eigenvalues \fMethod of Moments Parameter Estimation\n\n\u0001 ) = HMM MethodofMoments (x1:N , M K)\n\n( \u00afOP\n\n\u0001 , \u00afAP\n\nDepermutation\n\nAlgorithm 1 Spectral Learning for Mixture of Hidden Markov Models\nInputs: x1:N : Sequences, M K : total number of states of global HMM.\n\nOutput: (cid:98)\u03b8 =\n\n(cid:16)(cid:98)O1:(cid:98)K, (cid:98)A1:(cid:98)K\n\n(cid:17)\n\n: MHMM parameters\n\n\u0001\n\nFind eigenvalues of \u00afAP\nExponentiate eigenvalues for each discrete value e in a suf\ufb01ciently large range.\n\nIdentify (cid:98)K as the eigenvalue with largest longevity.\nCompute rank-(cid:98)K reconstruction Ar\n\u0001 with (cid:98)K clusters to \ufb01nd a depermutation mapping (cid:101)P via cluster labels.\n\u0001 according to (cid:101)P.\nForm(cid:98)\u03b8 by choosing corresponding blocks from depermuted \u00afOP\nReturn(cid:98)\u03b8.\n\nCluster the columns of Ar\n\u0001 and \u00afAP\nDepermute \u00afOP\n\n\u0001 via eigendecomposition.\n\n\u0001 and \u00afAP\n\u0001 .\n\nFigure 3: Top row: Euclidean distance vs T . Second row: Noisy input matrix. Third row: Noisy\nreconstruction Ar\n\u0001. Bottom row: Depermuted matrix, numbers at the bottom indicate the estimated\nnumber of clusters.\n\nspaced sequence lengths from 10 to 1000 are shown in Figure 3. On the top row, we plot the total\nerror (from centroid to point) obtained after \ufb01tting k-means with true number of HMM clusters. We\ncan see that the correct number of clusters K = 3 as well as the block-diagonal structure of the\ntransition matrix is correctly recovered even in the case where T = 20.\n\n4.2 Amount of data vs accuracy and speed\n\nWe have compared clustering accuracies of EM and our approach on data sampled from a Gaussian\nemission MHMM. Means of each state of each cluster is drawn from a zero mean unit variance\nGaussian, and observation covariance is spherical with variance 2. We set L = 20, K = 5, M =\n3. We used uniform mixing proportions and uniform initial state distribution. We evaluated the\nclustering accuracies for 10 uniformly spaced sequence lengths (every sequence has the same length)\nbetween 20 and 200, and 10 uniformly spaced number of sequences between 1 and 100 for each\ncluster. The results are shown in Figure 4. Although EM seems to provide higher accuracy on\n\n7\n\n101202303404505606707808901000012Euclidean Distance vs Sequence LengthTEuc. Dist.3333333333\fFigure 4: Clustering accuracy and run time results for synthetic data experiments.\n\nAlgorithm\nSpectral\n\nTable 1: Clustering accuracies for handwritten digit dataset.\n2v5\n99\n100\n100\n\nEM init. w/ Spectral\nEM init. at Random\n\n1v3\n70\n99\n99\n\n1v4\n54\n100\n98\n\n1v2\n100\n100\n96\n\n2v3\n83\n96\n83\n\n2v4\n99\n100\n100\n\nregions where we have less data, spectral algorithm is much faster. Note that, in spectral algorithm\nwe include the time spent in moment computation. We used four restarts for EM, and take the result\nwith highest likelihood, and used an automatic stopping criterion.\n\n4.3 Real data experiment\n\nWe ran an experiment on the handwritten character trajectory dataset from the UCI machine learn-\ning repository [8]. We formed pairs of characters and compared the clustering results for three\nalgorithms: the proposed spectral learning approach, EM initialized at random, and EM initialized\nwith MoM algorithm as explored in [9]. We take the maximum accuracy of EM over 5 random ini-\ntializations in the third row. We set the algorithm parameters to K = 2 and M = 4. There are 140\nsequences of average length 100 per class. In the original data, L = 3, but to apply MoM learning,\nwe require that M K < L. To achieve this, we transformed the data vectors with a cubic polyno-\nmial feature transformation such that L = 10 (this is the same transformation that corresponds to\na polynomial kernel). The results from these trials are shown in Table 1. We can see that although\nspectral learning doesn\u2019t always surpass randomly initialized EM on its own, it does serve as a very\ngood initialization scheme.\n\n5 Conclusions and future work\n\nWe have developed a method of moments based algorithm for learning mixture of HMMs. Our\nexperimental results show that our approach is computationally much cheaper than EM, while being\ncomparable in accuracy. Our real data experiment also show that our approach can be used as a\ngood initialization scheme for EM. As future work, it would be interesting to apply the proposed\napproach on other hierarchical latent variable models.\nAcknowledgements: We would like to thank Taylan Cemgil, David Forsyth and John Hershey for\nvaluable discussions. This material is based upon work supported by the National Science Founda-\ntion under Grant No. 1319708.\n\nReferences\n[1] A. Anandkumar, D. Hsu, and S.M. Kakade. A method of moments for mixture models and\n\nhidden markov models. In COLT, 2012.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv:1210.7559v2, 2012.\n\n8\n\n20208080808020806040536582838077629510082687397668082861008810058766579858110098811007378619780601001001001007977696984100100100100100767769100100887810010010088788880100100100100100755863827810010010079100100788680100877780100100100 T N/K Accuracy (%) of spectral algorithm 1031731161582001123456781006080604080608060604010010010010010010010080801001001001001007110010080100801001001001001001001001001001001001001001001001001001001001001001001001001001001008010010010010010010010010010010080100100100100100100100100100808010010010010010010080100100100100100100100100100100100100100 T N/K Accuracy (%) of EM algorithm 10317311615820011234567810022111121322333344455233455667733456678910335678911121334578101113141534689111314161735791113141618203571012141618202236811131618202325 T N/K Run time (s) of spectral algorithm 1031731161582001123456781001516341933564756752713817818731336760684657361454235296370724550969109310561616894273785297031241187314341418243316529075473413011323164618513074242317244466297014771098189232582030240422958810401106168319431861239636033332266791133520202457266223114330513758492338651664259737614431391441334247491521685520461879187539203609362987196890 T N/K Run time (s) of EM algorithm 103173116158200112345678100\f[3] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden\nmarkov models a spectral algorithm for learning hidden markov models. Journal of Computer\nand System Sciences, (1460-1480), 2009.\n\n[4] P. Smyth. Clustering sequences with hidden markov models. In Advances in neural information\n\nprocessing systems, 1997.\n\n[5] Yuting Qi, J.W. Paisley, and L. Carin. Music analysis using hidden markov mixture models.\n\nSignal Processing, IEEE Transactions on, 55(11):5209 \u20135224, nov. 2007.\n\n[6] A. Jonathan, S. Sclaroff, G. Kollios, and V. Pavlovic. Discovering clusters in motion time-series\n\ndata. In CVPR, 2003.\n\n[7] Tim Oates, Laura Firoiu, and Paul R. Cohen. Clustering time series with hidden markov models\nand dynamic time warping. In In Proceedings of the IJCAI-99 Workshop on Neural, Symbolic\nand Reinforcement Learning Methods for Sequence Learning, pages 17\u201321, 1999.\n\n[8] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[9] Arun Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions.\n\nIn International Conference on Machine Learning (ICML), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1181, "authors": [{"given_name": "Cem", "family_name": "Subakan", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Johannes", "family_name": "Traa", "institution": "UIUC"}, {"given_name": "Paris", "family_name": "Smaragdis", "institution": "University of Illinois Urbana-Champaign"}]}