{"title": "Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1674, "page_last": 1682, "abstract": "We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam's razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters.", "full_text": "Comparative Analysis of Viterbi Training and\nMaximum Likelihood Estimation for HMMs\n\nArmen Allahverdyan\u2217\nYerevan Physics Institute\n\nYerevan, Armenia\n\naarmen@yerphi.am\n\nAram Galstyan\n\nUSC Information Sciences Institute\n\nMarina del Rey, CA, USA\ngalstyan@isi.edu\n\nAbstract\n\nWe present an asymptotic analysis of Viterbi Training (VT) and contrast it with a\nmore conventional Maximum Likelihood (ML) approach to parameter estimation\nin Hidden Markov Models. While ML estimator works by (locally) maximizing\nthe likelihood of the observed data, VT seeks to maximize the probability of the\nmost likely hidden state sequence. We develop an analytical framework based on\na generating function formalism and illustrate it on an exactly solvable model of\nHMM with one unambiguous symbol. For this particular model the ML objective\nfunction is continuously degenerate. VT objective, in contrast, is shown to have\nonly \ufb01nite degeneracy. Furthermore, VT converges faster and results in sparser\n(simpler) models, thus realizing an automatic Occam\u2019s razor for HMM learning.\nFor more general scenario VT can be worse compared to ML but still capable of\ncorrectly recovering most of the parameters.\n\n1\n\nIntroduction\n\nHidden Markov Models (HMM) provide one of the simplest examples of structured data observed\nthrough a noisy channel. The inference problems of HMM naturally divide into two classes [20, 9]:\ni) recovering the hidden sequence of states given the observed sequence, and ii) estimating the model\nparameters (transition probabilities of the hidden Markov chain and/or conditional probabilities of\nobservations) from the observed sequence. The \ufb01rst class of problems is usually solved via the max-\nimum a posteriori (MAP) method and its computational implementation known as Viterbi algorithm\n[20, 9]. For the parameter estimation problem, the prevailing method is maximum likelihood (ML)\nestimation, which \ufb01nds the parameters by maximizing the likelihood of the observed data. Since\nglobal optimization is generally intractable, in practice it is implemented through an expectation\u2013\nmaximization (EM) procedure known as Baum\u2013Welch algorithm [20, 9].\nAn alternative approach to parameter learning is Viterbi Training (VT), also known in the literature\nas segmental K-means, Baum\u2013Viterbi algorithm, classi\ufb01cation EM, hard EM, etc. Instead of maxi-\nmizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely\nhidden state sequence. Maximizing VT objective function is hard [8], so in practice it is imple-\nmented via an EM-style iterations between calculating the MAP sequence and adjusting the model\nparameters based on the sequence statistics. It is known that VT lacks some of the desired features\nof ML estimation such as consistency [17], and in fact, can produce biased estimates [9]. However,\nit has been shown to perform well in practice, which explains its widespread use in applications\nsuch as speech recognition [16], unsupervised dependency parsing [24], grammar induction [6], ion\nchannel modeling [19]. It is generally assumed that VT is more robust and faster but usually less\naccurate, although for certain tasks it outperforms conventional EM [24].\n\n\u2217Currently at: Laboratoire de Physique Statistique et Systemes Complexes, ISMANS, Le Mans, France.\n\n1\n\n\fThe current understanding of when and under what circumstances one method should be preferred\nover the other is not well\u2013established. For HMMs with continuos observations, Ref. [18] established\nan upper bound on the difference between the ML and VT objective functions, and showed that both\napproaches produce asymptotically similar estimates when the dimensionality of the observation\nspace is very large. Note, however, that this asymptotic limit is not very interesting as it makes\nthe structure imposed by the Markovian process irrelevant. A similar attempt to compare both ap-\nproaches on discrete models (for stochastic context free grammars) was presented in [23]. However,\nthe established bound was very loose.\nOur goal here is to understand, both qualitatively and quantitatively, the difference between the two\nestimation methods. We develop an analytical approach based on generating functions for exam-\nining the asymptotic properties of both approaches. Previously, a similar approach was used for\ncalculating entropy rate of a hidden Markov process [1]. Here we provide a non-trivial extension of\nthe methods that allows to perform comparative asymptotic analysis of ML and VT estimation. It is\nshown that both estimation methods correspond to certain free-energy minimization problem at dif-\nferent temperatures. Furthermore, we demonstrate the approach on a particular class of HMM with\none unambiguous symbol and obtain a closed\u2013form solution to the estimation problem. This class\nof HMMs is suf\ufb01ciently rich so as to include models where not all parameters can be determined\nfrom the observations, i.e., the model is not identi\ufb01able [7, 14, 9].\nWe \ufb01nd that for the considered model VT is a better option if the ML objective is degenerate (i.e., not\nall parameters can be obtained from observations). Namely, not only VT recovers the identi\ufb01able\nparameters but it also provides a simple (in the sense that non-identi\ufb01able parameters are set to\nzero) and optimal (in the sense of the MAP performance) solution. Hence, VT realizes an automatic\nOccam\u2019s razor for the HMM learning. In addition, we show that the VT algorithm for this model\nconverges faster than the conventional EM approach. Whenever the ML objective is not degenerate,\nVT leads generally to inferior results that, nevertheless, may be partially correct in the sense of\nrecovering certain (not all) parameters.\n\nPr[Sk+l = sk|Sk\u22121+l = sk\u22121] = p(sk|sk\u22121),\n\nthat S is mixing: it has a unique stationary distribution pst(s),(cid:80)L\n\n2 Hidden Markov Process\nLet S = {S0,S1,S2, ...} be a discrete-time, stationary, Markov process with conditional probability\n(1)\nwhere l is an integer. Each realization sk of the random variable Sk takes values 1, ..., L. We assume\nr=1p(s|r)pst(r) = pst(s), that is\nestablished from any initial probability in the long time limit.\nLet random variables Xi, with realizations xi = 1, .., M, be noisy observations of Si: the (time-\ninvariant) conditional probability of observing Xi = xi given the realization Si = si of the Markov\nprocess is \u03c0(xk|sk). De\ufb01ning x \u2261 (xN , ..., x1), s \u2261 (sN , ..., s0), the joint probability of S,X reads\n(2)\n\nP (s, x) = TsN sN\u22121(xN )...Ts1 s0(x1) pst(s0),\n\nwhere the L \u00d7 L transfer-matrix T (x) with matrix elements Tsi si\u22121(x) is de\ufb01ned as\n\n(3)\nX = {X1,X2, ...} is called a hidden Markov process. Generally, it is not Markov, but it inherits\nstationarity and mixing from S [9]. The probabilities for X can be represented as follows:\n\nTsi si\u22121(x) = \u03c0(x|si) p(si|si\u22121).\n\nP (x) =(cid:88)\n\nss(cid:48) [T(x)]ss(cid:48) pst(s(cid:48)), T(x) \u2261 T (xN )T (xN\u22121) . . . T (x1),\n\n(4)\n\nwhere T(x) is a product of transfer matrices.\n\n3 Parameter Estimation\n\n3.1 Maximum Likelihood Estimation\nThe unknown parameters of an HMM are the transition probabilities p(s|s(cid:48)) of the Markov process\nand the observation probabilities \u03c0(x|s); see (2). They have to be estimated from the observed\n\n2\n\n\fsequence x. This is standardly done via the maximum-likelihood approach: one starts with some\ntrial values \u02c6p(s|s(cid:48)), \u02c6\u03c0(x|s) of the parameters and calculates the (log)-likelihood ln \u02c6P (x), where \u02c6P\nmeans the probability (4) calculated at the trial values of the parameters. Next, one maximizes\nln \u02c6P (x) over \u02c6p(s|s(cid:48)) and \u02c6\u03c0(x|s) for the given observed sequence x (in practice this is done via the\nBaum-Welch algorithm [20, 9]). The rationale of this approach is as follows. Provided that the\nlength N of the observed sequence is long, and recaling that X is mixing (due to the analogous\nfeature of S) we get probability-one convergence (law of large numbers) [9]:\n\nln \u02c6P (x) \u2192(cid:88)\nxP (x) ln[P (x)/ \u02c6P (x)] \u2265 0, the global maximum of(cid:80)\n\nP (y) ln \u02c6P (y),\n\ny\n\nentropy is non-negative,(cid:80)\n\nwhere the average is taken over the true probability P (...) that generated x. Since the relative\nxP (x) ln \u02c6P (x)\nas a function of \u02c6p(s|s(cid:48)) and \u02c6\u03c0(x|s) is reached for \u02c6p(s|s(cid:48)) = p(s|s(cid:48)) and \u02c6\u03c0(x|s) = \u03c0(x|s). This\nargument is silent on how unique this global maximum is and how dif\ufb01cult to reach it.\n\n(5)\n\n3.2 Viterbi Training\n\nAn alternative approach to the parameter learning employs the maximal a posteriori (MAP) estima-\ntion and proceeds as follows: Instead of maximizing the likelihood of observed data (5) one tries to\nmaximize the probability of the most likely sequence [20, 9]. Given the joint probability \u02c6P (s, x) at\ntrial values of parameters, and given the observed sequence x, one estimates the generating state-\nsequence s via maximizing the a posteriori probability\n\n\u02c6P (s|x) = \u02c6P (s, x)/ \u02c6P (x)\n\n(6)\nover s. Since \u02c6P (x) does not depend on s, one can maximize ln \u02c6P (s, x). If the number of obser-\nvations is suf\ufb01ciently large N \u2192 \u221e, one can substitute maxs ln \u02c6P (s, x) by its average over P (...)\n[see (5)] and instead maximize (over model parameters)\n\nP (x) maxs ln \u02c6P (s, x).\n\n(7)\n\n(cid:104)(cid:88)\n\n(cid:105)\n\nTo relate (7) to the free energy concept (see e.g. [2, 4]), we de\ufb01ne an auxiliary (Gibbsian) probability\n\n\u02c6\u03c1\u03b2(s|x) = \u02c6P \u03b2(s, x)/\n\n\u02c6P \u03b2(s(cid:48), x)\n\n(8)\nwhere \u03b2 > 0 is a parameter. As a function of s (and for a \ufb01xed x), \u02c6\u03c1\u03b2\u2192\u221e(s|x) concentrates on\nthose s that maximize ln \u02c6P (s, x):\n\ns(cid:48)\n\n,\n\nwhere \u03b4(s, s(cid:48)) is the Kronecker delta,(cid:101)s[j](x) are equivalent outcomes of the maximization, and N\n\nj\n\nis the number of such outcomes. Further, de\ufb01ne\n\n(9)\n\n\u02c6\u03c1\u03b2\u2192\u221e(s|x) \u2192 1\nN\n\n\u03b4[s,(cid:101)s[j](x)],\n\n(cid:88)\nP (x) ln(cid:88)\n\nx\n\ns\n\n(cid:88)\n\nF\u03b2 \u2261 \u2212 1\n\u03b2\n\n(cid:88)\n\nx\n\n\u02c6P \u03b2(s, x).\n\n(10)\n\nWithin statistical mechanics Eqs. 8 and 10 refer to, respectively, the Gibbs distribution and free\nenergy of a physical system with Hamiltonian H = \u2212 ln P (s, x) coupled to a thermal bath at\ninverse temperature \u03b2 = 1/T [2, 4].\nIt is then clear that ML and Viterbi Training correspond\nto minimizing the free energy Eq. 10 at \u03b2 = 1 and \u03b2 = \u221e, respectively. Note that \u03b22\u2202\u03b2F =\n\ns\u03c1\u03b2(s|x) ln \u03c1\u03b2(s|x) \u2265 0, which yields F1 \u2264 F\u221e.\n\n\u2212(cid:80)\nxP (x)(cid:80)\n\n3.3 Local Optimization\n\nAs we mentioned, global maximization of neither objective is feasible in the general case. Instead,\nin practice this maximization is locally implemented via an EM-type algorithm [20, 9]: for a given\nobserved sequence x, and for some initial values of the parameters, one calculates the expected value\nof the objective function under the trial parameters (E-step), and adjusts the parameters to maximize\nthis expectation (M-step). The resulting estimates of the parameters are now employed as new trial\nparameters and the previous step is repeated. This recursion continues till convergence.\n\n3\n\n\fe\u03b2\u03b3PN\n\ni=1\u03b4(si+1,a)\u03b4(si,b) and de\ufb01ne\n\nFor our purposes, this procedure can be understood as calculating certain statistics of the hid-\nIndeed, let us introduce f\u03b3(s) \u2261\nden sequence averaged over the Gibbs distribution Eqs. 8.\n\nP (x) ln(cid:88)\n\n\u03b2F\u03b2(\u03b3) \u2261 \u2212(cid:88)\n(cid:101)P (Sk+1 = a, Sk = b) = \u2212\u2202\u03b3[F\u221e(\u03b3)]|\u03b3\u21920.\n\n(11)\nThen, for instance, the (iterative) Viterbi estimate of the transition probabilities are given as follows:\n(12)\nConditional probabilities for observations are calculated similarly via a different indicator function.\n\n\u02c6P \u03b2(s, x)f\u03b3(s).\n\nx\n\ns\n\n4 Generating Function\n\n1\nN\n\nP (y) ln \u03bb[T(y)],\n\nln||T(x)|| \u2192 1\nN\n\nNote from (4) that both P (x) and \u02c6P (x) are obtained as matrix-products. For a large number of\nmultipliers the behavior of such products is governed by the multiplicative law of large numbers.\nWe now recall its formulation from [10]: for N \u2192 \u221e and x generated by the mixing process X\nthere is a probability-one convergence:\n\n(cid:88)\n(13)\nwhere ||...|| is a matrix norm in the linear space of L \u00d7 L matrices, and \u03bb[T(x)] is the maximal\npears for N \u2192 \u221e [10]. Eqs. (4, 13) also imply(cid:80)\neigenvalue of T(x). Note that (13) does not depend on the speci\ufb01c norm chosen, because all norms\nin the \ufb01nite-dimensional linear space are equivalent; they differ by a multiplicative factor that disap-\nx\u03bb[T(x)] \u2192 1. Altogether, we calculate (5) via\n(cid:88)\n\n(14)\nNote that the multiplicative law of large numbers is normally formulated for the maximal singular\nvalue. Its reformulation in terms of the maximal eigenvalue needs additional arguments [1].\nIntroducing the generating function\n\nits probability-one limit\n1\nN\n\nP (x) ln \u02c6P (x) \u2192 1\nN\n\n\u03bb[T(x)] ln \u03bb[\u02c6T(x)].\n\n(cid:88)\n\nx\n\ny\n\nx\n\n(15)\nwhere n is a non-negative number, and where \u039bN (n, N) means \u039b(n, N) in degree of N, one repre-\nsents (14) as\n\nx\n\n,\n\n\u03bb[T(x)]\u03bbn(cid:104)\u02c6T(x)\n(cid:105)\n\n\u039bN (n, N) =(cid:88)\n(cid:88)\n\n1\nN\n\nx\n\n\u03bb[T(x)] ln \u03bb[\u02c6T(x)] = \u2202n\u039b(n, N)|n=0,\n\n(16)\n\n(cid:20)\n\u2212(cid:88)\u221e\n\nzm\nm\n\n(cid:21)\n\nwhere we took into account \u039b(0, N) = 1, as follows from (15).\nThe behavior of \u039bN (n, N) is better understood after expressing it via the zeta-function \u03be(z, n) [1]\n\n(17)\nwhere \u039bm(n, m) \u2265 0 is given by (15). Since for a large N, \u039bN (n, N) \u2192 \u039bN (n) [this is the content\nof (13)], the zeta-function \u03be(z, n) has a zero at z = 1\n\n\u03be(z, n) = exp\n\n\u039bm(n, m)\n\nm=1\n\n,\n\n(18)\nIndeed for z close (but smaller than)\nalmost\ndiverges and one has \u03be(z, n) \u2192 1 \u2212 z\u039b(n). Recalling that \u039b(0) = 1 and taking n \u2192 0 in 0 =\ndn \u03be( 1\n\n\u039b(n) , n), we get from (16)\n\n[z\u039b(n)]m\n\nm=1\n\nm=1\n\nzm\n\nm\n\nd\n\n\u039b(n):\n\u03be(1/\u039b(n), n) = 0.\n1\n\n\u039b(n), the series(cid:80)\u221e\n\nm \u039bm(n, m) \u2192(cid:80)\u221e\n\n(cid:88)\n\n(19)\n\n(20)\n\nFor calculating \u2212\u03b2F\u03b2 in (10) we have instead of (19)\n\n1\nN\n\n\u03bb[T(x)] ln \u03bb[\u02c6T(x)] = \u2202n\u03be(1, 0)\n\u2202z\u03be(1, 0) .\n\nx\n\n\u2212 \u03b2F\u03b2\nN\n\n= \u2202n\u03be[\u03b2](1, 0)\n\u2202z\u03be[\u03b2](1, 0) ,\n\nwhere \u03be[\u03b2](z, n) employs \u02c6T \u03b2\nThough in this paper we restricted ourselves to the limit N \u2192 \u221e, we stress that the knowledge of\nthe generating function \u039bN (n, N) allows to analyze the learning algorithms for any \ufb01nite N.\n\nsi si\u22121(x) = \u02c6\u03c0\u03b2(x|si) \u02c6p\u03b2(si|si\u22121) instead of \u02c6Tsi si\u22121(x) in (19).\n\n4\n\n\fFigure 1: The hidden Markov process (21\u201322) for \u0001 = 0. Gray circles and arrows indicate on the\nrealization and transitions of the internal Markov process; see (21). The light circles and black\narrows indicate on the realizations of the observed process.\n\n5 Hidden Markov Model with One Unambiguous Symbol\n\n5.1 De\ufb01nition\nGiven a L-state Markov process S, the observed process X has two states 1 and 2; see Fig. 1. All\ninternal states besides one are observed as 2, while the internal state 1 produces, respectively, 1 and\n2 with probabilities 1 \u2212 \u0001 and \u0001. For L = 3 we obtain from (1) \u03c0(1|1) = 1 \u2212 \u03c0(2|1) = 1 \u2212 \u0001,\n\u03c0(1|2) = \u03c0(1|3) = \u03c0(2|1) = 0, \u03c0(2|2) = \u03c0(2|3) = 1. Hence 1 is unambiguous: if it is observed,\nthe unobserved process S was certainly in 1; see Fig. 1. The simplest example of such HMM exists\nalready for L = 2; see [12] for analytical features of entropy for this case. We, however, describe\nin detail the L = 3 situation, since this case will be seen to be generic (in contrast to L = 2) and\nit allows straightforward generalizations to L > 3. The transition matrix (1) of a general L = 3\nMarkov process reads\n\n(cid:32) p0\n\n(cid:33)\n\n=\n\nq0\nr0\n\n(cid:33)\n\n(cid:32) 1 \u2212 p1 \u2212 p2\n\n1 \u2212 r1 \u2212 r2\n1 \u2212 r1 \u2212 r2\n\n(21)\n\nP \u2261 { p(s|s(cid:48))}3\n\ns,s(cid:48)=1 =\n\n(cid:33)\n\n,\n\n(cid:33)\n\n(cid:32) p0\n\np1\np2\n\n(cid:32) p0\n\nq1\nq0\nq2\n\nr1\nr2\nr0\n\nq1\n0\n0\n\nr1\n0\n0\n\n0\n0\n\nwhere, e.g., q1 = p(1|2) is the transition probability 2 \u2192 1; see Fig. 1. The corresponding transfer\nmatrices read from (3)\n\nT (1) = (1 \u2212 \u0001)\n\n, T (2) = P \u2212 T (1).\n\n(22)\n\nEq. (22) makes straightforward the reconstruction of the transfer-matrices for L \u2265 4. It should also\nbe obvious that for all L only the \ufb01rst row of T (1) consists of non-zero elements.\nFor \u0001 = 0 we get from (22) the simplest example of an aggregated HMM, where several Markov\nstates are mapped into one observed state. This model plays a special role for the HMM theory,\nsince it was employed in the pioneering study of the non-identi\ufb01ability problem [7].\n\n5.2 Solution of the Model\n\nFor this model \u03be(z, n) can be calculated exactly, because T (1) has only one non-zero row. Using\nthe method outlined in the supplementary material (see also [1, 3]) we get\nk\u22122tk\u22122 \u2212 \u02c6tn\nwhere \u03c4 and \u02c6\u03c4 are the largest eigenvalues of T (2) and \u02c6T (2), respectively\n\n0 ) +(cid:88)\u221e\n\n\u03be(z, n) = 1 \u2212 z(t0\u02c6tn\n\nk\u22121tk\u22121]zk\n\n0 + \u03c40\u02c6\u03c4 n\n\n[\u03c4 \u02c6\u03c4 n\u02c6tn\n\n(23)\n\nk=2\n\ntk \u2261 (cid:104)1|T (1)T (2)k|1(cid:105) =(cid:88)L\n\n(24)\n(25)\nHere |R\u03b1(cid:105) and (cid:104)L\u03b1| are, respectively right and left eigenvalues of T (2), while \u03c41, . . . , \u03c4L (\u03c4L \u2261 \u03c4)\nare the eigenvalues of T (2). Eqs. (24, 25) obviously extend to hatted quantities.\n\n\u03c8\u03b1 \u2261 (cid:104)1|T (1)|R\u03b1(cid:105)(cid:104)L\u03b1|1(cid:105),\n\n(cid:104)1| \u2261 (1, 0, . . . , 0).\n\n\u03c4 k\n\u03b1\u03c8\u03b1,\n\n\u03b1=1\n\n5\n\n12312q2r2p2q1r1p1\fWe get from (23, 19):\n\n(cid:17)\n\n,\n\n\u02c6tn\nk tk\n\n(cid:16)\n1 \u2212(cid:88)\u221e\n(cid:80)\u221e\n(cid:80)\u221e\n\nk=0tk ln[\u02c6tk]\nk=0(k + 1)tk\n\nk=0\n\n.\n\n\u03be(1, n) = (1 \u2212 \u02c6\u03c4 n\u03c4)\n\n\u2202n\u03be(1, 0)\n\u2202z\u03be(1, 0)\n\n=\n\n(26)\n\n(27)\n\n(33)\n\n(34)\n\n(35)\n\n\u0001 > 0 this interpretation does not hold, but tk still has a meaning of probability as(cid:80)\u221e\n\nNote that for \u0001 = 0, tk are return probabilities to the state 1 of the L-state Markov process. For\n\nk=0tk = 1.\n\nTurning to equations (19, 27) for the free energy, we note that as a function of trial values it depends\non the following 2L parameters:\n\n(\u02c6\u03c41, . . . , \u02c6\u03c4L, \u02c6\u03c81, . . . , \u02c6\u03c8L).\n\n(28)\nAs a function of the true values, the free energy depends on the same 2L parameters (28) [without\nhats], though concrete dependencies are different. For the studied class of HMM there are at most\nL(L \u2212 1) + 1 unknown parameters: L(L \u2212 1) transition probabilities of the unobserved Markov\nchain, and one parameter \u0001 coming from observations. We checked numerically that the Jacobian\nof the transformation from the unknown parameters to the parameters (28) has rank 2L \u2212 1. Any\n2L \u2212 1 parameters among (28) can be taken as independent ones.\nFor L > 2 the number of effective independent parameters that affect the free energy is smaller than\nthe number of parameters. So if the number of unknown parameters is larger than 2L \u2212 1, neither\nof them can be found explicitly. One can only determine the values of the effective parameters.\n\n6 The Simplest Non-Trivial Scenario\n\nThe following example allows the full analytical treatment, but is generic in the sense that it contains\nall the key features of the more general situation given above (21). Assume that L = 3 and\n\nq0 = \u02c6q0 = r0 = \u02c6r0 = 0,\n\n\u0001 = \u02c6\u0001 = 0.\n\n(29)\n\nNote the following explicit expressions\n\n(30)\n(31)\n(32)\nEqs. (30\u201332) with obvious changes si \u2192 \u02c6si for every symbol si hold for \u02c6tk, \u02c6\u03c4k and \u02c6\u03c8k. Note a\n\nt0 = p0, t1 = p1q1 + p2r1, t2 = p1r1q2 + p2q1r2,\n\u03c41 = 0,\n\u03c83 \u2212 \u03c82 = t1/\u03c4, \u03c83 + \u03c82 = t2/\u03c4 2,\n\n\u03c42 = \u2212\u03c4,\n\n\u03c4 = \u03c43 =\n\nq2r2,\n\n\u221a\n\nconsequence of(cid:80)2\n\nk=0pk =(cid:80)2\n\nk=0qk =(cid:80)2\n\nk=0rk = 1:\n\n\u03c4 2(1 \u2212 t0) = 1 \u2212 t0 \u2212 t1 \u2212 t2.\n\n6.1 Optimization of F1\n\nEqs. (27) and (30\u201332) imply(cid:80)\u221e\n\nk=0(k + 1)tk = \u00b5\n\n1\u2212\u03c4 2 ,\n\n\u00b5 \u2261 1 \u2212 \u03c4 2 + t2 + (1 \u2212 t0)(1 + \u03c4 2) > 0,\n\u2212 \u00b5F1\nN\n\n= t1 ln \u02c6t1 + t2 ln \u02c6t2 + (1 \u2212 \u03c4 2)t0 ln \u02c6t0 + (1 \u2212 t0)\u03c4 2 ln \u02c6\u03c4 2 .\n\nThe free energy F1 depends on three independent parameters \u02c6t0, \u02c6t1, \u02c6t2 [recall (33)]. Hence, minimiz-\ning F1 we get \u02c6ti = ti (i = 0, 1, 2), but we do not obtain a de\ufb01nite solution for the unknown parame-\nters: any four numbers \u02c6p1, \u02c6p2, \u02c6q1, \u02c6r1 satisfying three equations t0 = 1 \u2212 \u02c6p1 \u2212 \u02c6p2, t1 = \u02c6p1 \u02c6q1 + \u02c6p2\u02c6r1,\nt2 = \u02c6p1\u02c6r1(1 \u2212 \u02c6q1) + \u02c6p2 \u02c6q1(1 \u2212 \u02c6r1), minimize F1.\n\n6.2 Optimization of F\u221e\nIn deriving (35) we used no particular feature of {\u02c6pk}2\n(20), the free energy at \u03b2 > 0 is recovered from (35) by equating its LHS to \u2212 \u03b2F\u03b2\n\nk=1, {\u02c6rk}2\n\nk=0, {\u02c6qk}2\n\nk=1. Hence, as seen from\nN and by taking in\n\n6\n\n\f0 , \u02c6\u03c4 2 \u2192 \u02c6q\u03b2\nits RHS: \u02c6t0 \u2192 \u02c6p\u03b2\nfree energy reads from (35)\n\n2 , \u02c6t1 \u2192 \u02c6p\u03b2\n\n2 \u02c6r\u03b2\n\n1 \u02c6q\u03b2\n\n1 + \u02c6p\u03b2\n\n2 \u02c6r\u03b2\n\n1 , \u02c6t2 \u2192 \u02c6p\u03b2\n\n1 \u02c6r\u03b2\n\n1 \u02c6q\u03b2\n\n2 + \u02c6p\u03b2\n\n2 \u02c6q\u03b2\n\n1 \u02c6r\u03b2\n\n2 . The zero-temperature\n\n\u2212 \u00b5F\u221e\nN\n\n= (1 \u2212 \u03c4 2)t0 ln \u02c6t0 + (1 \u2212 t0)\u03c4 2 ln \u02c6\u03c4 2 + t1 ln max[\u02c6p1 \u02c6q1, \u02c6p2\u02c6r1]\n\n{\u02c6\u03c3i}4\n\n(36)\nWe now minimize F\u221e over the trial parameters \u02c6p1, \u02c6p2, \u02c6q1, \u02c6r1. This is not what is done by the VT\nalgorithm; see the discussion after (12). But at any rate both procedures aim to minimize the same\ntarget. VT recursion for this models will be studied in section 6.3 \u2014 it leads to the same result.\nMinimizing F\u221e over the trial parameters produces four distinct solutions:\n\n+ t2 ln max[\u02c6p2 \u02c6q1\u02c6r2, \u02c6p1\u02c6r1 \u02c6q2].\n\ni=1 = {\u02c6p1 = 0, \u02c6p2 = 0, \u02c6q1 = 0, \u02c6r1 = 0}.\n\n(37)\nFor each of these four solutions: \u02c6ti = ti (i = 0, 1, 2) and F1 = F\u221e. The easiest way to get these\nresults is to minimize F\u221e under conditions \u02c6ti = ti (for i = 0, 1, 2), obtain F1 = F\u221e and then to\nconclude that due to the inequality F1 \u2264 F\u221e the conditional minimization led to the global mini-\nmization. The logics of (37) is that the unambiguous state tends to get detached from the ambiguous\nones, since the probabilities nullifying in (37) refer to transitions from or to the unambiguous state.\nNote that although minimizing either F\u221e and F1 produces correct values of the independent vari-\nables t0, t1, t2, in the present situation minimizing F\u221e is preferable, because it leads to the four-fold\ndegenerate set of solutions (37) instead of the continuously degenerate set. For instance, if the solu-\ntion with \u02c6p1 = 0 is chosen we get for other parameters\n\n\u02c6p2 = 1 \u2212 t0, \u02c6q1 =\n\nt2\n\n1 \u2212 t0 \u2212 t1\n\n, \u02c6r1 = t1\n1 \u2212 t0\n\n.\n\n(38)\n\nFurthermore, a more elaborate analysis reveals that for each \ufb01xed set of correct parameters only one\namong the four solutions Eq. 37 provides the best value for the quality of the MAP reconstruction,\ni.e. for the overlap between the original and MAP-decoded sequences.\nFinally, we note that minimizing F\u221e allows one to get the correct values t0, t1, t2 of the independent\nvariables \u02c6t0, \u02c6t1 and \u02c6t2 only if their number is less than the number of unknown parameters. This\nis not a drawback, since once the number of unknown parameters is suf\ufb01ciently small [less than\nfour for the present case (29)] their exact values are obtained by minimizing F1. Even then, the\nminimization of F\u221e can provide partially correct answers. Assume in (36) that the parameter \u02c6r1 is\nknown, \u02c6r1 = r1. Now F\u221e has three local minima given by \u02c6p1 = 0, \u02c6p2 = 0 and \u02c6q1 = 0; cf. with\n(37). The minimum with \u02c6p2 = 0 is the global one and it allows to obtain the exact values of the\ntwo effective parameters: \u02c6t0 = 1 \u2212 \u02c6p1 = t0 and \u02c6t1 = \u02c6p1 \u02c6q1 = t1. These effective parameters are\nrecovered, because they do not depend on the known parameter \u02c6r1 = r1. Two other minima have\ngreater values of F\u221e, and they allow to recover only one effective parameter: \u02c6t0 = 1 \u2212 \u02c6p1 = t0.\nIf in addition to \u02c6r1 also \u02c6q1 is known, the two local minimia of F\u221e (\u02c6p1 = 0 and \u02c6p2 = 0) allow to\nrecover \u02c6t0 = t0 only. In contrast, if \u02c6p1 = p1 (or \u02c6p2 = p2) is known exactly, there are three local\nminima again\u2014\u02c6p2 = 0, \u02c6q1 = 0, \u02c6r1 = 0\u2014but now none of effective parameters is equal to its true\nvalue: \u02c6ti (cid:54)= ti (i = 0, 1, 2).\n\n6.3 Viterbi EM\n\nRecall the description of the VT algorithm given after (12). For calculating (cid:101)P (Sk+1 = a, Sk = b)\n\nvia (11, 12) we modify the transfer matrix element in (15, 17) as \u02c6Tab(k) \u2192 \u02c6Tab(k)e\u03b3, which\nproduces from (11, 12) for the MAP-estimates of the transition probabilities\n\nt1 \u02c6\u03c71 + t2 \u02c6\u03c72\n\nt1 + t2 + t0(1 \u2212 \u03c4 2) , (cid:101)p2 = 1 \u2212 t0 \u2212(cid:101)p1,\n(cid:101)p1 =\n(cid:101)q1 = t1 \u02c6\u03c71 + t2(1 \u2212 \u02c6\u03c72)\nt1 \u02c6\u03c71 + t2 + (1 \u2212 t0)\u03c4 2 , (cid:101)q2 = 1 \u2212(cid:101)q1\nt2 + t1(1 \u2212 \u02c6\u03c71) + (1 \u2212 t0)\u03c4 2 (cid:101)r2 = 1 \u2212(cid:101)r1,\n(cid:101)r1 =\n\nt1(1 \u2212 \u02c6\u03c71) + t2 \u02c6\u03c72\n\n(39)\n\n(40)\n\n(41)\n\nwhere \u02c6\u03c71 \u2261\nof them is equal to 0 or 1 depending on the ratios \u02c6p1 \u02c6q1\n\u02c6p2 \u02c6r1\n\n\u02c6p\u03b2\n1 \u02c6r\u03b2\n1 \u02c6q\u03b2\n2\n1 \u02c6q\u03b2\n2 + \u02c6p\u03b2\n2 \u02c6r\u03b2\n\n\u02c6p\u03b2\n1 \u02c6q\u03b2\n1\n1 + \u02c6p\u03b2\n\u02c6p\u03b2\n1 \u02c6q\u03b2\n\n. The \u03b2 \u2192 \u221e limit of \u02c6\u03c71 and \u02c6\u03c72 is obvious: each\n. The EM approach amounts to\n\nand \u02c6p1 \u02c6r1 \u02c6q2\n\u02c6p2 \u02c6r2 \u02c6q1\n\n, \u02c6\u03c72 \u2261\n\n\u02c6p\u03b2\n1 \u02c6r\u03b2\n\n2 \u02c6q\u03b2\n1\n\n2 \u02c6r\u03b2\n1\n\n7\n\n\fstarting with some trial values \u02c6p1, \u02c6p2, \u02c6q1, \u02c6r1 and using(cid:101)p1, (cid:101)p2,(cid:101)q1,(cid:101)r1 as new trial parameters (and\n\nso on). We see from (39\u201341) that the algorithm converges just in one step: (39\u201341) are equal to\nthe parameters given by one of four solutions (37)\u2014which one among the solutions (37) is selected\ndepends on the on initial trial parameters in (39\u201341)\u2014recovering the correct effective parameters\n(30\u201332); e.g. cf. (38) with (39, 41) under \u02c6\u03c71 = \u02c6\u03c72 = 0. Hence, VT converges in one step in\ncontrast to the Baum-Welch algorithm (that uses EM to locally minimize F1) which, for the present\nmodel, obviously does not converge in one step. There is possibly a deeper point in the one-step\nconvergence that can explain why in practice VT converges faster than the Baum-Welch algorithm\n[9, 21]: recall that, e.g. the Newton method for local optimization works precisely in one step for\nquadratic functions, but generally there is a class of functions, where it performs faster than (say)\nthe steepest descent method. Further research should show whether our situation is similar: the VT\nworks just in one step for this exactly solvable HMM model that belongs to a class of models, where\nVT generally performs faster than ML.\nWe conclude this section by noting that the solvable case (29) is generic: its key results extend to\nthe general situation de\ufb01ned above (21). We checked this fact numerically for several values of L.\nIn particular, the minimization of F\u221e nulli\ufb01es as many trial parameters as necessary to express the\nremaining parameters via independent effective parameters t0, t1, . . .. Hence for L = 3 and \u0001 = 0\ntwo such trial parameters are nulli\ufb01ed; cf. with discussion around (28). If the true error probability\n\u0001 (cid:54)= 0, the trial value \u02c6\u0001 is among the nulli\ufb01ed parameters. Again, there is a discrete degeneracy in\nsolutions provided by minimizing F\u221e.\n\n7 Summary\n\nWe presented a method for analyzing two basic techniques for parameter estimation in HMMs, and\nillustrated it on a speci\ufb01c class of HMMs with one unambiguous symbol. The virtue of this class\nof models is that it is exactly solvable, hence the sought quantities can be obtained in a closed\nform via generating functions. This is a rare occasion, because characteristics of HMM such as\nlikelihood or entropy are notoriously dif\ufb01cult to calculate explicitly [1]. An important feature of the\nexample considered here is that the set of unknown parameters is not completely identi\ufb01able in the\nmaximum likelihood sense [7, 14]. This corresponds to the zero eigenvalue of the Hessian for the\nML (maximum-likelihood) objective function. In practice, one can have weaker degeneracy of the\nobjective function resulting in very small values for the Hessian eigenvalues. This scenario occurs\noften in various models of physics and computational biology [11]. Hence, it is a drawback that the\ntheory of HMM learning was developed assuming complete identi\ufb01ably [5].\nOne of our main result is that in contrast to the ML approach that produces continuously degener-\nate solutions, VT results in \ufb01nitely degenerate solution that is sparse, i.e., some [non-identi\ufb01able]\nparameters are set to zero, and, furthermore, converges faster. Note that sparsity might be a desired\nfeature in many practical applications. For instance, imposing sparsity on conventional EM-type\nlearning has been shown to produce better results part of speech tagging applications [25]. Whereas\n[25] had to impose sparsity via an additional penalty term in the objective function, in our case spar-\nsity is a natural outcome of maximizing the likelihood of the best sequence. While our results were\nobtained on a class of exactly-solvable model, it is plausible that they hold more generally.\nThe fact that VT provides simpler and more de\ufb01nite solutions\u2014among all choices of the parameters\ncompatible with the observed data\u2014can be viewed as a type of the Occam\u2019s razor for the parameter\nlearning. Note \ufb01nally that statistical mechanics intuition behind these results is that the aposteriori\nlikelihood is (negative) zero-temperature free energy of a certain physical system. Minimizing this\nfree energy makes physical sense: this is the premise of the second law of thermodynamics that\nensures relaxation towards a more equilibrium state.\nIn that zero-temperature equilibrium state\ncertain types of motion are frozen, which means nullifying the corresponding transition probabilities.\nIn that way the second law relates to the Occam\u2019s razor. Other connections of this type are discussed\nin [15].\n\nAcknowledgments\n\nThis research was supported in part by the US ARO MURI grant No. W911NF0610094 and US\nDTRA grant HDTRA1-10-1-0086.\n\n8\n\n\fReferences\n[1] A. E. Allahverdyan, Entropy of Hidden Markov Processes via Cycle Expansion, J. Stat. Phys. 133, 535\n\n(2008).\n\n[2] A.E. Allahverdyan and A. Galstyan, On Maximum a Posteriori Estimation of Hidden Markov Processes,\n\nProc. of UAI, (2009).\n\n[3] R. Artuso. E. Aurell and P. Cvitanovic, Recycling of strange sets, Nonlinearity 3, 325 (1990).\n[4] P. Baldi and S. Brunak, Bioinformatics, MIT Press, Cambridge, USA (2001).\n[5] L. E. Baum and T. Petrie, Statistical inference for probabilistic functions of \ufb01nite state Markov chains,\n\nAnn. Math. Stat. 37, 1554 (1966).\n\n[6] J.M. Benedi, J.A. Sanchez, Estimation of stochastic context-free grammars and their use as language\n\nmodels, Comp. Speech and Lang. 19, pp. 249-274 (2005).\n\n[7] D. Blackwell and L. Koopmans, On the identi\ufb01ability problem for functions of \ufb01nite Markov chains, Ann.\n\nMath. Statist. 28, 1011 (1957).\n\n[8] S. B. Cohen and N. A. Smith, Viterbi training for PCFGs: Hardness results and competitiveness of\n\nuniform initialization, Procs. of ACL (2010).\n\n[9] Y. Ephraim and N. Merhav, Hidden Markov processes, IEEE Trans. Inf. Th., 48, 1518-1569, (2002).\n[10] L.Y. Goldsheid and G.A. Margulis, Lyapunov indices of a product of random matrices, Russ. Math. Sur-\n\nveys 44, 11 (1989).\n\n[11] R. N. Gutenkunst et al., Universally Sloppy Parameter Sensitivities in Systems Biology Models, PLoS\n\nComputational Biology, 3, 1871 (2007).\n\n[12] G. Han and B. Marcus, Analyticity of entropy rate of hidden Markov chains, IEEE Trans. Inf. Th., 52,\n\n5251 (2006).\n\n[13] R. A. Horn and C. R. Johnson, Matrix Analysis (Cambridge University Press, New Jersey, USA, 1985).\n[14] H. Ito, S. Amari, and K. Kobayashi, Identi\ufb01ability of Hidden Markov Information Sources, IEEE Trans.\n\nInf. Th., 38, 324 (1992).\n\n[15] D. Janzing, On causally asymmetric versions of Occam\u2019s Razor and their relation to thermodynamics,\n\narXiv:0708.3411 (2007).\n\n[16] B. H. Juang and L. R. Rabiner, The segmental k-means algorithm for estimating parameters of hidden\nMarkov models, IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-38, no.9, pp.1639-1641,\n(1990).\n\n[17] B. G. Leroux, Maximum-Likelihood Estimation for Hidden Markov Models, Stochastic Processes and\n\nTheir Applications, 40, 127 (1992).\n\n[18] N. Merhav and Y. Ephraim, Maximum likelihood hidden Markov modeling using a dominant sequence of\n\nstates, IEEE Transactions on Signal Processing, vol.39, no.9, pp.2111-2115 (1991).\n\n[19] F. Qin, Restoration of single-channel currents using the segmental k-means method based on hidden\n\nMarkov modeling, Biophys J 86, 14881501 (2004).\n\n[20] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc.\n\nIEEE, 77, 257 (1989).\n\n[21] L. J. Rodriguez and I. Torres, Comparative Study of the Baum-Welch and Viterbi Training Algorithms,\n\nPattern Recognition and Image Analysis, Lecture Notes in Computer Science, 2652/2003, 847 (2003).\n[22] D. Ruelle, Statistical Mechanics, Thermodynamic Formalism, (Reading, MA: Addison-Wesley, 1978).\n[23] J. Sanchez, J. Benedi, F. Casacuberta, Comparison between the inside-outside algorithm and the Viterbi\nalgorithm for stochastic context-free grammars, in Adv. in Struct. and Synt. Pattern Recognition (1996).\n[24] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning, Viterbi Training Improves Unsupervised\nDependency Parsing, in Proc. of the 14th Conference on Computational Natural Language Learning\n(2010).\n\n[25] A. Vaswani, A. Pauls, and D. Chiang, Ef\ufb01cient optimization of an MDL-inspired objective function for\n\nunsupervised part-of-speech tagging, in Proc. ACL (2010).\n\n9\n\n\f", "award": [], "sourceid": 946, "authors": [{"given_name": "Armen", "family_name": "Allahverdyan", "institution": null}, {"given_name": "Aram", "family_name": "Galstyan", "institution": null}]}