{"title": "Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 647, "abstract": null, "full_text": "Training Algorithms for Hidden Markov Models \n\nUsing Entropy Based Distance Functions \n\nYoram Singer \n\nAT&T Laboratories \n600 Mountain Avenue \nMurray Hill, NJ 07974 \nsinger@research.att.com \n\nManfred K. Warmuth \n\nComputer Science Department \n\nUniversity of California \nSanta Cruz, CA 95064 \nmanfred@cse.ucsc.edu \n\nAbstract \n\nWe present new algorithms for parameter estimation of HMMs. By \nadapting a framework used for supervised learning, we construct iterative \nalgorithms that maximize the likelihood of the observations while also \nattempting to stay \"close\" to the current estimated parameters. We use a \nbound on the relative entropy between the two HMMs as a distance mea(cid:173)\nsure between them. The result is new iterative training algorithms which \nare similar to the EM (Baum-Welch) algorithm for training HMMs. The \nproposed algorithms are composed of a step similar to the expectation \nstep of Baum-Welch and a new update of the parameters which replaces \nthe maximization (re-estimation) step. The algorithm takes only negligi(cid:173)\nbly more time per iteration and an approximated version uses the same \nexpectation step as Baum-Welch. We evaluate experimentally the new \nalgorithms on synthetic and natural speech pronunciation data. For sparse \nmodels, i.e. models with relatively small number of non-zero parameters, \nthe proposed algorithms require significantly fewer iterations. \n\n1 Preliminaries \nWe use the numbers from 0 to N to name the states of an HMM. State 0 is a special initial \nstate and state N is a special final state. Any state sequence, denoted by s, starts with the \ninitial state but never returns to it and ends in the final state. Observations symbols are also \nnumbers in {I, ... , M} and observation sequences are denoted by x. A discrete output \nhidden Markov model (HMM) is parameterized by two matrices A and B. The first matrix \nis of dimension [N, N] and ai,j (0:5: i :5: N - 1,1 :5: j :5: N) denotes the probability of \nmoving from state i to state j. The second matrix is of dimension [N + 1, M] and bi ,k is the \nprobability of outputting symbol k at state i. The set of parameters of an HMM is denoted \nby 0 = (A, B). (The initial state distribution vector is represented by the first row of A.) \nAn HMM is a probabilistic generator of sequences. It starts in the initial state O. It then \niteratively does the following until the final state is reached. If i is the current state then a \nnext state j is chosen according to the transition probabilities out of the current state (row i of \nmatrix A). After arriving at state j a symbol is output according to the output probabilities \nof that state (row j of matrix B). Let P(x, slO) denote the probability (likelihood) that an \nHMM 0 generates the observation sequence x on the path s starting at state 0 and ending \nat state N: P(x, sllsl = Ixl + 1, So = 0, slSI = N, 0) ~ I1~~ll as._t,s.bs.,x \u2022. For the \nsake of brevity we omit the conditions on s and x. Throughout the paper we assume that \nthe HMMs are absorbing, that is from every state there is a path to the final state with a \n\n\f642 \n\nY. Singer and M. K. Warmuth \n\nnon-zero probability. Similar parameter estimation algorithms can be derived for ergodic \nHMMs. Absorbing HMMs induce a probability over all state-observation sequences, \ni.e. Ex,s P(x, s18) = 1. The likelihood of an observation sequence x is obtained by \nsumming over all possible hidden paths (state sequences), P(xI8) = Es P(x, sI8). To \nobtain the likelihood for a set X of observations we simply mUltiply the likelihood values \nfor the individual sequences. We seek an HMM 8 that maximizes the likelihood for a \ngiven set of observations X, or equivalently, maximizes the log-likelihood, LL(XI8) = \nr:h EXEX In P(xI8). \nTo simplify our notation we denote the generic parameter in 8 by Oi, where i ranges \nfrom 1 to the total number of parameters in A and B (There might be less if some are \nclamped to zero). We denote the total number of parameters of 8 by I and leave the (fixed) \ncorrespondence between the Oi and the entries of A and B unspecified. The indices are \nnaturally partitioned into classes corresponding to the rows of the matrices. We denote by \n[i] the class of parameters to which Oi belongs and by O[i) the vector of all OJ S.t. j E [i]. If \nj E [i] then both Oi and OJ are parameters from the same row of one of the two matrices. \nWhenever it is clear from the context, we will use [i] to denote both a class of parameters \nand the row number (i.e. state) associated with the class. We now can rewrite P(x, s18) as \nnf=l O~'(X,S), where ni(x, s) is the number of times parameter i is used along the path s \nwith observation sequence x. (Note that this value does not depend on the actual parameters \n8.) We next compute partial derivatives ofthe likelihood and the log-likelihood using this \nnotation. \n\no \nOOi P(x, s18) \n\nlInl(X,S) \nu 1 \n\nlIn._I(X,S) \n\n... U i-I \n\n( \n\nni x, SUi \n\n) lIn,(X,S)-l \n\nlInl(X,S) \n\n... U 1 \n\noLL(XI8) \n\nOOi \n\nHere 11i(xI8) ~ Es ni(x, s)P(slx, 8) is the expected number of occurrences of the \ntransition/output that corresponds to Oi over all paths that produce x in 8. These val(cid:173)\nues are calculated in the expectation step of the Expectation-Maximization (EM) train(cid:173)\ning algorithm for HMMs [7], also known as the Baum-Welch [2] or the Forward(cid:173)\nBackward algorithm. In the next sections we use the additional following expectations, \n11i(8) ~ Ex,s ni(X, s)P(x, s18) and 11[i) (8) ~ EjE[i) 11j(8). Note that the summation \nhere is over all legal x and s of arbitrary length and 11[i) (8) is the expected number of times \nthe state [i] was visited. \n2 Entropic distance functions for HMMs \nOur training algorithms are based on the following framework of Kivinen and Wannuth \nfor motivating iterative updates [6]. Assume we have already done a number of iterations \nand our current parameters are 8 . Assume further that X is the set of observations to \nbe processed in the current iteration. In the batch case this set never changes and in the \non-line case X is typically a single observation. The new parameters 8 should stay close \nto 8, which incorporates all the knowledge obtained in past iterations, but it should also \nmaximize the log-likelihood on the current date set X. Thus, instead of maximizing the log(cid:173)\nlikelihood we maximize, U(8) = 7JLL(XI8) - d(8, 8) (see [6, 5] for further motivation). \n\n\fTraining Algorithms/or Hidden Markov Models \n\n643 \n\nHere d measures the dis!ance between the old and new parameters and 1] > 0 is a trade-off \nfactor. Maximizing U~B) is usually difficult since both the distance function and the log(cid:173)\nlikelihood depend on B. As in [6, 5], we approximate the log-likelihood by a first order \nTaylor expansion around 9 = B and add Lagrange multipliers for the constraints that the \nparameters of each class must sum to one: \n\nU(8) :::::: 1] (LL(XIB) + (8 - B)\\7 BLL(XIB\u00bb) - d(8, B) + L A[i] L OJ. \n\n(3) \n\n[i] \n\nJEri] \n\nA commonly used distance function is the relative entropy. To calculate the relative entropy \nbetween two HMMs we need to sum over all possible hidden state sequence which leads to \nthe following definition, \n\nd \nRE, ~ x \n\n(8 B) ~f ~ P( 18) 1 P(xI8) = ~ (~ P( \n\nn P(xIB) ~ '7 x, s \n\n18\u00bb) 1 Ls P(x, s19) \n\nn Ls P(x, siB) \n\nHowever, the above divergence is very difficult to calculate and is not a convex function in \nB. To avoid the computational difficulties and the non-convexity of dRE we upper bound \nthe relative entropy using the log sum inequality [3]: \n\ndRE (8, B) \n\n~ \n\n-\n\n:s; dRE(B,8) = L.t P(x, s18) In P( \n\ndef ~ -\n\nP(x, s19) \nIB) \n\nx,s \n\n_ (nIl On.(x,s\u00bb) \n\nX,s \nL P(x, siB) In \nx,s \n~ O\u00b7 ~ -\nO\u00b7 \nL.t In (J~ L.t P(x, s18) ni(x, s) = L.t ni(8) In (J~ \ni=1 \n\nx,s \n~ -\ni= I ' \n\nn,=1 , \n\n' X,S \n\n_ I \n,=1 \n\nO. \n't (J~'(X,S) = L P(x, siB) ?= ni(x, s) In (J: \n\nNote that for the distance function ~E(9, 8) an HMM is viewed as a joint distribution \nbetween observation sequences and hidden state sequences. We can further simplify the \nbound on the relative entropy using the following lemma (proof omitted). \nLemma 1 ForanyabsorbingHMM, 8, and any parameter(Jj E 8, ni(8) = (Jin[i](B). \n\n-\n\n-\n\n-\n\n-\n\n{j \n\n-\n\n~ \n\nThis gives the following new formula, dRE (9, 8) = L7= 1 n[j] (9) [Oi In ~ ] , which can \nbe rewritten as, dRE(8, B) = L[i] n[i](8) dRE(8[iJ> B[i]) = L[i] n[i](8) LjE[i] (Jj In ~ . \nEquation (3) is still difficult to solve since the variables n[i] (9) depend on the new set of \nparameters (which~ are not known). We therefore further approximate ~E(8, 8) by the \ndistance function, dRE(9, B) = L[i] n[i](B) LjE[i] OJ In~. \n3 New Parameter Updates \nWe now would like to use the distance functions discussed in previous section in U (9). We \nfirst derive ou~ main update using this distance function. This is done by replacing d( 8,8) \nin U (9) with ~E (9, 8) and setting the derivatives of the resul ting U (9) w.r.t OJ to O. This \ngives the following set of equations (i E {I, ... , I}), \nOi \n\nLXEX ni(xIB) \n\n_ \n\n1] \n\nIXI(Ji \n\nA \n\n- n[i](B) (In (Ji - 1) + A[i] - 0 , \n\nwhich are equivalent to \n\n\f644 \n\nY. Singer and M. K. Warmuth \n\nWe now can solve for Oi and replace A[i] by a nonnalization factor which ensures that the \nsum of the parameters in [i] is 1: \n\n_ \nOi = \n\nOJ exp n).)(8) \n\n(~ 2:XEX n.(XI8)) \n\nIXI9. \n\n2:jE [i] OJ exp n[)6) \n\n'~I 9J \n\n(2 :x nJ(XI8)) \n\n(4) \n\nThe above re-estimation rule is the entropic update for HMMs. l \n\nn[.)(H) \n\nWe now derive an alternate to the updateof(4). The mixture weights n[i](8) (whichapprox(cid:173)\nimate the original mixture weights n[i] (0) in ~E (0, 8) lead to a state dependent learning \nrate of ~ for the parameters of class [i]. If computation time is limited (see discussion \nbelow) then the expectations n[i] (8) can be approximated by values that are readily available. \nOne possible choice is to use the sample based expectations 2:jE [i]2:xEX nj(xI8)/IXI as \nan approximation for n[i] (8). These weights are needed for calculating the gradient and are \nevaluated in the expectation step of Baum-Welch. Let, n[i](xI8) ~ 2:jE [i] nj(xI8), then \nthis approximation leads to the following distance function \n\n\"'\" 2:xEX n[i](xI8) d \nL-\n[i] \n\nIXI \n\nRE \n\n[~l> \n\n[a) \n\n(0. 8.) = \"'\" 2:xEx n[j)(xI8) \"'\" O\u00b7 In OJ \n\n(5) \n\nL-\n[i] \n\nIXI \n\nL- J \nJEri] \n\n0 . ' \nJ \n\nwhich results in an update which we call the approximated entropic update for HMMs: \n\n-\nOi = \n\nOi exp ~ . \n\n( \n\n(XI8) \n\n1) \n\nDXEX n['1 \n( \n\n~. \nDjE[i) OJ exp 2:xEx nbl(XI8) \n\n1) \n\n2:xEx n.(XI8)) \n\n9, \n~ \nDXEX nJ(Xlo) \n\nLl ) \n\n9J \n\n(6) \n\ngiven a current set of parameters 8 and a learning rate 11 we obtain a new set of parameters \n8 by iteratively evaluating the right-hand-side of the entropic update or the approximated \nentropic update. We calculate the expectations ni(xI8) as done in the expectation step \nof Baum-Welch. The weights n[i](xI8) are obtained by averaging nj(xI8) for j E [i]. \nThis lets us evaluate the right-hand-side of the approximated entropic update. The en tropic \nupdate is slightly more involved and requires an additional calculation of n[i) (8). (Recall \nthat n[i] (8) is the expected number oftimes state [i] is visited, unconditioned on the data). To \ncompute these expectations we need to sum over all possible sequences of state-observation \npairs. Since the probability of outputting the possible symbols at a given state sum to one, \ncalculating n[i] (8) reduces to evaluating the probability of reaching a state for each possible \ntime and sequence length. For absorbing HMMs n[i] (8) can be approximated efficiently \nusing dynamic programming; we compute n[i] (8) by summing the probabilities of all legal \nstate sequences S of up to length eN (typically C = 3 proved to be sufficient to obtain very \naccurate approximations of n[i] (8). Therefore, the time complexity of calculating n[i] (8) \ndepends only on the number of states, regardless of the dimension of the output vector M \nand the training data X. \n\n1 A subtle improvement is possible over the update (4) by treating the transition probabilities and \noutput probabilities differently. First the transition probabilities are updated based on (4). Then \nthe state probabilities n[i)(O) = n[i)(A) are recomputed based on th; new parameters A. This is \npossible since the state probabilities depend only on the transition probabilities and not on the output \nprobabilities. Finally the output probabilities are updated with (4) where the n[.)(O) are used in place \nof the n[i](8). \n\n\fTraining Algorithms/or Hidden Markov Models \n\n645 \n\n4 The relation to EM and convergence properties \nWe first show that the EM algorithm for HMMs can be derived using our framework. To \ndo so, we approximate the relative entropy by the X2 distance (see [3]), dRE(p, p) ~ \ndx2(p, p) ~ ~ L:i (P.;~./, and use this distance to approximate dRE(9, 8): \n\ndRE(9, 8) ~ ~2(9, 8) 1;\u00a3 2: 1l[i) (9) dx2(9[i),8[i)) \n\n[i) \n\n~ A \n\n~ L-n[i)(8) dx2(8[i]>8[i)) ~ L-\n[i) \n\n[i) \n\n-\n\n~ L:XEX 1l[i) (xI8) \n\nIXI \n\nHere dx2(9[i), 8[i)) = ~ L:j E[i) (9'~,8.)2 . By minimizing U (9) with the last version of the X2 \ndistance function and following the same derivation steps as for the approximated entropic \nupdate we arrive at what we call the approximated X2 update for HMMs: \n\nOi = (1 - 7J)Oi + 7J 2: 1li(xI8) /2: 1l[j)(xI8) . \n\nXEX \n\nXEX \n\n(7) \n\nSetting TJ = 1 results in the update, Oi = L:xEX 1li(xI8)/L:xEX 1l[i) (xI8), which is the \nmaximization (re-estimation) step of the EM algorithm. \n\nAlthough omitted from this paper due to the lack of space, it is can be shown that for \n7J E (0,1] the en tropic updates and the X2 update improve the likelihood on each iteration. \nTherefore, these updates belong to the family of Generalized EM (GEM) algorithms which \nare guaranteed to converge to a local maximum given some additional conditions [4]. \nFurthennore, using infinitesimal analysis and second order approximation of the likelihood \nfunction at the (local) maximum similar to [10]. it can be shown that the approximated X2 \nupdate is a contraction mapping and close to the local maximum there exists a learning rate \n7J > 1 which results in a faster rate of convergence than when using TJ = 1. \n5 Experiments with Artificial and Natural Data \nIn order to test the actual convergence rate of the algorithms and to compare them to \nBaum-Welch we created synthetic data using HMMs. In our experiments we mainly used \nsparse models, that is, models with many parameters clamped to zero. Previous work \n(e.g., [5, 6]) might suggest that the entropic updates will perfonn better on sparse models. \n(Indeed, when we used dense models to generate the data, the algorithms showed almost \nthe same perfonnance). The training algorithms, however, were started from a randomly \nchosen dense model. When comparing the algorithms we used the same initial model. \nDue to different trajectories in parameter space, each algorithm may converge to a different \n(local) maximum. For the clarity of presentation we show here results for cases where all \nupdates converged to the same maximum, which often occur when the HMM generating the \ndata is sparse and there are enough examples (typically tens of observations per non-zero \nparameter). We tested both the entropic updates and the X2 updates. Learning rates greater \nthan one speed up convergence. The two entropic updates converge almost equally fast \non synthetic data generated by an HMM. For natural data the entropic update converges \nslightly faster than the approximated version. The X2 update also benefits from learning \nrates larger than one. However, the x2-update need to be used carefully since it does not \nnecessarily ensure non-negativeness of the new parameters for 7J > 1. This problems is \nexaggerated when the data is not generated by an HMM. We therefore used the entropic \nupdates in our experiments with natural data. In order to have a fair comparison, we did not \ntune the learning rate 7J and set it to 1.5. In Figure 1 we give a comparison of the entropic \nupdate, the approximated entropic update, and Baum-Welch (left figure), using an HMM \nto generate the random observation sequences, where N = M = 40 but only 25% (10 \nparameters on the average for each transition/observation vector) of the parameters of the \n\n\f646 \n\nY. Singer and M. K. Warmuth \n\nHMM are non-zero. The perfonnance of the entropic update and the approximated entropic \nupdate are practically the same and both updates clearly outperfonn Baum-Welch. One \nreason the perfonnance of the two entropic updates is the same is that the observations were \nindeed generated by an HMM. In this case, approximating the expectations n(il (8) by the \nsample based expectations seems reasonable. These results suggest a valuable alternative \nto using Baum-Welch with a predetermined sparse, potentially biased, HMM where a large \nnumber of parameters is clamped to zero. Instead, we suggest starting with a full model and \nlet one of the en tropic updates find the relevant parameters. This approach is demonstrated \non the right part of Figure 1. In this example the data was generated by a sparse HMM with \n100 states and 100 possible output symbols. Only 10% ofthe HMM's parameters were non(cid:173)\nzero. Three log-likelihood curves are given in the figure. One is the log-likelihood achieved \nby Baum-Welch when only those parameters that are non-zero in the HMM generating the \ndata are initialized to random non-zero values. The other two are the log-likelihood of the \nentropic update and Baum-Welch when all the parameters are initialized randomly. The \ncurves show that the en tropic update compensates for its inferior initialization in less than \n10 iterations (see horizontal line in Figure 1) and from this point on it requires only 23 \nmore iterations to converge compared to Baum-Welch which is given prior knowledge of \nthe non-zero parameters. In contrast, when Baum-Welch is started with a full model then \nits convergence is much slower than the entropic update. \n\n-0 . 4 r---\"---\"'--~----r--~---, \n\n'g -0 .6 \n~ \n.. -0,8 \n~ \n7 \ngo \nol \n1:1 - 1. 2 \n~ \n~ - 1. 4 \n\n- 1 \n\n~ -1. 6 \n\nEntr op ic Upda t e - (cid:173)\nEntr opi c Up date ....... -~ \nEM \n(Daum- we l c h) \u00b7 41 ' \" \n\nEM (Daum-wel c h ) . Random I ni t. \nEn t r opic Updat e, Random I ni t . \nEM (Ba um-we l c h ). s pa r se I ni t .\n\n. .g.. \n\n-1. 8 ' -_ \" ' - -_ . . .L . . -_ - ' -_ - ' - -_ - - ' -_ - - ' \n30 \n\n20 \n\n1 5 \n\n10 \n\n25 \n\no \n\nIte r a t ion\" \n\n-2. 4 \"-_...L..-_--'--_-'-_--'--_--'-_--' \n120 \n\n100 \n\n80 \n\n4 0 \n\n60 \n\n20 \n\nIt e r a t ion \" \n\nFigure 1: Comparison of the entropic updates and Baum-Welch. \n\nWe next tested the updates on speech pronunciation data. In natural speech, a word might \nbe pronounced differently by different speakers. A common practice is to construct a \nset of stochastic models in order to capture the variability of the possible pronunciations. \nalternative pronunciations of a given word. This problem was studied previously in [9] \nusing a state merging algorithm for HMMs and in [8] using a subclass of probabilistic \nfinite automata. The purpose of the experiments discussed here is not to compare the above \nalgorithms to the en tropic updates but rather compare the entropic updates to Baum-Welch. \nNevertheless, the resulting HMM pronunciation models are usually sparse. Typically, only \ntwo or three phonemes have a non zero output probability at a given state and the average \nnumber of states that in practice can follow a states is about 2. Therefore, the entropic \nupdates may provide a good alternative to the algorithms presented in [8, 9]. \n\nWe used the TIMIT (Texas Instruments-MIT) database as in [8, 9]. This database contains \nthe acoustic wavefonns of continuous speech with phone labels from an alphabet of 62 \nphones which constitute a temporally aligned phonetic transcription to the uttered words. \nFor the purpose of building pronunciation models, the acoustic data was ignored and we \npartitioned the phonetic labels according to the words that appeared in the data. The data \nwas filtered and partitioned so that words occurring between 20 and 100 times in the dataset \nwere used for training and evaluation according to the following partition. 75% of the \noccurrences of each word were used as training data for the learning algorithm and the \nremaining 25% were used for evaluation. We then built for each word three pronunciation \nmodels by training a fully connected HMM whose number of states was set to 1, 1.5 and \n1.75 times the longest sample (denoted by N m). The models were evaluated by calculating \n\n\fTraining Algorithmsfor Hidden Markov Models \n\n647 \n\nthe log-likelihood (averaged over 10 different random parameter initializations) of each \nIn Table 1 we give \nHMM on the phonetic transcription of each word in the test set. \nthe negative log-likelihood achieved on the test data together with the average number of \niterations needed for training. Overall the differences in the log-likelihood are small which \nmeans that the results should be interpreted with some caution. Nevertheless, the entropic \nupdate obtained the highest likelihood on the test data while needing the least number of \niterations. The approximated en tropic update and Baum-Welch achieve similar results on \nthe test data but the latter requires more iterations. Checking the resulting models reveals \none reason why the en tropic update achieves higher likelihood values, namely, it does a \nbetter job in setting the irrelevant parameters to zero (and it does it faster). \n\n# States \nBaum-Welch \nApprox. EU \nEntropic Update \n\n1.0Nm \n2448 \n2440 \n2418 \n\nNegative Log-Likelihood \n\n1.5Nm \n2388 \n2389 \n2352 \n\n1.75Nm \n2425 \n2426 \n2405 \n\n1.0Nm \n27.4 \n25.5 \n23.1 \n\n# Iterations \n\n1.5Nm \n36.1 \n35.0 \n30.9 \n\n1.75Nm \n\n41.1 \n37.0 \n32.6 \n\nTable 1: Comparison of the entropic updates and Baum-Welch on speech pronunciation data. \n\n6 Conclusions and future research \nIn this paper we have showed how the framework of Kivinen and Warmuth [6] can be used \nto derive parameter updates algorithms for HMMs. We view an HMM as a joint distribution \nbetween the observation sequences and hidden state sequences and use a bound on relative \nentropy as a distance between the new and old parameter settings. If we approximate of the \nrelative entropy by the X2 distance, replace the exact state expectations by a sample based \napproximation, and fix the learning rate to one then the framework yields an alternative \nderivation of the EM algorithm for HMMs. Since the EM update uses sample based \nestimates of the state expectations it is hard to use it in an on-line setting. In contrast, the \non-line versions of our updates can be easily derived using only one observation sequence \nat a time. Also, there are alternative gradient descent based methods for estimating the \nparameters of HMMs. Such methods usually employ an exponential parameterization \n(such as soft-max) of the parameters (see [1 D. For the case of learning one set of mixture \ncoefficients an exponential parameterization led to an algorithm with a slower convergence \nrate compared to algorithms derived using entropic distances [5] . However, it is not clear \nwhether this is still the case for HMMs. Our future goals is to perform a comparative study \nof the different updates with emphasis on the on-line versions. \nAcknowledgments \nWe thank Anders Krogh for showing us the simple derivative calculations used in this paper and thank \nFernando Pereira and Yasubumi Sakakibara for interesting discussions. \nReferences \n[1] P. Baldi and Y. Chauvin. Smooth on-line learning algorithms for Hidden Markov Models. Neural Computation . 6(2), 1994. \n[2] L.E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. Annals of Mathematic a I \n\nStatisitics, 37, 1966. \n\n[3] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. \n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal \n\nof the Royal Statistical Society, B39: 1-38,1977. \n\n[5] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. A comparison of new and old algorithms for a mixture \nestimation problem. In Proceedingsofthe Eighth Annual Workshop on Computational Learning Theory, pages 69-78,1995. \n[6] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Informationa and \n\nComputation, 1997. To appear. \n\n[7] LR Rabiner and B. H. Juang. An introduction to hidden markov models. IEEE ASSP Magazine, 3(1 ):4-16, 1986. \n[8] D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In Proc. of the \n\nEighth Annual Workshop on Computational Learning Theory, 1995. \n\n[9] A. Stolcke and S. Omohundro. Hidden Markov model induction by Bayesian model merging. In Advances in Neural \n\nInformation Processing Systems, volume 5. Morgan Kaufmann, 1993. \n\n[10] L. Xu and M.I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neurol Computation, \n\n8:129-151 , 1996. \n\n\f", "award": [], "sourceid": 1263, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}