{"title": "Hidden Markov Model Induction by Bayesian Model Merging", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 18, "abstract": null, "full_text": "Hidden Markov Model Induction by Bayesian \n\nModel Merging \n\nAndreas Stolcke*'** \n\n*Computer Science Division \n\nUniversity of California \n\nBerkeley, CA 94720 \n\nstolcke@icsi.berkeley.edu \n\nStephen Omohundro\" \n\n**International Computer Science Institute \n\n1947 Center Street, Suite 600 \n\nBerkeley, CA 94704 \n\nom@icsi.berkeley.edu \n\nAbstract \n\nThis paper describes a technique for learning both the number of states and the \ntopology of Hidden Markov Models from examples. The induction process starts \nwith the most specific model consistent with the training data and generalizes \nby successively merging states. Both the choice of states to merge and the \nstopping criterion are guided by the Bayesian posterior probability. We compare \nour algorithm with the Baum-Welch method of estimating fixed-size models, and \nfind that it can induce minimal HMMs from data in cases where fixed estimation \ndoes not converge or requires redundant parameters to converge. \n\n1 INTRODUCTION AND OVERVIEW \n\nHidden Markov Models (HMMs) are a well-studied approach to the modelling of sequence \ndata. HMMs can be viewed as a stochastic generalization of finite-state automata, where \nboth the transitions between states and the generation of output symbols are governed by \nprobability distributions. HMMs have been important in speech recognition (Rabiner & \nJuang, 1986), cryptography, and more recently in other areas such as protein classification \nand alignment (Haussler, Krogh, Mian & SjOlander, 1992; Baldi, Chauvin, Hunkapiller & \nMcClure, 1993). \n\nPractitioners have typically chosen the HMM topology by hand, so that learning the HMM \nfrom sample data means estimating only a fixed number of model parameters. The standard \napproach is to find a maximum likelihood (ML) or maximum a posteriori probability (MAP) \nestimate of the HMM parameters. The Baum-Welch algorithm uses dynamic programming \n\n11 \n\n\f12 \n\nStokke and Omohundro \n\nto approximate these estimates (Baum, Petrie, Soules & Weiss, 1970). \n\nA more general problem is to additionally find the best HMM topology. This includes \nboth the number of states and the connectivity (the non-zero transitions and emissions). \nOne could exhaustively search the model space using the Baum-Welch algorithm on fully \nconnected models of varying sizes, picking the model size and topology with the highest \nposterior probability. (Maximum likelihood estimation is not useful for this comparison \nsince larger models usually fit the data better.) This approach is very costly and Baum(cid:173)\nWelch may get stuck at sub-optimal local maxima. Our comparative results later in the \npaper show that this often occurs in practice. The problem can be somewhat alleviated by \nsampling from several initial conditions, but at a further increase in computational cost. \n\nThe HMM induction method proposed in this paper tackles the structure learning problem \nin an incremental way. Rather than estimating a fixed-size model from scratch for various \nsizes, the model size is adjusted as new evidence arrives. There are two opposing tendencies \nin adjusting the model size and structure. Initially new data adds to the model size, because \nthe HMM has to be augmented to accommodate the new samples. If enough data of a similar \nstructure is available, however, the algorithm collapses the shared structure, decreasing the \nmodel size. The merging of structure is also what drives generalization, Le., creates HMMs \nthat generate data not seen during training. \n\nBeyond being incremental, our algorithm is data-driven, in that the samples themselves \ncompletely determine the initial model shape. Baum-Welch estimation, by comparison, \nuses an initially random set of parameters for a given-sized HMM and iteratively updates \nthem until a point is found at which the sample likelihood is locally maximal. What seems \nintuitively troublesome with this approach is that the initial model is completely uninformed \nby the data. The sample data directs the model formation process only in an indirect manner \nas the model approaches a meaningful shape. \n\n2 HIDDEN MARKOV MODELS \n\nFor lack of space we cannot give a full introduction to HMMs here; see Rabiner & Juang \n(1986) for details. Briefly, an HMM consists of states and transitions like a Markov chain. \nIn the discrete version considered here, it generates strings by performing random walks \nbetween an initial and a final state, outputting symbols at every state in between. The \nprobability P(xlM) that a model M generates a string x is determined by the conditional \nprobabilities of making a transition from one state to another and the probability of emitting \neach symbol from each state. Once these are given, the probability of a particular path \nthrough the model generating the string can be computed as the product of all transition \nand emission probabilities along the path. The probability of a string x is the sum of the \nprobabilities of all paths generating x. \n\nFor example, the model M3 in Figure 1 generates the strings ab, abab, ababab, ... with \nI \npro a I lUes 3' 3!' 3!' ... , respective y. \n\nb b\u00b7l\u00b7\u00b7 \n\n. \n\n2 2 \n\n2 \n\n3 HMM INDUCTION BY STATE MERGING \n\n3.1 MODEL MERGING \n\nOmohundro (1992) has proposed an approach to statistical model inference in which initial \n\n\fHidden Markov Model Induction by Bayesian Model Merging \n\n13 \n\nmodels simply replicate the data and generalize by similarity. As more data is received, \ncomponent models are fit from more complex model spaces. This allows the formation of \narbitrarily complex models without overfitting along the way. The elementary step used in \nmodifying the overall model is a merging of sub-models, collapsing the sample sets for the \ncorresponding sample regions. The search for sub-models to merge is guided by an attempt \nto sacrifice as little of the sample likelihood as possible as a result of the merging process. \nThis search can be done very efficiently if (a) a greedy search strategy can be used, and (b) \nlikelihood computations can be done locally for each sub-model and don't require global \nrecomputation on each model update. \n\n3.2 STATE MERGING IN HMMS \n\nWe have applied this general approach to the HMM learning task. We describe the algorithm \nhere mostly by presenting an example. The details are available in Stolcke & Omohundro \n(1993). \n\nTo obtain an initial model from the data, we first construct an HMM which produces \nexactly the input strings. The start state has as many outgoing transitions as there are \nstrings and each string is represented by a unique path with one state per sample symbol. \nThe probability of entering these paths from the start state is uniformly distributed. Within \neach path there is a unique transition arc whose probability is 1. The emission probabilities \nare 1 for each state to produce the corresponding symbol. \n\nAs an example, consider the regular language (abt and two samples drawn from it, the \nstrings ab and abab. The algorithm constructs the initial model Mo depicted in Figure 1. \nThis is the most specific model accounting for the observed data. It assigns each sample \na probability equal to its relative frequency, and is therefore a maximum likelihood model \nfor the data. \n\nLearning from the sample data means generalizing from it. This implies trading off \nmodel likelihood against some sort of bias towards 'simpler' models, expressed by a prior \nprobability distribution over HMMs. Bayesian analysis provides a formal basis for this \ntradeoff. Bayes' rule tells us that the posterior model probability P(Mlx) is proportional \nto the product of the model prior P(M) and the likelihood of the data P(xlM). Smaller or \nsimpler models will have a higher prior and this can outweigh the drop in likelihood as long \nas the generalization is conservative and keeps the model close to the data. The choice of \nmodel priors is discussed in the next section. \n\nThe fundamental idea exploited here is that the initial model Mo can be gradually trans(cid:173)\nformed into the generating model by repeatedly merging states. The intuition for this \nheuristic comes from the fact that if we take the paths that generate the samples in an actual \ngenerating HMM M and 'unroll' them to make them completely disjoint, we obtain Mo. \nThe iterative merging process, then, is an attempt to undo the unrolling, tracing a search \nthrough the model space back to the generating model. \n\nMerging two states q] and q2 in this context means replacing q] and q2 by a new state r with \na transition distribution that is a weighted mixture of the transition probabilities of q], q2, \nand with a similar mixture distribution for the emissions. Transition probabilities into q] or \nq2 are added up and redirected to r. The weights used in forming the mixture distributions \nare the relative frequencies with which q] and q2 are visited in the current model. \n\nRepeatedly performing such merging operations yields a sequence of models Mo, M J , \n\n\f14 \n\nStokke and Omohundro \n\nMo: \n\na \n\nb \n\nlog L(xIMo) = - 1. 39 \n\nlog L(xIMJ) = log L(xIMo) \n\na \n\nb \n\nFigure I: Sequence of models obtained by merging samples {ab, abab}. All transitions without \nspecial annotations have probability 1; Output symbols appear above their respective states and also \ncarry an implicit probability of 1. For each model the log likelihood is given. \n\nM2 \u2022... , along which we can search for the MAP model. To make the search for M efficient, \nwe use a greedy strategy: given Mi. choose a pair of states for merging that maximizes \nP(Mi+llX)\u00b7 \nContinuing with the previous example, we find that states 1 and 3 in Mo can be merged \nwithout penalizing the likelihood. This is because they have identical outputs and the loss \ndue to merging the outgoing transitions is compensated by the merging of the incoming \ntransitions. The .5/.5 split is simply transferred to outgoing transitions of the merged state. \nThe same situation obtains for states 2 and 4 once 1 and 3 are merged. From these two \nfirst merges we get model M. in Figure 1. By convention we reuse the smaller of two state \nindices to denote the merged state. \n\nAt this point the best merge turns out to be between states 2 and 6, giving model M2. \nHowever, there is a penalty in likelihood, which decreases to about .59 of its previous \nvalue. Under all the reasonable priors we considered (see below), the posterior model \nprobability still increases due to an increase in the prior. Note that the transition probability \nratio at state 2 is now 2/1, since two samples make use of the first transition, whereas only \none takes the second transition. \n\nFinally, states 1 and 5 can be merged without penalty to give M3, the minimal model that \ngenerates (ab)+. Further merging at this point would reduce the likelihood by three orders \nof magnitude. The resulting decrease in the posterior probability tells the algorithm to stop \n\n\fHidden Markov Model Induction by Bayesian Model Merging \n\n15 \n\nat this point. \n\n3.3 MODEL PRIORS \n\nAs noted previously, the likelihoods P(XIMj ) along the sequence of models considered by \nthe algorithm is monotonically decreasing. The prior P(M) must account for an overall \nincrease in posterior probability, and is therefore the driving force behind generalization. \n\nAs in the work on Bayesian learning of classification trees by Buntine (1992), we can split \nthe prior P(M) into a term accounting for the model structure, P(Ms), and a term for the \nadjustable parameters in a fixed structure P(MpIMs). \n\nWe initially relied on the structural prior only, incorporating an explicit bias towards smaller \nmodels. Size here is some function of the number of states and/or transitions, IMI. Such a \nprior can be obtained by making P(Ms )