{"title": "Boltzmann Chains and Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 442, "abstract": null, "full_text": "Boltzmann Chains and Hidden \n\nMarkov Models \n\nLawrence K. Saul and Michael I. Jordan \n\nlksaulOpsyche.mit.edu, jordanOpsyche.mit.edu \nCenter for Biological and Computational Learning \n\nMassachusetts Institute of Technology \n\n79 Amherst Street, E10-243 \n\nCambridge, MA 02139 \n\nAbstract \n\nWe propose a statistical mechanical framework for the modeling \nof discrete time series. Maximum likelihood estimation is done via \nBoltzmann learning in one-dimensional networks with tied weights. \nWe call these networks Boltzmann chains and show that they \ncontain hidden Markov models (HMMs) as a special case. Our \nframework also motivates new architectures that address partic(cid:173)\nular shortcomings of HMMs. We look at two such architectures: \nparallel chains that model feature sets with disparate time scales, \nand looped networks that model long-term dependencies between \nhidden states. For these networks, we show how to implement \nthe Boltzmann learning rule exactly, in polynomial time, without \nresort to simulated or mean-field annealing. The necessary com(cid:173)\nputations are done by exact decimation procedures from statistical \nmechanics. \n\n1 \n\nINTRODUCTION AND SUMMARY \n\nStatistical models of discrete time series have a wide range of applications, most \nnotably to problems in speech recognition (Juang & Rabiner, 1991) and molecular \nbiology (Baldi, Chauvin, Hunkapiller, & McClure, 1992). A common problem in \nthese fields is to find a probabilistic model, and a set of model parameters, that \n\n\f436 \n\nLawrence K. Saul, Michael I. Jordan \n\naccount for sequences of observed data. Hidden Markov models (HMMs) have been \nparticularly successful at modeling discrete time series. One reason for this is the \npowerful learning rule (Baum) 1972\u00bb) a special case of the Expectation-Maximization \n(EM) procedure for maximum likelihood estimation (Dempster) Laird) & Rubin) \n1977). \nIn this work) we develop a statistical mechanical framework for the modeling of \ndiscrete time series. The framework enables us to relate HMMs to a large family \nof exactly solvable models in statistical mechanics. The connection to statistical \nmechanics was first noticed by Sourlas (1989\u00bb) who studied spin glass models of \nerror-correcting codes. We view the estimation procedure for HMMs as a special \n(and particularly tractable) case of the Boltzmann learning rule (Ackley) Hinton) & \nSejnowski) 1985; Byrne) 1992). \nThe rest of this paper is organized as follows . In Section 2) we review the modeling \nproblem for discrete time series and establish the connection between HMMs and \nBoltzmann machines. In Section 3) we show how to quickly determine whether \nor not a particular Boltzmann machine is tractable) and if so) how to efficiently \ncompute the correlations in the Boltzmann learning rule. Finally) in Section 4) \nwe look at two architectures that address particular weaknesses of HMMs: \nthe \nmodelling of disparate time scales and long-term dependencies. \n\n2 MODELING DISCRETE TIME SERIES \n\nA discrete time series is a sequence of symbols {jdr=l in which each symbol belongs \nto a finite countable set) i.e. jl E {1) 2) .. . ) m}. Given one long sequence) or perhaps \nmany shorter ones) the modeling task is to characterize the probability distribution \nfrom which the time series are generated. \n\n2.1 HIDDEN MARKOV MODELS \n\nA first-order Hidden Markov Model (HMM) is characterized by a set of n hidden \nstates) an alphabet of m symbols) a transmission matrix ajj') an emission matrix \nbjj ) and a prior distribution 7I'j over the initial hidden state. The sequence of states \n{idr=l and symbols {jdr=l is modeled to occur with probability \n\n(1) \n\nThe modeling problem is to find the parameter values (ajj' , bij ) 7I'j) that maximize \nthe likelihood of observed sequences of training data. We will elaborate on the \nlearning rule in section 2.3) but first let us make the connection to a well-known \nfamily of stochastic neural networks, namely Boltzmann machines. \n\n2.2 BOLTZMANN MACHINES \n\nConsider a Boltzmann machine with m-state visible units) n-state hidden units) tied \nweights) and the linear architecture shown in Figure 1. This example represents the \nsimplest possible Boltzmann \"chain))) one that is essentially equivalent to a first(cid:173)\norder HMM unfolded in time (MacKay) 1994). The transition weights Aii' connect \nadjacent hidden units) while the emission weights Bjj connect each hidden unit to \n\n\fBoltzmann Chains and Hidden Markov Models \n\n437 \n\n1J ViS~ble \n\nBij Bij \n\nun~ts \n\n\u2022\u2022\u2022 A i i , Au' \n\nhidden \nunits \n\nFigure 1: Boltzmann chain with n-state hidden units, m-state visible units, transi(cid:173)\ntion weights Aiil, emission weights Bij, and boundary weights IIi. \n\nits visible counterpart. In addition, boundary weights IIi model an extra bias on \nthe first hidden unit. Each configuration of units represents a state of energy \n\n1t[{il' jd] = -Ilil - L Ailil+t - 2: Bitio \n\nL-1 \n\nL \n\nl=l \n\nl=l \n\n(2) \n\nwhere {idf=l ({jl }f=l) is the sequence of states over the hidden (visible) units. The \nprobability to find the network in a particular configuration is given by \n\nP({ ' \n\n'}) \n\nZl,)l = Ze \n\n1 -{31-l \n\n, \n\nwhere f3 = I/T is the inverse temperature, and the partition function \n\nZ = L e-fJ'H. \n\n{idd \n\n(3) \n\n(4) \n\nis the sum over states that normalizes the Boltzmann distribution, eq. (3). \nComparing this to the HMM distribution, eq. (1), it is clear that any first-order \nHMM can be represented by the Boltzmann chain of figure 1, provided we take 1 \n\nAii' = TIn aij/, Bij = TIn bij , \n\n(5) \nLater, in Section 4, we will consider more complicated chains whose architectures \naddress particular shortcomings of HMMs. For now, however, let us continue to \ndevelop the example of figure 1, making explicit the connection to HMMs. \n\nIIi = TIn 7ri\u00b7 \n\n2.3 LEARNING RULES \n\nIn the framework of Boltzmann learning (Williams & Hinton, 1990), the data for \nour problem consist of sequences of states over the visible units; the goal is to find \nthe weights (Ail, B ij , IIi) that maximize the likelihood of the observed data. The \nlikelihood of a sequence {jd is given by the ratio \n\n. \n\nZc \nP({Jd) = P({idl{jl}) = e-{3'H./Zc = Z' \n\nP({il,jd) \n\ne-{3'H./Z \n\n(6) \n\n1 Note, however, that the reverse statement-that for any set of parameters, this Boltz(cid:173)\n\nmann chain can be represented as an HMM-is not true. The weights in the Boltzmann \nchain represent arbitrary energies between \u00b1oo, whereas the HMM parameters represent \nprobabilities that are constrained to obey sum rules, such as Lil aiil = 1. The Boltzmann \nchain of figure 1 therefore has slightly more degrees of freedom than a first-order HMM. \nAn interpretation of these extra degrees of freedom is given by MacKay (1994). \n\n\f438 \n\nLawrence K. Saul, Michael I. Jordan \n\nwhere Zc is the clamped partition function \n\nZc = L e-/31i . \n\n{it} \n\n(7) \n\nNote that the sum in Zc is only over the hidden states in the network, while the \nvisible states are clamped to the observed values bt}. \nThe Boltzmann learning rule adjusts the weights of the network by gradient-ascent \non the log-likelihood. For the example of figure 1, this leads to weight updates \n\n~Aii' = 7J/3 L [(6iil6ilil+Jc - (6iil6ilil+l)] ; \n\nL-l \n\nl=1 \nL \n\n(8) \n\n(9) \n\n~Bij \n\n(6i il 6jjl)] , \n\n7J/3 L [(6iil6jjl)C -\n7J/3 [(6ii1 )c - (6ii1 )] , \n\nl=1 \n\n~ni \n\n(10) \nwhere 6ij stands for the Kronecker delta function, 7J is a learning rate, and (-) and \n(-) c denote expectations over the free and clamped Boltzmann distributions. \nThe Boltzmann learning rule may also be derived as an Expectation-Maximization \n(EM) algorithm. The EM procedure is an alternating two-step method for max(cid:173)\nimum likelihood estimation in probability models with hidden and observed vari(cid:173)\nables. For Boltzmann machines in general, neither the E-step nor the M-step can \nbe done exactly; one must estimate the necessary statistics by Monte Carlo sim(cid:173)\nulation (Ackley et al., 1985) or mean-field theory (Peterson & Anderson, 1987). \nIn certain special cases (e.g. trees and chains) , however, the necessary statistics \ncan be computed to perform an exact E-step (as shown below). While the M(cid:173)\nstep in these Boltzmann machines cannot be done exactly, the weight updates can \nbe approximated by gradient descent. This leads to learning rules in the form of \neqs. (8-10). \n\nHMMs may be viewed as a special case of Boltzmann chains for which both the \nE-step and the M-step are analytically tractable. In this case, the maximization in \nthe M-step is performed subject to the constraints 2:i e/3Il \u2022 = 1, 2:il e/3A ;;1 = 1, and \n2:j e/3B ;i = 1. These constraints imply Z = 1 and lead to closed-form equations \nfor the weight updates in HMMs. \n\n3 EXACT METHODS FOR BOLTZMANN LEARNING \n\nThe key technique to compute partition functions and correlations in Boltzmann \nchains is known as decimation. The idea behind decimation 2 is the following. Con(cid:173)\nsider three units connected in series, as shown in Figure 2a. Though not directly \nconnected, the end units have an effective interaction that is mediated by the middle \none. In fact, the two weights in series exert the same influence as a single effective \nweight, given by \n\njl \n\n(11) \n\n2 A related method, the transfer matrix, is described by Stolarz (1994). \n\n\fBoltzmann Chains and Hidden Markov Models \n\n439 \n\n1.1. \n\nA~~)' \n+ \nA~~). \n11. \n\n(a) \n\n(b) \n\n(c) \n\nFigure 2: Decimation, pruning, and joining in Boltzmann machines. \n\nReplacing the weights in this way amounts to integrating out, or decimating, the \ndegree offreedom represented by the middle unit. An analogous rule may be derived \nfor the situation shown in Figure 2b. Summing over the degrees of freedom of the \ndangling unit generates an effective bias on its parent, given by \n\nef3B \u2022 = L:: ef3B \u2022j \u2022 \n\n(12) \n\nj \n\nWe call this the pruning rule. Another type of equivalence is shown in Figure 2c. \nThe two weights in parallel have the same effect as the sum total weight \n\nAjjl = A~P + A~i) . \n\n(13) \n\nWe call this the joining rule. It holds trivially for biases as well as weights. \nThe rules for decimating, pruning, and joining have simple analogs in other types \nof networks (e.g. the law for combining resistors in electric circuits), and the strat(cid:173)\negy for exploiting them is a familiar one. Starting with a complicated network, \nwe iterate the rules until we have a simple network whose properties are easily \ncomputed. A network is tractable for Boltzmann learning if it can be reduced to \nany pair of connected units. In this case, we may use the rules to compute all the \ncorrelations required for Boltzmann learning. Clearly, the rules do not make all net(cid:173)\nworks tractable; certain networks (e.g. trees and chains), however, lend themselves \nnaturally to these types of operations. \n\n4 DESIGNER NETS \n\nThe rules in section 3 can be used to quickly assess whether or not a network is \ntractable for Boltzmann learning. Conversely, they can be used to design networks \nthat are computationally tractable. This section looks at two networks designed to \naddress particular shortcomings of HMMs. \n\n4.1 PARALLEL CHAINS AND DISPARATE TIME SCALES \n\nAn important problem in speech recognition (Juang et al., 1991) is how to \"combine \nfeature sets with fundamentally different time scales.\" Spectral parameters, such \n\n\f440 \n\nLawrence K. Saul, Michael I. Jordan \n\nfast \n\nfeatures \n\ncoupled \nhidden \nunits \n\nslow \n\nfeatures \n\nFigure 3: Coupled parallel chains for features with different time scales. \n\nas the cepstrum and delta-cepstrum, vary on a time scale of 10 msec; on the other \nhand, prosodic parameters, such as the signal energy and pitch, vary on a time scale \nof 100 msec. A model that takes into account this disparity should avoid two things. \nThe first is redundancy-in particular, the rather lame solution of oversampling the \nnonspectral features. The second is overfitting. How might this arise? Suppose we \nhave trained two separate HMMs on sequences of spectral and prosodic features, \nknowing that the different features \"may not warrant a single, unified Markov chain\" \n(Juang et al., 1991). To exploit the correlation between feature sets, we must now \ncouple the two HMMs. A naive solution is to form the Cartesian product of their \nhidden state spaces and resume training. Unfortunately, this results in an explosion \nin the number of parameters that must be fit from the training data. The likely \nconsequences are overfitting and poor generalization. \n\nFigure 3 shows a network for modeling feature sets with disparate time scales-in \nthis case, a 2: 1 disparity. Two parallel Boltzmann chains are coupled by weights \nthat connect their hidden units. Like the transition and emission weights within \neach chain, the coupling weights are tied across the length of the network. Note \nthat coupling the time scales in this way introduces far fewer parameters than \nforming the Cartesian product of the hidden state spaces. Moreover, the network is \ntractable by the rules of section 3. Suppose, for example, that we wish to compute \nthe correlation between two neighboring hidden units in the middle of the network. \nThis is done by first pruning all the visible units, then repeatedly decimating hidden \nunits from both ends of the network. \nFigure 4 shows typical results on a simple benchmark problem, with data generated \nby an artificially constructed HMM. We tested the parallel chains model on 10 \ntraining sets, with varying levels of built-in correlation between features. A two(cid:173)\nstep method was used to train the parallel chains. First, we set the coupling weights \nto zero and trained each chain by a separate Baum-Welch procedure. Then, after \nlearning in this phase was complete, we lifted the zero constraints and resumed \ntraining with the full Boltzmann learning rule. The percent gain in this second \nphase was directly related to the degree of correlation built into the training data, \nsuggesting that the coupling weights were indeed capturing the correlation between \nfeature sets. We also compared the performance of this Boltzmann machine versus \nthat of a simple Cartesian-product HMM trained by an additional Baum-Welch \nprocedure. While in both cases the second phase of learning led to reduced training \nerror, the Cartesian product HMMs were decidedly more prone to overfitting. \n\n\fBoltzmann Chains and Hidden Markov Models \n\n441 \n\nI \n\nI-J\"anv::luu::a:cnu:nnuI I II JOU\"'XJ:o:x:o:a:J::) \n\n/tfA='=u\"na\"m .... .,m\"l \n\n1-1rainl\"O \n- - croaa-vaJdation \n\n! \n\n~ 20 \n\n'\" '\" \n\n10 \n\n\u00b71500 \n\n-1700 \n\n200 \n\n400 \n\neoo \n\neoo \n\nepoch \n(a) \n\n0.2 \n\n0.' \nfeature colT8latkwl \n\n0.6 \n\n0.8 \n\n(b) \n\nFigure 4: (a) Log-likelihood versus epoch for parallel chains with 4-state hidden \nunits, 6-state visible units, and 100 hidden-visible unit pairs (per chain) . The \nsecond jump in log-likelihood occurred at the onset of Boltzmann learning (see \ntext). (b) Percent gain in log-likelihood versus built-in correlation between feature \nsets. \n\n4.2 LOOPS AND LONG-TERM DEPENDENCIES \n\nAnother shortcoming of first-order HMMs is that they cannot exhibit long-term \ndependencies between the hidden states (Juang et aL , 1991). Higher-order and \nduration-based HMMs have been used in this regard with varying degrees of suc(cid:173)\ncess. The rules of section 3 suggest another approach-namely, designing tractable \nnetworks with limited long-range connectivity. As an example, Figure 5a shows a \nBoltzmann chain with an internal loop and a long-range connection between the \nfirst and last hidden units. These extra features could be used to enforce known \nperiodicities in the time series. Though tractable for Boltzmann learning, the loops \nin this network do not fit naturally into the framework of HMMs. Figure 5b shows \nlearning curves for a toy problem, with data generated by another looped network. \nCarefully chosen loops and long-range connections provide additional flexibility in \nthe design of probabilistic models for time series. Can networks with these extra \nfeatures capture the long-term dependencies exhibited by real data? This remains \nan important issue for future research . \n\nAcknowledgements \n\nWe thank G. Hinton, D. MacKay, P. Stolorz, and C. Williams for useful discus(cid:173)\nsions. This work was funded by ATR Human Information Processing Laboratories, \nSiemens Corporate Research, and NSF grant CDA-9404932 . \n\nReferences \n\nD. H. Ackley, G. E. Hinton, and T . J. Sejnowski. (1985) A Learning Algorithm for \nBoltzmann Machines. Cog. Sci. 9: 147- 160. \nP. Baldi, Y. Chauvin, T . Hunkapiller, and M. A. McClure. (1992) Proc. Nat . Acad. \nSci. (USA) 91: 1059-1063. \n\n\f442 \n\nLawrence K. Saul, Michael I. Jordan \n\n\u00b7700 \n\no \n\n(a) \n\nI-tralning \n\n~ crosa..validation \n\nI \n\n10 \n\n12 \n\n1. \n\n8 \nepoch \n\n(b) \n\nFigure 5: (a) Looped network. (b) Log-likelihood versus epoch for a looped network \nwith 4-state hidden units, 6-state visible units, and 100 hidden-visible unit pairs . \n\nL. Baum. (1972) An Inequality and Associated Maximization Technique in Statis(cid:173)\ntical Estimation of Probabilistic Functions of Markov Processes, Inequalities 3:1-8. \nByrne, W. (1992) Alternating Minimization and Boltzmann Machine Learning. \nIEEE Trans. Neural Networks 3:612-620. \n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. (1977) Maximum Likelihood from \nIncomplete Data via the EM Algorithm. J. Roy. Statist. Soc. B, 39:1-38 . \nC. Itzykson and J . Drouffe. (1991) Statistical Field Theory, Cambridge: Cambridge \nUniversity Press. \nB. H. Juang and L. R. Rabiner. (1991) Hidden Markov Models for Speech Recog(cid:173)\nnition, Technometrics 33: 251-272. \nD. J. MacKay. (1994) Equivalence of Boltzmann Chains and Hidden Markov Mod(cid:173)\nels, submitted to Neural Compo \nC. Peterson and J. R. Anderson. (1987) A Mean Field Theory Learning Algorithm \nfor Neural Networks, Complex Systems 1:995-1019. \n1. Saul and M. Jordan. (1994) Learning in Boltzmann Trees. Neural Comp o 6 : \n1174-1184. \n\nN. Sourlas. (1989) Spin Glass Models as Error Correcting Codes. Nature 339: \n693-695 . \nP. Stolorz. (1994) Links Between Dynamic Programming and Statistical Physics \nfor Heterogeneous Systems, JPL/Caltech preprint . \n\nC. Williams and G. E. Hinton. (1990) Mean Field Networks That Learn To Discrim(cid:173)\ninate Temporally Distorted Strings. Proc. Connectionist Models Summer School: \n18-22. \n\n\f", "award": [], "sourceid": 966, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}