{"title": "Relative Density Nets: A New Way to Combine Backpropagation with HMM's", "book": "Advances in Neural Information Processing Systems", "page_first": 1149, "page_last": 1156, "abstract": null, "full_text": "Relative Density Nets: A New Way to \nCombine Backpropagation with HMM's \n\nAndrew D. Brown \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, Canada M5S 3G4 \n\nandy@cs.utoronto.ca \n\nGeoffrey E. Hinton \n\nGatsby Unit, UCL \n\nLondon, UK WCIN 3AR \nhinton@gatsby.ucl.ac.uk \n\nAbstract \n\nLogistic units in the first hidden layer of a feedforward neural net(cid:173)\nwork compute the relative probability of a data point under two \nGaussians. This leads us to consider substituting other density \nmodels. We present an architecture for performing discriminative \nlearning of Hidden Markov Models using a network of many small \nHMM's. Experiments on speech data show it to be superior to the \nstandard method of discriminatively training HMM's. \n\n1 \n\nIntroduction \n\nA standard way of performing classification using a generative model is to divide the \ntraining cases into their respective classes and t hen train a set of class conditional \nmodels. This unsupervised approach to classification is appealing for two reasons. It \nis possible to reduce overfitting, because t he model learns the class-conditional input \ndensities P(xlc) rather t han the input-conditional class probabilities P(clx). Also, \nprovided that the model density is a good match to the underlying data density \nthen the decision provided by a probabilistic model is Bayes optimal. The problem \nwith this unsupervised approach to using probabilistic models for classification is \nthat, for reasons of computational efficiency and analytical convenience, very simple \ngenerative models are typically used and the optimality of the procedure no longer \nholds. For this reason it is usually advantageous to train a classifier discriminatively. \n\nIn this paper we will look specifically at the problem of learning HMM's for classify(cid:173)\ning speech sequences. It is an application area where the assumption that the HMM \nis the correct generative model for the data is inaccurate and discriminative methods \nof training have been successful. The first section will give an overview of current \nmethods of discriminatively training HMM classifiers. We will then introduce a new \ntype of multi-layer backpropagation network which takes better advantage of the \nHMM's for discrimination. Finally, we present some simulations comparing the two \nmethods. \n\n\f1 \n\n1 \n\n1 \n[tn] [tn] [tn] HMM's \n\n19 ' S1c1=\" \n\\V \n\nSequence \n\nFigure 1: An Alphanet with one HMM per class. Each computes a score for the \nsequence and this feeds into a softmax output layer. \n\n2 Alphanets and Discriminative Learning \n\nThe unsupervised way of using an HMM for classifying a collection of sequences is to \nuse the Baum-Welch algorithm [1] to fit one HMM per class. Then new sequences \nare classified by computing the probability of a sequence under each model and \nassigning it to the one with the highest probability. Speech recognition is one of the \ncommonest applications of HMM's, but unfortunately an HMM is a poor model of \nthe speech production process. For this reason speech researchers have looked at the \npossibility of improving the performance of an HMM classifier by using information \nfrom negative examples -\nexamples drawn from classes other than the one which \nthe HMM was meant to model. One way of doing this is to compute the mutual \ninformation between the class label and the data under the HMM density, and \nmaximize that objective function [2]. \nIt was later shown that this procedure could be viewed as a type of neural network \n(see Figure 1) in which the inputs to the network are the log-probability scores \nC(Xl:TIH) of the sequence under hidden Markov model H [3]. In such a model \nthere is one HMM per class, and the output is a softmax non-linearity: \n\n(1) \n\nTraining this model by maximizing the log probability of correct classification leads \nto a classifier which will perform better than an equivalent HMM model trained \nsolely in a unsupervised manner. Such an architecture has been termed an \"AI(cid:173)\nphanet\" because it may be implemented as a recurrent neural network which mimics \nthe forward pass of the forward-backward algorithm.l \n\n3 Backpropagation Networks as Density Comparators \n\nA multi-layer feedforward network is usually thought of as a flexible non-linear \nregression model, but if it uses the logistic function non-linearity in the hidden \nlayer, there is an interesting interpretation of the operation performed by each \nhidden unit. Given a mixture of two Gaussians where we know the component \npriors P(9) and the component densities P(xl9) then the posterior probability that \nGaussian, 90 , generated an observation x , is a logistic function whose argument is \nthe negative log-odds of the two classes [4] . This can clearly be seen by rearranging \n\nlThe results of the forward pass are the probabilities of the hidden states conditioned \n\non the past observations, or \"alphas\" in standard HMM terminology. \n\n\fthe expression for the posterior: \n\nP(Qolx) \n\nP(xI9o)P(Qo) \n\nP(xI9o)P(Qo) + P(xI9d P (Qd \n\n1 \n\n1 + exp {-log P(x IQo) -\nP(x lQd \n\nlog P(Qo) } \nP(Ql) \n\nIf the class conditional densities in question are multivariate Gaussians \n\nP(xI9k) = 121f~1-~ exp {-~(x - Pk)T ~-l(X - Pk)} \n\n(2) \n\n(3) \n\nwith equal covariance matrices, ~ , then the posterior class probability may be \nwritten in this familiar form: \n\nP(Qo Ix) = -l-+-e-xp-{-=---(:-x=Tw-+-b---:-) \n\n1 \n\nwhere, \n\nw \n\nb \n\n(4) \n\n(5) \n\n(6) \n\nThus, the multi-layer perceptron can be viewed as computing pairwise posteriors \nbetween Gaussians in the input space, and then combining these in the output layer \nto compute a decision. \n\n4 A New Kind of Discriminative Net \n\nThis view of a feedforward network suggests variations in which other kinds of \ndensity models are used in place of Gaussians in the input space. In particular, \ninstead of performing pairwise comparisons between Gaussians, the units in the \nfirst hidden layer can perform pairwise comparisons between the densities of an \ninput sequence under M different HMM's. For a given sequence the log-probability \nof a sequence under each HMM is computed and the difference in log-probability \nis used as input to the logistic hidden unit. 2 This is equivalent to computing the \nposterior responsibilities of a mixture of two HMM's with equal prior probabilities. \nIn order to maximally leverage the information captured by the HMM's we use (~) \nhidden units so that all possible pairs are included. The output of a hidden unit h \nis given by \n\n(7) \nwhere we have used (mn) as an index over the set, (~) , of all unordered pairs of \nthe HMM's. The results of this hidden layer computation are then combined using \na fully connected layer of free weights, W, and finally passed through a soft max \nfunction to make the final decision. \n\nak = L W(m ,n)kh(mn) \n\n(mn) E (~) \n\n(8) \n\n(9) \n\n2We take the time averaged log-probability so that the scale of the inputs is independent \n\nof the length of the sequence. \n\n\fDensity \nComparator \nUnits \n\nFigure 2: A multi-layer density net with HMM's in the input layer. The hidden \nlayer units perform all pairwise comparisons between the HMM's. \n\nwhere we have used u(\u00b7) as shorthand for the logistic function, and Pk is the value \nof the kth output unit. The resulting architecture is shown in figure 2. Because \neach unit in the hidden layer takes as input the difference in log-probability of two \nHMM's, this can be thought of as a fixed layer of weights connecting each hidden \nunit to a pair of HMM's with weights of \u00b1l. \n\nIn contrast to the Alphanet , which allocates one HMM to model each class, this net(cid:173)\nwork does not require a one-to-one alignment between models and classes and it gets \nmaximum discriminative benefit from the HMM's by comparing all pairs. Another \nbenefit of this architecture is that it allows us to use more HMM's than there are \nclasses. The unsupervised approach to training HMM classifiers is problematic be(cid:173)\ncause it depends on the assumption that a single HMM is a good model of the data \nand, in the case of speech, this is a poor assumption. Training the classifier discrim(cid:173)\ninatively alleviated this drawback and the multi-layer classifier goes even further in \nthis direction by allowing many HMM's to be used to learn the decision boundaries \nbetween the classes. The intuition here is that many small HMM's can be a far \nmore efficient way to characterize sequences than one big HMM. When many small \nHMM's cooperate to generate sequences, the mutual information between different \nparts of generated sequences scales linearly with the number of HMM's and only \nlogarithmically with the number of hidden nodes in each HMM [5]. \n\n5 Derivative Updates for a Relative Density Network \n\nThe learning algorithm for an RDN is just the backpropagation algorithm applied \nto the network architecture as defined in equations 7,8 and 9. The output layer is \na distribution over class memberships of data point Xl:T, and this is parameterized \nas a softmax function. We minimize the cross-entropy loss function: \n\nK \n\nf = 2: tk logpk \n\nk = l \n\n(10) \n\nwhere Pk is the value of the kth output unit and tk is an indicator variable which is \nequal to 1 if k is the true class. Taking derivatives of this expression with respect \nto the inputs of the output units yields \nof \n-=tk-Pk \noak \n\n(11) \n\n\fO\u00a3 \n\nOW(mn) ,k \n\nOak \n\no\u00a3 \noak OW(mn) ,k \n\n-,---- = (tk - Pk)h(mn) \n\n(12) \n\nThe derivative of the output of the (mn)th hidden unit with respect to the output \nof ith HMM, \u00a3i, is \n\noh(mn) \n~ = U(\u00a3m - \u00a3n)(l - U(\u00a3m - \u00a3n))(bim - bin) \n\n(13) \n\nwhere (bim - bin) is an indicator which equals +1 if i = m, -1 if i = n and zero \notherwise. This derivative can be chained with the the derivatives backpropagated \nfrom the output to the hidden layer. \n\nFor the final step of the backpropagation procedure we need the derivative of the \nlog-likelihood of each HMM with respect to its parameters. In the experiments we \nuse HMM's with a single, axis-aligned, Gaussian output density per state. We use \nthe following notation for the parameters: \n\n\u2022 A: aij is the transition probability from state i to state j \n\u2022 II: 7ri is the initial state prior \n\u2022 f./,i: mean vector for state i \n\u2022 Vi: vector of variances for state i \n\u2022 1-l: set of HMM parameters {A , II, f./\" v} \n\nWe also use the variable St to represent the state of the HMM at time t. We make \nuse of the property of all latent variable density models that the derivative of the \nlog-likelihood is equal to the expected derivative of the joint log-likelihood under \nthe posterior distribution . For an HMM this means that: \n\nO\u00a3(Xl:TI1-l) \n\no1-li \n\n' \" \n\n= ~ P(Sl:Tlxl:T' 1-l) o1-l i log P(Xl:T' Sl:TI1-l) \n\n0 \n\n(14) \n\nThe joint likelihood of an HMM is: \n\n(logP(Xl:T ' Sl:TI1-l)) = \n\nSl:T \n\nL(b81 ,i)log 7ri + LL(b8\"jb8 ,_1 ,i)log aij + \n\nT \n\nt=2 i,j \n\n~ ~(b8\" i) [-~ ~IOgVi'd - ~ ~(Xt'd - f./,i,d) 2 /Vi,d] + canst \n\n(15) \n\nwhere (-) denotes expectations under the posterior distribution and (b 8 , ,i) and \n(b 8 , ,jb8'_1 ,i) are the expected state occupancies and transitions under this dis(cid:173)\ntribution. All the necessary expectations are computed by the forward back(cid:173)\nward algorithm. We could take derivatives with respect to this functional di(cid:173)\nrectly, but that would require doing constrained gradient descent on the prob(cid:173)\nabilities and the variances. \nsoftmax basis for probability vectors and an exponential basis for the vari(cid:173)\nance parameters. \nThis choice of basis allows us to do unconstrained op-\ntimization in the new basis. \nThe new parameters are defined as follows: \n. _ \n7r, - 2: \n\nInstead, we reparameterize the model using a \n\n(v) \n(e(~\u00bb)' V\"d - exp(Oi,d ) \n\na' J - 2: \n\nexp(e;; \u00bb) \n\n. \n\n_ \n\n. _ \n\nexp(e;~\u00bb) \nif exp \n\ni \n\n(e (a\u00bb ) , \n\nJI exp 1JI \n\nThis results in the following derivatives: \n\nO\u00a3(Xl :T 11-l) \n\noO(a) \n'J \n\nT \nL \nt = 2 \n\n[(b 8 , ,jb8'_1 ,i) - (b 8'_1 ,i)aij ] \n\n(16) \n\n\f8\u00a3(Xl:T 11\u00a3) \n\n80(7r) \u2022 \n\n8\u00a3(Xl:T 11\u00a3) \n\n8f..li,d \n\n8\u00a3(Xl:T 11\u00a3) \n\n80(v) \n.,d \n\n(8 S1 ,i) - 1fi \n\nT \nl)8st ,i)(Xt,d -\nt= l \n1 T \n2\"l)8st ,i) [(Xt ,d -\n\nt= l \n\nf..li ,d)/Vi,d \n\nf..li ,d)2/Vi ,d -\n\nIJ \n\n(17) \n\n(18) \n\n(19) \n\nWhen chained with the error signal backpropagated from the output, these deriva(cid:173)\ntives give us the direction in which to move the parameters of each HMM in order \nto increase the log probability of the correct classification of the sequence. \n\n6 Experiments \n\nTo evaluate the relative merits of the RDN, we compared it against an Alphanet \non a speaker identification task. The data was taken from the CSLU 'Speaker \nRecognition' corpus. It consisted of 12 speakers uttering phrases consisting of 6 \ndifferent sequences of connected digits recorded multiple times (48) over the course \nof 12 recording sessions. The data was pre-emphasized and Fourier transformed \nin 32ms frames at a frame rate of lOms. It was then filtered using 24 bandpass, \nmel-frequency scaled filters. The log magnitude filter response was then used as the \nfeature vector for the HMM's. This pre-processing reduced the data dimensionality \nwhile retaining its spectral structure. \n\nWhile mel-cepstral coefficients are typically recommended for use with axis-aligned \nGaussians, they destroy the spectral structure of the data, and we would like to \nallow for the possibility that of the many HMM's some of them will specialize on \nparticular sub-bands of the frequency domain. They can do this by treating the \nvariance as a measure of the importance of a particular frequency band -\nusing \nlarge variances for unimportant bands, and small ones for bands to which they pay \nparticular attention. \n\nWe compared the RDN with an Alphanet and three other models which were im(cid:173)\nplemented as controls. The first of these was a network with a similar architecture \nto the RDN (as shown in figure 2), except that instead of fixed connections of \u00b11, \nthe hidden units have a set of adaptable weights to all M of the HMM's. We refer \nto this network as a comparative density net (CDN). A second control experiment \nused an architecture similar to a CDN without the hidden layer, i.e. there is a single \nlayer of adaptable weights directly connecting the HMM's with the softmax output \nunits. We label this architecture a CDN-l. The CDN-l differs from the Alphanet \nin that each softmax output unit has adaptable connections to the HMM's and we \ncan vary the number of HMM's, whereas the Alphanet has just one HMM per class \ndirectly connected to each softmax output unit. Finally, we implemented a version \nof a network similar to an Alphanet, but using a mixture of Gaussians as the in(cid:173)\nput density model. The point of this comparison was to see if the HMM actually \nachieves a benefit from modelling the temporal aspects of the speaker recognition \ntask. \n\nIn each experiment an RDN constructed out of a set of, M, 4-state HMM's was \ncompared to the four other networks all matched to have the same number of free \nparameters, except for the MoGnet. In the case of the MoGnet, we used the same \nnumber of Gaussian mixture models as HMM's in the Alphanet, each with the \nsame number of hidden states. Thus, it has fewer parameters, because it is lacking \nthe transition probabilities of the HMM. We ran the experiment four times with \n\n\fC) \n\n$ \n\n* \n\na: \nc \n0 \n~O.8 \n\u00b7in \ngj \nCi \n\n0.6 \n\nRDN \n\nD e \n\n~ \n\n~ \n\n8 \n\n~ \n\n0.9 \n\n*0.8 \na: \ngO.7 \n~ \n~0.6 \n\u00b7in \n\nCiO.5 \n\nB gj \n\n0.4 \n\n0.3 \n\na) ~ \n\n0.95 ~ e \n\n0.9 \n\n0.85 \n\n0.8 \n\n0.75 \n\n0.7 \n\n0.65 \n\n0.6 \n\n0.55 \n\nE=:l \n\n~ \n\n8 0 \n\nb) ~ \n\ne = \n\n0.95 \n\n0.9 \n\n0.85 \n\n0.8 \n\n0.75 \n\n0.7 \n\n0.65 \n\n0.6 \n\n0.55 \n\nB \n\nEJ \n\nRDN \n\nAlphanet MaGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nRDN \n\nAlphanet MaGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nd) ~ \n\n~ ~ \n\nU \n\n8 \n\nAlphanet MeG net \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nRDN \n\nAlphanet MeGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nFigure 3: Results of the experiments for an RDN with (a) 12, (b) 16, (c) 20 and \n(d) 24 HMM's. \n\nvalues of M of 12, 16, 20 and 24. For the Alphanet and MoGnet we varied the \nnumber of states in the HMM's and the Gaussian mixtures, respectively. For the \nCDN model we used the same number of 4-state HMM's as the RDN and varied \nthe number of units in the hidden layer of the network. Since the CDN-1 network \nhas no hidden units, we used the same number of HMM's as the RDN and varied \nthe number of states in the HMM. The experiments were repeated 10 times with \ndifferent training-test set splits. All the models were trained using 90 iterations of \na conjugate gradient optimization procedure [6] . \n\n7 Results \n\nThe boxplot in figure 3 shows the results of the classification performance on the \n10 runs in each of the 4 experiments. Comparing the Alphanet and the RDN we \nsee that the RDN consistently outperforms the Alphanet. In all four experiments \nthe difference in their performance under a paired t-test was significant at the level \np < 0.01. This indicates that given a classification network with a fixed number of \nparameters, there is an advantage to using many small HMM's and using all the \npairwise information about an observed sequence, as opposed to using a network \nwith a single large HMM per class. \n\nIn the third experiment involving the MoGnet we see that its performance is com(cid:173)\nparable to that of the Alphanet. This suggests that the HMM's ability to model the \ntemporal structure of the data is not really necessary for the speaker classification \ntask as we have set it Up.3 Nevertheless, the performance of both the Alphanet and \n\n3If we had done text-dependent speaker identification, instead of multiple digit phrases \n\n\fthe MoGnet is less than the RDN. \n\nUnfortunately the CDN and CDN-l networks perform much worse than we ex(cid:173)\npected. While we expected these models to perform similarly to the RDN, it seems \nthat the optimization procedure takes much longer with these models. This is prob(cid:173)\nably because the small initial weights from the HMM's to the next layer severely \nattenuate the backpropagated error derivatives that are used to train the HMM's. \nAs a result the CDN networks do not converge properly in the time allowed. \n\n8 Conclusions \n\nWe have introduced relative density networks, and shown that this method of dis(cid:173)\ncriminatively learning many small density models in place of a single density model \nper class has benefits in classification performance. In addition, there may be a \nsmall speed benefit to using many smaller HMM's compared to a few big ones. \nComputing the probability of a sequence under an HMM is order O(TK2 ), where T \nis the length of the sequence and K is the number of hidden states in the network. \nThus, smaller HMM's can be evaluated faster. However, this is somewhat counter(cid:173)\nbalanced by the quadratic growth in the size of the hidden layer as M increases. \n\nAcknowledgments \n\nWe would like to thank John Bridle, Chris Williams, Radford Neal, Sam Roweis, \nZoubin Ghahramani, and the anonymous reviewers for helpful comments. \n\nReferences \n\n[1] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, \"A maximization technique \noccurring in the statistical analysis of probabilistic functions of Markov chains,\" \nThe Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164-171, 1970. \n\n[2] 1. R. Bahl, P. F. Brown, P. V. de Souza, and R. 1. Mercer, \"Maximum mu(cid:173)\ntual information of hidden Markov model parameters for speech recognition,\" \nin Proceeding of the IEEE International Conference on Acoustics, Speech and \nSignal Processing, pp. 49- 53, 1986. \n\n[3] J. Bridle, \"Training stochastic model recognition algorithms as networks can \nlead to maximum mutual information estimation of parameters,\" in Advances in \nNeural Information Processing Systems (D. Touretzky, ed.), vol. 2, (San Mateo, \nCA), pp. 211- 217, Morgan Kaufmann, 1990. \n\n[4] M. I. Jordan, \"Why the logistic function? A tutorial discussion on probabilities \nand neural networks,\" Tech. Rep. Computational Cognitive Science, Technical \nReport 9503, Massachusetts Institute of Technology, August 1995. \n\n[5] A. D. Brown and G. E. Hinton, \"Products of hidden Markov models,\" in Proceed(cid:173)\nings of Artificial Intelligence and Statistics 2001 (T. Jaakkola and T. Richard(cid:173)\nson, eds.), pp. 3- 11, Morgan Kaufmann, 2001. \n\n[6] C. E. Rasmussen, Evaluation of Gaussian Processes and other Methods for Non(cid:173)\n\nLinear Regression. PhD thesis, University of Toronto, 1996. Matlab conjugate \ngradient code available from http ://www .gatsby.ucl.ac.uk/~edward/code/. \n\nthen this might have made a difference. \n\n\f", "award": [], "sourceid": 2137, "authors": [{"given_name": "Andrew", "family_name": "Brown", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}