{"title": "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 211, "page_last": 217, "abstract": null, "full_text": "Training Stochastic Model Recognition Algorithms \n\n211 \n\nTraining Stochastic Model Recognition \n\nAlgorithms as Networks can lead to Maximum \nMutual Information Estimation of Parameters \n\nJohn s. Bridle \n\nRoyal Signals and Radar Establishment \n\nGreat Malvern \n\nWorcs. \n\nUK \n\nWR143PS \n\nABSTRACT \n\nOne of the attractions of neural network approaches to pattern \nrecognition is the use of a discrimination-based training method. \nWe show that once we have modified the output layer of a multi(cid:173)\nlayer perceptron to provide mathematically correct probability dis(cid:173)\ntributions, and replaced the usual squared error criterion with a \nprobability-based score, the result is equivalent to Maximum Mu(cid:173)\ntual Information training, which has been used successfully to im(cid:173)\nprove the performance of hidden Markov models for speech recog(cid:173)\nnition. If the network is specially constructed to perform the recog(cid:173)\nnition computations of a given kind of stochastic model based clas(cid:173)\nsifier then we obtain a method for discrimination-based training of \nthe parameters of the models. Examples include an HMM-based \nword discriminator, which we call an 'Alphanet'. \n\nINTRODUCTION \n\n1 \nIt has often been suggested that one of the attractions of an adaptive neural network \n(NN) approach to pattern recognition is the availability of discrimination-based \ntraining (e.g. in Multilayer Perceptrons (MLPs) using Back-Propagation). Among \nthe disadvantages of NN approaches are the lack of theory about what can be \ncomputed with any partir.ular structure, what can be learned, how to choose a \nnetwork architecture for a given task, and how to deal with data (such as speech) in \nwhich an underlying sequential structure is ofthe essence. There have been attempts \nto build internal dynamics into neural networks, using recurrent connections, so that \nthey might deal with sequences and temporal patterns [1, 2], but there is a lack of \nrelevant theory to inform the choice of network type. \n\nHidden Markov models (HMMs) are the basis of virtually all modern automatic \nspeech recognition systems. They can be seen as an extension of the parametric \nstatistical approach to pattern recognition, to deal (in a simple but principled way) \nwitli temporal patterning. Like most parametric models, HMMs are usually trained \nusing within-class maximum-likelihood (ML) methods, and an EM algorithm due to \nBaum and Welch is particularly attractive (see for instance [3]). However, recently \n\n\f212 \n\nBridle \n\nsome success has been demonstrated using discrimination-based training methods, \nsuc.h as the so-called Maximum Mutual Information criterion [4] and Corrective \nTraining[5] . \n\nThis paper addresses two important questions: \n\n\u2022 How can we design Neural Network architectures with at least the desirable \nproperties of methods based on stochastic models (such as hidden Markov \nmodels)? \n\n\u2022 What is the relationship between the inherently discriminative neural network \n\ntraining and the analogous MMI training of stochastic models? \n\nWe address the first question in two steps. Firstly, to make sure that the outputs \nof our network have the simple mathematical properties of conditional probability \ndistributions over class labels we recommend a generalisation of the logistic nonlin(cid:173)\nearity; this enables us (but does not require us) to replace the usual squared error \ncriterion with a more appropriate one, based on relative entropy. Secondly, we \nalso have the option of designing networks which exactly implement the recognition \ncomputations of a given stochastic model method. (The resulting 'network' may be \nrather odd, and not very 'neural', but this is engineering, not biology.) As a con(cid:173)\ntribution to the investigation of the second question, we point out that optimising \nthe relative entropy criterion is exactly equivalent to performing Maximum Mutual \nInformation Estimation. \n\nBy way of illustration we describe three 'networks' which implement stochastic \nmodel classifiers, and show how discrimination training can help. \n\n2 TRAINABLE NETWORKS AS PARAMETERISED CON(cid:173)\n\nDITIONAL DISTRIBUTION FUNCTIONS \n\nWe consider a trainable network, when used for pattern classification, as a vector \nfunction Q( re, 8) from an input vt>ctor re to a set of indicators of class membership, \nj = 1, ... N. The parameters 8 modify the transfer function. In a multi(cid:173)\n{Qj}, \nlayer perceptron, for instance, the parameters would be values of weights. Typically, \nt = 1, ... T, of inputs and associated true \nwe have a training set of pairs (ret,ct), \nclass labels, and we have to find a value for 8 which specialises the function so that \nit is consistent with the training st't. A common procedure is to minimise E( 8), the \nsum of the squart's of the differt'nces hetwt'en the network outputs and true class \nindicators, or targets: \n\n'1' \n\nN \n\nE(8) =: L L(Qj(ret, 8) - bj,c,)2, \n\nt=l j==l \n\nwhere bj,c = 1 if j = c, otht'rwise O. E and Q will be written without the 8 argument \nwhere the meaning is clear, and wt' may drop the t subscript. \n\nIt is well known that the value of F(~) which minimises the expected value of \n(F(~) - y)2 is the expected value of y given~. The expected value of bj,e, is \nP( C = j I X = red, the probability that the class associated with ret is the jth class. \n\n\fTraining Stochastic Model Recognition Algorithms \n\n213 \n\nFrom now on we shall assume that the desired output of a classifier network is this \nconditional probability distribution over classes, given the input. \n\nThe outputs must satisfy certain simple constraints if they are to be interpretable as \na probability distribution. For any input, the outputs must all be positive and they \nmust sum to unity. The use of logistic nonlinearities at the outputs of the network \nensures positivity, and also ensures that each output is less than unity. These \nconstraints are appropriate for outputs that are to be interpreted as probabilities \nof Boolean events, but are not sufficient for I-from-N classifiers. \nGiven a set of unconstrained values, Vj(:e), we can ensure both conditions by using \na Normalised Exponential transformation: \n\nQj(~) = eVj(a!) / L eVIe(~) \n\nIe \n\nThis transformation can be considered a multi-input generalisation of the logistic, \noperating on the whole output layer. It preserves the rank order of its input values, \nand is a differentiable generalisation of the 'winner-take-all' operation of picking the \nmaximum value. For this reason we like to refer to it as soft max. Like the logistic, \nit has a simple implementation in transistor circuits [6]. \n\nIf the network is such that we can be sure the values we have are all positive, it may \nbe more appropriate just to normalise them. In particular, if we can treat them as \nlikelihoods of the data given the possible classes, Lj(~) = P(X = ~ Ie =i), then \nnormalisation produces the required conditional distribution (assuming equal prior \nprobabilities for the classes). \n\n3 RELATIVE ENTROPY SCORING FOR CLASSIFIERS \n\nIn this section we introduce an information-theoretic criterion for training I-from(cid:173)\nN classifier networks, to replace the squared error criterion, both for its intrinsic \ninterest and because of the link to discriminative training of stochastic models. \nthe class with highest likelihood. This is justified by \n\nif we assume equal priors P(c) (this can be generalised) and see that the denominator \nP(~) = Lc P(~ I c)P(c) is the same for all classes. \nIt is also usual to train such classifiers by ma:\u00a5:imising the data likelihood given \nthe correct classes. Maximum Likelihood (ML) training is appropriate if we are \nchoosing from a family of pdfs which includes the correct one. In most real-life \napplications of pattern classification we do not have knowledge of the form of the \ndata distributions, although we may have some useful ideas. In tbat case ML may \nbe a rather bad approach to pdf estimation for the purpose of pattern clauification, \nbecause what matters is the f'elalive densities. \n\nAn alternative is to optimise a measure of success in pattern classification, and this \ncan make a big difference to performance, particularly when the assumptions about \nthe form of the class pdfs is badly wrong. \n\n\f214 \n\nBridle \n\nTo make the likelihoods produced by a SM classifier look like NN outputs we can \nsimply normalise them: \n\nThen we can use Neural Network optimisation methods to adjust the parameters. \n\na SUlll, weighted by the joint probability, of the MI of the joint events \n\nIe \n\n,.... \n\nP(X=:r,Y=y) \n\nI(X, Y) = ,L; P(X:=::r, Y=y)log p{X-=:r)p-(Y~Yf \n\n(~,y) \n\nFor discrimination training of sets of stochastic models, Bahl et.al. suggest max(cid:173)\nimising the Mutual Information, I, between the training observations and the choice \nof the correspolluing correct class. \n\nI(X, C) = ,L; log \n\n,\"\" \n\nt \n\nP(C =.: Ct,X=Zt) \nP(C=cdP(X=z) \n\n,........... \n\n= ,L; log \n\nt \n\nP(C=Ct IX=zt}P(X=zd \n. \n\nP(C=ct}P(X=z) \n\nP(C=Ct I X = zt} should be read as the probability that we choose the correct class \nfor the tth training example. If we are choosing classes according to the conditional \ndistribution computed using parameters (J then P(C=Ct IX = zd = QCt(z,(J), \nand \n\nIf the second term involving the priors is fixed, we are left with maximising \n\nLlogQCt(:rt,6) = -J. \n\nt \n\nThe RE-based score we use is J ..;; -- }:;:;;1 L;=l Pjtlog Qj{ zd, where Pjt is the \nIf as usual the \nprobability of class j associated with input Zt 1ll the training set. \ntraining set specifies only oue true class, Ct for each Zt then Pj,t = [)j,Ct and \n\nJ = -- LlogQCt(zt}, \n\nT \n\nt=l \n\nthe sum of the logs of the outputs for the correct classes. \nJ can be derived from the Relative Entropy of distribution Q with respect to the \ntrue conditional distribution P, averaged over the input distribution: \n\nJ d:r P(X = z)G(Q I P), where G(Q I P) = - L P(c I z)log ~~(Iz~)' \n\nC \n\ninformation, cross entropy, asymmetric divergence, directed divergence, I-divergence, \nand Kullback-Leibler number. RE scoring is the basis for the Boltzmann Machine \nlearning algorithm [7] and has also been proposed and used for adaptive networks \nwith continuous-valued outputs [8, 9, 10, 11], but usually in the form appropriate \nto separate logistics and independent Boolean targets. An exception is [12]. \n\nThere is another way of thinking about this 'log-of correct-output' score. Assume \nthat the way we would use the outputs of the network is that, rather than choosing \n\n\fTraining Stochastic Model Recognition Algorithms \n\n215 \n\nthe class with the largest output, we choose randomly, picking from the distribution \nspecified by the outputs. (Pick class j with probability Qj.) The probability of \nchoosing the class Ct for training sample IBt is simply Qet (tee). The probability of \nchoosing the correct class labels for all the training set is n;=1 Qet (1Bt). We simply \nseek to maximise this probability, or what is equivalent, to minimise minus its log: \n\nT \n\nJ = - L log Qet(ted\u00b7 \n\nt=l \n\nIn order to compute the partial derivatives of J wrt to parameters of the network, we \nfirst need :gj -= -Pjt!Qj The details of the back-propagation depend on the form \nof the network, but if the final non-linearity is a normalised exponential (softmax), \n\nQj(:l) = exp(Vj(:z:))/ Lt exp(V\" (:z:)), \n\n'\"' \n\nthen [6] aV- -= (Qj(:z:t) - bj,et)' \n\n8Jt \n\nWe see that the derivative before the output nonlinearity is the difference between \nthe corresponding output and a one-from-N target. We conclude that softmax \noutput stages and I-from-N RE scoring are natural partners. \n\n\" \n\nJ \n\n4 DISCRIMINATIVE TRAINING \nIn stochastic model (probability-density) based pattern classification we usually \ncompute likelihoods of the data given models for each class, P(IB I c), and choose. \nSo minimising our J criterion is also maximising Bahl's mutual information. (Also \nsee [13).) \n\n5 STOCHASTIC MODEL CLASSIFIERS AS NETWORKS \n5.1 EXAMPLE ONEs A PAIR OF MULTIVARIATE GAUSSIANS \n\nThe conditional distribution for a pair of multivariate Gaussian densities with the \nsame arbitrary covariance matrix is a logistic function of a weighted sum of the \ninput coordinates (plus a constant). Therefore, even if we make such incorrect \nassumptions as equal priors and spherical unit covariances, it is still possible to find \nvalues for the parameters of the model (the positions of the means of the assumed \ndistributions) for which the form of the conditional distribution is correct. (The \nmeans may be far from the means of the true distributions and from the data \nmeans.) Of course in this case we have the alternative of using a weighted-sum \nlogistic, unit to compute the conditional probability: the parameters are then the \nweights. \n\n5.2 EXAMPLE TWO: A MULTI-CLASS GAUSSIAN CLASSIFIER \n\nConsider a model in which the distributions for each class are multi-variate Gaus(cid:173)\nsian, with equal isotropic unit variances, and different means, {mj}. The prob(cid:173)\nability distribution over class labels, given an observation IB I is P( c = j lIB) = \ne 1'; / L\" e V\", where V; = -IIIB - mj 112. This can be interpreted as a one-layer \nfeed-forward non-linear network. The usual weighted sums are replaced by squared \nEuclidean distances, and the usual logistic output non-linearities are replaced by a \nnormalised exponential. \n\n\f216 \n\nBridle \n\nFor a particular two-dimensional10-class problem, derived from Peterson and Bar(cid:173)\nney's formant data, we have demonstrated [6] that training such a network can \ncause the ms to move from their \"natural\" positions at the data means (the in-class \nmaximum likelihood estimates), and this can improve classification performance on \nunseen data (from 68% correct to 78%). \n\n5.3 EXAMPLE THREE: ALPHANETS \n\nConsider a set of hidden Markov models (HMMs), one for each word, each param(cid:173)\neterised by a set of state transition probabilities, {a~j}' and observation likelihood \nfunctions {b~ ('\" H, where a~j is the probability that in model k state i will be fol(cid:173)\nlowed by state j, and b~ ( \"') is the likelihood of model k emi tting observation '\" from \nstate j. For simplicity we insist that the end of the word pattern corresponds to \nstate N of a model. \nThe likelihood, Lie (lett) of model k generating a given sequence \",tt ~ \"'1, \u2022\u2022 \" \"'M \nis a sum, over all sequences of states, of the joint likelihood of that state sequence \nand the data: \n\nLIe(ler) = L IT a!'_I\"f b!I(\"'d with 8M = N. \n\nM \n\n'I ... IM t=2 \n\nThis can be r.omput.ed efficiently via the forward recursion [3J \n\nglvlllg \n\nwhich we can think of as a recurrent network. (Note that t is used as a time index \nhere.) \n\nIf the observation sequence \"':'\" could only have come from one of a set of known, \nequally likely models, then the posterior probability that it was from model k is \n\np(r=k I \",f!) = QIe(\",f!) = Llc(\",f1 ) / L Lr(\",r)\u00b7 \n\nr \n\nThese numbers are the output of our special \"recurrent neural network\" for isolated \nword discrimination, which we call an \"Alphanet\" [14J. Backpropagation of partial \nderivatives of the J score has the form of the backward recurrence used in the \nBaum-Welch algorithm, but they include discriminative terms, and we obtain the \ngradient of the relative entropy/mutual information. \n\n6 CONCLUSIONS \n\nDiscrimination-based training is different from within-class parameter estimation, \nand it may be useful. (Also see [15].) Discrimination-based training for stochastic \nmodels and for networks are not distinct, and in some cases can be mathematically \nidentical. \n\nThe notion of specially constructed 'network' architectures which implement stochas(cid:173)\ntic model recognition algorithms provides a way to construct fertile hybrids. For \ninstance, a Gaussian classifier (or a HMM classifier) can be preceeded by a nonlin(cid:173)\near transformation (perhaps based on semilinear logistics) and all the parameters \n\n\fTraining Stochastic Model Recognition Algorithms \n\n217 \n\nof the system adjusted together. This seems a useful approach to automating the \ndiscovery of 'feature detectors'. \n\n\u00a9 British Crown Copyright 1990 \n\nReferences \n[1] R P Lippmann. Review of neural networks for speech recognition. Neural \n\nComputation, 1(1), 1989. \n\n[2] It L Watrous . Connectionist speech recognition using the temporal flow model. \n\nIn .Pl'Oc. IEEE W ol'kshop on Speech Recognition, June 1988. \n\n[3] A B Poritz. Hidden Markov models: a guided tour. In Proc. IEEE Int. Conf. \n\nAcouslics Speech and Signal P1'Ocessillg, pages 7-13, 1988. \n\n[4] L R Bahl, P F Brown, P V de Souza, and R L Mercer. Maximum mutual \ninformation estimation of hidden Markov model parameters. In Proc. IEEE \nTnt. Conf. Acoustics Speech and Signal P,'ocessing, pages 49-52, 1986. \n\n[5] L R Bahl, P F Brown, P V de Souza, and R L r.fercer. A new algorithm for the \nestimation of HMM parameters. In P,'Vf. IEEE Int. Con!. Acoustics Speech \nand Signal Processmg, pages 493-496, 1988. \n\n[6] J S Bridle. Probabilistic interpretation of feedforward classification network \noutput.s, with relationships to statistical pattern recognition. In F Fougelman(cid:173)\nSoulie and J Herault, editors, Neuro-computing: algorithms, architectures and \nappfications, Springer-Verlag, 1989. \n\n[7] D HAckley, G E Hinton, and T J Sejnowski. A learning algorithm for Boltz(cid:173)\n\nmann machines. Cognitive Science, 9:147-168,1985. \n\n[8] L Gillick. Probability scores for backpropagation networks. July 1987. Per(cid:173)\n\nsonal communication. \n\n[9] G E Hinton. Connectionist LeaJ'ning Procedures. Technical Report CMU-CS-\n87-115, Carnegie Mellon University Computer Science Department, June 1987. \n[10] E B Baum and F Wilczek. Supervised learning of probability distributions \nIn D Anderson, editor, Neura,Z Infol'mation Processing \n\nby neural networks. \nSystems, pages 52\"-6], Am. lnst. of Physics, 1988. \n\n[11] S SoHa, E Levin, and M Fleisher. Accelerated learning in layered neural net(cid:173)\n\nworks. Complex Systems, January 1989. \n\n[12] E Yair and A Gersho. The Boltzmann Perceptron Network: a soft classifier. \nIII D Touretzky, editor, Advances in Neuml Information Processing Systems 1, \nSan Mateo, CA: Morgan Kaufmann, 1989. \n\n[13] P S Gopalakrishnan, D Kanevsky, A Nadas, D Nahamoo, and M A Picheny. \nDecoder seledion based on cross-entropies . In Proc. IEEE Int. Conf. Acoustics \nSpeech and Signal Pl'ocessing, pages 20-23, 1988. \n\n[14] J S Bridle. Alphanets: a recurrent 'lleural' network architecture with a hidden \nMarkov model interpretation. Spee('h Communication, Special N eurospeech \nissue, February 1990. \n\n[15] \"L Niles, H Silverman, G Tajclllnan, and 1\\'1 Bush. How limited training data \nIn Proc. \n\ncan allow a neural network to out-perform an 'optimal' classifier. \nIEEE in.t . Conf. Acoustics Speech and Signal Processing, 1989. \n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}]}