{"title": "RecNorm: Simultaneous Normalisation and Classification applied to Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 234, "page_last": 240, "abstract": null, "full_text": "RecNorm: Simultaneous Normalisation and \nClassification applied to Speech Recognition \n\nJohn S. Bridle \nRoyal Signals and Radar Est. \nGreat Malvern \nUK WR143PS \n\nStephen J. Cox \n\nBritish Telecom Research Labs. \n\nIpswich \n\nUK \n\nIP57RE \n\nAbstract \n\nA particular form of neural network is described, which has terminals \nfor acoustic patterns, class labels and speaker parameters. A method of \ntraining this network to \"tune in\" the speaker parameters to a particular \nspeaker is outlined, based on a trick for converting a supervised network \nto an unsupervised mode. We describe experiments using this approach \nin isolated word recognition based on whole-word hidden Markov models. \nThe results indicate an improvement over speaker-independent perfor(cid:173)\nmance and, for unlabelled data, a performance close to that achieved on \nlabelled data. \n\n1 \n\nINTRODUCTION \n\nWe are concerned to emulate some aspects of perception. In particular, the way that \na stimulus which is ambiguous, perhaps because of unknown lighting conditions, can \nbecome unambiguous in the context of other such stimuli: the fact that they are \nsubject to tbe same unknown conditions gives our perceptual apparatus enough \nconstraints to solve tbe problem. \nIndividual words are often ambiguous even to \nhuman listeners. For instance a Cockney might say the word \"ace\" to sound the \nsame as a Standard English speaker's \"ice\". Similarly with \"room\" and \"rum\", or \n\"work\" and \"walk\" ill other pairs of British English accents. If we heard one of these \nambiguous pronunciations, knowing nothing else about the speaker we could not tell \nwhich word had been said. For current automatic speech recognition (ASR) systems \nsuch effects are much more frequent, because we do not know bow to concentrate \non the important aspects of the signal locally, nor how to exploit the fact that some \nunknown properties apply to w hole words, nor how to bring to bear on the task of \n\n234 \n\n\fRecNorm \n\n235 \n\nacoustic disambiguation all the information that is normally latent in the context \nof the utterance. \n\nMost attempts to construct ASR systems which can be used by many persons have \nused so-called speaker-independent models. When decoding a short sequence of \nwords there is no way of imposing our knowledge that all the speech is uttered by \none person. \n\nTo enable adaptation using small amounts of speech from a new speaker we propose \nto factor the speech knowledge into speaker-independent models, continous speaker(cid:173)\nspecific parameters and a transformation which modifies the models according to \nthe speaker parameters. (In this paper we shall only use transformations which can \njust as easily be applied to the input patterns.) We are specially interested in the \npossibility of estimating such parameters from quite small amounts of unlabelled \nspeech, such as a few short words or one longer word. Although the types of models \nand transformations we have used are very simple, we hope the general approach \nwill be applicable to quite sophisticated models and transformations which will be \nnecessary for future high-performance speech recognition systems. \n\n2 AN ADAPTIVE NETWORK APPROACH \n\n2.1 GENERAL IDEA \n\nSuppose we had a feed-forward network with three (vector-valued) terminals, which \nencapsulates our knowledge of the relationship between acoustic patterns, X, class \nlabels (e.g. word identities) C, and speaker parameters, Q. \nTraining such a network seems difficult, because although we can supply (X ,C) \npairs, we do not know the appropriate values of Q. (We only know the names of \nthe speakers, or perhaps some phonetician's descriptive labels.) \n\nIn training the network we start with default values of Q, feed forward from X and \nQ to C, bac.k-propagate derivatives to internal parameters of the network (weights, \ntransition probabilities, etc.) and also to the Qs, enforcing the constraint that the \nQs for anyone speaker stay equal. We can imagine one copy of the network for \neach utterance, with the Q terminals of networks dealing with the same speaker \nstrapped together. One convenient implementation (for a small number of training \nspeakers) is to adapt one Q vector per speaker in a set of weights from one-from-N \ncoded speaker identity inputs to linear units, as we shall see later. \nOnce the network is trained we have two modes of use. If we have available one \nor more known utterances by a new speaker, then we can \"tune-in\" to the speaker \n(as during training) except that only the Q inputs are adjusted. The case of most \ninterest in this paper, however, is when we have a few unknown words from an \nunknown speaker. We set up a Q-strapped set of networks, one for each word, \ninitialise the Q values to their defaults, propagate forwards to produce a set of \ndistributions across word labels, and then we use a technique which tends to sharpen \nthese distributions. In the simplest case, the sharpening process could be a matter \nof: for each utterance pick the word label with the largest output, and assuming \nit to be correct back-propagate derivatives to the common Q. In practice, we can \nuse a gentler method in which large outputs get most 'encouragement'. For some \n\n\f236 \n\nBridle and Cox \n\nnetworks it is possible to show that such a \"phantom target\" procedure can lead \nto hillclimbing on the likelihood of the data given an assumption about the form of \nthe generator of the data (see Appendix). \n\n2.2 SIMPLE NETWORK ILLUSTRATION \n\nWe have explored these ideas using a very simple network based on that in figure \n1. It can be viewed either as a feedforward network with radial (minus Euclidean \ndistance squared) units and a generalised-logistic (Softmax) output non-linearity, \nor as a Gaussian classifier in which the covariance matrices are unit diagonal (see \n[Bri90bJ). Training is done by gradient-based optimisation, using back-propagation \nof partial derivatives. During training the criterion is based on relative entropy \n(likelihood of the targets given the network outputs) [Bri90c). (Such di~criminative \ntraining can lead to different results from the usual model-based methods [Bri90b], \nwhich in this case would set the reference points at the data means for each class.) \n\nThis simple classifier network is preceded by a full linear transformation (6 param(cid:173)\neters), so the equivalent model-based classifier has Gaussian distributions with the \nsame arbitrary covariance matrix for each class. We use the biasses of the linear \nunits as speaker parameters, so the weights from speaker identity inputs go straight \ninto the hidden units, ~ as figure 2 . \n\nDuring adaptation to a new speaker from unlabelled tokens, the speaker parameters \nof the transformation are allowed to adapt, but the (\"phantom\") targets are derived \nfrom the outputs themselves (the targets are just double the outputs) so that the \nlargest outputs are encouraged. \n\nIn figure 3 we see the adaptation of the positions of the reference points of the \nradial units in figure 2 when the input points are essentially the 6 reference points \ndisplaced to one side (to represent one example of each word spoken by a new \nspeaker). Adaptation based on tentative classifications pulls the reference points \ntowards a position where the inputs can be given confident, consistent labels. \n\n3 SPEECH RECOGNITION EXPERIMENTS \n\nWe have applied these ideas to the problem of recognising a few short, confusable \nwords from a known set, all spoken by the same unknown speaker. If our method \nworks we should be able to recognise each word better (on average) if we also look \nat a few other unknown words from the same speaker. \n\nThe dataset [SaI89], which had been recorded previously for other purposes, com(cid:173)\nprised the British English isolated pronounciations of the names of the letters of \nthe alphabet, each spoken 3 times by each speaker. The 104 speakers were divided \ninto two groups of 52 (Train and Test), balanced for age and sex. Initial acoustic \nanalysis produced 28-component spectrum vectors, 100 per second. In place of the \n2-D input patterns discussed above, each speech pattern was a variable-duration \nsequence (typically 50) of 28-vectors. \n\nIn place of each simple Gaussian density class-models we used a set of Gaussian \ndensities and a matrix of probabilities of transitions between them. Each class(cid:173)\nmodel is thus a hidden Markov model (HMM) of a word. We used 26 HMMs, each \n\n\fRecNorm \n\n237 \n\nRadial \nUnits \n\nFig.1 Feedforward Network \nImplementing Simple Gaussian \n\nClassifier \n\nFig.2 Gaussian classifier network with \ninput transformation and speaker inputs \n\n15% \n\n14% \n\n13% \n\n12% \n\n11% \n\n10% \n\n3 \n\nNo Adapt. \n\n10 \n\n20 \n\nWords given \n\n78 \n\n'Cheat' \nmode \n\nFig.3 Adaptation to 6 displaced \n\nFig.4 Average error rates for alphabet \n\npoints \n\nword recognition \n\n\f238 \n\nBridle and Cox \n\nwith 15 states, each with a 3-component Gaussian mixture output distribution. For \nfurther details see [CB90]. \n\nThe equivalent to the evaluation of a Gaussian density in the simple network is \nthe Forward (or Alpha) computation of the likelihood of the data given a (hidden \nMarkov) model. This calculation can be thought of as being performed by a \nrecurrent network of a special form. When we include the Bayes inversion to \nproduce probabilities of the classes (this is a normalisation if we assume equal prior \nprobabilities) we obtain the equivalent of the simple network of figure 1, which we \ncall an Alphanet[Bri90a]. \n\nIn place of the 2-component linear transformation in figure 2 we use a constrained \nlinear transformation based on [Hun81] Yi = ai~i-l + bi~i + Ci~i+l + dil where \n~i, i = 1, ... 28, is the log spectrum amplitude in frequency channel i. \nWe tried three conditions: \n\n\u2022 Bias Only: a = 0, b = 1, C = 0 (28 parameters) \n\n\u2022 Fixed Shift: ai = a, bi = b, Ci = C (31 parameters) \n\n\u2022 Variable Shift: the general c.ase (107 parameters) \n\nFigure 4 shows average word error rates for the three types of transformation, for \ndifferent numbers of utterances taken together (N = 3,10,20,78). N = 1 is the \nnon-adaptive case. 'Cheat' Mode is a check on the power of the transformations: \nfor each test speaker, all 78 utterances were used to set the parameters of the \ntransformation, then recognition performance was measured using those parameters \nof the same utterances. \n\nWe see: \n\n\u2022 Use of unsupervised adaptation reduced the error rates. \n\n\u2022 The reductions are not spectacular (15% errors to 12% errors, a reduction in \nerror rate by 20%.) but they are statistically significant and may be practically \nsignificant too. \n\n\u2022 The performance in 'Cheat' Mode is only a little better than in unsupervised \n\nmode, so performance is being limited by the power of the transformation. \n\n\u2022 The Fixed Shift transformation gives quite good results even on only 3 words \n\nat a time. \n\nWhen tested on a 120 talker telephone-line database of isolated digits collected \nat British Telecom, the best unsupervised speaker adaptation technique gave a \n37% decrease in error-rate (for both supervised and unsupervised adaptation on 5 \nutterances) using a simple front-end consisting of 8 MFCCs (mel frequency-scale \ncepstrum coefficients). A more sophisticated front-end (using differential informa(cid:173)\ntion and energy) improved the unadapted performance by 63% over the 8 MFCC \nfront-end. Using this front-end, the best unsupervised adaptation technique (on 5 \nutterances) decreased the error-rate by a further 25% \n\n\fRecNorm \n\n239 \n\n4 CONCLUSIONS \n\nThe results reported here show that simultaneous word recognition and speaker \nnormalisation can be made to work, that it improves performance over the cor(cid:173)\nresponding speaker-independent version, and that given 3 to 10 unknown words \nperformance can be almost as good as when the adaptation is done using knowl(cid:173)\nedge of the word identities. The main extensions we are interested in are to use \nnon-linear transformations, and to learn low-dimensional but effective speaker pa(cid:173)\nrameterisations. \n\nA Unsupervised Adaptation using Phantom Targets \n\nWe aim to motivate the 'phantom target' trick of feeding back twice each output of \nthe network as a target. \n\nSuppose we have a classifier network, with a 1-from-N output coding, and a Softmax \noutput nonlinearity. We write Qj for an output value, Vi for an input to the Softmax \noutput stage, 113 for the input to the network, c for a class and 8 for parameters \nwhich we may want to adjust. A typical output value is \n\nQj (113,8) = e Vi (113,8) / LeVie (113,8). \n\nIe \n\nThe output values are interpretable as estimates of posterior probabilities: Qj :::::: \nPr(c = j 1113,8). For the next step we assume there are some implicit probability \ndensity functions Pj (113,8) :::::: Pr( 2 I c = j, 8) Assuming equal prior probabilities of \nthe classes for simplicity, Bayes rule gives \n\nQj(2, 8) = Pj(lI3, 8) / L Pie (113, 8), \n\nN \n\n1e=1 \n\nso we suppose that \n\nwhere the normalisation is \n\nP'(2 8) = _1_eVi(II3,8) \n, \nl ' \n\nZj(8) \n\nZj(8) = f eVi(II3,8)d2. \n\nIn the networks we use, the same normalisation applies to all the classes, so we write \nzj(8) = z(8). \nA maximum-likelihood approach to unsupervised adaptation maximises the likeli(cid:173)\nhood of the data given the set of (equally probable) distributions, which is \n\n1 \nP ( 113, 8) = L Pie ( 2, 8) N' \n\nN \n\n1e=1 \n\nIt is simpler to maximise the log likelihood: \nL(II3, 8) ~ log P( 2,8) = log L Pie ( 2, 8)-log N = log L e VIe(2, 8) -log z(8)-log N. \n\nIe \n\nIe \n\n\f240 \n\nBridle and Cox \n\nWe shall need \n\n8L ____ l ___ eltj(z, 8) __ 1_ 8z(8) \n8ltj - Lie eVIe(z, 8) \n\n. \nz(8) 8ltj(z,8)\" \n\n(The likelihood of the whole training set is the product of the likelihoods of the \nindividual patterns, and the log turns the product into a sum, so we can sum the \nderi vati ves of Lover thf' training set.) \n\nWe can often assume that the normalisation is independent of 8, giving \n\n8L \n- ---- -- ------- -- = Qj(z, 8). \n8Vj \n\neltj(z,8) \nLie eVIe(z, 8) \n\nIf we have a supervised backprop network using the relative entropy based criterion \n(rather than squared error) [1], we are minimising J = - Lj Tj 10gQj, where Tj is \nthe target for the jth output. We know [Bri90b] that :~. = Qj - Tj , so if we set \nTj = 2Qj we have :t = --- g{; , and minimising J is equivalent to maximising L. \n\nJ \n\nJ \n\nJ \n\nFor the simple Gaussian network of figure 1, this unsupervised adaptation, ap(cid:173)\nplied to the reference points, can be understood as an on-line, gradient descent, \nrelative of the k-means cluster analysis procedure, or of the LBG vector quantiser \ndesign method, or indeed of Kohonen's feature map (without the neighbourhood \nconstraints) . \n\nCopyright \u00a9 Controller HMSO London 1989 \n\nReferences \n\n[Bri90a] J S Bridle. Alphanets: a recurrent 'neural' network architecture with \na hidden Markov model interpretation. Speech Communication, Special \n\"Neurospeech\" issue, February 1990. \n\n[Bri90b] J S Bridle. Probabilistic interpretation of feedforward classification net(cid:173)\n\nwork outputs, with relationships to statistical pattern recognition. \nIn \nF Fougelman-Soulie and J Herault, editors, N euro-computing: algorithms, \narchitectures and applications, NATO ASI Series on Systems and computer \nscience. Springer-Verlag, 1990. \n\n[Bri90c] J S Bridle. Training stochastic model recognition algorithms as networks \ncan lead to maximum mutual information estimation of parameters. In \nAdvances in Neural Information Processing Systems 2. Morgan Kaufmann, \n1990. \n\n[CB90] S J Cox and J S Bridle. Simultaneous speaker normalisation and utterance \nlabelling using Bayesian/neural net techniques. In Proc. IEEE Int. ConJ. \nAcoustics Speech and Signal Processing, 1990. \n\n[Hun81] M J Hunt. \n\nSpeaker adaptation for word-based speech recognition. \n\n[SaI89] \n\nJ. Acoust. Soc. Amer, 69:S41-S42, 1981. (abstract only). \nJ A S Salter. The RT5233 Alphabetic database for the Connex project. \nTechnical Report RT52/G231/89, BT Technology Executive, 1989. \n\n\f", "award": [], "sourceid": 328, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}, {"given_name": "Stephen", "family_name": "Cox", "institution": null}]}