{"title": "Speech Recognition: Statistical and Neural Information Processing Approaches", "book": "Advances in Neural Information Processing Systems", "page_first": 796, "page_last": 801, "abstract": null, "full_text": "796 \n\nSPEECH RECOGNITION: STATISTICAL AND \n\nNEURAL INFORMATION PROCESSING \n\nAPPROACHES \n\nJohn S. Bridle \n\nSpeech Research Unit and \n\nNational Electronics Research Initiative in Pattern Recognition \n\nRoyal Signals and Radar Establishment \n\nMalvern UK \n\nAutomatic Speech Recognition (ASR) is an artificial perception problem: the input \nis raw, continuous patterns (no symbols!) and the desired output, which may be \nwords, phonemes, meaning or text, is symbolic. The most successful approach to \nautomatic speech recognition is based on stochastic models. A stochastic model is \na theoretical system whose internal state and output undergo a series of transfor(cid:173)\nmations governed by probabilistic laws [1]. In the application to speech recognition \nthe unknown patterns of sound are treated as if they were outputs of a stochastic \nsystem [18,2]. Information about the classes of patterns is encoded as the structure \nof these \"laws\" and the probabilities that govern their operation. The most popular \ntype of SM for ASR is also known as a \"hidden Markov model.\" \n\nThere are several reasons why the SM approach has been so successful for ASR. \nIt can describe the shape of the spectrum, and has a principled way of describ(cid:173)\ning temporal order, together with variability of both. It is compatible with the \nhierarchical nature of speech structure [20,18,4], there are powerful algorithms for \ndecoding with respect to the model (recognition), and for adapting the model to fit \nsignificant amounts of example data (learning). Firm theoretical (mathematical) \nfoundations enable extensions to be accommodated smoothly (e.g. [3]). \n\nThere are many deficiencies however. In a typical system the speech signal is first \ndescribed as a sequence of acoustic vectors (spectrum cross sections or equivalent) \nat a rate of say 100 per second. The pattern is assumed to consist of a sequence of \nsegments corresponding to discrete states of the model. In each segment the acoustic \nvectors are drawn from a distribution characteristic of the state, but otherwise \nindependent of one another and of the states before and after. In some systems there \nis a controlled relationship between states and the phonemes or phones of speech \nscience, but most of the properties and notions which speech scientists assume are \nimportan t are ignored. \n\nMost SM approaches are also deficient at a pattern-recognition theory level: The \nparameters of the models are usually adj usted (using the Baum-Welch re-estimation \nmethod [5,2]) so as to maximise the likelihood of the data given the model. This \nis the right thing to do if the form of the model is actually appropriate for the \ndata, but if not the parameter-optimisation method needs to be concerned with \n\n\fSpeech Recognition \n\n797 \n\ndiscrimination between classes (phonemes, words, meanings, ... ) [28,29,30]. \n\nA HMM recognition algorithm is designed to find the best explanation of the input in \nterms of the model. It tracks scores for all plausible current states of the generator \nand throws away explanations which lead to a current state for which there is a \nbetter explanation (Bellman's Dynamic Programming) . It may also throwaway \nexplanations which lead to a current state much worse than the best current state \n(score pruning), producing a Beam Search method. (It is important to keep many \nhypotheses in hand, particularly when the current input is ambiguous.) \n\nConnectionist (or \"Neural Network\") approaches start with a strong pre-conception \nof the types of process to be used. They can claim some legitimacy by reference \nto new (or renewed) theories of cognitive processing. The actual mechanisms used \nare usually simpler than those of the SM methods, but the mathematical theory \n(of what can be learnt or computed for instance) is more difficult, particularly for \nstructures which have been proposed for dealing with temporal structure. \n\nOne of the dreams for connectionist approaches to speech is a network whose inputs \naccept the speech data as it arrives, it would have an internal state which contains all \nnecessary information about the past input, and the output would be as accurate \nand early as it could be. The training of networks with their own dynamics is \nparticularly difficult, especially when we are unable to specify what the internal state \nshould be. Some are working on methods for training the fixed points of continuous(cid:173)\nvalued recurrent non-linear networks [15,16,27]. Prager [6] has attempted to train \nvarious types of network in a full state-feedback arrangement. Watrous [9] limits \nhis recurrent connections to self-loops on hidden and output units, but even so the \ntheory of such recursive non-linear filters is formidable. \n\nAt the other extreme are systems which treat a whole time-frequency-amplitude \narray (resulting from initial acoustic analysis) as the input to a network, and require \na label as output. For example, the performance that Peeling et al. \n[7] report \non multi-speaker small-vocabulary isolated word recognition tasks approach those \nof the best HMM techniques available on the same data. Invariance to temporal \nposition was trained into the network by presenting the patterns at random positions \nin a fixed time-window. Waibel et al. [8] use a powerful compromise arrangement \nwhich can be thought of either as the replication of smaller networks across the time(cid:173)\nwindow (a time-spread network [19]) or as a single small network with internal delay \nlines (a Time-Delay Neural Network [8]). There are no recurrent links except for \ntrivial ones at the output, so training (using Backpropagation) is no great problem. \nWe may think of this as a finite-impulse-response non-linear filter. Reported results \non consonant discrimination are encouraging, and better than those of a HMM \nsystem on the same data. The system is insensitive to position by virtue of its \nconstruction. \n\nKohonen has constructed and demonstrated large vocabulary isolated word [12] \nand unrestricted vocabulary continuous speech transcription [13J systems which are \ninspired by neural network ideas, but implemented as algorithms more suitable for \n\n\f798 \n\nBridle \n\ncurrent programmed digital signal processor and CPU chips. Kohonen's phonotopic \nmap technique can be thought of as an unsupervised adaptive quantiser constrained \nto put its reference points in a non-linear low-dimensional sub-space. His learning \nvector quantiser technique used for initial labeling combines the advantages of the \nclassic nearest-neighbor method and discriminant training. \n\nAmong other types of network which have been applied to speech we must mention \nan interesting class based not on correlations with weight vectors (dot-product) but \non distances from reference points. Radial Basis Function theory [22] was developed \nfor multi-dimensional interpolation, and was shown by Broomhead and Lowe [23] \nto be suitable for many of the jobs that feed-forward networks are used for. The \nadvantage is that it is not difficult to find useful positions for the reference points \nwhich define the first, non-linear, transformation. If this is followed by a linear \noutput transformation then the weights can be found by methods which are fast and \nstraightforward. The reference points can be adapted using methods based on back(cid:173)\npropagation. Related methods include potential functions [24], Kernel methods [25] \nand the modified Kanerva network [26]. \n\nThere is much to be gained form a careful comparison of the theory of stochastic \nmodel and neural network approaches to speech recognition. If a NN is to per(cid:173)\nform speech decoding in a way anything like a SM algorithm it will have a state \nwhich is not just one of the states of the hypothetical generative model; the state \nmust include information about the distribution of possible generator states given \nthe pattern so far, and the state transition function must update this distribution \ndepending on the current speech input. It is not clear whether such an internal rep(cid:173)\nresentation and behavior can be 'learned' from scratch by an otherwise unstructured \nrecurrent network. \n\nStochastic model based algorithms seem to have the edge at present for dealing with \ntemporal sequences. Discrimination-based training inspired by NN techniques may \nmake a significant difference in performance. \n\nIt would seem that the area where NNs have most to offer is in finding non-linear \ntransformations of the data which take us to a space (perhaps related to formant or \narticulatory parameters) where comparisons are more relevant to phonetic decisions \nthan purely auditory ones (e.g., [17,10,11]). The resulting transformation could also \nbe viewed as a set of 'feature detectors'. Or perhaps the NN should deliver posterior \nprobabilities of the states of a SM directly [14]. \n\nThe art of applying a stochastic model or neural network approach is to choose \na class of models or networks which is realistic enough to be likely to be able to \ncapture the distinctions (between speech sounds or words for instance) and yet \nhave a structure which makes it amenable to algorithms for building the detail of \nthe models based on examples, and for interpreting particular unknown patterns. \nFuture systems will need to exploit the regularities described by phonetics, to allow \nthe construction of high-performance systems with large vocabularies, and their \nadaptation to the characteristics of each new user. \n\n\fSpeech Recognition \n\n799 \n\nThere is no doubt that the Stochastic model based methods work best at present, \nbut current systems are generally far inferior to humans even in situations where the \nusefulness of higher-level processing in minimal. I predict that the next generation \nof ASR systems will be based on a combination of connectionist and SM theory and \ntechniques, with mainstream speech knowledge used in a rather soft way to decide \nthe structure. It should not be long before the distinction I have been making will \ndisappear [29]. \n\n[1] D. R. Cox and H. D. Millar, \"The Theory of Stochastic Processes\", Methuen, \n\n1965. pp. 721-741. \n\n[2] S. E. Levinson, L. R. Rabiner and M. M. Sohndi, \"An introduction to the \napplication of the theory of probabilistic functions of a Markov process to \nautomatic speech recognition\", Bell Syst. Tech. J., vol. 62, no. 4, pp. 1035-\n1074, Apr. 1983. \n\n[3] M. R. Russell and R. K. Moore, \"Explicit modeling of state occupancy in \n\nhidden Markov models of automatic speech recognition\". IEEE ICASSP-85. \n\n[4] S. E. Levinson, \"A unified theory of composite pattern analysis for automatic \nspeech recognition', in F. Fallside and W. Woods (eds.), \"Computer Speech \nProcessing\", Prentice-Hall, 1984. \n\n[5] L. E. Baum, \"An inequality and associated maximisation technique in statis(cid:173)\n\ntical estimation of probabilistic functions of a Markov process\", Inequalities, \nvol. 3, pp. 1-8, 1972. \n\n[6] R. G. Prager et al., \"Boltzmann machines for speech recognition\", Computer \n\nSpeech and Language, vol. 1., no. 1, 1986. \n\n[7] S. M. Peeling, R. K. moore and M. J. Tomlinson, \"The multi-layer perceptron \nas a tool for speech pattern processing research\", Proc. Inst. Acoustics Conf. \non Speech and Hearing, Windermere, November 1986. \n\n[8] Waibel et al., ICASSP88, NIPS88 and ASSP forthcoming. \n\n[9] R. 1. Watrous, \"Connectionist speech recognition using the Temporal Flow \nmodel\", Proc. IEEE Workshop on Speech Recognition, Harriman NY, June \n1988. \n\n[10] I. S. Howard and M. A. Huckvale, \"Acoustic-phonetic attribute determination \n\nusing multi-layer perceptrons\", IEEE Colloquium Digest 1988/11. \n\n[11] M. A. Huckvale and I. S. Howard, \"High performance phonetic feature analysis \n\nfor automatic speech recognition\", ICASSP89. \n\n[12J T. Kohonen et al., \"On-line recognition of spoken words from a large vocabu(cid:173)\n\nlary\", Information Sciences 33, 3-30 (1984). \n\n\f800 \n\nBridle \n\n[13] T. Kohonen, \"The 'Neural' phonetic typewriter\", IEEE Computer, March \n\n1988. \n\n[14] H. Bourlard and C. J. Wellekens, \"Multilayer perceptrons and automatic speech \n\nrecognition\", IEEE First IntI. Conf. Neural Networks, San Diego, 1987. \n\n[15] R. Rohwer and S. Renals, \"Training recurrent networks\", Proc. N'Euro-88, \n\nParis, June 1988. \n\n[16] L. Almeida, \"A learning rule for asynchronous perceptrons with feedback in \na combinatorial environment\", Proc. IEEE IntI. Conf. Neural Networks, San \nDiego 1987. \n\n[17] A. R. Webb and D. Lowe, \"Adaptive feed-forward layered networks as pattern \nclassifiers: a theorem illuminating their success in discriminant analysis\" , sub. \nto N eural Networks. \n\n[18] J. K. Baker, \"The Dragon system: an overview\", IEEE Trans. ASSP-23, no. \n\n1, pp. 24-29, Feb. 1975. \n\n[19] J. S. Bridle and R. K. Moore, \"Boltzmann machines for speech pattern pro(cid:173)\n\ncessing\", Proc. Inst. Acoust., November 1984, pp. 1-8. \n\n[20] B. H. Repp, \"On levels of description in speech research\", J. Acoust. Soc. Amer. \n\nvol. 69 p. 1462-1464, 1981. \n\n[21] R. A. Cole et aI, \"Performing fine phonetic distinctions: templates vs. features\" , \nin J. Perkell and D. H. Klatt (eds.), \"Symposium on invariance and variability \nof speech processes\", Hillsdale, NJ, Erlbaum 1984. \n\n[22] M. J. D. Powell, \"Radial basis functions for multi-variate interpolation: a \nreview\", IMA Conf. on algorithms for the approximation offunctions and data, \nShrivenham 1985. \n\n[23] D. Broomhead and D. Lowe, \"Multi-variable interpolation and adaptive net(cid:173)\n\nworks\", RSRE memo 4148, Royal Signals and Radar Est., 1988. \n\n[24] M. A. Aizerman, E. M. Braverman and L. 1. Rozonoer, \"On the method of \npotential functions\", Automatika i Telemekhanika, vol. 26 no. 11, pp. 2086-\n2088, 1964. \n\n[25] Hand, \"Kernel discriminant analysis\", Research Studies Press, 1982. \n\n[26] R. W. Prager and F. Fallside, \"Modified Kanerva model for automatic speech \n\nrecognition\", submitted to Cmputer Speech and Language. \n\n[27] F. J. Pineda, \"Generalisation of back propagation to recurrent neural net(cid:173)\n\nworks\", Physical Review Letters 1987. \n\n[28] L. R. Bahl et aI., Proc. ICASSP88, pp. 493-496. \n\n\fSpeech Recognition \n\n801 \n\n[29] H. Bourlard and C. J. Wellekens, \"Links between Markov models and multi(cid:173)\n\nlayer perceptrons\", this volume. \n\n[30] L. Niles, H. Silverman, G. Tajchman, M. Bush, \"How limited training data can \nallow a neural network to out-perform an 'optimal' classifier\" , Proc. ICASSP89. \n\n\f", "award": [], "sourceid": 174, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}]}