{"title": "From Speech Recognition to Spoken Language Understanding: The Development of the MIT SUMMIT and VOYAGER Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 255, "page_last": 261, "abstract": null, "full_text": "From Speech Recognition to Spoken Language \nUnderstanding: The Development of the MIT \n\nSUMMIT and VOYAGER Systems \n\nVictor Zue, James Glass, David Goodine, Lynette Hirschman, \n\nHong Leung, Michael Phillips, Joseph Polifroni, and Stephanie Seneff' \n\nRoom NE43-601 \n\nSpoken Language Systems Group \nLaboratory for Computer Science \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 U.S.A. \n\nAbstract \n\nSpoken language is one of the most natural, efficient, flexible, and econom(cid:173)\nical means of communication among humans. As computers play an ever \nincreasing role in our lives, it is important that we address the issue of \nproviding a graceful human-machine interface through spoken language. \nIn this paper, we will describe our recent efforts in moving beyond the \nscope of speech recognition into the realm of spoken-language understand(cid:173)\ning. Specifically, we report on the development of an urban navigation and \nexploration system called VOYAGER, an application which we have used as \na basis for performing research in spoken-language understanding. \n\n1 \n\nIntroduction \n\nOver the past decade, research in speech coding and synthesis has matured to the \nextent that speech can now be transmitted efficiently and generated with high in(cid:173)\ntelligibility. Spoken input to computers, however, has yet to pass the threshold of \npracticality. Despite some recent successful demonstrations, current speech recog(cid:173)\nnition systems typically fall far short of human capabilities of continuous speech \nrecognition with essentially unrestricted vocabulary and speakers, under adverse \nacoustic environments. This is largely due to our incomplete knowledge of the en(cid:173)\ncoding of linguistic information in the speech signal, and the inherent variabilities of \n\n255 \n\n\f256 \n\nZue, Glass, Goodine, Hirschman, Leung, Phillips, lblifroni, and Seneff \n\nthis process. Our approach to system development is to seek a good understanding \nof human communication through spoken language, to capture the essential features \nof the process in appropriate models, and to develop the necessary computational \nframework to make use of these models for machine understanding. \n\nOur research in spoken language system development is based on the premise that \nmany of the applications suitable for human/machine interaction using speech typ(cid:173)\nically involve interactive problem solving. That is, in addition to converting the \nspeech signal to text, the computer must also understand the user's request, in \norder to generate an appropriate response. As a result, we have focused our atten(cid:173)\ntion on three main issues. First, the system must operate in a realistic application \ndomain, where domain-specific information can be utilized to translate spoken in(cid:173)\nput into appropriate actions. The use of a realistic application is also critical to \ncollecting data on how people would like to use machines to access information and \nsolve problems. Use of a constrained task also makes possible rigorous evaluations \nof system performance. Second and perhaps most importantly the system must \nintegrate speech recognition and natural language technologies to achieve speech \nunderstanding. Finally, the system must begin to deal with interactive speech, \nwhere the computer is an active conversational participant, and where people pro(cid:173)\nduce spontaneous speech, including false starts, hestitations, etc. \n\nIn this paper, we will describe our recent efforts in developing a spoken language \ninterface for an urban navigation system (VOYAGER). We begin by describing our \noverall system architecture, paying particular attention to the interface between \nspeech and natural language. We then describe the application domain and some \nof the issues that arise in realistic interactive problem solving applications, par(cid:173)\nticulary in terms of conversational interaction. Finally, we report results of some \nperformance evaluations we have made, using a spontaneous speech corpus we col(cid:173)\nlected for this task. \n\n2 System Architecture \n\nOur spoken language language system contains three important components. The \nSUMMIT speech recognition system converts the speech signal into a set of word \nhypotheses. The TINA natural language system interacts with the speech recognizer \nin order to obtain a word string, as well as a linguistic' interpretation of the utterance. \nA control strategy mediates between the recognizer and the language understanding \ncomponent, using the language understanding constraints to help control the search \nof the speech recognition system. \n\n2.1 Continuous Speech Recognition: The SUMMIT System \n\nThe SUMMIT system (Zue et aI., 1989) starts the recognition process by first trans(cid:173)\nforming the speech signal into a representation that models some of the known \nproperties of the human auditory system (Seneff, 1988). Using the output of the \nauditory model, acoustic landmarks of varying robustness are located and embed(cid:173)\nded in a hierarchical structure called a dendrogram (Glass, 1988). The acoustic \nsegments in the dendrogram are then mapped to phoneme hypotheses, using a set \nof automatically determined acoustic attributes in conjunction with conventional \n\n\fFrom Speech Recognition to Spoken Language Understanding \n\n257 \n\npattern recognition algorithms. The result is a phoneme network, in which each \narc is characterized by a vector of probabilities for all the possible candidates. Re(cid:173)\ncently, we have begun to experiment with the use of artificial neural nets for pho(cid:173)\nnetic classifiction. To date, we have been able to improve the system's classification \nperformance by over 5% (Leung and Zue, 1990). \n\nWords in the lexicon are represented as pronunciation networks, which are generated \nautomatically by a set of phonological rules (Zue et aI., 1990). Weights derived \nfrom training data are assigned to each arc, using a corrective training procedure, \nto reflect the likelihood of a particular pronunciation. Presently, lexical decoding \nis accomplished by using the Viterbi algorithm to find the best path that matches \nthe acoustic-phonetic network with the lexical network. \n\n2.2 Natural Language Processing: The TINA System \n\nIn a spoken language system, the natural language component should perform two \ncritical functions: 1) provide constraint for the recognizer component, and 2) pro(cid:173)\nvide an interpretation of the meaning of the sentence to the back-end. Our natural \nlanguage system, TINA, was specifically designed to meet these two needs. TINA \nis a probabilistic parser which operates top-down, using an agenda-based control \nstrategy which favors the most likely analyses. The basic design of TIN A has been \ndescribed elsewhere (Seneff, 1989), but will be briefly reviewed. The grammar is \nentered as a set of simple context-free rules which are automatically converted to \na shared network structure. The nodes in the network are augmented with con(cid:173)\nstraint filters (both syntactic and semantic) that operate only on locally available \nparameters. All arcs in the network are associated with probabilities, acquired \nautomatically from a set of training sentences. Note that the probabilities are es(cid:173)\ntablished not on the rule productions but rather on arcs connecting sibling pairs \nin a shared structure for a number of linked rules. The effect of such pooling \nis essentially a hierarchical bigram model. We believe this mechanism offers the \ncapability of generating probabilities in a reasonable way by sharing counts on syn(cid:173)\ntactically /semantically identical units in differing structural environments. \n\n2.3 Control Strategy \n\nThe current interface between the SUMMIT speech recognition system and the TINA \nnatural language system, uses an N-best algorithm (Chow and Schwartz, 1989; \nSoong and Huang, 1990; Zue et aI., 1990), in which the recognizer can propose its \nbest N complete sentence hypotheses one by one, stopping with the first sentence \nthat is successfully analyzed by the natural language component TINA. In this case, \nTINA acts as a filter on whole sentence hypotheses. \n\nIn order to produce N -best hypotheses, we use a search strategy that involves an \ninitial Viterbi search all the way to the end of the sentence, to provide a \"best\" \nhypothesis, followed by an A\u00b7 search to produce next-best hypotheses in turn, \nprovided that the first hypothesis failed to parse. If all hypotheses fail to parse the \nsystem produces the rejection message, \"I'm sorry but I didn't understand you.\" \n\nEven with the parser acting as a filter of whole-sentence hypotheses, it is appropriate \nto also provide the recognizer with an inexpensive language model that can partially \n\n\f258 \n\nZue, Glass, Goodine, Hirschman, Leung, Phillips, Iblifroni, and Seneff \n\nconstrain the theories. This is currently done with a word-pair language model, in \nwhich each word in the vocabulary is associated with a list of words that could \npossibly follow that word anywhere in the sentence. \n\n3 The VOYAGER Application Domain \n\nVOYAGER is an urban navigation and exploration system that enables the user to \nask about places of interest and obtain directions. It has been under development \nsince early 1989 (Zue et al., 1989; Zue et al., 1990). In this section, we describe the \napplication domain, the interface between our language understanding system TIN A \nand the application back-end, and the discourse capabilities of the current system. \n\n3.1 Domain Description \n\nFor our first attempt at exploring issues related to a fully-interactive spoken(cid:173)\nlanguage system, we selected a task in which the system knows about the physical \nenvironment of a specific geographical area and can provide assistance on how to \nget from one location to another within this area. The system, which we call VOy(cid:173)\nAGER, can also provide information concerning certain objects located inside this \narea. The current version of VOYAGER focuses on the geographic area of the city of \nCambridge, MA between MIT and Harvard University. \n\nThe application database is an enhanced version of the Direction Assistance pro(cid:173)\ngram developed at the Media Laboratory at MIT (Davis and Trobaugh, 1987). It \nconsists of a map database, including the locations of various classes of objects \n(streets, buildings, rivers) and properties of these objects (address, phone number, \netc.) The application supports a set of retrieval functions to access these data. \nThe application must convert the semantic representation of TIN A into the appro(cid:173)\npriate function call to the VOYAGER back-end. The answer is given to the user in \nthree forms. It is graphically displayed on a map, with the object(s) of interest \nhighlighted. In addition, a textual answer is printed on the screen, and is also spo(cid:173)\nken verbally using synthesized speech. The current implementation handles various \ntypes of queries, such as the location of objects, simple properties of objects, how to \nget from one place to another, and the distance and time for travel between objects. \n\n3.2 Application Interface to VOYAGER \n\nOnce an utterance has been processed by the language understanding system, it \nis passed to an interface component which constructs a command function from \nthe natural language representation. This function is subsequently passed to the \nback-end where a response is generated. There are three function types used in \nthe current command framework of VOYAGER, which we will illustrate with the \nfollowing example: \n\nQuery: Where is the nearest bank to MIT? \n\nFunction: \n\n(LOCATE (NEAREST (BANK nil) (SCHOOL \"HIT\"\u00bb) \n\nLOCATE is an example of a major function that determines the primary action to \nbe performed by the command. It shows the physical location of an object or set \n\n\fFrom Speech Recognition to Spoken Language Understanding \n\n259 \n\nof objects on the map. Functions such as BAlK and SCHOOL in the above example \naccess the database to return an object or a set of objects. When null arguments \nare provided, all possible candidates are returned from the database. Thus, for \nexample, (SCHOOL \"MIT\") and (BAlK nil) will return the objects MIT and all \nknown banks, respectively. Finally, there are a number of functions in VOYAGER \nthat act as filters, whereby the subset that fulfills some requirements are returned. \nThe function (IEAREST X y), for example, returns the object in the set X that is \nclosest to the object y. These filter functions can be nested, so that they can quite \neasily construct a complicated object. For example, \"the Chinese restaurant on \nMain Street nearest to the hotel in Harvard Square that is closest to City Hall\" \nwould be represented by, \n\n(NEAREST \n\n(ON-STREET \n\n(SERVE (RESTAURAIT nil) \"Chinese\") \n(STREET \"Main\" \"Street\"\u00bb \n\n(IEAREST \n\n(Ill-REGIOI (HOTEL nil) (SQUARE \"Harvard\"\u00bb \n(PUBLIC-BUILDIIlG \"City Hall\"\u00bb) \n\n3.3 Discourse Capabilities \n\nCarrying on a conversation requires the use of context and discourse history. With(cid:173)\nout context, some user input may appear underspecified, vague or even ill-formed. \nHowever, in context, these queries are generally easily understood. The discourse \ncapabilities of the current VOYAGER system are simplistic but nonetheless effective \nin handling the majority of the interactions within the designated task. We describe \nbriefly how a discourse history is maintained, and how the system keeps track of \nincomplete requests, querying the user for more information as needed to fill in \nambiguous material. \n\nTwo slots are reserved for discourse history. The first slot refers to the location \nof the user, which can be set during the course of the conversation and then later \nreferred to. The second slot refers to the most recently referenced set of objects. \nThis slot can be a single object, a set of objects, or two separate objects in the \ncase where the previous command involved a calculation involving both a source \nand a destination. With these slots, the system can process queries that include \npronominal reference as in \"What is their address?\" or \"How far is it from here?\" \n\nVOYAGER can also handle underspecified or vague queries, in which a function ar(cid:173)\ngument has either no value or multiple values. Examples of such queries would be \n\"How far is a bank?\" or \"How far is MIT?\" when no [FROM-LOCATION] has been \nspecified. VOYAGER points out such underspecification to the user, by asking for \nspecific clarification. The underspecified command is also pushed onto a stack of \nincompletely specified commands. When the user provides additional information \nthat is evaluated successfully, the top command in the stack is popped for reevalua(cid:173)\ntion. If the additional information is not sufficient to resolve the original command, \nthe command is again pushed onto the stack, with the new information incorpo(cid:173)\nrated. A protection mechanism automatically clears the history stack whenever the \nuser abandons a line of discussion before all underspecified queries are clarified. \n\n\f260 \n\nZue, Glass, Goodine, Hirschman, Leung, Phillips, Iblifroni, and Seneff \n\n4 Performance Evaluation \n\nIn this section, we describe our experience with performance evaluation of spoken \nlanguage systems. The version of VOYAGER that we evaluated has a vocabulary of \n350 words. The word-pair language model for the speech recognition sub-system has \na perplexity of 72. For the N-best algorithm; the number of sentence hypotheses \nwas arbitrarily set at 100. The system was implemented on a SUN-4, using four \ncommercially available signal processing boards. This configuration has a processes \nan utterance in 3 to 5 times real-time. \n\nThe system was trained and tested using a corpus of spontaneous speech recorded \nfrom 50 male and 50 female subjects (Zue et al., 1989). We arbitrarily designated the \ndata from 70 speakers, equally divided between male and female, to be the training \nset. Data from 20 of the remaining speakers were designated as the development \nset. The test set consisted of 485 utterances generated by the remaining 5 male and \n5 female subjects. The average number of words per sentence was 7.7. \n\nVOYAGER generated an action for 51.7% of the sentences in the test set. The \nsystem failed to generate a parse on the remaining 48.3% of the sentences, either \ndue to recognizer errors, unknown words, unseen linguistic structures, or back-end \ninadequacy. Specifically, 20.3% failed to generate an action due to recognition errors \nor the system's inability to deal with spontaneous speech phenomena, 17.2% were \nfound to contain unknown words, and an additional 10.5% would not have parsed \neven if recognized correctly. VOYAGER almost never failed to provide a response \nonce a parse had been generated. This is a direct result of our conscious decision \nto constrain TINA according to the capabilities of the back-end. Although 48.3% of \nthe sentences were judged to be incorrect, only 13% generated the wrong response. \nFor the remainder of the errors, the system responded with the message, \"I'm sorry \nbut I didn't understand you.\" \n\nFinally, we solicited judgments from three naive subjects who had had no previous \nexperience with VOYAGER to assess the capabilities of the back-end. About 80% of \nthe responses were judged to be appropriate, with an additional 5% being verbose \nbut otherwise correct. Only about 4% of the sentences produced diagnostic error \nmessages, for which the system was judged to give an appropriate response about \ntwo thirds of the time. The response was judged incorrect about 10% of the time. \nThe subjects judged about 87% of the user queries to be reasonable. \n\n5 Summary \n\nThis paper summarizes the status of our recent efforts in spoken language system de(cid:173)\nvelopment. It is clear that spoken language systems will incorporate research from, \nand provide a useful testbed for a variety of disciplines including speech, natural \nlanguage processing, knowledge aquisition, databases, expert systems, and human \nfactors. \nIn the near term our plans include improving the phonetic recognition \naccuracy of SUMMIT by incorporating context-dependent models, and investigating \ncontrol strategies which more fully integrate our speech recognition and natural \nlanguage components. \n\n\fFrom Speech Recognition to Spoken Language Understanding \n\n261 \n\nAcknowledgements \n\nThis research was supported by DARPA under Contract NOOOI4-89-J-1332, moni(cid:173)\ntored through the Office of Naval Research. \n\nReferences \n\nChow, Y, and R. Schwartz, (1989) \"The N-Best Algorithm: An Efficient Proce(cid:173)\ndure for Finding Top N Sentence Hypotheses\", Proc. DARPA Speech and Natural \nLanguage Workshop, pp. 199-202, October. \n\nDavis, J.R. and T. F. Trobaugh, (1987) \"Back Seat Driver,\" Technical Report 1, \nMIT Media Laboratory Speech Group, December. \n\nGlass, J. R., (1988) \"Finding Acoustic Regularities in Speech: Applications to \nPhonetic Recognition,\" Ph.D. thesis, Massachusetts Institute of Technology, May. \n\nLeung, H., and V. Zue, (1990) \"Phonetic Classification Using Multi-Layer Percep(cid:173)\ntrons,\" Proc. ICASSP-90, pp. 525-528, Albuquerque, NM. \n\nSeneff, S., (1988) \"A Joint Synchrony/Mean-Rate Model of Auditory Speech Pro(cid:173)\ncessing,\" J. of Phonetics, vol. 16, pp. 55-76, January. \nSeneff, S. (1989) \"TINA: A Probabilistic Syntactic Parser for Speech Understanding \nSystems,\" Proc. DARPA Speech and Natural Language Workshop, pp. 168-178, \nFebruary. \n\nSoong, F., and E. Huang, (1990) \"A Tree-Trellis Based Fast Search for Finding the \nN-best Sentence Hypotheses in Continuous Speech Recognition\", Proc. DARPA \nSpeech and Natural Language Workshop, pp. 199-202, June. \n\nZue, V., J. Glass, M. Phillips, and S. Seneff, (1989) \"Acoustic Segmentation and \nPhonetic Classification in the SUMMIT System,\" Proc. ICASSP-89, pp. 389-392, \nGlasgow, Scotland. \n\nZue, V., J. Glass, D. Goodine, H. Leung, M. Phillips, J. Polifroni, and S. Seneff, \n(1989) \"The VOYAGER Speech Understanding System: A Progress Report,\" Proc. \nDARPA Speech and Natural Language Workshop, pp. 51-59, October. \n\nZue, V., N. Daly, J. Glass, D. Goodine, H. Leung, M. Phillips, J. Polifroni, S. Seneff, \nand M. Soelof, (1989) \"The Collection and Preliminary Analysis of a Spontaneous \nSpeech Database,\" Proc. DARPA Speech and Natural Language Workshop, pp. \n126-134, October. \n\nZue, V., J. Glass, D. Goodine, M. Phillips, and S. Seneff, (1990) \"The SUMMIT \nSpeech Recognition System: Phonological Modelling and Lexical Access,\" Proc. \nICASSP-90, pp. 49-52, Albuquerque, NM. \n\nZue, V., J. Glass, D. Goodine, H. Leung, M. Phillips, J. Polifroni, and S. Seneff, \n(1990) \"The VOYAGER Speech Understanding System: Preliminary Development \nand Evaluation,\" Proc. ICASSP-90, pp. 73-76, Albuquerque, NM. \nZue, V., J. Glass, D. Goodine, H. Leung, M. Phillips, J. Polifroni, and S. Seneff, \n(1990) \"Recent Progress on the VOYAGER System,\" Proc. DARPA Speech and \nNatural Language Workshop, pp. 206-211, June. \n\n\f", "award": [], "sourceid": 403, "authors": [{"given_name": "Victor", "family_name": "Zue", "institution": null}, {"given_name": "James", "family_name": "Glass", "institution": null}, {"given_name": "David", "family_name": "Goodine", "institution": null}, {"given_name": "Lynette", "family_name": "Hirschman", "institution": null}, {"given_name": "Hong", "family_name": "Leung", "institution": null}, {"given_name": "Michael", "family_name": "Phillips", "institution": null}, {"given_name": "Joseph", "family_name": "Polifroni", "institution": null}, {"given_name": "Stephanie", "family_name": "Seneff", "institution": null}]}