{"title": "Figure of Merit Training for Detection and Spotting", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1026, "abstract": null, "full_text": "Figure of Merit Training for Detection and \n\nSpotting \n\nEric I. Chang and Richard P. Lippmann \n\nMIT Lincoln Laboratory \n\nLexington, MA 02173-0073, USA \n\nAbstract \n\nSpotting tasks require detection of target patterns from a background of \nrichly varied non-target inputs. The performance measure of interest for \nthese tasks, called the figure of merit (FOM), is the detection rate for \ntarget patterns when the false alarm rate is in an acceptable range. A \nnew approach to training spotters is presented which computes the FOM \ngradient for each input pattern and then directly maximizes the FOM \nusing b ackpropagati on. This eliminates the need for thresholds during \ntraining. It also uses network resources to model Bayesian a posteriori \nprobability functions accurately only for patterns which have a \nsignificant effect on the detection accuracy over the false alarm rate of \ninterest. FOM training increased detection accuracy by 5 percentage \npoints for a hybrid radial basis function (RBF) - hidden Markov model \n(HMM) wordspotter on the credit-card speech corpus. \n\n1 INTRODUCTION \nSpotting tasks require accurate detection of target patterns from a background of richly var(cid:173)\nied non-target inputs. Examples include keyword spotting from continuous acoustic input, \nspotting cars in satellite images, detecting faults in complex systems over a wide range of \noperating conditions, detecting earthquakes from continuous seismic signals, and finding \nprinted text on images which contain complex graphics. These problems share three com(cid:173)\nmon characteristics. First, the number of instances of target patterns is unknown. Second, \npatterns from background, non-target, classes are varied and often difficult to model accu(cid:173)\nrately. Third, the performance measure of interest, called the figure of merit (FOM), is the \ndetection rate for target patterns when the false alarm rate is over a specified range. \n\nNeural network classifiers are often used for detection problems by training on target and \nbackground classes, optionally normalizing target outputs using the background output, \n\n1019 \n\n\f1020 \n\nChang and Lippmann \n\nPUTATIVE HITS \n\nnA \n\nuS \n\nI NORMALIZATION AND THRESHOLDING I \n\njl A \n\nUs \n\n~~ SACKG ROUND \n\nCLASSIFIER \n\nt \n\nINPUT PATTERN \n\nFigure 1. Block diagram of a spotting system. \n\nand thresholding the resulting score to generate putative hits, as shown in Figure 1. Putative \nhits in this figure are input patterns which generate normalized scores above a threshold. \nWe have developed a hybrid radial basis function (RBF) - hidden Markov model (HMM) \nkeyword spotter. This wordspotter was evaluated using the NIST credit card speech data(cid:173)\nbase as in (Rohlicek, 1993, Zeppenfeld, 1993) using the same train/evaluation split of the \ntraining conversations as was used in (Zeppenfeld, 1993). The system spots 20 target key(cid:173)\nwords, includes one general filler class, and uses a Viterbi decoding backtrace as described \nin (Lippmann, 1993) to backpropagate errors over a sequence of input speech frames. The \nperformance of this spotting system and its improved versions is analyzed by plotting de(cid:173)\ntection versus false alarm rate curves as shown in Figure 2. These curves are generated by \nadjusting the classifier output threshold to allow few or many putative hits. Wordspotter pu(cid:173)\ntative hits used to generate Figure 2 correspond to speech frames when the difference be(cid:173)\ntween the cumulative log Viterbi scores in output HMM nodes of word and filler models is \nabove a threshold. The FOM for this wordspotter is defined as the average keyword detec(cid:173)\ntion rate when the false alarm rate ranges from 1 to 10 false alarms per keyword per hour. \nThe 69.7% figure of merit for this system means that 69.7% of keyword occurrences are \ndetected on the average while generating from 20 to 200 false alarms per hour of input \nspeech. \n\n2 PROBLEMS WITH BACKPROPAGATION TRAINING \nNeural network classifiers used for spotting tasks can be trained using conventional back(cid:173)\npropagation procedures with 1 of N desired outputs and a squared error cost function. This \napproach to training does not maximize the FOM because it attempts to estimate Bayesian \na posteriori probability functions accurately for all inputs even if a particular input has little \neffect on detection accuracy at false alarm rates of interest. Excessive network resources \nmay be allocated to modeling the distribution of common background inputs dissimilar \nfrom targets and of high-scoring target inputs which are easily detected. This problem can \nbe addressed by training only when network outputs are above thresholds. This approach is \nproblematic because it is difficult to set the threshold for different keywords, because using \nfixed target values of 1.0 and 0.0 requires careful normalization of network output scores to \nprevent saturation and maintain backpropagation effectiveness, and because the gradient \ncalculated from a fixed target value does not reflect the actual impact on the FOM. \n\n\fFigure of Merit Training for Detection and Spotting \n\n1021 \n\nA SPLIT OF CREDIT-CARD \nTRAINING DATA \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 :::.:./.::.:: .. !!! .. ~.~ ......... u::.~ .. .::: .. l - -\n\n\",\"I'I .. ~.~I'iI.I: \n\n100 \n(J) 90 \nz \n0 80 \ni= \n70 \n0 \nW \nt- 60 \nw c \n50 \nt-\nO 40 \nw a:: 30 \na:: \n0 20 \n0 \n10 \n~ 0 \n0 \n\n0 \n\n/'-:--\n\n,/, \n\nf / \n./ \n\nFOM BACK-PROP (FOM: 69.7%) \n\nEMBEDDED REESTIMATION (FOM: 64.5%) \n\nISOLATED WORD TRAIN (FOM: 62.5%) \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\nFALSE ALARMS PER KW PER HR \n\nFigure 2. Detection vs. false alarm rate curve for a 20-word hybrid wordspotter. \n\nFigure 3 shows the gradient of true hits and false alarms when target values are set to be 1.0 \nfor true hits and 0.0 for false alarms, the output unit is sigmoidal, and the threshold for a \nputative hit is set to roughly 0.6. The gradient is the derivative of the squared error cost with \nrespect to the input of the sigmodal output unit. As can be seen, low-scoring hits or false \nalarms that may affect the FOM are ignored, the gradient is discontinuous at the threshold, \nthe gradient does not fall to zero fast enough at high values, and the relative sizes of the hit \nand false alarm gradients do not reflect the true effect of a hit or false alarm on the FOM. \n\n3 FIGURE OF MERIT TRAINING \nA new approach to training a spotter system called \"figure of merit training\" is to directly \ncompute the FOM and its derivative. This derivative is the change in FOM over the change \nin the output score of a putative hit and can be used instead of the derivative of a squared(cid:173)\nerror or other cost function during training. Since the FOM is calculated by sorting true hits \nand false alarms separately for each target class and forming detection versus false alarm \ncurves, these measures and their derivatives can not be computed analytically. Instead, the \nFOM and its derivative are computed using fast sort routines. These routines insert a new \n\n0.2 r - - - - - - - - - - - - - - - - - - - - , \n\nTHRESHOLD \n\nL....... \n\nHIT GRADIENT \n\n0 I-----------f------==-'-'\"\"!l \n\n!z w \nCi \n\u00ab \na: \n