{"title": "Minimum Bayes Error Feature Selection for Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 806, "abstract": null, "full_text": "Minimum Bayes Error Feature Selection for \n\nContinuous Speech Recognition \n\nGeorge Saon and Mukund Padmanabhan \n\nIBM T. 1. Watson Research Center, Yorktown Heights, NY, 10598 \nE-mail: {saon.mukund}@watson.ibm.com. Phone: (914)-945-2985 \n\nAbstract \n\nWe consider the problem of designing a linear transformation () E lRPx n, \nof rank p ~ n, which projects the features of a classifier x E lRn onto \ny = ()x E lRP such as to achieve minimum Bayes error (or probabil(cid:173)\nity of misclassification). Two avenues will be explored: the first is to \nmaximize the ()-average divergence between the class densities and the \nsecond is to minimize the union Bhattacharyya bound in the range of (). \nWhile both approaches yield similar performance in practice, they out(cid:173)\nperform standard LDA features and show a 10% relative improvement \nin the word error rate over state-of-the-art cepstral features on a large \nvocabulary telephony speech recognition task. \n\n1 \n\nIntroduction \n\nModern speech recognition systems use cepstral features characterizing the short-term \nspectrum of the speech signal for classifying frames into phonetic classes. These features \nare augmented with dynamic information from the adjacent frames to capture transient \nspectral events in the signal. What is commonly referred to as MFCC+~ + ~~ features \nconsist in \"static\" mel-frequency cepstral coefficients (usually 13) plus their first and sec(cid:173)\nond order derivatives computed over a sliding window of typicaJly 9 consecutive frames \nyielding 39-dimensional feature vectors every IOms. One major drawback of this front-end \nscheme is that the same computation is performed regardless of the application, channel \nconditions, speaker variability, etc. In recent years, an alternative feature extraction pro(cid:173)\ncedure based on discriminant techniques has emerged: the consecutive cepstral frames \nare spliced together forming a supervector which is then projected down to a manageable \ndimension. One of the most popular objective functions for designing the feature space \nprojection is linear discriminant analysis. \n\nLDA [2, 3] is a standard technique in statistical pattern classification for dimensionality \nreduction with a minimal loss in discrimination. Its application to speech recognition has \nshown consistent gains for small vocabulary tasks and mixed results for large vocabulary \napplications [4,6]. Recently, there has been an interest in extending LDA to heteroscedastic \ndiscriminant analysis (HDA) by incorporating the individual class covariances in the ob(cid:173)\njective function [6, 8]. Indeed, the equal class covariance assumption made by LDA does \n\n\fnot always hold true in practice making the LDA solution highly suboptimal for specific \ncases [8]. \n\nHowever, since both LDA and HDA are heuristics, they do not guarantee an optimal pro(cid:173)\njection in the sense of a minimum Bayes classification error. The aim of this paper is to \nstudy feature space projections according to objective functions which are more intimately \nlinked to the probability of misclassification. More specifically, we will define the proba(cid:173)\nbility of misclassification in the original space, E, and in the projected space, E(J, and give \nconditions under which E(J = Eo Since after a projection y = Ox discrimination infor(cid:173)\nmation is usually lost, the Bayes error in the projected space will always increase, that is \nE(J ~ E therefore minimizing E(J amounts to finding 0 for which the equality case holds. An \nalternative approach is to define an upper bound on E(J and to directly minimize this bound. \n\nThe paper is organized as follows: in section 2 we recall the definition of the Bayes error \nrate and its link to the divergence and the Bhattacharyya bound, section 3 deals with the \nexperiments and results and section 4 provides a final discussion. \n\n2 Bayes error, divergence and Bhattacharyya bound \n\n2.1 Bayes error \n\nConsider the general problem of classifying an n-dimensional vector x into one of C dis(cid:173)\ntinct classes. Let each class i be characterized by its own prior Ai and probability density \nfunction Pi, i = 1, ... ,C. Suppose x is classified as belonging to class j through the Bayes \nassignment j = argmax1