{"title": "Minimum Bayes Error Feature Selection for Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 806, "abstract": null, "full_text": "Minimum Bayes Error Feature Selection for \n\nContinuous Speech Recognition \n\nGeorge Saon and Mukund Padmanabhan \n\nIBM T.  1. Watson Research Center, Yorktown Heights, NY,  10598 \nE-mail:  {saon.mukund}@watson.ibm.com. Phone:  (914)-945-2985 \n\nAbstract \n\nWe consider the problem of designing a linear transformation ()  E lRPx n, \nof rank p  ~ n, which  projects the features of a classifier x  E  lRn  onto \ny  =  ()x  E  lRP  such  as  to  achieve  minimum Bayes  error (or probabil(cid:173)\nity  of misclassification).  Two  avenues  will  be  explored:  the  first  is  to \nmaximize the ()-average divergence between the  class  densities  and  the \nsecond is  to minimize the union Bhattacharyya bound in the range of (). \nWhile both  approaches yield  similar performance in  practice,  they  out(cid:173)\nperform standard  LDA  features  and  show  a  10%  relative improvement \nin  the  word  error rate  over state-of-the-art cepstral  features  on  a  large \nvocabulary telephony speech recognition task. \n\n1 \n\nIntroduction \n\nModern  speech  recognition  systems  use  cepstral  features  characterizing  the  short-term \nspectrum of the speech signal for classifying frames into phonetic classes.  These features \nare  augmented  with  dynamic  information  from  the  adjacent frames  to  capture  transient \nspectral events in the signal.  What is  commonly referred to as  MFCC+~ +  ~~ features \nconsist in \"static\" mel-frequency cepstral coefficients (usually  13) plus their first and sec(cid:173)\nond order derivatives computed over a sliding  window  of typicaJly  9 consecutive frames \nyielding 39-dimensional feature vectors every IOms.  One major drawback of this front-end \nscheme is  that the  same  computation is  performed regardless  of the  application,  channel \nconditions, speaker variability,  etc.  In  recent years,  an  alternative feature extraction pro(cid:173)\ncedure  based  on  discriminant  techniques  has  emerged:  the  consecutive cepstral  frames \nare spliced together forming a supervector which is  then projected down to  a manageable \ndimension.  One  of the  most popular objective functions  for  designing  the  feature  space \nprojection is  linear discriminant analysis. \n\nLDA  [2,  3]  is  a standard  technique  in  statistical  pattern classification for dimensionality \nreduction with  a minimal loss  in discrimination.  Its  application to  speech recognition has \nshown consistent gains for small vocabulary tasks  and mixed results for large vocabulary \napplications [4,6]. Recently, there has been an interest in extending LDA to heteroscedastic \ndiscriminant analysis  (HDA)  by  incorporating the individual class  covariances in  the  ob(cid:173)\njective function [6,  8].  Indeed, the equal class covariance assumption made by  LDA does \n\n\fnot always  hold  true  in  practice making the  LDA  solution highly  suboptimal for  specific \ncases [8]. \n\nHowever, since both LDA and HDA are heuristics,  they do not guarantee an  optimal pro(cid:173)\njection in  the  sense  of a minimum Bayes classification error.  The aim  of this  paper is  to \nstudy feature space projections according to objective functions which are more intimately \nlinked to  the probability of misclassification.  More specifically, we will define the proba(cid:173)\nbility of misclassification in  the original  space,  E,  and in  the projected space,  E(J,  and give \nconditions under  which  E(J  =  Eo  Since  after  a projection  y  =  Ox  discrimination infor(cid:173)\nmation is  usually  lost,  the Bayes error in  the projected space will  always increase, that is \nE(J  ~ E therefore minimizing E(J  amounts to finding 0 for which the equality case holds.  An \nalternative approach is  to define an upper bound on E(J  and to directly minimize this bound. \n\nThe paper is  organized as  follows:  in section 2 we recall the definition of the Bayes error \nrate and  its  link to  the divergence and  the Bhattacharyya bound,  section  3 deals  with  the \nexperiments and results and section 4 provides a final discussion. \n\n2  Bayes error, divergence and Bhattacharyya bound \n\n2.1  Bayes error \n\nConsider the general problem of classifying an  n-dimensional vector x  into one of C dis(cid:173)\ntinct classes.  Let each class i  be characterized by  its  own prior Ai  and probability density \nfunction Pi, i  = 1, ... ,C. Suppose x is classified as belonging to class j  through the Bayes \nassignment j  =  argmax1 <i<C AiPi (x).  The expected error rate for this classifier is called \nBayes error [3]  or probability of misclassification and is defined as \n\n(1) \n\nSuppose  next  that  we  wish  to  perform the  linear  transformation f  : R n  -t  RP,  y  = \nf(x)  =  Ox,  with  0  a P  x  n  matrix  of rank P  :::;  n.  Moreover,  let  us  denote  by p~ the \ntransformed density for class i.  The Bayes error in the range of 0 now becomes \n\n(2) \n\nSince the transformation y  = Ox  produces a vector whose coefficients are linear combi(cid:173)\nnations  of the input vector x, it can be shown [1]  that,  in general, information is  lost and \nE(J  ~ E. \nFor a fixed p, the feature selection problem can be stated as finding fj such that \n\nfj  = \n\nargmin \n\nE(J \n\n(JERPxn,  rank((J)=p \n\n(3) \n\nWe  will  take  however an  indirect approach  to  (3):  by  maximizing  the  average  pairwise \ndivergence and relating it to E(J  (subsection 2.2) and by minimizing the union Bhattacharyya \nbound on E(J  (subsection 2.3). \n\n\f2.2 \n\nInterclass divergence \n\nSince Kullback [5], the symmetric divergence between class i  and j  is given by \n\n..!. \n\nD(~,J) = \n\nRn \n\npi(x)log-(-) +pj(x)log-(-)dx \n\n(4) \n\nPi(X) \nPj  X \n\nPj(x) \nPi  X \n\nD(i, j)  represents  a  measure  of the  degree  of difficulty  of discriminating  between  the \nclasses  (the  larger  the  divergence,  the  greater  the  separability  between  the  classes). \nSimilarly,  one  can  define  Do(i,j),  the  pairwise  divergence  in  the  range  of  ().  Kull(cid:173)\nback  [5]  showed  that  Do(i,j)  :::;  D(i,j).  If the  equality  case  holds,  ()  is  called  a  suf(cid:173)\nficient  statistic for  discrimination.  The  average  pairwise  divergence is  defined  as  D  = \nC(J-1)  L1:<:;i<j:<:;C D(i,j)  and  respectively  Do  =  C(J-1)  L1:<:;i<j:<:;C Do(i,j) .  It fol(cid:173)\nlows  that  Do  :::;  D.  The  next  theorem due  to  Decell  [1]  provides  a link between  Bayes \nerror and divergence for classes with uniform priors A1  =  ... =  AC (= b). \nTheorem [Decell'72] If Do  = D then EO  = E. \nThe main idea of the proof is  to  show  that if the divergences are the same then  the Bayes \nassignment is  preserved  because  the  likelihood  ratios  are  preserved  almost  everywhere: \nP; ((x))  = pl ((OOx)) '  i  :j:.  j. The result follows by noting that for any measurable set A c R P \n\nP3  X \n\nP j \n\nX \n\n! pf(y)dy = r \n\nA \n\niO - l(A) \n\nPi (x)dx \n\n(5) \n\nwhere ()-1 (A)  = {x E  Rnl()x E  A}. The previous theorem provides a basis for selecting \n()  such as  to maximize Do. \n\nLet us  make next the assumption that each class i  is normally distributed with mean /-Li  and \ncovariance ~i, that is Pi (x)  =  N(x; /-Li,  ~i) andpf (y)  =  N(y; ()/-Li, ()~i()T), i  =  1, ... , c. \nIt is straightforward to show that in  this case the divergence is given by \n\n1 \n\nD(i,j) =  2\"  trace{~i1[~j + (/-Li  - /-Lj)(/-Li - /-Lj)T] +~j1[~i+ (/-Li - /-Lj)(/-Li  - /-Lj)T]}-n \n(6) \n\nThus, the objective function to be maximized becomes \n\nDo  = C(C -1) trace{~(()~i())  ()Si()} - P \n\nT  -1 \n\n1 \n\nT \n\nC \n'\"' \n\n(7) \n\nwhere Si = L ~j + (/-Li  -\n\n/-Lj)(/-Li  -\n\n/-Ljf, i  = 1, ... , C. \n\n#i \n\nFollowing matrix differentiation results from [9],  the gradient of Do  with respect to ()  has \nthe expression \n\n\fUnfortunately, it turns out that 8f//  = a has no analytical solutions for the stationary points. \nInstead, one has to use numerical optimization routines for the maximization of D(). \n\n2.3  Bhattacharyya bound \n\nAn  alternative way  of minimizing the Bayes error is  to minimize an  upper bound on  this \nquantity. We will first prove the following statement \n\nIndeed, from  (1), the Bayes error can be rewritten as \n\n(9) \n\n(10) \n\n-+ \nand  for  every  x,  there  exists  a  permutation  of  the  indices  U x \n{l, ... , C}  such  that the  terms  A1P1 (x), ... , ACPC (x)  are  sorted in increasing order, i.e. \nAux (1)Pux (1) (x) S  ... S  Aux(C)Pux(C)(x) .  Moreover, for 1 S  k  S C  - 1 \n\n:  {l, ... , C} \n\nfrom which follows that \n\nl~~cLAjPj(X) =  L \n-\n\nj#i \n\n-\n\nk=l \n\nC-1 \n\nAux (k)Pu x (k) (x) s L \n\nC-1 \n\nk=l \n\nAUx(k)Pux(k) (X)A ux (k+1)Pux (k+1) (x) \n\n(12) \n\n< \n\nL \n\nl::;i<j::;C \n\nVAiPi (X)Ajpj (x) \n\nwhich, when integrated over :R n, leads to (9). \n\nAs  previously, if we  assume that the Pi'S are normal distributions with  means  JLi  and co(cid:173)\nvariances ~i, the bound given by the right-hand side of (9) has the closed form expression \n\nL \n\nVAiAje-P(i,j) \n\n19<j::;C \n\nwhere \n\n(13) \n\n(14) \n\n\fis  called the Bhattacharyya distance between the normal distributions Pi  and Pj  [3].  Simi(cid:173)\nlarly, one can define Po (i, j), the Bhattacharyya distance between the projected densities pf \nand p~. Combining (9) and (13), one obtains the following inequality involving the Bayes \nerror rate in the projected space \n\nEO  ~  L \n\n..jAiAje- P9 (i,j) (= Bo) \n\n(15) \n\n15,i<j5,C \n\nIt is necessary at this point to introduce the following simplifying notations: \n\n\u2022  Bij = ~ (JLi  - JLj )(JLi  - JLj) T and \n\u2022  Wij = ~(~i + ~j), 1 ~ i  < j  ~ c. \n\nFrom (14), it follows that \n\npo(i,j) =  -21  trace{(OWijOT)-lOBijOT} +  -21  log \n\nIOWijOTI \n\n..jIO~iOTIIO~jOTI \n\n(16) \n\nand the gradient of Bo  with respect to 0 is \n\naBo \naO  = \n\n(17) \n\nwith, again by  making use of differentiation results from [9] \n\napo(i,j) \n\naO \n\n1 =  \"2 (OWijOT)-l[OBijOT (OWijOT)-lOWij  - OBij ] + (OWijOT)-lOWij-\n\n-~[(O~iOT)-lO~i + (O~jOT)-lO~j] \n\n(18) \n\n3  Experiments and results \n\nThe speech recognition experiments were conducted on a voicemail transcription task [7]. \nThe  baseline  system  has  2.3K  context dependent HMM  states  and  134K  diagonal  gaus(cid:173)\nsian  mixture  components  and  was  trained  on  approximately  70 hours  of data.  The  test \nset  consists  of 86  messages  (approximately 7000 words).  The  baseline  system uses  39-\ndimensional frames  (13 cepstral coefficients plus deltas and double deltas computed from \n9  consecutive frames).  For the  divergence  and  Bhattacharyya projections,  every  9 con(cid:173)\nsecutive  24-dimensional cepstral vectors  were  spliced together forming  216-dimensional \nfeature vectors which were then clustered to estimate 1 full covariance gaussian density for \neach  state.  Subsequently,  a  39x216 transformation 0 was  computed using  the  objective \nfunctions  for  the  divergence  (7)  and  the  Bhattacharyya bound  (15),  which  projected  the \nmodels and feature space down to 39 dimensions.  As mentioned in [4], it is not clear what \nthe  most appropriate class  definition for  the  projections should be.  The best results  were \nobtained by considering each individual HMM state as  a separate class, with the priors of \nthe  gaussians  summing up to  one  across  states.  Both optimizations  were initialized with \n\n\fthe LDA matrix and carried out using a conjugate gradient descent routine with user sup(cid:173)\nplied  analytic gradient from  the  NAG1  Fortran library.  The routine performs  an  iterative \nupdate  of the inverse of the  hessian  of the  objective function  by  accumulating curvature \ninformation during the optimization. \n\nFigure  1 shows  the evolution  of the  objective functions  for the  divergence and  the Bhat(cid:173)\ntacharyya bound. \n\n\"dlvg.dat\" - -\n\n.. + \n\n. .. ~. \n\n.+. \n\n.+. \n\n10 \n\n15 \n\n20 \n\n25 \n\n30 \n\n35 \n\nIteration \n\n\"bhatta.dat\"  - -\n\n300 \n\n250 \n\n200 \n\n. ~ \"  150 \n\n~ c \n~ \n~ \n~ \n\n~ \nro \n~ \n2 \n.E \n\n100 \n\n50 \n\n0 \n\n0 \n\n5 .9 \n\n5 .B \n\n\u00a7 \n\n5 .7 \n\n0 \n.0 \nro \n~ \n~ \n~  5 .6 \nj!l \nIn \n\n5 .5 \n\n5.4 \n\n5 .3 \n\n0 \n\n20 \n\n40 \n\n60 \n\nBO \n\n100 \n\nIteration \n\nFigure 1:  Evolution of the objective functions. \n\nThe parameters of the baseline system (with 134K gaussians) were then re-estimated in the \ntransformed spaces using the EM algorithm.  Table 1 summarizes the improvements in the \nword error rates for the different systems. \n\nINumerical Algebra Group \n\n\fSystem \nBaseline (MFCC+~ +  ~~) \nLDA \nInterclass divergence \nBhattacharyya bound \n\nWord error rate \n\n39.61 % \n37.39% \n36.32% \n35.73% \n\nTable 1:  Word error rates for the different systems. \n\n4  Summary \n\nTwo methods for performing discriminant feature  space projections have  been presented. \nUnlike LDA, they both aim to minimize the probability of misclassification in the projected \nspace by  either maximizing the interclass divergence and relating it to the Bayes error or \nby  directly minimizing an  upper bound on  the classification error.  Both  methods  lead  to \ndefining smooth objective functions which have as argument projection matrices and which \ncan be numerically optimized. Experimental results on large vocabulary continuous speech \nrecognition over the telephone show the superiority of the resulting features over their LDA \nor cepstral counterparts. \n\nReferences \n\n[1]  H. P. Decell and 1.  A.  Quirein.  An iterative approach to the feature selection problem. \nProc.  Purdue  Univ.  Conf.  on  Machine  Processing  of Remotely  Sensed Data,  3B 1-\n3BI2,1972. \n\n[2]  R.  O. Duda and P. B. Hart.  Pattern classification and scene analysis.  Wiley,  New York, \n\n1973. \n\n[3]  K.  Fukunaga.  Introduction to  statistical pattern recognition.  Academic Press,  New \n\nYork,1990. \n\n[4]  R.  Haeb-Umbach and H.  Ney.  Linear Discriminant Analysis for improved large vo(cid:173)\n\ncabulary continuous speech recognition.  Proceedings of lCASSP'92, volume 1, pages \n13-16, 1992. \n\n[5]  S. Kullback.  Information theory and statistics.  Wiley,  New York,  1968. \n\n[6]  N.  Kumar and  A.  G.  Andreou.  Heteroscedastic discriminant  analysis  and  reduced \nrank HMMs for improved speech recognition.  Speech  Communcation,  26:283-297, \n1998. \n\n[7]  M.  Padmanabhan, G. Saon, S.  Basu, 1.  Huang and G. Zweig.  Recent improvements \nin  voicemail  transcription.  Proceedings of EUROSPEECH'99, Budapest,  Hungary, \n1999. \n\n[8]  G.  Saon, M. Padmanabhan, R.  Gopinath and S.  Chen.  Maximum likelihood discrim(cid:173)\n\ninant feature spaces.  Proceedings of lCASSP'2000, Istanbul, Turkey, 2000. \n\n[9]  S.  R.  Searle.  Matrix  algebra useful for  statistics.  Wiley  Series  in  Probability  and \n\nMathematical Statistics, New York, 1982. \n\n\f", "award": [], "sourceid": 1794, "authors": [{"given_name": "George", "family_name": "Saon", "institution": null}, {"given_name": "Mukund", "family_name": "Padmanabhan", "institution": null}]}