{"title": "Analysis of Information in Speech Based on MANOVA", "book": "Advances in Neural Information Processing Systems", "page_first": 1213, "page_last": 1220, "abstract": null, "full_text": "Analysis  of Information  in  Speech based \n\non MANOVA \n\nSachin s.  Kajarekarl  and Hynek Hermansky l ,2 \n1 Department of Electrical and Computer Engineering \n\nOGI School of Science and Engineering at  OHSU \n\nBeaverton, OR \n\n2International Computer Science Institute \n\nBerkeley,  CA \n\n{ sachin,hynek} @asp.ogi.edu \n\nAbstract \n\nWe  propose  analysis  of information  in  speech  using  three  sources \n- language  (phone),  speaker and channeL  Information in speech is \nmeasured as mutual information between the source and the set of \nfeatures  extracted  from  speech  signaL  We  assume  that  distribu(cid:173)\ntion of features  can be modeled  using  Gaussian distribution.  The \nmutual  information  is  computed  using  the  results  of  analysis  of \nvariability in speech.  We  observe similarity in the results of phone \nvariability and phone information, and show that the results of the \nproposed  analysis  have  more  meaningful  interpretations  than  the \nanalysis of variability. \n\n1 \n\nIntroduction \n\nSpeech  signal  carries  information  about  the  linguistic  message,  the  speaker,  the \ncommunication  channeL  In  the  previous  work  [1,  2],  we  proposed  analysis  of in(cid:173)\nformation  in  speech  as  analysis  of variability  in  a  set  of features  extracted  from \nthe speech signal.  The variability was  measured as  covariance of the features , and \nanalysis  was  performed  using  using multivariate  analysis  of variance  (MANOVA). \nTotal  variability  was  divided  into  three  types  of variabilities,  namely,  intra-phone \n(or  phone)  variability,  speaker  variability,  and  channel  variability.  Effect  of each \ntype was  measured as its contribution to the total variability. \n\nIn this  paper, we  extend our previous  work by  proposing an information-theoretic \nanalysis  of information  in  speech.  Similar  to  MANOVA,  we  assume  that  speech \ncarries  information from  three  main  sources- language,  speaker,  and  channeL  We \nmeasure  information  from  a  source  as  mutual  information  (MI)  [3]  between  the \ncorresponding class labels and features.  For example, linguistic information is  mea(cid:173)\nsured  as  MI  between  phone labels  and features.  The effect  of sources  is  measured \nin  nats  (or  bits).  In  this  work,  we  show  it is  easier to interpret  the  results  of this \nanalysis than the analysis of variability. \n\nIn  general,  MI  between  two  random  variables  X  and  Y  can  be  measured  using \nthree  different  methods  [4].  First,  assuming that  X  and  Y  have  a  joint  Gaussian \n\n\fdistribution.  However, we cannot use this method because one of the variables - a set \nof class labels - is discrete.  Second, modeling distribution of X  or Y  using parametric \nform , for example, mixture of Gaussians [4].  Third, using non-parametric techniques \nto  estimate  distributions  of X  and  Y  [5].  The  proposed  analysis  is  based  on  the \nsecond method, where distribution of features is modeled as a Gaussian distribution. \nAlthough it is  a strong assumption, we show that results of this analysis are similar \nto the results obtained using the third method  [5]. \n\nThe  paper  is  organized  as  follows.  Section  2  describes  the  experimental  setup. \nSection 3 describes MAN OVA and presents results of MAN OVA.  Section 4 proposes \ninformation  theoretic  approach  for  analysis  of information  in  speech  and  presents \nthe results.  Section  5 compares  these  results  with  results from  the previous study. \nSection 6 describes the summary and conclusions from  this  work. \n\n2  Experimental Setup \n\nIn the previous work  [1 , 2],  we  have analyzed variability in the features using three \ndatabases  - HTIMIT,  OGI  Stories  and  TIMIT.  In  this  work,  we  present  results \nof MAN OVA  using  OGI  Stories  database;  mainly  for  the  comparison with Yang's \nresults [5, 6].  English part of OGI Stories database consists of 207 speakers, speaking \nfor  approximately  1  minute  each.  Each  utterance  is  transcribed  at  phone  level. \nTherefore,  phone  is  considered  as  a  source  of variability  or  source  of information. \nThe  utterances  are  not  labeled separately by  speakers and channels,  so  we  cannot \nmeasure speaker and channel as separate sources.  Instead, we  assume that different \nspeakers  have  used  different  channels  and  consider  speaker+channel  as  a  single \nsource of variability or a  single source of information. \n\nFigure 1 shows a  commonly used time-frequency representation of energy in speech \nsignal.  The y-axis represents frequency,  x-axis represents time,  and the darkness of \neach element shows  the energy at a  given frequency  and time.  A spectral vector is \ndefined  by the  number of points on the y-axis,  S(w, tm ).  In this  work,  this  vector \ncontains 15 points on Bark spectrum.  The vector is estimated at every 10 ms using a \n25  ms speech segment.  It is  labeled by the phone and the speaker and channel label \nof the  corresponding speech  segment.  A  temporal  vector  is  defined  by  a  sequence \nof  points  along  time  at  a  given  frequency,  S(wn, t).  In  this  work,  it  consists  of \n50  points  each  in  the  past  and  the  future  with  respect  to  the  current  observation \nand the  observation itself.  As  the spectral vectors  are computed every  10  ms,  the \ntemporal  vector  represents  1  sec  of temporal  information.  The  temporal  vectors \nare labeled  by  the  phone  and the  speaker and  channel  label  of the  current speech \nsegment.  In this  work,  the analysis  is  performed independently using  spectral and \ntemporal vectors. \n\n3  MANOVA \n\nMultivariate  analysis  of variance  (MANOVA)  [7]  is  used  to  measure the  variation \nin  the  data,  {X  E  Rn },  with  respect  to  two  or  more  factors.  In this  work,  we  use \ntwo factors - phone and speaker+channel.  The underline model of MAN OVA  is \n\n(1) \n\nwhere,  i  =  1\"\"  ,p, represents phones,  j  =  1\"\"  Be,  represents speakers and chan(cid:173)\nnels.  This equation shows that any feature vector, X ijk ,  can be approximated using \na  sum  of X.. , the  mean  of the  data;  Xi.,  mean  of the  phone  i;  Xij., mean  of the \nspeaker and channel j, and phone i;  and Eij k,  an error in this approximation.  Using \n\n\fIi \n\nLL \n\n~r- l0ms \n\nI \n\nII  II ,UII \nn \n'I \nI \n\n\u2022 i 1II1 \nIII \nI~I  11  IH~ifl~\"  I 'I \nJJ I 1M \nII \nI \n11;1, II  ~i \nI  I1I11 \nIl: \n\u2022 \nI III \n1! ~ 'I~ il[IIi' \n[ III \n1\n'  I ~  IJ' n  l! '\"\"  \"'I \nIIIIIII~  111 \nI ~  [I \nII,  \"iii  r 1111 \n\nIHII \nII \nI[lll~ \n11111 \nuJlJ  ~ \n\n\" \n'II~'I \n\nu tJ:i \n1111 \n\n1'111 \nII  I \n\n1111 \nIII \n\nI \n\n1'1 \nI  ... \n\n1111 \nilJU \n\nJ \nI \n\nTemporal Vector \n(Temporal Domain) \n\n---:JIIIII ___  _ \n\nSpectral Vector \n\n(Spectral  Domain) \n\nFigure 1:  Time-frequency representation of logarithmic energies from  speech signal \n\nthis  model,  the total covariance can be decomposed as follows \n\n~total =  ~p + ~sc + ~residual \n\n(2) \n\nwhere \n\n\"\"\"'  N i (X.  _ X  )t (X.  - X  ) \n~ N \n\n.. \n\n.. \n\n.. \n\n.. \n\n~sc  LL Nij \n\n-\nN \n\ni \n\nj \n\n(X, - X  )  (X, - X  ) \n\n-\n\nt \n\n-\n\n-\n\nZJ \n\n.. \n\nZJ \n\n-\n\nz. \n\n~residual \n\n1\"\"\",,,\"\",,,\"\", \nN  ~ ~ ~(Xijk - Xij)  (Xijk  - Xij) \n\nt \n\n-\n\n-\n\ni \n\nj \n\nk \n\nis  the  data size  and  Nijk  refers  to  the  number  of samples  associated  with \n\nand,  N \nthe particular combination of factors  (indicated by the subscript). \nThe  covariance  terms  are  computed  as  follows.  First,  all  the  feature  vectors  (X) \nbelonging  to  each  phone  i  are  collected  and  their  mean  (Xi)  is  computed.  The \ncovariance of these phone means,  ~p, is the estimate of phone variability.  Next, the \ndata for  each speaker and channel j  within each phone i  is  collected and the mean \nof the  data  (Xij )  is  computed.  The  covariance  of the  means  of different  speakers \naveraged  over  all  phones,  ~s c,  is  the  estimate  of speaker  variability.  All  the  vari(cid:173)\nability  in  the data is  not explained  using  these sources.  The unaccounted sources, \nsuch  as  context  and  coarticulation,  cause  variability  in  the  data  collected  from \none  speaker speaking one phone through one  channel.  The covariance within each \nphone,  speaker,  and  channel  is  averaged  over  all  phones,  speakers,  and  channels, \nand the resulting covariance,  ~r e sidual, is  the estimate of residual variability. \n\n3.1  Results \n\nResults  of  MAN OVA  are  interpreted  at  two  levels  - feature  element  and  feature \nvector.  Results for  each feature  element  are shown in  Figure  2.  Table 1 shows  the \nresults  using  the  complete feature  vector.  The  contribution  of different  sources  is \ncalculated  as  trace (~source )ltrace(~total). Note  that  this  measure cannot  be  used \nto compare variabilities across feature-sets with different number of features.  There(cid:173)\nfore,  we  cannot directly compare contribution of variabilities in time and frequency \ndomains.  For comparison, contribution of sources in temporal domain is  calculated \n\n\fTable  1:  Contribution of sources  in  spectral and temporal domains \n\nsource \nphone \n\nspeaker+channel \n\no contribution \n\npectral Domain  Temporal Domain \n\n35.3 \n41.1 \n\n4.0 \n30.3 \n\nas  trace(EtI',source E) /trace(EtI',totaIE) ,  where  ElOl x 15  is  a  matrix  of  15  leading \neigenvectors of I',total. \n\nIn spectral domain, the highest phone variability is between 4-6 Barks.  The highest \nspeaker  and  channel  variability  is  between  1-2  Barks  where  phone  variability  is \nthe  lowest.  In  temporal  domain,  phone  variability  spreads  for  approximately  250 \nms  around the  current  phone.  Speaker  and  channel  variability is  almost  constant \nexcept  around the  current  frame.  This  deviation  is  explained  by  the  difference  in \nt he  phonetic  context  among  the  phone  instances  across  different  speakers.  Thus, \nfeatures  for  speakers  within  a  phone  differ  not  only  because  of different  speaker \ncharacteristics  but  also  different  phonetic  contexts.  This  deviation  is  also  seen  in \nthe speaker and channel information in the proposed analysis.  In the overall results \nfor  each domain, spectral domain has higher variability due to different phones than \ntemporal domain.  It also has higher speaker and channel variability than temporal \ndomain. \n\nThe disadvantage of this  analysis  is  that it  is  difficult  to interpret the results.  For \nexample, how much phone variability is  needed for  perfect phone recognition?  and \nis 4%  of phone variability in temporal domain significant?  In order to answer these \nquestions,  we  propose an information theoretic analysis. \n\nPhone \nSpeaker +Channel \n\n7r,----------------------~ \n\n7 ,---~----~--------~ \n\n6  ~ , \n5  ' \n\n\" ' .. \n\n6 \n\n5 \n\nQ) \n\ng 4 , .. ,. .. , _ ,_ , _ , \n... \n<1l \n'\u00a73 \n> \n\n2 \n\n_ , _ , _ ' - , - ' \n\n; \n\nO L-----~----~------~ \n15 \n\n5 \n10 \nFrequency \n\n(Critical Band  Index) \n\n-250 \n\n0 \n\n250 \n\nTime (ms) \n\nFigure 2:  Results  of analysis of variability \n\n\f4 \n\nInformation-theoretic Analysis \n\nResults of MAN OVA  can not be directly converted to MI because the determinant \nof source and residual covariances do not add to the determinant of total covariance. \nTherefore, we  propose a  different formulation for  the information theoretic analysis \nas follows.  Let  {X E  Rn}  be a  set of feature  vectors, with probability distribution \np(X).  Let  h(X)  be the  entropy of X.  Let  Y  =  {Y1 ,  ... , Ym}  be  a  set  of different \nfactors  and  each  Yi  be  a  set  of classes  within  each  factor.  For  example,  we  can \nassume that Y1  =  {yf} represents phone factor and each yf  represent a phone class. \nLets assume that X  has two parts; one completely characterized by Y  and another \npart,  Z , characterized by N(X) \"\"'  N(O, Jn xn ),  where J  is  the identity matrix.  Let \nJ (X; Y)  be  the  MI between  X  and Y.  Assuming that  we  consider  all  the possible \nfactors  for  our analysis, \nJ(X;Y) =  J(X;Y1, ... , Ym) =  h(X)-h(X/Yl , ... ,Ym)  =  h(X)-h(Z) =  D(PIIN) , \nwhere D() is the kullback-liebler distance [3]  between distributions P and N.  Using \nthe chain-rule, the left  hand side can be expanded as follows , \n\nJ(X; Y1 ,\u00b7.\u00b7, Yn )  =  J(X; Yd + J(X; Y2 /Yd + l: J(X; Yi/Yi - l\"'\"  Y2 , Yd\u00b7 \n\nm \n\n(3) \n\ni=3 \n\nIf we  assume that there are only two factors  Y1  and Y2  used for  the analysis,  then \nthis equation is  similar to the decomposition performed using MAN OVA  (Equation \n2).  The term on the left  hand  side  is  entropy of X  which  is  the total information \nin  X  that  can  be  explained  using  Y.  This  is  similar  to the  left-hand  side  term in \nMANOVA  that  describes  the  total  variability.  On  the  right  hand  side,  first  term \nis  similar to the phone variability,  second term is  similar to the speaker variability, \nand the last term which calculates the effect  of unaccounted factors  (Y3 , ... , Ym )  is \nsimilar to the residual variability. \n\nJ(X; Yd  = h(X) - h(X/Yd \n\nFirst and second terms on the right hand side of Equation 3 are computed as follows. \n(4) \n(5) \nh ()  terms  are  estimated  using  parametric  approximation  to  the  total  and  condi(cid:173)\ntional distribution It is assumed that the total distribution of features is a Gaussian \ndistribution  with  covariance  ~.  Therefore,  h (X)  =  ~ log (2net I~I. Similarly,  we \nassume that the distribution of features  of different phones  (i)  is  a  Gaussian distri(cid:173)\nbution with covariances  ~i'  Therefore, \n\nJ(X; Y2 /Yd =  h(X/Yd - h(X/Y1, Y2 ). \n\nh(X/Y1) =  ~ l: p (y~)Iog (2net I~il \n\nyiCYi \n\n(6) \n\nFinally,  we  assume  that  the  distribution  of features  of different  phones  spoken  by \ndifferent  speakers is  also a  Gaussian distribution with covariances  ~ij. Therefore, \n\nh(X/Y1,Y2 )  = ~ \n\nl: \n\np(yLYOlog(2netl~ijl \n\ny;CY1,y~CY2 \n\nSubstituting equations 6 and 7 in equations 4 and 5,  we  get \n\nJ(X' Y;)=~lo \n\n, \n\n1 \n\n2  g IT. \n\nJ(X;Y2 /Yd =  -log \n\n1 \n\n2 \n\nI~I\n. \n1~ ' IP(Yil \nYi CY; \nIT\u00b7 \n\n, \nI~'IP(Y;) \n. \n\nYicY;' \n\nIT  i \n\nYl CY1 'Y2 CY2 \n\nj \n\nj \n\nI ~i IP(Yi 'Y2) \n\n(7) \n\n(8) \n\n(9) \n\n\fPhone \nSpeaker +Channel \n\n1.5 \n\nUl \n\nca c -\n\n:2: \n\n\\ \n\n\\ \n\n0.5 \n\n-' ,- ,- ,- ,- ,_ ,_ ,_ 1- '\" \n\n\" \n\n5 \n10 \nFrequency \n\n15 \n\n(Critical  Band  Index) \n\n0.6 ,----~--~-~-------, \n\n0.5 \n\n, i ,  \\ \n:2:  0.2 , .. ,_ , .. ,_ ,' ,; -\n\n....  ,. - .' .....  , - , _., \n\n0.1 \n\n-250 \n\n0 \n\n250 \n\nTime (ms) \n\nFigure 3:  Results of information-theoretic analysis \n\nTable  2:  Mutual information between features  and phone and speaker and channel \nlabels in spectral and temporal domains \n\nsource \nphone \n\nspeaker+ channel \n\n4 .1  Results \n\nFigure 3 shows  the results of information-theoretic analysis in spectral and tempo(cid:173)\nral  domain.  These  results  are  computed  independently  for  each  feature  element. \nIn spectral domain, phone information is highest  between 3-6  Barks.  Speaker and \nchannel information is  lowest  in  that range and  highest  between  1-2  Barks.  Since \nOGI Stories  database was  collected over  different  telephones, speaker+ channel in(cid:173)\nformation  below  2  Barks  (  :=:::  200  Hz  )  is  due  to  different  telephone  channels.  In \ntemporal domain, the highest phone information is  at the center  (0  ms).  It spreads \nfor  approximately  200  ms  around  the  center.  Speaker  and  channel  information  is \nalmost constant across time except near the center. \n\nNote that the nature of speaker and channel variability also deviates from the con(cid:173)\nstant  around  the  current  frame.  But,  at  the  current  frame,  phone  variability  is \nhigher  than  speaker  and  channel  variability.  The  results  of  analysis  of informa(cid:173)\ntion show  that, at the current frame,  phone information is lower than speaker and \nchannel information.  This difference is  explained by comparing our MI results with \nresults from  Yang et.  al.  [6]  in the next section. \n\nTable 2 shows the results for  the complete feature vector.  Note that there are some \npractical issues  in  computing  determinants in  Equation 4  and  5.  They are related \nto  data insufficiency,  specifically,  in  temporal  domain  where  the  feature  vector  is \n101  points  and there are approximately 60  vectors per speaker  per phone.  We  ob-\n\n\fserve  that  without  proper  conditioning  of covariances,  the  analysis  overestimates \nMI  (l(X ; Yl , Y2 )  >  H(Yl , Y2 )).  This  is  addressed  using  the  condition  number  to \nlimit  the  number  of eigenvalues  used  in  the  calculation of determinants.  Our  hy(cid:173)\npothesis  is  that in  presence  of insufficient  data,  only few  leading  eigen  vectors  are \nproperly estimated.  We have use condition number of 1000 to estimate determinant \nof ~ and ~i, and condition number of 100 to estimate the determinant of ~ij.  The \nresults  show  that  phone  information  in  spectral  domain  is  1.6  nats.  Speaker  and \nchannel  information  is  0.5  nats.  In  temporal  domain,  phone  information  is  about \n1.2  nats.  Speaker and channel information is  5.9  nats.  Comparison of results from \nspectral and temporal domains shows that spectral domain has higher phone infor(cid:173)\nmation than temporal  domain.  Temporal domain  has  higher  speaker  and  channel \ninformation than spectral domain. \n\nUsing these results, we  can answer the questions raised in Section 3.  First question \nwas  how  much phone variability  is  needed  for  perfect  phone  recognition?  The an(cid:173)\nswer  to  the  question  is  H(Yd,  because  the  maximum  value  of leX; Yd  is  H(Yd\u00b7 \nWe  compute  H(Yl )  using  phone  priors.  For  this  database,  we  get  H(Yl )  =  3.42 \nnats,  that  means  we  need  3.42  nats  of information  for  perfect  phone  recognition. \nQuestion about significance of phone information in temporal domain is  addressed \nby  comparing  it  with  information-less  MI  level.  The  information-less  MI  is  com(cid:173)\nputed  as  MI  between  the  current  phone  label  and  features  at  500  ms  in  the  past \nor in the future.  From our results,  we  get information-less MI equal to 0.0013 nats \nconsidering feature  at  500  ms  in  the  past, and  0.0010  nats  considering features  at \n500  ms  in the future l .  The phone information in  temporal domain is  1.2  bits that \nis  greater than both the levels.  Therefore it is  significant. \n\n5  Results  in Perspective \n\nIn the proposed analysis,  we  estimated MI  assuming  Gaussian distribution for  the \nfeatures.  This  assumption  is  validated  by  comparing  our  results  with  the  results \nfrom  a  study  by  Yang,  et.  al.,[6],  where  MI  was  computed  without  assuming  any \nparametric  model  for  the  distribution  of  features.  Note  that  only  entropies  can \nbe  directly  compared  for  difference  in  the  estimation  technique  [3].  However,  MI \nusing  Gaussian  assumption  can  be  equal  to,  less  or  more  than  the  actual  MI.  In \nthe  comparison  of our  results  with  Yang's  results,  we  consider  only  the  nature  of \ninformation observed in  both studies.  The difference in actual MI levels  across the \ntwo studies is  related to the difference  in the estimation techniques. \n\nIn  spectral  domain,  Yang's  study  showed  higher  phone  information  between  3-8 \nBarks.  The  highest  phone  information  was  observed  at  4  Barks.  Higher  speaker \nand channel information was observed around 1-2 Barks.  In temporal domain, their \nstudy showed that phone information spreads for  approximately 200 ms  around the \ncurrent time frame.  Comparison of results from this analysis and our analysis shows \nthat nature of phone information is  similar in  both studies.  Nature of speaker and \nchannel information in  spectral  domain  is  also  similar.  We  could  not compare the \nspeaker and channel information in temporal domain because Yang's study did not \npresent these results. \n\nIn  Section  4.1,  we  observed  difference  in  the  nature  of speaker  and  channel  vari(cid:173)\nability,  and speaker and channel information at Ii =5 Barks.  Comparing MI levels \nfrom our study to those from Yang's study, we observe that Yang's results show that \nspeaker  and  channel  information  at  5  Barks is  less  that  the  corresponding  phone \ninformation.  This is  consistent  with results  of analysis  of variability,  but not  with \n\nlInformation-less  MI  calculated using Yang et.  al.  is  0.019  bits \n\n\fthe  proposed  analysis  of information.  As  mentioned  before,  this  difference  is  due \nto  difference  in  the  density  estimation  techniques  used  for  computing  MI.  In  the \nfuture  work,  we  plan  to  model  the  densities  using  more  sophisticated  techniques, \nand improve the estimation of speaker and channel information. \n\n6  Conclusions \n\nWe  proposed  analysis  of information in  speech  using  three  sources  of information \n- language  (phone),  speaker and  channel.  Information in  speech  was  measured  as \nMI  between  the  class  labels  and  the  set  of features  extracted  from  speech  signal. \nFor  example,  linguistic  information  was  measured  using  phone labels  and the  fea(cid:173)\ntures.  We  modeled  distribution  of features  using  Gaussian  distribution.  Thus  we \nrelated the analysis  to previous  proposed analysis of variability in  speech.  We  ob(cid:173)\nserved  similar  results  for  phone  variability  and  phone  information.  The  speaker \nand  channel  variability  and  speaker  and  channel  information  around  the  current \nframe was different.  This was shown to be related to the over-estimation of speaker \nand channel information using unimodal Gaussian model.  Note that the analysis of \ninformation was  proposed because its results have more meaningful interpretations \nthan results  of analysis  of variability.  For  addressing the  over-estimation,  we  plan \nto  use  more  complex  models  ,such  as  mixture  of Gaussians,  for  computing  MI  in \nthe future  work. \n\nAcknowledgments \n\nAuthors  thank Prof.  Andrew  Fraser from  Portland State University for  numerous \ndiscussions and helpful insights on  this  topic. \n\nReferences \n\n[1]  S.  S.  Kajarekar, N.  Malayath and H.  Hermansky,  \"Analysis of sources of vari(cid:173)\n\nability in speech,\"  in  Proc.  of EUROSPEECH,  Budapest,  Hungary, 1999. \n\n[2]  S.  S.  Kajarekar,  N.  Malayath  and  H.  Hermansky,  \"Analysis  of  speaker  and \n\nchannel variability in  speech,\"  in  Proc.  of ASRU, Colorado,  1999. \n\n[3]  T. M.  Cover and J. A.  Thomas,  Elements  of Information  theory,  John Wiley & \n\nSons,  Inc.,  1991. \n\n[4]  J.  A.  Bilmes,  \"Maximum Mutual Information Based Reduction Strategies for \nin  Proc.  of ICASSP, \n\nCross-correlation  Based  Joint  Distribution  Modelling  ,\" \nSeattle,  USA,  1998. \n\n[5]  H.  Hermansky H.  Yang, S.  van Vuuren,  \"Relevancy of Time-Frequency Features \nfor  Phonetic  Classification  Measured  by  Mutual  Information,\"  in  ICASSP '99, \nPhoenix, Arizona,  USA,  1999. \n\n[6]  H.  H.  Yang, S.  Sharma, S.  van Vuuren and H.  Hermansky,  \"Relevance of Time(cid:173)\n\nFrequency  Features  for  Phonetic  and  Speaker-Channel  Classification,\"  Speech \nCommunication,  Aug.  2000. \n\n[7]  R.  V.  Hogg  and E.  A.  Tannis,  Statistical  Analysis  and  Inference,  PRANTICE \n\nHALL,  fifth  edition,  1997. \n\n\f", "award": [], "sourceid": 2161, "authors": [{"given_name": "Sachin", "family_name": "Kajarekar", "institution": null}, {"given_name": "Hynek", "family_name": "Hermansky", "institution": null}]}