{"title": "Controlling the Complexity of HMM Systems by Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 743, "abstract": null, "full_text": "Controlling the Complexity of HMM Systems by \n\nRegularization \n\nChristoph Neukirchen, Gerhard Rigoll \n\nDepartment of Computer Science \n\nGerhard-Mercator-University Duisburg \n\n47057 Duisburg, Germany \n\nemail:  {chn.rigoll}@fb9-ti.uni-duisburg.de \n\nAbstract \n\nThis paper introduces a method for regularization ofHMM systems that \navoids parameter overfitting caused by insufficient training data.  Regu(cid:173)\nlarization is  done by augmenting the  EM  training method by a penalty \nterm  that  favors  simple  and  smooth HMM  systems.  The penalty  term \nis  constructed as  a mixture model of negative exponential distributions \nthat is assumed to generate the state dependent emission probabilities of \nthe HMMs.  This new method is the successful transfer of a well known \nregularization approach in neural networks to the HMM domain and can \nbe interpreted as a generalization of traditional state-tying for HMM sys(cid:173)\ntems.  The effect of regularization is demonstrated for continuous speech \nrecognition tasks by improving overfitted triphone models and by speaker \nadaptation with limited training data. \n\n1  Introduction \n\nOne general problem when constructing statistical pattern recognition systems is to ensure \nthe  capability to  generalize well,  i.e.  the  system must be able to  classify data that  is  not \ncontained in the training data set.  Hence the classifier should learn the true underlying data \ndistribution  instead  of overfitting to  the  few  data examples seen during  system  training. \nOne way to cope with the problem of overfitting is to balance the system's complexity and \nflexibility against the limited amount of data that is available for training. \nIn the neural network community it is well known that the amount of information used in \nsystem training that is required for a good generalization performance should be larger than \n,the number of adjustable weights (Baum,  1989).  A common method to train a  large size \nneural network sufficiently well  is  to  reduce  the  number of adjustable parameters either \nby removing those weights that seem to be less important (in (Ie Cun,  1990) the sensitiv(cid:173)\nity of individual network weights is estimated by the second order gradient) or by sharing \n\n\f738 \n\nC.  Neukirchen and G. Rigoll \n\nthe weights among many network connections (in (Lang,  1990) the connections that share \nidentical weight values are determined in advance by using prior knowledge about invari(cid:173)\nances in the problem to be solved).  A second approach to avoid overfitting in neural net(cid:173)\nworks is to make use of regularization methods.  Regularization adds an extra term to  the \ntraining objective function that penalizes network complexity.  The simplest regularization \nmethod is weight decay (Plaut,  1986) that assigns high penalties to large weights. A more \ncomplex  regularization term  is  used  in  soft  weight-sharing  (Nowlan,  1992) by favoring \nneural network weights that fall  into a finite  set of small weight-clusters.  The traditional \nneural weight sharing technique can be interpreted as a special case of soft weight-sharing \nregularization when the cluster variances tend towards zero. \nIn continuous speech recognition the Hidden Markov Model (HMM) method is common. \nWhen using detailed context-dependent triphone HMMs, the number ofHMM-states and \nparameters  to  estimate  in  the  state-dependent probability  density  functions  (pdfs)  is  in(cid:173)\ncreasingly large and overfitting becomes a serious problem.  The most common approach \n,to balance the complexity of triphone HMM systems against the training data set is  to re(cid:173)\nduce the number of parameters by tying, i.e.  parameter sharing (Young,  1992).  A popular \nsharing method is  state-tying with selecting the HMM-states to be tied in advance, either \nby data-driven state-clustering based on a pdf-dependent distance metric  (Young,  1993), \nor by constructing binary decision trees that incorporate higher phonetic knowledge (Bahl, \n1991).  In these methods, the number of state-clusters and the decision tree sizes,  respec(cid:173)\ntively,  must be  chosen  adequately to  match the  training  data  size.  However,  a  possible \ndrawback of both methods is  that two different states may be selected to be tied (and their \npdfs are forced to be identical) although there is enough training data to estimate the differ(cid:173)\nent pdfs of both states sufficiently well.  In the following, a method to reduce the complexity \nof general HMM systems based on a regularization term is presented. Due to its close rela(cid:173)\ntionship to the soft weight-sharing method for neural networks this novel approach can be \ninterpreted as soft state-tying. \n\n2  Maximum likelihood training in HMM systems \n\nTraditionally, the method most commonly used to determine the set of adjustable param(cid:173)\neters  8  in  a  HMM  system is  maximum  likelihood (ML)  estimation  via the  expectation \nmaximization (EM)  algorithm.  If the  training observation vector sequence is  denoted as \nX  =  (x(l), ... ,x(T)) and the corresponding HMM is  denoted as W  the ML estimator is \ngiven by: \n\n{)ML  =  argmax {logpe(XIW)} \n\n() \n\n(1) \n\nIn the following, the total number of different HMM states is given by K. The emission pdf \n,of the k-th state is denoted as bk (x); for continuous HMMs bk (x) is a mixture of Gaussian \npdfs most commonly; in the case of discrete HMMs the observation vector x  is mapped by \na vector quantizer (VQ) on the discrete VQ-Iabel m(x) and the emission pdfis replaced by \nthe discrete output probability bk (m).  By the forward-backward algorithm the probabilistic \nstate counts rdt) can be determined for each training observation and the  log-likelihood \nover the training data can be decomposed into the auxiliary function Q (8) optimized in \nthe EM steps (state transition probabilities are neglected here): \n\nT  K \n\nQ(8) =  L L rk(t) \u00b7logbk(x(t)) \n\nt=l k=l \n\n(2) \n\nSometimes, the observation vector x is split up into several independent streams.  If the total \nnumber of streams is given by Z, the features in the z-th stream comprise the subvector x(z) \nand in the case of application ofa VQ the corresponding VQ label is denoted as m(z) (x(z\u00bb). \n\n\fControlling the Complexity of HMM Systems by Regularization \n\n739 \n\nThe observation subvectors in different streams are assumed to be statistically independent \nthus the states' pdfs can be written as: \n\nZ \n\nbk(x)  =  II b~z)(x(z\u00bb) \n\nz=l \n\n(3) \n\n3  A  complexity measure for HMM systems \n\nWhen  using  regularization methods  to  train  the  HMM  system,  the  traditional  objective \ntraining  function  Q(0)  is  augmented by a complexity penalization term  0  and the  new \noptimization problem becomes: \n\n(reg  =  argmax {Q(0) + v\u00b7 0(0)} \n\n(} \n\n(4) \n\nHere,  the  regulizer term 0  should be small  if the HMM  system  has high complexity and \nparameter overfitting becomes a problem;  0  should be large if the HMM-states' pdfs are \nshaped smoothly and system generalization works well.  The constant v  2:  0 is  a control \nparameter that adjusts the tradeoff between the pure ML solution and the smoothness of \npenalization.  In Eqn.  (4) the term Q (0) becomes larger the more data is used for training \n(which makes the ML estimation become more reliable) and the influence of the term v\u00b7 0 \ngets less important, relatively. \nThe basic  idea when  constructing  an  expression  for  the  regulizer 0  that  favors  smooth \nHMM systems is, that in the case of simple and smooth systems the state-dependent emis(cid:173)\nsion pdfs  bk (.)  should fall  into  several groups of similar pdfs.  This  is  in  contrast to  the \ntraditional state-tying that forces identical pdfs in each group. In the following, these clus(cid:173)\nters of similar emission pdfs are described by a probabilistic mixture model.  Each pdf is \nassumed to be generated by a mixture of I  different mixture components Pi (. ).  In this case \nthe probability (-density) of generating the emission pdf bk (.)  is given by: \n\np(bkO) = L Ci  . Pi(bk(\u00b7)) \n\nI \n\ni=l \n\n(5) \n\nwith the mixture weights Ci  that are constrained to 0  ::::;  Ci  ::::;  1 and 1 =  2::=1 ci.  The i-th \nmixture component Pi (.)  is  used to  model the  i-th cluster of HMM-emission pdfs.  Each \ncluster is  represented by a prototype pdf that is  denoted as fJi (.)  for the i-th cluster;  the \ndistance (using a suitable metric) between a HMM emission pdf bk 0  and the i-th prototype \npdfis denoted as Di(bk (.)).  If these distances are small for all HMM emission probabilities \nthere are several small clusters of emission probabilities and the regulizer term 0  should be \nlarge. Now, it is assumed that the distances follow a negative exponential distribution (with \na deviation parameter Ai), yielding an expression for the mixture components: \n\np; (b.O) - (g A;,,) . exp  ( - ~ A;\" \n\n. D;\"  (bh)) ) \n\n(6) \n\nIn Eqn.  (6) the  more general case of Z  independent streams  is  given.  Hence,  the  HMM \nemission pdfs and the  cluster prototype pdfs are  split up  into  Z  different pdfs b~) (.)  and \nfJ;Z) ( . ), respectively and the stream dependent distances D i,z  and parameters Ai,z  are used. \nNow, for the regulizer term 0  the log-likelihood of the mixture model in Eqn. (5) over all \nemission pdfs in the HMM system can be used: \n\nK \n\n0(0) =  L logp(bk (\u00b7)) \n\nk=l \n\n(7) \n\n\f740 \n\nC.  Neukirchen and G.  Rigoll \n\n4  Regularization example:  discrete HMMs \n\nAs an example for parameter estimation in the regularization framework, a discrete HMM \nsystem with different VQs for each of the Z  streams is considered here:  Each VQ subdi(cid:173)\nvides the feature  space into  J z  different partitions (i.e.  the  z-th codebook size is  J z) and \nthe  VQ-partition labels are  denoted m)z) .  If the  observation subvector x (z)  is  in the  j-th \nVQ-partition the VQ output is m(z) (x(z))  =  m)Z). \n\nSince the discrete kind HMM output probabilities b~\\m(z))  are used here,  the regulizer's \n, prototypes are the discrete probabilities (3~z) (m (z) ).  As a distance metric between the HMM \nemission  probabilities  and  the  prototype  probabilities  used  in  Eqn.  (6)  the  asymmetric \nKullback-Leibler divergence is applied: \n\n(8) \n\n4.1  Estimation of HMM parameters using regularization \nThe parameter set e of the HMM system to  be estimated mainly consists of the  discrete \nHMM emission probabilities (transition probabilities are not subject of regularization here). \nTo get an iterative parameter estimation in the EM style, Eqn. (4) must be maximized; e.g. \nby setting the derivative of Eqn.  (4) with respect to the HMM -parameter b~) (m )z) ) to zero \nand application of Lagrange multipliers with regard to the constraint 1 =  EJ~ 1 biz) (m ;z)) . \nThis leads to a quite complex solution that can be only solved numerically. \nThe optimization problem can be simplified if the mixture in Eqn.  (5)  is replaced by the \nmaximum approximation; i.e. only the maximum component in the sum is considered. The \ncorresponding index of the maximum component is denoted i * : \n\nIn this simplified case the HMM parameter estimation is given by: \n\n(9) \n\n(10) \n\nThis is a weighted sum of the well known ML solution and the regulizer's prototype proba(cid:173)\nbility (3i~ (.) that is selected by the maximum search in Eqn. (9).  The larger the value ofthe \nconstant II, the stronger is the force that pushes the estimate of the HMM emission probabil-\nity biz) (m ;z))  towards the prototype probability (3i~ (.). The situation when II tends towards \ninfinity corresponds to the  case of traditional  state-tying,  because all  different states that \nfall  into the same cluster i*  make use of (3i~ (.) as emission probability in the z-th stream. \n\n4.2  Estimation of regulizer parameters \n\nThe parameter set ~ of the regulizer consists of the mixture weights  Ci,  the  deviation pa(cid:173)\nrameters Ai,z , and of the discrete prototype probabilities (3~z) (m ;z) ) in the case of regulizing \n\n\fControlling the Complexity ofHMM Systems by Regularization \n\n741 \n\ndiscrete HMMs.  These parameters can be  set in advance by making use of prior knowl(cid:173)\nedge;  e.g.  the  prototype probabilities can be obtained from  a simple HMM system that \nuses a small number of states.  Alternatively, the regulizer's parameters can be estimated in \na similar way as  in (Nowlan,  1992) by maximizing Eqn.  (7).  Since there is no direct solu(cid:173)\ntion to this optimization problem, maximization must be performed in an EM-like iterative \nprocedure that uses  the HMM emission pdfs bk (.)  as  training data for the mixture model \nand by increasing the following auxiliary function in each step: \n\nR(~)  =  L L P(ilbk(\u00b7))  \u00b7logp(i, bk(\u00b7)) \n\nK \n\nI \n\nk=1  i=1 \nK \n\nk=1  i=1 \n\nI L L P(ilbk(\u00b7))  . log (Ci  . Pi(bk(\u00b7))) \n\nwith the posterior probability used as weighting factor given by: \n\nP(ilbk(.))  = \n\nICi  . Pi(bk(')) \n\n2::1=1  Cl  . Pl(bk(\u00b7)) \n\nAgain,  maximization  of Eqn.  (11)  can  be  performed by  setting  the  derivative  of R(~) \nwith  respect to  the  regulizer's parameters to  zero  under consideration of the  constraints \n1 =  2:::=1  ci  and 1 =  2:::::1 f3~Z)(m~Z)) by application of Lagrange multipliers.  For the es-\no timation of the regulizer parameters this yields: \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\nK \n\nCi  =  ~ . L P(ilbk (-)) \nk=1 \n2:::=1  P(ilbk(\u00b7)) \n\n~.  _ \n~,z  - 2:::=1 Di,z(b~) (.)) . P(ilbk(-)) \n\n~(z)(  (z))  _ \ni  mj \n\nexp \n\n(\n\n2:::=1  P(ilbk(')) 'IOgb~)(m)Z))) \n\nK .  \n\n2::k=1  P(zlbk(')) \n\n- ~  (2:::=1 P(llbk (.)) 'IOgb~)(m}Z))) \n\n~exp \n1=1 \n\nK \n\n2::k=1  P(llbk (\u00b7)) \n\nThe estimate Ci  can be interpreted as the a:-erage probability that a HMM emission prob(cid:173)\nability  falls  into the  i-th mixture cluster;  Ai,z  is  the inverse ofthe weighted average dis(cid:173)\ntance between the emission probabilities and the prototype probability f3;z) ( .). The estimate \n~;z)(m)zl) is  the average probability over all  emission probabilities for the VQ-label m~zl \nweighted in the log-domain. \nIf the Euclidean distance between the discrete probabilities is  used instead of Eqn.  (8) to \nmeasure the differences between the HMM emission probabilities and the prototypes \n\nDi'Z (b~)(m(zl)) =  L (f3jz\\myl)  - b~z)(m;Zl)) \n\nJz \n\n2 \n\nj=1 \n\n(16) \n\nthe estimate of the prototype probabilities is given by the average of the HMM probabilities \nweighted in the original space: \n\n(17) \n\n\f742 \n\nC.  Neukirchen and G.  Rigo/l \n\n5  Experimental results \n\nTo  investigate  the  performance  of the  regularization methods  described  above  a  HMM \nspeech recognition system  for the speaker-independent resource management (RM)  con(cid:173)\ntinuous speech task is built up. For training 3990 sentences from 109 different speakers are \nused.  Recognition results are given as word error rates averaged over the official DARPA \nRM test sets feb'89, oct'89, feb'91  and sep'92, consisting of 1200 sentences from 40 differ(cid:173)\nent speakers, totally.  Recognition is done via a beam search guided Viterbi decoder using \nthe DARPA RM word pair grammar (perplexity:  60). \n'As acoustic features every  10 ms  12  MFCC coefficients and the relative signal power are \nextracted from the speech signal along with the dynamic ~- and ~~-features, comprising \n39  features  per frame.  The HMM  system  makes  use of standard 3-state discrete  proba(cid:173)\nbility phonetic models.  Four different neural networks, trained by the MMI method, that \nis  described in  in (Rigoll,  1997) and extended in  (Neukirchen,  1998), are used as  VQ  to \nquantize the features into Z  = 4 different streams of discrete labels.  The codebook size in \neach stream is set to 200. \nA simple system with models for 47 monophones and for the most prominent 33  function \nwords  (totally 394  states)  yields  a word error rate  of 8.6%.  A system that makes use  of \nthe  more  detailed (but untied) word internal triphone models  (totally 6921  states)  yields \n12.2% word error. Hence HMM overfitting because of insufficient training data is a severe \nproblem in this case.  Traditional methods to overcome the effects of overfitting like inter(cid:173)\npolating between triphones and monophones (Bahl,  1983), data driven state-clustering and \ndecision tree clustering yield error rates of 6.5%, 8.3% and 6.4%, respectively.  It must be \nnoted that in contrast to the usual training procedure in (Rigoll,  1996) no further smoothing \nmethods are applied to the HMM emission probabilities here. \nIn  a first  series  of experiments the  untied triphone system  is  regulized by a quite simple \nmixture of I  = 394 density components, i.e.  the number of clusters in the penalty term is \nidentical to the number of states in the monophone system. In this case the prototype prob(cid:173)\nabilities are initialized by the emission probabilities of the monophone system; the mixture \nweights  and the  deviation  parameters in  the  regulizer are  set to be uniform,  initially.  In \norder to test the inluence of the tradeoff parameter v  it is set to 50,  10 and 2, respectively. \nThe  corresponding word error rates  are  8.4%,  6.9% and 6.3%,  respectively.  In the case \nof large  vs  regularization degrades to  a tying  of trip hone states to  monophone states and \n,the error rate tends towards the monophone system performance.  For smaller vs there is a \ngood tradeoff between data fitting and HMM smoothness yielding improved system perfor(cid:173)\nmance. The initial prototype probability settings provided by the monophone system do not \nseem to be changed much by regulizer parameter estimation, since the system performance \nonly changes slightly when the regulizer's parameter reestimation is not incorporated. \nIn preliminary experiments the regularization method is  also used for  speaker adaptation. \nA speaker-independent system trained on the Wall Street Journal (WSJ) database yields an \nerror rate of32.4% on the Nov. 93 S33>0 test set with 10 different non-native speakers. The \nspeaker-independent HMM emission probabilities are used to initialize the prototype prob(cid:173)\nabilities of the  regulizer.  Then,  speaker-dependent systems are built up  for each speaker \nusing only 40 fast enrollment sentences for training along with regularization (v  is  set to \n10).  Now, the error rate drops to 25.7% what is better than the speaker adaptation method \ndescribed in (Rottland, 1998) that yields 27.3% by a linear feature space transformation. In \ncombination both methods achieve 23.0% word error. \n\n6  Summary and Discussion \n\nA method to avoid parameter overfitting in HMM systems by application of a regulariza(cid:173)\ntion term that favor smooth and simple models has been presented here.  The complexity \n\n\fControlling the Complexity of HMM Systems by Regularization \n\n743 \n\nmeasure applied to the HMMs is based on a finite mixture of negative exponential distribu(cid:173)\ntions, that generates the state-dependent emission probabilities. This kind of regularization \nterm can be interpreted as a soft state-tying, since it forces the HMM emission probabilities \nto form a finite  set of clusters.  The effect of regularization has been demonstrated on the \nRM task by improving overfitted trip hone models.  On a WSJ non-native speaker adaption \ntask with limited training data, regularization outperforms feature space transformations. \nEqn.  (4) may be also  interpreted from  a  perspective of Bayesian inference:  the term  v  . \nn plays the  role of setting  a  prior distribution on the  HMM parameters to  be estimated. \nHence, the use of a mixture model for n is equivalent to using a special kind of prior in the \nframework of MAP estimation for HMMs (Gauvain, 1994). \n\nReferences \n\nL.R.  Bahl,  F.  Jelinek,  L.R.  Mercer,  'A Maximum  Likelihood  Approach  to  Continuous \nSpeech Recognition', IEEE Trans.  Pattern Analysis and Machine Intelligence, Vol. 5, No.2 \nMar.  1983, pp.  179-190. \nL.R. Bahl, P.v. de Souza, P.S. Gopalakrishnan,D. Nahamoo, M.A. Picheny, (1991) Context \ndependent modeling of phones in  continuous speech using decision trees.  Proc.  DARPA \nspeech and natural language processing workshop,  264-270. \nE.B. Baum, D.  Haussler, (1989) What size net gives valid generalization? Neural Compu(cid:173)\ntation,  1:151-160. \nY.  Ie  Cun,  J.  Denker, S.  Solla,  R.E.  Howard,  L.D.  Jackel,  (1990) Optimal brain damage. \nAdvances in Neural Information Processing Systems 2,  San Mateo, CA, Morgan Kauffinan. \nJ.L. Gauvain, C.R. Lee, (1994) Maximum a posteriori estimation for multivariate Gaussian \nmixture observations of markov chains.  IEEE Transaction Speech and Audio Proc.,  Vol.  2, \n2:291-298. \nKJ. Lang, A.H. Waibel, G.E. Hinton, (1990) A time-delay neural network architecture for \nisolated word recognition. Neural Networks,  3:23~3. \nCh.  Neukirchen,  D.  Willett,  S.  Eickeler,  S.  Muller,  (1998)  Exploiting  acoustic  feature \ncorrelations by joint neural  vector  quantizer  design  in  a  discrete  HMM  system.  Proc. \nICASSP'98,5-8. \nS.J.  Nowlan,  G.E.  Hinton,  (1992)  Simplifying neural  networks  by  soft  weight-sharing. \nNeural Computation,  4:473~93. \nD.C.  Plaut,  S.J.  Nowlan,  G.E.  Hinton,  (1986)  Experiments  on  learning  by  back(cid:173)\npropagation.  technical report  CMU-CS-86-126,  Carnegie-Mellon University,  Pittsburgh, \nPA. \nG.  Rigoll, Ch. Neukirchen, J. Rottland, (1996) A new hybrid system based on MMI-neural \nnetworks for the RM speech recognition task.  Proc.  ICASSP'96,  865-868. \nG. Rigoll, Ch. Neukirchen, (1997) A new approach to hybrid HMMI ANN speech recogni(cid:173)\ntion using mutual information neural networks.  Advances in Neural Information Process(cid:173)\ning Systems 9,  Cambridge, MA, MIT Press, 772-778. \nJ.  Rottland,  Ch.  Neukirchen,  G.  Rigoll,  (1998)  Speaker  adaptation  for  hybrid  MMI(cid:173)\nconnectionist speech recognition systems.  Pmc.  ICASSP '98,  465~68. \n'S.J. Young, (1992) The general use of tying in phoneme-based HMM speech recognizers. \nProc.  ICASSP '92,  569- 572. \nSJ. Young, P.C. Woodland (1993) The use of state tying in continuous speech recognition. \nProc.  Eurospeech '93, 2203-2206. \n\n\f", "award": [], "sourceid": 1519, "authors": [{"given_name": "Christoph", "family_name": "Neukirchen", "institution": null}, {"given_name": "Gerhard", "family_name": "Rigoll", "institution": null}]}