{"title": "Speech Recognition using SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1197, "page_last": 1204, "abstract": null, "full_text": "Speech Recognition using SVMs \n\nNathan Smith \n\nCambridge University \n\nEngineering Dept \n\nCambridge, CB2 1PZ, U.K. \n\nndsl 002@eng.cam.ac.uk \n\nMark Gales \n\nCambridge University \n\nEngineering Dept \n\nCambridge, CB2 1PZ, U.K. \n\nmjfg@eng.cam.ac.uk \n\nAbstract \n\nAn important issue in applying SVMs to speech recognition is the \nability to classify variable length sequences. This paper presents \nextensions to a standard scheme for handling this variable length \ndata, the Fisher score. A more useful mapping is introduced based \non the likelihood-ratio. The score-space defined by this mapping \navoids some limitations of the Fisher score. Class-conditional gen(cid:173)\nerative models are directly incorporated into the definition of the \nscore-space. The mapping, and appropriate normalisation schemes, \nare evaluated on a speaker-independent isolated letter task where \nthe new mapping outperforms both the Fisher score and HMMs \ntrained to maximise likelihood. \n\n1 \n\nIntroduction \n\nSpeech recognition is a complex, dynamic classification task. State-of-the-art sys(cid:173)\ntems use Hidden Markov Models (HMMs), either trained to maximise likelihood or \ndiscriminatively, to achieve good levels of performance. One of the reasons for the \npopularity of HMMs is that they readily handle the variable length data sequences \nwhich result from variations in word sequence, speaker rate and accent. Support \nVector Machines (SVMs) [1] are a powerful, discriminatively-trained technique that \nhave been shown to work well on a variety of tasks. However they are typically only \napplied to static binary classification tasks. This paper examines the application \nof SVMs to speech recognition. There are two major problems to address. First, \nhow to handle the variable length sequences. Second, how to handle multi-class \ndecisions. This paper only concentrates on dealing with variable length sequences. \nIt develops our earlier research in [2] and is detailed more fully in [7]. A similar \napproach for protein classification is adopted in [3]. \n\nThere have been a variety of methods proposed to map variable length sequences \nto vectors of fixed dimension. These include vector averaging and selecting a 'rep(cid:173)\nresentative' number of observations from each utterance. However, these methods \nmay discard useful information. This paper adopts an approach similar to that of \n[4] which makes use of all the available data. Their scheme uses generative prob(cid:173)\nability models of the data to define a mapping into a fixed dimension space, the \nFisher score-space. When incorporated within an SVM kernel, the kernel is known \nas the Fisher kernel. Relevant regularisation issues are discussed in [5]. This paper \n\n\fexamines the suitability of the Fisher kernel for classification in speech recognition \nand proposes an alternative, more useful, kernel. In addition some normalisation \nissues associated with using this kernel for speech recognition are addressed. \n\nInitially a general framework for defining alternative score-spaces is required. First, \ndefine an observation sequence as 0 = (01 , . . . Ot, ... OT) where Ot E ~D , and a set \nof generative probability models of the observation sequences as P = {Pk(OI(h)}, \nwhere 9 k is the vector of parameters for the kth member of the set. The observation \nsequence 0 can be mapped into a vector of fixed dimension [4], \n\ni{J~ (0) \n\n(1) \n\nf(\u00b7) is the score-argument and is a function of the members of the set of generative \nmodels P. i{Jft is the score-mapping and is defined using a score-operator F. i{J~(0) \nis the score and occupies the fixed-dimension score-space. Our investigation of \nscore-spaces falls into three divisions. What are the best generative models, score(cid:173)\narguments and score-operators to use? \n\n2 Score-spaces \n\nAs HMMs have proved successful in speech recognition, they are a natural choice \nas the generative models for this task. \nIn particular HMMs with state output \ndistributions formed by Gaussian mixture models. There is also the choice of the \nscore-argument. For a two-class problem, let Pi(019i ) represent a generative model, \nwhere i = {g, 1, 2} (g denotes the global2-class generative model, and 1 and 2 denote \nthe class-conditional generative models for the two competing classes). Previous \nschemes have used the log of a single generative model, Inpi(019i) representing \neither both classes as in the original Fisher score (i = g) [4], or one of the classes \n(i = 1 or 2) [6]. This score-space is termed the likelihood score-space, i{J~k(O). \nThe score-space proposed in this paper uses the log of the ratio of the two class(cid:173)\nconditional generative models, In(P1(019d / P2(0192)) where 9 = [9{,9J] T. The \ncorresponding score-space is called the likelihood-ratio score-space, i{J~(0) . Thus, \n\ni{J~k(O) \n\ni{J~(0) \n\n(2) \n\n(3) \n\nThe likelihood-ratio score-space can be shown to avoid some of the limitations of \nthe likelihood score-space, and may be viewed as a generalisation of the standard \ngenerative model classifier. These issues will be discussed later. \n\nHaving proposed forms for the generative models and score-arguments, the score(cid:173)\noperators must be selected. The original score-operator in [4] was the 1st-order \nderivative operator applied to HMMs with discrete output distributions. Consider \na continuous density HMM with N emitting states, j E {I . . . N}. Each state, \nj, has an output distribution formed by a mixture of K Gaussian components, \nN(J-tjk' ~jd where k E {I ... K}. Each component has parameters of weight Wjk, \nmean J-tjk and covariance ~jk. The 1st-order derivatives of the log probability of \nthe sequence 0 with respect to the model parameters are given below1, where the \nderivative operator has been defined to give column vectors, \n\nT L ')'jk(t)S~,jkl \n\nt = l \n\n(4) \n\nlFor fuller details of the derivations see [2). \n\n\fV Wjk Inp(OIO) \n\nwhere \n\nS[t ,jk] \n\nIjdt) is the posterior probability of component k of state j at time t. Assuming \nthe HMM is left-to-right with no skips and assuming that a state only appears once \nin the HMM (i.e. \nthere is no state-tying), then the 1st-order derivative for the \nself-transition probability for state j, ajj, is, \n\nt[/j(t) 1] \n\nt=l ajj \n\nTajj(l- ajj) \n\n(8) \n\nThe 1st-order derivatives for each Gaussian parameter and self-transition probabil(cid:173)\nity in the HMM can be spliced together into a 'super-vector' which is the score 2 . \n\nFrom the definitions above, the score for an utterance is a weighted sum of scores \nfor individual observations. If the scores for the same utterance spoken at different \nspeaking rates were calculated, they would lie in different regions of score-space \nsimply because of differing numbers of observations. To ease the task of the classifier \nin score-space, the score-space may be normalised by the number of observations, \ncalled sequence length normalisation. Duration information can be retained in the \nderivatives of the transition probabilities. One method of normalisation redefines \nscore-spaces using generative models trained to maximise a modified log likelihood \nfunction, In( 010). Consider that state j has entry time Tj and duration dj (both in \nnumbers of observations) and output probability bj(Ot) for observation Ot [7]. So, \n\nIn(OIO) \n\nN 1 \n\nL d- ((dj -1) lnajj + Inaj(j+1) + L (Inbj(Ot))) \nj=l \n\nt=Tj \n\nJ \n\nT;+d j- 1 \n\n(9) \n\nIt is not possible to maximise In(OIO) using the EM algorithm. Hill-climbing tech(cid:173)\nniques could be used. However, in this paper, a simpler normalisation method is \nemployed. The generative models are trained to maximise the standard likelihood \nfunction. Rather than define the score-space using standard state posteriors Ij(t) \n(the posterior probability of state j at time t), it is defined on state posteriors nor(cid:173)\nmalised by the total state occupancy over the utterance. The standard component \nposteriors 1 j k (t) are replaced in Equations 4 to 6 and 8 by their normalised form \n'Yjk(t), \n\nA\n\n. (t) _ \n~k -\n\nIj(t) \nT \n\n(WjkN(Ot; ILjk, ~jk) ) \n2::T=l/j(T) 2::i = l wjiN(ot; ILji' ~ji) \n\nK \n\n(10) \n\nIn effect, each derivative is divided by the sum of state posteriors. This is preferred \nto division by the total number of observations T which assumes that when the \nutterance length varies, the occupation of every state in the state sequence is scaled \nby the same ratio. This is not necessarily the case for speech. \n\nThe nature of the score-space affects the discriminative power of classifiers built \nin the score-space. For example, the likelihood score-space defined on a two-class \n\n2Due to the sum to unity constraints, one of the weight parameters in each Gaussian \nmixture is discarded from the definition of the super-vector, as are the forward transitions \nin the left-to-right HMM with no skips. \n\n\fgenerative model is susceptible to wrap-around [7] . This occurs when two different \nlocations in acoustic-space map to a single point in score-subspace. As an example, \nconsider two classes modelled by two widely-spaced Gaussians. If an observation \nis generated at the peak of the first Gaussian, then the derivative relative to the \nmean of that Gaussian is zero because S [t ,jk] is zero (see Equation 4). However, the \nderivative relative to the mean of the distant second Gaussian is also zero because \nof a zero component posterior f jdt). A similar problem occurs with an observation \ngenerated at the peak of the second Gaussian. This ambiguity in mapping two \npossible locations in acoustic-space to the zero of the score-subspace of the means \nrepresents a wrapping of the acoustic space onto this subspace. This also occurs \nin the subspace of the variances. Thus wrap-around can increase class confusion. \nA likelihood-ratio score-space defined on these two Gaussians does not suffer wrap(cid:173)\naround since the component posteriors for each Gaussian are forced to unity. \nSo far, only 1st-order derivative score-operators have been considered. It is pos(cid:173)\nsible to include the zeroth, 2nd and higher-order derivatives. Of course there is \nan interaction between the score-operator and the score-argument. For example, \nthe zeroth-order derivative for the likelihood score-space is expected to be less use(cid:173)\nful than its counter-part in the likelihood-ratio score-space because of its greater \nsensitivity to acoustic conditions. A principled approach to using derivatives in \nscore-spaces would be useful. Consider the simple case of true class-conditional \ngenerative models P1(OIOd and P2(OI02) with respective estimates of the same \nfunctional form P1 (0 10d and P2 (0102 ) . Expressing the true models as Taylor se(cid:173)\nries expansions about the parameter estimates 01 and O2 (see [7] for more details, \nand [3]) , \n\nInpi (OIOi ) + (Oi - Oi ) TV' 9i Inpi (OIOi ) \n\n1 \n\nA T \n\n+\"2(Oi - Oi ) [V' 9i V' 9i Inpi (OIOi )](Oi - Oi ) + 0 Oi (\u00b7) \nwill , V'~i' vec(V' 9i V'~) T . . . ]T Inpi (OIOi ) \n\nA \n\nT \n\nA \n\n( \n\n3) \n\n(11) \n\nThe output from the operator in square brackets is an infinite number of derivatives \narranged as a column vector. Wi is also a column vector. The expressions for the \ntwo true models can be incorporated into an optimal minimum Bayes error decision \nA \nAT AT \n[01 , 02 ]T , W = [w i, WJjT, and b encodes the class \nrule as follows , where 0 \npriors, \n\nInp1(OIOd -lnp2(OI02) + b \nwi[l, V'~1' vec(V' 91 V'~1) T ... ]T Inp1 (OIOd-\nwJ [l , V'~,' vec(V' 92 V'~) T ... ]T Inp2(OI02) + b \n\na \n\na \n\nn \n\nb \nA + \n\n]T I P1(OIOd \nP2(OI02) \nw T iplr(o) + b \n\na \na \n\nT[ \n\nT ) T\nw 1, V' 9' vec V' 9 V' 9 \n\nT \n\n( \n\n. . . \n\n(12) \niplr(o) is a score in the likelihood-ratio score-space formed by an infinite number of \nderivatives with respect to the parameter estimates O. Therefore, the optimal deci(cid:173)\nsion rule can be recovered by constructing a well-trained linear classifier in iplr(o). \nIn this case, the standard SVM margin can be interpreted as the log posterior mar(cid:173)\ngin. This justifies the use of the likelihood-ratio score-space and encourages the \nuse of higher-order derivatives. However, most HMMs used in speech recognition \nare 1st-order Markov processes but speech is a high-order or infinite-order Markov \n\n\fprocess. Therefore, a linear decision boundary in the likelihood-ratio score-space de(cid:173)\nfined on 1st-order Markov model estimates is unlikely to be sufficient for recovering \nthe optimal decision rule due to model incorrectness. However, powerful non-linear \nclassifiers may be trained in such a likelihood-ratio score-space to try to compensate \nfor model incorrectness and approximate the optimal decision rule. SVMs with non(cid:173)\nlinear kernels such as polynomials or Gaussian Radial Basis Functions (GRBFs) may \nbe used. Although gains are expected from incorporating higher-order derivatives \ninto the score-space, the size of the score-space dramatically increases. Therefore, \npractical systems may truncate the likelihood-ratio score-space after the 1st-order \nderivatives, and hence use linear approximations to the Taylor series expansions3 . \nHowever, an example of a 2nd-order derivative is V' J-Ljk (V'~;k Inp(OIO)) , \n\nV' J-L;k (V'~;k Inp(OIO)) ~ - L 'Yjk(t)\"2';;k1 \n\nT \n\nt = l \n\n(13) \n\nFor simplicity the component posterior 'Y j k (t) is assumed independent of J-L j k. Once \nthe score-space has been defined, an SVM classifier can be built in the score-space. \nIf standard linear, polynomial or GRBF kernels are used in the score-space, then \nthe space is assumed to have a Euclidean metric tensor. Therefore, the score-space \nshould first be whitened (i.e. decorrelated and scaled) before the standard kernels \nare applied. Failure to perform such score-space normalisation for a linear kernel \nin score-space results in a kernel similar to the Plain kernel [5]. This is expected \nto perform poorly when different dimensions of score-space have different dynamic \nranges [2]. Simple scaling has been found to be a reasonable approximation to \nfull whitening and avoids inverting large matrices in [2] (though for classification \nof single observations rather than sequences, on a different database). The Fisher \nkernel in [4] uses the Fisher Information matrix to normalise the score-space. This \nis only an acceptable normalisation for a likelihood score-space under conditions \nthat give a zero expectation in score-space. The appropriate SVM kernel to use \nbetween two utterances O i and OJ in the normalised score-space is therefore the \nNormalised kernel, kN(Oi, OJ) (where ~sc is the covariance matrix in score-space), \n\n(14) \n\n3 Experimental Results \n\nThe ISOLET speaker-independent isolated letter database [8] was used for eval(cid:173)\nuation. The data was coded at a 10 msec frame rate with a 25.6 msec window(cid:173)\nsize. The data was parameterised into 39-dimensional feature vectors including 12 \nMFCCs and a log energy term with corresponding delta and acceleration parame(cid:173)\nters. 240 utterances per letter from isolet{ 1,2,3,4} were used for training and \n60 utterances per letter from isolet5 for testing. There was no overlap between \nthe training and test speakers. Two sets of letters were tested, the highly con(cid:173)\nfusible E-set, {B C D E G P T V Z}, and the full 26 letters. The baseline HMM \nsystem was well-trained to maximise likelihood. Each letter was modelled by a \n10-emitting state left-to-right continuous density HMM with no skips, and silence \nby a single emitting-state HMM with no skips. Each state output distribution had \nthe same number of Gaussian components with diagonal covariance matrices. The \nmodels were tested using a Viterbi recogniser constrained to a silence-letter-silence \nnetwork. \n\n31t is useful to note that a linear decision boundary, with zero bias, constructed in a \nsingle-dimensional likelihood-ratio score-space formed by the zeroth-order derivative oper(cid:173)\nator would, under equal class priors, give the standard minimum Bayes error classifier. \n\n\fThe baseline HMMs were used as generative models for SVM kernels. A modi(cid:173)\nfied version of SV Mlight Version 3.02 [9] was used to train 1 vI SVM classifiers \non each possible class pairing. The sequence length normalisation in Equation 10, \nand simple scaling for score-space normalisation, were used during training and \ntesting. Linear kernels were used in the normalised score-space, since they gave \nbetter performance than GRBFs of variable width and polynomial kernels of de(cid:173)\ngree 2 (including homogeneous, inhomogeneous, and inhomogeneous with zero-mean \nscore-space). The linear kernel did not require parameter-tuning and, in initial ex(cid:173)\nperiments, was found to be fairly insensitive to variations in the SVM trade-off \nparameter C. C was fixed at 100, and biased hyperplanes were permitted. A va(cid:173)\nriety of score-subspaces were examined. The abbreviations rn, v, wand t refer to \nthe score-subspaces \\7 J-L jk Inpi(OIOi), \\7veC (I;jk) Inpi(OIOi), \\7Wjk Inpi(OIOi) and \n\\7 ajj Inpi(OIOi) respectively. 1 refers to the log likelihood Inpi(OIOi) and r to the \nlog likelihood-ratio In[p2(OI02) /Pl( OIOd]. The binary SVM classification results \n(and, as a baseline, the binary HMM results) were combined to obtain a single \nclassification for each utterance. This was done using a simple majority voting \nscheme among the full set of 1v1 binary classifiers (for tied letters, the relevant 1v1 \nclassifiers were inspected and then, if necessary, random selection performed [2]). \n\nTable 1: Error-rates for HMM baselines and SVM score-spaces (E-set) \n\nNum compo \nper class \nper state \n1 \n2 \n4 \n6 \n\nHMM \n\nSVM score-space \n\nmin. Bayes majority \nerror \n11.3 \n8.7 \n6.7 \n7.2 \n\nvoting \n11.3 \n8.7 \n6.7 \n7.2 \n\nlik-ratio \n(stat. sign.) \n6.9 ~99.8~! \n5.0 (98.9%) \n5.7 (13.6%) \n6.1 (59.5%) \n\nlik \n(I-class) \n7.6 \n6.3 \n8.0 \n7.8 \n\nlik \n(2-class) \n6.1 \n9.3 \n23.2 \n30.6 \n\nTable 1 compares the baseline HMM and SVM classifiers as the complexity of the \ngenerative models was varied. Statistical significance confidence levels are given in \nbrackets comparing the baseline HMM and SVM classifiers with the same gener(cid:173)\native models, where 95% was taken as a significant result (confidence levels were \ndefined by (100 - P), where P was given by McNemar's Test and was the per(cid:173)\ncentage probability that the two classifiers had the same error rates and differences \nwere simply due to random error; for this, a decision by random selection for tied \nletters was assigned to an 'undecided' class [7]). The baseline HMMs were com(cid:173)\nparable to reported results on the E-set for different databases [10]. The majority \nvoting scheme gave the same performance as the minimum Bayes error scheme, \nindicating that majority voting was an acceptable multi-class scheme for the E-set \nexperiments. For the SVMs, each likelihood-ratio score-space was defined using its \ncompeting class-conditional generative models and projected into a rnr score-space. \nEach likelihood (I-class) score-space was defined using only the generative model \nfor the first of its two classes, and projected into a rnl score-space. Each likelihood \n(2-class) score-space was defined using a generative model for both of its classes, \nand projected into a rnl score-space (the original Fisher score, which is a projection \ninto its rn score-subspace, was also tested but was found to yield slightly higher error \nrates). SVMs built using the likelihood-ratio score-space achieved lower error rates \nthan HMM systems, as low as 5.0%. The likelihood (I-class) score-space performed \nslightly worse than the likelihood-ratio score-space because it contained about half \nthe information and did not contain the log likelihood-ratio. In both cases, the \noptimum number of components in the generative models was 2 per state, possibly \nreflecting the gender division within each class. The likelihood (2-class) score-space \nperformed poorly possibly because of wrap-around. However, there was an excep-\n\n\ftion for generative models with 1 component per class per state (in total the models \nhad 2 components per state since they modelled both classes). The 2 components \nper state did not generally reflect the gender division in the 2-class data, as first \nsupposed, but the class division. A possible explanation is that each Gaussian com(cid:173)\nponent modelled a class with bi-modal distribution caused by gender differences. \nMost of the data modelled did not sit at the peaks of the two Gaussians and was \nnot mapped to the ambiguous zero in score-subspace. Hence there was still suffi(cid:173)\ncient class discrimination in score-space [7]. This task was too small to fully assess \npossible decorrelation in error structure between HMM and SVM classifiers [6] . \n\nWithout scaling for score-space normalisation, the error-rate for the likelihood-ratio \nscore-space defined on models with 2 components per state increased from 5.0% to \n11.1%. Some likelihood-ratio mr score-spaces were then augmented with 2nd-order \nderivatives ~ J-t jk (~~jk lnp( 018)) . The resulting classifiers showed increases in error \nrate. The disappointing performance was probably due to the simplicity of the task, \nthe independence assumption between component posteriors and component means, \nand the effect of noise with so few training scores in such large score-spaces. \nIt is known that some dimensions of feature-space are noisy and degrade classifi(cid:173)\ncation performance. For this reason, experiments were performed which selected \nsubsets of the likelihood-ratio score-space and then built SVM classifiers in those \nscore-subspaces. First, the score-subspaces were selected by parameter type. Error \nrates for the resulting classifiers, otherwise identical to the baseline SVMs, are de(cid:173)\ntailed in Table 2. Again, the generative models were class-conditional HMMs with \n2 components per state. The log likelihood-ratio was shown to be a powerful dis(cid:173)\ncriminating feature4 \u2022 Increasing the number of dimensions in score-space allowed \nmore discriminative classifiers. There was more discrimination, or less noise, in the \nderivatives of the component means than the component variances. As expected \nin a dynamic task, the derivatives of the transitions were also useful since they \ncontained some duration information. \n\nTable 2: Error rates for subspaces of the likelihood-ratio score-space (E-set) \n\nscore-space \n\nerror rate, % \n\nr \nv \nm \nmv \nmvt \nwmvtr \n\n8.5 \n7.2 \n5.2 \n5.0 \n4.4 \n4.1 \n\nscore-space \ndimensionality \n1 \n1560 \n1560 \n3120 \n3140 \n3161 \n\nNext, subsets of the mr and wmvtr score-spaces were selected according to dimen(cid:173)\nsions with highest Fisher-ratios [7] . The lowest error rates for the mr and wmvtr \nscore-spaces were respectively 3.7% at 200 dimensions and 3.2% at 500 dimensions \n(respectively significant at 99.1% and 99.7% confidence levels relative to the best \nHMM system with 4 components per state). Generally, adding the most discrimina(cid:173)\ntive dimensions lowered error-rate until less discriminative dimensions were added. \nFor most binary classifiers, the most discriminative dimension was the log likelihood(cid:173)\nratio. As expected for the E-set, the most discriminative dimensions were dependent \non initial HMM states. The low-order MFCCs and log energy term were the most \nimportant coefficients. Static, delta and acceleration streams were all useful. \n\n4The error rate at 8.5% differed from that for the HMM baseline at 8.7% because of \n\nthe non-zero bias for the SVMs. \n\n\fThe HMM and SVM classifiers were run on the full alphabet. The best HMM clas(cid:173)\nsifier, with 4 components per state, gave 3.4% error rate. Computational expense \nprecluded a full optimisation of the SVM classifier. However, generative models \nwith 2 components per state and a wmvtr score-space pruned to 500 dimensions by \nFisher-ratios, gave a lower error rate of 2.1% (significant at a 99.0% confidence level \nrelative to the HMM system). Preliminary experiments evaluating sequence length \nnormalisation on the full alphabet and E-set are detailed in [7]. \n\n4 Conclusions \n\nIn this work, SVMs have been successfully applied to the classification of speech \ndata. The paper has concentrated on the nature of the score-space when handling \nvariable length speech sequences. The standard likelihood score-space of the Fisher \nkernel has been extended to the likelihood-ratio score-space, and normalisation \nschemes introduced. The new score-space avoids some of the limitations of the \nFisher score-space, and incorporates the class-conditional generative models directly \ninto the SVM classifier. The different score-spaces have been compared on a speaker(cid:173)\nindependent isolated letter task. The likelihood-ratio score-space out-performed the \nlikelihood score-spaces and HMMs trained to maximise likelihood. \n\nAcknowledgements \n\nN. Smith would like to thank EPSRC; his CASE sponsor, the Speech Group at IBM \nU.K. Laboratories; and Thorsten Joachims, University of Dortmund, for BV Mlight. \n\nReferences \n\n[1] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. \n[2] N. Smith, M. Gales, and M. Niranjan. Data-dependent kernels in SVM classification \nof speech patterns. Tech. Report CUED/F-INFENG/TR.387, Cambridge University \nEng.Dept., April 2001. \n\n[3] K. Tsuda et al. A New Discriminative Kernel from Probabilistic Models. In T.G. \nDietterich, S. Becker and Z. Ghahramani, editors Advances in Neural Information \nProcessing Systems 14, MIT Press, 2002. \n\n[4] T. Jaakkola and D. Haussler. Exploiting Generative Models in Discriminative Clas(cid:173)\n\nsifiers. \nInformation Processing Systems 11. MIT Press, 1999. \n\nIn M.S. Kearns, S.A. Solia, and D.A. Cohn, editors, Advances in Neural \n\n[5] N. Oliver, B. Scholkopf, and A. Smola. Advances in Large-Margin Classifiers, chapter \n\nNatural Regularization from Generative Models. MIT Press, 2000. \n\n[6] S. Fine, J. Navratil, and R. Gopinath. A hybrid GMM/ SVM approach to speaker iden(cid:173)\n\ntification. In Proceedings, volume 1, International Conference on Acoustics, Speech, \nand Signal Processing, May 2001. Utah, USA. \n\n[7] N. Smith and M. Gales. Using SVMs to classify variable length speech patterns. Tech. \n\nReport CUED/ F-INFENG/ TR.412, Cambridge University Eng.Dept., June 2001. \n\n[8] M. Fanty and R . Cole. Spoken Letter Recognition. In R.P. Lippmann, J .E. Moody, \nand D.S. Touretzky, editors, Neural Information Processing Systems 3, pages 220-226. \nMorgan Kaufmann Publishers, 1991. \n\n[9] T. Joachims. Making Large-Scale SVM Learning Practical. \n\nIn B. Scholkopf, \nC. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector \nLearning. MIT-Press, 1999. \n\n[10] P.C. Loizou and A.S. Spanias. High-Performance Alphabet Recognition. IEEE Trans(cid:173)\n\nactions on Speech and Audio Processing, 4(6):430-445, Nov. 1996. \n\n\f", "award": [], "sourceid": 2044, "authors": [{"given_name": "N.", "family_name": "Smith", "institution": null}, {"given_name": "Mark", "family_name": "Gales", "institution": null}]}