{"title": "Dynamic Time-Alignment Kernel in Support Vector Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": "Dynamic Time-Alignment Kernel in\nSupport Vector Machine\nHiroshi Shimodaira\n\nSchool of Information Science,\nJapan Advanced Institute of\nScience and Technology\n\nsim@jaist.ac.jp\n\nKen-ichi Noma\n\nSchool of Information Science,\nJapan Advanced Institute of\nScience and Technology\n\nknoma@jaist.ac.jp\nMitsuru Nakai\n\nSchool of Information Science,\nJapan Advanced Institute of\nScience and Technology\n\nmit@jaist.ac.jp\n\nShigeki Sagayama\n\nGraduate School of Information Science\nand Technology,\nThe University of Tokyo\n\nsagayama@hil.t.u-tokyo.ac.jp\nAbstract\n\nA new class of Support Vector Machine (SVM) that is applica-\nble to sequential-pattern recognition such as speech recognition is\ndeveloped by incorporating an idea of non-linear time alignment\ninto the kernel function. Since the time-alignment operation of\nsequential pattern is embedded in the new kernel function, stan-\ndard SVM training and classification algorithms can be employed\nwithout further modifications. The proposed SVM (DTAK-SVM)\nis evaluated in speaker-dependent speech recognition experiments\nof hand-segmented phoneme recognition. Preliminary experimen-\ntal results show comparable recognition performance with hidden\nMarkov models (HMMs).\n1 Introduction\n\nSupport Vector Machine (SVM) [1] is one of the latest and most successful statistical\npattern classifier that utilizes a kernel technique [2, 3]. The basic form of SVM\nclassifier which classifies an input vector x\n\n# R\n\nn\n\nis expressed as\ng(x) =\n\nN\n\n#\n\ni=1\n\n# i y i #(x i )\n\n\n\n#(x) + b =\n\nN\n\n#\n\ni=1\n\n# i y i K(x i , x) + b, (1)\nwhere # is a non-linear mapping function #(x) :\n\nR\n\nn\n\n## R\n\nn #\n\n, (n\n\n#\n\nn # ), ``'' denotes\nthe inner product operator, x i , y i and # i are the i-th training sample, its class label,\nand its Lagrange multiplier, respectively, K is a kernel function, and b is a bias.\nDespite the successful applications of SVM in the field of pattern recognition such\nas character recognition and text classification, SVM has not been applied to speech\n\f\nrecognition that much. This is because SVM assumes that each sample is a vector\nof fixed dimension, and hence it can not deal with the variable length sequences\ndirectly. Because of this, most of the e#orts that have been made so far to apply\nSVM to speech recognition employ linear time normalization, where input feature\nvector sequences with di#erent lengths are aligned to same length [4]. A variant\nof this approach is a hybrid of SVM and HMM (hidden Markov model), in which\nHMM works as a pre-processor to feed time-aligned fixed-dimensional vectors to\nSVM [5]. Another approach is to utilize probabilistic generative models as a SVM\nkernel function. This includes the Fisher kernels [6, 7], and conditional symmetric\nindependence (CSI) kernels [8], both of which employ HMMs as the generative mod-\nels. Since HMMs can treat sequential patterns, SVM that employs the generative\nmodels based on HMMs can handle sequential patterns as well.\nIn contrast to those approaches, our approach is a direct extension of the original\nSVM to the case of variable length sequence. The idea is to incorporate the op-\neration of dynamic time alignment into the kernel function itself. Because of this,\nthe proposed new SVM is called ``Dynamic Time-Alignment Kernel SVM (DTAK-\nSVM)''. Unlike the SVM with Fisher kernel that requires two training stages with\ndi#erent training criteria, one is for training the generative models and the second\nis for training the SVM, the DTAK-SVM uses one training criterion as well as the\noriginal SVM.\n2 Dynamic Time-Alignment Kernel\n\nWe consider a sequence of vectors X = (x 1 , x 2 ,\n\n \n\n, xL ), where x i\n\n# R\n\nn\n\n, L is the\nlength of the sequence, and the notation\n\n|X|\n\nis sometimes used to represent the\nlength of the sequence instead. For simplification, we at first assume the so-called\nlinear SVM that does not employ non-linear mapping function #. In such case, the\nkernel operation in (1) is identical to the inner product operation.\n\n2.1 Formulation for linear kernel\n\nAssume that we have two vector sequences X and V . If these two patterns are\nequal in length, i.e.\n\n|X|\n\n=\n\n|V |\n\n= L, then the inner product between X and V can\nbe obtained easily as a summation of each inner product between x k and v k for\n\nk = 1,\n\n \n\n, L:\nX\n\n\n\nV =\n\nL\n\n#\n\nk=1\n\nx k\n\n\n\nv k , (2)\nand therefore an SVM classifier can be defined as given in (1). On the other hand\nin case where the two sequences are di#erent in length, the inner product can not\nbe calculated directly. Even in such case, however, some sort of inner product like\noperation can be defined if we align the lengths of the patterns. To that end, let\n\n#(k), #(k) be the time-warping functions of normalized time frame k for the pattern\n\nX and V , respectively, and let ``#'' be the new inner product operator instead of\nthe original inner product ``''. Then the new inner product between the two vector\nsequences X and V can be given by\nX\n\n#\n\nV =\n1\n\nL\n\nL\n\n#\n\nk=1\n\nx #(k)\n\n\n\nv #(k) , (3)\nwhere L is a normalized length that can be either\n\n|X|, |V |\n\nor arbitrary positive\ninteger.\n\f\nThere would be two possible types of time-warping functions. One is a linear time-\nwarping function and the other is a non-linear time-warping function. The linear\ntime-warping function takes the form as\n\n#(k) =\n\n#(|X|/L)k#,\n\n#(k) =\n\n#(|V |/L)k#,\n\nwhere\n\n#x#\n\nis the ceiling function which gives the smallest integer that is greater than\nor equal to x. As it can be seen from the definition given above, the linear warping\nfunction is not suitable for continuous speech recognition, i.e. frame-synchronous\nprocessing, because the sequence lengths,\n\n|X|\n\nand\n\n|V |,\n\nshould be known beforehand.\nOn the other hand, non-linear time warping, or dynamic time warping (DTW) [9] in\nother word, enables frame-synchronous processing. Furthermore, the past research\non speech recognition has shown that the recognition performance by the non-linear\ntime normalization outperforms the one by the linear time normalization. Because\nof these reasons, we focus on the non-linear time warping based on DTW.\nThough the original DTW uses a distance/distortion measure and finds the opti-\nmal path that minimizes the accumulated distance/distortion, the DTW that is\nemployed for SVM uses inner product or kernel function instead and finds the op-\ntimal path that maximizes the accumulated similarity:\nX\n\n#\n\nV = max\n\n#,#\n\n1\n\nM##\n\nL\n\n#\n\nk=1\n\nm(k)x #(k)\n\n\n\nv #(k) , (4)\nsubject to 1\n\n#\n\n#(k)\n\n#\n\n#(k + 1)\n\n# |X|,\n\n#(k + 1)\n\n-\n\n#(k)\n\n#\n\nQ, (5)\n1\n\n#\n\n#(k)\n\n#\n\n#(k + 1)\n\n# |V |,\n\n#(k + 1)\n\n-\n\n#(k)\n\n#\n\nQ,\n\nwhere m(k) is a nonnegative (path) weighting coe#cient, M## is a (path) normal-\nizing factor, and Q is a constant constraining the local continuity. In the standard\nDTW, the normalizing factor M## is given as\n\n#\n\nL\nk=1 m(k), and the weighting coef-\nficients m(k) are chosen so that M ## is independent of the warping functions.\nThe above optimization problem can be solved e#ciently by dynamic programming.\nThe recursive formula in the dynamic programming employed in the present study\nis as follows\nG(i, j) = max\n\n#\n\nG(i\n\n-\n\n1, j) + Inp(i, j),\nG(i\n\n-\n\n1, j\n\n-\n\n1) + 2 Inp(i, j),\nG(i, j\n\n-\n\n1) + Inp(i, j), #\n\n(6)\nwhere Inp(i, j) is the standard inner product between the two vectors corresponding\nto point i and j. As a result, we have\n\nX\n\n#\n\nV = G(|X|,\n\n|V |)/(|X|\n\n+\n\n|V |).\n\n(7)\n\n2.2 Formulation for non-linear kernel\n\nIn the last subsection, a linear kernel, i.e. the inner product, for two vector se-\nquences with di#erent lengths has been formulated in the framework of dynamic\ntime-warping. With a little constraint, similar formulation is possible for the case\nwhere SVM's non-linear mapping function # is applied to the vector sequences. To\nthat end, # is restricted to the one having the following form:\n#(X) = (#(x 1 ), #(x 2 ),\n\n \n\n, #(xL )), (8)\nwhere # is a non-linear mapping function that is applied to each frame vector x i ,\nas given in (1). It should be noted that under the above restriction # preserves the\noriginal length of sequence at the cost of losing long-term correlations such as the\n\f\none between x 1 and xL . As a result, a new class of kernel can be defined by using\nthe extended inner product introduced in the previous section;\n\nK\n\ns (X, V ) = #(X)\n\n#\n\n#(V ) (9)\n= max\n\n#,#\n\n1\n\nM##\n\nL\n\n#\n\nk=1\n\nm(k)#(x #(k) )\n\n\n\n#(v #(k) ) (10)\n= max\n\n#,#\n\n1\n\nM##\n\nL\n\n#\n\nk=1\n\nm(k)K(x #(k) , v #(k) ). (11)\nWe call this new kernel ``dynamic time-alignment kernel (DTAK)''.\n\n2.3 Properties of the dynamic time-alignment kernel\n\nIt has not been proven that the proposed function\n\nK\n\ns (, ) is really an SVM's admis-\nsible kernel which guarantees the existence of a feature space. This is because that\nthe mapping function to a feature space is not independent but dependent on the\ngiven vector sequences. Although a class of data-dependent asymmetric kernel for\nSVM has been developed in [10], our proposed function is more complicated and\ndi#cult to analyze because the input data is a vector sequence with variable length\nand non-linear time normalization is embedded in the function. Instead, what have\nbeen known about the proposed function so far are (1)\n\nK\n\ns is symmetric, (2)\n\nK\n\ns\n\nsatisfies the Cauchy-Schwartz like inequality described bellow:\n\nProposition 1\n\nK\n\ns (X, V )\n\n2\n\n# K\n\ns (X, X)K s (V, V ) (12)\n\nProof For simplification, we assume that normalized length L is fixed, and omit\n\nm(k) and M## in (11). Using the standard Cauchy-Schwartz inequality, the follow-\ning inequality holds:\n\nK\n\ns (X, V ) = max\n\n#,#\nL\n\n#\n\nk=1\n\n#(x #(k) )\n\n\n\n#(v #(k) ) =\n\nL\n\n#\n\nk=1\n\n#(x # # (k) )\n\n\n\n#(v # # (k) ) (13)\n#\n\nL\n\n#\n\nk=1\n\n#\n\n#(x # # (k) )\n\n##\n\n#(v # # (k) )\n\n#,\n\n(14)\nwhere # # (k), # # (k) represent the optimal warping functions that maximize the RHS\nof (13). On the other hand,\nK\n\ns (X, X) = max\n\n#,#\nL\n\n#\n\nk=1\n\n#(x #(k) )\n\n\n\n#(x #(k) ) =\n\nL\n\n#\n\nk=1\n\n#(x # + (k) )\n\n\n\n#(x # + (k) ). (15)\nBecause here we assume that #\n\n+\n\n(k), #\n\n+\n\n(k) are the optimal warping functions that\nmaximize (15), for any warping functions including # # (k), the following inequality\nholds:\n\nK\n\ns (X, X)\n\n#\n\nL\n\n#\n\nk=1\n\n#(x # # (k) )\n\n\n\n#(x # # (k) ) =\n\nL\n\n#\n\nk=1\n\n#\n\n#(x # # (k) )\n\n#\n\n2\n\n. (16)\nIn the same manner, the following holds:\nK\n\ns (V, V )\n\n#\n\nL\n\n#\n\nk=1\n\n#(v # # (k) )\n\n\n\n#(v # # (k) ) =\n\nL\n\n#\n\nk=1\n\n#\n\n#(v # # (k) )\n\n#\n\n2\n\n. (17)\n\f\nTherefore,\n\nK\n\ns (X, X)K s (V, V )\n\n-K\n\ns (X, V )\n\n2\n#\n\n#\n\nL\n\n#\n\nk=1\n\n#\n\n#(x # # (k) )\n\n#\n\n2\n\n##\n\nL\n\n#\n\nk=1\n\n#\n\n#(v # # (k) )\n\n#\n\n2\n\n#\n\n-\n\n#\n\nL\n\n#\n\nk=1\n\n#\n\n#(x # # (k) )\n\n##\n\n#(v # # (k) )\n\n#\n\n#\n\n2\n=\n\nL\n\n#\n\ni=1\nL\n\n#\n\nj=i+1\n\n#\n\n#\n\n#(x # # (i) )\n\n##\n\n#(v # # (j) )\n\n# - #\n\n#(x # # (j) )\n\n##\n\n#(v # # (i) )\n\n#\n\n#\n\n2\n\n#\n\n0 (18)\n#\n\n3 DTAK-SVM\n\nUsing the dynamic time-alignment kernel (DTAK) introduced in the previous sec-\ntion, the discriminant function of SVM for a sequential pattern is expressed as\ng(X) =\n\nN\n\n#\n\ni=1\n\n# i y i #(X\n\n(i)\n\n)\n\n#\n\n#(X) + b (19)\n=\n\nN\n\n#\n\ni=1\n\n# i y i\n\nK\n\ns (X\n\n(i)\n\n, X) + b, (20)\nwhere X\n\n(i)\n\nrepresents the i-th training pattern. As it can be seen from these\nexpressions, the SVM discriminant function for time sequence has the same form\nwith the original SVM except for the di#erence in kernels. It is straightforward to\ndeduce the learning problem which is given as\nmin\n\nW,b,# i\n\n1\n2\n\nW\n\n#\n\nW + C\n\nN\n\n#\n\ni=1\n\n# i , (21)\nsubject to y i (W\n\n#\n\n#(X\n\n(i)\n\n) + b)\n\n#\n\n1\n\n-\n\n# i , (22)\n\n# i\n\n#\n\n0, i = 1,\n\n \n\n, N.\n\nAgain, since the formulation of learning problem defined above is almost the same\nwith that for the original SVM, same training algorithms for the original SVM can\nbe used to solve the problem.\n4 Experiments\n\nSpeech recognition experiments were carried out to evaluated the classification per-\nformance of DTAK-SVM. As our objective is to evaluate the basic performance\nof the proposed method, very limited task, hand-segmented phoneme recognition\ntask in which positions of target patterns in the utterance are known, was chosen.\nContinuous speech recognition task that does not require phoneme labeling would\nbe our next step.\n\n4.1 Experimental conditions\n\nThe details of the experimental conditions are given in Table 1. The training\nand evaluation samples were collected from the ATR speech database: A-set (5240\n\f\nTable 1: Experimental conditions\nExperiment-1 Experiment-2\nSpeaker dependency dependent dependent\nPhoneme classes 6 voiced consonants 5 vowels\nSpeakers 5 males 5 males and 5 females\nTraining samples 200 samples per phoneme 500 samples per phoneme\nEvaluation samples 2,035 samples in all per\nspeaker\n2500 samples in all per\nspeaker\nSignal sampling 12kHz, 10ms frame-shift\nFeature values 13-MFCCs and 13-#MFCCs\nKernel type RBF (radial basis function): K(x i , x j ) = exp(- #x\n\ni\n\n-x\n\nj\n\n#\n\n2\n\n# 2 )\n50\n55\n60\n65\n70\n75\n80\n85\n90\n95\n100\n0 2 4 6 8 10\n\nCorrect\nclassification\nrate\n[%]\nRBF-sigma\nC=0.1\nC=1.0\nC=10\n(a) Recognition performance\n\n0\n20\n40\n60\n80\n100\n1 2 3 4 5 6 7 8 9 10\n\n#\nSVs\n/\n#\ntraining\nsamples\n[%]\nRBF-sigma\nC=0.1\nC=1.0\nC=10.0\n(b) Number of SVs\nFigure 1: Experimental results for Experiment-1 (6 voiced-consonants recognition)\nshowing (a) correct classification rate and (b) the number of SVs as a function of #\n\n(the parameter of RBF kernel).\nJapanese words in vocabulary). In consonant-recognition task (Experiment-1), only\nsix voiced-consonants /b,d,g,m,n,N/ were used to save time. The classification task\nof those 6 phonemes without using contextual information is considered as a rela-\ntively di#cult task, whereas the classification of 5 vowels /a,i,u,e,o/ (Experiment-2)\nis considered as an easier task.\nTo apply SVM that is basically formulated as a two-class classifier to the multi-\nclass problem, ``one against the others'' type of strategy was chosen. The proposed\nDTAK-SVM has been implemented with the publicly available toolkit, SVMTorch\n[11].\n4.2 Experimental results\n\nFig. 1 depicts the experimental results for Experiment-1, where average values over\n5 speakers are shown. It can be seen in Fig. 1 that the best performance of 95.8%\nwas achieved at # = 2.0 and C = 10. Similar results were obtained for Experiment-2\nas given in Fig. 2.\n\f\n50\n55\n60\n65\n70\n75\n80\n85\n90\n95\n100\n0 2 4 6 8 10\n\nCorrect\nclassification\nrate\n[%]\nRBF-sigma\n(a) Recognition performance\n\n0\n20\n40\n60\n80\n100\n1 2 3 4 5 6 7 8 9 10\n\nRBF-sigma\n(b) Number of SVs\nFigure 2: Experimental results for Experiment-2 (5 vowels recognition) showing\n(a) correct classification rate and (b) the number of SVs as a function of # (the\nparameter of RBF kernel).\nTable 2: Recognition performance comparison of DTAK-SVM with HMM. Results\nof Experiment-1 for 1 male and 1 female speakers are shown. (numbers represent\ncorrect classification rate [%])\n# training samples/phoneme\nModel male female\n50 100 200 50 100 200\nHMM (1 mix.) 75.0 69.1 77.1 72.2 65.5 76.6\nHMM (4 mix.) 83.3 84.7 90.9 77.3 76.4 86.4\nHMM (8 mix.) 82.8 87.0 92.4 74.6 79.3 88.5\nHMM (16 mix.) 79.9 85.0 93.2 72.9 78.7 89.8\n\nDTAK-SVM 83.8 85.9 92.1 83.5 81.8 87.7\nNext, the classification performance of DTAK-SVM was compared with that of the\nstate-of-the-art HMM. In order to see the e#ect of generalization performance on\nthe size of training data set and model complexity, experiments were carried out\nby varying the number of training samples (50, 100, 200), and mixtures (1,4,8,16)\nfor each state of HMM. The HMM used in this experiment was a 3-states, con-\ntinuous density, Gaussian-distribution mixtures with diagonal covariances, context-\nindependent model. HTK [12] was employed for this purpose. The parameters of\nDTAK-SVM were fixed to C = 10, # = 2.0. The results for Experiment-1 with\nrespect to 1 male and 1 female speakers are given in Table 2.\nIt can be said from the experimental results that DTAK-SVM shows better classi-\nfication performance when the number of training samples is 50, while comparable\nperformance when the number of samples is 200. One might argue that the number\nof training samples used in this experiment is not enough at all for HMM to achieve\nbest performance. But such shortage of training samples occurs often in HMM-\nbased real-world speech recognition, especially when context-dependent models are\nemployed, which prevents HMM from improving the generalization performance.\n\f\n5 Conclusions\n\nA novel approach to extend the SVM framework for the sequential-pattern classifica-\ntion problem has been proposed by embedding a dynamic time-alignment operation\ninto the kernel. Though long-term correlations between the feature vectors are omit-\nted at the cost of achieving frame-synchronous processing for speech recognition, the\nproposed DTAK-SVMs demonstrated comparable performance in hand-segmented\nphoneme recognition with HMMs. The DTAK-SVM is potentially applicable to\ncontinuous speech recognition with some extension of One-pass search algorithm\n[9].\nReferences\n\n[1] V. N. Vapnik, Statistical Learning Theory. Wiley, 1998.\n[2] B. Scholkopf, C. J. Burges, and A. J. Smola, eds., Advances in Kernel Methods.\n\nThe MIT Press, 1998.\n[3] ``Kernel machine website,'' 2000. http://www.kernel-machines.org/.\n[4] P. Clarkson, ``On the Use of Support Vector Machines for Phonetic Classifica-\ntion,'' in ICASSP99, pp. 585--588, 1999.\n[5] A. Ganapathiraju and J. Picone, ``Hybrid SVM/HMM architectures for speech\nrecognition,'' in ICSLP2000, 2000.\n[6] Tommi S. Jaakkola and David Haussler, ``Exploiting generative models in dis-\ncriminative classifiers,'' in Advances in Neural Information Processing Systems\n11 (M. S. Kearns and S. A. Solla and D. A. Cohn, ed.), pp. 487--493, The MIT\nPress, 1999.\n[7] N. Smith and M. Niranjan, ``Data-dependent Kernels in SVM classification of\nspeech patterns,'' in ICSLP-2000, vol. 1, pp. 297--300, 2000.\n[8] C. Watkins, ``Dynamic Alignment Kernels,'' in Advances in Large Margin Clas-\nsifiers (A. J. Smola and P. L. Bartlett and B. Scholkopf and D. Schuurmans,\ned.), ch. 3, pp. 39--50, The MIT Press, 2000.\n[9] L. Rabiner and B. Juang, Fundamental of Speech Recognition. Prentice Hall,\n1993.\n[10] K. Tsuda, ``Support Vector Classifier with Asymmetric Kernel Functions,'' in\n\nEuropean Symposium on Artificial Neural Networks (ESANN), pp. 183--188,\n1999.\n[11] R. Collobert, ``SVMTorch: A Support Vector Machine for\nLarge-Scale Regression and Classification Problems,'' 2000.\nhttp://www.idiap.ch/learning/SVMTorch.html.\n[12] ``The Hidden Markov Model Toolkit (HTK).'' http://htk.eng.cam.ac.uk/.\n\f\n", "award": [], "sourceid": 2131, "authors": [{"given_name": "Hiroshi", "family_name": "Shimodaira", "institution": null}, {"given_name": "Ken-ichi", "family_name": "Noma", "institution": null}, {"given_name": "Mitsuru", "family_name": "Nakai", "institution": null}, {"given_name": "Shigeki", "family_name": "Sagayama", "institution": null}]}