{"title": "Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1401, "page_last": 1408, "abstract": "", "full_text": "Eigenvoice Speaker Adaptation via Composite\n\nKernel PCA\n\nJames T. Kwok, Brian Mak and Simon Ho\n\nDepartment of Computer Science\n\nHong Kong University of Science and Technology\n\nClear Water Bay, Hong Kong\n\n[jamesk,mak,csho]@cs.ust.hk\n\nAbstract\n\nEigenvoice speaker adaptation has been shown to be effective when only\na small amount of adaptation data is available. At the heart of the method\nis principal component analysis (PCA) employed to \ufb01nd the most im-\nportant eigenvoices. In this paper, we postulate that nonlinear PCA, in\nparticular kernel PCA, may be even more effective. One major challenge\nis to map the feature-space eigenvoices back to the observation space so\nthat the state observation likelihoods can be computed during the estima-\ntion of eigenvoice weights and subsequent decoding. Our solution is to\ncompute kernel PCA using composite kernels, and we will call our new\nmethod kernel eigenvoice speaker adaptation. On the TIDIGITS corpus,\nwe found that compared with a speaker-independent model, our kernel\neigenvoice adaptation method can reduce the word error rate by 28\u201333%\nwhile the standard eigenvoice approach can only match the performance\nof the speaker-independent model.\n\n1\n\nIntroduction\n\nIn recent years, there has been a lot of interest in the study of kernel methods [1]. The basic\nidea is to map data in the input space X to a feature space via some nonlinear map \u2019, and\nthen apply a linear method there. It is now well known that the computational procedure\ndepends only on the inner products1 \u2019(xi)0\u2019(xj) in the feature space (where xi; xj 2\nX ), which can be obtained ef\ufb01ciently from a suitable kernel function k((cid:1);(cid:1)). Besides,\nkernel methods have the important computational advantage that no nonlinear optimization\nis involved. Thus, the use of kernels provides elegant nonlinear generalizations of many\nexisting linear algorithms. A well-known example in supervised learning is the support\nvector machines (SVMs). In unsupervised learning, the kernel idea has also led to methods\nsuch as kernel-based clustering algorithms and kernel principal component analysis [2].\n\nIn the \ufb01eld of automatic speech recognition, eigenvoice speaker adaptation [3] has drawn\nsome attention in recent years as it is found particularly useful when only a small amount\nof adaptation speech is available; e.g. a few seconds. At the heart of the method is prin-\ncipal component analysis (PCA) employed to \ufb01nd the most important eigenvoices. Then\n\n1In this paper, vector/matrix transpose is denoted by the superscript 0.\n\n\fa new speaker is represented as a linear combination of a few (most important) eigen-\nvoices and the eigenvoice weights are usually estimated by maximizing the likelihood of\nthe adaptation data. Conventionally, these eigenvoices are found by linear PCA. In this\npaper, we investigate the use of nonlinear PCA to \ufb01nd the eigenvoices by kernel methods.\nIn effect, the nonlinear PCA problem is converted to a linear PCA problem in the high-\ndimension feature space using the kernel trick. One of the major challenges is to map the\nfeature-space eigenvoices back to the observation space to compute the state observation\nlikelihood of adaptation data during the estimation of eigenvoice weights and likelihood of\ntest data during decoding. Our solution is to compute kernel PCA using composite kernels.\nWe will call our new method kernel eigenvoice speaker adaptation.\n\nKernel eigenvoice adaptation will have to deal with several parameter spaces. To avoid\nconfusion, we denote the several spaces as follows: the d1-dimensional observation space\nas O; the d2-dimensional speaker (supervector) space asX ; and the d3-dimensional speaker\nfeature space as F. Notice that d1 (cid:28) d2 (cid:28) d3 in general.\nThe rest of this paper is organized as follows. Brief overviews on eigenvoice speaker\nadaptation and kernel PCA are given in Sections 2 and 3. Sections 4 and 5 then describe\nour proposed kernel eigenvoice method and its robust extension. Experimental results are\npresented in Section 6, and the last section gives some concluding remarks.\n\n2 Eigenvoice\n\nIn the standard eigenvoice approach [3], speech training data are collected from many\nspeakers with diverse characteristics. A set of speaker-dependent (SD) acoustic hidden\nMarkov models (HMMs) are trained from each speaker where each HMM state is modeled\nas a mixture of Gaussian distributions. A speaker\u2019s voice is then represented by a speaker\nsupervector that is composed by concatenating the mean vectors of all HMM Gaussian\ndistributions. For simplicity, we assume that each HMM state consists of one Gaussian\nonly. The extension to mixtures of Gaussians is straightforward. Thus, the ith speaker\nsupervector consists of R constituents, one from each Gaussian, and will be denoted by\nxi = [x0i1 : : : x0iR]0 2 Rd2. The similarity between any two speaker supervectors xi and xj\nis measured by their dot product\n\nR\n\nx0ixj =\n\nx0irxjr :\n\n(1)\n\nXr=1\n\nand (cid:25)r is the initial probability of state r; (cid:13)t(r) is the posterior probability of observation\nsequence being at state r at time t; (cid:24)t(p; r) is the posterior probability of observation se-\nquence being at state p at time t and at state r at time t + 1; br is the Gaussian pdf of the rth\n\nPCA is then performed on a set of training speaker supervectors and the resulting eigen-\nvectors are called eigenvoices. To adapt to a new speaker, his/her supervector s is treated\nas a linear combination of the \ufb01rst M eigenvoices fv1; : : : ; vMg, i.e., s = s(ev) =\nPM\nm=1 wmvm where w = [w1; : : : ; wM ]0 is the eigenvoice weight vector. Usually, only a\nfew eigenvoices (e.g., M < 10) are employed so that a little amount of adaptation speech\n(e.g., a few seconds) will be required. Given the adaptation data ot; t = 1; : : : ; T , the\neigenvoice weights are in turn estimated by maximizing the likelihood of the ot\u2019s. Mathe-\nmatically, one \ufb01nds w by maximizing the Q function: Q(w) = Q(cid:25) + Qa + Qb(w); where\n\nQ(cid:25) =\n\nand, Qb(w) =\n\n(cid:13)1(r) log((cid:25)r) ; Qa =\n\nT\n\nR\n\nXp;r=1\n\nT (cid:0)1\n\nXt=1\n\n(cid:24)t(p; r) log(apr) ;\n\n(cid:13)t(r) log(br(ot; w)) ;\n\n(2)\n\nR\n\nR\n\nXr=1\nXr=1\n\nXt=1\n\n\fstate after re-estimation. Furthermore, Qb is related to the new speaker supervector s by\n\nR\n\nT\n\nQb(w) = (cid:0)\n\n1\nXr=1\n2\nwhere kot (cid:0) sr(w)k2\nCr = (ot (cid:0) sr(w))0C(cid:0)1\nof the Gaussian at state r.\n\nXt=1\n\n(cid:13)t(r)(cid:2)d1 log(2(cid:25)) + log jCrj + kot (cid:0) sr(w)k2\n\nCr(cid:3) ;\n\nr (ot (cid:0) sr(w)) and Cr is the covariance matrix\n\n(3)\n\n3 Kernel PCA\n\nIn this paper, the computation of eigenvoices is generalized by performing kernel PCA\ninstead of linear PCA. In the following, let k((cid:1);(cid:1)) be the kernel with associated mapping \u2019\nwhich maps a pattern x in the speaker supervector space X to \u2019(x) in the speaker feature\nspace F. Given a set of N patterns (speaker supervectors) fx1; : : : ; xNg, denote the mean\nof the \u2019-mapped feature vectors by (cid:22)\u2019 = 1\ni=1 \u2019(xi), and the \u201ccentered\u201d map by ~\u2019\n(with ~\u2019(x) = \u2019(x) (cid:0) (cid:22)\u2019). Eigendecomposition is performed on ~K, the centered version of\nK = [k(xi; xj)]ij, as ~K = U(cid:3)U0, where U = [(cid:11)1; : : : ; (cid:11)N ] with (cid:11)i = [(cid:11)i1; : : : ; (cid:11)iN ]0,\nand (cid:3) = diag((cid:21)1; : : : ; (cid:21)N ). Notice that ~K is related to K by ~K = HKH, where H =\nI (cid:0) 1\nN 110 is the centering matrix, I is the N (cid:2) N identity matrix, and 1 = [1; : : : ; 1]0, an\nN-dimensional vector. The mth orthonormal eigenvector of the covariance matrix in the\nfeature space is then given by [2] as vm =PN\n\nN PN\n\n~\u2019(xi) :\n\n(cid:11)mip(cid:21)m\n\ni=1\n\n4 Kernel Eigenvoice\n\nAs seen from Eqn (3), the estimation of eigenvoice weights requires the evaluation of the\ndistance between adaptation data ot and Gaussian means of the new speaker in the obser-\nvation space O. In the standard eigenvoice method, this is done by \ufb01rst breaking down\nthe adapted speaker supervector s to its R constituent Gaussians s1; : : : ; sR. However, the\nuse of kernel PCA does not allow us to access each constituent Gaussians directly. To get\naround the problem, we investigate the use of composite kernels.\n\n4.1 De\ufb01nition of the Composite Kernel\n\nFor the ith speaker supervector xi, we map each constituent xir separately via a kernel\nkr((cid:1);(cid:1)) to \u2019r(xir), and then construct \u2019(xi) as \u2019(xi) = [\u20191(xi1)0; : : : ; \u2019R(xiR)0]0. Anal-\nogous to Eqn (1), the similarity between two speaker supervectors xi and xj in the com-\nposite feature space is measured by\n\nR\n\nk(xi; xj) =\n\nkr(xir; xjr) :\n\nXr=1\n\nNote that if kr\u2019s are valid Mercer kernels, so is k [1].\nUsing this composite kernel, we can then proceed with the usual kernel PCA on the set of\nN training speaker supervectors and obtain (cid:11)m\u2019s, (cid:21)m\u2019s, and the orthonormal eigenvectors\nvm\u2019s (m = 1; : : : ; M) of the covariance matrix in the feature space F.\n4.2 New Speaker in the Feature Space\n\nIn the following, we denote the supervector of a new speaker by s. Similar to the standard\neigenvoice approach, its ~\u2019-mapped speaker feature vector2 ~\u2019(kev)(s) is assumed to be a\n2The notation for a new speaker in the feature space requires some explanation. If s exists, then\n(kev)(s). However, since the pre-image of a speaker in the feature space may\n\nits centered image is ~\u2019\n\n\flinear combination of the \ufb01rst M eigenvectors, i.e.,\nN\n\nM\n\nM\n\n~\u2019(kev)(s) =\n\n~\u2019(xi):\n\n(4)\n\nIts rth constituent is then given by\n\nXm=1\n\nwmvm =\n\nM\n\n~\u2019(kev)\n\nr\n\n(sr) =\n\nXm=1\n\nXi=1\n\nwm(cid:11)mip(cid:21)m\n\nXm=1\nXi=1\nwm(cid:11)mip(cid:21)m\n\nN\n\n~\u2019r(xir) :\n\nHence, the similarity between \u2019(kev)\n\nr\n\n(sr) and \u2019r(ot) is given by\n\nk(kev)\nr\n\nN\n\nN\n\n\u2019r(ot)\n\nwm(cid:11)mip(cid:21)m\n\n(sr; ot) (cid:17) \u2019(sr)0\u2019r(ot)\n= \" M\nXm=1\nXi=1\n= \" M\nXm=1\nXi=1\nwm(cid:11)mip(cid:21)m\nXi=1\nXm=1\nXm=1\n(cid:17) A(r; t) +\n\n~\u2019r(xir)! + (cid:22)\u2019r#0\n(\u2019r(xir) (cid:0) (cid:22)\u2019r)! + (cid:22)\u2019r#0\n(kr(xir; ot) (cid:0) (cid:22)\u20190r\u2019r(ot)) + (cid:22)\u20190r\u2019r(ot)\nwmp(cid:21)m\ni=1 \u2019r(xir) is the rth part of (cid:22)\u2019,\n\nwm(cid:11)mip(cid:21)m\n\nB(m; r; t);\n\n\u2019r(ot)\n\n=\n\nM\n\nM\n\nN\n\n(5)\n\nwhere (cid:22)\u2019r = 1\n\nN PN\n\nand\n\nA(r; t) = (cid:22)\u20190r\u2019r(ot) =\n\n1\nN\n\nN\n\nXj=1\n\nkr(xjr; ot);\n\nB(m; r; t) = N\nXi=1\n\n(cid:11)mikr(xir; ot)! (cid:0) A(r; t) N\nXi=1\n\n(cid:11)mi! :\n\n4.3 Maximum Likelihood Adaptation Using an Isotropic Kernel\n\nr\n\nOn adaptation, we have to express kot (cid:0) srk2\nCr of Eqn (3) as a function of w. Con-\nsider using isotropic kernels for kr so that kr(xir; xjr) = (cid:20)(kxir (cid:0) xjrkCr ). Then\nk(kev)\nCr will be a function of\n(sr; ot) = (cid:20)(kot (cid:0) srk2\nk(kev)\n(sr; ot), which in turn is a function of w by Eqn (5). In the sequel, we will use the\nGaussian kernel kr(xir; xjr) = exp((cid:0)(cid:12)rkxir (cid:0) xjrk2\nkot (cid:0) srk2\n\nCr ), and if (cid:20) is invertible, kot (cid:0) srk2\nCr ), and hence\n\n(sr; ot) = (cid:0)\n\nlog A(r; t) +\n\nB(m; r; t)! :\n\nCr = (cid:0)\n\nwmp(cid:21)m\n\nlog k(kev)\n\n1\n(cid:12)r\n\n1\n(cid:12)r\n\nM\n\nr\n\nr\n\n(6)\n\nXm=1\n\nSubstituting Eqn (6) for the Qb function in Eqn (3), and differentiating with respect to each\neigenvoice weight, wj; j = 1; : : : ; M, we obtain\n\n@Qb\n@wj\n\n=\n\n1\n\n2p(cid:21)j\n\nR\n\nT\n\nXr=1\n\nXt=1\n\n(cid:13)t(r)\n\n(cid:12)r\n\n(cid:1)\n\nB(j; r; t)\n\nk(kev)\n\nr\n\n(sr; ot)\n\n:\n\n(7)\n\nnot exist, its notation as ~\u2019\nintuitiveness and the readers are advised to infer the existence of s based on the context.\n\n(kev)(s) is not exactly correct. However, the notation is adopted for its\n\n\fSince Q(cid:25) and Qa do not depend on w, @Q\n@wj\n\n=\n\n@Qb\n@wj\n\n.\n\n4.4 Generalized EM Algorithm\n\nBecause of the nonlinear nature of kernel PCA, Eqn (6) is nonlinear in w and there is no\nclosed form solution for the optimal w. In this paper, we instead apply the generalized\nEM algorithm (GEM) [4] to \ufb01nd the optimal weights. GEM is similar to standard EM\nexcept for the maximization step: EM looks for w that maximizes the expected likelihood\nof the E-step but GEM only requires a w that improves the likelihood. Many numerical\nmethods may be used to update w based on the derivatives of Q. In this paper, gradient\nascent is used to get w(n) from w(n (cid:0) 1) based only on the \ufb01rst-order derivative as:\nw(n) = w(n (cid:0) 1) + (cid:17)(n)Q0jw=w(n(cid:0)1); where Q0 =\nand (cid:17)(n) is the learning rate\nat the nth iteration. Methods such as the Newton\u2019s method that uses the second-order\nderivatives may also be used for faster convergence, at the expense of computing the more\ncostly Hessian in each iteration.\n\n@Qb\n@w\n\nThe initial value of w(0) can be important for numerical methods like gradient ascent. One\nreasonable approach is to start with the eigenvoice weights of the supervector composed\nfrom the speaker-independent model x(si). That is,\n\nwm = v0m ~\u2019(x(si)) =\n\n~\u2019(xi)0 ~\u2019(x(si)) =\n\nN\n\nXi=1\n\n(cid:11)mip(cid:21)m\n(cid:11)mip(cid:21)m\"k(xi; x(si))+\nXi=1\n\nN\n\n1\nN 2\n\nN\n\nXp;q=1\n\n=\n\nN\n\n[\u2019(xi) (cid:0) (cid:22)\u2019]0[\u2019(x(si)) (cid:0) (cid:22)\u2019]\n\n(cid:11)mip(cid:21)m\nXi=1\nXp=1(cid:0)k(xi; xp)+k(x(si); xp)(cid:1)# :\nk(xp; xq)(cid:0)\n\n1\nN\n\nN\n\n(8)\n\n5 Robust Kernel Eigenvoice\n\nThe success of the eigenvoice approach for fast speaker adaptation is due to two factors: (1)\na good collection of \u201cdiverse\u201d speakers so that the whole speaker space is captured by the\neigenvoices; and (2) the number of adaptation parameters is reduced to a few eigenvoice\nweights. However, since the amount of adaptation data is so little the adaptation perfor-\nmance may vary widely. To get a more robust performance, we propose to interpolate the\nkernel eigenvoice ~\u2019(kev)(s) obtained in Eqn (4) with the ~\u2019-mapped speaker-independent\n(SI) supervector ~\u2019(x(si)) to obtain the \ufb01nal speaker adapted model ~\u2019(rkev)(s) as follows:\n\n~\u2019(rkev)(s) = w0 ~\u2019(x(si)) + (1 (cid:0) w0) ~\u2019(kev)(s) ;\n\n0:0 (cid:20) w0 (cid:20) 1:0 ;\n\n(9)\n\nwhere ~\u2019(kev)(s) is found by Eqn (4). By replacing ~\u2019(kev)(s) by ~\u2019(rkev)(s) for the com-\nputation of the kernel value of Eqn (5), and following the mathematical steps in Section 4,\none may derive the required gradients for the joint maximum-likelihood estimation of w0\nand other eigenvoice weights in the GEM algorithm.\n\nNotice that ~\u2019(rkev)(s) also contains components in ~\u2019(x(si)) from eigenvectors beyond the\nM selected kernel eigenvoices for adaptation. Thus, robust KEV adaptation may have the\nadditional bene\ufb01t of preserving the speaker-independent projections on the remaining less\nimportant but robust eigenvoices in the \ufb01nal speaker-adapted model.\n\n\f6 Experimental Evaluation\n\nThe proposed kernel eigenvoice adaptation method was evaluated on the TIDIGITS speech\ncorpus [5].\nIts performance was compared with that of the speaker-independent model\nand the standard eigenvoice adaptation method using only 3s, 5.5s, and 13s of adaptation\nspeech. If we exclude the leading and ending silence, the average duration of adaptation\nspeech is 2.1s, 4.1s, and 9.6s respectively.\n\n6.1 TIDIGITS Corpus\n\nThe TIDIGITS corpus contains clean connected-digit utterances sampled at 20 kHz. It is\ndivided into a standard training set and a test set. There are 163 speakers (of both genders)\nin each set, each pronouncing 77 utterances of one to seven digits (out of the eleven digits:\n\u201c0\u201d, \u201c1\u201d, : : :, \u201c9\u201d, and \u201coh\u201d.). The speaker characteristics is quite diverse with speakers\ncoming from 22 dialect regions of USA and their ages ranging from 6 to 70 years old.\n\nIn all the following experiments, only the training set was used to train the speaker-\nindependent (SI) HMMs and speaker-dependent (SD) HMMs from which the SI and SD\nspeaker supervectors were derived.\n\n6.2 Acoustic Models\n\nAll training data were processed to extract 12 mel-frequency cepstral coef\ufb01cients and the\nnormalized frame energy from each speech frame of 25 ms at every 10 ms. Each of the\neleven digit models was a strictly left-to-right HMM comprising 16 states and one Gaussian\nwith diagonal covariance per state. In addition, there were a 3-state \u201csil\u201d model to capture\nsilence speech and a 1-state \u201csp\u201d model to capture short pauses between digits. All HMMs\nwere trained by the EM algorithm. Thus, the dimension of the observation space d1 is 13\nand that of the speaker supervector space d2 = 11 (cid:2) 16 (cid:2) 13 = 2288.\nFirstly, the SI models were trained. Then an SD model was trained for each individual\nspeaker by borrowing the variances and transition matrices from the corresponding SI mod-\nels, and only the Gaussian means were estimated. Furthermore, the sil and sp models were\nsimply copied to the SD model.\n\n6.3 Experiments\n\nThe following \ufb01ve models/systems were compared:\n\nSI: speaker-independent model\nEV: speaker-adapted model found by the standard eigenvoice adaptation method.\nRobust-EV: speaker-adapted models found by our robust version of EV, which is the in-\nterpolation between the SI supervector and the supervector found by EV. That is,\n\ns(rev) = w0s(si) + (1 (cid:0) w0)s(ev) ;\n\n0:0 (cid:20) w0 (cid:20) 1:0 :\n\nKEV: speaker-adapted model found by our new kernel eigenvoice adaptation method as\n\ndescribed in Section 4.\n\nRobust-KEV: speaker-adapted model found by our robust KEV as described in Section 5.\n\nAll adaptation results are the averages of 5-fold cross-validation taken over all 163 test\nspeaker data. The detailed results using different numbers of eigenvoices are shown in\nFigure 1, while the best result for each model is shown in Table 1.\n\n\fTable 1: Word recognition accuracies of SI model and the best adapted models found by\nEV, robust EV, KEV, and robust KEV using 2.1s, 4.1s, and 9.6s of adaptation speech.\n\nSYSTEM\n\n2.1s\n\nSI\nEV\n\nrobust EV\n\n95.61\n96.26\n96.85\nrobust KEV 97.28\n\nKEV\n\n4.1s\n96.25\n95.65\n96.26\n97.05\n97.44\n\n9.6s\n\n95.67\n96.27\n97.05\n97.50\n\nFrom Table 1, we observe that the standard eigenvoice approach cannot obtain better perfor-\nmance than the SI model3. On the other hand, using our kernel eigenvoice (KEV) method,\nwe obtain a word error rate (WER) reduction of 16.0%, 21.3%, and 21.3% with 2.1s, 4.1s,\nand 9.6s of adaptation speech over the SI model. When the SI model is interpolated with\nthe KEV model in our robust KEV method, the WER reduction further improves to 27.5%,\n31.7%, and 33.3% respectively. These best results are obtained with 7 to 8 eigenvoices. The\nresults show that nonlinear PCA using composite kernels can be more effective in \ufb01nding\nthe eigenvoices.\n\n)\n)\n\n%\n%\n\n(\n(\n \n \ny\ny\nc\nc\na\na\nr\nr\nu\nu\nc\nc\nc\nc\nA\nA\nn\nn\no\no\n\n \n \n\ni\ni\nt\nt\ni\ni\n\nn\nn\ng\ng\no\no\nc\nc\ne\ne\nR\nR\nd\nd\nr\nr\no\no\nW\nW\n\n \n \n\n 98\n 98\n\n 97.5\n 97.5\n\n 97\n 97\n\n 96.5\n 96.5\n\n 96\n 96\n\n 95.5\n 95.5\n\n 95\n 95\n\n 94.5\n 94.5\n\n 94\n 94\n\n 0\n 0\n\n 1\n 1\n\n 2\n 2\n\nSI model\nSI model\nKEV (2.1s)\nKEV (2.1s)\nKEV (9.6s)\nKEV (9.6s)\nrobust KEV (2.1s)\nrobust KEV (2.1s)\nrobust KEV (9.6s)\nrobust KEV (9.6s)\n\n 4\n 4\n\n 3\n 8\n 3\n 8\nNumber of Kernel Eigenvoices\nNumber of Kernel Eigenvoices\n\n 5\n 5\n\n 6\n 6\n\n 7\n 7\n\n 9\n 9\n\n 10\n 10\n\nFigure 1: Word recognition accuracies of adapted models found by KEV and robust KEV\nusing different numbers of eigenvoices.\n\nFrom Figure 1, the KEV method can outperform the SI model even with only two eigen-\nvoices using only 2.1s of speech. Its performance then improves slightly with more eigen-\nvoices or more adaptation data. If we allow interpolation with the SI model as in robust\n\n3The word accuracy of our SI model is not as good as the best reported result on TIDIGITS which\nis about 99.7%. The main reasons are that we used only 13-dimensional static cepstra and energy, and\neach state was modelled by a single Gaussian with diagonal covariance. The use of this simple model\nallowed us to run experiments with 5-fold cross-validation using very short adaptation speech. Right\nnow our approach requires computation of many kernel function values and is very computationally\nexpensive. As a \ufb01rst attempt on the approach, we feel that the use of this simple model is justi\ufb01ed.\nWe are now working on its speed-up and its extension to HMM states of Gaussian mixtures.\n\n\fKEV, the saturation effect is even more pronounced: even with one eigenvoice, the adap-\ntation performance is already better than that of SI model, and then the performance does\nnot change much with more eigenvoices or adaptation data. The results seem to suggest\nthat the requirement that the adapted speaker supervector is a weighted sum of few eigen-\nvoices is both the strength and weakness of the method: on the one hand, fast adaptation\nbecomes possible since the number of estimation parameters is small, but adaptation sat-\nurates quickly because the constraint is so restrictive that all mean vectors of different\nacoustic models have to undergo the same linear combination of the eigenvoices.\n\n7 Conclusions\n\nIn this paper, we improve the standard eigenvoice speaker adaptation method using ker-\nnel PCA with a composite kernel. In the TIDIGITS task, it is found that while the stan-\ndard eigenvoice approach does not help, our kernel eigenvoice method may outperform the\nspeaker-independent model by about 28\u201333% (in terms of error rate improvement).\n\nRight now the speed of recognition using the adapted model that resulted from our kernel\neigenvoice method is slower than that from the standard eigenvoice method because any\nstate observation likelihoods cannot be directly computed but through evaluating the kernel\nvalues with all training speaker supervectors. One possible solution is to apply sparse\nkernel PCA [6] so that computation of the \ufb01rst M principal components involves only M\n(instead of N with M (cid:28) N) kernel functions. Another direction is to use compactly\nsupported kernels [7], in which the value of (cid:20)(kxi (cid:0) xjk) vanishes when kxi (cid:0) xjk is\ngreater than a certain threshold. The kernel matrix then becomes sparse. Moreover, no\nmore computation is required when kxi (cid:0) xjk is large.\n8 Acknowledgements\n\nThis research is partially supported by the Research Grants Council of the Hong Kong SAR\nunder the grant numbers HKUST2033/00E, HKUST6195/02E, and HKUST6201/02E.\n\nReferences\n\n[1] B. Sch\u00a8olkopf and A.J. Smola. Learning with Kernels. MIT, 2002.\n[2] B. Sch\u00a8olkopf, A. Smola, and K.R. M\u00a8uller. Nonlinear component analysis as a kernel\n\neigenvalue problem. Neural Computation, 10:1299\u20131319, 1998.\n\n[3] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski. Rapid Speaker Adaptation in\nEigenvoice Space. IEEE Transactions on Speech and Audio Processing, 8(4):695\u2013707,\nNov 2000.\n\n[4] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete\ndata via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1\u2013\n38, 1977.\n\n[5] R.G. Leonard. A Database for Speaker-Independent Digit Recognition. In Proceedings\nof the IEEE International Conference on Acoustics, Speech, and Signal Processing,\nvolume 3, pages 4211\u20134214, 1984.\n\n[6] A.J. Smola, O.L. Mangasarian, and B. Sch\u00a8olkopf. Sparse kernel feature analysis. Tech-\n\nnical Report 99-03, Data Mining Institute, University of Wisconsin, Madison, 1999.\n\n[7] M.G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal\n\nof Machine Learning Research, 2:299\u2013312, 2001.\n\n\f", "award": [], "sourceid": 2421, "authors": [{"given_name": "James", "family_name": "Kwok", "institution": null}, {"given_name": "Brian", "family_name": "Mak", "institution": null}, {"given_name": "Simon", "family_name": "Ho", "institution": null}]}