{"title": "Speaker Comparison with Inner Product Discriminant Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 207, "page_last": 215, "abstract": "Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications---speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison: support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.", "full_text": "Speaker Comparison with Inner Product\n\nDiscriminant Functions\n\nW. M. Campbell\n\nMIT Lincoln Laboratory\nLexington, MA 02420\nwcampbell@ll.mit.edu\n\nZ. N. Karam\n\nDSPG, MIT RLE, Cambridge MA\n\nMIT Lincoln Laboratory, Lexington, MA\n\nzahi@mit.edu\n\nD. E. Sturim\n\nMIT Lincoln Laboratory\nLexington, MA 02420\n\nsturim@ll.mit.edu\n\nAbstract\n\nSpeaker comparison, the process of \ufb01nding the speaker similarity between two\nspeech signals, occupies a central role in a variety of applications\u2014speaker ver-\ni\ufb01cation, clustering, and identi\ufb01cation. Speaker comparison can be placed in a\ngeometric framework by casting the problem as a model comparison process. For\na given speech signal, feature vectors are produced and used to adapt a Gaussian\nmixture model (GMM). Speaker comparison can then be viewed as the process of\ncompensating and \ufb01nding metrics on the space of adapted models. We propose\na framework, inner product discriminant functions (IPDFs), which extends many\ncommon techniques for speaker comparison\u2014support vector machines, joint fac-\ntor analysis, and linear scoring. The framework uses inner products between the\nparameter vectors of GMM models motivated by several statistical methods. Com-\npensation of nuisances is performed via linear transforms on GMM parameter\nvectors. Using the IPDF framework, we show that many current techniques are\nsimple variations of each other. We demonstrate, on a 2006 NIST speaker recog-\nnition evaluation task, new scoring methods using IPDFs which produce excellent\nerror rates and require signi\ufb01cantly less computation than current techniques.\n\n1 Introduction\n\nComparing speakers in speech signals is a common operation in many applications including foren-\nsic speaker recognition, speaker clustering, and speaker veri\ufb01cation. Recent popular approaches\nto text-independent comparison include Gaussian mixture models (GMMs) [1], support vector ma-\nchines [2, 3], and combinations of these techniques. When comparing two speech utterances, these\napproaches are used in a train and test methodology. One utterance is used to produce a model which\nis then scored against the other utterance. The resulting comparison score is then used to cluster,\nverify or identify the speaker.\n\nComparing speech utterances with kernel functions has been a common theme in the speaker recog-\nnition SVM literature [2, 3, 4]. The resulting framework has an intuitive geometric structure. Vari-\nable length sequences of feature vectors are mapped to a large dimensional SVM expansion vector.\nThese vectors are \u201csmoothed\u201d to eliminate nuisances [2]. Then, a kernel function is applied to the\n\n\u2217This work was sponsored by the Federal Bureau of Investigation under Air Force Contract FA8721-05-\nC-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not\nnecessarily endorsed by the United States Government.\n\n1\n\n\ftwo vectors. The kernel function is an inner product which induces a metric on the set of vectors, so\ncomparison is analogous to \ufb01nding the distances between SVM expansion vectors.\n\nA recent trend in the speaker recognition literature has been to move towards a more linear geo-\nmetric view for non-SVM systems. Compensation via linear subspaces and supervectors of mean\nparameters of GMMs is presented in joint factor analysis [5]. Also, comparison of utterances via\nlinear scoring is presented in [6]. These approaches have introduced many new ideas and perform\nwell in speaker comparison tasks.\n\nAn unrealized effort in speaker recognition is to bridge the gap between SVMs and some of the new\nproposed GMM methods. One dif\ufb01culty is that most SVM kernel functions in speaker comparison\nsatisfy the Mercer condition. This restricts the scope of investigation of potential comparison strate-\ngies for two speaker utterances. Therefore, in this paper, we introduce the idea of inner product\ndiscriminant functions (IPDFs).\n\nIPDFs are based upon the same basic operations as SVM kernel functions with some relaxation in\nstructure. First, we map input utterances to vectors of \ufb01xed dimension. Second, we compensate the\ninput feature vectors. Typically, this compensation takes the form of a linear transform. Third, we\ncompare two compensated vectors with an inner product. The resulting comparison function is then\nused in an application speci\ufb01c way.\n\nThe focus of our initial investigations of the IPDF structure are the following. First, we show that\nmany of the common techniques such as factor analysis, nuisance projection, and various types of\nscoring can be placed in the framework. Second, we systematically describe the various inner prod-\nuct and compensation techniques used in the literature. Third, we propose new inner products and\ncompensation. Finally, we explore the space of possible combinations of techniques and demon-\nstrate several novel methods that are computationally ef\ufb01cient and produce excellent error rates.\n\nThe outline of the paper is as follows.\nIn Section 2, we describe the general setup for speaker\ncomparison using GMMs. In Section 3, we introduce the IPDF framework. Section 4 explores inner\nproducts for the IPDF framework. Section 5 looks at methods for compensating for variability. In\nSection 6, we perform experiments on the NIST 2006 speaker recognition evaluation and explore\ndifferent combinations of IPDF comparisons and compensations.\n\n2 Speaker Comparison\n\nA standard distribution used for text-independent speaker recognition is the Gaussian mixture\nmodel [1],\n\ng(x) =\n\nN\n\nX\n\ni=1\n\n\u03bbiN (x|mi, \u03a3i).\n\n(1)\n\nFeature vectors are typically cepstral coef\ufb01cients with associated smoothed \ufb01rst- and second-order\nderivatives.\nWe map a sequence of feature vectors, xNx\n1 , from a speaker to a GMM by adapting a GMM universal\nbackground model (UBM). Here, we use the shorthand xNx\nto denote the sequence, x1, \u00b7 \u00b7 \u00b7 , xNx.\nFor the purpose of this paper, we will assume only the mixture weights, \u03bbi, and means, mi, in (1)\nare adapted. Adaptation of the means is performed with standard relevance MAP [1]. We estimate\nthe mixture weights using the standard ML estimate. The adaptation yields new parameters which\nwe stack into a parameter vector, ax, where\nx(cid:3)t\nx mt\n\u00b7 \u00b7 \u00b7\n\n\u00b7 \u00b7 \u00b7 mt\n\n(2)\n\n1\n\n(3)\n\nax = (cid:2)\u03bbt\n= (cid:2)\u03bbx,1\n\n\u03bbx,N mt\n\nx,1\n\nx,N(cid:3)t\n\n.\n\nIn speaker comparison, the problem is to compare two sequences of feature vectors, e.g., xNx\nand\nNy\n1 . To compare these two sequences, we adapt a GMM UBM to produce two sets of parameter\ny\nvectors, ax and ay, as in (2). The goal of our speaker comparison process can now be recast as a\nfunction that compares the two parameter vectors, C(ax, ay), and produces a value that re\ufb02ects the\nsimilarity of the speakers. Initial work in this area was performed using kernels from support vector\nmachines [4, 7, 2], but we expand the scope to other types of discriminant functions.\n\n1\n\n2\n\n\f3 Inner Product Discriminant Functions\n\nThe basic framework we propose for speaker comparison functions is composed of two parts\u2014\ncompensation and comparison. For compensation, the parameter vectors generated by adaptation\nin (2) can be transformed to remove nuisances or projected onto a speaker subspace. The second\npart of our framework is comparison. For the comparison of parameter vectors, we will consider\nnatural distances that result in inner products between parameter vectors.\n\nWe propose the following inner product discriminant function (IPDF) framework for exploring\nspeaker comparison,\n\n(4)\nwhere Lx, Ly are linear transforms and potentially dependent on \u03bbx and/or \u03bby. The matrix D is\npositive de\ufb01nite, usually diagonal, and possibly dependent on \u03bbx and/or \u03bby. Note, we also consider\nsimple combinations of IPDFs to be in our framework\u2014e.g., positively-weighted sums of IPDFs.\n\nC(ax, ay) = (Lxax)tD2(Lyay)\n\nSeveral questions from this framework are: 1) what inner product gives the best speaker comparison\nperformance, 2) what compensation strategy works best, 3) what tradeoffs can be made between\naccuracy and computational cost, and 4) how do the compensation and the inner product interact.\nWe explore theoretical and experimental answers to these questions in the following sections.\n\n4 Inner Products for IPDFs\n\nIn general, an inner product of the parameters should be based on a distance arising from a statistical\ncomparison. We derive three straightforward methods in this section. We also relate some other\nmethods, without being exhaustive, that fall in this framework that have been described in detail in\nthe literature.\n\n4.1 Approximate KL Comparison (CKL)\nA straightforward strategy for comparing the GMM parameter vectors is to use an approximate\nform of the KL divergence applied to the induced GMM models. This strategy was used in [2]\nsuccessfully with an approximation based on the log-sum inequality; i.e., for the GMMs, gx and gy,\nwith parameters ax and ay,\n\nD(gxkgy) \u2264\n\nN\n\nX\n\ni=1\n\n\u03bbx,iD (N (\u00b7; mx,i, \u03a3i)kN (\u00b7; my,i, \u03a3i)) .\n\n(5)\n\nHere, D(\u00b7k\u00b7) is the KL divergence, and \u03a3i is from the UBM.\nBy symmetrizing (5) and substituting in the KL divergence between two Gaussian distributions, we\nobtain a distance, ds, which upper bounds the symmetric KL divergence,\n\nds(ax, ay) = Ds(\u03bbxk\u03bby) +\n\nN\n\nX\n\ni=1\n\n(0.5\u03bbx,i + 0.5\u03bby,i)(mx,i \u2212 my,i)t\u03a3\u22121\n\ni\n\n(mx,i \u2212 my,i).\n\n(6)\n\nWe focus on the second term in (6) for this paper, but note that the \ufb01rst term could also be converted\nto a comparison function on the mixture weights. Using polarization on the second term, we obtain\nthe inner product\n\nCKL(ax, ay) =\n\nN\n\nX\n\ni=1\n\n(0.5\u03bbx,i + 0.5\u03bby,i)mt\n\nx,i\u03a3\u22121\n\ni my,i.\n\nNote that (7) can also be expressed more compactly as\n\nCKL(ax, ay) = mt\n\nx ((0.5\u03bbx + 0.5\u03bby) \u2297 In) \u03a3\u22121my\n\n(7)\n\n(8)\n\nwhere \u03a3 is the block matrix with the \u03a3i on the diagonal, n is the feature vector dimension, and \u2297\nis the Kronecker product. Note that the non-symmetric form of the KL distance in (5) would result\nin the average mixture weights in (8) being replaced by \u03bbx. Also, note that shifting the means by\nthe UBM will not affect the distance in (6), so we can replace means in (8) by the UBM centered\nmeans.\n\n3\n\n\f4.2 GLDS kernel (CGLDS)\nAn alternate inner product approach is to use generalized linear discriminants and the corresponding\nkernel [4]. The overall structure of this GLDS kernel is as follows. A per feature vector expansion\nfunction is de\ufb01ned as\n\nb(xi) = [b1(xi)\n\n\u00b7 \u00b7 \u00b7\n\nbm(xi)]t .\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nThe mapping between an input sequence, xNx\n\n1\n\nis then de\ufb01ned as\n\nX\ni=1\nThe corresponding kernel between two sequences is then\nNy\n1 ) = bt\n\nKGLDS(xNx\n\n7\u2192 bx =\n\n1 , y\n\nxNx\n\n1\n\nx\u0393\u22121by\n\nNx\n\n1\nNx\n\nb(xi).\n\nwhere\n\n\u0393 =\n\n1\nNz\n\nNz\n\nX\n\ni=1\n\nb(zi)b(zi)t,\n\nis a large set of feature vectors which is representative of the speaker population.\n\nand zNz\n1\nIn the context of a GMM UBM, we can de\ufb01ne an expansion as follows\n\n(13)\nwhere p(j|xi) is the posterior probability of mixture component j given xi, and mj is from a UBM.\nUsing (13) in (10), we see that\n\nb(xi) = (cid:2)p(1|xi)(xi \u2212 m1)t\n\np(N |xi)(xi \u2212 mN )t(cid:3)t\n\n\u00b7 \u00b7 \u00b7\n\nbx = (\u03bbx \u2297 In)(mx \u2212 m) and by = (\u03bby \u2297 In)(my \u2212 m)\n\n(14)\n\nwhere m is the stacked means of the UBM. Thus, the GLDS kernel inner product is\nCGLDS(ax, ay) = (mx \u2212 m)t(\u03bbx \u2297 In)\u0393\u22121(\u03bby \u2297 In)(my \u2212 m).\n\n(15)\nNote that \u0393 in (12) is almost the UBM covariance matrix, but is not quite the same because of a\nsquaring of the p(j|zi) in the diagonal. As is commonly assumed, we will consider a diagonal\napproximation of \u0393, see [4].\n\n4.3 Gaussian-Distributed Vectors\nA common assumption in the factor analysis literature [5] is that the parameter vector mx as x varies\nhas a Gaussian distribution. If we assume a single covariance for the entire space, then the resulting\nlikelihood ratio test between two Gaussian distributions results in a linear discriminant [8].\nMore formally, suppose that we have a distribution with mean mx and we are trying to distinguish\nfrom a distribution with the UBM mean m, then the discriminant function is [8],\n\nh(x) = (mx \u2212 m)t\u03a5\u22121(x \u2212 m) + cx\n\n(16)\nwhere cx is a constant that depends on mx, and \u03a5 is the covariance in the parameter vector space.\nWe will assume that the comparison function can be normalized (e.g., by Z-norm [1]), so that cx can\nbe dropped. We now apply the discriminant function to another mean vector, my, and obtain the\nfollowing comparison function\n\nCG(ax, ay) = (mx \u2212 m)t\u03a5\u22121(my \u2212 m).\n\n(17)\n\n4.4 Other Methods\n\nSeveral other methods are possible for comparing the parameter vectors that arise either from ad hoc\nmethods or from work in the literature. We describe a few of these in this section.\nGeometric Mean Comparison (CGM). A simple symmetric function that is similar to the KL (8)\nand GLDS (15) comparison functions is arrived at by replacing the arithmetic mean in CKL by a\ngeometric mean. The resulting kernel is\n\nCGM (ax, ay) = (mx \u2212 m)t(\u03bb1/2\n\nx \u2297 In)\u03a3\u22121(\u03bb1/2\n\ny \u2297 In)(my \u2212 m)\n\n(18)\n\n4\n\n\fwhere \u03a3 is the block diagonal UBM covariances.\nFisher Kernel (CF ). The Fisher kernel specialized to the UBM case has several forms [3]. The\nmain variations are the choice of covariance in the inner product and the choice of normalization\nof the gradient term. We took the best performing con\ufb01guration for this paper\u2014we normalize the\ngradient by the number of frames which results in a mixture weight scaling of the gradient. We also\nuse a diagonal data-trained covariance term. The resulting comparison function is\n\nCF (ax, ay) = (cid:2)(\u03bbx \u2297 In)\u03a3\u22121(mx \u2212 m)(cid:3)t\n\n\u03a6\u22121 (cid:2)(\u03bby \u2297 In)\u03a3\u22121(my \u2212 m)(cid:3)\n\n(19)\n\nwhere \u03a6 is a diagonal matrix acting as a variance normalizer.\nLinearized Q-function (CQ). Another form of inner product may be derived from the linear Q-\nscoring shown in [6]. In this case, the scoring is given as (mtrain \u2212 m)t\u03a3\u22121(F \u2212 Nm) where N\nand F are the zeroth and \ufb01rst order suf\ufb01cient statistics of a test utterance, m is the UBM means,\nmtrain is the mean of a training model, and \u03a3 is the block diagonal UBM covariances. A close\napproximation of this function can be made by using a small relevance factor in MAP adaptation of\nthe means to obtain the following comparison function\n\nCQ(ax, ay) = (mx \u2212 m)t\u03a3\u22121(\u03bby \u2297 In)(my \u2212 m).\n\n(20)\nNote that if we symmetrize CQ, this gives us CKL; this analysis ignores for a moment that in [6],\ncompensation is also asymmetric.\nKL Kernel (KKL). By assuming the mixture weights are constant and equal to the UBM mixture\nin the comparison function CKL (7), we obtain the KL kernel,\n\n(21)\nwhere \u03bb are the UBM mixture weights. This kernel has been used extensively in SVM speaker\nrecognition [2].\n\nKKL(mx, my) = mt\n\nx (\u03bb \u2297 In) \u03a3\u22121my\n\nAn analysis of the different inner products in the preceding sections shows that many of the methods\npresented in the literature have a similar form, but are interestingly derived with quite disparate\ntechniques. Our goal in the experimental section is to understand how these comparison function\nperform and how they interact with compensation.\n\n5 Compensation in IPDFs\n\nOur next task is to explore compensation methods for IPDFs. Our focus will be on subspace-based\nmethods. With these methods, the fundamental assumption is that either speakers and/or nuisances\nare con\ufb01ned to a small subspace in the parameter vector space. The problem is to use this knowledge\nto produce a higher signal (speaker) to noise (nuisance) representation of the speaker. Standard\nnotation is to use U to represent the nuisance subspace and to have V represent the speaker subspace.\nOur goal in this section is to recast many of the methods in the literature in a standard framework\nwith oblique and orthogonal projections.\n\nTo make a cohesive presentation, we introduce some notation. We de\ufb01ne an orthogonal projection\nwith respect to a metric, PU,D, where D and U are full rank matrices as\n\nPU,D = U (U tD2U )\u22121U tD2\n\n(22)\nwhere DU is a linearly independent set, and the metric is kx \u2212 ykD = kDx \u2212 Dyk2. The\nprocess of projection, e.g.\nis equivalent to solving the least-squares problem,\n\u02c6x = argminx kU x \u2212 bkD and letting y = U \u02c6x. For convenience, we also de\ufb01ne the projection\nonto the orthogonal complement of U , U \u22a5, as QU,D = PU \u22a5,D = I \u2212 PU,D. Note that we can reg-\nularize the projection PU,D by adding a diagonal term to the inverse in (22); the resulting operation\nremains linear but is no longer a projection.\n\ny = PU,Db,\n\nWe also de\ufb01ne the oblique projection onto V with null space U + (U + V )\u22a5 and metric induced by\nD. Let QR be the (skinny) QR decomposition of the matrix [U V ] in the D norm (i.e., QtD2Q = I),\nand QV be the columns corresponding to V in the matrix Q. Then, the oblique (non-orthogonal)\nprojection onto V is\n\n(23)\nThe use of projections in our development will add geometric understanding to the process of com-\npensation.\n\nOV,U,D = V (Qt\n\nV D2V )\u22121Qt\n\nV D2.\n\n5\n\n\f5.1 Nuisance Attribute Projection (NAP)\n\nA framework for eliminating nuisances in the parameter vector based on projection was shown in [2].\nThe basic idea is to assume that nuisances are con\ufb01ned to a small subspace and can be removed via\nan orthogonal projection, mx 7\u2192 QU,Dmx. One justi\ufb01cation for using subspaces comes from the\nperspective that channel classi\ufb01cation can be performed with inner products along one-dimensional\nsubspaces. Therefore, the projection removes channel speci\ufb01c directions from the parameter space.\n\nThe NAP projection uses the metric induced by a kernel in an SVM. For the GMM context, the\nstandard kernel used is the approximate KL comparison (8) [2]. We note that since D is known a\npriori to speaker comparison, we can orthonormalize the matrix DU and apply the projection as a\nmatrix multiply. The resulting projection has D = (cid:16)\u03bb1/2 \u2297 In(cid:17) \u03a3\u22121/2.\n5.2 Factor Analysis and Joint Factor Analysis\n\nms,sess = m + U x + V y\n\nThe joint factor analysis (JFA) model assumes that the mean parameter vector can be expressed as\n(24)\nwhere ms,sess is the speaker- and session-dependent mean parameter vector, U and V are matrices\nwith small rank, and m is typically the UBM. Note that for this section, we will use the standard\nvariables for factor analysis, x and y, even though they con\ufb02ict with our earlier development. The\ngoal of joint factor analysis is to \ufb01nd solutions to the latent variables x and y given training data.\nIn (24), the matrix U represents a nuisance subspace, and V represents a speaker subspace. Existing\nwork on this approach for speaker recognition uses both maximum likelihood (ML) estimates and\nMAP estimates of x and y [9, 5]. In the latter case, a Gaussian prior with zero mean and diagonal\ncovariance for x and y is assumed. For our work, we focus on the ML estimates [9] of x and y\nin (24), since we did not observe substantially different performance from MAP estimates in our\nexperiments.\n\nAnother form of modeling that we will consider is factor analysis (FA). In this case, the term V y is\nreplaced by a constant vector representing the true speaker model, ms; the goal is then to estimate\nx. Typically, as a simpli\ufb01cation, ms is assumed to be zero when calculating suf\ufb01cient statistics for\nestimation of x [10].\n\nThe solution to both JFA and FA can be uni\ufb01ed. For the JFA problem, if we stack the matrices [U V ],\nthen the problem reverts to the FA problem. Therefore, we initially study the FA problem. Note that\nwe also restrict our work to only one EM iteration of the estimation of the factors, since this strategy\nworks well in practice.\n\nThe standard ML solution to FA [9] for one EM iteration can be written as\n(cid:2)U t\u03a3\u22121(N \u2297 In)U(cid:3) x = U t\u03a3\u22121 [F \u2212 (N \u2297 In)m]\n\n(25)\nwhere F is the vector of \ufb01rst order suf\ufb01cient statistics, and N is the diagonal matrix of zeroth order\nstatistics (expected counts). The suf\ufb01cient statistics are obtained from the UBM applied to an input\nset of feature vectors. We \ufb01rst let Nt = PN\ni=1 Ni and multiply both sides of (25) by 1/Nt. Now\nuse relevance MAP with a small relevance factor and F and N to obtain ms; i.e., both ms \u2212 m and\nF \u2212 (N \u2297 In)m will be nearly zero in the entries corresponding to small Ni. We obtain\n\n(cid:2)U t\u03a3\u22121(\u03bbs \u2297 In)U(cid:3) x = U t\u03a3\u22121 (\u03bbs \u2297 In) [ms \u2212 m]\n\n(26)\nwhere \u03bbs is the speaker dependent mixture weights. We note that (26) are the normal equations\nfor the least-squares problem, \u02c6x = argminx kU x \u2212 (ms \u2212 m)kD where D is given below. This\nsolution is not unexpected since ML estimates commonly lead to least-squares problems with GMM\ndistributed data [11].\n\nOnce the solution to (26) is obtained, the resulting U x is subtracted from an estimate of the speaker\nmean, ms to obtain the compensated mean. If we assume that ms is obtained by a relevance map\nadaptation from the statistics F and N with a small relevance factor, then the FA process is well\napproximated by\n\nwhere\n\nms 7\u2192 QU,Dms\n\nD = (cid:16)\u03bb1/2\n\ns \u2297 In(cid:17) \u03a3\u22121/2.\n\n6\n\n(27)\n\n(28)\n\n\fJFA becomes an extension of the FA process we have demonstrated. One \ufb01rst projects onto the\nstacked U V space. Then another projection is performed to eliminate the U component of variabil-\nity. This can be expressed as a single oblique projection; i.e., the JFA process is\n\nms 7\u2192 OV,U,I P[UV ],Dms = OV,U,Dms.\n\n(29)\n\n5.3 Comments and Analysis\n\nSeveral comments should be made on compensation schemes and their use in speaker comparison.\nFirst, although NAP and ML FA (27) were derived in substantially different ways, they are essen-\ntially the same operation, an orthogonal projection. The main difference is in the choice of metrics\nunder which they were originally proposed. For NAP, the metric depends on the UBM only, and for\nFA it is utterance and UBM dependent.\n\nA second observation is that the JFA oblique projection onto V has substantially different properties\nthan a standard orthogonal projection. When JFA is used in speaker recognition [5, 6], typically\nJFA is performed in training, but the test utterance is compensated only with FA. In our notation,\napplying JFA with linear scoring [6] gives\n\nCQ(OV,U,D1 m1, QU,D2m2)\n\n(30)\n\nwhere m1 and m2 are the mean parameter vectors estimated from the training and testing utterances,\nrespectively; also, D1 = (\u03bb1/2\n2 \u2297 In)\u03a3\u22121/2. Our goal in the exper-\niments section is to disentangle and understand some of the properties of scoring methods such\nas (30). What is signi\ufb01cant in this process\u2014mismatched train/test compensation, data-dependent\nmetrics, or asymmetric scoring?\n\n1 \u2297 In)\u03a3\u22121/2 and D2 = (\u03bb1/2\n\nA \ufb01nal note is that training the subspaces for the various projections optimally is not a process\nthat is completely understood. One dif\ufb01culty is that the metric used for the inner product may\nnot correspond to the metric for compensation. As a baseline, we used the same subspace for all\ncomparison functions. The subspace was obtained with an ML style procedure for training subspaces\nsimilar to [11] but specialized to the factor analysis problem as in [5].\n\n6 Speaker Comparison Experiments\n\nExperiments were performed on the NIST 2006 speaker recognition evaluation (SRE) data set. En-\nrollment/veri\ufb01cation methodology and the evaluation criterion, equal error rate (EER) and minDCF,\nwere based on the NIST SRE evaluation plan [12]. The main focus of our efforts was the one con-\nversation enroll, one conversation veri\ufb01cation task for telephone recorded speech. T-Norm models\nand Z-Norm [13] speech utterances were drawn from the NIST 2004 SRE corpus. Results were\nobtained for both the English only task (Eng) and for all trials (All) which includes speakers that\nenroll/verify in different languages.\n\nFeature extraction was performed using HTK [14] with 20 MFCC coef\ufb01cients, deltas, and accelera-\ntion coef\ufb01cients for a total of 60 features. A GMM UBM with 512 mixture components was trained\nusing data from NIST SRE 2004 and from Switchboard corpora. The dimension of the nuisance\nsubspace, U , was \ufb01xed at 100; the dimension of the speaker space, V , was \ufb01xed at 300.\nResults are in Table 1. In the table, we use the following notation,\n\nDUBM = (cid:16)\u03bb1/2 \u2297 In(cid:17) \u03a3\u22121/2, D1 = (cid:16)\u03bb\n\n1/2\n\n1 \u2297 In(cid:17) \u03a3\u22121/2, D2 = (cid:16)\u03bb\n\n1/2\n\n2 \u2297 In(cid:17) \u03a3\u22121/2\n\n(31)\n\nwhere \u03bb are the UBM mixture weights, \u03bb1 are the mixture weights estimated from the enrollment\nutterance, and \u03bb2 are the mixture weights estimated from the veri\ufb01cation utterance. We also use\nthe notation DL, DG, and DF to denote the parameters of the metric for the GLDS, Gaussian, and\nFisher comparison functions from Sections 4.2, 4.3, and 4.4, respectively.\n\nAn analysis of the results in Table 1 shows several trends. First, the performance of the best IPDF\ncon\ufb01gurations is as good or better than the state of the art SVM and JFA implementations. Second,\nthe compensation method that dominates good performance is an orthogonal complement of the\nnuisance subspace, QU,D. Combining a nuisance projection with an oblique projection is \ufb01ne, but\n\n7\n\n\fTable 1: A comparison of baseline systems and different IPDF implementations\n\nComparison\n\nFunction\n\nBaseline SVM\n\nBaseline JFA, CQ\n\nCKL\nCKL\nCKL\nCKL\nCKL\nCGM\nCGM\nCGM\nKKL\nKKL\n\nCGLDS\n\nCG\nCF\n\nEnroll\nComp.\n\nQU,DUBM\nOV,U,D1\nOV,U,D1\nOV,U,D1\nQU,D1\n\nQU,DUBM\n\nVerify\nComp.\n\nQU,DUBM\n\nQU,D2\nQU,D2\nOV,U,D2\nQU,D2\n\nQU,DUBM\n\nI \u2212 OU,V,D1\n\nI \u2212 OU,V,D2\n\nQU,D1\n\nQU,DUBM\nQU,DUBM\nQU,DUBM\n\nQU,D1\nQU,DL\nQU,DG\nQU,DF\n\nQU,D2\n\nQU,DUBM\n\nI\n\nQU,DUBM\n\nQU,D2\nQU,DL\nQU,DG\nQU,DF\n\nEER\n\nminDCF\nAll (%) All (\u00d7100)\n\nEER\n\nEng (%)\n\nminDCF\n\nEng (\u00d7100)\n\n3.82\n3.07\n3.21\n8.73\n2.93\n3.03\n7.10\n2.90\n3.01\n3.95\n4.95\n5.52\n3.60\n5.07\n3.56\n\n1.82\n1.57\n1.70\n5.06\n1.55\n1.55\n3.60\n1.59\n1.66\n1.93\n2.46\n2.85\n1.93\n2.52\n1.89\n\n2.62\n2.11\n2.32\n8.06\n1.89\n1.92\n6.49\n1.73\n1.89\n2.76\n3.73\n4.43\n2.27\n3.89\n2.22\n\n1.17\n1.23\n1.32\n4.45\n0.93\n0.95\n3.13\n0.98\n1.05\n1.26\n1.75\n2.15\n1.23\n1.87\n1.12\n\nTable 2: Summary of some IPDF performances and computation time normalized to a baseline system. Com-\npute time includes compensation and inner product only.\n\nComparison\n\nFunction\n\nCQ\nCGM\nCGM\nCGM\n\nEnroll\nComp.\nOV,U,D1\nQU,D1\n\nVerify\nComp.\nQU,D2\nQU,D2\n\nQU,DUBM QU,DUBM\nQU,DUBM\n\nI\n\nEER\n\nEng (%)\n\nminDCF\n\nEng (\u00d7100)\n\n2.11\n1.73\n1.89\n2.76\n\n1.23\n0.98\n1.05\n1.26\n\nCompute\n\ntime\n1.00\n0.17\n0.08\n0.04\n\nusing only oblique projections onto V gives high error rates. A third observation is that comparison\nfunctions whose metrics incorporate \u03bb1 and \u03bb2 perform signi\ufb01cantly better than ones with \ufb01xed \u03bb\nfrom the UBM. In terms of best performance, CKL, CQ, and CGM perform similarly. For example,\nthe 95% con\ufb01dence interval for 2.90% EER is [2.6, 3.3]%.\nWe also observe that a nuisance projection with \ufb01xed DUBM gives similar performance to a pro-\njection involving a \u201cvariable\u201d metric, Di. This property is fortuitous since a \ufb01xed projection can\nbe precomputed and stored and involves signi\ufb01cantly reduced computation. Table 2 shows a com-\nparison of error rates and compute times normalized by a baseline system. For the table, we used\nprecomputed data as much as possible to minimize compute times. We see that with an order of\nmagnitude reduction in computation and a signi\ufb01cantly simpler implementation, we can achieve the\nsame error rate.\n\n7 Conclusions and future work\n\nWe proposed a new framework for speaker comparison, IPDFs, and showed that several recent sys-\ntems in the speaker recognition literature can be placed in this framework. We demonstrated that\nusing mixture weights in the inner product is the key component to achieve signi\ufb01cant reductions in\nerror rates over a baseline SVM system. We also showed that elimination of the nuisance subspace\nvia an orthogonal projection is a computationally simple and effective method of compensation.\nMost effective methods of compensation in the literature (NAP, FA, JFA) are straightforward vari-\nations of this idea. By exploring different IPDFs using these insights, we showed that computation\ncan be reduced substantially over baseline systems with similar accuracy to the best performing\nsystems. Future work includes understanding the performance of IPDFs for different tasks, incor-\nporating them into an SVM system, and hyperparameter training.\n\n8\n\n\fReferences\n[1] Douglas A. Reynolds, T. F. Quatieri, and R. Dunn, \u201cSpeaker veri\ufb01cation using adapted Gaussian mixture\n\nmodels,\u201d Digital Signal Processing, vol. 10, no. 1-3, pp. 19\u201341, 2000.\n\n[2] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, \u201cSVM based speaker veri\ufb01cation\nusing a GMM supervector kernel and NAP variability compensation,\u201d in Proc. ICASSP, 2006, pp. I97\u2013\nI100.\n\n[3] C. Longworth and M. J. F. Gales, \u201cDerivative and parametric kernels for speaker veri\ufb01cation,\u201d in Proc.\n\nInterspeech, 2007, pp. 310\u2013313.\n\n[4] W. M. Campbell, \u201cGeneralized linear discriminant sequence kernels for speaker recognition,\u201d in Proc.\n\nICASSP, 2002, pp. 161\u2013164.\n\n[5] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, \u201cA study of inter-speaker variability in\n\nspeaker veri\ufb01cation,\u201d IEEE Transactions on Audio, Speech and Language Processing, 2008.\n\n[6] Ondrej Glembek, Lukas Burget, Najim Dehak, Niko Brummer, and Patrick Kenny, \u201cComparison of\n\nscoring methods used in speaker recognition with joint factor analysis,\u201d in Proc. ICASSP, 2009.\n\n[7] Pedro J. Moreno, Purdy P. Ho, and Nuno Vasconcelos, \u201cA Kullback-Leibler divergence based kernel for\nSVM classi\ufb01cation in multimedia applications,\u201d in Adv. in Neural Inf. Proc. Systems 16, S. Thrun, L. Saul,\nand B. Sch\u00f6lkopf, Eds. MIT Press, Cambridge, MA, 2004.\n\n[8] Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990.\n[9] Simon Lucey and Tsuhan Chen, \u201cImproved speaker veri\ufb01cation through probabilistic subspace adapta-\n\ntion,\u201d in Proc. Interspeech, 2003, pp. 2021\u20132024.\n\n[10] Robbie Vogt, Brendan Baker, and Sridha Sriharan, \u201cModelling session variability in text-independent\n\nspeaker veri\ufb01cation,\u201d in Proc. Interspeech, 2005, pp. 3117\u20133120.\n\n[11] Mark J. F. Gales, \u201cCluster adaptive training of hidden markov models,\u201d IEEE Trans. Speech and Audio\n\nProcessing, vol. 8, no. 4, pp. 417\u2013428, 2000.\n\n[12] M. A. Przybocki, A. F. Martin, and A. N. Le, \u201cNIST speaker recognition evaluations utilizing the Mixer\ncorpora\u20142004,2005,2006,\u201d IEEE Trans. on Speech, Audio, Lang., vol. 15, no. 7, pp. 1951\u20131959, 2007.\n\u201cScore normalization for text-\n\n[13] Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas,\n\nindependent speaker veri\ufb01cation systems,\u201d Digital Signal Processing, vol. 10, pp. 42\u201354, 2000.\n\n[14] J. Odell, D. Ollason, P. Woodland, S. Young, and J. Jansen, The HTK Book for HTK V2.0, Cambridge\n\nUniversity Press, Cambridge, UK, 1995.\n\n9\n\n\f", "award": [], "sourceid": 364, "authors": [{"given_name": "Zahi", "family_name": "Karam", "institution": null}, {"given_name": "Douglas", "family_name": "Sturim", "institution": null}, {"given_name": "William", "family_name": "Campbell", "institution": null}]}