{"title": "Phoneme Classification using Constrained Variational Gaussian Process Dynamical System", "book": "Advances in Neural Information Processing Systems", "page_first": 2006, "page_last": 2014, "abstract": "This paper describes a new acoustic model based on variational Gaussian process dynamical system (VGPDS) for phoneme classification. The proposed model overcomes the limitations of the classical HMM in modeling the real speech data, by adopting a nonlinear and nonparametric model. In our model, the GP prior on the dynamics function enables representing the complex dynamic structure of speech, while the GP prior on the emission function successfully models the global dependency over the observations. Additionally, we introduce variance constraint to the original VGPDS for mitigating sparse approximation error of the kernel matrix. The effectiveness of the proposed model is demonstrated with extensive experimental results including parameter estimation, classification performance on the synthetic and benchmark datasets.", "full_text": "Phoneme Classi\ufb01cation using Constrained Variational\n\nGaussian Process Dynamical System\n\nHyunsin Park\n\nDepartment of EE, KAIST\n\nDaejeon, South Korea\n\nhs.park@kaist.ac.kr\n\nSungrack Yun\nQualcomm Korea\nSeoul, South Korea\n\nsungrack@qualcomm.com\n\nSanghyuk Park\n\nDepartment of EE, KAIST\n\nDaejeon, South Korea\n\nJongmin Kim\n\nChang D. Yoo\n\nshine0624@kaist.ac.kr\n\nkimjm0309@gmail.com\n\ncdyoo@ee.kaist.ac.kr\n\nDepartment of EE, KAIST\n\nDaejeon, South Korea\n\nDepartment of EE, KAIST\n\nDaejeon, South Korea\n\nAbstract\n\nFor phoneme classi\ufb01cation, this paper describes an acoustic model based on the\nvariational Gaussian process dynamical system (VGPDS). The nonlinear and non-\nparametric acoustic model is adopted to overcome the limitations of classical hid-\nden Markov models (HMMs) in modeling speech. The Gaussian process prior\non the dynamics and emission functions respectively enable the complex dynamic\nstructure and long-range dependency of speech to be better represented than that\nby an HMM. In addition, a variance constraint in the VGPDS is introduced to\neliminate the sparse approximation error in the kernel matrix. The effectiveness\nof the proposed model is demonstrated with three experimental results, including\nparameter estimation and classi\ufb01cation performance, on the synthetic and bench-\nmark datasets.\n\n1\n\nIntroduction\n\nAutomatic speech recognition (ASR), the process of automatically translating spoken words into\ntext, has been an important research topic for several decades owing to its wide array of potential\napplications in the area of human-computer interaction (HCI). The state-of-the-art ASR systems\ntypically use hidden Markov models (HMMs) [1] to model the sequential articulator structure of\nspeech signals. There are various issues to consider in designing a successful ASR and certainly\nthe following two limitations of an HMM need to be overcome. 1) An HMM with a \ufb01rst-order\nMarkovian structure is suitable for capturing short-range dependency in observations and speech\nrequires a more \ufb02exible model that can capture long-range dependency in speech. 2) Discrete latent\nstate variables and sudden state transitions in an HMM have limited capacity when used to represent\nthe continuous and complex dynamic structure of speech. These limitations must be considered\nwhen seeking to improve the performance of an ASR.\nTo overcome these limitations, various models have been considered to model the complex structure\nof speech. For example, the stochastic segment model [2] is a well-known generalization of the\nHMM that represents long-range dependency over observations using a time-dependent emission\nfunction. And the hidden dynamical model [3] is used for modeling the complex nonlinear dynamics\nof a physiological articulator.\nAnother promising research direction is to consider a nonparametric Bayesian model for nonlinear\nprobabilistic modeling of speech. Owing to the fact that nonparametric models do not assume any\n\n1\n\n\f\ufb01xed model structure, they are generally more \ufb02exible than parametric models and can allow de-\npendency among observations naturally. The Gaussian process (GP) [4], a stochastic process over\na real-valued function, has been a key ingredient in solving such problems as nonlinear regression\nand classi\ufb01cation. As a standard supervised learning task using the GP, Gaussian process regression\n(GPR) offers a nonparametric Bayesian framework to infer the nonlinear latent function relating the\ninput and the output data. Recently, researchers have begun focusing on applying the GP to un-\nsupervised learning tasks with high-dimensional data, such as the Gaussian process latent variable\nmodel (GP-LVM) for reduction of dimensionality [5-6]. In [7], a variational inference framework\nwas proposed for training the GP-LVM. The variational approach is one of the sparse approxima-\ntion approaches [8]. The framework was extended to the variational Gaussian process dynamical\nsystem (VGPDS) in [9] by augmenting latent dynamics for modeling high-dimensional time series\ndata. High-dimensional time series have been incorporated in many applications of machine learn-\ning such as robotics (sensor data), computational biology (gene expression data), computer vision\n(video sequences), and graphics (motion capture data). However, no previous work has considered\nthe GP-based approach for speech recognition tasks that involve high-dimensional time series data.\nIn this paper, we propose a GP-based acoustic model for phoneme classi\ufb01cation. The proposed\nmodel is based on the assumption that the continuous dynamics and nonlinearity of the VGPDS can\nbe better represent the statistical characteristic of real speech than an HMM. The GP prior over the\nemission function allows the model to represent long-range dependency over the observations of\nspeech, while the HMM does not. Furthermore, the GP prior over the dynamics function enables\nthe model to capture the nonlinear dynamics of a physiological articulator.\nOur contributions are as follows: 1) we introduce a GP-based model for phoneme classi\ufb01cation tasks\nfor the \ufb01rst time, showing that the model has the potential of describing the underlying character-\nistics of speech in a nonparametric way; 2) we propose a prior for hyperparameters and a variance\nconstraint that are specially designed for ASR; and 3) we provide extensive experimental results and\nanalyses to reveal clearly the strength of our proposed model.\nThe remainder of the paper is structured as follows: Section 2 introduces the proposed model after a\nbrief description of the VGPDS. Section 3 provides extensive experimental evaluations to prove the\neffectiveness of our model, and Section 4 concludes the paper with a discussion and plans for future\nwork.\n\n2 Acoustic modeling using Gaussian Processes\n\n2.1 Variational Gaussian Process Dynamical System\n\nThe VGPDS [9] models time series data by assuming that there exist latent states that govern the\ndata. Let Y = [[y11,\u00b7\u00b7\u00b7 yN 1]T ,\u00b7\u00b7\u00b7 , [y1D,\u00b7\u00b7\u00b7 yN D]T ] \u2208 RN\u00d7D, t = [t1,\u00b7\u00b7\u00b7 , tN ]T \u2208 RN\n+ , and\nX = [[x11,\u00b7\u00b7\u00b7 xN 1]T ,\u00b7\u00b7\u00b7 , [x1Q,\u00b7\u00b7\u00b7 xN Q]T ] \u2208 RN\u00d7Q be observed data, time, and corresponding\nlatent state, where N, D, and Q(< D) are the number of samples, the dimension of the observation\nspace, and the dimension of the latent space, respectively. In the VGPDS, these variables are related\nas follows:\n\nxnj = gj(tn) + \u03b7nj,\nyni = fi(xn) + \u0001ni,\n\n\u03b7nj \u223c N (0, 1/\u03b2x\nj ),\n\u0001ni \u223c N (0, 1/\u03b2y\ni ),\n\np(Y|X)p(X|t)dX.\n\n(2)\n\n2\n\ni (x), kf\n\ni (x, x(cid:48))) and gj(t) \u223c GP(\u00b5g\n\n(1)\nwhere fi(x) \u223c GP(\u00b5f\nj (t, t(cid:48))) are the emission function\nfrom the latent space to the i-th dimension of the observation space and the dynamics function\nfrom the time space to the j-th dimension of the latent space, respectively. Here, n \u2208 {1,\u00b7\u00b7\u00b7 , N},\ni \u2208 {1,\u00b7\u00b7\u00b7 , D}, and j \u2208 {1,\u00b7\u00b7\u00b7 , Q}. In this paper, a zero-mean function is used for all GPs. Fig.\n1 shows graphical representations of HMM and VGPDS. Although the Gaussian process dynamical\nmodel (GPDM) [10], which involves an auto-regressive dynamics function, is also a GP-based model\nfor time-series, it is not considered in this paper.\nThe marginal likelihood of the VGPDS is given as\n\nj (t), kg\n\n(cid:90)\n\np(Y|t) =\n\n\f(cid:90)\n(cid:90)\n\n(cid:90)\n\nD(cid:88)\n\nD(cid:88)\n\nFigure 1: Graphical representations of (left) the left-to-right HMM and (right) the VGPDS: In the\nleft \ufb01gure, yn \u2208 RD and xn \u2208 {1,\u00b7\u00b7\u00b7 , C} are observations and discrete latent states. In the right\n\ufb01gure, yni, fni, xnj, gnj, and tn are observations, emission function points, latent states, dynamics\nfunction points, and times, respectively. All function points in the same plate are fully connected.\n\nSince the integral in Eq. (2) is not tractable, a variational method is used by introducing a variational\ndistribution q(X). A variational lower bound on the logarithm of the marginal likelihood is\n\nlog p(Y|t) \u2265\n\nq(X) log\n\np(Y|X)p(X|t)\n\nq(X)\n\ndX\n\n(cid:90)\n\nq(X) log p(Y|X)dX \u2212\n=\n= L \u2212 KL(q(X)||p(X|t)).\n\nq(X) log\n\nq(X)\np(X|t)\n\ndX\n\n(3)\nBy the assumption of independence over the observation dimension, the \ufb01rst term in Eq. (3) is given\nas\n\nL =\n\nq(X) log p(yi|X)dX =\n\nLi.\n\n(4)\n\ni=1\n\ni=1\n\ni \u03a82i + \u02dcKi|1/2\ni \u03a82i + \u02dcKi)\u22121\u03a8T\n\nIn [9], a variational approach which involves sparse approximation of the covariance matrix obtained\nfrom GP is proposed. The variational lower bound on Li is given as\n\u2212 \u03b2y\ni\n2\n\ni )N/2| \u02dcKi|1/2\n(\u03b2y\n\n(\u03c80i \u2212 Tr( \u02dcK\u22121\n\nLi \u2265 log\n\ni \u03a82i)),\n\ne(\u2212 1\n\n(2\u03c0)N/2|\u03b2y\n\n2 yT\n\ni Wiyi)\n\n(cid:35)\n\n(cid:34)\n\n(5)\n\ni )2\u03a81i(\u03b2y\n\ni IN \u2212 (\u03b2y\n\nj=1 p(xj) and q(X) =(cid:81)N\n\nond term of Eq. (3), p(X|t) =(cid:81)Q\n\n1i. Here, \u02dcKi \u2208 RM\u00d7M is a kernel matrix calcu-\nwhere Wi = \u03b2y\nlated using the i-th kernel function and inducing input variables \u02dcX \u2208 RM\u00d7Q that are used for sparse\napproximation of the full kernel matrix Ki. The closed-form of the statistics {\u03c80i, \u03a81i, \u03a82i}D\n(cid:81)Q\ni=1,\nwhich are functions of variational parameters and inducing points, can be found in [9]. In the sec-\nj=1 N (\u00b5nj, snj) are the prior for\nthe latent state and the variational distribution that is used for approximating the posterior of the\nlatent state, respectively.\nThe parameter set \u0398, which consists of the hyperparameters {\u03b8f , \u03b8g} of the kernel functions,\nthe noise variances {\u03b2y, \u03b2x}, the variational parameters {[\u00b5n1,\u00b7\u00b7\u00b7 , \u00b5nQ], [sn1,\u00b7\u00b7\u00b7 , snQ]}N\nn=1 of\nq(X), and the inducing input points \u02dcX, is estimated by maximizing the lower bound on log p(Y|t)\nin Eq. (3) using a scaled conjugate gradient (SCG) algorithm.\n\nn\n\n2.2 Acoustic modeling using VGPDS\n\nFor several decades, HMM has been the predominant model for acoustic speech modeling. However,\nas we mentioned in Section 1, the model suffers from two major limitations: discrete state variables\nand \ufb01rst-order Markovian structure which can model short-range dependency over the observations.\n\n3\n\n\fTo overcome such limitations of the HMM, we propose an acoustic speech model based on the\nVGPDS, which is a nonlinear and nonparametric model that can be used to represent the complex\ndynamic structure of speech and long-range dependency over observations of speech. In addition,\nto \ufb01t the model to large-scale speech data, we describe various implementation issues.\n\n2.2.1 Time scale modi\ufb01cation\n\nThe time length of each phoneme segment in an utterance varies with various conditions such as\nposition of the phoneme segment in the utterance, emotion, gender, and other speaker and environ-\nment conditions. To incorporate this fact into the proposed acoustic model, the time points tn are\nmodi\ufb01ed as follows:\n\nn \u2212 1\nN \u2212 1\n\ntn =\n\n,\n\n(6)\n\nwhere n and N are the observation index and the number of observations in a phoneme segment,\nrespectively. This time scale modi\ufb01cation makes all phoneme signals have unit time length.\n\n2.2.2 Hyperparameters\n\nTo compute the kernel matrices in Eq. (5), the kernel function must be de\ufb01ned. We use the radial\nbasis function (RBF) kernel for the emission function f as follows:\n\nkf (x, x(cid:48)) = \u03b1f exp\n\nj (xj \u2212 x(cid:48)\n\u03c9f\n\nj)2\n\n(7)\n\n\uf8eb\uf8ed\u2212 Q(cid:88)\n\nj=1\n\n\uf8f6\uf8f8 ,\n\nwhere \u03b1f and \u03c9f\nj are the RBF kernel variance and the j-th inverse length scale, respectively. The\nRBF kernel function is adopted for representing smoothness of speech. For the dynamics function\ng, the following kernel function is used:\n\nkg(t, t(cid:48)) = \u03b1g exp(cid:0)\u2212\u03c9g(t \u2212 t(cid:48))2(cid:1) + \u03bbtt(cid:48) + b,\n\n(8)\nwhere \u03bb and b are linear kernel variance and bias, respectively. The above dynamics kernel, which\nconsists of both linear and nonlinear components, is used for representing the complex dynamics of\nthe articulator. All hyperparameters are assumed to be independent in this paper.\nIn [11], same kernel function parameters are shared over all dimensions of human-motion capture\ndata and high-dimensional raw video data. However, this extensive sharing of the hyperparame-\nters is unsuitable for speech modeling. Even though each dimension of observations is normal-\nized in advance to have unit variance, the signal-to-noise ratio (SNR) is not consistent over all\ndimensions. To handle this problem, this paper considers each dimension to be modeled indepen-\ndently using different kernel function parameters. Therefore, the hyperparameter sets are de\ufb01ned as\n\u03b8f = {\u03b1f\n\ni=1 and \u03b8g = {\u03b1g\n\nj , \u03bbj, bj}Q\n\ni ,{\u03c9f\n\n1i,\u00b7\u00b7\u00b7 , \u03c9f\n\nQi}}D\n\nj , \u03c9g\n\nj=1.\n\n2.2.3 Priors on the hyperparameters\n\nIn the parameter estimation of the VGPDS, the SCG algorithm does not guarantee the optimal solu-\ntion. To overcome this problem, we place the following prior on the hyperparameters of the kernel\nfunctions as given below\n\np(\u03b3) \u221d exp(\u2212\u03b32/\u00af\u03b3),\n\n(9)\nwhere \u03b3 \u2208 {\u03b8f , \u03b8g} and \u00af\u03b3 are the hyper-parameter and the model parameter of the prior, respec-\ntively. In this paper, \u00af\u03b3 is set to the sample variance for the hyperparameters of the emission kernel\nfunctions, and \u00af\u03b3 is set to 1 for the hyperparameters of the dynamics kernel functions. Uniform\npriors are adopted for other hyperparameters, then the parameters of the VGPDS are estimated by\nmaximizing the joint distribution p(Y, \u0398|t) = p(Y|t, \u0398)p(\u0398).\n\n2.2.4 Variance constraint\n\nIn the lower bound of Eq. (5), the second term on the right-hand side is the regularization term that\nrepresents the sparse approximation error of the full kernel matrix Ki. Note that with more inducing\n\n4\n\n\finput points, approximation error becomes smaller. However, only a small number of inducing\ninput points can be used owing to the limited availability of computational power, which increases\nthe effect of the regularization term.\nTo mitigate this problem, we introduce the following constraint on the diagonal terms of the covari-\nance matrix as given below:\n\nTr((cid:104)Ki(cid:105)q(X))\n\nN\n\n+ 1/\u03b2y\n\ni = \u03c32\ni ,\n\n(10)\n\nwhere (cid:104)Ki(cid:105)q(X) and \u03c32\ni are the expectation of the full kernel matrix Ki and the sample variance of\nthe i-th dimension of the observation, respectively. This constraint is designed so that the variance\nof each observation calculated from the estimated model is equal to the sample variance. By using\n\u03c80i = Tr((cid:104)Ki(cid:105)q(X)), the inverse noise variance parameter is obtained directly by \u03b2y\ni \u2212\ni = (\u03c32\n\u03c80i/N )\u22121 without separate gradient-based optimization. Then, the partial derivative \u2202 log \u03b2y\n=\nIn Section 3.1, the effectiveness of the variance\nN \u03c32\u2212\u03c80i\nconstraint is demonstrated empirically.\n\nis used for SCG-based optimization.\n\n1\n\ni\n\n\u2202\u03c80i\n\n2.3 Classi\ufb01cation\nFor classi\ufb01cation with trained VGPDSs, maximum-likelihood (ML) decoding is used. Let D(l) =\n{Y(l), t(l)} and \u0398(l) be the observation and parameter sets of the l-th VGPDS, respectively. Given\nthe test data D\u2217 = {Y\u2217, t\u2217}, the classi\ufb01cation result \u02c6l \u2208 {1,\u00b7\u00b7\u00b7 , L} can be obtained by\n\n\u02c6l = arg max\n\nl\n\n= arg max\n\nl\n\n3 Experiments\n\nlog p(Y\u2217|t\u2217, Y(l), t(l), \u0398(l))\np(Y(l), Y\u2217|t(l), t\u2217, \u0398(l))\n\np(Y(l)|t(l), \u0398(l))\n\nlog\n\n.\n\n(11)\n\nTo evaluate the effectiveness of the proposed model, three different kinds of experiments have been\ndesigned:\n\n1. Parameter estimation: validating the effectiveness of the proposed variance constraint (Sec-\n\ntion 2.2.4) on model parameter estimation\n\n2. Two-class classi\ufb01cation using synthetic data: demonstrating explicitly the advantages of\nthe proposed model over the HMM with respect to the degree of dependency over the\nobservations\n\n3. Phoneme classi\ufb01cation: evaluating the performance of the proposed model on real speech\n\ndata\n\nEach experiment is described in detail in the following subsections. In this paper, the proposed\nmodel is referred to as the constrained-VGPDS (CVGPDS).\n\n3.1 Parameter estimation\n\nIn this subsection, the experiments of parameter estimation on synthetic data are described. Syn-\nthetic data are generated by using a phoneme model that is selected from the trained models in\nSection 3.3 and then modi\ufb01ed. The RBF kernel variances of the emission functions and the emis-\nsion noise variances are modi\ufb01ed from the selected model. In this experiment, the emission noise\nvariances and inducing input points are estimated, while all other parameters are \ufb01xed to the true\nvalues used in generating the data.\nFig. 2 shows the parameter estimation results. The estimates of the 39-dimensional noise variance\nof the emission functions are shown with the true noise variances, the true RBF kernel variances, and\nthe sample variances of the synthetic data. The top row denotes the estimation results without the\nvariance constraint, and the bottom row with the variance constraint. By comparing the two \ufb01gures\n\n5\n\n\fFigure 2: Results of parameter estimation: (top-left) VGPDS with M = 5, (top-right) VGPDS with\n\nM = 30, and (bottom) CVGPDS with M = 5\n\non the top row, we can con\ufb01rm that the estimation result of the noise variance with M = 30 inducing\ninput points is better than that with M = 5 inducing input points. This result is obvious in the sense\nthat smaller values of M produce more errors in the sparse approximation of the covariance matrix.\nHowever, both noise variance estimates are still different from the true values. By comparing the\ntop and bottom rows, we can see that the proposed CVGPDS outperforms the VGPDS in terms of\nparameter estimation. Remarkably, the estimation result of the CVGPDS with M = 5 inducing input\npoints is much better than the result of the VGPDS with M = 30. Based on these observations, we\ncan conclude that the proposed CVGPDS is considerably more robust to the sparse approximation\nerror compared to the VGPDS, as we claimed in Section 2.2.4.\n\n3.2 Two-class classi\ufb01cation using synthetic data\n\nThis section aims to show that when there is strong dependency over the observations, the proposed\nCVGPDS is a more appropriate model than the HMM for the classi\ufb01cation task. To this end, we\n\ufb01rst generated several sets of two-class classi\ufb01cation datasets with different degrees of dependency\nover the observations. The considered classi\ufb01cation task is to map each input segment to one of two\nclass labels. Using s \u2208 {1, ..., S} as the segment index, the synthetic dataset D = {Ys, ts, ls}S\nconsists of S segments, where the s-th segment has Ns samples. Here, Ys \u2208 RNs\u00d7D, ts \u2208 RNs,\ns=1\nand ls are the observation data, time, and class label of the s-th segment, respectively. The synthetic\ndataset is generated as follows:\n\n\u2022 Mean and kernel functions of two GPs gj(t) and fi(x) are de\ufb01ned as\n\ngj(t) : \u00b5g\nfi(x) : \u00b5f\n\ni (x) =(cid:80)Zi\n\nj (t) = ajt + bj,\n\n(12)\ni }, and {\u03b1i, \u03c9i} are respectively the parameters of the linear,\nwhere {aj, bj}, {wz, mz\nGaussian mixture, and RBF kernel functions. The superscript z denotes the component\nindex of the Gaussian mixture, and Zi is the number of components in fi(x).\n\nz=1 wzN (x; mz\ni , \u039bz\n\nj (t, t(cid:48)) = 1t=t(cid:48)\nkg\ni (x, x(cid:48)) = \u03b1i exp(\u2212\u03c9i||x \u2212 x(cid:48)||)\ni ), kf\n\ni \u039bz\n\n6\n\n\f\u2022 For the s-th segment, {Ys, ts, ls},\n\n1. ls is selected as either class 1 or 2.\n2. Ns is randomly selected from interval [20, 30], and ts is obtained by using Eq. (6).\n3. From ts, the mean vector \u00b5g\n\nj are obtained for j =\n1, ..., Q. Let Xs \u2208 RNs\u00d7Q be the latent state of the s-th segment. Then, the j-th col-\numn of Xs is generated by the Ns-dimensional Gaussian distribution N (\u00b5g\nj (ts), Kg\nj ).\ni are obtained for i =\n1, ..., D. Then, the i-th column of Ys is generated by the Ns-dimensional Gaussian\ndistribution, N (\u00b5f\n\ni (Xs) and covariance matrix Kf\n\nj (ts) and covariance matrix Kg\n\n4. From Xs, the mean vector \u00b5f\n\ni (Xs), Kf\n\ni ).\n\nNote that parameter \u03c9i controls the degree of dependency over the observations. For instance, if \u03c9i\ndecreases, the off-diagonal terms of the emission kernel matrix Kf\ni increase, which means stronger\ncorrelations over the observations.\nThe experimental setups are as follows. The synthesized dataset consists of 200 segments in total\n(100 segments per class). The dimensions of the latent space and observation space are set to Q = 2\nand D = 5, respectively. We use 6(= Zi) components for the mean function of the emission kernel\nfunction. In this experiment, three datasets are synthesized and used to compare the CVGPDS and\nthe HMM. When generating each dataset, we use two different \u03c9i values, one for each class, while\nall other parameters in Eq. (12) are shared between the two classes. As a result, the degree of\ncorrelation between the observations is the only factor that distinguishes the two classes. The three\ngenerated datasets have different degrees of correlation over the observations, as a result of setting\ndifferent \u03c9i values for generating each dataset. In particular, the third dataset is constructed with two\nlimitations of HMM such that it is well represented by an HMM. This could be achieved simply by\nj (t) from a linear to a step function, and setting \u03c9i = \u221e so\nchanging the form of the mean function \u00b5g\nthat each data sample is generated independently of the others. In the third dataset, the two classes\nare set to have different \u03b1i values. The classi\ufb01cation experiments are conducted using an HMM and\nCVGPDS.\n\nTable 1: Classi\ufb01cation accuracy for the two-class synthetic datasets (10-fold CV average [%]):\n\nAll parameters except \u03c9i are set to be equal for classes 1 and 2.\n\nIn the case of \u03c9i = \u221e, \u03b1i are set to be different.\n\n\u03c9i (class 1 : class 2)\n\n0.1 : 0.5\n\nHMM\n\nCVGPDS\n\n61.0\n\n78.0\n\n1.0 : 2.0 \u221e : \u221e\n88.5\n68.5\n\n79.0\n\n92.0\n\nTable 1 summarizes the classi\ufb01cation performance of the HMM and CVGPDS for the three synthetic\ndatasets. Remarkably, in all cases, the proposed CVGPDS outperforms the HMM, even in the case\nof \u03c9i = \u221e (the fourth column), where we assumed the dataset follows HMM-like characteristics.\nComparing the second and the third columns of Table 1, we can see that the performance of the\nHMM degrades by 6.5% as \u03c9i becomes smaller, while the proposed CVGPDS almost maintains\nits performance with only 1.0% reduction. This result demonstrates the superiority of the proposed\nCVGPDS in modeling data with strong correlations over the observations. Apparently, the HMM\nfailed to distinguish the two classes with different degree of dependency over the observations. In\ncontrast, the proposed CVGPDS distinguishes the two classes more effectively by capturing the\ndifferent degrees of inter-dependencies over the observations incorporated in each class.\n\n3.3 Phoneme classi\ufb01cation\n\nIn this section, phoneme classi\ufb01cation experiments is described on real speech data from the TIMIT\ndatabase. The TIMIT database contains a total of 6300 phonetically rich utterances, each of which\nis manually segmented based on 61 phoneme transcriptions. Following the standard regrouping of\nphoneme labels [11], 61 phonemes are reduced to 48 phonemes selected for modeling. As observa-\ntions, 39-dimensional Mel-frequency cepstral coef\ufb01cients (MFCCs) (13 static coef\ufb01cients, \u2206, and\n\n7\n\n\f\u2206\u2206) extracted from the speech signals with standard 25 ms frame size, and 10 ms frame shifts are\nused. The dimension of the latent space is set to Q = 2.\nFor the \ufb01rst phoneme classi\ufb01cation experiment, 100 segments per phoneme are randomly selected\nusing the phoneme boundary provided information in the TIMIT database. The number of inducing\ninput points is set to M = 10. A 10-fold cross-validation test was conducted to evaluate the proposed\nmodel in comparison with an HMM that has three states and a single Gaussian distribution with a\nfull covariance matrix per state. Parameters of the HMMs are estimated by using the conventional\nexpectation-maximization (EM) algorithm with a maximum likelihood criterion.\n\nTable 2: Classi\ufb01cation accuracy on the 48-phoneme dataset (10-fold CV average [%]):\n\n100 segments are used for training and testing each phoneme model\n\nHMM VGPDS CVGPDS\n49.19\n\n49.36\n\n48.17\n\nTable 2 shows the experimental results of a 48-phoneme classi\ufb01cation. Compared to the HMM and\nVGPDS, the proposed CVGPDS performs more effectively.\nFor the second phoneme classi\ufb01cation experiment, the TIMIT core test set consisting of 192 sen-\ntences is used for evaluation. We use the same 100 segments for training the phoneme models as in\nthe \ufb01rst phoneme classi\ufb01cation experiment. The size of the training dataset is smaller than that of\nconventional approaches due to our limited computational ability. When evaluating the models, we\nmerge the labels of 48 phonemes into the commonly used 39 phonemes [11]. Given speech obser-\nvations with boundary information, a sequence of log-likelihoods is obtained, and then a bigram is\nconstructed to incorporate linguistic information into the classi\ufb01cation score. In this experiment, the\nnumber of inducing input points is set to M = 5.\n\nTable 3: Classi\ufb01cation accuracy on the TIMIT core test set [%]:\n\n100 segments are used for training each phoneme model\n\nHMM VGPDS CVGPDS\n57.83\n\n61.54\n\n61.44\n\nTable 3 shows the experimental results of phoneme classi\ufb01cation for the TIMIT core test set. As\nshown by the results in Table 2, the proposed CVGPDS performed better than the HMM and VG-\nPDS. However, the classi\ufb01cation accuracies in Table 3 are lower than the state-of-the-art phoneme\nclassi\ufb01cation results [12-13]. The reasons for low accuracy are as follows: 1) insuf\ufb01cient amount\nof data is used for training the model owing to limited availability of computational power; 2) a\nmixture model for the emission is not considered. These remaining issues need to be addressed for\nimproved performance.\n\n4 Conclusion\n\nIn this paper, a VGPDS-based acoustic model for phoneme classi\ufb01cation was considered. The pro-\nposed acoustic model can represent the nonlinear latent dynamics and dependency among observa-\ntions by GP priors. In addition, we introduced a variance constraint on the VGPDS. Although the\nproposed model could not achieve the state-of-the-art performance of phoneme classi\ufb01cation, the\nexperimental results showed that the proposed acoustic model has potential for speech modeling.\nFor future works, extension to phonetic recognition and mixture of the VGPDS will be considered.\n\nAcknowledgments\n\nThis work was supported by the National Research Foundation of Korea(NRF) grant funded by the\nKorea government(MEST) (No.2012-0005378 and No.2012-0000985)\n\n8\n\n\fReferences\n\n[1] F. Jelinek, \u201cContinuous speech recognition by statistical methods,\u201d Proceedings of the IEEE, Vol.64, pp.532-\n556, 1976.\n[2] M. Ostendorf, V. Digalakis, and J. Rohlicek, \u201cFrom HMMs to segment models: A uni\ufb01ed view of stochastic\nmodeling for speech recognition,\u201d IEEE Trans. on Speech and Audio Processing, Vol.4, pp.360-378, 1996.\n[3] L. Deng, D. Yu, and A. Acero, \u201cStructured Speech Modeling,\u201d IEEE Trans. on Audio, Speech, and Lan-\nguage Processing, Vol.14, pp.1492-1504, 2006.\n[4] C. E. Rasmussen and C. K. I. Williams, \u201cGaussian Process for Machine Learning,\u201d MIT Press, Cambridge,\nMA, 2006.\n[5] N. D. Lawrence, \u201cProbabilistic non-linear principal component analysis with Gaussian process latent vari-\nable models,\u201d Journal of Machine Learning Research (JMLR), Vol.6, pp.1783-1816, 2005.\n[6] N. D. Lawrence, \u201cLearning for larger datasets with the Gaussian process latent variable model,\u201d Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pp.243-250, 2007.\n[7] M. K. Titsias and N. D. Lawrence, \u201cBayesian Gaussian Process Latent Variable Model,\u201d International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pp.844-851, 2010.\n[8] J. Qui\u02dcnonero-Candela and C. E. Rasmussen, \u201cA Unifying View of Sparse Approximate Gaussian Process\nRegression,\u201d Journal of Machine Learning Research (JMLR), Vol.6, pp.1939-1959, 2005.\n[9] A. C. Damianou, M. K. Titsias, and N. D. Lawrence, \u201cVariational Gaussian Process Dynamical Systems,\u201d\nAdvances in Neural Information Processing Systems (NIPS), 2011.\n[10] J. M. Wang, D. J. Fleet, and A. Hertzmann, \u201cGaussian Process Dynamical Models for Human Motion,\u201d\nIEEE Trans. Pattern Analysis and Machine Intelligence, Vol.30, pp.283-298, 2008.\n[11] K. F. Lee and H. W. Hon, \u201cSpeaker-independent phone recognition using hidden Markov models,\u201d IEEE\nTrans. on Acoustics, Speech and Signal Processing, vol.37, pp.1641-1648, 1989.\n[12] A. Mohamed, G. Dahl, and G. Hinton, \u201cAcoustic modeling using deep belief networks,\u201d IEEE Trans. on\nAudio, Speech, and Language Processing, Vol.20, no.1, pp. 14-22, 2012.\n[13] F. Sha and L. K. Saul, \u201cLarge margin hidden markov models for automatic speech recognition,\u201d Advances\nin Neural Information Processing Systems (NIPS), 2007.\n\n9\n\n\f", "award": [], "sourceid": 990, "authors": [{"given_name": "Hyunsin", "family_name": "Park", "institution": null}, {"given_name": "Sungrack", "family_name": "Yun", "institution": null}, {"given_name": "Sanghyuk", "family_name": "Park", "institution": null}, {"given_name": "Jongmin", "family_name": "Kim", "institution": null}, {"given_name": "Chang", "family_name": "Yoo", "institution": null}]}