{"title": "Sequential Noise Compensation by Sequential Monte Carlo Method", "book": "Advances in Neural Information Processing Systems", "page_first": 1205, "page_last": 1212, "abstract": null, "full_text": "Sequential noise compensation by\nsequential Monte Carlo method\nKaisheng Yao and Satoshi Nakamura\n\nATR Spoken Language Translation Research Laboratories\n2-2-2, Hikaridai Seika-cho, Souraku-gun, Kyoto, 619-0288, Japan\nE-mail:\n\n{kaisheng.yao,\n\nsatoshi.nakamura}@slt.atr.co.jp\nAbstract\n\nWe present a sequential Monte Carlo method applied to additive\nnoise compensation for robust speech recognition in time-varying\nnoise. The method generates a set of samples according to the prior\ndistribution given by clean speech models and noise prior evolved\nfrom previous estimation. An explicit model representing noise ef-\nfects on speech features is used, so that an extended Kalman filter\nis constructed for each sample, generating the updated continuous\nstate estimate as the estimation of the noise parameter, and predic-\ntion likelihood for weighting each sample. Minimum mean square\nerror (MMSE) inference of the time-varying noise parameter is car-\nried out over these samples by fusion the estimation of samples ac-\ncording to their weights. A residual resampling selection step and\na Metropolis-Hastings smoothing step are used to improve calcula-\ntion e#ciency. Experiments were conducted on speech recognition\nin simulated non-stationary noises, where noise power changed ar-\ntificially, and highly non-stationary Machinegun noise. In all the\nexperiments carried out, we observed that the method can have sig-\nnificant recognition performance improvement, over that achieved\nby noise compensation with stationary noise assumption.\n1 Introduction\n\nSpeech recognition in noise has been considered to be essential for its real applica-\ntions. There have been active research e#orts in this area. Among many approaches,\nmodel-based approach assumes explicit models representing noise e#ects on speech\nfeatures. In this approach, most researches are focused on stationary or slow-varying\nnoise conditions. In this situation, environment noise parameters are often esti-\nmated before speech recognition from a small set of environment adaptation data.\nThe estimated environment noise parameters are then used to compensate noise\ne#ects in the feature or model space for recognition of noisy speech.\nHowever, it is well-known that noise statistics may vary during recognition. In\nthis situation, the noise parameters estimated prior to speech recognition of the\nutterances is possibly not relevant to the subsequent frames of input speech if en-\nvironment changes.\n\f\nA number of techniques have been proposed to compensate time-varying noise ef-\nfects. They can be categorized into two approaches. In the first approach, time-\nvarying environment sources are modeled by Hidden Markov Models (HMM) or\nGaussian mixtures that were trained by prior measurement of environments, so\nthat noise compensation is a task of identification of the underlying state sequences\nof the noise HMMs, e.g., in [1], by maximum a posterior (MAP) decision. This ap-\nproach requires making a model representing di#erent conditions of environments\n(signal-to-noise ratio, types of noise, etc.), so that statistics at some states or mix-\ntures obtained before speech recognition are close to the real testing environments.\nIn the second approach, environment model parameters are assumed to be time-\nvarying, so it is not only an inference problem but also related to environment\nstatistics estimation during speech recognition. The parameters can be estimated\nby Maximum Likelihood estimation, e.g., sequential EM algorithm [2][3][4]. They\ncan also be estimated by Bayesian methods. In the Bayesian methods, all relevant\ninformation on the set of environment parameters and speech parameters, which are\ndenoted as #(t) at frame t, is included in the posterior distribution given observa-\ntion sequence Y (0 : t), i.e., p(#(t)|Y (0 : t)). Except for a few cases including linear\nGaussian state space model (Kalman filter), it is formidable to evaluate the distri-\nbution updating analytically. Approximation techniques are required. For example,\nin [5], a Laplace transform is used to approximate the joint distribution of speech\nand noise parameters by vector Taylor series. The approximated joint distribution\ncan give analytical formula for posterior distribution updating.\nWe report an alternative approach for Bayesian estimation and compensation of\nnoise e#ects on speech features. The method is based on sequential Monte Carlo\nmethod [6]. In the method, a set of samples is generated hierarchically from the prior\ndistribution given by speech models. A state space model representing noise e#ects\non speech features is used explicitly, and an extended Kalman filter (EKF) is con-\nstructed in each sample. The prediction likelihood of the EKF in each sample gives\nits weight for selection, smoothing, and inference of the time-varying noise param-\neter, so that noise compensation is carried out afterwards. Since noise parameter\nestimation, noise compensation and speech recognition are carried out frame-by-\nframe, we denote this approach as sequential noise compensation.\n2 Speech and noise model\n\nOur work is on speech features derived from Mel Frequency Cepstral Coe#cients\n(MFCC). It is generated by transforming signal power into log-spectral domain, and\nfinally, by discrete Cosine transform (DCT) to the cepstral domain. The following\nderivation of the algorithm is in log-spectral domain. Let t denote frame (time)\nindex.\nIn our work, speech and noise are respectively modeled by HMMs and a Gaussian\nmixture. For speech recognition in stationary additive noise, the following for-\nmula [4] has been shown to be e#ective in compensating noise e#ects. For Gaussian\nmixture k t at state s t , the Log-Add method transforms the mean vector \n\nl\ns t k t\n\nof\nthe Gaussian mixture by,\n \n\nl\ns t k t = \n\nl\ns t k t + log(1 + exp(\n\nl\nn\n\n-\n\n\n\nl\ns t k t )) (1)\nwhere \n\nl\nn is the mean vector in the noise model. s t\n\n# {1,   \n\n, S}, k t\n\n# {1,   \n\n, M}.\nS and M each denote the number of states in speech models and the number of\nmixtures at each state. Superscript l indicates that parameters are in the log-\nspectral domain.\nAfter the transformation, the mean vector  \n\nl\ns t k t\n\nis further transformed by DCT,\n\f\nand then plugged into speech models for recognition of noisy speech. In case of\ntime-varying noise, the \n\nl\nn should be a function of time, i.e., \n\nl\nn (t). Accordingly,\nthe compensated mean is  \n\nl\ns t k t (t).\n) 0 ( 0 0 k s \n) 0 (\n\nl\n\nY\n) 0 (\n\nl\nn \n\n) 1 ( 1 1 - - - t k s \n) 1 ( - t Y\n) 1 ( - t\n\nl\nn \n\n) (t\n\nl\nk s \n) (t Y\n) (t n \n\n) (T k s T T \n) (T Y\n) (T n \n\n0 k 1 - t k k T k\n\n0 s 1 - s s T s\nFigure 1: The graphical model representation of the dependences of the speech and\nnoise model parameters. s t and k t each denote the state and Gaussian mixture\nat frame t in speech models. \n\nl\ns t k t\n\n(t) and \n\nl\nn (t) each denote the speech and noise\nparameter. Y\n\nl\n\n(t) is the noisy speech observation.\nThe following analysis can be viewed in Figure 1. In Gaussian mixture k t at state s t\n\nof speech model, speech parameter \n\nl\ns t k t (t) is assumed to be distributed in Gaussian\nwith mean \n\nl\ns t k t\n\nand variance #\n\nl\ns t k t\n\n. On the other hand, since the environment\nparameter is assumed to be time varying, the evolution of the environment mean\nvector can be modeled by a random walk function, i.e.,\n\n\n\nl\nn (t) = \n\nl\nn (t\n\n-\n\n1) + v(t) (2)\nwhere v(t) is the environment driving noise in Gaussian distribution with zero mean\nand variance V .\nThen, we have,\n\np(s t , k t , \n\nl\ns t k t\n\n(t), \n\nl\nn (t)|s t-1 , k t-1 , \n\nl\ns t-1k t-1\n\n(t\n\n-\n\n1), \n\nl\nn (t\n\n-\n\n1))\n= a s t-1s t p s t k t N(\n\nl\ns t k t (t); \n\nl\ns t k t , #\n\nl\ns t k t )N(\n\nl\nn (t); \n\nl\nn (t\n\n-\n\n1), V ) (3)\nwhere a s t-1s t is the state transition probability from s t-1 to s t , and p s t k t is the\nmixture weight. The above formula gives the prior distribution of the set of speech\nand noise model parameter #(t) =\n\n{s\n\nt , k t , \n\nl\ns t k t\n\n(t), \n\nl\nn (t)}.\nFurthermore, given observation Y\n\nl\n\n(t), assume that the transformation by (1) has\nmodeling and measurement uncertainty in Gaussian distribution, i.e.,\n\nY\n\nl\n\n(t) = \n\nl\ns t k t (t) + log (1 + exp (\n\nl\nn (t)\n\n-\n\n\n\nl\ns t k t (t))) + w s t k t (t) (4)\nwhere w s t k t (t) is Gaussian with zero mean and variance #\n\nl\ns t k t\n\n, i.e., N(; 0, #\n\nl\ns t k t\n\n).\nThus, the likelihood of observation Y\n\nl\n\n(t) at state s t and mixture k t is\n\np(Y\n\nl\n\n(t)|#(t)) = N(Y\n\nl\n\n(t); \n\nl\ns t k t (t) + log (1 + exp (\n\nl\nn (t)\n\n-\n\n\n\nl\ns t k t (t))), #\n\nl\ns t k t ) (5)\n\f\nRefereeing to (3) and (5), the posterior distribution of #(t) given Y\n\nl\n\n(t) is\n\np(s t , k t , \n\nl\ns t k t (t), \n\nl\nn (t)|Y\n\nl\n\n(t))\n\n#\n\np(Y\n\nl\n\n(t)|#(t))a s t-1s t p s t k t N(\n\nl\ns t k t (t); \n\nl\ns t k t , #\n\nl\ns t k t )N(\n\nl\nn (t); \n\nl\nn (t\n\n-\n\n1), V ) (6)\nThe time-varying noise parameter is estimated by MMSE, given as,\n \n\nl\nn (t) =\n\n#\n\n\n\nl\nn (t)\n\n\n\nl\nn (t)\n\n#\n\ns t ,k t #\n\n\n\nl\ns t k t\n\n(t)\n\np(#(t)|Y\n\nl\n\n(0 : t))d\n\nl\ns t k t (t)d\n\nl\nn (t) (7)\nHowever, it is di#cult to obtain the posterior distribution p(#(t)|Y\n\nl\n\n(0 : t)) ana-\nlytically, since p(\n\nl\ns t k t\n\n(t), \n\nl\nn (t)|Y\n\nl\n\n(t)) is non-Gaussian in \n\nl\ns t k t\n\n(t) and \n\nl\nn (t) due to\nthe non-linearity in (4). It is thus di#cult, if possible, to assign conjugate prior\nof \n\nl\nn (t) to the likelihood function p(Y\n\nl\n\n(t)|#(t)). Another di#culty is that the\nspeech state and mixture sequence is hidden in (7). We thus rely on the solution\nby computational Bayesian approach [6].\n3 Time-varying noise parameter estimation by sequential\nMonte Carlo method\n\nWe apply the sequential Monte Carlo method [6] for posterior distribution updat-\ning. At each frame t, a proposal importance distribution is sampled whose target\nis the posterior distribution in (7), and it is implemented by sampling from lower\ndistributions in hierarchy. The method goes through the sampling, selection, and\nsmoothing steps frame-by-frame. MMSE inference of the time-varying noise param-\neter is a by-product of the steps, carried out after the smoothing step.\nIn the sampling step, the prior distribution given by speech models is\nset to the proposal importance distribution, i.e., q(#(t)|#(t\n\n-\n\n1)) =\n\na s t-1s t p s t k t N(\n\nl\ns t k t (t); \n\nl\ns t k t , #\n\nl\ns t k t ). The samples are then generated by sampling\nhierarchically of the prior distribution described as follows: set i = 1 and perform\nthe following steps:\n1. sample s\n\n(i)\n\nt\n\n#\n\na\n\ns\n\n(i)\n\nt-1 s t\n\n2. sample k\n\n(i)\n\nt\n\n#\n\np\n\ns\n\n(i)\n\nt k t\n\n3. sample \n\nl(i)\ns\n\n(i)\n\nt k\n\n(i)\n\nt\n\n(t)\n\n#\n\nN (; \n\nl\ns\n\n(i)\n\nt k\n\n(i)\n\nt\n\n, #\n\nl\ns\n\n(i)\n\nt k\n\n(i)\n\nt\n\n), and set i = i + 1\n4. repeat step 1 to 3 until i = N\n\nwhere superscript (i) denotes the index of samples and N denotes the number of\nsamples. Each sample represents certain speech and noise parameter, which is\ndenoted as #\n\n(i)\n\n(t) = (s\n\n(i)\n\nt , k\n\n(i)\n\nt , \n\nl(i)\ns\n\n(i)\n\nt k\n\n(i)\n\nt\n\n(t), \n\nl(i)\nn (t)). The weight of each sample is\ngiven by\n\n#\n\nt\n#=1\np(#(#)\n\n(i)\n\n|Y\n\nl\n\n(#))\n\nq(#(#) (i)\n\n|#(#-1)\n\n(i) )\n\n. Refereeing to (6), the weight is calculated by\n\n#\n\n(i)\n\n(t) = p(Y\n\nl\n\n(t)|#\n\n(i)\n\n(t))N(\n\nl(i)\nn (t); \n\nl(i)\nn (t\n\n-\n\n1), V )\n\n\n#\n\n(i)\n\n(t\n\n-\n\n1) (8)\nwhere\n\n\n#\n\n(i)\n\n(t\n\n-\n\n1) is the sample weight from previous frame. The remaining part\nin the right side of above equation, in fact, represents the prediction likelihood of\nthe state space model given by (2) and (4) for each sample (i). This likelihood\ncan be obtained analytically since after linearization of (4) with respect to \n\nl\nn (t) at\n\f\n\n\nl(i)\nn (t\n\n-\n\n1), an extended Kalman filter (EKF) can be obtained, where the prediction\nlikelihood of the EKF gives the weight, and the updated continuous state of EKF\ngives \n\nl(i)\nn (t).\nIn practice, after the above sampling step, the weights of all but several samples may\nbecome insignificant. Given the fixed number of samples, this will results in degener-\nacy of the estimation, where not only some computational resources are wasted, but\nalso estimation might be biased because of losing detailed information on some parts\nimportant to the parameter estimation. A selection step by residual resampling [6]\nis adopted after the sampling step. The method avoids the degeneracy by discard-\ning those samples with insignificant weights, and in order to keep the number of the\nsamples constant, samples with significant weights are duplicated. Accordingly, the\nweights after the selection step are also proportionally redistributed. Denote the\nset of samples after the selection step as  #(t) =\n\n{\n\n #\n\n(i)\n\n(t); i = 1\n\n  \n\nN} with weights\n #(t) =\n\n{\n\n #\n\n(i)\n\n(t); i = 1\n\n  \n\nN}.\n\nAfter the selection step at frame t, these N samples are distributed approximately\naccording to the posterior distribution in (7). However, the discrete nature of\nthe approximation can lead to a skewed importance weights distribution, where\nthe extreme case is all the samples have the same  #(t) estimated. A Metropolis-\nHastings smoothing [7] step is introduced in each sample where the step involves\nsampling a candidate #\n\n#(i)\n\n(t) given the current\n\n#\n\n(i)\n\n(t) according to the proposal\nimportance distribution q(#\n\n#\n\n(t)|  #\n\n(i)\n\n(t)). The Markov chain then moves towards\n#\n\n#(i)\n\n(t) with acceptance possibility as min{1,\n\np(#\n\n#(i)\n\n|Y\n\nl\n\n(t))q(  #\n\n(i)\n\n|#\n\n#(i)\n\n)\n\np(  # (i)\n\n|Y\n\nl (t))q(# #(i)\n\n|\n\n # (i) )\n\n},\n\notherwise it\nremains at\n\n#\n\n(i)\n\n. To simplify calculation, we assume that the importance distribu-\ntion q(#\n\n#\n\n(t)|  #\n\n(i)\n\n(t)) is symmetric, and after some mathematical manipulation, it\nis shown that the acceptance possibility is given by min{1,\n\n#\n\n#(i)\n\n(t)\n # (i) (t)\n\n}.\n\nDenote the\nobtained samples as  #(t) =\n\n{\n\n #\n\n(i)\n\n(t); i = 1\n\n  \n\nN} with weights  #(t) =\n\n{\n\n #\n\n(i)\n\n(t); i =\n1\n\n  \n\nN}.\n\nNoise parameter \n\nl\nn (t) is estimated via MMSE over the samples, i.e.,\n \n\nl\nn (t) =\n\nN\n\n#\n\ni=1\n\n\n\n#\n\n(i)\n\n(t)\n\n#\n\nN\nj=1\n\n #\n\n(j)\n\n(t)\n \n\nl(i)\nn (t)\nwhere  \n\nl(i)\nn (t) is the updated continuous state of the EKF in the sample after the\nsmoothing step. Once the estimate  \n\nl\nn (t) has been obtained, it is plugged into (1)\nto do non-linear transformation of clean speech models.\n4 Experimental results\n\n4.1 Experimental setup\n\nExperiments were performed on the TI-Digits database down-sampled to 16kHz.\nFive hundred clean speech utterances from 15 speakers and 111 utterances unseen\nin the training set were used for training and testing, respectively. Digits and\nsilence were respectively modeled by 10-state and 3-state whole word HMMs with\n4 diagonal Gaussian mixtures in each state.\nThe window size was 25.0ms with a 10.0ms shift. Twenty-six filter banks were used\nin the binning stage. The features were MFCC + # MFCC. The baseline system\nhad a 98.7% Word Accuracy under clean conditions.\n\f\nWe compared three systems. The first was the baseline trained on clean speech with-\nout noise compensation, and the second was the system with noise compensation by\n(1) assuming stationary noise [4]. They were each denoted as Baseline and Station-\nary Compensation. The sequential method was un-supervised, i.e., without training\ntranscript, and it was denoted according to the number of samples and variance of\nthe environment driving noise V . Four seconds of contaminating noise was used in\neach experiment to obtain noise mean vector \n\nl\nn in (1) for Stationary Compensa-\ntion. It was also for initialization of \n\nl\nn (0) in the sequential method. The initial\n\n\n\nl(i)\nn (0) for each sample was sampled from N(\n\nl\nn (0), 0.01) + N(\n\nl\nn (0) + #(0), 10.0),\nwhere #(0) was flat distribution in [-1.0, 9.0].\n\n4.2 Speech recognition in simulated non-stationary noise\n\nWhite noise signal was multiplied by a Chirp signal and a rectangular signal, so that\nthe noise power of the contaminating White noise changed continuously, denoted\nas experiment A, and dramatically, denoted as experiment B. As a result, signal-\nto-noise ratio (SNR) of the contaminating noise ranged from 0dB to 20.4dB. We\nplotted the noise power in 12th filter bank versus frames in Figure 2, together with\nthe estimated noise power by the sequential method with number of samples set to\n120 and environment driving noise variance set to 0.0001. As a comparison, we also\nplotted the noise power and its estimate by the method with the same number of\nsamples but larger driving noise variance to 0.001.\nBy Figure 2 and Figure 3, we have the following observations. First, the method\ncan track the evolution of the noise power. Second, the larger driving noise variance\n\nV will make faster convergence but larger estimation error of the method. In terms\nof recognition performance, Table 1 shows that the method can e#ectively improve\nsystem robustness to the time-varying noise. For example, with 60 samples, and\nthe environment driving noise variance V set to 0.001, the method can improve\nword accuracy from 75.30% achieved by ``Stationary Compensation'', to 94.28% in\nexperiment A. The table also shows that, the word accuracies can be improved\nby increasing number of samples. For example, given environment driving noise\nvariance V set to 0.0001, increasing number of samples from 60 to 120, can improve\nword accuracy from 77.11% to 85.84% in experiment B.\nTable 1: Word Accuracy (in %) in simulated non-stationary noises, achieved by\nthe sequential Monte Carlo method in comparison with baseline without noise com-\npensation, denoted as Baseline, and noise compensation assuming stationary noise,\ndenoted as Stationary Compensation.\nExperiment Baseline Stationary # samples = 60 # samples = 120\nCompensation V V\n\n0.001 0.0001 0.001 0.0001\nA 48.19 75.30 94.28 93.98 94.28 94.58\nB 53.01 78.01 82.23 77.11 85.84 85.84\n4.3 Speech recognition in real noise\n\nIn this experiment, speech signals were contaminated by highly non-stationary Ma-\nchinegun noise in di#erent SNRs. The number of samples was set to 120, and the\nenvironment driving noise variance V was set to 0.0001. Recognition performances\nare shown in Table 2, together with ``Baseline'' and ``Stationary Compensation''.\n\f\n\u0002\u0001\u0003\u0001\u0004\u0001 \u0005\u0003\u0001\u0004\u0001\u0004\u0001 \u0006\u0004\u0001\u0004\u0001\u0003\u0001 \u0007\u0003\u0001\u0004\u0001\u0003\u0001 \b\t\u0001\u0003\u0001\u0004\u0001\u0003\u0001 \b\n\u0003\u0001\u0004\u0001\u0003\u0001 \b\u000b\u0005\u0004\u0001\u0004\u0001\u0003\u0001 \b\n\u0004\u0001\u0004\u0001\u0003\u0001\n\b\u0003\b \b\u0004\b\n\n\r\b\n\n\b\u000b\u0003\f \r\n\b\u000b\u000e\n\b\t\u000e\u000f\f \r\n\b\u000b\u0005\n\b\t\u0005\u000f\f \r\b\n\n\b\u000b\r\u0003\f \r\b\n\n\u0010\u0012\u0011\u0014\u0013\u0002\u0015\u0017\u0016\n\u0018 \u0019 \u001a \u001b \u001c \u001d \u0019\n\u001e \u001c\u001f\n  \u0011\"!\u0002\u0016$#\u0004\u0013\u0004% !\u0002\u0016 \u0016\u0004&  (' \u0015\u0017\u0013   \u0016\u000f)\n\u0002\u0001\u0003\u0001\u0004\u0001 \u0005\u0003\u0001\u0004\u0001\u0004\u0001 \u0006\u0004\u0001\u0004\u0001\u0003\u0001 \u0007\u0003\u0001\u0004\u0001\u0003\u0001 \b\t\u0001\u0003\u0001\u0004\u0001\u0003\u0001 \b\n\u0003\u0001\u0004\u0001\u0003\u0001 \b\u000b\u0005\u0004\u0001\u0004\u0001\u0003\u0001 \b\n\u0004\u0001\u0004\u0001\u0003\u0001\n\b\u0003\b \b\n\n\b\u000b\f\n\b\u000b\u0005 \b\n \b\n\n\u000e\u0010\u000f\u0012\u0011\u0002\u0013\u0015\u0014\n\u0016 \u0017 \u0018 \u0019 \u001a \u001b \u0017\n\u001c \u001a\u001d\n\u001e \u000f\u0010\u001f\u0002\u0014! \u0004\u0011\u0004\" \u001f\u0004\u0014 \u0014\u0003# \u001e%$ \u0013\u0015\u0011 \u001e \u0014\u0003&\nFigure 2: Estimation of the time-varying parameter \n\nl\nn (t) by the sequential Monte\nCarlo method at 12th filter bank in experiment A. Number of samples is 120.\nEnvironment driving noise variance is 0.0001. Solid curve is the true noise power.\nDash-dotted curve is the estimated noise power.\nIt is observed that, in all SNR conditions, the method can further improve sys-\ntem performance, compared to that obtained by ``Stationary Compensation'', over\n``Baseline''. For example, in 8.86dB SNR, the method can improve word accuracy\nfrom 75.60% by ``Stationary Compensation'' to 83.13%. As a whole, the method\ncan have a relative 39.9% word error rate reduction compared to ``Stationary Com-\npensation''.\nTable 2: Word Accuracy (in %) in Machinegun noise, achieved by the sequential\nMonte Carlo method in comparison with baseline without noise compensation, de-\nnoted as Baseline, and noise compensation assuming stationary noise, denoted as\nStationary Compensation.\nSNR (dB) Baseline Stationary Compensation #samples = 120, V = 0.0001\n28.86 90.36 92.77 97.59\n14.88 64.46 76.81 88.25\n8.86 56.02 75.60 83.13\n1.63 50.0 68.98 72.89\n5 Summary\n\nWe have presented a sequential Monte Carlo method for Bayesian estimation of\ntime-varying noise parameter, which is for sequential noise compensation applied to\nrobust speech recognition. The method uses samples to approximate the posterior\ndistribution of the additive noise and speech parameters given observation sequence.\n\f\n\u0002\u0001\u0003\u0001\u0004\u0001 \u0005\u0003\u0001\u0004\u0001\u0004\u0001 \u0006\u0004\u0001\u0004\u0001\u0003\u0001 \u0007\u0003\u0001\u0004\u0001\u0003\u0001 \b\t\u0001\u0003\u0001\u0004\u0001\u0003\u0001 \b\n\u0003\u0001\u0004\u0001\u0003\u0001 \b\u000b\u0005\u0004\u0001\u0004\u0001\u0003\u0001 \b\n\u0004\u0001\u0004\u0001\u0003\u0001\n\b\u0003\b \b\u0004\b\n\n\r\b\n\n\b\u000b\u0003\f \r\n\b\u000b\u000e\n\b\t\u000e\u000f\f \r\n\b\u000b\u0005\n\b\t\u0005\u000f\f \r\b\n\n\b\u000b\r\u0003\f \r\b\n\n\u0010\u0012\u0011\u0014\u0013\u0002\u0015\u0017\u0016\n\u0018 \u0019 \u001a \u001b \u001c \u001d \u0019\n\u001e \u001c\u001f\n  \u0011\"!\u0002\u0016$#\u0004\u0013\u0004% !\u0002\u0016 \u0016\u0004&  (' \u0015\u0017\u0013   \u0016\u000f)\n\u0002\u0001\u0003\u0001\u0004\u0001 \u0005\u0003\u0001\u0004\u0001\u0004\u0001 \u0006\u0004\u0001\u0004\u0001\u0003\u0001 \u0007\u0003\u0001\u0004\u0001\u0003\u0001 \b\t\u0001\u0003\u0001\u0004\u0001\u0003\u0001 \b\n\u0003\u0001\u0004\u0001\u0003\u0001 \b\u000b\u0005\u0004\u0001\u0004\u0001\u0003\u0001 \b\n\u0004\u0001\u0004\u0001\u0003\u0001\n\b\u0003\b \b\n\n\b\u000b\f\n\b\u000b\u0005 \b\n \b\n\n\u000e\u0010\u000f\u0012\u0011\u0002\u0013\u0015\u0014\n\u0016 \u0017 \u0018 \u0019 \u001a \u001b \u0017\n\u001c \u001a\u001d\n\u001e \u000f\u0010\u001f\u0002\u0014! \u0004\u0011\u0004\" \u001f\u0004\u0014 \u0014\u0003# \u001e%$ \u0013\u0015\u0011 \u001e \u0014\u0003&\nFigure 3: Estimation of the time-varying parameter \n\nl\nn (t) by the sequential Monte\nCarlo method at 12th filter bank in experiment A. Number of samples is 120.\nEnvironment driving noise variance is 0.001. Solid curve is the true noise power.\nDash-dotted curve is the estimated noise power.\nOnce the noise parameter has been inferred, it is plugged into a non-linear trans-\nformation of clean speech models. Experiments conducted on digits recognition in\nsimulated non-stationary noises and real noises have shown that the method is very\ne#ective to improve system robustness to time-varying additive noise.\nReferences\n\n[1] A. Varga and R.K. Moore, ``Hidden markov model decomposition of speech and noise,''\nin ICASSP, 1990, pp. 845--848.\n[2] N.S. Kim, ``Nonstationary environment compensation based on sequential estimation,''\n\nIEEE Signal Processing Letters, vol. 5, no. 3, March 1998.\n[3] K. Yao, K. K. Paliwal, and S. Nakamura, ``Sequential noise compensation by a sequen-\ntial kullback proximal algorithm,'' in EUROSPEECH, 2001, pp. 1139--1142, extended\npaper submitted for publication.\n[4] K. Yao, B. E. Shi, S. Nakamura, and Z. Cao, ``Residual noise compensation by a\nsequential em algorithm for robust speech recognition in nonstationary noise,'' in\n\nICSLP, 2000, vol. 1, pp. 770--773.\n[5] B. Frey, L. Deng, A. Acero, and T. Kristjansson, ``Algonquin: Iterating laplace's\nmethod to remove multiple types of acoustic distortion for robust speech recognition,''\nin EUROSPEECH, 2001, pp. 901--904.\n[6] J. S. Liu and R. Chen, ``Sequential monte carlo methods for dynamic systems,'' J.\nAm. Stat. Assoc, vol. 93, pp. 1032--1044, 1998.\n[7] W. K. Hastings, ``Monte carlo sampling methods using markov chains and their appli-\ncations,'' Biometrika, vol. 57, pp. 97--109, 1970.\n\f\n", "award": [], "sourceid": 2093, "authors": [{"given_name": "K.", "family_name": "Yao", "institution": null}, {"given_name": "S.", "family_name": "Nakamura", "institution": null}]}