{"title": "Speech Denoising and Dereverberation Using Probabilistic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 758, "page_last": 764, "abstract": null, "full_text": "Speech Denoising and Dereverberation Using \n\nProbabilistic Models \n\nHagai Attias \n\nJohn C. Platt \n\nAlex Acero \n\nMicrosoft Research \n\n1 Microsoft Way \n\nLi Deng \n\nRedmond, WA 98052 \n\n{hagaia,jplatt,alexac,deng} @microsoft.com \n\nAbstract \n\nThis paper presents a unified probabilistic framework for denoising and \ndereverberation of speech signals. The framework transforms the denois(cid:173)\ning and dereverberation problems into Bayes-optimal signal estimation. \nThe key idea is to use a strong speech model that is pre-trained on a \nlarge data set of clean speech. Computational efficiency is achieved by \nusing variational EM, working in the frequency domain, and employing \nconjugate priors. The framework covers both single and multiple micro(cid:173)\nphones. We apply this approach to noisy reverberant speech signals and \nget results substantially better than standard methods. \n\n1 \n\nIntroduction \n\nThis paper presents a statistical-model-based algorithm for reconstructing a speech source \nfrom microphone signals recorded in a stationary noisy reverberant environment. Speech \nenhancement in a realistic environment is a challenging problem, which remains largely \nunsolved in spite of more than three decades of research. Speech enhancement has many \napplications and is particularly useful for robust speech recognition [7] and for telecommu(cid:173)\nnication. \nThe difficulty of speech enhancement depends strongly on environmental conditions. If a \nspeaker is close to a microphone, reverberation effects are minimal and traditional methods \ncan handle typical moderate noise levels. However, if the speaker is far away from a micro(cid:173)\nphone, there are more severe distortions, including large amounts of noise and noticeable \nreverberation. Denoising and dereverberation of speech in this condition has proven to be \na very difficult problem [4]. \nCurrent speech enhancement methods can be placed into two categories: \nsingle(cid:173)\nmicrophone methods and multiple-microphone methods. A large body of literature exists \non single-microphone speech enhancement methods. These methods often use a proba(cid:173)\nbilistic framework with statistical models of a single speech signal corrupted by Gaussian \nnoise [6, 8]. These models have not been extended to dereverberation or multiple micro(cid:173)\nphones. \nMultiple-microphone methods start with microphone array processing, where an array of \nmicrophones with a known geometry is deployed to make both spatial and temporal mea(cid:173)\nsurements of sounds. A microphone array offers significant advantages compared to single \nmicrophone methods. Non-adaptive algorithms can denoise a signal reasonably well, as \n\n\flong as it originates from a limited range of azimuth. These algorithms do not handle re(cid:173)\nverberation, however. Adaptive algorithms can handle reverberation to some extent [4], but \nexisting methods are not derived from a principled probabilistic framework and hence may \nbe sub-optimal. \nWork on blind source separation has attempted to remove the need for fixed array geome(cid:173)\ntries and pre-specified room models. Blind separation attempts the full multi-source, multi(cid:173)\nmicrophone case. In practice, the most successful algorithms concentrate on instantaneous \nnoise-free mixing with the same number of sources as sensors and with very weak prob(cid:173)\nabilistic models for the source [5]. Some algorithms for noisy non-square instantaneous \nmixing have been developed [1], as well as algorithms for convolutive square noise-free, \nmixing [9]. However, the full problem including noise and convolution has so far remained \nopen. \nIn this paper, we present a new method for speech denoising and dereverberation. We use \nthe framework of probabilistic models, which allows us to integrate the different aspects \nof the whole problem, including strong speech models, environmental noise and reverber(cid:173)\nation, and microphone arrays. This integration is performed in a principled manner facili(cid:173)\ntating a coherent unified treatment. The framework allows us to produce a Bayes-optimal \nestimation algorithm. Using a strong speech model leads to computational intractability, \nwhich we overcome using a variational approach. The computational efficiency is further \nenhanced by working in the frequency domain and by employing conjugate priors. The \nresulting algorithm has complexity O(N log N). Results on noisy speech show significant \nimprovement over standard methods. \nDue to space limitations, the full derivation and mathematical details for this method are \nprovided in the technical report [3]. \nNotation and conventions. We work with time series data using a frame-by-frame analysis \nwith N -point frames. Thus, all signals and systems, e.g. Y~' have a time point subscript \nextending over n = 0, ... , N - 1. With the superscript i omitted, Yn denotes all microphone \nsignals. When n is also omitted, Y denotes all signals at all time points. Superscripts may \nbecome subscripts and vice versa when no confusion arises. The discrete Fourier transform \n(DFf) of Xn is Xk = En exp( -iwkn)Xn. We define the primed quantity \n\nii~ = 1 - L e-iwknan \n\np \n\nn=l \n\n(1) \n\nfor variables an with n = 1, ... ,p. \nThe Gaussian distribution for a random vector a with mean fl and precision matrix V (de(cid:173)\nfined as the inverse covariance matrix) is denotedN(a I fl, V). The Gamma distribution for \na non-negative random variable v with a degrees of freedom and inverse scale (3 is denoted \ng(v I a, (3) IX va / 2 - 1 exp( -(3v/2). Their product, the Normal-Gamma distribution \n\nNg(a, v I fl, V, a, (3) = N(a I fl, vV)g(v I a, (3) , \n\n(2) \n\nturns out to be particularly useful. Notice that it relates the precision of a to v. \nProblem Formulation We consider the case where a single speech source is present and \nM microphones are available. The treatment of the single-microphone case is a special \ncase of M = 1, but is not qualitatively different. \nLet Xn be the signal emitted by the source at time n, and let y~ be the signal received at \nmicrophone i at the same time. Then \n\ny~ = h~ * Xn + u~ = L h~xn-m + u~ , \n\nm \n\n(3) \n\nwhere h:'\" is the impulse response of the filter (of length Ki ~ N) operating on the source \nas it propagates toward microphone i, * is the convolution operator, and u~ denotes the \n\n\fnoise recorded at that microphone. Noise may originate from both microphone responses \nand from environmental sources. \nIn a given environment, the task is to provide an optimal estimate of the clean speech signal \nx from the noisy microphone signals yi. This requires the estimation of the convolving \nfilters hi and characteristics of the noise u i . This estimation is accomplished by Bayesian \ninference on probabilistic models for x and u i . \n\n2 Probabilistic Signal Models \n\nWe now turn to our model for the speech source. Much of the work on speech denoising in \nthe past has usually employed very simple source models: AR or ARMA descriptions [6]. \nOne exception is [8], which uses an HMM whose observations are Gaussian AR mod(cid:173)\nels. These simple denoising models incorporate very little information on the structure of \nspeech. Such an approach a priori allows any value for the model coefficients, including \nvalues that are unlikely to occur in a speech signal. Without a strong prior, it is difficult to \nestimate the convolving filters accurately due to identifiability. A source prior is especially \nimportant in the single microphone case, which estimates N clean samples plus model co(cid:173)\nefficients from N noisy samples. Thus, the absence of a strong speech model degrades \nreconstruction quality. \nThe most detailed statistical speech models available are those employed by state-of-the(cid:173)\nart speech recognition engines. These systems are generally based on mixture of diagonal \nGaussian models in the mel-cepstral domain. These models are endowed with temporal \nMarkov dynamics and have a very large (f'.:::l 100000) number of states corresponding to \nindividual atoms of speech. However, in the mel-cepstral domain, the noisy reverberant \nspeech has a strong non-linear relationship to the clean speech. \nPhysical speech production model. In this paper, we work in the linear time/frequency \ndomain using a statistical model and take an intermediate approach regarding the model \nsize. We model speech production with an AR(P) model: \n\np \n\nXn = L amXn-m +Vn , \n\nm=l \n\n(4) \n\nwhere the coefficients am are related to the physical shape of a \"lossless tube\" model of \nthe vocal tract. \nTo tum this physical model into a probabilistic model, we assume that Vn are indepen(cid:173)\ndent zero-mean Gaussian variables with scalar precision v. Each speech frame x = \n(xo, ... ,XN-l) has its own parameters (J = (al, ... , ap , v). Given (J, the joint distribution \nof x is generally a zero-mean Gaussian, p(x 1 (J) = N(x 1 0, A), where A is the N x N \nprecision matrix. Specifically, the joint distribution is given by the product \n\n(5) \n\np(x 1 (J) = IT N(xn 1 L amXn-m, v). \n\nn \n\nm \n\nProbabilistic model in the frequency domain. However, rather than employing this prod(cid:173)\nuct form directly, we work in the frequency domain and use the DFf to write \n\np(x 1 (J) ()( exp( - 2~ L 1 ii~ 121 Xk 12) , \n\nN-l \n\nk=O \n\n(6) \n\nwhere ii~ is defined in (1). The precision matrix A is now given by an inverse DFf, Anm = \n(v/N)I:keiWk(n-m) 1 ii~ 12. This matrix belongs to a sub-class of Toeplitz matrices \ncalled circulant Toeplitz. It follows from (6) that the mean power spectrum of x is related \nto (J via Sk = (I Xk 12) = N/(v 1 ii~ 12). \n\n\fConjugate priors. To complete our speech model, we must specify a distribution over the \nspeech production parameters O. We use a S-state mixture model with a Normal-Gamma \ndistribution (2) for each component s = 1, ' ''' S: p(O 1 s) = N(al' \"\" ap 1 /-Ls, vVs)Q(v 1 \nfollows. Given the model p(x 1 O)p( \u00b0 1 s), the prior p( \u00b0 1 s) is conjugate to p(x 1 0) iff the \nO:s, (3s) . This form is chosen by invoking the idea of a conjugate prior, which is defined as \n\nposterior p(O 1 x, s) , computed by Bayes' rule, has the same functional form as the prior. \nThis choice has the advantage of being quite general while keeping the clean speech model \nanalytically tractable. \nIt turns out, as discussed below, that significant computational savings result if we restrict \nthe p x p precision matrices Vs to have a circulant Toeplitz structure. To do this without \nhaving to impose an explicit constraint, we reparametrize p(O 1 s) in terms of ~;, 'f/; instead \nof /-L;, V':m' and work in the frequency domain: \n\np(O 1 s) ex exp(-~ L: 1 ~kak - iik 12) , v-~ exp(_(3s v) . \n\np-l \n\n2p k=O \n\n2 \n\n(7) \n\nNote that we use a p- rather than N -point DFf. The precisions are now given by the inverse \nDFT V':m = (lip) Lk eiWk(n-m) 1 ~k 12 and are manifestly circulant. It is easy to show \nthat conjugacy still holds. \nFinally, the mixing fractions are given by p( s) = 7r s . This completes the specification of \nour clean speech modelp(x) in terms of the latent variable modelp(x, 0, s) = p(x 1 O)p(O 1 \ns)p(s). The model is parametrized by W = (~~, 'f/~, O:s, (3s, 7rs) . \nSpeech model training. We pre-train the speech model parameters W using 10000 sen(cid:173)\ntences of the Wall Street Journal corpus, recorded with a close-talking microphone for 150 \nmale and female speakers of North American English. We used 16msec overlapping frames \nwith N = 256 time points at 16kHz sampling rate. Training was performed using an EM \nalgorithm derived specifically for this model [3]. We used S = 256 clusters and p = 12. \nW were initialized by extracting the AR(P) coefficients from each frame using the autocor(cid:173)\nrelation method. These coefficents were converted into cepstral coefficients, and clustered \ninto S classes by k-means clustering. We then considered the corresponding hard clusters \nof the AR(p) coefficients, and separately fit a model p(O 1 s) (7) to each. The resulting \nparameters were used as initial values for the full EM algorithm. \nNoise model. In this paper, we use an AR(q) description for the noise recorded by micro(cid:173)\nphone i, u~ = Lm b~u~_m + w~. The noise parameters are \u00a2>i = (b~, Ai), where Ai are \nthe precisions of the zero-mean Gaussian excitations w~ . In the frequency domain we have \nthe joint distribution \n\np(ui 1 \u00a2i) ex exp( - 2~ L: 1 b~,k 121 u~ 12) , \n\n. N-l \n\nk=O \n\n(8) \n\nAs in (6), the parameters \u00a2i determine the spectra of the noise. But unlike the speech \nmodel, the AR(q) noise model is chosen for mathematical convenience rather than for its \nrelation to an underlying physical model. \nNoisy speech model. The form (8) now implies that given the clean speech x, the distribu(cid:173)\ntion of the data yi is \n\ni I) \n\n. N-l \npyx ex exp - 2N L...J \n( \nk=O \n\n(N \"\" 1 -b' \n\ni,k \n\n121 -i \nYk -\n\nh- i -\n\nkXk \n\n12) \n\n. \n\n(9) \n\nThis completes the specification of our noisy speech model p(y) in terms of the joint dis(cid:173)\n\ntribution Oi p(yi 1 x )p( x 1 O)p( \u00b0 1 s )p( s). \n\n\f3 Variational Speech Enhancement (VSE) Algorithm \n\nThe denoising and dereverberation task is accomplished by estimating the clean speech x, \nwhich requires estimating the speech parameters 8, the filter coefficients hi, and the noise \nparameters qi. These tasks can be performed by the EM algorithm. This algorithm receives \nthe data yi from an utterance (a long sequence of frames) as input and proceeds iteratively. \nIn the E-step, the algorithm computes the sufficient statistics of the clean speech x and \nthe production parameters 8 for each frame. In the M-step, the algorithm uses the suffi(cid:173)\ncient statistics to update the values of hi and