{"title": "A Harmonic Excitation State-Space Approach to Blind Separation of Speech", "book": "Advances in Neural Information Processing Systems", "page_first": 993, "page_last": 1000, "abstract": null, "full_text": "A harmonic excitation state-space approach to\n\nblind separation of speech\n\nRasmus Kongsgaard Olsson and Lars Kai Hansen\n\nInformatics and Mathematical Modelling\n\nTechnical University of Denmark, 2800 Lyngby, Denmark\n\nrko,lkh@imm.dtu.dk\n\nAbstract\n\nWe discuss an identi\ufb01cation framework for noisy speech mixtures. A\nblock-based generative model is formulated that explicitly incorporates\nthe time-varying harmonic plus noise (H+N) model for a number of latent\nsources observed through noisy convolutive mixtures. All parameters\nincluding the pitches of the source signals, the amplitudes and phases of\nthe sources, the mixing \ufb01lters and the noise statistics are estimated by\nmaximum likelihood, using an EM-algorithm. Exact averaging over the\nhidden sources is obtained using the Kalman smoother. We show that\npitch estimation and source separation can be performed simultaneously.\nThe pitch estimates are compared to laryngograph (EGG) measurements.\nArti\ufb01cial and real room mixtures are used to demonstrate the viability\nof the approach. Intelligible speech signals are re-synthesized from the\nestimated H+N models.\n\n1 Introduction\n\nOur aim is to understand the properties of mixtures of speech signals within a generative\nstatistical framework. We consider convolutive mixtures, i.e.,\n\nxt =\n\nAkst\u2212k + nt,\n\n(1)\n\nk=0\n\nwhere the elements of the source signal vector, st, i.e., the ds statistically independent\nsource signals, are convolved with the corresponding elements of the \ufb01lter matrix, Ak.\nThe multichannel sensor signal, xt, is furthermore degraded by additive Gaussian white\nnoise.\n\nIt is well-known that separation of the source signals based on second order statistics is\ninfeasible in general. Consider the second order statistic\n\nL\u22121(cid:88)\n\nL\u22121(cid:88)\n\nk,k(cid:48)=0\n\n(cid:104)xtx(cid:62)\n\nt(cid:48)(cid:105) =\n\nAk(cid:104)st\u2212ks(cid:62)\n\nt(cid:48)\u2212k(cid:48)(cid:105)A(cid:62)\n\nk(cid:48) + R,\n\n(2)\n\nwhere R is the (diagonal) noise covariance matrix. If the sources can be assumed stationary\nwhite noise, the source covariance matrix can be assumed proportional to the unit matrix\n\n\fwithout loss of generality, and we see that the statistic is symmetric to a common rotation\nof all mixing matrices Ak \u2192 AkU. This rotational invariance means that the acquired\nstatistic is not informative enough to identify the mixing matrix, hence, the source time\nseries.\n\nHowever,\nif we consider stationary sources with known, non-trivial, autocorrelations\n(cid:104)sts(cid:62)\nt(cid:48)(cid:105) = G(t \u2212 t(cid:48)), and we are given access to measurements involving multiple val-\nues of G(t \u2212 t(cid:48)), the rotational degrees of freedom are constrained and we will be able to\nrecover the mixing matrices up to a choice of sign and scale of each source time series.\nExtending this argument by the observation that the mixing model (1) is invariant to \ufb01lter-\ning of a given column of the convolutive \ufb01lter provided that the inverse \ufb01lter is applied to\ncorresponding source signal, we see that it is infeasible to identify the mixing matrices if\nthese arbitrary inverse \ufb01lters can be chosen to that they are allowed to \u2018whiten\u2019 the sources,\nsee also [1].\n\nFor non-stationary sources, on the other hand, the autocorrelation functions vary through\ntime and it is not possible to choose a single common whitening \ufb01lter for each source. This\nmeans that the mixing matrices may be identi\ufb01able from multiple estimates of the second\norder correlation statistic (2) for non-stationary sources. Analysis in terms of the number\nof free parameters vs. the number of linear conditions is provided in [1] and [2].\n\nAlso in [2], the constraining effect of source non-stationarity was exploited by the simul-\ntaneous diagonalization of multiple estimates of the source power spectrum.\nIn [3] we\nformulated a generative probabilistic model of this process and proved that it could esti-\nmate sources and mixing matrices in noisy mixtures. Blind source separation based on\nstate-space models has been studied, e.g., in [4] and [5]. The approach is especially useful\nfor including prior knowledge about the source signals and for handling noisy mixtures.\nOne example of considerable practical importance is the case of speech mixtures.\n\nFor speech mixtures the generative model based on white noise excitation may be improved\nusing more realistic priors. Speech models based on sinusoidal excitation have been quite\npopular in speech modelling since [6]. This approach assumes that the speech signal is\na time-varying mixture of a harmonic signal and a noise signal (H+N model). A recent\napplication of this model for pitch estimation can be found in [7]. Also [8] and [9] exploit\nthe harmonic structure of certain classes of signals for enhancement purposes. A related\napplication is the BSS algorithm of [10], which uses the cross-correlation of the amplitude\nin different frequency. The state-space model naturally leads to maximum-likelihood esti-\nmation using the EM-algorithm, e.g. [11], [12]. The EM algorithm has been used in related\nmodels: [13] and [14].\n\nIn this work we generalize our previous work on state space models for blind source sepa-\nration to include harmonic excitation and demonstrate that it is possible to perform simul-\ntaneous un-mixing and pitch tracking.\n\n2 The model\n\nThe assumption of time variant source statistics help identify parameters that would other-\nwise not be unique within the model. In the following, the measured signals are segmented\ninto frames, in which they are assumed stationary. The mixing \ufb01lters and observation noise\ncovariance matrix are assumed stationary across all frames.\n\nThe colored noise (AR) process that was used in [3] to model the sources is augmented to\ninclude a periodic excitation signal that is also time-varying. The speci\ufb01c choice of periodic\nbasis function, i.e.\nthe sinusoid, is motivated by the fact that the phase is linearizable,\n\n\ffacilitating one-step optimization. In frame n, source i is represented by:\n\ni,t =\nsn\n\nf n\ni,t(cid:48)sn\n\ni,t\u2212t(cid:48) +\n\ni,k sin(\u03c9n\n\u03b1n\n\n0,ikt + \u03b2n\n\ni ) + vn\n\ni,t\n\np(cid:88)\np(cid:88)\n\nt(cid:48)=1\n\nK(cid:88)\nK(cid:88)\n\nk=1\n\n=\n\nt(cid:48)=1\n\nf n\ni,t(cid:48)sn\n\ni,t\u2212t(cid:48) +\n\ni,2k cos(\u03c9n\nwhere n \u2208 {1, 2, .., N} and i \u2208 {1, 2, .., ds}. The innovation noise, vn\nsian. Clearly, (3) represents a H+N model. The fundamental frequency, \u03c9n\nestimation problem in an inherent non-linear manner.\n\ni,2k\u22121 sin(\u03c9n\ncn\n\n0,ikt) + cn\n\nk=1\n\n0,ikt) + vn\n\ni,t\n\n(3)\n\ni,t, is i.i.d Gaus-\n0,i, enters the\n\n(cid:163)\n\nIn order to bene\ufb01t from well-established estimation theory, the above recursion is \ufb01t-\nted into the framework of Gaussian linear models, see [15]. The Kalman \ufb01lter model\nis an instance of this model. The augmented state space is constructed by includ-\ning a history of past samples for each source. Source vector i in frame n is de\ufb01ned:\ni,t =\nsn\ni,t\u2019s are stacked in the total source vec-\ntor: \u00afsn\n. The resulting state-space model is:\n\nsn\nsn\ni,t\u22121\n. . .\ni,t\n1,t)(cid:62) (sn\n2,t)(cid:62) . . .\n(sn\n\n(cid:164)(cid:62)\nds,t)(cid:62) (cid:164)(cid:62)\n\nsn\ni,t\u2212p+1\n\n. All sn\n\nt =\n\n(sn\n\n(cid:163)\n\nt = Fn\u00afsn\n\u00afsn\nt = A\u00afsn\nxn\nwhere \u00afvt \u223c N (0, Q), nt \u223c N (0, R) and \u00afsn\n1,t)(cid:62) (un\nput vector is de\ufb01ned: un\ncorresponding to source i in frame n are:\n\nt =\n\n(un\n\n(cid:163)\n\nt\n\nt\n\nt + \u00afvn\n\nt\u22121 + Cnun\nt + nn\n1 \u223c N (\u00b5n, \u03a3n). The combined harmonics in-\n2,t)(cid:62) . . .\n, where the harmonics\n\n(un\n\ni,t =\nun\n\nsin(\u03c9n\n\n0,it)\n\ncos(\u03c9n\n\n0,it)\n\n. . .\n\nsin(K\u03c9n\n\n0,it)\n\ncos(K\u03c9n\n\n0,it)\n\nIt is apparent that the matrix multiplication by A constitutes a convolutive mixing of the\nsources, where the dx \u00d7 ds channel \ufb01lters are:\n\n(cid:164)(cid:62)\n\n(cid:163)\n\nIn order to implement the H+N source model, the parameter matrices are constrained as\nfollows:\n\nds,t)(cid:62) (cid:164)(cid:62)\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\n1ds\n\n2ds\n\na(cid:62)\na(cid:62)\n...\na(cid:62)\n\ndxds\n\n..\n..\n...\n..\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\ni =\n\n11\n\n21\n\n12\n\n22\n\ndx2\n\na(cid:62)\na(cid:62)\na(cid:62)\n...\n...\na(cid:62)\ndx1 a(cid:62)\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0 a(cid:62)\n\uf8f9\uf8fa\uf8fa\uf8fb , Fn\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n\nds\n\nA =\n\nds\n\n\u00b7\u00b7\u00b7 0\n\u00b7\u00b7\u00b7 0\n...\n...\n\u00b7\u00b7\u00b7 Fn\n\u00b7\u00b7\u00b7 0\n\u00b7\u00b7\u00b7 0\n...\n...\n\u00b7\u00b7\u00b7 Qn\n\u00b7\u00b7\u00b7 0\n\u00b7\u00b7\u00b7 0\n...\n...\n\u00b7\u00b7\u00b7 Cn\n\nds\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nf n\ni,p\n0\n0\n...\n0\n\nf n\ni,1\n1\n0\n...\n0\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\nf n\ni,2\n0\n1\n...\n(cid:110)\n0\n\nqn\ni\n0\n\nf n\ni,p\u22121\n0\n0\n...\n1\n\n(cid:87)\n\nj(cid:48) (cid:54)= 1\n\nj = j(cid:48) = 1\nj (cid:54)= 1\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\ncn\ni,2K\n0\n0\n...\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\ncn\ni,1\n0\n0\n...\n0\n\ncn\ni,2\n0\n0\n...\n0\n\n(Qn\n\ni )jj(cid:48) =\n\nCn\n\ni =\n\n0\n...\n0\n\n\uf8ee\uf8ef\uf8ef\uf8f0 Fn\n\uf8ee\uf8ef\uf8ef\uf8f0 Qn\n\uf8ee\uf8ef\uf8ef\uf8f0 Cn\n\n0\n...\n0\n\n0\n...\n0\n\n1 0\nFn\n2\n...\n0\n1 0\nQn\n2\n...\n0\n1 0\nCn\n2\n...\n0\n\nFn =\n\nQn =\n\nCn =\n\n\f3 Learning\n\nwith the de\ufb01nitions J (\u03b8, \u02c6p) \u2261 (cid:82)\n\nHaving described the convolutive mixing problem in the general framework of linear Gaus-\nsian models, more speci\ufb01cally the Kalman \ufb01lter model, optimal inference of the sources is\nobtained by the Kalman smoother. However, since the problem at hand is effectively blind,\nwe also need to estimate the parameters. Along the lines of, e.g. [15], we will invoke an EM\napproach. The log-likelihood is bounded from below: L(\u03b8) \u2265 F(\u03b8, \u02c6p) \u2261 J (\u03b8, \u02c6p) \u2212 R(\u02c6p),\ndS\u02c6p(S) log \u02c6p(S).\nIn accordance with standard EM theory, J (\u03b8, \u02c6p) is optimized wrt. \u03b8 in the M-step. The\nE-step infers the relevant moments of the marginal posterior, \u02c6p = p(S|X, \u03b8). For the Gaus-\nsian model the means are also source MAP estimates. The combined E and M steps are\nguaranteed not to decrease L(\u03b8).\n\ndS\u02c6p(S) log p(X, S|\u03b8) and R(\u02c6p) \u2261 (cid:82)\n\n3.1 E-step\n\nThe forward-backward recursions which comprise the Kalman smoother are employed in\nthe E-step to infer moments of the source posterior, p(S|X, \u03b8), i.e. the joint posterior of\nthe sources conditioned on all observations. The relevant second-order statistic of this\nt (cid:105), and autocorrelation,\ndistribution in segment n is the marginal posterior mean, \u02c6\u00afsn\ni,L,t ](cid:62), along with the marginal lag-one\ni,1,t mn\n.. mn\nMn\ni,t\u22121)(cid:62)(cid:105) \u2261 [ m1,n\ni,L,t ](cid:62). In particular,\ncovariance, M1,n\ni,1,t. All averages are performed over p(S|X, \u03b8). The forward\nmn\nrecursion also yields the log-likelihood, L(\u03b8).\n\ni,t)(cid:62)(cid:105) \u2261 [ mn\ni,t \u2261 (cid:104)sn\ni,t(sn\ni,t is the \ufb01rst element of mn\n\ni,t \u2261 (cid:104)sn\n\ni,1,t m1,n\n\nt \u2261 (cid:104)\u00afsn\n\n.. m1,n\n\ni,t(sn\n\ni,2,t\n\ni,2,t\n\n3.2 M-step\nThe M-step utility function, J (\u03b8, \u02c6p), is de\ufb01ned:\n\nJ (\u03b8, \u02c6p) = \u22121\n2\n\nlog det \u03a3n\n\ni + (\u03c4 \u2212 1)\n\nds(cid:88)\n\ni=1\n\nlog qn\ni\n\nN(cid:88)\nds(cid:88)\nds(cid:88)\n\ni=1\n\nn=1\n\n[\n\n(cid:104)(sn\n\ni,1 \u2212 \u00b5n\n\ni )T (\u03a3n\n\ni,1 \u2212 \u00b5n\ni )(cid:105)\n\n+\u03c4 log det R +\n\n\u03c4(cid:88)\n\nds(cid:88)\n\nt=2\n\ni=1\n\n+\n\ni=1\n\ni,t \u2212 (dn\n(sn\n\ni )T zn\n\ni,t)2(cid:105) +\n\n(cid:104) 1\nqn\ni\n\ni )\u22121(sn\n\u03c4(cid:88)\n\n(cid:104)(xn\n\nt \u2212 A\u00afsn\n\nt )T R\u22121(xn\n\nt \u2212 A\u00afsn\n\nt )(cid:105)]\n\nwhere (cid:104)\u00b7(cid:105) signi\ufb01es averaging over the source posterior from the previous E-step, p(S|X, \u03b8)\nand \u03c4 is the frame length. The linear source parameters are grouped as\n\ni )(cid:62) (cid:164)(cid:62)\n\nt=1\n\ni \u2261(cid:163)\n\ni )(cid:62) (cn\n(f n\n\n, zn\n\ni,t\u22121)(cid:62) (un\n(sn\n\ni \u2261(cid:163)\n\ndn\n\nwhere\n\ni \u2261 [ fi,1\nf n\n\n(cid:62)\n\nfi,p ]\n\n, cn\n\ni \u2261 [ ci,1\n\n..\n\nci,2\n\nfi,2\n\n..\nOptimization of J (\u03b8, \u02c6p) wrt. \u03b8 is straightforward (except for the \u03c9n\n0,i\u2019s). Relatively minor\nchanges are introduced to the estimators of e.g. [12] in order to respect the special con-\nstrained format of the parameter matrices and to allow for an external input to the model.\nMore details on the estimators for the correlated source model are given in [3].\nIt is in general dif\ufb01cult to maximize J (\u03b8, \u02c6p) wrt. to \u03c9n\ni,0, since several local maxima exist,\ne.g. at multiples of \u03c9n\ni,0, see e.g. [6]. This problem is addressed by narrowing the search\nrange based on prior knowledge of the domain, e.g. that the pitch of speech lies in the range\n\ni,t)(cid:62) (cid:164)(cid:62)\n\n(cid:62)\nci,p ]\n\n\f50-400Hz. A candidate estimate for \u03c9n\nfunction of sn\nFor each point in the grid we optimize dn\ni :\n\ni )(cid:62)sn\n\ni,0 is obtained by computing the autocorrelation\ni,t\u22121. Grid search is performed in the vicinity of the candidate.\n\n(cid:184)(cid:35)\u22121 \u03c4(cid:88)\n\n(cid:183)\n\n(cid:184)\n\ni,t \u2212 (f n\n(cid:34)\n\u03c4(cid:88)\n\n(cid:183)\n\nt=2\n\ni,new =\ndn\n\n(Mn\ni,t(\u02c6sn\nun\n\ni,t)(cid:62)\n\u02c6sn\ni,t\u22121(un\ni,t\u22121)\ni,t)(cid:62)\ni,t\u22121)(cid:62) un\ni,t(un\n\nmn\ni,t,t\u22121\n\u02c6sn\ni,tun\ni,t\n\nt=2\n\n(4)\n\nAt each step of the EM-algorithm, the parameters are normalized by enforcing ||Ai|| = 1,\n\nFigure 1: Amplitude spectrograms of the frequency range 0-4000Hz, from left to right: the\ntrue sources, the estimated sources and the re-synthesized source.\n\nthat is enforcing a unity norm on the \ufb01lter coef\ufb01cients related to source i.\n\n4 Experiment I: BSS and pitch tracking in a noisy arti\ufb01cial mixture\n\nThe performance of a pitch detector can be evaluated using electro-laryngograph (EGG)\nrecordings, which are obtained from electrodes placed on the neck, see [7]. In the following\nexperiment, speech signals from the TIMIT [16] corpus is used for which the EGG signals\nwere measured, kindly provided by the \u2018festvox\u2019 project (http://festvox.org).\nTwo male speech signals (Fs = 16kHz) were mixed through known mixing \ufb01lters and\ndegraded by additive white noise (SNR \u223c20dB), constructing two observation signals. The\npitches of the speech signals were overlapping. The \ufb01lter coef\ufb01cients (of 2 \u00d7 2 = 4 FIR\n\ufb01lter impulse responses) were:\n1.00 0.35 \u22120.20\n0.00 0.00\n\nA =\nThe signals were segmented into frames, \u03c4 = 320 \u223c 20ms, and the order of the AR-\nprocess was set to p = 1. The number of harmonics was limited to K = 40. The pitch\ngrid search involved 30 re-estimations of dn\ni . In \ufb01gure 1 is shown the spectrograms of\n\n0.00 0.00 \u22120.50 \u22120.30 0.20\n1.30 0.60\n0.00\n\n0.00 0.00,\n0.70 \u22120.20 0.15,\n\n0.30\n\n0.00\n\n(cid:183)\n\n(cid:184)\n\nt [sec]F [Hz] (source 2)00.5101000200030004000F [Hz] (source 1)01000200030004000t [sec]00.51t [sec]00.51\fFigure 2: The estimated (dashed) and EGG-provided (solid) pitches as a function of time.\nThe speech mixtures were arti\ufb01cially mixed from TIMIT utterances and white noise was\nadded.\n\napproximately 1 second of 1) the original sources, 2) the MAP source estimates and 3) the\nresynthesized sources (from the estimated model parameters). It is seen that the sources\nwere well separated. Also, the re-synthesizations are almost indistinguishable from the\nsource estimates. In \ufb01gure 2, the estimated pitch of both speech signals are shown along\nwith the pitch of the EGG measurements.1 The voiced sections of the speech were manually\npreselected, this step is easily automated. The estimated pitches do follow the \u2019true\u2019 pitches\nas provided by the EGG. The smoothness of the estimates is further indicating the viability\nof the approach, as the pitch estimates are frame-local.\n\n5 Experiment II: BSS and pitch tracking in a real mixture\n\nThe algorithm was further evaluated on real room recordings that were also used in [17].2\nTwo male speakers synchronously count in English and Spanish (Fs = 16kHz). The mix-\ntures were degraded with noise (SNR \u223c20dB). The \ufb01lter length, the frame length, the order\nof the AR-process and the number of harmonics were set to L = 25, \u03c4 = 320, p = 1 and\nK = 40, respectively. Figure 3 shows the MAP source estimates and the re-synthesized\nsources. Features of speech such as amplitude modulation are clearly evident in estimates\nand re-synthesizations.3 A listening test con\ufb01rms: 1) the separation of the sources and\n2) the good quality of the synthesized sources, recon\ufb01rming the applicability of the H+N\nmodel. Figure 4 displays the estimated pitches of the sources, where the voiced sections\nwere manually preselected. Although, the \u2019true\u2019 pitch is unavailable in this experiment, the\nsmoothness of the frame-local pitch-estimates is further support for the approach.\n\n1The EGG data are themselves noisy measurements of the hypothesized \u2018truth\u2019. Bandpass \ufb01ltering\n\nwas used for preprocessing.\n\n2The mixtures were obtained from http://inc2.ucsd.edu/\u02dctewon/ica_cnl.html.\n3Note that the \u2019English\u2019 counter lowers the pitch throughout the sentence.\n\n0.40.60.8180100120140160t [sec]F0 [Hz] (source 2)80100120140160F0 [Hz] (source 1)\fFigure 3: Spectrograms of the estimated (left) and re-synthesized sources (right) extracted\nfrom the \u2019one two . . . \u2019 and \u2019uno dos . . . \u2019 mixtures, source 1 and 2, respectively\n\n6 Conclusion\n\nIt was shown that prior knowledge on speech signals and quasi-periodic signals in general\ncan be integrated into a linear non-stationary state-space model. As a result, the simultane-\nous separation of the speech sources and estimation of their pitches could be achieved. It\nwas demonstrated that the method could cope with noisy arti\ufb01cially mixed signals and real\nroom mixtures. Future research concerns more realistic mixtures in terms of reverberation\ntime and inclusion of further domain knowledge. It should be noted that the approach is\ncomputationally intensive, we are also investigating means for approximate inference and\nparameter estimation that would allow real time implementation.\n\nAcknowledgement\n\nThis work is supported by the Danish \u2018Oticon Fonden\u2019.\n\nReferences\n\n[1] E. Weinstein, M. Feder and A.V. Oppenheim, Multi-channel signal separation by decorrelation,\n\nIEEE Trans. on speech and audio processing, vol. 1, no. 4, pp. 405-413,1993.\n\n[2] Parra, L., Spence C., Convolutive blind separation of non-stationary sources. IEEE Trans. on\n\nspeech and audio processing, vol. 5, pp. 320-327, 2000.\n\n[3] Olsson, R. K., Hansen L. K., Probabilistic blind deconvolution of non-stationary source. Proc.\nEUSIPCO, 2004, accepted. Olsson R. K., Hansen L. K., Estimating the number of sources in a\nnoisy convolutive mixture using BIC. International conference on independent component anal-\nysis 2004, accepted. Preprints may be obtained from http://www.imm.dtu.dk/\u02dcrko/\nresearch.htm.\n\n[4] Gharbi, A.B.A., Salam, F., Blind separtion of independent sources in linear dynami-\ncal media. NOLTA, Hawaii, 1993. http://www.egr.msu.edu/bsr/papers/blind_\nseparation/nolta93.pdf\n\nt [sec]F [Hz] (source 2)01230100020003000t [sec]0123F [Hz] (source 1)0100020003000\fFigure 4: Pitch tracking in \u2019one two . . . \u2019/\u2019uno dos . . . \u2019 mixtures.\n\n[5] Zhang, L., Cichocki, A., Blind Deconvolution of dynamical systems: a state space appraoch,\n\nJournal of signal processing, vol. 4, no. 2, pp. 111-130, 2000.\n\n[6] McAulay, R.J., Quateri. T.F., Speech analysis/synthesis based on a sinusoidal representation,\n\nIEEE Trans. on acoustics, speech and signal processing, vol. 34, no. 4, pp. 744-754, 1986.\n\n[7] Parra, L., Jain U., Approximate Kalman \ufb01ltering for the harmonic plus noise model. IEEE Work-\n\nshop on applications of signal processing to audio and acoustics, pp. 75-78, 2001.\n\n[8] Nakatani, T., Miyoshi, M., and Kinoshita, K., One microphone blind dereverberation based on\nquasi-periodicity of speech signals, Advances in Neural Information Processing Systems 16 (to\nappear), MIT Press, 2004.\n\n[9] Hu, G. Wang, D., Monaural speech segregation based on pitch tracking and amplitude modula-\n\ntion, IEEE Trans. neural networks, in press, 2004.\n\n[10] Anem\u00a8uller, J., Kollmeier, B., Convolutive blind source separation of speech signals based on\namplitude modulation decorrelation, Journal of the Acoustical Society of America, vol. 108, pp.\n2630, 2000.\n\n[11] A. P. Dempster, N. M. Laird, and Rubin D. B., Maximum liklihood from incomplete data via\n\nthe EM algorithm, Journal of the Royal Statistical Society, vol. 39, pp. 1\u201338, 1977.\n\n[12] Shumway, R.H., Stoffer, D.S., An approach to time series smoothing and forecasting using the\n\nEM algorithm. Journal of time series analysis, vol. 3, pp. 253-264. 1982.\n\n[13] Moulines E., Cardoso J. F., Gassiat E., Maximum likelihood for blind separation and deconvo-\n\nlution of noisy signals using mixture models, ICASSP, vol. 5, pp. 3617-20, 1997.\n\n[14] Cardoso, J.F., Snoussi, H. , Delabrouille, J., Patanchon, G., Blind separation of noisy Gaussian\nstationary sources. Application to cosmic microwave background imaging, Proc. EUSIPCO, pp\n561-564, 2002.\n\n[15] Roweis, S., Ghahramani, Z., A unifying review of linear Gaussian models. Neural Computation,\n\nvol. 11, pp. 305-345, 1999.\n\n[16] Center for Speech Technology Research, University of Edinburgh,http://www.cstr.ed.\n\nac.uk/\n\n[17] Lee, T.-W., Bell, A.J., Orglmeister, R., Blind source separation of real world signals, Proc. IEEE\n\ninternational conference neural networks, pp 2129-2135, 1997.\n\n0246880100120140160180F0 [Hz] (source 2)t [sec]80100120140F0 [Hz] (source 1)\f", "award": [], "sourceid": 2712, "authors": [{"given_name": "Rasmus", "family_name": "Olsson", "institution": null}, {"given_name": "Lars", "family_name": "Hansen", "institution": null}]}