{"title": "Source Separation with a Sensor Array using Graphical Models and Subband Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 1229, "page_last": 1236, "abstract": "", "full_text": "Source Separation with a Sensor Array Using\n\nGraphical Models and Subband Filtering\n\nHagai Attias\n\nMicrosoft Research\nRedmond, WA 98052\n\nhagaia@microsoft.com\n\nAbstract\n\nSource separation is an important problem at the intersection of several\n\ufb01elds, including machine learning, signal processing, and speech tech-\nnology. Here we describe new separation algorithms which are based\non probabilistic graphical models with latent variables. In contrast with\nexisting methods, these algorithms exploit detailed models to describe\nsource properties. They also use subband \ufb01ltering ideas to model the\nreverberant environment, and employ an explicit model for background\nand sensor noise. We leverage variational techniques to keep the compu-\ntational complexity per EM iteration linear in the number of frames.\n\n1 The Source Separation Problem\n\nFig. 1 illustrates the problem of source separation with a sensor array. In this problem,\nsignals from K independent sources are received by each of L (cid:21) K sensors. The task\nis to extract the sources from the sensor signals. It is a dif\ufb01cult task, partly because the\nreceived signals are distorted versions of the originals. There are two types of distortions.\nThe \ufb01rst type arises from propagation through a medium, and is approximately linear but\nalso history dependent. This type is usually termed reverberations. The second type arises\nfrom background noise and sensor noise, which are assumed additive. Hence, the actual\ntask is to obtain an optimal estimate of the sources from data. The task is dif\ufb01cult for another\nreason, which is lack of advance knowledge of the properties of the sources, the propagation\nmedium, and the noises. This dif\ufb01culty gave rise to adaptive source separation algorithms,\nwhere parameters that are related to those properties are adjusted to optimized a chosen cost\nfunction.\n\nUnfortunately, the intense activity this problem has attracted over the last several years [1\u20139]\nhas not yet produced a satisfactory solution. In our opinion, the reason is that existing tech-\nniques fail to address three major factors. The \ufb01rst is noise robustness: algorithms typically\nignore background and sensor noise, sometime assuming they may be treated as additional\nsources. It seems plausible that to produce a noise robust algorithm, noise signals and their\nproperties must be modeled explicitly, and these models should be exploited to compute\noptimal source estimators. The second factor is mixing \ufb01lters: algorithms typically seek,\nand directly optimize, a transformation that would unmix the sources. However, in many\nsituations, the \ufb01lters describing medium propagation are non-invertible, or have an unstable\ninverse, or have a stable inverse that is extremely long. It may hence be advantageous to\n\n\fFigure 1: The source separation problem. Signals from K = 2 speakers propagate toward\nL = 2 sensors. Each sensor receives a linear mixture of the speaker signals, distorted by\nmultipath propagation, medium response, and background and sensor noise. The task is to\ninfer the original signals from sensor data.\n\nestimate the mixing \ufb01lters themselves, then use them to estimate the sources. The third\nfactor is source properties: algorithms typically use a very simple source model (e.g., a one\ntime point histogram). But in many cases one may easily obtain detailed models of the\nsource signals. This is particularly true for speech sources, where large datasets exist and\nmuch modeling expertise has developed over decades of research. Separation of speakers is\nalso one of the major potential commercial applications of source separation algorithms. It\nseems plausible that incorporating strong source models could improve performance. Such\nmodels may potentially have two more advantages: \ufb01rst, they could help limit the range of\npossible mixing \ufb01lters by constraining the optimization problem. Second, they could help\navoid whitening the extracted signals by effectively limiting their spectral range to the range\ncharacteristic of the source model.\n\nThis paper makes several contributions to the problem of real world source separation. In\nthe following, we present new separation algorithms that are the \ufb01rst to address all three\nfactors. We work in the framework of probabilistic graphical models. This framework\nallows us to construct models for sources and for noise, combine them with the reverberant\nmixing transformation in a principled manner, and compute parameter and source estimates\nfrom data which are Bayes optimal. We identify three technical ideas that are key to our\napproach: (1) a strong speech model, (2) subband \ufb01ltering, and (3) variational EM.\n\n2 Frames, Subband Signals, and Subband Filtering\n\nWe start with the concept of subband \ufb01ltering. This is also a good point to de\ufb01ne our\nnotation. Let xm denote a time domain signal, e.g., the value of a sound pressure waveform\nat time point m = 0; 1; 2; :::. Let Xn[k] denote the corresponding subband signal at time\nframe n and subband frequency k. The subband signals are obtained from the time domain\nsignal by imposing an N-point window wm, m = 0 : N (cid:0) 1 on that signal at equally spaced\npoints nJ, n = 0; 1; 2; :::, and FFT-ing the windowed signal,\n\nXn[k] =\n\nN (cid:0)1\n\nXm=0\n\ne(cid:0)i!kmwmxnJ +m ;\n\n(1)\n\nwhere !k = 2(cid:25)k=N and k = 0 : N (cid:0) 1. The subband signals are also termed frames.\nNotice the difference in time scale between the time frame index n in Xn[k] and the time\npoint index n in xn.\nThe chosen value of the spacing J depends on the window length N. For J (cid:20) N the original\nsignal xm can be synthesized exactly from the subband signals (synthesis formula omitted).\n\n\fAn important consideration for selecting J, as well as the window shape, is behavior under\n\ufb01ltering. Consider a \ufb01lter hm applied to xm, and denote by ym the \ufb01ltered signal. In the\nsimple case hm = h(cid:14)m;0 (no \ufb01ltering), the subband signals keep the same dependence as\nthe time domain ones, yn = hxn (cid:0)! Yn[k] = hXn[k] . For an arbitrary \ufb01lter hm, we\nuse the relation\n\nyn = Xm\n\nhmxn(cid:0)m (cid:0)! Yn[k] = Xm\n\nHm[k]Xn(cid:0)m[k] ;\n\n(2)\n\nwith complex coef\ufb01cients Hm[k] for each k. This relation between the subband signals\nis termed subband \ufb01ltering, and the Hm[k] are termed subband \ufb01lters. Unlike the simple\ncase of non-\ufb01ltering, the relation (2) holds approximately, but quite accurately using an\nappropriate choice of J and wm; see [13] for details on accuracy. Throughout this paper,\nwe will assume that an arbitrary \ufb01lter hm can be modeled by the subband \ufb01lters Hm[k] to\na suf\ufb01cient accuracy for our purposes.\nOne advantage of subband \ufb01ltering is that it replaces a long \ufb01lter hm by a set of short\nindependent \ufb01lters Hm[k], one per frequency. This will turn out to decompose the source\nseparation problem into a set of small (albeit coupled) problems, one per frequency. Another\nadvantage is that this representation allows using a detailed speech model on the same footing\nwith the \ufb01lter model. This is because a speech model is de\ufb01ned on the time scale of a single\nframe, whereas the original \ufb01lter hm, in contrast with Hm[k], is typically as long as 10 or\nmore frames.\n\nAs a \ufb01nal point on notation, we de\ufb01ne a Gaussian distribution over a complex number Z\nby p(Z) = N (Z j (cid:22); (cid:23)) = (cid:23)\n(cid:25) exp((cid:0)(cid:23) j Z (cid:0) (cid:22) j2) . Notice that this is a joint distribution\nover the real and imaginary parts of Z. The mean is (cid:22) = hXi and the precision (inverse\nvariance) (cid:23) satis\ufb01es (cid:23)(cid:0)1 = hj X j2i(cid:0) j (cid:22) j2.\n\n3 A Model for Speech Signals\n\nWe assume independent sources, and model the distribution of source j by a mixture model\nover its subband signals Xjn,\n\nN=2(cid:0)1\n\np(Xjn j Sjn = s) =\n\nN (Xjn[k] j 0; Ajs[k])\n\np(Sjn = s) = (cid:25)js\n\nYk=1\np(X; S) = Yjn\n\np(Xjn j Sjn)p(Sjn) ;\n\n(3)\n\nwhere the components are labeled by Sjn. Component s of source j is a zero mean Gaussian\nwith precision Ajs. The mixing proportions of source j are (cid:25)js. The DAG representing\nthis model is shown in Fig. 2. A similar model was used in [10] for one microphone speech\nenhancement for recognition (see also [11]).\n\nHere are several things to note about this model. (1) Each component has a characteris-\ntic spectrum, which may describe a particular part of a speech phoneme. This is because\nthe precision corresponds to the inverse spectrum: the mean energy (w.r.t. the above dis-\ntribution) of source j at frequency k, conditioned on label s, is hj Xjn j2i = A(cid:0)1\njs . (2)\nA zero mean model is appropriate given the physics of the problem, since the mean of a\nsound pressure waveform is zero.\n(3) k runs from 1 to N=2 (cid:0) 1, since for k > N=2,\nthe subbands k = 0; N=2 are real and are omitted from\nXjn[k] = Xjn[N (cid:0) k]?;\nthe model, a common practice in speech recognition engines.\n(4) Perhaps most impor-\ntantly, for each source the subband signals are correlated via the component label s, as\n\np(Xjn) = Ps p(Xjn; Sjn = s) 6= Qk p(Xjn[k]) . Hence, when the source separation\n\nproblem decomposes into one problem per frequency, these problems turn out to be cou-\npled (see below), and independent frequency permutations are avoided.\n(5) To increase\n\n\fns\n\nnx\n\nFigure 2: Graphical model describing speech signals in the subband domain. The model\nassumes i.i.d. frames; only the frame at time n is shown. The node Xn represents a complex\nN=2 (cid:0) 1-dimensional vector Xn[k], k = 1 : N=2 (cid:0) 1.\n\nmodel accuracy, a state transition matrix p(Sjn = s j Sj;n(cid:0)1 = s0) may be added for each\nsource. The resulting HMM models are straightforward to incorporate without increasing\nthe algorithm complexity.\n\nThere are several modes of using the speech model in the algorithms below. In one mode,\nthe sources are trained online using the sensor data. In a second mode, source models are\ntrained of\ufb02ine using available data on each source in the problem. A third mode correspond\nto separation of sources known to be speech but whose speakers are unknown. In this case,\nall sources have the same model, which is trained of\ufb02ine on a large dataset of speech signals,\nincluding 150 male and female speakers reading sentences from the Wall Street Journal (see\n[10] for details). This is the case presented in this paper. The training algorithm used was\nstandard EM (omitted) using 256 clusters, initialized by vector quantization.\n\n4 Separation of Non-Reverberant Mixtures\n\nWe now present a source separation algorithm for the case of non-reverberant (or instan-\ntaneous) mixing. Whereas many algorithms exist for this case, our contribution here is an\nalgorithm that is signi\ufb01cantly more robust to noise. Its robustness results, as indicated in the\nintroduction, from three factors: (1) explicitly modeling the noise in the problem, (2) using\na strong source model, in particular modeling the temporal statistics (over N time points)\nof the sources, rather than one time point statistics, and (3) extracting each source signal\nfrom data by a Bayes optimal estimator obtained from p(X j Y ). A more minor point is\nhandling the case of less sources than sensors in a principled way.\n\nThe mixing situation is described by yin = Pj hijxjn + uin ; where xjn is source signal\nj at time point n, yin is sensor signal i, hij is the instantaneous mixing matrix, and uin is\nthe noise corrupting sensor i\u2019s signal. The corresponding subband signals satisfy Yin[k] =\nPj hijXjn[k] + Uin[k] .\nTo turn the last equation into a probabilistic graphical model, we assume that noise i has\nprecision (inverse spectrum) Bi[k], and that noises at different sensors are independent (the\nlatter assumption is often inaccurate but can be easily relaxed). This yields\n\np(Yin j X) = Yk\np(Y j X) = Yin\n\nN (Yin[k] j Xj\n\np(Yin j X) ;\n\nhijXjn[k]; Bi[k])\n\n(4)\n\nwhich together with the speech model (3) forms a complete model p(Y; X; S) for this\nproblem. The DAG representing this model for the case K = L = 2 is shown in Fig. 3.\nNotice that this model generalizes [4] to the subband domain.\n\n\f21 -ns\n\n11 -ns\n\nns1\n\n22 -ns\n\n12 -ns\n\nns2\n\n21 -nx\n\n11 -nx\n\nnx1\n\n22 -nx\n\n12 -nx\n\nnx2\n\n21 -ny\n\n11 -ny\n\nny1\n\n22 -ny\n\n12 -ny\n\nny2\n\nFigure 3: Graphical model for noisy, non-reverberant 2 (cid:2) 2 mixing, showing a 3 frame-long\nsequence. All nodes Yin and Xjn represent complex N=2 (cid:0) 1-dimensional vectors (see Fig.\n2). While Y1n and Y2n have the same parents, X1n and X2n, the arcs from the parents to\nY2n are omitted for clarity.\n\nThe model parameters (cid:18) = fhij; Bi[k]; Ajs[k]; (cid:25)jsg are estimated from data by an EM\nalgorithm. However, as the number of speech components M or the number of sources K\nincreases, the E-step becomes computationally intractable, as it requires summing over all\nO(M K) con\ufb01gurations of (S1n; :::; SKn) at each frame. We approximate the E-step using\na variational technique: focusing on the posterior distribution p(X; S j Y ), we compute an\noptimal tractable approximation q(X; S j Y ) (cid:25) p(X; S j Y ), which we use to compute the\nsuf\ufb01cient statistics (SS). We choose\n\nq(X; S j Y ) = Yjn\n\nq(Xjn j Sjn; Y )q(Sjn j Y ) ;\n\n(5)\n\nwhere the hidden variables are factorized over the sources, and also over the frames (the latter\nfactorization is exact in this model, but is an approximation for reverberant mixing). This\nposterior maintains the dependence of X on S, and thus the correlations between different\nsubbands Xjn[k]. Notice also that this posterior implies a multimodal q(Xjn) (i.e., a mixture\ndistribution), which is more accurate than unimodal posteriors often employed in variational\napproximations (e.g., [12]), but is also harder to compute. A slightly more general form\n\nwhich allows inter-frame correlations by employing q(S j Y ) = Qjn q(Sjn j Sj;n(cid:0)1; Y )\n\nmay also be used, without increasing complexity.\n\nBy optimizing in the usual way (see [12,13]) a lower bound on the likelihood w.r.t. q, we\nobtain\n\nq(Xjn; Sjn = s j Y ) = Yk\n\nq(Xjn[k] j Sjn = s; Y )q(Sjn = s j Y ) ;\n\n(6)\n\nwhere q(Xjn[k] j Sjn = s; Y ) = N (Xjn[k] j (cid:26)jns[k]; (cid:23)js[k]) and q(Sjn = s j Y ) = (cid:13)jns.\nBoth the factorization over k of q(Xjn j Sjn) and its Gaussian functional form fall out from\nthe optimization under the structural restriction (5) and need not be speci\ufb01ed in advance.\nThe variational parameters f(cid:26)jns[k]; (cid:23)js[k]; (cid:13)jnsg, which depend on the data Y , constitute\nthe SS and are computed in the E-step. The DAG representing this posterior is shown in\nFig. 4.\n\n\f21 -ns\n\n11 -ns\n\nns1\n\n22 -ns\n\n12 -ns\n\nns2\n\n21 -nx\n\n11 -nx\n\nnx1\n\n22 -nx\n\n12 -nx\n\nnx2\n\n}{ imy\n\nFigure 4: Graphical model describing the variational posterior distribution applied to the\nmodel of Fig. 3. In the non-reverberant case, the components of this posterior at time frame\nn are conditioned only on the data Yin at that frame; in the reverberant case, the components\nat frame n are conditioned on the data Yim at all frames m. For clarity and space reasons,\nthis distinction is not made in the \ufb01gure.\n\nAfter learning, the sources are extracted from data by a variational approximation of the\nminimum mean squared error estimator,\n\n^Xjn[k] = E(Xjn[k] j Y ) = Z dX q(X j Y )Xjn[k] ;\n\n(7)\n\ni.e., the posterior mean, where q(X j Y ) = PS q(X; S j Y ). The time domain waveform\n^xjm is then obtained by appropriately patching together the subband signals.\nM-step. The update rule for the mixing matrix hij is obtained by solving the linear equation\n\nXk\n\nBi[k](cid:17)ij;0[k] = Xj 0\n\nhij 0 Xk\n\nBi[k](cid:21)j 0j;0[k] :\n\n(8)\n\nThe update rule for the noise precisions Bi[k] is omitted. The quantities (cid:17)ij;m[k] and\n(cid:21)j 0j;m[k] are computed from the SS; see [13] for details.\nE-step. The posterior means of the sources (7) are obtained by solving\n\n^Xjn[k] = ^(cid:23)jn[k](cid:0)1Xi\n\nBi[k]hij 0\n\n@Yin[k] (cid:0) Xj 06=j\n\nhij 0 ^Xj 0n[k]1\nA\n\n(9)\n\nfor ^Xjn[k], which is a K (cid:2)K linear system for each frequency k and frame n. The equations\nfor the SS are given in [13], which also describes experimental results.\n\n5 Separation of Reverberant Mixtures\n\nIn this section we extend the algorithm to the case of reverberant mixing.\nIn that case,\ndue to signal propagation in the medium, each sensor signal at time frame n depends\non the source signals not just at the same time but also at previous times. To describe\nthis mathematically, the mixing matrix hij must become a matrix of \ufb01lters hij;m, and\nyin = Pjm\n\nhij;mxj;n(cid:0)m + uin.\n\nIt may seem straightforward to extend the algorithm derived above to the present case.\nHowever, this appearance is misleading, because we have a time scale problem. Whereas\n\n\fare speech model p(X; S) is frame based, the \ufb01lters hij;m are generally longer than the\nframe length N, typically 10 frames long and sometime longer. It is unclear how one can\nwork with both Xjn and hij;m on the same footing (and, it is easy to see that straightforward\nwindowed FFT cannot solve this problem).\nThis is where the idea of subband \ufb01ltering becomes very useful. Using (2) we have Yin[k] =\nPjm\n\nHij;m[k]Xj;n(cid:0)m[k] + Uin[k], which yields the probabilistic model\n\np(Yin j X) = Yk\n\nN (Yin[k] j Xjm\n\nHij;m[k]Xj;n(cid:0)m[k]; Bi[k]) :\n\n(10)\n\nHence, both X and Y are now frame based. Combining this equation with the speech model\n(3), we now have a complete model p(Y; X; S) for the reverberant mixing problem. The\nDAG describing this model is shown in Fig. 5.\n\n21 -ns\n\n11 -ns\n\nns1\n\n22 -ns\n\n12 -ns\n\nns2\n\n21 -nx\n\n11 -nx\n\nnx1\n\n22 -nx\n\n12 -nx\n\nnx2\n\n21 -ny\n\n11 -ny\n\nny1\n\n22 -ny\n\n12 -ny\n\nny2\n\nFigure 5: Graphical model for noisy, reverberant 2 (cid:2) 2 mixing, showing a 3 frame-long\nsequence. Here we assume 2 frame-long \ufb01lters, i.e., m = 0; 1 in Eq. (10), where the solid\narcs from X to Y correspond to m = 0 (as in Fig. 3) and the dashed arcs to m = 1. While\nY1n and Y2n have the same parents, X1n and X2n, the arcs from the parents to Y2n are\nomitted for clarity.\n\nThe model parameters (cid:18) = fHij;m[k]; Bi[k]; Ajs[k]; (cid:25)jsg are estimated from data by a\nvariational EM algorithm, whose derivation generally follows the one outlined in the pre-\nvious section. Notice that the exact E-step here is even more intractable, due to the history\ndependence introduced by the \ufb01lters.\nM-step. The update rule for Hij;m is obtained by solving the Toeplitz system\n\nHij 0;m0 [k](cid:21)j 0j;m(cid:0)m0 [k] = (cid:17)ij;m[k]\n\n(11)\n\nXj 0m0\n\nwhere the quantities (cid:21)j 0j;m[k]; (cid:17)ij;m[k] are computed from the SS (see [12]). The update\nrule for the Bi[k] is omitted.\nE-step. The posterior means of the sources (7) are obtained by solving\n\n^Xjn[k] = ^(cid:23)jn[k](cid:0)1Xim\n\nBi[k]Hij;m(cid:0)n[k]?0\n\n@Yim[k] (cid:0) Xj 0m06=jm\n\nHij 0;m(cid:0)m0 [k] ^Xj 0m0 [k]1\nA\n\n(12)\n\n\ffor ^Xjn[k]. Assuming P frames long \ufb01lters Hij;m, m = 0 : P (cid:0) 1, this is a KP (cid:2) KP\nlinear system for each frequency k. The equations for the SS are given in [13], which also\ndescribes experimental results.\n\n6 Extensions\n\nAn alternative technique we have been pursuing for approximating EM in our models is\nSequential Rao-Blackwellized Monte Carlo. There, we sample state sequences S from the\nposterior p(S j Y ) and, for a given sequence, perform exact inference on the source signals\nX conditioned on that sequence (observe that given S, the posterior p(X j S; Y ) is Gaussian\nand can be computed exactly). In addition, we are extending our speech model to include\nfeatures such as pitch [7] in order to improve separation performance, especially in cases\nwith less sensors than sources [7\u20139]. Yet another extension is applying model selection\ntechniques to infer the number of sources from data in a dynamic manner.\n\nAcknowledgments\n\nI thank Te-Won Lee for extremely valuable discussions.\n\nReferences\n\n[1] A.J. Bell, T.J. Sejnowski (1995). An information maximisation approach to blind separation and\nblind deconvolution. Neural Computation 7, 1129-1159.\n\n[2] B.A. Pearlmutter, L.C. Parra (1997). Maximum likelihood blind source separation: A context-\nsensitive generalization of ICA. Proc. NIPS-96.\n\n[3] A. Cichocki, S.-I. Amari (2002). Adaptive Blind Signal and Image Processing. Wiley.\n\n[4] H. Attias (1999). Independent Factor Analysis. Neural Computation 11, 803-851.\n\n[5] T.-W. Lee et al. (2001) (Ed.). Proc. ICA 2001.\n\n[6] S. Griebel, M. Brandstein (2001). Microphone array speech dereverberation using coarse channel\nmodeling. Proc. ICASSP 2001.\n\n[7] J. Hershey, M. Casey (2002). Audiovisual source separation via hidden Markov models. Proc.\nNIPS 2001.\n\n[8] S. Roweis (2001). One Microphone Source Separation. Proc. NIPS-00, 793-799.\n\n[9] G.-J. Jang, T.-W. Lee, Y.-H. Oh (2003). A probabilistic approach to single channel blind signal\nseparation. Proc. NIPS 2002.\n\n[10] H. Attias, L. Deng, A. Acero, J.C. Platt (2001). A new method for speech denoising using\nprobabilistic models for clean speech and for noise. Proc. Eurospeech 2001.\n\n[11] Ephraim, Y. (1992). Statistical model based speech enhancement systems. Proc. IEEE 80(10),\n1526-1555.\n\n[12] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational\nmethods in graphical models. Machine Learning 37, 183-233.\n\n[13] H. Attias (2003). New EM algorithms for source separation and deconvolution with a microphone\narray. Proc. ICASSP 2003.\n\n\f", "award": [], "sourceid": 2238, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}