{"title": "Blind Separation of Delayed and Convolved Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 758, "page_last": 764, "abstract": null, "full_text": "Blind separation of delayed and convolved \n\nsources. \n\nTe-Won Lee \n\nMax-Planck-Society, GERMANY, \nAND Interactive Systems Group \n\nCarnegie Mellon University \nPittsburgh, PA 15213, USA \n\ntewonOes. emu. edu \n\nAnthony J. Bell \n\nComputational Neurobiology, \n\nThe Salk Institute \n\n10010 N. Torrey Pines Road \n\nLa Jolla, California 92037, USA \n\ntonyOsalk.edu \n\nRussell H. Lambert \n\nDept of Electrical Engineering \n\nUniversity of South California, USA \n\nrlambertOsipi.use.edu \n\nAbstract \n\nWe address the difficult problem of separating multiple speakers \nwith multiple microphones in a real room. We combine the work \nof Torkkola and Amari, Cichocki and Yang, to give Natural Gra(cid:173)\ndient information maximisation rules for recurrent (IIR) networks, \nblindly adjusting delays, separating and deconvolving mixed sig(cid:173)\nnals. While they work well on simulated data, these rules fail \nin real rooms which usually involve non-minimum phase transfer \nfunctions, not-invertible using stable IIR filters. An approach that \nsidesteps this problem is to perform infomax on a feedforward archi(cid:173)\ntecture in the frequency domain (Lambert 1996). We demonstrate \nreal-room separation of two natural signals using this approach. \n\n1 The problem. \n\nIn the linear blind signal processing problem ([3, 2] and references therein), N \nsignals, s(t) = [S1(t) ... SN(t)V, are transmitted through a medium so that an \narray of N sensors picks up a set of signals x(t) = [Xl (t) ... XN(t)V, each of which \n\n\fBlind Separation of Delayed and Convolved Sources \n\nhas been mixed, delayed and filtered as follows: \n\nN M-l \n\nXi(t) = L L aijkSj(t - Dii - k) \n\nj=l k=O \n\n759 \n\n(1) \n\n(Here Dij are entries in a matrix of delays and there is an M-point filter, aij, \nbetween the the jth source and the ith sensor.) The problem is to invert this \nmixing without knowledge of it, thus recovering the original signals, s(t). \n\n2 Architectures. \n\nThe obvious architecture for inverting eq.1 is the feed forward one: \n\nN M-l \n\nUi(t) = L L WijkXj(t - dij - k). \n\nj=l k=O \n\n(2) \n\nwhich has filters, Wij, and delays, dij, which supposedly reproduce, at the Ui , the \noriginal uncorrupted source Signals, Si. This was the architecture implicitly assumed \nin [2]. However, it cannot solve the delay-compensation problem, since in eq.1 each \ndelay, D ij , delays a single source, while in eq.2 each delay, dii is associated with a \nmixture, Xj. \n\nTorkkola [8], has addressed the problem of solving the delay-compensation prob(cid:173)\nlem with a feedback architecture. Such an architecture can, in principle, solve this \nproblem, as shown earlier by Platt & Faggin [7]. Torkkola [9] also generalised the \nfeedback architecture to remove dependencies across time, to achieve the deconvo(cid:173)\nlution of mixtures which have been filtered, as in eq.1. \n\nHere we propose a slightly different architecture than Torkkola's ([9], eq.15). His \narchitecture could fail since it is missing feedback cross-weights for t = 0, ie: WijO. \nA full feedback system looks like: \n\nUi (t) = Xi - L L WijkUj(t - dij - k). \n\nN M-l \n\nj=l k=O \n\n(3) \n\nand is illustrated in Fig.I. Because terms in Ui(t) appear on both sides, we rewrite \nthis in vector terms: u(t) = x(t) - Wou(t) - E~~l Wku(t - k), in order to solve \nit as follows: \n\nu(t) = (I + WO)-l(X(t) - L WkU(t - k\u00bb \n\nM-l \n\nk=l \n\n(4) \n\nIn these equations, there is a feedback unmixing matrix, W k, for each time point \nof the filter, but the 'leading matrix', Wo has a special status in solving for u(t). \nThe delay terms are useful since one metre of distance in air at an 8kHz sampling \nrate, corresponds to a whole 25 zero-taps of a filter. Reintroducing them gives us: \n\nu(t) = (I + WO)-l(X(t) - net(t\u00bb, \n\nN M-l \n\nneti(t) = L L WijkU(t - dij - k\u00bb \n\nj=l k=l \n\n(5) \n\n\f760 \n\nT. Lee, A. 1. Bell and R. H. Lambert \n\nMa:tdmize Joint Entropy \n\nH(Y) / \n\nS I ---.----; \n\nI---.--XI \n\nS2--<----; \n\nA(z) \n\nW(z) \n\nUI \n\nU2 \n\nFigure 1: The feedback neural architecture of eq.9, which is used to separate and \ndeconvolve signals. Each box represents a causal filter and each circle denotes a \ntime delay. \n\n3 Algorithms. \n\nLearning in this architecture is performed by maximising the joint entropy, H(y(t)), \nof the random vector y(t) = g(u(t)), where 9 is a bounded monotonic nonlinear \nfunction (a sigmoid function). The success of this for separating sources depends \non four assumptions: (1) that the sources are statistically independent, (2) that \neach source is white, ie: there are no dependencies between time points, (3) that \nthe non-linearity, g, has a derivative which has higher kurtosis than the probability \ndensity functions (pdf's) of the sources, and (4) that a stable IIR (feedback) inverse \nof the mixing exists; ie: that a is minimum phase (see section 5). \nAssumption (1) is reasonable and Assumption (3) allows some tailoring of our algo(cid:173)\nrithm to fit data of different types. Assumption (2), on the other hand, is not true \nfor natural signals. Our algorithm will whiten: it will remove dependencies across \ntime which already existed in the original source signals, Si. However, it is possible \nto restore the characteristic autocorrelations (amplitude spectra) of the sources by \npost-processing. For the reasoning behind Assumption (3) see [2]. We will discuss \nAssumption 4 in section 5. \nIn the static feedback case of eq.5, when M = 1, the learning rule for the feedback \nweights W 0 is just a co-ordinate transform of the rule for feedforward weights, W 0, \nin the equivalent architecture of u(t) = Wox(t). Since Wo = (I + WO)-l, we \nhave Wo = WOI - I, which, due to the quotient rule for matrix differentiation, \ndifferentiates as: \n\n(6) \n\nThe best way to maximise entropy in the feedforward system is not to follow the \nentropy gradient, as in [2J, but to follow its 'natural' gradient, as reported by Amari \net al [1]: \n\n(7) \n\nThis is an optimal rescaling of the entropy gradient [1, 3]. It simplifies the learning \n\n\fBlind Separation of Delayed and Convolved Sources \n\nrule and speeds convergence considerably. Evaluated, it gives [2]: \n\nA \n\nAWO ex: (I + yu )WO, \n\nA \n\nT \n\n761 \n\n(8) \n\nSubstituting into eq.7 gives the natural gradient rule for static feedback weights: \n\n(9) \n\nThis reasoning may be extended to networks involving filters. For the feedforward \nfilter architecture u(t) = ~:!:~1 W kX(t - k), we derive a natural gradient rule (for \nk> 0) of: \n\n(10) \nwhere, for convenience, time has become subscripted. Performing the same coordi(cid:173)\nnate transforms as for Wo above, gives the rule: \n\nA W k ex: YUt-k W k \n\nA \n\nA \n\nT \n\n(We note that learning rules similar to these have been independently derived by \nCichocki et al [4]). Finally, for the delays in eq.5, we derive [2, 8]: \n\n(11) \n\nAd\u00b7\u00b7 ex: \n\n13 \n\naH( ) \nY = _yA. ~ -W \" kU(t - d .. - k) \nad. . \n13 \n\nM-l a \n1 L at 13 \n\nk=l \n\n13 \n\n(12) \n\nThis rule is different from that in [8] because it uses the collected temporal gradient \ninformation from all the taps. The algorithms of eq.9, eq.11 and eq.12 are the ones \nwe use in our experiments on the architecture of eq.5. \n\n4 Simulation results for the feedback architecture \n\nTo test the learning rules in eq.9, eq.ll and eq.12 we used an IIR filter system to \nrecover two sources which had been mixed and delayed as follows (in Z-transform \nnotation): \n\nAn (z) = 0.9 + 0.5z- 1 + 0.3z-2 \n\nA21 (z) = -0.7z-5 - 0.3z-6 - 0.2z- 7 \nA 12 (Z) = 0.5z- 5 + 0.3z-6 + 0.2z- 7 \n\nA22 (Z) = 0.8 - 0.lz-1 \n\n(13) \n\nThe mixing system, A(z), is a minimum-phase system with all its zeros inside the \nunit circle. Hence, A(z) can be inverted using a stable causal IIR system since all \npoles of the inverting systems are also inside the unit circle. For this experiment, we \nchose an artificially-generated source: a white process with a Laplacian distribution \n[/z(x) = exp( -Ix!)]. In the frequency domain the deconvolving system looks as \nfollows: \n\n(14) \n\nwhere D(z) = Wll (Z)W22 (Z) - W12 (Z)W21 (Z)). This leads to the following solution \nfor the weight filters: \n\nW ll (z) = A 22 (Z) W22 (Z) = All (z) \nW21 (z) = -A21 (z) W 12 (Z) = -A12 (Z) \n\n(15) \n\n\f762 \n\nT. Lee, A. 1. Bell and R. H. Lambert \n\nThe learning rule we used was that of eq.9 and eq.ll with the logistic non-linearity, \nYi = 1/ exp( -Ui). Fig.2A shows the four filters learnt by our IIR algorithm. The \nbottom row shows the inverting system convolved with the mixing system, proving \nthat W * A is approximately the identity mapping. Delay learning is not demon(cid:173)\nstrated here, though for periodic signals like speech we observed that it is subject \nto local minima problems [8, 9]. \n\n(A) Feedback (IIR) learning \n\n(8) Feedforward (FIR) learning \n\nFigure 2: Top two rows: learned unmixing filters for (A) IIR learning on minimum(cid:173)\nphase mixing, and (B) FIR freq.-domain learning on non-minimum phase mixing. \nBottom row: the convolved mixing and unmixing systems. The delta-like response \nindicates successful blind unmixing. In (B) this occurs acausally with a time-shift. \n\n5 Back to the feedforward architecture. \n\nThe feedback architecture is elegant but limited. It can only invert minimum(cid:173)\nphase mixing (all zeros are inside the unit circle meaning that all poles of the \ninverting system are as well). Unfortunately, real room acoustics usually involves \nnon-minimum phase mixing. \n\nThere does exist, however, a stable non-causal feedforward (FIR) inverse for non(cid:173)\nminimum phase mixing systems. The learning rules for such a system can be formu(cid:173)\nlated using the FIR polynomial matrix algebra as described by Lambert [5]. This \nmay be performed in the time or frequency domain, the only requirements being \nthat the inverting filters are long enough and their main energy occurs more-or(cid:173)\nless in their centre. This allows for the non-causal expansion of the non-minimum \nphase roots, causing the roughly symmetrical \"flanged\" appearance of the filters in \nFig-.2B. \n\nFor convenience, we formulate the infomax and natural gradient info max rules [2, 1] \nin the frequency domain: \n\n~ W ex W- H + fft(y)XH \n~ W ex (I + fft(y)UH)W \n\n(16) \n\n(17) \nwhere the H superscript denotes the Hermitian transpose (complex conjugate). In \nthese rules, as in eq.14, W is a matrix of filters and U and X are blocks of multi-\n\n\fBlind Separation of Delayed and Convolved Sources \n\n763 \n\nsensor signal in the frequency domain. Note that the nonlinearity 1li = 8~i ~ still \noperates in the time domain and the fft is applied at the output. \n\n6 Simulation results for the feedforward architecture \n\nTo show the learning rule in eq.17 working, we altered the transfer function in eq.13 \nas follows: \n\n(18) \nThis system is now non-minimum phase, having zeros outside the unit circle. The \ninverse system can be approximated by stable non-causal FIR filters. These were \nlearnt using the learning rule of eq.17 (again, with the logistic non-linearity). The \nresulting learnt filters are shown in Fig.2B where the leading weights were chosen \nto be at half the filter size (M/2). Non-causality of the filters can be clearly ob(cid:173)\nserved for W12 and W21, where there are non-zero coefficients before the maximum \namplitude weights. The bottom row of Fig.2B shows the successful separation by \nplotting the complete unmixing/mixing transfer function: W * A. \n\n7 Experiments with real recordings \n\nTo demonstrate separation in a real room, we set up two microphones and recorded \nfirstly two people speaking and then one person speaking with music in the back(cid:173)\nground. The microphones and the sources were both 60cm apart and 60cm from \neach other (arranged in a square), and the sampling was 16kHz. Fig.3A shows \nthe two recordings of a person saying the digits \"one\" to \"ten\" while loud music \nplays in the background. The IIR system of eq.5, eq.9 and eq.ll was unable to \nseparate these signals, presumably due to the non-mini mum-phase nature of the \nroom transfer functions. However, the algorithm of eq.17, converged after 30 passes \nthrough the 10 second recordings. The filter lengths were 256 (corresponding to \n16ms). The separated signals are shown in Fig.3B. Listening to them conveys a \nsense of almost-clean separation, though interference is audible. The results on the \ntwo people speaking were similar. \n\nAn important application is in spontaneous speech recognition tasks where the best \nrecognizer may fail completely in the presence of background music or competing \nspeakers (as in the teleconferencing problem). To test this application, we fed into a \nspeech recognizer, ten sentences recorded with loud music in the background and ten \nsentences recorded with a simultaneous speaker interference. After separation, the \nrecognition rate increased considerably for both cases. These results are reported \nin detail in [6]. \n\n8 Conclusions \n\nStarting with 'Natural gradient infomax' IIR learning rules for blind time delay \nadjustment, separation and deconvolution, we showed how these worked well on \nminimum-phase mixing, but not on non-minimum-phase mixing, as usually occurs \nin rooms. This led us to an FIR frequency domain infomax approach suggested \nby Lambert [5]. The latter approach shows much better separation of speech and \nmusic mixed in a real-room. Based on these techniques, it should now be possible \nto develop real-world applications. \n\n\f764 \n\nT. Lee, A. 1. Bell and R. H. Lambert \n\n(A) Mixtures \n\n(8) Separations \n\n1'1.11 .:M~i.t; ... :.rr'.~:\" I ~ II~ 1.-+--\u2022 ....--: s-:-ec-~-r-; .--.--'-; -.. -.--r;-t-\"'-~ -'1 \nI'.'.~!!s!';:~~.':.: .\u2022 :: : .1111 ~J l \n\nFigure 3: Real-room separation/deconvolution. (A) recorded mixtures (B) sepa(cid:173)\nrated speech (spoken digits 1-10) and music. \n\nAcknowledgments \n\nT.W.L. is supported by the Daimler-Benz-Fellowship, and A.J.B. by a grant from \nthe Office of Naval Research. We are grateful to Kari Torkkola for sharing his results \nwith us, and to Jiirgen Fritsch, Terry Sejnowski and Alex Waibel for discussions \nand comments. \n\nReferences \n\n[1] Amari S-I. Cichocki A. & Yang H.H. 1996. A new learning algorithm for blind \nsignal separation, Advances in Neural Information Processing Systems 8, MIT \npress. \n\n[2] Bell A.J. & Sejnowski T.J. 1995. An information maximisation approach to \nblind separation and blind deconvolution, Neural Computation, 7, 1129-1159 \n\n[3] Cardoso J-F. & Laheld B. 1996. Equivariant adaptive source separation, IEEE \n\nTrans. on Signal Proc., Dec. 1996 \n\n[4] Cichocki A., Amari S-I & Coo J. 1996. Blind separation of delayed and con(cid:173)\nvolved signals with self-adaptive learning rate, in Proc. Intern. Symp. on Non(cid:173)\nlinear Theory and Applications (NOLTA *96), Kochi, Japan. \n\n[5] Lambert R. 1996.Multichannel blind deconvolution: FIR matrix algebra and \n\nseparation of multipath mixtures, PhD Thesis, University of Southern Califor(cid:173)\nnia, Department of Electrical Engineering, May 1996. \n\n[6] Lee T-W. & Orglmeister R. Blind source separation of real-world signals. sub(cid:173)\n\nmitted to Proc. ICNN, Houston, USA, 1997. \n\n[7] Platt J.C. & Faggin F. 1992. Networks for the separation of sources that are \n\nsuperimposed and delayed, in Moody J.E et al (eds) Advances in Neural In(cid:173)\nformation Processing Systems 4, Morgan-Kaufmann \n\n[8] Torkkola K. 1996. Blind separation of delayed sources based on information \n\nmaximisation, Proc IEEE ICASSP, Atlanta, May 1996. \n\n[9] Torkkola K. 1996. Blind separation of convolved sources based on information \n\nmaximisation, Proc. IEEE Workshop on Neural Networks and Signal Process(cid:173)\ning, Kyota, Japan, Sept. 1996 \n\n\f", "award": [], "sourceid": 1235, "authors": [{"given_name": "Te-Won", "family_name": "Lee", "institution": null}, {"given_name": "Anthony", "family_name": "Bell", "institution": null}, {"given_name": "Russell", "family_name": "Lambert", "institution": null}]}