{"title": "One Microphone Source Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 799, "abstract": null, "full_text": "One Microphone Source Separation \n\nSam T. Roweis \n\nGatsby Unit, University College London \n\nroweis@gatsby.ucl. a c.uk \n\nAbstract \n\nSource separation, or computational auditory scene analysis , attempts to extract \nindividual acoustic objects from input which contains a mixture of sounds from \ndifferent sources, altered by the acoustic environment. Unmixing algorithms \nsuch as lCA and its extensions recover sources by reweighting multiple obser(cid:173)\nvation sequences, and thus cannot operate when only a single observation signal \nis available. I present a technique called refiltering which recovers sources by \na nonstationary reweighting (\"masking\") of frequency sub-bands from a single \nrecording, and argue for the application of statistical algorithms to learning this \nmasking function . I present results of a simple factorial HMM system which \nlearns on recordings of single speakers and can then separate mixtures using only \none observation signal by computing the masking function and then refiltering. \n\n1 Learning from data in computational auditory scene analysis \nImagine listening to many pianos being played simultaneously. If each pianist were striking \nkeys randomly it would be very difficult to tell which note came from which piano. But \nif each were playing a coherent song, separation would be much easier because of the \nstructure of music. Now imagine teaching a computer to do the separation by showing it \nmany musical scores as \"training data\". Typical auditory perceptual input contains a mix(cid:173)\nture of sounds from different sources, altered by the acoustic environment. Any biological \nor artificial hearing system must extract individual acoustic objects or streams in order \nto do successful localization, denoising and recognition. Bregman [1] called this process \nauditory scene analysis in analogy to vision. Source separation, or computational auditory \nscene analysis (CASA) is the practical realization of this problem via computer analysis \nof microphone recordings and is very similar to the musical task described above. It has \nbeen investigated by research groups with different emphases. The CASA community have \nfocused on both multiple and single microphone source separation problems under highly \nrealistic acoustic conditions, but have used almost exclusively hand designed systems \nwhich include substantial knowledge of the human auditory system and its psychophysical \ncharacteristics (e.g. [2,3]). Unfortunately, it is difficult to incorporate large amounts of \ndetailed statistical knowledge about the problem into such an approach. On the other \nhand, machine learning researchers, especially those working on independent components \nanalysis (lCA) and related algorithms, have focused on the case of multiple microphones \nin simplified mixing environments and have used powerful \"blind\" statistical techniques. \nThese \"unmixing\" algorithms (even those which attempt to recover more sources than \nsignals) cannot operate on single recordings. Furthermore, since they often depend only on \nthe joint amplitude histogram of the observations they can be very sensitive to the details of \nfiltering and reverberation in the environment. The goal of this paper is to bring together the \nrobust representations of CAS A and methods which learn from data to solve a restricted \nversion of the source separation problem - isolating acoustic objects from only a single \nmicrophone recording. \n\n\f2 Refiltering vs. unmixing \nUnmixing algorithms reweight multiple simultaneous recordings mk (t) (generically called \nmicrophones) to form a new source object s(t): \n\ns(t) \n'-v-\" \n\nestimated source \n\n= D:lml(t)+D:2m2(t)+ ... +D:KmK(t) \n............... \nmic K \n\n'-v-'\" \nmic 2 \n\n'-v-'\" \nmic 1 \n\n(1) \n\nThe unmixing coefficients D:i are constant over time and are chosen to optimize some \nproperty of the set of recovered sources, which often translates into a kurtosis measure on \nthe joint amplitude histogram of the microphones . The intuition is that unmixing algorithms \nare finding spikes (or dents for low kurtosis sources) in the marginal amplitude histogram. \nThe time ordering of the datapoints is often irrelevant. \nUnmixing depends on a fine timescale, sample-by-sample comparison of several observa(cid:173)\ntion signals. Humans, on the other hand, cannot hear histogram spikes l and perform well on \nmany monaural separation tasks. We are doing structural analysis, or a kind of perceptual \ngrouping on the incoming sound. But what is being grouped? There is substantial evidence \nthat the energy across time in different frequency bands can carry relatively independent \ninformation. This suggests that the appropriate subparts of an audio signal may be narrow \nfrequency bands over short times. To generate these parts, one can perform multi band \nanalysis - break the original signal y(t) into many subband signals bi(t) each filtered to \ncontain only energy from a small portion of the spectrum. The results of such an analysis \nare often displayed as a spectrogram which shows energy (using colour or grayscale) as a \nfunction of time (ordinate) and frequency (abscissa). (For example one is shown on the top \nleft of figure 5.) In the musical analogy, a spectrogram is like a musical score in which the \ncolour or grey level of the each note tells you how hard to hit the piano key. \nThe basic idea of refiltering is to construct new sources by selectively reweighting the \nmultiband signals bi(t). Crucially, however, the mixing coefficients are no longer constant \nover time; they are now called masking signals. Given a set of masking signals, denoted \nD:i(t), a source s(t) can be recovered by modulating the corresponding subband signals \nfrom the original input and summing: \n\ns(t) \n'-v-\" \n\nestimated source \n\nmask 1 \n,-.-.. \nD:l(t) b1(t) \n\nmask 2 \n,-.-.. \n\n+ D:2(t) b2(t) \n\n~ \n\nsub-band 1 \n\n~ \n\nsub-band 2 \n\nmaskK \n,.--.... \n\n+ ... + D:K(t) bK(t) \n'-v-\" \n\nsub-band K \n\n(2) \n\nThe D:i(t) are gain knobs on each subband that we can twist over time to bring bands \nin and out of the source as needed. This performs masking on the original spectrogram. \n(An equivalent operation can be performed in the frequency domain. 2) This approach, \nillustrated in figure 1, forms the basis of many CASA approaches (e.g. [2,3,4]). \nFor any specific choice of masking signals D:i(t), refiltering attempts to isolate a single \nsource from the input signal and suppress all other sources and background noises. Differ(cid:173)\nent sources can be isolated by choosing different masking signals. Henceforth, I will make \na strong simplifying assumption that D:i(t) are binary and constant over a timescale T of \nroughly 30ms. This is physically unrealistic, because the energy in each small region of \ntime-frequency never comes entirely from a single source. However in practice, for small \nnumbers of sources, this approximation works quite well (figure 3). (Think of ignoring col(cid:173)\nlisions by assuming separate piano players do not often hit the same note at the same time.) \n\nlTry randomJy permuting the time order of samples in a stereo mixture containing several sources \n\nand see if you still hear distinct streams when you play it back. \n2Make a conventional spectrogram of the original signal y(t) and modulate the magnitude of \neach short time DFT while preserving its phase: SW(T) = F- 1 {D:wIIF{yW(r)}IILF{yW(r)}} \nwhere sW(r) and yW(r) are the wth windows (blocks) of the recovered and original signals, oi is \nthe masking signal for subband i in window w, and F[\u00b7] is the DFf. \n\n\fFigure 1: The refiltering approach to one microphone source separation. Multiband analysis of \nthe original signal y(t) gives sub-band signals bi(t) which are modulated by masking signals ai(t) \n(binary or real valued between 0 and 1) and recombined to give the estimated source or object s(t). \n\nRefiltering can also be thought of as a highly nonstationary Wiener filter in which both the \nsignal and noise spectra are re-estimated at a rate l/T; the binary assumption is equivalent \nto assuming that over a timescale T the signal and noise spectra are nonoverlapping. \nIt is a fortunate empirical fact that refiltering, even with binary masking signals, can cleanly \nseparate sources from a single mixed recording. This can be demonstrated by taking several \nisolated sources or noises and mixing them in a controlled way. Since the original compo(cid:173)\nnents are known, an \"optimal\" set of masking signals can be computed. For example, we \nmight set 0i ( t) equal to the ratio of energy from one source in band i around times t \u00b1 T to \nthe sum of energies from all sources in the same band at that time (as recommended by the \nWiener filter) or to a binary version which thresholds this ratio. Constructing masks in this \nway is also useful for generating labeled training data, as discussed below. \n\n3 Multiband grouping as a statistical pattern recognition problem \n\nSince one-microphone source separation using refiltering is possible if the masking signals \nare well chosen, the essential problem becomes: how can the Oi(t) be computed automat(cid:173)\nically from a single mixed recording? The goal is to group or \"tag\" together regions of \nthe spectrogram that belong to the same auditory object. Fortunately, in audition (as in \nvision), natural signals-especially speech---exhibit a lot of regularity in the way energy \nis distributed across the time-frequency plane. Grouping cues based on these regularities \nhave been studied for many years by psychophysicists and are hand built into many CASA \nsystems. Cues are based on the idea of suspicious coincidences: roughly, \"things that move \ntogether likely belong together\". Thus, frequencies which exhibit common onsets, offsets, \nor upward/downward sweeps are more likely to be grouped into the same stream (figure 2). \nAlso, many real world sounds have harmonic spectra; so frequencies which lie exactly on \na harmonic \"stack\" are often perceptually grouped together. (Musically, piano players do \nnot hit keys randomly, but instead use chords and repeated melodies.) \n\nFigure 2: Examples of three common group(cid:173)\ning cues for energy which often comes from \na single source. (left) Frequencies which lie \nexactly on harmonic multiples of a single base \nfrequency. (middle) Frequencies which sud(cid:173)\ndenly increase or decrease their energy to(cid:173)\ngether. (right) Energy which which moves up \nor down in frequency at the same time. \n\nHarmonic \nstacking. \n\nCommon \nonset. \n\nFrequency \nco-modulation. \n\nThere are several ways that statistical pattern recognition might be applied to take advan(cid:173)\ntage of these cues. Methods may be roughly grouped into unsupervised ones, which learn \nmodels of isolated sources and then try to explain mixed input as being caused by the \ninteraction of individual source models; and supervised methods, which explicitly model \ngrouping in mixed acoustic input but require labeled data consisting of mixed input as well \n\n\fas masking signals. Luckily it is very easy to generate such data by mixing isolated sources \nin a controlled way, although the subsequent supervised learning can difficult. 3 \n\nFigure 3: Each point represents the energy from one source versus \nanother in a narrow frequency band over a 32ms window. The plot \nshows all frequencies over a 2 second period from a speech mixture. \nTypically when one source has large energy the other does not. The \nbinary assumption on the masking signals O!i(t) is equivalent to pro(cid:173)\njecting the points shown onto either the horizontal or vertical axis. \n\n4 Results using factorial-max HMMs \nHere, I will describe one (purely unsupervised) method I have pursued for automatically \ngenerating masking signals from a single microphone. The approach first trains speaker \ndependent hidden Markov models (HMMs) on isolated data from single talkers. These \npre-trained models are then combined in a particular way to build a separation system. \nFirst, for each speaker, a simple HMM is fit using patches of narrowband spectrograms as \nthe pattern vectors.4 The emission densities model the typical spectral patterns produced by \neach talker, while the transition probabilities encourage spectral continuity. HMM training \nwas initialized by first training a mixture of Gaussians on each speaker's data (with a single \nshared covariance matrix) independent of time order. Each mixture had 8192 components \nof dimension 1026 = 513 x 2; thus each HMM had 8192 states. To avoid overfitting, \nthe transition matrices were regularlized after training so that each transition (even those \nunobserved in the training set) had a small finite probability. \nNext, to separate a new single recording which is a mixture of known speakers, these pre(cid:173)\ntrained models are combined into afactorial hidden Markov model (FHMM) architecture \n[5]. A FHMM consists of two or more underlying Markov chains (the hidden states) which \nevolve independently. The observation Yt at any time depends on the states of all the chains. \nA simple way to model this dependence is to have each chain c independently propose an \noutput yC and then combine them to generate the observation according to some rule Yt = \nQ(yi, yl, ... ,yD\u00b7 Below, I use a model with only two chains, whose states are denoted \nXt and Zt. At each time, one chain proposes an output vector ax, and the other proposes \nh z,. The key part of the model is the function Q: observations are generated by taking the \nelementwise maximum of the proposals and adding noise. This maximum operation reflects \nthe observation that the log magnitude spectrogram of a mixture of sources is very nearly \nthe elementwise maximum of the individual spectrograms. The full generative model for \nthis \"factorial-max HMM\" can be written simply as: \n\np(Xt = jlXt-l = i) = Tij \np(Zt = jlZt-l = i) = U ij \np(Yt IXt, Zt) = N(max[axt! h z,], R) \n\n(3) \n(4) \n(5) \n\n3Recall that refiltering can only isolate one auditory stream at a time from the scene (we are always \nseparating \"a source\" from \"the background\"). This makes learning the masking signals an unusual \nproblem because for any input (spectrogram) there are as many correct answers as objects in the \nscene. Such a highly multimodal distribution on outputs given inputs means that the mapping from \nauditory input to masking signals cannot be learned using backprop or other single-valued function \napproximators which take the average of the possible maskings present in the training data. \n\n4The observations are created by concatenating the values of 2 adjacent columns of the log magni(cid:173)\n\ntude periodogram into a single vector. The original waveforms were sampled at 16kHz. Periodogram \nwindows of 32ms at a frame rate of 16ms were analyzed using a Hamming tapered OFT zero padded \nto length 1024. This gave 513 frequency samples from OC to Nyquist. Average signal energy was \nnormalized across the most recent 8 frames before computing each OFT. \n\n\fwhere N(f.-L, 1;) denotes a Gaussian distribution with mean f.-L and covariance 1; and max[\u00b7] \nis the elementwise maximum operation on two vectors. (There are also densities on the \ninitial states Xl and zd This model is illustrated in figure 4. It ignores two aspects of the \nspectrogram data: first, Gaussian noise is used although the observations are nonnegative; \nsecond, the probability factor requiring the non-maximum output proposal to be less than \nthe maximum proposal is missing. However, in practice these approximations are not too \nsevere and making them allows an efficient inference procedure (see below) . \n\n\u2022\u2022\u2022 \n\n\u2022\u2022\u2022 \n\nFigure 4: Factorial HMM with \nmax output semantics. Two Markov \nchains Xt and Zt evolve indepen(cid:173)\ndently. Observations Yt are the \nelementwise max of the individ(cid:173)\nual emission vectors max[ax \" b z ,] \nplus Gaussian noise. \n\nIn the experiment presented below, each chain represents a speaker dependent HMM (one \nmale and one female). The emission and transition probabilities from each speaker's pre(cid:173)\ntrained HMM were used as the parameters for the combined FHMM. (The output noise \ncovariance R is shared between the two HMMs.) \n\nGiven an input waveform, the observation sequence Y = YI, ... ,YT is created from \nthe spectrogram as before. 4 Separation is done by first inferring a joint underlying state \nsequence {Xt, Zt} of the two Markov chains in the model and then using the difference of \ntheir individual output predictions to compute a binary masking signal: \n\nClt(i) = 1 \n\nif a~, (i) > hz, (i) \n\nand 0 \n\nif a~, (i) ~ hz, (i) \n\n(6) \n\nIdeally, the inferred state sequences {Xt, Zt} should be the mode of the posterior distri(cid:173)\nbution p(Xt, ztIY). Since the hidden chains share a single visible output variable, naive \ninference in the FHMM graphical model yields an intractable amount of work exponential \nin the size of the state space of each submodel. However, because all of the observations \nare nonnegative and the max operation is used to combine output proposals, there is an \nefficient trick for computing the best joint state trajectory. At each time, we can upper \nbound the log-probability of generating the observation vector if one chain is in state i, no \nmatter what state the other chain is in. Computing these bounds for each state setting of \neach chain requires only a linear amount of work in the size of the state spaces. With these \nbounds in hand, each time we evaluate the probability of a specific pair of states we can \neliminate from consideration all state settings of either chain whose bounds are worse than \nthe achieved probability. If pairs of states are evaluated in a sensible heuristic order (for \nexample by ranking the bounds) this results in practice in almost all possible configurations \nbeing quickly eliminated. (This trick turns out to be equivalent to Clj3 search in game trees.) \n\nThe training data for the model consists only of spectrograms of isolated examples of each \nspeaker but inference can be done on test data which is a spectrogram of a single mixture of \nknown speakers. The results of separating a simple two speaker mixture are shown below. \nThe test utterance was formed by linearly mixing two out-of-sample utterances (one male \nand one female) from the same speakers as the models were trained on. Figure 5 shows \nthe original mixed spectrogram (top left) as well as the sequence of outputs a~, (bottom \nleft) and hz, (bottom right) from each chain. The chain with the maximum output in any \nsub-band at any time has Cli(t) = 1, otherwise Cli(t) = 0 (top right). The FHMM system \nachieves good separation from only a single microphone (see figure 6). \n\n\f< \n? \n> \n\nFigure 5: (top left) Original spectrogram of mixed utterance. (bottom) Male and female spec(cid:173)\ntrograms predicted by factorial HMM and used to compute refiltering masks. (top right) Masking \nsignals Oi (t) , computed by comparing the magnitudes of each model's predictions. \n\nhz, \n\n5 Conclusions \n\nIn this paper I have argued for the marriage of learning algorithms with the refiltering \napproach to CASA. I have presented results from a simple factorial HMM system on \na speaker dependent separation problem which indicate that automatically learned one(cid:173)\nmicrophone separation systems may be possible. In the machine learning community, \nthe one-microphone separation problem has received much less attention than unmixing \nproblems, while CASA researchers have not employed automatic learning techniques to \nfull effect. Scene analysis is an interesting and challenging learning problem with exciting \nand practical applications, and the refiltering setup has many nice properties. First, it \ncan work if the masking signals are chosen properly. Second, it is easy to generate lots \nof training data, both supervised and unsupervised. Third, a good learning algorithm(cid:173)\nwhen presented with enough data-should automatically discover the sorts of grouping \ncues which have been built into existing systems by hand. \nFurthermore, in the refiltering paradigm there is no need to make a hard decision about the \nnumber of sources present in an input. Each proposed masking has an associated score or \nprobability; groupings with high scores can be considered \"sources\", while ones with low \nscores might be parts of the background or mixtures other faint sources. CAS A returns a \ncollection of candidate maskings and their associated scores, and then it is up to the user to \ndecide-based on the range of scores-the number of sources in the scene. \nMany existing approaches to speech and audio processing have the potential to be applied \nto the monaural source separation problem. The unsupervised factorial HMM system \npresented in this paper is very similar to the work in the speech recognition community \non parallel model combination [6,7]; however rather than using the combined models to \nevaluate the likelihood of speech in noise, the efficiently inferred states are being used \nto generate a masking signal for refiltering. Wan and Nelson have developed dual EKF \nmethods [8] and applied them speech denoising but have also informally demonstrated their \npotential application to monaural source separation. Attias and colleagues [9] developed \na fully probabilistic model of speech in noise and used variational Bayesian techniques \nto perform inference and learning allowing denoising and dereverberation; their approach \nclearly has the potential to be applied to the separation problem as well. Cauwenberghs \n[10] has a very promising approach to the problem for purely harmonic signals that takes \nadvantage of powerful phase constraints which are ignored by other algorithms. \nUnsupervised and supervised approaches can be combined to various degrees. Learning \nmodels of isolated sounds may be useful for developing feature detectors; conjunctions \nof such feature detectors can then be trained in a supervised fashion using labeled data. \n\n\fFigure 6: Test separation results, using a 2-chain speaker dependent factorial-max HMM, followed \nby refiltering. (See figure 4 and text for details.) (A) Original waveform of mixed utterance. \n(B) Original isolated male & female waveforms. (C) Estimated male and female waveforms. \n\nThe oscillatory correlation algorithm of Brown and Wang [4] has a low level module to \ndetect features in the correlogram and a high level module to do grouping. Related ideas \nin machine vision, such as Markov networks [11] and minimum normalized cut [12] use \nlow level operations to define weights between pixels and then higher level computations \nto group pixels together. \n\nAcknowledgements \nThanks to Hagai Attias, Guy Brown, Geoff Hinton and Lawrence Saul for many insightful discussions \nabout the CASA problem, and to three anonymous referees and many visitors to my poster for helpful \ncomments, criticisms and references to work I had overlooked. \n\nReferences \n[1] AS. Bregman. (1994) Auditory Scene Analysis. MIT Press. \n[2] G. Brown & M. Cooke. (1994) Computational auditory scene analysis. \n\nComputer Speech and Language 8. \n\n[3] D. Ellis. (1994) A computer implementation of psychoacoustic grouping rules. \n\nProc. 12th IntI. Conf. on Pattern Recognition, Jerusalem. \n\n[4] G. Brown & D.L. Wang. (2000) An oscillatory correlationframeworkfor \n\ncomputational auditory scene analysis. NIPS 12. \n\n[5] Z. Ghalu'amani & M.l. Jordan (1997) Factorial hidden Markov models, Machine Learning 29. \n[6] AP. Varga & R.K. Moore (1990) Hidden Markov model decomposition of speech and noise, \n\nIEEE Conf. Acoustics, Speech & Signal Processing (ICASSP'90). \n\n[7] M.J.F. Gales & SJ. Young (1996) Robust continuous speech recognition using \n\nparallel model combination, IEEE Trans. Speech & Audio Processing 4. \n\n[8] E.A. Wan & A.T. Nelson (1998) Removal of noise from speech using the dual EKF algorithm, \n\nIEEE Conf. Acoustics, Speech & Signal Processing (ICASSP'98). \n\n[9] H. Attias, J.C. Platt & A. Acero (2001) Speech denoising and dereverberation \n\nusing probabilistic models, this volume. \n\n[10] G. Cauwenberghs (1999) Monaural separation of independent acoustical components, \n\nIEEE Symp. Circuit & Systems (ISCAS'99). \n\n[11] W. Freeman & E. Pasztor. (1999) Markov networks for low-level vision. \n\nMitsubishi Electric Research Laboratory Technical Report TR99-08. \n[12] J. Shi & J. Malik. (1997) Normalized cuts and image segmentation. \n\nIEEE Conf. Computer Vision and Pattern Recognition, Puerto Rico (ICCV'97). \n\n\f", "award": [], "sourceid": 1885, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}]}