{"title": "Blind One-microphone Speech Separation: A Spectral Learning Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": null, "full_text": " Blind one-microphone speech separation:\n A spectral learning approach\n\n\n\n Francis R. Bach Michael I. Jordan\n Computer Science Computer Science and Statistics\n University of California University of California\n Berkeley, CA 94720 Berkeley, CA 94720\n fbach@cs.berkeley.edu jordan@cs.berkeley.edu\n\n\n\n\n Abstract\n\n We present an algorithm to perform blind, one-microphone speech sep-\n aration. Our algorithm separates mixtures of speech without modeling\n individual speakers. Instead, we formulate the problem of speech sep-\n aration as a problem in segmenting the spectrogram of the signal into\n two or more disjoint sets. We build feature sets for our segmenter using\n classical cues from speech psychophysics. We then combine these fea-\n tures into parameterized affinity matrices. We also take advantage of the\n fact that we can generate training examples for segmentation by artifi-\n cially superposing separately-recorded signals. Thus the parameters of\n the affinity matrices can be tuned using recent work on learning spectral\n clustering [1]. This yields an adaptive, speech-specific segmentation al-\n gorithm that can successfully separate one-microphone speech mixtures.\n\n\n1 Introduction\n\nThe problem of recovering signals from linear mixtures, with only partial knowledge of the\nmixing process and the signals--a problem often referred to as blind source separation--\nis a central problem in signal processing. It has applications in many fields, including\nspeech processing, network tomography and biomedical imaging [2]. When the problem is\nover-determined, i.e., when there are no more signals to estimate (the sources) than signals\nthat are observed (the sensors), generic assumptions such as statistical independence of the\nsources can be used in order to demix successfully [2]. Many interesting applications,\nhowever, involve under-determined problems (more sources than sensors), where more\nspecific assumptions must be made in order to demix. In problems involving at least two\nsensors, progress has been made by appealing to sparsity assumptions [3, 4].\n\nHowever, the most extreme case, in which there is only one sensor and two or more sources,\nis a much harder and still-open problem for complex signals such as speech. In this setting,\nsimple generic statistical assumptions do not suffice. One approach to the problem involves\na return to the spirit of classical engineering methods such as matched filters, and estimating\nspecific models for specific sources--e.g., specific speakers in the case of speech [5, 6].\nWhile such an approach is reasonable, it departs significantly from the desideratum of\n\"blindness.\" In this paper we present an algorithm that is a blind separation algorithm--our\nalgorithm separates speech mixtures from a single microphone without requiring models\nof specific speakers.\n\n\f\nOur approach involves a \"discriminative\" approach to the problem of speech separation.\nThat is, rather than building a complex model of speech, we instead focus directly on the\ntask of separation and optimize parameters that determine separation performance. We\nwork within a time-frequency representation (a spectrogram), and exploit the sparsity of\nspeech signals in this representation. That is, although two speakers might speak simul-\ntaneously, there is relatively little overlap in the time-frequency plane if the speakers are\ndifferent [5, 4]. We thus formulate speech separation as a problem in segmentation in the\ntime-frequency plane. In principle, we could appeal to classical segmentation methods\nfrom vision (see, e.g. [7]) to solve this two-dimensional segmentation problem. Speech\nsegments are, however, very different from visual segments, reflecting very different un-\nderlying physics. Thus we must design features for segmenting speech from first principles.\n\nIt also proves essential to combine knowledge-based feature design with learning methods.\nIn particular, we exploit the fact that in speech we can generate \"training examples\" by\nartificially superposing two separately-recorded signals. Making use of our earlier work\non learning methods for spectral clustering [1], we use the training data to optimize the\nparameters of a spectral clustering algorithm. This yields an adaptive, \"discriminative\"\nsegmentation algorithm that is optimized to separate speech signals.\n\nWe highlight one other aspect of the problem here--the major computational challenge\ninvolved in applying spectral methods to speech separation. Indeed, four seconds of speech\nsampled at 5.5 KHz yields 22,000 samples and thus we need to manipulate affinity matrices\nof dimension at least 22, 000 22, 000. Thus a major part of our effort has involved the\ndesign of numerical approximation schemes that exploit the different time scales present in\nspeech signals.\n\nThe paper is structured as follows. Section 2 provides a review of basic methodology.\nIn Section 3 we describe our approach to feature design based on known cues for speech\nseparation [8, 9]. Section 4 shows how parameterized affinity matrices based on these cues\ncan be optimized in the spectral clustering setting. We describe our experimental results in\nSection 5 and present our conclusions in Section 6.\n\n\n2 Speech separation as spectrogram segmentation\n\nIn this section, we first review the relevant properties of speech signals in the time-\nfrequency representation and describe how our training sets are constructed.\n\n\n2.1 Spectrogram\n\nThe spectrogram is a two-dimensional (time and frequency) redundant representation of a\none-dimensional signal [10]. Let f [t], t = 0, . . . , T - 1 be a signal in RT . The spectro-\ngram is defined through windowed Fourier transforms and is commonly referred to as a\nshort-time Fourier transform or as Gabor analysis [10]. The value (U f )mn of the spectro-\ngram at time window n and frequency m is defined as (U f ) T -1\n mn = 1\n f [t]w[t -\n M t=0\nna]ei2mt/M , where w is a window of length T with small support of length c. We assume\nthat the number of samples T is an integer multiple of a and c. There are then N = T /a\ndifferent windows of length c. The spectrogram is thus an N M image which provides a\nredundant time-frequency representation of time signals1 (see Figure 1).\n\nInversion Our speech separation framework is based on the segmentation of the spectro-\ngram of a signal f [t] in S 2 disjoint subsets Ai, i = 1, . . . , S of [0, N - 1] [0, M - 1].\n\n 1In our simulations, the sampling frequency is f0 = 5.5kHz and we use a Hanning window of\nlength c = 216 (i.e., 43.2ms). The spacing between window is equal to a = 54 (i.e., 10.8ms). We\nuse a 512-point FFT (M = 512). For a speech sample of length 4 sec, we have T = 22, 000 samples\nand then N = 407, which makes 2 105 spectrogram pixels.\n\n\f\n Frequency Frequency \n\n\n\n\n\n Time Time \n\nFigure 1: Spectrogram of speech; (left) single speaker, (right) two simultaneous speakers.\nThe gray intensity is proportional to the magnitude of the spectrogram.\n\n\nThis leads to S spectrograms Ui such that (Ui)mn = Umn if (m, n) Ai and zero\notherwise--note that the phase is kept the same as the one of the original mixed signal.\nWe now need to find S speech signals fi[t] such that each Ui is the spectrogram of fi.\nIn general there are no exact solutions (because the representation is redundant), and a\nclassical technique is to find the minimum L2 norm approximation, i.e., find fi such that\n||Ui - U fi||2 is minimal [10]. The solution of this minimization problem involves the\npseudo-inverse of the linear operator U [10] and is equal to fi = (U U )-1U Ui. By\nour choice of window (Hanning), U U is proportional to the identity matrix, so that the\nsolution to this problem can simply be obtained by applying the adjoint operator U .\n\nNormalization and subsampling There are several ways of normalizing a speech signal.\nIn this paper, we chose to rescale all speech signals as follows: for each time window n,\nwe compute the total energy en = |U f\n m mn|2, and its 20-point moving average. The\nsignals are normalized so that the 80% percentile of those values is equal to one.\n\nIn order to reduce the number of spectrogram samples to consider, for a given pre-\nnormalized speech signal, we threshold coefficients whose magnitudes are less than a value\nthat was chosen so that the distortion is inaudible.\n\n\n2.2 Generating training samples\n\nOur approach is based on a learning algorithm that optimizes a segmentation criterion. The\ntraining examples that we provide to this algorithm are obtained by mixing separately-\nnormalized speech signals. That is, given two volume-normalized speech signals f1, f2 of\nthe same duration, with spectrograms U1 and U2, we build a training sample as U train =\nU1 + U2, with a segmentation given by z = arg min{U1, U2}. In order to obtain better\ntraining partitions (and in particular to be more robust to the choice of normalization),\nwe also search over all [0, 1] such that the least square reconstruction error of the\nwaveform obtained from segmenting/reconstructing using z = arg min{U1, (1 - )U2}\nis minimized. An example of such a partition is shown in Figure 2 (left).\n\n\n3 Features and grouping cues for speech separation\n\nIn this section we describe our approach to the design of features for the spectral segmen-\ntation. We base our design on classical cues suggested from studies of perceptual group-\ning [11]. Our basic representation is a \"feature map,\" a two-dimensional representation that\nhas the same layout as the spectrogram. Each of these cues is associated with a specific\ntime scale, which we refer to as \"small\" (less than 5 frames), \"medium\" (10 to 20 frames),\nand \"large\" (across all frames). (These scales will be of particular relevance to the design\nof numerical approximation methods in Section 4.3). Any given feature is not sufficient for\nseparating by itself; rather, it is the combination of several features that makes our approach\nsuccessful.\n\n\f\n3.1 Non-harmonic cues\n\nThe following non-harmonic cues have counterparts in visual scenes and for these cues we\nare able to borrow from feature design techniques used in image segmentation [7].\n\nContinuity Two time-frequency points are likely to belong to the same segment if they\nare close in time or frequency; we thus use time and frequency directly as features. This\ncue acts at a small time scale.\n\nCommon fate cues Elements that exhibit the same time variation are likely to belong to\nthe same source. This takes several particular forms. The first is simply common offset and\ncommon onset. We thus build an offset map and an onset map, with elements that are zero\nwhen no variation occurs, and are large when there is a sharp decrease or increase (with\nrespect to time) for that particular time-frequency point. The onset and offset maps are\nbuilt using oriented energy filters as used in vision (with one vertical orientation). These\nare obtained by convolving the spectrogram with derivatives of Gaussian windows [7].\n\nAnother form of the common fate cue is frequency co-modulation, the situation in which\nfrequency components of a single source tend to move in sync. To capture this cue we\nsimply use oriented filter outputs for a set of orientation angles (8 in our simulations).\nThose features act mainly at a medium time scale.\n\n\n3.2 Harmonic cues\n\nThis is the major cue for voiced speech [12, 9, 8], and it acts at all time scales (small,\nmedium and large): voiced speech is locally periodic and the local period is usually referred\nto as the pitch.\n\nPitch estimation In order to use harmonic information, we need to estimate potentially\nseveral pitches. We have developed a simple pattern matching framework for doing this\nthat we present in Appendix A. If S pitches are sought, the output that we obtain from the\npitch extractor is, for each time frame n, the S pitches n1, . . . , nS, as well as the strength\nynms of the s-th pitch for each frequency m.\n\nTimbre The pitch extraction algorithm presented in Appendix A also outputs the spec-\ntral envelope of the signal [12]. This can be used to design an additional feature related\nto timbre which helps integrate information regarding speaker identification across time.\nTimbre can be loosely defined as the set of properties of a voiced speech signal once the\npitch has been factored out [8]. We add the spectral envelope as a feature (reducing its\ndimensionality using principal component analysis).\n\nBuilding feature maps from pitch information We build a set of features\nfrom the pitch information. Given a time-frequency point (m, n), let s(m, n) =\narg max ynms\n s denote the highest energy pitch, and define the features \n ( y ns(m,n),\n m nm s )1/2\ny ynms(m,n) ynms(m,n)\n nms(m,n), y and . We use a partial\n m nm s(m,n), y\n m nm s(m,n) ( y\n m nm s(m,n)))1/2\nnormalization with the square root to avoid including very low energy signals, while allow-\ning a significant difference between the local amplitude of the speakers.\n\nThose features all come with some form of energy level and all features involving pitch\nvalues should take this energy into account when the affinity matrix is built in Section 4.\nIndeed, the values of the harmonic features have no meaning when no energy in that pitch\nis present.\n\n\n4 Spectral clustering and affinity matrices\n\nGiven the features described in the previous section, we now show how to build affinity\n(i.e., similarity) matrices that can be used to define a spectral segmenter. In particular, our\n\n\f\napproach builds parameterized affinity matrices, and uses a learning algorithm to adjust\nthese parameters.\n\n\n4.1 Spectral clustering\n\nGiven P data points to partition into S 2 disjoint groups, spectral clustering methods\nuse an affinity matrix W , symmetric of size P P , that encodes topological knowledge\nabout the problem. Once W is available, it is normalized and its first S (P -dimensional)\neigenvectors are computed. Then, forming a P S matrix with these eigenvectors as\ncolumns, we cluster the P rows of this matrix as points in RS using K-means (or a weighted\nversion thereof). These clusters define the final partition [7, 1].\n\nWe prefer spectral clustering methods over other clustering algorithms such as K-means or\nmixtures of Gaussians estimated by the EM algorithm because we do not have any reason\nto expect the segments of interest in our problem to form convex shapes in the feature\nrepresentation.\n\n\n4.2 Parameterized affinity matrices\n\nThe success of spectral methods for clustering depends heavily on the construction of the\naffinity matrix W . In [1], we have shown how learning can play a role in optimizing\nover affinity matrices. Our algorithm assumes that fully partitioned datasets are available,\nand uses these datasets as training data for optimizing the parameters of affinity matrices.\nAs we have discussed in Section 2.2, such training data are easily obtained in the speech\nseparation setting. It remains for us to describe how we parameterize the affinity matrices.\n\nFrom each of the features defined in Section 3, we define a basis affinity matrix Wj =\nWj(j), where j is a (vector) parameter. We restrict ourselves to affinity matrices whose\nelements are between zero and one, and with unit diagonal. We distinguish between har-\nmonic and non-harmonic features. For non-harmonic features, we use a radial basis func-\ntion to define affinities. Thus, if fa is the value of the feature for data point a, we use a\nbasis affinity matrix defined as Wab = exp(-||fa - fb||), where > 1.\n\nFor an harmonic feature, on the other hand, we need to take into account the strength of the\nfeature: if fa is the value of the feature for data point a, with strength ya, we use Wab =\nexp(-|g(ya, yb) + 3|4||fa - fb||2), where g(u, v) = (ue5u + ve5v)/(e5u + e5v)\nranges from the minimum of u and v for 5 = - to their maximum for 5 = +.\n\nGiven m basis matrices, we use the following parameterization of W : W =\n K \n k=1 kW k1\n 1 W km\n m , where the products are taken pointwise. Intuitively, if\nwe consider the values of affinity as soft boolean variables, taking the product of two affin-\nity matrices is equivalent to considering the conjunction of two matrices, while taking the\nsum can be seen as their disjunction: our final affinity matrix can thus be seen as a disjunc-\ntive normal form. For our application to speech separation, we consider a sum of K = 3\nmatrices, one matrix for each time scale. This has the advantage of allowing different\napproximation schemes for each of the time scales, an issue we address in the following\nsection.\n\n\n4.3 Approximations of affinity matrices\n\nThe affinity matrices that we consider are huge, of size at least 50,000 by 50,000. Thus a\nsignificant part of our effort has involved finding computationally efficient approximations\nof affinity matrices.\n\nLet us assume that the time-frequency plane is vectorized by stacking one time frame after\nthe other. In this representation, the time scale of a basis affinity matrix W exerts an effect\non the degree of \"bandedness\" of W . The matrix W is said band-diagonal with bandwidth\n\n\f\nB, if for all i, j, |i - j| B Wij = 0. On a small time scale, W has a small bandwidth;\nfor a medium time scale, the band is larger but still small compared to the total size of the\nmatrix, while for large scale effects, the matrix W has no band structure. Note that the\nbandwidth B can be controlled by the coefficient of the radial basis function involving the\ntime feature n.\n\nFor each of these three cases, we have designed a particular way of approximating the\nmatrix, while ensuring that in each case the time and space requirements are linear in the\nnumber of time frames.\n\nSmall scale If the bandwidth B is very small, we use a simple direct sparse approxi-\nmation. The complexity of such an approximation grows linearly in the number of time\nframes.\n\nMedium and large scale We use a low-rank approximation of the matrix W similar in\nspirit to the algorithm of [13]. If we assume that the index set {1, . . . , P } is partitioned\nrandomly into I and J , and that A = W (I, I) and B = W (J, I), then W (J, I) = B\n(by symmetry) and we approximate C = W (J, J ) by a linear combination of the columns\nin I, i.e., C = BE, where E R|I||J|. The matrix E is chosen so that when the linear\ncombination defined by E is applied to the columns in I, the error is minimal, which leads\nto an approximation of W (J, J ) by B(A2 + I)-1AB .\n\nIf G is the dimension of J , then the complexity of finding the approximation is O(G3 +\nG2P ), and the complexity of a matrix-vector product with the low-rank approximation is\nO(G2P ). The storage requirement is O(GP ). For large bandwidths, we use a constant G,\ni.e., we make the assumption that the rank that is required to encode a speaker is indepen-\ndent of the duration of the signals.\n\nFor mid-range interactions, we need an approximation whose rank grows with time, but\nwhose complexity does not grow quadratically with time. This is done by using the banded\nstructure of A and W . If is the proportion of retained indices, then the complexity of\nstorage and matrix-vector multiplication is O(P 3B).\n\n\n\n5 Experiments\n\nWe have trained our segmenter using data from four different speakers, with speech signals\nof duration 3 seconds. There were 28 parameters to estimate using our spectral learning\nalgorithm. For testing, we have use mixes from five speakers that were different from those\nin the training set.\n\nIn Figure 2, for two speakers from the testing set, we show on the left part an example\nof the segmentation that is obtained when the two speech signals are known in advance\n(obtained as described in Section 2.2), and on the right side, the segmentation that is output\nby our algorithm. Although some components of the \"black\" speaker are missing, the\nsegmentation performance is good enough to obtain audible signals of reasonable quality.\nThe speech samples for this example can de downloaded from www.cs.berkeley.edu/\n~fbach/speech/ . On this web site, there are additional examples of speech separation,\nwith various speakers, in French and in English.\n\nAn important point is that our method does not require to know the speaker in advance in\norder to demix successfully; rather, it just requires that the two speakers have distinct and\nfar enough pitches most of the time (another but less crucial condition is that one pitch is\nnot too close to twice the other one).\n\nAs mentioned earlier, there was a major computational challenge in applying spectral meth-\nods to single microphone speech separation. Using the techniques described in Section 4.3,\nthe separation algorithm has linear running time complexity and memory requirement and,\n\n\f\n Frequency Frequency \n\n\n\n\n\n Time Time \n\nFigure 2: (Left) Optimal segmentation for the spectrogram in Figure 1 (right), where the\ntwo speakers are \"black\" and \"grey;\" this segmentation is obtained from the known sepa-\nrated signals. (Right) The blind segmentation obtained with our algorithm.\n\n\ncoded in Matlab and C, it takes 30 minutes to separate 4 seconds of speech on a 1.8 GHz\nprocessor with 1GB of RAM.\n\n\n6 Conclusions\n\nWe have presented an algorithm to perform blind source separation of speech signals from a\nsingle microphone. To do so, we have combined knowledge of physical and psychophysical\nproperties of speech with learning methods. The former provide parameterized affinity\nmatrices for spectral clustering, and the latter make use of our ability to generate segmented\ntraining data. The result is an optimized segmenter for spectrograms of speech mixtures.\nWe have successfully demixed speech signals from two speakers using this approach.\n\nOur work thus far has been limited to the setting of ideal acoustics and equal-strength\nmixing of two speakers. There are several obvious extensions that warrant investigation.\nFirst, the mixing conditions should be weakened and should allow some form of delay or\necho. Second, there are multiple applications where speech has to be separated from a\nnon-stationary noise; we believe that our method can be extended to this situation. Third,\nour framework is based on segmentation of the spectrogram and, as such, distortions are\ninevitable since this is a \"lossy\" formulation [6, 4]. We are currently working on post-\nprocessing methods that remove some of those distortions. Finally, while running time\nand memory requirements of our algorithm are linear in the duration of the signal to be\nseparated, the resource requirements remain a concern. We are currently working on further\nnumerical techniques that we believe will bring our method significantly closer to real-time.\n\n\nAppendix A. Pitch estimation\n\nPitch estimation for one pitch In this paragraph, we assume that we are given one time\nslice s of the spectrogram magnitude, s RM . The goal is to have a specific pattern match\ns. Since the speech signals are real, the spectrogram is symmetric and we can consider only\nM/2 samples.\n\nIf the signal is exactly periodic, then the spectrogram magnitude for that time frame is ex-\nactly a superposition of bumps at multiples of the fundamental frequency, The patterns we\nare considering have thus the following parameters: a \"bump\" function u b(u), a pitch\n [0, M/2] and a sequence of harmonics x1, . . . , xH at frequencies 1 = , . . . , H =\nH, where H is the largest acceptable harmonic multiple, i.e., H = M/2 . The pattern\n~\ns = ~\n s(x, b, ) is then built as a weighted sum of bumps.\n\nBy pattern matching, we mean to find the pattern ~\n s as close to s in the L2-norm sense. We\nimpose a constraint on the harmonic strengths (xh), namely, that they are samples at h\n M/2\nof a function g with small second derivative norm |g(2)()|2d. The function g can\n 0\n\n\f\nbe seen as the envelope of the signal and is related to the \"timbre\" of the speaker [8]. The\nexplicit consideration of the envelope and its smoothness is necessary for two reasons: (a)\nit will provide a timbre feature helpful for separation, (b) it helps avoid pitch-halving, a\ntraditional problem of pitch extractors [12].\n\nGiven b and , we minimize with respect to x, ||s - ~\n s(x)||2 + M/2 |g(2)()|2d, where\n 0\nxh = g(h). Since ~\n s(x) is linear function of x, this is a spline smoothing problem, and the\nsolution can be obtained in closed form with complexity O(H3) [14].\n\nWe now have to search over b and , knowing that the harmonic strengths x can be found\nin closed form. We use exhaustive search on a grid for , while we take only a few bump\nshapes. The main reason for several bump shapes is to account for the only approximate\nperiodicity of voiced speech. For further details and extensions, see [15].\n\nPitch estimation for several pitches If we are to estimate S pitches, we estimate them\nrecursively, by removing the estimated harmonic signals. In this paper, we assume that the\nnumber of speakers and hence the maximum number of pitches is known. Note, however,\nthat since all our pitch features are always used with their strengths, our separation method\nis relatively robust to situations where we try to look for too many pitches.\n\nAcknowledgments\n\nWe wish to acknowledge support from a grant from Intel Corporation, and a graduate fel-\nlowship to Francis Bach from Microsoft Research.\n\nReferences\n\n [1] F. R. Bach and M. I. Jordan. Learning spectral clustering. In NIPS 16, 2004.\n\n [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John\n Wiley & Sons, 2001.\n\n [3] M. Zibulevsky, P. Kisilev, Y. Y. Zeevi, and B. A. Pearlmutter. Blind source separation\n via multinode sparse representation. In NIPS 14, 2002.\n\n [4] O. Yilmaz and S. Rickard. Blind separation of speech mixtures via time-frequency\n masking. IEEE Trans. Sig. Proc., 52(7):18301847, 2004.\n\n [5] S. T. Roweis. One microphone source separation. In NIPS 13, 2001.\n\n [6] G.-J. Jang and T.-W. Lee. A probabilistic approach to single channel source separa-\n tion. In NIPS 15, 2003.\n\n [7] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE PAMI,\n 22(8):888905, 2000.\n\n [8] A. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound.\n MIT Press, 1990.\n\n [9] G. J. Brown and M. P. Cooke. Computational auditory scene analysis. Computer\n Speech and Language, 8:297333, 1994.\n\n[10] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1998.\n\n[11] M. Cooke and D. P. W. Ellis. The auditory organization of speech and other sources in\n listeners and computational models. Speech Communication, 35(3-4):141177, 2001.\n\n[12] B. Gold and N. Morgan. Speech and Audio Signal Processing: Processing and Per-\n ception of Speech and Music. Wiley Press, 1999.\n\n[13] S. Belongie, C. Fowlkes, F. Chung, and J. Malik. Spectral partitioning with indefinite\n kernels using the Nystrom extension. In ECCV, 2002.\n\n[14] G. Wahba. Spline Models for Observational Data. SIAM, 1990.\n\n[15] F. R. Bach and M. I. Jordan. Discriminative training of hidden Markov models for\n multiple pitch tracking. In ICASSP, 2005.\n\n\f\n", "award": [], "sourceid": 2572, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}