{"title": "A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1713, "abstract": "In this paper we present an algorithm for separating mixed sounds from a monophonic recording. Our approach makes use of training data which allows us to learn representations of the types of sounds that compose the mixture. In contrast to popular methods that attempt to extract com- pact generalizable models for each sound from training data, we employ the training data itself as a representation of the sources in the mixture. We show that mixtures of known sounds can be described as sparse com- binations of the training data itself, and in doing so produce signi\ufb01cantly better separation results as compared to similar systems based on compact statistical models.", "full_text": "A Sparse Non-Parametric Approach for Single\n\nChannel Separation of Known Sounds\n\nParis Smaragdis\nAdobe Systems Inc.\nparis@adobe.com\n\nMadhusudana Shashanka\n\nBhiksha Raj\n\nMars Inc.\n\nCarnegie Mellon University\n\nshashanka@alum.bu.edu\n\nbhiksha@cs.cmu.edu\n\nAbstract\n\nIn this paper we present an algorithm for separating mixed sounds from\na monophonic recording. Our approach makes use of training data which\nallows us to learn representations of the types of sounds that compose the\nmixture.\nIn contrast to popular methods that attempt to extract com-\npact generalizable models for each sound from training data, we employ\nthe training data itself as a representation of the sources in the mixture.\nWe show that mixtures of known sounds can be described as sparse com-\nbinations of the training data itself, and in doing so produce signi\ufb01cantly\nbetter separation results as compared to similar systems based on compact\nstatistical models.\n\nKeywords: Example-Based Representation, Signal Separation, Sparse Models.\n\n1 Introduction\n\nThis paper deals with the problem of single-channel signal separation \u2013 separating out sig-\nnals from individual sources in a mixed recording. As of recently, a popular statistical\napproach has been to obtain compact characterizations of individual sources and employ\nthem to identify and extract their counterpart components from mixture signals. Statisti-\ncal characterizations may include codebooks [1], Gaussian mixture densities [2], HMMs [3],\nindependent components [4, 5], sparse dictionaries [6], non-negative decompositions [7\u20139]\nand latent variable models [10, 11]. All of these methods attempt to derive a generalizable\nmodel that captures the salient characteristics of each source. Separation is achieved by\nabstracting components from the mixed signal that conform to the statistical characteriza-\ntions of the individual sources. The key here is the speci\ufb01c statistical model employed \u2013 the\nmore e\ufb00ectively it captures the speci\ufb01c characteristics of the signal sources, the better the\nseparation that may be achieved.\n\nIn this paper we argue that, given any su\ufb03ciently large collection of data from a source,\nthe best possible characterization of any data is, quite simply, the data themselves. This\nhas been the basis of several example-based characterizations of a data source, such as\nnearest-neighbor, K-nearest neighbor, Parzen-window based models of source distributions\netc. Here, we use the same idea to develop a monaural source-separation algorithm that\ndirectly uses samples from the training data to represent the sources in a mixture. Using\nthis approach we sidestep the need for a model training step, and we can rely on a very\n\ufb02exible reconstruction process, especially as compared with previously used statistical mod-\nels. Identifying the proper samples from the training data that best approximate a sample\nof the mixture is of course a hard combinatorial problem, which can be computationally\ndemanding. We therefore formulate this as a sparse approximation problem and proceed\nto solve it with an e\ufb03cient algorithm. We additionally show that this approach results in\n\n1\n\n\fsource estimates which are guaranteed to lie on the source manifold, as opposed to trained-\nbasis approaches which can produce arbitrary outputs that will not necessarily be plausible\nsource estimates.\n\nExperimental evaluations show that this approach results in separated signals that exhibit\nsigni\ufb01cantly higher performance metrics as compared to conceptually similar techniques\nwhich are based on various types of combinations of generalizable bases representing the\nsources.\n\n2 Proposed Method\n\nIn this section we cover the underlying statistical model we will use, introduce some of the\ncomplications that one might encounter when using it and \ufb01nally we propose an algorithm\nthat resolves these issues.\n\n2.1 The Basic Model\n\nGiven a magnitude spectrogram of a single source, each spectral frame is modeled as a\nhistogram of repeated draws from a multinomial distribution over the frequency bins. At\na given time frame t, consider a random process characterized by the probability Pt(f ) of\ndrawing frequency f in a given draw. The distribution Pt(f ) is unknown but what one\ncan observe instead is the result of multiple draws from the process, that is the observed\nspectral vector. The model assumes that Pt(f ) is comprised of bases indexed by a latent\nvariable z. The latent factors are represented by P (f |z). The probability of picking the\nz-th distribution in the t-th time frame can be represented by Pt(z). We use this model to\nlearn the source-speci\ufb01c bases given by Pt(f |z) as done in [10, 11]. At this point this model\nis conceptually very similar to the non-negative factorization models in [8, 9].\n\nNow let the matrix VF \u00d7T of entries vf t represent the magnitude spectrogram of the mixture\nsound and vt represent time frame t (the t-th column vector of matrix V). Each mixture\nspectral frame is again modeled as a histogram of repeated draws, from the multinomial\ndistributions corresponding to every source. The model for each mixture frame includes an\nadditional latent variable s representing each source, and is given by\n\nPt(f ) = X\ns\n\nPt(s) X\nz\u2208{zs}\n\nPs(f |z)Pt(z|s),\n\n(1)\n\nwhere Pt(f ) is the probability of observing frequency f in time frame t in the mixture\nspectrogram, Ps(f |z) is the probability of frequency f in the z-th learned basis vector from\nsource s, Pt(z|s) is the probability of observing the z-th basis vector of source s at time t,\n{zs} represents the set of values the latent variable z can take for source s, and Pt(s) is the\nprobability of observing source s at time t.\n\nWe can assume that for each source in the mixture we have an already trained model in the\nform of basis vectors Ps(f |z). These bases will represent a dictionary of spectra that best\ndescribe each source. Armed with this knowledge we can decompose a new mixture of these\nknown sources in terms of the contributions of the dictionaries for each source. To do so we\ncan use the EM algorithm to estimate Pt(z|s) and Pt(s):\n\nPt(s, z|f ) =\n\nPt(s)Pt(z|s)Ps(f |z)\n\nPs Pt(s) Pz\u2208{zs} Ps(f |z)Pt(z|s)\n\nPt(z|s) = Pf vf tPt(s, z|f )\nPf,z vf tPt(s, z|f )\n\nPt(s) = Pf vf t Pz\u2208{zs} Pt(s, z|f )\nPf vf t Ps Pz\u2208{zs} Pt(s, z|f )\n\n(2)\n\n(3)\n\n(4)\n\nThe reconstruction of the contribution of source s in the mixture can then be computed as\n\n\u02c6v(s)\nf t =\n\nPt(s) Pz\u2208{zs} Ps(f |z)Pt(z|s)\n\nPs Pt(s) Pz\u2208{zs} Ps(f |z)Pt(z|s)\n\nvf t\n\n2\n\n\f \n\nSource A\nSource B\nMixture\nConvex Hull A\nConvex Hull B\nSimplex\n\n \n\nFigure 1: Illustration of the basic model. The triangles denote the position of basis functions\nfor two source classes. The square is an instance of a mixture of the two sources. The mixture\npoint is not within the convex hull which covers either source, but it is within the convex\nhull de\ufb01ned by all the bases combined.\n\nThese reconstructions will approximate the magnitude spectrogram of each source in the\nmixture. Once we obtain these reconstructions we can use them to modulate the original\nphase spectrogram of the mixture and obtain the time-series representation of the sources.\n\nLet us now pursue a brief pictorial understanding of this algorithm, which will help us\nintroduce the concepts in the next section. Each basis vector and the mixture input will lie\nin a F \u2212 1 dimensional simplex (due to the fact that these quantities are normalized to sum\nto unity). Each source\u2019s basis set will de\ufb01ne a convex hull within which any point can be\napproximated using these bases. Assuming that the training data is accurate, all potential\ninputs from that source should lie in that area. The union of all the source bases will de\ufb01ne\na larger space in which a mixture input will be inside. Any mixture point can then be\napproximated as a weighted sum of multiple bases from both sources. For visualization of\nthese concepts for F = 3, see \ufb01gure 1.\n\n2.2 Using Training Data Directly as a Dictionary\n\nIn this paper, we would like to explain the mixture frame from the training spectral frames\ninstead of using a smaller set of learned bases. There are two rationales behind this decision.\nThe \ufb01rst is that the resulting large dictionary provides a better description of the sources,\nas opposed to the less expressive learned-basis models. As we show later on, this holds even\nfor learned-basis models with dictionaries as large as the proposed method\u2019s. The secondary\nrationale behind this operation is based on the observation that the points de\ufb01ned by the\nconvex hull of a source\u2019s model, do not necessarily all fall on that source\u2019s manifold. To\nvisualize this problem consider the plots in \ufb01gure 2.\nIn both of these plots the sources\nexhibit a clear structure. In the left plot both sources appear in a circular pattern, and\nin the right plot in a spiral form. As shown in [12], learning a set of bases that explains\nthese sources results in de\ufb01ning a convex hull that surrounds the training data. Under this\nmodel potential source estimates can now lie anywhere inside these hulls. Using trained-\nbasis models, if we decompose the mixture points in these \ufb01gures we obtain two source\nestimates which do not lie in the same manifold as the original sources. Although the input\nwas adequately approximated, there is no guarantee that the extracted sources are indeed\nappropriate outcomes for their sound class.\n\nIn order to address this problem and to also provide a richer dictionary for the source\nreconstructions, we will make direct use of the training data in order to explain the mixture,\nand bypass the basis representation as an abstraction. To do so we will use each frame of the\nspectrograms of the training sequences as the bases Ps(f |z). More speci\ufb01cally, let W(s)\nF \u00d7T (s)\nbe the training spectrogram from source s and let w(s)\nrepresent the time frame t from the\nspectrogram. In this case, the latent variable z for source s takes T (s) values, and the z-th\nbasis function will be given by the (normalized) z-th column vector of W(s).\n\nt\n\n3\n\n\f \n\nSource A\nSource B\nMixture\nConvex Hull A\nConvex Hull B\nEstimate for A\nEstimate for B\nApproximation\nof mixture\n\n \n\nSource A\nSource B\nMixture\nConvex Hull A\nConvex Hull B\nEstimate for A\nEstimate for B\nApproximation\nof mixture\n\n \n\n \n\nFigure 2: Two examples where the separation process using trained bases provides poor\nsource estimates. In both plots the training data for each source are denoted by \u25b3 and \u25bd,\nand the mixture sample by (cid:3). The learned bases of each source are the vertices of the two\ndashed convex hulls that enclose each class. The source estimates and the approximation\nof the mixture are denoted by \u00d7, + and (cid:13). In the left case the two sources lie on two\noverlapping circular areas, the source estimates however lie outside these areas. On the\nright, the two sources form two intertwined spirals. The recovered sources lie very closely\non the competing source\u2019s area, thereby providing a highly inappropriate decomposition.\nAlthough the mixture was well approximated in both cases, the estimated sources were\npoor representations of their classes.\n\nWith the above model we would ideally want to use one dictionary element per source at\nany point in time. Doing so will ensure that the outputs would lie on the source manifold,\nand also o\ufb00set any issues of potential overcompleteness. One way to ensure this is to\nperform a reconstruction such that we only use one element of each source at any time,\nmuch akin to a nearest-neighbor model, albeit in an additive setting. This kind of search\ncan be computationally very demanding so we instead treat this as a sparse approximation\nproblem. The intuition is that at any given point in time, the mixture frame is explained\nby very few active elements from the training data. In other words, we need the mixture\nweight distributions and the speaker priors to be sparse at every time instant.\n\nWe use the concept of entropic prior introduced in [13] to enforce sparsity. Given a proba-\nbility distribution \u03b8, entropic prior is de\ufb01ned as\n\nPe(\u03b8) = e\u2212H(\u03b8)\n\n(5)\n\nwhere H(\u03b8) = \u2212 Pi \u03b8i log \u03b8i is the entropy of the distribution. A sparse representation, by\nde\ufb01nition, has few \u201cactive\u201d elements which means that the representation has low entropy.\nHence, imposing this prior during maximum a posteriori estimation is a way to minimize\nentropy during estimation which will result in a sparse \u03b8 distribution. We would like to\nminimize the entropies of both the speaker dependent mixture weight distributions (given\nby Pt(z|s)) and the source priors (given by Pt(s)) at every frame. In other words, we want\nto minimize H(z|s) and H(s) at every time frame. However, we know from information\ntheory that\n\nH(z, s) = H(z|s) + H(s).\n\nThus, reducing the entropy of the joint distribution Pt(z, s) is equivalent to reducing the\nconditional entropy of the source dependent mixture weights and the entropy of the source\npriors.\n\nSince the dictionary is already known and is given by the normalized spectral frames from\nsource training spectrograms, the parameter to be estimated is given by Pt(z, s). The model,\nwritten in terms of this parameter, is given by\n\nPt(f ) = X\ns\n\nX\nz\u2208{zs}\n\nPs(f |z)Pt(z, s).\n\nwhere we have modi\ufb01ed equation (1) by representing Pt(s)Pt(z|s) as Pt(z, s). We use the\nExpectation-Maximization algorithm to derive the update equations. Let all parameters\nto be estimated be represented by \u039b. We impose an entropic prior distribution on Pt(z, s)\n\n4\n\n\f \n\nSource A\nSource B\nMixture\nEstimate for A\nEstimate for B\nApproximation\nof mixture\n\n \nSource A\nSource B\nMixture\nEstimate for A\nEstimate for B\nApproximation\nof mixture\n\n \n\n \n\nFigure 3: Using a sparse reconstruction on the data in \ufb01gure 2. Note how in contrast to that\n\ufb01gure the source estimates are now identi\ufb01ed as training data points, and are thus plausible\nsolutions. The approximation of the mixture is the nearest point of the line connecting the\ntwo source estimates, to the actual mixture input. Note that the proper solution is the one\nthat results in such a line that is as close as possible to the mixture point, and not one that\nis de\ufb01ned by two training points close to the mixture.\n\ngiven by\n\nlog P (\u039b) = \u03b2 X\nt\n\nX\ns\n\nX\nz\u2208{zs}\n\nPt(z, s) log Pt(z, s),\n\nwhere \u03b2 is a parameter indicating the extent of sparsity desired. The E-step is given by\n\nPt(z, s|f ) =\n\nPt(z, s)Ps(f |z)\n\nPs Pz\u2208{zs} Pt(z, s)Ps(f |z)\n\nand the M-step by\n\n\u03c9t\n\nPt(z, s)\n\n+ \u03b2 + \u03b2 log Pt(z, s) + \u03bbt = 0\n\n(6)\n\nwhere we have let \u03c9 represent Pf vf tPt(s, z|f ) and \u03bbt is the Lagrange multiplier. The\nabove M-step equation is a system of simultaneous transcendental equations for Pt(z, s).\nBrand [13] proposes a method to solve such problems using the Lambert W function [14].\nIt can be shown that Pt(z, s) can be estimated as\n\n\u02c6Pt(z, s) =\n\n\u2212\u03c9/\u03b2\n\nW(\u2212\u03c9e1+\u03bbt/\u03b2/\u03b2)\n\n.\n\n(7)\n\nEquations (6),(7) form a set of \ufb01xed point iterations that typically converge in 2-5 iterations\n[13].\n\nOnce Pt(z, s) is estimated, the reconstruction of source s can be computed as\n\nf t = Pz\u2208{zs} Ps(f |z)Pt(z, s)\n\u02c6v(s)\nPs Pz\u2208{zs} Ps(f |z)Pt(z, s)\n\nvf t\n\nNow let us consider how this problem resolves the issues presented in \ufb01gure 2. In \ufb01gure 3\nwe show the results obtained using this approach on the same data. The sparsity parameter\n\u03b2 as set to 0.1. In both plots we see that the source reconstructions lie on a training point,\nthereby being a plausible source estimate. The approximation of the mixture is not as exact\nas before, since now it has to lie on the line connecting the two active source elements.\nThis is not however an issue of concern since in practice the approximation is always good\nenough, and the guarantee of a plausible source estimate is more valuable than the exact\napproximation of the mixture.\n\nAlternative means to strive towards similar results would be to make use of priors such as\nin [15, 16]. In these approaches the priors are imposed on the mixture weights and thus\nare not as e\ufb00ective for this particular task since they still su\ufb00er from the symptoms of\nlearned-basis models. This was veri\ufb01ed through cursory simulations, which also revealed an\nadditional computational complexity penalty against such models.\n\n5\n\n\fe\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\nx\ne\nd\nn\n\nl\n\ni\n \ns\ne\np\nm\na\ns\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\n5\n0\n\u22125\n\n10\n5\n0\n\u22125\n\n5\n10\n15\n20\n25\n\n5\n10\n15\n20\n25\n\nInput source 1\n\nInput source 2\n\nTraining sample weights 1\n\n25\n\nTraining sample weights 2\n\n25\n\n20\n\n20\n\n5\n\n5\n\n10\n\n10\n\n15\n\nTime frame index\n\n15\n\nTime frame index\n\nFigure 4: An oracle case where we \ufb01t training data from two speakers, on the mixture\nof that data. The top plots show the input waveforms, and the bottom plots shows the\nestimated weights multiplied with the source priors. As expected the weights exhibit two\ndiagonal traces which imply that the algorithm we used has \ufb01t the data appropriately.\n\n3 Experimental Results\n\nIn this section we present the results of experiments done with real speech data. All of these\nexperiments we performed on data from the TIMIT speech database on 0dB male/female\nmixtures. The sources were sampled as 16 kHz, we used 64 ms windows for the spectrogram\ncomputation, and an overlap of 32 ms. Before the FFT computation, the input was tapered\nusing a square-root Hann window. The training data was around 25 sec worth of speech for\neach speaker, and the testing mixture was about 3 sec long. We evaluated the separation\nperformance using the metrics provided in [17]. These metrics include the Signal to Inter-\nference Ratio (SIR), the Signal to Distortion Ratio (SDR), and the Signal to Artifacts Ratio\n(SAR). The \ufb01rst is a measure of how well we suppress the interfering speaker, whereas the\nother two provide us with a sense of how much the extracted source is corrupted due to the\nseparation process. All of these are measured in dB and the higher they are the better the\nperformance is deemed to be.\n\nIn the following sections we \ufb01rst present some \u201coracle tests\u201d that validate that indeed\nthis algorithm is performing as expected, and we then proceed to more realistic testing.\nFinally, we show the performance impact of pruning the training data in order to speed up\ncomputation time.\n\n3.1 Oracle tests\n\nIn order to verify that this approach works we go through a few oracle experiments. In these\ntests we include the actual solutions as training data and we make sure that the answers are\nexactly what we would expect to \ufb01nd. The \ufb01rst experiment we perform is on a mixture for\nwhich the training data includes its isolated constituent sentences. In this experiment we\nwould expect to see two dictionary components active at each point in time, one from each\nspeaker\u2019s dictionary, and both of these progressing through the component index linearly\nthrough time. As shown in \ufb01gure 4, we observe exactly that behavior. This test provides a\nsanity check which veri\ufb01es that given an answer this algorithm can properly identify it.\n\nA more comprehensive oracle test is shown in \ufb01gure 5.\nIn this experiment, the training\ndata were again the same as the testing data. We averaged the results from 10 runs using\ndi\ufb00erent combinations of speakers, varying sparsity parameters and number of bases. The\nsparsity parameter \u03b2 was checked for various values from 0 to 0.8, and we used trained-basis\nmodels with 5, 10, 20, 40, 80, 160 and 320 bases, as well as the proposed scenario where\nall the training data is used as a dictionary. The primary observation from this experiment\nis that the more bases we use the better the results get. We also see that increasing the\nsparsity parameter we see a modest improvement in most cases.\n\n6\n\n\fSignal to Distortion Ratio\n\nSignal to Interference Ratio\n\nSignal to Artifacts Ratio\n\n10\n\n5\n\n0\n\n20\n\n10\n\n0\n\n10\n\n5\n\n0\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\nFigure 5: Average separation performance metrics for oracle cases, as dependent on the\nchoice of di\ufb00erent number of elements in the speaker\u2019s dictionary, and di\ufb00erent choices of\nthe entropic prior parameter \u03b2. The left plot shows the SDR, the middle plot the SIR, and\nthe right plot the SAR, all in dB. The basis row labeled as \u201cTrain\u201d is the case where we use\nall the training data as a basis set.\n\nSignal to Distortion Ratio\n\nSignal to Interference Ratio\n\nSignal to Artifacts Ratio\n\n10\n\n5\n\n0\n\n20\n\n10\n\n0\n\n10\n\n5\n\n0\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\n0.8\n\n0.5\n\n0.3\n\n0.2\n\n0.1\n0.01\n\n\u03b2\n\nTrain\n\n320\n\n160\n\n20\n\n10\n\n80\n\n40\n\nBases\n\n0\n\n5\n\nFigure 6: Average separation performance metrics for real-world cases, as dependent on the\nchoice of di\ufb00erent number of elements in the speaker\u2019s dictionary, and di\ufb00erent choices of\nthe entropic prior parameter \u03b2. The left plot shows the SDR, the middle plot the SIR, and\nthe right plot the SAR, all in dB. Sparsely using all of the training data clearly outperforms\nlow-rank models by a signi\ufb01cant margin on all metrics.\n\n3.2 Results on Realistic Situations\n\nLet us now consider the more realistic case where the mixture data is di\ufb00erent from the\ntraining set. In the following simulation we repeat the previous experiment, but in this case\nthere are no common elements between the training and testing data. The input mixture\nhas to be reconstructed using approximate samples. The results are now very di\ufb00erent in\nnature. We do not obtain such high numbers in performance as in the oracle case, but\nwe also see a stronger trend in favor of sparsity and the use of all the training data as a\ndictionary. The results are shown in \ufb01gure 6. We can clearly see that in all metrics using all\nthe training data signi\ufb01cantly outperforms trained-basis models. More importantly, we see\nthat this is not because we have a larger dictionary. For trained-bases we see a performance\npeak at around 80 bases, but then we observe a deterioration in performance as we use a\nlarger dictionary. Using the actual training data results in a signi\ufb01cant boost though. Due\nto the high dimensionality of the data the e\ufb00ect of sparsity is a little more subtle, but we still\nsee a helpful boost especially for the SIR which is the most important of the performance\nmeasures. We see some decrease in the SAR, which is expected since the reconstructions are\nmade using elements that look like the remaining data, and are not made to approximate\nthe actual input mixture. This does not mean that the extracted sources are distorted and\nof poor quality, but rather that they don\u2019t match the original inputs exactly. The use of\nsparsity ensures that the output is a plausible speech signal devoid of artifacts like distortion\nand musical noise. The e\ufb00ects of sparsity alone in the proposed case are shown separately\nin \ufb01gure 7.\n\n7\n\n\fB\nd\n\n15\n\n10\n\n5\n\n0\n\n \n\nSDR\n\nSIR\n\n0 \n\n0.01\n\nSAR\n\n0.1 \n\n0.2 \n\nSparsity parameter \u03b2\n\n0.3 \n\n0.5 \n\n0.8 \n\n \n\nFigure 7: A slice of the results in \ufb01gure 6 in which we only show the case where we use\nall the training data as adictionary. The horizontal axis represents various values for the\nsparsity parameter \u03b2.\n\nB\nd\n\n15\n\n10\n\n5\n\n0\n\n \n\nSDR\n\n0%\n\nSIR\n\nSAR\n\n20%\n\n40%\n\n60%\n\n70%\n\n80%\n\n90%\n\n95%\n\nPercentage of discarded training frames\n\n \n\nFigure 8: E\ufb00ect of discarding low energy training frames. The horizontal axis denotes the\npercentage of training frames that have been discarded. These are averaged results using a\nsparsity parameter \u03b2 = 0.1.\n\nThe unfortunate side e\ufb00ect of the proposed method is that we need to use a dictionary\nwhich can be substantially larger than otherwise. In order to address this concern we show\nthat the size of the training data can be easily pruned down to a size comparable to trained-\nbasis models and still outperform them. Since sound signals, especially speech, tend to have\na considerable amount of short-term pauses and regions of silence, we can use an energy\nthreshold to in order to select the loudest frames of the training spectrogram as bases. In\n\ufb01gure 8 we show how the separation performance metrics are in\ufb02uenced as we increasingly\nremove bases which lie under various energy percentiles. It is clear that even after discarding\nup to at least 70% of the lowest energy training frames the performance is still approximately\nthe same. After that we see some degradation since we start discarding signi\ufb01cant parts of\nthe training data. Regardless this scheme outperforms trained-basis models of equivalent\nsize. For the 80% percentile case, a trained-basis model of the same size dictionary results\nin roughly half the values in all performance metrics, a very signi\ufb01cant handicap for the\nsame amount of computational and memory requirements.\n\nThe experiments in this paper were all conducted in MATLAB on an average modern desk-\ntop machine. Overall computations for a single mixture took roughly 4 sec when not using\nthe sparsity prior, 14 sec when using the sparsity prior (primarily due to slow computation\nof Lambert\u2019s function), and dropped down to 5 sec when using the 30% highest energy\nframes from the training data.\n\n4 Conclusion\n\nIn this paper we present a new approach to solving the monophonic source separation\nproblem. The contributions of this paper lies primarily in the choice of using all the training\ndata as opposed to a trained-basis model. In order to do so we present a sparse learning\nalgorithm which can e\ufb03ciently solve this problem, and also guarantees that the returned\nsource estimates are plausible given the training data. We provide experiments that show\nhow this approach is in\ufb02uenced by the use of varying sparsity constraints and training data\nselection. Finally we demonstrate how this approach can generate signi\ufb01cantly superior\nresults as compared to trained-basis methods.\n\n8\n\n\fReferences\n\n[1] S. T. Roweis, One microphone source separation, in Advances in Neural Information\n\nProcessing Systems, 2001.\n\n[2] Reddy, A.M. and B. Raj. Soft Mask Methods for Single-Channel Speaker Separation, in\nIEEE Transactions of Audio, Speech, and Language Processing, Volume: 15, Issue: 6,\nAug 2007.\n\n[3] T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, and R. Gopinath, Super-human multi-\ntalker speech recognition: The IBM 2006 speech separation challenge system, in Interna-\ntional Conference on Spoken Language Processing (INTERSPEECH), 2006, pp. 97\u2013100,\nKluwer Academic Publishers, ch. 20, pp. 295304.\n\n[4] Casey, M.A., and A. Westner. Separation of mixed audio sources by independent sub-\nspace analysis, in Proceedings of the International Conference of Computer Music, 2000.\n[5] Jang, G.-J., T.-W. Lee. A Maximum Likelihood Approach to Single-channel Source\n\nSeparation, in Journal of Machine Learning Research 4 (2003) pp. 1365\u20131392.\n\n[6] Pearlmutter, B., M. Zibulevsky, Blind Source Separation by Sparse Decomposition in a\n\nSignal Dictionary, in Neural Computation 13, pp. 863\u2013882. 2001.\n\n[7] L. Benaroya, L. M. Donagh, F. Bimbot, and R. Gribonval, Non negative sparse repre-\nsentation for wiener based source separation with a single sensor, in Acoustics, Speech,\nand Signal Processing, IEEE International Conference on, 2003, pp. 613\u2013616.\n\n[8] M. N. Schmidt and R. K. Olsson, Single-channel speech separation using sparse non-\nnegative matrix factorization, in International Conference on Spoken Language Process-\ning (INTERSPEECH), 2006.\n\n[9] T. Virtanen, Sound source separation using sparse coding with temporal continuity\n\nobjective, in International Computer Music Conference, ICMC, 2003.\n\n[10] Smaragdis, P. Raj, B. and Shashanka, M.V. 2007. Supervised and Semi-Supervised Sep-\naration of Sounds from Single-Channel Mixtures. In proceedings of ICA 2007. London,\nUK. September 2007.\n\n[11] Raj, B.; Smaragdis, P. 2005. Latent Variable Decomposition of Spectrograms for single\nchannel speaker separation. In Proceedings of the IEEE Workshop on Applications of\nSignal Processing to Audio and Acoustics, New Paltz, NY, October, 2005.\n\n[12] Shashanka, M.V., B. Raj, P. Smaragdis, 2007. Sparse Overcomplete Latent Variable\nDecoposition of Counts Data. In Neural Information Processing Systems (NIPS), Van-\ncouver, BC, Canada. December 2007.\n\n[13] Brand, M.E. Pattern Discovery via Entropy Minimization. In Uncertainty 99, AIS-\n\nTATS99,1999.\n\n[14] Corless, R.M., G.H. Gonnet, D.E.G. Hare, D.J. Je\ufb00rey, and D.E. Knuth. On the Lam-\n\nbert W Function. Advances in Computational Mathematics,1996.\n\n[15] Bouguila N. and D. Ziou. Using unsupervised learning of a \ufb01nite Dirichlet mixture\nmodel to improve pattern recognition applications, Pattern Recognition Letters, Volume\n26, Issue 12, September 2005.\n\n[16] Hinneburg, A., Gabriel, H.-H. and Gohr, A. Bayesian Folding-In with Dirichlet Kernels\n\nfor PLSI, in Seventh IEEE International Conference on Data Mining, Oct. 2007\n\n[17] F\u00b4evotte, C., R. Gribonval and E. Vincent. 2005. BSS EVAL Toolbox User Guide, IRISA\n\nTechnical Report 1706, Rennes, France, April 2005.\n\n9\n\n\f", "award": [], "sourceid": 1113, "authors": [{"given_name": "Paris", "family_name": "Smaragdis", "institution": null}, {"given_name": "Madhusudana", "family_name": "Shashanka", "institution": null}, {"given_name": "Bhiksha", "family_name": "Raj", "institution": null}]}