{"title": "Low-Rank Time-Frequency Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 3563, "page_last": 3571, "abstract": "Many single-channel signal decomposition techniques rely on a low-rank factorization of a time-frequency transform. In particular, nonnegative matrix factorization (NMF) of the spectrogram -- the (power) magnitude of the short-time Fourier transform (STFT) -- has been considered in many audio applications. In this setting, NMF with the Itakura-Saito divergence was shown to underly a generative Gaussian composite model (GCM) of the STFT, a step forward from more empirical approaches based on ad-hoc transform and divergence specifications. Still, the GCM is not yet a generative model of the raw signal itself, but only of its STFT. The work presented in this paper fills in this ultimate gap by proposing a novel signal synthesis model with low-rank time-frequency structure. In particular, our new approach opens doors to multi-resolution representations, that were not possible in the traditional NMF setting. We describe two expectation-maximization algorithms for estimation in the new model and report audio signal processing results with music decomposition and speech enhancement.", "full_text": "Low-Rank Time-Frequency Synthesis\n\nC\u00b4edric F\u00b4evotte\n\nLaboratoire Lagrange\n\n(CNRS, OCA & Universit\u00b4e de Nice)\n\nNice, France\n\ncfevotte@unice.fr\n\nMatthieu Kowalski\u2217\n\nLaboratoire des Signaux et Syst`emes\n\n(CNRS, Sup\u00b4elec & Universit\u00b4e Paris-Sud)\n\nGif-sur-Yvette, France\n\nkowalski@lss.supelec.fr\n\nAbstract\n\nMany single-channel signal decomposition techniques rely on a low-rank factor-\nization of a time-frequency transform. In particular, nonnegative matrix factoriza-\ntion (NMF) of the spectrogram \u2013 the (power) magnitude of the short-time Fourier\ntransform (STFT) \u2013 has been considered in many audio applications. In this set-\nting, NMF with the Itakura-Saito divergence was shown to underly a generative\nGaussian composite model (GCM) of the STFT, a step forward from more empiri-\ncal approaches based on ad-hoc transform and divergence speci\ufb01cations. Still, the\nGCM is not yet a generative model of the raw signal itself, but only of its STFT.\nThe work presented in this paper \ufb01lls in this ultimate gap by proposing a novel\nsignal synthesis model with low-rank time-frequency structure. In particular, our\nnew approach opens doors to multi-resolution representations, that were not pos-\nsible in the traditional NMF setting. We describe two expectation-maximization\nalgorithms for estimation in the new model and report audio signal processing\nresults with music decomposition and speech enhancement.\n\n1\n\nIntroduction\n\nMatrix factorization methods currently enjoy a large popularity in machine learning and signal pro-\ncessing. In the latter \ufb01eld, the input data is usually a time-frequency transform of some original time\nseries x(t). For example, in the audio setting, nonnegative matrix factorization (NMF) is commonly\nused to decompose magnitude or power spectrograms into elementary components [1]; the spectro-\ngram, say S, is approximately factorized into WH, where W is the dictionary matrix collecting\nspectral patterns in its columns and H is the activation matrix. The approximate WH is generally\nof lower rank than S, unless additional constraints are imposed on the factors.\nNMF was originally designed in a deterministic setting [2]: a measure of \ufb01t between S and WH is\nminimized with respect to (w.r.t) W and H. Choosing the \u201cright\u201d measure for a speci\ufb01c type of data\nand task is not straightforward. Furthermore, NMF-based spectral decompositions often arbitrarily\ndiscard phase information: only the magnitude of the complex-valued short-time Fourier transform\n(STFT) is considered. To remedy these limitations, a generative probabilistic latent factor model\nof the STFT was proposed in [3]. Denoting by {yf n} the complex-valued coef\ufb01cients of the STFT\nof x(t), where f and n index frequencies and time frames, respectively, the so-called Gaussian\nComposite Model (GCM) introduced in [3] writes simply\n\n(1)\nwhere Nc refers to the circular complex-valued normal distribution.1 As shown by Eq. (1), in the\nGCM the STFT is assumed centered (re\ufb02ecting an equivalent assumption in the time domain which\n\nyf n \u223c Nc(0, [WH]f n),\n\n\u2217Authorship based on alphabetical order to re\ufb02ect an equal contribution.\n1A random variable x has distribution Nc(x|\u00b5, \u03bb) = (\u03c0\u03bb)\u22121 exp\u2212(|x \u2212 \u00b5|2/\u03bb) if and only if its real and\n\nimaginary parts are independent and with distribution N (Re(\u00b5), \u03bb/2) and N (Im(\u00b5), \u03bb/2), respectively.\n\n1\n\n\fis valid for many signals such as audio signals) and its variance has a low-rank structure. Under these\nassumptions, the negative log-likelihood \u2212 log p(Y|W, H) of the STFT matrix Y and parameters\nW and H is equal, up to a constant, to the Itakura-Saito (IS) divergence DIS(S|WH) between the\npower spectrogram S = |Y|2 and WH [3].\nThe GCM is a step forward from traditional NMF approaches that fail to provide a valid genera-\ntive model of the STFT itself \u2013 other approaches have only considered probabilistic models of the\nmagnitude spectrogram under Poisson or multinomial assumptions, see [1] for a review. Still, the\nGCM is not yet a generative model of the raw signal x(t) itself, but of its STFT. The work reported\nin this paper \ufb01lls in this ultimate gap. It describes a novel signal synthesis model with low-rank\ntime-frequency structure. Besides improved accuracy of representation thanks to modeling at low-\nest level, our new approach opens doors to multi-resolution representations, that were not possible\nin the traditional NMF setting. Because of the synthesis approach, we may represent the signal as a\nsum of layers with their own time resolution, and their own latent low-rank structure.\nThe paper is organized as follows. Section 2 introduces the new low-rank time-frequency synthesis\n(LRTFS) model. Section 3 addresses estimation in LRTFS. We present two maximum likelihood\nestimation approaches with companion EM algorithms. Section 4 describes how LRTFS can be\nadapted to multiple-resolution representations. Section 5 reports experiments with audio applica-\ntions, namely music decomposition and speech enhancement. Section 6 concludes.\n\n2 The LRTFS model\n\n2.1 Generative model\n\nThe LRTFS model is de\ufb01ned by the following set of equations. For t = 1, . . . , T , f = 1, . . . , F ,\nn = 1, . . . , N:\n\n(cid:88)\n\n\u03b1f n\u03c6f n(t) + e(t)\n\n(2)\n\nf n\n\nx(t) =\n\u03b1f n \u223c Nc(0, [WH]f n)\ne(t) \u223c Nc(0, \u03bb)\n\n(3)\n(4)\nFor generality and simplicity of presentation, all the variables in Eq. (2) are assumed complex-\nvalued. In the real case, the hermitian symmetry of the time-frequency (t-f) frame can be exploited:\none only needs to consider the atoms relative to positive frequencies, generate the corresponding\ncomplex signal and then generate the real signal satisfying the hermitian symmetry on the coef\ufb01-\ncients. W and H are nonnegative matrices of dimensions F \u00d7 K and K \u00d7 N, respectively.2 For a\n\ufb01xed t-f point (f, n), the signal \u03c6f n = {\u03c6f n(t)}t, referred to as atom, is the element of an arbitrary\nt-f basis, for example a Gabor frame (a collection of tapered oscillating functions with short tempo-\nral support). e(t) is an identically and independently distributed (i.i.d) Gaussian residual term. The\nvariables {\u03b1f n} are synthesis coef\ufb01cients, assumed conditionally independent. Loosely speaking,\nf n(t). The coef\ufb01cients of\nthe STFT can be interpreted as analysis coef\ufb01cients obtained with a Gabor frame. The synthesis\ncoef\ufb01cients are assumed centered, ensuring that x(t) has zero expectation as well. A low-rank latent\nstructure is imposed on their variance. This is in contrast with the GCM introduced at Eq. (1), that\ninstead imposes a low-rank structure on the variance of the analysis coef\ufb01cients.\n\nthey are dual of the analysis coef\ufb01cients, de\ufb01ned by yf n = (cid:80)\n\nt x(t)\u03c6\u2217\n\n2.2 Relation to sparse Bayesian learning\n\nEq. (2) may be written in matrix form as\n\nx = \u03a6\u03b1 + e ,\n\n(5)\nwhere x and e are column vectors of dimension T with coef\ufb01cients x(t) and e(t), respectively.\nGiven an arbitrary mapping from (f, n) \u2208 {1, . . . , F} \u00d7 {1, . . . , N} to m \u2208 {1, . . . , M}, where\nM = F N, \u03b1 is a column vector of dimension M with coef\ufb01cients {\u03b1f n}f n and \u03a6 is a matrix of\nsize T \u00d7 M with columns {\u03c6f n}f n. In the following we will sometimes slightly abuse notations by\n2In the general unsupervised setting where both W and H are estimated, WH must be low-rank such that\n\nK < F and K < N. However, in supervised settings where W is known, we may have K > F .\n\n2\n\n\findexing the coef\ufb01cients of \u03b1 (and other variables) by either m or (f, n). It should be understood that\nm and (f, n) are in one-to-one correspondence and the notation should be clear from the context.\nLet us denote by v the column vector of dimension M with coef\ufb01cients vf n = [WH]f n. Then,\nfrom Eq. (3), we may write that the prior distribution for \u03b1 is\n\np(\u03b1|v) = Nc(\u03b1|0, diag(v)) .\n\n(6)\n\nIgnoring the low-rank constraint, Eqs. (5)-(6) resemble sparse Bayesian learning (SBL), as intro-\nduced in [4, 5], where it is shown that marginal likelihood estimation of the variance induces sparse\nsolutions of v and thus \u03b1. The essential difference between our model and SBL is that the coef\ufb01-\ncients are no longer unstructured in LRTFS. Indeed, in SBL, each coef\ufb01cient \u03b1m has a free variance\nparameter vm. This property is fundamental to the sparsity-inducing effect of SBL [4]. In contrast,\nin LRTFS, the variances are now tied together and such that vm = vf n = [WH]f n .\n\n2.3 Latent components reconstruction\n\nwrite yf n =(cid:80)K\n\nAs its name suggests, the GCM described by Eq. (1) is a composite model, in the following sense.\nWe may introduce independent complex-valued latent components ykf n \u223c Nc(0, wf khkn) and\nk=1 ykf n. Marginalizing the components from this simple Gaussian additive model\nleads to Eq. (1). In this perspective, the GCM implicitly assumes the data STFT Y to be a sum of\nelementary STFT components Yk = {ykf n}f n . In the GCM, the components can be reconstructed\nafter estimation of W and H , using any statistical estimator. In particular, the minimum mean\nsquare estimator (MMSE), given by the posterior mean, reduces to so-called Wiener \ufb01ltering:\n\n\u02c6ykf n =\n\nwf khkn\n[WH]f n\n\nyf n.\n\n(7)\n\nThe components may then be STFT-inversed to obtain temporal reconstructions that form the output\nof the overall signal decomposition approach.\nOf course, the same principle applies to LRTFS. The synthesis coef\ufb01cients \u03b1f n may equally be\nk \u03b1kf n, with \u03b1kf n \u223c Nc(0, wf khkn).\nDenoting by \u03b1k the column vector of dimension M with coef\ufb01cients {\u03b1kf n}f n, Eq. (5) may be\nwritten as\n\nwritten as a sum of latent components, such that \u03b1f n = (cid:80)\n(cid:88)\n\n(cid:88)\n\n(8)\n\nx =\n\n\u03a6\u03b1k + e =\n\nck + e ,\n\nk\n\nk\n\nwhere ck = \u03a6\u03b1k. The component ck is the \u201ctemporal expression\u201d of spectral pattern wk, the kth\ncolumn of W. Given estimates of W and H, the components may be reconstructed in various way.\nThe equivalent of the Wiener \ufb01ltering approach used traditionally with the GCM would consist in\n= E{\u03b1k|x, W, H}. Though the expression of \u02c6\u03b1MMSE\ncomputing \u02c6cMMSE\nis available in closed form it requires the inversion of a too large matrix, of dimensions T \u00d7 T (see\nalso Section 3.2). We will instead use \u02c6ck = \u03a6 \u02c6\u03b1k with \u02c6\u03b1k = E{\u03b1k| \u02c6\u03b1, W, H}, where \u02c6\u03b1 is the\navailable estimate of \u03b1. In this case, the coef\ufb01cients of \u02c6\u03b1k are given by\n\n, with \u02c6\u03b1MMSE\n\n= \u03a6 \u02c6\u03b1MMSE\n\nk\n\nk\n\nk\n\nk\n\n\u02c6\u03b1kf n =\n\nwf khkn\n[WH]f n\n\n\u02c6\u03b1f n.\n\n(9)\n\n3 Estimation in LRTFS\n\nWe now consider two approaches to estimation of W, H and \u03b1 in the LRTFS model de\ufb01ned by\nEqs. (2)-(4). The \ufb01rst approach, described in the next section is maximum joint likelihood esti-\nmation (MJLE). It relies on the minimization of \u2212 log p(x, \u03b1|W, H, \u03bb). The second approach is\nmaximum marginal likelihood estimation (MMLE), described in Section 3.2. It relies on the min-\nimization of \u2212 log p(x|W, H, \u03bb), i.e., involves the marginalization of \u03b1 from the joint likelihood,\nfollowing the principle of SBL. Though we present MMLE for the sake of completeness, our cur-\nrent implementation does not scale with the dimensions involved in the audio signal processing\napplications presented in Section 5, and large-scale algorithms for MMLE are left as future work.\n\n3\n\n\f3.1 Maximum joint likelihood estimation (MJLE)\n\nObjective. MJLE relies on the optimization of\n\nCJL(\u03b1, W, H, \u03bb)\n\n= \u2212 log p(x, \u03b1|W, H, \u03bb)\ndef\n\n(10)\n\n=\n\n1\n\u03bb\n\n(cid:107)x \u2212 \u03a6\u03b1(cid:107)2\n\n2 + DIS(|\u03b1|2|v) + log(|\u03b1|2) + M log \u03c0 ,\n\nwhere we recall that v is the vectorized version of WH and where DIS(A|B) =(cid:80)\n\n(11)\nij dIS(aij|bij)\nis the IS divergence between nonnegative matrices (or vectors, as a special case), with dIS(x|y) =\n(x/y) \u2212 log(x/y) \u2212 1. The \ufb01rst term in Eq. (11) measures the discrepancy between the raw signal\nand its approximation. The second term ensures that the synthesis coef\ufb01cients are approximately\nlow-rank. Unexpectedly, a third term that favors sparse solutions of \u03b1, thanks to the log function,\nnaturally appears from the derivation of the joint likelihood. The objective function (11) is not\nconvex and the EM algorithm described next may only ensure convergence to a local solution.\n\nEM algorithm.\nproposed by Figueiredo & Nowak [6]. It consists of rewriting Eq. (5) as\n\nIn order to minimize CJL, we employ an EM algorithm based on the architecture\n\nz = \u03b1 +(cid:112)\u03b2 e1 ,\n\n(cid:90)\n\n(12)\n(13)\nwhere z acts as a hidden variable, e1 \u223c Nc(0, I), e2 \u223c Nc(0, \u03bbI \u2212 \u03b2\u03a6\u03a6\u2217), with the operator \u00b7\u2217\ndenoting Hermitian transpose. Provided that \u03b2 \u2264 \u03bb/\u03b4\u03a6, where \u03b4\u03a6 is the largest eigenvalue of \u03a6\u03a6\u2217,\nthe likelihood function p(x|\u03b1, \u03bb) under Eqs. (12)-(13) is the same as under Eq. (5). Denoting the\nset of parameters by \u03b8JL = {\u03b1, W, H, \u03bb}, the EM algorithm relies on the iterative minimization of\n\nx = \u03a6z + e2 ,\n\nQ(\u03b8JL|\u02dc\u03b8JL) = \u2212\n\nlog p(x, \u03b1, z|W, H, \u03bb)p(z|x, \u02dc\u03b8JL)dz ,\n\n(14)\n\nz\n\nwhere \u02dc\u03b8JL acts as the current parameter value. Loosely speaking, the EM algorithm relies on the\nidea that if z was known, then the estimation of \u03b1 and of the other parameters would boil down to\nthe mere white noise denoising problem described by Eq. (12). As z is not known, the posterior\nmean value w.r.t z of the joint likelihood is considered instead.\nThe complete likelihood in Eq. (14) may be decomposed as\n\nlog p(x, \u03b1, z|W, H, \u03bb) = log p(x|z, \u03bb) + log p(z|\u03b1) + log p(\u03b1|WH).\n\n(15)\nThe hidden variable posterior simpli\ufb01es to p(z|x, \u03b8JL) = p(z|x, \u03bb). From there, using standard\nmanipulations with Gaussian distributions, the (i + 1)th iteration of the resulting algorithm writes\nas follows.\n\nE-step: z(i) = E{z|x, \u03bb(i)} = \u03b1(i) +\n\n\u03a6\u2217(x \u2212 \u03a6\u03b1(i))\n\n\u03b2\n\u03bb(i)\n\nM-step: \u2200(f, n), \u03b1(i+1)\n\nf n =\n\nv(i)\nf n\n\nv(i)\nf n + \u03b2\n\n(W(i+1), H(i+1)) = arg min\nW,H\u22650\n(cid:107)x \u2212 \u03a6\u03b1(i+1)(cid:107)2\n\n\u03bb(i+1) =\n\nF\n\n1\nT\n\nz(i)\nf n\n\n(cid:88)\n\nf n\n\n(cid:16)|\u03b1(i+1)\n\nf n\n\nDIS\n\n(cid:17)\n\n|2|[WH]f n\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\nIn Eq. (17), v(i)\nf n is a shorthand for [W(i)H(i)]f n . Eq. (17) is simply the application of Wiener\n\ufb01ltering to Eq. (12) with z = z(i). Eq. (18) amounts to solving a NMF with the IS divergence; it\nmay be solved using majorization-minimization, resulting in the standard multiplicative update rules\ngiven in [3]. A local solution might only be obtained with this approach, but this is still decreasing\nthe negative log-likelihood at every iteration. The update rule for \u03bb is not the one that exactly\nderives from the EM procedure (this one has a more complicated expression), but it still decreases\nthe negative log-likelihood at every iteration as explained in [6].\n\n4\n\n\fNote that the overall algorithm is rather computationally friendly as no matrix inversion is required.\nThe \u03a6\u03b1 and \u03a6\u2217x operations in Eq. (16) correspond to analysis and synthesis operations that can be\nrealized ef\ufb01ciently using optimized packages, such as the Large Time-Frequency Analysis Toolbox\n(LTFAT) [7].\n\n3.2 Maximum marginal likelihood estimation (MMLE)\n\nObjective. The second estimation method relies on the optimization of\n\nCML(W, H, \u03bb)\n\n(cid:90)\n\n= \u2212 log p(x|W, H, \u03bb)\ndef\n= \u2212 log\n\np(x|\u03b1, \u03bb)p(\u03b1|WH)d\u03b1\n\n(20)\n\n(21)\n\nIt corresponds to the \u201ctype-II\u201d maximum likelihood procedure employed in [4, 5]. By treating \u03b1\nas a nuisance parameter, the number of parameters involved in the data likelihood is signi\ufb01cantly\nreduced, yielding more robust estimation with fewer local minima in the objective function [5].\n\n\u03b1\n\nEM algorithm.\nIn order to minimize CML, we may use the EM architecture described in [4, 5] that\nquite naturally uses \u03b1 has the hidden data. Denoting the set of parameters by \u03b8ML = {W, H, \u03bb},\nthe EM algorithm relies on the iterative minimization of\n\nQ(\u03b8ML|\u02dc\u03b8ML) = \u2212\n\nlog p(x, \u03b1|W, H, \u03bb)p(\u03b1|x, \u02dc\u03b8ML)d\u03b1,\n\n(22)\n\nwhere \u02dc\u03b8ML acts as the current parameter value. As the derivations closely follow [4, 5], we skip\ndetails for brevity. Using rather standard results about Gaussian distributions the (i + 1)th iteration\nof the algorithm writes as follows.\n\n\u03b1\n\n(cid:90)\n\nE-step : \u03a3(i) = (\u03a6\u2217\u03a6/\u03bb(i) + diag(v(i\u22121))\u22121)\u22121\n\nM-step :\n\n\u03b1(i) = \u03a3(i)\u03a6\u2217x/\u03bb(i)\nv(i) = E{|\u03b1|2|x, v(i), \u03bb(i)} = diag(\u03a3(i)) + |\u03b1(i)|2\n(W(i+1), H(i+1)) = arg min\nW,H\u22650\n(cid:107)x \u2212 \u03a6\u03b1(i)(cid:107)2\n\n(cid:16)\n(cid:88)\n2 + \u03bb(i)(cid:88)M\n\nf n|[WH]f n\nv(i)\n(1 \u2212 \u03a3(i)\n\n\u03bb(i+1) =\n\nDIS\n\nf n\n\n(cid:17)\n\n(cid:20)\n\n1\nT\n\nm=1\n\nmm/v(i)\nm )\n\n(cid:21)\n\n(23)\n(24)\n(25)\n\n(26)\n\n(27)\n\nThe complexity of this algorithm can be problematic as it involves the computation of the inverse of\na matrix of size M in the expression of \u03a3(i). M is typically at least twice larger than T , the signal\nlength. Using the Woodbury matrix identity, the expression of \u03a3(i) can be reduced to the inversion\nof a matrix of size T , but this is still too large for most signal processing applications (e.g., 3 min\nof music sampled at CD quality makes T in the order of 106). As such, we will discard MMLE in\nthe experiments of Section 5 but the methodology presented in this section can be relevant to other\nproblems with smaller dimensions.\n\n4 Multi-resolution LRTFS\n\nBesides the advantage of modeling the raw signal itself, and not its STFT, another major strength of\nLRTFS is that it offers the possibility of multi-resolution modeling. The latter consists of represent-\ning a signal as a sum of t-f atoms with different temporal (and thus frequency) resolutions. This is\nfor example relevant in audio where transients, such as the attacks of musical notes, are much shorter\nthan sustained parts such as the tonal components (the steady, harmonic part of musical notes). An-\nother example is speech where different classes of phonemes can have different resolutions. At even\nhigher level, stationarity of female speech holds at shorter resolution than male speech. Because\ntraditional spectral factorizations approaches work on the transformed data, the time resolution is\nset once for all at feature computation and cannot be adapted during decomposition.\nIn contrast, LRTFS can accommodate multiple t-f bases in the following way. Assume for simplicity\nthat x is to be expanded on the union of two frames \u03a6a and \u03a6b, with common column size T\n\n5\n\n\fand with t-f grids of sizes Fa \u00d7 Na and Fb \u00d7 Nb, respectively. \u03a6a may be for example a Gabor\nframe with short time resolution and \u03a6b a Gabor frame with larger resolution \u2013 such a setting has\nbeen considered in many audio applications, e.g., [8, 9], together with sparse synthesis coef\ufb01cients\nmodels. The multi-resolution LRTFS model becomes\n\nx = \u03a6a\u03b1a + \u03a6b\u03b1b + e\n\n(28)\n\nwith\n\n\u2200(f, n) \u2208 {1, . . . , Fa} \u00d7 {1, . . . , Na}, \u03b1a,f n \u223c Nc([WaHa]f n) ,\n\u2200(f, n) \u2208 {1, . . . , Fb} \u00d7 {1, . . . , Nb}, \u03b1b,f n \u223c Nc([WbHb]f n) ,\nand where {\u03b1a,f n}f n and {\u03b1b,f n}f n are the coef\ufb01cients of \u03b1a and \u03b1b, respectively.\nBy stacking the bases and synthesis coef\ufb01cients into \u03a6 = [\u03a6a \u03a6b] and \u03b1 = [\u03b1T\nb ]T\nlog-likelihood\nand introducing a latent variable z = [zT\n\u2212 log p(x, \u03b1|Wa, Ha, Wb, Hb, \u03bb) in the multi-resolution LRTFS model can be optimized using\nthe EM algorithm described in Section 3.1. The resulting algorithm at iteration (i + 1) writes as\nfollows.\n\nthe negative joint\n\n(29)\n(30)\n\na \u03b1T\n\na zT\n\nb ]T ,\n\nfor (cid:96) = {a, b}, z(i)\n\n(cid:96) = \u03b1(i)\n\n(cid:96) +\n\n\u03b2\n\u03bb\n\nE-step:\n\nM-step:\n\nfor (cid:96) = {a, b}, \u2200(f, n) \u2208 {1, . . . , F(cid:96)} \u00d7 {1, . . . , N(cid:96)}, \u03b1(i+1)\n\n(cid:96) (x \u2212 \u03a6a\u03b1(i)\na \u2212 \u03a6b\u03b1(i)\n\u03a6\u2217\nb )\n(cid:88)\n\nv(i)\n(cid:96),f n\n\n(cid:16)|\u03b1(i+1)\n\n(cid:96),f n =\nv(i)\n(cid:96),f n + \u03b2\n(cid:96),f n |2|[W(cid:96)H(cid:96)]f n\n\n(cid:17)\n\nz(i)\nf n\n\n(31)\n\n(32)\n\n(33)\n\nDIS\n\nf n\n\n) = arg min\nW(cid:96),H(cid:96)\u22650\n(cid:107)2\n2/T\n\nfor (cid:96) = {a, b}, (W(i+1)\n\n(cid:96)\n\n, H(i+1)\n\n(cid:96)\n\n\u03bb(i+1) = (cid:107)x \u2212 \u03a6a\u03b1(i+1)\n\n\u2212 \u03a6b\u03b1(i+1)\n\n(34)\nThe complexity of the algorithm remains fully compatible with signal processing applications. Of\ncourse, the proposed setting can be extended to more than two bases.\n\na\n\nb\n\n5 Experiments\n\nWe illustrate the effectiveness of our approach on two experiments. The \ufb01rst one, purely illustrative,\ndecomposes a jazz excerpt into two layers (tonal and transient), plus a residual layer, according\nto the hybrid/morphological model presented in [8, 10]. The second one is a speech enhancement\nproblem, based on a semi-supervised source separation approach in the spirit of [11]. Even though\nwe provided update rules for \u03bb for the sake of completeness, this parameter was not estimated in\nour experiments, but instead treated as an hyperparameter, like in [5, 6]. Indeed, the estimation of \u03bb\nwith all the other parameters free was found to perform poorly in practice, a phenomenon observed\nwith SBL as well.\n\n5.1 Hybrid decomposition of music\n\nWe consider a 6 s jazz excerpt sampled at 44.1 kHz corrupted with additive white Gaussian noise\nwith 20 dB input Signal to Noise Ratio (SNR). The hybrid model aims to decompose the signal as\n\nx = xtonal + xtransient + e = \u03a6tonal\u03b1tonal + \u03a6transient\u03b1transient + e ,\n\n(35)\nusing the multi-resolution LRTFS method described in Section 4. As already mentionned, a classical\ndesign consists of working with Gabor frames. We use a 2048 samples-long (\u223c 46 ms) Hann\nwindow for the tonal layer, and a 128 samples-long (\u223c 3 ms) Hann window for the transient layer,\nboth with a 50% time overlap. The number of latent components in the two layers is set to K = 3.\nWe experimented several values for the hyperparameter \u03bb and selected the results leading to best\noutput SNR (about 26 dB). The estimated components are shown at Fig. 1. When listening to the\nsignal components (available in the supplementary material), one can identify the hit-hat in the \ufb01rst\nand second components of the transient layer, and the bass and piano attacks in the third component.\nIn the tonal layer, one can identify the bass and some piano in the \ufb01rst component, some piano in\nthe second component, and some hit-hat \u201cring\u201d in the third component.\n\n6\n\n\fFigure 1: Top: spectrogram of the original signal (left), estimated transient coef\ufb01cients log |\u03b1transient|\n(center), estimated tonal coef\ufb01cients log |\u03b1tonal| (right). Middle: the 3 latent components (of rank 1)\nfrom the transient layer. Bottom: the 3 latent components (of rank 1) from the tonal layer.\n\n5.2 Speech enhancement\n\nThe second experiment considers a semi-supervised speech enhancement example (treated as a\nsingle-channel source separation problem). The goal is to recover a speech signal corrupted by\na texture sound, namely applauses. The synthesis model considered is given by\n\n(cid:16)\n\n(cid:17)\n\nx = \u03a6tonal\n\n\u03b1speech\ntonal + \u03b1noise\ntonal\n\n+ \u03a6transient\n\n\u03b1speech\ntransient + \u03b1noise\n\ntransient\n\nwith\n\nand\n\ntonal \u223c Nc\n(cid:16)\n\u03b1speech\n\n0, Wtrain\n\ntonalHspeech\ntonal\n\n, \u03b1noise\n\ntonal \u223c Nc\n\ntransient \u223c Nc\n\u03b1speech\n\n0, Wtrain\n\ntransientHspeech\ntransient\n\n, \u03b1noise\n\ntransient \u223c Nc\n\ntransientHnoise\n\ntransient\n\n+ e,\n\n(cid:1) ,\n\n(cid:0)0, Wnoise\n(cid:0)0, Wnoise\n\ntonal Hnoise\ntonal\n\n(36)\n\n(37)\n\n(38)\n\n(cid:1) .\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n(cid:17)\n\ntonal and Wtrain\n\nWtrain\ntransient are \ufb01xed pre-trained dictionaries of dimension K = 500, obtained from 30 min\nof training speech containing male and female speakers. The training data, with sampling rate\n16kHz, is extracted from the TIMIT database [12]. The noise dictionaries Wnoise\ntransient are\nlearnt from the noisy data, using K = 2. The two t-f bases are Gabor frames with Hann window\nof length 512 samples (\u223c 32 ms) for the tonal layer and 32 samples (\u223c 2 ms) for the transient layer,\nboth with 50% overlap. The hyperparameter \u03bb is gradually decreased to a negligible value during\niterations (resulting in a negligible residual e), a form of warm-restart strategy [13].\nWe considered 10 test signals composed of 10 different speech excerpts (from the TIMIT dataset as\nwell, among excerpts not used for training) mixed in the middle of a 7 s-long applause sample. For\nevery test signal, the estimated speech signal is computed as\n\ntonal and Wnoise\n\n\u02c6x = \u03a6tonal \u02c6\u03b1speech\n\ntonal + \u03a6transient \u02c6\u03b1speech\ntransient\n\n(39)\n\n7\n\nTimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104TimeFrequency01234500.511.52x 104\fFigure 2: Time-frequency representations of the noisy data (top) and of the estimated tonal and\ntransient layers from the speech (bottom).\n\nand a SNR improvement is computed as the difference between the output and input SNRs. With\nour approach, the average SNR improvement other the 10 test signals was 6.6 dB. Fig. 2 displays the\nspectrograms of one noisy test signal with short and long windows, and the clean speech synthesis\ncoef\ufb01cients estimated in the two layers. As a baseline, we applied IS-NMF in a similar setting using\none Gabor transform with a window of intermediate length (256 samples, \u223c 16 ms). The average\nSNR improvement was 6 dB in that case. We also applied the standard OMLSA speech enhancement\nmethod [14] (using the implementation available from the author with default parameters) and the\naverage SNR improvement was 4.6 dB with this approach. Other experiments with other noise types\n(such as helicopter and train sounds) gave similar trends of results. Sound examples are provided in\nthe supplementary material.\n\n6 Conclusion\n\nWe have presented a new model that bridges the gap between t-f synthesis and traditional NMF\napproaches. The proposed algorithm for maximum joint likelihood estimation of the synthesis co-\nef\ufb01cients and their low-rank variance can be viewed as an iterative shrinkage algorithm with an\nadditional Itakura-Saito NMF penalty term. In [15], Elad explains in the context of sparse represen-\ntations that soft thresholding of analysis coef\ufb01cients corresponds to the \ufb01rst iteration of the forward-\nbackward algorithm for LASSO/basis pursuit denoising. Similarly, Itakura-Saito NMF followed by\nWiener \ufb01ltering correspond to the \ufb01rst iteration of the proposed EM algorithm for MJLE.\nAs opposed to traditional NMF, LRTFS accommodates multi-resolution representations very natu-\nrally, with no extra dif\ufb01culty at the estimation level. The model can be extended in a straightforward\nmanner to various additional penalties on the matrices W or H (such as smoothness or sparsity).\nFuture work will include the design of a scalable algorithm for MMLE, using for example message\npassing [16], and a comparison of MJLE and MMLE for LRTFS. Moreover, our generative model\ncan be considered for more general inverse problems such as multichannel audio source separa-\ntion [17]. More extensive experimental studies are planned in this direction.\n\nAcknowledgments\n\nThe authors are grateful to the organizers of the Modern Methods of Time-Frequency Analysis\nSemester held at the Erwin Schr\u00a8oedinger Institute in Vienna in December 2012, for arranging a\nvery stimulating event where the presented work was initiated.\n\n8\n\nTimeFrequencyNoisy signal: long window STFT analysis01234567010002000300040005000600070008000TimeFrequencyNoisy signal: short window STFT analysis01234567010002000300040005000600070008000TimeFrequencyDenoised signal: Tonal Layer01234567010002000300040005000600070008000TimeFrequencyDenoised signal: Transient Layer01234567010002000300040005000600070008000\fReferences\n[1] P. Smaragdis, C. F\u00b4evotte, G. Mysore, N. Mohammadiha, and M. Hoffman. Static and dynamic\nsource separation using nonnegative factorizations: A uni\ufb01ed view. IEEE Signal Processing\nMagazine, 31(3):66\u201375, May 2014.\n\n[2] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.\n\nNature, 401:788\u2013791, 1999.\n\n[3] C. F\u00b4evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-\nSaito divergence. With application to music analysis. Neural Computation, 21(3):793\u2013830,\nMar. 2009.\n\n[4] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1:211\u2013244, 2001.\n\n[5] D. P. Wipf and B. D. Rao. Sparse bayesian learning for basis selection. IEEE Transactions on\n\nSignal Processing, 52(8):2153\u20132164, Aug. 2004.\n\n[6] M. Figueiredo and R. Nowak. An EM algorithm for wavelet-based image restoration. IEEE\n\nTransactions on Image Processing, 12(8):906\u2013916, Aug. 2003.\n\n[7] Z. Pr\u02dau\u02c7sa, P. S\u00f8ndergaard, P. Balazs, and N. Holighaus. LTFAT: A Matlab/Octave toolbox for\nIn Proc. 10th International Symposium on Computer Music Multidisci-\n\nsound processing.\nplinary Research (CMMR), pages 299\u2013314, Marseille, France, Oct. 2013.\n\n[8] L. Daudet and B. Torr\u00b4esani. Hybrid representations for audiophonic signal encoding. Signal\n\nProcessing, 82(11):1595 \u2013 1617, 2002.\n\n[9] M. Kowalski and B. Torr\u00b4esani. Sparsity and persistence: mixed norms provide simple signal\nmodels with dependent coef\ufb01cients. Signal, Image and Video Processing, 3(3):251\u2013264, 2009.\n[10] M. Elad, J.-L. Starck, D. L. Donoho, and P. Querre. Simultaneous cartoon and texture image\ninpainting using morphological component analysis (MCA). Journal on Applied and Compu-\ntational Harmonic Analysis, 19:340\u2013358, Nov. 2005.\n\n[11] P. Smaragdis, B. Raj, and M. V. Shashanka. Supervised and semi-supervised separation of\nsounds from single-channel mixtures. In Proc. 7th International Conference on Independent\nComponent Analysis and Signal Separation (ICA), London, UK, Sep. 2007.\n\n[12] TIMIT: acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 1993.\n[13] A. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for (cid:96)1-minimization: Methodology\n\nand convergence. SIAM Journal on Optimisation, 19(3):1107\u20131130, 2008.\n\n[14] I. Cohen. Noise spectrum estimation in adverse environments: Improved minima controlled\nIEEE Transactions on Speech and Audio Processing, 11(5):466\u2013475,\n\nrecursive averaging.\n2003.\n\n[15] M. Elad. Why simple shrinkage is still relevant for redundant representations? IEEE Transac-\n\ntions on Information Theory, 52(12):5559\u20135569, 2006.\n\n[16] M. W. Seeger. Bayesian inference and optimal design for the sparse linear model. The Journal\n\nof Machine Learning Research, 9:759\u2013813, 2008.\n\n[17] A. Ozerov and C. F\u00b4evotte. Multichannel nonnegative matrix factorization in convolutive mix-\ntures for audio source separation. IEEE Transactions on Audio, Speech and Language Pro-\ncessing, 18(3):550\u2013563, Mar. 2010.\n\n9\n\n\f", "award": [], "sourceid": 1872, "authors": [{"given_name": "C\u00e9dric", "family_name": "F\u00e9votte", "institution": "CNRS"}, {"given_name": "Matthieu", "family_name": "Kowalski", "institution": "Univ Paris-Sud"}]}