{"title": "Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation", "book": "Advances in Neural Information Processing Systems", "page_first": 2267, "page_last": 2275, "abstract": "In this paper we describe a maximum likelihood likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE approach is shown to embed automatic model order selection, akin to automatic relevance determination.", "full_text": "Nonnegative dictionary learning in the exponential\nnoise model for adaptive music signal representation\n\nOnur Dikmen\n\nCNRS LTCI; T\u00b4el\u00b4ecom ParisTech\n\n75014, Paris, France\n\nC\u00b4edric F\u00b4evotte\n\nCNRS LTCI; T\u00b4el\u00b4ecom ParisTech\n\n75014, Paris, France\n\ndikmen@telecom-paristech.fr\n\nfevotte@telecom-paristech.fr\n\nAbstract\n\nIn this paper we describe a maximum likelihood approach for dictionary learning\nin the multiplicative exponential noise model. This model is prevalent in audio\nsignal processing where it underlies a generative composite model of the power\nspectrogram. Maximum joint likelihood estimation of the dictionary and expan-\nsion coef\ufb01cients leads to a nonnegative matrix factorization problem where the\nItakura-Saito divergence is used. The optimality of this approach is in question be-\ncause the number of parameters (which include the expansion coef\ufb01cients) grows\nwith the number of observations. In this paper we describe a variational procedure\nfor optimization of the marginal likelihood, i.e., the likelihood of the dictionary\nwhere the activation coef\ufb01cients have been integrated out (given a speci\ufb01c prior).\nWe compare the output of both maximum joint likelihood estimation (i.e., stan-\ndard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE)\non real and synthetical datasets. The MMLE approach is shown to embed auto-\nmatic model order selection, akin to automatic relevance determination.\n\n1 Introduction\n\nIn this paper we address the task of nonnegative dictionary learning described by\n\nV \u2248 W H,\n\n(1)\nwhere V , W , H are nonnegative matrices of dimensions F \u00d7 N , F \u00d7 K and K \u00d7 N , respectively. V\nis the data matrix, where each column vn is a data point, W is the dictionary matrix, with columns\n{wk} acting as \u201cpatterns\u201d or \u201cexplanatory variables\u201d reprensentative of the data, and H is the acti-\nvation matrix, with columns {hn}. For example, in this paper we will be interested in music data\nsuch that V is time-frequency spectrogram matrix and W is a collection of spectral signatures of la-\ntent elementary audio components. The most common approach to nonnegative dictionary learning\nis nonnegative matrix factorization (NMF) [1] which consists in retrieving the factorization (1) by\nsolving\n\nmin\nW,H\n\nD(V |W H)\n\nd(vf n|[W H]f n)\n\ns.t. W, H \u2265 0 ,\n\n(2)\n\nwhere d(x|y) is a measure of \ufb01t between nonnegative scalars, vf n are the entries of V , and A \u2265 0\nexpresses nonnegativity of the entries of matrix A. The cost function D(V |W H) is often a likeli-\nhood function \u2212 log p(V |W, H) in disguise, e.g., the Euclidean distance underlies additive Gaussian\nnoise, the Kullback-Leibler (KL) divergence underlies Poissonian noise, while the Itakura-Saito (IS)\ndivergence underlies multiplicative exponential noise [2]. The latter noise model will be central to\nthis work because it underlies a suitable generative model of the power spectrogram, as shown in [3]\nand later recalled.\n\ndef\n\n= Xf n\n\n1\n\n\fA criticism about NMF is that little can be said about the asymptotical optimality of the learnt\ndictionary W . Indeed, because W is estimated jointly with H, the total number of parameters F K +\nKN grows with the number of data points N . As such, this paper instead addresses optimization of\nthe likelihood in the marginal model described by\n\np(V |W ) =ZH\n\np(V |W, H)p(H)dH,\n\n(3)\n\nwhere H is treated as a random latent variable with prior p(H). The evaluation and optimization of\nthe marginal likelihood is not trivial in general, and this paper is precisely devoted to these tasks in\nthe multiplicative exponential noise model.\n\nThe maximum marginal likelihood estimation approach we seek here is related to IS-NMF in such\na way that Latent Dirichlet Allocation (LDA) [4] is related to Latent Semantic Indexing (pLSI)\n[5]. LDA and pLSI are two estimators in the same model, but LDA seeks estimation of the topic\ndistributions in the marginal model, from which the topic weights describing each document have\nbeen integrated out. In contrast, pLSI (which is essentially equivalent to KL-NMF as shown in [6])\nperforms maximum joint likelihood estimation (MJLE) for the topics and weights. Blei et al. [4]\nshow the better performance of LDA with respect to (w.r.t) pLSI. Welling et al. [7] also report similar\nresults with a discussion, stating that deterministic latent variable models assign zero probability to\ninput con\ufb01gurations that do not appear in the training set. A similar approach is Discrete Component\nAnalysis (DCA) [8] which considers maximum marginal a posteriori estimation in the Gamma-\nPoisson (GaP) model [9], see also [10] for the maximum marginal likelihood estimation on the same\nmodel. In this paper, we will follow the same objective for the multiplicative exponential noise\nmodel.\n\nWe will describe a variational algorithm for the evaluation and optimization of (3); note that the\nalgorithm exploits speci\ufb01cities of the model and is not a mere adaptation of LDA or DCA to an\nalternative setting. We will consider a nonnegative Generalized inverse-Gaussian (GIG) distribution\nas a prior for H, a \ufb02exible distribution which takes the Gamma and inverse-Gamma as special\ncases. As will be detailed later, this work relates to recent work by Hoffman et al. [11], which\nconsiders full Bayesian integration of W and H (both assumed random) in the exponential noise\nmodel, in a nonparametric setting allowing for model order selection. We will show that our more\nsimple maximum likelihood approach inherently performs model selection as well by automatically\npruning \u201cirrelevant\u201d dictionary elements. Applied to a short well structured piano sequence, our\napproach is shown to capture the correct number of components, corresponding to the expected note\nspectra, and outperforms the nonparametric Bayesian approach of [11].\n\nThe paper is organized as follows. Section 2 introduces the multiplicative exponential noise model\nwith the prior distribution for the expansion coef\ufb01cients p(H). Sections 3 and 4 describe the MJLE\nand MMLE approaches, respectively. Section 5 reports results on synthetical and real audio data.\nSection 6 concludes.\n\n2 Model\n\nThe generative model assumed in this paper is\n\nvf n = \u02c6vf n . \u01ebf n ,\n\n(4)\n\nwhere \u02c6vf n = Pk wf khkn and \u01ebf n is a nonnegative multiplicative noise with exponential distri-\nbution \u01ebf n \u223c exp(\u2212\u01ebf n). In other words, and under independence assumptions, the likelihood\nfunction is\n\np(V |W, H) = Yf n\n\n(1/\u02c6vf n) exp(\u2212vf n/\u02c6vf n) .\n\n(5)\n\nWhen V is a power spectrogram matrix such that vf n = |xf n|2 and {xf n} are the complex-valued\nshort-time Fourier transform (STFT) coef\ufb01cients of some signal data, where f typically acts as a\nfrequency index and n acts as a time-frame index, it was shown in [3] that an equivalent generative\nmodel of vf n is\n\nxf n = Xk\n\ncf kn,\n\ncf kn \u223c Nc(0, wf khkn) ,\n\n(6)\n\n2\n\n\fwhere Nc refers to the circular complex Gaussian distribution.1 In other words, the exponential\nmultiplicative noise model underlies a generative composite model of the STFT. The complex-\nvalued matrix {cf kn}f n, referred to as kthcomponent, is characterized by a spectral signature wk,\namplitude-modulated in time by the frame-dependent coef\ufb01cient hkn, which accounts for nonsta-\ntionarity. In analogy with LDA or DCA, if our data consisted of word counts, with f indexing words\nand n indexing documents, then the columns of W would describe topics and cf kn would denote\nthe number of occurrences of word f stemming from topic k in document n.\nIn our setting W is considered a free deterministic parameter to be estimated by maximum likeli-\nhood. In contrast, H is treated as a nonnegative random latent variable over which we will integrate.\nIt is assigned a GIG prior, such that\n\n(7)\n\n(8)\n\nwith\n\nhkn \u223c GIG(\u03b1k, \u03b2k, \u03b3k) ,\n\nGIG(x|\u03b1, \u03b2, \u03b3) =\n\n(\u03b2/\u03b3)\u03b1/2\n\n2K\u03b1(2\u221a\u03b2\u03b3)\n\nx\u03b1\u22121 exp\u2212(cid:16)\u03b2x +\n\n\u03b3\n\nx(cid:17) ,\n\nwhere K is a modi\ufb01ed Bessel function of the second kind and x, \u03b2 and \u03b3 are nonnegative scalars.\nThe GIG distribution uni\ufb01es the Gamma (\u03b1 > 0, \u03b3 = 0) and inverse-Gamma (\u03b1 < 0, \u03b2 = 0)\ndistributions. Its suf\ufb01cient statistics are x, 1/x and log x, and in particular we have\n\nhxi = K\u03b1+1(2\u221a\u03b2\u03b3)\n\nK\u03b1(2\u221a\u03b2\u03b3) r \u03b3\n\n\u03b2\n\n,\n\nx(cid:29) = K\u03b1\u22121(2\u221a\u03b2\u03b3)\nK\u03b1(2\u221a\u03b2\u03b3) s \u03b2\n(cid:28) 1\n\n\u03b3\n\n,\n\n(9)\n\nwhere hxi denotes expectation. Although all derivations and the implementations are done for the\ngeneral case, in practice we will only consider the special case of Gamma distribution for simplicity.\nIn such case, \u03b2 parameter merely acts as a scale parameter, which we \ufb01x so as to solve the scale\nambiguity between the columns of W and the rows of H. We will also assume the shape parameters\n{\u03b1k} \ufb01xed to arbitrary values (typically, \u03b1k = 1, which corresponds to the exponential distribution).\nGiven the generative model speci\ufb01ed by equations (4) and (7) we now describe two estimators for\nW .\n\n3 Maximum joint likelihood estimation\n\n3.1 Estimator\n\nThe joint (penalized) log-likelihood likelihood of W and H is de\ufb01ned by\n\nCJL(W, H)\n\ndef\n= log p(V |W, H) + log p(H)\n= \u2212DIS(V |W H) \u2212 Xkn\n\n(1 \u2212 \u03b1k) log hkn + \u03b2khkn + \u03b3k/hkn + cst ,\n\n(10)\n\n(11)\n\nwhere DIS(V |W H) is de\ufb01ned as in equation (2) with dIS(x|y) = x/y \u2212 log(x/y) \u2212 1 (Itakura-\nSaito divergence) and \u201ccst\u201d denotes terms constant w.r.t W and H. The subscript JL stands for joint\nlikelihood, and the estimation of W by maximization of CJL(W, H) will be referred to as maximum\njoint likelihood estimation (MJLE).\n\n3.2 MM algorithm for MJLE\n\nWe describe an iterative algorithm which sequentially updates W given H and H given W . Each of\nthe two steps can be achieved in a minorization-maximization (MM) setting [12], where the original\nproblem is replaced by the iterative optimization of an easier-to-optimize auxiliary function. We \ufb01rst\ndescribe the update of H, from which the update of W will be easily deduced. Given W , our task\n\nconsists in maximizing C(H) = \u2212DIS(V |W H) \u2212 L(H), where L(H) =Pkn(1 \u2212 \u03b1k) log hkn +\n\u03b2khkn + \u03b3k/hkn. Using Jensen\u2019s inequality to majorize the convex part of DIS(V |W H) (terms in\n1A complex random variable has distribution Nc(\u00b5, \u03bb) if and only if its real and imaginary parts are inde-\n\npendent and distributed as N (\u211c(\u00b5), \u03bb/2) and N (\u2111(\u00b5), \u03bb/2), respectively.\n\n3\n\n\fvf n/\u02c6vf n) and \ufb01rst order Taylor approximation to majorize its concave part (terms in log \u02c6vf n), as in\n[13], the functional\n\nG(H, \u02dcH) = \u2212(cid:16)Xk\nknPf wf kvf n/\u02dcv2\n\npkn/hkn + qknhkn(cid:17) \u2212 L(H) + cst,\n\n(12)\n\nf n, qkn = Pf wf k/\u02dcvf n, \u02dcvf n = [W \u02dcH]f n, can be shown to be\nwhere pkn = \u02dch2\na tight lower bound of C(H), i.e., G(H, \u02dcH) \u2264 C(H) and G( \u02dcH, \u02dcH) = C( \u02dcH). Its iterative max-\nimization w.r.t H, where \u02dcH = H (i) acts as the current iterate at iteration i, produces an ascent\nalgorithm, such that C(H (i+1)) \u2265 C(H (i)). The update is easily shown to amount to solving an\norder 2 polynomial with a single positive root given by\n\nhkn =\n\n(\u03b1k \u2212 1) +p(\u03b1k \u2212 1)2 + 4(pkn + \u03b3k)(qkn + \u03b2k)\n\n2(qkn + \u03b2k)\n\n.\n\n(13)\n\nThe update preserves nonnegativity given positive initialization. By exchangeability of W and H\nwhen the data is transposed (V T = H T W T ), and dropping the penalty term (\u03b1k = 1, \u03b2k = 0,\n\u03b3k = 0), the update of W is given by the multiplicative update\n\nwhich is known from [13].\n\nwf k = \u02dcwf ksPn hknvf n/\u02dcv2\nPn hkn/\u02dcvf n\n\nf n\n\n,\n\n4 Maximum marginal likelihood estimation\n\n4.1 Estimator\n\nWe de\ufb01ne the marginal log-likelihood objective function as\n\n(14)\n\n(15)\n\nCML(W )\n\ndef\n\n= logZ p(V |W, H)p(H) dH .\n\nThe subscript ML stands for marginal likelihood, and the estimation of W by maximization of\nCML(W ) will be referred to as maximum marginal likelihood estimation (MMLE). Note that in\nBayesian estimation the term marginal likelihood is sometimes used as a synonym for the model\nevidence, which is the likelihood of data given the model, i.e., where all random parameters (in-\ncluding W ) have been marginalized. This is not the case here where W is treated as a deterministic\nparameter and marginal likelihood only refers to the likelihood of W , where H has been integrated\nout. The integral in equation (15) is intractable given our model. In the next section we resort to a\nvariational Bayes procedure for the evaluation and maximization of CML(W ).\n\n4.2 Variational algorithm for MMLE\n\nIn the following we propose an iterative lower bound evaluation/maximization procedure for\napproximate maximization of CML(W ). We will construct a bound B(W, \u02dcW ) such that\n\u2200(W, \u02dcW ), CML(W ) \u2265 B(W, \u02dcW ), where \u02dcW acts as the current iterate and W acts as the free pa-\nrameter over which the bound is maximized. The maximization is approximate in that the bound\nwill only satisfy B( \u02dcW , \u02dcW ) \u2248 CML( \u02dcW ), i.e., is loosely tight in the current update \u02dcW , which fails to\nensure ascent of the objective function like in the MM setting of Section 3.2.\nWe propose to construct the bound from a variational Bayes perspective [14]. The following in-\nequality holds for any distribution function q(H)\n\nCML(W ) \u2265 hlog p(V |W, H)iq + hlog p(H)iq \u2212 hlog q(H)iq\n\n(16)\nThe inequality becomes an equality when q(H) = p(H|V, W ); when the latter is available in close\nform, the EM algorithm consists in using \u02dcq(H) = p(H|V, \u02dcW ) and maximize Bvb\n\u02dcq (W ) w.r.t W ,\nand iterate. The true posterior of H being intractable in our case, we take q(H) to be a factorized,\n\nq (W ) .\n\ndef\n= Bvb\n\n4\n\n\fparametric distribution q\u03b8(H), whose parameter \u03b8 is updated so as to tighten Bvb\nLike in [11], we choose q\u03b8(H) to be in the same family as the prior, such that\n\nq ( \u02dcW ) to C( \u02dcW ).\n\nq\u03b8(H) = Ykn GIG(\u00af\u03b1kn, \u00af\u03b2kn, \u00af\u03b3kn) .\n\n(17)\n\nThe \ufb01rst term of Bvb\nq (W ) essentially involves the expectation of \u2212DIS(V |W H) w.r.t to the vari-\national distribution q\u03b8(H). The product W H introduces some coupling of the coef\ufb01cients of H\n(via the sum Pk wf khkn) which makes the integration dif\ufb01cult. Following [11] and similar to\nSection 3.2, we propose to lower bound this term using Jensen\u2019s and Taylor\u2019s type inequalities to\nmajorize the convex and concave parts of \u2212DIS(V |W H). The contributions of the elements of H\nbecome decoupled w.r.t to k, which allows for evaluation and maximization of the bound. This leads\nto\n\n1\n\n\u03c62\nf kn\n\nvf n\n\nwf k(cid:28) 1\n\nhkn(cid:29)q! + log \u03c8f n +\n\nhlog p(V |H, W )iq \u2265 \u2212Xf n Xk\nwhere {\u03c8f n} and {\u03c6f kn} are nonnegative free parameters such that Pk \u03c6f kn = 1. We de\ufb01ne\n\nB\u03b8,\u03c6,\u03c8(W ) as Bvb\nq (W ) but where the expectation of the joint log-likelihood is replaced by its lower\nbound given right side of equation (18). From there, our algorithm is a two-step procedure consisting\nin 1) computing \u02dc\u03b8, \u02dc\u03c6, \u02dc\u03c8 so as to tighten B\u03b8,\u03c6,\u03c8( \u02dcW ) to CML( \u02dcW ), and 2) maximizing B \u02dc\u03b8, \u02dc\u03c6, \u02dc\u03c8(W )\nw.r.t W . The corresponding updates are given next. Note that evaluation of the bound only involves\nexpectations of hkn and 1/hkn w.r.t to the GIG distribution, which is readily given by equation (9).\n\nwf khhkniq \u2212 1! ,\n\n\u03c8f n Xk\n\n(18)\n\nStep 1: Tightening the bound Given current dictionary update \u02dcW , run the following \ufb01xed-point\nequations.\n\n\u03c6f kn =\n\n\u00af\u03b1kn = \u03b1k,\n\n,\n\n\u02dcwf k/h1/hkniq\nPj \u02dcwf j /h1/hjniq\n\u00af\u03b2kn = \u03b2k +Xf\n\n\u02dcwf k\n\u03c8f n\n\n,\n\n\u03c8f n =Xj\n\u02dcwf jhhjniq\n\u00af\u03b3kn = \u03b3k +Xf\n\nvf n\u03c62\n\u02dcwf k\n\nf kn\n\n.\n\nStep 2: Optimizing the bound Given the variational distribution \u02dcq = q \u02dc\u03b8 from previous step,\nupdate W as\n\nwf k = \u02dcwf kvuuuut\n\n\u02dcq i\u22122\nPn vf nhPj \u02dcwf jh1/hjni\u22121\nPnhPj \u02dcwf jhhjni\u02dcqi\u22121\n\n\u02dcq\n\nh1/hkni\u22121\nhhkni\u02dcq\n\n.\n\n(19)\n\nThe VB update has a similar form to the MM update of equation (14) but the contributions of H are\nreplaced by expected values w.r.t the variational distribution.\n\n4.3 Relation to other works\n\nA variational algorithm using the activation matrix H and the latent components C = {cf kn} as\nhidden data can easily be devised, as sketched in [2]. Including C in the variational distribution also\nallows to decouple the contributions of the activation coef\ufb01cients w.r.t to k but leads from our expe-\nrience to a looser bound, a \ufb01nding also reported in [11]. In a fully Bayesian setting, Hoffman et al.\n\n[11] assume Gamma priors for both W and H. The model is such that \u02c6vf n =Pk \u03bbkwf khkn, where\n\u03bbk acts as a component weight parameter. The number of components is potentially in\ufb01nite but,\nin a nonparametric setting, the prior for \u03bbk favors a \ufb01nite number of active components. Posterior\ninference of the parameters W , H, {\u03bbk} is achieved in a variational setting similar to Section 4.2,\nby maximizing a lower bound on p(V ). In contrast to this method, our approach does not require to\nspecify a prior for W , leads to simple updates for W that are directly comparable to IS-NMF and\nexperiments will reveal that our approach embeds model order selection as well, by automatically\npruning unnecessary columns of W , without resorting to the nonparametric framework.\n\n5\n\n\fx 105\n\n\u22121.25\n\nMMLE\n\n\u22121.3\n\n\u22121.35\n\nL\nM\n\nC\n\n\u22121.4\n\n\u22121.45\n\n\u22121.5\n\n\u22121.55\n\n5\n\n10\n\n15\n\nK\n\n20\n\n25\n\n(a) CML by MMLE\n\nL\nJ\n\nC\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n\n\u22128\n\nx 104\n\nMJLE\n\nx 105\n\n\u22121.4\n\nMJLE\n\nL\nM\n\nC\n\n\u22121.45\n\n\u22121.5\n\n\u22121.55\n\n\u22121.6\n\n\u22121.65\n\n5\n\n10\n\n15\n\nK\n\n20\n\n25\n\n(c) CML by MJLE\n\n5\n\n10\n\n15\n\nK\n\n20\n\n25\n\n(b) CJL by MJLE\n\nFigure 1: Marginal likelihood CML (a) and joint likelihood CJL (b) versus number of components\nK. CML values corresponding to dictionaries estimated by CJL maximization (c).\n\n5 Experiments\n\nIn this section, we study the performances of MJLE and MMLE methods on both synthetical and\nreal-world datasets.2 The prior hyperparameters are \ufb01xed to \u03b1k = 1, \u03b3k = 0 (exponential distri-\nbution) and \u03b2k = 1, i.e., hkn \u223c exp(\u2212hkn). We used 5000 algorithm iterations and nonnegative\nrandom initializations in all cases. In order to minimize the odds of getting stuck in local optima, we\nadapted the deterministic annealing method proposed in [15] for MMLE. Deterministic annealing\nis applied by multiplying the entropy term \u2212hlog q(H)i in the lower bound in (16) by 1/\u03b7(i). The\ninitial \u03b7(0) is chosen in (0, 1) and increased through iterations. In our experiments, we set \u03b7(0) = 0.6\nand updated it with the rule \u03b7(i+1) = min(1, 1.005\u03b7(i)).\n\n5.1 Swimmer dataset\n\nFirst, we consider the synthetical Swimmer dataset [16], for which the ground truth of the dictionary\nis available. The dataset is composed of 256 images of size 32 \u00d7 32, representing a swimmer built\nof an invariant torso and 4 limbs. Each of the 4 limbs can be in one of 4 positions and the dataset\nis formed of all combinations. Hence, the ground truth dictionary corresponds to the collection of\nindividual limb positions. As explained in [16] the torso is an unidenti\ufb01able component that can\nbe paired with any of the limbs, or even split among the limbs. In our experiments, we mapped the\nvalues in the dataset onto the range [1, 100] and multiplied with exponential noise, see some samples\nin Fig. 2 (a).\n\nWe ran the MM and VB algorithms (for MJLE and MMLE, respectively) for K = 1 . . . 20 and the\njoint and marginal log-likelihood end values (after the 5000 iterations) are displayed in Fig. 1. The\nmarginal log-likelihood is here approximated by its lower bound, as described in Section 4.2. In\nFig. 1(a) and (b) the respective objective criteria (CML and CJL) maximized by MMLE and MJLE\nare shown. The increase of CML stops after K = 16, whereas CJL continues to increase as K gets\nlarger. Fig. 1 (c) displays the corresponding marginal likelihood values, CML, of the dictionaries\nobtained by MJLE in Fig. 1 (b); this \ufb01gure empirically shows that maximizing the joint likelihood\ndoes not necessarily imply maximization of the marginal likelihood. These \ufb01gures display the mean\nand standard deviation values obtained from 7 experiments.\n\nThe likelihood values increase with the number of components, as expected from nested models.\nHowever, the marginal likelihood stagnates after K = 16. Manual inspection reveals that passed\nthis value of K, the extra columns of W are pruned to zero, leaving the criterion unchanged. Hence,\nMMLE appears to embed automatic order selection, similar to automatic relevance determination\n[17, 18]. The dictionaries learnt from MJLE and MMLE with K = 20 components are shown in\nFig. 2 (b) and (c). As can be seen from Fig. 2 (b), MJLE produces spurious or duplicated compo-\nnents. In contrast, the ground truth is well recovered with MMLE.\n\n2MATLAB code is available at http://perso.telecom-paristech.fr/\u223cdikmen/nips11/\n\n6\n\n\f(a) Data\n\n(b) WMJLE\n\n(c) WMMLE\n\nFigure 2: Data samples and dictionaries learnt on the swimmer dataset with K = 20.\n\n5.2 A piano excerpt\n\nIn this section, we consider the piano data used in [3]. It is a toy audio sequence recorded in real\nconditions, consisting of four notes played all together in the \ufb01rst measure and in all possible pairs in\nthe subsequent measures. A power spectrogram with analysis window of size 46 ms was computed,\nleading to F = 513 frequency bins and N = 676 time frames. We ran MMLE with K = 20 on the\nspectrogram. We reconstructed STFT component estimates from the factorization \u02c6W \u02c6H, where \u02c6W is\nthe MMLE dictionary estimate and \u02c6H = hHiq. We used the minimum mean square error (MMSE)\nestimate given by \u02c6cf kn = gf kn. xf n, where gf kn is the time-frequency Wiener mask de\ufb01ned by\n\u02c6wf k\u02c6hkn/Pj \u02c6wf j \u02c6hjn. The estimated dictionary and the reconstructed components in the time do-\n\nmain after inverse STFT are shown in Fig. 3 (a). Out of the 20 components, 12 were assigned to zero\nduring inference. The remaining 8 are displayed. 3 of the nonzero dictionary columns have very\nsmall values, leading to inaudible reconstructions. The \ufb01ve signi\ufb01cant dictionary vectors correspond\nto the frequency templates of the four notes and the transients. For comparison, we applied the non-\nparametric approach by Hoffman et al. [11] on the same data with the same hyperparameters for H.\nThe estimated dictionary and the reconstructed components are presented in Fig. 3 (b). 10 out of\n20 components had very small weight values. The most signi\ufb01cant 8 of the remaining components\nare presented in the \ufb01gure. These components do not exactly correspond to individual notes and\ntransients as they did with MMLE. The fourth note is mainly represented in the \ufb01fth component, but\npartially appears in the \ufb01rst three components as well. In general, the performance of the nonpara-\nmetric approach depends more on initialization, i.e., requires more repetitions than MMLE. For the\nabove results, we used 200 repetitions for the nonparametric method and 20 for MMLE (without\nannealing, same stopping criterion) and chose the repetition with the highest likelihood.\n\n5.3 Decomposition of a real song\n\nIn this last experiment, we decompose the \ufb01rst 40 seconds of God Only Knows by the Beach Boys.\nThis song was produced in mono and we retrieved a downsampled version of it at 22kHz from the\nCD release. We computed a power spectrogram with 46 ms analysis window and ran our VB algo-\nrithm with K = 50. Fig. 4 displays the original data, and two examples of estimated time-frequency\nmasks and reconstructed components. The \ufb01gure also shows the variance of the reconstructed com-\nponents and the evolution of the variational bound along iterations. In this example, 5 components\nout of the 50 are completely pruned in the factorization and 7 others are inaudible. Such decompo-\nsition can be used in various music editing settings, for example for mono to stereo remixing, see,\ne.g., [3].\n\n7\n\n\fW\n\nMMLE\n\nc\nMMLE\n\nW\n\nhoffman\n\nc\nhoffman\n\n(a) MMLE\n\n(b) Hoffman et al.\n\nFigure 3: The estimated dictionary and the reconstructed components by MMLE and the nonpara-\nmetric approach by Hoffman et al. with K = 20.\n\nLog power data spectrogram\n\nx 10\u22123\n\nVariance of reconstructed components\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\nTemporal data\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\nTime\u2212frequency Wiener mask of component 13\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n5\n\nx 105\n\n15\n\n10\n40\nVariational bound against iterations\n\n35\n\n20\n\n25\n\n30\n\n45\n\n50\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n4500\n\n5000\n\nTime\u2212frequency Wiener mask of component 18\n\n10\n\n5\n\n0\n\n500\n\n400\n\n300\n\n200\n\n100\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\nReconstructed component 13\n\nReconstructed component 18\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n1\n\n0\n\n\u22121\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n500\n\n400\n\n300\n\n200\n\n100\n\n1\n\n0\n\n\u22121\n\n0\n\n500\n\n400\n\n300\n\n200\n\n100\n\n1\n\n0\n\n\u22121\n\n0\n\nFigure 4: Decomposition results of a real song. The Wiener masks take values between 0 (white)\nand 1 (black). The \ufb01rst example of reconstructed component captures the \ufb01rst chord of the song,\nrepeated 4 times in the intro. The other component captures the cymbal, which starts with the \ufb01rst\nverse of the song.\n\nAcknowledgments\n\nThis work is supported by project ANR-09-JCJC-0073-01 TANGERINE (Theory and applications\nof nonnegative matrix factorization).\n\n6 Conclusions\n\nIn this paper we have challenged the standard NMF approach to nonnegative dictionary learning,\nbased on maximum joint likelihood estimation, with a better-posed approach consisting in maximum\nmarginal likelihood estimation. The proposed algorithm based on variational inference has compa-\nrable computational complexity to standard NMF/MJLE. Our experiments on synthetical and real\ndata have brought up a very attractive feature of MMLE, namely its self-ability to discard irrelevant\ncolumns in the dictionary, without resorting to elaborate schemes such as Bayesian nonparametrics.\n\n8\n\n\fReferences\n\n[1] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.\n\nNature, 401:788\u2013791, 1999.\n\n[2] C. F\u00b4evotte and A. T. Cemgil. Nonnegative matrix factorisations as probabilistic inference in\ncomposite models. In Proc. 17th European Signal Processing Conference (EUSIPCO), pages\n1913\u20131917, Glasgow, Scotland, Aug. 2009.\n\n[3] C. F\u00b4evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-\nSaito divergence. With application to music analysis. Neural Computation, 21(3):793\u2013830,\nMar. 2009.\n\n[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, Jan. 2003.\n\n[5] Thomas Hofman. Probabilistic latent semantic indexing. In Proc. 22nd International Confer-\n\nence on Research and Development in Information Retrieval (SIGIR), 1999.\n\n[6] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In Proc. 28th\nannual international ACM SIGIR conference on Research and development in information\nretrieval (SIGIR\u201905), pages 601\u2013602, New York, NY, USA, 2005. ACM.\n\n[7] M. Welling, C. Chemudugunta, and N. Sutter. Deterministic latent variable models and their\n\npitfalls. In SIAM Conference on Data Mining (SDM), pages 196\u2013207, 2008.\n\n[8] W. L. Buntine and A. Jakulin. Discrete component analysis. In Lecture Notes in Computer\n\nScience, volume 3940, pages 1\u201333. Springer, 2006.\n\n[9] John F. Canny. GaP: A factor model for discrete data. In Proceedings of the 27th ACM in-\nternational Conference on Research and Development of Information Retrieval (SIGIR), pages\n122\u2013129, 2004.\n\n[10] O. Dikmen and C. F\u00b4evotte. Maximum marginal likelihood estimation for nonnegative dictio-\nnary learning. In Proc. of International Conference on Acoustics, Speech and Signal Process-\ning (ICASSP\u201911), Prague, Czech Republic, 2011.\n\n[11] M. Hoffman, D. Blei, and P. Cook. Bayesian nonparametric matrix factorization for recorded\nmusic. In Proc. 27th International Conference on Machine Learning (ICML), Haifa, Israel,\n2010.\n\n[12] D. R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58:30 \u2013\n\n37, 2004.\n\n[13] Y. Cao, P. P. B. Eggermont, and S. Terebey. Cross Burg entropy maximization and its applica-\ntion to ringing suppression in image reconstruction. IEEE Transactions on Image Processing,\n8(2):286\u2013292, Feb. 1999.\n\n[14] C. M. Bishop. Pattern Recognition And Machine Learning. Springer, 2008. ISBN-13: 978-\n\n0387310732.\n\n[15] K. Katahira, K. Watanabe, and M. Okada. Deterministic annealing variant of variational\nBayes method. In International Workshop on Statistical-Mechanical Informatics 2007 (IW-\nSMI 2007), 2007.\n\n[16] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct de-\ncomposition into parts? In Sebastian Thrun, Lawrence Saul, and Bernhard Sch\u00a8olkopf, editors,\nAdvances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n[17] D. J. C. Mackay. Probable networks and plausible predictions \u2013 a review of practical Bayesian\nmodels for supervised neural networks. Network: Computation in Neural Systems, 6(3):469\u2013\n505, 1995.\n\n[18] C. M. Bishop. Bayesian PCA. In Advances in Neural Information Processing Systems (NIPS),\n\npages 382\u2013388, 1999.\n\n9\n\n\f", "award": [], "sourceid": 1234, "authors": [{"given_name": "Onur", "family_name": "Dikmen", "institution": null}, {"given_name": "C\u00e9dric", "family_name": "F\u00e9votte", "institution": null}]}