{"title": "Algorithms for Non-negative Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 556, "page_last": 562, "abstract": "Non-negative matrix factorization (NMF) has previously been shown to \r\nbe a useful decomposition for multivariate data. Two different multi- \r\nplicative algorithms for NMF are analyzed. They differ only slightly in \r\nthe multiplicative factor used in the update rules. One algorithm can be \r\nshown to minimize the conventional least squares error while the other \r\nminimizes the generalized Kullback-Leibler divergence. The monotonic \r\nconvergence of both algorithms can be proven using an auxiliary func- \r\ntion analogous to that used for proving convergence of the Expectation- \r\nMaximization algorithm. The algorithms can also be interpreted as diag- \r\nonally rescaled gradient descent, where the rescaling factor is optimally \r\nchosen to ensure convergence. ", "full_text": "Algorithms for Non-negative Matrix \n\nFactorization \n\nDaniel D. Lee* \n\n*BelJ Laboratories \nLucent Technologies \nMurray Hill, NJ 07974 \n\nH. Sebastian Seung*t \n\ntDept. of Brain and Cog. Sci. \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02138 \n\nAbstract \n\nNon-negative matrix factorization (NMF) has previously been shown to \nbe a useful decomposition for multivariate data. Two different multi(cid:173)\nplicative algorithms for NMF are analyzed. They differ only slightly in \nthe multiplicative factor used in the update rules. One algorithm can be \nshown to minimize the conventional least squares error while the other \nminimizes the generalized Kullback-Leibler divergence. The monotonic \nconvergence of both algorithms can be proven using an auxiliary func(cid:173)\ntion analogous to that used for proving convergence of the Expectation(cid:173)\nMaximization algorithm. The algorithms can also be interpreted as diag(cid:173)\nonally rescaled gradient descent, where the rescaling factor is optimally \nchosen to ensure convergence. \n\n1 \n\nIntroduction \n\nUnsupervised learning algorithms such as principal components analysis and vector quan(cid:173)\ntization can be understood as factorizing a data matrix subject to different constraints. De(cid:173)\npending upon the constraints utilized, the resulting factors can be shown to have very dif(cid:173)\nferent representational properties. Principal components analysis enforces only a weak or(cid:173)\nthogonality constraint, resulting in a very distributed representation that uses cancellations \nto generate variability [1, 2]. On the other hand, vector quantization uses a hard winner(cid:173)\ntake-all constraint that results in clustering the data into mutually exclusive prototypes [3]. \n\nWe have previously shown that nonnegativity is a useful constraint for matrix factorization \nthat can learn a parts representation of the data [4, 5]. The nonnegative basis vectors that are \nlearned are used in distributed, yet still sparse combinations to generate expressiveness in \nthe reconstructions [6, 7]. In this submission, we analyze in detail two numerical algorithms \nfor learning the optimal nonnegative factors from data. \n\n2 Non-negative matrix factorization \n\nWe formally consider algorithms for solving the following problem: \n\nNon-negative matrix factorization (NMF) Given a non-negative matrix \nV, find non-negative matrix factors Wand H such that: \n\nV~WH \n\n(1) \n\n\fNMF can be applied to the statistical analysis of multivariate data in the following manner. \nGiven a set of of multivariate n-dimensional data vectors, the vectors are placed in the \ncolumns of an n x m matrix V where m is the number of examples in the data set. This \nmatrix is then approximately factorized into an n x r matrix Wand an r x m matrix H. \nUsually r is chosen to be smaller than nor m, so that Wand H are smaller than the original \nmatrix V. This results in a compressed version of the original data matrix. \n\nWhat is the significance of the approximation in Eq. (1)? It can be rewritten column by \ncolumn as v ~ Wh, where v and h are the corresponding columns of V and H. In other \nwords, each data vector v is approximated by a linear combination of the columns of W, \nweighted by the components of h. Therefore W can be regarded as containing a basis \nthat is optimized for the linear approximation of the data in V. Since relatively few basis \nvectors are used to represent many data vectors, good approximation can only be achieved \nif the basis vectors discover structure that is latent in the data. \n\nThe present submission is not about applications of NMF, but focuses instead on the tech(cid:173)\nnical aspects of finding non-negative matrix factorizations. Of course, other types of ma(cid:173)\ntrix factorizations have been extensively studied in numerical linear algebra, but the non(cid:173)\nnegativity constraint makes much of this previous work inapplicable to the present case \n[8]. \n\nHere we discuss two algorithms for NMF based on iterative updates of Wand H. Because \nthese algorithms are easy to implement and their convergence properties are guaranteed, \nwe have found them very useful in practical applications. Other algorithms may possibly \nbe more efficient in overall computation time, but are more difficult to implement and may \nnot generalize to different cost functions. Algorithms similar to ours where only one of the \nfactors is adapted have previously been used for the deconvolution of emission tomography \nand astronomical images [9, 10, 11, 12]. \n\nAt each iteration of our algorithms, the new value of W or H is found by multiplying the \ncurrent value by some factor that depends on the quality ofthe approximation in Eq. (1). We \nprove that the quality of the approximation improves monotonically with the application \nof these multiplicative update rules. In practice, this means that repeated iteration of the \nupdate rules is guaranteed to converge to a locally optimal matrix factorization. \n\n3 Cost functions \n\nTo find an approximate factorization V ~ W H, we first need to define cost functions \nthat quantify the quality of the approximation. Such a cost function can be constructed \nusing some measure of distance between two non-negative matrices A and B . One useful \nmeasure is simply the square of the Euclidean distance between A and B [13], \n\nIIA - BI12 = L(Aij - Bij)2 \n\nij \n\nThis is lower bounded by zero, and clearly vanishes if and only if A = B . \n\nAnother useful measure is \n\nD(AIIB) = 2: Aij log B:~ - Aij + Bij \n\nk\u00b7 \n\n( \n\n) \n\n(2) \n\n(3) \n\n\"J \n\nLike the Euclidean distance this is also lower bounded by zero, and vanishes if and only \nif A = B . But it cannot be called a \"distance\", because it is not symmetric in A and B, \nso we will refer to it as the \"divergence\" of A from B. It reduces to the Kullback-Leibler \ndivergence, or relative entropy, when 2:ij Aij = 2:ij Bij = 1, so that A and B can be \nregarded as normalized probability distributions. \n\n\fWe now consider two alternative formulations of NMF as optimization problems: \nProblem 1 Minimize IIV - W HI12 with respect to Wand H, subject to the constraints \nW,H~O. \n\nProblem 2 Minimize D(VIIW H) with re.lpect to Wand H, subject to the constraints \nW,H~O. \nAlthough the functions IIV - W HI12 and D(VIIW H) are convex in W only or H only, they \nare not convex in both variables together. Therefore it is unrealistic to expect an algorithm \nto solve Problems 1 and 2 in the sense of finding global minima. However, there are many \ntechniques from numerical optimization that can be applied to find local minima. \n\nGradient descent is perhaps the simplest technique to implement, but convergence can be \nslow. Other methods such as conjugate gradient have faster convergence, at least in the \nvicinity of local minima, but are more complicated to implement than gradient descent \n[8] . The convergence of gradient based methods also have the disadvantage of being very \nsensitive to the choice of step size, which can be very inconvenient for large applications. \n\n4 Multiplicative update rules \n\nWe have found that the following \"multiplicative update rules\" are a good compromise \nbetween speed and ease of implementation for solving Problems 1 and 2. \nTheorem 1 The Euclidean distance II V - W H II is non increasing under the update rules \n\n(WTV)att \nHal' +- Hal' (WTWH)att \n\n(V HT)ia \nWia +- Wia(WHHT)ia \n\n(4) \n\nThe Euclidean distance is invariant under these updates if and only if Wand H are at a \nstationary point of the distance. \n\nTheorem 2 The divergence D(VIIW H) is nonincreasing under the update rules \n\nH \n\natt +-\n\natt \n\nH 2:i WiaVitt/(WH)itt \n\n\" W \nL..Jk \n\nka \n\nWia +- Wia \n\n2:1' HattVitt/(WH)itt \n\n\" H \nL..Jv \n\nav \n\n(5) \n\nThe divergence is invariant under these updates if and only ifW and H are at a stationary \npoint of the divergence. \n\nProofs of these theorems are given in a later section. For now, we note that each update \nconsists of multiplication by a factor. In particular, it is straightforward to see that this \nmultiplicative factor is unity when V = W H, so that perfect reconstruction is necessarily \na fixed point of the update rules. \n\n5 Multiplicative versus additive update rules \n\nIt is useful to contrast these multiplicative updates with those arising from gradient descent \n[14]. In particular, a simple additive update for H that reduces the squared distance can be \nwritten as \n\n(6) \n\nIf 'flatt are all set equal to some small positive number, this is equivalent to conventional \ngradient descent. As long as this number is sufficiently small, the update should reduce \nIIV - WHII\u00b7 \n\n\fNow if we diagonally rescale the variables and set \n\nHalt \n\n\"Ialt = (WTW H)alt ' \n\n(7) \n\nthen we obtain the update rule for H that is given in Theorem 1. Note that this rescaling \nresults in a multiplicative factor with the positive component of the gradient in the denom(cid:173)\ninator and the absolute value of the negative component in the numerator of the factor. \n\nFor the divergence, diagonally rescaled gradient descent takes the form \n\nHalt f- Halt + \"Ialt [~Wia (:;;)ilt - ~ Wia]. \n\n(8) \n\nAgain, if the \"Ialt are small and positive, this update should reduce D (V II W H). If we now \nset \n\nHalt \n\n\"Ialt= ~ W. ' \nui za \n\n(9) \n\nthen we obtain the update rule for H that is given in Theorem 2. This rescaling can also \nbe interpretated as a multiplicative rule with the positive component of the gradient in the \ndenominator and negative component as the numerator of the multiplicative factor. \n\nSince our choices for \"Ialt are not small, it may seem that there is no guarantee that such a \nrescaled gradient descent should cause the cost function to decrease. Surprisingly, this is \nindeed the case as shown in the next section. \n\n6 Proofs of convergence \n\nTo prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used \nin the Expectation-Maximization algorithm [15, 16]. \n\nDefinition 1 G(h, h') is an auxiliary functionfor F(h) if the conditions \n\nG(h, h') ~ F(h), \n\nG(h, h) = F(h) \n\n(10) \n\nare satisfied. \n\nThe auxiliary function is a useful concept because of the following lemma, which is also \ngraphically illustrated in Fig. 1. \n\nLemma 1 IfG is an auxiliary junction, then F is nonincreasing under the update \n\nht+1 = argmlnG (h,ht ) \n\n(11) \n\nProof: F(ht+1) ~ G(ht+1, ht) ~ G(ht, ht) = F(ht) \u2022 \nNote that F(ht+1) = F(ht) only if ht is a local minimum of G(h, ht). If the derivatives \nof F exist and are continuous in a small neighborhood of ht , this also implies that the \nderivatives 'V F(ht) = O. Thus, by iterating the update in Eq. (11) we obtain a sequence \nof estimates that converge to a local minimum hmin = argminh F(h) of the objective \nfunction: \n\nWe will show that by defining the appropriate auxiliary functions G(h, ht) for both IIV -\nW HII and D(V, W H), the update rules in Theorems 1 and 2 easily follow from Eq. (11). \n\n\fFigure 1: Minimizing the auxiliary function G(h, ht) 2:: F(h) guarantees that F(ht+1) :::; \nF(ht) for hn+1 = argminh G(h, ht). \n\nLemma 2 If K(ht) is the diagonal matrix \n\nKab(ht) = <5ab(WTwht)a/h~ \n\nthen \n\nG(h, ht) = F(ht) + (h - ht)T\\l F(ht) + ~(h - ht)T K(ht)(h - ht) \n\nis an auxiliary function for \n\nF(h) = ~ ~)Vi - L W ia ha)2 \n\ni \n\na \n\n(13) \n\n(14) \n\n(15) \n\nProof: Since G(h, h) = F(h) is obvious, we need only show that G(h, ht) 2:: F(h). To \ndo this, we compare \n\nF(h) = F(ht) + (h - htf\\l F(ht) + ~(h - ht)T(WTW)(h - ht) \n\n2 \n\nwith Eq. (14) to find that G(h, ht) 2:: F(h) is equivalent to \n\n0:::; (h - htf[K(ht) - WTW](h - ht) \n\n(16) \n\n(17) \n\nTo prove positive semidefiniteness, consider the matrix 1: \n\n(18) \nwhich is just a rescaling of the components of K - WTW. Then K - WTW is positive \nsemidefinite if and only if M is, and \n\nVT M v = L VaMabVb \n\nab \nL h~(WTW)abh~v~ - vah~(WTW)abh~Vb \nab \n\n) \n\nt \n\n\" ( T \nL...J W W abhahb \nab \n\nt [1 2 1 2 \n2\" va + 2\"Vb - VaVb \n= ~ L(WTW)abh~h~(va - Vb)2 \n> 0 \n\nab \n\n] \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n'One can also show that K - WTW is positive semidefinite by considering the matrix K (I-\nK- 2 W W K- 2 K 2. Then v M W W ht a is a positive eigenvector of K- 2 W W K- with \n1 \nunity eigenvalue, and application of the Frobenius-Perron theorem shows that Eq. 17 holds. \n\n. / (T \n\nT \n\nT \n\n1) \n\n) \n\n1 \n\n1 \n\n\f\u2022 \n\nWe can now demonstrate the convergence of Theorem 1: \n\nProof of Theorem 1 Replacing G(h, ht) in Eq. (11) by Eq. (14) results in the update rule: \n(24) \nSince Eq. (14) is an auxiliary function, F is nonincreasing under this update rule, according \nto Lemma 1. Writing the components of this equation explicitly, we obtain \n\nht+1 = ht - K(ht)-l\\1F(ht) \n\nht+1 = ht \na \n\n(WT V )a \n\na (WTWht)a . \n\n(25) \n\nBy reversing the roles of Wand H in Lemma 1 and 2, F can similarly be shown to be \nnonincreasing under the update rules for W .\u2022 \n\nWe now consider the following auxiliary function for the divergence cost function: \n\nLemma 3 Define \n\nG(h,ht) \n\nia \n\n\" \nia \nThis is an auxiliary function for \n\nWiah~ ( \n- ~ Vi,\"\", W - ht \nub \n,b b \n\nWiah~ ) \nlogWiaha -log,\"\", W - ht \nub \n,b b \n\nF(h) = L Vi log (~ ~_ h ) - Vi + LWiaha \n\na \n\n'l,a a \n\na \n\ni \n\n(26) \n\n(27) \n\n(28) \n\nProof: It is straightforward to verify that G(h, h) = F(h) . To show that G(h, ht) 2: F(h), \nwe use convexity of the log function to derive the inequality \n\n\"W iaha \n-log ~ Wiaha ::; - ~ Q a log - -\n\na \n\nQ a \n\n\" \na \n\nwhich holds for all nonnegative Q a that sum to unity. Setting \n\nQ a = '\"'\" \n\nWiah~ \nt \nub Wibhb \n\nwe obtain \n\nWiah~ ( \n-log ~ Wiaha ::; - ~ '\"'\" W- ht \n,b b \n\n\" \na ub \n\n\" \na \n\nlog Wiaha -\n\nWiah~ ) \nlog,\"\", W- ht \n,b b \nub \n\n(29) \n\n(30) \n\n(31) \n\nFrom this inequality it follows that F(h) ::; G(h, ht) .\u2022 \n\nTheorem 2 then follows from the application of Lemma 1: \n\nProof of Theorem 2: The minimum of G(h, ht) with respect to h is determined by setting \nthe gradient to zero: \n\n_dG---,(,---,h,_h--,-t) __ \" \n~v, \n, \n_ \n\ndha \n\n-\n\n_ Wiah~ 1 \"W- - 0 \n~b Wibhb ha \n\n+ ~ za-\n\nt \n\n, \n-\n\nThus, the update rule of Eq. (11) takes the form \n\nt+1 h~\" Vi \nha = '\"'\" w ~ '\"'\" W- ht W ia \u00b7 \n\ni ub \n\n,b b \n\nub \n\nkb \n\n(32) \n\n(33) \n\nSince G is an auxiliary function, F in Eq. (28) is nonincreasing under this update. Rewrit(cid:173)\nten in matrix form, this is equivalent to the update rule in Eq. (5). By reversing the roles of \nHand W, the update rule for W can similarly be shown to be nonincreasing .\u2022 \n\n\f7 Discussion \n\nWe have shown that application of the update rules in Eqs. (4) and (5) are guaranteed to \nfind at least locally optimal solutions of Problems 1 and 2, respectively. The convergence \nproofs rely upon defining an appropriate auxiliary function. We are currently working to \ngeneralize these theorems to more complex constraints. The update rules themselves are \nextremely easy to implement computationally, and will hopefully be utilized by others for \na wide variety of applications. \n\nWe acknowledge the support of Bell Laboratories. We would also like to thank Carlos \nBrody, Ken Clarkson, Corinna Cortes, Roland Freund, Linda Kaufman, Yann Le Cun, Sam \nRoweis, Larry Saul, and Margaret Wright for helpful discussions. \n\nReferences \n\n[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. \n[2] Turk, M & Pentland, A (1991). Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71- 86. \n[3] Gersho, A & Gray, RM (1992). Vector Quantization and Signal Compression. Kluwer Acad. \n\nPress. \n\n[4] Lee, DD & Seung, HS . Unsupervised learning by convex and conic coding (1997). Proceedings \n\nof the Conference on Neural Information Processing Systems 9, 515- 521. \n\n[5] Lee, DD & Seung, HS (1999). Learning the parts of objects by non-negative matrix factoriza(cid:173)\n\ntion. Nature 401, 788- 791. \n\n[6] Field, DJ (1994). What is the goal of sensory coding? Neural Comput. 6, 559-601. \n[7] Foldiak, P & Young, M (1995). Sparse coding in the primate cortex. The Handbook of Brain \n\nTheory and Neural Networks, 895- 898. (MIT Press, Cambridge, MA). \n\n[8] Press, WH, Teukolsky, SA, Vetterling, WT & Flannery, BP (1993). Numerical recipes: the art \n\nof scientific computing. (Cambridge University Press, Cambridge, England). \n\n[9] Shepp, LA & Vardi, Y (1982). Maximum likelihood reconstruction for emission tomography. \n\nIEEE Trans. MI-2, 113- 122. \n\n[10] Richardson, WH (1972). Bayesian-based iterative method of image restoration. 1. Opt. Soc. \n\nAm. 62, 55- 59. \n\n[11] Lucy, LB (1974). An iterative technique for the rectification of observed distributions. Astron. \n\nJ. 74, 745- 754. \n\n[12] Bouman, CA & Sauer, K (1996). A unified approach to statistical tomography using coordinate \n\ndescent optimization. IEEE Trans. Image Proc. 5, 480--492. \n\n[13] Paatero, P & Tapper, U (1997). Least squares formulation of robust non-negative factor analy(cid:173)\n\nsis. Chemometr. Intell. Lab. 37, 23- 35. \n\n[14] Kivinen, J & Warmuth, M (1997). Additive versus exponentiated gradient updates for linear \n\nprediction. Journal of Tnformation and Computation 132, 1-64. \n\n[15] Dempster, AP, Laird, NM & Rubin, DB (1977). Maximum likelihood from incomplete data via \n\nthe EM algorithm. J. Royal Stat. Soc. 39, 1-38. \n\n[16] Saul, L & Pereira, F (1997). Aggregate and mixed-order Markov models for statistical language \nprocessing. In C. Cardie and R. Weischedel (eds). Proceedings of the Second Conference on \nEmpirical Methods in Natural Language Processing, 81- 89. ACL Press. \n\n\f", "award": [], "sourceid": 1861, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}