{"title": "Generalized Nonnegative Matrix Approximations with Bregman Divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 283, "page_last": 290, "abstract": null, "full_text": "Generalized Nonnegative Matrix Approximations with Bregman Divergences\n\nInderjit S. Dhillon Suvrit Sra Dept. of Computer Sciences The Univ. of Texas at Austin Austin, TX 78712. {inderjit,suvrit}@cs.utexas.edu\n\nAbstract\nNonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of \"link\" functions for modeling nonlinear relationships are also discussed.\n\n1\n\nIntroduction\n\nNonnegative matrix approximation (NNMA) is a method for dimensionality reduction and data analysis that has gained favor over the past few years. NNMA has previously been called positive matrix factorization [13] and nonnegative matrix factorization1 [12]. Assume that a1 , . . . , aN are N nonnegative input (M -dimensional) vectors. We organize these vectors as the columns of a nonnegative data matrix . A a1 a2 . . . a N NNMA seeks a small set of K nonnegative representative vectors b1 , . . . , bK that can be nonnegatively (or conically) combined to approximate the input vectors ai . That is, an kK ckn bk , 1 n N,\n\n=1\n\n1 We use the word approximation instead of factorization to emphasize the inexactness of the process since, the input A is approximated by B C .\n\n\f\nwhere the combining coefficientsnckn are restricted to be nonnegative. If ckn and bk are unrestricted, and we minimize an - B cn 2, the Truncated Singular Value Decomposition (TSVD) of A yields the optimal bk and ckn values. If the bk are unrestricted, but the coefficient vectors cn are restricted to be indicator vectors, then we obtain the problem of hard-clustering (See [16, Chapter 8] for related discussion regarding different constraints on cn and bk ). In this paper we consider problems where all involved matrices are nonnegative. For many practical problems nonnegativity is a natural requirement. For example, color intensities, chemical concentrations, frequency counts etc., are all nonnegative entities, and approximating their measurements by nonnegative representations leads to greater interpretability. NNMA has found a significant number of applications, not only due to increased interpretability, but also because admitting only nonnegative combinations of the bk leads to sparse representations. This paper contributes to the algorithmic advancement of NNMA by generalizing the problem significantly, and by deriving efficient algorithms based on multiplicative updates for the generalized problems. The scope of this paper is primarily on generic methods for NNMA, rather than on specific applications. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms, which seek to minimize Bregman divergences between the nonnegative input A and its approximation. In addition, we discuss the use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, we illustrate an interesting extension of our algorithms for handling non-linear relationships through the use of \"link\" functions.\n\n2\n\nProblems\n\nGiven a nonnegative matrix A as input, the classical NNMA problem is to approximate it by a lower rank nonnegative matrix of the form B C , where B = [b1 , ..., bK ] and C = [c1 , ..., cN ] are themselves nonnegative. That is, we seek the approximation, A BC , where B , C 0. (2.1) We judge the goodness of the approximation in (2.1) by using a general class of distortion measures called Bregman divergences. For any strictly convex function : S R R that has a continuous first derivative, the corresponding Bregman divergence D : S int(S ) R+ is defined as D (x, y ) (x) - (y ) - (y )(x - y ), where int(S ) is the interior of set S [1, 2]. Bregman divergences are nonnegative, convex in the first argument and zero if and only if x = y . These divergences play an important role in convex optimization [2]. For the sequel we consider only separable Bregman divergences, i i.e., D (X , Y ) = j D (xij , yij ). We further require xij , yij dom R+ . Formally, the resulting generalized nonnegative matrix approximation problems are: min D (B C , A) + (B ) + (C ), (2.2)\nB , C 0 B , C 0\n\nmin\n\nD (A, B C ) + (B ) + (C ).\n\n(2.3)\n\nThe functions and serve as penalty functions, and they allow us to enforce regularization (or other constraints) on B and C . We consider both (2.2) and (2.3) since Bregman divergences are generally asymmetric. Table 1 gives a small sample of NNMA problems to illustrate the breadth of our formulation.\n\n3\n\nAlgorithms\n\nIn this section we present algorithms that seek to optimize (2.2) and (2.3). Our algorithms are iterative in nature, and are directly inspired by the efficient algorithms of Lee and Seung [11]. Appealing properties include ease of implementation and computational efficiency.\n\n\f\nDivergence D A - BC 2 F A - BC 2 F W (A - B C ) 2 F KL(A, B C ) KL(A, W B C ) KL(A, B C ) D (A, W1 B C W2 )\n\n 12 2x 12 2x 12 2x x log x x log x x log x (x)\n\n 0 0 0 0 0 c1B T B 1 (B )\n\n 0 1T C 1 0 0 0 -c C 2 F (C )\n\nRemarks Lee and Seung [11, 12] Hoyer [10] Paatero and Tapper [13] Lee and Seung [11] Guillamet et al. [9] Feng et al. [8] Weighted NNMA (new)\n\nTable 1: Some example NNMA problems that may be obtained from (2.3). The corresponding asymmetric problem (2.2) has not been ipreviously treated in the literature. KL(x, y ) i xi log xi - xi + yi (also called I-divergence). denotes the generalized KL-Divergence = y Note that the problems (2.2) and (2.3) are not jointly convex in B and C , so it is not easy to obtain globally optimal solutions in polynomial time. Our iterative procedures start by initializing B and C randomly or otherwise. Then, B and C are alternately updated until there is no further appreciable change in the objective function value. 3.1 Algorithms for (2.2) We utilize the concept of auxiliary functions [11] for our derivations. It is sufficient to illustrate our methods using a single column of C (or row of B ), since our divergences are separable. Definition 3.1 (Auxiliary function). A function G(c, c ) is called an auxiliary function for F (c) if: 1. G(c, c) = F (c), and 2. G(c, c ) F (c) for all c . Auxiliary functions turn out to be useful due to the following lemma. Lemma 3.2 (Iterative minimization). If G(c, c ) is an auxiliary function for F (c), then F is non-increasing under the update ct+1 = argminc G(c, ct ). Proof. F (ct+1 ) G(ct+1 , ct ) G(ct , ct ) = F (ct ). As can be observed, the sequence formed by the iterative application of Lemma 3.2 leads to a monotonic decrease in the objective function value F (c). For an algorithm that iteratively updates c in its quest to minimize F (c), the method for proving convergence boils down to the construction of an appropriate auxiliary function. Auxiliary functions have been used in many places before, see for example [5, 11]. We now construct simple auxiliary functions for (2.2) that yield multiplicative updates. To avoid clutter we drop the functions and from (2.2), noting that our methods can easily be extended to incorporate these functions. Suppose B is fixed and we wish to compute an updated column of C . We wish to minimize F (c) = D (B c, a), (3.1)\n\nwhere a is the column of A corresponding to the column c of C . The lemma below shows how to construct an auxiliary function for (3.1). For convenience of notation we use to denote for the rest of this section.\n\n\f\nLemma 3.3 (Auxiliary function). The function b - i i ( ij cj G(c, c ) = ij (ai ) + (ai ) B c)i - ai ij j\n\n,\n\n(3.2)\n\nl bil c ), is an auxiliary function for (3.1). Note that by definition witjh ij = (bij c )/( j l ij = 1, and as both bij and c are nonnegative, ij 0. j Proof. It is easy to verify that G(c, c) = F (c), since j ij = 1 and ij 0, then , we conclude that if i F (c) = =i ij \nj\n\nj\n\nij = 1. Using the convexity of\n\nj\n\nbij cj -\n\n- i\n\nb\n\nij cj\n\n( (ai ) - (ai ) B c)i - ai ( (ai ) + (ai ) B c)i - ai\n\nij\n\nG(c, c ).\n\nTo obtain the update, we minimize G(c, c ) w.r.t. c. Let (x) denote the vector [ (x1 ), . . . , (xn )]T . We compute the partial derivative b b i i G ip ip cp = - ip bip (ai ) cp ip ip - c i p (B T (a))p . (3.3) = bip (B c )i cp We need to solve (3.3) for cp by setting G/ cp = 0. Solving this equation analytically is not always possible. However, for a broad class of functions, we can obtain an analytic solution. For example, if is multiplicative (i.e., (xy ) = (x) (y )) we obtain the following iterative update relations for b and c (see [7]) bp cp b p -1 cp -1 , [ (aT )C T ] p [ (bT C )C T ]p [B T (a)] .\np\n\n(3.4) (3.5)\n\n[B T (B c)]p\n\nIt turns out that when is a convex function of Legendre type, then -1 can be obtained by the derivative of the conjugate function of , i.e., -1 = [14].\n1 Note. (3.4) & (3.5) coincide with updates derived by Lee and Seung [11], if (x) = 2 x2 .\n\n3.1.1\n\nExamples of New NNMA Problems\n\nWe illustrate the power of our generic auxiliary functions given above for deriving algorithms with multiplicative updates for some specific interesting problems. First we consider the problem that seeks to minimize the divergence, KL(B c, a) = i (B c)i log (B c)i - (B c)i + ai , ai B , c 0. (3.6)\n\n\f\nLet (x) = x log x - x. Then, (x) = log x, and as (xy ) = (x) + (y ), upon substituting in (3.3), and setting the resultant to zero we obtain i i G = bip log(cp (B c )i /c ) - bip log ai = 0, p cp cp = (B T 1)p log = [B T log a - B T log(B c )]p cp [ a ]. B T log /(B c ) p = cp = cp exp [B T 1]p The update for b can be derived similarly. Constrained NNMA. Next we consider NNMA problems that have additional constraints. We illustrate our ideas on a problem with linear constraints. min\nx\n\nD (B c, a) c 0.\n\ns.t. P c 0,\n\n(3.7)\n\nWe can solve (3.7) problem using our method by making use of an appropriate (differentiable) penalty function that enforces P c 0. We consider, F (c) = D (B c, a) + max(0, P c) 2, (3.8) where > 0 is some penalty constant. Assuming multiplicative and following the auxiliary function technique described above, we obtain the following updates for c, [T , B (a)]k - [P T (P c)+ ]k -1 ck ck [B T (B c)]k where (P c)+ = max(0, P c). Note that care must be taken to ensure that the addition of this penalty term does not violate the nonnegativity of c, and to ensure that the argument of -1 lies in its domain. Remarks. Incorporating additional constraints into (3.6) is however easier, since the exponential updates ensure nonnegativity. Given a = 1, with appropriate penalty functions, our solution to (3.6) can be utilized for maximizing entropy of B c subject to linear or non-linear constraints on c. Nonlinear models with \"link\" functions. If A h(B C ), where h is a \"link\" function that models a nonlinear relationship between A and the approximant B C , we may wish to minimize D (h(B C ), A). We can easily extend our methods to handle this case for appropriate h. Recall that the auxiliary function that we used, depended upon the convexity of . Thus, if ( h) is a convex function, whose derivative ( h) is \"factorizable,\" then we can easily derive algorithms for this problem with link functions. We exclude explicit examples for lack of space and refer the reader to [7] for further details. 3.2 Algorithms using KKT conditions\n\nWe now derive efficient multiplicative update relations for (2.3), and these updates turn out to be simpler than those for (2.2). To avoid clutter, we describe our methods with 0, and 0, noting that if and are differentiable, then it is easy to incorporate them in our derivations. For convenience we use (x) to denote 2 (x) for the rest of this section. Using matrix algebra, one can show that the gradients of D (A, B C ) w.r.t. B and C are, C B D (A, B C ) = (B C ) (B C - A) T , C D (A, B C ) =B T (B C ) (B C - A)\n\n\f\nwhere denotes the elementwise or Hadamard product, and is applied elementwise to B C . According to the KKT conditions, there exist Lagrange multiplier matrices 0 and 0 such that B D (A, B C ) = , mk bmk = kn ckn = 0. C D (A, B C ) = , (3.9a) (3.9b)\n\nWriting out the gradient B D (A, B C ) elementwise, multiplying by bmk , and making use of (3.9a,b), we obtain Cm (B C ) (B C - A) T k bmk = mk bmk = 0, which suggests the iterative scheme bm k bm k (B C ) A CT m C\nT k m. k\n\n(B C ) B C\n\n(3.10)\n\nProceeding in a similar fashion we obtain a similar iterative formula for ckn , which is ] [B T (B C ) A kn ]. (3.11) ckn ckn T [B (B C ) B C kn 3.2.1 Examples of New and Old NNMA Problems as Special Cases We now illustrate the power of our approach by showing how one can easily obtain iterative update relations for many NNMA problems, including known and new problems. For more examples and further generalizations we refer the reader to [7].\n\n1 Lee and Seung's Algorithms. Let 0, 0. Now if we set (x) = 2 x2 or (x) = x log x, then (3.10) and (3.11) reduce to the Frobenius norm and KL-Divergence update rules originally derived by Lee and Seung [11].\n\nElemetwise weighted disto ion. Here we wish to minimize W (A - B C ) F . Using n rt 2 X W X , and A W A in (3.10) and (3.11) one obtains BB (W A)C T , (W (B C ))C T CC B T (W A) . B T (W (B C ))\n\nThese iterative updates are significantly simpler than the PMF algorithms of [13]. The Multifactor NNMA Problem (new). The above ideas can be extended to the multifactor NNMA problem that seeks to minimize the following divergence (see [7]) D (A, B1 B2 . . . BR ), where all matrices involved are nonnegative. A typical usage of multifactor NNMA problem would be to obtain a three-factor NNMA, namely A RB C . Such an approximation is closely tied to the problem of co-clustering [3], and can be used to produce relaxed coclustering solutions [7]. Weighted NNMA Problem (new). We can follow the same derivation method as above (based on KKT conditions) for obtaining multiplicative updates for the weighted NNMA problem: min D (A, W1 B C W2 ), where W1 and W2 are nonnegative (and nonsingular) weight matrices. The work of [9] is a special case as mentioned in Table 1. Please refer to [7] for more details.\n\n\f\n4\n\nExperiments and Discussion\n\nWe have looked at generic algorithms for minimizing Bregman divergences between the input and its approximation. One important question arises: Which Bregman divergence should one use for a given problem? Consider the following factor analytic model A = BC + N , where N represents some additive noise present in the measurements A, and the aim is to recover B and C . If we assume that the noise is distributed according to some member of the exponential family, then minimizing the corresponding Bregman divergence [1] is appropriate. For e.g., if the noise is modeled as i.i.d. Gaussian noise, then the Frobenius norm based problem is natural. Another question is: Which version of the problem we should use, (2.2) or (2.3)? For (x) = 1 x2 , both problems coincide. For other , the choice between (2.2) and (2.3) can 2 be guided by computation issues or sparsity patterns of A. Clearly, further work is needed for answering this question in more detail. Some other open problems involve looking at the class of minimization problems to which the iterative methods of Section 3.2 may be applied. For example, determining the class of functions h, for which these methods may be used to minimize D (A, h(B C )). Other possible methods for solving both (2.2) and (2.3), such as the use of alternating projections (AP) for NNMA, also merit a study. Our methods for (2.2) decreased the objective function monotonically (by construction). However, we did not demonstrate such a guarantee for the updates (3.10) & (3.11). Figure 1 offers encouraging empirical evidence in favor of a monotonic behavior of these updates. It is still an open problem to formally prove this monotonic decrease. Preliminary results that yield new monotonicity proofs for the Frobenius norm and KL-divergence NNMA problems may be found in [7]. PMF Objective\n3 28 26 24 2.8 Objective function value Objective function value Objective function value 22 20 18 16 14 2.4 12 2.3 10 8 12 13 17 2.9\n\n(x) = - log x\n19 18\n\n(x) = x log x - x\n\n2.7\n\n16\n\n2.6\n\n15\n\n2.5\n\n14\n\n2.2\n\n0\n\n10\n\n20\n\n30\n\n40 50 60 Number of iterations\n\n70\n\n80\n\n90\n\n100\n\n0\n\n10\n\n20\n\n30\n\n40 50 60 Number of iterations\n\n70\n\n80\n\n90\n\n100\n\n11\n\n0\n\n10\n\n20\n\n30\n\n40 50 60 Number of iterations\n\n70\n\n80\n\n90\n\n100\n\nFigure 1: Objective function values over 100 iterations for different NNMA problems. The input\nmatrix A was random 20 8 nonnegative matrix. Matrices B and C were 20 4, 4 8, respectively.\n\nNNMA has been used in a large number of applications, a fact that attests to its importance and appeal. We believe that special cases of our generalized problems will prove to be useful for applications in data mining and machine learning.\n\n5\n\nRelated Work\n\nPaatero and Tapper [13] introduced NNMA as positive matrix factorization, and they aimed to minimize W (A - B C ) F, where W was a fixed nonnegative matrix of weights. NNMA remained confined to applications in Environmetrics and Chemometrics before pioneering papers of Lee and Seung [11, 12] popularized the problem. Lee and Seung [11] provided simple and efficient algorithms for the NNMA problems that sought to minimize\n\n\f\nA - B C F and KL(A, B C ). Lee & Seung called these problems nonnegative matrix factorization (NNMF), and their algorithms have inspired our generalizations. NNMA was applied to a host of applications including text analysis, face/image recognition, language modeling, and speech processing amongst others. We refer the reader to [7] for pointers to the literature on various applications of NNMA. Srebro and Jaakola [15] discuss elementwise weighted low-rank approximations without any nonnegativity constraints. Collins et al. [6] discuss algorithms for obtaining a low rank approximation of the form A B C , where the loss functions are Bregman divergences, however, there is no restriction on B and C . More recently, Cichocki et al. [4] presented schemes for NNMA with Csiszar's -divergeneces, though rigorous convergence proofs seem to be unavailable. Our approach of Section 3.2 also yields heuristic methods for minimizing Csiszar's divergences. Acknowledgments This research was supported by NSF grant CCF-0431257, NSF Career Award ACI0093404, and NSF-ITR award IIS-0325116.\n\nReferences\n[1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman Divergences. In SIAM International Conf. on Data Mining, Lake Buena Vista, Florida, April 2004. SIAM. [2] Y. Censor and S. A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Numerical Mathematics and Scientific Computation. Oxford University Press, 1997. [3] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum Sum Squared Residue based Co-clustering of Gene Expression data. In Proc. 4th SIAM International Conference on Data Mining (SDM), pages 114125, Florida, 2004. SIAM. [4] A. Cichocki, R. Zdunek, and S. Amari. Csiszar's Divergences for Non-Negative Matrix Factorization: Family of New Algorithms. In 6th Int. Conf. ICA & BSS, USA, March 2006. [5] M. Collins, R. Schapire, and Y. Singer. Logistic regression, adaBoost, and Bregman distances. In Thirteenth annual conference on COLT, 2000. [6] M. Collins, S. Dasgupta, and R. E. Schapire. A Generalization of Principal Components Analysis to the Exponential Family. In NIPS 2001, 2001. [7] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations. Technical report, Computer Sciences, University of Texas at Austin, 2005. [8] T. Feng, S. Z. Li, H-Y. Shum, and H. Zhang. Local nonnegative matrix factorization as a visual representation. In Proceedings of the 2nd International Conference on Development and Learning, pages 178193, Cambridge, MA, June 2002. ` [9] D. Guillamet, M. Bressan, and J. Vitria. A weighted nonnegative matrix factorization for local representations. In CVPR. IEEE, 2001. [10] P. O. Hoyer. Non-negative sparse coding. In Proc. IEEE Workshop on Neural Networks for Signal Processing, pages 557565, 2002. [11] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. In NIPS, pages 556562, 2000. [12] D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788791, October 1999. [13] P. Paatero and U. Tapper. Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(111126), 1994. [14] R. T. Rockafellar. Convex Analysis. Princeton Univ. Press, 1970. [15] N. Srebro and T. Jaakola. Weighted low-rank approximations. In Proc. of 20th ICML, 2003. [16] J. A. Tropp. Topics in Sparse Approximation. PhD thesis, The Univ. of Texas at Austin, 2004.\n\n\f\n", "award": [], "sourceid": 2757, "authors": [{"given_name": "Suvrit", "family_name": "Sra", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}]}