{"title": "Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 5154, "page_last": 5164, "abstract": "Symmetric nonnegative matrix factorization (NMF)---a special but important class of the general NMF---is demonstrated to be useful for data analysis and in particular for various clustering tasks. Unfortunately, designing fast algorithms for Symmetric NMF is not as easy as for the nonsymmetric counterpart, the latter admitting the splitting property that allows efficient alternating-type algorithms. To overcome this issue, we transfer the symmetric NMF to a nonsymmetric one, then we can adopt the idea from the state-of-the-art algorithms for nonsymmetric NMF to design fast algorithms solving symmetric NMF. We rigorously establish that solving nonsymmetric reformulation returns a solution for symmetric NMF and then apply fast alternating based algorithms for the corresponding reformulated problem. Furthermore, we show these fast algorithms admit strong convergence guarantee in the sense that the generated sequence is convergent at least at a sublinear rate and it converges globally to a critical point of the symmetric NMF. We conduct experiments on both synthetic data and image clustering to support our result.", "full_text": "Dropping Symmetry for Fast\n\nSymmetric Nonnegative Matrix Factorization\n\nZhihui Zhu\u2217\n\nMathematical Institute for Data Science\n\nJohns Hopkins University\n\nBaltimore, MD, USA\n\nzzhu29@jhu.edu\n\nXiao Li\u2217\n\nDepartment of Electronic Engineering\nThe Chinese University of Hong Kong\n\nShatin, NT, Hong Kong\n\nxli@ee.cuhk.edu.hk\n\nKai Liu\n\nQiuwei Li\n\nDepartment of Computer Science\n\nColorado School of Mines\n\nDepartment of Electrical Engineering\n\nColorado School of Mines\n\nGolden, CO, USA\nkaliu@mines.edu\n\nGolden, CO, USA\nqiuli@mines.edu\n\nAbstract\n\nSymmetric nonnegative matrix factorization (NMF)\u2014a special but important class\nof the general NMF\u2014is demonstrated to be useful for data analysis and in particular\nfor various clustering tasks. Unfortunately, designing fast algorithms for Symmetric\nNMF is not as easy as for the nonsymmetric counterpart, the later admitting the\nsplitting property that allows ef\ufb01cient alternating-type algorithms. To overcome this\nissue, we transfer the symmetric NMF to a nonsymmetric one, then we can adopt\nthe idea from the state-of-the-art algorithms for nonsymmetric NMF to design\nfast algorithms solving symmetric NMF. We rigorously establish that solving\nnonsymmetric reformulation returns a solution for symmetric NMF and then apply\nfast alternating based algorithms for the corresponding reformulated problem.\nFurthermore, we show these fast algorithms admit strong convergence guarantee\nin the sense that the generated sequence is convergent at least at a sublinear rate\nand it converges globally to a critical point of the symmetric NMF. We conduct\nexperiments on both synthetic data and image clustering to support our result.\n\n1\n\nIntroduction\n\nGeneral nonnegative matrix factorization (NMF) is referred to the following problem: Given a matrix\nY \u2208 Rn\u00d7m and a factorization rank r, solve\n\ns. t. U \u2265 0, V \u2265 0,\n\n(cid:107)Y \u2212 U V T(cid:107)2\nF ,\n\n1\n2\n\nmin\n\nU\u2208Rn\u00d7r,V \u2208Rm\u00d7r\n\n(1)\nwhere U \u2265 0 means each element in U is nonnegative. NMF has been successfully used in the\napplications of face feature extraction [1, 2], document clustering [3], source separation [4] and many\nothers [5]. Because of the ubiquitous applications of NMF, many ef\ufb01cient algorithms have been\nproposed for solving (1). Well-known algorithms include MUA [6], projected gradientd descent [7],\nalternating nonnegative least squares (ANLS) [8], and hierarchical ALS (HALS) [9]. In particular,\nANLS (which uses the block principal pivoting algorithm to very ef\ufb01ciently solve the nonnegative\nleast squares) and HALS achive the state-of-the-art performance.\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOne special but important class of NMF, called symmetric NMF, requires the two factors U and V\nidentical, i.e., it factorizes a PSD matrix X \u2208 Rn\u00d7n by solving\n\nmin\n\nU\u2208Rn\u00d7r\n\n1\n2\n\n(cid:107)X \u2212 U U T(cid:107)2\nF ,\n\ns. t. U \u2265 0.\n\n(2)\n\nAs a contrast, (1) is referred to as nonsymmetric NMF. Symmetric NMF (2) has its own applications\nin data analysis, machine learning and signal processing [10, 11, 12]. In particular the symmetric\nNMF is equivalent to the classical K-means kernel clustering in [11]and it is inherently suitable for\nclustering nonlinearly separable data from a similarity matrix [10].\nIn the \ufb01rst glance, since (2) has only one variable, one may think it\nis easier to solve (2) than (1), or at least (2) can be solved by directly\nutilizing ef\ufb01cient algorithms developed for nonsymmetric NMF.\nHowever, the state-of-the-art alternating based algorithms (such as\nANLS and HALS) for nonsymmetric NMF utilize the splitting prop-\nerty of (1) and thus can not be used for (2). On the other hand,\n\ufb01rst-order method such as projected gradient descent (PGD) for solv-\ning (2) suffers from very slow convergence. As a proof of concept,\nwe show in Figure 1 the convergence of PGD for solving symmetric\nNMF and as a comparison, the convergence of gradient descent\n(GD) for solving a matrix factorization (MF) (i.e., (2) without the\nnonnegative constraint) which is proved to admit linear convergence\n[13, 14]. This phenomenon also appears in nonsymmetric NMF and\nis the main motivation to have many ef\ufb01cient algorithms such as\nANLS and HALS.\n\nFigure 1: Convergence of MF by\nGD and symmetric NMF by PGD\nwith the same initialization.\n\nMain Contributions This paper addresses the above issue by considering a simple framework that\nallows us to design alternating-type algorithms for solving the symmetric NMF, which are similar to\nalternating minimization algorithms (such as ANLS and HALS) developed for nonsymmetric NMF.\nThe main contributions of this paper are summarized as follows.\n\u2022 Motivated by the splitting property exploited in ANLS and HALS algorithms, we split the bilinear\nform of U into two different factors and transfer the symmetric NMF into a nonsymmetric one:\n\nmin\nU ,V\n\nf (U , V ) =\n\n(cid:107)X \u2212 U V T(cid:107)2\n\nF +\n\n1\n2\n\n(cid:107)U \u2212 V (cid:107)2\nF ,\n\n\u03bb\n2\n\ns. t. U \u2265 0, V \u2265 0,\n\n(3)\n\nwhere the regularizer (cid:107)U \u2212 V (cid:107)2\nF is introduced to force the two factors identical and \u03bb > 0 is a\nbalancing factor. The \ufb01rst main contribution is to guarantee that any critical point of (3) that has\nbounded energy satis\ufb01es U = V with a suf\ufb01ciently large \u03bb. We further show that any local-search\nalgorithm with a decreasing property is guaranteed to solve (2) by targeting (3). To the best of our\nknowledge, this is the \ufb01rst work to rigorously establish that symmetric NMF can be ef\ufb01ciently\nsolved by fast alternating-type algorithms.\n\u2022 Our second contribution is to provide convergence analysis for our proposed alternating-based\nalgorithms solving (3). By exploiting the speci\ufb01c structure in (3), we show that our proposed\nalgorithms (without any proximal terms and any additional constraints on U and V except the\nnonnegative constraint) is convergent. Moreover, we establish the point-wise global sequence\nconvergence and show that the proposed alternating-type algorithms achieve at least a global\nsublinear convergence rate. Our sequence convergence result provides theoretical guarantees for\nthe practical utilization of alternating-based algorithms directly solving (3) without any proximal\nterms or additional constraint on the factors which are usually needed to guarantee the convergence.\n\nRelated Work Due to slow convergence of PGD for solving symmetric NMF, different algorithms\nhave been proposed to ef\ufb01ciently solve (2), either in a direct way or similar to (3) by splitting the two\nfactors. Vandaele et al. [15] proposed an alternating algorithm that cyclically optimizes over each\nelement in U by solving a nonnegative constrained nonconvex univariate fourth order polynomial\nminimization. A quasi newton second order method was used in [10] to directly solve the symmetric\nNMF optimization problem (2). However, both the element-wise updating approach and the second\norder method are observed to be computationally expensive in large scale applications. We will\nillustrate this with experiments in Section 4.\n\n2\n\n01000200030004000500010-10100NMF by PGDMF by GD\fThe idea of solving symmetric NMF by targeting (3) also appears in [10]. However, despite an\nalgorithm used for solving (3), no other formal guarantee (such as solving (3) returns a solution of\n(2)) was provided in [10]. Lu et al. [16] considered an alternative problem to (3) that also enjoys\nthe splitting property and utilized alternating direction method of multipliers (ADMM) algorithm\nto tackle the corresponding problem with equality constraint (i.e., U = V ). Unlike the sequence\nconvergence guarantee of algorithms solving (3), the ADMM is only guaranteed to have a subsequence\nconvergence in [16] with an additional proximal term2 and constraint on the boundedness of columns\nof U, rendering the problem hard to solve.\nA different line of research that may be also related to our work is the classical result on exact penalty\nmethods [17, 18].3 On one hand, our work shares similar spirit to the exact penalty methods as both\napproaches attempt to transfer constraints as penalties into the objective function. On the other hand,\nour approach differs from the exact penalty methods in the following three folds: 1) as stated in [18],\nmost of the results on exact penalty methods require the feasible set compact, which is not satis\ufb01ed\nin our problem (2); 2) the exact penalty methods replace all the constraints with certain penalty\nfunctions, while our reformulated problem (3) still has the nonnegative constriant; 3) in general\nthe exact penalty methods use either nonsmooth penalty functions or differentiable exact penalty\nfunctions which are based on replacing the conventional multipliers with continuously differentiable\nmultiplier functions, while the penalty in (3) is smooth and the parameter \u03bb is \ufb01xed and is independent\nof U and V . We comment that our result builds on the speci\ufb01c structure of the objective function in\n(3) while the exact penalty methods focus on general nonlinear programming problems.\nFinally, our work is also closely related to recent advances in convergence analysis for alternating\nminimizations. The sequence convergence result for general alternating minimization with an\nadditional proximal term was provided in [19]. When speci\ufb01ed to NMF, as pointed out in [20], with\nthe aid of this additional proximal term (and also an additional constraint to bound the factors), the\nconvergence of ANLS and HALS can be established from [19, 21]. With similar proximal term\nand constraint, the subsequence convergence of ADMM for symmetric NMF was obtained in [16].\nAlthough the convergence of these algorithms are observed without the proximal term and constraint\n(which are also not used in practice), these are in general necessary to formally show the convergence\nof the algorithms. For alternating minimization methods solving (3), without any additional constraint,\nwe show the factors are indeed bounded through the iterations, and without the proximal term, the\nalgorithms admit suf\ufb01cient decreasing property. These observations then guarantee the sequence\nconvergence of the original algorithms that are used in practice. The convergence result for algorithms\nsolving (3) is not only limited to alternating-type algorithms, though we only consider these as they\nachieve state-of-the-art performance.\n\n2 Guarantee when Transfering Symmetric NMF to Nonsymmetric NMF\n\nCompared with (2), in the \ufb01rst glance, (3) is slightly more complicated as it has one more variable.\nHowever, because of this new variable, f (U , V ) is now strongly convex with respect to either U or V ,\nthough it is still nonconvex in terms of the joint variable (U , V ). Moreover, the two decision variables\nU and V in (3) are well splitted, like the case in nonsymmetric NMF. This observation suggests an\ninteresting and useful fact that (3) can be solved by designing tailored alternating minimization type\nalgorithms developed for tackling nonsymmetric NMF. On the other hand, a theoretical question\nraised in the regularized form (3) is that we are not guaranteed U = V and hence solving (3) is not\nequivalent to solving (2). In this section, we provide formal guarantee to assure that solving (3) (to a\ncritical point) indeed gives a critical point solution of (2). Note that (3) is nonconvex and many local\nsearch algorithms are only guaranteed to converge to a critical point rather than the global solution.\nThus, our goal is to guarantee that any critical point that the algorithms may converge to admits the\nproperty that the two factors U and V are identical and further that U is a solution ( critical point) of\nthe original symmetric NMF (2).\nBefore stating out the formal result, as an intuitive example, we \ufb01rst consider a simple case where\nf (u, v) = (1\u2212uv)2/2+\u03bb(u\u2212v)2/2. Its derivative is \u2202uf (u, v) = (uv\u22121)v+\u03bb(u\u2212v), \u2202vf (u, v) =\n(uv \u2212 1)u \u2212 \u03bb(u \u2212 v). Thus, any critical point of f satis\ufb01es (uv \u2212 1)v + \u03bb(u \u2212 v) = 0 and\n(uv \u2212 1)u \u2212 \u03bb(u \u2212 v) = 0, which further indicates that (u \u2212 v)(2\u03bb + 1 \u2212 uv) = 0. Therefore,\n\n2In k-th iteration, a proximal term (cid:107)U \u2212 U k\u22121(cid:107)2\n3We greatly acknolwledge the anonymous area chair for pointing out this related work.\n\nF is added to the objective function when updating U.\n\n3\n\n\ffor any critical point (u, v) such that |uv| < 2\u03bb + 1, it must satisfy u = v. Although (3) is more\ncomplicated than this example as it also has nonnegative constraint, the following result establishes\nsimilar guarantee for (3).4\nTheorem 1. Suppose (U (cid:63), V (cid:63)) be any critical point of (3) satisfying (cid:107)U (cid:63)V (cid:63)T(cid:107) < 2\u03bb + \u03c3n(X),\nwhere \u03c3n(\u00b7) denotes the n-th largest singular value. Then U (cid:63) = V (cid:63) and U (cid:63) is a critical point of\n(2).\n\nTowards interpreting Theorem 1, we note that for any \u03bb > 0, Theorem 1 ensures a certain region\n(whose size depends on \u03bb) in which each critical point of (3) has identical factors and also returns a\nsolution for the original symmetric NMF (2). This further suggests the opportunity of choosing an\nappropriate \u03bb such that the corresponding region (i.e., all (U , V ) such that (cid:107)U V T(cid:107) < 2\u03bb + \u03c3n(X))\ncontains all the possible points that the algorithms will converge to. Towards that end, next result\nindicates that for any local search algorithms, if it decreases the objective function, then the iterates\nare bounded.\nLemma 1. For any local search algorithm solving (3) with initialization V 0 = U 0, U 0 \u2265 0, suppose\nit sequentially decreases the objective value. Then, for any k \u2265 0, the iterate (U k, V k) generated by\nthis algorithm satis\ufb01es\n\n(cid:18) 1\n\n(cid:19)\n0 (cid:107)2\n(cid:107)X \u2212 U 0U T\n0 (cid:107)F + (cid:107)X(cid:107)F .\nk (cid:107)F \u2264 (cid:107)X \u2212 U 0V T\n\nF \u2264\n\n+ 2\n\n\u221a\n\n\u03bb\n\nr\n\n(cid:107)U k(cid:107)2\n\nF + (cid:107)V k(cid:107)2\n(cid:107)U kV T\n\n\u221a\n\nF + 2\n\nr(cid:107)X(cid:107)F := B0,\n\n(4)\n\nThere are two interesting facts regarding the iterates can be interpreted from (4). The \ufb01rst equation\nof (4) implies that both U k and V k are bounded and the upper bound decays when the \u03bb increases.\nSpeci\ufb01cally, as long as \u03bb is not too close to zero, then the RHS in (4) gives a meaningful bound\nwhich will be used for the convergence analysis of local search algorithms in next section. In terms\nof U kV T\nk , the second equation of (4) indicates that it is indeed upper bounded by a quantity that\nis independent of \u03bb. This suggests a key result that if the iterative algorithm is convergent and the\niterates (U k, V k) converge to a critical point (U (cid:63), V (cid:63)), then U (cid:63)V (cid:63)T is also bounded, irrespectively\nthe value of \u03bb. This together with Theorem 1 ensures that many local search algorithms can be\nutilized to \ufb01nd a critical point of (2) by choosing a suf\ufb01ciently large \u03bb.\nfor (3), where (cid:107)X(cid:107)2 means\nTheorem 2. Choose \u03bb > 1\n2\nlargest singular value of X. For any local search algorithm solving (3) with initialization V 0 = U 0,\nif it sequentially decreases the objective value, is convergent, and converges to a critical point\n(U (cid:63), V (cid:63)) of (3), then we have U (cid:63) = V (cid:63) and that U (cid:63) is also a critical point of (2).\n\n(cid:13)(cid:13)(cid:13)X \u2212 U 0U T\n\n(cid:16)(cid:107)X(cid:107)2 +\n\n\u2212 \u03c3n(X)\n\n(cid:13)(cid:13)(cid:13)F\n\n(cid:17)\n\n0\n\nTheorem 2 indicates that instead of directly solving the symmetric NMF (2), one can turn to solve (3)\nwith a suf\ufb01ciently large regularization parameter \u03bb. The latter is very similar to the nonsymmetric\nNMF (1) and obeys similar splitting property, which enables us to utilize ef\ufb01cient alternating-type\nalgorithms. In the next section, we propose alternating based algorithms for tackling (3) with strong\nguarantees on the descend property and convergence issue.\n\n3 Fast Algorithms for Symmetric NMF with Guaranteed Convergence\n\nIn the last section, we have shown that the symmetric NMF (2) can be transfered to problem (3), the\nlatter admitting splitting property which enables us to design alternating-type algorithms to solve\nsymmetric NMF. Speci\ufb01cally, we exploit the splitting property by adopting the main idea in ANLS\nand HALS for nonsymmetric NMF to design fast algorithms for (3). Moreover, note that the objective\nfunction f in (3) is strongly convex with respect to U (or V ) with \ufb01xed V (or U) because of the\nregularized term \u03bb\nF . This together with Lemma 1 ensures that strong descend property\nand point-wise sequence convergence guarantee of the proposed alternating-type algorithms. With\nTheorem 2, we are \ufb01nally guaranteed that the algorithms converge to a critical point of symmetric\nNMF (2).\n\n2(cid:107)U \u2212 V (cid:107)2\n\n3.1 ANLS for symmetric NMF (SymANLS)\n\n4All the proofs can be found at supplementary material.\n\n4\n\n\fAlgorithm 1 SymANLS\nInitialization: k = 1 and U 0 = V 0.\n1: while stop criterion not meet do\n2: U k = arg minV \u22650\nV k = arg minU\u22650\n3:\nk = k + 1.\n4:\n5: end while\nOutput: factorization (U k, V k).\n\n1\n\n1\n\n2(cid:107)X \u2212 U V T\nk\u22121(cid:107)2\n2(cid:107)X \u2212 U kV T(cid:107)2\n\nF + \u03bb\nF + \u03bb\n\n2(cid:107)U \u2212 V k\u22121(cid:107)2\nF ;\n2(cid:107)U k \u2212 V (cid:107)2\nF ;\n\nANLS is an alternating-type algorithm customized for nonsymmetric NMF (1) and its main idea is\nthat at each time, keep one factor \ufb01xed, and update another one via solving a nonnegative constrained\nleast squares. We use similar idea for solving (3) and refer the corresponding algorithm as SymANLS.\nSpeci\ufb01cally, at the k-th iteration, SymANLS \ufb01rst updates U k by\n\nU k = arg min\n\nU\u2208Rn\u00d7r,U\u22650\n\n(cid:107)X \u2212 U V T\n\nk\u22121(cid:107)2\n\nF +\n\n(cid:107)U \u2212 V k\u22121(cid:107)2\nF .\n\n\u03bb\n2\n\n(5)\n\nV k is then updated in a similar way. We depict the whole procedure of SymANLS in Algorithm 1.\nWith respect to solving the subproblem (5), we \ufb01rst note that there exists a unique minimizer (i.e.,\nU k) for (5) as it involves a strongly objective function as well as a convex feasible region. However,\nwe note that because of the nonnegative constraint, unlike least squares, in general there is no\nclosed-from solution for (5) unless r = 1. Fortunately, there exist many feasible methods to solve\nthe nonnegative constrained least squares, such as projected gradient descend, active set method and\nprojected Newton\u2019s method. Among these methods, a block principal pivoting method is remarkably\nef\ufb01cient for tackling the subproblem (5) (and also the one for updating V ) [8].\nWith the speci\ufb01c structure within (3) (i.e., its objective function is strongly convex and its feasible\nregion is convex), we \ufb01rst show that SymANLS monotonically decreases the function value at each\niteration, as required in Theorem 2.\nLemma 2. Let {(U k, V k)} be the iterates sequence generated by Algorithm 1. Then we have\n\nf (U k, V k) \u2212 f (U k+1, V k+1) \u2265 \u03bb\n2\n\n((cid:107)U k+1 \u2212 U k(cid:107)2\n\nF + (cid:107)V k+1 \u2212 V k(cid:107)2\nF ).\n\nWe now give the following main convergence guarantee for Algorithm 1.\nTheorem 3 (Sequence convergence of Algorithm 1). Let {(U k, V k)} be the sequence generated by\nAlgorithm 1. Then\n\nk\u2192\u221e(U k, V k) = (U (cid:63), V (cid:63)),\nlim\n\nwhere (U (cid:63), V (cid:63)) is a critical point of (3). Furthermore the convergence rate is at least sublinear.\n\nEquipped with all the machinery developed above, the global sublinear sequence convergence of\nSymANLS to a critical solution of symmetric NMF (2) is formally guaranteed in the following result,\nwhich is a direct consequence of Theorem 2, Lemma 2 and Theorem 3.\nCorollary 1 (Convergence of Algorithm 1 to a critical point of (2)). Suppose Algorithm 1 is initialized\nwith V 0 = U 0. Choose\n\n(cid:16)(cid:107)X(cid:107)2 +\n\n(cid:13)(cid:13)(cid:13)X \u2212 U 0U T\n\n0\n\n(cid:13)(cid:13)(cid:13)F\n\n\u03bb >\n\n1\n2\n\n(cid:17)\n\n\u2212 \u03c3n(X)\n\n.\n\nLet {(U k, V k)} be the sequence generated by Algorithm 1. Then {(U k, V k)} is convergent and\nconverges to (U (cid:63), V (cid:63)) with U (cid:63) = V (cid:63) and U (cid:63) a critical point of (2). Furthermore, the convergence\nrate is at least sublinear.\n\nRemark. We emphasis that the speci\ufb01c structure within (3) enables Corollary 1 get rid of the\nassumption on the boundedness of iterates (U k, V k) and also the requirement of a proximal term,\nwhich is usually required for convergence analysis but not necessarily used in practice. As a contrast\nand also as pointed out in [20], to provide the convergence guarantee for standard ANLS solving\nnonsymmetric NMF (1), one needs to modify it by adding an additional proximal term as well as an\nadditional constraint to make the factors bounded.\n\n5\n\n\f3.2 HALS for symmetric NMF (SymHALS)\n\nAs we stated before, due to the nonnegative constraint, there is no closed-from solution for (5),\nalthough one may utilize some ef\ufb01cient algorithms for solving (5). As a contrast, there does exist a\nclosed-form solution when r = 1. HALS exploits this observation by splitting the pair of variables\n(U , V ) into columns (u1,\u00b7\u00b7\u00b7 , ur, v1,\u00b7\u00b7\u00b7 , vr) and then optimizing over column by column. We\nutilize similar idea for solving (3). Speci\ufb01cally, rewrite U V T = uivT\nj and denote by\n\ni +(cid:80)\n\nj(cid:54)=i ujvT\n\nX i = X \u2212(cid:88)\n\nujvT\nj\n\nj(cid:54)=i\nthe factorization residual X \u2212 U V T excluding uivT\nin (3) only with respect to ui, then it is equivalent to\n\nu(cid:92)\ni = arg min\nui\u2208Rn\n\n1\n2\n\n(cid:107)X i \u2212 uivT\ni (cid:107)2\n\nF +\n\n(cid:107)ui \u2212 vi(cid:107)2\n\n2 = max\n\n\u03bb\n2\n\ni . Now if we minimize the objective function f\n\n(cid:18) (X i + \u03bbI)vi\n\n(cid:107)vi(cid:107)2\n\n2 + \u03bb\n\n(cid:19)\n\n, 0\n\n.\n\nthe presentation easily understood, we directly use X \u2212(cid:80)i\u22121\n\nSimilar closed-form solution also holds when optimizing in terms of vi. With this observation, we uti-\nlize alternating-type minimization that at each time minimizes the objective function in (3) only with\nrespect to one column in U or V and denote the corresponding algorithm as SymHALS. We depict\nSymHALS in Algorithm 2, where we use subscript k to denote the k-th iteration. Note that to make\nj (vk\n)T\n1 = X \u2212 U k\u22121(V k\u22121)T, we can\nto update X k\nthen update X k\ni )T by recursively utilizing the previous one. The\ndetailed information about ef\ufb01cient implementation of SymHALS can be found in the supplementary\nmaterial (see the corresponding Algorithm 3 in Section 5).\n\nj=1 uk\ni , which is not adopted in practice. Instead, letting X k\n\ni with only the computation of uk\n\nj )T \u2212(cid:80)r\n\nj=i+1 uk\u22121\n\n(vk\u22121\n\ni (vk\n\nj\n\nj\n\nAlgorithm 2 SymHALS\nInitialization: U 0, V 0, iteration k = 1.\n1: while stop criterion not meet do\n2:\n3:\n\ni = X \u2212(cid:80)i\u22121\n\nfor i = 1 : r do\n\nj )T \u2212(cid:80)r\n\n4:\n\nend for\nk = k + 1.\n\n5:\n6:\n7:\n8: end while\nOutput: factorization (U k, V k).\n\nj=1 uk\n1\n\nX k\ni = arg minui\u22650\nuk\ni = arg minvi\u22650\nvk\n\nj (vk\n2(cid:107)X k\n2(cid:107)X k\n\n1\n\nj\n\n)T;\n\nj=i+1 uk\u22121\nj\nF + \u03bb\n2(cid:107)uk\n\n)T(cid:107)2\nF + \u03bb\n\n(vk\u22121\n2(cid:107)ui \u2212 vk\u22121\ni \u2212 vi(cid:107)2\n\ni\n\ni \u2212 ui(vk\u22121\ni\ni \u2212 uk\ni (cid:107)2\ni vT\n\n2 = max\n\n(cid:107)2\n2 = max\n\n(cid:16) (X k\n\ni\n\ni +\u03bbI)vk\u22121\n(cid:107)vk\u22121\n(cid:107)2\n2+\u03bb\n2+\u03bb , 0\n;\n\ni +\u03bbI)uk\ni (cid:107)2\n\ni\n\ni\n\n(cid:107)uk\n\n(cid:17)\n\n(cid:16) (X k\n\n(cid:17)\n\n, 0\n\n;\n\nThe SymHALS enjoys similar descend property and convergence guarantee to algorithm SymANLS\nas both of them are alternating-based algorithms. Thus, we just give out the following results ensuring\nAlgorithm 2 converge to a critical point of symmetric NMF (2).\nCorollary 2 (Convergence of Algorithm 2 to a critical point of (2)). Suppose it is initialized with\nV 0 = U 0. Choose\n\n(cid:16)(cid:107)X(cid:107)2 +\n\n(cid:13)(cid:13)(cid:13)X \u2212 U 0U T\n\n0\n\n(cid:13)(cid:13)(cid:13)F\n\n\u03bb >\n\n1\n2\n\n(cid:17)\n\n\u2212 \u03c3n(X)\n\n.\n\nLet {(U k, V k)} be the sequence generated by Algorithm 2. Then {(U k, V k)} is convergent and\nconverges to (U (cid:63), V (cid:63)) with U (cid:63) = V (cid:63) and U (cid:63) being a critical point of (2). Furthermore, the\nconvergence rate is at least sublinear.\n\nRemark. Similar to Corollary 1 for SymANLS, Corollary 2 has no assumption on the boundedness\nof iterates (U k, V k) and it establishes convergence guarantee for SymHALS without the aid from a\nproximal term. As a contrast, to establish the subsequence convergence for classical HALS solving\nnonsymmetric NMF [9, 22] (i.e., setting \u03bb = 0 in SymHALS), one needs the assumption that every\ncolumn of (U k, V k) is not zero through all iterations. Though such assumption can be satis\ufb01ed\nby using additional constraints, it actually solves a slightly different problem than the original\n\n6\n\n\fnonsymmetric NMF (1). On the other hand, SymHALS overcomes this issue and admits sequence\nconvergence because of the additional regularizer in (3).\nWe end this section by noting that both Theorem 2 and the convergence guarantee in Corollary 1 and\nCorollary 2 can be extended to the case with any additional convex constraint and/or regularizer on\nU. For example, for promoting sparsity, one can use (cid:96)1 constraint or regularizer which can also be\nef\ufb01ciently incorporated into SymHALS or SymGCD. A formal guarantee for this extension is the\nsubject of ongoing work.\n\n4 Experiments on Synthetic Data and Real Image Data\n\nIn this section, we conduct experiments on both synthetic data and real data to illustrate the perfor-\nmance of our proposed algorithms and compare it to other state-of-the-art ones, in terms of both\nconvergence property and image clustering performance.\nFor comparison convenience, we de\ufb01ne\n\nEk =\n\n(cid:107)X \u2212 U k(U k)T(cid:107)2\n\nF\n\n(cid:107)X(cid:107)2\n\nF\n\nas the normalized \ufb01tting error at k-th iteration.\nBesides SymANLS and SymHALS, we also apply the greedy coordinate descent (GCD) algorithm in\n[23] (which is designed for tackling nonsymmetric NMF) to solve the reformulated problem (3) and\ndenote the corrosponding algorithm as SymGCD. SymGCD is expected to have similar sequence\nconvergence guarantee as SymANLS and SymHALS. We list the algorithms to compare: 1) ADMM\nin [16], where there is a regularization item in their augmented Lagrangian and we tune a good one for\ncomparison; 2) SymNewton [10] which is a Newton-like algorithm by with a the Hessian matrix in\nNewton\u2019s method for computation ef\ufb01ciency; and 3) PGD in [7]. The algorithm in [15] is inef\ufb01cient\nfor large scale data, since they apply an alternating minimization over each coordinate which entails\nmany loops for large scale U.\n\n4.1 Convergence veri\ufb01cation\nWe randomly generate a matrix U \u2208 R50\u00d75(n = 50, r = 5) with each entry independently following\na standard Gaussian distribution. To enforce nonnegativity, we then take absolute value on each\nentry of U to get U (cid:63). Data matrix X is constructed as U (cid:63)(U (cid:63))T which is nonnegative and PSD.\nWe initialize all the algorithms with same U 0 and V 0, whose entries are i.i.d. uniformly distributed\nbetween 0 to 1.\nTo study the effect of the parameter \u03bb in (3), we show the\nvalue (cid:107)U k\u2212V k(cid:107)2\nF versus iteration for different choices of\n\u03bb by SymHALS in Figure 2. While for this experimental\nsetting the lower bound of \u03bb provided in Theorem 2 is\n39.9, we observe that (cid:107)U k \u2212 V k(cid:107)2\nF still converges to 0\nwith much smaller \u03bb. This suggests that the suf\ufb01cient\ncondition on the choice of \u03bb in Theorem 2 is stronger\nthan necessary, leaving room for future improvements.\nParticularly, we suspect that SymHALS converges to a\ncritical point (U (cid:63), V (cid:63)) with U (cid:63) = V (cid:63) (i.e. a critical\npoint of symmetric NMF) for any \u03bb > 0; we leave this\nline of theoretical justi\ufb01cation as our future work. On\nthe other hand, we note that although SymHALS \ufb01nds a\ncritical point of symmetric NMF for most of the \u03bb, the\nconvergence speed varies for different \u03bb. For example,\nwe observe that either a very large or small \u03bb yields a\nslow convergence speed. In the sequel, we tune the best\nparameter \u03bb for each experiment.\n\nFigure 2: SymHALS with different \u03bb. Here\nn = 50, r = 5.\n\n7\n\n02004006008001000Iteration k10-1010-5100 = 0.01 = 0.1 = 1 = 10 = 39.9 = 100\fWe also test on real world dataset CBCL 5, where there are 2429 face image data with dimension\n19 \u00d7 19. We construct the similarity matrix X following [10, section 7.1, step 1 to step 3]. The\nconvergence results on synthetic data and real world data are shown in Figure 3 (a1)-(a2) and Figure 3\n(b1)-(b2), respectively. We observe that the SymANLS, SymHALS, and SymGCD 1) converge faster;\n2) empirically have a linear convergence rate in terms of Ek.\n\n(a1)\n\n(b1)\n\n(a2)\n\n(b2)\n\nFigure 3: Synthetic data where n = 50, r = 5: (a1)-(a2) \ufb01tting error versus iteration and running\ntime. Real image dataset CBCL where n = 2429, r = 49: (b1)-(b2) \ufb01tting error versus iteration and\nrunning time.\n\n4.2\n\nImage clustering\n\nSymmetric NMF can be used for graph clustering [10, 11] where each element X ij denotes the\nsimilarity between data i and j. In this subsection, we apply different symmetric NMF algorithms for\ngraph clustering on image datasets and compare the clustering accuracy [24].\nWe put all images to be clustered in a data matrix M, where each row is a vectorized image. We\nconstruct similarity matrix following the procedures in [10, section 7.1, step 1 to step 3], and utilize\n\nself-tuning method to construct the similarity matrix X. Upon deriving (cid:101)U from symmetric NMF\nX \u2248 (cid:101)U(cid:101)U\n\nT, the label of the i-th image can be obtained by:\n\nl(M i) = arg max\n\n(cid:101)U (ij).\n\n(6)\n\nj\nWe conduct the experiments on four image datasets:\nORL: 400 facial images from 40 different persons with each one has 10 images from different angles\nand emotions 6.\nCOIL-20: 1440 images from 20 objects 7.\nTDT2: 10,212 news articles from 30 categories 8. We extract the \ufb01rst 3147 data for experiments\n(containing only 2 categories).\nMNIST: classical handwritten digits dataset 9, where 60,000 are for training (denoted as\nMNISTtrain), and 10,000 for testing (denoted as MNISTtest). we test on the \ufb01rst 3147 data from\nMNISTtrain (contains 10 digits) and 3147 from MNISTtest (contains only 3 digits) .\n\n5http://cbcl.mit.edu/software-datasets/FaceData2.html\n6http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n7http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php\n8https://www.ldc.upenn.edu/collaborations/past-projects\n9http://yann.lecun.com/exdb/mnist/\n\n8\n\n0500100015002000Iteration10-1510-1010-5100SymANLSSymHALSSymGCDADMMSymNewtonPGD00.20.40.60.81Time(s)10-1510-1010-5100SymANLSSymHALSSymGCDADMMSymNewtonPGD020406080100Iteration100105SymANLSSymHALSSymGCDADMMSymNewtonPGD00.511.52Time(s)100105SymANLSSymHALSSymGCDADMMSymNewtonPGD\fIn Figure 4 (a1) and Figure 4(a2), we display the clustering accuracy on dataset ORL with respect to\niterations and time (only show \ufb01rst 10 seconds), respectively. Similar results for dataset COIL-20\nare plotted in Figure 4 (b1)-(b2). We observe that in terms of iteration number, SymNewton has\ncomparable performance to the three alternating methods for (3) (i.e., SymANLS, SymHALS, and\nSymGCD), but the latter outperform the former in terms of running time. Such superiority becomes\nmore apparent when the size of the dataset increases. We show the comparison on larger truncated\ndatasets MNISTtrain and MNISTtest in supplementary materials. We note that the performance of\nADMM will increase as iterations goes and after almost 3500 iterations on ORL dataset it reaches a\ncomparable result to other algorithms. Moreover, it requires more iterations for larger dataset. This\nobservation makes ADMM not practical for image clustering. We run ADMM 5000 iterations on\nORL dataset, and display it in the supplementary material. These results as well as the experimental\nresults shown in the last subsection demonstrate (i) the power of transfering the symmetric NMF (2)\nto a nonsymmetric one (3); and (ii) the ef\ufb01cieny of alternating-type algorithms for sovling (3) by\nexploiting the splitting property within the optimization variables in (3).\n\n(a1)\n\n(b1)\n\n(a2)\n\n(b2)\n\nFigure 4: Real dataset: (a1) and (a2) Image clustering quality on ORL dataset, n = 400, r = 40; (b1)\nand (b2) Image clustering quality on COIL-20 dataset, n = 1440, r = 20.\n\nTable 1 shows the clustering accuracies of different algorithms on different datasets, where we\nrun enough iterations for ADMM so that it obtains its best result. We observe from Table 1 that\nSymANLS, SymHALS, and SymGCD perform better than or have comparable performance to others\nfor most of the cases.\n\nTable 1: Summary of image clustering accuracy of different algorithms on \ufb01ve image datasets\n\nSymANLS\nSymHALS\nSymGCD\nADMM\n\nSymNewton\n\nPGD\n\nORL\n0.8075\n0.7550\n0.7900\n0.7650\n0.7625\n0.7700\n\nCOIL-20 MNISTtrain\n0.7979\n0.5854\n0.7076\n0.6903\n0.7472\n0.7243\n\n0.6477\n0.6657\n0.6293\n0.5803\n0.5990\n0.6475\n\nTDT2 MNISTtest\n0.9800\n0.9806\n0.9803\n0.9800\n0.9793\n0.9800\n\n0.8589\n0.8608\n0.9882\n0.8713\n0.8589\n0.8710\n\nAcknowledgment\n\nThe authors thank Dr. Songtao Lu for sharing the code used in [16] and the three anonymous\nreviewers as well as the area chair for their constructive comments.\n\n9\n\n0100200300400500Iteration00.20.40.60.8Clustering accuracySymANLSSymHALSSymGCDADMMSymNewtonPGD0246810Time(s)00.20.40.60.8Clustering accuracySymANLSSymHALSSymGCDADMMSymNewtonPGD0100200300400500Iteration00.20.40.60.8Clustering accuracySymANLSSymHALSSymGCDADMMSymNewtonPGD0246810Time(s)00.20.40.60.8Clustering accuracySymANLSSymHALSSymGCDADMMSymNewtonPGD\fReferences\n[1] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401(6755):788, 1999.\n\n[2] David Guillamet and Jordi Vitria. Non-negative matrix factorization for face recognition. In Topics in\n\narti\ufb01cial intelligence, pages 336\u2013344. Springer, 2002.\n\n[3] Farial Shahnaz, Michael W Berry, V Paul Pauca, and Robert J Plemmons. Document clustering using\n\nnonnegative matrix factorization. Information Processing & Management, 42(2):373\u2013386, 2006.\n\n[4] Wing-Kin Ma, Jos\u00e9 M Bioucas-Dias, Tsung-Han Chan, Nicolas Gillis, Paul Gader, Antonio J Plaza,\nArulMurugan Ambikapathi, and Chong-Yung Chi. A signal processing perspective on hyperspectral\nunmixing: Insights from remote sensing. IEEE Signal Processing Magazine, 31(1):67\u201381, 2014.\n\n[5] Nicolas Gillis. The why and how of nonnegative matrix factorization. Regularization, Optimization,\n\nKernels, and Support Vector Machines, 12(257), 2014.\n\n[6] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in\n\nneural information processing systems, pages 556\u2013562, 2001.\n\n[7] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation,\n\n19(10):2756\u20132779, 2007.\n\n[8] Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm and\n\ncomparisons. In International Conference on Data Mining, pages 353\u2013362, 2008.\n\n[9] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix and tensor\nfactorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences,\n92(3):708\u2013721, 2009.\n\n[10] Da Kuang, Sangwoon Yun, and Haesun Park. Symnmf: nonnegative low-rank approximation of a similarity\n\nmatrix for graph clustering. Journal of Global Optimization, 62(3):545\u2013574, 2015.\n\n[11] Chris Ding, Xiaofeng He, and Horst D Simon. On the equivalence of nonnegative matrix factorization and\n\nspectral clustering. In Proceedings ofInternational Conference on Data Mining, pages 606\u2013610, 2005.\n\n[12] Zhaoshui He, Shengli Xie, Rafal Zdunek, Guoxu Zhou, and Andrzej Cichocki. Symmetric nonnegative\nmatrix factorization: Algorithms and applications to probabilistic clustering. IEEE Transactions on Neural\nNetworks, 22(12):2117\u20132131, 2011.\n\n[13] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix\n\nequations via procrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\n[14] Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. The global optimization geometry of\n\nlow-rank matrix optimization. arXiv preprint arXiv:1703.01256, 2017.\n\n[15] Arnaud Vandaele, Nicolas Gillis, Qi Lei, Kai Zhong, and Inderjit Dhillon. Ef\ufb01cient and non-convex coordi-\nnate descent for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing,\n64(21):5571\u20135584, 2016.\n\n[16] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A nonconvex splitting method for symmetric nonnegative\nmatrix factorization: Convergence analysis and optimality. IEEE Transactions on Signal Processing,\n65(12):3120\u20133135, 2017.\n\n[17] G Di Pillo and L Grippo. Exact penalty functions in constrained optimization. SIAM Journal on control\n\nand optimization, 27(6):1333\u20131360, 1989.\n\n[18] Gianni Di Pillo. Exact penalty methods. In Algorithms for Continuous Optimization, pages 209\u2013253.\n\nSpringer, 1994.\n\n[19] H\u00e9dy Attouch, J\u00e9r\u00f4me Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization\nand projection methods for nonconvex problems: An approach based on the kurdyka-lojasiewicz inequality.\nMathematics of Operations Research, 35(2):438\u2013457, 2010.\n\n[20] Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. A \ufb02exible and ef\ufb01cient algorithmic\nframework for constrained matrix and tensor factorization. IEEE Transactions on Signal Processing,\n64(19):5052\u20135065, 2016.\n\n10\n\n\f[21] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A uni\ufb01ed convergence analysis of block successive\n\nminimization methods for nonsmooth optimization. SIAM J. Optimization, 23(2):1126\u20131153, 2013.\n\n[22] Nicolas Gillis and Fran\u00e7ois Glineur. Accelerated multiplicative updates and hierarchical als algorithms for\n\nnonnegative matrix factorization. Neural computation, 24(4):1085\u20131105, 2012.\n\n[23] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-\nnegative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 1064\u20131072. ACM, 2011.\n\n[24] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization.\nIn International ACM SIGIR conference on Research and development in informaion retrieval, pages\n267\u2013273, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2471, "authors": [{"given_name": "Zhihui", "family_name": "Zhu", "institution": "Johns Hopkins University"}, {"given_name": "Xiao", "family_name": "Li", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Kai", "family_name": "Liu", "institution": "Colorado School of Mines"}, {"given_name": "Qiuwei", "family_name": "Li", "institution": "Colorado School of Mines"}]}