{"title": "An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 847, "page_last": 855, "abstract": "Many problems in machine learning and statistics can be formulated as (generalized) eigenproblems. In terms of the associated optimization problem, computing linear eigenvectors amounts to finding critical points of a quadratic function subject to quadratic constraints. In this paper we show that a certain class of constrained optimization problems with nonquadratic objective and constraints can be understood as nonlinear eigenproblems. We derive a generalization of the inverse power method which is guaranteed to converge to a nonlinear eigenvector. We apply the inverse power method to 1-spectral clustering and sparse PCA which can naturally be formulated as nonlinear eigenproblems. In both applications we achieve state-of-the-art results in terms of solution quality and runtime. Moving beyond the standard eigenproblem should be useful also in many other applications and our inverse power method can be easily adapted to new problems.", "full_text": "An Inverse Power Method for Nonlinear\n\nEigenproblems with Applications in\n1-Spectral Clustering and Sparse PCA\n\nMatthias Hein\n\nThomas B\u00a8uhler\nSaarland University, Saarbr\u00a8ucken, Germany\n{hein,tb}@cs.uni-saarland.de\n\nAbstract\n\nMany problems in machine learning and statistics can be formulated as (general-\nized) eigenproblems. In terms of the associated optimization problem, comput-\ning linear eigenvectors amounts to \ufb01nding critical points of a quadratic function\nsubject to quadratic constraints. In this paper we show that a certain class of con-\nstrained optimization problems with nonquadratic objective and constraints can be\nunderstood as nonlinear eigenproblems. We derive a generalization of the inverse\npower method which is guaranteed to converge to a nonlinear eigenvector. We\napply the inverse power method to 1-spectral clustering and sparse PCA which\ncan naturally be formulated as nonlinear eigenproblems. In both applications we\nachieve state-of-the-art results in terms of solution quality and runtime. Moving\nbeyond the standard eigenproblem should be useful also in many other applica-\ntions and our inverse power method can be easily adapted to new problems.\n\n1\n\nIntroduction\n\nEigenvalue problems associated to a symmetric and positive semi-de\ufb01nite matrix are quite abundant\nin machine learning and statistics. However, considering the eigenproblem from a variational point\nof view using Courant-Fischer-theory, the objective is a ratio of quadratic functions, which is quite\nrestrictive from a modeling perspective. We show in this paper that using a ratio of p-homogeneous\nfunctions leads quite naturally to a nonlinear eigenvalue problem, associated to a certain nonlin-\near operator. Clearly, such a generalization is only interesting if certain properties of the standard\nproblem are preserved and ef\ufb01cient algorithms for the computation of nonlinear eigenvectors are\navailable. In this paper we present an ef\ufb01cient generalization of the inverse power method (IPM)\nto nonlinear eigenvalue problems and study the relation to the standard problem. While our IPM\nis a general purpose method, we show for two unsupervised learning problems that it can be easily\nadapted to a particular application.\nThe \ufb01rst application is spectral clustering [20]. In prior work [5] we proposed p-spectral clustering\nbased on the graph p-Laplacian, a nonlinear operator on graphs which reduces to the standard graph\nLaplacian for p = 2. For p close to one, we obtained much better cuts than standard spectral clus-\ntering, at the cost of higher runtime. Using the new IPM, we ef\ufb01ciently compute eigenvectors of the\n1-Laplacian for 1-spectral clustering. Similar to the recent work of [19], we improve considerably\ncompared to [5] both in terms of runtime and the achieved Cheeger cuts. However, opposed to the\nsuggested method in [19] our IPM is guaranteed to converge to an eigenvector of the 1-Laplacian.\nThe second application is sparse Principal Component Analysis (PCA). The motivation for sparse\nPCA is that the largest PCA component is dif\ufb01cult to interpret as usually all components are nonzero.\nIn order to allow a direct interpretation it is therefore desirable to have only a few features with\nnonzero components but which still explain most of the variance. This kind of trade-off has been\n\n1\n\n\fwidely studied in recent years, see [15] and references therein. We show that also sparse PCA has a\nnatural formulation as a nonlinear eigenvalue problem and can be ef\ufb01ciently solved with the IPM.\nAll proofs had to be omitted due to space restrictions and can be found in the supplementary material.\n\n2 Nonlinear Eigenproblems\nThe standard eigenproblem for a symmetric matric A \u2208 Rn\u00d7n is of the form\n\n(1)\nwhere f \u2208 Rn and \u03bb \u2208 R. It is a well-known result from linear algebra that for symmetric matrices\nA, the eigenvectors of A can be characterized as critical points of the functional\n\nAf \u2212 \u03bbf = 0,\n\nFStandard(f) =\n\n.\n\n(2)\n\n(cid:104)f, Af(cid:105)\n(cid:107)f(cid:107)2\n\n2\n\nThe eigenvectors of A can be computed using the Courant-Fischer Min-Max principle. While the\nratio of quadratic functions is useful in several applications, it is a severe modeling restriction. This\nrestriction however can be overcome using nonlinear eigenproblems.\nIn this paper we consider\nfunctionals F of the form\n\nF (f) = R(f)\nS(f) ,\n\n(3)\nwhere with R+ = {x \u2208 R| x \u2265 0} we assume R : Rn \u2192 R+, S : Rn \u2192 R+ to be convex,\nLipschitz continuous, even and positively p-homogeneous1 with p \u2265 1. Moreover, we assume that\nS(f) = 0 if and only if f = 0. The condition that R and S are p-homogeneous and even will imply\nfor any eigenvector v that also \u03b1v for \u03b1 \u2208 R is an eigenvector. It is easy to see that the functional of\nthe standard eigenvalue problem in Equation (2) is a special case of the general functional in (3).\nTo gain some intuition, let us \ufb01rst consider the case where R and S are differentiable. Then it holds\nfor every critical point f\u2217 of F ,\n\u2207F (f\u2217) = 0\n\n\u00b7 \u2207S(f\u2217) = 0 .\n\n\u21d0\u21d2\n\n\u2207R(f\u2217) \u2212 R(f\u2217)\nS(f\u2217)\n\nLet r, s : Rn \u2192 Rn be the operators de\ufb01ned as r(f) = \u2207R(f), s(f) = \u2207S(f) and \u03bb\u2217 = R(f\u2217)\nS(f\u2217) ,\nwe see that every critical point f\u2217 of F satis\ufb01es the nonlinear eigenproblem\n\nr(f\u2217) \u2212 \u03bb\u2217 s(f\u2217) = 0,\n\n(4)\nwhich is in general a system of nonlinear equations, as r and s are nonlinear operators. If R and S\nare both quadratic, r and s are linear operators and one gets back the standard eigenproblem (1).\nBefore we proceed to the general nondifferentiable case, we have to introduce some important con-\ncepts from nonsmooth analysis. Note that F is in general nonconvex and nondifferentiable. In the\nfollowing we denote by \u2202F (f) the generalized gradient of F at f according to Clarke [9],\n\n\u2202F (f) = {\u03be \u2208 Rn(cid:12)(cid:12) F 0(f, v) \u2265 (cid:104)\u03be, v(cid:105) ,\n\nfor all v \u2208 Rn},\n\nt\n\nwhere F 0(f, v) = limg\u2192f, t\u21920 sup F (g+tv)\u2212F (g)\n. In the case where F is convex, \u2202F is the subdif-\nferential of F and F 0(f, v) the directional derivative for each v \u2208 Rn. A characterization of critical\npoints of nonsmooth functionals is as follows.\nDe\ufb01nition 2.1 ([7]) A point f \u2208 Rn is a critical point of F , if 0 \u2208 \u2202F .\nThis generalizes the well-known fact that the gradient of a differentiable function vanishes at a\ncritical point. We now show that the nonlinear eigenproblem (4) is a necessary condition for a\ncritical point and in some cases even suf\ufb01cient. A useful tool is the generalized Euler identity.\nTheorem 2.1 ([21]) Let R : Rn \u2192 R be a positively p-homogeneous and convex continuous func-\ntion. Then, for each x \u2208 Rn and r\u2217 \u2208 \u2202R(x) it holds that (cid:104)x, r\u2217(cid:105) = p R(x).\n\n1A function G : Rn \u2192 R is positively homogeneous of degree p if G(\u03b3x) = \u03b3pG(x) for all \u03b3 \u2265 0.\n\n2\n\n\fThe next theorem characterizes the relation between nonlinear eigenvectors and critical points of F .\nTheorem 2.2 Suppose that R, S ful\ufb01ll the stated conditions. Then a necessary condition for f\u2217\nbeing a critical point of F is\n\n0 \u2208 \u2202R(f\u2217) \u2212 \u03bb\u2217 \u2202S(f\u2217),\n\nwhere\n\n\u03bb\u2217 = R(f\u2217)\nS(f\u2217) .\n\n(5)\n\nIf S is continuously differentiable at f\u2217, then this is also suf\ufb01cient.\n\nFinally, the de\ufb01nition of the associated nonlinear operators in the nonsmooth case is a bit tricky as\nr and s can be set-valued. However, as we assume R and S to be Lipschitz, the set where R and S\nare nondifferentiable has measure zero and thus r and s are single-valued almost everywhere.\n\n3 The inverse power method for nonlinear Eigenproblems\n\nA standard technique to obtain the smallest eigenvalue of a positive semi-de\ufb01nite symmetric matrix\nA is the inverse power method [12]. Its main building block is the fact that the iterative scheme\n\nconverges to the smallest eigenvector of A. Transforming (6) into the optimization problem\n\nAf k+1 = f k\n\n(cid:104)u, A u(cid:105) \u2212(cid:10)u, f k(cid:11)\n\nf k+1 = arg min\n\nu\n\n1\n2\n\n(6)\n\n(7)\n\nR(u) \u2212(cid:10)u, s(f k)(cid:11) ,\n\nis the motivation for the general IPM. The direct generalization tries to solve\n\n0 \u2208 r(f k+1) \u2212 s(f k)\n\nu\n\nf k+1 = arg min\n\nor equivalently\n\n(8)\nwhere r(f) \u2208 \u2202R(f) and s(f) \u2208 \u2202S(f). For p > 1 this leads directly to Algorithm 2, however\nfor p = 1 the direct generalization fails. In particular, the ball constraint has to be introduced in\nAlgorithm 1 as the objective in the optimization problem (8) is otherwise unbounded from below.\n(Note that the 2-norm is only chosen for algorithmic convenience). Moreover, the introduction of\n\u03bbk in Algorithm 1 is necessary to guarantee descent whereas in Algorithm 2 it would just yield a\nrescaled solution of the problem in the inner loop (called inner problem in the following).\nFor both methods we show convergence to a solution of (4), which by Theorem 2.2 is a neces-\nsary condition for a critical point of F and often also suf\ufb01cient. Interestingly, both applications are\nnaturally formulated as 1-homogeneous problems so that we use in both cases Algorithm 1. Never-\ntheless, we state the second algorithm for completeness. Note that we cannot guarantee convergence\nto the smallest eigenvector even though our experiments suggest that we often do so. However, as\nthe method is fast one can afford to run it multiple times with different initializations and use the\neigenvector with smallest eigenvalue.\n\nAlgorithm 1 Computing a nonlinear eigenvector for convex positively p-homogeneous functions R\nand S with p = 1\n\n1: Initialization: f 0 = random with(cid:13)(cid:13)f 0(cid:13)(cid:13) = 1, \u03bb0 = F (f 0)\n\nwhere s(f k) \u2208 \u2202S(f k)\n\n2: repeat\n3:\n\nf k+1 = arg min\n(cid:107)u(cid:107)2\u22641\n\n(cid:8)R(u) \u2212 \u03bbk(cid:10)u, s(f k)(cid:11)(cid:9)\n\n4:\n\n\u03bbk+1 = R(f k+1)/S(f k+1)\n\n5: until |\u03bbk+1\u2212\u03bbk|\n6: Output: eigenvalue \u03bbk+1 and eigenvector f k+1.\n\n< \u0001\n\n\u03bbk\n\nThe inner optimization problem is convex for both algorithms. In turns out that both for 1-spectral\nclustering and sparse PCA the inner problem can be solved very ef\ufb01ciently, for sparse PCA it has\neven a closed form solution. While we do not yet have results about convergence speed, empirical\nobservation shows that one usually converges quite quickly to an eigenvector.\n\n3\n\n\fAlgorithm 2 Computing a nonlinear eigenvector for convex positively p-homogeneous functions R\nand S with p > 1\n1: Initialization: f 0 = random, \u03bb0 = F (f 0)\n2: repeat\n3:\n\n(cid:8)R(u) \u2212(cid:10)u, s(f k)(cid:11)(cid:9)\n\nwhere s(f k) \u2208 \u2202S(f k)\n\ngk+1 = arg min\n\nu\n\n4:\n5:\n\nf k+1 = gk+1/S(gk+1)1/p\n\u03bbk+1 = R(f k+1)/S(f k+1)\n\n6: until |\u03bbk+1\u2212\u03bbk|\n7: Output: eigenvalue \u03bbk+1 and eigenvector f k+1.\n\n< \u0001\n\n\u03bbk\n\nTo our best knowledge both suggested methods have not been considered before. In [4] they propose\nan inverse power method specially tailored towards the continuous p-Laplacian for p > 1, which\ncan be seen as a special case of Algorithm 2. In [15] a generalized power method has been proposed\nwhich will be discussed in Section 5. Finally, both methods can be easily adapted to compute the\nlargest nonlinear eigenvalue, which however we have to omit due to space constraints.\nLemma 3.1 The sequences f k produced by Alg. 1 and 2 satisfy F (f k) > F (f k+1) for all k \u2265 0 or\nthe sequences terminate.\nTheorem 3.1 The sequences f k produced by Algorithms 1 and 2 converge to an eigenvector f\u2217\ncontinuously differentiable at f\u2217, then F has a critical point at f\u2217.\n\nwith eigenvalue \u03bb\u2217 \u2208(cid:2)0, F (f 0)(cid:3) in the sense that it solves the nonlinear eigenproblem (5). If S is\n\nPractical implementation: By the proof of Lemma 3.1, descent in F is not only guaranteed for\nthe optimal solution of the inner problem, but for any vector u which has inner objective value\n\u03a6f k(u) < 0 = \u03a6f k(f k) for Alg. 1 and \u03a8f k(u) < \u03a8f k(F (f k) 1\n1\u2212p f k) in the case of Alg. 2. This\nhas two important practical implications. First, for the convergence of the IPM, it is suf\ufb01cient to use\na vector u satisfying the above conditions instead of the optimal solution of the inner problem. In\nparticular, in an early stage where one is far away from the limit, it makes no sense to invest much\neffort to solve the inner problem accurately. Second, if the inner problem is solved by a descent\nmethod, a good initialization for the inner problem at step k + 1 is given by f k in the case of Alg. 1\nand F (f k) 1\n\n1\u2212p f k in the case of Alg. 2 as descent in F is guaranteed after one step.\n\n4 Application 1: 1-spectral clustering and Cheeger cuts\n\nSpectral clustering is a graph-based clustering method (see [20] for an overview) based on a re-\nlaxation of the NP-hard problem of \ufb01nding the optimal balanced cut of an undirected graph. The\nspectral relaxation has as its solution the second eigenvector of the graph Laplacian and the \ufb01nal par-\ntition is found by optimal thresholding. While usually spectral clustering is understood as relaxation\nof the so called ratio/normalized cut, it can be equally seen as relaxation of the ratio/normalized\nCheeger cut, see [5]. Given a weighted undirected graph with vertex set V and weight matrix W ,\nthe ratio Cheeger cut (RCC) of a partition (C, C), where C \u2282 V and C = V \\C, is de\ufb01ned as\n\nRCC(C, C) :=\n\nwhere\n\nmin{|C| ,(cid:12)(cid:12)C(cid:12)(cid:12)} ,\n\ncut(C, C)\n\ncut(A, B) = (cid:88)\n\nwij,\n\ni\u2208A,j\u2208B\n\nwhere we assume in the following that the graph is connected. Due to limited space the normalized\nversion is omitted, but the proposed IPM can be adapted to this case. In [5] we proposed p-spectral\nclustering, a generalization of spectral clustering based on the second eigenvector of the nonlinear\ngraph p-Laplacian (the graph Laplacian is recovered for p = 2). The main motivation was the\nrelation between the optimal Cheeger cut hRCC = minC\u2282V RCC(C, C) and the Cheeger cut h\u2217\nRCC\nobtained by optimal thresholding the second eigenvector of the p-Laplacian, see [5, 8],\n\n\u2200 p > 1,\n\nhRCC\n\nmaxi\u2208V di\n\n\u2264\n\nh\u2217\n\nRCC\n\nmaxi\u2208V di\n\n\u2264 p\n\n4\n\n(cid:18) hRCC\n\nmaxi\u2208V di\n\n(cid:19) 1\n\np\n\n,\n\n\fwhere di =(cid:80)\n\ni\u2208V wij denotes the degree of vertex i. While the inequality is quite loose for spectral\nclustering (p = 2), it becomes tight for p \u2192 1. Indeed in [5] much better cuts than standard spectral\nclustering were obtained, at the expense of higher runtime. In [19] the idea was taken up and they\nconsidered directly the variational characterization of the ratio Cheeger cut, see also [8],\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|\n(cid:107)f \u2212 median(f)1(cid:107)1\n\n1\n2\n\n= min f nonconstant\nmedian(f)=0\n\n1\n2\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|\n\n(cid:107)f(cid:107)1\n\n.\n\n(9)\n\nhRCC = minf nonconstant\n\nIn [19] they proposed a minimization scheme based on the Split Bregman method [11]. Their method\nproduces comparable cuts to the ones in [5], while being computationally much more ef\ufb01cient.\nHowever, they could not provide any convergence guarantee about their method.\nIn this paper we consider the functional associated to the 1-Laplacian \u22061,\n(cid:104)f, \u22061f(cid:105)\n(cid:107)f(cid:107)1\n\nF1(f) =\n\n(10)\n\n=\n\n1\n2\n\n,\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|\n(cid:111)\n\n(cid:107)f(cid:107)1\n\nwijuij | uij = \u2212uji, uij \u2208 sign(fi \u2212 fj)\n\nand sign(x) =\n\n(cid:40) \u22121,\n\n[\u22121, 1],\n1,\n\nx < 0,\nx = 0,\nx > 0.\n\nwhere\n\n(\u22061f)i =\n\n(cid:110) n(cid:88)\n\nj=1\n\nand study its associated nonlinear eigenproblem 0 \u2208 \u22061f \u2212 \u03bb sign(f).\nProposition 4.1 Any non-constant eigenvector f\u2217 of the 1-Laplacian has median zero. Moreover,\nlet \u03bb2 be the second eigenvalue of the 1-Laplacian, then if G is connected it holds \u03bb2 = hRCC.\n\nFor the computation of the second eigenvector we have to modify the IPM which is discussed in the\nnext section.\n\n4.1 Modi\ufb01cation of the IPM for computing the second eigenvector of the 1-Laplacian\n\nThe direct minimization of (10) would be compatible with the IPM, but the global minimizer is the\n\ufb01rst eigenvector which is constant. For computing the second eigenvector note that, unlike in the\ncase p = 2, we cannot simply project on the space orthogonal to the constant eigenvector, since\nmutual orthogonality of the eigenvectors does not hold in the nonlinear case.\nAlgorithm 3 is a modi\ufb01cation of Algorithm 1 which computes a nonconstant eigenvector of the 1-\nLaplacian. The notation |f k+1\n| refers to the cardinality of positive, negative and\nzero elements, respectively. Note that Algorithm 1 requires in each step the computation of some\n\nsubgradient s(f k) \u2208 \u2202S(f k), whereas in Algorithm 3 the subgradient vk has to satisfy(cid:10)vk, 1(cid:11) = 0.\n\n+ |,|f k+1\u2212 | and |f k+1\n\nThis condition ensures that the inner objective is invariant under addition of a constant and thus\nnot affected by the subtraction of the median. Opposite to [19] we can prove convergence to a\nnonconstant eigenvector of the 1-Laplacian. However, we cannot guarantee convergence to the\nsecond eigenvector. Thus we recommend to use multiple random initializations and use the result\nwhich achieves the best ratio Cheeger cut.\nTheorem 4.1 The sequence f k produced by Algorithm 3 converges to an eigenvector f\u2217 of the 1-\n\nLaplacian with eigenvalue \u03bb\u2217 \u2208(cid:2)hRCC, F1(f 0)(cid:3). Furthermore, F1(f k) > F1(f k+1) for all k \u2265 0\n\n0\n\nor the sequence terminates.\n\n4.2 Quality guarantee for 1-spectral clustering\n\nEven though we cannot guarantee that we obtain the optimal ratio Cheeger cut, we can guarantee\nthat 1-spectral clustering always leads to a ratio Cheeger cut at least as good as the one found by\nstandard spectral clustering. Let (C\u2217\nf ) be the partition of V obtained by optimal thresholding of\nf ), and for t \u2208 R, C t\nf = {i \u2208 V | fi > t}. Furthermore, 1C\nf, where C\u2217\nf = arg mint RCC(C t\ndenotes the vector which is 1 on C and 0 else.\n\nLemma 4.1 Let C, C be a partitioning of the vertex set V , and assume that |C| \u2264(cid:12)(cid:12)C(cid:12)(cid:12). Then for\n\nany vector f \u2208 Rn of the form f = \u03b11C, where \u03b1 \u2208 R, it holds that F1(f) = RCC(C, C).\n\nf , C\u2217\nf , C t\n\n5\n\n\fAlgorithm 3 Computing a nonconstant 1-eigenvector of the graph 1-Laplacian\n1: Input: weight matrix W\n\n2: Initialization: nonconstant f 0 with median(f 0) = 0 and(cid:13)(cid:13)f 0(cid:13)(cid:13)1 = 1, accuracy \u0001\nf k+1 = gk+1 \u2212 median(cid:0)gk+1(cid:1)\n\ni,j=1 wij|fi \u2212 fj| \u2212 \u03bbk(cid:10)f, vk(cid:11)(cid:111)\n(cid:110) 1\n(cid:80)n\n(cid:40) sign(f k+1\n\ngk+1 = arg min\n2\u22641\n\n3: repeat\n4:\n\n(cid:107)f(cid:107)2\n\n5:\n\n2\n\n),\n\ni\n\nif f k+1\nif f k+1\n\ni\n\n(cid:54)= 0,\n= 0.\n\n,\n\ni\n\n+ |\u2212|f k+1\u2212 |\n\n|f k+1\n\n|\n\n0\n\n,\n\n6:\n\ni =\nvk+1\n\n\u2212|f k+1\n\u03bbk+1 = F1(f k+1)\n< \u0001\n\n8: until |\u03bbk+1\u2212\u03bbk|\n\n7:\n\n\u03bbk\n\nLemma 4.2 Let f \u2208 Rn with median(f) = 0, and C = arg min{|C\u2217\nf\u2217 = 1C satis\ufb01es F1(f) \u2265 F1(f\u2217).\nTheorem 4.2 Let u denote the second eigenvector of the standard graph Laplacian, and f denote\nthe result of Algorithm 3 after initializing with the vector 1|C| 1C, where C = arg min{|C\u2217\nu|}.\nThen RCC(C\u2217\n\nf|}. Then the vector\n\nf|,|C\u2217\n\nu|,|C\u2217\n\nu) \u2265 RCC(C\u2217\n\nu, C\u2217\n\nf , C\u2217\nf ).\n\n4.3 Solution of the inner problem\n\nThe inner problem is convex, thus a solution can be computed by any standard method for solving\nconvex nonsmooth programs, e.g. subgradient methods [3]. However, in this particular case we can\nexploit the structure of the problem and use the equivalent dual formulation of the inner problem.\nLemma 4.3 Let E \u2282 V \u00d7 V denote the set of edges and A : RE \u2192 RV be de\ufb01ned as (A\u03b1)i =\n\n(cid:80)\n\nj | (i,j)\u2208E wij\u03b1ij. The inner problem is equivalent to\n\nmin{\u03b1\u2208RE | (cid:107)\u03b1(cid:107)\u221e\u22641, \u03b1ij =\u2212\u03b1ji} \u03a8(\u03b1) :=(cid:13)(cid:13)A\u03b1 \u2212 F (f k)vk(cid:13)(cid:13)2\n(cid:80)n\n\nThe Lipschitz constant of the gradient of \u03a8 is upper bounded by 2 maxr\n\n2 .\nrs.\ns=1 w2\n\nCompared to the primal problem, the objective of the dual problem is smooth. Moreover, it can be\nef\ufb01ciently solved using FISTA ([2]), a two-step subgradient method with guaranteed convergence\nrate O( 1\nk2 ) where k is the number of steps. The only input of FISTA is an upper bound on the Lips-\nchitz constant of the gradient of the objective. FISTA provides a good solution in a few steps which\nguarantees descent in functional (9) and thus makes the modi\ufb01ed IPM very fast. The implementation\ncan be found in the supplementary material.\n\n5 Application 2: Sparse PCA\n\nPrincipal Component Analysis (PCA) is a standard technique for dimensionality reduction and data\nanalysis [13]. PCA \ufb01nds the k-dimensional subspace of maximal variance in the data. For k = 1,\ngiven a data matrix X \u2208 Rn\u00d7p where each column has mean 0, in PCA one computes\n\n(cid:10)f, X T Xf(cid:11)\n\n(cid:107)f(cid:107)2\n\n2\n\nf\u2217 = arg max\nf\u2208Rp\n\n,\n\n(11)\n\nwhere the maximizer f\u2217 is the largest eigenvector of the covariance matrix \u03a3 = X T X \u2208 Rp\u00d7p.\nThe interpretation of the PCA component f\u2217 is dif\ufb01cult as usually all components are nonzero. In\nsparse PCA one wants to get a small number of features which still capture most of the variance.\nFor instance, in the case of gene expression data one would like the principal components to consist\nonly of a few signi\ufb01cant genes, making it easy to interpret by a human. Thus one needs to enforce\nsparsity of the PCA component, which yields a trade-off between explained variance and sparsity.\n\n6\n\n\fWhile standard PCA leads to an eigenproblem, adding a constraint on the cardinality, i.e. the num-\nber of nonzero coef\ufb01cients, makes the problem NP-hard. The \ufb01rst approaches performed simple\nthresholding of the principal components which was shown to be misleading [6]. Since then several\nmethods have been proposed, mainly based on penalizing the L1 norm of the principal components,\nincluding SCoTLASS [14] and SPCA [22]. D\u2019Aspremont et al.[10] focused on the L0-constrained\nformulation and proposed a greedy algorithm to compute a full set of good candidate solutions up\nto a speci\ufb01ed target sparsity, and derived suf\ufb01cient conditions for a vector to be globally optimal.\nMoghaddam et al. [16] used branch and bound to compute optimal solutions for small problem\ninstances. Other approaches include D.C. [18] and EM-based methods [17]. Recently, Journee et al.\n[15] proposed two single unit (computation of one component only) and two block (simultaneous\ncomputation of multiple components) methods based on L0-penalization and L1-penalization.\nProblem (11) is equivalent to\n\nf\u2217 = arg min\nf\u2208Rp\n\n(cid:107)f(cid:107)2\n(cid:104)f, \u03a3f(cid:105) = arg min\nf\u2208Rp\n\n2\n\n(cid:107)f(cid:107)2\n(cid:107)Xf(cid:107)2\n\n.\n\nIn order to enforce sparsity we use instead of the L2-norm a convex combination of an L1 norm and\nL2 norm in the enumerator, which yields the functional\n\n(1 \u2212 \u03b1)(cid:107)f(cid:107)2 + \u03b1(cid:107)f(cid:107)1\n\n,\n\nF (f) =\n\n(cid:107)Xf(cid:107)2\n\n(12)\nwith sparsity controlling parameter \u03b1 \u2208 [0, 1]. Standard PCA is recovered for \u03b1 = 0, whereas \u03b1 = 1\nyields the sparsest non-trivial solution: the component with the maximal variance. One easily sees\nthat the formulation (12) \ufb01ts in our general framework, as both enumerator and denominator are\n1-homogeneous functions. The inner problem of the IPM becomes\n\n(1 \u2212 \u03b1)(cid:107)f(cid:107)2 + \u03b1(cid:107)f(cid:107)1 \u2212 \u03bbk(cid:10)f, \u00b5k(cid:11) , where\n(cid:114)(cid:88)n\n\nThis problem has a closed form solution. In the following we use the notation x+ = max{0, x}.\nLemma 5.1 The convex optimization problem (13) has the analytical solution\n(\u03bbk|\u00b5k\n\n\u03a3f k(cid:112)(cid:104)f k, \u03a3f k(cid:105) .\n\ngk+1 = arg min\n(cid:107)f(cid:107)2\u22641\n\ni )(cid:0)\u03bbk(cid:12)(cid:12)\u00b5k\n\n(cid:12)(cid:12) \u2212 \u03b1(cid:1)\n\ni | \u2212 \u03b1)2\n+ .\n\ni =\ngk+1\n\nsign(\u00b5k\n\n\u00b5k =\n\ni\n\n+, where\n\ns =\n\n(13)\n\ni=1\n\n1\ns\n\nAs s is just a scaling factor, we can omit it and obtain the simple and ef\ufb01cient scheme to compute\nsparse principal components shown in Algorithm 4. While the derivation is quite different from\n[15], the resulting algorithms are very similar. The subtle difference is that in our formulation the\nthresholding parameter of the inner problem depends on the current eigenvalue estimate whereas it\nis \ufb01xed in [15]. Empirically, this leads to the fact that we need slightly less iterations to converge.\n\nAlgorithm 4 Sparse PCA\n1: Input: data matrix X, sparsity controlling parameter \u03b1, accuracy \u0001\n2: Initialization: f 0 = random with S(f k) = 1, \u03bb0 = F (f k)\n3: repeat\ni = sign(\u00b5k\ngk+1\n4:\nf k+1 = gk+1\n(cid:107)Xgk+1(cid:107)2\n\ni )(cid:0)\u03bbk(cid:12)(cid:12)\u00b5k\n\n(cid:12)(cid:12) \u2212 \u03b1(cid:1)\n\n\u03bbk+1 = (1 \u2212 \u03b1)(cid:13)(cid:13)f k+1(cid:13)(cid:13)2 + \u03b1(cid:13)(cid:13)f k+1(cid:13)(cid:13)1\n8: until |\u03bbk+1\u2212\u03bbk|\n\n\u00b5k+1 = \u03a3f k+1\n(cid:107)Xf k+1(cid:107)2\n< \u0001\n\n5:\n6:\n7:\n\n+,\n\ni\n\n\u03bbk\n\n6 Experiments\n\n1-Spectral Clustering: We compare our IPM with the total variation (TV) based algorithm by\n[19], p-spectral clustering with p = 1.1 [5] as well as standard spectral clustering with optimal\n\n7\n\n\fthresholding the second eigenvector of the graph Laplacian (p = 2). The graph and the two-moons\ndataset is constructed as in [5]. The following table shows the average ratio Cheeger cut (RCC) and\nerror (classi\ufb01cation as in [5]) for 100 draws of a two-moons dataset with 2000 points. In the case of\nthe IPM, we use the best result of 10 runs with random initializations and one run initialized with\nthe second eigenvector of the unnormalized graph Laplacian. For [19] we initialize once with the\nsecond eigenvector of the normalized graph Laplacian as proposed in [19] and 10 times randomly.\nIPM and the TV-based method yield similar results, slightly better than 1.1-spectral and clearly\noutperforming standard spectral clustering. In terms of runtime, IPM and [19] are on the same level.\n\nAvg. RCC\nAvg. error\n\nInverse Power Method\n0.0195 (\u00b1 0.0015)\n0.0462 (\u00b1 0.0161)\n\nSzlam & Bresson [19]\n0.0195 (\u00b1 0.0015)\n0.0491 (\u00b1 0.0181)\n\n1.1-spectral [5]\n0.0196 (\u00b1 0.0016)\n0.0578 (\u00b1 0.0285)\n\nStandard spectral\n0.0247 (\u00b1 0.0016)\n0.1685 (\u00b1 0.0200)\n\nFigure 1: Left and middle: Second eigenvector of the 1-Laplacian and 2-Laplacian, respectively.\nRight: Relative Variance (relative to maximal possible variance) versus number of non-zero compo-\nnents for the three datasets Lung2, GCM and Prostate1.\n\nNext we perform unnormalized 1-spectral clustering on the full USPS and MNIST-datasets (9298\nresp. 70000 points). As clustering criterion we use the multicut version of RCut, given as\n\nRCut(C1, . . . , CK) =\n\ncut(Ci, Ci)\n\n|Ci|\n\n.\n\nK(cid:88)\n\ni=1\n\nWe successively subdivide clusters until the desired number of clusters (K = 10) is reached. In each\nsubstep the eigenvector obtained on the subgraph is thresholded such that the multi-cut criterion is\nminimized. This recursive partitioning scheme is used for all methods. As in the previous experi-\nment, we perform one run initialized with the thresholded second eigenvector of the unnormalized\ngraph Laplacian in the case of the IPM and with the second eigenvector of the normalized graph\nLaplacian in the case of [19]. In both cases we add 100 runs with random initializations. The next\ntable shows the obtained RCut and errors.\n\nMNIST\n\nUSPS\n\nRcut\nError\nRcut\nError\n\nInverse Power Method\n\n0.1507\n0.1244\n0.6661\n0.1349\n\nS.&B. [19]\n\n0.1545\n0.1318\n0.6663\n0.1309\n\n1.1-spectral [5]\n\n0.1529\n0.1293\n0.6676\n0.1308\n\nStandard spectral\n\n0.2252\n0.1883\n0.8180\n0.1686\n\nAgain the three nonlinear eigenvector methods clearly outperform standard spectral clustering. Note\nthat our method requires additional effort (100 runs) but we get better results. For both datasets our\nmethod achieves the best RCut. However, if one wants to do only a single run, by Theorem 4.2\nfor bi-partitions one achieves a cut at least as good as the one of standard spectral clustering if one\ninitializes with the thresholded 2nd eigenvector of the 2-Laplacian.\n\nSparse PCA: We evaluate our IPM for sparse PCA on gene expression datasets obtained from\n[1]. We compare with two recent algorithms: the L1 based single-unit power algorithm of [15]\nas well as the EM-based algorithm in [17]. For all considered datasets, the three methods achieve\nvery similar performance in terms of the tradeoff between explained variance and sparsity of the\nsolution, see Fig.1 (Right). In fact the results are so similar that for each dataset, the plots of all\nthree methods coincide in one line. In [15] it also has been observed that the best state-of-the-art\nalgorithms produce the same trade-off curve if one uses the same initialization strategy.\n\nAcknowledgments: This work has been supported by the Excellence Cluster on Multimodal Com-\nputing and Interaction at Saarland University.\n\n8\n\n\fReferences\n[1] http://www.stat.ucla.edu/\u02dcwxl/research/microarray/DBC/index.htm.\n[2] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image denoising\n\nand deblurring problems. IEEE Transactions on Image Processing, 18(11):2419\u20132434, 2009.\n\n[3] D.P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[4] R.J. Biezuner, G. Ercole, and E.M. Martins. Computing the \ufb01rst eigenvalue of the p-Laplacian via the\n\ninverse power method. Journal of Functional Analysis, 257:243\u2013270, 2009.\n\n[5] T. B\u00a8uhler and M. Hein. Spectral Clustering based on the graph p-Laplacian. In Proceedings of the 26th\n\nInternational Conference on Machine Learning, pages 81\u201388. Omnipress, 2009.\n\n[6] J. Cadima and I.T. Jolliffe. Loading and correlations in the interpretation of principal components. Journal\n\nof Applied Statistics, 22:203\u2013214, 1995.\n\n[7] K.-C. Chang. Variational methods for non-differentiable functionals and their applications to partial\n\ndifferential equations. Journal of Mathematical Analysis and Applications, 80:102\u2013129, 1981.\n\n[8] F.R.K. Chung. Spectral Graph Theory. AMS, 1997.\n[9] F.H. Clarke. Optimization and Nonsmooth Analysis. Wiley New York, 1983.\n[10] A. d\u2019Aspremont, F. Bach, and L. El Ghaoui. Optimal solutions for sparse principal component analysis.\n\nJournal of Machine Learning Research, 9:1269\u20131294, 2008.\n\n[11] T. Goldstein and S. Osher. The Split Bregman method for L1-Regularized Problems. SIAM Journal on\n\nImaging Sciences, 2(2):323\u2013343, 2009.\n\n[12] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996.\n[13] I.T. Jolliffe. Principal Component Analysis. Springer, 2nd edition, 2002.\n[14] I.T. Jolliffe, N. Trenda\ufb01lov, and M. Uddin. A modi\ufb01ed principal component technique based on the\n\nLASSO. Journal of Computational and Graphical Statistics, 12:531\u2013547, 2003.\n\n[15] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized Power Method for Sparse Principal\n\nComponent Analysis. Journal of Machine Learning Research, 11:517\u2013553, 2010.\n\n[16] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms.\n\nIn Advances in Neural Information Processing Systems, pages 915\u2013922. MIT Press, 2006.\n\n[17] C.D. Sigg and J.M. Buhmann. Expectation-maximization for sparse and non-negative PCA. In Proceed-\n\nings of the 25th International Conference on Machine Learning, pages 960\u2013967. ACM, 2008.\n\n[18] B.K. Sriperumbudur, D.A. Torres, and G.R.G. Lanckriet. Sparse eigen methods by D.C. programming.\nIn Proceedings of the 24th International Conference on Machine Learning, pages 831\u2013838. ACM, 2007.\n[19] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In Proceedings of the 27th International\n\nConference on Machine Learning, pages 1039\u20131046. Omnipress, 2010.\n\n[20] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395\u2013416, 2007.\n[21] F. Yang and Z. Wei. Generalized Euler identity for subdifferentials of homogeneous functions and appli-\n\ncations. Mathematical Analysis and Applications, 337:516\u2013523, 2008.\n\n[22] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and\n\nGraphical Statistics, 15:265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1133, "authors": [{"given_name": "Matthias", "family_name": "Hein", "institution": null}, {"given_name": "Thomas", "family_name": "B\u00fchler", "institution": null}]}