{"title": "Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts", "book": "Advances in Neural Information Processing Systems", "page_first": 2366, "page_last": 2374, "abstract": "Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately.", "full_text": "Beyond Spectral Clustering - Tight Relaxations of\n\nBalanced Graph Cuts\n\nMatthias Hein\n\nSaarland University, Saarbr\u00a8ucken, Germany\n\nhein@cs.uni-saarland.de\n\nSimon Setzer\n\nSaarland University, Saarbr\u00a8ucken, Germany\n\nsetzer@mia.uni-saarland.de\n\nAbstract\n\nSpectral clustering is based on the spectral relaxation of the normalized/ratio graph\ncut criterion. While the spectral relaxation is known to be loose, it has been shown\nrecently that a non-linear eigenproblem yields a tight relaxation of the Cheeger\ncut. In this paper, we extend this result considerably by providing a character-\nization of all balanced graph cuts which allow for a tight relaxation. Although\nthe resulting optimization problems are non-convex and non-smooth, we provide\nan ef\ufb01cient \ufb01rst-order scheme which scales to large graphs. Moreover, our ap-\nproach comes with the quality guarantee that given any partition as initialization\nthe algorithm either outputs a better partition or it stops immediately.\n\n1\n\nIntroduction\n\nThe problem of \ufb01nding the best balanced cut of a graph is an important problem in computer sci-\nence [9, 24, 13]. It has been used for minimizing the communication cost in parallel computing,\nreordering of sparse matrices, image segmentation and clustering. In particular, in machine learning\nspectral clustering is one of the most popular graph-based clustering methods as it can be applied\nto any graph-based data or to data where similarity information is available so that one can build\na neighborhood graph. Spectral clustering is originally based on a relaxation of the combinatorial\nnormalized/ratio graph cut problem, see [28]. The relaxation with the best known worst case approx-\nimation guarantee yields a semi-de\ufb01nite program, see [3]. However, it is practically infeasible for\ngraphs with more than 100 vertices due to the presence of O(n3) constraints where n is the number\nof vertices in the graph. In contrast, the computation of eigenvectors of a sparse graph scales easily\nto large graphs. In a line of recent work [6, 26, 14] it has been shown that relaxation based on the\nnonlinear graph p-Laplacian lead to similar runtime performance while providing much better cuts.\nIn particular, for p = 1 one obtains a tight relaxation of the Cheeger cut, see [8, 26, 14].\nIn this work, we generalize this result considerably. Namely, we provide for almost any balanced\ngraph cut problem a tight relaxation into a continuous problem. This allows \ufb02exible modeling of\ndifferent graph cut criteria. The resulting non-convex, non-smooth continuous optimization problem\ncan be ef\ufb01ciently solved by our new method for the minimization of ratios of differences of convex\nfunctions, called RatioDCA. Moreover, compared to [14], we also provide a more ef\ufb01cient way\nhow to solve the resulting convex inner problems by transferring recent methods from total variation\ndenoising, cf. [7], to the graph setting. In \ufb01rst experiments, we illustrate the effect of different\nbalancing terms and show improved clustering results of USPS and MNIST compared to [14].\n\n2 Set Functions, Submodularity, Convexity and the Lovasz Extension\n\nIn this section we gather some material from the literature on set functions, submodularity and the\nLovasz extension, which we need in the next section. We refer the reader to [11, 4] for a more\ndetailed exposition. We work on weighted, undirected graphs G = (V, W ) with vertex set V and\n\n1\n\n\fa symmetric, non-negative weight matrix W . We de\ufb01ne n := |V | and denote by A = V \\A the\ncomplement of A in V , set functions are denoted with a hat, \u02c6S, whereas the corresponding Lovasz\nextension is simply S. The indicator vector of a set A is written as 1A. In the following we always\nassume that for any considered set function \u02c6S it holds \u02c6S(\u2205) = 0. The Lovasz extension is a way to\nextend a set function from 2V to RV .\nDe\ufb01nition 2.1 Let \u02c6S : 2V \u2192 R be a set function with \u02c6S(\u2205) = 0. Let f \u2208 RV be ordered in\nincreasing order f1 \u2264 f2 \u2264 . . . \u2264 fn and de\ufb01ne Ci = {j \u2208 V | fj > fi} where C0 = V . Then\nS : RV \u2192 R given by\n\nn(cid:88)\n\n(cid:16) \u02c6S(Ci\u22121) \u2212 \u02c6S(Ci)\n(cid:17)\n\nn\u22121(cid:88)\n\nS(f ) =\n\nfi\n\n=\n\n\u02c6S(Ci)(fi+1 \u2212 fi) + f1 \u02c6S(V )\n\ni=1\n\ni=1\n\nis called the Lovasz extension of \u02c6S. Note that S(1A) = \u02c6S(A) for all A \u2282 V .\nNote that for symmetric set functions \u02c6S, that is \u02c6S(A) = \u02c6S(A) for all A \u2282 V , the property \u02c6S(\u2205) = 0\nimplies \u02c6S(V ) = 0. A particular interesting class of set functions are the submodular set functions\nas their Lovasz extension is convex.\nDe\ufb01nition 2.2 A set function, \u02c6F : 2V \u2192 R is submodular if for all A, B \u2282 V ,\n\n\u02c6F (A \u222a B) + \u02c6F (A \u2229 B) \u2264 \u02c6F (A) + \u02c6F (B).\n\n\u02c6F is called strictly submodular if the inequality is strict whenever A (cid:42) B or B (cid:42) A.\nNote that symmetric submodular set functions are always non-negative as for all A \u2282 V ,\n2 \u02c6F (A) = \u02c6F (A) + \u02c6F (A) \u2265 \u02c6F (A \u222a A) + \u02c6F (A \u2229 A) = \u02c6F (V ) + \u02c6F (\u2205) = 0.\n\nAn important class of set functions for clustering are cardinality-based set functions.\nProposition 2.1 ([4]) Let e \u2208 RV\nis submodular. If \u02c6F : A (cid:55)\u2192 g(s(A)) is submodular for all s \u2208 RV\nThe following properties hold for the Lovasz extension.\nProposition 2.2 ([11, 4]) Let S : RV \u2192 R be the Lovasz extension of \u02c6S : 2V \u2192 R with \u02c6S(\u2205) = 0.\n\n+ and g : R+ \u2192 R is a concave function, then \u02c6F : A (cid:55)\u2192 g(s(A))\n\n+, then g is concave.\n\n\u2022 \u02c6S is submodular if and only if S is convex,\n\u2022 S is positively one-homogeneous,\n\u2022 S(f ) \u2265 0, \u2200 f \u2208 RV and S(1) = 0 if and only if \u02c6S(A) \u2265 0, \u2200 A \u2282 V and \u02c6S(V ) = 0.\n\u2022 S(f + \u03b11) = S(f ) for all f \u2208 RV , \u03b1 \u2208 R if and only if \u02c6S(V ) = 0,\n\u2022 S is even, if \u02c6S is symmetric.\n\nOne might wonder if the Lovasz extension of all submodular set functions generates the set of\nall positively one-homogeneous convex functions. This is not the case, as already Lovasz [19]\ngave a counter-example. In the next section we will be interested in the class of positively one-\nhomogeneous, even, convex functions S with S(f + \u03b11) = S(f ) for all f \u2208 RV . From the above\nproposition we deduce that these properties are ful\ufb01lled for the Lovasz extension of any symmetric,\nsubmodular set function. However, also for this special class there exists a counter-example. Take\n\nIt ful\ufb01lls all the stated conditions but it induces the set function \u02c6S(A) := S(1A) given as\n\nS(f ) =\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:13)(cid:13)(cid:13)(cid:13)f \u2212 1\n(cid:26)max{|A|,|V \\A|}, 0 < |A| < |V |\n\n|V | (cid:104)f, 1(cid:105) 1\n\n.\n\n0,\n\nelse\n\n2\n\n\u02c6S(A) =\n\n1\n|V |\n\n\fIt is easy to check that this function is not submodular. Thus different convex one-homogeneous\nfunctions can induce the same set function via \u02c6S(A) := S(1A).\nIt is known [15] that a large class of functions e.g. every f \u2208 C 2(Rn) can be written as a difference\nof convex functions. As submodular functions correspond to convex functions in the sense of the\nLovasz extension, one can ask if the same result holds for set functions: Is every set function a\ndifference of submodular set functions ? The following result has been reported in [21]. As some\nproperties assumed in the proof in [21] do not hold, we give an alternative constructive proof.\nProposition 2.3 Every set function \u02c6S : 2V \u2192 R can be written as the difference of two submodular\nfunctions. The corresponding Lovasz extension S : RV \u2192 R can be written as a difference of convex\nfunctions.\n\nNote that the proof of Proposition 2.3 is constructive. Thus we can always \ufb01nd the decomposition\nof the set function into a difference of two submodular functions and thus also the decomposition of\nits Lovasz extension into a difference of convex functions.\n\n3 Tight Relaxations of Balanced Graph Cuts\n\nIn graph-based clustering a popular criterion to partition the graph is to minimize the cut cut(A, A),\nde\ufb01ned as\n\n(cid:88)\n\ni\u2208A,j\u2208A\n\ncut(A, A) =\n\nwij,\n\n(cid:16) 1\n\n(cid:17)\n\n.\n\nwhere (wij) \u2208 R|V |\u00d7|V | are the non-negative, symmetric weights of the undirected graph G =\n(V, W ) usually interpreted as similarities of vertices i and j. Direct minimization of the cut leads\ntypically to very unbalanced partitions, where often just a single vertex is split off. Therefore one has\nto introduce a balancing term which biases the criterion towards balanced partitions. Two popular\nbalanced graph cut criterion are the Cheeger cut RCC(A, A) and the ratio cut RCut(A, A)\n\nRCC(A, A) =\n\ncut(A, A)\nmin{|A|,|A|} ,\n\nRCut(A, A) = |V | cut(A, A)\n\n|A||A| = cut(A, A)\n\n|A| +\n\n1\n|A|\n\nWe consider later on also their normalized versions. Spectral clustering is derived as relaxation\nof the ratio cut criterion based on the second eigenvector of the graph Laplacian. While the sec-\nond eigenvector can be ef\ufb01ciently computed, it is well-known that this relaxation is far from being\ntight. In particular there exist graphs where the spectral relaxation is as bad [12] as the isoperimetric\ninequality suggests [1]. In a recent line of work [6, 26, 14] it has been shown that a tight relax-\nation for the Cheeger cut can be achieved by moving from the linear eigenproblem to a nonlinear\neigenproblem associated to the nonlinear graph 1-Laplacian [14].\nIn this work we generalize this result considerably by showing in Theorem 3.1 that a tight relaxation\nexists for every balanced graph cut measure which is of the form cut divided by balancing term.\nMore precisely, let \u02c6S : 2V \u2192 R be a symmetric non-negative set function. Then a balanced graph\ncut criterion \u03c6 : 2V \u2192 R+ of a partition (A, A) has the form,\n\n\u03c6(A) :=\n\ncut(A, A)\n\n\u02c6S(A)\n\n.\n\n(1)\n\nAs we consider undirected graphs, the cut is a symmetric set function and thus \u03c6(A) = \u03c6(A). In\norder to get a balanced graph cut, \u02c6S is typically chosen as a function of |A| (or some other type of\nvolume) which is monotonically increasing on [0,|V |/2]. The \ufb01rst part of the theorem showing the\nequivalence of combinatorial and continuous problem is motivated by a result derived by Rothaus\nin [25] in the context of isoperimetric inequalities on Riemannian manifolds. It has been transferred\nto graphs by Tillich and independently by Houdre in [27, 16]. We generalize their result further so\nthat it now holds for all possible non-negative symmetric set functions. In order to establish the link\nto the result of Rothaus, we \ufb01rst state the following characterization\nLemma 3.1 A function S : V \u2192 R is positively one-homogeneous, even, convex and S(f + \u03b11) =\nS(f ) for all f \u2208 RV , \u03b1 \u2208 R if and only if S(f ) = supu\u2208U (cid:104)u, f(cid:105) where U \u2282 Rn is a closed\nsymmetric convex set and (cid:104)u, 1(cid:105) = 0 for any u \u2208 U.\n\n3\n\n\fTheorem 3.1 Let G = (V, E) be a \ufb01nite, weighted undirected graph and S : RV \u2192 R and let\n\u02c6S : 2V \u2192 R be symmetric with \u02c6S(\u2205) = 0, then\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|\n\nS(f )\n\n1\n2\n\ninf\nf\u2208RV\n\n= inf\nA\u2282V\n\ncut(A, A)\n\n\u02c6S(A)\n\n,\n\nif either one of the following two conditions holds\n\n1. S is positively one-homogeneous, even, convex and S(f + \u03b11) = S(f ) for all f \u2208 RV ,\n\n\u03b1 \u2208 R and \u02c6S is de\ufb01ned as \u02c6S(A) := S(1A) for all A \u2282 V .\n\n2. S is the Lovasz extension of the non-negative, symmetric set function \u02c6S with \u02c6S(\u2205) = 0.\n\nLet f \u2208 RV and denote by Ct := {i \u2208 V | fi > t}, then it holds under both conditions,\n\nmint\u2208R\n\ncut(Ct, Ct)\n\n\u02c6S(Ct)\n\n\u2264 1\n\n2\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|\n\n.\n\nS(f )\n\nTheorem 3.1 can be generalized by replacing the cut with an arbitrary other set function. However,\nthe emphasis of this paper is to use the new degree of freedom for balanced graph clustering. The\nmore general approach will be discussed elsewhere. Note that the \ufb01rst condition in Theorem 3.1\nimplies that \u02c6S is symmetric as\n\n\u02c6S(A) = S(1A) = S(\u22121A) = S(1 \u2212 1A) = S(1A) = \u02c6S(A).\n\nMoreover, \u02c6S is non-negative with \u02c6S(\u2205) = \u02c6S(V ) = 0 as S is even, convex and positively one-\nhomogeneous. For the second condition note that by Proposition 2.3 the Lovasz extension of any\nset function can be written as a difference of convex (d.c.) functions. As the total variation term in\nthe enumerator is convex, we thus have to minimize a ratio of a convex and a d.c. function. The\nef\ufb01cient minimization of such problems will be the topic of the next section.\nWe would like to point out a related line of work for the case where the balancing term \u02c6S is sub-\nmodular and the balanced graph cut measure is directly optimized using submodular minimization\ntechniques. In [23] this idea is proposed for the ratio cut and subsequently generalized [22, 17] so\nthat every submodular balancing function \u02c6S can be used. While the general framework is appealing,\nit is unclear if the minimization can be done ef\ufb01ciently. Moreover, note that Theorem 3.1 goes well\nbeyond the case where \u02c6S is submodular.\n\n3.1 Examples of Balancing Set Functions\n\nTheorem 3.1 opens up new modeling possibilities for clustering based on balanced graph cuts. We\ndiscuss in the experiments differences and properties of the individual balancing terms. However,\nit is out of the scope of this paper to answer the question which balancing term is the \u201cbest\u201d. An\nanswer to such a question is likely to be application-dependent. However, for a given random graph\nmodel it might be possible to suggest a suitable balancing term given one knows how cut and volume\nbehave. A \ufb01rst step in this direction has been done in [20] where the limit of cut and volume has\nbeen discussed for different neighborhood graph types.\nIn the following we assume that we work with graphs which have non-negative edge weights W =\n(wij) and non-negative vertex weights e : V \u2192 R+. The volume vol(A) of a set A \u2282 V is de\ufb01ned\ni\u2208A ei. The volume reduces to the cardinality if ei = 1 for all i \u2208 V (unnormalized\ni\u2208A di for ei = di for all\ni \u2208 V (normalized case), where di is the degree of vertex i. We denote by E the diagonal matrix\nwith Eii = ei, i = i, . . . , n. Using general vertex weights allows us to present the unnormalized and\nnormalized case in a uni\ufb01ed framework. Moreover, general vertex weights allow more modeling\nfreedom e.g. one can give two different vertices very large vertex weights and so implicitly enforce\nthat they will be in different partitions.\n\nas vol(A) =(cid:80)\ncase) or to the volume considered in the normalized cut, vol(A) = (cid:80)\n\n4\n\n\fName\n\nCheeger p-cut\n\nNormalized p-cut\n\nTrunc. Cheeger cut\n\nHard balanced cut\n\nHard Cheeger cut\n\np\n\np\n\ni=1\n\ni=1\n\nS(f )\n\nei|fi \u2212 (cid:104)e,f(cid:105)\n\n(cid:16) n(cid:80)\nei|fi \u2212 wmeanp(f )|p(cid:17) 1\nvol(V )|p(cid:17) 1\n(cid:16) n(cid:80)\n(cid:0)gmax, K|V |\n(f )(cid:1)\n(f )(cid:1)\n\u2212(cid:0)gmax, K\u22121|V |\n(f ) \u2212 gmin, K\u22121|V |\n\u2212(cid:0)gmax, K\u22121|V |\n(f )(cid:1)\n\ngmax,\u03b1(f ) \u2212 gmin,\u03b1(f )\n\n(cid:107)f \u2212 median(f )1(cid:107)1\n\n(f ) \u2212 gmin, K\u22121|V |\n\n(f ) \u2212 gmin, K|V |\n\n(cid:0)\n\n\u02c6S(A)\n\nvol(A) vol(A)\np\u22121 +vol(A)\n\n1\n\nvol(A)\n\n(cid:1) 1\np\u22121(cid:1)1\u2212 1\n(cid:1) 1\n\np\n\np\n\n1\n\nvol(A) vol(A)p+vol(A)p vol(A)\n\np\n\nvol(V )\n\nif vol(A) \u2264 \u03b1 vol(V ),\nif vol(A) \u2264 \u03b1 vol(V ),\nelse.\n\nif min{|A|,|A|} \u2265 K\nelse.\n\nif min{|A|,|A|} < K,\n\n(cid:0)\n(cid:0)\n\uf8f1\uf8f2\uf8f3 vol(A),\n(cid:40)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30,\n\n1,\n0,\n\nvol(A),\n\u03b1 vol(V ),\n\nmin{|A|,|A|}\n\u2212(K \u2212 1),\n\nelse.\n\nTable 1: Examples of balancing set functions and their continuous counterpart. For the hard balanced\nand hard Cheeger cut we have unit vertex weights, that is ei \u2261 1.\n\ni=1\n\ni=1\n\nWe report here the Lovasz extension of two important set functions which will be needed in the\nsequel. For that we de\ufb01ne the functions gmax,\u03b1 and gmin,\u03b1 as:\n\nn(cid:88)\ngmax,\u03b1(f ) = max(cid:8)(cid:104)\u03c1, f(cid:105) (cid:12)(cid:12) 0 \u2264 \u03c1i \u2264 ei, \u2200 i = 1, . . . , n,\n\u03c1i = \u03b1 vol(V )(cid:9),\nn(cid:88)\ngmin,\u03b1(f ) = min(cid:8)(cid:104)\u03c1, f(cid:105) (cid:12)(cid:12) 0 \u2264 \u03c1i \u2264 ei, \u2200 i = 1, . . . , n,\n\u03c1i = \u03b1 vol(V )(cid:9)\n\nand the weighted p-mean wmeanp(f ) is de\ufb01ned as wmeanp(f ) = inf a\u2208R(cid:80)n\n\ni=1 ei|fi \u2212 a|p. Note\nthat gmax,\u03b1 is convex, whereas gmin,\u03b1 is concave. Both functions can be easily evaluated by sorting\nthe componentwise product eifi.\nProposition 3.1 Let \u02c6S : 2V \u2192 R, \u02c6S(A) := min{vol(A), vol(A)}. Then the Lovasz extension\nS : V \u2192 R is given by S(f ) = (cid:107)E(f \u2212 wmean1(f )1)(cid:107)1 .\nLet ei = 1,\u2200i \u2208 V and \u02c6S : 2V \u2192 R, \u02c6S(A) :=\nthe Lovasz extension S : V \u2192 R is given as S(f ) = gmax, K|V | (f ) \u2212 gmin, K|V | (f ).\nIn Table 1 we collect a set of interesting set functions enforcing different levels of balancing. For\nthe Cheeger and Normalized p-cut family and the truncated Cheeger cut the functions S are convex\nand not necessarily the Lovasz extension of the induced set functions \u02c6S (\ufb01rst case in Theorem 3.1).\nIn the case of hard balanced and hard Cheeger cut the set function \u02c6S is not submodular. However, in\nboth cases we know an explicit decomposition of the set function \u02c6S into a difference of submodular\nfunctions and thus their Lovasz extension S can be written as a difference of the convex functions.\nThe derivations can be found in the supplementary material.\n\nif min{|A|,|A|} \u2264 K,\nelse.\n\n(cid:26)min{|A|,|A|},\n\n. Then\n\nK,\n\n4 Minimization of Ratios of Non-negative Differences of Convex Functions\n\nIn [14], the problem of computing the optimal Cheeger cut partition is formulated as a nonlinear\neigenproblem. Hein and B\u00a8uhler show that the second eigenvector of the nonlinear 1-graph Laplacian\nis equal to the indicator function of the optimal partition. In Theorem 3.1, we have generalized this\nrelation considerably. In this section, we discuss the ef\ufb01cient computation of critical points of the\ncontinuous ratios of Theorem 3.1. We propose a general scheme called RatioDCA for minimizing\nratios of non-negative differences of convex functions and thus generalizes Algorithm 1 of [14]\nwhich could handle only ratios of convex functions. As the optimization problem is non-smooth and\nnon-convex, only convergence to critical points can be guaranteed. However, we will show that for\nevery balanced graph cut criterion our algorithm improves a given partition or it terminates directly.\nNote that such types of algorithms have been considered for speci\ufb01c graph cut criteria [23, 22, 2].\n\n5\n\n\fFigure 1: Left: Illustration of different balancing functions (rescaled so that they attain value |V |/2\nat |V |/2). Right: Log-log plot of the duality gap of the inner problem vs. the number of iterations\nof PDHG (dashed) and FISTA (solid) in outer iterations 3 (black), 5 (blue) and 7 (red) of RatioDCA\ncorresponding to increasing dif\ufb01culty of the problem. PDHG signi\ufb01cantly outperforms FISTA.\n\n4.1 General Scheme\n\n(cid:80)n\nThe continuous optimization problem in Theorem 3.1 has the form\ni,j=1 wij|fi \u2212 fj|\n\n1\n2\n\nminf\u2208RV\n\nS(f )\n\n,\n\n(2)\n\nwhere S is one-homogeneous and either convex or the Lovasz extension of a non-negative symmet-\nric set function. By Proposition 2.3 the Lovasz extension of any set function can be written as a\ndifference of one-homogeneous convex functions. Using the fourth property of Proposition 2.2 the\nLovasz extension S is non-negative, that is S(f ) \u2265 0 for all f \u2208 RV . With the algorithm RatioDCA\nbelow, we provide a general scheme for the minimization of a ratio F (f ) := R(f )/S(f ), where\nR and S are non-negative and one-homogeneous and each can be written as a difference of convex\nfunctions: R(f ) = R1(f )\u2212R2(f ) and S(f ) = S1(f )\u2212S2(f ) with R1, R2, S1, S2 being convex. In\n\nAlgorithm RatioDCA \u2013 Minimization of a non-negative ratio of 1-homogeneous d.c. functions\n\n1: Initialization: f 0 = random with(cid:13)(cid:13)f 0(cid:13)(cid:13) = 1, \u03bb0 = F (f 0)\n\n2: repeat\n3:\n4:\n\n(cid:8)R1(u) \u2212(cid:10)u, r2(f k)(cid:11) + \u03bbk(cid:0)S2(u) \u2212(cid:10)u, s1(f k)(cid:11)(cid:1)(cid:9)\n\ns1(f k) \u2208 \u2202S1(f k), r2(f k) \u2208 \u2202R2(f k)\nf k+1 = arg min\n(cid:107)u(cid:107)2\u22641\n\u03bbk+1 = (R1(f k+1) \u2212 R2(f k+1))/(S1(f k+1) \u2212 S2(f k+1))\n\n5:\n\n6: until |\u03bbk+1\u2212\u03bbk|\n7: Output: eigenvalue \u03bbk+1 and eigenvector f k+1.\n\n< \u0001\n\n\u03bbk\n\n(cid:80)n\ni,j=1 wi,j|fi \u2212 fj|. We refer to the convex optimization problem\n\nour setting R(f ) = R1(f ) = 1\n2\nwhich has to be solved at each step in RatioDCA (line 4) as the inner problem.\nProposition 4.1 The sequence f k produced by RatioDCA satis\ufb01es F (f k) > F (f k+1) for all k \u2265 0\nor the sequence terminates.\n\nThe sequence F (f k) is not only monotonically decreasing but converges to a generalized nonlinear\neigenvector as introduced in [14].\nTheorem 4.1 Each cluster point f\u2217 of the sequence f k produced by the RatioDCA is a nonlinear\neigenvector with eigenvalue \u03bb\u2217 = R(f\u2217)\n\nS(f\u2217) \u2208(cid:2)0, F (f 0)(cid:3) in the sense that it ful\ufb01lls\n0 \u2208 \u2202R1(f\u2217) \u2212 \u2202R2(f\u2217) \u2212 \u03bb\u2217(cid:0)\u2202S1(f\u2217) \u2212 \u2202S2(f\u2217)(cid:1).\n\nIf S1 \u2212 S2 is continuously differentiable at f\u2217, then F has a critical point at f\u2217.\nIn the balanced graph cut problem (2) we minimize implicitly over non-constant functions. Thus\nit is important to guarantee that the RatioDCA for this particular problem always converges to a\nnon-constant vector.\n\n6\n\n\fLemma 4.1 For every balanced graph cut problem, the RatioDCA converges to a non-constant f\u2217\ngiven that the initial vector f 0 is non-constant.\n\nNow we are ready to state the following key property of our balanced graph clustering algorithm.\nTheorem 4.2 Let (A, A) be a given partition of V and let S : V \u2192 R+ satisfy one of the conditions\nstated in Theorem 3.1. If one uses as initialization of RatioDCA, f 0 = 1A, then either RatioDCA\nterminates after one step or it yields an f 1 which after optimal thresholding as in Theorem 3.1 gives\na partition (B, B) which satis\ufb01es\n\ncut(B, B)\n\n\u02c6S(B)\n\n<\n\ncut(A, A)\n\n\u02c6S(A)\n\n.\n\nThe above \u201cimprovement theorem\u201d implies that we can use the result of any other graph partitioning\nmethod as initialization. In particular, we can always improve the result of spectral clustering.\n\n4.2 Solution of the Convex Inner Optimization Problems\n\nThe performance of RatioDCA depends heavily on how fast we can solve the corresponding inner\n(cid:80)n\nproblem. We propose to use a primal-dual algorithm for the inner problem and show experimentally\nthat this approach yields faster convergence than the FISTA method of [5] which was applied in\ni,j=1 wij|fi \u2212 fj| and\n[14]. Let us restrict our attention to the case where R(f ) = R1(f ) = 1\n2\nS2 = 0. In other words, we apply the RatioDCA algorithm to (2) with S = S1 which is what we\nneed, e.g., for the tight relaxations of the Cheeger cut, normalized cut and truncated Cheeger cut\nfamilies. Hence, the inner problem of the RatioDCA algorithm (line 4) has the form\n\nn(cid:88)\n\ni,j=1\n\nf k+1 = arg min\n(cid:107)u(cid:107)2\u22641\n\n{ 1\n2\n\nwij|fi \u2212 fj| \u2212 \u03bbk(cid:104)u, s1(f k)(cid:105)}.\n\n(3)\n\nRecently, Arrow-Hurwicz-type primal-dual algorithms have become popular, e.g., in image pro-\ncessing, to solve problems whose objective function consists of the sum of convex terms, cf., e.g.,\n[10, 7]. We propose to use the following primal-dual algorithm of [7] where it is referred to as\nAlgorithm 2. We call this method a primal-dual hybrid gradient algorithm (PDHG) here since this\nterm is used for similar algorithms in the literature. Note that the operator P(cid:107)\u00b7(cid:107)\u221e\u22641 in the \ufb01rst\nstep is the componentwise projection onto the interval [\u22121, 1]. For the sake of readability, we de-\n\ufb01ne the linear operator B : RV \u2192 RE by Bu = (wij(ui \u2212 uj))n\ni,j=1 and its transpose is then\nBT\u03b2 =\n\n(cid:16)(cid:80)n\n\nj=1 wij(\u03b2i,j \u2212 \u03b2j,i)\n\n(cid:17)n\n\n.\n\ni=1\n\nAlgorithm PDHG \u2013 Solution of the inner problem of RatioDCA for (2) and S convex\n1: Initialization: u0, \u00afu0, \u03b20 = 0, \u03b3, \u03c30, \u03c40 > 0 with \u03c30\u03c40 \u2264 1/(cid:107)B(cid:107)2\n2: repeat\n\u03b2l+1 = P(cid:107)\u00b7(cid:107)\u221e\u22641(\u03b2l + \u03c3lB \u00aful)\n3:\nul+1 = 1\n\u221a\n4:\n5:\n\u03b8l = 1/\n\u00aful+1 = ul+1 + \u03b8l(ul+1 \u2212 ul)\n6:\n7: until duality gap < \u0001\n8: Output: f k+1 \u2248 ul+1/(cid:107)ul+1(cid:107)2\n\n(cid:0)ul \u2212 \u03c4l(BT\u03b2l+1 \u2212 2\u03bbks1(f k))(cid:1)\n\n\u03c3l+1 = \u03c3l/\u03b8l\n\n\u03c4l+1 = \u03b8l\u03c4l,\n\n1 + 2\u03b3\u03c4l,\n\n1+\u03c4l\n\n2\n\nAlthough PDHG and FISTA have the same guaranteed converges rates of O(1/l2), our experiments\nshow that for clustering applications, PDHG can outperform FISTA substantially.\nIn Fig.1, we\nillustrate this difference on a toy problem. Note that a single step takes about the same computation\ntime for both algorithms so that the number of iterations is a valid criterion for comparison. In the\nsupplementary material, we also consider the inner problem of RatioDCA for the tight relaxation of\nthe hard balanced cut. Although, in this case we have to deal with S2 (cid:54)= 0 in the inner problem of\nRatioDCA, we can derive a similar PDHG method since the objective function is still the a sum of\nconvex terms.\n\n7\n\n\f5 Experiments\n\nIn a \ufb01rst experiment, we study the in\ufb02uence of the different balancing criteria on the obtained clus-\ntering. The data is a Gaussian mixture in R20 where the projection onto the \ufb01rst two dimensions\nis shown in Figure 2 - the remaining 18 dimensions are just noise. The distribution of the 2000\npoints is [1200,600,200]. A symmetric k-NN-graph with k = 20 is built with Gaussian weights\n\u2212\n} where \u03c3x,k is the k-NN distance of point x. For better interpretation, we report\n\n2(cid:107)x\u2212y(cid:107)2\n,\u03c32\n\nmax{\u03c32\n\nx,k\n\ny,k\n\ne\n\nFigure 2: From left to right: Cheeger 1-cut, Normalized 1-cut, truncated Cheeger cut (TCC), hard\nbalanced cut (HBC), hard Cheeger cut (HCC). The criteria are the normalized ones, i.e., the vertex\nweights are ei = di.\nall resulting partitions with respect to all balanced graph cut criteria, cut and the size of the largest\ncomponent in the following table. The parameter for truncated, hard Cheeger cut and hard balanced\ncut is set to K = 200. One observes that the normalized 1-cut results in a less balanced partition but\nwith a much smaller cut than the Cheeger 1-cut, which is itself less balanced than the hard Cheeger\ncut. The latter is fully balanced but has an even higher cut. The truncated Cheeger cut has a smaller\ncut than the hard balanced cut but its partition is not feasible. Note that the hard balanced cut is\nsimilar to the normalized 1-cut but achieves smaller cut at the prize of a larger maximal component.\nThus, the example nicely shows how the different balance criterion in\ufb02uence the \ufb01nal partition.\n\nCriterion \\ Obj.\nCheeger 1-cut\nNorm. 1-cut\nTrunc. Ch. cut\nHard bal. cut\nHard Ch. cut\n\nCut\n408.4\n178.3\n153.6\n175.4\n639.2\n\nmax{|A|,|A|}\n\n1301\n1775\n1945\n1785\n1000\n\nCh. 1-cut N. 1-cut TCC200 HBC200 HCC200\n0.817\n6.858\n\u221e\n10.96\n0.798\n\n0.079\n0.075\n0.263\n0.076\n0.115\n\n2.042\n0.892\n0.768\n0.877\n3.196\n\n0.099\n0.132\n0.513\n0.134\n0.119\n\n408.4\n178.3\n\u221e\n175.4\n639.2\n\nmulticut version of the normalized 1-cut, given as RCut(C1, . . . , CM ) = (cid:80)M\n\nNext we perform unnormalized 1-spectral clustering on the full USPS, normal and extended1\nMNIST-datasets (resp. 9298, 70000 and 630000 points) in the same setting as in [14] with no\nvertex weights, that is ei = 1,\u2200i \u2208 V . As clustering criterion for multi-partitioning we use the\n. We\nsuccessively subdivide clusters until the desired number of clusters (M = 10) is reached. This re-\ncursive partitioning scheme is used for all methods. In [14] the Cheeger 1-cut has been used which\nis not compatible with the multi-cut criterion. We expect that using the normalized 1-cut for the\nbipartitioning steps we should get better results. The results of the other methods for USPS and\nMNIST (normal) are taken from [14]. Each bipartitioning step is initialized randomly. Out of 100\nobtained multi-partitionings we report the results of the best clustering with respect to the multi-cut\ncriterion. The next table shows the obtained RCut and errors.\n\ncut(Ci,Ci)\n\n|Ci|\n\ni=1\n\nVertices/Edges\n\nUSPS\n\n9K/272K\n\nMNIST (Normal)\n\n70K/1043K\nMNIST (Ext)\n630K/9192K\n\nRcut\nError\nRcut\nError\nRcut\nError\n\nN. 1-cut\n0.6629\n0.1301\n0.1499\n0.1236\n0.0996\n0.1180\n\nCh. 1-cut[14]\n\n0.6661\n0.1349\n0.1507\n0.1244\n0.0997\n0.1223\n\n0.6663\n0.1309\n0.1545\n0.1318\n\n\u2013\n\u2013\n\nS.&B.[26]\n\n1.1-SCl [6]\n\n0.6676\n0.1308\n0.1529\n0.1293\n\n\u2013\n\u2013\n\nStandard spectral\n\n0.8180\n0.1686\n0.2252\n0.1883\n0.1594\n0.2297\n\nWe see for all datasets improvements in the obtained cut. Also a slight decrease in the obtained\nerror can be observed. The improvements are not so drastic as the clustering is already very good.\nThe problem is that for both datasets one digit is split (0) and two are merged (4 and 9) resulting in\nseemingly large errors. Similar results hold for the extended MNIST dataset. Note that the resulting\nerror is comparable to recently reported results on semi-supervised learning [18].\n\n1The extended MNIST dataset is generated by translating each original input image of MNIST by one pixel\n\n(8 directions).\n\n8\n\n\fReferences\n[1] N. Alon and V. D. Milman. \u03bb1, isoperimetric inequalities for graphs, and superconcentrators. J. Combin.\n\nTheory Ser. B, 38(1):73\u201388, 1985.\n\n[2] R. Andersen and K. Lang. An algorithm for improving graph partitions. In Proc. of the 19th ACM-SIAM\n\nSymposium on Discrete Algorithms (SODA 2008), pages 651\u2013660, 2008.\n\n[3] S. Arora, J. R. Lee, and A. Naor. Expander \ufb02ows, geometric embeddings and graph partitioning. In Proc.\n\n36th Annual ACM Symp. on Theory of Computing (STOC), pages 222\u2013231. ACM, 2004.\n\n[4] F. Bach. Convex analysis and optimization with submodular functions, 2010. arXiv:1010.4207v2.\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2:183\u2013202, 2009.\n\n[6] T. B\u00a8uhler and M. Hein. Spectral clustering based on the graph p-Laplacian. In L. Bottou and M. Littman,\n\neditors, Proc. of the 26th Int. Conf. on Machine Learning (ICML), pages 81\u201388. Omnipress, 2009.\n\n[7] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n\n[8] F. Chung. Spectral Graph Theory. AMS, Providence, RI, 1997.\n[9] W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of graphs. IBM J. Res. Develop.,\n\n17:420\u2013425, 1973.\n\n[10] E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of \ufb01rst order primal-dual algorithms\n\nfor convex optimization in imaging science. SIAM J. Imaging Sciences, 3(4):1015\u20131046, 2010.\n\n[11] S. Fujishige. Submodular functions and optimization, volume 58 of Annals of Discrete Mathematics.\n\nElsevier B. V., Amsterdam, second edition, 2005.\n\n[12] Stephen Guattery and Gary L. Miller. On the quality of spectral separators. SIAM Journal on Matrix\n\nAnalysis and Applications, 19:701\u2013719, 1998.\n\n[13] L. Hagen and A. B. Kahng. Fast spectral methods for ratio cut partitioning and clustering. Proc. IEEE\n\nIntl. Conf. on Computer-Aided Design, pages 10\u201313, November 1991.\n\n[14] M. Hein and T. B\u00a8uhler. An inverse power method for nonlinear eigenproblems with applications in 1-\nIn Advances in Neural Information Processing Systems 23 (NIPS\n\nspectral clustering and sparse pca.\n2010), pages 847\u2013855, 2010.\n\n[15] J.-B. Hiriart-Urruty. Generalized differentiability, duality and optimization for problems dealing with\n\ndifferences of convex functions. In Convexity and duality in optimization, pages 37\u201370. 1985.\n\n[16] C. Houdr\u00b4e. Mixed and Isoperimetric Estimates on the Log-Sobolev Constants of Graphs and Markov\n\nChains. Combinatorica, 21:489\u2013513, 2001.\n\n[17] Y. Kawahara, K. Nagano, and Y. Okamoto. Submodular fractional programming for balanced clustering.\n\nPattern Recognition Letters, 32:235\u2013243, 2011.\n\n[18] W. Liu, J. He, and S.-F. Chang. Large graph construction for scalable semi-supervised learning. In Proc.\n\nof the 27th Int. Conf. on Machine Learning (ICML), 2010.\n\n[19] L. Lov\u00b4asz. Submodular functions and convexity.\n\nIn Mathematical programming: the state of the art\n\n(Bonn, 1982), pages 235\u2013257. Springer, Berlin, 1983.\n\n[20] M. Maier, U. von Luxburg, and M. Hein.\n\nIn\ufb02uence of graph construction on graph-based clustering\nmeasures. In Advances in Neural Information Processing Systems 21 (NIPS), pages 1025 \u2013 1032, 2009.\n[21] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative\n\nstructure learning. In 21st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[22] M. Narasimhan and J. Bilmes. Local search for balanced submodular clusterings. In 20th International\n\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI), 2007.\n\n[23] S. B. Patkar and H. Narayanan. Improving graph partitions using submodular functions. Discrete Appl.\n\nMath., 131(2):535\u2013553, 2003.\n\n[24] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM\n\nJournal on Matrix Analysis and Applications, 11:430 \u2013 452, 1990.\n\n[25] O. S. Rothaus. Analytic inequalities, Isoperimetric Inequalities and Logarithmic Sobolev Inequalities.\n\nJournal of Functional Analysis, 64:296\u2013313, 1985.\n\n[26] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In Proceedings of the 27th International\n\nConference on Machine Learning, pages 1039\u20131046. Omnipress, 2010.\n\n[27] J.-P. Tillich. Edge isoperimetric inequalities for product graphs. Discrete Mathematics, 213:291\u2013320,\n\n2000.\n\n[28] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395\u2013416, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1268, "authors": [{"given_name": "Matthias", "family_name": "Hein", "institution": null}, {"given_name": "Simon", "family_name": "Setzer", "institution": null}]}