{"title": "Tight Continuous Relaxation of the Balanced k-Cut Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 3131, "page_last": 3139, "abstract": "Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the hard sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method beats all existing approaches for ratio cut and other balanced k-cut criteria.", "full_text": "Tight Continuous Relaxation of the Balanced k-Cut\n\nProblem\n\nSyama Sundar Rangapuram, Pramod Kaushik Mudrakarta and Matthias Hein\n\nDepartment of Mathematics and Computer Science\n\nSaarland University, Saarbr\u00a8ucken\n\nAbstract\n\nSpectral Clustering as a relaxation of the normalized/ratio cut has become one of\nthe standard graph-based clustering methods. Existing methods for the compu-\ntation of multiple clusters, corresponding to a balanced k-cut of the graph, are\neither based on greedy techniques or heuristics which have weak connection to\nthe original motivation of minimizing the normalized cut. In this paper we pro-\npose a new tight continuous relaxation for any balanced k-cut problem and show\nthat a related recently proposed relaxation is in most cases loose leading to poor\nperformance in practice. For the optimization of our tight continuous relaxation\nwe propose a new algorithm for the dif\ufb01cult sum-of-ratios minimization problem\nwhich achieves monotonic descent. Extensive comparisons show that our method\noutperforms all existing approaches for ratio cut and other balanced k-cut criteria.\n\n1\n\nIntroduction\n\nGraph-based techniques for clustering have become very popular in machine learning as they al-\nlow for an easy integration of pairwise relationships in data. The problem of \ufb01nding k clusters in\na graph can be formulated as a balanced k-cut problem [1, 2, 3, 4], where ratio and normalized\ncut are famous instances of balanced graph cut criteria employed for clustering, community detec-\ntion and image segmentation. The balanced k-cut problem is known to be NP-hard [4] and thus in\npractice relaxations [4, 5] or greedy approaches [6] are used for \ufb01nding the optimal multi-cut. The\nmost famous approach is spectral clustering [7], which corresponds to the spectral relaxation of the\nratio/normalized cut and uses k-means in the embedding of the vertices found by the \ufb01rst k eigen-\nvectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has\nbeen shown to be loose for k = 2 [8] and for k > 2 no guarantees are known of the quality of the\nobtained k-cut with respect to the optimal one. Moreover, in practice even greedy approaches [6]\nfrequently outperform spectral clustering.\nThis paper is motivated by another line of recent work [9, 10, 11, 12] where it has been shown that\nan exact continuous relaxation for the two cluster case (k = 2) is possible for a quite general class of\nbalancing functions. Moreover, ef\ufb01cient algorithms for its optimization have been proposed which\nproduce much better cuts than the standard spectral relaxation. However, the multi-cut problem has\nstill to be solved via the greedy recursive splitting technique.\nInspired by the recent approach in [13], in this paper we tackle directly the general balanced k-cut\nproblem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric\nratio Cheeger cut proposed recently by [13] is loose when the data does not contain k well-separated\nclusters and thus leads to poor performance in practice. Similar to [13] we can also integrate label\ninformation leading to a transductive clustering formulation. Moreover, we propose an ef\ufb01cient\nalgorithm for the minimization of our continuous relaxation for which we can prove monotonic\ndescent. This is in contrast to the algorithm proposed in [13] for which no such guarantee holds.\nIn extensive experiments we show that our method outperforms all existing methods in terms of the\n\n1\n\n\fachieved balanced k-cuts. Moreover, our clustering error is competitive with respect to several other\nclustering techniques based on balanced k-cuts and recently proposed approaches based on non-\nnegative matrix factorization. Also we observe that already with small amount of label information\nthe clustering error improves signi\ufb01cantly.\n\n2 Balanced Graph Cuts\n\nGraphs are used in machine learning typically as similarity graphs, that is the weight of an edge\nbetween two instances encodes their similarity. Given such a similarity graph of the instances, the\nclustering problem into k sets can be transformed into a graph partitioning problem, where the goal\nis to construct a partition of the graph into k sets such that the cut, that is the sum of weights of the\nedge from each set to all other sets, is small and all sets in the partition are roughly of equal size.\nBefore we introduce balanced graph cuts, we brie\ufb02y \ufb01x the setting and notation. Let G(V, W )\ndenote an undirected, weighted graph with vertex set V with n = |V | vertices and weight matrix\n+ with W = W T . There is an edge between two vertices i, j \u2208 V if wij > 0. The\nW \u2208 Rn\u00d7n\ni\u2208A,j\u2208B wij and we write 1A for the\nindicator vector of set A \u2282 V . A collection of k sets (C1, . . . , Ck) is a partition of V if \u222ak\ni=1Ci = V ,\nCi \u2229 Cj = \u2205 if i (cid:54)= j and |Ci| \u2265 1, i = 1, . . . , k. We denote the set of all k-partitions of V by Pk.\n\ncut between two sets A, B \u2282 V is de\ufb01ned as cut(A, B) =(cid:80)\nFurthermore, we denote by \u2206k the simplex {x : x \u2208 Rk, x \u2265 0, (cid:80)k\n\ni=1 xi = 1}.\n\nFinally, a set function \u02c6S : 2V \u2192 R is called submodular if for all A, B \u2282 V , \u02c6S(A\u222aB)+ \u02c6S(A\u2229B) \u2264\n\u02c6S(A) + \u02c6S(B). Furthermore, we need the concept of the Lovasz extension of a set function.\nDe\ufb01nition 1 Let \u02c6S : 2V \u2192 R be a set function with \u02c6S(\u2205) = 0. Let f \u2208 RV be ordered in increasing\norder f1 \u2264 f2 \u2264 . . . \u2264 fn and de\ufb01ne Ci = {j \u2208 V | fj > fi} where C0 = V . Then S : RV \u2192 R\n, is called the Lovasz extension of \u02c6S. Note that\n\ngiven by, S(f ) = (cid:80)n\n\n(cid:17)\n(cid:16) \u02c6S(Ci\u22121) \u2212 \u02c6S(Ci)\n\nS(1A) = \u02c6S(A) for all A \u2282 V .\nThe Lovasz extension of a set function is convex if and only if the set function is submodular [14].\nThe cut function cut(C, C), where C = V \\C, is submodular and its Lovasz extension is given by\nTV(f ) = 1\n2\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|.\n\ni=1 fi\n\n2.1 Balanced k-cuts\n\nThe balanced k-cut problem is de\ufb01ned as\n\nk(cid:88)\n\ni=1\n\nmin\n\n(C1,...,Ck)\u2208Pk\n\ncut(Ci, Ci)\n\n\u02c6S(Ci)\n\n=: BCut(C1, . . . , Ck)\n\n(1)\n\nwhere \u02c6S : 2V \u2192 R+ is a balancing function with the goal that all sets Ci are of the same \u201csize\u201d.\nIn this paper, we assume that \u02c6S(\u2205) = 0 and for any C (cid:40) V, C (cid:54)= \u2205, \u02c6S(C) \u2265 m, for some m > 0.\nIn the literature one \ufb01nds mainly the following submodular balancing functions (in brackets is the\nname of the overall balanced graph cut criterion BCut(C1, . . . , Ck)),\n\n\u02c6S(C) = |C|,\n\u02c6S(C) = min{|C|,|C|},\n\u02c6S(C) = min{(k \u2212 1)|C|, C}\n\n(Ratio Cut),\n(Ratio Cheeger Cut),\n(Asymmetric Ratio Cheeger Cut).\n\n(2)\n\nThe Ratio Cut is well studied in the literature e.g. [3, 7, 6] and corresponds to a balancing function\nwithout bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently\nproposed in [13] has a bias towards sets of size |V |\nk ( \u02c6S(C) attains its maximum at this point) which\nmakes perfect sense if one expects clusters which have roughly equal size. An intermediate version\nbetween the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly\npenalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing\nfunctions. However, we can also handle the corresponding weighted cases e.g., \u02c6S(C) = vol(C) =\n\n(cid:80)\ni\u2208C di, where di =(cid:80)n\n\nj=1 wij, leading to the normalized cut[4].\n\n2\n\n\f3 Tight Continuous Relaxation for the Balanced k-Cut Problem\n\nIn this section we discuss our proposed relaxation for the balanced k-cut problem (1). It turns out\nthat a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that\nthe continuous problem also yields a partition (together with a suitable rounding scheme). The\nmotivation for our relaxation is taken from the recent work of [9, 10, 11], where exact relaxations\nare shown for the case k = 2. Basically, they replace the ratio of set functions with the ratio of\nthe corresponding Lovasz extensions. We use the same idea for the objective of our continuous\nrelaxation of the k-cut problem (1) which is given as\n\nk(cid:88)\n\nmin\n\nF =(F1,...,Fk),\n\nTV(Fl)\nS(Fl)\nF\u2208Rn\u00d7k\nsubject to : F(i) \u2208 \u2206k,\n\nl=1\n\n+\n\ni = 1, . . . , n,\n\nmax{F(i)} = 1, \u2200i \u2208 I,\nS(Fl) \u2265 m,\n\nl = 1, . . . , k,\n\n(3)\n\n(simplex constraints)\n(membership constraints)\n(size constraints)\n\nwhere S is the Lovasz extension of the set function \u02c6S and m = minC(cid:40)V, C(cid:54)=\u2205 \u02c6S(C). We have\nm = 1, for Ratio Cut and Ratio Cheeger Cut whereas m = k \u2212 1 for Asymmetric Ratio Cheeger\nCut. Note that TV is the Lovasz extension of the cut functional cut(C, C). In order to simplify\nnotation we denote for a matrix F \u2208 Rn\u00d7k by Fl the l-th column of F and by F(i) the i-th row\nof F . Note that the rows of F correspond to the vertices of the graph and the j-th column of F\ncorresponds to the set Cj of the desired partition. The set I \u2282 V in the membership constraints is\nchosen adaptively by our method during the sequential optimization described in Section 4.\nAn obvious question is how to get from the continuous solution F \u2217 of (3) to a partition\n(C1, . . . , Ck) \u2208 Pk which is typically called rounding. Given F \u2217 we construct the sets, by assigning\neach vertex i to the column where the i-th row attains its maximum. Formally,\n\nCi = {j \u2208 V | i = arg max\n\nFjs},\n\ns=1,...,k\n\ni = 1, . . . , k,\n\n(Rounding)\n\n(4)\n\nwhere ties are broken randomly. If there exists a row such that the rounding is not unique, we say\nthat the solution is weakly degenerated. If furthermore the resulting set (C1, . . . , Ck) do not form a\npartition, that is one of the sets is empty, then we say that the solution is strongly degenerated.\nFirst, we connect our relaxation to the previous work of [11] for the case k = 2. Indeed for sym-\nmetric balancing function such as the Ratio Cheeger Cut, our continuous relaxation (3) is exact even\nwithout membership and size constraints.\n\nTheorem 1 Let \u02c6S be a non-negative symmetric balancing function, \u02c6S(C) = \u02c6S(C), and denote by\np\u2217 the optimal value of (3) without membership and size constraints for k = 2. Then it holds\n\n2(cid:88)\n\ni=1\n\np\u2217 = min\n\n(C1,C2)\u2208P2\n\ncut(Ci, Ci)\n\n\u02c6S(Ci)\n\n.\n\nFurthermore there exists a solution F \u2217 of (3) such that F \u2217 = [1C\u2217 , 1C\u2217 ], where (C\u2217, C\u2217) is the\noptimal balanced 2-cut partition.\n\nNote that rounding trivially yields a solution in the setting of the previous theorem.\nA second result shows that indeed our proposed optimization problem (3) is a relaxation of the\nbalanced k-cut problem (1). Furthermore, the relaxation is exact if I = V .\n\nProposition 1 The continuous problem (3) is a relaxation of the k-cut problem (1). The relaxation\nis exact, i.e., both problems are equivalent, if I = V .\n\nThe row-wise simplex and membership constraints enforce that each vertex in I belongs to exactly\none component. Note that these constraints alone (even if I = V ) can still not guarantee that F\ncorresponds to a k-way partition since an entire column of F can be zero. This is avoided by the\ncolumn-wise size constraints that enforce that each component has at least one vertex.\n\n3\n\n\fIf I = V it is immediate from the proof that problem (3) is no longer a continuous problem as the\nfeasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition.\nOn the other hand, if I = \u2205 (i.e., no membership constraints), and k > 2 it is not guaranteed\nthat rounding of the solution of the continuous problem yields a partition. Indeed, we will see in\nthe following that for symmetric balancing functions one can, under these conditions, show that\nthe solution is always strongly degenerated and rounding does not yield a partition (see Theorem\n2). Thus we observe that the index set I controls the degree to which the partition constraint is\nenforced. The idea behind our suggested relaxation is that it is well known in image processing that\nminimizing the total variation yields piecewise constant solutions (in fact this follows from seeing\nthe total variation as Lovasz extension of the cut). Thus if |I| is suf\ufb01ciently large, the vertices where\nthe values are \ufb01xed to 0 or 1 propagate this to their neighboring vertices and \ufb01nally to the whole\ngraph. We discuss the choice of I in more detail in Section 4.\n\nSimplex constraints alone are not suf\ufb01cient to yield a partition: Our approach has been inspired\nby [13] who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut\n\nk(cid:88)\n\n(cid:13)(cid:13)Fl \u2212 quantk\u22121(Fl)(cid:13)(cid:13)1\n\nTV(Fl)\n\nmin\n\nF =(F1,...,Fk),\n\n+\n\nl=1\n\ni = 1, . . . , n,\n\nF\u2208Rn\u00d7k\nsubject to : F(i) \u2208 \u2206k,\n\n(simplex constraints)\n\nwhere S(f ) =(cid:13)(cid:13)f \u2212 quantk\u22121(f )(cid:13)(cid:13)1 is the Lovasz extension of \u02c6S(C) = min{(k \u2212 1)|C|, C} and\n\nquantk\u22121(f ) is the k\u2212 1-quantile of f \u2208 Rn. Note that in their approach no membership constraints\nand size constraints are present.\nWe now show that the usage of simplex constraints in the optimization problem (3) is not suf\ufb01cient\nto guarantee that the solution F \u2217 can be rounded to a partition for any symmetric balancing function\nin (1). For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by\n[13] in their relaxation (5) we can prove such a strong result only in the case where the graph is\ndisconnected. However, note that if the number of components of the graph is less than the number\nof desired clusters k, the multi-cut problem is still non-trivial.\n\nTheorem 2 Let \u02c6S(C) be any non-negative symmetric balancing function. Then the continuous\nrelaxation\n\n(5)\n\n(6)\n\nk(cid:88)\n\nmin\n\nF =(F1,...,Fk),\n\nTV(Fl)\nS(Fl)\nF\u2208Rn\u00d7k\nsubject to : F(i) \u2208 \u2206k,\n\nl=1\n\n+\n\nof the balanced k-cut problem (1) is void in the sense that the optimal solution F \u2217 of the continu-\nous problem can be constructed from the optimal solution of the 2-cut problem and F \u2217 cannot be\nrounded into a k-way partition, see (4). If the graph is disconnected, then the same holds also for\nany non-negative asymmetric balancing function.\n\ni = 1, . . . , n,\n\n(simplex constraints)\n\nThe proof of Theorem 2 shows additionally that for any balancing function if the graph is discon-\nnected, the solution of the continuous relaxation (6) is always zero, while clearly the solution of the\nbalanced k-cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in\nthis case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected\nbut there exists a cut of the graph which is very small as the following corollary indicates.\nCorollary 1 Let \u02c6S be an asymmetric balancing function and C\u2217 = arg min\nthat \u03c6\u2217 := (k\u2212 1) cut(C\u2217,C\u2217)\n\na feasible F with F1 = 1C\u2217 and Fl = \u03b1l1C\u2217 , l = 2, . . . , k such that(cid:80)k\nwhich has objective(cid:80)k\n\nS(Fi) = \u03c6\u2217 and which cannot be rounded to a k-way partition.\n\n. Then there exists\n\u02c6S(Ci)\nl=2 \u03b1l = 1, \u03b1l > 0 for (6)\n\n< min(C1,...,Ck)\u2208Pk\n\nC\u2282V\ncut(Ci,Ci)\n\n+ cut(C\u2217,C\u2217)\n\nand suppose\n\n(cid:80)k\n\n\u02c6S(C\u2217)\n\n\u02c6S(C\u2217)\n\ncut(C,C)\n\nTV(Fi)\n\n\u02c6S(C)\n\ni=1\n\ni=1\n\nTheorem 2 shows that the membership and size constraints which we have introduced in our relax-\nation (3) are essential to obtain a partition for symmetric balancing functions. For the asymmetric\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Toy example illustrating that the relaxation of [13] converges to a degenerate solution\nwhen applied to a graph with dominating 2-cut. (a) 10NN-graph generated from three Gaussians in\n10 dimensions (b) continuous solution of (5) from [13] for k = 3, (c) rounding of the continuous\nsolution of [13] does not yield a 3-partition (d) continuous solution found by our method together\nwith the vertices i \u2208 I (black) where the membership constraint is enforced. Our continuous solution\ncorresponds already to a partition.\n(e) clustering found by rounding of our continuous solution\n(trivial as we have converged to a partition). In (b)-(e), we color data point i according to F(i) \u2208 R3.\n\nbalancing function failure of the relaxation (6) and thus also of the relaxation (5) of [13] is only guar-\nanteed for disconnected graphs. However, Corollary 1 indicates that degenerated solutions should\nalso be a problem when the graph is still connected but there exists a dominating cut. We illustrate\nthis with a toy example in Figure 1 where the algorithm of [13] for solving (5) fails as it converges\nexactly to the solution predicted by Corollary 1 and thus only produces a 2-partition instead of the\ndesired 3-partition. The algorithm for our relaxation enforcing membership constraints converges to\na continuous solution which is in fact a partition matrix so that no rounding is necessary.\n\n4 Monotonic Descent Method for Minimization of a Sum of Ratios\n\nApart from the new relaxation another key contribution of this paper is the derivation of an algorithm\nwhich yields a sequence of feasible points for the dif\ufb01cult non-convex problem (3) and reduces\nmonotonically the corresponding objective. We would like to note that the algorithm proposed by\n[13] for (5) does not yield monotonic descent. In fact it is unclear what the derived guarantee for\nthe algorithm in [13] implies for the generated sequence. Moreover, our algorithm works for any\nnon-negative submodular balancing function.\nThe key insight in order to derive a monotonic descent method for solving the sum-of-ratio mini-\nmization problem (3) is to eliminate the ratio by introducing a new set of variables \u03b2 = (\u03b21, . . . , \u03b2k).\n\n(7)\n\nk(cid:88)\n\n\u03b2l\n\nmin\nF =(F1,...,Fk),\nF\u2208Rn\u00d7k\n+ , \u03b2\u2208Rk\nsubject to : TV(Fl) \u2264 \u03b2lS(Fl),\n\nl=1\n\n+\n\nl = 1, . . . , k,\ni = 1, . . . , n,\n\u2200i \u2208 I,\nl = 1, . . . , k.\n\nF(i) \u2208 \u2206k,\nmax{F(i)} = 1,\nS(Fl) \u2265 m,\n\n(descent constraints)\n(simplex constraints)\n(membership constraints)\n(size constraints)\nNote that for the optimal solution (F \u2217, \u03b2\u2217) of this problem it holds TV(F \u2217\nl ), l =\n1, . . . , k (otherwise one can decrease \u03b2\u2217\nl and hence the objective) and thus equivalence holds. This\nis still a non-convex problem as the descent, membership and size constraints are non-convex. Our\nalgorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation\nof the constraint set, that is the convex approximation is a subset of the non-convex constraint set,\nbased on the current iterate (F t, \u03b2t). Then we optimize the resulting convex optimization problem\nand repeat the process. In this way we get a sequence of feasible points for the original problem (7)\nfor which we will prove monotonic descent in the sum-of-ratios.\nConvex approximation: As \u02c6S is submodular, S is convex. Let st\nsub-differential of S at the current iterate F t\nl . We have by Prop. 3.2 in [14], (st\n\u02c6S(Cli), where ji is the index of the ith smallest component of F t\nl )i}. Moreover, using the de\ufb01nition of subgradient, we have S(Fl) \u2265 S(F t\n(F t\n(cid:104)st\nl, Fl(cid:105).\n\nl \u2208 \u2202S(F t\nl and Cli = {j \u2208 V | (F t\nl, Fl \u2212 F t\n\nl ) be an element of the\nl)ji = \u02c6S(Cli\u22121) \u2212\nl )j >\nl ) + (cid:104)st\nl (cid:105) =\n\nl ) = \u03b2\u2217\n\nl S(F \u2217\n\n5\n\n 0 1 00 0 11 0 0 1 0 00 0 1 0 1 00 0 11 0 0 0 1 00 0 11 0 0\f(cid:10)st\n(cid:10)st\n\n(cid:11) \u2212 \u03b4+\n(cid:11) \u2212 \u03b4+\n\nFor the descent constraints, let \u03bbt\nthe amount of change in each ratio. We further decompose \u03b4l as \u03b4l = \u03b4+\nLet M = maxf\u2208[0,1]n S(f ) = maxC\u2282V \u02c6S(C), then for S(Fl) \u2265 m,\n\nl ) and introduce new variables \u03b4l = \u03b2l \u2212 \u03bbt\nl , \u03b4+\n\nl = TV(F t\nl )\n\nl \u2212 \u03b4\u2212\n\nS(F t\n\nl that capture\nl \u2265 0.\n\nl \u2265 0, \u03b4\u2212\n\nl\n\nl\n\nl S(Fl)\n\nl S(Fl) + \u03b4\u2212\nl m + \u03b4\u2212\nl M\n\nTV(Fl) \u2212 \u03b2lS(Fl) \u2264 TV(Fl) \u2212 \u03bbt\n\u2264 TV(Fl) \u2212 \u03bbt\n\nl, Fl\nl, Fl\nFinally, note that because of the simplex constraints, the membership constraints can be rewritten\nas max{F(i)} \u2265 1. Let i \u2208 I and de\ufb01ne ji := arg maxj F t\nij (ties are broken randomly). Then the\nmembership constraints can be relaxed as follows: 0 \u2265 1 \u2212 max{F(i)} \u2265 1 \u2212 Fiji =\u21d2 Fiji \u2265 1.\nAs Fij \u2264 1 we get Fiji = 1. Thus the convex approximation of the membership constraints\n\ufb01xes the assignment of the i-th point to a cluster and thus can be interpreted as \u201clabel constraint\u201d.\nHowever, unlike the transductive setting, the labels for the vertices in I are automatically chosen by\nour method. The actual choice of the set I will be discussed in Section 4.1. We use the notation\nL = {(i, ji) | i \u2208 I} for the label set generated from I (note that L is \ufb01xed once I is \ufb01xed).\nDescent algorithm: Our descent algorithm for minimizing (7) solves at each iteration t the follow-\ning convex optimization problem (8).\n\nk(cid:88)\n\nl \u2212 \u03b4\u2212\n\u03b4+\n\nmin\nF\u2208Rn\u00d7k\n+ ,\n+, \u03b4\u2212\u2208Rk\n\nl=1\n\n\u03b4+\u2208Rk\nsubject to : TV(Fl) \u2264 \u03bbt\n\n+\n\nl\n\nl\n\n(cid:10)st\n(cid:11) \u2265 m,\n\nF(i) \u2208 \u2206k,\nFiji = 1,\n\n(cid:10)st\n\nl, F t\nl\n\nl, Fl\n\n(cid:11) + \u03b4+\n\nl m \u2212 \u03b4\u2212\n\nl M,\n\nl = 1, . . . k,\n\ni = 1, . . . , n,\n\u2200(i, ji) \u2208 L,\nl = 1, . . . , k.\n\n(8)\n\n(descent constraints)\n(simplex constraints)\n(label constraints)\n(size constraints)\n\u2208 \u2202S(F t+1\n\nl = TV(F t+1\nS(F t+1\n\nfollowing Theorem 3 shows the monotonic descent property of our algorithm.\n\nAs its solution F t+1 is feasible for (3) we update \u03bbt+1\n), l =\n1, . . . , k and repeat the process until the sequence terminates, that is no further descent is possible as\nl is smaller than a prede\ufb01ned \u0001. The\n\nthe following theorem states, or the relative descent in(cid:80)k\nTheorem 3 The sequence {F t} produced by the above algorithm satis\ufb01es (cid:80)k\n(cid:80)k\n\nfor all t \u2265 0 or the algorithm terminates.\n\nTV(F t+1\nS(F t+1\n)\n\nand st+1\n\nl=1 \u03bbt\n\n)\n\n<\n\nl=1\n\n)\n\n)\n\nl\n\nl\n\nl\n\nl\n\nl\n\nl\n\nTV(F t\nl )\nS(F t\nl )\n\nl=1\n\nThe inner problem (8) is convex, but contains the non-smooth term TV in the constraints. We\neliminate the non-smoothness by introducing additional variables and derive an equivalent linear\nprogramming (LP) formulation. We solve this LP via the PDHG algorithm [15, 16]. The LP and the\nexact iterates can be found in the supplementary material.\n\n4.1 Choice of membership constraints I\n\ncut(cid:0)C\u2217\n\nThe overall algorithm scheme for solving the problem (1) is given in the supplementary material. For\nthe membership constraints we start initially with I 0 = \u2205 and sequentially solve the inner problem\n(8). From its solution F t+1 we construct a P (cid:48)\nk = (C1, . . . , Ck) via rounding, see (4). We repeat this\nprocess until we either do not improve the resulting balanced k-cut or P (cid:48)\nk is not a partition. In this\ncase we update I t+1 and double the number of membership constraints. Let (C\u2217\nk ) be the\ncurrent best partition. For each l \u2208 {1, . . . , k} and i \u2208 C\u2217\n\n(cid:88)\nj(cid:54)=l, j(cid:54)=s\nand de\ufb01ne Ol = {(\u03c01, . . . , \u03c0|C\u2217\n|}. The top-ranked vertices in\nOl correspond to the ones which lead to the largest minimal increase in BCut when moved from\nC\u2217\nto another component and thus are most likely to belong to their current component. Thus it\n\ns \u222a {i}, C\u2217\ns \u222a {i})\n\u02c6S(C\u2217\n\u2265 . . . \u2265 b\u2217\n\n\uf8ee\uf8f0 cut(cid:0)C\u2217\n\n\uf8f9\uf8fb (9)\n\nl \\{i}, C\u2217\nl \\{i})\n\u02c6S(C\u2217\n\nl \u222a {i}(cid:1)\n\n+ min\ns(cid:54)=l\nl |)| b\u2217\n\ns\\{i}(cid:1)\n\nl we compute\n\n1 , . . . , C\u2217\n\ncut(Cj, Cj)\n\nb\u2217\nli =\n\n\u2265 b\u2217\n\n\u02c6S(Cj)\n\nl\u03c0|C\u2217\n\nl\u03c01\n\nl\u03c02\n\n+\n\nl\n\nl\n\n6\n\n\fis natural to \ufb01x the top-ranked vertices for each component \ufb01rst. Note that the rankings Ol, l =\n1, . . . , k are updated when a better partition is found. Thus the membership constraints correspond\nalways to the vertices which lead to largest minimal increase in BCut when moved to another\ncomponent. In Figure 1 one can observe that the \ufb01xed labeled points are lying close to the centers\nof the found clusters. The number of membership constraints depends on the graph. The better\nseparated the clusters are, the less membership constraints need to be enforced in order to avoid\ndegenerate solutions. Finally, we stop the algorithm if we see no more improvement in the cut or\nthe continuous objective and the continuous solution corresponds to a partition.\n\n5 Experiments\n\nWe evaluate our method against a diverse selection of state-of-the-art clustering methods like spec-\ntral clustering (Spec) [7], BSpec [11], Graclus1 [6], NMF based approaches PNMF [18], NSC [19],\nONMF [20], LSD [21], NMFR [22] and MTV [13] which optimizes (5). We used the publicly\navailable code [22, 13] with default settings. We run our method using 5 random initializations, 7\ninitializations based on the spectral clustering solution similar to [13] (who use 30 such initializa-\ntions). In addition to the datasets provided in [13], we also selected a variety of datasets from the\nUCI repository shown below. For all the datasets not in [13], symmetric k-NN graphs are built with\n\n(cid:1), where \u03c3x,k is the k-NN distance of point x. We chose the\n\nGaussian weights exp(cid:0)\u2212 s(cid:107)x\u2212y(cid:107)2\n\nparameters s and k in a method independent way by testing for each dataset several graphs using all\nthe methods over different choices of k \u2208 {3, 5, 7, 10, 15, 20, 40, 60, 80, 100} and s \u2208 {0.1, 1, 4}.\nThe best choice in terms of the clustering error across all the methods and datasets, is s = 1, k = 15.\n\nmin{\u03c32\n\nx,k,\u03c32\n\ny,k}\n\nIris\n\n150\n3\n\n# vertices\n# classes\n\nwine\n\nvertebral\n\necoli\n\n4moons\n\nwebkb4\n\noptdigits\n\nUSPS\n\npendigits\n\n20news MNIST\n\n178\n3\n\n310\n3\n\n336\n6\n\n4000\n\n4\n\n4196\n\n4\n\n5620\n10\n\n9298\n10\n\n10992\n\n10\n\n19928\n\n20\n\n70000\n\n10\n\nQuantitative results: In our \ufb01rst experiment we evaluate our method in terms of solving the bal-\nanced k-cut problem for various balancing functions, data sets and graph parameters. The following\ntable reports the fraction of times a method achieves the best as well as strictly best balanced k-cut\nover all constructed graphs and datasets (in total 30 graphs per dataset). For reference, we also report\nthe obtained cuts for other clustering methods although they do not directly minimize this criterion\nin italic; methods that directly optimize the criterion are shown in normal font. Our algorithm can\nhandle all balancing functions and signi\ufb01cantly outperforms all other methods across all criteria.\nFor ratio and normalized cut cases we achieve better results than [7, 11, 6] which directly optimize\nthis criterion. This shows that the greedy recursive bi-partitioning affects badly the performance of\n[11], which, otherwise, was shown to obtain the best cuts on several benchmark datasets [23]. This\nfurther shows the need for methods that directly minimize the multi-cut. It is striking that the com-\npeting method of [13], which directly minimizes the asymmetric ratio cut, is beaten signi\ufb01cantly by\nGraclus as well as our method. As this clear trend is less visible in the qualitative experiments, we\nsuspect that extreme graph parameters lead to fast convergence to a degenerate solution.\n\nRCC-asym\n\nRCC-sym\n\nNCC-asym\n\nNCC-sym\n\nRcut\n\nNcut\n\nBest (%)\n\nStrictly Best (%)\n\nBest (%)\n\nStrictly Best (%)\n\nBest (%)\n\nStrictly Best (%)\n\nBest (%)\n\nStrictly Best (%)\n\nBest (%)\n\nStrictly Best (%)\n\nBest (%)\n\nStrictly Best (%)\n\n2.01\n0.00\n\n8.72\n0.00\n\n37.58\n4.70\n\n7.38\n0.00\n\n6.71\n0.00\n\n38.26\n4.70\n\n2.01\n0.00\n\n0.00\n0.00\n\n23.49\n1.34\n\n25.50\n10.74\n\nOurs MTV BSpec Spec Graclus PNMF NSC ONMF LSD NMFR\n80.54\n44.97\n94.63\n61.74\n93.29\n56.38\n98.66\n59.06\n85.91\n58.39\n95.97\n61.07\n\n10.07\n2.01\n\n20.13\n2.68\n\n32.89\n8.72\n\n2.01\n1.34\n\n1.34\n0.00\n\n4.03\n0.00\n\n4.03\n0.00\n\n4.70\n0.00\n\n0.67\n0.00\n\n0.67\n0.00\n\n1.34\n0.67\n\n0.67\n0.00\n\n2.01\n0.67\n\n0.67\n0.00\n\n10.07\n0.00\n\n4.70\n0.00\n\n3.36\n0.00\n\n1.34\n0.00\n\n3.36\n0.00\n\n38.26\n2.01\n\n5.37\n0.00\n\n4.03\n0.00\n\n5.37\n0.00\n\n4.03\n0.00\n\n0.67\n0.00\n\n13.42\n2.01\n\n1.34\n0.00\n\n0.67\n0.00\n\n1.34\n0.00\n\n1.34\n0.00\n\n0.67\n0.00\n\n0.00\n0.00\n\n0.67\n0.00\n\n10.07\n0.00\n\n19.46\n0.67\n\n20.13\n0.00\n\n20.81\n0.00\n\n7.38\n0.00\n\n10.07\n0.00\n\n20.13\n0.00\n\n9.40\n0.00\n\n9.40\n0.00\n\n40.27\n1.34\n\n37.58\n4.03\n\nQualitative results: In the following table, we report the clustering errors and the balanced k-cuts\nobtained by all methods using the graphs built with k = 15, s = 1 for all datasets. As the main goal\n\n1Since [6], a multi-level algorithm directly minimizing Rcut/Ncut, is shown to be superior to METIS [17], we do not compare with [17].\n\n7\n\n\fis to compare to [13] we choose their balancing function (RCC-asym). Again, our method always\nachieved the best cuts across all datasets. In three cases, the best cut also corresponds to the best\nclustering performance. In case of vertebral, 20news, and webkb4 the best cuts actually result in\nhigh errors. However, we see in our next experiment that integrating ground-truth label information\nhelps in these cases to improve the clustering performance signi\ufb01cantly.\n\nBSpec\n\nSpec\n\nPNMF\n\nNSC\n\nONMF\n\nLSD\n\nNMFR\n\nGraclus\n\nMTV\n\nOurs\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nErr(%)\nBCut\n\nIris\n\n23.33\n1.495\n22.00\n1.783\n\n22.67\n1.508\n\n23.33\n1.518\n\n23.33\n1.518\n\n23.33\n1.518\n22.00\n1.627\n\n23.33\n1.534\n\n22.67\n1.508\n\n23.33\n1.495\n\nwine\n\n37.64\n6.417\n\n20.22\n5.820\n\n27.53\n4.916\n\n17.98\n5.140\n\n28.09\n4.881\n\n17.98\n5.399\n\n11.24\n4.318\n\n8.43\n4.293\n\n18.54\n5.556\n6.74\n4.168\n\nvertebral\n\n50.00\n1.890\n\n48.71\n1.950\n\n50.00\n2.250\n\n50.00\n2.046\n\n50.65\n2.371\n\n39.03\n2.557\n\n38.06\n2.713\n\n49.68\n1.890\n34.52\n2.433\n\n50.00\n1.890\n\necoli\n\n19.35\n2.550\n14.88\n2.759\n\n16.37\n2.652\n14.88\n2.754\n\n16.07\n2.633\n\n18.45\n2.523\n\n22.92\n2.556\n\n16.37\n2.414\n\n22.02\n2.500\n\n16.96\n2.399\n\n4moons webkb4\n\noptdigits USPS\n\npendigits 20news MNIST\n\n36.33\n0.634\n\n31.45\n0.917\n\n35.23\n0.737\n\n32.05\n0.933\n\n35.35\n0.725\n\n35.68\n0.782\n\n36.33\n0.840\n0.45\n0.589\n\n7.72\n0.774\n0.45\n0.589\n\n60.46\n1.056\n\n60.32\n1.520\n\n60.94\n3.520\n\n59.49\n3.566\n\n60.94\n3.621\n\n47.93\n2.082\n\n40.73\n1.467\n39.97\n1.581\n\n48.40\n2.346\n\n60.46\n1.056\n\n11.30\n0.386\n\n7.81\n0.442\n\n10.37\n0.548\n\n8.24\n0.482\n\n10.37\n0.548\n\n8.42\n0.483\n\n2.08\n0.369\n1.67\n0.350\n\n4.11\n0.374\n\n1.71\n0.350\n\n20.09\n0.822\n\n21.05\n0.873\n\n24.07\n1.180\n\n20.53\n0.850\n\n24.14\n1.183\n\n22.68\n0.918\n\n22.17\n0.992\n\n19.75\n0.815\n15.13\n0.940\n\n19.72\n0.802\n\n17.59\n0.081\n\n16.75\n0.141\n\n17.93\n0.415\n\n19.81\n0.101\n\n22.82\n0.548\n\n13.90\n0.188\n\n13.13\n0.240\n10.93\n0.092\n\n20.55\n0.193\n\n19.95\n0.079\n\n84.21\n0.966\n\n79.10\n1.170\n\n66.00\n2.924\n\n78.86\n2.233\n\n69.02\n3.058\n\n67.81\n2.056\n39.97\n1.241\n\n60.69\n1.431\n\n72.18\n3.291\n\n79.51\n0.895\n\n11.82\n0.471\n\n22.83\n0.707\n\n12.80\n0.934\n\n21.27\n0.688\n\n27.27\n1.575\n\n24.49\n0.959\nfail\n-\n\n2.43\n0.440\n\n3.77\n0.458\n2.37\n0.439\n\nTransductive Setting: As in [13], we randomly sample either one label or a \ufb01xed percentage of\nlabels per class from the ground truth. We report clustering errors and the cuts (RCC-asym) for both\nmethods for different choices of labels. For label experiments their initialization strategy seems to\nwork better as the cuts improve compared to the unlabeled case. However, observe that in some\ncases their method seems to fail completely (Iris and 4moons for one label per class).\n\nLabels\n\n1\n\n1%\n\n5%\n\n10%\n\nMTV\n\nOurs\n\nMTV\n\nOurs\n\nMTV\n\nOurs\n\nMTV\n\nOurs\n\nErr(%)\nBCut\nErr(%)\nBCut\n\nErr(%)\nBCut\nErr(%)\nBCut\n\nErr(%)\nBCut\nErr(%)\nBCut\n\nErr(%)\nBCut\nErr(%)\nBCut\n\nIris\n\n33.33\n3.855\n22.67\n1.571\n\n33.33\n3.855\n22.67\n1.571\n17.33\n1.685\n17.33\n1.685\n\n18.67\n1.954\n14.67\n1.960\n\nwine\n\n9.55\n4.288\n8.99\n4.234\n\n10.67\n4.277\n6.18\n4.220\n\n7.87\n4.330\n6.74\n4.224\n\n7.30\n4.332\n6.74\n4.194\n\nvertebral\n42.26\n2.244\n50.32\n2.265\n39.03\n2.300\n41.29\n2.288\n\n40.65\n2.701\n37.10\n2.724\n\n39.03\n3.187\n33.87\n3.134\n\necoli\n13.99\n2.430\n15.48\n2.432\n\n14.29\n2.429\n13.99\n2.419\n\n14.58\n2.462\n13.99\n2.461\n\n13.39\n2.776\n13.10\n2.778\n\n4moons webkb4 optdigits USPS pendigits 20news MNIST\n\n35.75\n0.723\n0.57\n0.610\n0.45\n0.589\n0.45\n0.589\n0.45\n0.589\n0.45\n0.589\n0.38\n0.592\n0.38\n0.592\n\n51.98\n1.596\n45.11\n1.471\n\n48.38\n1.584\n41.63\n1.462\n\n40.09\n1.763\n38.04\n1.719\n40.63\n2.057\n41.97\n1.972\n\n1.69\n0.352\n1.69\n0.352\n1.67\n0.354\n1.67\n0.354\n1.51\n0.369\n1.53\n0.369\n1.41\n0.377\n1.41\n0.377\n\n12.91\n0.846\n12.98\n0.812\n\n5.21\n0.789\n5.13\n0.789\n4.85\n0.812\n4.85\n0.811\n4.19\n0.833\n4.25\n0.833\n\n14.49\n0.127\n10.98\n0.113\n7.75\n0.129\n7.75\n0.128\n\n1.79\n0.188\n1.76\n0.188\n1.24\n0.197\n1.24\n0.197\n\n50.96\n1.286\n68.53\n1.057\n\n40.18\n1.208\n37.42\n1.157\n\n31.89\n1.254\n30.07\n1.210\n\n27.80\n1.346\n26.55\n1.314\n\n2.45\n0.439\n2.36\n0.439\n\n2.41\n0.443\n2.33\n0.442\n2.18\n0.455\n2.18\n0.455\n\n2.03\n0.465\n2.02\n0.465\n\n6 Conclusion\n\nWe presented a framework for directly minimizing the balanced k-cut problem based on a new con-\ntinuous relaxation. Apart from ratio/normalized cut, our method can also handle new application-\nspeci\ufb01c balancing functions. Moreover, in contrast to a recursive splitting approach [24], our method\nenables the direct integration of prior information available in form of must/cannot-link constraints,\nwhich is an interesting topic for future research. Finally, the monotonic descent algorithm proposed\nfor the dif\ufb01cult sum-of-ratios problem is another key contribution that is of independent interest.\nAcknowledgements. The authors would like to acknowledge support by the DFG excellence cluster\nMMCI and the ERC starting grant NOLEPRO.\n\n8\n\n\fReferences\n[1] W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of graphs. IBM J. Res. Develop.,\n\n17:420\u2013425, 1973.\n\n[2] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM\n\nJ. Matrix Anal. Appl., 11(3):430\u2013452, 1990.\n\n[3] L. Hagen and A. B. Kahng. Fast spectral methods for ratio cut partitioning and clustering. In ICCAD,\n\npages 10\u201313, 1991.\n\n[4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,\n\n22:888\u2013905, 2000.\n\n[5] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.\n\n849\u2013856, 2001.\n\nIn NIPS, pages\n\n[6] I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach.\n\nIEEE Trans. Pattern Anal. Mach. Intell., pages 1944\u20131957, 2007.\n\n[7] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395\u2013416, 2007.\n[8] S. Guattery and G. Miller. On the quality of spectral separators. SIAM J. Matrix Anal. Appl., 19:701\u2013719,\n\n1998.\n\n[9] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In ICML, pages 1039\u20131046, 2010.\n[10] M. Hein and T. B\u00a8uhler. An inverse power method for nonlinear eigenproblems with applications in 1-\n\nspectral clustering and sparse PCA. In NIPS, pages 847\u2013855, 2010.\n\n[11] M. Hein and S. Setzer. Beyond spectral clustering - tight relaxations of balanced graph cuts. In NIPS,\n\npages 2366\u20132374, 2011.\n\n[12] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Convergence and energy landscape for Cheeger\n\ncut clustering. In NIPS, pages 1394\u20131402, 2012.\n\n[13] X. Bresson, T. Laurent, D. Uminsky, and J. H. von Brecht. Multiclass total variation clustering. In NIPS,\n\npages 1421\u20131429, 2013.\n\n[14] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and\n\nTrends in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[15] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. J. of Math. Imaging and Vision, 40:120\u2013145, 2011.\n\n[16] T. Pock and A. Chambolle. Diagonal preconditioning for \ufb01rst order primal-dual algorithms in convex\n\noptimization. In ICCV, pages 1762\u20131769, 2011.\n\n[17] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.\n\nSIAM J. Sci. Comput., 20(1):359\u2013392, 1998.\n\n[18] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions\n\non Neural Networks, 21(5):734\u2013749, 2010.\n\n[19] C. Ding, T. Li, and M. I. Jordan. Nonnegative matrix factorization for combinatorial optimization: Spec-\n\ntral clustering, graph matching, and clique \ufb01nding. In ICDM, pages 183\u2013192, 2008.\n\n[20] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In\n\nKDD, pages 126\u2013135, 2006.\n\n[21] R. Arora, M. R. Gupta, A. Kapila, and M. Fazel. Clustering by left-stochastic matrix factorization. In\n\nICML, pages 761\u2013768, 2011.\n\n[22] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja. Clustering by nonnegative matrix factorization using\n\ngraph random walk. In NIPS, pages 1088\u20131096, 2012.\n\n[23] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation\n\napproach to graph-partitioning. J. of Global Optimization, 29(2):225\u2013241, 2004.\n\n[24] S. S. Rangapuram and M. Hein. Constrained 1-spectral clustering. In AISTATS, pages 1143\u20131151, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1618, "authors": [{"given_name": "Syama Sundar", "family_name": "Rangapuram", "institution": "Saarland University"}, {"given_name": "Pramod Kaushik", "family_name": "Mudrakarta", "institution": "University of Chicago"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}]}