{"title": "The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited", "book": "Advances in Neural Information Processing Systems", "page_first": 2427, "page_last": 2435, "abstract": "Hypergraphs allow to encode higher-order relationships in data and are thus a very flexible modeling tool. Current learning methods are either based on approximations of the hypergraphs via graphs or  on tensor methods which are only applicable under special conditions. In this paper we present a new learning framework on hypergraphs which fully uses the hypergraph structure. The key element  is a family of regularization functionals based on  the total variation on hypergraphs.", "full_text": "The Total Variation on Hypergraphs - Learning on\n\nHypergraphs Revisited\n\nMatthias Hein, Simon Setzer, Leonardo Jost and Syama Sundar Rangapuram\n\nDepartment of Computer Science\n\nSaarland University\n\nAbstract\n\nHypergraphs allow one to encode higher-order relationships in data and are thus a\nvery \ufb02exible modeling tool. Current learning methods are either based on approx-\nimations of the hypergraphs via graphs or on tensor methods which are only appli-\ncable under special conditions. In this paper, we present a new learning framework\non hypergraphs which fully uses the hypergraph structure. The key element is a\nfamily of regularization functionals based on the total variation on hypergraphs.\n\n1\n\nIntroduction\n\nGraph-based learning is by now well established in machine learning and is the standard way to deal\nwith data that encode pairwise relationships. Hypergraphs are a natural extension of graphs which\nallow to model also higher-order relations in data. It has been recognized in several application\nareas such as computer vision [1, 2], bioinformatics [3, 4] and information retrieval [5, 6] that such\nhigher-order relations are available and help to improve the learning performance.\nCurrent approaches in hypergraph-based learning can be divided into two categories. The \ufb01rst one\nuses tensor methods for clustering as the higher-order extension of matrix (spectral) methods for\ngraphs [7, 8, 9]. While tensor methods are mathematically quite appealing, they are limited to so-\ncalled k-uniform hypergraphs, that is, each hyperedge contains exactly k vertices. Thus, they are not\nable to model mixed higher-order relationships. The second main approach can deal with arbitrary\nhypergraphs [10, 11]. The basic idea of this line of work is to approximate the hypergraph via a stan-\ndard weighted graph. In a second step, one then uses methods developed for graph-based clustering\nand semi-supervised learning. The two main ways of approximating the hypergraph by a standard\ngraph are the clique and the star expansion which were compared in [12]. One can summarize [12]\nby stating that no approximation fully encodes the hypergraph structure. Earlier, [13] have proven\nthat an exact representation of the hypergraph via a graph retaining its cut properties is impossible.\nIn this paper, we overcome the limitations of both existing approaches. For both clustering and semi-\nsupervised learning the key element, either explicitly or implicitly, is the cut functional. Our aim is\nto directly work with the cut de\ufb01ned on the hypergraph. We discuss in detail the differences of the\nhypergraph cut and the cut induced by the clique and star expansion in Section 2.1. Then, in Section\n2.2, we introduce the total variation on a hypergraph as the Lovasz extension of the hypergraph\ncut. Based on this, we propose a family of regularization functionals which interpolate between\nthe total variation and a regularization functional enforcing smoother functions on the hypergraph\ncorresponding to Laplacian-type regularization on graphs. They are the key for the semi-supervised\nlearning method introduced in Section 3. In Section 4, we show in line of recent research [14, 15, 16,\n17] that there exists a tight relaxation of the normalized hypergraph cut. In both learning problems,\nconvex optimization problems have to be solved for which we derive scalable methods in Section\n5. The main ingredients of these algorithms are proximal mappings for which we provide a novel\nalgorithm and analyze its complexity. In the experimental section 6, we show that fully incorporating\nhypergraph structure is bene\ufb01cial. All proofs are moved to the supplementary material.\n\n1\n\n\f2 The Total Variation on Hypergraphs\n\nA large class of graph-based algorithms in semi-supervised learning and clustering is based either\nexplicitly or implicitly on the cut. Thus, we discuss \ufb01rst in Section 2.1 the hypergraph cut and the\ncorresponding approximations.In Section 2.2, we introduce in analogy to graphs, the total variation\non hypergraphs as the Lovasz extension of the hypergraph cut.\n\n2.1 Hypergraphs, Graphs and Cuts\n\nof a vertex i \u2208 V is de\ufb01ned as di =(cid:80)\nas |e| = (cid:80)\n\nHypergraphs allow modeling relations which are not only pairwise as in graphs but involve multiple\nvertices. In this paper, we consider weighted undirected hypergraphs H = (V, E, w) where V is\nthe vertex set with |V | = n and E the set of hyperedges with |E| = m. Each hyperedge e \u2208 E\ncorresponds to a subset of vertices, i.e., to an element of 2V . The vector w \u2208 Rm contains for\neach hyperedge e its non-negative weight we. In the following, we use the letter H also for the\nincidence matrix H \u2208 R|V |\u00d7|E| which is for i \u2208 V and e \u2208 E, Hi,e =\n. The degree\n\n(cid:26)1\n\nif i \u2208 e,\nelse.\n\n0\n\ne\u2208E weHi,e and the cardinality of an edge e can be written\nj\u2208V Hj,e. We would like to emphasize that we do not impose the restriction that the\n\nhypergraph is k-uniform, i.e., that each hyperedge contains exactly k vertices.\nThe considered class of hypergraphs contains the set of undirected, weighted graphs which is equiv-\nalent to the set of 2-uniform hypergraphs. The motivation for the total variation on hypergraphs\ncomes from the correspondence between the cut on a graph and the total variation functional. Thus,\nwe recall the de\ufb01nition of the cut on weighted graphs G = (V, W ) with weight matrix W . Let\nC = V \\C denote the complement of C in V . Then, for a partition (C, C), the cut is de\ufb01ned as\n\nThis standard de\ufb01nition of the cut carries over naturally to a hypergraph H\n\ncutG(C, C) =\n\ni,j : i\u2208C,j\u2208C\n\nwij.\n\n(cid:88)\n(cid:88)\n\ncutH (C, C) =\n\ne\u2208E:\n\ne\u2229C(cid:54)=\u2205, e\u2229C(cid:54)=\u2205\n\nwe.\n\n(1)\n\nThus, the cut functional on a hypergraph is just the sum of the weights of the hyperedges which have\nvertices both in C and C. It is not biased towards a particular way the hyperedge is cut, that is, how\nmany vertices of the hyperedge are in C resp. C. This emphasizes that the vertices in a hyperedge\nbelong together and we penalize every cut of a hyperedge with the same value.\nIn order to handle hypergraphs with existing methods developed for graphs, the focus in previous\nworks [11, 12] has been on transforming the hypergraph into a graph. In [11], they suggest using\nthe clique expansion (CE), i.e., every hyperedge e \u2208 H is replaced with a fully connected subgraph\nwhere every edge in this subgraph has weight we|e| . This leads to the cut functional cutCE,\n\ncutCE(C, C) :=\n\ne\u2208E:\n\ne\u2229C(cid:54)=\u2205, e\u2229C(cid:54)=\u2205\n\nwe|e| |e \u2229 C||e \u2229 C|.\n\n(2)\n\n(cid:88)\n\nNote that in contrast to the hypergraph cut (1), the value of cutCE depends on the way each hyper-\nedge is cut since the term |e\u2229 C||e\u2229 C| makes the weights dependent on the partition. In particular,\nthe smallest weight is attained if only a single vertex is split off, whereas the largest weight is attained\nif the partition of the hyperedge is most balanced. In comparison to the hypergraph cut, this leads\nto a bias towards cuts that favor splitting off single vertices from a hyperedge which in our point of\nview is an undesired property for most applications. We illustrate this with an example in Figure\n1, where the minimum hypergraph cut (cutH) leads to a balanced partition, whereas the minimum\nclique expansion cut (cutCE) not only cuts an additional hyperedge but is also unbalanced. This is\ndue to its bias towards splitting off single nodes of a hyperedge. Another argument against the clique\nexpansion is computational complexity. For large hyperedges the clique expansion leads to (almost)\nfully connected graphs which makes computations slow and is prohibitive for large hypergraphs.\nWe omit the discussion of the star graph approximation of hypergraphs discussed in [12] as it is\nshown there that the star graph expansion is very similar to the clique expansion. Instead, we want\nto recall the result of Ihler et al. [13] which states that in general there exists no graph with the same\nvertex set V which has for every partition (C, C) the same cut value as the hypergraph cut.\n\n2\n\n\fFigure 1: Minimum hypergraph cut cutH vs. minimum cut of the clique expansion cutCE: For edge\nweights w1 = w4 = 10, w2 = w5 = 0.1 and w3 = 0.6 the minimum hypergraph cut is (C1, C 1)\nwhich is perfectly balanced. Although cutting one hyperedge more and being unbalanced, (C2, C 2)\nis the optimal cut for the clique expansion approximation.\n\nFinally, note that for weighted 3-uniform hypergraphs it is always possible to \ufb01nd a corresponding\ngraph such that any cut of the graph is equal to the corresponding cut of the hypergraph.\nProposition 2.1. Suppose H = (V, E, w) is a weighted 3-uniform hypergraph. Then, W \u2208\nR|V |\u00d7|V | de\ufb01ned as W = 1\n2 Hdiag(w)H T de\ufb01nes the weight matrix of a graph G = (V, W ) where\neach cut of G has the same value as the corresponding hypergraph cut of H.\n\n2.2 The Total Variation on Hypergraphs\n\nIn this section, we de\ufb01ne the total variation on hypergraphs. The key technical element is the Lovasz\nextension which extends a set function, seen as a mapping on 2V , to a function on R|V |.\nDe\ufb01nition 2.1. Let \u02c6S : 2V \u2192 R be a set function with \u02c6S(\u2205) = 0. Let f \u2208 R|V |, let V be ordered\nsuch that f1 \u2264 f2 \u2264 . . . \u2264 fn and de\ufb01ne Ci = {j \u2208 V | j > i}. Then, the Lovasz extension\nS : R|V | \u2192 R of \u02c6S is given by\n\nn(cid:88)\n\ni=1\n\n(cid:16) \u02c6S(Ci\u22121) \u2212 \u02c6S(Ci)\n(cid:17)\n\n=\n\nn\u22121(cid:88)\n\ni=1\n\nS(f ) =\n\nfi\n\n\u02c6S(Ci)(fi+1 \u2212 fi) + f1 \u02c6S(V ).\n\nNote that for the characteristic function of a set C \u2282 V , we have S(1C) = \u02c6S(C).\nIt is well-known that the Lovasz extension S is a convex function if and only if \u02c6S is submodular\n[18]. For graphs G = (V, W ), the total variation on graphs is de\ufb01ned as the Lovasz extension of the\ngraph cut [18] given as T VG : R|V | \u2192 R, TVG(f ) = 1\nProposition 2.2. The total variation TVH : R|V | \u2192 R on a hypergraph H = (V, E, w) de\ufb01ned as\nthe Lovasz extension of the hypergraph cut, \u02c6S(C) = cutH (C, C), is a convex function given by\n\n(cid:80)n\ni,j=1 wij|fi \u2212 fj|.\n(cid:17)\n(cid:88)\n\n(cid:16)\n\n2\n\nTVH (f ) =\n\nwe\n\nfi \u2212 min\nj\u2208e\n\nfj\n\nmax\ni\u2208e\n\n=\n\ne\u2208E\n\nwe max\ni,j\u2208e\n\n|fi \u2212 fj|.\n\n(cid:88)\n\ne\u2208E\n\nas, TVH (f ) = (cid:80)\n\nNote that the total variation of a hypergraph cut reduces to the total variation on graphs if H is\n2-uniform (standard graph). There is an interesting relation of the total variation on hypergraphs\nto sparsity inducing group norms. Namely, de\ufb01ning for each edge e \u2208 E the difference operator\nDe : R|V | \u2192 R|V |\u00d7|V | by (Def )ij = fi \u2212 fj if i, j \u2208 e and 0 otherwise, TVH can be written\ne\u2208E we (cid:107)Def(cid:107)\u221e, which can be seen as inducing group sparse structure on the\ngradient level. The groups are the hyperedges and thus are typically overlapping. This could lead\npotentially to extensions of the elastic net on graphs to hypergraphs.\nIt is known that using the total variation on graphs as a regularization functional in semi-supervised\nlearning (SSL) leads to very spiky solutions for small numbers of labeled points. Thus, one would\nlike to have regularization functionals enforcing more smoothness of the solutions. For graphs this\nis achieved by using the family of regularization functionals \u2126G,p : R|V | \u2192 R,\n\n\u2126G,p(f ) =\n\n1\n2\n\nwij|fi \u2212 fj|p.\n\nn(cid:88)\n\ni,j=1\n\n3\n\n\fFor p = 2 we get the regularization functional of the graph Laplacian which is the basis of a large\nclass of methods on graphs. In analogy to graphs, we de\ufb01ne a corresponding family on hypergraphs.\nDe\ufb01nition 2.2. The regularization functionals \u2126H,p : R|V | \u2192 R for a hypergraph H = (V, E, w)\nare de\ufb01ned for p \u2265 1 as\n\n(cid:17)p\n\n(cid:16)\n\n(cid:88)\n\ne\u2208E\n\n\u2126H,p(f ) =\n\nwe\n\nfi \u2212 min\nj\u2208e\n\nmax\ni\u2208e\n\nfj\n\n.\n\nLemma 2.1. The functionals \u2126H,p : R|V | \u2192 R are convex.\nNote that \u2126H,1(f ) = TVH (f ). If H is a graph and p \u2265 1, \u2126H,p reduces to the Laplacian regulariza-\ntion \u2126G,p. Note that for characteristic functions of sets, f = 1C, it holds \u2126H,p(1C) = cutH (C, C).\nThus, the difference between the hypergraph cut and its approximations such as clique and star\nexpansion carries over to \u2126H,p and \u2126GCE ,p, respectively.\n\n3 Semi-supervised Learning\n\nWith the regularization functionals derived in the last section, we can immediately write down a\nformulation for two-class semi-supervised learning on hypergraphs similar to the well-known ap-\nproaches of [19, 20]. Given the label set L we construct the vector Y \u2208 Rn with Yi = 0 if i /\u2208 L\nand Yi equal to the label in {\u22121, 1} if i \u2208 L. We propose solving\n\nf\u2217 = arg min\nf\u2208R|V |\n\n1\n2\n\n(cid:107)f \u2212 Y (cid:107)2\n\n2 + \u03bb \u2126H,p(f ),\n\n(3)\n\nwhere \u03bb > 0 is the regularization parameter. In Section 5, we discuss how this convex optimization\nproblem can be solved ef\ufb01ciently for the case p = 1 and p = 2. Note, that other loss functions than\nthe squared loss could be used. However, the regularizer aims at contracting the function and we\nuse the label set {\u22121, 1} so that f\u2217 \u2208 [\u22121, 1]|V |. Hence, on the interval [\u22121, 1] the squared loss\nbehaves very similar to other margin-based loss functions. In general, we recommend using p = 2\nas it corresponds to Laplacian-type regularization for graphs which is known to work well. For\ngraphs p = 1 is known to produce spiky solutions for small numbers of labeled points. This is due\nto the effect that cutting \u201cout\u201d the labeled points leads to a much smaller cut than, e.g., producing a\nbalanced partition. However, in the case where one has only a small number of hyperedges this effect\nis much smaller and we will see in the experiments that p = 1 also leads to reasonable solutions.\n\n4 Balanced Hypergraph Cuts\n\nIn Section 2.1, we discussed the difference between the hypergraph cut (1) and the graph cut of\nthe clique expansion (2) of the hypergraph and gave a simple example in Figure 1 where these\ncuts yield quite different results. Clearly, this difference carries over to the famous normalized cut\ncriterion introduced in [21, 22] for clustering of graphs with applications in image segmentation.\nFor a hypergraph the ratio resp. normalized cut can be formulated as\n\nRCut(C, C) =\n\ncutH (C, C)\n\n|C||C|\n\n, NCut(C, C) =\n\ncutH (C, C)\nvol(C) vol(C)\n\n,\n\nwhich incorporate different balancing criteria. Note, that in contrast to the normalized cut for graphs\nthe normalized hypergraph cut allows no relaxation into a linear eigenproblem (spectral relaxation).\nThus, we follow a recent line of research [14, 15, 16, 17] where it has been shown that the standard\nspectral relaxation of the normalized cut used in spectral clustering [22] is loose and that a tight, in\nfact exact, relaxation can be formulated in terms of a nonlinear eigenproblem. Although nonlinear\neigenproblems are non-convex, one can compute nonlinear eigenvectors quite ef\ufb01ciently at the price\nof loosing global optimality. However, it has been shown that the potentially non-optimal solutions\nof the exact relaxation, outperform in practice the globally optimal solution of the loose relaxation,\noften by large margin. In this section, we extend their approach to hypergraphs and consider general\n, where \u02c6S : 2V \u2192 R+\nbalanced hypergraph cuts Bcut(C, C) of the form, Bcut(C, C) = cutH (C,C)\nis a non-negative, symmetric set function (that is \u02c6S(C) = \u02c6S(C)). For the normalized cut one has\n\n\u02c6S(C)\n\n4\n\n\f\u02c6S(C) = vol(C) vol(C) whereas for the Cheeger cut one has \u02c6S(C) = min{vol C, vol C}. Other\nexamples of balancing functions can be found in [16]. Our following result shows that the balanced\nhypergraph cut also has an exact relaxation into a continuous nonlinear eigenproblem [14].\nTheorem 4.1. Let H = (V, E, w) be a \ufb01nite, weighted hypergraph and S : R|V | \u2192 R be the Lovasz\nextension of the symmetric, non-negative set function \u02c6S : 2V \u2192 R. Then, it holds that\n\n(cid:80)\n\n(cid:0) max\n\ni\u2208e\nS(f )\n\ne\u2208E we\n\nfi \u2212 min\nj\u2208e\n\nfj\n\n(cid:1)\n\n(cid:0) max\n\ni\u2208e\nS(f )\n\ncutH (C, C)\n\n.\n\n\u02c6S(C)\n\n(cid:1)\n\n.\n\n= min\nC\u2282V\nFurther, let f \u2208 R|V | and de\ufb01ne Ct := {i \u2208 V | fi > t}. Then,\n\nmin\nf\u2208R|V |\n\n(cid:80)\n\ne\u2208E we\n\nfi \u2212 min\nj\u2208e\n\nfj\n\nmin\nt\u2208R\n\ncutH (Ct, Ct)\n\n\u02c6S(Ct)\n\n\u2264\n\nThe last part of the theorem shows that \u201coptimal thresholding\u201d (turning f \u2208 RV into a partition)\namong all level sets of any f \u2208 R|V | can only lead to a better or equal balanced hypergraph cut.\nThe question remains how to minimize the ratio Q(f ) = TVH (f )\n. As discussed in [16], every\nLovasz extension S can be written as a difference of convex positively 1-homogeneous functions1\nS = S1 \u2212 S2. Moreover, as shown in Prop. 2.2 the total variation TVH is convex. Thus, we have\nto minimize a non-negative ratio of a convex and a difference of convex (d.c.) function. We employ\nthe RatioDCA algorithm [16] shown in Algorithm 1. The main part is the convex inner problem. In\n\nS(f )\n\nAlgorithm 1 RatioDCA \u2013 Minimization of a non-negative ratio of 1-homogeneous d.c. functions\n1: Objective: Q(f ) = R1(f )\u2212R2(f )\n2: repeat\n3:\n4:\n\nS1(f )\u2212S2(f ) . Initialization: f 0 = random with(cid:13)(cid:13)f 0(cid:13)(cid:13) = 1, \u03bb0 = Q(f 0)\n(cid:8)R1(u) \u2212(cid:10)u, r2(f k)(cid:11) + \u03bbk(cid:0)S2(u) \u2212(cid:10)u, s1(f k)(cid:11)(cid:1)(cid:9)\n\ns1(f k) \u2208 \u2202S1(f k), r2(f k) \u2208 \u2202R2(f k)\nf k+1 = arg min\n(cid:107)u(cid:107)2\u22641\n\u03bbk+1 = (R1(f k+1) \u2212 R2(f k+1))/(S1(f k+1) \u2212 S2(f k+1))\n\n5:\n\n6: until |\u03bbk+1\u2212\u03bbk|\n7: Output: eigenvalue \u03bbk+1 and eigenvector f k+1.\n\n< \u0001\n\n\u03bbk\n\nour case R1 = T VH , R2 = 0, and thus the inner problem reads\n\nmin(cid:107)u(cid:107)2\u22641{TVH (u) + \u03bbk(cid:0)S2(u) \u2212(cid:10)u, s1(f k)(cid:11)(cid:1)}.\n\n(4)\nFor simplicity we restrict ourselves to submodular balancing functions, in which case S is convex\nand thus S2 = 0. For the general case, see [16]. Note that the balancing functions of ratio/normalized\ncut and Cheeger cut are submodular. It turns out that the inner problem is very similar to the semi-\nsupervised learning formulation (3). The ef\ufb01cient solution of both problems is discussed next.\n\n5 Algorithms for the Total Variation on Hypergraphs\n\nThe problem (3) we want to solve for semi-supervised learning and the inner problem (4) of Ra-\ntioDCA have a common structure. They are the sum of two convex functions: one of them is the\nnovel regularizer \u2126H,p and the other is a data term denoted by G here, cf., Table 1. We propose\nsolving these problems using a primal-dual algorithm, denoted by PDHG, which was proposed in\n[23, 24]. Its main idea is to iteratively solve for each convex term in the objective function a proximal\nproblem. The proximal map proxg w.r.t. a mapping g : Rn \u2192 R is de\ufb01ned by\n\nproxg(\u02dcx) = arg min\n\nx\u2208Rn\n\n(cid:107)x \u2212 \u02dcx(cid:107)2\n\n2 + g(x)}.\n\n{ 1\n2\n\n1A function f : Rd \u2192 R is positively 1-homogeneous if \u2200\u03b1 > 0, f (\u03b1x) = \u03b1f (x).\n\n5\n\n\fThe key idea is that often proximal problems can be solved ef\ufb01ciently leading to fast convergence\nof the overall algorithm. We see in Table 1 that for both G the proximal problems have an explicit\nsolution. However, note that smooth convex terms can also be directly exploited [25]. For \u2126H,p, we\ndistinguish two cases, p = 1 and p = 2. Detailed descriptions of the algorithms can be found in the\nsupplementary material.\n\nG(f ) = 1\n\n2(cid:107)f \u2212 Y (cid:107)2\n\n2\n\nprox\u03c4 G(f )(\u02dcx) = 1\n\n1+\u03c4 (\u02dcx + \u03c4 Y )\n\nG(f ) = \u2212(cid:104)s1(f k), f(cid:105) + \u03b9(cid:107)\u00b7(cid:107)2\u22641(f )\nprox\u03c4 G(f )(\u02dcx) =\nmax{1,(cid:107)\u02dcx+\u03c4 s1(f k)(cid:107)2}\n\n\u02dcx+\u03c4 s1(f k)\n\nTable 1: Data term and proximal map for SSL (3) (left) and the inner problem of RatioDCA (4)\n(right).The indicator function is de\ufb01ned as \u03b9(cid:107)\u00b7(cid:107)2\u22641(x) = 0, if (cid:107)x(cid:107)2 \u2264 1 and +\u221e otherwise.\n\n(cid:88)\n\nPDHG algorithm for \u2126H,1. Let me be the number of vertices in hyperedge e \u2208 E. We write\n\n\u03bb\u2126H,1(f ) = F (Kf ) :=\n\n(5)\nwhere the rows of the matrices Ke \u2208 Rme,n are the i-th standard unit vectors for i \u2208 e and the\nfunctionals F(e,j) : Rme \u2192 R are de\ufb01ned as\n\n(F(e,1)(Kef ) + F(e,2)(Kef )),\n\ne\u2208E\n\nF(e,1)(\u03b1(e,1)) = \u03bbwe max(\u03b1(e,1)), F(e,2)(\u03b1(e,2)) = \u2212\u03bbwe min(\u03b1(e,2)).\n\nIn contrast to the function G, we need in the PDHG algorithm the proximal maps for the conjugate\nfunctions of F(e,j). They are given by\n\nwhere S\u03bbwe = {x \u2208 Rme : (cid:80)me\n\nF \u2217\n(e,1) = \u03b9S\u03bbwe\ni=1 xi = \u03bbwe, xi \u2265 0} is the scaled simplex in Rme. The solutions\n(e,1) and F \u2217\n(e,1) are the orthogonal projections onto the simplexes\n, respectively. These projections can be done in linear time [26].\n\nof the proximal problem for F \u2217\nwritten here as PSe\nand P\u2212Se\nWith the proximal maps we have presented so far, the PDHG algorithm has the following form.\n\n(e,2) = \u03b9\u2212S\u03bbwe\n\n, F \u2217\n\n\u03bbwe\n\n\u03bbwe\n\n,\n\nAlgorithm 2 PDHG for \u2126H,1\n1: Initialization: f (0) = \u00aff (0) = 0, \u03b8 \u2208 [0, 1], \u03c3, \u03c4 > 0 with \u03c3\u03c4 < 1/(2 maxi=1,...,n{ci})\n2: repeat\n3:\n\ne\u2208E Hi,e is the number of hyperedges the vertex i lies in. It is important to\npoint out here that the algorithm decouples the problem in the sense that in every iteration we\nsolve subproblems which treat the functionals G, F(e,1), F(e,2) separately and thus can be solved\nef\ufb01ciently.\n\nPDHG algorithm for \u2126H,2. We de\ufb01ne the matrices Ke as above. Moreover, we introduce for\nevery hyperedge e \u2208 E the functional\n\nHence, we can write \u2126H,2(f ) =(cid:80)\n\n(6)\ne\u2208E Fe(Kef ). As we show in the supplementary material, the\ne are not indicator functions and we thus solve the corresponding proximal\n\nconjugate functions F \u2217\nproblems via proximal problems for Fe. More speci\ufb01cally, we exploit the fact that\n\nFe(\u03b1e) = \u03bbwe(max(\u03b1e) \u2212 min(\u03b1e))2.\n\nprox\u03c3F \u2217\n\ne\n\n(\u02dc\u03b1e) = \u02dc\u03b1e \u2212 prox 1\n\n\u03c3 Fe (\u02dc\u03b1e),\n\n(7)\n\nand use the following novel result concerning the proximal problem on the right-hand side of (7).\n\n6\n\n\u00aff (k)),\n\u00aff (k)),\n\ne \u2208 E\ne \u2208 E\n\ne (\u03b1(e,1)(k+1)\n\n+ \u03b1(e,2)(k+1)\n\n))\n\n\u03bbwe\n\n\u03bbwe\n\n4:\n\n+ \u03c3Ke\n\n\u03b1(e,1)(k+1)\n\u03b1(e,2)(k+1)\n\n(\u03b1(e,1)(k)\n(\u03b1(e,2)(k)\n\n= PSe\n= P\u2212Se\n\nf (k+1) = prox\u03c4 G(f (k) \u2212 \u03c4(cid:80)\n\n+ \u03c3Ke\ne\u2208E KT\n\u00aff (k+1) = f (k+1) + \u03b8(f (k+1) \u2212 f (k))\n\n5:\n6:\n7: until relative duality gap < \u0001\n8: Output: f (k+1).\n\nThe value ci = (cid:80)\n\n\fProp. \\ Dataset\nNumber of classes\n|V |\n|E|\n|E| of Clique Exp.\n\n(cid:80)\ne\u2208E |e|\n\nZoo\n7\n101\n42\n1717\n10201\n\nMushrooms Covertype (4,5) Covertype (6,7)\n2\n8124\n112\n170604\n65999376\n\n2\n37877\n123\n454522\n1348219153\n\n2\n12240\n104\n146880\n143008092\n\n20Newsgroups\n4\n16242\n100\n65451\n53284642\n\nTable 2: Datasets used for SSL and clustering. Note that the clique expansion leads for all datasets\nto a graph which is close to being fully connected as all datasets contain large hyperedges. For\ncovertype (6,7) the weight matrix needs over 10GB of memory, the original hypergraph only 4MB.\n\nProposition 5.1. For any \u03c3 > 0 and any \u02dc\u03b1e \u2208 Rme the proximal map\n\nprox 1\n\n\u03c3 Fe (\u02dc\u03b1e) = arg min\n\u03b1e\u2208Rme\n\n(cid:107)\u03b1e \u2212 \u02dc\u03b1e(cid:107)2\n\n2 +\n\n{ 1\n2\n\n1\n\u03c3\n\n\u03bbwe(max(\u03b1e) \u2212 min(\u03b1e))2}\n\ncan be computed with O(me log me) arithmetic operations.\nA corresponding algorithm which is new to the best of our knowledge is provided in the supplemen-\ntary material. We note here that the complexity is due to the fact that we sort the input vector \u02dc\u03b1e.\nThe PDHG algorithm for p = 2 is provided in the supplementary material. It has the same structure\nas Algorithm 2 with the only difference that we now solve (7) for every hyperedge.\n\n6 Experiments\n\nThe method of Zhou et al [11] seems to be the standard algorithm for clustering and SSL on hyper-\ngraphs. We compare to them on a selection of UCI datasets summarized in Table 2. Zoo, Mushrooms\nand 20Newsgroups2 have been used also in [11] and contain only categorical features. As in [11],\na hyperedge of weight one is created by all data points which have the same value of a categorical\nfeature. For covertype we quantize the numerical features into 10 bins of equal size. Two datasets\nare created each with two classes (4,5 and 6,7) of the original dataset.\n\nIn [11], they suggest using a regularizer induced by the nor-\n\nSemi-supervised Learning (SSL).\nmalized Laplacian LCE arising from the clique expansion\n\nwhere DCE is a diagonal matrix with entries dEC(i) = (cid:80)\n\n\u2212 1\nCEHW (cid:48)H T D\n\nLCE = I \u2212 D\n\n2\n\n\u2212 1\nCE,\n\n2\n\ndiagonal matrix with entries w(cid:48)(e) = we/|e|. The SSL problem can then be formulated as\n\ne\u2208E Hi,e\n\nwe|e| and W (cid:48) \u2208 R|E|\u00d7|E| is a\n\n\u03bb > 0,\n\narg minf\u2208R|V |{(cid:107)f \u2212 Y (cid:107)2\n\n2 + \u03bb(cid:104)f, LCEf(cid:105)}.\n\nwhich needs 2(cid:80)\n\nThe advantage of this formulation is that the solution can be found via a linear system. However, as\nTable 2 indicates the obvious downside is that LCE is a potentially very dense matrix and thus one\nneeds in the worst case |V |2 memory and O(|V |3) computations. This is in contrast to our method\ne\u2208E |e| + |V | memory. For the largest example (covertype 6,7), where the clique\nexpansion fails due to memory problems, our method takes 30-100s (depending on \u03bb). We stop our\nmethod for all experiments when we achieve a relative duality gap of 10\u22126. In the experiments we\ndo 10 trials for different numbers of labeled points. The reg. parameter \u03bb is chosen for both methods\nfrom the set 10\u2212k, where k = {0, 1, 2, 3, 4, 5, 6} via 5-fold cross validation. The resulting errors\nand standard deviations can be found in the following table(\ufb01rst row lists the no. of labeled points).\nOur SSL methods based on \u2126H,p, p = 1, 2 outperform consistently the clique expansion technique\nof Zhou et al [11] on all datasets except 20newsgroups3. However, 20newsgroups is a very dif\ufb01cult\ndataset as only 10,267 out of the 16,242 data points are different which leads to a minimum possible\nerror of 9.6%. A method based on pairwise interaction such as the clique expansion can better deal\n\n2This is a modi\ufb01ed version by Sam Roweis of the original 20 newsgroups dataset available at http:\n\n//www.cs.nyu.edu/\u02dcroweis/data/20news_w100.mat.\n\n3Communications with the authors of [11] could not clarify the difference to their results on 20newsgroups\n\n7\n\n\fwith such label noise as the large hyperedges for this dataset accumulate the label noise. On all\nother datasets we observe that incorporating hypergraph structure leads to much better results. As\nexpected our squared TV functional (p = 2) outperforms slightly the total variation (p = 1) even\nthough the difference is small. Thus, as \u2126H,2 reduces to the standard regularization based on the\ngraph Laplacian, which is known to work well, we recommend \u2126H,2 for SSL on hypergraphs.\n\nZoo\nZhou et al.\n\u2126H,1\n\u2126H,2\nMushr.\nZhou et al.\n\u2126H,1\n\u2126H,2\ncovert45\nZhou et al.\n\u2126H,1\n\u2126H,2\ncovert67\n\u2126H,1\n\u2126H,2\n20news\nZhou et al.\n\u2126H,1\n\u2126H,2\n\n30\n40.7\u00b1 14.2\n2.2 \u00b1 2.1\n2.9 \u00b1 2.3\n\n35\n29.7 \u00b1 8.8\n0.7 \u00b1 1.0\n0.9 \u00b1 1.4\n\n50\n25.3\u00b1 14.4\n1.9 \u00b1 3.0\n1.6 \u00b1 2.9\n\n20\n25\n35.1\u00b1 17.2\n30.3 \u00b1 7.9\n2.9 \u00b1 3.0\n1.4 \u00b1 2.2\n2.3 \u00b1 1.9\n1.5 \u00b1 2.4\n80\n20\n40\n10.3\u00b12.0\n15.5 \u00b1 12.8 10.9\u00b14.4\n5.6 \u00b1 1.9\n19.5\u00b1 10.5\n10.8\u00b13.7\n6.4 \u00b1 2.7\n9.8 \u00b1 4.5\n18.4 \u00b1 7.4\n80\n40\n20\n16.6\u00b16.4\n18.9 \u00b1 4.6\n18.3\u00b15.2\n7.6 \u00b1 3.5\n21.4 \u00b1 0.9\n17.6\u00b12.6\n16.1 \u00b1 4.1 10.9 \u00b1 4.9 5.9 \u00b1 3.7\n20.7 \u00b1 2.0\n20\n40\n40.6 \u00b1 8.9\n6.4\u00b110.4\n25.2 \u00b1 18.3 4.3 \u00b1 9.6\n20\n45.5 \u00b1 7.5\n65.7 \u00b1 6.1\n55.0 \u00b1 4.8\n\n200\n9.3 \u00b1 1.0\n5.6 \u00b1 3.8\n3.0 \u00b1 0.6\n200\n20.4\u00b12.9\n1.5 \u00b1 1.3\n1.0 \u00b1 1.1\n200\n1.2 \u00b1 0.9\n1.1 \u00b1 0.8\n40\n200\n34.4 \u00b1 3.1 31.5 \u00b1 1.4 29.8 \u00b1 4.0 27.0 \u00b1 1.3 27.3 \u00b1 1.5 25.7 \u00b1 1.4 25.0 \u00b1 1.3\n61.4\u00b16.1\n34.7\u00b13.6\n48.0\u00b16.0\n34.1\u00b12.0\n\n100\n9.0 \u00b1 4.5\n5.7 \u00b1 2.2\n6.3 \u00b1 2.5\n100\n17.6\u00b15.2\n6.2 \u00b1 3.8\n4.6 \u00b1 3.4\n100\n1.8 \u00b1 0.8\n1.4 \u00b1 1.1\n100\n42.4\u00b13.3\n38.3\u00b12.7\n\n40\n32.9\u00b1 16.8\n0.7 \u00b1 1.5\n0.8 \u00b1 1.7\n120\n8.8 \u00b1 1.4\n5.4 \u00b1 2.4\n4.5 \u00b1 1.8\n120\n18.4\u00b15.1\n4.5 \u00b1 3.6\n3.3 \u00b1 3.1\n120\n1.3 \u00b1 0.9\n1.0 \u00b1 0.8\n120\n40.9\u00b13.2\n38.1\u00b12.6\n\n45\n27.6\u00b1 10.8\n0.9 \u00b1 1.4\n1.2 \u00b1 1.8\n160\n8.8 \u00b1 2.3\n4.9 \u00b1 3.8\n4.4 \u00b1 2.1\n160\n19.2\u00b14.0\n2.6 \u00b1 1.6\n2.2 \u00b1 1.8\n160\n0.9 \u00b1 0.4\n0.7 \u00b1 0.4\n160\n36.1\u00b11.5\n35.0\u00b12.8\n\n80\n3.3 \u00b1 2.5\n2.2 \u00b1 1.4\n80\n46.2\u00b13.7\n40.3\u00b13.0\n\n60\n9.5 \u00b1 2.7\n7.4 \u00b1 3.8\n9.9 \u00b1 5.5\n60\n17.2\u00b16.7\n12.6\u00b14.3\n\n60\n3.6 \u00b1 3.2\n2.1 \u00b1 2.0\n60\n53.2\u00b15.7\n45.0\u00b15.9\n\nTest error and standard deviation of the SSL methods over 10 runs for varying number of labeled\n\npoints.\n\nClustering. We use the normalized hypergraph cut as clustering objective. For more than two\nclusters we recursively partition the hypergraph until the desired number of clusters is reached.\nFor comparison we use the normalized spectral clustering approach based on the Laplacian LCE\n[11](clique expansion). The \ufb01rst part (\ufb01rst 6 columns) of the following table shows the clustering\nerrors (majority vote on each cluster) of both methods as well as the normalized cuts achieved by\nthese methods on the hypergraph and on the graph resulting from the clique expansion. Moreover,\nwe show results (last 4 columns) which are obtained based on a kNN graph (unit weights) which\nis built based on the Hamming distance (note that we have categorical features) in order to check if\nthe hypergraph modeling of the problem is actually useful compared to a standard similarity based\ngraph construction. The number k is chosen as the smallest number for which the graph becomes\nconnected and we compare results of normalized 1-spectral clustering [14] and the standard spectral\nclustering [22]. Note that the employed hypergraph construction has no free parameter.\n\nDataset\nMushrooms\nZoo\n20-newsgroup\ncovertype (4,5)\ncovertype (6,7)\n\nClustering Error %\nOurs\n10.98\n16.83\n47.77\n22.44\n8.16\n\n[11]\n32.25\n15.84\n33.20\n22.44\n\n-\n\nHypergraph Ncut\n[11]\nOurs\n0.0013\n0.0011\n0.6784\n0.6739\n0.0176\n0.0303\n0.0018\n0.0022\n8.18e-4\n\n-\n\nGraph(CE) Ncut\n[11]\nOurs\n0.7053\n0.6991\n5.1703\n5.1315\n2.3846\n1.8492\n0.7400\n0.6691\n0.6882\n\n-\n\nClustering Error %\n[14]\n48.2\n5.94\n66.38\n22.44\n45.85\n\n[22]\n48.2\n5.94\n66.38\n22.44\n45.85\n\nkNN-Graph Ncut\n[22]\n[14]\n1e-4\n1e-4\n1.636\n1.636\n0.1031\n0.1034\n0.02182\n0.0152\n0.0041\n0.0041\n\nFirst, we observe that our approach optimizing the normalized hypergraph cut yields better or similar\nresults in terms of clustering errors compared to the clique expansion (except for 20-newsgroup for\nthe same reason given in the previous paragraph). The improvement is signi\ufb01cant in case of Mush-\nrooms while for Zoo our clustering error is slightly higher. However, we always achieve smaller\nnormalized hypergraph cuts. Moreover, our method sometimes has even smaller cuts on the graphs\nresulting from the clique expansion, although it does not directly optimize this objective in contrast\nto [11]. Again, we could not run the method of [11] on covertype (6,7) since the weight matrix is\nvery dense. Second, the comparison to a standard graph-based approach where the similarity struc-\nture is obtained using the Hamming distance on the categorical features shows that using hypergraph\nstructure is indeed useful. Nevertheless, we think that there is room for improvement regarding the\nconstruction of the hypergraph which is a topic for future research.\n\nAcknowledgments\nM.H. would like to acknowledge support by the ERC Starting Grant NOLEPRO and L.J. acknowl-\nedges support by the DFG SPP-1324.\n\n8\n\n\fReferences\n[1] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In CVPR, pages 1738\n\n\u2013 1745, 2009.\n\n[2] P. Ochs and T. Brox. Higher order motion models and spectral clustering. In CVPR, pages 614\u2013621,\n\n2012.\n\n[3] S. Klamt, U.-U. Haus, and F. Theis. Hypergraphs and cellular networks. PLoS Computational Biology,\n\n5:e1000385, 2009.\n\n[4] Z. Tian, T. Hwang, and R. Kuang. A hypergraph-based learning algorithm for classifying gene expression\n\nand arraycgh data with prior knowledge. Bioinformatics, 25:2831\u20132838, 2009.\n\n[5] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamical\n\nsystems. VLDB Journal, 8:222\u2013236, 2000.\n\n[6] J. Bu, S. Tan, C. Chen, C. Wang, H. Wu, L. Zhang, and X. He. Music recommendation by uni\ufb01ed hyper-\ngraph: Combining social media information and music content. In Proc. of the Int. Conf. on Multimedia\n(MM), pages 391\u2013400, 2010.\n\n[7] A. Shashua, R. Zass, and T. Hazan. Multi-way clustering using super-symmetric non-negative tensor\n\nfactorization. In ECCV, pages 595\u2013608, 2006.\n\n[8] S. Rota Bulo and M. Pellilo. A game-theoretic approach to hypergraph clustering. In NIPS, pages 1571\u2013\n\n1579, 2009.\n\n[9] M. Leordeanu and C. Sminchisescu. Ef\ufb01cient hypergraph clustering. In AISTATS, pages 676\u2013684, 2012.\n[10] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Petrona, D. J. Kriegman, and S. Belongie. Beyond pairwise\n\nclustering. In CVPR, pages 838\u2013845, 2005.\n\n[11] D. Zhou, J. Huang, and B. Sch\u00a8olkopf. Learning with hypergraphs: Clustering, classi\ufb01cation, and embed-\n\nding. In NIPS, pages 1601\u20131608, 2006.\n\n[12] S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In ICML, pages 17\u201324,\n\n2006.\n\n[13] E. Ihler, D. Wagner, and F. Wagner. Modeling hypergraphs by graphs with the same mincut properties.\n\nInformation Processing Letters, 45:171\u2013175, 1993.\n\n[14] M. Hein and T. B\u00a8uhler. An inverse power method for nonlinear eigenproblems with applications in 1-\n\nspectral clustering and sparse PCA. In NIPS, pages 847\u2013855, 2010.\n\n[15] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In ICML, pages 1039\u20131046, 2010.\n[16] M. Hein and S. Setzer. Beyond spectral clustering - tight relaxations of balanced graph cuts. In NIPS,\n\npages 2366\u20132374, 2011.\n\n[17] T. B\u00a8uhler, S. Rangapuram, S. Setzer, and M. Hein. Constrained fractional set programs and their applica-\n\ntion in local clustering and community detection. In ICML, pages 624\u2013632, 2013.\n\n[18] F. Bach. Learning with submodular functions: A convex optimization perspective. CoRR, abs/1111.6453,\n\n2011.\n\n[19] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. Machine Learning, 56:209\u2013239, 2004.\n[20] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\n\nIn NIPS, volume 16, pages 321\u2013328, 2004.\n\n[21] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell.,\n\n22(8):888\u2013905, 2000.\n\n[22] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395\u2013416, 2007.\n[23] E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of \ufb01rst order primal-dual algorithms\nfor convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015\u20131046, 2010.\n[24] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. J. of Math. Imaging and Vision, 40:120\u2013145, 2011.\n\n[25] L. Condat. A primaldual splitting method for convex optimization involving lipschitzian, proximable and\n\nlinear composite terms. J. Optimization Theory and Applications, 158(2):460\u2013479, 2013.\n\n[26] K. Kiwiel. On Linear-Time algorithms for the continuous quadratic knapsack problem. J. Opt. Theory\n\nAppl., 134(3):549\u2013554, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1146, "authors": [{"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}, {"given_name": "Simon", "family_name": "Setzer", "institution": "Saarland University"}, {"given_name": "Leonardo", "family_name": "Jost", "institution": "Saarland University"}, {"given_name": "Syama Sundar", "family_name": "Rangapuram", "institution": "Saarland University"}]}