{"title": "Quadratic Decomposable Submodular Function Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1054, "page_last": 1064, "abstract": "We introduce a new convex optimization problem, termed quadratic decomposable submodular function minimization. The problem is closely related to decomposable submodular function minimization and arises in many learning on graphs and hypergraphs settings, such as graph-based semi-supervised learning and PageRank. We approach the problem via a new dual strategy and describe an objective that may be optimized via random coordinate descent (RCD) methods and projections onto cones. We also establish the linear convergence rate of the RCD algorithm and develop efficient projection algorithms with provable performance guarantees. Numerical experiments in semi-supervised learning on hypergraphs confirm the efficiency of the proposed algorithm and demonstrate the significant improvements in prediction accuracy with respect to state-of-the-art methods.", "full_text": "Quadratic Decomposable Submodular Function\n\nMinimization\n\nPan Li\nUIUC\n\npanli2@illinois.edu\n\nNiao He\nUIUC\n\nniaohe@illinois.edu\n\nOlgica Milenkovic\n\nUIUC\n\nmilenkov@illinois.edu\n\nAbstract\n\nWe introduce a new convex optimization problem, termed quadratic decomposable\nsubmodular function minimization. The problem is closely related to decomposable\nsubmodular function minimization and arises in many learning on graphs and\nhypergraphs settings, such as graph-based semi-supervised learning and PageRank.\nWe approach the problem via a new dual strategy and describe an objective that\nmay be optimized via random coordinate descent (RCD) methods and projections\nonto cones. We also establish the linear convergence rate of the RCD algorithm\nand develop ef\ufb01cient projection algorithms with provable performance guarantees.\nNumerical experiments in semi-supervised learning on hypergraphs con\ufb01rm the\nef\ufb01ciency of the proposed algorithm and demonstrate the signi\ufb01cant improvements\nin prediction accuracy with respect to state-of-the-art methods.1\n\nIntroduction\n\n1\nGiven [N ] = {1, 2, ..., N}, a submodular function F : 2[N ] \u2192 R is a set function that for any\nS1, S2 \u2286 [N ] satis\ufb01es F (S1) + F (S2) \u2265 F (S1 \u222a S2) + F (S1 \u2229 S2). Submodular functions are\nubiquitous in machine learning as they capture rich combinatorial properties of set functions and\nprovide useful regularization functions for supervised and unsupervised learning [1]. Submodular\nfunctions also have continuous Lov\u00e1sz extensions [2], which establish solid connections between\ncombinatorial and continuous optimization problems.\nDue to their versatility, submodular functions and their Lov\u00e1sz extensions are frequently used in\napplications such as learning on directed/undirected graphs and hypergraphs [3, 4], image denoising\nvia total variation regularization [5, 6] and MAP inference in high-order Markov random \ufb01elds [7].\nIn many optimization settings involving submodular functions, one encounters the convex program\n\n(cid:88)\n\ni\u2208[N ]\n\nmin\n\nx\n\n(xi \u2212 ai)2 +\n\n[fr(x)]p ,\n\n(cid:88)\n\nr\u2208[R]\n\nwhere a \u2208 RN , p \u2208 {1, 2}, and where for all r in some index set [R], fr stands for the Lov\u00e1sz\nextension of a submodular function Fr that describes a combinatorial structure over the set [N ]. For\nexample, in image denoising, each parameter ai may correspond to the observed value of a pixel i,\nwhile the functions [fr(x)]p may be used to impose smoothness constraints on pixel neighborhoods.\nOne of the main dif\ufb01culties in solving this optimization problem comes from the nondifferentiability\nof the second term: a direct application of subgradient methods leads to convergence rates as slow as\n1/\nIn recent years, the above described optimization problem with p = 1 has received signi\ufb01cant interest\nin the context of decomposable submodular function minimization (DSFM) [9]. The motivation for\n\nk, where k denotes the number of iterations [8].\n\n\u221a\n\n1The code for QDSFM is available at https://github.com/lipan00123/QDSDM.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fF (S) =(cid:80)\n\nr\u2208[R] Fr(S) \u2212 2(cid:80)\n\nstudying this particular setup is two-fold: \ufb01rst, solving the convex optimization problem directly\nrecovers the combinatorial solution to the submodular min-cut problem minS\u2286[N ] F (S), where\ni\u2208S ai [10]; second, minimizing a submodular function decomposed\ninto a sum of simpler components Fr, r \u2208 [R], is much easier than minimizing an unrestricted\nsubmodular function F over a large set [N ]. There are several milestone results for the DSFM\nproblem: Jegelka et al. [11] \ufb01rst tackled the problem by considering its dual and proposed a solver\nbased on Douglas-Rachford splitting. Nishihara et al. [12] established the linear convergence\nrate of alternating projection methods for solving the dual problem. Ene et al. [13, 14] presented\nlinear convergence rates of coordinate descent methods and subsequently tightened the results via\nsubmodular \ufb02ows. Pan et al. [15] improved those methods by leveraging incidence relations of the\narguments of submodular function components.\nHere, we focus on the other important case when p = 2; we refer to the underlying optimization\nproblem as quadratic DSFM (QDSFM). QDSFM appears naturally in a wide spectrum of applica-\ntions, including learning on graphs and hypergraphs, and in particular, semi-supervised learning and\nPageRank. It has also been demonstrated both theoretically [16] and empirically [4, 17] that employ-\ning regularization with quadratic terms offers signi\ufb01cantly improved predictive performance when\ncompared to the case when p = 1. Despite the importance of the QDSFM problem, its theoretical\nand algorithmic developments have not reached the same level of maturity as those for the DSFM\nproblem. To the best of our knowledge, only a few reported works [17, 18] have provided solutions\nfor speci\ufb01c instances of QDSFMs with sublinear convergence guarantees.\nThis work takes a substantial step towards solving the QDSFM problem in its most general form by\ndeveloping a family of algorithms with linear convergence rate and small iteration cost, including the\nrandomized coordinate descent (RCD) and alternative projection (AP) algorithms. Our contributions\nare as follows. First, we derive a new dual formulation for the QDSFM problem since an analogue\nof the dual transformation for the DSFM problem is not applicable. Interestingly, the dual QDSFM\nproblem requires one to \ufb01nd the best approximation of a hyperplane via a product cone as opposed to a\nproduct polytope, encountered in the dual DSFM problem. Second, we develop a linearly convergent\nRCD (and AP) algorithm for solving the dual QDSFM. Because of the special underlying conic\nstructure, new analytic approaches are needed to prove the weak strong convexity of the dual QDSFM,\nwhich essentially guarantees linear convergence. Third, we develop generalized Frank-Wolfe and\nmin-norm-point methods for ef\ufb01ciently computing the conic projection required in each step of\nRCD (and AP) and provide a 1/k-rate convergence analysis. Finally, we evaluate our methods\non semi-supervised learning over hypergraphs using synthetic and real datasets, and demonstrate\nsuperior performance both in convergence rate and prediction accuracy compared to existing methods.\nWe postpone all the detailed proofs and supplementary discussion to the full version of this paper.\n\n2 Notation and Problem Formulation\n\nFor a submodular function F de\ufb01ned over the ground set [N ], the Lov\u00e1sz extension is a convex\nfunction f : RN \u2192 R, de\ufb01ned for all x \u2208 RN according to\n\nN\u22121(cid:88)\n\nk=1\n\nf (x) =\n\n(1)\n\nwhere xi1 \u2265 xi2 \u2265 \u00b7\u00b7\u00b7 \u2265 xiN . For a vector x \u2208 RN and a set S \u2286 [N ], let x(S) =(cid:80)\n\nF ({i1, ..., ik})(xik \u2212 xik+1 ) + F ([N ])xiN ,\n\ni\u2208[S] xi where\nxi is the component of x in ith dimention. Then, the base polytope of F , denoted by B, is de\ufb01ned as\n(2)\n\nB = {y \u2208 RN|y(S) \u2264 F (S), \u2200S \u2282 [N ], y([N ]) = F ([N ])}.\n\nUsing the base polytope, the Lov\u00e1sz extension can also be written as f (x) = maxy\u2208B(cid:104)y, x(cid:105).\nWe say that an element i \u2208 [N ] is incident to F if there exists a S \u2282 [N ] such that F (S) (cid:54)= F (S\u222a{i}).\nFurthermore, we use (x)+ to denote the function max{x, 0}. Given a positive diagonal matrix\nW \u2208 RN\u00d7N and a vector x \u2208 RN , we de\ufb01ne the W -norm according to (cid:107)x(cid:107)W =\ni=1 Wiix2\ni ,\nand simply use (cid:107) \u00b7 (cid:107) when W = I, the identity matrix. For an index set [R], we denote the R-\nproduct of N-dimensional Euclidean spaces by \u2297r\u2208[R]RN . A vector y \u2208 \u2297r\u2208[R]RN is written\nas (y1, y2, ..., yR), where yr \u2208 RN for all r \u2208 [R]. The W -norm induced on \u2297r\u2208[R]RN equals\n(cid:107)y(cid:107)I(W ) =\n\nW . We reserve the symbol \u03c1 for maxyr\u2208Br,\u2200r\n\n(cid:113)(cid:80)N\n\n(cid:113)(cid:80)R\n\nr=1 (cid:107)yr(cid:107)2\n\n(cid:113)(cid:80)\n\nr\u2208[R] (cid:107)yr(cid:107)2\n1.\n\n2\n\n\fNext, we formally state the QDSFM problem. Consider a collection of submodular functions\n{Fr}r\u2208[R] de\ufb01ned over the ground set [N ], and denote their Lov\u00e1sz extensions and base polytopes by\n{fr}r\u2208[R] and {Br}r\u2208[R], respectively. We use Sr \u2286 [N ] to denote the set of variables incident to\nFr and make the further assumption that the functions Fr are normalized and nonnegative, i.e., that\nFr(\u2205) = 0 and Fr \u2265 0. These two mild constraints are satis\ufb01ed by almost all submodular functions\nthat arise in practical applications. We consider the following minimization problem:\n\nQDSFM:\n\n(cid:107)x \u2212 a(cid:107)2\n\nW +\n\nmin\nx\u2208RN\n\n[fr(x)]2 ,\n\n(3)\n\n(cid:88)\n\nr\u2208[R]\n\nwhere a \u2208 RN is a given vector and W \u2208 RN\u00d7N is a positive diagonal matrix. As an immediate\nobservation, the problem has a unique solution, denoted by x\u2217, due to the strong convexity of (3).\n\n3 Applications\n\nWe start by reviewing some important machine learning problems that give rise to QDSFM.\nSemi-supervised Learning (SSL) is a learning paradigm that allows one to utilize the underlying\nstructure or distribution of unlabeled samples, whenever the information provided by labeled samples\ndoes not suf\ufb01ce for learning an inductive predictor [19, 20]. A standard setting for a K-class\ntransductive learner is as follows: given N data points {zi}i\u2208[N ], and labels for the \ufb01rst l ((cid:28) N)\nsamples {yi|yi \u2208 [K] }i\u2208[l], the learner is asked to infer the labels for all the remaining data\npoints i \u2208 [N ]/[l]. The widely-used SSL problem with least squared loss requires one to solve K\nregularization problems: for each class k \u2208 [K], set the scores of data points within the class to\n\n\u02c6x(k) = arg min\nx(k)\n\n\u03b2(cid:107)x(k) \u2212 a(k)(cid:107)2 + \u2126(x(k)),\n\nwhere a(k) represents the information provided by the known labels, i.e., a(k)\ni = 1 if yi = k, and\n0 otherwise, \u03b2 denotes a hyperparameter and \u2126 stands for a smoothness regularizer. The labels of\ni }. For typical graph and hypergraph\nthe data points are inferred according to \u02c6yi = arg maxk{\u02c6x(k)\nlearning problems, \u2126 is often chosen to be a Laplacian regularizer constructed using {zi}i\u2208[N ] (see\nTable 1). In Laplacian regularization, each edge/hyperedge corresponds to one functional component\nin the QDSFM problem. Note that the variables may also be normalized with respect to their degrees,\n\u221a\nin which case the normalized Laplacian is used instead. For example, in graph learning, one of the\nterms in \u2126 assumes the form wij(xi/\nof the vertex variables i and j, respectively. It can be shown using some simple algebra that the\nnormalization term reduces to the matrix W used in the de\ufb01nition of the QDSFM problem (3).\n\ndi \u2212 xj/(cid:112)dj)2, where di and dj correspond to the degrees\n\nOne component in \u2126(x)\nwr(xi \u2212 xj )2, Sr = {i, j}\nwr maxi,j\u2208Sr (xi \u2212 xj )2\n(xi\u2212xj )2\nwr\n\nmax\n\n(i,j)\u2208Hr\u00d7Tr\n\n+\n\nDescription of the combinatorial structure\n\nGraphs: Nearest neighbors [4]/Gaussian similarity [21]\n\nHypergraphs: Categorical features [17]\n\nDirected hypergraphs: citation networks [18]\n\nThe submodular function\nFr(S) =\n\u221a\n\n\u221a\nwij if |S \u2229 {i, j}| = 1\nwr if |S \u2229 Sr| \u2208 [1, |Sr| \u2212 1]\n\u221a\n|([N ]/S) \u2229 Tr| \u2265 1\n\nwr if |S \u2229 Hr| \u2265 1,\n\nFr (S) =\n\nFr (S) =\n\nGeneral [fr(x)]2\n\nSubmodular Hypergraphs: Mutual Information [22, 23]\n\nA symmetric submodular function\n\nTable 1: Laplacian regularization in semi-supervised learning. In the third column, whenever the stated\nconditions are not satis\ufb01ed, it is assumed that Fr = 0. For directed hypergraphs, Hr and Tr are subsets of Sr\ntermed the head and the tail set. When Hr = Tr = Sr, one recovers the setting for undirected hypergraphs.\n\nPageRank (PR) is a well-known method used for ranking Web pages [24]. Web pages are linked and\nthey naturally give rise to a graph G = (V, E), where, without loss of generality, one may assume that\nV = [N ]. Let A and D be the adjacency matrix and diagonal degree matrix of G, respectively. PR\nessentially \ufb01nds a \ufb01xed point p \u2208 RN via the iterative procedure p(t+1) = (1 \u2212 \u03b1)s + \u03b1AD\u22121p(t),\nwhere s \u2208 RN is a \ufb01xed vector and \u03b1 \u2208 (0, 1]. It is easy to verify that p is a solution of the problem\n\n(1 \u2212 \u03b1)\n\nmin\n\np\n\n\u03b1\n\n(cid:107)p \u2212 s(cid:107)2\n\nD\u22121 + (D\u22121p)T (D \u2212 A)(D\u22121p) = (cid:107)x \u2212 a(cid:107)2\n\nW +\n\n(xi \u2212 xj)2,\n\n(4)\n\n(cid:88)\n\nij\u2208E\n\nwhere x = D\u22121p, a = D\u22121s and W = (1\u2212\u03b1)\nof the QDSFM problem. Note that the PR iterations on graphs take the form D\u2212 1\n\n\u03b1 D. Obviously, (4) may be viewed as a special instance\n2 (p(t+1) \u2212 p(t)) =\n\n3\n\n\f2 (s \u2212 p(t)) \u2212 \u03b1L(D\u2212 1\n\n2 p(t)), where L = I \u2212 D\u2212 1\n\n(1 \u2212 \u03b1)D\u2212 1\n2 is the normalized Laplacian of\nthe graph. The PR problem for hypergraphs is signi\ufb01cantly more involved, and may be formulated\nusing diffusion processes (DP) based on a normalized hypergraph Laplacian operator L [25]. The\ndt = (1 \u2212 \u03b1)(a \u2212 x) \u2212 \u03b1L(x), where x(t) \u2208 RN is the potential\nunderlying PR procedure reads as dx\nvector at time t. Tracking this DP precisely for every time point t is a dif\ufb01cult task which requires\nsolving a densest subset problem [25]. However, the stationary point of this problem, i.e., a point x\nthat satis\ufb01es (1 \u2212 \u03b1)(a \u2212 x) \u2212 \u03b1L(x) = 0 may be easily found by solving the optimization problem\n\n2 AD\u2212 1\n\n(1 \u2212 \u03b1)(cid:107)x \u2212 a(cid:107)2 + \u03b1(cid:104)x, L(x)(cid:105).\n\nmin\n\nx\n\n(cid:80)\n\nThe term (cid:104)x, L(x)(cid:105) matches the normalized regularization term for hypergraphs listed in Table 1, i.e.,\n\ndi \u2212 xj/(cid:112)dj)2. Clearly, once again this leads to the QDSFM problem. The PR\n\n\u221a\nr maxi,j\u2208Sr (xi/\n\nequation for directed or submodular hypergraphs can be stated similarly using the Laplacian operators\ndescribed in [26, 23, 27]. The PR algorithm de\ufb01ned in this manner has many advantages over the\nmultilinear PR method based on higher-order Markov chains [28], since it allows for arbitrarily large\norders and is guaranteed to converge for any \u03b1 \u2208 (0, 1]. In the full version of this paper, we will\nprovide a more detailed analysis of the above described PR method.\n\n4 Algorithms for Solving the QDSFM Problem\n\nWe describe next the \ufb01rst known linearly convergent algorithms for solving the QDSFM problem. To\nstart with, observe that the QDSFM problem is convex since the Lov\u00e1sz extensions fr are convex\nand nonnegative. But the objective is in general nondifferentiable. To address this issue, we consider\nthe dual of the QDSFM problem. A natural idea is to try to mimic the approach used for DSFM by\ninvoking the characterization of the Lov\u00e1sz extension, fr(x) = maxyr\u2208Br(cid:104)yr, x(cid:105), \u2200r. However, this\nleads to a semide\ufb01nite programing problem for the dual variables {yr}r\u2208[R], which is complex to\nsolve for large problems. Instead, we establish a new dual formulation that overcomes this obstacle.\nThe dual formulation hinges upon on the following key observation:\n\n[fr(x)]2 = max\n\u03c6r\u22650\n\n\u03c6rfr(x) \u2212 \u03c62\nr\n4\n\n= max\n\u03c6r\u22650\n\nmax\nyr\u2208\u03c6rBr\n\n(cid:104)yr, x(cid:105) \u2212 \u03c62\nr\n4\n\n.\n\n(5)\n\nLet y = (y1, y2, ..., yR) and \u03c6 = (\u03c61, \u03c62, ..., \u03c6R). Using equation (5), we arrive at\nLemma 4.1. The following optimization problem is dual to (3):\n\nyr \u2212 2W a(cid:107)2\n\nW \u22121 +\n\n\u03c62\nr,\n\ns.t. y \u2208 \u2297r\u2208[R]\u03c6rBr, \u03c6 \u2208 \u2297r\u2208[R]R\u22650. (6)\n\n(cid:88)\n\nr\u2208[R]\n\nBy introducing \u039b = (\u03bbr) \u2208 \u2297r\u2208[R]RN , the previous optimization problem can be rewritten as\n\n(cid:21)\n\n, s.t. y \u2208 \u2297r\u2208[R]\u03c6rBr, \u03c6 \u2208 \u2297r\u2208[R]R\u22650,(cid:88)\n\n(cid:107)2\nW \u22121 + \u03c62\nr\n\ng(y, \u03c6) := (cid:107) (cid:88)\n(cid:20)\n(cid:88)\n\n(cid:107)yr \u2212 \u03bbr\u221a\nR\n\nr\u2208[R]\n\nr\u2208[R]\n\nmin\ny,\u03c6\n\nmin\ny,\u03c6,\u039b\n\n\u03bbr = 2W a.\n\n(7)\n\n2 W \u22121(cid:80)\n\nr\u2208[R]\n\nr\u2208[R] yr.\n\nThe primal variables in both cases are recovered via x = a \u2212 1\n\nCounterparts of the above results for the DSFM problem were discussed in Lemma 2 of [11]. However,\nthere is a signi\ufb01cant difference between [11] and the QDSFM problem, since in the latter setting we\nuse a conic set constructed from base polytopes of submodular functions. More precisely, for each r,\nwe de\ufb01ne a convex cone Cr = {(yr, \u03c6r)|\u03c6r \u2265 0, yr \u2208 \u03c6rBr} which gives the feasible set of the dual\nvariables (yr, \u03c6r). The optimization problem (7) essentially asks one to \ufb01nd the best approximation\nof an af\ufb01ne space in terms of a product cone \u2297r\u2208[R]Cr, as opposed to a product polytope encountered\nin DSFM. Several algorithms have been developed for solving the DSFM problem, including the\nDouglas-Rachford splitting method (DR) [11], the alternative projection method (AP) [12] and the\nrandom coordinate descent method (RCD) [13]. Similarly, for QDSFM, we propose to solve the\ndual problem (6) using the RCD method exploiting the separable structure of the feasible set, and to\nsolve (7) using the AP method. Although these similar methods for DSFM may be used for QDSFM,\na novel scheme of analysis handling the conic structure is required, which takes all the effort in the\nrest of this section and the next section. Due to the page limitation, the analysis of the AP method\nis deferred to the full version of this paper. Also, it is worth mentioning that results of this work\n\n4\n\n\fcan be easily extended for the DR method, as well as accelerated and parallel variants of the RCD\nmethod [13, 15].\nRCD Algorithm. De\ufb01ne the projection \u03a0 onto a convex cone Cr as follows: for a given point b in\nRN , let \u03a0Cr (b) = arg min(yr,\u03c6r)\u2208Cr (cid:107)yr \u2212 b(cid:107)2\nr. For each coordinate r, optimizing over the\ndual variables (yr, \u03c6r) is equivalent to computing a projection onto the cone Cr. This gives rise to\nthe RCD method summarized in Algorithm 1.\n\nW \u22121 + \u03c62\n\nAlgorithm 1: RCD Solver for (6)\nr \u2190 0, \u03c6(0)\n0: For all r, initialize y(0)\n1: In iteration k:\n2: Uniformly at random pick an r \u2208 [R].\n3:\nr\n4: Set y(k+1)\n\n) \u2190 \u03a0Cr (2W a \u2212(cid:80)\n\nfor r(cid:48) (cid:54)= r\n\nr(cid:48) \u2190 y(k)\nr(cid:48)\n\n, \u03c6(k+1)\n\n(y(k+1)\n\nr\n\nr\n\nand k \u2190 0\n\nr(cid:48)(cid:54)=r yr(cid:48))\n\nIn Section 5, we describe ef\ufb01cient methods to compute the projections. But throughout the remainder\nof this section, we treat the projections as provided by an oracle. Note that each iteration of the RCD\nmethod only requires the computation of one projection onto a single cone. In contrast, methods\nsuch as DR, AP and the primal-dual hybrid gradient descent (PDHG) proposed in [29] used for SSL\non hypergraphs [17], require performing a complete gradient descent and computing a total of R\nprojections at each iteration. Thus, from the perspective of iteration cost, RCD is signi\ufb01cantly more\nef\ufb01cient, especially when R is large and computing \u03a0(\u00b7) is costly.\nThe objective g(y, \u03c6) described in (6) is not strongly convex in general. Inspired by the work for\nDSFM [13], in what follows, we show that this objective indeed satis\ufb01es a weak strong convexity\ncondition, which guarantees linear convergence of the RCD algorithm. Note that due to the additional\nterm \u03c6 that characterizes the conic structures, extra analytic effort is required than that for the DSFM\ncase. We start by providing a general result that characterizes relevant geometric properties of the\ncone \u2297r\u2208[R]Cr.\nLemma 4.2. Consider a feasible solution (y, \u03c6) \u2208 \u2297r\u2208[R]Cr and a nonnegative vector \u03c6(cid:48) = (\u03c6(cid:48)\ntwo positive diagonal matrices. Then, there exists a y(cid:48) \u2208 \u2297r\u2208[R]\u03c6(cid:48)\n\n\u2297r\u2208[R]R\u22650. Let s be an arbitrary point in the base polytope of(cid:80)\n\uf8ee\uf8f0(cid:107) (cid:88)\n\nr) \u2208\nrFr, and let W (1), W (2) be\nr = s and\n\nr\u2208[R] y(cid:48)\nW (2) + (cid:107)\u03c6 \u2212 \u03c6(cid:48)(cid:107)2\n\nrBr such that(cid:80)\n\n(cid:107)y \u2212 y(cid:48)(cid:107)2\n\nyr \u2212 s(cid:107)2\n\n\uf8f9\uf8fb ,\n\nr\u2208[R] \u03c6(cid:48)\n\nr\u2208[R]\n\n\u03c12 (cid:88)\n\ni\u2208[N ]\n\n9\n4\n\nW (1)\n\nii + 1\n\n\uf8fc\uf8fd\uf8fe .\n\nW (1)\nii\n\n1/W (2)\njj ,\n\nj\u2208[N ]\n\n(8)\n\nwhere\n\n\u00b5(W (1), W (2)) = max\n\nI(W (1)) + (cid:107)\u03c6 \u2212 \u03c6(cid:48)(cid:107)2 \u2264 \u00b5(W (1), W (2))\n(cid:88)\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\u039e = {(y, \u03c6)| (cid:88)\n\ni\u2208[N ]\n\nAs a corollary of Lemma 4.2, the next result establishes the weak strong convexity of g(y, \u03c6). To\nproceed, we introduce some additional notation. Denote the set of solutions of problem (6) by\n\nyr = 2W (a \u2212 x\u2217), \u03c6r = inf\nyr\u2208\u03b8Br\n\n\u03b8,\u2200r}.\n\nr\u2208[R]\n\nNote that this representation arises from the relationship between the optimal primal and dual solution\nas stated in Lemma 4.1. We denote the optimal value of the objective over (y, \u03c6) \u2208 \u039e by g\u2217 = g(y, \u03c6),\nand de\ufb01ne a distance function d((y, \u03c6), \u039e) =\nLemma 4.3. Suppose that (y, \u03c6) \u2208 \u2297r\u2208[R]Cr and that (y\u2217, \u03c6\u2217) \u2208 \u039e minimizes (cid:107)y \u2212 y\u2217(cid:107)2\n(cid:107)\u03c6 \u2212 \u03c6\u2217(cid:107)2. Then\n\nI(W \u22121) + (cid:107)\u03c6 \u2212 \u03c6(cid:48)(cid:107)2.\n\n(cid:107)y \u2212 y(cid:48)(cid:107)2\n\nI(W \u22121) +\n\n(y(cid:48),\u03c6(cid:48))\u2208\u039e\n\nmin\n\n(cid:114)\n\n(cid:107) (cid:88)\n\nr\u2208[R]\n\n(yr \u2212 y\u2217\n\nr )(cid:107)2\n\nW \u22121 + (cid:107)\u03c6 \u2212 \u03c6\u2217(cid:107)2 \u2265 d2((y, \u03c6), \u039e)\n\u00b5(W \u22121, W \u22121)\n\n.\n\n5\n\n\fBased on Lemma 4.3, we can establish the linear convergence rate of the RCD algorithm.\nTheorem 4.4. After k iterations of Algorithm 1, we obtain a pair (y(k), \u03c6(k)) that satis\ufb01es\n\ng(y(k), \u03c6(k)) \u2212 g\u2217 + d2((y(k), \u03c6(k)), \u039e)\n\n(cid:105)\n\n(cid:21)k(cid:104)\n\nE(cid:104)\n\n(cid:20)\n\n(cid:105)\n\n\u2264\n\n1 \u2212\n\n2\n\nR[1 + \u00b5(W \u22121, W \u22121)]\n\ng(y(0), \u03c6(0)) \u2212 g\u2217 + d2((y(0), r(0)), \u039e)\n\n.\n\nTheorem 4.4 implies that O(R\u00b5(W \u22121, W \u22121) log 1\n\u0001 ) iterations are required to obtain an \u0001-optimal\nsolution. Below we give the explicit characterization of the complexity for the SSL and PR problems\nwith normalized Laplacian regularization as discussed in Section 3.\nCorollary 4.5. Suppose that W = \u03b2D, where \u03b2 is a hyper-parameter, and D is a diagonal degree\nmaxS\u2286V [Fr(S)]2. Algorithm 1 requires an expected number\nr:\u2208[R],i\u2208Sr\nof O(N 2R max{1, 9\u03b2\u22121} maxi,j\u2208[N ]\nDii\nDjj\n\nmatrix such that Dii =(cid:80)\n\n\u0001 ) iterations to return an \u0001-optimal solution.\n\nlog 1\n\nh(y, \u03c6) (cid:44) (cid:107)y \u2212 a(cid:107)2\n\ns.t. y \u2208 \u03c6B, \u03c6 \u2265 0,\n\nDii\nDjj\n\narises due to the degree-based normalization.\n\nThe term N 2R also appears in the expression for the complexity of the RCD method for solving the\nDSFM problem [14]. The term max{1, 9\u03b2\u22121} implies that whenever \u03b2 is small, the convergence\nrate is slow. This makes sense: for example, in the PR problem (4), a small \u03b2 corresponds to a\nlarge \u03b1, which typically implies longer mixing times of the underlying Markov process. The term\nmaxi,j\u2208[N ]\n5 Computing the Projections \u03a0Cr(\u00b7)\nIn this section, we provide ef\ufb01cient routines for computing the projection onto the conic set \u03a0Cr (\u00b7).\nAs the procedure works for all values of r \u2208 [R], we drop the subscript r for simplicity of notation.\nFirst, recall that\n\n\u02dcW + \u03c62\n\n\u03a0C(a) = arg min\n(y,\u03c6)\n\n(9)\nwhere \u02dcW = W \u22121, and where B denotes the base polytope of the submodular function F . Let h\u2217 and\n(y\u2217, \u03c6\u2217) be the optimal value of the objective function and the argument that optimizes it, respectively.\nWhen performing projections, one only needs to consider the variables incident to F , and set all other\nvariables to zero. For ease of exposition, we assume that all variables in [N ] are incident to F.\nUnlike QDSFM, the DSFM involves the computation of projections onto the base polytopes of\nsubmodular functions. Two algorithms, the Frank-Wolfe (FW) method [30] and the Fujishige-Wolfe\nminimum norm algorithm (MNP) [31], are used for this purpose. Both methods assume cheap linear\nminimization oracles on polytopes and attain a 1/k-convergence rate. The MNP algorithm is more\nsophisticated and empirically more ef\ufb01cient. Nonetheless, neither of these methods can be applied\ndirectly to cones. To this end, we modify these two methods by adjusting them to the conic structure\nin (9) and show that a 1/k\u2212convergence rate still holds. We refer to the procedures as the conic MNP\nmethod and the conic FW method, respectively. Here we focus mainly on the conic MNP method\ndescribed in Algorithm 2, as it is more sophisticated. A detailed discussion of the conic FW method\nand its convergence guarantees can be found in the full version of this work.\nThe conic MNP algorithm keeps track of an active set S = {q1, q2, ...} and searches for the best\nqi\u2208S \u03b1iqi|\u03b1i \u2265\nqi\u2208S \u03b1iqi|\u03b1i \u2208 R}. Similar to the original MNP algorithm,\nAlgorithm 2 also contains two level-loops: MAJOR and MINOR. In the MAJOR loop, we greedily\nadd a new active point q(k) to the set S obtained from the linear minimization oracle w.r.t. the base\npolytope (Step 2), and by the end of the MAJOR loop, we obtain a y(k+1) that minimizes h(y, \u03c6) over\ncone(S) (Step 3-8). The MINOR loop is activated when lin(S) contains some point z that guarantees\na smaller value of the objective function than that of the optimal point in cone(S), provided that\nsome active points from S may be removed. Compared to the original MNP method, Steps 2 and 5 as\nwell as the termination Step 3 are specialized for the conic structure.\nThe following convergence result implies that the conic MNP algorithm also has a convergence rate\nof order 1/k; the proof is essentially independent on the submodularity assumption and represents a\ncareful modi\ufb01cation of the arguments in [32] for conic structures.\n\nsolution in its conic hull. Let us denote the cone of an active set S as cone(S) = {(cid:80)\n0} and its linear set as lin(S) = {(cid:80)\n\n6\n\n\fqi\u2208S(k) \u03bb(k)\n\ni\n\nAlgorithm 2: The Conic MNP Method for Solving (9)\n\nInput: \u02dcW , a, B and a small positive constant \u03b4. Maintain \u03c6(k) =(cid:80)\n\n\u02dcW\n\n1 \u2190 (cid:104)a,q1(cid:105) \u02dcW\n1+(cid:107)q1(cid:107)2\n\n, y(0) \u2190 \u03bb1q1, k \u2190 0\n\nq(k) \u2190 arg minq\u2208B(cid:104)\u2207yh(y(k), \u03c6(k)), q(cid:105) \u02dcW\nIf (cid:104)y(k) \u2212 a, q(k)(cid:105) \u02dcW + \u03c6(k) \u2265 \u2212\u03b4, then break; Else S(k) \u2190 S(k) \u222a {q(k)}.\nIteratively execute (MINOR LOOP):\ni \u2208S(k) \u03b1iq(k)\nq(k)\n\nChoose an arbitrary q1 \u2208 B. Set S(0) \u2190 {q1}, \u03bb(0)\n1. Iteratively execute (MAJOR LOOP):\n2.\n3.\n4.\n5.\n6.\n7.\n8.\n9. y(k+1) \u2190 z(k), \u03bb(k+1) \u2190 \u03b1, S(k+1) \u2190 {i : \u03bb(k+1) > 0}, k \u2190 k + 1\n\ni \u2212 a(cid:107)2\nIf \u03b1i \u2265 0 for all i then break\nElse \u03b8 = mini:\u03b1i<0 \u03bbi/(\u03bbi \u2212 \u03b1i), \u03bb(k+1)\n\n\u03b1 \u2190 arg min\u03b1 (cid:107)(cid:80)\n\ni \u2208S \u03b1i)2, z(k) \u2190(cid:80)\n\ny(k+1) \u2190 \u03b8z(k) + (1 \u2212 \u03b8)y(k), S(k+1) \u2190 {i : \u03bb(k+1) > 0}, k \u2190 k + 1\n\ni\n\n\u2190 \u03b8\u03b1i + (1 \u2212 \u03b8)\u03bb(k)\n\n,\n\ni\n\n+ ((cid:80)\n\n\u02dcW\n\nq(k)\n\ni \u2208S \u03b1iq(k)\nq(k)\n\ni\n\nTheorem 5.1. Let B be an arbitrary polytope in RN and let C = {(y, \u03c6)|y \u2208 \u03c6B, \u03c6 \u2265 0} be the\ncone induced by the polytope. For some positive diagonal matrix \u02dcW , de\ufb01ne Q = maxq\u2208B (cid:107)q(cid:107) \u02dcW .\nAlgorithm 2 yields a sequence of (y(k), \u03c6(k))k=1,2,... such that h(y(k), \u03c6(k)) decreases monotonically.\nAlgorithm 2 terminates when k = O(N(cid:107)a(cid:107) \u02dcW max{Q2, 1}/\u03b4), with h(y(k), \u03c6(k)) \u2264 h\u2217 + \u03b4(cid:107)a(cid:107) \u02dcW .\nBoth the (conic) FW and MNP are approximate algorithms for computing the projections for generic\npolytopes B and their induced cones. We also devised an algorithm of complexity O(N log N ) that\nexactly computes the projection for polytopes B arising in learning on (un)directed hypergraphs. A\ndetailed description of the algorithm for exact projections is described in the full version of this paper.\n\n6 Extension to mix-DSFM\n\nR1(cid:88)\n\nR1+R2(cid:88)\n\nWith the tools to solve both QDSFM and DSFM problems, it is simple to derive an ef\ufb01cient solver for\nthe following mix-DSFM problem: Suppose {Fr}r\u2208[R1+R2] are a collection of submodular functions\nwhere Fr \u2265 0 for r \u2208 [R1]. Let fr be the corresponding Lov\u00e1sz extension of Fr, r \u2208 [R1 + R2]. We\nare to solve\n\nmix-DSFM:\n\n(cid:107)x \u2212 a(cid:107)2\n\nW +\n\nmin\nx\u2208RN\n\n[fr(x)]2 +\n\nfr(x)\n\n(10)\n\nr=1\n\nr=R1+1\n\nBy using the same trick in (5) for the quadratic term, one may show the dual problem of mix-DSFM\nis essentially to \ufb01nd the best approximation of an af\ufb01ne space in terms of a mixture product of cones\nand base polytopes. Furthermore, all other related results, including the weak-strong duality of the\ndual, the linear convergence of RCD/AP and the 1/k-rate convergence of the MNP/FW methods can\nbe generalized to the mix-DSFM case via the same technique developed in this work.\n\n7 Experiments\n\nOur dataset experiments focus on SSL learning for hypergraphs on both real and synthetic datasets.\nFor the particular problem at hand, the QDSFM problem can be formulated as follows\n\n\u03b2(cid:107)x \u2212 a(cid:107)2 +\n\nmin\nx\u2208RN\n\nmax\ni,j\u2208Sr\n\n(\n\nxi\u221a\nWii\n\n)2,\n\n(11)\n\n\u2212 xj(cid:112)Wjj\n\n(cid:88)\n\nr\u2208[R]\n\nwhere ai \u2208 {\u22121, 0, 1} indicates if the corresponding variable i has a negative, missing, or positive\nlabel, respectively, \u03b2 is a parameter that balances out the in\ufb02uence of observations and the regular-\nization term, {Wii}i\u2208[N ] de\ufb01nes a positive measure over the variables and may be chosen to be the\ndegree matrix D with Dii = |{r \u2208 [R] : i \u2208 Sr}|. Each part in the decomposition corresponds to\none hyperedge. We compare eight different solvers falling into three categories: (a) our proposed\ngeneral QDSFM solvers, QRCD-SPE, QRCD-MNP, QRCD-FW and QAP-SPE; (b) alternative solvers\nfor the speci\ufb01c problem (11), including PDHG [17] and SGD [18]; (c) SSL solvers that do not\nuse QDSFM as the objective, including DRCD [13] and InvLap [33]. The \ufb01rst three methods all\n\n7\n\n\fhave outer-loops that execute RCD, but with different inner-loop projections computed via the exact\nprojection algorithm for undirected hyperedges, or the generic MNP and FW. The QAP-SPE method\nuses AP in the outer-loop and exact inner-loop projections. PDHG and SGD are the only known\nsolvers for the speci\ufb01c objective (11). DRCD is a state-of-the-art solver for DSFM and also uses a\ncombination of outer-loop RCD and inner-loop projections. InvLap \ufb01rst transforms hyperedges into\ncliques and then solves a Laplacian-based linear problem. All the aforementioned methods, except\nInvLap, are implemented via C++ in a nonparallel fashion. InvLap is executed via matrix inversion\noperations in Matlab which may be parallelized. The CPU times of all methods are recorded on a\n3.2GHz Intel Core i5. The results are summarized for 100 independent tests. When reporting the\nresults, we use the primal gap (\u201cgap\u201d) to characterize the convergence properties of different solvers.\nAdditional descriptions of the settings and experimental results for the QRCD-MNP and QRCD-FW\nmethods for general submodular functions can be found in the full version of this paper.\nSynthetic data. We generated a hypergraph with N = 1000 vertices that belong to two equal-sized\nclusters. We uniformly at random generated 500 hyperedges within each cluster and 1000 hyperedges\nacross the two clusters. Note that in higher-order clustering, we do not need to have many hyperedges\nwithin each cluster to obtain good clustering results. Each hyperedge includes 20 vertices. We also\nuniformly at random picked l = 1, 2, 3, 4 vertices from each cluster to represent labeled datapoints.\nWith the vector x obtained by solving (11), we classi\ufb01ed the variables based on the Cheeger cut\n, and de\ufb01ne Sj = {i1, i2, ..., ij}. We\nrule [17]: suppose that\npartition [N ] into two sets (Sj\u2217 , \u00afSj\u2217 ), where\n\n\u2265 \u00b7\u00b7\u00b7 \u2265 xiN\u221a\n\nxi1\u221a\nWi1i1\n\n\u2265 xi2\u221a\n\nWiN iN\n\nWi2i2\n\nj\u2217 = arg min\nj\u2208[N ]\n\nc(Sj) (cid:44)\n\nmax{(cid:80)\n\n|Sr \u2229 Sj (cid:54)= \u2205, Sr \u2229 \u00afSj (cid:54)= \u2205}|\n\nr\u2208[R] |Sr \u2229 Sj|,(cid:80)\n\nr\u2208[R] |Sr \u2229 \u00afSj|} .\n\nThe classi\ufb01cation error is de\ufb01ned as (# of incorrectly classi\ufb01ed vertices)/N. In the experiment, we\nused Wii = Dii, \u2200 i, and tuned \u03b2 to be nearly optimal for different objectives with respect to the\nclassi\ufb01cation error rates.\nThe top-left \ufb01gure in Figure 1 shows that QRCD-SPE converges much faster than all other methods\nwhen solving the problem (11) according to the gap metric (we only plotted the curve for l = 3\nas all other values of l produce similar patterns). To avoid clutter, we postpone the results for\nQRCD-MNP and QRCD-FW to the full version of this paper, as these methods are typically 100\nto 1000 times slower than QRCD-SPE. In the table that follows, we describe the performance of\ndifferent methods with similar CPU-times. Note that when QRCD-SPE converges (with primal gap\n10\u22129 achieved after 0.83s), the obtained x leads to a much smaller classi\ufb01cation error than other\nmethods. QAP-SPE, PDHG and SGD all have large classi\ufb01cation errors as they do not converge\nwithin short CPU time-frames. QAP-SPE and PDHG perform only a small number of iterations, but\neach iteration computes the projections for all the hyperedges, which is more time-consuming. The\npoor performance of DRCD implies that the DFSM is not a good objective for SSL. InvLap achieves\nmoderate classi\ufb01cation errors, but still does not match the performance of QRCD-SPE. Note that\nInvLap uses Matlab, which is optimized for matrix operations, and is hence fairly ef\ufb01cient. However,\nfor experiments on real datasets, where one encounters fewer but signi\ufb01cantly larger hyperedges,\nInvLap does not offer as good a performance as the one for synthetic data. The column \u201cAverage\n100c(Sj\u2217 )\u201d also illustrates that the QDSFM objective is a good choice for \ufb01nding approximate\nbalanced cuts of hypergraphs.\nWe also evaluated the in\ufb02uence of parameter choices on the convergence of QRCD methods. Accord-\ning to Theorem 4.4, the required number of RCD iterations for achieving an \u0001-optimal solution for (11)\nis roughly O(N 2R max(1, 9/(2\u03b2)) maxi,j\u2208[N ] Wii/Wjj log 1/\u0001) (see the full version of this paper).\nWe mainly focus on testing the dependence on the parameters \u03b2 and maxi,j\u2208[N ] Wii/Wjj, as the\nterm N 2R is also included in the iteration complexity of DSFM and was shown to be necessary\ngiven certain submodular structures [15]. To test the effect of \u03b2, we \ufb01x Wii = 1 for all i, and vary\n\u03b2 \u2208 [10\u22123, 103]. To test the effect of W , we \ufb01x \u03b2 = 1 and randomly choose half of the vertices and\nset their Wii values to lie in {1, 0.1, 0.01, 0.001}, and set the remaining ones to 1. The two top-right\nplots of Figure. 1 show our results. The logarithm of gap ratios is proportional to log \u03b2\u22121 for small\n\u03b2, and log maxi,j\u2208[N ] Wii/Wjj, which is not as sensitive as predicted by Theorem 4.4. Moreover,\nwhen \u03b2 is relatively large (> 1), the dependence on \u03b2 levels out.\nReal data. We also evaluated the proposed algorithms on three UCI datasets: Mushroom, Cover-\ntype45, Covertype67, used as standard datasets for SSL on hypergraphs [33, 17, 18]. Each dataset\n\n8\n\n\fObj.\n\nSolvers\n\nM\nF\nS\nD\nQ\n\n.\nh\nt\nO\n\nQRCD-SPE\nQAP-SPE\n\nPDHG\nSGD\nDRCD\nInvLap\n\nClassi\ufb01cation error rate (%)\n\nl=1\n\nl=2\n\nl=3\n\nl=4\n\nMN MD MN MD MN MD MN MD\n2.93\n0.00\n3.80\n14.9\n0.95\n9.05\n2.10\n5.79\n44.6\n44.7\n8.17\n0.70\n\n0.00\n13.2\n4.05\n3.30\n45.3\n3.00\n\n0.00\n8.10\n2.55\n2.90\n44.2\n1.60\n\n1.47\n7.33\n3.02\n3.94\n43.4\n1.91\n\n0.78\n4.07\n1.74\n3.41\n45.3\n0.89\n\n2.55\n15.0\n9.65\n4.15\n44.2\n7.30\n\n2.23\n12.6\n4.56\n4.30\n46.1\n3.27\n\nAverage 100c(Sj\u2217 )\n\nl=1\n6.81\n9.51\n8.64\n8.22\n9.97\n8.89\n\nl=2\n6.04\n9.21\n7.32\n7.11\n9.97\n7.11\n\nl=3\n5.71\n8.14\n6.81\n7.01\n9.96\n6.18\n\nl=4\n5.41\n7.09\n6.11\n6.53\n9.97\n5.60\n\n#iterations\n\ncputime(s)\n\n4.8\u00d7 105\n2.7\u00d7 102\n3.0\u00d7 102\n1.5\u00d7 104\n3.8\u00d7 106\n\n\u2014\n\n0.83\n0.85\n0.83\n0.86\n0.85\n0.07\n\nFigure 1: Experimental results on synthetic datasets. Top-left: gap vs CPU-time of different QDSFM solvers\n(with an average \u00b1 standard deviation). Bottom: classi\ufb01cation error rates & Average 100 c(Sj\u2217 ) for different\nsolvers (MN: mean, MD: median). Top-right: the rate of a primal gap of QRCD after 2 \u00d7 105 iterations with\nrespect to different choices of the parameters \u03b2 & maxi,j\u2208[N ] Wii/Wjj.\n\nFigure 2: Convergence of different solvers for QDFSM over three different real datasets.\n\ncorresponds to a hypergraph model as described in [17]: entries correspond to vertices while each\ncategorical feature is modeled as one hyperedge; numerical features are \ufb01rst quantized into 10 bins\nof equal size, and then mapped to hyperedges. Compared to synthetic data, in this datasets, the\nsize of most hyperedges is much larger (\u2265 1000) while the number of hyperedges is small (\u2248 100).\nPrevious works have been shown that fewer classi\ufb01cation errors can be achieved by using QDSFM\nas an objective instead of DSFM or InvLap [17]. In our experiment, we focused on comparing the\nconvergence of different solvers for QDSFM. We set \u03b2 = 100 and Wii = 1, for all i, and set the\nnumber of observed labels to 100, which is a proper setting as described in [17]. Figure. 2 shows the\nresults. Again, the proposed QRCD-SPE and QAP-SPE methods both converge faster than PDHG\nand SGD, while QRCD-SPE performs the best. Note that we did not plot the results for QRCD-MNP\nand QRCD-FW as the methods converge extremely slowly due to the large sizes of the hyperedges.\nInvLap requires 22, 114 and 1802 seconds to run on the Mushroom, Covertype45 and Covertype67\ndatasets, respectively. Hence, the methods do not scale well.\n\n8 Acknowledgement\n\nThe authors gratefully acknowledge many useful suggestions by the reviewers. This work was\nsupported in part by the NIH grant 1u01 CA198943A and the NSF grant CCF 15-27636.\n\nReferences\n\n[1] F. Bach, \u201cLearning with submodular functions: A convex optimization perspective,\u201d Foundations\n\nand Trends R(cid:13) in Machine Learning, vol. 6, no. 2-3, pp. 145\u2013373, 2013.\n\n[2] L. Lov\u00e1sz, \u201cSubmodular functions and convexity,\u201d in Mathematical Programming The State of\n\nthe Art. Springer, 1983, pp. 235\u2013257.\n\n9\n\n00.20.40.60.8cputime(s)-10-9-8-7-6-5-4-3-2-10log10(gap)l=3QRCDM-SPEQAP-SPEPDHGSGD-4-2024log10(\u03b2-1)-12-11-10-9-8-7-6-5-4-3-2log10(gap(2e5)/gap(0))-101234log10(Wmax/Wmin)-12-11-10-9-8-7-6-5-4-3-2log10(gap(2e5)/gap(0))00.511.52cputime(s)-12-10-8-6-4-20log10(gap)mushroomQRCDM-SPEQAP-SPEPDHGSGD00.20.40.60.8cputime(s)-10-9-8-7-6-5-4-3-2-10log10(gap)covertype45QRCDM-SPEQAP-SPEPDHGSGD00.511.522.53cputime(s)-9-8-7-6-5-4-3-2-10log10(gap)covertype67QRCDM-SPEQAP-SPEPDHGSGD\f[3] X. Zhu, Z. Ghahramani, and J. D. Lafferty, \u201cSemi-supervised learning using gaussian \ufb01elds and\nharmonic functions,\u201d in Proceedings of the 20th International Conference on Machine learning,\n2003, pp. 912\u2013919.\n\n[4] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00f6lkopf, \u201cLearning with local and global\n\nconsistency,\u201d in Advances in Neural Information Processing Systems, 2004, pp. 321\u2013328.\n\n[5] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, \u201cAn iterative regularization method for\ntotal variation-based image restoration,\u201d Multiscale Modeling & Simulation, vol. 4, no. 2, pp.\n460\u2013489, 2005.\n\n[6] A. Chambolle and J. Darbon, \u201cOn total variation minimization and surface evolution using\nparametric maximum \ufb02ows,\u201d International Journal of Computer Vision, vol. 84, no. 3, p. 288,\n2009.\n\n[7] S. Kumar and M. Hebert, \u201cDiscriminative random \ufb01elds: A discriminative framework for\ncontextual interaction in classi\ufb01cation,\u201d in Proceedings of IEEE International Conference on\nComputer Vision.\n\nIEEE, 2003, pp. 1150\u20131157.\n\n[8] N. Z. Shor, Minimization Methods for Non-differentiable Functions.\n\nBusiness Media, 2012, vol. 3.\n\nSpringer Science &\n\n[9] P. Stobbe and A. Krause, \u201cEf\ufb01cient minimization of decomposable submodular functions,\u201d in\n\nAdvances in Neural Information Processing Systems, 2010, pp. 2208\u20132216.\n\n[10] S. Fujishige, Submodular functions and optimization. Elsevier, 2005, vol. 58.\n[11] S. Jegelka, F. Bach, and S. Sra, \u201cRe\ufb02ection methods for user-friendly submodular optimization,\u201d\n\nin Advances in Neural Information Processing Systems, 2013, pp. 1313\u20131321.\n\n[12] R. Nishihara, S. Jegelka, and M. I. Jordan, \u201cOn the convergence rate of decomposable submod-\nular function minimization,\u201d in Advances in Neural Information Processing Systems, 2014, pp.\n640\u2013648.\n\n[13] A. Ene and H. Nguyen, \u201cRandom coordinate descent methods for minimizing decomposable\nsubmodular functions,\u201d in Proceedings of the International Conference on Machine Learning,\n2015, pp. 787\u2013795.\n\n[14] A. Ene, H. Nguyen, and L. A. V\u00e9gh, \u201cDecomposable submodular function minimization:\ndiscrete and continuous,\u201d in Advances in Neural Information Processing Systems, 2017, pp.\n2874\u20132884.\n\n[15] P. Li and O. Milenkovic, \u201cRevisiting decomposable submodular function minimization with\n\nincidence relations,\u201d in Advances in Neural Information Processing Systems, 2018.\n\n[16] R. Johnson and T. Zhang, \u201cOn the effectiveness of laplacian normalization for graph semi-\nsupervised learning,\u201d Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1489\u20131517,\n2007.\n\n[17] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram, \u201cThe total variation on hypergraphs-learning\non hypergraphs revisited,\u201d in Advances in Neural Information Processing Systems, 2013, pp.\n2427\u20132435.\n\n[18] C. Zhang, S. Hu, Z. G. Tang, and T. H. Chan, \u201cRe-revisiting learning on hypergraphs: con\ufb01dence\ninterval and subgradient method,\u201d in Proceedings of the International Conference on Machine\nLearning, 2017, pp. 4026\u20134034.\n\n[19] A. Gammerman, V. Vovk, and V. Vapnik, \u201cLearning by transduction,\u201d in Proceedings of the\nFourteenth conference on Uncertainty in arti\ufb01cial intelligence. Morgan Kaufmann Publishers\nInc., 1998, pp. 148\u2013155.\n\n[20] T. Joachims, \u201cTransductive learning via spectral graph partitioning,\u201d in Proceedings of the 20th\n\nInternational Conference on Machine Learning, 2003, pp. 290\u2013297.\n\n[21] X. Zhu, J. Lafferty, and Z. Ghahramani, \u201cCombining active learning and semi-supervised learn-\ning using gaussian \ufb01elds and harmonic functions,\u201d in ICML 2003 workshop on the continuum\nfrom labeled to unlabeled data in machine learning and data mining, vol. 3, 2003.\n\n[22] P. Li and O. Milenkovic, \u201cInhomogeneous hypergraph clustering with applications,\u201d in Advances\n\nin Neural Information Processing Systems, 2017, pp. 2305\u20132315.\n\n[23] \u2014\u2014, \u201cSubmodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering,\u201d in\n\nProceedings of the International Conference on Machine learning, 2018.\n\n10\n\n\f[24] L. Page, S. Brin, R. Motwani, and T. Winograd, \u201cThe pagerank citation ranking: Bringing order\n\nto the web.\u201d Stanford InfoLab, Tech. Rep., 1999.\n\n[25] T.-H. H. Chan, A. Louis, Z. G. Tang, and C. Zhang, \u201cSpectral properties of hypergraph laplacian\n\nand approximation algorithms,\u201d Journal of the ACM (JACM), vol. 65, no. 3, p. 15, 2018.\n\n[26] T. Chan, Z. G. Tang, X. Wu, and C. Zhang, \u201cDiffusion operator and spectral analysis for directed\n\nhypergraph laplacian,\u201d arXiv preprint arXiv:1711.01560, 2017.\n\n[27] Y. Yoshida, \u201cCheeger\narXiv:1708.08781, 2017.\n\ninequalities for submodular\n\ntransformations,\u201d arXiv preprint\n\n[28] D. F. Gleich, L.-H. Lim, and Y. Yu, \u201cMultilinear pagerank,\u201d SIAM Journal on Matrix Analysis\n\nand Applications, vol. 36, no. 4, pp. 1507\u20131541, 2015.\n\n[29] A. Chambolle and T. Pock, \u201cA \ufb01rst-order primal-dual algorithm for convex problems with\napplications to imaging,\u201d Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp.\n120\u2013145, 2011.\n\n[30] M. Frank and P. Wolfe, \u201cAn algorithm for quadratic programming,\u201d Naval Research Logistics,\n\nvol. 3, no. 1-2, pp. 95\u2013110, 1956.\n\n[31] S. Fujishige and S. Isotani, \u201cA submodular function minimization algorithm based on the\n\nminimum-norm base,\u201d Paci\ufb01c Journal of Optimization, vol. 7, no. 1, pp. 3\u201317, 2011.\n\n[32] D. Chakrabarty, P. Jain, and P. Kothari, \u201cProvable submodular minimization using Wolfe\u2019s\n\nalgorithm,\u201d in Advances in Neural Information Processing Systems, 2014, pp. 802\u2013809.\n\n[33] D. Zhou, J. Huang, and B. Sch\u00f6lkopf, \u201cLearning with hypergraphs: Clustering, classi\ufb01cation,\nand embedding,\u201d in Advances in Neural Information Processing Systems, 2007, pp. 1601\u20131608.\n\n11\n\n\f", "award": [], "sourceid": 561, "authors": [{"given_name": "Pan", "family_name": "Li", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}, {"given_name": "Olgica", "family_name": "Milenkovic", "institution": "University of Illinois at Urbana-Champaign"}]}