{"title": "Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 2742, "page_last": 2750, "abstract": "We investigate three related and important problems connected to machine learning, namely approximating a submodular function everywhere, learning a submodular function (in a PAC like setting [26]), and constrained minimization of submodular functions. In all three problems, we provide improved bounds which depend on the \u201ccurvature\u201d of a submodular function and improve on the previously known best results for these problems [9, 3, 7, 25] when the function is not too curved \u2013 a property which is true of many real-world submodular functions. In the former two problems, we obtain these bounds through a generic black-box transformation (which can potentially work for any algorithm), while in the case of submodular minimization, we propose a framework of algorithms which depend on choosing an appropriate surrogate for the submodular function. In all these cases, we provide almost matching lower bounds. While improved curvature-dependent bounds were shown for monotone submodular maximization [4, 27], the existence of similar improved bounds for the aforementioned problems has been open. We resolve this question in this paper by showing that the same notion of curvature provides these improved results. Empirical experiments add further support to our claims.", "full_text": "Curvature and Optimal Algorithms for Learning and\n\nMinimizing Submodular Functions\n\nRishabh Iyer\u2020, Stefanie Jegelka\u2217, Jeff Bilmes\u2020\n\n\u2020 University of Washington, Dept. of EE, Seattle, U.S.A.\n\u2217 University of California, Dept. of EECS, Berkeley, U.S.A.\n\nrkiyer@uw.edu, stefje@eecs.berkeley.edu, bilmes@uw.edu\n\nAbstract\n\nWe investigate three related and important problems connected to machine learning:\napproximating a submodular function everywhere, learning a submodular function\n(in a PAC-like setting [28]), and constrained minimization of submodular functions.\nWe show that the complexity of all three problems depends on the \u201ccurvature\u201d of the\nsubmodular function, and provide lower and upper bounds that re\ufb01ne and improve\nprevious results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either\nuse a black-box transformation of the function (for approximation and learning),\nor a transformation of algorithms to use an appropriate surrogate function (for\nminimization). Curiously, curvature has been known to in\ufb02uence approximations\nfor submodular maximization [3, 29], but its effect on minimization, approximation\nand learning has hitherto been open. We complete this picture, and also support\nour theoretical claims by empirical results.\n\n1\n\nIntroduction\n\nSubmodularity is a pervasive and important property in the areas of combinatorial optimization,\neconomics, operations research, and game theory. In recent years, submodularity\u2019s use in machine\nlearning has begun to proliferate as well. A set function f : 2V \u2192 R over a \ufb01nite set V =\n{1, 2, . . . , n} is submodular if for all subsets S, T \u2286 V , it holds that f(S) + f(T ) \u2265 f(S \u222a T ) +\nf(S \u2229 T ). Given a set S \u2286 V , we de\ufb01ne the gain of an element j /\u2208 S in the context S as\nf(j|S) (cid:44) f(S \u222a j) \u2212 f(S). A function f is submodular if it satis\ufb01es diminishing marginal returns,\nnamely f(j|S) \u2265 f(j|T ) for all S \u2286 T, j /\u2208 T , and is monotone if f(j|S) \u2265 0 for all j /\u2208 S, S \u2286 V .\nWhile submodularity, like convexity, occurs naturally in a wide variety of problems, recent studies\nhave shown that in the general case, many submodular problems of interest are very hard: the\nproblems of learning a submodular function or of submodular minimization under constraints do\nnot even admit constant or logarithmic approximation factors in polynomial time [2, 7, 8, 10, 27].\nThese rather pessimistic results however stand in sharp contrast to empirical observations, which\nsuggest that these lower bounds are speci\ufb01c to rather contrived classes of functions, whereas much\nbetter results can be achieved in many practically relevant cases. Given the increasing importance\nof submodular functions in machine learning, these observations beg the question of qualifying and\nquantifying properties that make sub-classes of submodular functions more amenable to learning and\noptimization. Indeed, limited prior work has shown improved results for constrained minimization\nand learning of sub-classes of submodular functions, including symmetric functions [2, 25], concave\nfunctions [7, 18, 24], label cost or covering functions [9, 31].\nIn this paper, we take additional steps towards addressing the above problems and show how the\ngeneric notion of the curvature \u2013 the deviation from modularity\u2013 of a submodular function determines\nboth upper and lower bounds on approximation factors for many learning and constrained optimization\nproblems. In particular, our quanti\ufb01cation tightens the generic, function-independent bounds in [8, 2,\n27, 7, 10] for many practically relevant functions. Previously, the concept of curvature has been used to\n\n1\n\n\ftighten bounds for submodular maximization problems [3, 29]. Hence, our results complete a unifying\npicture of the effect of curvature on submodular problems. Moreover, curvature is still a fairly generic\nconcept, as it only depends on the marginal gains of the submodular function. It allows a smooth\ntransition between the \u2018easy\u2019 functions and the \u2018really hard\u2019 subclasses of submodular functions.\n\n2 Problem statements, de\ufb01nitions and background\n\nBefore stating our main results, we provide some necessary de\ufb01nitions and introduce a new concept,\nthe curve normalized version of a submodular function. Throughout this paper, we assume that\nthe submodular function f is de\ufb01ned on a ground set V of n elements, that it is nonnegative and\nf(\u2205) = 0. We also use normalized modular (or additive) functions w : 2V \u2192 R which are those that\ni\u2208S w(i). We are concerned with the following three\n\ncan be written as a sum of weights, w(S) =(cid:80)\n\nproblems:\nProblem 1. (Approximation [8]) Given a submodular function f in form of a value oracle, \ufb01nd an\napproximation \u02c6f (within polynomial time and representable within polynomial space), such that for\nall X \u2286 V , it holds that \u02c6f(X) \u2264 f(X) \u2264 \u03b11(n) \u02c6f(X) for a polynomial \u03b11(n).\nProblem 2. (PMAC-Learning [2]) Given i.i.d training samples {(Xi, f(Xi)}m\ni=1 from a distribution\nD, learn an approximation \u02c6f(X) that is, with probability 1 \u2212 \u03b4, within a multiplicative factor of\n\u03b12(n) from f.\nProblem 3. (Constrained optimization [27, 7, 10, 16]) Minimize a submodular function f over a\nfamily C of feasible sets, i.e., minX\u2208C f(X).\nIn its general form, the approximation problem was \ufb01rst studied by Goemans et al. [8], who approx-\nimate any monotone submodular function to within a factor of O(\nn log n), with a lower bound\n\u221a\nn/ log n). Building on this result, Balcan and Harvey [2] show how to PMAC-learn\nof \u03b11(n) = \u2126(\na monotone submodular function within a factor of \u03b12(n) = O(\nn), and prove a lower bound of\n\u2126(n1/3) for the learning problem. Subsequent work extends these results to sub-additive and frac-\ntionally sub-additive functions [1]. Better learning results are possible for the subclass of submodular\nshells [23] and Fourier sparse set functions [26]. Both Problems 1 and 2 have numerous applications\nin algorithmic game theory and economics [2, 8] as well as machine learning [2, 22, 23, 26, 15].\nConstrained submodular minimization arises in applications such as power assignment or transporta-\ntion problems [19, 30, 13]. In machine learning, it occurs, for instance, in the form of MAP inference\nin high-order graphical models [17] or in size-constrained corpus extraction [21]. Recent results\nshow that almost all constraints make it hard to solve the minimization even within a constant factor\n[27, 6, 16]. Here, we will focus on the constraint of imposing a lower bound on the cardinality, and\non combinatorial constraints where C is the set of all s-t paths, s-t cuts, spanning trees, or perfect\nmatchings in a graph.\nA central concept in this work is the total curvature \u03baf of a submodular function f and the curvature\n\u03baf (S) with respect to a set S \u2286 V , de\ufb01ned as [3, 29]\n\n\u221a\n\n\u221a\n\n\u03baf = 1 \u2212 min\nj\u2208V\n\nf(j | V \\ j)\n\nf(j|S\\j)\n\n.\n\n\u03baf (S) = 1 \u2212 min\nj\u2208S\n\n,\n\nf(j)\n\nf(j)\n\n(1)\nWithout loss of generality, assume that f(j) > 0 for all j \u2208 V . It is easy to see that \u03baf (S) \u2264\n\u03baf (V ) = \u03baf , and hence \u03baf (S) is a tighter notion of curvature. A modular function has curvature\n\u03baf = 0, and a matroid rank function has maximal curvature \u03baf = 1. Intuitively, \u03baf measures how\nfar away f is from being modular. Conceptually, curvature is distinct from the recently proposed\nsubmodularity ratio [5] that measures how far a function is from being submodular. Curvature has\nserved to tighten bounds for submodular maximization problems, e.g., from (1\u22121/e) to 1\n(1\u2212e\u2212\u03baf )\nfor monotone submodular maximization subject to a cardinality constraint [3] or matroid constraints\n[29], and these results are tight. For submodular minimization, learning, and approximation, however,\nthe role of curvature has not yet been addressed (an exception are the upper bounds in [13] for\nminimization). In the following sections, we complete the picture of how curvature affects the\ncomplexity of submodular maximization and minimization, approximation, and learning.\nThe above-cited lower bounds for Problems 1\u20133 were established with functions of maximal curvature\n(\u03baf = 1) which, as we will see, is the worst case. By contrast, many practically interesting functions\nhave smaller curvature, and our analysis will provide an explanation for the good empirical results\n\n\u03baf\n\n2\n\n\fvision [17]. This class comprises, for instance, functions of the form f(X) =(cid:80)k\n\nobserved with such functions [13, 22, 14]. An example for functions with \u03baf < 1 is the class\nof concave over modular functions that have been used in speech processing [22] and computer\ni=1(wi(X))a, for\nsome a \u2208 [0, 1] and a nonnegative weight vectors wi. Such functions may be de\ufb01ned over clusters\nCi \u2286 V , in which case the weights wi(j) are nonzero only if j \u2208 Ci [22, 17, 11].\n\nCurvature-dependent analysis. To analyze Problems 1 \u2013 3, we introduce the concept of a curve-\nnormalized polymatroid1. Speci\ufb01cally, we de\ufb01ne the \u03baf -curve-normalized version of f as\n\nf(X) \u2212 (1 \u2212 \u03baf )(cid:80)\n\n\u03baf\n\nj\u2208X f(j)\n\nf \u03ba(X) =\n\n(2)\n\nfdif\ufb01cult(X) = \u03baf f \u03ba(X) and measy(X) = (1\u2212 \u03baf )(cid:80)\n\nIf \u03baf = 0, then we set f \u03ba \u2261 0. We call f \u03ba the curve-normalized version of f because its curvature\nis \u03baf \u03ba = 1. The function f \u03ba allows us to decompose a submodular function f into a \u201cdif\ufb01-\ncult\u201d polymatroid function and an \u201ceasy\u201d modular part as f(X) = fdif\ufb01cult(X) + measy(X) where\nj\u2208X f(j). Moreover, we may modulate the cur-\nvature of given any function g with \u03bag = 1, by constructing a function f(X) (cid:44) cg(X) + (1 \u2212 c)|X|\nwith curvature \u03baf = c but otherwise the same polymatroidal structure as g.\nOur curvature-based decomposition is different from decompositions such as that into a totally\nnormalized function and a modular function [4]. Indeed, the curve-normalized function has some\nspeci\ufb01c properties that will be useful later on (proved in [12]):\nj\u2208X f(j) and f(X) \u2265\n\nLemma 2.1. If f is monotone submodular with \u03baf > 0, then f(X) \u2264(cid:80)\n(1 \u2212 \u03baf )(cid:80)\nsubmodular function. Furthermore, f \u03ba(X) \u2264(cid:80)\n\nLemma 2.2. If f is monotone submodular, then f \u03ba(X) in Eqn. (2) is a monotone non-negative\n\nj\u2208X f(j).\n\nj\u2208X f(j).\n\nThe function f \u03ba will be our tool for analyzing the hardness of submodular problems. Previous\ninformation-theoretic lower bounds for Problems 1\u20133 [6, 8, 10, 27] are independent of curvature and\nuse functions with \u03baf = 1. These curvature-independent bounds are proven by constructing two\nessentially indistinguishable matroid rank functions h and f R, one of which depends on a random\nset R \u2286 V . One then argues that any algorithm would need to make a super-polynomial number of\nqueries to the functions for being able to distinguish h and f R with high enough probability. The\nlower bound will be the ratio maxX\u2208C h(X)/f R(X). We extend this proof technique to functions\nwith a \ufb01xed given curvature. To this end, we de\ufb01ne the functions\n\n\u03ba (X) = \u03baf f R(X) + (1 \u2212 \u03baf )|X|\nf R\n\nand\n\nh\u03ba(X) = \u03baf h(X) + (1 \u2212 \u03baf )|X|.\n\n(3)\n\nBoth of these functions have curvature \u03baf . This construction enables us to explicitly introduce the\neffect of curvature into information-theoretic bounds for all monotone submodular functions.\n\nn log n) [8] by an improved curvature-dependent O(\n\nMain results. The curve normalization (2) leads to re\ufb01ned upper bounds for Problems 1\u20133, while\nthe curvature modulation (3) provides matching lower bounds. The following are some of our\nmain results: for approximating submodular functions (Problem 1), we replace the known bound\n\u221a\n\u221a\nof \u03b11(n) = O(\nn log n\u22121)(1\u2212\u03baf )). We\n\u221a\ncomplement this with a lower bound of \u02dc\u2126(\nn\u22121)(1\u2212\u03baf )). For learning submodular functions\n\u221a\n(Problem 2), we re\ufb01ne the known bound of \u03b12(n) = O(\nn) [2] in the PMAC setting to a curvature\n1+(n1/3\u22121)(1\u2212\u03baf )). Finally,\ndependent bound of O(\nTable 1 summarizes our curvature-dependent approximation bounds for constrained minimization\n(Problem 3). These bounds re\ufb01ne many of the results in [6, 27, 10, 16]. In general, our new curvature-\ndependent upper and lower bounds re\ufb01ne known theoretical results whenever \u03baf < 1, in many cases\nreplacing known polynomial bounds by a curvature-dependent constant factor 1/(1 \u2212 \u03baf ). Besides\nmaking these bounds precise, the decomposition and the curve-normalized version (2) are the basis\nfor constructing tight algorithms that (up to logarithmic factors) achieve the lower bounds.\n\n\u221a\nn\u22121)(1\u2212\u03baf )), with a lower bound of \u02dc\u2126(\n\nn log n\n\nn1/3\n\n1+(\n\n\u221a\n\n1+(\n\n\u221a\n\n1+(\n\n\u221a\n\nn\n\nn\n\n1A polymatroid function is a monotone increasing, nonnegative, submodular function satisfying f (\u2205) = 0.\n\n3\n\n\fConstraint\nCard. LB\nSpanning Tree\nMatchings\ns-t path\ns-t cut\n\nModular approx. (MUB)\n\nk\n\nn\n\nn\n\n1+(k\u22121)(1\u2212\u03baf )\n1+(n\u22121)(1\u2212\u03baf )\n2+(n\u22122)(1\u2212\u03baf )\n1+(n\u22121)(1\u2212\u03baf )\n1+(m\u22121)(1\u2212\u03baf )\n\nm\n\nn\n\n1+(\n\nn log n\n\nm log m\n\nEllipsoid approx. (EA)\n\u221a\n\u221a\nn log n\u22121)(1\u2212\u03baf ))\nO(\n\u221a\n1+(\n\u221a\nO(\nm log m\u22121)(1\u2212\u03baf ))\n\u221a\n\u221a\nm log m\u22121)(1\u2212\u03baf ))\nO(\n\u221a\n\u221a\nm log m\u22121)(1\u2212\u03baf ))\nO(\n\u221a\nm\u22121)(1\u2212\u03baf ))\nO(\n\n\u221a\nm log m\n\n1+(log m\n\nm log m\n\nm log m\n\n1+(\n\n1+(\n\nn\n\nn1/2)\n\nLower bound\n\u02dc\u2126(\n1+(n1/2\u22121)(1\u2212\u03baf ))\n\u02dc\u2126(\n1+(n\u22121)(1\u2212\u03baf ))\n\u02dc\u2126(\n1+(n\u22121)(1\u2212\u03baf ))\n\u02dc\u2126(\nn2/3\n1+(n2/3\u22121)(1\u2212\u03baf ))\n\u221a\n\u02dc\u2126(\nn\u22121)(1\u2212\u03baf ))\n\n1+(\n\n\u221a\n\nn\n\nn\n\nTable 1: Summary of our results for constrained minimization (Problem 3).\n\n3 Approximating submodular functions everywhere\n\nWe \ufb01rst address improved bounds for the problem of approximating a monotone submodular function\neverywhere. Previous work established \u03b1-approximations g to a submodular function f satisfying\ng(S) \u2264 f(S) \u2264 \u03b1g(S) for all S \u2286 V [8]. We begin with a theorem showing how any algorithm\ncomputing such an approximation may be used to obtain a curvature-speci\ufb01c, improved approximation.\nNote that the curvature of a monotone submodular function can be obtained within 2n + 1 queries to\nf. The key idea of Theorem 3.1 is to only approximate the curved part of f, and to retain the modular\npart exactly. The full proof is in [12].\nTheorem 3.1. Given a polymatroid function f with \u03baf < 1, let f \u03ba be its curve-normalized version\nde\ufb01ned in Equation (2), and let \u02c6f \u03ba be a submodular function satisfying \u02c6f \u03ba(X) \u2264 f \u03ba(X) \u2264\n\u03b1(n) \u02c6f \u03ba(X), for some X \u2286 V . Then the function \u02c6f(X) (cid:44) \u03baf\nj\u2208X f(j) satis\ufb01es\n\n\u02c6f(X) \u2264 f(X) \u2264\n\n\u03b1(n)\n\n1 + (\u03b1(n) \u2212 1)(1 \u2212 \u03baf )\n\n(4)\n\n\u02c6f \u03ba(X) + (1\u2212 \u03baf )(cid:80)\n\u02c6f(X) \u2264 \u02c6f(X)\n1 \u2212 \u03baf\n\n.\n\nTheorem 3.1 may be directly applied to tighten recent results on approximating submodular functions\neverywhere. An algorithm by Goemans et al. [8] computes an approximation to a polymatroid\nfunction f in polynomial time by approximating the submodular polyhedron via an ellipsoid. This\nn log n), and has\napproximation (which we call the ellipsoidal approximation) satis\ufb01es \u03b1(n) = O(\n\nthe form(cid:112)wf (X) for a certain weight vector wf . Corollary 3.2 states that a tighter approximation\nCorollary 3.2. Let f be a polymatroid function with \u03baf < 1, and let (cid:112)wf \u03ba(X) be the ellip-\n(cid:112)wf \u03ba(X) + (1 \u2212 \u03baf )(cid:80)\n\nis possible for functions with \u03baf < 1.\n\nsoidal approximation to the \u03ba-curve-normalized version f \u03ba(X) of f. Then the function f ea(X) =\n\u03baf\n\nj\u2208X f(j) satis\ufb01es\n\n\u221a\n\n(cid:18)\n\n\u221a\n\n(cid:19)\n\nf ea(X) \u2264 f(X) \u2264 O\n\n\u221a\n\n1 + (\n\nn log n\n\nn log n \u2212 1)(1 \u2212 \u03baf )\n\nf ea(X).\n\n(5)\n\nIf \u03baf = 0, then the approximation is exact. This is not surprising since a modular function can be\ninferred exactly within O(n) oracle calls. The following lower bound (proved in [12]) shows that\nCorollary 3.2 is tight up to logarithmic factors. It re\ufb01nes the lower bound in [8] to include \u03baf .\nTheorem 3.3. Given a submodular function f with curvature \u03baf , there does not exist a (possibly\nrandomized) polynomial-time algorithm that computes an approximation to f within a factor of\n1+(n1/2\u2212\u0001\u22121)(1\u2212\u03baf ) , for any \u0001 > 0.\nThe simplest alternative approximation to f one might conceive is the modular function \u02c6f m(X) (cid:44)\n\nn1/2\u2212\u0001\n\nj\u2208X f(j) which can easily be computed by querying the n values f(j).\n\nLemma 3.1. Given a monotone submodular function f, it holds that2\n\n(cid:80)\n\nf(X) \u2264 \u02c6f m(X) =(cid:88)\n\nf(j) \u2264\n\nj\u2208X\n\n1 + (|X| \u2212 1)(1 \u2212 \u03baf (X)) f(X)\n\n|X|\n\n(6)\n\n.\n\nP\nP\nj\u2208X f (j|X\\j)\n\nj\u2208X f (j)\n\n2In [12], we show this result with a stronger notion of curvature: \u02c6\u03baf (X) = 1 \u2212\n\n4\n\n\fThe form of Lemma 3.1 is slightly different from Corollary 3.2. However, there is a straightforward\ncorrespondence: given \u02c6f such that \u02c6f(X) \u2264 f(X) \u2264 \u03b1(cid:48)(n) \u02c6f(X), by de\ufb01ning \u02c6f(cid:48)(X) = \u03b1(cid:48)(n) \u02c6f(X),\nwe get that f(X) \u2264 \u02c6f(cid:48)(X) \u2264 \u03b1(cid:48)(n)f(X). Lemma 3.1 for the modular approximation is comple-\nmentary to Corollary 3.2: First, the modular approximation is better whenever |X| \u2264 \u221a\nn. Second,\nthe bound in Lemma 3.1 depends on the curvature \u03baf (X) with respect to the set X, which is stronger\nthan \u03baf . Third, \u02c6f m is extremely simple to compute. For sets of larger cardinality, however, the\nellipsoidal approximation of Corollary 3.2 provides a better approximation, in fact, the best possible\none (Theorem 3.3). In a similar manner, Lemma 3.1 is tight for any modular approximation to a\nsubmodular function:\nLemma 3.2. For any \u03ba > 0, there exists a monotone submodular function f with curvature \u03ba such\nthat no modular upper bound on f can approximate f(X) to a factor better than\n1+(|X|\u22121)(1\u2212\u03baf ) .\nThe improved curvature dependent bounds immediately imply better bounds for the class of concave\nover modular functions used in [22, 17, 11].\nCorollary 3.4. Given weight vectors w1,\u00b7\u00b7\u00b7 , wk \u2265 0, and a submodular function f(X) =\n\n(cid:80)k\ni=1 \u03bbi[wi(X)]a, \u03bbi \u2265 0, for a \u2208 (0, 1), it holds that f(X) \u2264(cid:80)\nmodular functions by a factor of(cid:112)|X|.\n\nIn particular, when a = 1/2, the modular upper bound approximates the sum of square-root over\n\nj\u2208X f(j) \u2264 |X|1\u2212af(X)\n\n|X|\n\n4 Learning Submodular functions\n\n(cid:104)\n\nWe next address the problem of learning submodular functions in a PMAC setting [2]. The PMAC\n(Probably Mostly Approximately Correct) framework is an extension of the PAC framework [28]\nto allow multiplicative errors in the function values from a \ufb01xed but unknown distribution D over\n2V . We are given training samples {(Xi, f(Xi)}m\ni=1 drawn i.i.d. from D. The algorithm may take\ntime polynomial in n, 1/\u0001, 1/\u03b4 to compute a (polynomially-representable) function \u02c6f that is a good\napproximation to f with respect to D. Formally, \u02c6f must satisfy that\n\nPrX1,X2,\u00b7\u00b7\u00b7 ,Xm\u223cD\n\nPrX\u2208D[ \u02c6f(X) \u2264 f(X) \u2264 \u03b1(n) \u02c6f(X)] \u2265 1 \u2212 \u0001\n\u221a\n\n(7)\nfor some approximation factor \u03b1(n). Balcan and Harvey [2] propose an algorithm that PMAC-learns\nany monotone, nonnegative submodular function within a factor \u03b1(n) =\nn + 1 by reducing the\nproblem to that of learning a binary classi\ufb01er. If we assume that we have an upper bound on the\ncurvature \u03baf , or that we can estimate it 3, and have access to the value of the singletons f(j), j \u2208 V ,\nthen we can obtain better learning results with non-maximal curvature:\nLemma 4.1. Let f be a monotone submodular function for which we know an upper bound on its\ncurvature and the singleton weights f(j) for all j \u2208 V . For every \u0001, \u03b4 > 0 there is an algorithm\nthat uses a polynomial number of training examples, runs in time polynomial in (n, 1/\u0001, 1/\u03b4) and\n. If D is a product distribution, then there exists\nPMAC-learns f within a factor of\n\n\u221a\n\n(cid:105) \u2265 1 \u2212 \u03b4\n\n\u221a\nn+1\u22121)(1\u2212\u03baf )\n\nn+1\n\n1+(\n\nan algorithm that PMAC-learns f within a factor of O(\n\n1+(log 1\n\nlog 1\n\u0001 \u22121)(1\u2212\u03baf )).\n\u0001\n\n\u02c6f \u03ba(X) + (1 \u2212 \u03baf )(cid:80)\n\nThe algorithm of Lemma 4.1 uses the reduction of Balcan and Harvey [2] to learn the \u03baf -curve-\nnormalized version f \u03ba of f. From the learned function \u02c6f \u03ba(X), we construct the \ufb01nal estimate\n\u02c6f(X) (cid:44) \u03baf\nj\u2208X f(j). Theorem 3.1 implies Lemma 4.1 for this \u02c6f(X).\nMoreover, no polynomial-time algorithm can be guaranteed to PMAC-learn f within a factor of\nn1/3\u2212\u0001(cid:48)\n, for any \u0001(cid:48) > 0 [12]. We end this section by showing how we can learn with a\n1+(n1/3\u2212\u0001(cid:48)\u22121)(1\u2212\u03baf )\nconstruction analogous to that in Lemma 3.1.\nLemma 4.2. If f is a monotone submodular function with known curvature (or a known upper\nbound) \u02c6\u03baf (X),\u2200X \u2286 V , then for every \u0001, \u03b4 > 0 there is an algorithm that uses a polynomial number\nof training examples, runs in time polynomial in (n, 1/\u0001, 1/\u03b4) and PMAC learns f(X) within a factor\nof 1 +\n\n|X|\n\n1+(|X|\u22121)(1\u2212 \u02c6\u03baf (X)) .\n\n3note that \u03baf can be estimated from a set of 2n + 1 samples {(j, f (j))}j\u2208V , {(V, f (V ))}, and\n\n{(V \\j, f (V \\j)}j\u2208V included in the training samples\n\n5\n\n\fCompare this result to Lemma 4.1. Lemma 4.2 leads to better bounds for small sets, whereas\nLemma 4.1 provides a better general bound. Moreover, in contrast to Lemma 4.1, here we only need an\nupper bound on the curvature and do not need to know the singleton weights {f(j), j \u2208 V }. Note also\ncorollary is that the class of concave over modular functions f(X) =(cid:80)k\nthat, while \u03baf itself is an upper bound of \u02c6\u03baf (X), often one does have an upper bound on \u02c6\u03baf (X) if one\nknows the function class of f (for example, say concave over modular). In particular, an immediate\ni=1 \u03bbi[wi(X)]a, \u03bbi \u2265 0, for\na \u2208 (0, 1) can be learnt within a factor of min{\u221a\n\nn + 1, 1 + |X|1\u2212a}.\n\n5 Constrained submodular minimization\n\nNext, we apply our results to the minimization of submodular functions under constraints. Most\nalgorithms for constrained minimization use one of two strategies: they apply a convex relaxation [10,\n16], or they optimize a surrogate function \u02c6f that should approximate f well [6, 8, 16]. We follow\nthe second strategy and propose a new, widely applicable curvature-dependent choice for surrogate\nfunctions. A suitable selection of \u02c6f will ensure theoretically optimal results. Throughout this section,\nwe refer to the optimal solution as X\u2217 \u2208 argminX\u2208C f(X).\nf(X) \u2264 \u03b1(n) \u02c6f1(X), for all X \u2286 V . Then any minimizer (cid:98)X1 \u2208 argminX\u2208C \u02c6f(X) of \u02c6f satis\ufb01es\nLemma 5.1. Given a submodular function f, let \u02c6f1 be an approximation of f such that \u02c6f1(X) \u2264\nf((cid:98)X) \u2264 \u03b1(n)f(X\u2217). Likewise, if an approximation of f is such that f(X) \u2264 \u02c6f2(X) \u2264 \u03b1(X)f(X)\nfor a set-speci\ufb01c factor \u03b1(X), then its minimizer \u02dcX2 \u2208 argminX\u2208C \u02c6f2(X) satis\ufb01es f((cid:98)X2) \u2264\n\n\u03b1(X\u2217)f(X\u2217). If only \u03b2-approximations4 are possible for minimizing \u02c6f1 or \u02c6f2 over C, then the \ufb01nal\nbounds are \u03b2\u03b1(n) and \u03b2\u03b1(X\u2217) respectively.\nFor Lemma 5.1 to be practically useful, it is essential that \u02c6f1 and \u02c6f2 be ef\ufb01ciently optimizable\nover C. We discuss two general curvature-dependent approximations that work for a large class of\ncombinatorial constraints. In particular, we use Theorem 3.1: we decompose f into f \u03ba and a modular\npart f m, and then approximate f \u03ba while retaining f m, i.e., \u02c6f = \u02c6f \u03ba + f m. The \ufb01rst approach uses a\nsimple modular upper bound (MUB) and the second relies on the Ellipsoidal approximation (EA) we\nused in Section 3.\nMUB: The simplest approximation to a submodular function is the modular approximation\nj\u2208X f(j) \u2265 f(X). Since here, \u02c6f \u03ba happens to be equivalent to f m, we obtain the\noverall approximation \u02c6f = \u02c6f m. Lemmas 5.1 and 3.1 directly imply a set-dependent approximation\nfactor for \u02c6f m:\n\n\u02c6f m(X) (cid:44) (cid:80)\nCorollary 5.1. Let (cid:98)X \u2208 C be a \u03b2-approximate solution for minimizing(cid:80)\n(cid:80)\nj\u2208bX f(j) \u2264 \u03b2 minX\u2208C(cid:80)\n1 + (|X\u2217| \u2212 1)(1 \u2212 \u03baf (X\u2217)) f(X\u2217).\n\nj\u2208X f(j) over C, i.e.\n\nj\u2208X f(j). Then\n\n(8)\n\nf( \u02c6X) \u2264\n\n\u03b2|X\u2217|\n\nCorollary 5.1 has also been shown in [13]. Similar to the algorithms in [13], MUB can be extended\nto an iterative algorithm yielding performance gains in practice. In particular, Corollary 5.1 implies\nimproved approximation bounds for practically relevant concave over modular functions, such\nj\u2208X wi(j), we obtain a worst-case\nn. This is signi\ufb01cantly better than the worst case factor of |X\u2217|\n\nas those used in [17]. For instance, for f(X) = (cid:80)k\napproximation bound of(cid:112)|X\u2217| \u2264 \u221a\n\n(cid:113)(cid:80)\n\nfor general submodular functions.\nEA: Instead of employing a modular upper bound, we can approximate f \u03ba using the construction\nby Goemans et al. [8], as in Corollary 3.2. In that case, \u02c6f(X) = \u03baf\nhas a special form: a weighted sum of a concave function and a modular function. Minimizing\nsuch a function over constraints C is harder than minimizing a merely modular function, but with\nthe algorithm in [24] we obtain an FPTAS5 for minimizing \u02c6f over C whenever we can minimize a\nnonnegative linear function over C.\n\n(cid:112)wf \u03ba(X) + (1 \u2212 \u03baf )f m(X)\n\ni=1\n\n4A \u03b2-approximation algorithm for minimizing a function g \ufb01nds set X : g(X) \u2264 \u03b2 minX\u2208C g(X)\n5The FPTAS will yield a \u03b2 = (1 + \u0001)-approximation through an algorithm polynomial in 1\n\u0001 .\n\n6\n\n\fCorollary 5.2. For a submodular function with curvature \u03baf < 1, algorithm EA will return a\n\nsolution (cid:98)X that satis\ufb01es\n\n(cid:18)\n\nf((cid:98)X) \u2264 O\n\n\u221a\n\nn log n\n\nn log n \u2212 1)(1 \u2212 \u03baf ) + 1)\n\n\u221a\n\n(\n\n(cid:19)\n\nf(X\u2217).\n\n(9)\n\nk\n\nn log n\n\n\u221a\n\n1+(\n\nn1/2\u2212\u0001\n\nNext, we apply the results of this section to speci\ufb01c optimization problems, for which we show\n(mostly tight) curvature-dependent upper and lower bounds. We just state our main results; a more\nextensive discussion along with the proofs can be found in [12].\nCardinality lower bounds (SLB). A simple constraint is a lower bound on the cardinality of the\nsolution, i.e., C = {X \u2286 V : |X| \u2265 k}. Svitkina and Fleischer [27] prove that for monotone\nsubmodular functions of arbitrary curvature, it is impossible to \ufb01nd a polynomial-time algorithm\n\nwith an approximation factor better than(cid:112)n/ log n. They show an algorithm which matches this\n\n1+(k\u22121)(1\u2212\u03baf ) and O(\n\n1+(n1/2\u2212\u0001\u22121)(1\u2212\u03baf ) for any \u0001 > 0.\n\napproximation factor. Corollaries 5.1 and 5.2 immediately imply curvature-dependent approximation\n\u221a\nn log n\u22121)(1\u2212\u03baf )). These bounds are improvements over the\nbounds of\nresults of [27] whenever \u03baf < 1. Here, MUB is preferable to EA whenever k is small. Moreover,\nthe bound of EA is tight up to poly-log factors, in that no polynomial time algorithm can achieve a\ngeneral approximation factor better than\nIn the following problems, our ground set V consists of the set of edges in a graph G = (V,E) with\ntwo distinct nodes s, t \u2208 V and n = |V|, m = |E|. The submodular function is f : 2E \u2192 R.\nShortest submodular s-t path (SSP). Here, we aim to \ufb01nd an s-t path X of minimum (submodular)\nlength f(X). Goel et al. [6] show a O(n2/3)-approximation with matching curvature-independent\nlower bound \u2126(n2/3). By Corollary 5.1, the curvature-dependent worst-case bound for MUB is\n1+(n\u22121)(1\u2212\u03baf ) since any minimal s-t path has at most n edges. Similarly, the factor for EA is\n\u221a\nm log m\u22121)(1\u2212\u03baf )). The bound of EA will be tighter for sparse graphs while MUB provides\nO(\nbetter results for dense ones. Our curvature-dependent lower bound for SSP is\n1+(n2/3\u2212\u0001\u22121)(1\u2212\u03baf ),\nfor any \u0001 > 0, which reduces to the result in [6] for \u03baf = 1.\nMinimum submodular s-t cut (SSC): This problem, also known as the cooperative cut problem [16,\n17], asks to minimize a monotone submodular function f such that the solution X \u2286 E is a set of\nedges whose removal disconnects s from t in G. Using curvature re\ufb01nes the We can also show a\n1+(n1/2\u2212\u0001\u22121)(1\u2212\u03baf ), for any \u0001 > 0. Corollary 5.1 implies an approximation\nlower bound of [16] to\n1+(m\u22121)(1\u2212\u03baf ) for MUB, where m = |E|\nfactor of O(\nis the number of edges in the graph. Hence the factor for EA is tight for sparse graphs. Speci\ufb01cally\nfor cut problems, there is yet another useful surrogate function that is exact on local neighborhoods.\nJegelka and Bilmes [16] demonstrate how this approximation may be optimized via a generalized\nmaximum \ufb02ow algorithm that maximizes a polymatroidal network \ufb02ow [20]. This algorithm still\n\u02c6f \u03ba + (1 \u2212 \u03baf )f m, where we only approximate f \u03ba. We refer to\napplies to the combination \u02c6f = \u03baf\nthis approximation as Polymatroidal Network Approximation (PNA).\nCorollary 5.3. Algorithm PNA achieves a worst-case approximation factor of\ncooperative cut problem.\n\nm log m\u22121)(1\u2212\u03baf )+1) for EA and a factor of\n\n2+(n\u22122)(1\u2212\u03baf ) for the\n\nn1/2\u2212\u0001\n\nn2/3\u2212\u0001\n\n\u221a\n\nm log m\n\nn\n\n\u221a\n\n1+(\n\nm log m\n\n\u221a\n\n(\n\nm\n\nn\n\nFor dense graphs, this factor is theoretically tighter than that of the EA approximation.\nMinimum submodular spanning tree (SST). Here, C is the family of all spanning trees in a given\ngraph G. Such constraints occur for example in power assignment problems [30]. Goel et al. [6] show\na curvature-independent optimal approximation factor of O(n) for this problem. Corollary 5.1\nre\ufb01nes this bound to\n1+(n\u22121)(1\u2212\u03baf ) when using MUB; Corollary 5.2 implies a slightly worse bound\nfor EA. We also show that the bound of MUB is tight: no polynomial-time algorithm can guarantee a\nfactor better than\nMinimum submodular perfect matching (SPM): Here, we aim to \ufb01nd a perfect matching in\na graph that minimizes a monotone submodular function. Corollary 5.1 implies that an MUB\napproximation will achieve an approximation factor of at most\n2+(n\u22122)(1\u2212\u03baf ). Similar to the\nspanning tree case, the bound of MUB is also tight [12].\n\n1+(n1\u2212\u0001\u22121)(1\u2212\u03baf )+\u03b4\u03baf\n\n, for any \u0001, \u03b4 > 0.\n\nn1\u2212\u0001\n\nn\n\nn\n\n7\n\n\f(a) \ufb01xed \u03ba = 0, \u03b1 =\nFigure 1: Minimization of g\u03ba for cardinality lower bound constraints.\nn1/2+\u0001, \u03b2 = n2\u0001 for varying \u0001; (b) \ufb01xed \u0001 = 0.1, but varying \u03ba; (c) different choices of \u03b1 for \u03b2 = 1;\n(d) varying \u03ba with \u03b1 = n/2, \u03b2 = 1. Dashed lines: MUB, dotted lines: EA, solid lines: theoretical\nbound. The results of EA are not visible in some instances since it obtains a factor of 1.\n\n5.1 Experiments\n\n\u03ba (X) = \u03baf(X) + (1 \u2212 \u03ba)|X| as in Equation (3).\n\nWe end this section by empirically demonstrating the performance of MUB and EA and their precise\ndependence on curvature. We focus on cardinality lower bound constraints, C = {X \u2286 V : |X| \u2265 \u03b1}\nand the \u201cworst-case\u201d class of functions that has been used throughout this paper to prove lower\nbounds, f R(X) = min{|X \u2229 \u00afR| + \u03b2,|X|, \u03b1} where \u00afR = V \\R and R \u2286 V is random set such that\n|R| = \u03b1. We adjust \u03b1 = n1/2+\u0001 and \u03b2 = n2\u0001 by a parameter \u0001. The smaller \u0001 is, the harder the\nproblem. This function has curvature \u03baf = 1. To obtain a function with speci\ufb01c curvature \u03ba, we\nde\ufb01ne f R\nIn all our experiments, we take the average over 20 random draws of R. We \ufb01rst set \u03ba = 1 and\nvary \u0001. Figure 1(a) shows the empirical approximation factors obtained using EA and MUB, and the\ntheoretical bound. The empirical factors follow the theoretical results very closely. Empirically, we\nalso see that the problem becomes harder as \u0001 decreases. Next we \ufb01x \u0001 = 0.1 and vary the curvature\n\u03ba in f R\n\u03ba . Figure 1(b) illustrates that the theoretical and empirical approximation factors improve\nsigni\ufb01cantly as \u03ba decreases. Hence, much better approximations than the previous theoretical lower\nbounds are possible if \u03ba is not too large. This observation can be very important in practice. Here,\ntoo, the empirical upper bounds follow the theoretical bounds very closely.\nFigures 1(c) and (d) show results for larger \u03b1 and \u03b2 = 1. In Figure 1(c), as \u03b1 increases, the empirical\nfactors improve. In particular, as predicted by the theoretical bounds, EA outperforms MUB for large\n\u03b1 and, for \u03b1 \u2265 n2/3, EA \ufb01nds the optimal solution. In addition, Figures 1(b) and (d) illustrate the\ntheoretical and empirical effect of curvature: as n grows, the bounds saturate and approximate a\nconstant 1/(1 \u2212 \u03ba) \u2013 they do not grow polynomially in n. Overall, we see that the empirical results\nquite closely follow our theoretical results, and that, as the theory suggests, curvature signi\ufb01cantly\naffects the approximation factors.\n\n6 Conclusion and Discussion\n\nIn this paper, we study the effect of curvature on the problems of approximating, learning and\nminimizing submodular functions under constraints. We prove tightened, curvature-dependent upper\nbounds with almost matching lower bounds. These results complement known results for submodular\nmaximization [3, 29]. Given that the functional form and effect of the submodularity ratio proposed\nin [5] is similar to that of curvature, an interesting extension is the question of whether there is a\nsingle unifying quantity for both of these terms. Another open question is whether a quantity similar\nto curvature can be de\ufb01ned for subadditive functions, thus re\ufb01ning the results in [1] for learning\nsubadditive functions. Finally it also seems that the techniques in this paper could be used to provide\nimproved curvature-dependent regret bounds for constrained online submodular minimization [15].\nAcknowledgments: Special thanks to Kai Wei for pointing out that Corollary 3.4 holds and for\nother discussions, to Bethany Herwaldt for reviewing an early draft of this manuscript, and to\nthe anonymous reviewers. This material is based upon work supported by the National Science\nFoundation under Grant No. (IIS-1162606), a Google and a Microsoft award, and by the Intel Science\nand Technology Center for Pervasive Computing. Stefanie Jegelka\u2019s work is supported by the Of\ufb01ce\nof Naval Research under contract/grant number N00014-11-1-0688, and gifts from Amazon Web\nServices, Google, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General\nElectric, Hortonworks, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware and Yahoo!.\n\n8\n\n501001502002502468(a) varying \u03b5, \u03ba= 0emp. approx. factorn \u03b5= 0.1\u03b5= 0.2\u03b5= 0.3\u03b5= 0.4501001502002502345(b) varying \u03ba, \u03b5= 0.1emp. approx. factorn \u03ba= 0.7\u03ba= 0.5\u03ba= 0.3\u03ba= 0.15010015020020406080100(c) with varying \u03b1 and \u03b2 = 1emp. approx. factorn \u03b1= n/2\u03b1= n3/4\u03b1= n1/2501001502002468(d) varying \u03ba with \u03b1 = n/2 and \u03b2 = 1emp. approx. factorn \u03ba= 0.9\u03ba= 0.6\u03ba= 0.3\u03ba= 0.1\fReferences\n[1] M. F. Balcan, F. Constantin, S. Iwata, and L. Wang. Learning valuation functions. COLT, 2011.\n[2] N. Balcan and N. Harvey. Submodular functions: Learnability, structure, and optimization. In Arxiv\n\npreprint, 2012.\n\n[3] M. Conforti and G. Cornuejols. Submodular set functions, matroids and the greedy algorithm: tight worst-\ncase bounds and some generalizations of the Rado-Edmonds theorem. Discrete Applied Mathematics, 7(3):\n251\u2013274, 1984.\n\n[4] W. H. Cunningham. Decomposition of submodular functions. Combinatorica, 3(1):53\u201368, 1983.\n[5] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse\n\napproximation and dictionary selection. In ICML, 2011.\n\n[6] G. Goel, C. Karande, P. Tripathi, and L. Wang. Approximability of combinatorial problems with multi-agent\n\nsubmodular cost functions. In FOCS, 2009.\n\n[7] G. Goel, P. Tripathi, and L. Wang. Combinatorial problems with discounted price functions in multi-agent\n\nsystems. In FSTTCS, 2010.\n\n[8] M. Goemans, N. Harvey, S. Iwata, and V. Mirrokni. Approximating submodular functions everywhere. In\n\nSODA, pages 535\u2013544, 2009.\n\n[9] R. Hassin, J. Monnot, and D. Segev. Approximation algorithms and hardness results for labeled connectivity\n\nproblems. J Combinatorial Optimization, 14(4):437\u2013453, 2007.\n\n[10] S. Iwata and K. Nagano. Submodular function minimization under covering constraints. In In FOCS,\n\npages 671\u2013680. IEEE, 2009.\n\n[11] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular\n\nfunctions, with applications. In UAI, 2012.\n\n[12] R. Iyer, S. Jegelka, and J. Bilmes. Curvature and Optimal Algorithms for Learning and Optimization of\n\nSubmodular Functions: Extended arxiv version, 2013.\n\n[13] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidifferential based submodular function optimization. In ICML,\n\n2013.\n\n[14] S. Jegelka. Combinatorial Problems with submodular coupling in machine learning and computer vision.\n\nPhD thesis, ETH Zurich, 2012.\n\n[15] S. Jegelka and J. Bilmes. Online submodular minimization for combinatorial structures. ICML, 2011.\n[16] S. Jegelka and J. A. Bilmes. Approximation bounds for inference using cooperative cuts. In ICML, 2011.\n[17] S. Jegelka and J. A. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nCVPR, 2011.\n\n[18] P. Kohli, A. Osokin, and S. Jegelka. A principled deep random \ufb01eld for image segmentation. In CVPR,\n\n2013.\n\n[19] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models.\n\nProceedings of Uncertainity in Arti\ufb01cial Intelligence. UAI, 2005.\n\nIn\n\n[20] E. Lawler and C. Martel. Computing maximal \u201cpolymatroidal\u201d network \ufb02ows. Mathematics of Operations\n\nResearch, 7(3):334\u2013347, 1982.\n\n[21] H. Lin and J. Bilmes. Optimal selection of limited vocabulary speech corpora. In Interspeech, 2011.\n[22] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In The 49th Meeting\n\nof the Assoc. for Comp. Ling. Human Lang. Technologies (ACL/HLT-2011), Portland, OR, June 2011.\n\n[23] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization.\n\nIn UAI, 2012.\n\n[24] E. Nikolova. Approximation algorithms for of\ufb02ine risk-averse combinatorial optimization, 2010.\n[25] J. Soto and M. Goemans. Symmetric submodular function minimization under hereditary family constraints.\n\narXiv:1007.2140, 2010.\n\n[26] P. Stobbe and A. Krause. Learning fourier sparse set functions. In International Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2012.\n\n[27] Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms and lower bounds.\n\nIn FOCS, pages 697\u2013706, 2008.\n\n[28] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n[29] J. Vondr\u00b4ak. Submodularity and curvature: the optimal algorithm. RIMS Kokyuroku Bessatsu, 23, 2010.\n[30] P.-J. Wan, G. Calinescu, X.-Y. Li, and O. Frieder. Minimum-energy broadcasting in static ad hoc wireless\n\nnetworks. Wireless Networks, 8:607\u2013617, 2002.\n\n[31] P. Zhang, J.-Y. Cai, L.-Q. Tang, and W.-B. Zhao. Approximation and hardness results for label cut and\n\nrelated problems. Journal of Combinatorial Optimization, 21(2):192\u2013208, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1271, "authors": [{"given_name": "Rishabh", "family_name": "Iyer", "institution": "University of Washington"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "UC Berkeley"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington"}]}