{"title": "Gradient Methods for Submodular Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 5841, "page_last": 5851, "abstract": "In this paper, we study the problem of maximizing continuous submodular functions that naturally arise in many learning applications such as those involving utility functions in active learning and sensing, matrix approximations and network inference. Despite the apparent lack of convexity in such functions, we prove that stochastic projected gradient methods can provide strong approximation guarantees for maximizing continuous submodular functions with convex constraints. More specifically, we prove that for monotone continuous DR-submodular functions, all fixed points of projected gradient ascent provide a factor $1/2$ approximation to the global maxima. We also study stochastic gradient methods and show that after $\\mathcal{O}(1/\\epsilon^2)$ iterations these methods reach solutions which achieve in expectation objective values exceeding $(\\frac{\\text{OPT}}{2}-\\epsilon)$. An immediate application of our results is to maximize submodular functions that are defined stochastically, i.e. the submodular function is defined as an expectation over a family of submodular functions with an unknown distribution. We will show how stochastic gradient methods are naturally well-suited for this setting, leading to a factor $1/2$ approximation when the function is monotone. In particular, it allows us to approximately maximize discrete, monotone submodular optimization problems via projected gradient ascent on a continuous relaxation, directly connecting the discrete and continuous domains. Finally, experiments on real data demonstrate that our projected gradient methods consistently achieve the best utility compared to other continuous baselines while remaining competitive in terms of computational effort.", "full_text": "Gradient Methods for Submodular Maximization\n\nHamed Hassani\nESE Department\n\nPhiladelphia, PA\n\nhassani@seas.upenn.edu\n\nUniversity of Pennsylvania\n\nUniversity of Southern California\n\nMahdi Soltanolkotabi\n\nEE Department\n\nLos Angeles, CA\n\nsoltanol@usc.edu\n\nAmin Karbasi\nECE Department\nYale University\nNew Haven, CT\n\namin.karbasi@yale.edu\n\nAbstract\n\nIn this paper, we study the problem of maximizing continuous submodular func-\ntions that naturally arise in many learning applications such as those involving\nutility functions in active learning and sensing, matrix approximations and network\ninference. Despite the apparent lack of convexity in such functions, we prove that\nstochastic projected gradient methods can provide strong approximation guarantees\nfor maximizing continuous submodular functions with convex constraints. More\nspeci\ufb01cally, we prove that for monotone continuous DR-submodular functions, all\n\n\ufb01xed points of projected gradient ascent provide a factor 1~2 approximation to the\nO(1~\u00012) iterations these methods reach solutions which achieve in expectation\nobjective values exceeding( OPT\n\u2212 \u0001). An immediate application of our results is to\nwell-suited for this setting, leading to a factor 1~2 approximation when the func-\n\nmaximize submodular functions that are de\ufb01ned stochastically, i.e. the submodular\nfunction is de\ufb01ned as an expectation over a family of submodular functions with an\nunknown distribution. We will show how stochastic gradient methods are naturally\n\nglobal maxima. We also study stochastic gradient methods and show that after\n\ntion is monotone. In particular, it allows us to approximately maximize discrete,\nmonotone submodular optimization problems via projected gradient ascent on a\ncontinuous relaxation, directly connecting the discrete and continuous domains.\nFinally, experiments on real data demonstrate that our projected gradient methods\nconsistently achieve the best utility compared to other continuous baselines while\nremaining competitive in terms of computational effort.\n\n2\n\n1\n\nIntroduction\n\nSubmodular set functions exhibit a natural diminishing returns property, resembling concave functions\nin continuous domains. At the same time, they can be minimized exactly in polynomial time (while\ncan only be maximized approximately), which makes them similar to convex functions. They have\nfound numerous applications in machine learning, including viral marketing [1], dictionary learning\n[2] network monitoring [3, 4], sensor placement [5], product recommendation [6, 7], document and\ncorpus summarization [8] data summarization [9], crowd teaching [10, 11], and probabilistic models\n[12, 13]. However, submodularity is in general a property that goes beyond set functions and can\nbe de\ufb01ned for continuous functions. In this paper, we consider the following stochastic continuous\nsubmodular optimization problem\n\nwhereK is a bounded convex body,D is generally an unknown distribution, and F\u03b8\u2019s are continuous\nsubmodular functions for every \u03b8\u2208D. We also use OPT\u00e1 maxx\u2208K F(x) to denote the optimum\nvalue. We note that the function F(x) is itself also continuous submodular, as a non-negative\n\ncombination of submodular functions are still submodular [14]. The formulation covers popular\n\n(1.1)\n\nmax\n\nx\u2208K F(x)\u08ca E\u03b8\u223cD[F\u03b8(x)],\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finstances of submodular optimization. For instance, whenD puts all the probability mass on a single\nobjective is the \ufb01nite-sum continuous submodular optimization whereD is uniformly distributed over\nm instances, i.e., F(x)\u08ca 1\n\u03b8=1 F\u03b8(x).\nm\u2211m\n\nfunction, (1.1) reduces to deterministic continuous submodular optimization. Another common\n\nA natural approach to solving problems of the form (1.1) is to use projected stochastic methods.\nAs we shall see in Section 5, these local search heuristics are surprisingly effective. However, the\nreasons for this empirical success is completely unclear. The main challenge is that maximizing F\ncorresponds to a nonconvex optimization problem (as the function F is not concave), and a priori it is\nnot clear why gradient methods should yield a reliable solution. This leads us to the main challenge\nof this paper\n\nDo projected gradient methods lead to provably good solutions for continuous\nsubmodular maximization with general convex constraints?\n\nshow that\n\nWe answer the above question in the af\ufb01rmative, proving that projected gradient methods produce a\ncompetitive solution with respect to the optimum. More speci\ufb01cally, given a general bounded convex\n\nbodyK and a continuous function F that is monotone, smooth, and (weakly) DR-submodular we\n\u2022 All stationary points of a DR-submodular function F overK provide a 1~2 approximation\n(a.k.a. gradient \ufb02ows) always lead to a solutions with 1~2 approximation guarantees.\n\u0001 iterations produces a solution with objective value\n\u2022 Projected gradient ascent after O\u0001 L2\nlarger than(OPT~2\u2212 \u0001). When calculating the gradient is dif\ufb01cult but an unbiased estimate\n\u00012\u0002 iterations\n+ \u03c32\ncan be easily obtained, the stochastic projected gradient ascent in O\u0002 L2\n\ufb01nds a solution with objective value exceeding(OPT~2\u2212 \u0001). Here, L2 is the smoothness\n\nto the global maximum. Thus, projected gradient methods with suf\ufb01ciently small step sizes\n\n\u0001\n\n\u0001\n\nof the continuous submodular function measured in the (cid:96)2-norm, \u03c32 is the variance of the\nstochastic gradient with respect to the true gradient and OPT is the function value at the\nglobal optimum.\n\nin (2.6)) we prove the above results with \u03b32~(1+ \u03b32) approximation guarantee.\n\n\u2022 More generally, for weakly continuous DR-submodular functions with parameter \u03b3 (de\ufb01ne\n\nOur result have some important implications. First, they show that projected gradient methods are an\nef\ufb01cient way of maximizing the multilinear extension of (weakly) submodular set functions for any\n\nsubmodularity ratio \u03b3 (note that \u03b3= 1 corresponds to submodular functions) [2]. Second, in contrast\norigin [15, 16], projected gradient methods can start from any initial point in the constraint setK\n\nto conditional gradient methods for submodular maximization that require the initial point to be the\n\nprovides a unifying approach for solving the stochastic submodular maximization problem [18]\n\nand still produce a competitive solution. Third, such conditional gradient methods, when applied\nto the stochastic setting (with a \ufb01xed batch size), perform poorly and can produce arbitrarily bad\nsolutions when applied to continuous submodular functions (see [17, Appendix B] in the long version\nof this paper for an example and further discussion on why conditional gradient methods do not easily\nadmit stochastic variants). In contrast, stochastic projected gradient methods are stable by design\n\nand provide a solution with an approximation ratio of at least 1~2 in expectation. Finally, our work\nwhere the functions f\u03b8\u2236 2V \u2192 R+ are submodular set functions de\ufb01ned over the ground set V . Such\nrecently introduced and studied in [18]. SinceD is unknown, problem (1.2) cannot be directly solved.\nrelaxation to reach a solution that is within a factor(1\u2212 1~e) of the optimum. In contrast, our work\nprovides a general recipe with 1~2 approximation guarantee for problem (1.2) in which f\u03b8\u2019s can be\n\nInstead, [18] showed that in the case of coverage functions, it is possible to ef\ufb01ciently maximize f by\nlifting the problem to the continuous domain and using stochastic gradient methods on a continuous\n\nobjective functions naturally arise in many data summarization applications [19] and have been\n\nf(S)\u08ca E\u03b8\u223cD[f\u03b8(S)],\n\n(1.2)\n\nany monotone submodular function.\n\n2\n\n\f(2.1)\n\n2 Continuous Submodular Maximization\n\nEven though submodularity is mostly considered on discrete domains, the notion can be naturally\n\nF(x)+ F(y)\u2265 F(x\u2228 y)+ F(x\u2227 y),\n\nf(A)+ f(B)\u2265 f(A\u2229 B)+ f(A\u222a B).\n\nA set function f \u2236 2V \u2192 R+, de\ufb01ned on the ground set V , is called submodular if for all subsets\nA, B\u2286 V , we have\nextended to arbitrary lattices [20]. To this aim, let us consider a subset of Rn+ of the formX=\u220fn\ni=1Xi\nwhere eachXi is a compact subset of R+. A function F \u2236X \u2192 R+ is submodular [21] if for all\n(x, y)\u2208X\u00d7X , we have\nwhere x\u2228y\u08ca max(x, y) (component-wise) and x\u2227y\u08ca min(x, y) (component-wise). A submodular\nfunction is monotone if for any x, y\u2208X obeying x\u2264 y, we have F(x)\u2264 F(y) (here, by x\u2264 y we\nif for any x\u2208X , any two distinct basis vectors ei, ej\u2208 Rn, and any two non-negative real numbers\nzi, zj\u2208 R+ obeying xi+ zi\u2208Xi and xj+ zj\u2208Xj we have\nClearly, the above de\ufb01nition includes submodularity over a set (by restrictingXi\u2019s to{0, 1}) or over\nan integer lattice (by restrictingXi\u2019s to Z+) as special cases. However, in the remainder of this paper\nwe consider continuous submodular functions de\ufb01ned on product of sub-intervals of R+. We note that\n\nF(x+ ziei)+ F(x+ zjej)\u2265 F(x)+ F(x+ ziei+ zjej).\n\nmean that every element of x is less than that of y). Like set functions, we can de\ufb01ne submodularity in\nan equivalent way, reminiscent of diminishing returns, as follows [14]: the function F is submodular\n\nfunctions that are both submodular and convex/concave. For instance, for a concave function g\n\nThe above expression makes it clear that continuous submodular functions are not convex nor concave\n\nwhen twice differentiable, F is submodular if and only if all cross-second-derivatives are non-positive\n[14], i.e.,\n\n\u22022F(x)\n\u2264 0.\nin general as concavity (convexity) implies that\u22072F \u0002 0 (resp.\u25bd2F \u0002 0). Indeed, we can have\ni=1 \u03bbixi) is submodular and concave.\nand non-negative weights \u03bbi\u2265 0, the function F(x)= g(\u2211n\nfunctions are called DR-submodular [16, 22] if for any x, y\u2208X obeying x\u2264 y, any standard basis\nvector ei\u2208 Rn, and any non-negative number z\u2208 R+ obeying zei+ x\u2208X and zei+ y\u2208X , we have\nmapping, i.e., for all x, y \u2208 X such that x \u2264 y we have\u2207F(x) \u2265 \u2207F(y) [16]. When twice\n\n(2.4)\nOne can easily verify that for a differentiable DR-submodular functions the gradient is an antitone\n\nF(zei+ x)\u2212 F(x)\u2265 F(zei+ y)\u2212 F(y).\n\nTrivially, af\ufb01ne functions are submodular, concave, and convex. A proper subclass of submodular\n\n\u2200i\u2260 j,\u2200x\u2208X ,\n\n\u2202xi\u2202xj\n\n(2.2)\n\n(2.3)\n\n\u22022F(x)\n\n(2.5)\n\n\u2202xi\u2202xj\n\ndifferentiable, DR-submodularity is equivalent to\n\nThe above twice differentiable functions are sometimes called smooth submodular functions in the\nliterature [23]. However, in this paper, we say a differentiable submodular function F is L-smooth\n\n\u2200i & j,\u2200x\u2208X ,\n\u2264 0.\nw.r.t a norm\u0001\u22c5\u0001 (and its dual norm\u0001\u22c5\u0001\u2217) if for all x, y\u2208X we have\n\u0001\u2207F(x)\u2212\u2207F(x)\u0001\u2217\u2264 L\u0001x\u2212 y\u0001.\nHere,\u0001\u22c5\u0001\u2217 is the dual norm of\u0001\u22c5\u0001 de\ufb01ned as\u0001g\u0001\u2217= supx\u2208Rn\u2236\u0001x\u0001\u22641 gT x. When the function is\ni\u2208[n][\u2207F(x)]i\n\u03b3= inf\n[\u2207F(y)]i\nx,y\u2208X\nx\u2264y\n(1\u2212 xj)f(S).\nF(x)= Q\nxiM\nM\nS\u2286V\ni\u2208S\nj~\u2208S\n\nSee [24] for related de\ufb01nitions. Clearly, for a differentiable DR-submodular function we have \u03b3= 1.\nAn important example of a DR-submodular function is the multilinear extension [15] F\u2236[0, 1]n\u2192 R\n\nsmooth w.r.t the (cid:96)2-norm we use L2 (note that the (cid:96)2 norm is self-dual). We say that a function is\nweakly DR-submodular with parameter \u03b3 if\n\nof a discrete submodular function f, namely,\n\n(2.6)\n\ninf\n\n.\n\n3\n\n\fWe note that for set functions, DR-submodularity (i.e., Eq. 2.4) and submodularity (i.e., Eq. 2.1) are\nequivalent. However, this is not true for the general submodular functions de\ufb01ned on integer lattices\nor product of sub-intervals [16, 22].\nThe focus of this paper is on continuous submodular maximization de\ufb01ned in Problem (1.1). More\n\nspeci\ufb01cally, we assume thatK\u2282X is a a general bounded convex set (not necessarily down-closed\n\nas considered in [16]) with diameter R. Moreover, we consider F\u03b8\u2019s to be monotone (weakly)\nDR-submodular functions with parameter \u03b3.\n\n3 Background and Related Work\n\nSubmodular set functions [25, 20] originated in combinatorial optimization and operations research,\nbut they have recently attracted signi\ufb01cant interest in machine learning. Even though they are usually\nconsidered over discrete domains, their optimization is inherently related to continuous optimization\nmethods. In particular, Lovasz [26] showed that the Lovasz extension is convex if and only if\nthe corresponding set function is submodular. Moreover, minimizing a submodular set-function is\nequivalent to minimizing the Lovasz extension.1 This idea has been recently extended to minimization\nof strict continuous submodular functions (i.e., cross-order derivatives in (2.3) are strictly negative)\n[14]. Similarly, approximate submodular maximization is linked to a different continuous extension\nknown as multilinear extension [28]. Multilinear extension (which is an example of DR-submodular\nfunctions studied in this paper) is not concave nor convex in general. However, a variant of conditional\ngradient method, called continuous greedy, can be used to approximately maximize them. Recently,\nChekuri et al [23] proposed an interesting multiplicative weight update algorithm that achieves\n\n(1\u2212 1~e\u2212 \u0001) approximation guarantee after \u02dcO(n2~\u00012) steps for twice differentiable monotone DR-\ncontinuous greedy algorithm, achieves(1\u2212 1~e\u2212 \u0001) approximation guarantee after O(L2~\u0001) iterations\n\nsubmodular functions (they are also called smooth submodular functions) subject to a polytope\nconstraint. Similarly, Bian et al [16] proved that a conditional gradient method, similar to the\n\nfor maximizing a monotone DR-submodular functions subject to special convex constraints called\ndown-closed convex bodies. A few remarks are in order. First, the proposed conditional gradient\nmethods cannot handle the general stochastic setting we consider in Problem (1.1) (in fact, projection\nis the key). Second, there is no near-optimality guarantee if conditional gradient methods do not start\nfrom the origin. More precisely, for the continuous greedy algorithm it is necessary to start from\nthe 0 vector (to be able to remain in the convex constraint set at each iteration). Furthermore, the\n0 vector must be a feasible point of the constraint set. Otherwise, the iterates of the algorithm may\nfall out of the convex constraint set leading to an infeasible \ufb01nal solution. Third, due to the starting\npoint requirement, they can only handle special convex constraints, called down-closed. And \ufb01nally,\nthe dependency on L2 is very subomptimal as it can be as large as the dimension n (e.g., for the\nmultilinear extensions of some submodular set functions, see [17, Appendix B] in the long version of\nthis paper). Our work resolves all of these issues by showing that projected gradient methods can also\napproximately maximize monotone DR-submodular functions subject to general convex constraints,\n\nalbeit, with a lower 1~2 approximation guarantee.\n\nGeneralization of submodular set functions has lately received a lot of attention. For instance, a line\nof recent work considered DR-submodular function maximization over an integer lattice [29, 30, 22].\nInterestingly, Ene and Nguyen [31] provided an ef\ufb01cient reduction from an integer-lattice DR-\nsubmodular to a submodular set function, thus suggesting a simple way to solve integer-lattice\nDR-submodular maximization. Note that such reductions cannot be applied to the optimization\nproblem (1.1) as expressing general convex body constraints may require solving a continuous\noptimization problem.\n\n4 Algorithms and Main Results\n\nIn this section we discuss our algorithms together with the corresponding theoretical guarantees. In\nwhat follows, we assume that F is a weakly DR-submodular function with parameter \u03b3.\n\n1The idea of using stochastic methods for submodular minimization has recently been used in [27].\n\n4\n\n\f4.1 Characterizing the quality of stationary points\n\nWe begin with the de\ufb01nition of a stationary point.\n\nDe\ufb01nition 4.1 A vector x\u2208K is called a stationary point of a function F \u2236X \u2192 R+ over the set\nK\u2282X if maxy\u2208K\u001b\u2207F(x), y\u2212 x\u001b\u2264 0.\n\nStationary points are of interest because they characterize the \ufb01xed points of the Gradient Ascent\n(GA) method. Furthermore, (projected) gradient ascent with a suf\ufb01ciently small step size is known\nto converge to a stationary point for smooth functions [32]. To gain some intuition regarding this\nconnection, let us consider the GA procedure. Roughly speaking, at any iteration t of the GA\n\nprocedure, the value of F increases (to the \ufb01rst order) by\u001b\u2207F(xt), xt+1\u2212 xt\u001b. Hence, the progress\nat time t is at most maxy\u2208K\u001b\u2207F(xt), y\u2212xt\u001b. If at any time t we have maxy\u2208K\u001b\u2207F(xt), y\u2212xt\u001b\u2264 0,\n\nthen the GA procedure will not make any progress and it will be stuck once it falls into a stationary\npoint.\nThe next natural question is how small can the value of F be at a stationary point compared to the\nglobal maximum? The following lemma relates the value of F at a stationary point to OPT.\n\nTheorem 4.2 Let F \u2236X \u2192 R+ be monotone and weakly DR-submodular with parameter \u03b3 and\nassumeK\u2286X is a convex set. Then,\n(i) If x is a stationary point of F inK, then F(x)\u2265 \u03b32\n(ii) Furthermore, if F is L-smooth, gradient ascent with a step size smaller than 1~L will\nparticular case of \u03b3= 1, i.e., when F is DR-submodular, asserts that at any stationary point F is at\n1+\u03b32 approximation ratio. The\nleast OPT~2. This lower bound is in fact tight. In the long version of this paper (in particular [17,\nOPT~2 at a stationary point that is also a local maximum.\n\u03b3= 1 follows directly from [28, Lemma 3.2]. See the long version of this paper [17, Section 7] for\n\nThe theorem above guarantees that all \ufb01xed points of the GA method yield a solution whose function\nvalue is at least \u03b32\n\nWe would like to note that our result on the quality of stationary points (i.e., \ufb01rst part of Theorem 4.2\nabove) can be viewed as a simple extension of the results in [33]. In particular, the special case of\n\n1+\u03b32 OPT. Thus, all \ufb01xed point of GA provide a factor \u03b32\n\nAppendix A]) we provide a simple instance of a differentiable DR-Submodular function that attains\n\nconverge to a stationary point.\n\n1+\u03b32 OPT.\n\nhow this lemma is used in our proofs. However, we note that the main focus of this paper is whether\nsuch a stationary point can be found ef\ufb01ciently using stochastic schemes that do not require exact\nevaluations of gradients. This is the subject of the next section.\n\n4.2\n\n(Stochastic) gradient methods\n\nconstraints, GA iteratively applies the following update\n\nWe now discuss our \ufb01rst algorithmic approach. For simplicity we focus our exposition on the DR\n\nof this paper ([17, Section 7]). A simple approach to maximizing DR submodular functions is to use\n\nxt+1=PK(xt+ \u00b5t\u2207F(xt)) .\n\nsubmodular case, i.e., \u03b3= 1, and discuss how this extends to the more general case in the long version\nthe (projected) Gradient Ascent (GA) method. Starting from an initial estimate x1\u2208K obeying the\nHere, \u00b5t is the learning rate andPK(v) denotes the Euclidean projection of v onto the setK.\nleads to the Stochastic Gradient Method (SGM). Starting from an initial estimate x0\u2208K obeying the\nunbiased estimate of the gradient\u2207F(xt) and \u00b5t is the learning rate. The result is then projected onto\nthe setK. We note that when gt=\u2207F(xt), i.e., when there is no randomness in the updates, then\n\n(4.2)\nSpeci\ufb01cally, at every iteration t, the current iterate xt is updated by adding \u00b5tgt, where gt is an\n\nHowever, in many problems of practical interest we do not have direct access to the gradient of F . In\nthese cases it is natural to use a stochastic estimate of the gradient in lieu of the actual gradient. This\n\nxt+1=PK(xt+ \u00b5tgt) .\n\nconstraints, SGM iteratively applies the following updates\n\n(4.1)\n\n5\n\n\fAlgorithm 1 (Stochastic) Gradient Method for Maximizing F(x) over a convex setK\nParameters: Integer T> 0 and scalars \u03b7t> 0, t\u2208[T]\nInitialize: x1\u2208K\nfor t= 1 to T do\nyt+1\u2190 xt+ \u03b7tgt,\nwhere gt is a random vector s.t. E[gtxt]=\u2207F(xt)\nxt+1= arg minx\u2208Kx\u2212 yt+12\nPick \u03c4 uniformly at random from{1, 2, . . . , T}.\n\nend for\n\nOutput x\u03c4\n\nthe SGM updates (4.2) reduce to the GA updates (4.1). We detail the SGM method in Algorithm 1.\nAs we shall see in our experiments detained in Section 5, the SGM method is surprisingly effective\nfor maximizing monotone DR-submodular functions. However, the reasons for this empirical success\nwas previously unclear. The main challenge is that maximizing F corresponds to a nonconvex\noptimization problem (as the function F is not concave), and a priori it is not clear why gradient\nmethods should yield a competitive ratio. Thus, studying gradient methods for such nonconvex\nproblems poses new challenges:\n\nDo (stochastic) gradient methods converge to a stationary point?\n\noracle gt obeying\n\n(4.3)\n\n2\n\n(cid:96)2\n\n(cid:96)2\n\n1\n\nR\n\nt\n\n(cid:6)\u2264 \u03c32.\n\nthe elements in place to state our \ufb01rst theorem.\n\n\u2264 L\u0001x\u2212 y\u0001(cid:96)2\n\nTheorem 4.3 (Stochastic Gradient Method) Let us assume that F is L-smooth w.r.t. the Euclidean\n, monotone and DR-submodular. Furthermore, assume that we have access to a stochastic\n\nThe next theorem addresses some of these challenges. To be able to state this theorem let us recall the\nstandard de\ufb01nition of smoothness. We say that a continuously differentiable function F is L-smooth\n.\n. We now have all\n\n(in Euclidean norm) if the gradient\u2207F is L-Lipschitz, that is\u0001\u2207F(x)\u2212\u2207F(y)\u0001(cid:96)2\nWe also de\ufb01ned the diameter (in Euclidean norm) as R2= supx,y\u2208K 1\n\u0001x\u2212 y\u00012\nnorm\u0001\u22c5\u0001(cid:96)2\nand E\u0001\u0001gt\u2212\u2207F(xt)\u00012\nE[gt]=\u2207F(xt)\nWe run stochastic gradient updates of the form (4.2) with \u00b5t=\n\u221a\ntaking values in{1, 2, . . . , T} with equal probability. Then,\nL+ \u03c3\n\u2212\u0004 R2L+ OPT\n+ R\u03c3\u221a\nE[F(x\u03c4)]\u2265 OPT\n\u0004 .\n1(T\u22121) and 1 and T each with probability\nE[F(x\u03c4)]\u2265 OPT\n\u2212\u0004 R2L\n\u0004 .\n+ R2\u03c32\n\u00012 \u0002 iterations of the stochastic gradient method\nThe above results roughly state that T=O\u0002 R2L\n\u2212 \u0001. Stated differently,\nT =O\u0002 R2L\n+ R2\u03c32\n\u00012 \u0002 iterations of the stochastic gradient method provides in expectation a value\n\u2212 \u0001 approximation ratio for DR-submodular maximization. As explained in Section\n4.1, it is not possible to go beyond the factor 1~2 approximation ratio using gradient ascent from an\n\nRemark 4.4 We would like to note that if we pick \u03c4 to be a random variable taking values in\n\n{2, . . . , T\u2212 1} with probability\n\nfrom any initial point, yields a solution whose objective value is at least OPT\n2\n\n. Let \u03c4 be a random variable\n\n2(T\u22121) then\n\n1\n\n+ R\u03c3\u221a\n\nT\n\n\u0001\n\nthat exceeds OPT\n2\n\n2\n\n2T\n\nT\n\n2\n\n2T\n\n\u0001\n\narbitrary initialization.\nAn important aspect of the above result is that it only requires an unbiased estimate of the gradient.\nThis \ufb02exibility is crucial for many DR-submodular maximization problems (see, (1.1)) as in many\ncases calculating the function F and its derivative is not feasible. However, it is possible to provide a\ngood un-biased estimator for these quantities.\n\n6\n\n\fWe would like to point out that our results are similar in nature to known results about stochastic\nfor stochastic\n\nmethods for convex optimization. Indeed, this result interpolates between the 1\u221a\nsmooth optimization, and the 1~T for deterministic smooth optimization. The special case of \u03c3= 0\nassumptions of Theorem 4.3, it is possible to show that F(xT)\u2265 OPT\nrandomized choice of \u03c4\u2208[T].\nFinally, we would like to note that while the \ufb01rst term in (4.3) decreases as 1~T , the pre-factor L\n\nwhich corresponds to Gradient Ascent deserves particular attention. In this case, and under the\nT , without the need for a\n\n\u2212 R2L\n\ncould be rather large in many applications. For instance, this quantity may depend on the dimension\nof the input n (see [17, Appendix C] in the extended version of this paper). Thus, the number of\niterations for reaching a desirable accuracy may be very large. Such a large computational load causes\n(stochastic) gradient methods infeasible in some application domains. It is possible to overcome this\nde\ufb01ciency by using stochastic mirror methods (see [17, Section 4.3] in the extended version of this\npaper).\n\nT\n\n2\n\n5 Experiments\n\nF(x).\n\nS\u2236S\u2264k\n\nmax\n\nsame optimal value [15]:\n\nf(S)= max\nx\u2208Pk\n\nIn our experiments, we consider a movie recommendation application [19] consisting of N users and\nn movies. Each user i has a user-speci\ufb01c utility function fi for evaluating sets of movies. The goal is\nto \ufb01nd a set of k movies such that in expectation over users\u2019 preferences it provides the highest utility,\n\nmaximization problem de\ufb01ned in (1.2). We consider a setting that consists of N users and consider\nthe empirical objective function 1\nuniform on the integers between 1 and N. We can then run the (discrete) greedy algorithm on the\nempirical objective function to \ufb01nd a good set of size k. However, as N is a large number, the greedy\nalgorithm will require a high computational complexity. Another way of solving this problem is\nto evaluate the multilinear extension Fi of any sampled function fi and solve the problem in the\n\ni.e., maxS\u2264k f(S), where f(S)\u08ca Ei\u223cD[fi(S)]. This is an instance of the stochastic submodular\nj=1 fi. In other words, the distributionD is assumed to be\nN\u2211N\ncontinuous domain as follows. Let F(x)= Ei\u223cD[Fi(x)] for x\u2208[0, 1]n and de\ufb01ne the constraint set\ni=1 xi\u2264 k}. The discrete and continuous optimization formulations lead to the\nPk={x\u2208[0, 1]m\u2236\u2211n\nin the continuous domain that is at least 1~2 approximation to the optimal value. By rounding that\nat least 1~2 of the optimum solution set of size k. We note that randomized Pipage rounding does not\nneed access to the value of f. We also remark that projection ontoPk can be done very ef\ufb01ciently in\nO(n) time (see [18, 34, 35]). Therefore, such an approach easily scales to big data scenarios where\n(i) Stochastic Gradient Ascent (SG): We use the step size \u00b5t= c~\u221a\nB to denote mini-batch sizes (we further let \u03b1= 1, \u03b4= 0, see Algorithm 1 in [16] for more\n\nthe size of the data set (e.g. number of users) or the number of items n (e.g. number of movies) are\nvery large.\nIn our experiments, we consider the following baselines:\n\nt and mini-batch size B.\nThe details for computing an unbiased estimator for the gradient of F are given in [17,\nAppendix D] of the extended version of this paper.\n\nfractional solution (for instance via randomized Pipage rounding [15]) we obtain a set whose utility is\n\nTherefore, by running the stochastic versions of projected gradient methods, we can \ufb01nd a solution\n\n(ii) Frank-Wolfe (FW) variant of [16]: We use T to denote the total number of iterations and\n\ndetails).\n\n(iii) Batch-mode Greedy (Greedy): We run the vanilla greedy algorithm (in the discrete domain)\nin the following way. At each round of the algorithm (for selecting a new element), B\nrandom users are picked and the function f is estimated by the average over the B selected\nusers.\n\nN= 6041 users for n= 4000 movies. Let ri,j denote the rating of user i for movie j (if such a rating\n\nTo run the experiments we use the MovieLens data set. It consists of 1 million ratings (from 1 to 5) by\n\ndoes not exist we assign ri,j to 0). In our experiments, we consider two well motivated objective\nfunctions. The \ufb01rst one is called \u201cfacility location\" where the valuation function by user i is de\ufb01ned\n\n7\n\n\f(a) Concave Over Modular\n\n(b) Concave Over Modular\n\n(c) Facility Location\n\n(d) Facility Location\n\nFigure 1: (a) shows the performance of the algorithms w.r.t. the cardinality constraint k for the\nconcave over modular objective. Each of the continuous algorithms (i.e., SG and FW) run for\n\nT= 2000 iterations. (b) shows the performance of the SG algorithm versus the number of iterations\nfor \ufb01xed k= 20 for the concave over modular objective. The green dashed line indicates the value\nobtained by Greedy (with B = 1000). Recall that the step size of SG is c~\u221a\nfunction. Each of the continuous algorithms (SG and FW) run for T = 2000 iterations. (d) shows\n\nt. (c) shows the\nperformance of the algorithms w.r.t. the cardinality constraint k for the facility location objective\n\nthe performance of different algorithms versus the number of simple function computations (i.e. the\nnumber of fi\u2019s evaluated during the algorithm) for the facility location objective function. For the\ngreedy algorithm, larger number of function computations corresponds to a larger batch size. For SG\nlarger time corresponds to larger iterations.\n\nas f(S, i)= maxj\u2208S ri,j. In words, the way user i evaluates a set S is by picking the highest rated\n\nmovie in S. Thus, the objective function is equal to\n\nfunction composed with a modular function, i.e., f(S, i)=(\u2211j\u2208S ri,j)1~2. Again, by considering the\n\nIn our second experiment, we consider a different user-speci\ufb01c valuation function which is a concave\n\nuniform distribution over the set of users, we obtain\n\nffac(S)= 1\n\nN\n\nNQ\ni=1\n\nj\u2208S\n\nmax\n\nri,j.\n\nfcon(S)= 1\n\nN\n\nNQ\ni=1\n\n\u0002Q\nj\u2208S\n\nri,j\u00021~2\n\n.\n\nNote that the multilinear extensions of f1 and f2 are neither concave nor convex.\nFigure 1 depicts the performance of different algorithms for the two proposed objective functions. As\nFigures 1a and 1c show, the FW algorithm needs a much higher mini-batch size to be comparable\n\n8\n\n1020304050k468SG(B=20)Greedy(B=100)Greedy(B=1000)FW(B=20)FW(B=100)objective value02004006008001000T(numberofiterations)4.85.05.25.45.6SG(B=20,c=1)GreedySG(B=20,c=10)objective value1020304050k3.754.004.254.504.755.00SG(B=20)Greedy(B=100)Greedy(B=1000)FW(B=20)FW(B=100)objective value0.00.20.40.60.8numberoffunctioncomputations\u21e51084.654.704.754.80SG(B=20,c=1)SG(B=20,c=2)Greedyobjective value\fin performance to our stochastic gradient methods. Note that a smaller batch size leads to less\ncomputational effort (using the same value for B and T , the computational complexity of FW and\n\nSGA is almost the same). Figure 1b shows that after a few hundred iterations SG with B= 20 obtains\nalmost the same utility as the Greedy method with a large batch size (B= 1000). Finally, Figure 1d\n\nshows the performance of the algorithms with respect to the number of times the single functions\n(fi\u2019s) are evaluated. This further shows that gradient based methods have comparable complexity\nw.r.t. the Greedy algorithm in the discrete domain.\n\n6 Conclusion\n\nIn this paper we studied gradient methods for submodular maximization. Despite the lack of\nconvexity of the objective function we demonstrated that local search heuristics are effective at\n\ufb01nding approximately optimal solutions. In particular, we showed that all \ufb01xed point of projected\n\ngradient ascent provide a factor 1~2 approximation to the global maxima. We also demonstrated that\n\u00012) iterations.\nstochastic gradient and mirror methods achieve an objective value of OPT~2\u2212 \u0001, inO( 1\n\nWe further demonstrated the effectiveness of our methods with experiments on real data.\nWhile in this paper we have focused on convex constraints, our framework may allow non-convex\nconstraints as well. For instance it may be possible to combine our framework with recent results in\n[36, 37, 38] to deal with general nonconvex constraints. Furthermore, in some cases projection onto\nthe constraint set may be computationally intensive or even intractable but calculating an approximate\nprojection may be possible with signi\ufb01cantly less effort. One of the advantages of gradient descent-\nbased proofs is that they continue to work even when some perturbations are introduced in the updates.\nTherefore, we believe that our framework can deal with approximate projections and we hope to\npursue this in future work.\n\nAcknowledgments\n\nThis work was done while the authors were visiting the Simon\u2019s Institute for the Theory of Computing.\nA. K. is supported by DARPA YFA D16AP00046. The authors would like to thank Jeff Bilmes, Volkan\nCevher, Chandra Chekuri, Maryam Fazel, Stefanie Jegelka, Mohammad-Reza Karimi, Andreas\nKrause, Mario Lucic, and Andrea Montanari for helpful discussions.\n\n9\n\n\fReferences\n\n[1] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in\ufb02uence through a social network. In\n\nKDD, 2003.\n\n[2] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse\n\napproximation and dictionary selection. ICML, 2011.\n\n[3] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. Van Briesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In KDD, 2007.\n\n[4] R. M. Gomez, J. Leskovec, and A. Krause. Inferring networks of diffusion and in\ufb02uence. In Proceedings\n\nof KDD, 2010.\n\n[5] C. Guestrin, A. Krause, and A. P. Singh. Near-optimal sensor placements in gaussian processes. In ICML,\n\n2005.\n\n[6] K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD,\n\n2009.\n\n[7] B. Mirzasoleiman, A. Badanidiyuru, and A. Karbasi. Fast constrained submodular maximization: Person-\n\nalized data summarization. In ICML, 2016.\n\n[8] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proceedings of\nAnnual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.\n\n[9] B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying\n\nrepresentative elements in massive data. In NIPS, 2013.\n\n[10] A. Singla, I. Bogunovic, G. Bartok, A. Karbasi, and A. Krause. Near-optimally teaching the crowd to\n\nclassify. In ICML, 2014.\n\n[11] B. Kim, O. Koyejo, and R. Khanna. Examples are not enough, learn to criticize! criticism for interpretability.\n\nIn NIPS, 2016.\n\n[12] J. Djolonga and A. Krause. From map to marginals: Variational inference in bayesian submodular models.\n\nIn NIPS, 2014.\n\n[13] R. Iyer and J. Bilmes. Submodular point processes with applications to machine learning. In Arti\ufb01cial\n\nIntelligence and Statistics, 2015.\n\n[14] F. Bach. Submodular functions: from discrete to continous domains. arXiv preprint arXiv:1511.00394,\n\n2015.\n\n[15] G. Calinescu, C. Chekuri, M. Pal, and J. Vondrak. Maximizing a submodular set function subject to a\n\nmatroid constraint. SIAM Journal on Computing, 2011.\n\n[16] A. Bian, B. Mirzasoleiman, J. M. Buhmann, and A. Krause. Guaranteed non-convex optimization:\n\nSubmodular maximization over continuous domains. arXiv preprint arXiv:1606.05615, 2016.\n\n[17] H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization. arXiv\n\npreprint arXiv:1708.03949, 2017.\n\n[18] M. Karimi, M. Lucic, H. Hassani, and A. Krasue. stochastic submodular maximization: The case for\n\ncoverage functions. 2017.\n\n[19] S. A. Stan, M. Zadimoghaddam, A. Krasue, and A. Karbasi. Probabilistic submodular maximization in\n\nsub-linear time. ICML, 2017.\n\n[20] S. Fujishige. Submodular functions and optimization, volume 58. Annals of Discrete Mathematics, North\n\nHolland, Amsterdam, 2nd edition, 2005.\n\n[21] L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica,\n\n2(4):385\u2013393, 1982.\n\n[22] T. Soma and Y. Yoshida. A generalization of submodular cover via the diminishing return property on the\n\ninteger lattice. In NIPS, 2015.\n\n10\n\n\f[23] C. Chekuri, T. S. Jayram, and J. Vondrak. On multiplicative weight updates for concave and submodular\nfunction maximization. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer\nScience, pages 201\u2013210. ACM, 2015.\n\n[24] R. Eghbali and M. Fazel. Designing smoothing functions for improved worst-case competitive ratio in\n\nonline optimization. In Advances in Neural Information Processing Systems, pages 3287\u20133295, 2016.\n\n[25] J. Edmonds. Matroids and the greedy algorithm. Mathematical programming, 1(1):127\u2013136, 1971.\n\n[26] L\u00e1szl\u00f3 Lov\u00e1sz. Submodular functions and convexity. In Mathematical Programming The State of the Art,\n\npages 235\u2013257. Springer, 1983.\n\n[27] D. Chakrabarty, Y. T. Lee, Sidford A., and S. C. W. Wong. Subquadratic submodular function minimization.\n\nIn STOC, 2017.\n\n[28] C. Chekuri, J. Vondr\u00e1k, and R.s Zenklusen. Submodular function maximization via the multilinear\nrelaxation and contention resolution schemes. In Proceedings of the 43rd ACM Symposium on Theory of\nComputing (STOC), 2011.\n\n[29] T. Soma, N. Kakimura, K. Inaba, and K. Kawarabayashi. Optimal budget allocation: Theoretical guarantee\n\nand ef\ufb01cient algorithm. In ICML, 2014.\n\n[30] C. Gottschalk and B. Peis. Submodular function maximization on the bounded integer lattice.\n\nInternational Workshop on Approximation and Online Algorithms, 2015.\n\nIn\n\n[31] A. Ene and H. L. Nguyen. A reduction for optimizing lattice submodular functions with diminishing\n\nreturns. arXiv preprint arXiv:1606.08362, 2016.\n\n[32] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[33] J. Vondrak, C. Chekuri, and R. Zenklusen. Submodular function maximization via the multilinear relaxation\nand contention resolution schemes. In Proceedings of the forty-third annual ACM symposium on Theory of\ncomputing, pages 783\u2013792. ACM, 2011.\n\n[34] P. Brucker. AnO(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3(3):163\u2013\n\n166, 1984.\n\n[35] P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs subject to\n\nupper and lower bounds. Mathematical Programming, 46(1):321\u2013328, 1990.\n\n[36] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp time\u2013data tradeoffs for linear inverse problems. arXiv\n\npreprint arXiv:1507.04793, 2015.\n\n[37] M. Soltanolkotabi. Structured signal recovery from quadratic measurements: Breaking sample complexity\n\nbarriers via nonconvex optimization. arXiv preprint arXiv:1702.06175, 2017.\n\n[38] M. Soltanolkotabi. Learning ReLUs via gradient descent. arXiv preprint arXiv:1705.04591, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2992, "authors": [{"given_name": "Hamed", "family_name": "Hassani", "institution": "UPenn"}, {"given_name": "Mahdi", "family_name": "Soltanolkotabi", "institution": "University of Southern california"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}]}