{"title": "The Power of Optimization from Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 4017, "page_last": 4025, "abstract": "We consider the problem of optimization from samples of monotone submodular functions with bounded curvature. In numerous applications, the function optimized is not known a priori, but instead learned from data. What are the guarantees we have when optimizing functions from sampled data? In this paper we show that for any monotone submodular function with curvature c there is a (1 - c)/(1 + c - c^2) approximation algorithm for maximization under cardinality constraints when polynomially-many samples are drawn from the uniform distribution over feasible sets. Moreover, we show that this algorithm is optimal. That is, for any c < 1, there exists a submodular function with curvature c for which no algorithm can achieve a better approximation. The curvature assumption is crucial as for general monotone submodular functions no algorithm can obtain a constant-factor approximation for maximization under a cardinality constraint when observing polynomially-many samples drawn from any distribution over feasible sets, even when the function is statistically learnable.", "full_text": "The Power of Optimization from Samples\n\nEric Balkanski\nHarvard University\n\nericbalkanski@g.harvard.edu\n\nAviad Rubinstein\n\nUniversity of California, Berkeley\n\naviad@eecs.berkeley.edu\n\nYaron Singer\n\nHarvard University\n\nyaron@seas.harvard.edu\n\nAbstract\n\nWe consider the problem of optimization from samples of monotone submodular\nfunctions with bounded curvature. In numerous applications, the function opti-\nmized is not known a priori, but instead learned from data. What are the guarantees\nwe have when optimizing functions from sampled data?\nIn this paper we show that for any monotone submodular function with curvature\nc there is a (1 c)/(1 + c c2) approximation algorithm for maximization\nunder cardinality constraints when polynomially-many samples are drawn from the\nuniform distribution over feasible sets. Moreover, we show that this algorithm is\noptimal. That is, for any c < 1, there exists a submodular function with curvature\nc for which no algorithm can achieve a better approximation. The curvature\nassumption is crucial as for general monotone submodular functions no algorithm\ncan obtain a constant-factor approximation for maximization under a cardinality\nconstraint when observing polynomially-many samples drawn from any distribution\nover feasible sets, even when the function is statistically learnable.\n\n1\n\nIntroduction\n\nTraditionally, machine learning is concerned with predictions: assuming data is generated from some\nmodel, the goal is to predict the behavior of the model on data similar to that observed. In many cases\nhowever, we harness machine learning to make decisions: given observations from a model the goal\nis to \ufb01nd its optimum, rather than predict its behavior. Some examples include:\n\n\u2022 Ranking in information retrieval: In ranking the goal is to select k 2 N documents that\nare most relevant for a given query. The underlying model is a function which maps a set\nof documents and a given query to its relevance score. Typically we do not to have access\nto the scoring function, and thus learn it from data. In the learning to rank framework, for\nexample, the input consists of observations of document-query pairs and their relevance\nscore. The goal is to construct a scoring function of query-document pairs so that given a\nquery we can decide on the k most relevant documents.\n\n\u2022 Optimal tagging: The problem of optimal tagging consists of picking k tags for some new\ncontent to maximize incoming traf\ufb01c. The model is a function which captures the way in\nwhich users navigate through content given their tags. Since the algorithm designer cannot\nknow the behavior of every online user, the model is learned from observations on user\nnavigation in order to make a decision on which k tags maximize incoming traf\ufb01c.\n\n\u2022 In\ufb02uence in networks: In in\ufb02uence maximization the goal is to identify a subset of individ-\nuals who can spread information in a manner that generates a large cascade. The underlying\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fassumption is that there is a model of in\ufb02uence that governs the way in which individuals\nforward information from one to another. Since the model of in\ufb02uence is not known, it is\nlearned from data. The observed data is pairs of a subset of nodes who initiated a cascade\nand the total number of individuals in\ufb02uenced. The decision is the optimal set of in\ufb02uencers.\n\nIn the interest of maintaining theoretical guarantees on the decisions, we often assume that the\ngenerative model has some structure which is amenable to optimization. When the decision variables\nare discrete quantities a natural structure for the model is submodularity. A function f : 2N ! R\nde\ufb01ned over a ground set N = {e1, . . . , en} of elements is submodular if it exhibits a diminishing\nmarginal returns property, i.e., fS(e) fT (e) for all sets S \u2713 T \u2713 N and element e 62 T where\nfS(e) = f (S [ e) f (S) is the marginal contribution of element e to set S \u2713 N. This diminishing\nreturns property encapsulates numerous applications in machine learning and data mining and is\nparticularly appealing due to its theoretical guarantees on optimization (see related work below).\nThe guarantees on optimization of submodular functions apply to the case in which the algorithm\ndesigner has access to some succinct description of the function, or alternatively some idealized value\noracle which allows querying for function values of any given set. In numerous settings such as in\nthe above examples, we do not have access to the function or its value oracle, but rather learn the\nfunction from observed data. If the function learned from data is submodular we can optimize it\nand obtain a solution with provable guarantees on the learned model. But how do the guarantees\nof this solution on the learned model relate to its guarantees on the generative model? If we obtain\nan approximate optimum on the learned model which turns out to be far from the optimum of the\nsubmodular function we aim to optimize, the provable guarantees at hand do not apply.\n\nOptimization from samples. For concreteness, suppose that the generative model is a monotone\nsubmodular function f : 2N ! R and we wish to \ufb01nd a solution to maxS:|S|\uf8ffk f (S). To formalize\nthe concept of observations in standard learning-theoretic terms, we can assume that we observe sam-\nples of sets drawn from some distribution D and their function values, i.e. {(Si, f (Si))}m\ni=1. In terms\nof learnability, under some assumptions about the distribution and the function, submodular func-\ntions are statistically learnable (see discussion about PMAC learnability). In terms of approximation\nguarantees for optimization, a simple greedy algorithm obtains a 1 1/e-approximation.\nRecent work shows that optimization from samples is generally impossible [4], even for models that\nare learnable and optimizable. In particular, even for maximizing coverage functions, which are a\nspecial case of submodular functions and widely used in practice, no algorithm can obtain a constant\nfactor approximation using fewer than exponentially many samples of feasible solutions drawn from\nany distribution. In practice however, the functions we aim to optimize may be better behaved.\nAn important property of submodular functions that has been heavily explored recently is that of\ncurvature. Informally, the curvature is a measure of how far the function is to being modular. A\nfunction f is modular if f (S) =Pe2S f (e), and has curvature c 2 [0, 1] if fS(e) (1 c)f (e) for\nany S \u2713 N. Curvature plays an important role since the hard instances of submodular optimization\noften occur only when the curvature is unbounded, i.e., c close to 1. The hardness results for\noptimization from samples are no different, and apply when the curvature is unbounded.\n\nWhat are the guarantees for optimization from samples of submodular functions\n\nwith bounded curvature?\n\nIn this paper we study the power of optimization from samples when the curvature is bounded. Our\nmain result shows that for any monotone submodular function with curvature c there is an algorithm\nwhich observes polynomially-many samples from the uniform distribution over feasible sets and\nobtains an approximation ratio of (1 c)/(1 + c c2) o(1). Furthermore, we show that this bound\nis tight. For any c < 1, there exist monotone submodular functions with curvature c for which no\nalgorithm can obtain an approximation better than (1 c)/(1 + c c2) + o(1) given polynomially\nmany samples. We also perform experiments on synthetic hard instances of monotone submodular\nfunctions that convey some interpretation of our results.\nFor the case of modular functions a 1 o(1) algorithm can be obtained and as a consequence leads\nto a (1 c)2 algorithm for submodular functions with bounded curvature [4]. The goal of this work\nis to exploit the curvature property to obtain the optimal algorithm for optimization from samples.\n\n2\n\n\fA high-level overview of the techniques. The algorithm estimates the expected marginal contri-\nbution of each element to a random set. It then returns the (approximately) best set between the set of\nelements with the highest estimates and a random set. The curvature property is used to bound the\ndifferences between the marginal contribution of each element to: (1) a random set, (2) the set of\nelements with highest (estimated) marginal contributions to a random set, and (3) the optimal set. A\nkey observation in the analysis is that if the difference between (1) and (3) is large, then a random set\nhas large value (in expectation).\nTo obtain our matching inapproximability result, we construct an instance where, after viewing\npolynomially many samples, the elements of the optimal set cannot be distinguished from a much\nlarger set of elements that have high marginal contribution to a random set, but low marginal\ncontribution when combined with each other. The main challenge is constructing the optimal\nelements such that they have lower marginal contribution to a random set than to the other optimal\nelements. This requires carefully de\ufb01ning the way different types of elements interact with each other,\nwhile maintaining the global properties of monotonicity, submodularity, and bounded curvature.\n\n1.1 Related work\nSubmodular maximization.\nIn the traditional value oracle model, an algorithm may adaptively\nquery polynomially many sets Si and obtain via a black-box their values f (Si). It is well known\nthat in this model, the greedy algorithm obtains a 1 1/e approximation for a wide range of\nconstraints including cardinality constraints [23], and that no algorithm can do better [6]. Submodular\noptimization is an essential tool for problems in machine learning and data mining such as sensor\nplacement [20, 12], information retrieval [28, 14], optimal tagging [24], in\ufb02uence maximization\n[19, 13], information summarization [21, 22], and vision [17, 18].\n\nLearning. A recent line of work focuses on learning submodular functions from samples [3, 8, 2,\n10, 11, 1, 9]. The standard model to learn submodular functions is \u21b5-PMAC learnability introduced\nby Balcan and Harvey [3] which generalizes the well known PAC learnability framework from\nValiant [26]. Informally, a function is PAC or PMAC learnable if given polynomially samples, it is\npossible to construct a function that is likely to mimic the function for which the samples are coming\nform. Monotone submodular functions are \u21b5-PMAC learnable from samples coming from a product\ndistribution for some constant \u21b5 and under some assumptions [3].\n\nCurvature.\nIn the value oracle model, the greedy algorithm is a (1 ec)/c approximation algo-\nrithm for cardinality constraints [5]. Recently, Sviridenko et al. [25] improved this approximation to\n1 c/e with variants of the continuous greedy and local search algorithms. Submodular optimization\nand curvature have also been studied for more general constraints [27, 15] and submodular mini-\nmization [16]. The curvature assumption has applications in problems such as maximum entropy\nsampling [25], column-subset selection [25], and submodular welfare [27].\n\n2 Optimization from samples\n\nWe precisely de\ufb01ne the framework of optimization from samples. A sample (S, f (S)) of function\nf (\u00b7) is a set and its value. As with the PMAC-learning framework, the samples (Si, f (Si)) are such\nthat the sets Si are drawn i.i.d. from a distribution D. As with the standard optimization framework,\nthe goal is to return a set S satisfying some constraint M\u2713 2N such that f (S) is an \u21b5-approximation\nto the optimal solution f (S?) with S? 2M .\nA class of functions F is \u21b5-optimizable from samples under constraint M and over distribution D\nif for all functions f (\u00b7) 2F there exists an algorithm which, given polynomially many samples\n(Si, f (Si)), returns with high probability over the samples a set S 2M such that\n\nf (S) \u21b5 \u00b7 max\nT2M\n\nf (T ).\n\nIn the unconstrained case, a random set achieves a 1/4-approximation for general (not necessarily\nmonotone) submodular functions [7]. We focus on the constrained case and consider a simple\ncardinality constraint M, i.e., M = {S : |S|\uf8ff k}. To avoid trivialities in the framework, it is\nimportant to \ufb01x a distribution D. We consider the distribution D to be the uniform distribution over\nall feasible sets, i.e., all sets of size at most k.\n\n3\n\n\fWe are interested in functions that are both learnable and optimizable. It is already known that there\nexists classes of functions, such as coverage and submodular, that are both learnable and optimizable\nbut not optimizable from samples for M and D de\ufb01ned above. This paper studies optimization from\nsamples under some additional assumption: curvature. We assume that the curvature c of the function\nis known to the algorithm designer. In the appendix, we show an impossibility result for learning the\ncurvature of a function from samples.\n\n3 An optimal algorithm\nWe design a (1c)/(1+cc2)o(1)-optimization from samples algorithm for monotone submodular\nfunctions with curvature c. In the next section, we show that this approximation ratio is tight. The\nmain contribution is improving over the (1 c)2 o(1) approximation algorithm from [4] to obtain\na tight bound on the approximation.\n\nThe algorithm. Algorithm 1 \ufb01rst estimates the expected marginal contribution of each element ei\nto a uniformly random set of size k 1, which we denote by R for the remaining of this section.\nThese expected marginal contributions ER[fR(ei)] are estimated with \u02c6vi. The estimates \u02c6vi are the\ndifferences between the average value avg(Sk,i) := (PT2Sk,i\nf (T ))/|Sk,i| of the collection Sk,i of\nsamples of size k containing ei and the average value of the collection Sk1,i1 of samples of size\nk 1 not containing ei. We then wish to return the best set between the random set R and the set S\nconsisting of the k elements with the largest estimates \u02c6vi. Since we do not know the value of S, we\nlower bound it with \u02c6vS using the curvature property. We estimate the expected value ER[f (R)] of R\nwith \u02c6vR, which is the average value of the collection Sk1 of all samples of size k 1. Finally, we\ncompare the values of S and R using \u02c6vS and \u02c6vR to return the best of these two sets.\n\nAlgorithm 1 A tight (1 c)/(1 + c c2) o(1)-optimization from samples algorithm for monotone\nsubmodular functions with curvature c\nInput: S = {Si : (Si, f (Si)) is a sample}\n1: \u02c6vi avg(Sk,i) avg(Sk1,i1)\n2: S argmax|T|=kPi2T \u02c6vi\n3: \u02c6vS (1 c)Pei2S \u02c6vi\n4: \u02c6vR avg(Sk1)\n5: if \u02c6vS \u02c6vR then\n6:\n7: else\n8:\n9: end if\n\na lower bound on the value of f (S)\nan estimate of the value of a random set R\n\nreturn R\n\nreturn S\n\n1, . . . , e?\n\n1, . . . , e?\n\ni := {e?\n\nk} and S?\n\nThe analysis. Without loss of generality, let S = {e1, . . . , ek} be the set de\ufb01ned in Line 2 of the\nalgorithm and de\ufb01ne Si to be the \ufb01rst i elements in S, i.e., Si := {e1, . . . , ei}. Similarly, for the\noptimal solution S?, we have S? = {e?\ni}. We abuse notation and\ndenote by f (R) and fR(e) the expected values ER[f (R)] and ER[fR(e)] where the randomization is\nover the random set R of size k 1.\nAt a high level, the curvature property is used to bound the loss from f (S) toPi\uf8ffk fR(ei) and\nfromPi\uf8ffk fR(e?\nbounding the loss fromPi\uf8ffk fR(e?\n\ni ). When\ni ) to f (S?), a key observation is that if this loss is large, then it\nmust be the case that R has a high expected value. This observation is formalized in our analysis\nby bounding this loss in terms of f (R) and motivates Algorithm 1 returning the best of R and S.\nLemma 1 is the main part of the analysis and gives an approximation for S. The approximation\nguarantee for Algorithm 1 (formalized as Theorem 1) follows by \ufb01nding the worst-case ratios of\nf (R) and f (S).\nLemma 1. Let S be the set de\ufb01ned in Algorithm 1 and f (\u00b7) be a monotone submodular function with\ncurvature c, then\n\ni ) to f (S?). By the algorithm,Pi\uf8ffk fR(ei) is greater thanPi\uf8ffk fR(e?\n\nf (S) (1 o(1))\u02c6vS \u2713(1 c)\u27131 c \u00b7\n\nf (R)\n\nf (S?)\u25c6 o(1)\u25c6 f (S?).\n\n4\n\n\fProof. First, observe that\n\nf (S) =Xi\uf8ffk\n\nfSi1(ei) (1 c)Xi\uf8ffk\n\nf (ei) (1 c)Xi\uf8ffk\n\nfR(ei)\n\nwhere the \ufb01rst inequality is by curvature and the second is by monotonicity. We now claim that w.h.p.\nand with a suf\ufb01ciently large polynomial number of samples the estimates of the marginal contribution\nof an element are precise,\n\nfR(ei) +\n\nf (S?)\nn2 \u02c6vi fR(ei) \n\nf (S?)\n\nn2\n\n\u02c6v?\n\nf (S?)\n\n.\n\nn\n\nfR(e?\n\ni ) \n\n\u02c6vS\n1 c\n\n=Xi\uf8ffk\n\n\u02c6vi Xi\uf8ffk\n\nand defer the proof to the appendix. Thus f (S) (1 c)Pi\uf8ffk \u02c6vi f (S?)/n \u02c6vS f (S?)/n.\n\nNext, by the de\ufb01nition of S in the algorithm, we get\n\ni Xi\uf8ffk\nIt is possible to obtain a 1 c loss betweenPi\uf8ffk fR(e?\nwish to relate fS?(R) andPi\uf8ffk fR(e?\nby the de\ufb01nition of marginal contribution andPi\uf8ffk fR(e?\nPi\uf8ffk fR(e?\n\ni ) and f (S?) with a similar argument as in\nthe \ufb01rst part. The key idea to improve this loss is to use the curvature property on the elements in R\ni 2 S?. By curvature, we have that fS?(R) (1 c)f (R). We now\ninstead of on the elements e?\ni ). Note that f (S?) + fS?(R) = f (R[ S?) = f (R) + fR(S?)\ni ) fR(S?) by submodularity. We get\ni ) f (S?) + fS?(R) f (R) by combining the previous equation and inequality. By\nXi\uf8ffk\n\ni ) f (S?) + (1 c)f (R) f (R) =\u27131 c \u00b7\n\nthe previous curvature observation, we conclude that\n\nf (S?)\u25c6 f (S?).\n\nfR(e?\n\nf (R)\n\nCombining Lemma 1 and the fact that we obtain value at least max{f (R), (1 c)Pk\ni=1 \u02c6vi}, we\nobtain the main result of this section.\nTheorem 1. Let f (\u00b7) be a monotone submodular function with curvature c. Then Algorithm 1 is a\n(1 c)/(1 + c c2) o(1) optimization from samples algorithm.\nProof. In the appendix, we show that the estimate \u02c6vR of f (R) is precise, the estimate is such that\nf (R) + f (S?)/n2 \u02c6vR f (R) f (S?)/n2. In addition, by the \ufb01rst inequality in Lemma 1,\nf (S) (1 o(1))\u02c6vS. So by the algorithm and the second inequality in Lemma 1, the approximation\nobtained by the set returned is at least\n(1 o(1)) \u00b7 max\u21e2 f (R)\nf (S?)\u25c6 .\nLet x := f (R)/f (S\u21e4), the best of f (R)/f (S?) and (1c) (1 c \u00b7 f (R)/f (S?))o(1) is minimized\nwhen x = (1 c)(1 cx), or when x = (1 c)/(1 + c c2). Thus, the approximation obtained is\nat least (1 c)/(1 + c c2) o(1).\n4 Hardness\n\nf (S?) (1 o(1)) \u00b7 max\u21e2 f (R)\n\n, (1 c)\u27131 c \u00b7\n\nf (S?)\n\nf (S?)\n\nf (R)\n\n\u02c6vS\n\n,\n\nWe show that the approximation obtained by Algorithm 1 is tight. For every c < 1, there exists\nmonotone submodular functions that cannot be (1 c)/(1 + c c2)-optimized from samples. This\nimpossibility result is information theoretic, we show that with high probability the samples do not\ncontain the right information to obtain a better approximation.\n\nTechnical overview. To obtain a tight bound, all the losses from Algorithm 1 must be tight. We\nneed to obtain a 1 cf (R)/f (S?) gap between the contribution of optimal elements to a random\nsetPi\uf8ffk fR(e?\ni ) and the value f (S?). This gap implies that as a set grows with additional random\nelements, the contribution of optimal elements must decrease. The main dif\ufb01culty is in obtaining this\ndecrease while maintaining random sets of small value, submodularity, and the curvature.\n\n5\n\n\ffunction \nvalue\n\nloss:\n\n1/(1+c-c2)\n\nloss:\n\n1-c\n\ng(s,0)\ng(s,s P \u2265 k \u2013 2log n)\n\nb(s)\n\n0\n\n0\n\nlog n\n\nset size s\n\nFigure 1: The symmetric functions g(sG, sP ) and b(sB).\n\nThe ground set of elements is partitioned into three parts: the good elements G, the bad elements B,\nand the poor elements P . In relation to the analysis of the algorithm, the optimal solution S? is G,\nthe set S consists mostly of elements in B, and a random set consists mostly of elements in P . The\nvalues of the good, bad, and poor elements are given by the good, bad, and poor functions g(\u00b7), b(\u00b7),\nand p(\u00b7) to be later de\ufb01ned and the functions f (\u00b7) we construct for the impossibility result are:\n\nf G(S) := g(S \\ G, S \\ P ) + b(S \\ B) + p(S \\ P ).\n\nThe value of the good function is also dependent on the poor elements to obtain the decrease\nin marginal contribution of good elements mentioned above. The proof of the hardness result\n(Theorem 2) starts with concentration bounds in Lemma 2 to show that w.h.p. every sample contains\na small number of good and bad elements and a large number of poor elements. Using these\nconcentration bounds, Lemma 3 gives two conditions on the functions g(\u00b7), b(\u00b7), and p(\u00b7) to obtain\nthe desired result. Informally, the \ufb01rst condition is that good and bad elements cannot be distinguished\nwhile the second is that G has larger value than a set with a small number of good elements. We\nthen construct these functions and show that they satisfy the two conditions in Lemma 4. Finally,\nLemma 5 shows that f (\u00b7) is monotone submodular with curvature c.\nTheorem 2. For every c < 1, there exists a hypothesis class of monotone submodular functions with\ncurvature c that is not (1 c)/(1 + c c2) + o(1) optimizable from samples.\nThe remaining of this section is devoted to the proof of Theorem 2. Let \u270f> 0 be some small constant.\nThe set of poor elements P is \ufb01xed and has size n n2/3\u270f. The good elements G are then a\nuniformly random subset of P C of size k := n1/3, the remaining elements B are the bad elements.\nThe following concentration bound is used to show that elements in G and B cannot be distinguished.\nThe proof is deferred to the appendix.\nLemma 2. All samples S are such that |S \\ (G [ B)|\uf8ff log n and |S \\ P| k 2 log n w.h.p..\nWe now give two conditions on the good, bad, and poor functions to obtain an impossibility result\nbased on the above concentration bounds. The \ufb01rst condition ensures that good and bad elements\ncannot be distinguished. The second condition quanti\ufb01es the gap between the value of k good\nelements and a set with a small number of good elements. We denote by sG the number of good\nelements in a set S, i.e., sG := |S \\ G| and de\ufb01ne similarly sB and sP . The good, bad, and, poor\nfunctions are symmetric, meaning they each have equal value over sets of equal size, and we abuse\nthe notation with g(sG, sP ) = g(S \\ G, S \\ P ) and similarly for b(sB) and p(sP ). Figure 1 is a\nsimpli\ufb01ed illustration of these two conditions.\nLemma 3. Consider sets S and S0, and assume g(\u00b7), b(\u00b7), and p(\u00b7) are such that\n\n1. g(sG, sP ) + b(sB) = g(s0G, s0P ) + b(s0B) if\n\n\u2022 sG + sB = s0G + s0B \uf8ff log n and sP , s0P k 2 log n,\n\n2. g(sG, sP ) + b(sB) + p(sP ) <\u21b5 \u00b7 g(k, 0) if\n\n\u2022 sG \uf8ff n\u270f and sG + sB + sP \uf8ff k\n\nthen the hypothesis class of functions F = {f G(\u00b7) : G \u2713 P C,|G| = k} is not \u21b5-optimizable from\nsamples.\n\n6\n\n\fProof. By Lemma 2, for any two samples S and S0, sG + sB \uf8ff log n, s0G + s0B \uf8ff log n and\nsP , s0P k 2 log n with high probability. If sG + sB = s0G + s0B, then by the \ufb01rst assumption,\ng(sG, sP ) + b(sB) = g(s0G, s0P ) + b(s0B). Recall that G is a uniformly random subset of the \ufb01xed\nset P C and that B consists of the remaining elements in P C. Thus, w.h.p., the value f G(S) of all\nsamples S is independent of which random subset G is. In other words, no algorithm can distinguish\ngood elements from bad elements with polynomially many samples. Let T be the set returned by the\nalgorithm. Since any decision of the algorithm is independent from G, the expected number of good\nelements in T is tG \uf8ff k \u00b7 |G|/|G [ B| = k2/n2/3\u270f = n\u270f. Thus,\n\nEG\u21e5f G(T )\u21e4 = g(tG, tP ) + b(tB) + p(tP ) \uf8ff g(n\u270f, tP ) + b(tB) + p(tP ) <\u21b5 \u00b7 g(k, 0)\n\nwhere the \ufb01rst inequality is by the submodularity and monotonicity properties of the good elements\nG for f G(\u00b7) and the second inequality is by the second condition of the lemma. By expectations, the\nset S returned by the algorithm is therefore not an \u21b5-approximation to the solution G for at least one\nfunction f G(\u00b7) 2F and F is not \u21b5-optimizable from samples.\nConstructing g(\u00b7), b(\u00b7), p(\u00b7). The goal is now to construct g(\u00b7), b(\u00b7) and p(\u00b7) that satisfy the above\nconditions. We start with the good and bad function:\n\ng(sG, sP ) =(sG \u00b7\u21e31 \u21e31 1\nb(sB) =(sB \u00b7\n\n1+cc2\n1+cc2\n\n(sB log n) \u00b7\n\nsG \u00b7\n\n1\n\n1\n\n1+cc2\u2318 \u00b7 sP \u00b7\n\n1\n\nk2 log n\u2318 if sp \uf8ff k 2 log n\n\notherwise\nif sB \uf8ff log n\notherwise\n\n1c\n1+cc2 + log n \u00b7\n\n1\n\n1+cc2\n\nk\n\nk\n\nk2 log n\n\nk2 log n\n\n1c\n1+cc2 \u00b7\n\n1c\n\n1+cc2\u2318\n\n1+cc2 + (k 2 log n)\n\n\u21e3(sp (k 2 log n)) (1c)2\n\nThese functions exactly exhibit the losses from the analysis of the algorithm in the case where\nthe algorithm returns bad elements. As illustrated in Figure 1, there is a 1 c loss between the\ncontribution 1/(1 + c c2) of a bad element to a random set and its contribution (1 c)/(1 + c c2)\nto a set with at least log n bad elements. There is also a 1/(1 + c c2) loss between the contribution\n1 of a good element to a set with no poor elements and its contribution 1/(1 + c c2) to a random\nset. We add a function p(sP ) to f G(\u00b7) so that it is monotone increasing when adding poor elements.\np(sP ) =(sp \u00b7\nif sP \uf8ff k 2 log n\notherwise\nThe next two lemmas show that theses function satisfy Lemma 3 and that f G(\u00b7) is monotone\nsubmodular with curvature c, which concludes the proof of Theorem 2.\nLemma 4. The functions g(\u00b7), b(\u00b7), and p(\u00b7) de\ufb01ned above satisfy the conditions of Lemma 3 with\n\u21b5 = (1 c)/(1 + c c2) + o(1).\nProof. We start with the \ufb01rst condition. Assume sG + sB = s0G + s0B \uf8ff log n and sP , s0P \nk 2 log n. Then,\n1 + c c2 = g(s0G, s0P ) + b(s0B).\ng(sG, sP ) + b(sB) = (sG + sB) \u00b7\nFor the second condition, assume sG \uf8ff n\u270f and sG + sB + sP \uf8ff k. It is without loss to assume that\nsB + sP k n\u270f, then\n1 + c c2 + o(1)\u25c6 .\nWe conclude by noting that g(k, 0) = k.\nLemma 5. The function f G(\u00b7) is a monotone submodular function with curvature c.\nProof. We show that the marginal contributions are positive (monotonicity), decreasing (submodular-\nity), but not by more than a 1 c factor (curvature), i.e., that fS(e) fT (e) (1 c)fS(e) 0 for\nall S \u2713 T and e 62 T . Let e be a good element, then\n1+cc2\u2318 \u00b7 sP \u00b7\n\nk2 log n\u2318 if sp \uf8ff k 2 log n\n\n1 + c c2 \uf8ff k \u00b7\u2713 1 c\n\nS (e) =(\u21e31 \u21e31 1\n\n1 + c c2 = (s0G + s0B) \u00b7\n\nf G(S) \uf8ff (1 + o(1)) \u00b7 (sB + sP ) \u00b7\n\notherwise.\n\n1 c\n\nf G\n\n1\n\n1\n\n1\n\n1+cc2\n\n1\n\n7\n\n\fSince sP \uf8ff tP for S \u2713 T , we obtain fS(e) fT (e) 0. It is also easy to see that we get\nfT (e) \n\n1+cc2 (1 c) (1 c)fS(e). For bad elements,\n\n1\n\nf G\n\nS (e) =( 1\n\n1+cc2\n1c\n1+cc2\n\nif sB \uf8ff log n\notherwise.\n\nThus, fS(e) fT (e) (1 c)fS(e) 0 for all S \u2713 T and e 62 T . Finally, for poor elements,\n\n1\n\nk2 log n + 1c\n\n1+cc2 \u00b7\n\nk\n\nk2 log n\n\nif sP \uf8ff k 2 log n\notherwise.\n\nf G\n\nS (e) =8<:\n\nSince sG \uf8ff k,\n\n\u21e31 1\n\n1+cc2\u2318 \u00b7 sG \u00b7\n\nk\n\n(1c)2\n1+cc2\n\nk2 log n\n\n1 c\n1 + c c2 \u00b7\n\nk\n\nk 2 log n f G\n\nS (e) \n\n(1 c)2\n1 + c c2\n\nk\n\n.\n\nk 2 log n\n\nConsider S \u2713 T , then sG \uf8ff tG, and fS(e) fT (e) (1 c)fS(e) 0.\n5 Experiments\n\nf (S) =\u21e2|S \\ (G [ B)|\n\nFigure 2: The objective f (\u00b7) as a function of the\ncardinality constraint k.\n\nWe perform simulations on simple syn-\nthetic functions.\nThese experiments are\nmeant to complement the theoretical anal-\nysis by conveying some interpretations of\nthe bounds obtained. The synthetic func-\ntions are a simpli\ufb01cation of the construc-\ntion for the impossibility result. The motiva-\ntion for these functions is to obtain hard in-\nstances that are challenging for the algorithm.\nMore precisely, the function considered is\nif |S \\ B|\uf8ff 10\n|S \\ G| + |S \\ B| \u00b7 (1 c) 10c\notherwise, where G and B are \ufb01xed sets of size\n102 and 103 respectively. The ground set N\ncontains 105 elements. It is easy to verify that\nf (\u00b7) has curvature c. This function is hard to\noptimize since the elements in G and B cannot\nbe distinguished from samples.\nWe consider several benchmarks. The \ufb01rst is\nthe value obtained by the learn then optimize\napproach where we \ufb01rst learn the function and\nthen optimize the learned function. Equiva-\nlently, this is a random set of size k, since\nthe learned function is a constant with the algo-\nrithm from [3]. We also compare our algorithm\nto the value of the best sample observed. The\nsolution returned by the greedy algorithm is an\nupper bound and is a solution obtainable only\nin the full information setting. The results are\nsummarized in Figure 2 and 3. In Figure 2, the\nvalue of greedy, best sample, and random set\ndo not change for different curvatures c since\nw.h.p.\nthey pick at most 10 elements from\nB. For curvature c = 0, when the function\nis modular, our algorithm performs as well as\nthe greedy algorithm, which is optimal. As the\ncurvature increases, the solution obtained by our algorithm worsens, but still signi\ufb01cantly outperforms\nthe best sample and a random set. The power of our algorithm is that it is capable to distinguish\nelements in G [ B from the other elements.\n\nFigure 3: The approximation as a function of the\ncurvature 1 c when k = 100.\n\n8\n\n\fReferences\n[1] Balcan, M. (2015). Learning submodular functions with applications to multi-agent systems. In AAMAS.\n[2] Balcan, M., Constantin, F., Iwata, S., and Wang, L. (2012). Learning valuation functions. In COLT.\n[3] Balcan, M. and Harvey, N. J. A. (2011). Learning submodular functions. In STOC.\n[4] Balkanski, E., Rubinstein, A., and Singer, Y. (2015). The limitations of optimization from samples. arXiv\n\npreprint arXiv:1512.06238.\n\n[5] Conforti, M. and Cornu\u00e9jols, G. (1984). Submodular set functions, matroids and the greedy algorithm: tight\nworst-case bounds and some generalizations of the rado-edmonds theorem. Discrete applied mathematics.\n\n[6] Feige, U. (1998). A threshold of ln n for approximating set cover. JACM.\n[7] Feige, U., Mirrokni, V. S., and Vondrak, J. (2011). Maximizing non-monotone submodular functions. SIAM\n\nJournal on Computing.\n\n[8] Feldman, V. and Kothari, P. (2014). Learning coverage functions and private release of marginals. In COLT.\n[9] Feldman, V., Kothari, P., and Vondr\u00e1k, J. (2013). Representation, approximation and learning of submodular\n\nfunctions using low-rank decision trees. In COLT.\n\n[10] Feldman, V. and Vondr\u00e1k, J. (2013). Optimal bounds on approximation of submodular and XOS functions\n\nby juntas. In FOCS.\n\n[11] Feldman, V. and Vondr\u00e1k, J. (2015). Tight bounds on low-degree spectral concentration of submodular and\n\nXOS functions. CoRR.\n\n[12] Golovin, D., Faulkner, M., and Krause, A. (2010). Online distributed sensor selection. In IPSN.\n[13] Gomez Rodriguez, M., Leskovec, J., and Krause, A. (2010). Inferring networks of diffusion and in\ufb02uence.\n\nIn SIGKDD.\n\n[14] Hang, L. (2011). A short introduction to learning to rank. IEICE.\n[15] Iyer, R. K. and Bilmes, J. A. (2013). Submodular optimization with submodular cover and submodular\n\nknapsack constraints. In NIPS.\n\n[16] Iyer, R. K., Jegelka, S., and Bilmes, J. A. (2013). Curvature and optimal algorithms for learning and\n\nminimizing submodular functions. In NIPS.\n\n[17] Jegelka, S. and Bilmes, J. (2011a). Submodularity beyond submodular energies: coupling edges in graph\n\ncuts. In CVPR.\n\n[18] Jegelka, S. and Bilmes, J. A. (2011b). Approximation bounds for inference using cooperative cuts. In\n\nICML.\n\n[19] Kempe, D., Kleinberg, J., and Tardos, \u00c9. (2003). Maximizing the spread of in\ufb02uence through a social\n\nnetwork. In SIGKDD.\n\n[20] Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., and Glance, N. (2007). Cost-effective\n\noutbreak detection in networks. In SIGKDD.\n\n[21] Lin, H. and Bilmes, J. (2011a). A class of submodular functions for document summarization. In NAACL\n\nHLT.\n\n[22] Lin, H. and Bilmes, J. A. (2011b). Optimal selection of limited vocabulary speech corpora. In INTER-\n\nSPEECH.\n\n[23] Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. (1978). An analysis of approximations for maximizing\n\nsubmodular set functions ii. Math. Programming Study 8.\n\n[24] Rosenfeld, N. and Globerson, A. (2016). Optimal Tagging with Markov Chain Optimization. arXiv\n\npreprint arXiv:1605.04719.\n\n[25] Sviridenko, M., Vondr\u00e1k, J., and Ward, J. (2015). Optimal approximation for submodular and supermodular\n\noptimization with bounded curvature. In SODA.\n\n[26] Valiant, L. G. (1984). A Theory of the Learnable. Commun. ACM.\n[27] Vondr\u00e1k, J. (2010). Submodularity and curvature: the optimal algorithm. RIMS.\n[28] Yue, Y. and Joachims, T. (2008). Predicting diverse subsets using structural svms. In ICML.\n\n9\n\n\f", "award": [], "sourceid": 2011, "authors": [{"given_name": "Eric", "family_name": "Balkanski", "institution": "Harvard University"}, {"given_name": "Aviad", "family_name": "Rubinstein", "institution": "UC Berkeley"}, {"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}]}